How observability has changed in recent years and what comes next

Join top executives in San Francisco on July 11-12 to hear how leaders are integrating and optimizing AI investments for success. Learn more

In recent years, companies have become increasingly dependent on observability to manage and maintain complex systems and infrastructure. As systems become even more complex, observability increases must evolve keep pace with changing requirements. The big question for 2023: what’s next for observability?

The proliferation of microservices and distributed systems has made it more difficult to understand real-time system behavior, which is critical to problem solving. Recently, more companies have solved this problem with automations to monitor distributed architecture, deep dive tracking and real-time observability.

However, each decade has brought about a major change in the way it is done observability is expected to function. The past three decades have seen transformation after transformation – from on-premise to cloud to cloud-native. With each generation there are new problems to solve, opening the door for new companies to form:

The on-premise cloud era gave birth to some companies such as Solarwinds, BMC, and CA Technology.
The cloud era (where AWS came in) led to a vibrant market, with new companies like Datadog, New Relic, Sumologic, Dynatrace, Appdynamic and more.
The cloud-native era (starting in 2019-20) has resulted in another market shake-up.

Why does observability change?

The main reason for the current upheaval is that companies build software with completely different technology than in 2010. Instead of monolithic architectures, they use microservices, Kubernetes And distributed architecture.

Event

Transform 2023

Join us on July 11-12 in San Francisco, where top executives will talk about how they integrated and optimized AI investments for success and how they avoided common pitfalls.

There are three main reasons why this is the case:

Better security
Easy scalability
More efficiency for distributed teams

However, there are also challenges. According to data from Gartner, 95% of systems will be cloud native by 2025. Since cloud-native generates much more data than previous generations of technology, hosting and scaling that data becomes more challenging. This poses three major problems.

1. Prohibited Charges

The first problem is relatively simple: cost. All of the legacy observatories have become so expensive that most startups and midsize companies can’t afford them. As a result, they use old technology to host and process their data – technology that may not meet needs in 2023.

2. Evolving priorities in observability

In addition, as observability capabilities have become more sophisticated, the KPIs and OKRs that development and operations teams track have evolved.

Previously, the primary focus was on ensuring that applications and infrastructure do not crash. Now the dev and ops teams are working on a deeper level, prioritizing:

Request latency
Saturation
Scalability
Traffic maps for where use takes place
Optimize and predict future results
How new code is changing cloud usage

In short, dev and ops teams have become more proactive than reactive. That calls for technology that can keep up.

3. Changing expectations for observability

Finally, the emergence of microservices architecture is changing the way IT teams perceive application changes. One microservice can run on a hundred machines and a hundred small services can run on a single machine. There is no “one size fits all” approach. Dev and ops teams need deeper analysis to understand what’s going on in their infrastructure.

These are the challenges. So how should the new generation of observation tools respond in 2023? From my perspective, here are eight things we need to win the market.

Note: I’m looking at a 30,000 foot view of a huge market. It is unlikely that a single company will do all of these things. But these are the needs and it will take new companies, technologies and platforms to meet all of these needs.

Uniform observability

All the old companies say they are a unit observability platform. What this really means is that they have different tabs for stats, logs, and traces that can be accessed from their platform.

This doesn’t actually solve the problem. What dev and ops teams need is one place to view all this data in one timeline. Only then will they be able to detect correlations and identify the root causes of problems – and fix them quickly.

Integrated observability and business data

As Bogomil of Sequoia mentioned in this weblog, most companies don’t correlate their observability and business data. This is a problem because powerful insights can be gained by analyzing the two side by side.

Amazon, for example, recently discovered that if their website grows one second longer, they lose millions of dollars every day. This can be huge for e-commerce businesses, especially if they track a slowdown in orders – it could be due to poor application performance. The faster they fix the application, the more orders they receive and the more revenue they earn.

The same goes for software companies. If the application is fast, it improves usability, which improves the user experience, which in turn affects a number of business metrics. Only by integrating these two sets of data can companies begin to make these connections to improve business outcomes.

Vendor-agnostic open telemetry (OTel)

Companies are looking for a solution that is not tied to one supplier. In that way, most technology companies are contributing to open telemetry (OTel) and making it the tool of choice for data collectors. OTel has many advantages: interoperability, flexibility and improved performance monitoring.

Predictive observability

In the AI era, everything turns into a humanless experience. This allows systems to do things that humans simply cannot, such as predict failures before they even happen through machine learning.

This is currently not common in observability and there is a great need for more innovation. By adding an AI layer to observation platforms, companies can predict problems before they happen, and fix them before the user or customer knows something is wrong.

Predictive safety in observability

Observability and safety work closely together. Most observance companies are moving to security because they have control over all data collected from applications and infrastructure.

By reading metrics, logs and traces, especially those that exhibit unusual behavior, AI should be able to understand security threats. Most SEIM and XDR do not. And even if they do, it’s a rule-based model rather than analyzing behavior and learning from it.

Cost optimization

Perhaps the biggest challenge in observability is cost. While cloud storage is getting cheaper, most observance companies are not lowering their prices. Customers get the short end of the stick, mainly because there are no alternatives.

Open Telemetry collects over 200 points every second. However, we don’t need all of these data points. So instead of charging users for storage space they don’t need, organizations should collect and store only the useful ones and delete the rest. This can reduce the cost of storing and processing data.

Correlation with causality analysis

Most legacy observing platforms provide basic information about what is happening in the cloud or application. Often, however, the inciting event occurs hours or even days earlier. That’s why it’s important to monitor CI/CD pipelines to see when code is being pushed and which regulation or request is causing the problem.

Let’s say there is one network socket that is slow and starts to clog up requests. As a result, your backend starts to slow down, which then throws an error. Then the front-end slows down, which throws another error. Then the application crashes. You may only notice that the front-end is slowing down and think that this is the cause of the application crash. But in reality, the problem started elsewhere.

In a distributed architecture, this root cause analysis takes more time than in a monolith. Observation platforms must adapt to this new reality.

AI-based alerts

Alert fatigue is a real challenge. When developers receive so many warnings that they mute email threads or Slack channels, it hides problems and slows down time to resolution.

Instead, AI-based alerting systems use AI to predict which alerts are essential and which are not. AI can also provide context and even suggest possible solutions.

Final thoughts

This is an exciting time to be in observability. The changes we are seeing open the door to unprecedented opportunities. The question remains: who will rise to the top in 2023?

Laduram Vishnoi is founder and CEO of Middleware.

Data decision makers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

To read about advanced ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contribute to an article of your own!