More is More - A Case for Dynamic Observability

Dynamic observability is the concept that the amount of data collected should scale based on signals from your environment. Elastic infrastructure is not a new concept. Much of the internet is powered by services that provision more resources based on signals derived from metrics like cpu load, memory utilization and queue depth. If we can use tools to right size our infrastructure, why can’t we also use tools to right size the amount of data we collect?

Why does autoscaling exist at all?

In order for a service to run reliably, it needs to be able to accommodate peak load. You could follow a peak log provisioning strategy such that your services can handle the highwater mark. However, that can be really expensive, especially if spikes in utilization are relatively short lived. Instead of provisioning for peak load, most services are architected to support some degree of scale out behavior.

Collecting, processing and storing more data costs money, just like spinning up more servers. However, if you look at the way the vast majority of observability strategies are implemented, the amount of data that is sent is almost always statically defined. People will send as much data as they can afford. This is understandable since the tools needed to send back more data based on signals from your environment are basically non-existent. Why would observability vendors that charge based on data volume help you build elasticity into your data collection strategy? When growth is prioritized over capital efficiency, what incentive do engineers have to address this problem in house?

Teams justify this peak provisioning approach to data collection by saying that the cost of not having the data when it is needed is higher than the cost of sending all of the data. While there is a grain of truth in that, it can be an excuse to write a blank check for observability tools and largely explains why businesses are spending so much on observability. The logic is also subtly flawed. Someone, somewhere is making decisions on how much data to send when they define a poll interval, log level filter or trace sampling rate. You can always send more data. However, this idea that there is a one size fits all collection strategy that sends back the same amount of data during business as usual or during a five alarm fire is at question here.

Dynamic observability

How do you build elasticity into your observability strategy? At a high level you will need at least the following three components:

Something to tell you about the state of the things you’re monitoring
Something to decide what to do when your environment is in a given state
A mechanism through which to change how data is being collected

If you’re reading this article, you almost certainly have item 1, which is an observability platform. Observability platforms tell you a lot about the state of your environment and can emit signals like alerts which can be acted upon by other tools. The key here is that the alerts that are emitted need to carry a reasonably high amount of signal and they must be leading indicators to service impacting issues. They need to be high signal because, even with the best of protections, any automation carries with it risk, so you want to make sure you’re responding to the right things. The alerts you respond to should also be leading indicators because it is considerably less useful to send back more data on a service that went down than sending more data during the period that led up to the outage. Anything you can do to improve the signal or increase the proportion of your alerts which are leading indicators can make an dynamic observability strategy more effective. AIOps platforms carry the promise of helping on both fronts and could be a good complement to the tools used to implement item 1.

The component responsible for deciding what to do in response to an alert must be able to handle the incoming alerts and then do something with that information. For example, maybe the component receives an alert that latency is increasing for one of the web services you manage. In this scenario, maybe you want the component to trigger some action that will increase the trace sampling rate from 1% to 5% on the affected services. Depending on your architecture, where you make this change could be an application, pipeline or observability platform concern.

The level of sophistication needed for this event handling component really depends on the level of complexity in your observability stack and the flexibility you need in terms of handling new use cases. If the number of events that the event handler will need to account for is small and the number of potential actions it could trigger is also limited, then something as simple as a serverless function could work here. However, safely allowing teams to define these rules can quickly create a number of edge cases that need to be handled for a production ready solution.

Finally, it’s one thing to know that an action should be taken, it’s another to actually perform that action. For item 3 in the list, you need a way of changing the way data is collected. In a typical observability pipeline, there are roughly four places where this could occur:

Application
Collector
Pipeline
Observability Platform

At the application level this could take the form of changing the log level it uses, or in more extreme cases changing the log or trace data the application emits. At the collector level, where the data is initially processed and shipped, this could take the form of changing filters or adjusting the poll interval. If you’re using a pipeline in front of your observability platform, you could adjust the filtering and routing of the data there as well. Finally, most observability platforms allow you to filter data on ingestion for a much smaller fee than if the data was stored and indexed. Each tier has its set of tradeoffs, and the cost/complexity tradeoff will likely be different in every organization.

Why is this not a solved problem?

If you’ve ever debugged an application (maybe by frantically adding print statements throughout your code), you know more data is key to finding root cause and squashing bugs. If you put the cost-cutting benefits of dynamic observability aside, there is a scenario where you could continue to collect data exactly as you do today, but when there is an issue, use automation to turn on a firehose of data for a very targeted portion of your infrastructure. Why doesn’t this already exist? Here are a couple of reasons that come to mind:

Perverse incentives
Cheap money
FUD
Diverse tools
Observer effect

Observability vendors have little incentive to help you add elasticity to your collection strategy. They mostly charge by the amount of data you send them, so have little incentive to help you send them less data. Since developers are expensive, tools that improve developer productivity can and do charge a premium.

Another reason this is not a solved problem is the macro environment that low interest rates helped support. Even if in most organizations, observability tools are a cost center, their inefficient use could be justified if they supported revenue generating services or growth. For many years growth has been more important than capital efficiency, though clearly that is now changing.

Fear, uncertainty and doubt (FUD), also plays a role. You don’t always know what data will be important until after an incident occurs. So most organizations adopt a strategy of sending as much data as they can afford, rather than investing the time to refine what data is collected. The risk of not having the data you need during a service outage is an existential threat that is hard for organizations to price correctly.

Even if a vendor wanted to solve this problem, there are so many different tools and ways of collecting data. OTEL is doing a lot to standardize things, but it still has a long way to go and will never be how all organizations instrument their applications and infrastructure. Changing the way data is collected is extremely expensive. Not only does it require developer time to re-instrument applications and infrastructure, it could also require rewrites of all existing dashboards and alert conditions. Managed agents do exist (Zabbix Agent, Elastic Fleet, and Cribl Edge come to mind). However, each requires their own agents to work, which in addition to the high adoption cost, also creates even more vendor lock-in risk.

Finally, there is the observer effect, where the act of observing something changes its behavior. This is the most legitimate critique of dynamic observability. There is an unspoken rule that observability tools should not affect the performance of your applications, or if they do, it is a predictable effect that you can account for in your design. Anyone who has gotten a little overzealous in increasing the trace sampling rate knows how changes like that can impact system performance. Similarly, like with any automation tools, you can automate yourself into oblivion if you’re not careful. So even if an ambitious internal tools team wanted to build some tooling to allow for dynamic observability, there is a longtail of safeguards and edge cases they’d need to account for. This makes it challenging to open the tooling to other teams who typically have the knowledge required to define what data needs to be collected in which contexts.

Conclusion

As an industry we need to move away from a one size fits all collection strategy. The amount of data we collect while all of our services are healthy should not be the same as when production is on fire. This is intuitively how we navigate our world, seeing a doctor when we have a fever or a mechanic when our car engine starts to make funny noises. When we get a signal that something is wrong, we go and collect more data.

While controlling how much data makes it to your observability platform can happen in a few places, I believe it makes most sense to instrument this at the collection layer. Almost all agents follow a similar design pattern where they can be controlled exclusively through the use of configuration files. This pattern does a lot of things right, in that it decouples collection from your observability platform, allows for gitops workflows and integrates with existing software distribution tools. However, in decoupling from upstream platforms and by delegating the lifecycle management of these agents’ configs entirely to the user, it makes adding any level of sophistication in how data is collected the responsibility of the user. For most organizations, building the tooling to safely make changes to how agents collect data would not be a good use of engineering time and focus. However, it’s not an intractable problem, it’s just one where the tooling available to achieve the desired outcome (ie elastic data collection) is under-developed or non-existent.

We are starting to see some progress on the tooling front. Companies like Calyptia and ObservIQ have offerings that add some dynamism to Fluent Bit and OTEL Collector respectively. At Circonus, we have a new offering called Passport which tries to address this problem in an agent agnostic way and has a rules engine which introduces considerable flexibility. Regardless of the chosen tool, the concept that context should inform how data is collected is a conversation that is long overdue. The cost/visibility tradeoff is unavoidable, so tooling really needs to fill in the gaps in order to make the way data is collected more intelligent. If done right, you can get better visibility at a lower cost, keeping both engineering and procurement teams in perfect harmony for the time being.