4 Strategies to Reduce Observability Costs - Without Sacrificing Visibility

Today’s end users have little to no patience for performance issues. Jitters, slow load times, and full-blown outages can quickly lead to brand damage, lost customers, and diminished revenue. That’s why it’s essential for DevOps and engineers to be able to quickly identify and resolve issues before users ever notice them.

Doing this requires collecting and analyzing massive amounts of telemetry data – metrics, traces, and logs. The problem is that today’s complex, distributed, microservices-based architectures are generating more telemetry data than ever before, leading to surging observability costs. In fact, observability is now the second highest spend in IT budgets, behind only the cloud.

Unfortunately, to address these rising costs, many companies often adjust their observability strategies in ways that sacrifice their visibility, leading to more performance issues and longer resolution times.

However, there are strategies to reduce observability costs that actually improve visibility and MTTR by removing the unnecessary noise. In this article, I share four of these strategies that organizations can employ now.

Strategy #1: Embrace Dynamic Observability

In order for a service to run reliably, it needs to be able to accommodate peak load. But to accommodate your highest peaks all the time would be really expensive. That’s why most services are architected to support some degree of scale out behavior. Why shouldn’t organizations approach observability data collection the same way?

Just like spinning up more servers to accommodate peak load, collecting, processing and storing more observability data costs money. But most observability data collection strategies today are static, designed to send as much data as they can afford.

Dynamic observability is the concept that the amount of metrics, logs, and traces collected should auto-scale based on signals from your environment. You automatically collect more data when you need it, such as during incidents and high traffic-events, and less data when you don’t.

For example, if you’re alerted to a server experiencing increasing CPU load, start collecting metrics at 10s granularity vs your standard 60s. Just as you right-size your infrastructure by provisioning resources based on signals from metrics like CPU load, you’re right-sizing the amount of observability data you collect.

SREs and DevOps engineers may be concerned that not having all the data they can afford will lead to visibility gaps, which can ultimately lead to costly outages. But Dynaic observability doesn’t mean sacrificing data quality or visibility to save money. In fact, the opposite is the case. You save money by collecting less data when it’s not needed, so you can afford to collect even more of the relevant, meaningful data when you need it.

Fortunately, tooling to automate this is now available. Consider implementing tools that control how much data makes it to your observability platform at the data collection layer (rather than filtering data after it’s already collected). This decouples data collection from your observability platform, so you and your teams have control over defining what data you want, when you want it. It also helps avoid vendor lock-in.

Check out Circonus’ new Dynamic observability solution, Passport.

Strategy #2: Convert logs to metrics

Logs contain valuable information about the behavior and performance of IT systems, as well as events and activities that occur within those systems. If stored, they provide a historical record of what has happened within an IT system, which is essential for understanding and resolving problems. Logs can also be used to identify security threats, detect performance bottlenecks, and gain insights into the behavior of IT systems and users.

However, transmitting and storing log data has become incredibly complex and expensive. Given the expense of transmitting and storing log data, most enterprises choose to consistently discard it after short periods of time, as well as during periods of heavy volume, when such data is incredibly important but increases dramatically in terms of both volume and cost.

Historical data provides crucial business insights that help not only speed troubleshooting, but also prevent future performance issues. But, it’s often nonexistent. Even if organizations retained all of their log data, they would still need to manually sort through it to pull out relevant insights. The sheer volume of logs these days makes this incredibly difficult, resulting in a poor signal-to-noise ratio and increased MTTR.

By converting logs to metrics, organizations can significantly reduce costs while also lessening resolution time.

While metrics, logs, and traces together are essential for full observability, metrics offer a couple key advantages. First, you can perform far more sophisticated and accurate data analysis on metrics, enabling teams to quickly and easily identify trends, performance issues, outliers, and anomalies. Second, they are cheap to store compared to logs. Why? A single number may represent an entire page or more of log entries when processed, so 4, 6, or 8 digits replaces up to 3,000 characters (approx. 60 characters per line at 50 lines). A reduction of 400+ is a huge difference when you consider that it’s not uncommon today for a server to produce 10+ Gigabytes a day of logs.

Converting logs to metrics can save thousands of dollars a month, shrink resolution time, and enable teams to keep historical data that’s valuable for post-mortems and proactive performance monitoring.

Strategy #3: Consolidate tooling

Tool sprawl is one of the biggest challenges companies currently face when it comes to their observability strategies and costs. Organizations typically employ several open source and/or proprietary point solutions.

Open source tools are a less expensive option and may be the optimal choice for some teams. However, as data volume grows, managing multiple open source tools can become resource-intensive, and organizations with a larger telemetry footprint can suffer visibility issues when trying to correlate data and insights from several different tools. This ultimately leads to longer troubleshooting and downtime, along with their associated costs.

For example, according to Netblocks, the “X” (formerly Twitter) outage in May 2023 cost the company a whopping $13,962,513 per hour in the United States alone. Moreover, the indirect costs associated with outages, such as customer churn, negative publicity, and increased customer support requests, can further strain a company’s finances. In an era where customers expect 24/7 access to services, even a short-lived outage can lead to long-term reputational damage.

Unified platforms for collecting and monitoring all observability in one place can reduce overall costs in two ways: first, they’re less expensive than leveraging multiple commercial point solutions, and second, they improve MTTR by enabling faster data correlation, so issues are resolved before customers notice and before they become major outages.

For organizations currently using open source tools, look for unified platforms that are built utilizing open standards. This will significantly ease migration, reduce learning curves, and help prevent vendor lock-in.

Strategy #4: Simplify by focusing on the observability essentials

Following up on strategy #3, in addition to considering unified platforms built on open standards, also consider exactly what observability capabilities are essential for your organization. There are several robust, comprehensive observability platforms in the market.

But some of these are incredibly expensive and force you to pay for many capabilities you do not need. A platform that may provide less overall features but aligns with what’s essential to your team will be significantly more cost-effective. These platforms should also provide the flexible APIs you need to easily customize as necessary.

Wrapping Up

As expectations for performance and the complexities of IT environments continue to grow, so does the significance of observability within organizations. The ability to rapidly act on data and resolve incidents before users notice them has never been greater – or more challenging. As an industry, we need to ensure that companies can implement effective observability strategies at a price they can afford. While not a comprehensive list, these four strategies can help teams execute observability without facing the dilemma of the cost-performance trade off.