Monitoring is an essential function of enterprise SRE teams and a critical component of business service deliverability. Its importance has only grown as enterprise environments and technologies continue to evolve at a rapid pace.
Unfortunately, traditional monitoring is no longer enough.
Why is traditional monitoring not enough?
As enterprises moved to cloud-based environments over the years, the nature of applications changed, their numbers grew exponentially, and environments became increasingly complex. As if in lockstep, the number of monitoring tools multiplied to address this growing complexity.
However, the vast majority of these tools were purpose-built to monitor highly domain-specific, component-level elements such as routers, storage systems, or web servers. The result was that the process of monitoring itself became increasingly complex, and even worse, siloed—composed of a patchwork of tools that couldn’t talk to each other.
This left SRE teams scrambling to weed out continuous monitoring “noise” in order to understand which of the many component-level alarms to focus on at any given time and which to ignore. They lacked the ability to see a big picture view of their systems and, perhaps most importantly, they lacked visibility regarding the quality of service end users were actually experiencing.
How has the enterprise monitoring landscape changed?
Today’s environments are only growing more dynamic and complex, involving a range of standards, platforms, and technologies comprising millions of components.
They also generate more data than ever before—and hidden in that data are rich business insights.
Today’s globally distributed, hybrid environments must support the continued adoption of cloud-native services, software-defined components and networks, containers, microservices, and more. Monitoring systems within such complex environments requires observing a system to understand if it is behaving the way it should. Much more complex than basic on/off monitoring, this often involves a time consuming process of continually forming educated hypotheses around how you think the system should be behaving, and determining the best way to observe the system in order to understand how it is actually behaving.
So, while monitoring continues to be necessary, achieving a big picture view of today’s complex systems and business services—while squeezing the maximum value from your data—requires a modern, enterprise level monitoring and observability platform that meets the following requirements.
Note: There are of course more than five, but these are foundational and absolutely essential, especially for enterprises running global, distributed, mission-critical infrastructure and applications at scale.
What is required to achieve enterprise monitoring and observability at scale?
1. Unlimited Data Retention and Historical Analysis
SRE as a principle is a culture of learning. When things go wrong (as they inevitably do), it is critical to have a robust process for interrogating the system and the organization to understand how a particular failure transpired. By identifying the root cause of failure, processes can be altered to ensure that a similar scenario does not occur in the future.
The thing is, new questions and realizations can present themselves at any time, often long after a particular event has passed. In such cases, there is a distinct need to “go back in time” to investigate past failures in light of these new questions and ideas.
Doing so is an invaluable way to gain new knowledge that can then be implemented to reduce future risk. However, this sort of “time travel” to conduct effective postmortems requires immense data retention.
In fact, doing it well requires unlimited data retention.
Therefore, your monitoring and observability platform should provide years of historical data to look back upon—and the data you have today should be exactly the same 6 months from now and 12 months from now. After all, there is nothing more infuriating than having a new question come out in the postmortem but no longer having the data you need to answer it.
Furthermore, the data provided by your platform should be extremely granular and precise. Forget graphs that are just one big average over a day—how can you use that to ask a specific new question regarding an outage that occurred months or even years ago?
Instead of averaging data hourly and daily over time as many solutions do, seek out a platform that can provide data minute-by-minute and second-by-second. You should have a record of every single API call ever served so that you can look back upon them as needed, even months or years in the future.
Most monitoring vendors—particularly open source—downplay the importance of such historical data, storing it for a month and telling you that anything older is not valuable. This couldn’t be further from the truth. These solutions don’t store long-term data because they weren’t purpose-built to do so—but that doesn’t mean it’s not important.
To accomplish this feat of storing unlimited data, today’s most advanced enterprise monitoring and observability platforms use histograms.
2. SLO Optimization
Thought experiment: Can you leverage your current monitoring platform to accurately calculate latency SLOs that are the optimal balance between cost and benefit to the enterprise?
For example, is it 99.99% or 99.999%? If a 99.99% SLO meets requirements, then setting your SLO at 99.999% would be a waste of money and resources.
Without unsummarized source data over a long period of time, accurately determining optimal latency SLOs is impossible.
By storing latency measurements as histograms, however, the best enterprise monitoring and observability platforms can answer these questions and many more—helping SRE teams determine which SLOs become too expensive to reach without resulting in benefits to their organization.
By now you may be asking…
What is a Histogram?
A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin. Histograms visualize the distribution of latency data to make it easy for engineers to identify disruptions and concentrations and ensure performance requirements are met.
Histograms are the best way to compute latency SLOs because they efficiently store all raw latency data and can easily calculate any percentile you would like to see, on demand. Having this flexibility comes in handy when, for example, you are still evaluating your service and are not yet ready to commit to a latency threshold.
Employing an enterprise monitoring and observability platform that stores data as histograms will ensure your business is making truly informed decisions about its SLO commitments.
3. Context for Faster Mean Time to Resolution (MTTR)
Without context, information is just random data points. Context is key to connecting those dots—or in the case of investigating infrastructure monitoring data, spikes on a graph.
Those pesky little spikes often take a lot of time to explore. And we all know that time (particularly IT staff time) is money.
Perhaps it’s an upgrade to one part of the system that creates a spike in latency for a portion of end users. If someone sees that spike weeks later via a RUM tool and lacks the proper contextual data regarding why it happened, it could take quite a long time to determine the root cause and resolve the situation. In fact, a wide range of business events—from earning announcements to marketing campaigns, Super Bowl ads, and beyond—can impact the entire infrastructure.
That’s why it’s necessary to maintain a detailed record of all system events, changes, upgrades, and alerts over a long time period, in order to provide the context needed to reduce the time required to identify and correct the root cause of service-impacting faults.
Furthermore, this data must be Metrics 2.0 Compliant.
Before you ask…
What is Metrics 2.0?
Metrics 2.0 is a set of “conventions, standards and concepts around time series metrics metadata” with the goal of generating metrics in a format that is self-describing and standardized.
Metrics 2.0 requires metrics be tagged with associated metadata in order to provide context surrounding the metric that is being collected—because we all know that metrics without context do not offer much value (in fact, the fundamental premise of Metrics 2.0 is just that).
Let’s take the example of collecting CPU utilization from a few dozen servers at random. Without Metrics 2.0 tags, you can’t know much at all about any given CPU metric. With them, however, you’ll know exactly which server, rack, and data center a particular CPU metric is from and with which type of work it is associated.
Unfortunately, many monitoring tools are not currently Metrics 2.0 compliant. This leaves today’s SREs swimming in data without contextual metrics—often scrambling to identify the source of a performance issue over the course of hours and left helpless when attempting to execute core SRE functions like dynamically creating SLOs.
When all metrics are tagged in this manner, however, queries and analytics become quite powerful. You can search based on these tags and slice and dice the data in various ways to glean insights and intelligence about your operations and performance.
4. Unified Monitoring and Observability
As mentioned in the introduction, one of the main reasons traditional monitoring falls short in today’s complex environments is because it is based upon a patchwork of disparate monitoring tools, each built for a specific purpose—and this creates silos of metric data.
In such an environment, it’s challenging to share information in a cohesive way among different teams because of a lack of consistent standards and processes among these tools.
Compounding this issue is the fact that knowledge of how to use these tools can reside in just a few individuals, which causes bottlenecks and prevents teams within the IT organization from being able to find answers on their own.
At the strategic level, there is no way to get a comprehensive and consolidated view of the health and performance of the systems that underpin the business.
All of this results in increased time for essential tasks like troubleshooting—and beyond being less effective, having disparate tools also requires more resources and increases costs.
IT organizations can leverage a unified monitoring and observability platform to avoid these and other problems.
What is Unified Monitoring and Observability?
Unified monitoring and observability is defined as implementing consistent monitoring processes, workflows, and standards across the organization. Teams employ a centralized platform on which to collect, analyze, alert on, and graph their data to gain a comprehensive view of the health and performance of the systems that underpin the business.
Centralizing all of your metrics into a single monitoring and observability platform provides a consistent metrics framework across teams and services. This democratizes your data so that anybody can immediately access data any time and use it in a way that is correlated to the other parts of your business—eliminating the time-consuming barriers associated with legacy monitoring tools.
A centralized platform that consistently presents and correlates all data in real-time consolidates monitoring efforts across all teams within the organization and enables the business to extract the maximum value from its monitoring efforts.
5. Unlimited Scale to Measure Everything
Of course, a unified monitoring and observability platform is only as good as the quality and quantity of data it collects, and the speed at which it does so.
In order to observe all of your infrastructure and all of your metrics, you need to monitor…well…everything.
Traditional monitoring was certainly not built to handle this type of scale.
For example, traditional tools reduce all latency measurements to a single number—the average latency over an arbitrarily determined time window (typically one minute). This can result in wildly inaccurate latency SLOs, which could end up costing organizations significant money and resources.
However, to monitor everything requires an enterprise monitoring and observability platform with the built-in ability to continuously aggregate all metrics from all infrastructure—tracking and storing millions of measurements per second, on-demand, at extremely high granularity…which can amount to millions of measurements per second.
This platform must run at scale, meaning there should be no compromise on performance, regardless of the infrastructure environment size or the amount of data collected to run analytics in real-time.
Not only will this help accelerate problem resolution, but once these metrics are collected, it’s possible for your teams to derive additional business value from this vast expanse of contextualized data.
A truly unified monitoring and observability platform that offers unlimited data retention and historical analysis, SLO optimization, contextual metrics for faster MTTR, and unlimited scale to measure absolutely everything in real-time may sound too good to be true…but it does exist.
And enterprises—especially those running global, distributed, mission-critical infrastructure and applications at scale—can’t afford to settle for anything less. To do so would disadvantage their IT organization, their end users, and their bottom line.
Conversely, using such a platform will grant their businesses the power to see the big picture view of their systems and the quality of service their end users are actually experiencing, from moment to moment, in a day and age when every moment counts.