How SREs Can Achieve More Success by Implementing Modern Monitoring
by Theo Schlossnagle
Today’s complex and service-centric, “always on” IT environments have placed greater strains on monitoring teams. Unfortunately, the reality is that many organizations still have legacy monitoring tools and processes in place that are no longer effective in today’s world. At Circonus, we speak with many companies who are looking to “modernize” their monitoring. They want to embrace and implement Site Reliability Engineer (SRE) principles and fully harness all the powerful data they are generating so they can gain insights and make decisions that have a big impact on the company.
Increasingly, organizations are implementing Site Reliability Engineer (SRE) functions that are responsible for defining ways to measure availability and uptime, accelerate releases, and reduce the costs of failures. SREs operate in continuous-delivery, continuous-integration environments where user demand drives frequent, high-performing release cycles and systems change very quickly. It’s so dynamic that traditional monitoring approaches are trying to solve problems that no longer exist and simply do not meet new monitoring expectations and requirements.
This post explores why SREs need a new, more modern approach to monitoring; 5 characteristics of modern monitoring that SREs must embrace; and two real examples of modern monitoring in action.
SREs Require an Updated Approach to Monitoring
Today’s systems are born in an agile world and remain fluid in order to accommodate changes in both the supplier and the consumer landscape. This highly dynamic system stands to challenge traditional monitoring paradigms.
At its heart, monitoring is about observing and determining the behavior of systems. Its purpose is to answer the ever-present question: Are my systems doing what they are supposed to? In the old world of slow release cycles (often between six and 18 months), the system deployed at the beginning of a release looked a lot like the same system several months later. Simply put, it was not very fluid – which is great for monitoring. If the system today is the system tomorrow, and the exercise that system does today is largely the same tomorrow, then the baselines developed by observing the behavior of the system’s components will very likely live long, useful lives.
In the new world of rapid and fluid business and development processes, we have change on a continual basis. The problem here is that the fundamental principles that power monitoring — the very methods that judge if your machine is behaving itself — require an understanding of what good behavior looks like. In order to understand if systems are misbehaving, you need to know what it looks like when they are behaving.
Today, many organizations are adopting a microservices-systems architecture pattern. Microservices dictate that the solution to a specific technical problem should be isolated to a network accessible service with clearly defined interfaces, such that the service has freedom. This freedom is very powerful, but the true value lies in decoupling release schedules and maintenance, and allowing for independent higher-level decisions around security, resiliency and compliance. The conflation of these two changes results in something quite unexpected for the world of monitoring: the system of today neither looks like nor should behave like the system of tomorrow.
5 Characteristics of Modern Monitoring
For SREs to be successful, they need a new, modern way to manage and monitor rapidly scaling and rapidly changing IT infrastructure, where monitoring is a key component of service delivery. So what should monitoring in the world of SREs look like? Organizations and SREs who are successfully advancing their monitoring and elevating its impact to their businesses must achieve the following 5 characteristics of modern monitoring.
#1: Measuring Performance to Meet Quality of Service Requirements
It is time to move beyond only pinging a system to see if it is up or down. Pinging is useful, but not the same as knowing how well the service is running and meeting business requirements. Knowing that a machine is running and delivering some subset of a service currently being delivered to a customer – and to have that knowledge in real-time – this is real business value.
The next question becomes how to most efficiently measure performance for those quality of service requirements. The answer is to measure the latency of every interaction between every component in the system. In this new service-centric world, high latency is the new “down”. Instead of just checking for available disk space or number of IO operations against that disk, it’s important to check (for example) the latency distribution of the API requests. Just knowing how much memory the system is using isn’t enough — it’s much more important to know how many microseconds of latency occur against every query.
What should be measured is the actual performance of all of the components in the system and along the path of service delivery between the customer and the data center. Don’t just check to see if an arbitrarily determined “average” transaction time has been met or a system is up. While these kinds of traditional metrics are still useful and necessary to monitor, it is crucial to see if your quality of service requirements are met.
Every user on a web app, or every customer website hit, uses a plethora of infrastructure components, and the quality of the user’s experience is affected by the performance of numerous microservices. Completely understanding performance requires checking the latency of every component and microservice in that system. All of those latencies add up to make or break the customer experience, thereby determining the quality of your service.
Will the quality of your service be affected if 1 out of every 100 database queries is painfully slow? Will your business be impacted if 5 out of every 100 customer experiences with your service are unpleasant? Traditional monitoring tools that store and alert on averages leave SREs blind to these situations. Every user matters and so does every user interaction. Their experience is directly affected by every component interaction, every disk interaction, every cloud service interaction, every microservice interaction, and every API query – so they should all be measured.
Imagine measuring the total latency experienced by the user and alerting SREs to unacceptable latency in subcomponents and microservices – before they affect end-to-end service quality. If it is not measured, then SREs are blind to the underpinnings of what causes a web app or website to meet or to fail service level agreements.
#2: Centralizing Data To Correlate Business Outcomes and Improve Productivity
Do not silo data. The behavior of parts must be put in context. Correlating disparate systems and even business outcomes is critical. Today’s IT organizations want everything distributed, but if you don’t have your data together, then you cannot correlate your systems and business outcomes.
One of the hallmarks of conventional monitoring is having disparate monitoring tools that each have a specific purpose and create silos of metric data. It’s a patchwork environment where there is a lack of consistent standards and processes; and as a result, there’s no ability to share information in a clear and cohesive way among different teams within the organization.
Having disparate tools often requires more costs and resources; and knowledge of how to use them can reside in just a few individuals. This not only creates the potential for serious disruptions if people leave the organization, but it also prevents teams within the IT organization from being able to find answers on their own. For example, an engineer responsible for application performance monitoring cannot get information they require on network health without relying from someone on that team to get it for them – resulting in increased time for essential tasks like troubleshooting. At the strategic level, there is no way to get a comprehensive and consolidated view of the health and performance of the systems that underpin the business.
By centralizing all of your metrics — application, infrastructure, cloud, network, container — into one monitoring and analytics platform, your organization gains a consistent metrics framework across teams and services. You democratize your data, so that anybody can immediately access that data any time and use it in a way that is correlated to the other parts of your business – eliminating the time-consuming barriers associated with legacy monitoring tools. A centralized platform that consistently presents and correlates all data in real-time consolidates monitoring efforts across all teams within the organization and enables the business to extract the maximum value from its monitoring efforts.
#3: Gaining Deeper Context to Reduce MTTR and Gain Higher Insights
Today’s SREs are swimming in data that is constantly spewing from every infrastructure component — virtual, physical, or cloud. Identifying the source of a performance issue from what can be millions of data streams can require hours and hours of engineering time using traditional monitoring processes. To quickly troubleshoot performance issues, SREs need more context.
Metrics with context allow SREs to correlate events, so they can reduce the amount of time required to identify and correct the root cause of service-impacting faults. This is why it’s imperative SREs have monitoring solutions that are Metrics 2.0 compliant. Metrics 2.0 is a set of conventions, standards and concepts around time series metrics metadata with the goal of generating metrics in a format that is self-describing and standardized.
The fundamental premise of Metrics 2.0 is that metrics without context do not have a lot of value. Metrics 2.0 requires metrics be tagged with associated “metadata” or context about the metric that is being collected. For example, collecting CPU utilization from a hundred servers without any context is not particularly useful. But with Metrics 2.0 tags, you will know that this particular CPU metric is from this particular server, within this particular rack, at this specific data center, doing this particular type of work. Much more useful.
When all metrics are tagged in this manner, queries and analytics become quite powerful. You can search based on these tags and you are able to slice and dice the data in many ways to glean insights and intelligence about your operations and performance.
#4: Retaining Your Data So You Can Reduce Future Risk
Monitoring data has often been considered low value and high cost. Times have changed and, as with all things computing, the cost of storing data has fallen dramatically. More importantly, SREs have changed the value of long-term retention of this data. SREs operate in a culture of learning. When things go wrong, and they always do, it is critical to have a robust process for interrogating the system and the organization to understand how the failure transpired. This allows processes to be altered to reduce future risk. At the pace we move, it is undeniable that your organization will develop intelligent questions regarding a failure that was missed immediately after past failures. Those new questions are crucial to the development of your organization, but they become absolutely precious if you can travel back in time and ask those questions about past incidents.
A lot of monitoring vendors — particularly open source — downplay the importance of historical data. They store data for a month, believing anything older than that is not valuable. This couldn’t be more wrong. These solutions don’t store data long-term because they weren’t designed to. But that doesn’t mean it’s not important.
You should have years — not just weeks or months — of historical data that allows you to do post-mortems. But what’s key is that the data you collect today is the same tomorrow.
Your minute-by-minute data or second-by-second data should be just that — not averaged into hour by hour or day by day over time as many solutions do. There is nothing more infuriating to have a new question that comes out in the post-mortem, but you no longer have the data you need to answer it. Do you remember that outage that you had six months ago? You would like to ask a question about that now that you didn’t ask then — but wait, you can’t because your graphs are just one big average over a day.
The data you have today should be exactly the same 6 months from now, 12 months from now. Capacity planning, retrospectives, comparative analysis, and modeling rely on accurate, high fidelity history. Take bandwidth utilization for example. You likely don’t serve the same bandwidth all day long. If you look at the history of bandwidth utilization and you’ve averaged it out over a day, then your maximum is completely obscured. All of the maximums are gone and you’re planning this trajectory curve that doesn’t accommodate your peaks at all. Having this granular data ensures you can answer all future questions correctly.
#5: Using Histograms for Latency SLOs
Latency measurements have become an important part of IT infrastructure and application monitoring. But there are a number of technical challenges associated with managing and analyzing latency data. The volume emitted by a single data source can easily become very large; data has to be collected and aggregated from a large number of different sources; and the data has to be stored over long time periods in order to allow historic comparisons and long-term service quality estimations (SLOs). In order to address these challenges, a compression scheme has to be applied that drastically reduces the size of the data to be stored and transmitted. The most accurate, cost-effective technology to enable this compression is histograms.
Histograms are a data structure that allow users to model the distribution of a set of samples – for example, the age of every human on earth. But instead of storing each sample as its own record, they are grouped together in “buckets” or “bins” which allows for significant data compression and thus superior economics. This compression of data allows for extraordinary metric transmission and ingestion rates, high frequency, real-time analytics, and economical long-term storage. Histograms are also particularly useful in handling the breadth and depth of metric data produced by container technologies such as Kubernetes.
Histograms are more essential to the monitoring industry now than ever before. Not only are many more user interactions being generated, collected, and analyzed, but organizations also now have multiple layers of systems, services, and applications communicating with each other that are generating an overwhelming volume of data — significantly more than what’s possible by just users.
SREs now need to analyze the behavior of their systems and determine quantitatively, what is good enough. If you’re servicing web pages or an API endpoint, how fast do you need to service requests? The problem with the question of how fast do most of them need to be is that you have two variables: how fast (measured in milliseconds) and how many (measured in a number like a percentile).
Histograms are the perfect model for solving that problem because they allow SREs to collect, compress, and store ALL data points and analyze what percentage of their traffic is slower or faster than a certain speed — at low cost and zero overhead. Histograms are ideal for SLO analysis because they can be aggregated over time, and they can be used to calculate arbitrary percentiles and inverse percentiles on the fly, after data ingestion. So instead of saying, “I need 99% of requests to be served faster than one second,” you can start to ask, “what does it look like when I have 98% of requests served faster than 5,500 milliseconds.”
Modern Monitoring in Action: Redfin and Major League Baseball
Redfin, the technology-powered real estate brokerage, was experiencing significant growth in observability data as its website and mobile application quickly grew in popularity. As the organization grew and also began implementing more SRE principles, it implemented Circonus for more modern, advanced monitoring capabilities.
Redfin Upgrades Monitoring
Using histograms for StatsD analysis and SLOs
Redfin’s legacy telemetry pipeline was based on a combination of StatsD and Graphite, which was unable to scale as Redfin’s StatsD metric load began to quickly increase — a common challenge organizations face as they now embrace Kubernetes, microservices, and stateless applications.
StatsD has built-in aggregation functions for timers that are performed by the StatsD daemon, which include count, min, max, median, standard deviation, sum, sum of squares, percentiles (p90) and more. But most StatsD servers only offer static aggregations, which you have to configure upfront. So for example, if you want the 97th percentile for metric values, you have to have known that you’ll need the 97th percentile and configure that from the start.
This prevented Redfin from having the ability to dynamically analyze latencies or calculate SLOs on demand. Also, their various teams all had to use the same SLOs because they’re forced to share the same pre-calculated aggregations.
Additionally, as the cardinality of Redfin’s StatsD metrics increased, the operational burden of managing its StatsD pipelines became significant. The StatsD server can precalculate millions of aggregates that are not even used. In some cases, 20+ aggregated metrics are produced for a single application timer metric. On top of this, you have to manage multiple atomic implementations of the StatsD server.
To tackle these challenges, Redfin decided to replace its legacy StatsD pipeline with Circonus’ histograms. Histograms allow them to efficiently and cost-effectively store all raw data, so they can now perform StatsD aggregations and build percentiles on the fly, after ingestion. This flexibility empowers Redfin’s SRE teams to dynamically set and measure their own SLOs for existing and future use cases.
The histograms also eliminate the need for multiple StatsD servers performing aggregations. Redfin can now compress all data into a single histogram and then send all of this data to their backend in one transaction, rather than multiple ones. Overall, Redfin is reducing the number of metrics they’re ingesting and storing by 50% compared to pre-aggregations, thereby reducing network bandwidth and associated costs.
Correlating data correlation for faster MTTR
Modern IT environments are dynamic and ephemeral, making tagging essential to monitoring services and infrastructure. However, Redfin’s legacy StatsD pipeline didn’t support tagging in the line protocol. They therefore lacked the ability to slice and dice metrics for visualization and alerting, identify and resolve issues quickly, or correlate insights across business units.
With Circonus, Redfin can now enable Metrics 2.0 tagging of StatsD telemetry. The additional context and metadata makes it easier for Redfin to analyze across various dimensions and drastically improves the insight discovery process among millions of unique metrics.
Major League Baseball Unifies Monitoring
Major League Baseball was using multiple different monitoring solutions across its organization. The process of managing all of these various solutions was beginning to get expensive and made troubleshooting challenging. They therefore decided to consolidate all of their monitoring data into Circonus, making it the league’s centralized monitoring and analytics platform that underpins applications, systems infrastructure, cloud infrastructure, and network infrastructure.
In a recent interview with Network World, Jeremy Schulman, Principal Network Automation Software Engineer at MLB, stated, “All this very rich information is being put into a common observability platform, and that democratizes the data in a very important way at MLB. Enabling other IT disciplines to access network data will potentially speed troubleshooting and improve performance.”
He continued, “It’s amazing to have a seat at that table,” Schulman says. “We don’t have to make isolated tool decisions. We get to work with a group of very sophisticated engineers across all these other domains in cloud infrastructure, systems infrastructure, and we get to use their tools, along with their technology.”
By centralizing all of their monitoring data, MLB is making it fully accessible and valuable across IT departments. As a result, they are saving money, improving productivity, and delighting millions of global fans — all while elevating the role monitoring plays in their organization’s success.
The Bottom Line: SREs Require More Advanced Monitoring Solutions
The reality of today’s “always on” service-centric IT environments means that monitoring plays a different and more impactful role than it has in the past. As such, SREs have new, more advanced requirements and expectations when it comes to monitoring. As you embrace these monitoring characteristics, you’ll immediately begin to elevate the relevance of monitoring to your business’ success. You’ll gain lots of other benefits as well — like faster problem identification and resolution, full visibility into all your metrics, better performance, reduced costs, and more confidence in the accuracy of your decisions.