Monitoring for Success: What All SREs Need to Know

The last ten years have seen a massive change in how IT operations and development enable business success. From virtualization and cloud computing to continuous delivery, continuous integration, and rapid application development, IT has never been more complex or more critical to creating competitive advantage. To support increasingly Web-Scale IT operations and wide-scale cloud adoption, applications now operate as services. This requires software engineers and operations teams to collaborate to meet “always on” customer expectations and deliver superior customer experiences. As more companies transform themselves into service-centric, “always on” environments, they are implementing Site Reliability Engineer (SRE) functions that are responsible for defining ways to measure availability and uptime, accelerate releases, and reduce the costs of failures.

SREs operate in continuous-delivery, continuous-integration environments where user demand drives frequent, high-performing release cycles and systems change very quickly. It’s so dynamic that old monitoring tools are trying to solve problems that no longer exist, and simply do not meet new monitoring expectations and requirements. At the same time, “always on,” high-reliability has become table stakes in application delivery. To optimize performance, this means that “always on,” high-reliability in monitoring systems is nonnegotiable. It also means that SREs need an efficient way to identify performance problems.

Today’s SREs are swimming in data that is constantly spewing from every infrastructure component, virtual, physical, or cloud. Identifying the source of a performance issue from what can be millions of data streams can require hours and hours of engineering time using traditional IT monitoring tools. Clearly, SREs desperately need a new way to manage and monitor rapidly scaling and rapidly changing IT infrastructure where monitoring is a key component of service delivery – for an overall service, the microservices that make it up, and all the connections between them.

Measuring Performance to Meet Quality of Service Requirements

It is time to move beyond only pinging a system to see if it is up or down. Pinging is useful, but not the same as knowing how well the service is running and meeting business requirements. Knowing that a machine is running and delivering some subset of a service currently being delivered to a customer – and to have that knowledge in real-time – this is real business value.

The next question becomes how to most efficiently measure performance for those quality of service requirements. The answer is to measure the latency of every interaction between every component in the system. In this new service-centric world, high latency is the new “down”. Instead of just checking for available disk space or number of IO operations against that disk, it’s important to check (for example) the latency distribution of the API requests. Just knowing how much memory the system is using isn’t enough – it’s much more important to know how many microseconds of latency occur against every query.

What should be measured is the actual performance of all of the components in the system and along the path of service delivery between the customer and the data center. Don’t just check to see if an arbitrarily determined “average” transaction time has been met or a system is up. While these kinds of traditional metrics are still useful and necessary to monitor, it is crucial to see if your quality of service requirements are met.

Every user on a web app, or every customer website hit, uses a plethora of infrastructure components, and the quality of the user’s experience is affected by the performance of numerous microservices. Completely understanding performance requires checking the latency of every component and microservice in that system. All of those latencies add up to make or break the customer experience, thereby determining the quality of your service.

Will the quality of your service be affected if 1 out of every 100 database queries is painfully slow? Will your business be impacted if 5 out of every 100 customer experiences with your service are unpleasant? Traditional monitoring tools that store and alert on averages leave SREs blind to these situations. Every user matters and so does every user interaction. Their experience is directly affected by every component interaction, every disk interaction, every cloud service interaction, every microservice interaction, and every API query – so they should all be measured.

Imagine measuring the total latency experienced by the user and alerting SREs to unacceptable latency in subcomponents and microservices – before they affect end-to-end service quality. If it is not measured, then SREs are blind to the underpinnings of what causes a web app or website to meet or to fail service level agreements.

SREs require a new set of monitoring solutions that reliably and cost-effectively measure everything, without increasing staff demands. These solutions must have the following capabilities:

Be reliable and comprehensive, include all infrastructure activity for ALL infrastructure all the time. It needs to be up-to-date to monitor exactly what is there and not components that were removed long ago.
Have 100% operability. The architecture needs to be on all the time and so does the system that monitors it. It should be upgradeable without disruptions.
Be API accessible. Do-it-yourself APIs replace the need to go through a support desk for monitoring, with automatic updates to keep up with always changing environments.
Be analytics-driven to correlate IT operations metrics and business metrics. Monitoring everything requires the built-in ability to continuously aggregate all infrastructure and transaction data, run machine learning algorithms, and graph data into visualizations that are easy to understand.
Run at scale. There should be no compromise on performance, regardless of the infrastructure environment size or the amount of data collected to run analytics in real-time.

How Circonus Empowers SREs

All computing architectures pump out constant streams of data. Circonus changes the economics of storing and processing all that data, so even a small company can collect billions of measurements per second and afford to analyze all of their data for better answers to better questions.

Circonus histograms enable cost-efficient, accurate storage and analysis of billions of metrics

How does IT process, consume, and make decisions using a thousand or a million measurements per second? Consider API request latency – tracking this between just two microservices, or a single microservice and a database, can require storage of millions of measurements per second. Existing monitoring tools will reduce all of those measurements to a single number – the average latency over an arbitrarily determined time window, typically a minute. On the other hand, Circonus uses its histograms to store and analyze massive amounts of data in a cost-efficient way. This method of storage, analysis, and visualization provides more meaningful graphs that can show distributions and outliers, regardless of data scale.

Circonus gives SREs deeper context to improve root cause analysis

Context matters in quickly troubleshooting performance issues. In Circonus, events are correlated across the system as a whole. Circonus takes measurements from everything, no matter how many, and provides superior tools to reduce the amount of time required to identify and correct the root cause of service-impacting faults. Using an organization’s own system data, Circonus aims to put it all together to keep SREs as informed as possible, so they can quickly solve the problems that inevitably arise.

Circonus enables SREs to cost-effectively create and measure SLAs

Service Level Agreements (SLAs) are a critical factor for both customers and Software-as-a-Service providers. Unfortunately, many organizations make an arbitrary choice about an SLA requirement that could be unnecessarily expensive to hit. With Circonus, it’s easy to determine the cost of hitting a range of SLA specifications. It’s possible, for example, to identify if a 99.999% SLA costs the same as a 99.99% SLA. Using Circonus histograms, SREs can go back and accurately identify the optimal service level to cost-effectively satisfy all of its customers. It can determine the SLA objectives that are inexpensive or costly to hit and which targets can be cost-effectively inflated, so SREs can make informed decisions about its SLA commitments.

SREs Require More Advanced Monitoring

Many current monitoring systems were never designed to handle metric growth that scales to the millions or hundreds of millions of data streams that need to be collected to support today’s “always on” service-centric IT environments. The reality of today’s IT environment means that monitoring plays a different and more impactful role than it has in the past. As such, SREs have new, more advanced requirements and expectations when it comes to monitoring. Most importantly, SREs value monitoring and analytics tools that can provide intelligence beyond simple status, and can correlate its performance to business metrics. With extensive analytics and a proprietary data store, the Circonus monitoring platform provides operational intelligence and meaningful visualization of even massive amounts of data – helping SREs to accelerate time to market and consistently improve customer experiences.