DevOps & Monitoring
by Theo Schlossnagle
As IT organizations evolve their understanding of how data helps them make decisions, they need to analyze massively increasing numbers of metrics. Current monitoring systems weren’t built for the new requirements of IT, how can you be sure that yours is?
In this post, you’ll learn about:
- The latest monitoring requirements you should consider including being API accessible, having 100% operability, and scaling to meet new standards of metric collection.
- How histograms power big data analytics
- How time series data provides a deeper context to improve root cause analysis
The last ten years have seen a massive change in how IT operations and development enables business success. From technological drivers like virtualization and cloud computing, to fundamental process game-changers like continuous delivery, continuous integration, and rapid application development, IT has never been more complex or more critical to creating competitive advantage.
To support increasingly Web-Scale IT operations and wide-scale cloud adoption, applications now operate as services. This requires software engineers and operations teams to collaborate to meet “always on” customer expectations and deliver superior customer experiences. As more and more companies transform themselves into service-centric, “always on” environments, they are also turning to DevOps, a fundamental shift in IT mentality, to help them align – and stay aligned – with the market. This means enterprise IT needs to become more agile, develop projects faster, shorten release cycles, and get to market faster with less risk.
Monitoring DevOps Environments Requires a New Approach
DevOps environments are continuous-delivery, continuous-integration environments where user demand drives frequent, high-performing release cycles and systems change very quickly. It’s so dynamic that old monitoring tools are trying to solve problems that no longer exist, and simply do not meet new monitoring expectations and requirements. IT patches, manipulates, mutilates, and evolves old monitoring tools to get them to work. Sometimes this works really well, but most of
the time it does not.
At the same time, “always on,” high-reliability has become table stakes in application delivery. To optimize performance, this means that “always on,” high-reliability in monitoring systems is non-negotiable. It also means that engineers need an efficient way to identify performance problems. Today’s IT is swimming in data that is constantly spewing from every infrastructure component, virtual, physical, or cloud. Identifying the source of a performance issue from what can be millions of data streams can require hours and hours of engineering time using traditional IT monitoring tools. Clearly, enterprises desperately need a new way to manage and monitor rapidly scaling and rapidly changing IT infrastructure where monitoring is a key component of service delivery – for an overall service, the microservices that make it up, and all the connections between them.
- DevOps has different requirements and expectations when it comes to monitoring, which traditional monitoring tools cannot meet.
- Reliability is critical in DevOps environments that must accommodate new levels of speed and scale.
- DevOps values monitoring and analytics tools that can provide intelligence beyond simple status, and can correlate it performance to business metrics.
The DevOps Ideal: Real User Monitoring for Service Delivery
The example of recent airline website outages, which led to massive flight disruptions and lost revenue, highlights that new monitoring and management strategies are long-overdue. It is time to move beyond only pinging a system to see if it is up or down. Pinging is useful, but not the same as knowing how well the service is running and meeting business requirements. Knowing that a machine is running and delivering some subset of a service currently being delivered to a customer – and to have that knowledge in real-time – this is real business value.
The next question becomes how to most efficiently measure performance for those quality of service requirements. The answer is to measure the latency of every interaction between every component in the system. In this new service-centric world, high latency is the new “down”. Instead of just checking for available disk space or number of IO operations against that disk, it’s important to check (for example) the latency distribution of the API requests. Just knowing how much memory the system is using isn’t enough, it’s much more important to know how many microseconds of latency occur against every query.
In this new service-centric world, high latency is the new “down”.
What should be measured is the actual performance of all of the components in the system and along the path of service delivery between the customer and the data center. Don’t just check to see if an arbitrarily determined “average” transaction time has been met or a system is up. While these kinds of traditional metrics are still useful and necessary to monitor, it is crucial to see if your quality of service requirements are met.
Every user on a web app, or every customer website hit, uses a plethora of infrastructure components, and the quality of the user’s experience is affected by the performance of numerous microservices. Completely understanding performance requires checking the latency of every component and microservice in that system. All of those latencies add up to make or break the customer experience, thereby determining the quality of your service.
Will the quality of your service be affected if 1 out of every 100 database queries is painfully slow? Will your business be impacted if 5 out of every 100 customer experiences with your service are unpleasant? Traditional monitoring tools that store and alert on averages leave DevOps personnel blind to these situations. Every user matters and so does every user interaction. Their experience is directly affected by every component interaction, every disk interaction, every cloud service interaction, every microservice interaction, and every API query – so they should all be measured.
Imagine using Real User Monitoring (RUM) as it exists today for measuring the total latency experienced by the user, but applied within the stack to alert DevOps personnel to unacceptable latency in subcomponents and microservices – before they affect end-to-end service quality. If it is not measured, then IT is blind to the underpinnings of what causes a web app or website to meet or to fail service level agreements.
Enterprises need a new set of monitoring solutions that reliably and cost-effectively measure everything, without increasing staff demands.
- Be reliable and comprehensive, include all infrastructure activity for ALL infrastructure all the time. It needs to be up-to-date to monitor exactly what is there and not components that were removed long ago.
- Have 100% operability. The architecture needs to be on all the time and so does the system that monitors it. It should be upgradeable without disruptions.
- Be API accessible. Do-it-yourself APIs replace the need to go through a support desk for monitoring, with automatic updates to keep up with always changing environments.
- Be analytics-driven to correlate IT operations metrics and business metrics. Monitoring everything requires the built-in ability to continuously aggregate all infrastructure and transaction data, run machine learning algorithms, and graph data into visualizations that are easy to understand.
- Run at scale. There should be no compromise on performance, regardless of the infrastructure environment size or the amount of data collected to run analytics in real-time.
Circonus Examples – How to Monitor the DevOps Way
All computing architectures pump out constant streams of data. As valuable as it clearly is, can an organization afford to store that much data? Circonus changes the economics of storing and processing all that data, so even a small company can collect billions of measurements per second and afford to analyze all of their data for better answers to better questions.
Before, only a handful of hyperscale web properties or apps had this ability. Now, Circonus changes how data is stored, processed, and consumed, thereby saving time for IT staff and helping them optimize infrastructure performance for organizations of all sizes.
Circonus Provides Big Data Analytics for Monitoring Data Using Histograms
How does IT process, consume, and make decisions using a thousand or a million measurements per second? Consider IO latency. Knowing the microsecond latency of every IO operation, read and written on every spindle on every machine in every cabinet in every data center could be as many as 40 million measurements per second. That’s a lot of data. What about a billion measurements per second? Say every machine runs at two billion cycles per second on multiple cores. Consider measuring the latency of context switching on every machine. That could mean a million measurements on 10,000 machines and would yield 10 billion measurements. How does a monitoring tool deal with 10 billion measurements per second?
That’s an extreme example, but consider API request latency. Tracking this vital statistic between just two microservices, or a single microservice and a database, can require storage of millions of measurements per second. Existing monitoring tools will reduce all of those measurements to a single number – the average latency over an arbitrarily determined time window, typically a minute. The reason they do this is because they lack Circonus’s patent-pending ability to store all of this useful information cost-effectively.
Circonus solves this problem by using histograms to store and analyze massive amounts of data in a cost-efficient way. This method of storage, analysis, and visualization provides more meaningful graphs that can show distributions and outliers, regardless of data scale. Whether it’s a million per second, a billion per second, or a trillion per second… these are merely unit differences; ultimately it’s the same math problem.
Looking at the IO measurements on Figure 2 shows us just how complicated a workload can be, but once a workload is realistically visualized, any casual observer with no prior knowledge about the graph can see the changes in response over time.
In the figure, the y-axis shows IO latency while the x-axis shows time, from January 4th at 4 PM to January 6th at 12 PM. The saturation of color in the heat map indicates the number of measurements in a given range. Clearly on January 5th, the response time profiles looks different. Shortly before 12 noon, it started to change; until 4:00 PM, when it reverted back to normal.
The smaller graph in the upper right corner zooms in to show us a close up of this data. This graph displays the histogram for the slice of time selected by the cursor. The x-axis indicates the range of latencies, and the y-axis height represents the number of transactions that fall in each bucket. It doesn’t take math skills to see that something changed; all it takes is Circonus visualization.
Let’s take a look at another example in Figure 3. It’s a sample Circonus graph of four nodes on a Content Delivery Network (CDN). Containing about five million data points, these graphs are latency profiles of every request for a two-week time interval. Looking at any specific time frame represented by the verticalized slices, it is easy to see that it is pixilated. Each of the four graphs shows the latency on web requests delivery times for each CDN node. This data visualization is an extremely valuable complement to Real User Monitoring (RUM). RUM represents the total latency experienced by the end user and, with graphs like this one, Circonus can highlight the portion of total latency attributable to the network.
The graph contains other interesting information as well. Note the overlaid histograms. Instead of showing just averages – which tell us very little about data distributions – it shows distribution curves. Note that these curves are all strongly multi-modal, showing how each node has a different user population experiencing very different latency behaviors. Such differences are important to monitor and are everywhere in every computing environment.
How does this apply to identifying abnormalities that may affect user experience in a computing environment with 10,000 machines? How easy is it for an engineering team to detect faults or other anomalies before they impact service quality? Circonus applies machine learning algorithms, which are heavily statistics-based, to numerically transform this histogram to show and, more importantly, “alert” DevOps personnel to changes in its modality.
Figure 4 is a Circonus graph of an 8-hour time interval containing millions of data points. Circonus typically renders graphs like this in less than a half second. The smaller histograms to the right of the graph reveal different humps on the curve, each representing a kernel of activity. On the left side, there is a spike, meaning that there’s a workload that is generating latency. There’s a second workload that is generating latency in the long tail out there to the right of the curve (graphed at minute 25). In this case, these characteristics could trigger an alert. In places, the graph may look somewhat normal, but it’s clear that some workloads were not present and functioning as they should have been.
Circonus Solves the Problem of Lost Context for Better Root Cause Analysis and Faster MTTR
Investigating infrastructure monitoring data requires IT staff time, which means it costs money. All too often, a spike on a graph can take a lot of time to explore. It’s not unusual to do an upgrade on one part of the system, only to create a horrible spike in latency for a portion of end-users captured by a RUM tool. Weeks later, someone may see the spike and then spend hours researching what happened because that context has been lost. Quite simply, context matters in quickly troubleshooting performance issues.
In Circonus, events are correlated across the system as a whole. When a large IT architecture is deployed, particularly one that is microservices-driven, one component has the ability to negatively or positively impact another component in the system. Well designed microservice architectures act like springs, and springs are designed to not break, but that does not mean there is no impact on another part of the system with serious force. Circonus visually displays this impact to speed up issue identification.
The ability to analyze and make decisions within context is a key feature of Circonus. Such context is captured by maintaining a timeline of system events, changes, upgrades, and alerts. As any IT staff member knows, business events, such as earnings announcements, special promotions, or marketing campaigns, can impact the entire infrastructure. Circonus easily allows anyone, even non-technical people, to use the system to make a record of impactful business events. Understanding the business context around IT events can reduce the amount of IT engineering time spent on investigating infrastructure performance problems. Circonus takes measurements from everything, no matter how many, and provides superior tools to reduce the amount of time required to identify and correct the root cause of service-impacting faults. Using an organization’s own system data, Circonus aims to put it all together to keep engineers as informed as possible, able to quickly solve the problems that inevitably arise.
Circonus Analytics Open Up New Opportunities to Optimize Service Level Agreements, Save Money and Generate Revenue for the Business
Everyone agrees that Service Level Agreements (SLAs) are a critical factor for both customers and Software-as-a-Service providers. They contain a lot of performance metrics, such as network performance requirements, disk service requirements, or database service requirements. Unfortunately, SLAs are typically written with arbitrarily chosen levels. They are written without any understanding of actual infrastructure performance and financial impact.
Most SLA metrics are created using quantiles. Metrics are captured on all transactions within a defined interval and the 99.9th percentile is used by default to identify the slowest relevant point. With no data or reasoning behind it, an organization usually makes an arbitrary choice about an SLA requirement such as maximum API latency. The danger to the business is that these arbitrary targets could be unnecessarily expensive to hit.
With Circonus, it’s easy to determine the cost of hitting a range of SLA specifications; to determine whether 99% or 99.9% strikes the optimal balance between cost and benefit to the enterprise. It’s possible to determine, for example, if 99.999% costs the same as a 99.99% SLA. Circonus collects all source data, not just summary metrics chosen on some random, predetermined time interval, and thereby enables decision-making based on real-world data rather than guesses based on predetermined, inaccurate summaries.
Without unsummarized source data over a long period of time, there is just no way to actually answer these types of questions. Using histograms, which Circonus users have found to be up to 100x more efficient at storing metrics, Circonus can cost-effectively keep every single API call ever served, allowing the IT organization to go back and actually look at its data whenever the need arises. Storing latency measurements as histograms also enables an organization to go back and accurately identify the optimal service level to cost-effectively satisfy all of its customers. It can determine the SLA objectives that are inexpensive to hit and which targets can be cost-effectively inflated. Conversely, it can also identify those SLA targets that are especially costly to hit, so that the business can make informed decisions about its SLA commitments. Without Circonus machine learning insights, a business could lose significant revenue and incur large or unnecessary SLA violation costs.
Current monitoring systems were never designed to handle metric growth that scales to the millions or hundreds of millions of data streams that need to be collected to support today’s “always on” service-centric IT environments. Circonus is architected to handle growth and resiliency so that if any component in the monitoring system goes down, it doesn’t prevent you from measuring service levels or receiving alerts. With extensive analytics and a proprietary data store, the Circonus monitoring platform provides operational intelligence and meaningful visualization of even massive amounts of data – helping DevOps (both engineering and the operations team) to accelerate time to market and consistently improve customer experience.