A Guide to Service Level Objectives, Part 3: Quantifying Your SLOs

A guide to the importance of, and techniques for, accurately quantifying your Service Level Objectives.

This is the third in a multi-part series about Service Level Objectives. The second part can be found here .

As we’ve discussed in part one and part two of this series, Service Level Objectives (SLOs) are essential performance indicators for organizations that want a real understanding of how their systems are performing. However, these indicators are driven by vast amounts of raw data and information. That being said, how do we make sense of it all and quantify our SLOs? Let’s take a look.

Feel The Heat: Map Out Your Data

The following heat map based on histogram data shows two weeks of API request latency data, displayed in 2 hour time slices. At Circonus, we use log linear histograms to store time series data, and the data is sorted into bin structures which have roughly 100 bins for every power of 10 (for details see Circonus Histogram Internals). This structure provides flexibility for a wide range of values without needing explicit operator configuration for the histogram bucket sizes. In all, this graph represents about 10 million data points. Notice that the left y-axis is in milliseconds, and so most of the values are concentrated under 0.1 seconds or about 100 milliseconds.

Heat map with histogram overlay
Heat map with histogram overlay

If we hover over one of the time slices in this heat map with our mouse, we can see a histogram overlay for that time slice showing the distribution of values for this range. For example, when we look at the bin shown in the graph above, we have a distribution with a very wide range of values, but when we zoom in closer we see it’s concentrated toward the left side of the graph, with modes at about 5 milliseconds and 20 milliseconds.

Histogram Overlay
Histogram Overlay

Now, we can look at how this heat map is generated by examining the CAQL statement in the legend. The Circonus Analytics Query Language (CAQL) is a function-based language, that works by piping data through commands in a manner similar to the UNIX command line. Since we store the raw distribution of data, we this graph gives us a great canvas to apply some math (i.e. transform overlays generated by CAQL statements) to give some real context and meaning to our data.

Heat map with 99th percentile overlay
Heat map with 99th percentile overlay

We can start by applying a 99th percentile overlay to the data and show the points at which 99% of the values are below that latency value. Notice that most of the high points on this graph are slightly over 10 seconds. That’s not a coincidence. Since this is an API, most default timeouts for clients fall right around 10 seconds. What we’re seeing here is a number of client timeouts, which would also show up in the errors graph on a RED dashboard (which we will cover more in another post). Here’s how we generated that overlay:

metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:percentile(99)

This example is a simple statement which says to display the histogram values for a certain metric, which is the latency value for an API call, and then calculate the 99th percentile overlay for these values. This is something no other monitoring solution can do, because most of them typically store aggregated percentiles instead of storing the raw distribution of data as a histogram.

Our approach allows us to calculate arbitrary percentiles over arbitrary time ranges and see what latency 99% of the requests are falling in. That’s not something you can do when you’re storing the 99th percentile for a given set of time ranges. You can’t find the 99th percentile for a larger overall time range by averaging the 99th percentile for those smaller time ranges together.

Inverse percentiles show us the percentage of values over a certain threshold, which we can then establish as a service level objective.

Inverse quantile calculation for 500 milliseconds
Inverse quantile calculation for 500 milliseconds

For example, let’s say we have an SLO of 500 milliseconds of latency. So, in the above graph there is a spike around the 50% mark, which means that 50% of the values in that time slice exceeded 500 milliseconds and we violated our SLO.

metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)

The above CAQL statement will show the percentage of requests that exceed that SLO.

We can also display this as an area graph to make the SLO violation more visible. The area under the line in the graph here is the amount of time we spent within the limits of our SLO. Here, we’re doing a good job.

Area graph, green shows where we meet the SLO
Area graph, green shows where we meet the SLO

Determining what the actual threshold should be is a business decision, but 200 milliseconds is generally a good expectation for web services, and we find this approach of setting the SLO as a time based metric instead of a percentile is easier for humans to understand instead of just picking an arbitrary percentile.

The traditional method might be to say we want 99 percent of our requests to fall under the 500 milliseconds threshold, but what is really more valuable and easier to understand is knowing how many of the requests exceed the SLO and knowing by how much each request exceeded the SLO. When we violate our SLO, we want to know: how bad is the damage that we’ve done? How much did our service suffer?

Quantifying the percentage of requests that meet that SLO is a good start, but we can take it a bit further.

What really matters to the business is the number of requests that failed to meet our SLO, not just the percentage.

Inverse quantile calculation count of requests that violated 500 millisecond SLO
Inverse quantile calculation count of requests that violated 500 millisecond SLO
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()}

Using this graph, we can calculate the number of SLO violations by taking the inverse quantile calculation of requests that violated our 500 millisecond SLO and graphing them. The CAQL statement above says that we subtract the percentage of requests that did not violate our SLO from 100 to get the percentage that did violate the SLO, then multiply that by the count of total number of requests, which gives us the total number of requests that violate the SLO.

These spikes in the graph above show the number of times that requests violated our SLO. As you can see, there are some instances where we had 100,000 violations within a time slice, which is fairly significant. Let’s take this a step further. We can use calculus to find the total number of violations, not just within a given time-slice, but over time.

Cumulative number of requests exceeded the 500 millisecond SLO
Cumulative number of requests exceeded the 500 millisecond SLO
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()} | integrate()

The CAQL statement above is similar to the previous one, but uses the integral function to calculate the total number of requests. The blue line shows a monotonically increasing number of requests and the points at which the inflection points increase and we have a change in the slope of the graph. These are the points where our system goes off the rails.

Any spots where the derivative of the slope increases, we are in violation of our SLO. We can use these spots as way-points for forensic analysis in our logs to figure out exactly why our system was behaving badly (for example, if the database has a slow down) and this also shows us how much damage was caused by the system misbehaving (the higher the difference on the y-axis, the more we’ve violated our SLO).

We can now quantify this damage by tying it to a request basis. If each request that violated our SLO represents the loss of a product sale, we can modify that CAQL statement to assign a dollar value for each failed request and get a brutally honest KPI that will ripple across the entire business and demonstrate the importance of your SLOs, and how failures in your system can cause failures in your business.

On The Fly: Real Time Anomaly Detection

It’s vital to be able to understand when your system was violating your SLO and it’s good to be able to run forensics on that after the fact, but what’s really valuable is getting that information in real time. We can take another CAQL statement and take the difference in the count of requests that violated that SLO and apply an anomaly detection algorithm to them to attempt to identify these points where we had these SLO request violations.

Anomaly detection using SLO violation request diff counts
Anomaly detection using SLO violation request diff counts
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()} | diff() | anomaly_detection(20, model_period=120, model="constant")

These are instances where the algorithm has identified potential anomalies. It gives each potential anomaly a score of 0 to 100, 100 being a definite anomaly, with lesser values depending on how the algorithm identifies the quantity of violation. We can also take this CAQL statement and create an alert for it, which will send an alert message to our operations team in real time every time we have an SLO violation.

This is a constant algorithm that takes the model period over 60 seconds. In the above example, the sensitivity is set to 20. We can adjust that to make the algorithm more or less sensitive to anomalies, which is threshold independent. Either way, we can monitor and track these anomalies as they happen, providing contextual, actionable insight.

In Conclusion

Our approach gives you the freedom to not have to proactively monitor your system.

By intelligently quantifying your SLOs through the methods described above, you can tie the performance of a given part of the system into the overall performance of the business.

This empowers you to adjust your operational footprint as needed to ensure that your SLOs are being met, and ultimately allows you to focus on your business objectives.

If you have questions about this article, or SLOs in general, feel free to join our slack channel and ask us. To see what we’ve been working on which inspired this article, feel free to have a look here.

Monitoring DevOps: Where are we now? [Infographic]

Our first DevOps & Monitoring Survey was conducted at ChefConf 2015. This year, we’ve created an infographic based on the facts and figures from our 2018 Monitoring DevOps Survey. The infographic provides a visual representation of the prevalence of DevOps, how monitoring responsibilities are distributed, metrics usage, and various aspects of current monitoring tools.

This infographic describes insights into strategies used by others in our community. Let us know what you think, and feel free to share it with your friends.

DevOps & Monitoring Survey

Circonus Update: New UI Rollout

The Circonus team is excited to announce the release of our newest update, which includes sweeping changes to the Circonus Monitoring Platform UI.

This update is part of an ongoing effort to optimize the Circonus UI for performance and usability on mobile devices, such as phones and tablets, as well as on media PCs, desktops, and laptops, and almost every single change to our familiar UI directly supports that optimization. You’ll find that just about every single page is now responsive. In the future we’ll be continuing these efforts, and will be tackling the dashboards and check creation workflows next.

We’re also grouping some of our features to provide a more streamlined experience, by making improvements to how data is displayed, and changing how controls are displayed to be consistent throughout all of the different pages and views.

Look for these changes to go live this Thursday, April 26th!

What’s New?

The biggest change is that the displays are consistent for ALL of the different searchable items in Circonus. Hosts, metrics, checks, rules sets, graphs, worksheets, everything!

Every item has a List View that includes the search functionality, and a Details View with more information for each item in the list. All of these Details View pages will have the same familiar design, and the List View for each item will also have the same functionality. Users will still be able to select their preferred layout for Metrics and Hosts lists, whether they want their List View in a grid or an actual list.

Another significant change is that the List View and the Details View are separate. Circonus will no longer display a dropdown accordion with a Details View inside of the List View. Instead, you can use the List View to search for data and simply click the View button to visit a page specifically designed for viewing those details.

These views group many familiar Circonus features together in a dropdown menu that appears under the View button, as you can see in this Alert History list.

Many frequently used Circonus features are grouped under one of our two (and only two) “burger”-style menus, which are for mobile navigation and tag filters. Controls for other features have been replaced with intuitive icons in the appropriate places that will clearly indicate what users can do from different views.

Menu items are context dependent, and display options relevant to the current List View or Details View.

All of the Token API Management pages have been consolidated to a single API Token page, under the Integrations menu.

Account administration has also been consolidated and streamlined, as you can see from this view of the Team page:

FAQ

There are a lot of changes in this update, so to assist users with this transition, we’ve prepared answers for a few questions we anticipated that you might ask.

How do I view the check UUID?

The view for the API object, which includes the check UUID, is available on the checks list page by clicking the down arrow next to the View button. You can also visit the Details View page for the Check to get all pertinent Check info, including the UUID.

How do I view details for two things at once now that the List View and Details View are separate?

We recommend opening a separate Details View page for each item you want to view, by right-clicking on the View button in the List View and opening multiple new tabs.

What’s Next?

Our team is dedicated to continuously improving Circonus, and have recently prepared our roadmap for the next year, so we can confidently say there are many more exciting new features and performance enhancements on the horizon.

Our next big UI update will enhance our dashboards and the check creation workflows. These features will receive the same responsive improvements you will see in the rest of the UI, along with improving usability.

UI Redesign FAQ

Today, the Circonus team is releasing the first round of our new User Interface changes.

Since we started developing Circonus 7 years ago, we’ve worked with many customers in different spaces with different needs. Our team talked with many engineers about how they monitor their systems, what their workflow looks like, and the configuration of their ideal platform. You’ve spoken and we listened. Now, we’ve put those years of feedback to use, so that you get the benefit of that collective knowledge, with a new, improved interface for managing your monitoring.

The interface now features responsive improvements for different screen sizes (and which provides better support for mobile devices) and a revised navigation menu. More changes will follow, as we innovate to adapt our interface to intuitively suit our users needs.

Frequently Asked Questions

Change, even change for the better, can be frustrating. This FAQ will help explain the interface changes. We’ll be adding more items to this FAQ as we get more feedback.

Where did the checks go?

The checks have been moved to under the “Integrations” menu.

Where did the graphs go?

Graphs are now under the “Analytics” menu.

Where are the rulesets?

Rulesets can be found under the “Alerts” menu.

Send us your feedback

Tell us about your own Circonus experience, and let us know what you think about our new User Interface: