A guide to the importance of, and techniques for, accurately quantifying your Service Level Objectives.

This is the third in a multi-part series about Service Level Objectives.
The first part can be found here and the second part can be found here .

As we’ve discussed in part one and part two of this series, Service Level Objectives (SLOs) are essential performance indicators for organizations that want a real understanding of how their systems are performing. However, these indicators are driven by vast amounts of raw data and information. That being said, how do we make sense of it all and quantify our SLOs? Let’s take a look.

Feel The Heat: Map Out Your Data

The following heat map based on histogram data shows two weeks of API request latency data, displayed in 2 hour time slices. At Circonus, we use log linear histograms to store time series data, and the data is sorted into bin structures which have roughly 100 bins for every power of 10 (for details see Circonus Histogram Internals). This structure provides flexibility for a wide range of values without needing explicit operator configuration for the histogram bucket sizes. In all, this graph represents about 10 million data points. Notice that the left y-axis is in milliseconds, and so most of the values are concentrated under 0.1 seconds or about 100 milliseconds.

Heat map with histogram overlay
Heat map with histogram overlay

If we hover over one of the time slices in this heat map with our mouse, we can see a histogram overlay for that time slice showing the distribution of values for this range. For example, when we look at the bin shown in the graph above, we have a distribution with a very wide range of values, but when we zoom in closer we see it’s concentrated toward the left side of the graph, with modes at about 5 milliseconds and 20 milliseconds.

Histogram Overlay
Histogram Overlay

Now, we can look at how this heat map is generated by examining the CAQL statement in the legend. The Circonus Analytics Query Language (CAQL) is a function-based language, that works by piping data through commands in a manner similar to the UNIX command line. Since we store the raw distribution of data, we this graph gives us a great canvas to apply some math (i.e. transform overlays generated by CAQL statements) to give some real context and meaning to our data.

Heat map with 99th percentile overlay
Heat map with 99th percentile overlay

We can start by applying a 99th percentile overlay to the data and show the points at which 99% of the values are below that latency value. Notice that most of the high points on this graph are slightly over 10 seconds. That’s not a coincidence. Since this is an API, most default timeouts for clients fall right around 10 seconds. What we’re seeing here is a number of client timeouts, which would also show up in the errors graph on a RED dashboard (which we will cover more in another post). Here’s how we generated that overlay:

metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:percentile(99)

This example is a simple statement which says to display the histogram values for a certain metric, which is the latency value for an API call, and then calculate the 99th percentile overlay for these values. This is something no other monitoring solution can do, because most of them typically store aggregated percentiles instead of storing the raw distribution of data as a histogram.

Our approach allows us to calculate arbitrary percentiles over arbitrary time ranges and see what latency 99% of the requests are falling in. That’s not something you can do when you’re storing the 99th percentile for a given set of time ranges. You can’t find the 99th percentile for a larger overall time range by averaging the 99th percentile for those smaller time ranges together.

Inverse percentiles show us the percentage of values over a certain threshold, which we can then establish as a service level objective.

Inverse quantile calculation for 500 milliseconds
Inverse quantile calculation for 500 milliseconds

For example, let’s say we have an SLO of 500 milliseconds of latency. So, in the above graph there is a spike around the 50% mark, which means that 50% of the values in that time slice exceeded 500 milliseconds and we violated our SLO.

metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)

The above CAQL statement will show the percentage of requests that exceed that SLO.

We can also display this as an area graph to make the SLO violation more visible. The area under the line in the graph here is the amount of time we spent within the limits of our SLO. Here, we’re doing a good job.

Area graph, green shows where we meet the SLO
Area graph, green shows where we meet the SLO

Determining what the actual threshold should be is a business decision, but 200 milliseconds is generally a good expectation for web services, and we find this approach of setting the SLO as a time based metric instead of a percentile is easier for humans to understand instead of just picking an arbitrary percentile.

The traditional method might be to say we want 99 percent of our requests to fall under the 500 milliseconds threshold, but what is really more valuable and easier to understand is knowing how many of the requests exceed the SLO and knowing by how much each request exceeded the SLO. When we violate our SLO, we want to know: how bad is the damage that we’ve done? How much did our service suffer?

Quantifying the percentage of requests that meet that SLO is a good start, but we can take it a bit further.

What really matters to the business is the number of requests that failed to meet our SLO, not just the percentage.

Inverse quantile calculation count of requests that violated 500 millisecond SLO
Inverse quantile calculation count of requests that violated 500 millisecond SLO
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()}

Using this graph, we can calculate the number of SLO violations by taking the inverse quantile calculation of requests that violated our 500 millisecond SLO and graphing them. The CAQL statement above says that we subtract the percentage of requests that did not violate our SLO from 100 to get the percentage that did violate the SLO, then multiply that by the count of total number of requests, which gives us the total number of requests that violate the SLO.

These spikes in the graph above show the number of times that requests violated our SLO. As you can see, there are some instances where we had 100,000 violations within a time slice, which is fairly significant. Let’s take this a step further. We can use calculus to find the total number of violations, not just within a given time-slice, but over time.

Cumulative number of requests exceeded the 500 millisecond SLO
Cumulative number of requests exceeded the 500 millisecond SLO
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()} | integrate()

The CAQL statement above is similar to the previous one, but uses the integral function to calculate the total number of requests. The blue line shows a monotonically increasing number of requests and the points at which the inflection points increase and we have a change in the slope of the graph. These are the points where our system goes off the rails.

Any spots where the derivative of the slope increases, we are in violation of our SLO. We can use these spots as way-points for forensic analysis in our logs to figure out exactly why our system was behaving badly (for example, if the database has a slow down) and this also shows us how much damage was caused by the system misbehaving (the higher the difference on the y-axis, the more we’ve violated our SLO).

We can now quantify this damage by tying it to a request basis. If each request that violated our SLO represents the loss of a product sale, we can modify that CAQL statement to assign a dollar value for each failed request and get a brutally honest KPI that will ripple across the entire business and demonstrate the importance of your SLOs, and how failures in your system can cause failures in your business.

On The Fly: Real Time Anomaly Detection

It’s vital to be able to understand when your system was violating your SLO and it’s good to be able to run forensics on that after the fact, but what’s really valuable is getting that information in real time. We can take another CAQL statement and take the difference in the count of requests that violated that SLO and apply an anomaly detection algorithm to them to attempt to identify these points where we had these SLO request violations.

Anomaly detection using SLO violation request diff counts
Anomaly detection using SLO violation request diff counts
op:prod(){op:sub(){100,metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:inverse_percentile(500)}, metric:histogram("14ab8f94-da3d-4047-f6fc-81cc68e9d4b5", "api`GET`/getState") | histogram:count()} | diff() | anomaly_detection(20, model_period=120, model="constant")

These are instances where the algorithm has identified potential anomalies. It gives each potential anomaly a score of 0 to 100, 100 being a definite anomaly, with lesser values depending on how the algorithm identifies the quantity of violation. We can also take this CAQL statement and create an alert for it, which will send an alert message to our operations team in real time every time we have an SLO violation.

This is a constant algorithm that takes the model period over 60 seconds. In the above example, the sensitivity is set to 20. We can adjust that to make the algorithm more or less sensitive to anomalies, which is threshold independent. Either way, we can monitor and track these anomalies as they happen, providing contextual, actionable insight.

In Conclusion

Our approach gives you the freedom to not have to proactively monitor your system.

By intelligently quantifying your SLOs through the methods described above, you can tie the performance of a given part of the system into the overall performance of the business.

This empowers you to adjust your operational footprint as needed to ensure that your SLOs are being met, and ultimately allows you to focus on your business objectives.

If you have questions about this article, or SLOs in general, feel free to join our slack channel and ask us. To see what we’ve been working on which inspired this article, feel free to have a look here.

Get blog updates.

Keep up with the latest in telemtry data intelligence and observability.