Circonus will soon be releasing our next generation fault detection system, faultd (fault-dee). Faultd is an internal component of our infrastructure has run alongside our existing fault detection system for several months with outputs verified for accuracy. Additionally it is in use by some of our enterprise customers who have reported no issues with faultd.
Faultd introduces powerful new features which will make it easy to manage alerting in ephemeral infrastructures such as serverless, container based applications, and large enterprises.
Pattern Based Rulesets
Say I have a few thousand hosts that emit S.M.A.R.T. disk status telemetry, and I want to alert when the seek error rate exceeds a threshold. Previously I would need to create a few thousand rules to alert on this condition for each host. While the Circonus API certainly makes this programmatically feasible, I would also need to create or delete these rules on each host addition or removal.
Now I can create a single pattern based rule using regular expressions to cover swaths of infrastructure for a given metric. I can also harness the power of stream tags to create pattern based rules based on metric metadata. What would have taken operators hours to do in the past can now be done easily in minutes.
Histogram Based Alerting
Traditional alerting has been based on a value exceeding a threshold for a given amount of time. Every monitoring system can do this. And each one of them suffers from the shortcoming of outliers triggering false positive alerts which are infamous for waking up systems operators in the middle of the night for what turns out to be nothing.
Histogram based alerting paves the way for alerts based on percentiles, which are much more robust than alerting on individual values or averages which can become easily skewed by outliers. This also allows for alerting on conditions when Service Level Objectives (SLOs) are exceeded, a capability core to the mission of Software Reliability Engineers (SREs). Alerts based on Inverse Quantiles are also now possible – “alert me if 20% of my requests in the last five minutes exceeded 500ms”, or “alert me if more than 100 requests in the last 5 minutes exceeded 253ms”.
Under the Hood
Faultd has been engineered in C with the libmtev application framework, which provides highly concurrent lock free data structures and safe memory reclamation. This implementation is radically more efficient for memory and CPU than the previous fault detection system written in Java. It also provides more powerful ways to scale out for ridiculously large installations, and supports more sophisticated clustering.
As a result, some window function alerts may show increased accuracy. Enterprise customers will enjoy greater reliability in not having to occasionally restart a JVM as part of normal maintenance.
Faultd will be going live on December 17th for Circonus hosted customers. While you might not notice anything new that day, that’s intentional as we expect complete continuity of service during the transition. In the coming weeks to months after, we’ll be showcasing the new features provided by faultd here on this blog so that you can put them to work.
Percentiles have become one of the primary service level indicators to represent real systems monitoring performance. When used correctly, they provide a robust metric that can be used for base-of-mission critical service level objectives. However, there’s a reason for the “when used correctly” above.
For all their potential, percentiles do have subtle limitations that are very often overlooked by the people using them in analyses.
Right off the bat, the most misused technique is aggregation of percentiles. You should almost never average percentiles, because even very fundamental aggregation tasks cannot be accommodated by percentile metrics. It is often thought that since percentiles are cheap to obtain by most telemetry systems, and good enough to use with little effort, that they are appropriate for aggregation and system wide performance analysis most of the time. While this is true most of the time and for most systems, you lose the ability to determine when your data is lying to you — for example, when you have high (+/- 5% and greater) error rates that are hidden from you.
Those times when your systems are misbehaving the most?
That’s exactly when you don’t have the data to tell you where things are going wrong.
Check the Math
Let’s look at an example* of request latencies of two webservers (W1 blue, W2 red). P95 of the blue server is 220ms, p95 of the red one is 650ms:
By aggregating the latency distributions of each web server, we find that the total p95 is 230ms. W2 barely served any requests, so adding requests from there did not change p95 of W1 by much. Now, naive averaging of the percentiles would have given you: (220+650) / 2 = 87/2 = 435ms, which is ~200% away from the true total percentile (230ms).
So, if you have a Service Level Indicator in this scenario of “95th percentile latency of requests over past 5 minutes < 300ms,” and you averaged P95s instead of calculating from a distribution, you would be led to believe that you have exceeded your SLI by ~30%. Folks would be getting paged even though they didn’t need to be, and maybe conclude that additional servers were needed (when in fact this scenario represents overprovisioning).
Incorrect math can result in tens of thousands of dollars of unneeded capacity,
not to mention the cost of the time of the humans in the loop.
*If you want to play with the numbers yourself to get a feel for how these scenarios can develop, there is a sample calculation with link to an online percentile calculator in the appendix 
“Almost Never” Actually Means “Never Ever”
“But above, you said ‘almost never’ when saying that we shouldn’t ever average percentiles?”
That’s 100% correct. (No pun intended.)
You see, there are circumstances where you can average percentiles and get a result that has low errors. Namely, when the distribution of your data sources are identical. The most obvious case is when the latencies are from two web servers that are (a) healthy and (b) serve very similar load.
Be aware that this supposition breaks down as soon as either of those conditions is violated! Those are the cases where you are most interested in your monitoring data, when one of your servers starts misbehaving or you got a load balancing problem.
“But my servers usually have an even distribution of request latencies which are nearly identical, so that doesn’t affect me, right?”
Well, sure, if your web servers have nearly identical latency distributions, go ahead and calculate your total 95th percentile for the system by averaging the percentiles from each server. But when you have one server that decides to run into swap and slow down, you likely won’t notice a problem, since the data indicating it is effectively hidden.
So still, you should never average percentiles; you won’t be able to know when the approach you are taking is hurting you at the worst time.
Percentiles are aggregates, but they should not be aggregated. They should be calculated, not stored. There are a considerable number of operational time series data monitoring systems, both open source and commercial, which will happily store percentiles at 5 minute (or similar) intervals. If you want to look at a year’s worth of data, you will encounter spike erosion. The percentiles are averaged to fit the number of pixels in the time windows on the graph. And that averaged data is mathematically wrong.
“So, the solution is to store every single sample like you did in the example, right?”
Well, yes and no.
You can store each sample, and generate correct percentiles from them, but at any more than a modest scale, this becomes prohibitively expensive. Some open-source time series databases and monitoring systems do this, but you give up either scalability in data ingest, or length of data retention. One million 64-bit integer samples per second for a year occupies 229 TB of space. One week of data of this data is 4 TB; doable with off-the-shelf hardware, but economically impractical for analysis, as well as wasteful.
“Ah, but I’ve thought up a better solution. I can just store the number of requests that are under my desired objective, say 500 milliseconds, and the number of requests that are above, and I can divide by the two to calculate a correct percentile!”
This is a valid approach, one that I have even implemented with a monitoring system that was not able to store full distributions. However, the limitation is subtle; if after some time I decide that my objective of 500ms was too aggressive and move it to 600ms, all of the historical data that I’ve collected is useless. I have to reset my counters and begin anew.
Store Distributions, Not Percentiles
A better approach than storing percentiles is to store the source sample data in a manner that is more efficient than storing single samples, but still able to produce statistically significant aggregates. The histogram, or distribution, is one such approach.
There are many types of histograms, but here at Circonus we use the log-linear histogram. It provides a mix of storage efficiency and statistical accuracy. Worst-case errors at single digit sample sizes are 5%, quite a bit better than the 200% that we demonstrated above by averaging percentiles.
Storage efficiency is significantly better than storing individual samples; a year’s worth of 5 minute log linear histogram windows (10 bytes per bin, 300 bins/window) can be stored in ~300MB (sans compression). Reading this amount of data from disk quickly is tractable with most physical (and virtualized) systems in under a second. The mergeability properties of histograms allows precomputed cumulative histograms to be stored for analytically useful windows such as 1 minute and 3 hours. This allows the composition of large time spans of time series telemetry to be rapidly assembled from sets that are visually relevant to the end user (think one year of data with histogram windows of six hours each).
Percentiles are a staple tool of real systems monitoring, but their limitations should be understood. Because percentiles are provided by nearly every monitoring and observability toolset without limitations on their usage, they can be applied to analyses easily without the operator needing to understand the consequences of *how* they are applied. Understanding the common scenarios where percentiles give the wrong answers is just as important as having an understanding of how they are generated from operational time series data.
If you use percentiles now, you are already capturing data as distributions to some degree through your toolset. Knowing how that toolset generates percentiles from that source telemetry will ensure that you can evaluate if your use of percentiles to answer business questions is mathematically correct.
Suppose I have two servers behind a load balancer, answering web requests. I use some form of telemetry gathering software to ingest the time it takes for each web server to complete serving its respective requests in milliseconds. I get results like this:
Web server 1: [22,90,73,296,55]
Web server 2: [935,72,18,553,267,351,56,28,45,873]
Step 1. Arrange the data in ascending order: 18, 22, 28, 45, 55, 56, 72, 73, 90, 267, 296, 351, 553, 873, 935
Step 2. Compute the position of the pth percentile (index i):
i = (p / 100) * n), where p = 90 and n = 15
i = (90 / 100) * 15 = 13.5
Step 3. The index i is not an integer, round up.
(i = 14) ⇒
the 90th percentile is the value in 14th position, or 873
Answer: the 90th percentile is 873
Now our correct 90th percentile is 873 instead of 600. An increase of 273, or 45.5%. Quite a difference, no? If you had used averages, you might be thinking that your website is delivering responses at a latency that you believed was acceptable for most users. In reality, 90% of requests are under a threshold that is almost 50% larger than you thought.
A guide to the importance of, and techniques for, accurately quantifying your Service Level Objectives.
This is the third in a multi-part series about Service Level Objectives. The second part can be found here .
As we’ve discussed in part one and part two of this series, Service Level Objectives (SLOs) are essential performance indicators for organizations that want a real understanding of how their systems are performing. However, these indicators are driven by vast amounts of raw data and information. That being said, how do we make sense of it all and quantify our SLOs? Let’s take a look.
Feel The Heat: Map Out Your Data
The following heat map based on histogram data shows two weeks of API request latency data, displayed in 2 hour time slices. At Circonus, we use log linear histograms to store time series data, and the data is sorted into bin structures which have roughly 100 bins for every power of 10 (for details see Circonus Histogram Internals). This structure provides flexibility for a wide range of values without needing explicit operator configuration for the histogram bucket sizes. In all, this graph represents about 10 million data points. Notice that the left y-axis is in milliseconds, and so most of the values are concentrated under 0.1 seconds or about 100 milliseconds.
If we hover over one of the time slices in this heat map with our mouse, we can see a histogram overlay for that time slice showing the distribution of values for this range. For example, when we look at the bin shown in the graph above, we have a distribution with a very wide range of values, but when we zoom in closer we see it’s concentrated toward the left side of the graph, with modes at about 5 milliseconds and 20 milliseconds.
Now, we can look at how this heat map is generated by examining the CAQL statement in the legend. The Circonus Analytics Query Language (CAQL) is a function-based language, that works by piping data through commands in a manner similar to the UNIX command line. Since we store the raw distribution of data, we this graph gives us a great canvas to apply some math (i.e. transform overlays generated by CAQL statements) to give some real context and meaning to our data.
We can start by applying a 99th percentile overlay to the data and show the points at which 99% of the values are below that latency value. Notice that most of the high points on this graph are slightly over 10 seconds. That’s not a coincidence. Since this is an API, most default timeouts for clients fall right around 10 seconds. What we’re seeing here is a number of client timeouts, which would also show up in the errors graph on a RED dashboard (which we will cover more in another post). Here’s how we generated that overlay:
This example is a simple statement which says to display the histogram values for a certain metric, which is the latency value for an API call, and then calculate the 99th percentile overlay for these values. This is something no other monitoring solution can do, because most of them typically store aggregated percentiles instead of storing the raw distribution of data as a histogram.
Our approach allows us to calculate arbitrary percentiles over arbitrary time ranges and see what latency 99% of the requests are falling in. That’s not something you can do when you’re storing the 99th percentile for a given set of time ranges. You can’t find the 99th percentile for a larger overall time range by averaging the 99th percentile for those smaller time ranges together.
Inverse percentiles show us the percentage of values over a certain threshold, which we can then establish as a service level objective.
For example, let’s say we have an SLO of 500 milliseconds of latency. So, in the above graph there is a spike around the 50% mark, which means that 50% of the values in that time slice exceeded 500 milliseconds and we violated our SLO.
The above CAQL statement will show the percentage of requests that exceed that SLO.
We can also display this as an area graph to make the SLO violation more visible. The area under the line in the graph here is the amount of time we spent within the limits of our SLO. Here, we’re doing a good job.
Determining what the actual threshold should be is a business decision, but 200 milliseconds is generally a good expectation for web services, and we find this approach of setting the SLO as a time based metric instead of a percentile is easier for humans to understand instead of just picking an arbitrary percentile.
The traditional method might be to say we want 99 percent of our requests to fall under the 500 milliseconds threshold, but what is really more valuable and easier to understand is knowing how many of the requests exceed the SLO and knowing by how much each request exceeded the SLO. When we violate our SLO, we want to know: how bad is the damage that we’ve done? How much did our service suffer?
Quantifying the percentage of requests that meet that SLO is a good start, but we can take it a bit further.
What really matters to the business is the number of requests that failed to meet our SLO, not just the percentage.
Using this graph, we can calculate the number of SLO violations by taking the inverse quantile calculation of requests that violated our 500 millisecond SLO and graphing them. The CAQL statement above says that we subtract the percentage of requests that did not violate our SLO from 100 to get the percentage that did violate the SLO, then multiply that by the count of total number of requests, which gives us the total number of requests that violate the SLO.
These spikes in the graph above show the number of times that requests violated our SLO. As you can see, there are some instances where we had 100,000 violations within a time slice, which is fairly significant. Let’s take this a step further. We can use calculus to find the total number of violations, not just within a given time-slice, but over time.
The CAQL statement above is similar to the previous one, but uses the integral function to calculate the total number of requests. The blue line shows a monotonically increasing number of requests and the points at which the inflection points increase and we have a change in the slope of the graph. These are the points where our system goes off the rails.
Any spots where the derivative of the slope increases, we are in violation of our SLO. We can use these spots as way-points for forensic analysis in our logs to figure out exactly why our system was behaving badly (for example, if the database has a slow down) and this also shows us how much damage was caused by the system misbehaving (the higher the difference on the y-axis, the more we’ve violated our SLO).
We can now quantify this damage by tying it to a request basis. If each request that violated our SLO represents the loss of a product sale, we can modify that CAQL statement to assign a dollar value for each failed request and get a brutally honest KPI that will ripple across the entire business and demonstrate the importance of your SLOs, and how failures in your system can cause failures in your business.
On The Fly: Real Time Anomaly Detection
It’s vital to be able to understand when your system was violating your SLO and it’s good to be able to run forensics on that after the fact, but what’s really valuable is getting that information in real time. We can take another CAQL statement and take the difference in the count of requests that violated that SLO and apply an anomaly detection algorithm to them to attempt to identify these points where we had these SLO request violations.
These are instances where the algorithm has identified potential anomalies. It gives each potential anomaly a score of 0 to 100, 100 being a definite anomaly, with lesser values depending on how the algorithm identifies the quantity of violation. We can also take this CAQL statement and create an alert for it, which will send an alert message to our operations team in real time every time we have an SLO violation.
This is a constant algorithm that takes the model period over 60 seconds. In the above example, the sensitivity is set to 20. We can adjust that to make the algorithm more or less sensitive to anomalies, which is threshold independent. Either way, we can monitor and track these anomalies as they happen, providing contextual, actionable insight.
Our approach gives you the freedom to not have to proactively monitor your system.
By intelligently quantifying your SLOs through the methods described above, you can tie the performance of a given part of the system into the overall performance of the business.
This empowers you to adjust your operational footprint as needed to ensure that your SLOs are being met, and ultimately allows you to focus on your business objectives.
If you have questions about this article, or SLOs in general, feel free to join our slack channel and ask us. To see what we’ve been working on which inspired this article, feel free to have a look here.
A simple primer on the complicated statistical analysis behind setting your Service Level Objectives.
This is the second in a multi-part series about Service Level Objectives. The first part can be found here .
Statistical analysis is a critical – but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. However, we aim to provide enough information here to give you a basic understanding of the math behind a smart SLO – and why it’s so important that you get it right.
Auditable, measurable data is the cornerstone of setting and meeting your SLOs. As stated in part one, Availability and Quality of Service (QoS) are the indicators that help quantify what you’re delivering to your customers, via time quantum and/or transaction availability. The better data you have, the more accurate the analysis, and the more actionable insight you have to work with.
So yes, it’s complicated. But understanding the importance of the math of SLOs doesn’t have to be.
Functions of SLO Analysis
SLO analysis is based on probability, the likelihood that an event will — or will not — take place. As such, it primarily uses two types of functions: Probability Density Function (PDF) and Cumulative Density Function (CDF).
Simply put, the analysis behind determining your SLO is driven by the basic concept of probability.
For example, PDF answers questions like “What is the probability that the next transaction will have a latency of X?” As the integral of the PDF, the CDF answers questions like What’s the probability that the next transaction will have a latency less than X?” or “What’s the probability that the next transaction will have a latency greater than X?”.
Probability Density Function (PDF)
Cumulative Density Function (CDF)
The probability that a given sample of data will have the input measurement.
The probability that X will take a value less than or equal to x
Percentiles and Quantiles
Before we get further into expressing these functions, let’s have a quick sidebar about percentiles vs. quantiles. Unfortunately, this is a simple concept that has gotten quite complicated.
A percentile is measured on a 0-100 scale, and expressed as a percentage. For example: the “99th percentile” means “as good or better than” 99% of the distribution.
A quantile is the same data, expressed on a 0-1 scale. So as a quantile, that “99th percentile” above would be expressed as “.99.”
That’s basically it. While scientists prefer using percentiles, the only differences from a quantile are a decimal point and a percentage symbol. However, for SLO analysis, the quantile function is important because it is mapped to the CDF we discussed earlier.
Remember, this is an overview of basic concepts to provide “top-level” understanding of the math behind a smart SLO. For a deeper dive, check out David Edelman Blank’s book “Seeking SRE.”
The Data Volume Factor
As any analyst will tell you, the sheer volume of data (or lack thereof) can dramatically impact your results, leading to uninformed insight, inaccurate reporting, and poor decisions. So, it’s imperative that you have enough data to support your analysis. For example, low volumes in the time quantum can produce incredibly misleading results if you don’t specify your SLOs well.
So, with large amounts of data vs “not enough,” the error levels in quantile approximations tend to be lower (vs. worst possible case errors with a single sample per bin, with the sample value at the edge of the bin, those can cause 5% errors). In practice, with log linear histograms, we tend to see data sets span 300 bins, so sets that contain thousands of data points tend to provide sufficient data for accurate statistical analyses.
Inverse quantiles can also come into play. For example, defining an SLO such that our 99th percentile request latency completes within 200ms. At low sample volumes, this approach is likely to be meaningless – with only a dozen or so samples, the 99th percentile can be far out of band compared to the median. And, the percentile and time quantum approach doesn’t tell us how many samples exceeded that 200ms quantum.
We can use inverse percentiles to define an SLO that says we want 80 percent of our requests to be faster than that 200ms quantum. Or alternatively, we can set our SLO as a fixed number of requests within the time quantum; say “I want less than 100 requests to exceed my 200ms time quantum over a span of 10 minutes.”
The actual implementations can vary, so it is incumbent upon the implementer to choose one which suits their business needs appropriately.
Defining Formulas and Analysis
Based on the information you’re trying to get, and your sample set, the next step is determining the right formulas or functions for analysis. For SLO-related data, most practitioners implement open-source histogram libraries. There are many implementations out there, ranging from log-linear, to t-digest, to fixed bin. These libraries often provide functions to execute quantile calculations, inverse calculations, bin count, and other mathematical implementations needed for statistical data analysis.
Some analysts use approximate histograms, such as t-digest. However, those implementations often exhibit double digit error rates near median values. With any histogram-based implementation, there will always be some level of error, but implementations such as log linear can generally minimize that error to well under 1%, particularly with large numbers of samples.
Common Distributions in SLO Analysis
Once you’ve begun analysis, there are several different mathematical models you will use to describe the distribution of your measurement samples, or at least how you expect them to be distributed.
Normal distributions: The common “bell-curve” distribution often used to describe random variables whose distribution is not known.
Gamma distributions: A two-parameter family of continuous probability distributions, important for using the PDF and CDF.
Pareto distributions: Most of the samples are concentrated near one end of the distribution. Often useful for describing how system resources are utilized.
In real life, our networks, systems, and computers are all complex entities, and you will probably almost never see something that perfectly fits any of these distributions. You may have spent a lot of time discussing normal distributions in Statistics 101, but you will probably never come across one as an SRE.
While you may often see distributions that resemble the Gamma or Pareto model, it’s highly unusual to see a distribution that’s a perfect fit.
Instead, most of your sample distributions will be a composition of different model, which is completely normal and expected. This “single mode” latency distribution most often represents latency distributions. And, while a single latency distribution is often represented by a Gamma distribution, it is exceptionally rare that we see single latency distributions. They are actually often multiple latency distributions all “jammed together”, which results in multi-modal distributions.
That could be the result of a few different common code paths (each with a different distribution), a few different types of clients each with a different usage pattern or network connection… Or both. So most of the latency distributions we’ll see in practice are actually a handful (and sometimes a bucket full) of different gamma-like distributions stacked atop each other. The point being, don’t worry too much about any specific model – it’s the actual data that’s important.
Histograms in SLO Analysis
A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.
If for any reason your range of values in on the low end, this is where a data volume issue (as we mentioned above) could rear its ugly head and distort your results.
However, histograms are ideal for SLO analysis, or any high-frequency, high-volume data, because they allow us to store the complete distribution of data at scale. You can describe a histogram with between 3 and 10 bytes per bin, depending on the varbit encoding of 8 of those bytes. Compression reduces that down lower. That’s an efficient approach to storing a large number of bounded sample values. So instead of storing a handful of quantiles, we can the complete distribution of data and calculate arbitrary quantiles and inverse quantiles on demand, as well as more advanced modeling techniques.
We’ll dig deeper into histograms in part 3.
In summary, analysis plays a critical role in setting your Service Level Objectives, because raw data is just that — raw and unrefined. To put yourself in a good position when setting SLOs, you must:
Know the data you’re analyzing, Choose data structures that are appropriate for your samples, ones that provide the needed precision and robustness for analysis. Be knowledgeable of the expected cardinality and expected distribution of your data set.
Understand how you’re analyzing the data and reporting your results. Ensure your analyses are mathematically correct. Realize if your data fits known distributions, and the implications that arise from that.
Set realistic expectations for results. Your outputs are only as good as the data you provide as inputs. Aggregates are excellent tools but it is important to understand their limitations.
And always be sure that you have enough data to support the analysis. A 99th percentile calculated with a dozen samples will likely vary significantly from one with hundreds of samples. Outliers can exert great influence over aggregates on small sets of data, but larger data sets are robust and not as susceptible.
With each of those pieces in place, you’ll gain the insight you need to make the smartest decision possible.
That concludes the basic overview of SLO analysis. As mentioned above, part 3 will focus, in more detail, on how to use histograms in SLO analysis.
In their excellent SLO-workshop at SRECon2018 (program) Liz Fong-Jones, Kristina Bennett and Stephen Thorne (Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and SRE techniques such as SLI/SLOs. For example, we recently produced A Guide to Service Level Objectives. As we will explain here, Latency SLOs are particularly delicate to implement and benefit from having Histogram-data available to understand distributions and adjust SLO targets.
Here the key definitions regarding Latency SLOs from the Workshop Handout (pdf).
The suggested specification for a request/response Latency SLI is:
The proportion of valid requests served faster than a threshold.
Turning this specification into an implementation requires making two choices: which of the requests this system serves are valid for the SLI, and that threshold marks the difference between requests that are fast enough and those that are not?
99% of home page requests in the past 28 days served in < 100ms.
Latency is typically measured with percentile metrics like these, which were presented for a similar use case:
Given this data, what can we say about the SLO?
What is the p90 computed over the full 28 days?
It’s very tempting to take the average of the p90 metric which is displayed in the graph, which would be just below the 500ms mark.
It’s important to note, and it was correctly pointed out during the session, that this is not generally true. There is no mathematical way determine the 28 day-percentile from the series of 1h(?)-percentiles that are shown in the above graphs (reddit, blog, math). You need to look at different metrics if you want to implement a latency SLO. In this post we will discuss three different methods how to do this correctly.
Latency metrics in the wild
In the example above the error of averaging percentiles might not actually be that dramatic. The system seems to be very well-behaved with a high constant load. In this situation the average p90/1h is typically close to the total p90/28 days.
Let’s take a look at another API, from a less loaded system. This API handles very few requests between 2:00am and 4:00am:
What’s the true p90 over the 6h drawn on the graph? Is it above or below 30ms?
The average p90/1M (36.28ms) looks far less appealing than before.
Computing Latency SLOs
So how to better compute latency? There are three ways to go about this:
Compute the SLO from stored raw data (logs)
Count the number of bad requests in a separate metric
Use histograms to store latency distribution.
Method 1: Using Raw/Log data
Storing access logs with latency data gives you accurate results. The drawback with this approach is that you must keep your logs over long periods of time (28 hrefdays) which can be costly.
Method 2: Counting bad requests
For the first case, instrument the application to count the number of requests that violated the threshold. The resulting metrics will look like this:
Using this metrics we see that 96% of our requests over the past 6h were faster than 30ms. Our SLO stated, that 90% of the requests should be good, so that objective was met.
The drawback of this approach is that you have to choose the latency threshold upfront. There is no way to calculate the percentage of requests that were faster than, say, 200ms from the recorded data.
If your SLO changes, you will need to change the executable or the service configuration to count requests above a different threshold.
Method 3: Using Histograms
The third practical option you have for computing accurate SLOs is storing your request latency data as histograms. The advantages of storing latency data as histograms are:
Histograms can be freely aggregated across time.
Histograms can be used to derive approximations of arbitrary percentiles.
For (1) to be true it’s critical that your histograms have common bin choices. It’s usually a good idea to mandate the bin boundaries for your whole organization, otherwise you will not be able to aggregate histograms from different services.
For (2), it’s critical that you have enough bins in the latency range that are relevant for your percentiles. Sparsely encoded log linear histograms allow you to cover a large floating point range (E.g. 10^-127 .. 10^128) with a fixed relative precision (E.g. 5%). In this way you can guarantee 5% accuracy on all percentiles, no matter how the data is distributed.
Two popular implementations of log linear histograms are:
Circonus comes with native support for Circllhist and is used for this example.
Histogram metrics store latency information per minute, and are commonly visualized as heatmaps:
Merging those 360x1M-histograms shown above into a single 6h-Histogram, we arrive at the following graph:
This is the true latency distribution over the full SLO reporting period of 6h, in this example.
At the time of this writing, there is no nice UI option to overlay percentiles in the above histogram graph. As we will see, you can perform the SLO calculation with CAQL or Python.
SLO Reporting via CAQL
We can use the CAQL functions histogram:rolling(6h) and histogram:percentile() to aggregate histograms over the last 6h and compute percentiles over the aggregated histograms. The SLO value we are looking for will be the very last value displayed on the graph.
SLO Reporting using Python
Using the Python API the calculation could look as follows:
# 1. Fetch Histogram Datat=1528171020# exact start time of the graph N=364# exact number of minutes on the above graphcirc=circonusdata.CirconusData(config["demo"])data=circ.caql('search:metric:histogram("api`GET`/getState")',t,60,N)
# 2. Merge HistogramsH=Circllhist()forhindata['output']:H.merge(h)
# Let's check the fetched data is consistent with Histogram in the UIcircllhist_plot(H)
50-latency percentile over 6h: 13.507ms
90-latency percentile over 6h: 21.065ms
95-latency percentile over 6h: 27.796ms
99-latency percentile over 6h: 56.058ms
99.9-latency percentile over 6h: 918.760ms
In particular we see that the true p90 is around 21ms, which is far away from the average p90 of 36.28ms we computed earlier.
18.465 percent faster than 10ms
96.238 percent faster than 30ms
98.859 percent faster than 50ms
99.484 percent faster than 100ms
99.649 percent faster than 200ms
In particular we replicate the “96.238% below 30ms” result, that we calculated using the counter metrics before.
It’s important to understand that percentile metrics do not allow you to implement accurate Service Level Objectives that are formulated against hours or weeks. Aggregating 1M-percentiles seems tempting, but can produce materially wrong results — especially if your load is highly volatile.
The most practical way to calculate correct SLO percentiles are counters and histograms. Histograms give you additional flexibility to choose the latency threshold after the fact. This comes in particularly handy when you are still evaluating your service, and are not ready to commit yourself to a latency threshold just yet.
Four steps to ensure that you hit your targets – and learn from your successes.
This is the first in a multi-part series about Service Level Objectives. The second part can be found here.
Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals?
You can’t succeed at your goals without first identifying them – AND answering “What does success look like?”
Your goals are more than high-level mission statements or an inspiring vision for your company. They must be quantified, measured, and reconciled, so you can compare the end result with the desired result.
For example, to promote system reliability we use Service Level Indicators (SLIs), set Service Level Objectives (SLOs1), and create Service Level Agreements (SLAs) to clarify goals and ensure that we’re on the same page as our customers. Below, we’ll define each of these terms and explain their relationships with each other, to help you identify, measure, and meet your goals.
Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.
But how do you determine fair and appropriate measures of success, and define these goals? We’ll look at four steps to get you there:
Identify relevant SLIs
Measure success with SLOs
Agree to an SLA based on your defined SLOs
Use gained insights to restart the process
Before we jump into the four steps, let’s make sure we’re on the same page by defining SLIs, SLOs, and SLAs.
So, What’s the Difference?
For the purposes of our discussion, let’s quickly differentiate between an SLI, an SLO, and an SLA. For example, if your broad goal is for your system to “…run faster,” then:
A Service Level Indicator is what we’ve chosen to measure progress towards our goal. E.g., “Latency of a request.”
A Service Level Objective is the stated objective of the SLI – what we’re trying to accomplish for either ourselves or the customer. E.g., “99.5% of requests will be completed in 5ms.”
A Service Level Agreement, generally speaking2, is a contract explicitly stating the consequences of failing to achieve your defined SLOs. E.g., “If 99% of your system requests aren’t completed in 5ms, you get a refund.”
Although most SLOs are defined in terms of what you provide to your customer, as a service provider you should also have separate internal SLOs that are defined between components within your architecture. For example, your storage system is relied upon by other components in your architecture for availability and performance, and these dependencies are similar to the promise represented by the SLOs within your SLA. We’ll call these internal SLOs out later in the discussion.
What Are We Measuring?: SLIs
Before you can build your SLOs, you must determine what it is you’re measuring. This will not only help define your objectives, but will also help set a baseline to measure against.
In general, SLIs help quantify the service that will be delivered to the customer — what will eventually become the SLO. These terms will vary depending on the nature of the service, but they tend to be defined in terms of either Quality of Service (QoS) or in terms of Availability.
Defining Availability and QoS
Availability means that your service is there if the consumer wants it. Either the service is up or it is down. That’s it.
Quality of Service (QoS) is usually related to the performance of service delivery (measured in latencies)
Availability and QoS tend to work best together. For example, picture a restaurant that’s always open, but has horrible food and service; or one that has great food and service but is only open for one hour, once a week. Neither is optimal. If you don’t balance these carefully in your SLA, you could either expose yourself to unnecessary risk or end up making a promise to your customer that effectively means nothing. The real path to success is in setting a higher standard and meeting it. Now, we’ll get into some common availability measurement strategies.
Traditionally, availability is measured by counting failures. That means the SLI for availability is the percentage of uptime or downtime. While you can use time quantum or transactions to define your SLAs, we’ve found that a combination works best.
Time quantum availability is measured by splitting your assurance window into pieces. If we split a day into minutes (1440), each minute represents a time quantum we could use to measure failure. A time quantum is marked as bad if any failures are detected, and your availability is then measured by dividing the good time quantum by the total time quantum. Simple enough, right?
The downside of this relatively simple approach is that it doesn’t accurately measure failure unless you have an even distribution of transactions throughout the day – and most services do not. You must also ensure that your time quantum is large enough to prevent a single bad transaction from ruining your objective. For example, a 0.001% error rate threshold makes no sense applied to less than 10k requests.
Transaction availability management uses raw transactions to measure availability – calculated by dividing the count of all successful transactions by the count of all attempted transactions over the course of each window. This method:
Provides a much stronger guarantee for the customer than the time quantum method.
Helps service providers avoid being penalized for SLA violations caused by short periods of anomalous behavior that affect a tiny fraction of transactions.
However, this method only works if you can measure attempted transactions… which is actually impossible. If data doesn’t show up, how could we know if it was ever sent? We’re not offering the customer much peace of mind if the burden of proof is on them.
So, we combine these approaches by dividing the assurance window into time quantum and counting transactions within each time quantum. We then use the transaction method to define part of our SLO, but we also mark any time quantum where transactions cannot be counted as failed, and incorporate that into our SLO as well. We’re now able to compensate for the inherent weakness of each method.
For example, if we have 144 million transactions per day with a 99.9% uptime SLO, our combined method would give this service an SLO that defines 99.9% uptime something like this:
“The service will be available and process requests for at least 1439 out of 1440 minutes each day. Each minute, at least 99.9% of the attempted transactions will processed. A given minute will be considered unavailable if a system outage prevents the number of attempted transactions during that minute from being measured, unless the system outage is outside of our control.”
Using this example, we would violate this SLO if the system is down for 2 minutes (consecutive or non-consecutive) in a day, or if we fail more than 100 transactions in a minute (assuming 100,000 transactions per minute).
This way you’re covered, even if you don’t have consistent system use throughout the day, or can’t measure attempted transactions. However, your indicators often require more than just crunching numbers.
Remember, some indicators are more than calculations. We’re often too focused on performance criteria instead of user experience.
Looking back to the example from the “What’s the Difference” section, if we can guarantee latency below the liminal threshold for 99% of users, then improving that to 99.9% would obviously be better because it means fewer users are having a bad experience. That’s a better goal than just improving upon an SLI like retrieval speed. If retrieval speed is already 5 ms, would it be better if it were 20% faster? In many cases the end user may not even notice an improvement.
We could gain better insight by analyzing the inverse quantile of our retrieval speed SLI. The 99th quantile for latency just tells us how slow the experience is for the 99th percentile of users. But the inverse quantile tells us what percentage of user experiences meet or exceed our performance goal.
Defining Your Goals: SLOs
Once you’ve decided on an SLI, an SLO is built around it. Generally, SLOs are used to set benchmarks for your goals. However, setting an SLO should be based on what’s cost-effective and mutually beneficial for your service and your customer. There is no universal, industry-standard set of SLOs. It’s a “case-by-case” decision based on data, what your service can provide and what your team can achieve.
That being said, how do you set your SLO? Knowing whether or not your system is up no longer cuts it. Modern customers expect fast service. High latencies will drive people away from your service almost as quickly as your service being unavailable. Therefore it’s highly probable that you won’t meet your SLO if your service isn’t fast enough.
Since “slow” is the new “down,” many speed-related
SLOs are defined using SLIs for service latency.
We track the latencies on our services to assess the success of both our external promises and our internal goals. For your success, be clear and realistic about what you’re agreeing to — and don’t lose sight of the fact that the customer is focused on “what’s in it for me.” You’re not just making promises, you’re showing commitment to your customer’s success.
For example, let’s say you’re guaranteeing that the 99th percentile of requests will be completed with latency of 200 milliseconds or less. You might then go further with your SLO and establish an additional internal goal that 80% of those requests will be completed in 5 milliseconds.
Next, you have to ask the hard question: “What’s the lowest quality and availability I can possibly provide and still provide exceptional service to users?” The spread between this service level and 100% perfect service is your budget for failure. The answer that’s right for you and your service should be based on an analysis of the underlying technical requirements and business objectives of the service.
Base your goals on data. As an industry, we too often select arbitrary SLOs.
There can be big differences between 99%, 99.9%, and 99.99%.
Setting an SLO is about setting the minimum viable service level that will still deliver acceptable quality to the consumer. It’s not necessarily the best you can do, it’s an objective of what you intend to deliver. To position yourself for success, this should always be the minimum viable objective, so that you can more easily accrue error budgets to spend on risk.
Agreeing to Success: The SLA
As you see, defining your objectives and determining the best way to measure against them requires a significant amount of effort. However, well-planned SLIs and SLOs make the SLA process smoother for you and your customer.
While commonly built on SLOs, the SLA is driven by two factors:
the promise of customer satisfaction, and the best service you can deliver.
The key to defining fair and mutually beneficial SLAs (and limiting your liability) is calculating a cost-effective balance between these two needs.
SLAs also tend to be defined by multiple, fixed time frames to balance risks. These time frames are called assurance windows. Generally, these windows will match your billing cycle, because these agreements define your refund policy.
Breaking promises can get expensive when an SLA is in place
– and that’s part of the point – if you don’t deliver, you don’t get paid.
As mentioned earlier, you should give yourself some breathing room by setting the minimum viable service level that will still deliver acceptable quality to the consumer. You’ve probably heard the advice “under-promise and over-deliver.” That’s because exceeding expectations is always better than the alternative. Using a tighter internal SLO than what you’ve committed to gives you a buffer to address issues before they become problems that are visible — and disappointing — to users. So, by “budgeting for failure” and building some margin for error into your objectives, you give yourself a safety net for when you introduce new features, load-test, or otherwise experiment to improve system performance.
Learn, Innovate, and Start Over
Your SLOs should reflect the ways you and your users expect your service to behave. Your SLIs should measure them accurately. And your SLA must make sense for you, your client, and your specific situation. Use all available data to avoid guesswork. Select goals that fit you, your team, your service, and your users. And:
Identify the SLIs that are relevant to your goals
Measure your goals precisely with SLOs
Agree to an SLA based on your defined SLOs
Use any gained insights to set new goals, improve, and innovate
Knowing how well you’re meeting your goals allows you to budget for the risks inherent to innovation. If you’re in danger of violating an SLA or falling short of your internal SLO, it’s time to take fewer risks. On the other hand, if you’re comfortably exceeding your goals, it’s time to either set more ambitious ones, or to use that extra breathing room to take more the risks. This enables you to deploy new features, innovate, and move faster!
That’s the overview. In part 2, we’ll take a closer look at the math used to set SLOs.
1Although SLO still seems to be the favored term at the time of this writing, the Information Technology Infrastructure Library (ITIL) v3 has deprecated “SLO” and replaced it with Service Level Target (SLT). 2There has been much debate as to whether an SLA is a collection of SLOs or simply an outward-facing SLO. Regardless, it is universally agreed that an SLA is a contract that defines the expected level of service and the consequences for not meeting it.
Getting paged at 11pm on New Year’s Eve because the application code used sprintf %d on a 32 bit system and your ids just passed 4.295 billion, sending the ids negative and crashing your object service. A wakeup call at 2 am (or is it 3 am?) on the ‘spring forward’ Daylight Savings transition because your timezone libraries didn’t incorporate one of the several dozen new politically mandated timezone changes. Sweating a four hour downtime two days in a row due to primary/replicant database failover because your kernel raid driver threw the same unhandled exception twice in a row; your backup primary database server naturally uses the same hardware as the active one, of course.
Circonus was created by its founders because they experienced the pain of reliability engineering on large scale systems first hand. They needed tools to efficiently diagnose and resolve problems in distributed systems. And they needed to do it at scale. The existing tools (Nagios, Ganglia, etc) at the time couldn’t cope the volume of telemetry nor provide the insight into systems behaviors that was needed. So they set out to develop tools and methods that would fill the void.
The first of these was using histograms to visualize timing data. Existing solutions would give you the average latency, the 95th percentile, the 99th percentile, and maybe a couple others. This information was useful for one host, but mathematically useless for aggregate systems metrics. Capturing latency metrics and storing it as a log linear histogram allowed users to see the distribution of values over a time window. Moreover, this data could be aggregated for multiple hosts to give a holistic view of a the performance of a distributed system or service.
However, systems are dynamic and constantly changing. Systems that behave well one second and poorly the next are the norm, not the exception in today’s ephemeral infrastructures. So we added heatmaps, which are histogram representations over discrete windows of time. So now users could get an overview of the actual performance of their system. if this diagram below was a traditional line graph showing the average latency value, it would be a mostly straight line, hiding the parts where long tail latencies became unbearable for certain customers. It gives SREs the power to separate the results of ‘works fine’ when testing and ‘this is really slow’ for those outlier large customers (who are generally the ones paying the big bucks).
These tools became formative components of standards that had been developing in the SRE community. A few years ago, Brendan Gregg introduced the USE method (Utilization, Saturation, Errors) a couple years ago. USE is a set of metrics which are key indicators for host level health. Following on the tails of USE, Tom Wilkie introduced the RED method (Rate, Errors, Duration). RED is a set of metrics which are indicators for service level health. Combining the two gives SREs a powerful set of standard frameworks for quickly identifying bad behavior for both hosts and systems.
These types of visualizations display a wealth of information, and as a result can put demands on the underlying metrics storage layer. A year ago we released the time series database that we have developed in C and Lua as IRONdb. This standalone TSDB can now power Grafana based visualizations, which have become part of the toolset for many SREs. As the complexity of today’s microservice based architectures grows, and the lifetime of individual components falls, the need for high volume time series telemetry continues to increase. Here at Circonus we are dedicated to bringing you solutions that solve the parts of reliability engineering which have caused us pain in the past which affect all SREs. So that you can focus your efforts on the parts of your business which you know better than anyone else.
The Circonus platform delivers time-series data alerts, graphs, dashboards, and machine-learning intelligence that help to optimize not just your operations, but also your business.
IRONdb is a highly resilient stand-alone time-series database designed to power Circonus. Learn more about IRONdb at www.irondb.io.