A simple primer on the complicated statistical analysis behind setting your Service Level Objectives.
Statistical analysis is a critical – but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. However, we aim to provide enough information here to give you a basic understanding of the math behind a smart SLO – and why it’s so important that you get it right.
Auditable, measurable data is the cornerstone of setting and meeting your SLOs. As stated in part one, Availability and Quality of Service (QoS) are the indicators that help quantify what you’re delivering to your customers, via time quantum and/or transaction availability. The better data you have, the more accurate the analysis, and the more actionable insight you have to work with.
So yes, it’s complicated. But understanding the importance of the math of SLOs doesn’t have to be.
Functions of SLO Analysis
SLO analysis is based on probability, the likelihood that an event will — or will not — take place. As such, it primarily uses two types of functions: Probability Density Function (PDF) and Cumulative Density Function (CDF).
Simply put, the analysis behind determining your SLO is driven by the basic concept of probability.
For example, PDF answers questions like “What is the probability that the next transaction will have a latency of X?” As the integral of the PDF, the CDF answers questions like What’s the probability that the next transaction will have a latency less than X?” or “What’s the probability that the next transaction will have a latency greater than X?”.
|Probability Density Function (PDF)||Cumulative Density Function (CDF)|
|INPUTS||Any measurement||Any measurement|
|OUTPUTS||The probability that a given sample of data will have the input measurement.||The probability that X will take a value less than or equal to x|
Percentiles and Quantiles
Before we get further into expressing these functions, let’s have a quick sidebar about percentiles vs. quantiles. Unfortunately, this is a simple concept that has gotten quite complicated.
A percentile is measured on a 0-100 scale, and expressed as a percentage. For example: the “99th percentile” means “as good or better than” 99% of the distribution.
A quantile is the same data, expressed on a 0-1 scale. So as a quantile, that “99th percentile” above would be expressed as “.99.”
That’s basically it. While scientists prefer using percentiles, the only differences from a quantile are a decimal point and a percentage symbol. However, for SLO analysis, the quantile function is important because it is mapped to the CDF we discussed earlier.
Remember, this is an overview of basic concepts to provide “top-level” understanding of the math behind a smart SLO.
For a deeper dive, check out David Edelman Blank’s book “Seeking SRE.”
The Data Volume Factor
As any analyst will tell you, the sheer volume of data (or lack thereof) can dramatically impact your results, leading to uninformed insight, inaccurate reporting, and poor decisions. So, it’s imperative that you have enough data to support your analysis. For example, low volumes in the time quantum can produce incredibly misleading results if you don’t specify your SLOs well.
So, with large amounts of data vs “not enough,” the error levels in quantile approximations tend to be lower (vs. worst possible case errors with a single sample per bin, with the sample value at the edge of the bin, those can cause 5% errors). In practice, with log linear histograms, we tend to see data sets span 300 bins, so sets that contain thousands of data points tend to provide sufficient data for accurate statistical analyses.
Inverse quantiles can also come into play. For example, defining an SLO such that our 99th percentile request latency completes within 200ms. At low sample volumes, this approach is likely to be meaningless – with only a dozen or so samples, the 99th percentile can be far out of band compared to the median. And, the percentile and time quantum approach doesn’t tell us how many samples exceeded that 200ms quantum.
We can use inverse percentiles to define an SLO that says we want 80 percent of our requests to be faster than that 200ms quantum. Or alternatively, we can set our SLO as a fixed number of requests within the time quantum; say “I want less than 100 requests to exceed my 200ms time quantum over a span of 10 minutes.”
The actual implementations can vary, so it is incumbent upon the implementer to choose one which suits their business needs appropriately.
Defining Formulas and Analysis
Based on the information you’re trying to get, and your sample set, the next step is determining the right formulas or functions for analysis. For SLO-related data, most practitioners implement open-source histogram libraries. There are many implementations out there, ranging from log-linear, to t-digest, to fixed bin. These libraries often provide functions to execute quantile calculations, inverse calculations, bin count, and other mathematical implementations needed for statistical data analysis.
Some analysts use approximate histograms, such as t-digest. However, those implementations often exhibit double digit error rates near median values. With any histogram-based implementation, there will always be some level of error, but implementations such as log linear can generally minimize that error to well under 1%, particularly with large numbers of samples.
Common Distributions in SLO Analysis
Once you’ve begun analysis, there are several different mathematical models you will use to describe the distribution of your measurement samples, or at least how you expect them to be distributed.
Normal distributions: The common “bell-curve” distribution often used to describe random variables whose distribution is not known.
Gamma distributions: A two-parameter family of continuous probability distributions, important for using the PDF and CDF.
Pareto distributions: Most of the samples are concentrated near one end of the distribution. Often useful for describing how system resources are utilized.
In real life, our networks, systems, and computers are all complex entities, and you will probably almost never see something that perfectly fits any of these distributions. You may have spent a lot of time discussing normal distributions in Statistics 101, but you will probably never come across one as an SRE.
While you may often see distributions that resemble the Gamma or Pareto model, it’s highly unusual to see a distribution that’s a perfect fit.
Instead, most of your sample distributions will be a composition of different model, which is completely normal and expected. This “single mode” latency distribution most often represents latency distributions. And, while a single latency distribution is often represented by a Gamma distribution, it is exceptionally rare that we see single latency distributions. They are actually often multiple latency distributions all “jammed together”, which results in multi-modal distributions.
That could be the result of a few different common code paths (each with a different distribution), a few different types of clients each with a different usage pattern or network connection… Or both. So most of the latency distributions we’ll see in practice are actually a handful (and sometimes a bucket full) of different gamma-like distributions stacked atop each other. The point being, don’t worry too much about any specific model – it’s the actual data that’s important.
Histograms in SLO Analysis
A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.
If for any reason your range of values in on the low end, this is where a data volume issue (as we mentioned above) could rear its ugly head and distort your results.
However, histograms are ideal for SLO analysis, or any high-frequency, high-volume data, because they allow us to store the complete distribution of data at scale. You can describe a histogram with between 3 and 10 bytes per bin, depending on the varbit encoding of 8 of those bytes. Compression reduces that down lower. That’s an efficient approach to storing a large number of bounded sample values. So instead of storing a handful of quantiles, we can the complete distribution of data and calculate arbitrary quantiles and inverse quantiles on demand, as well as more advanced modeling techniques.
We’ll dig deeper into histograms in part 3.
In summary, analysis plays a critical role in setting your Service Level Objectives, because raw data is just that — raw and unrefined. To put yourself in a good position when setting SLOs, you must:
- Know the data you’re analyzing, Choose data structures that are appropriate for your samples, ones that provide the needed precision and robustness for analysis. Be knowledgeable of the expected cardinality and expected distribution of your data set.
- Understand how you’re analyzing the data and reporting your results. Ensure your analyses are mathematically correct. Realize if your data fits known distributions, and the implications that arise from that.
- Set realistic expectations for results. Your outputs are only as good as the data you provide as inputs. Aggregates are excellent tools but it is important to understand their limitations.
- And always be sure that you have enough data to support the analysis. A 99th percentile calculated with a dozen samples will likely vary significantly from one with hundreds of samples. Outliers can exert great influence over aggregates on small sets of data, but larger data sets are robust and not as susceptible.
With each of those pieces in place, you’ll gain the insight you need to make the smartest decision possible.
That concludes the basic overview of SLO analysis. As mentioned above, part 3 will focus, in more detail, on how to use histograms in SLO analysis.