Setting and measuring latency Service Level Objectives (SLOs) is a critical responsibility for engineers monitoring the performance and health of their applications and systems. SLOs are an agreement on an acceptable level of availability and performance and are key to helping engineers properly balance risk and innovation. Unfortunately, many organizations aren’t fully realizing the benefits of SLOs because the reality is that they’re often done wrong — and the consequences of this include significant loss of time, money, and resources. The following are four common SLO mistakes to avoid.
Error #1: Aggregating percentiles
Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, many engineers aggregate multiple percentiles when calculating SLOs, which inevitably results in mathematical errors.
As an example, if you’re monitoring a set of ten web servers and want to collect latency statistics across all of them, a common technique is to calculate percentiles for each of the servers, and then store those calculated percentiles. If you then want to perform an analysis, like calculating global latency percentiles, the percentile metrics are aggregated from all ten servers. Unfortunately, calculating percentiles from pre-calculated percentile values is mathematically impossible. Once you have converted raw data to percentiles, there is no meaningful way to aggregate percentiles any further. Because percentiles are provided by nearly every monitoring and observability toolset without limitations on their usage, they can be applied to SLO analyses easily without the operator needing to understand the consequences of how they are applied.
Tip: A better approach than storing percentiles is to store the source sample data in a manner that is more efficient than storing single samples, but still able to produce statistically significant aggregates. At Circonus, we have long advocated histograms, which store summaries of the raw data rather than pre-calculated percentiles. Those histograms then can be freely aggregated, and they contain enough information to calculate accurate percentiles at the time of display instead of pre-calculating them.
Error #2: Using the wrong supporting evidence to enforce your SLO
While histograms are the best method for computing SLOs, not all histograms are created equal. Many monitoring tools provide histograms for measuring SLOs, and these histograms divide all sample data into a series of intervals called bins. But many of these tools provide histograms with an extremely low number of bins (as low as 8). When limited on the number of bins, engineers are forced to make the bin boundaries very large. And on top of this, engineers often do not set their bin boundaries to the actual question they are trying to answer. This results in astronomical error rates when calculating latency SLOs.
For example, say you have an SLO for your search API that requires 99% of your search queries to all return faster than 600 milliseconds. You set your bin boundaries at 250-500 milliseconds, 501 milliseconds-750 milliseconds, 751-1,000 milliseconds, etc. Your bin boundary does not include 600 milliseconds, and the boundaries span 250 milliseconds. Now you ask for the 99th percentile, and your graphs show 700 milliseconds – causing you to think you need to put resources into shaving off 100 milliseconds. But your latency is not actually 700 milliseconds – it’s in the range of 501-750 milliseconds. So it could be faster or slower than you think, and the data does not accurately represent the question being asked.
Histograms with large binning provide an estimation of what your latencies are, and the error bars can be as high as 70-80%. Unfortunately, the sloppiness of the answer is not conveyed to the engineer reading the graph, so it’s extremely misleading. There’s an understandable assumption that the answers on the graph are right, and this is almost never the case. As a result, organizations will put resources into optimizing latencies based on inaccurate information.
Tip: Aim for histograms with a high number of bins (Circhlist has 43,000 bins) and set your bin boundary to the actual question you’re answering. It’s critical that you have enough bins in the latency range that are relevant for your percentiles so you can guarantee 5% accuracy on all percentiles.
Error #3: Setting the wrong SLO
Setting a latency SLO is about setting the minimum viable service level that’s acceptable to your users, so you can take more risks. However, organizations routinely set SLOs too high. Why put time, effort, and money into optimizing uptime and performance to an unnecessary level?
When you break your SLO, there should actually be consequences. If your users are happy even when you don’t hit your SLO, then there’s a chance you’ve set your SLO too high. For instance, if you set your objective to the 99th percentile, but nobody notices a problem until you hit the 95th percentile, then your SLO is costing you unnecessary time and money.
In addition to setting SLOs too high, organizations often set SLOs that are not a huge value to the business — focusing on objectives that are just too low level. SLOs should be set around customer perceived value because this is what directly impacts your ability to be successful. If organizations invest in SLOs at the low level and fail to invest at the high level, they’re spending resources and not receiving any benefits.
Also, SLOs shouldn’t be confused with monitoring. An SLO is an availability and a performance guarantee — it doesn’t tell you when something’s down. An SLO should not be set around identifying when things are broken — this is what you use standard monitoring practices for.
Tip: Set your SLOs to the absolute minimum quality of service and availability that’s possible without having bad consequences, and ensure they are focused on actual business value.
Error #4: Thinking it’s right the first time
A lot of organizations spend significant effort trying to set their SLOs correctly. Unfortunately, this is wasted effort, because you’re going to be wrong. The approach should not be to get your SLOs perfect the first time, which is impossible — rather, SLOs should be an iterative process. You should have a feedback loop that informs you on if you need to change your concept around what deserves an SLO and what the parameters should be based on information you learn every day.
The reality is that your software and your consumers are always changing. We’ve seen companies spend thousands of dollars and waste significant resources trying to achieve an objective that was unnecessary because they didn’t iteratively go back and ensure their objectives were set reasonably.
Tip: Using histograms, routinely review your data over the past few months to identify at what threshold you begin to see a negative impact downstream. The key is to have flexibility with your SLOs. You will need to reassess them regularly to ensure they’re not too loose and not too tight.
Latency SLOs are a key characteristic of more modern, advanced monitoring, but their benefits cannot be fully achieved without doing them right. There are common mistakes when setting and calculating SLOs that unfortunately can cost organizations a lot of time and money. Doing SLOs right is not easy. But embrace them and the effort to do them accurately, so you can feel confident in the critical decisions you make around ensuring performance and deploying new features.