Four steps to ensure that you hit your targets – and learn from your successes.
This is the first in a multi-part series about Service Level Objectives. The second part can be found here.
Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals?
You can’t succeed at your goals without first identifying them – AND answering “What does success look like?”
Your goals are more than high-level mission statements or an inspiring vision for your company. They must be quantified, measured, and reconciled, so you can compare the end result with the desired result.
For example, to promote system reliability we use Service Level Indicators (SLIs), set Service Level Objectives (SLOs1), and create Service Level Agreements (SLAs) to clarify goals and ensure that we’re on the same page as our customers. Below, we’ll define each of these terms and explain their relationships with each other, to help you identify, measure, and meet your goals.
Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.
But how do you determine fair and appropriate measures of success, and define these goals? We’ll look at four steps to get you there:
- Identify relevant SLIs
- Measure success with SLOs
- Agree to an SLA based on your defined SLOs
- Use gained insights to restart the process
Before we jump into the four steps, let’s make sure we’re on the same page by defining SLIs, SLOs, and SLAs.
So, What’s the Difference?
For the purposes of our discussion, let’s quickly differentiate between an SLI, an SLO, and an SLA. For example, if your broad goal is for your system to “…run faster,” then:
- A Service Level Indicator is what we’ve chosen to measure progress towards our goal. E.g., “Latency of a request.”
- A Service Level Objective is the stated objective of the SLI – what we’re trying to accomplish for either ourselves or the customer. E.g., “99.5% of requests will be completed in 5ms.”
- A Service Level Agreement, generally speaking2, is a contract explicitly stating the consequences of failing to achieve your defined SLOs. E.g., “If 99% of your system requests aren’t completed in 5ms, you get a refund.”
Although most SLOs are defined in terms of what you provide to your customer, as a service provider you should also have separate internal SLOs that are defined between components within your architecture. For example, your storage system is relied upon by other components in your architecture for availability and performance, and these dependencies are similar to the promise represented by the SLOs within your SLA. We’ll call these internal SLOs out later in the discussion.
What Are We Measuring?: SLIs
Before you can build your SLOs, you must determine what it is you’re measuring. This will not only help define your objectives, but will also help set a baseline to measure against.
In general, SLIs help quantify the service that will be delivered to the customer — what will eventually become the SLO. These terms will vary depending on the nature of the service, but they tend to be defined in terms of either Quality of Service (QoS) or in terms of Availability.
Defining Availability and QoS
- Availability means that your service is there if the consumer wants it. Either the service is up or it is down. That’s it.
- Quality of Service (QoS) is usually related to the performance of service delivery (measured in latencies)
Availability and QoS tend to work best together. For example, picture a restaurant that’s always open, but has horrible food and service; or one that has great food and service but is only open for one hour, once a week. Neither is optimal. If you don’t balance these carefully in your SLA, you could either expose yourself to unnecessary risk or end up making a promise to your customer that effectively means nothing. The real path to success is in setting a higher standard and meeting it. Now, we’ll get into some common availability measurement strategies.
Traditionally, availability is measured by counting failures. That means the SLI for availability is the percentage of uptime or downtime. While you can use time quantum or transactions to define your SLAs, we’ve found that a combination works best.
Time quantum availability is measured by splitting your assurance window into pieces. If we split a day into minutes (1440), each minute represents a time quantum we could use to measure failure. A time quantum is marked as bad if any failures are detected, and your availability is then measured by dividing the good time quantum by the total time quantum. Simple enough, right?
The downside of this relatively simple approach is that it doesn’t accurately measure failure unless you have an even distribution of transactions throughout the day – and most services do not. You must also ensure that your time quantum is large enough to prevent a single bad transaction from ruining your objective. For example, a 0.001% error rate threshold makes no sense applied to less than 10k requests.
Transaction availability management uses raw transactions to measure availability – calculated by dividing the count of all successful transactions by the count of all attempted transactions over the course of each window. This method:
- Provides a much stronger guarantee for the customer than the time quantum method.
- Helps service providers avoid being penalized for SLA violations caused by short periods of anomalous behavior that affect a tiny fraction of transactions.
However, this method only works if you can measure attempted transactions… which is actually impossible. If data doesn’t show up, how could we know if it was ever sent? We’re not offering the customer much peace of mind if the burden of proof is on them.
So, we combine these approaches by dividing the assurance window into time quantum and counting transactions within each time quantum. We then use the transaction method to define part of our SLO, but we also mark any time quantum where transactions cannot be counted as failed, and incorporate that into our SLO as well. We’re now able to compensate for the inherent weakness of each method.
For example, if we have 144 million transactions per day with a 99.9% uptime SLO, our combined method would give this service an SLO that defines 99.9% uptime something like this:
“The service will be available and process requests for at least 1439 out of 1440 minutes each day. Each minute, at least 99.9% of the attempted transactions will processed. A given minute will be considered unavailable if a system outage prevents the number of attempted transactions during that minute from being measured, unless the system outage is outside of our control.”
Using this example, we would violate this SLO if the system is down for 2 minutes (consecutive or non-consecutive) in a day, or if we fail more than 100 transactions in a minute (assuming 100,000 transactions per minute).
This way you’re covered, even if you don’t have consistent system use throughout the day, or can’t measure attempted transactions. However, your indicators often require more than just crunching numbers.
Remember, some indicators are more than calculations. We’re often too focused on performance criteria instead of user experience.
Looking back to the example from the “What’s the Difference” section, if we can guarantee latency below the liminal threshold for 99% of users, then improving that to 99.9% would obviously be better because it means fewer users are having a bad experience. That’s a better goal than just improving upon an SLI like retrieval speed. If retrieval speed is already 5 ms, would it be better if it were 20% faster? In many cases the end user may not even notice an improvement.
We could gain better insight by analyzing the inverse quantile of our retrieval speed SLI. The 99th quantile for latency just tells us how slow the experience is for the 99th percentile of users. But the inverse quantile tells us what percentage of user experiences meet or exceed our performance goal.
Defining Your Goals: SLOs
Once you’ve decided on an SLI, an SLO is built around it. Generally, SLOs are used to set benchmarks for your goals. However, setting an SLO should be based on what’s cost-effective and mutually beneficial for your service and your customer. There is no universal, industry-standard set of SLOs. It’s a “case-by-case” decision based on data, what your service can provide and what your team can achieve.
That being said, how do you set your SLO? Knowing whether or not your system is up no longer cuts it. Modern customers expect fast service. High latencies will drive people away from your service almost as quickly as your service being unavailable. Therefore it’s highly probable that you won’t meet your SLO if your service isn’t fast enough.
Since “slow” is the new “down,” many speed-related
SLOs are defined using SLIs for service latency.
We track the latencies on our services to assess the success of both our external promises and our internal goals. For your success, be clear and realistic about what you’re agreeing to — and don’t lose sight of the fact that the customer is focused on “what’s in it for me.” You’re not just making promises, you’re showing commitment to your customer’s success.
For example, let’s say you’re guaranteeing that the 99th percentile of requests will be completed with latency of 200 milliseconds or less. You might then go further with your SLO and establish an additional internal goal that 80% of those requests will be completed in 5 milliseconds.
Next, you have to ask the hard question: “What’s the lowest quality and availability I can possibly provide and still provide exceptional service to users?” The spread between this service level and 100% perfect service is your budget for failure. The answer that’s right for you and your service should be based on an analysis of the underlying technical requirements and business objectives of the service.
Base your goals on data. As an industry, we too often select arbitrary SLOs.
There can be big differences between 99%, 99.9%, and 99.99%.
Setting an SLO is about setting the minimum viable service level that will still deliver acceptable quality to the consumer. It’s not necessarily the best you can do, it’s an objective of what you intend to deliver. To position yourself for success, this should always be the minimum viable objective, so that you can more easily accrue error budgets to spend on risk.
Agreeing to Success: The SLA
As you see, defining your objectives and determining the best way to measure against them requires a significant amount of effort. However, well-planned SLIs and SLOs make the SLA process smoother for you and your customer.
While commonly built on SLOs, the SLA is driven by two factors:
the promise of customer satisfaction, and the best service you can deliver.
The key to defining fair and mutually beneficial SLAs (and limiting your liability) is calculating a cost-effective balance between these two needs.
SLAs also tend to be defined by multiple, fixed time frames to balance risks. These time frames are called assurance windows. Generally, these windows will match your billing cycle, because these agreements define your refund policy.
Breaking promises can get expensive when an SLA is in place
– and that’s part of the point – if you don’t deliver, you don’t get paid.
As mentioned earlier, you should give yourself some breathing room by setting the minimum viable service level that will still deliver acceptable quality to the consumer. You’ve probably heard the advice “under-promise and over-deliver.” That’s because exceeding expectations is always better than the alternative. Using a tighter internal SLO than what you’ve committed to gives you a buffer to address issues before they become problems that are visible — and disappointing — to users. So, by “budgeting for failure” and building some margin for error into your objectives, you give yourself a safety net for when you introduce new features, load-test, or otherwise experiment to improve system performance.
Learn, Innovate, and Start Over
Your SLOs should reflect the ways you and your users expect your service to behave. Your SLIs should measure them accurately. And your SLA must make sense for you, your client, and your specific situation. Use all available data to avoid guesswork. Select goals that fit you, your team, your service, and your users. And:
- Identify the SLIs that are relevant to your goals
- Measure your goals precisely with SLOs
- Agree to an SLA based on your defined SLOs
- Use any gained insights to set new goals, improve, and innovate
Knowing how well you’re meeting your goals allows you to budget for the risks inherent to innovation. If you’re in danger of violating an SLA or falling short of your internal SLO, it’s time to take fewer risks. On the other hand, if you’re comfortably exceeding your goals, it’s time to either set more ambitious ones, or to use that extra breathing room to take more the risks. This enables you to deploy new features, innovate, and move faster!
That’s the overview. In part 2, we’ll take a closer look at the math used to set SLOs.
1Although SLO still seems to be the favored term at the time of this writing, the Information Technology Infrastructure Library (ITIL) v3 has deprecated “SLO” and replaced it with Service Level Target (SLT).
2There has been much debate as to whether an SLA is a collection of SLOs or simply an outward-facing SLO. Regardless, it is universally agreed that an SLA is a contract that defines the expected level of service and the consequences for not meeting it.