The world of DevOps is a constant push and pull. There’s pressure to deploy faster, yet performance is expected to be perfect all of the time. The founders of Circonus lived these expectations and know the struggle all too well. They felt the pain to quickly develop new features, but also understood how hard it is to fix something when it breaks. They founded Circonus with these experiences and challenges in mind, building a monitoring and analytics platform that is a foundation to implementing data-driven IT operations.
A primary goal of data-driven IT operations is to speed innovation while minimizing risks. There are three essential components to implementing data-driven operations:
- Supporting a DevOps culture
- Implementing and measuring SLOs/SLIs/SLAs
- Creating error-budgets
These are not independent of each other but are in fact three interrelated components where each impacts the other. What all three have in common is that they rely on having a centralized monitoring and analytics platform that provides highly precise measurement and metrics data.
In this post, I talk about how organizations can take the next step in professionalizing their IT operations by implementing DevOps, SLOs, and error-budgets, and how infrastructure monitoring analytics is critical to ensuring all three work together to accurately balance innovation and risk.
Supporting a DevOps Culture
The DevOps movement began as a way to bridge the gap between developers and operators. Operations wants to ensure all systems are highly available and operating smoothly. Meanwhile, developers are responsible for continuously deploying new products and features, which can introduce instability in applications and negatively impact system performance. Hence, friction ensues between the two functions.
DevOps helps create a culture where this friction is minimized as much as possible. But successfully implementing a DevOps culture is a complex task and as a result, requires fundamental changes within an organization. Primary among these changes is a new, more advanced way of thinking about infrastructure monitoring.
Two key tenants to DevOps are “ automation and tooling” and “measure everything.” Organizations looking to implement DevOps have to provide the team with the right tools to be successful. As a function, DevOps leverages data (lots of it) to guide their decisions, which is why infrastructure monitoring and analytics is one of these essential tools.
But to ensure they have all the data they need and the most precise data possible, DevOps requires a monitoring and analytics platform that is capable of significantly more than just basic alerting. It must be a platform that automates and unifies organizational-wide monitoring, provides highly granular analysis into the health and behaviors of all infrastructure systems, and provides the visibility required to align IT performance to key business success indicators.
These capabilities and analytics are foundational to enabling DevOps because they are the basis for informing the creation of service level indicators (SLIs), service level objectives (SLOs), service level agreements (SLAs), and error-budgets. As we’ll discuss, these tools are critical to enabling DevOps to initiate accurate, data-driven decisions on what acceptable performance is and when it’s ok or not ok to take risks.
Implementing and Measuring SLIs, SLOs, and SLAs
If your only goal is extreme reliability and performance, you severely limit your ability to introduce changes to production and you’ll lose out on deploying new features. But if you have the ability to “bend” a bit, and if you know with certainty the range of acceptable levels of performance, you can more rapidly innovate in a way that still maintains a level of service that delights your customers.
DevOps helps determine how to properly balance risk and innovation by creating SLOs. SLOs are an agreement on an acceptable level of availability and performance and help minimize confusion and conflicts between IT functions. But before you can build your SLOs, you must figure out what it is you’re measuring. This will not only help define your objectives, but will also help set a baseline to measure against.
There are three terms to understand:
- A Service Level Indicator is what we’ve chosen to measure progress towards our goal e.g. “Latency of a request.”
- A Service Level Objective is the stated objective of the SLI – what we’re trying to accomplish for either ourselves or the customer e.g. “99.5% of requests will be completed in 5ms.” Importantly, it defines both the upper and lower bounds of what’s acceptable performance.
- A Service Level Agreement is a contract explicitly stating the consequences of failing to achieve your defined SLOs e.g. “If 99% of your system requests aren’t completed in 5ms, you get a refund.”
SLIs define the things we need to measure in order to know if we are delivering an acceptable level of service to the customer. Once you’ve decided on an SLI, an SLO is built around it. Setting an SLO is about setting the minimum viable service level that will still deliver acceptable quality to the consumer. This is critical. It’s not the best you can do, but rather an objective of what you intend to deliver. It is the foundation for creating error budgets, which we’ll discuss next.
Creating your SLOs is dependent on highly precise infrastructure performance analytics. However, all too often, organizations select arbitrary SLOs. There can be big differences between 99%, 99.9%, and 99.99%. SLOs are supposed to provide business outcomes, but when they’re framed incorrectly, a common problem, suboptimal business decisions can be made that cost time, money, and resources. Having the right monitoring and analytics platform in place – one that will provide the correct math, historical metrics, and the ability to correlate metrics – is critical to calculating your SLOs correctly and avoiding those costly mistakes.
The final step is to define your SLAs with your customers. While commonly built on SLOs, the SLA is driven by two factors: the promise of customer satisfaction, and the best service you can deliver. The key to defining fair and mutually beneficial SLAs (and limiting your liability) is calculating a cost-effective balance between these two needs. SLAs also tend to be defined by multiple, fixed timeframes to balance risks. These timeframes are called assurance windows. Generally, these windows will match your billing cycle, because these agreements define your refund policy. Breaking promises can get expensive when an SLA is in place – and that’s part of the point – if you don’t deliver, you don’t get paid.
So assuming we want to keep our jobs and get paid, how do we safely introduce changes to production, and moreover how do we know if we could deploy faster or if we should slow down? Enter the error budget.
Because we know the level at which our systems are capable of performing and the level that still provides an acceptable experience to our customers, we can now define an “error budget,” which is essentially the difference between these two metrics. For example, if you have an SLO of 99.5% uptime and actually reach 99.99% on a typical month, consider the delta to be an error budget—time that your team can use to take risks.
The error budget is a data-driven way to make decisions on balancing risk and innovation. It provides a clear metric on how “unreliable” a service is allowed to be and is based on your SLOs. You are in essence “budgeting for failure” and building some margin of error into your SLOs. This will give you a safety net for when you introduce new features or experiment to improve system performance.
Having an error budget will force you to have metrics in place to know how well you’re meeting goals and if there’s room in the budget for additional risk. If you’re consistently not meeting or getting close to not meeting your SLOs, then it’s time to dial back. Conversely, if you’re exceeding goals, then dial up innovation and deploy more features. Like SLOs, the error budget ensures teams are aligned on when to slow down or speed up, and is therefore another effective tool for DevOps to bridge the gap between developers and operators.
IT teams that implement data-driven operations by achieving a higher, more sophisticated level of monitoring provide significantly more value to the business. Getting to this point may take some time, but it can be a gradual journey, and ultimately it will save money, improve efficiencies, and provide better customer experiences. The key is ensuring the monitoring platform you choose can provide the precision and granularity of metric data that’s required.
For a real use case on how advanced monitoring and analytics drives data-driven IT operations, watch this presentation from Major League Baseball on how the league used Circonus to automate SLOs and error-budgets.