The Critical Role of the SRE & Error Budgeting

The role of SRE, Site Reliability Engineer, was first created by Benjamin Treynor in 2003 at Google after he was tasked with ensuring that their websites were available and reliable.

The SRE is a multi-disciplined role that needs to have the ability to automate monitoring and observability across hundreds and thousands of complex systems. They need to manage against SLOs (Service Level Objectives) and coordinate across multiple groups including: development, DevOps, networking, cloud providers, IT and more. It’s a dual role that includes the ability to have an eye on the business objectives and the technical minutia, an ability to write code and deftly handle interpersonal conflict.

In large enterprises, there is often a team of SREs dedicated to ensuring the availability of both internal and external systems. These larger companies require an even greater level of discipline and rigor in their approach. However, SREs should not be expected to maintain 100% availability across the entire infrastructure stack. Setting unrealistic goals creates a culture of failure and does not advance the needs of the business. This is why error budgets are a critical tool for SREs.

Error Budgeting is a key component of a successful enterprise monitoring and observability program. Error budgeting means balancing uptime and reliability with user expectations and SREs must master this ability. When dealing with internal systems, error budgeting is about managing expectations and coordinating planned downtime with key stakeholders. However, when an SRE is responsible for external systems, especially those that are driving revenue, error budgeting must also take into account contractual obligations and ensure that they are in line with the uptime guarantees.

But, error budgeting can’t be done unless the SRE team has a firm grasp on SLOs. SLOs are the double yellow line on the highway. Managing to SLOS requires an ability to consume massive quantities of telemetry data from every type of source from hardware to networks to software whether it’s in the cloud, in the data center, or at the edge.

Sampling this data and building averages does not solve for this volume and in fact, can cause important signals to be missed or averaged out in the noise. Every data point from every source must be captured and the anomalies must be immediately visible and actionable at the time of capture. Only then can SLOs be guaranteed and error budgeting be attempted.

Typical Environment Unplanned Downtime
Year Month
99% Conventional Server 87.72 hours 7.31 hours
99.9% Public Cloud 8.77 hours 43.83 minutes
99.99% Fault Tolerance 52.6 minutes 4.38 minutes
99.999% Continuous Availability 5.26 minutes 26.30 seconds
The “nines” of availability

In order to accomplish this, you need to give SREs the tools to make them successful. In some cases that means extensive training across multiple technologies and disciplines. But, in every case it means empowering them with the ability to collaborate across all departments and to influence decision-making. It also requires the right tools to give them visibility to the entire infrastructure. They need to see the “forest and the leaves” and understand how each leaf affects the entire forest.

Circonus has developed an enterprise-class observability and telemetry solution that can integrate with the entire stack across both on-premise and in-cloud environments, monitoring and alerting on billions of data points per minute. We help SREs execute on their jobs, maintain their SLOs, implement error budgeting and proactively identify potential issues before they impact internal users and customers.

Circling all the way back to Google and their creation of the role, they have an entire website dedicated to the SRE. You can visit that here.