Learning from Failures: Better Crash Reporting for Better Incident Response

Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of…

Read More

Five Signs Your Monitoring Solution is Failing You

In a recent post I talked about the strain being placed on IT Infrastructure with the current surge in demand for online services being driven by the COVID-19 pandemic. I…

Read More

Monitoring Latency SLOs with Histograms and CAQL

Latency SLOs help us quantify the performance of an API endpoint over a period of time. A typical latency SLO reads as follows: The proportion of valid* requests served over…

Read More

Using CAQL to Identify Hosts with Top CPU Usage

A common task that users want to perform when monitoring their infrastructure is to identify their top resource consumers. Although the following techniques can be applied to numerous different resource…

Read More

Percentile Aggregation with Histograms and CAQL

Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, people frequently get tripped up when they want to take multiple percentiles and aggregate them. For…

Read More

A Guide to Service Level Objectives, Part 3: Quantifying Your SLOs

A guide to the importance of, and techniques for, accurately quantifying your Service Level Objectives. This is the third in a multi-part series about Service Level Objectives. The first part…

Read More