Learning from Failures: Better Crash Reporting for Better Incidence Response

Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of…

Read More

Monitoring Latency SLOs with Histograms and CAQL

Latency SLOs help us quantify the performance of an API endpoint over a period of time. A typical latency SLO reads as follows: The proportion of valid* requests served over…

Read More

Using CAQL to Identify Hosts with Top CPU Usage

A common task that users want to perform when monitoring their infrastructure is to identify their top resource consumers. Although the following techniques can be applied to numerous different resource…

Read More

Percentile Aggregation with Histograms and CAQL

Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, people frequently get tripped up when they want to take multiple percentiles and aggregate them. For…

Read More

Latency SLOs Done Right

In their excellent SLO-workshop at SRECon2018 (program)¬†Liz Fong-Jones, Kristina Bennett and Stephen Thorne¬†(Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and…

Read More

Less Toil, More Coil – Telemetry Analysis with Python

This was a frequent request we were hearing from many customers: “How can I analyze my data with Python?” The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of…

Read More