• Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of damage and help prevent future […]

  • Latency SLOs help us quantify the performance of an API endpoint over a period of time. A typical latency SLO reads as follows: The proportion of valid* requests served over the last 4 weeks that […]

  • A common task that users want to perform when monitoring their infrastructure is to identify their top resource consumers. Although the following techniques can be applied to numerous different resource metrics, we will specifically look […]

  • Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, people frequently get tripped up when they want to take multiple percentiles and aggregate them. For example, let’s say we are […]

  • In their excellent SLO-workshop at SRECon2018 (program) Liz Fong-Jones, Kristina Bennett and Stephen Thorne (Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and SRE techniques such as SLI/SLOs. […]

  • This was a frequent request we were hearing from many customers: “How can I analyze my data with Python?” The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, […]

  • The Linux kernel is an abundant component of modern IT systems. It provides the critical services of hardware abstraction and time-sharing to applications. The classical metrics for monitoring Linux are among the most well known […]

  • There are a lot of interesting monitoring tasks, that can be facilitated with a Raspberry Pi (e.g. here, there). Circonus does not officially support “Raspbian Linux on armv6/v7” as a deployment target, but given the […]