Kubernetes is one of the hottest topics in IT right now, but what exactly is it and where did it come from? As DevOps and Infrastructure-as-Code practices arose and took hold in the IT/OPS community […]
The world of DevOps is a constant push and pull. There’s pressure to deploy faster, yet performance is expected to be perfect all of the time. The founders of Circonus lived these expectations and know […]
Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of damage and help prevent future […]
In a recent post I talked about the strain being placed on IT Infrastructure with the current surge in demand for online services being driven by the COVID-19 pandemic. I talked about how this sudden […]
Latency SLOs help us quantify the performance of an API endpoint over a period of time. A typical latency SLO reads as follows: The proportion of valid* requests served over the last 4 weeks that […]
A common task that users want to perform when monitoring their infrastructure is to identify their top resource consumers. Although the following techniques can be applied to numerous different resource metrics, we will specifically look […]
Percentiles are commonly used for measuring statistics, particularly when analyzing things like latency. Unfortunately, people frequently get tripped up when they want to take multiple percentiles and aggregate them. For example, let’s say we are […]
When most people think of job scheduling, they consider all sorts of things. Container orchestration, serverless allocation, batch job running… all of these all qualify. In highly concurrent systems, it is important that you can […]