SREcon 2018 Americas

Getting paged at 11pm on New Year’s Eve because the application code used sprintf %d on a 32 bit system and your ids just passed 4.295 billion, sending the ids negative and crashing your object service. A wakeup call at 2 am (or is it 3 am?) on the ‘spring forward’ Daylight Savings transition because your timezone libraries didn’t incorporate one of the several dozen new politically mandated timezone changes. Sweating a four hour downtime two days in a row due to primary/replicant database failover because your kernel raid driver threw the same unhandled exception twice in a row; your backup primary database server naturally uses the same hardware as the active one, of course.

Circonus was created by its founders because they experienced the pain of reliability engineering on large scale systems first hand. They needed tools to efficiently diagnose and resolve problems in distributed systems. And they needed to do it at scale. The existing tools (Nagios, Ganglia, etc) at the time couldn’t cope the volume of telemetry nor provide the insight into systems behaviors that was needed. So they set out to develop tools and methods that would fill the void.

The first of these was using histograms to visualize timing data. Existing solutions would give you the average latency, the 95th percentile, the 99th percentile, and maybe a couple others. This information was useful for one host, but mathematically useless for aggregate systems metrics. Capturing latency metrics and storing it as a log linear histogram allowed users to see the distribution of values over a time window. Moreover, this data could be aggregated for multiple hosts to give a holistic view of a the performance of a distributed system or service.

However, systems are dynamic and constantly changing. Systems that behave well one second and poorly the next are the norm, not the exception in today’s ephemeral infrastructures. So we added heatmaps, which are histogram representations over discrete windows of time. So now users could get an overview of the actual performance of their system. if this diagram below was a traditional line graph showing the average latency value, it would be a mostly straight line, hiding the parts where long tail latencies became unbearable for certain customers. It gives SREs the power to separate the results of ‘works fine’ when testing and ‘this is really slow’ for those outlier large customers (who are generally the ones paying the big bucks).

These tools became formative components of standards that had been developing in the SRE community. A few years ago, Brendan Gregg introduced the USE method (Utilization, Saturation, Errors) a couple years ago. USE is a set of metrics which are key indicators for host level health. Following on the tails of USE, Tom Wilkie introduced the RED method (Rate, Errors, Duration). RED is a set of metrics which are indicators for service level health. Combining the two gives SREs a powerful set of standard frameworks for quickly identifying bad behavior for both hosts and systems.

These types of visualizations display a wealth of information, and as a result can put demands on the underlying metrics storage layer. A year ago we released the time series database that we have developed in C and Lua as IRONdb. This standalone TSDB can now power Grafana based visualizations, which have become part of the toolset for many SREs. As the complexity of today’s microservice based architectures grows, and the lifetime of individual components falls, the need for high volume time series telemetry continues to increase. Here at Circonus we are dedicated to bringing you solutions that solve the parts of reliability engineering which have caused us pain in the past which affect all SREs. So that you can focus your efforts on the parts of your business which you know better than anyone else.