Faultd Update – Next Generation Alerting

Circonus will soon be releasing our next generation fault detection system, faultd (fault-dee). Faultd is an internal component of our infrastructure has run alongside our existing fault detection system for several months with outputs verified for accuracy. Additionally it is in use by some of our enterprise customers who have reported no issues with faultd.

Faultd introduces powerful new features which will make it easy to manage alerting in ephemeral infrastructures such as serverless, container based applications, and large enterprises.

Pattern Based Rulesets

Say I have a few thousand hosts that emit S.M.A.R.T. disk status telemetry, and I want to alert when the seek error rate exceeds a threshold. Previously I would need to create a few thousand rules to alert on this condition for each host. While the Circonus API certainly makes this programmatically feasible, I would also need to create or delete these rules on each host addition or removal.

Now I can create a single pattern based rule using regular expressions to cover swaths of infrastructure for a given metric. I can also harness the power of stream tags to create pattern based rules based on metric metadata. What would have taken operators hours to do in the past can now be done easily in minutes.

Histogram Based Alerting

Traditional alerting has been based on a value exceeding a threshold for a given amount of time. Every monitoring system can do this. And each one of them suffers from the shortcoming of outliers triggering false positive alerts which are infamous for waking up systems operators in the middle of the night for what turns out to be nothing.

Histogram based alerting paves the way for alerts based on percentiles, which are much more robust than alerting on individual values or averages which can become easily skewed by outliers. This also allows for alerting on conditions when Service Level Objectives (SLOs) are exceeded, a capability core to the mission of Software Reliability Engineers (SREs). Alerts based on Inverse Quantiles are also now possible – “alert me if 20% of my requests in the last five minutes exceeded 500ms”, or “alert me if more than 100 requests in the last 5 minutes exceeded 253ms”.

Under the Hood

Faultd has been engineered in C with the libmtev application framework, which provides highly concurrent lock free data structures and safe memory reclamation. This implementation is radically more efficient for memory and CPU than the previous fault detection system written in Java. It also provides more powerful ways to scale out for ridiculously large installations, and supports more sophisticated clustering.

As a result, some window function alerts may show increased accuracy. Enterprise customers will enjoy greater reliability in not having to occasionally restart a JVM as part of normal maintenance.

Happy Holidays

Faultd will be going live on December 17th for Circonus hosted customers. While you might not notice anything new that day, that’s intentional as we expect complete continuity of service during the transition. In the coming weeks to months after, we’ll be showcasing the new features provided by faultd here on this blog so that you can put them to work.

How Safe is Your Home’s Air? The Internet of Things and Air Quality Monitoring during Wildfires

IoT-driven monitoring of Air Quality during the 2018 California Wildfires

Over the past few weeks, the Camp Fire in Northern California and the Woolsey Fire in Southern California have devastated people and property. There has been tragic loss of life in the town of Paradise, and California’s firefighters remain tasked, once again, with the difficult job of containing and extinguishing the flames. What no one can contain, though, is the spread of hazardous wildfire smoke.

I was out of town when the smoke arrived, but on return to San Francisco several days later, the smoke in the air was as bad as I feared. Last year I wrote, “2017 was a bad year for fires in California,” which discussed the tools I used to collect and analyze the air quality data. I put those tools to use again in order to understand the air quality both inside and outside of my home.

At SFO the level of smoke in the air was obvious
At SFO the level of smoke in the air was obvious

Air Quality Basics

Air quality is measured with the Air Quality Index, or AQI. The AQI corresponds to the concentration of 2.5 and 10 micron particles in the air, usually referred to as PM2.5 and PM10. The “PM” stands for ‘Particulate Matter’.

Generally speaking, PM2.5 is the key metric to watch during wildfires, as these small particles are produced by combustion. These tiny particles stay in the air for a long time, can deeply penetrate the lungs, and significantly impair lung function. PM10 particles are harmful as well, but their story ends with the lungs – not so with PM2.5, which can pass right into your bloodstream.

Particulate matter sizes from EPA.gov
Particulate matter sizes. Image credit: EPA.gov

The EPA maintains the website AirNow which tracks AQI throughout the United States, and indicates air quality ranging from ‘Good’ to ‘Hazardous’. Air quality is measured by the EPA (and others) with sensors deployed at locations around the country. In recent years, these sensors have become commoditized, allowing the general public to set up their own. PurpleAir has become a popular provider of these sensors, offering a low-cost version featuring miniature lasers. These sensors upload the air quality data to a map which shows air quality for thousands of locations. If you’ve heard the term ‘Internet of Things’ (or IoT) before, this map shows one way of harnessing this type of information.

Initial Data Collection

I have two PurpleAir sensors which can monitor PM2.5: a PA-I-Indoor and an PA-II Outdoor. I’ve had them up and running since last year’s fires, which were in October of 2017. Based on this data I purchased several HEPA filters to keep the air in my home free of particulate matter.

These air sensors are extremely sensitive; they can detect air quality changes from not just wildfires, but common household appliances. Shortly after I started using the indoor sensor, I discovered that my ultrasonic humidifiers produced large amounts of PM2.5. They also pick up air quality changes when a hairdryer is in use, or something is burning in the oven. So if you decide to monitor your indoor air quality, don’t be surprised if you find it’s worse than you expect from seemingly innocuous daily activity.

My HEPA filters were turned off to save electricity when I was away. I thought that my house was weather-sealed well enough that any particulates would stay outside. I was wrong.

Based on the sensor readings, I could see the current AQI level that I would walk into at home, and it was not healthy. The Air Quality Index outside was 284, solidly in the ‘Very Unhealthy’ band, and the air inside my house — which had been sealed shut since the fire — was AQI 158, designated as ‘Unhealthy.’

Outdoor (284) and Indoor (158) Air Quality Index readings upon returning home. The indoor air was in the ‘Unhealthy’ region
Outdoor (284) and Indoor (158) Air Quality Index readings upon returning home. The indoor air was in the ‘Unhealthy’ region. Image source: PurpleAir Map

My HEPA filters were turned off to save electricity when I was away. I thought that my house was weather-sealed well enough that any particulates would stay outside. I was wrong.”

Clearing the Air

First things first; I turned on my HEPA filters and tracked the indoor AQI until the air inside was brought down to a safe-breathing level. It took an hour to bring the AQI level down to yellow, and then another 30 minutes or so to get it to green.

ne hour of HEPA filters dropped the AQI indoors from Unhealthy to Moderate
One hour of HEPA filters dropped the AQI indoors from Unhealthy to Moderate. Right image source: PurpleAir.com
Air Quality Sensors
When the air quality is poor, the indoor sensor shows a red light. When it’s good, it changes to green. This happens below an AQI of about 50.

Once the indoor air quality dropped to an AQI of less than 50, I turned my attention to analyzing the sensor data. I wanted answers to the following questions:

  1. How effective are HEPA filters at keeping the indoor air clean even when it gets really bad outside?
  2. I have a HEPA filter in my garage too; is that as effective as the ones inside the house?
  3. How bad is the air quality outside? Does it measure up to the anecdotal evidence in the news and on social media?
  4. How does the PurpleAir sensor data compare with the sensor data published by the Bay Area Air Quality Management District?

To answer these questions, I collected both PM2.5 and PM10 data as histograms using Circonus. Circonus is a monitoring system that allows users to collect data directly from IoT devices as well as websites and other technology infrastructure. Circonus can ingest data in most common formats, which makes it capable of integrating with just about any device out there. This allows me to use Circonus’ heatmap visualizations, as well as run statistical analyses on the data.

Let there be Data

My sensors collected 24 hours worth of data, and I displayed both the indoor and outdoor sensor readings on one heatmap, as shown below. The green line is the outdoor sensor, the blue line is the indoor sensor.

24 Hours of Heatmap Data
Twenty four hours of the worst SF Bay area air quality. Outdoor sensor readings in green, indoor in blue. Additional interesting events called out. Click image to enlarge.

The AQI levels are displayed on the Y axis, using a logarithmic scale; the relation between PM2.5 particle count and AQI is not a linear one, so this approach produces the best visualization.

Air Quality Index Levels
AQI levels, their descriptions, and the associated colors. Image source: AirNow.gov

With this heatmap visualization, the answers to my questions are clear:

Question 1 – HEPA Filter Effectiveness

My question regarding the effectiveness of my indoor HEPA filters is answered with this heatmap. The blue line (indoor AQI) stays below the yellow ‘AQI Moderate’ level for almost the entire graph. HEPA filters are rated by the amount of square footage that they can cover, and mine are more than sufficient for the square footage of my house. The indoor air quality remains great even when the outdoor AQI goes above the ‘AQI Hazardous’ level (more on that level later). Since my house is older, it is not sealed as well as a newer house or most high-rise apartment buildings, and the filters had to work against bad air leaking in.

Question 2 – Is the air in the garage safe?

I have a HEPA filter in the garage, so I wondered if it was safe to exercise there. The garage has several spots where air can enter, especially along the sides of the rollup garage door. To measure this area, I moved the indoor sensor to the garage for several hours. I added an annotation to the heatmap (marked by the thick blue horizontal line at the top of the graph) to indicate the time when the sensor was in this location. It is clear from the heatmap that the garage air quality varied between orange (AQI Unhealthy for Sensitive Groups) and red (AQI Unhealthy). If you look closely, you’ll see a bump in the middle of the range where I opened the garage door for several minutes. The garage air quality was cleaner than the outside, but nowhere near as good as inside the house.

Moving the air sensor to the garage
AQI jumped from Good to Unhealthy (blue line annotation at top) when the indoor sensor was moved to the garage. Click image to enlarge.
Air Quality Sensor readings from the garage
Note the slight bump in AQI (yellow line annotation at top) where the garage door was opened. Click image to enlarge.

Question 3 – Just how bad did the air get outside?

In a nutshell – the air outside was record-setting bad. The outdoor sensor recorded a maximum of ~288 micrograms per cubic meter at about 2pm on November 16th, 2018, which corresponds to about 336 on the AQI scale. At least 10% above the ‘AQI Hazardous’ level. I’ve only seen this reading a couple of times when I was barbecuing. Based on the data: the air outdoors was at the level of being surrounded by barbecues.

Question 4 – How did my sensor data compare to government AQI maps?

The Bay Area Air Quality Management District (BAAQMD) has high-end sensors spread around the SF Bay Area which report to various government agencies. They reported a maximum AQI value of 271 at about 2pm on November 16th. Why did they report AQI 271, but I measured AQI 336?

AQI recorded at 336 San Francisco 2018
The outdoor PurpleAir sensor measured 287.5 micrograms per cubic meter (AQI ~336) on November 16th, 2018 at ~2:30pm. Click image to enlarge.
Bay Area Air Quality Management Readings for November 16th, 2018
Bay Area Air Quality Management Readings for November 16th, 2018. Image credit: BAAQMD

BAAQMD has one air sensor in San Francisco, and a number of others around the Bay Area. Meanwhile, there are several dozen PurpleAir sensors in San Francisco, monitored by IoT enthusiasts like myself.

Readings from PurpleAir sensors can vary by +/- 50 AQI, but PurpleAir has stated that their sensors are roughly as accurate as other laser based sensors. Given the hilly topology and microclimates in San Francisco, a possible explanation may be that AQI varies enough locally to account for these discrepancies. A PurpleAir sensor in the same location as the BAAQMD sensor would help answer this question.

Data Science is Here for Everyone

Democratization of technology is leading to democratization of data. The Internet of Things has provided hobbyists with the instruments that have long been only available to scientists, but with the added power of crowd-sourcing the data gathered by those instruments. The tools to monitor environmental data are available to any end user with only a few hundred dollars worth of sensors and an internet connection. With smart monitoring, we can ask and answer the right questions of the world around us.

Posted in IoT

Introducing Circonus Stream Tags

Gain greater insight into system behavior in aggregate, across multiple dimensions.

A “metric” is a measurement, or value, representing the operational state of your system at a given time. For example, the amount of free memory on a web host, or the number of users logged into your site at a given time. Factor in hundreds (or thousands) of hosts, availability zones, hardware batches, or service endpoints, and you’re suddenly dealing with a significant logistical challenge.

What’s the difference in average free memory between API hosts in my Ashburn and San Jose data centers? How about the difference in CPU utilization for the EC2 clusters using 4 r5.xlarge instances instead of 8 r5.larges? How do you stay on top of it all?

“As we grow our number of metrics and/or want to make more sense out of them, we need to be more systematic.” – Metrics 2.0 (metrics20.org)

For time-series data, it’s all about metadata — which may sound familiar if you take pictures with your smartphone. Metadata isn’t the picture you take, but the time of day, location, camera type, settings, and so on — all to help you organize albums and find/organize your photos more easily. The same principle applies to how you manage your operational time-series telemetry, as this additional information gives them more depth. Known in the software industry as metric tags or dimensions, this metadata provides greater insight into your metrics.

Why are metric tags important?

Traditionally in a large enterprise, automated monitoring solutions are a one-time, “set it and forget it” situation based on the defined, up-front monitoring objectives. These solutions are then turned over to the operations team. With static infrastructure, this works, and these checks and metrics tend to remain unchanged for extended periods of time. However, static infrastructure isn’t as common as it used to be – and that’s the problem.

Today’s dynamic, ever-changing infrastructure (including web and application servers) make it difficult and impractical to apply the same static analysis and operations in monitoring them.

The world of IT has always been a moving target. Enterprise data centers are continuously subject to transformation as new software, hardware and process components are deployed and updated. Today’s dynamic infrastructure is often a fusion of bare metal, virtualized instances, orchestrated container base applications, serverless, and cloud deployed services.  According to the IEEE, this puts an immense burden on monitoring needs. Not only are there thousands of different parameters to monitor, but the addition and modification of service level objectives (SLOs) may happen continuously.

Stream Tags from Circonus Can Help

Circonus’ metric tag implementation is known as Stream Tags. Our implementation improves infrastructure monitoring by adding metric metadata, to help you label and organize your metrics data. Stream Tags provide all of the capabilities of the Metrics 2.0 specification; self describing, orthogonal tags for every dimension. In addition, Stream Tags offer extra capabilities as detailed below.

CPU metric using Stream Tags to specify host, role, state, and zone

What Makes Stream Tags Better?

Stream Tags offer the ability to base64 encode tag names and values. The ability to support multi-byte character sets, as well as the full list of ASCII symbols, offers a notable advantage for non-english speakers.

There are, of course, some fun things you can do with this too. We aren’t saying that you should embed emoji within Stream Tags, but should you so desire, that’s certainly possible.




Tag, You’re It

Let’s jump in and take a look at what stream tagged metrics actually look like. Above, we talked about metric tags which represent the amount of free memory on a host. We’ll call this metric “`meminfo `MemFree”. This is a metric that you’ll find associated with any physical or virtual host, but whose average, across many hosts, would produce a broad and un-useful aggregate. Adding a Stream Tag such as “environment” allows us to examine a segment of these metrics, such as “environment:development”, or “environment:production”. Adding the environment tag to this metric would give us something like

`meminfo `MemFree|ST[environment:production]

Now, we can use an aggregate such as the average free memory of our production environment to get an idea of how that differs from other environments. Stream Tag names and values are separated by colons, and enclosed within square brackets. The tag name and value section is prefixed by “ST” and separated from the metric name with “|” character.

Let’s take it a bit further. Say I want to know if our new release is going to blow out our cluster because we forgot to call free() in one important spot in the code. We can add a tag “deploy_pool” to the metric, which will now look like

`meminfo `MemFree|ST[environment:production,deploy_pool:blue]

Now we can compare memory usage between blue and green deployment pools using Stream Tags, and determine if the new release has a memory leak that needs to be addressed. This technique can also be used for traditional A/B testing, or canary based feature deployment.

Further segmenting is possible as needed, we could add kernel version, instance type, etc. as Stream Tags to delineate metadata boundaries.

Stream tag showing web server endpoint latency metadata

With Stream Tags, we are also  able to apply as many labels and tags as we need to properly organize our metrics. Bear in mind, each unique combination of Stream Tags represents a new time series, which can dramatically increase the amount of data stored. So, it’s important to be careful when using Stream Tags to store dimensions with high cardinality, such as user IDs, email addresses, or other unbounded sets of values.

Do I Really Need Stream Tags?

If your infrastructure is dynamic, which has become increasingly common: yes.

Stream Tags allow the flexibility needed by today’s infrastructure monitoring. They make it possible to filter, aggregate and organize metrics in multiple dimensions. Analysis and operations can be applied to a set of metrics rather than individual metrics, allowing for service level monitoring in addition to resource monitoring.

Implemented correctly, Circonus’ Stream Tags are critical to monitoring modern infrastructure, because they give admins the power to aggregate metrics at every level.

When possible and applicable, automatic tagging means the metrics collection agent may be able to detect metadata from the source – e.g. role: tag from Chef, availability-zone: and instance-type: from AWS, labels from Google Cloud, etc. The metadata can then be automatically added as Stream Tags.

How fancy can I make them?

You can use Stream Tags to organize metrics with a set of key characteristics such as role, zone, department, etc. At Circonus, we append those key characteristics to the metric name following this format:

<metric name>|ST[<tag name>:<tag value>, … ]


<tag name> may consist of any number of characters in [`+A-Za-z0-9!@#\$%^&”‘\/\?\._-]

<tag value> may consist of any number of characters in [`+A-Za-z0-9!@#\$%^&”‘\/\?\._-:=]


The following example shows a CPU metric for wait_io from a webserver with hostname node15, in the us-east1b availability zone.

<tag-name>, <tag-value> may consist of base64 characters enclosed in double quotes and prefixed with ‘b’


Please note that there must not be any whitespace between tag delimiters.

Search across Stream Tags easily with the new Metrics Explorer

You’re also able to search for matching Stream Tags to filter and aggregate metrics, display them in graphs, set rules for alerts based on the search results, and more. This search can be performed by matching the




strings or base64 encoded strings.


(node:wnode15)                // string match

(b”X19uYW1lX18=”)             // base64 string match

(b/X.+/)                      // base64 <regex> match

So What’s the Take-Away?

Stream Tags enable you to manage your dynamic infrastructure, while providing flexibility for future paradigms you’ll need to adopt. You can keep up with the ever-changing monitoring landscape with better transparency, more effective searches and comparisons, faster responses, and automatic metric aggregation. Stream Tags provide all the control you need to manage diverse, high cardinality sets for your operational metrics. They give you the power of the Metrics 2.0 specification, plus the forward looking capabilities provided by base64 capable tag names and values.

That’s what’s in it for you. Overall, this rich functionality will save you time and effort, enable agility to keep up with changing technology, and deliver better results than the conventional alternative. Stream Tags are expected to land on November 19th. We look forward to hearing how this feature helps each of your respective organizations.

The Problem with Percentiles – Aggregation brings Aggravation

Percentiles have become one of the primary service level indicators to represent real systems monitoring performance. When used correctly, they provide a robust metric that can be used for base-of-mission critical service level objectives. However, there’s a reason for the “when used correctly” above.

For all their potential, percentiles do have subtle limitations that are very often overlooked by the people using them in analyses.

There’s no shortage of previous writings on this topic, most notably Baron Schwartz’s “Why Percentiles Don’t Work the Way You Think.” Here, I’ll focus on data, and why you should pay close attention to how percentiles are applied.

Right off the bat, the most misused technique is aggregation of percentiles. You should almost never average percentiles, because even very fundamental aggregation tasks cannot be accommodated by percentile metrics. It is often thought that since percentiles are cheap to obtain by most telemetry systems, and good enough to use with little effort, that they are appropriate for aggregation and system wide performance analysis most of the time. While this is true most of the time and for most systems, you lose the ability to determine when your data is lying to you — for example, when you have high (+/- 5% and greater) error rates that are hidden from you.

Those times when your systems are misbehaving the most?
That’s exactly when you don’t have the data to tell you where things are going wrong.

Check the Math

Let’s look at an example* of request latencies of two webservers (W1 blue, W2 red). P95 of the blue server is 220ms, p95 of the red one is 650ms:

What’s the total p95 across both nodes (W1, W2)? (plot generated with matplotlib)
What’s the total p95 across both nodes (W1, W2)? (plot generated with matplotlib)

By aggregating the latency distributions of each web server, we find that the total p95 is 230ms. W2 barely served any requests, so adding requests from there did not change p95 of W1 by much. Now, naive averaging of the percentiles would have given you: (220+650) / 2 = 87/2 = 435ms, which is ~200% away from the true total percentile (230ms).

So, if you have a Service Level Indicator in this scenario of “95th percentile latency of requests over past 5 minutes < 300ms,” and you averaged P95s instead of calculating from a distribution, you would be led to believe that you have exceeded your SLI by ~30%. Folks would be getting paged even though they didn’t need to be, and maybe conclude that additional servers were needed (when in fact this scenario represents overprovisioning).

Incorrect math can result in tens of thousands of dollars of unneeded capacity,
not to mention the cost of the time of the humans in the loop.

*If you want to play with the numbers yourself to get a feel for how these scenarios can develop, there is a sample calculation with link to an online percentile calculator in the appendix [1]

“Almost Never” Actually Means “Never Ever”

“But above, you said ‘almost never’ when saying that we shouldn’t ever average percentiles?”

That’s 100% correct. (No pun intended.)

You see, there are circumstances where you can average percentiles and get a result that has low errors. Namely, when the distribution of your data sources are identical. The most obvious case is when the latencies are from two web servers that are (a) healthy and (b) serve very similar load.

Be aware that this supposition breaks down as soon as either of those conditions is violated! Those are the cases where you are most interested in your monitoring data, when one of your servers starts misbehaving or you got a load balancing problem.

“But my servers usually have an even distribution of request latencies which are nearly identical, so that doesn’t affect me, right?”

Well, sure, if your web servers have nearly identical latency distributions, go ahead and calculate your total 95th percentile for the system by averaging the percentiles from each server. But when you have one server that decides to run into swap and slow down, you likely won’t notice a problem, since the data indicating it is effectively hidden.

So still, you should never average percentiles; you won’t be able to know when the approach you are taking is hurting you at the worst time.

Averaging percentiles masks problems with nodes that would otherwise be apparent
Averaging percentiles masks problems with nodes that would otherwise be apparent

Percentiles are aggregates, but they should not be aggregated. They should be calculated, not stored. There are a considerable number of operational time series data monitoring systems, both open source and commercial, which will happily store percentiles at 5 minute (or similar) intervals. If you want to look at a year’s worth of data, you will encounter spike erosion. The percentiles are averaged to fit the number of pixels in the time windows on the graph. And that averaged data is mathematically wrong.

Example of Spike Erosion: 2 week view on the left shows a max of ~22ms. 24 hour view on the right shows a max of ~70ms.
Example of Spike Erosion: 2 week view on the left shows a max of ~22ms. 24 hour view on the right shows a max of ~70ms.

Clever Hacks

“So, the solution is to store every single sample like you did in the example, right?”

Well, yes and no.

You can store each sample, and generate correct percentiles from them, but at any more than a modest scale, this becomes prohibitively expensive. Some open-source time series databases and monitoring systems do this, but you give up either scalability in data ingest, or length of data retention. One million 64-bit integer samples per second for a year occupies 229 TB of space. One week of data of this data is 4 TB; doable with off-the-shelf hardware, but economically impractical for analysis, as well as wasteful.

“Ah, but I’ve thought up a better solution. I can just store the number of requests that are under my desired objective, say 500 milliseconds, and the number of requests that are above, and I can divide by the two to calculate a correct percentile!”

This is a valid approach, one that I have even implemented with a monitoring system that was not able to store full distributions. However, the limitation is subtle; if after some time I decide that my objective of 500ms was too aggressive and move it to 600ms, all of the historical data that I’ve collected is useless. I have to reset my counters and begin anew.

Store Distributions, Not Percentiles

A better approach than storing percentiles is to store the source sample data in a manner that is more efficient than storing single samples, but still able to produce statistically significant aggregates. The histogram, or distribution, is one such approach.

There are many types of histograms, but here at Circonus we use the log-linear histogram. It provides a mix of storage efficiency and statistical accuracy. Worst-case errors at single digit sample sizes are 5%, quite a bit better than the 200% that we demonstrated above by averaging percentiles.

Log Linear histogram view of load balancer request latency. Note the increase in bin size by a factor of 10 at 1.0M (1 million)
Log Linear histogram view of load balancer request latency. Note the increase in bin size by a factor of 10 at 1.0M (1 million)

Storage efficiency is significantly better than storing individual samples; a year’s worth of 5 minute log linear histogram windows (10 bytes per bin, 300 bins/window) can be stored in ~300MB (sans compression). Reading this amount of data from disk quickly is tractable with most physical (and virtualized) systems in under a second. The mergeability properties of histograms allows precomputed cumulative histograms to be stored for analytically useful windows such as 1 minute and 3 hours. This allows the composition of large time spans of time series telemetry to be rapidly assembled from sets that are visually relevant to the end user (think one year of data with histogram windows of six hours each).

Using histograms for operational time series data may seem like a challenge at first, but there are a lot of resources out there to help you out. We have published open source libraries of our log linear histograms in C, Golang, and even JavaScript. The Envoy proxy is one project that has implemented the log linear histogram C implementation for operational statistics. The Istio service mesh uses the Golang version of the log linear histogram library via our open source gometrics package to record latency metrics as distributions.


In Conclusion

Percentiles are a staple tool of real systems monitoring, but their limitations should be understood. Because percentiles are provided by nearly every monitoring and observability toolset without limitations on their usage, they can be applied to analyses easily without the operator needing to understand the consequences of *how* they are applied. Understanding the common scenarios where percentiles give the wrong answers is just as important as having an understanding of how they are generated from operational time series data.

If you use percentiles now, you are already capturing data as distributions to some degree through your toolset. Knowing how that toolset generates percentiles from that source telemetry will ensure that you can evaluate if your use of percentiles to answer business questions is mathematically correct.


Suppose I have two servers behind a load balancer, answering web requests. I use some form of telemetry gathering software to ingest the time it takes for each web server to complete serving its respective requests in milliseconds. I get results like this:

Web server 1: [22,90,73,296,55]
Web server 2: [935,72,18,553,267,351,56,28,45,873]

I’ll make the calculations here so easy that you can try them out yourself at https://goodcalculators.com/percentile-calculator/

What is the 90th percentile of requests from web server 1?


  • Step 1. Arrange the data in ascending order: 22, 55, 73, 90, 296
  • Step 2. Compute the position of the pth percentile (index i):
    i = (p / 100) * n), where p = 90 and n = 5
    i = (90 / 100) * 5 = 4.5
  • Step 3. The index i is not an integer, round up.
    (i = 5) ⇒

    the 90th percentile is the value in 5th position, or 296

  • Answer: the 90th percentile is 296

What is the 90th percentile of requests from web server 2?


    • Step 1. Arrange the data in ascending order: 18, 28, 45, 56, 72, 267, 351, 553, 873, 935
    • Step 2. Compute the position of the pth percentile (index i):
i = (p / 100) * n), where p = 90 and n = 10
i = (90 / 100) * 10 = 9
  • Step 3. The index i is an integer ⇒ the 90th percentile is the average of the values in the 8th and 9th positions (873 and 935 respectively)
  • Answer: the 90th percentile is
    (873 + 935) / 2 = 904

So web server 1 q(0.9) is 296, web server 2 q(0.9) is 904. If our objective is to keep our 90th percentile of requests overall under 500ms, did we meet that objective?

Let’s try averaging these percentiles, which you should not ever do in reality.

q(0.9) as average of (q(0.9) web server 1, q(0.9) web server2) = (904+296)/2 = 600ms. 

We were under our objective by about 20%, which is not bad. So we think everything is ok, right? Let’s calculate the p90 the right way and see what we get.

First we merge all values together, then use the percentile calculator on the merged set.



  • Step 1. Arrange the data in ascending order: 18, 22, 28, 45, 55, 56, 72, 73, 90, 267, 296, 351, 553, 873, 935
  • Step 2. Compute the position of the pth percentile (index i):
    i = (p / 100) * n), where p = 90 and n = 15
    i = (90 / 100) * 15 = 13.5
  • Step 3. The index i is not an integer, round up.
    (i = 14) ⇒

    the 90th percentile is the value in 14th position, or 873

  • Answer: the 90th percentile is 873

Now our correct 90th percentile is 873 instead of 600. An increase of 273, or 45.5%. Quite a difference, no? If you had used averages, you might be thinking that your website is delivering responses at a latency that you believed was acceptable for most users. In reality, 90% of requests are under a threshold that is almost 50% larger than you thought.

A Guide To Service Level Objectives, Part 2: It All Adds Up

A simple primer on the complicated statistical analysis behind setting your Service Level Objectives.

This is the second in a multi-part series about Service Level Objectives. The first part can be found here .

Statistical analysis is a critical –  but often complicated – component in determining your ideal Service Level Objectives (SLOs). So, a “deep-dive” on the subject requires much more detail than can be explored in a blog post. However, we aim to provide enough information here to give you a basic understanding of the math behind a smart SLO – and why it’s so important that you get it right.

Auditable, measurable data is the cornerstone of setting and meeting your SLOs. As stated in part one, Availability and Quality of Service (QoS) are the indicators that help quantify what you’re delivering to your customers, via time quantum and/or transaction availability. The better data you have, the more accurate the analysis, and the more actionable insight you have to work with.

So yes, it’s complicated. But understanding the importance of the math of SLOs doesn’t have to be.

Functions of SLO Analysis

SLO analysis is based on probability, the likelihood that an event will — or will not — take place. As such, it primarily uses two types of functions: Probability Density Function (PDF) and Cumulative Density Function (CDF).

Simply put, the analysis behind determining your SLO is driven by the basic concept of probability.

For example, PDF answers questions like “What is the probability that the next transaction will have a latency of X?” As the integral of the PDF, the CDF answers questions like What’s the probability that the next transaction will have a latency less than X?” or “What’s the probability that the next transaction will have a latency greater than X?”.

Probability Density Function (PDF) Cumulative Density Function (CDF)
INPUTS Any measurement Any measurement
OUTPUTS The probability that a given sample of data will have the input measurement. The probability that X will take a value less than or equal to x

Percentiles and Quantiles

Before we get further into expressing these functions, let’s have a quick sidebar about percentiles vs. quantiles. Unfortunately, this is a simple concept that has gotten quite complicated.

A percentile is measured on a 0-100 scale, and expressed as a percentage. For example: the “99th percentile” means “as good or better than” 99% of the distribution.

A quantile is the same data, expressed on a 0-1 scale. So as a quantile, that “99th percentile” above would be expressed as “.99.”

That’s basically it. While scientists prefer using percentiles, the only differences from a quantile are a decimal point and a percentage symbol. However, for SLO analysis, the quantile function is important because it is mapped to the CDF we discussed earlier.

Remember, this is an overview of basic concepts to provide “top-level” understanding of the math behind a smart SLO.
For a deeper dive, check out David Edelman Blank’s book “Seeking SRE.”

The Data Volume Factor

As any analyst will tell you, the sheer volume of data (or lack thereof) can dramatically impact your results, leading to uninformed insight, inaccurate reporting, and poor decisions. So, it’s imperative that you have enough data to support your analysis. For example, low volumes in the time quantum can produce incredibly misleading results if you don’t specify your SLOs well.

So, with large amounts of data vs “not enough,” the error levels in quantile approximations tend to be lower (vs. worst possible case errors with a single sample per bin, with the sample value at the edge of the bin, those can cause 5% errors). In practice, with log linear histograms, we tend to see data sets span 300 bins, so sets that contain thousands of data points tend to provide sufficient data for accurate statistical analyses.

Inverse quantiles can also come into play. For example, defining an SLO such that our 99th percentile request latency completes within 200ms. At low sample volumes, this approach is likely to be meaningless – with only a dozen or so samples, the 99th percentile can be far out of band compared to the median. And, the percentile and time quantum approach doesn’t tell us how many samples exceeded that 200ms quantum.

We can use inverse percentiles to define an SLO that says we want 80 percent of our requests to be faster than that 200ms quantum. Or alternatively, we can set our SLO as a fixed number of requests within the time quantum; say “I want less than 100 requests to exceed my 200ms time quantum over a span of 10 minutes.”

The actual implementations can vary, so it is incumbent upon the implementer to choose one which suits their business needs appropriately.

Defining Formulas and Analysis

Based on the information you’re trying to get, and your sample set, the next step is determining the right formulas or functions for analysis. For SLO-related data, most practitioners implement open-source histogram libraries. There are many implementations out there, ranging from log-linear, to t-digest, to fixed bin. These libraries often provide functions to execute quantile calculations, inverse calculations, bin count, and other mathematical implementations needed for statistical data analysis.

Some analysts use approximate histograms, such as t-digest. However, those implementations often exhibit double digit error rates near median values. With any histogram-based implementation, there will always be some level of error, but implementations such as log linear can generally minimize that error to well under 1%, particularly with large numbers of samples.

Common Distributions in SLO Analysis

Once you’ve begun analysis, there are several different mathematical models you will use to describe the distribution of your measurement samples, or at least how you expect them to be distributed.

Normal distributions: The common “bell-curve” distribution often used to describe random variables whose distribution is not known.

tsdb histogram bell curve
Normal distribution histogram and curve fit

Gamma distributions: A two-parameter family of continuous probability distributions, important for using the PDF and CDF.

time series histogram gamma
Gamma distribution histogram and curve fit

Pareto distributions: Most of the samples are concentrated near one end of the distribution. Often useful for describing how system resources are utilized.

Pareto distribution histogram and curve fit
Pareto distribution histogram and curve fit

In real life, our networks, systems, and computers are all complex entities, and you will probably almost never see something that perfectly fits any of these distributions. You may have spent a lot of time discussing normal distributions in Statistics 101, but you will probably never come across one as an SRE.

While you may often see distributions that resemble the Gamma or Pareto model, it’s highly unusual to see a distribution that’s a perfect fit.

Instead, most of your sample distributions will be a composition of different model, which is completely normal and expected. This “single mode” latency distribution most often represents latency distributions. And, while a single latency distribution is often represented by a Gamma distribution, it is exceptionally rare that we see single latency distributions. They are actually often multiple latency distributions all “jammed together”, which results in multi-modal distributions.

That could be the result of a few different common code paths (each with a different distribution), a few different types of clients each with a different usage pattern or network connection… Or both. So most of the latency distributions we’ll see in practice are actually a handful (and sometimes a bucket full) of different gamma-like distributions stacked atop each other. The point being, don’t worry too much about any specific model – it’s the actual data that’s important.

Histograms in SLO Analysis

A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.

time series histogram log linear
Log Linear Bimodal Histogram

If for any reason your range of values in on the low end, this is where a data volume issue (as we mentioned above) could rear its ugly head and distort your results.

However, histograms are ideal for SLO analysis, or any high-frequency, high-volume data, because they allow us to store the complete distribution of data at scale. You can describe a histogram with between 3 and 10 bytes per bin, depending on the varbit encoding of 8 of those bytes. Compression reduces that down lower. That’s an efficient approach to storing a large number of bounded sample values. So instead of storing a handful of quantiles, we can the complete distribution of data and calculate arbitrary quantiles and inverse quantiles on demand, as well as more advanced modeling techniques.

We’ll dig deeper into histograms in part 3.


In summary, analysis plays a critical role in setting your Service Level Objectives, because raw data is just that — raw and unrefined. To put yourself in a good position when setting SLOs, you must:

  • Know the data you’re analyzing, Choose data structures that are appropriate for your samples, ones that provide the needed precision and robustness for analysis. Be knowledgeable of the expected cardinality and expected distribution of your data set.
  • Understand how you’re analyzing the data and reporting your results. Ensure your analyses are mathematically correct. Realize if your data fits known distributions, and the implications that arise from that.
  • Set realistic expectations for results. Your outputs are only as good as the data you provide as inputs. Aggregates are excellent tools but it is important to understand their limitations.
  • And always be sure that you have enough data to support the analysis. A 99th percentile calculated with a dozen samples will likely vary significantly from one with hundreds of samples. Outliers can exert great influence over aggregates on small sets of data, but larger data sets are robust and not as susceptible.

With each of those pieces in place, you’ll gain the insight you need to make the smartest decision possible.

That concludes the basic overview of SLO analysis. As mentioned above, part 3 will focus, in more detail, on how to use histograms in SLO analysis.

A Guide To Service Level Objectives, Part 1: SLOs & You

Four steps to ensure that you hit your targets – and learn from your successes.

This is the first in a multi-part series about Service Level Objectives. The second part can be found here.

Whether you’re just getting started with DevOps or you’re a seasoned pro, goals are critical to your growth and success. They indicate an endpoint, describe a purpose, or more simply, define success. But how do you ensure you’re on the right track to achieve your goals?

You can’t succeed at your goals without first identifying them – AND answering “What does success look like?”

Your goals are more than high-level mission statements or an inspiring vision for your company. They must be quantified, measured, and reconciled, so you can compare the end result with the desired result.

For example, to promote system reliability we use Service Level Indicators (SLIs), set Service Level Objectives (SLOs1), and create Service Level Agreements (SLAs) to clarify goals and ensure that we’re on the same page as our customers. Below, we’ll define each of these terms and explain their relationships with each other, to help you identify, measure, and meet your goals.

Whether you’re a Site Reliability Engineer (SRE), developer, or executive, as a service provider you have a vested interest in (or responsibility for) ensuring system reliability. However, “system reliability” in and of itself can be a vague and subjective term that depends on the specific needs of the enterprise. So, SLOs are necessary because they define your Quality of Service (QoS) and reliability goals in concrete, measurable, objective terms.

But how do you determine fair and appropriate measures of success, and define these goals? We’ll look at four steps to get you there:

  1. Identify relevant SLIs
  2. Measure success with SLOs
  3. Agree to an SLA based on your defined SLOs
  4. Use gained insights to restart the process

Before we jump into the four steps, let’s make sure we’re on the same page by defining SLIs, SLOs, and SLAs.

So, What’s the Difference?

For the purposes of our discussion, let’s quickly differentiate between an SLI, an SLO, and an SLA. For example, if your broad goal is for your system to “…run faster,” then:

  • A Service Level Indicator is what we’ve chosen to measure progress towards our goal. E.g., “Latency of a request.”
  • A Service Level Objective is the stated objective of the SLI – what we’re trying to accomplish for either ourselves or the customer. E.g., “99.5% of requests will be completed in 5ms.”
  • A Service Level Agreement, generally speaking2, is a contract explicitly stating the consequences of failing to achieve your defined SLOs. E.g., “If 99% of your system requests aren’t completed in 5ms, you get a refund.”

Although most SLOs are defined in terms of what you provide to your customer, as a service provider you should also have separate internal SLOs that are defined between components within your architecture. For example, your storage system is relied upon by other components in your architecture for availability and performance, and these dependencies are similar to the promise represented by the SLOs within your SLA. We’ll call these internal SLOs out later in the discussion.

What Are We Measuring?: SLIs

Before you can build your SLOs, you must determine what it is you’re measuring. This will not only help define your objectives, but will also help set a baseline to measure against.

In general, SLIs help quantify the service that will be delivered to the customer — what will eventually become the SLO. These terms will vary depending on the nature of the service, but they tend to be defined in terms of either Quality of Service (QoS) or in terms of Availability.

Defining Availability and QoS

  • Availability means that your service is there if the consumer wants it. Either the service is up or it is down. That’s it.
  • Quality of Service (QoS) is usually related to the performance of service delivery (measured in latencies)

Availability and QoS tend to work best together. For example, picture a restaurant that’s always open, but has horrible food and service; or one that has great food and service but is only open for one hour, once a week. Neither is optimal. If you don’t balance these carefully in your SLA, you could either expose yourself to unnecessary risk or end up making a promise to your customer that effectively means nothing. The real path to success is in setting a higher standard and meeting it. Now, we’ll get into some common availability measurement strategies.

Traditionally, availability is measured by counting failures. That means the SLI for availability is the percentage of uptime or downtime. While you can use time quantum or transactions to define your SLAs, we’ve found that a combination works best.

Time quantum availability is measured by splitting your assurance window into pieces. If we split a day into minutes (1440), each minute represents a time quantum we could use to measure failure. A time quantum is marked as bad if any failures are detected, and your availability is then measured by dividing the good time quantum by the total time quantum. Simple enough, right?

The downside of this relatively simple approach is that it doesn’t accurately measure failure unless you have an even distribution of transactions throughout the day – and most services do not. You must also ensure that your time quantum is large enough to prevent a single bad transaction from ruining your objective. For example, a 0.001% error rate threshold makes no sense applied to less than 10k requests.

Transaction availability management uses raw transactions to measure availability – calculated by dividing the count of all successful transactions by the count of all attempted transactions over the course of each window. This method:

  • Provides a much stronger guarantee for the customer than the time quantum method.
  • Helps service providers avoid being penalized for SLA violations caused by short periods of anomalous behavior that affect a tiny fraction of transactions.

However, this method only works if you can measure attempted transactions… which is actually impossible. If data doesn’t show up, how could we know if it was ever sent? We’re not offering the customer much peace of mind if the burden of proof is on them.

So, we combine these approaches by dividing the assurance window into time quantum and counting transactions within each time quantum. We then use the transaction method to define part of our SLO, but we also mark any time quantum where transactions cannot be counted as failed, and incorporate that into our SLO as well. We’re now able to compensate for the inherent weakness of each method.

For example, if we have 144 million transactions per day with a 99.9% uptime SLO, our combined method would give this service an SLO that defines 99.9% uptime something like this:

“The service will be available and process requests for at least 1439 out of 1440 minutes each day. Each minute, at least 99.9% of the attempted transactions will processed. A given minute will be considered unavailable if a system outage prevents the number of attempted transactions during that minute from being measured, unless the system outage is outside of our control.”

Using this example, we would violate this SLO if the system is down for 2 minutes (consecutive or non-consecutive) in a day, or if we fail more than 100 transactions in a minute (assuming 100,000 transactions per minute).

This way you’re covered, even if you don’t have consistent system use throughout the day, or can’t measure attempted transactions. However, your indicators often require more than just crunching numbers.

Remember, some indicators are more than calculations. We’re often too focused on performance criteria instead of user experience.

Looking back to the example from the “What’s the Difference” section, if we can guarantee latency below the liminal threshold for 99% of users, then improving that to 99.9% would obviously be better because it means fewer users are having a bad experience. That’s a better goal than just improving upon an SLI like retrieval speed. If retrieval speed is already 5 ms, would it be better if it were 20% faster? In many cases the end user may not even notice an improvement.

We could gain better insight by analyzing the inverse quantile of our retrieval speed SLI. The 99th quantile for latency just tells us how slow the experience is for the 99th percentile of users. But the inverse quantile tells us what percentage of user experiences meet or exceed our performance goal.

SLI graph of the inverse quantile calculation of request latency
This example SLI graph shows the inverse quantile calculation of request latency, where our SLO specifies completion within 500 milliseconds. We’ll explore how this is derived and used in a later post.

Defining Your Goals: SLOs

Once you’ve decided on an SLI, an SLO is built around it. Generally, SLOs are used to set benchmarks for your goals. However, setting an SLO should be based on what’s cost-effective and mutually beneficial for your service and your customer. There is no universal, industry-standard set of SLOs. It’s a “case-by-case” decision based on data, what your service can provide and what your team can achieve.

That being said, how do you set your SLO? Knowing whether or not your system is up no longer cuts it. Modern customers expect fast service. High latencies will drive people away from your service almost as quickly as your service being unavailable. Therefore it’s highly probable that you won’t meet your SLO if your service isn’t fast enough.

Since “slow” is the new “down,” many speed-related
SLOs are defined using SLIs for service latency.

We track the latencies on our services to assess the success of both our external promises and our internal goals. For your success, be clear and realistic about what you’re agreeing to — and don’t lose sight of the fact that the customer is focused on “what’s in it for me.” You’re not just making promises, you’re showing commitment to your customer’s success.

For example, let’s say you’re guaranteeing that the 99th percentile of requests will be completed with latency of 200 milliseconds or less. You might then go further with your SLO and establish an additional internal goal that 80% of those requests will be completed in 5 milliseconds.

Next, you have to ask the hard question: “What’s the lowest quality and availability I can possibly provide and still provide exceptional service to users?” The spread between this service level and 100% perfect service is your budget for failure. The answer that’s right for you and your service should be based on an analysis of the underlying technical requirements and business objectives of the service.

Base your goals on data. As an industry, we too often select arbitrary SLOs.
There can be big differences between 99%, 99.9%, and 99.99%.

Setting an SLO is about setting the minimum viable service level that will still deliver acceptable quality to the consumer. It’s not necessarily the best you can do, it’s an objective of what you intend to deliver. To position yourself for success, this should always be the minimum viable objective, so that you can more easily accrue error budgets to spend on risk.

Agreeing to Success: The SLA

As you see, defining your objectives and determining the best way to measure against them requires a significant amount of effort. However, well-planned SLIs and SLOs make the SLA process smoother for you and your customer.

While commonly built on SLOs, the SLA is driven by two factors:
the promise of customer satisfaction, and the best service you can deliver.

The key to defining fair and mutually beneficial SLAs (and limiting your liability) is calculating a cost-effective balance between these two needs.
SLAs also tend to be defined by multiple, fixed time frames to balance risks. These time frames are called assurance windows. Generally, these windows will match your billing cycle, because these agreements define your refund policy.

Breaking promises can get expensive when an SLA is in place
– and that’s part of the point – if you don’t deliver, you don’t get paid.

As mentioned earlier, you should give yourself some breathing room by setting the minimum viable service level that will still deliver acceptable quality to the consumer. You’ve probably heard the advice “under-promise and over-deliver.” That’s because exceeding expectations is always better than the alternative. Using a tighter internal SLO than what you’ve committed to gives you a buffer to address issues before they become problems that are visible — and disappointing — to users. So, by “budgeting for failure” and building some margin for error into your objectives, you give yourself a safety net for when you introduce new features, load-test, or otherwise experiment to improve system performance.

Learn, Innovate, and Start Over

Your SLOs should reflect the ways you and your users expect your service to behave. Your SLIs should measure them accurately. And your SLA must make sense for you, your client, and your specific situation. Use all available data to avoid guesswork. Select goals that fit you, your team, your service, and your users. And:

  • Identify the SLIs that are relevant to your goals
  • Measure your goals precisely with SLOs
  • Agree to an SLA based on your defined SLOs
  • Use any gained insights to set new goals, improve, and innovate

Knowing how well you’re meeting your goals allows you to budget for the risks inherent to innovation. If you’re in danger of violating an SLA or falling short of your internal SLO, it’s time to take fewer risks. On the other hand, if you’re comfortably exceeding your goals, it’s time to either set more ambitious ones, or to use that extra breathing room to take more the risks. This enables you to deploy new features, innovate, and move faster!

That’s the overview. In part 2, we’ll take a closer look at the math used to set SLOs.

1Although SLO still seems to be the favored term at the time of this writing, the Information Technology Infrastructure Library (ITIL) v3 has deprecated “SLO” and replaced it with Service Level Target (SLT).
2There has been much debate as to whether an SLA is a collection of SLOs or simply an outward-facing SLO. Regardless, it is universally agreed that an SLA is a contract that defines the expected level of service and the consequences for not meeting it.

Air Quality Sensors and IoT Systems Monitoring

2017 was a bad year for fires in California. The Tubbs Fire in Sonoma County in October destroyed whole neighborhoods and sent toxic smoke south through most of the San Francisco Bay Area. The Air Quality Index (AQI) for parts of that area went up past the unhealthy level (101–150) to the hazardous level (301–500) at certain points during the fire. Once word got out that N99 dust masks were needed to keep the harmful particles out of the lungs, they became a common sight.

The EPA maintains the AirNow website, which displays the AQI for the entire US. The weather app on my Pixel phone conveniently displays a summary of the local air quality from the EPA’s source. This was my goto source for information about the outdoor air quality once the fires started. However, I started to notice that the observed air quality often didn’t match that of the app. I realized that the data reported on the app was often delayed by an hour or more, and the local air quality could change much more quickly than was reported by the AirNow resource.

Pixel Weather App air quality
Pixel Weather App air quality

I started to look into how often the data was updated, and where the sensors that collected it were located. Unfortunately, I wasn’t able to find a lot of details. However, I did come across a link to PurpleAir.com while browsing a local news site. PurpleAir reports on the same air quality metrics as the EPA source, but uses sensors that were hosted by individuals. They have a network of over a thousand sensors across the planet, and the sensor density in the SF Bay Area is quite good, as can be seen from their map. Best of all, the data was reported in real-time. They had one hour averages, last twenty four hours average, particle counts, and so on. This made it easy to check the air quality reported in real-time from a sensor close by, letting us know when it was okay to go out, and when things had gotten bad in our area.

PurpleAir.com map
PurpleAir.com map

I considered obtaining one of these sensors for myself at the time, but as the Tubbs fire faded, I stopped checking the site as often. However, shortly after the Thomas fire in December, I decided to purchase a Purple Air PA-II sensor. The sensor was easy to setup. I connected it to my WiFi network, gave it a name, location, and some other metadata. It also allowed me to give the sensor a custom url (KEY3) to PUT sensor data. KEY4 allowed me to set a custom HTTP header.

PurpleAir.com map

Circonus allows you to post data to it as a JSON object using an HTTPTrap endpoint. I didn’t know what format the PA-II would use to send the data, but I thought this was a good guess. So I created an HTTPTrap, grabbed the data submission URL, and put it into the sensor configuration. About thirty seconds later, metrics started flowing into Circonus. PurpleAir shared a helpful document that described each of these data points.

HTTPTrap configuration
HTTPTrap configuration

I wanted to create my own dashboard, but first I needed to understand the data. It turns out that AQI is calculated from the concentration of 2.5 micron and 10 micron particles in micrograms per cubic meter. I wasn’t able to find an equation that allows AQI calculation from these concentrations; it appears that AQI is linearly interpolated between different particulate concentration ranges.

In addition to PM2.5 and PM10, the PA-II provided a number of other particle measurements from its dual laser sensors; in particular, particles per deciliter for particles ranging from 0.3 to 10 microns. It seems that the small particles under 1 micron are exhaled, whereas the 1–10 micron particles are the ones that become lodged in the lungs. The PA-II sensor also provides other metrics such as humidity, barometric pressure, dewpoint, temperature, RSSI signal strength to the WiFi access point, and free heap memory. I put together a dashboard to track these metrics.

Temperature, Humidity, Air Pressure, Dewpoint
Temperature, Humidity, Air Pressure, Dewpoint
AQI levels added as horizontal lines
PM2.5 (2.5 micron) and PM10, micrograms per cubic meter. Log 20 scale. AQI levels added as horizontal lines.
PM1 micrograms per cubic meter, and 0.3/0.5/1.0/2.5 micron particles per deciliter
PM1 micrograms per cubic meter, and 0.3/0.5/1.0/2.5 micron particles per deciliter
Free heap memory, WiFi signal strength, and 5.0/10.0 micron particles per deciliter
Free heap memory, WiFi signal strength, and 5.0/10.0 micron particles per deciliter

Now that I had a dashboard up and running, I could keep a good watch on local air quality, for my neighborhood specifically, in addition to some simple weather measurements. This was quite useful, but I wanted to get a handle on when the air quality started to go bad. So I created a rule to send an alert whenever the PM 2.5 count went over 12, from Good to Moderate.

Circonus rule
Circonus rule

It took a bit of digging, but I was able to find the AQI breakpoints which correlates air quality index values to PM 2.5 µg/m3. The relation between AQI and particle concentration wasn’t linear across the categories, so I couldn’t apply a formula to calculate the conversion directly. I settled on adding threshold lines to the graphs for each different AQI category. However, I was able to easily set alerts for each threshold for the particulate count boundaries.

AQI breakpoints
AQI breakpoints

If the air quality changed, I got a text message. I created several rules with varying levels of severity, so that I could get an idea of how fast the air quality was changing.

At this point I had a pretty good setup; if the air got bad, I got a text message. The air sensor itself was pretty sensitive to local air quality fluctuations; if I fired up the meat smoker, I’d get an alert. Overall the system was fairly stable, but I did run into some issues where data wasn’t sent to the HTTPTrap at certain times. As a former WiFi firmware engineer, I decided to use tcpdump to look at the traffic directly. To do this, I had to get a host between the sensor and the internet, so I shared my iMac internet connection over WiFi and connected the air sensor to it. The air sensor has a basic web interface that you can use to specify the WiFi connection, and also get a real time readout of the laser air sensors.

Once I had the sensor bridged through my iMac I was able to take a look at the network traffic. The sensor used HTTP GET requests to update the PurpleAir map, and as I had specified in the configuration interface, PUT requests to the Circonus HTTPTrap. Oddly enough, things worked just fine when requests were being routed through the iMac. I came to the conclusion that the Airport Extreme that the sensor was normally associated with might be the source of the failed PUT requests at the TCP level somehow. This is something I need to put some more energy into at some point, but these types of network level issues can be tricky to debug.

me@myhost ~ $ sudo tcpdump -AvvvXX -i bridge100 dst host or src host

10:58:11.781392 IP (tos 0x0, ttl 128, id 3301, offset 0, flags [none], proto TCP (6), length 1498) > Flags [P.], cksum 0xf46e (correct), seq 1:1459, ack 1, win 5840, length 1458: HTTP, length: 1458
PUT /module/httptrap/xxxxxxxx-adf9–40ec-bcc8–yyyyy3194/xx HTTP/1.1
Host: trap.noit.circonus.net
X-PurpleAir: 1
Content-Type: application/json
User-Agent: PurpleAir/2.50i
Content-Length: 1231
Connection: close

0x0000: ca2a 14f1 e064 5ccf 7f4b f8c4 0800 4500 .*…d\..K….E.
0x0010: 05da 0ce5 0000 8006 fbfe c0a8 0202 23c9 …………..#.
0x0020: 45c7 05bb 0050 0054 d1fa 9ab3 12ce 5018 E….P.T……P.
0x0030: 16d0 f46e 0000 5055 5420 2f6d 6f64 756c …n..PUT./modul
0x0040: 652f 6874 7470 7472 6170 2f31 6430 3131 e/httptrap/1d011
0x0050: 6339 332d 6164 6639 2d34 3065 632d 6263 xxx-xxx–40ec-bc
0x0060: 6338 2d35 3866 6634 3036 3733 3139 342f c8–xxxxxxxxxxx/
0x0070: 6d79 7333 6372 3374 2048 5454 502f 312e xxxxxxxx.HTTP/1.
0x0080: 310d 0a48 6f73 743a 2074 7261 702e 6e6f 1..Host:.trap.no
0x0090: 6974 2e63 6972 636f 6e75 732e 6e65 740d it.circonus.net.
0x00a0: 0a58 2d50 7572 706c 6541 6972 3a20 310d .X-PurpleAir:.1.
0x00b0: 0a43 6f6e 7465 6e74 2d54 7970 653a 2061 .Content-Type:.a
0x00c0: 7070 6c69 6361 7469 6f6e 2f6a 736f 6e0d pplication/json.
0x00d0: 0a55 7365 722d 4167 656e 743a 2050 7572 .User-Agent:.Pur
0x00e0: 706c 6541 6972 2f32 2e35 3069 0d0a 436f pleAir/2.50i..Co
0x00f0: 6e74 656e 742d 4c65 6e67 7468 3a20 3132 ntent-Length:.12
0x0100: 3331 0d0a 436f 6e6e 6563 7469 6f6e 3a20 31..Connection:.
0x0110: 636c 6f73 650d 0a0d 0a7b 2253 656e 736f close….{“Senso
0x0120: 7249 6422 3a22 3563 3a63 663a 3766 3a34 rId”:”5c:cf:7f:4
0x0130: 623a 6638 3a63 3422 2c22 4461 7465 5469 b:f8:c4",”DateTi

Overall, I’m pleased with the result. There are a couple more things I want to try with the sensor, such as putting a second order derivative alert on the air pressure metric to tell when a low pressure region is moving in. The folks at PurpleAir.com were kind and helpful in responding to any questions I had. I’m looking forward to trying out some other sensors that I can plug into a monitoring system. Amazon has a C02 sensor, so that might be next on my list.

Posted in IoT

Comprehensive Container-Based Service Monitoring with Kubernetes and Istio

Operating containerized infrastructure brings with it a new set of challenges. You need to instrument your containers, evaluate your API endpoint performance, and identify bad actors within your infrastructure. The Istio service mesh enables instrumentation of APIs without code change and provides service latencies for free. But how do you make sense all that data? With math, that’s how.

Circonus is the first third party adapter for Istio. In a previous post, we talked about the first Istio community adapter to monitor Istio based services. This post will expand on that. We’ll explain how to get a comprehensive understanding of your Kubernetes infrastructure. We will also explain how to get an Istio service mesh implementation for your container based infrastructure.

Istio Overview

Istio is a service mesh for Kubernetes, which means that it takes care of all of the intercommunication and facilitation between services, much like network routing software does for TCP/IP traffic. In addition to Kubernetes, Istio can also interact with Docker and Consul based services. It’s similar to LinkerD, which has been around for a while.

Istio is an open source project by developed by teams from Google, IBM, Cisco, and Lyft’s Envoy. The project recently turned one year old, and Istio has found its way into a couple of production environments at scale. At the time of this post, the current version is 0.8.

So, how does Istio fit into the Kubernetes ecosystem? Kubernetes acts as the data plane and Istio acts as the control plane. Kubernetes carries the application traffic, handling container orchestration, deployment, and scaling. Istio routes the application traffic, handling policy enforcement, traffic management and load balancing. It also handles telemetry syndication such as metrics, logs, and tracing. Istio is the crossing guard and reporting piece of the container based infrastructure.

Istio Overview - service mesh architecture
Istio service mesh architecture

The diagram above shows the service mesh architecture. Istio uses an envoy sidecar proxy for each service. Envoy proxies inbound requests to the Istio Mixer service via a GRPC call. Then Mixer applies rules for traffic management, and syndicates request telemetry. Mixer is the brains of Istio. Operators can write YAML files that specify how Envoy should redirect traffic. They can also specify what telemetry to push to monitoring and observability systems. Rules can be applied as needed at run time without needing to restart any Istio components.

Istio supports a number of adapters to send data to a variety of monitoring tools. That includes Prometheus, Circonus, or Statsd. You can also enable both Zipkin and Jaeger tracing. And, you can generate graphs to visualize the services involved.

Istio is easy to deploy. Way back when, around 7 to 8 months ago, you had to install Istio onto a Kubernetes cluster with a series of kubectl commands. And you still can today. But now you can just hop into Google Cloud platform, and deploy an Istio enabled Kubernetes cluster with a few clicks, including monitoring, tracing, and a sample application. You can get up and running very quickly, and then use the istioctl command to start having fun.

Another benefit is that we can gather data from services without requiring developers to instrument their services to provide that data. This has a multitude of benefits. It reduces maintenance. It removes points of failure in the code. It provides vendor agnostic interfaces, which reduces the chance of vendor lockin.

With Istio, we can deploy different versions of individual services and weight the traffic between them. Istio itself makes use of a number of different pods to operate itself, as shown here:

> kubectl get pods -n istio-system
istio-ca-797dfb66c5      1/1   Running  0       2m
istio-ingress-84f75844c4 1/1   Running  0       2m
istio-egress-29a16321d3  1/1   Running  0       2m
istio-mixer-9bf85fc68    3/3   Running  0       2m
istio-pilot-575679c565   2/2   Running  0       2m
grafana-182346ba12       2/2   Running  0       2m
prometheus-837521fe34    2/2   Running  0       2m

Istio is not exactly lightweight. The power and flexibility of Istio come with the cost of some overhead for operation. However, if you have more than a few microservices in your application, your application containers will soon surpass the system provisioned containers.

Service Level Objectives

This brief overview of Service Level Objectives will set the stage for how we should measure our service health. The concept of Service Level Agreements (SLAs) has been around for at least a decade. But just recently, the amount of online content related to Service Level Objectives (SLOs) and Service Level Indicators (SLIs) has been increasing rapidly.

In addition to the well-known Google SRE book, two new books that talk about SLOs are being published soon. The Site Reliability Workbook has a dedicated chapter on SLOs, and Seeking SRE has a chapter on defining SLO goals by Circonus founder and CEO, Theo Schlossnagle. We also recommend watching the YouTube video “SLIs, SLOs, SLAs, oh my!” from Seth Vargo and Liz Fong Jones to get an in depth understanding of the difference between SLIs, SLOs, and SLAs.

To summarize: SLIs drive SLOs, which inform SLAs.

A Service Level Indicator (SLI) is a metric derived measure of health for a service. For example, I could have an SLI that says my 95th percentile latency of homepage requests over the last 5 minutes should be less than 300 milliseconds.

A Service Level Objective (SLO) is a goal or target for an SLI. We take an SLI, and extend its scope to quantify how we expect our service to perform over a strategic time interval. Using the SLI from the previous example, we could say that we want to meet the criteria set by that SLI for 99.9% of a trailing year window.

A Service Level Agreement (SLA) is an agreement between a business and a customer, defining the consequences for failing to meet an SLO. Generally, the SLOs which your SLA is based upon will be more relaxed than your internal SLOs, because we want our internal facing targets to be more strict than our external facing targets.

RED Dashboard

What combinations of SLIs are best for quantifying both host and service health? Over the past several years, there have been a number of emerging standards. The top standards are the USE method, the RED method, and the “four golden signals” discussed in the Google SRE book.

Brendan Gregg introduced the USE method, which seeks to quantify health of a system host based on utilization, saturation, and errors metrics. For something like a CPU, we can use common utilization metrics for user, system, and idle percentages. We can use load average and run queue for saturation. The UNIX perf profiler is a good tool for measuring CPU error events.

Tom Wilkie introduced the RED method a few years ago. With RED. we monitor request rate, request errors, and request duration. The Google SRE book talks about using latency, traffic, errors, and saturation metrics. These “four golden signals” are targeted at service health, and is similar to the RED method, but extends it with saturation. In practice, it can be difficult to quantify service saturation.

So, how are we monitoring the containers? Containers are short lived entities. Monitoring them directly to discern our service health presents a number of complex problems, such as the high cardinality issue. It is easier and more effective to monitor the service outputs of those containers in aggregate. We don’t care if one container is misbehaving if the service is healthy. Chances are that our orchestration framework will reap that container anyway and replace it with a new one.

Let’s consider how best to integrate SLIs from Istio as part of a RED dashboard. To compose our RED dashboard, let’s look at what telemetry is provided by Istio:

  • Request Count by Response Code
  • Request Duration
  • Request Size
  • Response Size
  • Connection Received Bytes
  • Connection Sent Bytes
  • Connection Duration
  • Template Based MetaData (Metric Tags)

Istio provides several metrics about the requests it receives, the latency to generate a response, and connection level data. Note the first two items from the list above; we’ll want to include them in our RED dashboard.

Istio also gives us the ability to add metric tags, which it calls dimensions. So we can break down the telemetry by host, cluster, etc. We can get the rate in requests per second by taking the first derivative of the request count. We can get the error rate by taking the derivative of the request count of unsuccessful requests. Istio also provides us with the request latency of each request, so we can record how long each service request took to complete.

In addition, Istio provides us with a Grafana dashboard out of the box that contains the pieces we want:

Grafana dashboard
Grafana dashboard

The components we want are circled in red in the screenshot above. We have the request rate in operations per second in the upper left, the number of failed requests per second in the upper right, and a graph of response time in the bottom. There are several other indicators on this graph, but let’s take a closer look at the ones we’ve circled:

Dashboard Component
Dashboard Component

The above screenshot shows the rate component of the dashboard. This is pretty straight forward. We count the number of requests which returned a 200 response code and graph the rate over time.

Dashboard Component - Errors

The Istio dashboard does something similar for responses that return a 5xx error code. In the above screenshot, you can see how it breaks down the errors by either the ingress controller, or by errors from the application product page itself.

request duration graph
Request duration graph

This screenshot shows the request duration graph. This graph is the most informative about the health of our service. This data is provided by a Prometheus monitoring system, so we see request time percentiles graphed here, including the median, 90th, 95th, and 99th percentiles.

These percentiles give us some overall indication of how the service is performing. However, there are a number of deficiencies with this approach that are worth examining. During times of low activity, these percentiles can skew wildly because of limited numbers of samples. This could mislead you about the system performance in those situations. Let’s take a look at the other issues that can arise with this approach:

Duration Problems:

  • The percentiles are aggregated metrics over fixed time windows.
  • The percentiles cannot be re-aggregated for cluster health.
  • The percentiles cannot be averaged (and this is a common mistake).
  • This method stores aggregates as outputs, not inputs.
  • It is difficult to measure cluster SLIs with this method.

Percentiles often provide deeper insight than averages as they express the range of values with multiple data points instead of just one. But like averages, percentiles are an aggregated metric. They are calculated over a fixed time window for a fixed data set. If we calculate a duration percentile for one cluster member, we can not merge that with another one to get an aggregate performance metric for the whole cluster.

It is a common misconception that percentiles can be averaged; they cannot, except in rare cases where the distributions that generated them are nearly identical. If you only have the percentile, and not the source data, you cannot know that might be the case. It is a chicken and egg problem.

This also means that you cannot set service level indicators for an entire service due to the lack of mergeability, if you are measuring percentile based performance only for individual cluster members.

Our ability to set meaningful SLIs (and as a result, meaningful SLOs) is limited here, due to only having four latency data points over fixed time windows. So when you are working with percentile based duration metrics, you have to ask yourself if your SLIs are really good SLIs. We can do better by using math to determine the SLIs that we need to give us a comprehensive view of our service’s performance and health.

Histogram Telemetry

latency data for a service in microseconds using a histogram
Latency data for a service in microseconds using a histogram

Above is a visualization of latency data for a service in microseconds using a histogram. The number of samples is on the Y-Axis, and the sample value, in this case microsecond latency, is on the X-Axis. This is the open source histogram we developed at Circonus. (See the open source in both C and Golang, or read more about histograms here.) There are a few other histogram implementations out there that are open source, such as Ted Dunning’s t-digest histogram and the HDR histogram.

The Envoy project recently adopted the C implementation of Circonus’s log linear histogram library. This allows envoy data to be collected as distributions. They found a very minor bug in implementation, which Circonus was quite happy to fix. That’s the beauty of open source, the more eyes on the code, the better it gets over time.

Histograms are mergeable. Any two or more histograms can be merged as long as the bin boundaries are the same. That means that we can take this distribution and combine it with other distributions. Mergeable metrics are great for monitoring and observability. They allow us to combine outputs from similar sources, such as service members, and get aggregate service metrics.

As indicated in the image above, this log linear histogram contains 90 bins for each power of 10. You can see 90 bins between 100,000 and 1M. At each power of 10, the bin size increases by a factor of 10. This allows us to record a wide range of values with high relative accuracy without needing to know the data distribution ahead of time. Let’s see what this looks like when we overlay some percentiles:

The Average
The Average

Now you can see where we have the average, and the 50th percentile (also known as the median), and the 90th percentile. The 90th percentile is the value at which 90% of the samples are below that value.

Consider our example SLI from earlier. With latency data displayed in this format, we can easily calculate that SLI for a service by merging histograms together to get a 5 minute view of data, and then calculating the 90th percentile value for that distribution. If it is less than 1,000 milliseconds, we met our target.

The RED dashboard duration graph from our screenshot above has four percentiles, the 50th, 90th, 95th, and 99th. We could overlay those percentiles on this distribution as well. Even without data, we can see the rough shape of what the request distribution might look like, but that would be making a lot of assumptions. To see just how misleading those assumptions based on just a few percentiles can be, let’s look at a distribution with additional modes:

This histogram shows a distribution with two distinct modes. The leftmost mode could be fast responses due to serving from a cache, and the right mode from serving from disk. Just measuring latency using four percentiles would make it nearly impossible to discern a distribution like this. This gives us a sense of the complexity that percentiles can mask. Consider a distribution with more than two modes:

This distribution has at least four visible modes. If we do the math on the full distribution, we will find 20+ modes here. How many percentiles would you need to record to approximate a latency distribution like the one above? What about a distribution like the one below?

Complex systems composed of many service will generate latency distributions that can not be accurately represented by using percentiles. You have to record the entire latency distribution to be able to fully represent it. This is one reason it is preferable to store the complete distributions of the data in histograms and calculate percentiles as needed, rather than just storing a few percentiles.

This type of histogram visualization shows a distribution over a fixed time window. We can store multiple distributions to get a sense of how it changes over time, as shown below:

This is a heatmap, which represents a set of histograms over time. Imagine each column in this heatmap has a separate bar chart viewed from above, with color being used to indicate the height of each bin. This is a grafana visualization of the response latency from a cluster of 10 load balancers. This gives us a deep insight into the system behavior of the entire cluster over a week, there’s over 1 million data samples here. The median here centers around 500 microseconds, shown in the red colored bands.

Above is another type of heatmap. Here, saturation is used to indicate the “height” of each bin (the darker tiles are more “full”). Also, this time we’ve overlayed percentile calculations over time on top of the heatmap. Percentiles are robust metrics and very useful, but not by themselves. We can see here how the 90+ percentiles increase as the latency distribution shifts upwards.

Let’s take these distribution based duration maps and see if we can generate something more informative than the sample Istio dashboard:

The above screenshot is a RED dashboard revised to show distribution based latency data. In the lower left, we show a histogram of latencies over a fixed time window. To the right of it, we use a heat map to break that distribution down into smaller time windows. With this layout of RED dashboard, we can get a complete view of how our service is behaving with only a few panels of information. This particular dashboard was implemented using Grafana served from an IRONdb time series database which stores the latency data natively as log linear histograms.

We can extend this RED dashboard a bit further and overlay our SLIs onto the graphs as well:

For the rate panel, our SLI might be to maintain a minimum level of requests per second. For the rate panel, our SLI might be to stay under a certain number of errors per second. And as we have previously examined duration SLIs, we might want our 99th percentile for our entire service which is composed of several pods, to stay under a certain latency over a fixed window. Using Istio telemetry stored as histograms enables us to set these meaningful service wide SLIs. Now we have a lot more to work with and we’re better able to interrogate our data (see below).

Asking the Right Questions

So now that we’ve put the pieces together and have seen how to use Istio to get meaningful data from our services, let’s see what kinds questions we can answer with it.

We all love being able to solve technical problems, but not everyone has that same focus. The folks on the business side want to answer questions on how the business is doing. You need to be able to answer business-centric questions. Let’s take the toolset we’ve assembled and align the capabilities with a couple of questions that the business ask its SREs:

Example Questions:

  • How many users got angry on the Tuesday slowdown after the big marketing promotion?
  • Are we over-provisioned or under-provisioned on our purchasing checkout service?

Consider the first example. Everyone has been through a big slowdown. Let’s say Marketing did a big push, traffic went up, performance speed went down, and users complained that the site got slow. How can we quantify the extent of how slow it was for everyone? How many users got angry? Let’s say that Marketing wants to know this so that they can send out a 10% discount email to the users affected and also because they want to avoid a recurrence of the same problem. Let’s craft an SLI and assume that users noticed the slowdown and got angry if requests took more than 500 milliseconds. How can we calculate how many users got angry with this SLI of 500 ms?

First, we need to already be recording the request latencies as a distribution. Then we can plot them as a heatmap. We can use the distribution data to calculate the percentage of requests that exceeded our 500ms SLI by using inverse percentiles. We take that answer, multiply it by the total number of requests in that time window, and integrate over time. Then we can plot the result overlayed on the heatmap:

In this screenshot, we’ve circled the part of the heatmap where the slowdown occurred. The increased latency distribution is fairly indicative of a slowdown. The line on the graph indicates the total number of requests affected over time.

In this example, we managed to miss our SLI for 4 million requests. Whoops. What isn’t obvious are the two additional slowdowns on the right because they are smaller in magnitude. Each of those cost us an additional 2 million SLI violations. Ouch.

We can do these kinds of mathematical analyses because we are storing data as distributions, not aggregations like percentiles.

Let’s consider another common question. Is my service under provisioned, or over provisioned?

The answer is often “it depends.” Loads vary based on the time of day and the day of week, in addition to varying because of special events. That’s before we even consider how the system behaves under load. Let’s put some math to work and use latency bands to visualize how our system can perform:

The visualization above shows latency distribution broken down by latency bands over time. The bands here show the number of requests that took under 25ms, between 25 and 100 ms, 100-250ms, 250-1000, and over 1000ms. The colors are grouped by fast requests as shown in green, to slow requests shown in red.

What does this visualization tell us? It shows that requests to our service started off very quickly, then the percentage of fast requests dropped off after a few minutes, and the percentage of slow requests increased after about 10 minutes. This pattern repeated itself for two traffic sessions. What does that tell us about provisioning? It suggests that initially the service was over provisioned, but then became under provisioned over the course of 10-20 minutes. Sounds like a good candidate for auto-scaling.

We can also add this type of visualization to our RED dashboard. This type of data is excellent for business stakeholders. And it doesn’t require a lot of technical knowledge investment to understand the impact on the business.


We should monitor services, not containers. Services are long lived entities, containers are not. Your users doesn’t care how your containers are performing, they care about how your services are performing.

You should record distributions instead of aggregates. But then you should generate your aggregates from those distributions. Aggregates are very valuable sources of information. But they are unmergeable and so they are not well suited to statistical analysis.

Istio gives you a lot of stuff for free. You don’t have to instrument your code either. You don’t need to go and build a high quality application framework from scratch.

Use math to ask and answer questions about your services that are important to the business. That’s what this is all about, right? When we can make systems reliable by answering questions that the business values, we achieve the goals of the organization.

Cassandra Query Observability with Libpcap and Protocol Observer

Opinions vary in recent online discussions regarding systems and software observability. Some state that observability is a replacement for monitoring. Others that they are parallel mechanisms, or that one is a subset of another (not to mention where tracing fits into such a hierarchy). Monitoring Weekly recently provided a helpful list of resources for an overview of this discussion, as well as some practical applications of observability. Without attempting to put forth yet another opinion, let’s take a step back and ask, what is observability?

What is Observability

Software observability wasn’t invented yesterday, and has a history going back at least 10 years, to the birth of DTrace. One can go back even further to the Apollo 11 source code to see some of the earliest implementations of software observability, and also systems monitoring from mission control. If we ask Wikipedia what Observability is, we get an answer along the lines of how well internal states of a system can be inferred from knowledge of its external outputs.

To understand this distinction between monitoring and observability, consider what a doctor does with a stethoscope. Anyone who has watched one of the many TV medical dramas, or been in a doctor’s office themselves, should be familiar with this. One uses the external outputs of the stethoscope to determine the internal state of the patient. The patient is monitored with instruments such as the stethoscope, and it is the observable traits of these instruments that allow this monitoring to occur. The doctor observes the instruments in order to monitor the patient, because this is often more effective than observing the patient directly, which might not even be possible for some of a patient’s internal states.

Observing Wirelatency and Monitoring Cassandra

So now that we have a baseline understanding of these terms, let’s dive into a practical use case that we’ve implemented here at Circonus. Our patient will be the Apache Cassandra wide column distributed data store. Our stethoscope will be the wirelatency tool, which uses the libpcap library to grab a copy of packets off of the wire before they are processed by the operating system. We are going to determine how well the internal states of Cassandra are by observing the external outputs (the query latency data).

Let’s take a quick overview of libpcap. Pcap allows one to get a copy of packets off the ethernet interface at the link layer prior to their being handled by the kernel networking code. The details of the implementation vary between operating systems, but packets essentially bypass the kernel network stack and are made available to user space. Linux uses PF_PACKET sockets to accomplish this. These are often referred to as raw sockets. BSD based systems use the Berkeley Packet Filter. For more detail, refer to the publication “Introduction to RAW-sockets” (Heuschkel, Hofmann, Hollstein, & Kuepper, 2017).

So now that we can grab packets off the wire in a computationally performant means, we can reassemble bidirectional TCP streams, decode application specific protocols, and collect telemetry. Enter the wirelatency utility that was developed in Go. The gopacket library handles the reassembly of bidirectional data from the packets provided by pcap. Wirelatency provides a TCPProtocolInterpreter interface, which allows us to define protocol specific functions in modules for HTTP, PostgreSQL, Cassandra CQL, and Kafka application service calls. The circonus-gometrics library allows us to send that telemetry upstream to Circonus for analysis.

It’s time to do some observability. First lets use `tcpdump` to take a look some actual Cassandra CQL network traffic. I used Google’s Compute Engine to setup a Cassandra 3.4 server and client node. Initially, I used an Ubuntu 17.04 host and installed Cassandra 3.11, but the server failed to startup and threw an exception. Fortunately, this was a documented issue. I could have downgraded JDK or recompiled Cassandra from source (and probably would have done so 10 years ago), but I decided to take the easy route and lit up two new hosts running RHEL. It was a cake walk comparatively to get Cassandra up and running, so I used the excellent DataStax documentation to get a simple schema up and insert some data. At this point, I was able to grab some network traffic.

fredmoyer@instance-1 wirelatency]$ sudo tcpdump -AvvvX dst port 9042 or src port 9042
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
06:55:32.586018 IP (tos 0x0, ttl 64, id 60628, offset 0, flags [DF], proto TCP (6), length 109)
instance-2.c.deft-reflection-188505.internal.38428 > instance-1.c.deft-reflection-188505.internal.9042: Flags [
P.], cksum 0xacad (correct), seq 2362125213:2362125270, ack 3947549198, win 325, options [nop,nop,TS val 435889631
ecr 435892523], length 57
0x0000: 4500 006d ecd4 4000 4006 3896 0a8e 0003 E..m..@.@.8.....
0x0010: 0a8e 0002 961c 2352 8ccb 2b9d eb4a d20e ......#R..+..J..
0x0020: 8018 0145 acad 0000 0101 080a 19fb 25df ...E..........%.
0x0030: 19fb 312b 0400 000b 0700 0000 3000 0000 ..1+........0...
0x0040: 1b73 656c 6563 7420 2a20 6672 6f6d 2063 .select.*.from.c
0x0050: 7963 6c69 7374 5f6e 616d 653b 0001 3400 yclist_name;..4.
0x0060: 0000 6400 0800 0565 2698 f00e 08 ..d....e&....
06:55:32.593339 IP (tos 0x0, ttl 64, id 43685, offset 0, flags [DF], proto TCP (6), length 170)
instance-1.c.deft-reflection-188505.internal.9042 > instance-2.c.deft-reflection-188505.internal.38428: Flags [
P.], cksum 0x15bd (incorrect -> 0x1e61), seq 1:119, ack 57, win 220, options [nop,nop,TS val 435912925 ecr 43588963
1], length 118
0x0000: 4500 00aa aaa5 4000 4006 7a88 0a8e 0002 E.....@.@.z.....
0x0010: 0a8e 0003 2352 961c eb4a d20e 8ccb 2bd6 ....#R...J....+.
0x0020: 8018 00dc 15bd 0000 0101 080a 19fb 80dd ................
0x0030: 19fb 25df 8400 000b 0800 0000 6d00 0000 ..%.........m...
0x0040: 0200 0000 0100 0000 0300 0763 7963 6c69 ...........cycli
0x0050: 6e67 000c 6379 636c 6973 745f 6e61 6d65 ng..cyclist_name
0x0060: 0002 6964 000c 0009 6669 7273 746e 616d ..id....firstnam
0x0070: 6500 0d00 086c 6173 746e 616d 6500 0d00 e....lastname...
0x0080: 0000 0100 0000 105b 6962 dd3f 904c 938f .......[ib.?.L..
0x0090: 61ea bfa4 a803 e200 0000 084d 6172 6961 a..........Maria
0x00a0: 6e6e 6500 0000 0356 4f53 nne....VOS
06:55:32.593862 IP (tos 0x0, ttl 64, id 60629, offset 0, flags [DF], proto TCP (6), length 52)
instance-2.c.deft-reflection-188505.internal.38428 > instance-1.c.deft-reflection-188505.internal.9042: Flags [
.], cksum 0x55bd (correct), seq 57, ack 119, win 325, options [nop,nop,TS val 435889639 ecr 435912925], length 0
0x0000: 4500 0034 ecd5 4000 4006 38ce 0a8e 0003 E..4..@.@.8.....
0x0010: 0a8e 0002 961c 2352 8ccb 2bd6 eb4a d284 ......#R..+..J..
0x0020: 8010 0145 55bd 0000 0101 080a 19fb 25e7 ...EU.........%.
0x0030: 19fb 80dd

Here we can observe the query issued and the result sent back from the server. And because the packets are timestamped, we can calculate the query latency, which was about 7.3 milliseconds (06:55:32.593339 – 06:55:32.586018 = 0.0073209).

Here’s the approach using `protocol-observer`, the wirelatency executable. The golang binary takes an API token, a wire protocol (in this case `cassandra_cql`), and a number of optional debugging arguments. We can see below how it tracks inbound and outbound TCP streams. It reassembles those streams, pulls out the Cassandra queries, and records the query latencies. At regular intervals, it executes an HTTP PUT the Circonus API endpoint to store the observed metrics. The metrics are recorded as log linear histograms, which means that we can store thousands of query latencies (or much more) in a very compact data structure without losing any accuracy from a statistical analysis standpoint

[fredmoyer@instance-1 wirelatency]$ sudo API_TOKEIN:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx -wire c
assandra_cql -debug_capture=true -debug_circonus=true -debug_capture_data=true -debug_cql=true -debug_measurements

2018/02/14 07:25:07 [DEBUG] New(>, 38428->9042) -> true
2018/02/14 07:25:07 [DEBUG] establishing sessions for net:>
2018/02/14 07:25:07 [DEBUG] establishing dsessions for ports:38428->9042
2018/02/14 07:25:07 [DEBUG] new inbound TCP stream>>9042 started, paired: false
2018/02/14 07:25:07 [DEBUG] New(>, 9042->38428) -> false
2018/02/14 07:25:07 [DEBUG] new outbound TCP stream>>38428 started, paired: true
2018/02/14 07:25:08 [DEBUG] flushing all streams that haven't seen packets, pcap stats: &{PacketsReceived:3 Packets
Dropped:0 PacketsIfDropped:0}
2018/02/14 07:25:08 [DEBUG] Packaging metrics
2018/02/14 07:25:08 [DEBUG] PUT URL:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2018/02/14 07:25:09 [DEBUG] 2 stats sent

Check out the Wirelatency repo or ask about it on the Circonus-labs slack to learn more about how it works.


We can get a look at the distribution of latencies observed by using a histogram to display that distribution.

This histogram shows that a lot of requests are clustered between 2 and 4 milliseconds. We can also see a much smaller mode between 25 and 30 milliseconds. This tells us that we likely have two different data access patterns going on for this example select query. It’s possible that the first mode indicates queries that returned data from a memory cache, and the second from disk. This is something that can’t be discerned just from this data collection, but we can go one step further and plot an overlay with block I/O activity via eBPF to see if there is a correlation between the higher latency query times and disk activity.

If we had just looked at average query latency with a more blunt instrument, we would have probably concluded that most queries were taking around 10 milliseconds. But looking at the distribution here, we can see that very few queries actually took that long to execute. Your ability to observe your system accurately depends on the correctness of the instruments you are using. Here we can see how libpcap and the histogram visualization give us a detailed view into how our Cassandra instance is really performing.

Observability into key Service Level Objectives is essential for the success of your enterprise, and observability is the first step for successful analysis. Only once a doctor collects their observations of the symptoms can they can begin to make their diagnosis. What kind of insights can we gain into the behavior of our systems (of the patient’s internal states) from these observations?

That’s just part of what we will explore next week, when we’ll talk more about Service Level Objectives and dive into SLO analysis in more detail.

Effective Management of High Volume Numeric Data with Histograms

How do you capture and organize billions of measurements per second such that you can answer a rich set of queries effectively (percentiles, counts below X, aggregations across streams), and you don’t blow through your AWS budget in minutes?

To effectively manage billions of data points, your system has to be both performant and scalable. How do you accomplish that? Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possible, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible.

You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ve found that histograms are the most efficient data structure for storing the data types we care about at scale.

What is a histogram?

A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin.

This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value. On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.

This might look like the 2nd generation display in KITT from Knight Rider, but this is a heatmap. A heatmap is essentially a series of histograms over time. This heatmap represents web service request latency. Imagine each column in the heatmap is a bar graph (like the previous histogram) viewed “from above,” the parts that are red are where the sample density is the highest. So we can see that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.

Types of Histograms

There are five types of histograms:

  • Fixed Bucket – require the user to specify the bucket or bin boundaries.
  • Approximate – use approximations of values.
  • Linear – have one bin at even intervals, such as one bin per integer.
  • Log Linear – have bins at logarithmically increasing intervals.
  • Cumulative – each successive bin contains the sum of the counts of previous bins.

Fixed Bucket Histograms

Fixed bucket histograms require the user to specify the bin boundaries.

Traits of fixed bucket histograms:

  • Histogram bin sizes can be fine tuned for known data sets to achieve increased precision.
  • Cannot be merged with other types of histograms because the bin boundaries are likely uncommon.
  • Less experienced users will likely pick suboptimal bin sizes.
  • If you change your bin sizes, you can’t do calculations across older configurations.

Approximate Histograms

Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example above which displays centroids. The number of samples on each side of the centroid is the same.

Traits of approximate histograms:

  • Space efficient.
  • High accuracy at extreme percentiles (95%, 99%+).
  • Worst case errors ~10% at the median with small sample sizes.
  • Can be merged with other t-digest histograms.

Linear Histograms

Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins.

Traits of linear histograms:

  • Accuracy dependent on data distribution and bin size.
  • Low accuracy at fractional sample values (though this indicates improperly sized bins).
  • Inefficient bin footprint at higher sample values.

Log Linear Histograms

Log Linear histograms have bins at logarithmically increasing intervals.

Traits of log linear histograms:

  • High accuracy at all sample values.
  • Fits all ranges of data well.
  • Worst case bin error ~5%, but only with absurdly low sample density.
  • Bins are often subdivided into even slices for increased accuracy.
  • HDR histograms (high dynamic range) are a type of log linear histograms.

Cumulative Histograms

Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins.

Traits of cumulative histograms:

  • Final bin contains the total sample count, and as such is q(1).
  • Used at Google for their Monarch monitoring system.
  • Easy to calculate bin quantile – just divide the count by the maximum.

Open Source Log Linear Histograms

So what does a programmatic implementation of a log linear histogram looks like? Circonus has released this implementation of log linear histograms as open source, in both C and Golang.

We need to ask a few things:

  • Why does it scale?
  • Why is it fast?
  • How does it calculate quantiles?

First let’s examine what the visual representation of this histogram is, to get an idea of the structure:

At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis. You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced. Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.

Here’s an alternate look at the bin size transition:

You can see the transition to larger bins at the 1,000 sample value boundary.

Bin Data Structure

The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. The diagram above shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127. The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes.

Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes – this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.

Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data.

To calculate storage efficiency for 1 month, that’s 30 days of one minute histograms:

30 days * 24 hours/day * 60 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 123.6 MB

These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations, if that.

Now, 30 days of one minute averages only takes about a third of a megabyte – but that data is essentially useless for any sort of analysis.

Let’s examine what a year’s worth of data looks like in five minute windows. That’s 365 days of five minute histograms.

365 days * 24 hours/day * 12 bins/hour * 300 bin span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB = 300.9 MB

The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows.

Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.

Quantile Calculations

Let’s talk about performing quantile calculations.

  1. Given a quantile q(X) where 0 < X < 1.
  2. Sum up the counts of all the bins, C.
  3. Multiply X * C to get count Q.
  4. Walk bins, sum bin boundary counts until > Q.
  5. Interpolate quantile value q(X) from bin.

The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1.

So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what – the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n)

Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500.

Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.

Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here.

Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width.

Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy.

In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960.

However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.

Inverse Quantiles

We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles.

If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is.

Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.

To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q.

  1. Given a sample value X, locate its bin.
  2. Using the previous linear interpolation equation, solve for Q given X.

Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q.

X = left_value+(Q-left_count) / (right_count-left_count)*bin_width

X-left_value = (Q-left_count) / (right_count-left_count)*bin_width

(X-left_value)/bin_width = (Q-left_count)/(right_count-left_count)

(X-left_value)/bin_width*(right_count-left_count) = Q-left_count

Q = (X-left_value)/bin_width*(right_count-left_count)+left_count

Solving for Q, we get 700, which is expected for our value of 1.05.

Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X.

  1. Sum the bin counts up to Q as Qleft.
  2. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal.
  3. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3.
  4. 30% of sample values exceeded X.

So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.


This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.

Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.

Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests – not great, but at least we know the extent of that our SLO is exceeded.

We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money.

Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.

What is we sum up over time the number of requests that exceeded 500 millseconds? Here we integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with this service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?

Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.

Video Presentation

This video of our presentation at DataEngConf 2018 covers the material explained in this post.

You can also view the slides here.


We looked at a few different implementations of histograms, an overview of Circonus’ open source implementation of a log linear histogram, what data structures it uses to codify bin data, and the algorithm used to calculate quantiles. Reading the code (in C or in Golang) will demonstrate how avoiding locks, memory allocations, waits, and copies are essential to making these calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself.

One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies once you’ve applied the Spectre patch which disabled CPU branch prediction. You could find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30!

While percentiles are a good first step for analyzing your Service Level Objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?

Soon, in a later post, we’ll talk more about Service Level Objectives and SLO analysis.