Latency SLOs Done Right

In their excellent SLO-workshop at SRECon2018 (programLiz Fong-Jones, Kristina Bennett and Stephen Thorne (Google) presented some best practice examples for Latency SLI/SLOs. At Circonus we care deeply about measuring latency and SRE techniques such as SLI/SLOs. For example, we recently produced A Guide to Service Level Objectives. As we will explain here, Latency SLOs are particularly delicate to implement and benefit from having Histogram-data available to understand distributions and adjust SLO targets.

Latency SLOs

Here the key definitions regarding Latency SLOs from the Workshop Handout (pdf).

Latency SLI

The suggested specification for a request/response Latency SLI is:

The proportion of valid requests served faster than a threshold.

Turning this specification into an implementation requires making two choices: which of the requests this system serves are valid for the SLI, and that threshold marks the difference between requests that are fast enough and those that are not?

Latency SLO

99% of home page requests in the past 28 days served in < 100ms.

Latency Metrics

Latency is typically measured with percentile metrics like these, which were presented for a similar use case:

Given this data, what can we say about the SLO?

What is the p90 computed over the full 28 days?

It’s very tempting to take the average of the p90 metric which is displayed in the graph, which would be just below the 500ms mark.
It’s important to note, and it was correctly pointed out during the session, that this is not generally true. There is no mathematical way determine the 28 day-percentile from the series of 1h(?)-percentiles that are shown in the above graphs (reddit, blog, math). You need to look at different metrics if you want to implement a latency SLO. In this post we will discuss three different methods how to do this correctly.

Latency metrics in the wild

In the example above the error of averaging percentiles might not actually be that dramatic. The system seems to be very well-behaved with a high constant load. In this situation the average p90/1h is typically close to the total p90/28 days.
Let’s take a look at another API, from a less loaded system. This API handles very few requests between 2:00am and 4:00am:

What’s the true p90 over the 6h drawn on the graph? Is it above or below 30ms?
The average p90/1M (36.28ms) looks far less appealing than before.

Computing Latency SLOs

So how to better compute latency? There are three ways to go about this:

  1. Compute the SLO from stored raw data (logs)
  2. Count the number of bad requests in a separate metric
  3. Use histograms to store latency distribution.

Method 1: Using Raw/Log data

Storing access logs with latency data gives you accurate results. The drawback with this approach is that you must keep your logs over long periods of time (28 hrefdays) which can be costly.

Method 2: Counting bad requests

For the first case, instrument the application to count the number of requests that violated the threshold. The resulting metrics will look like this:

Percent good = 100 - (2262/60124)*100 = 96.238%
Percent good = 100 – (2262/60124)*100 = 96.238%

Using this metrics we see that 96% of our requests over the past 6h were faster than 30ms. Our SLO stated, that 90% of the requests should be good, so that objective was met.

The drawback of this approach is that you have to choose the latency threshold upfront. There is no way to calculate the percentage of requests that were faster than, say, 200ms from the recorded data.

If your SLO changes, you will need to change the executable or the service configuration to count requests above a different threshold.

Method 3: Using Histograms

The third practical option you have for computing accurate SLOs is storing your request latency data as histograms. The advantages of storing latency data as histograms are:

  1. Histograms can be freely aggregated across time.
  2. Histograms can be used to derive approximations of arbitrary percentiles.

For (1) to be true it’s critical that your histograms have common bin choices. It’s usually a good idea to mandate the bin boundaries for your whole organization, otherwise you will not be able to aggregate histograms from different services.
For (2), it’s critical that you have enough bins in the latency range that are relevant for your percentiles. Sparsely encoded log linear histograms allow you to cover a large floating point range (E.g. 10^-127 .. 10^128) with a fixed relative precision (E.g. 5%). In this way you can guarantee 5% accuracy on all percentiles, no matter how the data is distributed.

Two popular implementations of log linear histograms are:

Circonus comes with native support for Circllhist and is used for this example.
Histogram metrics store latency information per minute, and are commonly visualized as heatmaps:

Merging those 360x1M-histograms shown above into a single 6h-Histogram, we arrive at the following graph:

This is the true latency distribution over the full SLO reporting period of 6h, in this example.
At the time of this writing, there is no nice UI option to overlay percentiles in the above histogram graph. As we will see, you can perform the SLO calculation with CAQL or Python.

SLO Reporting via CAQL

We can use the CAQL functions histogram:rolling(6h) and histogram:percentile() to aggregate histograms over the last 6h and compute percentiles over the aggregated histograms. The SLO value we are looking for will be the very last value displayed on the graph.

SLO Reporting using Python

Using the Python API the calculation could look as follows:

# 1. Fetch Histogram Data
t = 1528171020 # exact start time of the graph 
N = 364        # exact number of minutes on the above graph
circ = circonusdata.CirconusData(config["demo"])
data = circ.caql('search:metric:histogram("api`GET`/getState")', t, 60, N)

# 2. Merge Histograms
H=Circllhist()
for h in data['output[0]']: H.merge(h)
# Let's check the fetched data is consistent with Histogram in the UI
circllhist_plot(H)
# 3. Calculate Aggregated Percentiles:
P = [50, 90, 95, 99, 99.9] # arbitrary percentiles
for p in P: print("{:>8}-latency percentile over 6h: {:>8.3f}ms".format(p, H.quantile(p/100)))
      50-latency percentile over 6h:   13.507ms
      90-latency percentile over 6h:   21.065ms
      95-latency percentile over 6h:   27.796ms
      99-latency percentile over 6h:   56.058ms
    99.9-latency percentile over 6h:  918.760ms

In particular we see that the true p90 is around 21ms, which is far away from the average p90 of 36.28ms we computed earlier.

# 4. Calculate Aggregated Counts:
Y = [10, 30, 50, 100, 200] # arbitrary thresholds
for y in Y: print("{:>10.3f} percent faster than {}ms".format(H.count_below(y)/H.count()*100,y))
    18.465 percent faster than 10ms
    96.238 percent faster than 30ms
    98.859 percent faster than 50ms
    99.484 percent faster than 100ms
    99.649 percent faster than 200ms

In particular we replicate the “96.238% below 30ms” result, that we calculated using the counter metrics before.

Conclusion

It’s important to understand that percentile metrics do not allow you to implement accurate Service Level Objectives that are formulated against hours or weeks. Aggregating 1M-percentiles seems tempting, but can produce materially wrong results — especially if your load is highly volatile.
The most practical way to calculate correct SLO percentiles are counters and histograms. Histograms give you additional flexibility to choose the latency threshold after the fact. This comes in particularly handy when you are still evaluating your service, and are not ready to commit yourself to a latency threshold just yet.

Less Toil, More Coil – Telemetry Analysis with Python

This was a frequent request we were hearing from many customers:

“How can I analyze my data with Python?”

The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, seamless integration with this environment is important for our customers who want to make use of those tools.

Circonus has for a long time provided Python bindings for its API. With these bindings, you can configure the account, create graphs and dashboards, etc. However, fetching data and getting it into the right format involves multiple steps and was not easy to get right.

We are now pleased to announce that this has changed. We have just added new capabilities to our Python bindings that allow you to fetch and analyze data more effectively. Here is how to use it.

Quick Tour

Connecting to the API

You need an API token to connect to the API. You can create one using the UI under Integrations > API Tokens. In the following we assume the variable api_token holds a valid API token for your account.

from circonusapi import circonusdata
circ = circonusdata.CirconusData(api_token)

Searching for Metrics

The first thing we can do is search for some metrics:

>>> M = circ.search('(metric:duration)', limit=10)

The returned object extends the list class, and can be manipulated like any list object.
We override the __str__ method, so that printing the list, gives a table representation of the fetched metrics:

>>> print(M)

check_id   type       metric_name
--------------------------------------------------
195902     numeric    duration
218003     numeric    duration
154743     numeric    duration
217833     numeric    duration
217834     numeric    duration
218002     numeric    duration
222857     numeric    duration
222854     numeric    duration
222862     numeric    duration
222860     numeric    duration

Metric lists provide a .fetch() method that can be used to fetch data. Fetches are performed serially, one metric at a time, so the retrieval can take some time. We will later see how to parallelize fetches with CAQL.

R = M.fetch(
    start=datetime(2018,1,1), # start at Midnight UTC 2018-01-01
    period=60,                # return 60 second (=1min) aggregates
    count=180,                # return 180 samples
    kind="value"              # return (mean-)value aggregate
)

The resulting object is a dict that maps metrics names to the fetched data. This is designed in such a way that it can be directly passed to a pandas DataFrame constructor.

import pandas as pd
df = pd.DataFrame(R)

# [OPTIONAL] Make the DataFrame aware of the time column
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)

df.head()
time			154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
2018-01-01 00:00:00	1		4		1		1		1		1		12		11		12		1
2018-01-01 00:01:00	1		2		1		1		2		1		11		12		12		1
2018-01-01 00:02:00	1		2		1		1		1		1		12		12		11		1
2018-01-01 00:03:00	1		2		1		1		1		1		12		11		12		1
2018-01-01 00:04:00	1		2		1		1		1		1		12		11		11		1

Data Analysis with pandas

Pandas makes common data analysis methods very easy to perform. We start with computing some summary statistics:

df.describe()
154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
count	180.000000	180.000000	180.0		180.000000	180.000000	180.000000	180.000000	180.000000	180.00000	180.000000
mean	1.316667	2.150000	1.0		1.150000	1.044444	1.177778	11.677778	11.783333	11.80000	1.022222
std	1.642573	0.583526	0.0		1.130951	0.232120	0.897890	0.535401	0.799965	0.89941 	0.181722
min	1.000000	1.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
25%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
50%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
75%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
max	15.000000	4.000000	1.0		12.000000	3.000000	9.000000	13.000000	17.000000	16.00000	3.000000

Here is a plot of the dataset over time:

from matplotlib import pyplot as plt
ax = df.plot(style=".",figsize=(20,5),legend=False, ylim=(0,20), linewidth=0.2)

We can also summarize the individual distributions as box plots:

ax = df.plot(figsize=(20,5),legend=False, ylim=(0,20), kind="box")
ax.figure.autofmt_xdate(rotation=-20,ha="left")

Working with Histogram Data

Histogram data can be fetched using the kind=”histogram” parameter to fetch. Numeric metrics will be converted to histograms. Histograms are represented as libcircllhist objects, which have very efficient methods for the most common histogram operations (mean, quantiles).

MH = circ.search("api`GET`/getState", limit=1)
print(MH)
check_id   type       metric_name
--------------------------------------------------
160764     histogram  api`GET`/getState                                 

Let’s fetch the 1h latency distributions of this API for the timespan of one day:

RH = MH.fetch(datetime(2018,1,1), 60*60, 24, kind="histogram")

We can plot the resulting histograms with a little helper function:

fig = plt.figure(figsize=(20, 5))
for H in RH['160764/api`GET`/getState']:
    circllhist_plot(H, alpha=0.2)
ax = fig.get_axes()
ax[0].set_xlim(0,100)

The output:
(0, 100)

Again, we can directly import the data into a pandas data frame, and perform some calculations on the data:

dfh = pd.DataFrame(RH)

# [OPTIONAL] Make the DataFrame aware of the time column
dfh['time']=pd.to_datetime(dfh['time'],unit='s')
dfh.set_index('time', inplace=True)
dfh['p99'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p90'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p95'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p50'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.5))
dfh['mean'] = dfh.iloc[:,0].map(lambda h: h.mean())
dfh.head()
time             	160764/api`GET`/getState				p99		p90		p95		p50		mean
2018-01-01 00:00:00	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	112.835714	112.835714	112.835714	11.992790	15.387013
2018-01-01 01:00:00	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	114.961628	114.961628	114.961628	16.567822	19.542284
2018-01-01 02:00:00	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	118.124324	118.124324	118.124324	20.556859	24.012226
2018-01-01 03:00:00	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	427.122222	427.122222	427.122222	20.827982	37.040173
2018-01-01 04:00:00	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	496.077778	496.077778	496.077778	23.247373	40.965517

The CAQL API

Circonus comes with a wide range of data analysis capabilities that are integrated into the Circonus Analytics Query Language, CAQL.

CAQL provides highly efficient data fetching operations that allow you to process multiple metrics at the same time. Also by performing the computation close to the data, you can save time and bandwidth.

To get started, we search for duration metrics, like we did before, using CAQL:

A = circ.caql('search:metric("duration")', datetime(2018,1,1), 60, 5000)
dfc = pd.DataFrame(A)
dfc.head()
		output[0]	output[10]	output[11]	output[12]	output[13]	output[14]	output[15]	output[16]	output[17]	output[18]	...	output[21]	output[2]	output[3]	output[4]	output[5]	output[6]	output[7]	output[8]	output[9]	time
0		4		12		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514764800
1		2		12		1		1		1		1		1		1		11		1		...		1		1		1		1		1		2		12		11		1	1514764860
2		2		11		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		12		12		1	1514764920
3		2		12		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		11		12		1	1514764980
4		2		11		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514765040
5 rows × 23 columns

This API call fetched 1000 samples from 22 metrics, and completed in just over 1 second. The equivalent circ.search().fetch() statement would have taken around one minute to complete.

One drawback of the CAQL fetching is, that we use the metric names in the output. We are working on resolving this shortcoming.

To showcase some of the analytics features, we’ll now use CAQL to compute a rolling mean over the second largest duration metric in the above cluster, and plot the transformed data using pandas:

B = circ.caql("""

search:metric("duration") | stats:trim(1) | stats:max() | rolling:mean(10M)

""", datetime(2018,1,1), 60, 1000)
df = pd.DataFrame(B)
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)
df.plot(figsize=(20,5), lw=.5,ylim=(0,50))

You can also fetch histogram data with circ.caql():

AH = circ.caql('search:metric:histogram("api`GET`/getState")', datetime(2018,1,1), 60*60, 24)
dfch = pd.DataFrame(AH)
dfch.head()
        output[0]						time
0	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	1514764800
1	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	1514768400
2	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	1514772000
3	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	1514775600
4	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	1514779200

We can perform a wide variety of data transformation tasks directly inside Circonus using CAQL expressions. This speeds up the computation even further. Another advantage is that we can leverage CAQL queries for live graphing and alerting in the Circonus UI.

In this example, we compute how many requests were serviced above certain latency thresholds:

B = circ.caql('''

search:metric:histogram("api`GET`/getState") | histogram:count_above(0,10,50,100,500,1000)

''', datetime(2018,1,1), 60*5, 24*20)
dfc2 = pd.DataFrame(B)
dfc2['time']=pd.to_datetime(dfc2['time'],unit='s')
dfc2.set_index('time', inplace=True)
dfc2.plot(figsize=(20,5), colormap="gist_heat",legend=False, lw=.5)

Conclusion

Getting Circonus data into Python has never been easier. We hope that this blog post allows you to get started with the new data fetching capabilities.  A Jupyter notebook version of this blog post containing the complete source code is available here. If you run into any problems or have some suggestions, feel free to open an issue on GitHub, or get in touch on our Slack channel.

Linux System Monitoring with eBPF

The Linux kernel is an abundant component of modern IT systems. It provides the critical services of hardware abstraction and time-sharing to applications. The classical metrics for monitoring Linux are among the most well known metrics in monitoring: CPU utilization, memory usage, disk utilization, and network throughput. For a while now, Circonus installations have organized, the key system metrics in the form of a USE Dashboard, as a high level overview of the system resources.

While those metrics are clearly useful and important, there are lot of things to be wished for. Even the most basic metrics like CPU utilization have some serious flaws (cpu-load.txt) that limit their significance. Also there are a lot of questions for which there are simply no metrics exposed (such as disk errors and failed mallocs).

eBPF is a game changing technology that became available in recent kernel versions (v4.1 and later). It allows subscribing to a large variety of in kernel events (Kprobes, function call tracing) and aggregating them with minimal overhead. This unlocks a wide range of meaningful precise measurements that can help narrow the observability gap. A great and ever growing collection of system tracing tools is provided by the bcc toolkit by iovisor.

The Circonus Monitoring Agent comes with a plugin that collects eBPF metrics using the bcc toolkit (see source code & instructions here). At the time of this writing, the plugin is supported on the Ubuntu 16.04 platform. In the following examples we will demonstrate how this information can be used.

Block-I/O Latencies

The block-I/O layer of the operating system is the interface the block-devices, like disk, offer to the file system. Since they are an API, it’s natural to apply the RED methodology (adapted from the SRE Book, see e.g. Tom Wilkie 2018) and monitor rate, errors, and duration. One famous example of how this information can be used is to identify environmental influences to I/O performance, as seen in Brendan Gregg – Shouting in the Datacenter (YouTube 2008). The example duration measurements can be seen in the figure below.

150M I/O events that were recorded on three disks over the period of a week.
150M I/O events that were recorded on three disks over the period of a week

This diagram shows a total of 150M I/O events that were recorded on three disks over the period of a week. The visualization as stand-alone histogram allows us to qualitatively compare the access latency profiles very easily. In this case, we see that the traffic pattern is imbalanced (one disk serving less than half of the load of the others), but the latency modes are otherwise very similar, indicating good (or at least consistent) health of the disks.

The next figure shows how the these requests are distributed over time.

Disk array visualized as a heatmap.
Disk array visualized as a heatmap.

This is the latency duration profile of the disk array visualized as a heatmap. The line graphs show the p10,p50 and p90 percentiles calculated over 20 minute spans. One can see how the workload changes over time. Most of the requests were issued between Sept 10th and Sept 11th 12:00, with a median performance of around 3ms.

File System Latency

From the application perspective, the file system latencies are much more relevant than block I/O latencies. The following graphic shows the latency of the read(2) and write(2) syscalls executed over the period of a few days.

The median latency of this dataset is around 5u-sec for read and 14u-sec for write accesses. This is an order of magnitude faster than block I/O latencies and indicates that buffering and caching of file system accesses is indeed speeding things up.

Caveat: In UNIX systems, “everything is a file.” Hence the same syscalls are used to write data to all kinds of devices (sockets, pipes) and not only disks. The above metrics do not differentiate between those devices.

CPU Scheduling Latency

Everyone knows that systems become less responsive when they are overloaded. If there are more runable processes than CPUs in the system, processes begin to queue and additional scheduling latency is introduced. The load average reported by top(1) gives you a rough idea how many processes were queued for execution over the last few minutes on average (the reality is quite a bit more subtle). If this metric is higher than the number of CPUs you will get “some” scheduling latency.

But how much scheduling latency did your application actually experience?

With eBPF, you can just measure the latency of every scheduling event. The diagram below shows the latency of 17.4B scheduling events collected over a 4 week period.

The median scheduling latency (30u-sec) was very reasonable. Clearly visible are Several modes which I suppose can be attributed to processes waiting behind none, one, or two other processes in the queue. The tail of the distribution shows the collateral damage caused by periods of extreme loads during the collection period. The longest scheduling delay was a severe hang of 45 seconds!

Next steps

If you want to try this out on your system, you can get free Circonus account is a matter of minutes. Installing the Circonus agent on an Ubuntu 16.04 machine can be done with a single command. Then enable the eBPF plugin on your host by following the instructions here.

It’s an ongoing effort to extend the capabilities of the eBPF plugin. Apart from the metrics shown above, there are also rate and duration metrics for all 392 Linux system calls that are exposed by the plugin. There are a lot more interesting tools in iovisor/bcc that wait to be ported.

Happy Monitoring!

Circonus On The Raspberry Pi

There are a lot of interesting monitoring tasks, that can be facilitated with a Raspberry Pi (e.g. here, there). Circonus does not officially support “Raspbian Linux on armv6/v7” as a deployment target, but given the steady interest in this topic, we took the time to pave the way and write down some instructions in this post.

Let’s dive right in:

# Install dependencies
sudo apt-get update
sudo apt-get install git make

# Only node node v4 or node v6 are supported. Install official arm package
wget https://nodejs.org/dist/latest-v6.x/node-v6.12.3-linux-armv6l.tar.xz
sudo mkdir -p /opt/node
sudo tar -vxf node-v6.12.3-linux-armv6l.tar.xz -C /opt/node --strip-components=1
sudo ln -s /opt/node/bin/node /usr/bin/node
sudo ln -s /opt/node/bin/npm /usr/bin/npm
sudo ln -s /opt/node/bin/node /opt/circonus/bin/node

# Install nad from sources
git clone https://github.com/circonus-labs/nad.git
cd nad
git checkout v2.5.1
sudo make install install-linux-init

# Fix nad disk plugins which depend on /proc/mdstat
sudo modprobe md

This should install and enable the monitoring agent nad on the RPI. Check that it is up and running using:

# Is it up? -> You should see a staus report mentioning "active (running)"
systemctl status nad
# Does it serve metrics? -> You should see a large JSON blob, containing metrics
curl localhost:2609

Now copy paste the one-step-install command from the Integrations > Checks > [New Host] page. In my case this is:

curl -sSL https://onestep.circonus.com/install | bash \
   -s -- \
   --key xxxxxxxxx-50ce-4118-9145-xxxxxxxxxx \
   --app circonus:osi:xxxxxx...

The credentials will be different for you. The installer should find the running agent and register checks, graphs and the USE Dashboard for you. We needed to make some tweaks to the COSI installer itself to get this working as it is now. Special thanks goes to Matt Maier from Circonus to make this happen so quickly.

These instructions were tested on Raspbian stretch lite (Version: November 2017 / Release date: 2017-11-29 / Kernel version: 4.9) on a RPI2 Model B.

Happy Monitoring!

This post was also published on Heinrich Hartmann’s blog.

System Monitoring with the USE Dashboard

The USE method was developed by Brendan Gregg to study performance problems in a systematic way.1 It provides a simple, top-down approach to identify bottlenecks quickly and reliably. The gathering of performance metrics is usually done using a myriad of different system tools (sar, iostat, vmstat) and tracers (dtrace, perf, ebpf). Circonus has now added the most relevant USE metrics to its monitoring agent, and conveniently presents them in the form of a USE Dashboard (Figure 1, live demo), that allows a full USE analysis at a single glance.

Figure 1: The USE System Dashboard.
Figure 1: The USE System Dashboard. Click to view a live demo.

Outline

  1. System Monitoring
  2. The USE Method
  3. Performance Metrics
    3.1 CPU / Scheduler Monitoring
    3.2 Memory Management
    3.3 Network Monitoring
    3.4 Disks and File Systems
  4. The USE Dashboard

1. System Monitoring

System monitoring is concerned with monitoring basic system resources: CPU, Memory, Network and Disks. These computational resources are not consumed directly by the application. Instead, the Operating System abstracts and manages these resources and provides a consistent abstracted API to the application (“process environment”, system calls, etc). Figure 2 illustrates the high level architecture. A more detailed version can be found in Gregg.2, 3

Figure 2: High-level System Overview
Figure 2: High-level System Overview

Once critical objective of system monitoring is to check that if how the available resource are utilized. Typical questions are: Is my CPU fully utilized? Is my application running out of memory? Do we have enough disk capacity left?

While a fully utilized resource is an indication of a performance bottleneck, it might not be a problem at all. A fully utilized CPU means only that we are making good use of the system. It starts causing problems only when incoming requests start queuing up or producing errors, and hence the performance of the application is impacted. But queuing does not only occur in the application layer. Modern software stacks use queuing in all system components to improve performance and distribute load. The degree to which a resource has extra work that it can not service is called saturation,3, p42 and is another important indicator for performance bottlenecks.

2. The USE Method

The USE method, by Gregg, is an excellent way to identify performance problems quickly. It uses a top down approach to summarize the system resources, which ensures that every resource is covered. Other approaches suffer from a “street light syndrome,” in that the focus lies on parts of the system where metrics are readily available. In other cases, random changes are applied in the hope that the problems go away.

The USE method can be summarized as follows:

For each resource, check utilization, saturation, and errors.

The USE analysis is started by creating an exhaustive list of that are consumed by the application. The four resource types mentioned above are the most important ones, but there are more resources, like IO Bus, Memory Bus, and Network Controllers, that should be included in a thorough analysis. For each resource, errors should be investigated first, since they impact performance and might not be noticed immediately, when the failure is recoverable. Then, utilization and saturation are checked.

For more details about the use method and its application to system performance analysis the reader is referred to the excellent book by Gregg.3

3. Performance Metrics

It’s not immediately clear how utilization, saturation, and errors can be quantified for different system resources. Fortunately, Gregg has compiled a sensible list of indicators2 that are available on Linux systems. We have taken this list as our starting point to define a set of USE metrics for monitoring systems with Circonus. In this section, we will go over each of them and explain their significance.

3.1. CPU Monitoring

Utilization Metrics:

  1. cpu`idle — time spent in the idle thread and not waiting on I/O (yellow)
  2. cpu`wait_io — time spent in the idle thread while waiting on IO (dark yellow)
  3. cpu`user — time spent executing user code (blue)
  4. cpu`system + cpu`intr — time spent executing OS-kernel code (dark blue)

Those metrics should give you a rough idea of what the CPU is doing during the last reporting period (1M). Blues represent time that the system spent doing work, yellow colors represent time where that the system spent waiting.

Like all other monitoring tools we are aware of, the values are derived from `/proc/stat`.4 However, they should be taken with the necessary precaution:5, 6 Those counters do not usually add up to 100%, which is already bad sign. Also accounting takes place in units of full CPU time slices (jiffies) at time of the clock interrupt. There are many voluntary context switches within a single time slice that are missed this way. Tickless kernels and varying clock speeds (Intel Turbo Boost/Turbo Step) further blur the meaning of these metrics. For the subtleties in accounting wait_io on multiprocessor systems see A. Veithen.11

Nevertheless these metrics are used everywhere to measure CPU utilization, and have proven to be valuable first source of information. We hope that we will be able to replace them with more precise metrics in the future.

There are some differences from Gregg:2 There, `vmstat “us” + “sy” + “st”` is the suggested utilization metric. We account steal time (“st”) as idle.

Saturation Metrics:

  1. cpu`procs_runnable (normalized, corrected) — runnable processes per CPU (purple)
  2. loadavg`1 (normalized) — one-minute load average per CPU (dark purple)
  3. Saturation threshold guide at value one. (black)

The CPU keeps runnable processes waiting on a queue. The length of this queue at any given point in time is a good metric for the CPU saturation. Unfortunately, this number is not directly exposed by the kernel. Via `/proc/stat` we get the total number of runnable processes (`procs_running`) which includes queued processes as well as currently running processes. We report this number as the `procs_runnable` metric. While being a straightforward measure, it suffers from the low sampling interval (only once a minute) and an observation bias. The Circonus agent NAD currently lists 3 running processes for an idle system. We account for these problems by (a) subtracting processes run by the monitoring agent (3) and (b) dividing by the number of CPU cores on the system. In this way, a value of one, which is marked with the guide, represents a system that has one runnable process per core.

The load average is a smoothed out version of procs`runnable metric maintained by the kernel. It’s typically sampled every 5 seconds and aggregated exponential smoothing algorithm.7 Recent kernel versions spent a lot of effort to maintain a meaningful load average metric across a system with a high number of CPUs and tickless kernels. This metric is divided by the number of CPU cores as well.

While 1min, 5min, and 15min load averages are maintained by the kernel, we only show the 1min average, since the others don’t provide any added value when plotted over time.

Both collected metrics are similar in their interpretation. If the value of either of these is larger than one (the guide) you have processes queuing for CPU time on the machine.

Differences from Gregg:2 In Gregg’s method, only procs_runnable statistic is considered. We think load averages are valuable as well.

Error Metrics:

CPU error metrics are hard to come by. If CPU performance counters are available (often not the case on virtualized hardware) perf(1) can be used to read them out. At the moment, Circonus does not provide CPU error metrics.

3.2. Memory Management

This section is concerned with the memory capacity resource. The bandwidth of the memory interconnect is another resource that can be worth analyzing, but it is much harder to get.

Utilization Metrics:

  1. vm`meminfo`Buffers — file system buffers (darker yellow)
  2. vm`meminfo`Cached — file-system caches (dark yellow)
  3. vm`meminfo`MemFree — free memory (yellow)
  4. vm`meminfo`Used (computed) — memory used by the application (blue)

The OS uses memory that is not utilized by the application for caching file system content. These memory pages can be reclaimed by the application as needed, and are usually not a problem for system performance.

Saturation Metrics:

  1. vm`info`pg_scan — the number of memory pages scanned per second (purple)
  2. vm`swap`used — usage of swap devices (purple, zero in the example)

When the free memory is close to exhausted, the system begins freeing memory from buffers and caches, or begins moving pages to a swap partition on disk (if present). The page scanner is responsible for identifying suitable memory pages to free. Hence, scanning activity is an indicator for memory saturation. A growing amount of swap space is also an indicator for saturated memory.

When the system has neither memory nor swap space left, it must free memory by force. Linux does this by killing applications that consume too much memory. When this happens, we have an OOM-(“out of memory”)-event, which is logged to dmesg.

Differences from Gregg:2 We are missing metrics for swapping (i.e. anonymous paging) and OOM events.

Errors:

Physical memory failures are logged to dmesg. Failed malloc(3) can be detected using SystemTap. We don’t have any metrics for either of them at the moment.

3.3. Network Monitoring

Utilization Metrics:

  1. if`<interface>`in_bytes — inbound network throughput in bytes/sec
  2. if`<interface>`out_bytes — outbound network throughput in bytes/sec

The network utilization can be measured as throughput divided by the bandwidth (maximal throughput) of each network interface. A full-duplex interface is fully utilized if either inbound or outbound throughput exhaust the available bandwidth. For half-duplex interfaces, the sum of inbound and outbound throughput is the relevant metric to consider.

For graphing throughput we use a logarithmic scale, so that a few kb/sec remain visibly distinct from the x-axis, and set the y-limit to the available bandwidth. The available bandwidth is often not exposed by virtual hardware; in this case, we don’t set a y-limit.

Saturation Metrics:

  1. tcp`setments_retransmitted (purple)
  2. if`<interface>`{in,out}_drop — dropped packets due to full buffers (purple, zero)
  3. if`<interface>`{in,out}_fifo_overrun — dropped packets due to full buffers (purple, zero)

Network saturation is hard to come by. Ideally, we’d like to know how many packets are queued send/receive buffers, but these statistics do not seem to be exposed via /proc. Instead, we have to settle for indirect indicators which are available, such as tcp-level retransmits, as well as drop and overrun counts.

Error Metrics:

  1. if`<interface>`{in,out}_errors — the number of errors occurred, e.g. invalid packets received

All metrics covered in this whole section should be taken with a great bit of caution. The Linux networking stack is a large, undocumented codebase with a lot of known bugs in the accounting mechanisms, see Damato.8.

3.4. Disks and File Systems

Utilization Metrics:

    diskstats`<device>`io_ms — the percentage of time the device has been utilized during the last reporting interval

The disk utilization is measured per device and not per file system. We simply record the percentage of time the device was busy during the last reporting period. This metric is read from `/proc/diskstats`.

Saturation Metrics:

  1. diskstats`<disks>`io_ms_weighted — Average number of inflight I/O operations

This metric is exposed under the name “avgqu-sz” by `sar -d` and `iostat -x` and measures the number of requests that are currently queued or processed by the disk. The metric is derived from a weighted_io_ms counter in `/proc/diskstats`, and involves a little bit of mathematics.11

Error Metrics:

According to Gregg,2 disk errors can be found in /sys/devices/<>/ioerr_cnt, smartctl, or by “tracing of the IO subsystem”. However in the (virtualized) systems we used for testing, these metrics were not exposed. Hence, at the moment, we don’t include any disk error metrics.

4. The USE Dashboards

The USE Dashboard, shown in Figure 1 above, combines all metrics discussed above in a single dashboard for each host.

Each row corresponds to a resource type, and each column contains an analysis dimension: Utilization, Saturation, and Errors. To perform a USE Performance analysis as outlined in Gregg,1, 2 you would traverse the dashboard line by line, and check:

  1. If errors are present? => Graphs in Column 3 non zero.
  2. If saturation is present? => Graphs in Column 2 show large values over extended time period
  3. Is the resource is fully utilized? => Graphs in Column 1 are maxed out.

The graphs are organized in such a way that all these checks can be done at a single glance.

Discussion

The USE Dashboard allows a rapid performance analysis of a single host. Instead of ssh-ing into a host and collecting statistics from a number of system tools (vmstat, iostat, sar), we get all relevant numbers together with their historical context in a single dashboard. We found this visualization valuable to judge the utilization of our own infrastructure, and as a first go-to source for analyzing performance problems.

However, the work on this dashboard is far from complete. First, we are lacking error metrics for CPU, Memory, and Disk resources. We have included text notes on how to get them manually in the USE dashboard to make those blind spots “known unknowns”. Another big working site is the quality of the measurements. As explained above, even basic metrics like CPU utilization have large margins of errors and conceptual weaknesses in the measurement methodology. Also, we did not cover all resources that a system provides to the application. E.g. we don’t have metrics about the File System, Network Controllers, I/O- and Memory- interconnects, and even basic metrics about the physical CPU and Memory resources are missing (Instruction per sec, Memory ECC events). Figure 3 below positions the covered resources into the high level system overview presented in the introduction.

Figure 3: Metrics displayed in the USE dashboard per resource
Figure 3: Metrics displayed in the USE dashboard per resource

Try it yourself!

To try the USE Dashboard for yourself, log into your Circonus account, click on “Checks” > “+ New Host” to provision a new host with a single command. The USE Dashboard will be automatically created. A link is provided at the command line.

No Circonus account, yet? Get one for free here.

References

  1. Gregg – Thinking Methodically about Performance, CACM 2013
  2. Gregg – Use Method: Linux Performance Checklist
  3. Gregg – Systems Performance, Prentice Hall 2014
  4. https://www.kernel.org/doc/Documentation/filesystems/proc.txt
  5. https://github.com/torvalds/linux/blob/master/Documentation/cpu-load.txt#L20-L22
  6. Gregg – Broken Linux Performance Tools, Scale14x (2016)
  7. https://en.wikipedia.org/wiki/Load_(computing)
  8. Damato – All of your network monitoring is probably wrong
  9. Hartmann – Monitoring Queue Sizes
  10. https://github.com/iovisor/bcc
  11. Andreas Veithen – The precise meaning of IO wait on Linux (blog)

Monitoring Queue Sizes

At Circonus we are collecting a number of metric about block devices. An interesting one is the average length of the request queue, which is exposed by iostat[1]:

aqu-sz
  The average queue length of the requests that were issued to the device.
  Note: In previous versions, this field was known as avgqu-sz.

To collect aqu-sz as a metric we don’t want to fire up iostat on each collection interval. So we needed to understand how this value is derived from statistics exposed by the operating system. It turns out that there is an interesting bit of mathematics involved in deriving this value. Here is how it works.

It was already noted by Baron Schwarz in 2010[2], that the calculation behind iostat are interesting. This post can be seen as a formalized elaboration.

Step 1: Checking the Implementation

The first step in understanding the iostat calculation of aqu-sz is to check the implementation (source):

/* aqu-sz */
    cprintf_f(NO_UNIT, 1, 6, 2, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);

Here S_VALUEis defined (source) as:

#define S_VALUE(m,n,p)		(((double) ((n) - (m))) / (p) * HZ)

and rq_ticks is Field 11 from /proc/diskstats (source).

Digging around in the source code a little more, you will be able to verify for yourself that the following
formula is used to calculate aqu-sz:

$$ aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} $$

where $t_0, t_1$ are the measurement timestamps in ms, and F_{11}(t) is the value of Field 11 of at time t.

So, the average queue size can be calculated as a slope or discrete derivative of a mystical field from /proc/diskstats. Interesting!

Step 2: Checking the kernel documentation

Let’s see how that works. The kernel documentation[3] says the following:

Field 11 -- weighted # of milliseconds spent doing I/Os
  This field is incremented at each I/O start, I/O completion, I/O
  merge, or read of these stats by the number of I/Os in progress
  (field 9) times the number of milliseconds spent doing I/O since the
  last update of this field.  This can provide an easy measure of both
  I/O completion time and the backlog that may be accumulating.

So:

$$
F_{11}(t_1) = F_{11}(t_0) + (\text{I/Os in progress at time $t_1$}) \times (\text{ms spent doing I/O in $t_0,t_1$})
$$

Where $t_0$ is the time of the last update, and $t_1$ is the current time.

So if we want to calculate the value of $F_{11}$ at a time $t$, we have to recursively apply this rule until the beginning of time (boot):

$$
F_{11}(t) = \sum_{t_i} (\text{I/Os in progress at time t'}) \times (\text{ms spent doing I/O in $t_{i-1},t_{i-1}$})
$$

where $t_i$ runs through the times of every I/O event in the system. For the mathematical reader, this sum has a very familiar form. It’s summing a function values times a little time interval. This is precisely how integration over time works. So we are inclined to compare it with:

$$
F_{11}(T) \overset{?}{\leftrightarrow} \int_0^T (\text{number of I/Os in progress at time t}) \text{dt}
$$

This is not be quite right, since we should only sum over I/O time, and it’s not clear that the sum can be replaced by an integral, but it points us at the right direction. Assuming this were the case, we would get the following expression for `avgqz`:

$$
avgqz \overset{?}{\leftrightarrow}  \frac{\int_{t_0}^{t_1} (\text{I/Os in progress at time t}) \text{dt}}{t_1 - t_0}.
$$

This is a very common way to express a mean value of a function. In this case, we average the number of I/Os in progress, and not the queue length, but that kind of imprecise wording is commonly found in man pages.

So all this would make sense, if only Field 11 accounted the total time and not the time “spent doing I/O” and we could replace the sum by an integral!

Step 3: A mathematical model

In order to make the above idea precise, let’s introduce some notation:

\begin{align*}
ios(t) &= \text{(number of I/O operations in progress at time t)} \\
busy(t) &= \text{(one if the device is processing IO at time t, zero otherwise)}
\end{align*}

Here, again $t$ denotes system uptime in ms. To warm up, let’s first look at the integral:

$$
Busy(T) = \int_0^T busy(t) \text{dt} = \int_{t : busy(t) = 1} 1 \text{dt}
$$

This integral measures the total time the system was busy processing IO since boot. We can get the utilization of the system during a time interval $t_0 and t_1$ as:

$$
\frac{Busy(t_1) - Busy(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} busy(t) \text{dt}}{t_1-t_0} = \frac{\text{time spent doing IO}}{\text{total time}} = util(t_0, t_1)
$$

This is another very important statistics that is reported by `iostat -x` under the %util column.

To get a mathematical description of Field 11, let’s try the following:

$$ F(T) := \int_0^T ios(t) busy(t) \text{dt} $$

In this way we have:

$$ F(t_1) - F(t_0) = \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

or, equivalently:

$$ F(t_1) = F(t_0) + \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

Now, if $ios(t)$ is constant during the time interval $t_1,t_0$, then:

$$ F(t_1) = F(t_0) + ios(t_1) \times \int_{t_0}^{t_1} busy(t) \text{dt} $$

Looking closely at this, we see that this is precisely the recursion formula for Field 11 from /proc/diskstat

$$ F(t_1) = F(t_0) + (\text{IO ops in progress at time $t_1$}) \times (\text{numer of ms spent doing I/O in $t_0,t_1$}) $$

And if $t_1,t_0$ are adjacent I/O events (start, complete, merge) the assumption that ios(t) is constant in between is justified. Hence we see, that $F_{11} = F$.

Step 4: This can’t be true!

Now that we have a firm grip of $F_{11}$, we can start examining the claimed formula for the average queue size

$$ aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} ios(t) busy(t) \text{dt}}{t_1 - t_0} $$

Is this indeed a sensible measure for the average queue size?

It does not seem to be the case. Take for example a time interval, where there are $ios(t) = 10$, but $busy(t) = 0$, then: The average queue size should be 10, but the integral evaluates to 0. Hence the above expression is zero, which is clearly not sensible.

Step 5: But is it really?

But is this really a valid example? If the two functions busy() and ios() were really independent, then this condition could certainly occur. In the analogue case of CPU scheduling, cases like these, where there are runnable threads but the CPU is busy doing something else (interrupt handlers, hypervisor) can indeed happen.

But is this really the case for block IO scheduling? Another look at the documentation[1] reveals the following:

    Field  9 -- # of I/Os currently in progress
        The only field that should go to zero. Incremented as requests are
        given to appropriate struct request_queue and decremented as they finish.
    Field 10 -- # of milliseconds spent doing I/Os
        This field increases so long as field 9 is nonzero.

So, Field 9 is our function $ios(t)$ and Field 10 is actually our function $Busy(t)$! And they indeed have a relation:

> "[Busy(t)] increases as long as [ios(t)] is nonzero"

In other words $busy(t) = 1$ if $ios(t) > 0$ and $0$ otherwise!

Revisiting our definition of $F$, we find the following (explanations follow)

\begin{align*}
F(T) &:= \int_{t \in [0,T]} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) busy(t) \text{dt} + \int_{t \in [0,T], busy(t) = 0} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) \text{dt} + 0 \\
&= \int_{t \in [0,T], ios(t) > 0} ios(t) \text{dt} + \int_{t \in [0,T], ios(t) = 0} ios(t) \text{dt} \\ &= \int_{t \in [0,T]} ios(t) \text{dt}
\end{align*}

Let’s break that down. In the first step, we can divide the integral into two parts, where $busy = 0/1$, the part where $busy = 0$ evaluates to 0. In the next step we note that $busy = 0$ is equivalent to $ios = 0$, so we can replace the condition. Next we complete the integral by introducing another 0-summand: Integrating over $ios(t) = 0$ which evaluates to zero as well. Finally we put things back together.

We see that for the calculation of Field 11, it does not matter if we include $busy(t)$ in the integrand. We end up with the same number whether we sum over “the number of milliseconds *spent doing I/O* since the last update of this field” or just “the number of milliseconds since the last update of this field”:

\begin{align*}
avgqz(t_0, t_1)
&= \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} \\
& \overset{!}{=} \frac{\int_{t_0}^{t_1} ios(t) \text{dt}}{t_1 – t_0} \\
&= \text{Integral-average over the number of IOs in progess within $t_0,t_1$.}
\end{align*}

Conclusion

We have seen that the avg-qz reported by iostat has indeed the interpretation of an integral average over the number of IOs in progress. We find it interesting to see calculus can be applied various state accounting metrics and clarify their relationships. Moreover, the ability to express integral averages as discrete derivatives of a counter ($F_{11}$) is another remarkable takeaway.

References

  1. man iostat
  2. Schwarz – How iostat computes metrics (2010)
  3. I/O statistics fields

Show Me the Data

Avoid spike erosion with Percentile – and Histogram – Aggregation

It has become common wisdom that the lossy process of averaging measurements leads to all kinds of problems when measuring performance of services (see Schlossnagle2015,  Ugurlu2013,  Schwarz2015,  Gregg2014). Yet, most people are not aware that averages are even more abundant than just in old-fashioned formulation of SLAs and storage backends for monitoring data. In fact, it is likely that most graphs that you are viewing involve some averaging behind the scenes, which introduces severe side effects. In this post, we will describe a phenomenon called spike erosion, and highlight some alternative views that allow you to get a more accurate picture of your data.

Meet Spike Erosion

Spike Erosion of Request Rates

Take a look at Figure 1. It shows a graph of request rates over the last month. The spike near December 23, marks the apparent maximum at around 7 requests per second (rps).

Figure 1: Web request rate in requests per second over one month time window
Figure 1: Web request rate in requests per second over one month time window

What if I told you, the actual maximal request rate was almost double that value at 13.67rps (marked with the horizontal guide)? And moreover, it was not reached at December 23, but December 15 at 16:44, near the left boundary of the graph?

Looks way off, right?

But it’s actually true! Figure 2 shows the same graph zoomed in at said time window.

Figure 2: Web request rates (in rps) over a 4h period
Figure 2: Web request rates (in rps) over a 4h period

We call this phenomenon spike erosion; the farther you zoom out, the lower the spikes, and it’s actually very common in all kinds of graphs across all monitoring products.

Let’s see another example.

Spike Erosion of Ping Latencies

Take a look at Figure 3. It shows a graph of ping latencies (from twitter.com) over the course of 4 weeks. Again, it looks like the latency is rather stable around 0.015ms with occasional spikes above 0.02ms and a clear maximum around December 23, with a value of ca 0.03ms.

Figure 3: Ping latency of twitter.com in ms over the last month
Figure 3: Ping latency of twitter.com in ms over the last month

Again, we have marked the actual maximum with a horizontal guide line. It is more than double the apparent maximum and is assumed at any of the visible spikes. That’s right. All spikes do in fact have the same maximal height. Figure 4 shows a closeup of the one on December 30, in the center.

Figure 4: Ping latency of twitter.com in ms on December 30
Figure 4: Ping latency of twitter.com in ms on December 30

What’s going on?

The mathematical explanation of spike erosion is actually pretty simple. It is an artifact of an averaging process that happens behind the scenes, in order to produce sensible plots with high performance.

Note that within a 4 month period we have a total of 40,320 samples collected that we need to represent in a plot over that time window. Figure 5 shows how a plot of all those samples looks in GnuPlot. There are quite a few issues with this raw presentation.

Figure 5: Plot of the raw data of request rates over a month
Figure 5: Plot of the raw data of request rates over a month

First, there is a ton of visual noise in that image. In fact, you cannot even see the individual 40,000 samples for the simple reason that the image is only 1240 pixels wide.

Also, rendering such an image within a browser puts a lot of load on the CPU. The biggest issue with producing such an image is the latency involved with retrieving 40K float values from the db and transmitting them as JSON over the internet.

In order to address the above issues, all mainstream graphing tools pre-aggregate the data before sending it to the browser. The size of the graph determines the number of values that should be displayed e.g. 500. The raw data is then distributed across 500 bins, and for each bin the average is taken, and displayed in the plot.

This process leads to plots like Figure 1, which (a) can be produced much faster, since less data has to be transferred and rendered (in fact, you can cache the pre-aggregated values to speed up retrieval from the db), and (b) are less visually cluttered. However, it also leads to (c) spike erosion!

When looking at a four week time window, the raw number of 40.320 samples is reduced to a mere 448 plotted values, where each plotted value corresponds to an average over a 90 minute period. If there is a single spike in one of the bins, it gets averaged with 90 other samples of lower value, which leads to the erosion of the spike height.

What to do about it?

There are (at least) two ways to allow you to avoid spike erosion and get more insight into your data. Both change the way the data is aggregated.

Min-Max Aggregation

The first way is to show the minimum and the maximum values of each bin along with the mean value. By doing so, you get a sense of the full range of the data, including the highest spikes. Figures 6 and 7 show how Min-Max Aggregation looks in Circonus for the request rate and latency examples.

Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay

In both cases, the points where the maximum values are assumed are clearly visible in the graph. When zooming into the spikes, the Max aggregation values stay aligned with the global maximum.

Keeping in mind that minimum and maximum are special cases of percentiles (namely the 0%-percentile and 100%-percentile), it appears natural to extend the aggregation methods to allow general quantiles as well. This is what we implemented in Circonus with the Percentile Aggregation overlay.

Histogram Aggregation

There is another, structurally different approach to mitigate spike erosion. It begins with the observation that histograms have a natural aggregation logic: Just add the bucket counts. More concretely, a histogram metric that stores data for each minute can be aggregated to larger time windows (e.g. 90 minutes) without applying any summary statistic, like a mean value, simply by adding the counts for each histogram bin across the aggregation time window.

If we combine this observation with the simple fact that time-series metrics can be considered histogram with a single value in it, we arrive at the powerful Histogram Aggregation that rolls-up time series into histogram metrics with lower time resolution. Figures 8 and 9 show Histogram Aggregation Overlays for the Request Rate and Latency examples discussed above.

Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay

In addition to showing the value range (which in the above figure is amplified by using the Min-Max Aggregation Overlay) we also gain a sense of the distribution of values across the bin. In the request rate example the data varies widely across a corridor of width 2.5-10rps. In the latency example, the distribution is concentrated near the mean global median of 0.015ms, with single value outliers.

Going Further

We have seen that displaying data as histograms gives a more concise picture of what is going on. Circonus allows you to go one step further and collect your data as histograms in the first place. This allows you to capture the latencies of all requests made to your API, instead of only probing your API once per minute. See [G.Schlossnagle2015] for an in-depth discussion of the pros and cons of this “passive monitoring” approach. Note that you can still compute averages and percentiles for viewing and alerting.

Figure 10: API Latency Histogram Metric with Average Overlay
Figure 10: API Latency Histogram Metric with Average Overlay

Figure 10 shows a histogram metric of API latencies, together with the mean value computed as an overlay. While this figure looks quite similar to Figures 8 and 9, the logical dependency is reversed. The mean values are computed from the histogram, not the other way around. Also, note that the time window of this figure only spans a few hours, instead of four weeks. This shows how much richer the captured histogram data is.

Resources

  1. Theo Schlossnagle – Problem with Math
  2. Dogan Ugurlu (Optimizely) – The Most Misleading Measure of Response Time: Average
  3. Baron Schwarz – Why percentiles don’t work the way you think
  4. Brendan Gregg – Frequency Tails: What the mean really means
  5. George Schlossnagle – API Performance Monitoring