Introducing Circonus Stream Tags

Gain greater insight into system behavior in aggregate, across multiple dimensions.

A “metric” is a measurement, or value, representing the operational state of your system at a given time. For example, the amount of free memory on a web host, or the number of users logged into your site at a given time. Factor in hundreds (or thousands) of hosts, availability zones, hardware batches, or service endpoints, and you’re suddenly dealing with a significant logistical challenge.

What’s the difference in average free memory between API hosts in my Ashburn and San Jose data centers? How about the difference in CPU utilization for the EC2 clusters using 4 r5.xlarge instances instead of 8 r5.larges? How do you stay on top of it all?

“As we grow our number of metrics and/or want to make more sense out of them, we need to be more systematic.” – Metrics 2.0 (metrics20.org)

For time-series data, it’s all about metadata — which may sound familiar if you take pictures with your smartphone. Metadata isn’t the picture you take, but the time of day, location, camera type, settings, and so on — all to help you organize albums and find/organize your photos more easily. The same principle applies to how you manage your operational time-series telemetry, as this additional information gives them more depth. Known in the software industry as metric tags or dimensions, this metadata provides greater insight into your metrics.

Why are metric tags important?

Traditionally in a large enterprise, automated monitoring solutions are a one-time, “set it and forget it” situation based on the defined, up-front monitoring objectives. These solutions are then turned over to the operations team. With static infrastructure, this works, and these checks and metrics tend to remain unchanged for extended periods of time. However, static infrastructure isn’t as common as it used to be – and that’s the problem.

Today’s dynamic, ever-changing infrastructure (including web and application servers) make it difficult and impractical to apply the same static analysis and operations in monitoring them.

The world of IT has always been a moving target. Enterprise data centers are continuously subject to transformation as new software, hardware and process components are deployed and updated. Today’s dynamic infrastructure is often a fusion of bare metal, virtualized instances, orchestrated container base applications, serverless, and cloud deployed services.  According to the IEEE, this puts an immense burden on monitoring needs. Not only are there thousands of different parameters to monitor, but the addition and modification of service level objectives (SLOs) may happen continuously.

Stream Tags from Circonus Can Help

Circonus’ metric tag implementation is known as Stream Tags. Our implementation improves infrastructure monitoring by adding metric metadata, to help you label and organize your metrics data. Stream Tags provide all of the capabilities of the Metrics 2.0 specification; self describing, orthogonal tags for every dimension. In addition, Stream Tags offer extra capabilities as detailed below.

CPU metric using Stream Tags to specify host, role, state, and zone

What Makes Stream Tags Better?

Stream Tags offer the ability to base64 encode tag names and values. The ability to support multi-byte character sets, as well as the full list of ASCII symbols, offers a notable advantage for non-english speakers.

There are, of course, some fun things you can do with this too. We aren’t saying that you should embed emoji within Stream Tags, but should you so desire, that’s certainly possible.

user_order|ST[drink:🍺,food:🌯]

becomes

user_order|ST[b”ZHJpbms=”:b”8J+Nug==”,b”Zm9vZA==”:b”8J+Mrw==”]

Tag, You’re It

Let’s jump in and take a look at what stream tagged metrics actually look like. Above, we talked about metric tags which represent the amount of free memory on a host. We’ll call this metric “`meminfo `MemFree”. This is a metric that you’ll find associated with any physical or virtual host, but whose average, across many hosts, would produce a broad and un-useful aggregate. Adding a Stream Tag such as “environment” allows us to examine a segment of these metrics, such as “environment:development”, or “environment:production”. Adding the environment tag to this metric would give us something like

`meminfo `MemFree|ST[environment:production]

Now, we can use an aggregate such as the average free memory of our production environment to get an idea of how that differs from other environments. Stream Tag names and values are separated by colons, and enclosed within square brackets. The tag name and value section is prefixed by “ST” and separated from the metric name with “|” character.

Let’s take it a bit further. Say I want to know if our new release is going to blow out our cluster because we forgot to call free() in one important spot in the code. We can add a tag “deploy_pool” to the metric, which will now look like

`meminfo `MemFree|ST[environment:production,deploy_pool:blue]

Now we can compare memory usage between blue and green deployment pools using Stream Tags, and determine if the new release has a memory leak that needs to be addressed. This technique can also be used for traditional A/B testing, or canary based feature deployment.

Further segmenting is possible as needed, we could add kernel version, instance type, etc. as Stream Tags to delineate metadata boundaries.

Stream tag showing web server endpoint latency metadata

With Stream Tags, we are also  able to apply as many labels and tags as we need to properly organize our metrics. Bear in mind, each unique combination of Stream Tags represents a new time series, which can dramatically increase the amount of data stored. So, it’s important to be careful when using Stream Tags to store dimensions with high cardinality, such as user IDs, email addresses, or other unbounded sets of values.

Do I Really Need Stream Tags?

If your infrastructure is dynamic, which has become increasingly common: yes.

Stream Tags allow the flexibility needed by today’s infrastructure monitoring. They make it possible to filter, aggregate and organize metrics in multiple dimensions. Analysis and operations can be applied to a set of metrics rather than individual metrics, allowing for service level monitoring in addition to resource monitoring.

Implemented correctly, Circonus’ Stream Tags are critical to monitoring modern infrastructure, because they give admins the power to aggregate metrics at every level.

When possible and applicable, automatic tagging means the metrics collection agent may be able to detect metadata from the source – e.g. role: tag from Chef, availability-zone: and instance-type: from AWS, labels from Google Cloud, etc. The metadata can then be automatically added as Stream Tags.

How fancy can I make them?

You can use Stream Tags to organize metrics with a set of key characteristics such as role, zone, department, etc. At Circonus, we append those key characteristics to the metric name following this format:

<metric name>|ST[<tag name>:<tag value>, … ]

where

<tag name> may consist of any number of characters in [`+A-Za-z0-9!@#\$%^&”‘\/\?\._-]

<tag value> may consist of any number of characters in [`+A-Za-z0-9!@#\$%^&”‘\/\?\._-:=]

EXAMPLE:
CPU|ST[mode:wait_io,host:wnode15,role:webserver,zone:us-east1b]

The following example shows a CPU metric for wait_io from a webserver with hostname node15, in the us-east1b availability zone.

<tag-name>, <tag-value> may consist of base64 characters enclosed in double quotes and prefixed with ‘b’

EXAMPLE:
Base64encodedMetric|ST[b"X19uYW1lX18=":b"cHJvbWV0aGV1c19yZW1vdGVfc3RvcmFnZV9zdWNjZWVkZWRfc2FtcGxl"]

Please note that there must not be any whitespace between tag delimiters.

Search across Stream Tags easily with the new Metrics Explorer

You’re also able to search for matching Stream Tags to filter and aggregate metrics, display them in graphs, set rules for alerts based on the search results, and more. This search can be performed by matching the

<tag-name>

and

<tag-value>

strings or base64 encoded strings.

EXAMPLE:

(node:wnode15)                // string match

(b”X19uYW1lX18=”)             // base64 string match

(b/X.+/)                      // base64 <regex> match

So What’s the Take-Away?

Stream Tags enable you to manage your dynamic infrastructure, while providing flexibility for future paradigms you’ll need to adopt. You can keep up with the ever-changing monitoring landscape with better transparency, more effective searches and comparisons, faster responses, and automatic metric aggregation. Stream Tags provide all the control you need to manage diverse, high cardinality sets for your operational metrics. They give you the power of the Metrics 2.0 specification, plus the forward looking capabilities provided by base64 capable tag names and values.

That’s what’s in it for you. Overall, this rich functionality will save you time and effort, enable agility to keep up with changing technology, and deliver better results than the conventional alternative. Stream Tags are expected to land on November 19th. We look forward to hearing how this feature helps each of your respective organizations.

Less Toil, More Coil – Telemetry Analysis with Python

This was a frequent request we were hearing from many customers:

“How can I analyze my data with Python?”

The Python Data Science toolchain (Jupyter/NumPy/pandas) offers a wide spectrum of advanced data analytics capabilities. Therefore, seamless integration with this environment is important for our customers who want to make use of those tools.

Circonus has for a long time provided Python bindings for its API. With these bindings, you can configure the account, create graphs and dashboards, etc. However, fetching data and getting it into the right format involves multiple steps and was not easy to get right.

We are now pleased to announce that this has changed. We have just added new capabilities to our Python bindings that allow you to fetch and analyze data more effectively. Here is how to use it.

Quick Tour

Connecting to the API

You need an API token to connect to the API. You can create one using the UI under Integrations > API Tokens. In the following we assume the variable api_token holds a valid API token for your account.

from circonusapi import circonusdata
circ = circonusdata.CirconusData(api_token)

Searching for Metrics

The first thing we can do is search for some metrics:

>>> M = circ.search('(metric:duration)', limit=10)

The returned object extends the list class, and can be manipulated like any list object.
We override the __str__ method, so that printing the list, gives a table representation of the fetched metrics:

>>> print(M)

check_id   type       metric_name
--------------------------------------------------
195902     numeric    duration
218003     numeric    duration
154743     numeric    duration
217833     numeric    duration
217834     numeric    duration
218002     numeric    duration
222857     numeric    duration
222854     numeric    duration
222862     numeric    duration
222860     numeric    duration

Metric lists provide a .fetch() method that can be used to fetch data. Fetches are performed serially, one metric at a time, so the retrieval can take some time. We will later see how to parallelize fetches with CAQL.

R = M.fetch(
    start=datetime(2018,1,1), # start at Midnight UTC 2018-01-01
    period=60,                # return 60 second (=1min) aggregates
    count=180,                # return 180 samples
    kind="value"              # return (mean-)value aggregate
)

The resulting object is a dict that maps metrics names to the fetched data. This is designed in such a way that it can be directly passed to a pandas DataFrame constructor.

import pandas as pd
df = pd.DataFrame(R)

# [OPTIONAL] Make the DataFrame aware of the time column
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)

df.head()
time			154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
2018-01-01 00:00:00	1		4		1		1		1		1		12		11		12		1
2018-01-01 00:01:00	1		2		1		1		2		1		11		12		12		1
2018-01-01 00:02:00	1		2		1		1		1		1		12		12		11		1
2018-01-01 00:03:00	1		2		1		1		1		1		12		11		12		1
2018-01-01 00:04:00	1		2		1		1		1		1		12		11		11		1

Data Analysis with pandas

Pandas makes common data analysis methods very easy to perform. We start with computing some summary statistics:

df.describe()
154743/duration	195902/duration	217833/duration	217834/duration	218002/duration	218003/duration	222854/duration	222857/duration	222860/duration	222862/duration
count	180.000000	180.000000	180.0		180.000000	180.000000	180.000000	180.000000	180.000000	180.00000	180.000000
mean	1.316667	2.150000	1.0		1.150000	1.044444	1.177778	11.677778	11.783333	11.80000	1.022222
std	1.642573	0.583526	0.0		1.130951	0.232120	0.897890	0.535401	0.799965	0.89941 	0.181722
min	1.000000	1.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
25%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	11.000000	11.000000	11.00000	1.000000
50%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
75%	1.000000	2.000000	1.0		1.000000	1.000000	1.000000	12.000000	12.000000	12.00000	1.000000
max	15.000000	4.000000	1.0		12.000000	3.000000	9.000000	13.000000	17.000000	16.00000	3.000000

Here is a plot of the dataset over time:

from matplotlib import pyplot as plt
ax = df.plot(style=".",figsize=(20,5),legend=False, ylim=(0,20), linewidth=0.2)

We can also summarize the individual distributions as box plots:

ax = df.plot(figsize=(20,5),legend=False, ylim=(0,20), kind="box")
ax.figure.autofmt_xdate(rotation=-20,ha="left")

Working with Histogram Data

Histogram data can be fetched using the kind=”histogram” parameter to fetch. Numeric metrics will be converted to histograms. Histograms are represented as libcircllhist objects, which have very efficient methods for the most common histogram operations (mean, quantiles).

MH = circ.search("api`GET`/getState", limit=1)
print(MH)
check_id   type       metric_name
--------------------------------------------------
160764     histogram  api`GET`/getState                                 

Let’s fetch the 1h latency distributions of this API for the timespan of one day:

RH = MH.fetch(datetime(2018,1,1), 60*60, 24, kind="histogram")

We can plot the resulting histograms with a little helper function:

fig = plt.figure(figsize=(20, 5))
for H in RH['160764/api`GET`/getState']:
    circllhist_plot(H, alpha=0.2)
ax = fig.get_axes()
ax[0].set_xlim(0,100)

The output:
(0, 100)

Again, we can directly import the data into a pandas data frame, and perform some calculations on the data:

dfh = pd.DataFrame(RH)

# [OPTIONAL] Make the DataFrame aware of the time column
dfh['time']=pd.to_datetime(dfh['time'],unit='s')
dfh.set_index('time', inplace=True)
dfh['p99'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p90'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p95'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.99))
dfh['p50'] = dfh.iloc[:,0].map(lambda h: h.quantile(0.5))
dfh['mean'] = dfh.iloc[:,0].map(lambda h: h.mean())
dfh.head()
time             	160764/api`GET`/getState				p99		p90		p95		p50		mean
2018-01-01 00:00:00	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	112.835714	112.835714	112.835714	11.992790	15.387013
2018-01-01 01:00:00	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	114.961628	114.961628	114.961628	16.567822	19.542284
2018-01-01 02:00:00	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	118.124324	118.124324	118.124324	20.556859	24.012226
2018-01-01 03:00:00	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	427.122222	427.122222	427.122222	20.827982	37.040173
2018-01-01 04:00:00	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	496.077778	496.077778	496.077778	23.247373	40.965517

The CAQL API

Circonus comes with a wide range of data analysis capabilities that are integrated into the Circonus Analytics Query Language, CAQL.

CAQL provides highly efficient data fetching operations that allow you to process multiple metrics at the same time. Also by performing the computation close to the data, you can save time and bandwidth.

To get started, we search for duration metrics, like we did before, using CAQL:

A = circ.caql('search:metric("duration")', datetime(2018,1,1), 60, 5000)
dfc = pd.DataFrame(A)
dfc.head()
		output[0]	output[10]	output[11]	output[12]	output[13]	output[14]	output[15]	output[16]	output[17]	output[18]	...	output[21]	output[2]	output[3]	output[4]	output[5]	output[6]	output[7]	output[8]	output[9]	time
0		4		12		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514764800
1		2		12		1		1		1		1		1		1		11		1		...		1		1		1		1		1		2		12		11		1	1514764860
2		2		11		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		12		12		1	1514764920
3		2		12		1		1		2		1		1		1		12		1		...		1		1		1		1		1		1		11		12		1	1514764980
4		2		11		1		1		2		1		1		1		11		1		...		1		1		1		1		1		1		11		12		1	1514765040
5 rows × 23 columns

This API call fetched 1000 samples from 22 metrics, and completed in just over 1 second. The equivalent circ.search().fetch() statement would have taken around one minute to complete.

One drawback of the CAQL fetching is, that we use the metric names in the output. We are working on resolving this shortcoming.

To showcase some of the analytics features, we’ll now use CAQL to compute a rolling mean over the second largest duration metric in the above cluster, and plot the transformed data using pandas:

B = circ.caql("""

search:metric("duration") | stats:trim(1) | stats:max() | rolling:mean(10M)

""", datetime(2018,1,1), 60, 1000)
df = pd.DataFrame(B)
df['time']=pd.to_datetime(df['time'],unit='s')
df.set_index('time', inplace=True)
df.plot(figsize=(20,5), lw=.5,ylim=(0,50))

You can also fetch histogram data with circ.caql():

AH = circ.caql('search:metric:histogram("api`GET`/getState")', datetime(2018,1,1), 60*60, 24)
dfch = pd.DataFrame(AH)
dfch.head()
        output[0]						time
0	{"+29e-002": 2, "+40e-002": 6, "+50e-002": 8, ...	1514764800
1	{"+40e-002": 2, "+50e-002": 2, "+59e-002": 5, ...	1514768400
2	{"+40e-002": 3, "+50e-002": 12, "+59e-002": 4,...	1514772000
3	{"+29e-002": 1, "+40e-002": 7, "+50e-002": 21,...	1514775600
4	{"+40e-002": 6, "+50e-002": 26, "+59e-002": 15...	1514779200

We can perform a wide variety of data transformation tasks directly inside Circonus using CAQL expressions. This speeds up the computation even further. Another advantage is that we can leverage CAQL queries for live graphing and alerting in the Circonus UI.

In this example, we compute how many requests were serviced above certain latency thresholds:

B = circ.caql('''

search:metric:histogram("api`GET`/getState") | histogram:count_above(0,10,50,100,500,1000)

''', datetime(2018,1,1), 60*5, 24*20)
dfc2 = pd.DataFrame(B)
dfc2['time']=pd.to_datetime(dfc2['time'],unit='s')
dfc2.set_index('time', inplace=True)
dfc2.plot(figsize=(20,5), colormap="gist_heat",legend=False, lw=.5)

Conclusion

Getting Circonus data into Python has never been easier. We hope that this blog post allows you to get started with the new data fetching capabilities.  A Jupyter notebook version of this blog post containing the complete source code is available here. If you run into any problems or have some suggestions, feel free to open an issue on GitHub, or get in touch on our Slack channel.

Linux System Monitoring with eBPF

The Linux kernel is an abundant component of modern IT systems. It provides the critical services of hardware abstraction and time-sharing to applications. The classical metrics for monitoring Linux are among the most well known metrics in monitoring: CPU utilization, memory usage, disk utilization, and network throughput. For a while now, Circonus installations have organized, the key system metrics in the form of a USE Dashboard, as a high level overview of the system resources.

While those metrics are clearly useful and important, there are lot of things to be wished for. Even the most basic metrics like CPU utilization have some serious flaws (cpu-load.txt) that limit their significance. Also there are a lot of questions for which there are simply no metrics exposed (such as disk errors and failed mallocs).

eBPF is a game changing technology that became available in recent kernel versions (v4.1 and later). It allows subscribing to a large variety of in kernel events (Kprobes, function call tracing) and aggregating them with minimal overhead. This unlocks a wide range of meaningful precise measurements that can help narrow the observability gap. A great and ever growing collection of system tracing tools is provided by the bcc toolkit by iovisor.

The Circonus Monitoring Agent comes with a plugin that collects eBPF metrics using the bcc toolkit (see source code & instructions here). At the time of this writing, the plugin is supported on the Ubuntu 16.04 platform. In the following examples we will demonstrate how this information can be used.

Block-I/O Latencies

The block-I/O layer of the operating system is the interface the block-devices, like disk, offer to the file system. Since they are an API, it’s natural to apply the RED methodology (adapted from the SRE Book, see e.g. Tom Wilkie 2018) and monitor rate, errors, and duration. One famous example of how this information can be used is to identify environmental influences to I/O performance, as seen in Brendan Gregg – Shouting in the Datacenter (YouTube 2008). The example duration measurements can be seen in the figure below.

150M I/O events that were recorded on three disks over the period of a week.
150M I/O events that were recorded on three disks over the period of a week

This diagram shows a total of 150M I/O events that were recorded on three disks over the period of a week. The visualization as stand-alone histogram allows us to qualitatively compare the access latency profiles very easily. In this case, we see that the traffic pattern is imbalanced (one disk serving less than half of the load of the others), but the latency modes are otherwise very similar, indicating good (or at least consistent) health of the disks.

The next figure shows how the these requests are distributed over time.

Disk array visualized as a heatmap.
Disk array visualized as a heatmap.

This is the latency duration profile of the disk array visualized as a heatmap. The line graphs show the p10,p50 and p90 percentiles calculated over 20 minute spans. One can see how the workload changes over time. Most of the requests were issued between Sept 10th and Sept 11th 12:00, with a median performance of around 3ms.

File System Latency

From the application perspective, the file system latencies are much more relevant than block I/O latencies. The following graphic shows the latency of the read(2) and write(2) syscalls executed over the period of a few days.

The median latency of this dataset is around 5u-sec for read and 14u-sec for write accesses. This is an order of magnitude faster than block I/O latencies and indicates that buffering and caching of file system accesses is indeed speeding things up.

Caveat: In UNIX systems, “everything is a file.” Hence the same syscalls are used to write data to all kinds of devices (sockets, pipes) and not only disks. The above metrics do not differentiate between those devices.

CPU Scheduling Latency

Everyone knows that systems become less responsive when they are overloaded. If there are more runable processes than CPUs in the system, processes begin to queue and additional scheduling latency is introduced. The load average reported by top(1) gives you a rough idea how many processes were queued for execution over the last few minutes on average (the reality is quite a bit more subtle). If this metric is higher than the number of CPUs you will get “some” scheduling latency.

But how much scheduling latency did your application actually experience?

With eBPF, you can just measure the latency of every scheduling event. The diagram below shows the latency of 17.4B scheduling events collected over a 4 week period.

The median scheduling latency (30u-sec) was very reasonable. Clearly visible are Several modes which I suppose can be attributed to processes waiting behind none, one, or two other processes in the queue. The tail of the distribution shows the collateral damage caused by periods of extreme loads during the collection period. The longest scheduling delay was a severe hang of 45 seconds!

Next steps

If you want to try this out on your system, you can get free Circonus account is a matter of minutes. Installing the Circonus agent on an Ubuntu 16.04 machine can be done with a single command. Then enable the eBPF plugin on your host by following the instructions here.

It’s an ongoing effort to extend the capabilities of the eBPF plugin. Apart from the metrics shown above, there are also rate and duration metrics for all 392 Linux system calls that are exposed by the plugin. There are a lot more interesting tools in iovisor/bcc that wait to be ported.

Happy Monitoring!

Circonus Update: New UI Rollout

The Circonus team is excited to announce the release of our newest update, which includes sweeping changes to the Circonus Monitoring Platform UI.

This update is part of an ongoing effort to optimize the Circonus UI for performance and usability on mobile devices, such as phones and tablets, as well as on media PCs, desktops, and laptops, and almost every single change to our familiar UI directly supports that optimization. You’ll find that just about every single page is now responsive. In the future we’ll be continuing these efforts, and will be tackling the dashboards and check creation workflows next.

We’re also grouping some of our features to provide a more streamlined experience, by making improvements to how data is displayed, and changing how controls are displayed to be consistent throughout all of the different pages and views.

Look for these changes to go live this Thursday, April 26th!

What’s New?

The biggest change is that the displays are consistent for ALL of the different searchable items in Circonus. Hosts, metrics, checks, rules sets, graphs, worksheets, everything!

Every item has a List View that includes the search functionality, and a Details View with more information for each item in the list. All of these Details View pages will have the same familiar design, and the List View for each item will also have the same functionality. Users will still be able to select their preferred layout for Metrics and Hosts lists, whether they want their List View in a grid or an actual list.

Another significant change is that the List View and the Details View are separate. Circonus will no longer display a dropdown accordion with a Details View inside of the List View. Instead, you can use the List View to search for data and simply click the View button to visit a page specifically designed for viewing those details.

These views group many familiar Circonus features together in a dropdown menu that appears under the View button, as you can see in this Alert History list.

Many frequently used Circonus features are grouped under one of our two (and only two) “burger”-style menus, which are for mobile navigation and tag filters. Controls for other features have been replaced with intuitive icons in the appropriate places that will clearly indicate what users can do from different views.

Menu items are context dependent, and display options relevant to the current List View or Details View.

All of the Token API Management pages have been consolidated to a single API Token page, under the Integrations menu.

Account administration has also been consolidated and streamlined, as you can see from this view of the Team page:

FAQ

There are a lot of changes in this update, so to assist users with this transition, we’ve prepared answers for a few questions we anticipated that you might ask.

How do I view the check UUID?

The view for the API object, which includes the check UUID, is available on the checks list page by clicking the down arrow next to the View button. You can also visit the Details View page for the Check to get all pertinent Check info, including the UUID.

How do I view details for two things at once now that the List View and Details View are separate?

We recommend opening a separate Details View page for each item you want to view, by right-clicking on the View button in the List View and opening multiple new tabs.

What’s Next?

Our team is dedicated to continuously improving Circonus, and have recently prepared our roadmap for the next year, so we can confidently say there are many more exciting new features and performance enhancements on the horizon.

Our next big UI update will enhance our dashboards and the check creation workflows. These features will receive the same responsive improvements you will see in the rest of the UI, along with improving usability.

Circonus On The Raspberry Pi

There are a lot of interesting monitoring tasks, that can be facilitated with a Raspberry Pi (e.g. here, there). Circonus does not officially support “Raspbian Linux on armv6/v7” as a deployment target, but given the steady interest in this topic, we took the time to pave the way and write down some instructions in this post.

Let’s dive right in:

# Install dependencies
sudo apt-get update
sudo apt-get install git make

# Only node node v4 or node v6 are supported. Install official arm package
wget https://nodejs.org/dist/latest-v6.x/node-v6.12.3-linux-armv6l.tar.xz
sudo mkdir -p /opt/node
sudo tar -vxf node-v6.12.3-linux-armv6l.tar.xz -C /opt/node --strip-components=1
sudo ln -s /opt/node/bin/node /usr/bin/node
sudo ln -s /opt/node/bin/npm /usr/bin/npm
sudo ln -s /opt/node/bin/node /opt/circonus/bin/node

# Install nad from sources
git clone https://github.com/circonus-labs/nad.git
cd nad
git checkout v2.5.1
sudo make install install-linux-init

# Fix nad disk plugins which depend on /proc/mdstat
sudo modprobe md

This should install and enable the monitoring agent nad on the RPI. Check that it is up and running using:

# Is it up? -> You should see a staus report mentioning "active (running)"
systemctl status nad
# Does it serve metrics? -> You should see a large JSON blob, containing metrics
curl localhost:2609

Now copy paste the one-step-install command from the Integrations > Checks > [New Host] page. In my case this is:

curl -sSL https://onestep.circonus.com/install | bash \
   -s -- \
   --key xxxxxxxxx-50ce-4118-9145-xxxxxxxxxx \
   --app circonus:osi:xxxxxx...

The credentials will be different for you. The installer should find the running agent and register checks, graphs and the USE Dashboard for you. We needed to make some tweaks to the COSI installer itself to get this working as it is now. Special thanks goes to Matt Maier from Circonus to make this happen so quickly.

These instructions were tested on Raspbian stretch lite (Version: November 2017 / Release date: 2017-11-29 / Kernel version: 4.9) on a RPI2 Model B.

Happy Monitoring!

This post was also published on Heinrich Hartmann’s blog.

UI Redesign FAQ

Today, the Circonus team is releasing the first round of our new User Interface changes.

Since we started developing Circonus 7 years ago, we’ve worked with many customers in different spaces with different needs. Our team talked with many engineers about how they monitor their systems, what their workflow looks like, and the configuration of their ideal platform. You’ve spoken and we listened. Now, we’ve put those years of feedback to use, so that you get the benefit of that collective knowledge, with a new, improved interface for managing your monitoring.

The interface now features responsive improvements for different screen sizes (and which provides better support for mobile devices) and a revised navigation menu. More changes will follow, as we innovate to adapt our interface to intuitively suit our users needs.

Frequently Asked Questions

Change, even change for the better, can be frustrating. This FAQ will help explain the interface changes. We’ll be adding more items to this FAQ as we get more feedback.

Where did the checks go?

The checks have been moved to under the “Integrations” menu.

Where did the graphs go?

Graphs are now under the “Analytics” menu.

Where are the rulesets?

Rulesets can be found under the “Alerts” menu.

Send us your feedback

Tell us about your own Circonus experience, and let us know what you think about our new User Interface:

Some Like It Flat

JSON rules the world, much to our collective chagrin. I’ve mentioned before the atrocious shortcomings of JSON as a format and I feel deeply saddened that the format has taken the world by storm. However, it is here and we must cope… but not always.

Like every other API-centric platform in the world, we support data in and out in the ubiquitous JavaScript Object Notation format. I’ll admit, for a (more or less) machine parseable notation, it is remarkably comprehensible to humans (this is, of course, one of the leading drivers for its ubiquity). While I won’t dive into the deficits of JSON on the data-type interpretation side (which is insidious for systems that mainly communicate numerical data), I will talk about speed.

JSON is slow. JSONB was created because JSON was slow. JSONB is still slow. Object (data) serialization has always been of interest to protocols. When one system must communicate data to another, both systems must agree on a format for transmission. While JSON is naturally debuggable, it does not foster agreement (specifically on numeric values) and it is truly abysmal on the performance side. This is why so many protocol serializations exist today.

From the Java world comes Thrift, its successor Avro, and MessagePack. From Python we have pickle, which somehow has escaped the Python world to inflict harm upon others. From the C and C++ world we have Cap’n Proto, Flatbuffers, and perhaps the most popular, Google Protobuf (the heart of the widely adopted gRPC protocol). Now, these serialization libraries might have come from one language world, but they’d be useless without bindings in basically every other language… which they do generally boast, with the exception of pickle.

It should be noted that not all of these serialization libraries stop at serialization. Some bring protocol specification (RPC definition and calling convention) into scope. Notwithstanding that this can be useful, the fact that they are conflated within a single implementation is a tragedy.

For IRONdb, we needed a faster alternative to JSON because we were sacrificing an absurd amount of performance to the JSON god. This is a common design in databases; either a binary protocol is used from the beginning or a fast binary protocol is added for performance reasons. I will say that starting with JSON and later adopting a binary encoding and protocol has some rather obvious and profound advantages.

By delaying the adoption of a binary protocol, we had large system deployments with petabytes of data in them under demanding, real-world use cases. This made it a breeze to evaluate both the suitability and performance gains. Additionally, we were able to understand how much performance we squandered on the protocol side vs. the encoding side. Our protocol has been HTTP and our payload encoding has been JSON. It turns out that in our use cases, the overhead for dealing with HTTP was nominal in our workloads, but the overhead for serializing and deserializing JSON was comically absurd (and yes, this is even with non-allocating, SAX-style JSON parsers).

So, Google Protobuf right? Google can’t get it wrong. It’s popular; there are a great many tools around it. Google builds a lot of good stuff, but there were three undesirable issues with Google Protobuf: (1) IRONdb is in C and while the C++ support is good, the C support is atrocious, (2) it conflates protocol with encoding so it becomes burdensome to not adopt gRPC, and (3) it’s actually pretty slow.

So, if not Google’s tech, then whose? Well, Google’s of course. It is little known that Flatbuffers are actually Google Flatbuffers designed specifically as an encoding and decoding format for gaming and other high performance applications. To understand why it is so fast, consider this: you don’t have to parse anything to access data elements of encoded objects. Additionally, it boasts the strong forward and backward compatibility guarantees that you get with other complex serialization systems like Protobuf.

The solution for us:

Same REST calls, same endpoints, same data, different encoding. Simply specify the data encoding you are sending with a ‘Content-Type: x-circonus-<datatype>-flatbuffer’ header and the remote end can even avoid copying or even parsing any memory; it just accesses it. The integration into our C code is very macro-oriented and simple to understand. Along with roughly 1000x speedup in encoding and decoding data, we save about 75% volumetrically “over the wire.”

Site Maintenance Oct 23rd, 2017 10:00 EDT

We will be performing site wide maintenance on Monday, October 23rd, 2017 at 10:00am EDT (14:00 UTC). This maintenance is expected to last 15 minutes. During this window, the UI and API will be unavailable as we fail over to a new primary datacenter. This maintenance also includes the promotion of a new primary DB and movement of alerting services.

Over the past few weeks, we have spun up data collection to this new DC, and have been serving graph, worksheet, and dashboard data from it. During the maintenance window, data collection will continue uninterrupted. There will be an alerting outage as we switch services to their new home. Alerts that would have fired in the window will fire when we come out of maintenance; alerts that would have cleared will clear when we come out of maintenance.

Please double check that our IPs listed in out.circonus.net are permitted through any firewalls, especially if you have rules to permit webhook notifications.

We expect no major issues with this move. If you have any questions, please contact our support at support@circonus.com for further clarification.

System Monitoring with the USE Dashboard

The USE method was developed by Brendan Gregg to study performance problems in a systematic way.1 It provides a simple, top-down approach to identify bottlenecks quickly and reliably. The gathering of performance metrics is usually done using a myriad of different system tools (sar, iostat, vmstat) and tracers (dtrace, perf, ebpf). Circonus has now added the most relevant USE metrics to its monitoring agent, and conveniently presents them in the form of a USE Dashboard (Figure 1, live demo), that allows a full USE analysis at a single glance.

Figure 1: The USE System Dashboard.
Figure 1: The USE System Dashboard. Click to view a live demo.

Outline

  1. System Monitoring
  2. The USE Method
  3. Performance Metrics
    3.1 CPU / Scheduler Monitoring
    3.2 Memory Management
    3.3 Network Monitoring
    3.4 Disks and File Systems
  4. The USE Dashboard

1. System Monitoring

System monitoring is concerned with monitoring basic system resources: CPU, Memory, Network and Disks. These computational resources are not consumed directly by the application. Instead, the Operating System abstracts and manages these resources and provides a consistent abstracted API to the application (“process environment”, system calls, etc). Figure 2 illustrates the high level architecture. A more detailed version can be found in Gregg.2, 3

Figure 2: High-level System Overview
Figure 2: High-level System Overview

Once critical objective of system monitoring is to check that if how the available resource are utilized. Typical questions are: Is my CPU fully utilized? Is my application running out of memory? Do we have enough disk capacity left?

While a fully utilized resource is an indication of a performance bottleneck, it might not be a problem at all. A fully utilized CPU means only that we are making good use of the system. It starts causing problems only when incoming requests start queuing up or producing errors, and hence the performance of the application is impacted. But queuing does not only occur in the application layer. Modern software stacks use queuing in all system components to improve performance and distribute load. The degree to which a resource has extra work that it can not service is called saturation,3, p42 and is another important indicator for performance bottlenecks.

2. The USE Method

The USE method, by Gregg, is an excellent way to identify performance problems quickly. It uses a top down approach to summarize the system resources, which ensures that every resource is covered. Other approaches suffer from a “street light syndrome,” in that the focus lies on parts of the system where metrics are readily available. In other cases, random changes are applied in the hope that the problems go away.

The USE method can be summarized as follows:

For each resource, check utilization, saturation, and errors.

The USE analysis is started by creating an exhaustive list of that are consumed by the application. The four resource types mentioned above are the most important ones, but there are more resources, like IO Bus, Memory Bus, and Network Controllers, that should be included in a thorough analysis. For each resource, errors should be investigated first, since they impact performance and might not be noticed immediately, when the failure is recoverable. Then, utilization and saturation are checked.

For more details about the use method and its application to system performance analysis the reader is referred to the excellent book by Gregg.3

3. Performance Metrics

It’s not immediately clear how utilization, saturation, and errors can be quantified for different system resources. Fortunately, Gregg has compiled a sensible list of indicators2 that are available on Linux systems. We have taken this list as our starting point to define a set of USE metrics for monitoring systems with Circonus. In this section, we will go over each of them and explain their significance.

3.1. CPU Monitoring

Utilization Metrics:

  1. cpu`idle — time spent in the idle thread and not waiting on I/O (yellow)
  2. cpu`wait_io — time spent in the idle thread while waiting on IO (dark yellow)
  3. cpu`user — time spent executing user code (blue)
  4. cpu`system + cpu`intr — time spent executing OS-kernel code (dark blue)

Those metrics should give you a rough idea of what the CPU is doing during the last reporting period (1M). Blues represent time that the system spent doing work, yellow colors represent time where that the system spent waiting.

Like all other monitoring tools we are aware of, the values are derived from `/proc/stat`.4 However, they should be taken with the necessary precaution:5, 6 Those counters do not usually add up to 100%, which is already bad sign. Also accounting takes place in units of full CPU time slices (jiffies) at time of the clock interrupt. There are many voluntary context switches within a single time slice that are missed this way. Tickless kernels and varying clock speeds (Intel Turbo Boost/Turbo Step) further blur the meaning of these metrics. For the subtleties in accounting wait_io on multiprocessor systems see A. Veithen.11

Nevertheless these metrics are used everywhere to measure CPU utilization, and have proven to be valuable first source of information. We hope that we will be able to replace them with more precise metrics in the future.

There are some differences from Gregg:2 There, `vmstat “us” + “sy” + “st”` is the suggested utilization metric. We account steal time (“st”) as idle.

Saturation Metrics:

  1. cpu`procs_runnable (normalized, corrected) — runnable processes per CPU (purple)
  2. loadavg`1 (normalized) — one-minute load average per CPU (dark purple)
  3. Saturation threshold guide at value one. (black)

The CPU keeps runnable processes waiting on a queue. The length of this queue at any given point in time is a good metric for the CPU saturation. Unfortunately, this number is not directly exposed by the kernel. Via `/proc/stat` we get the total number of runnable processes (`procs_running`) which includes queued processes as well as currently running processes. We report this number as the `procs_runnable` metric. While being a straightforward measure, it suffers from the low sampling interval (only once a minute) and an observation bias. The Circonus agent NAD currently lists 3 running processes for an idle system. We account for these problems by (a) subtracting processes run by the monitoring agent (3) and (b) dividing by the number of CPU cores on the system. In this way, a value of one, which is marked with the guide, represents a system that has one runnable process per core.

The load average is a smoothed out version of procs`runnable metric maintained by the kernel. It’s typically sampled every 5 seconds and aggregated exponential smoothing algorithm.7 Recent kernel versions spent a lot of effort to maintain a meaningful load average metric across a system with a high number of CPUs and tickless kernels. This metric is divided by the number of CPU cores as well.

While 1min, 5min, and 15min load averages are maintained by the kernel, we only show the 1min average, since the others don’t provide any added value when plotted over time.

Both collected metrics are similar in their interpretation. If the value of either of these is larger than one (the guide) you have processes queuing for CPU time on the machine.

Differences from Gregg:2 In Gregg’s method, only procs_runnable statistic is considered. We think load averages are valuable as well.

Error Metrics:

CPU error metrics are hard to come by. If CPU performance counters are available (often not the case on virtualized hardware) perf(1) can be used to read them out. At the moment, Circonus does not provide CPU error metrics.

3.2. Memory Management

This section is concerned with the memory capacity resource. The bandwidth of the memory interconnect is another resource that can be worth analyzing, but it is much harder to get.

Utilization Metrics:

  1. vm`meminfo`Buffers — file system buffers (darker yellow)
  2. vm`meminfo`Cached — file-system caches (dark yellow)
  3. vm`meminfo`MemFree — free memory (yellow)
  4. vm`meminfo`Used (computed) — memory used by the application (blue)

The OS uses memory that is not utilized by the application for caching file system content. These memory pages can be reclaimed by the application as needed, and are usually not a problem for system performance.

Saturation Metrics:

  1. vm`info`pg_scan — the number of memory pages scanned per second (purple)
  2. vm`swap`used — usage of swap devices (purple, zero in the example)

When the free memory is close to exhausted, the system begins freeing memory from buffers and caches, or begins moving pages to a swap partition on disk (if present). The page scanner is responsible for identifying suitable memory pages to free. Hence, scanning activity is an indicator for memory saturation. A growing amount of swap space is also an indicator for saturated memory.

When the system has neither memory nor swap space left, it must free memory by force. Linux does this by killing applications that consume too much memory. When this happens, we have an OOM-(“out of memory”)-event, which is logged to dmesg.

Differences from Gregg:2 We are missing metrics for swapping (i.e. anonymous paging) and OOM events.

Errors:

Physical memory failures are logged to dmesg. Failed malloc(3) can be detected using SystemTap. We don’t have any metrics for either of them at the moment.

3.3. Network Monitoring

Utilization Metrics:

  1. if`<interface>`in_bytes — inbound network throughput in bytes/sec
  2. if`<interface>`out_bytes — outbound network throughput in bytes/sec

The network utilization can be measured as throughput divided by the bandwidth (maximal throughput) of each network interface. A full-duplex interface is fully utilized if either inbound or outbound throughput exhaust the available bandwidth. For half-duplex interfaces, the sum of inbound and outbound throughput is the relevant metric to consider.

For graphing throughput we use a logarithmic scale, so that a few kb/sec remain visibly distinct from the x-axis, and set the y-limit to the available bandwidth. The available bandwidth is often not exposed by virtual hardware; in this case, we don’t set a y-limit.

Saturation Metrics:

  1. tcp`setments_retransmitted (purple)
  2. if`<interface>`{in,out}_drop — dropped packets due to full buffers (purple, zero)
  3. if`<interface>`{in,out}_fifo_overrun — dropped packets due to full buffers (purple, zero)

Network saturation is hard to come by. Ideally, we’d like to know how many packets are queued send/receive buffers, but these statistics do not seem to be exposed via /proc. Instead, we have to settle for indirect indicators which are available, such as tcp-level retransmits, as well as drop and overrun counts.

Error Metrics:

  1. if`<interface>`{in,out}_errors — the number of errors occurred, e.g. invalid packets received

All metrics covered in this whole section should be taken with a great bit of caution. The Linux networking stack is a large, undocumented codebase with a lot of known bugs in the accounting mechanisms, see Damato.8.

3.4. Disks and File Systems

Utilization Metrics:

    diskstats`<device>`io_ms — the percentage of time the device has been utilized during the last reporting interval

The disk utilization is measured per device and not per file system. We simply record the percentage of time the device was busy during the last reporting period. This metric is read from `/proc/diskstats`.

Saturation Metrics:

  1. diskstats`<disks>`io_ms_weighted — Average number of inflight I/O operations

This metric is exposed under the name “avgqu-sz” by `sar -d` and `iostat -x` and measures the number of requests that are currently queued or processed by the disk. The metric is derived from a weighted_io_ms counter in `/proc/diskstats`, and involves a little bit of mathematics.11

Error Metrics:

According to Gregg,2 disk errors can be found in /sys/devices/<>/ioerr_cnt, smartctl, or by “tracing of the IO subsystem”. However in the (virtualized) systems we used for testing, these metrics were not exposed. Hence, at the moment, we don’t include any disk error metrics.

4. The USE Dashboards

The USE Dashboard, shown in Figure 1 above, combines all metrics discussed above in a single dashboard for each host.

Each row corresponds to a resource type, and each column contains an analysis dimension: Utilization, Saturation, and Errors. To perform a USE Performance analysis as outlined in Gregg,1, 2 you would traverse the dashboard line by line, and check:

  1. If errors are present? => Graphs in Column 3 non zero.
  2. If saturation is present? => Graphs in Column 2 show large values over extended time period
  3. Is the resource is fully utilized? => Graphs in Column 1 are maxed out.

The graphs are organized in such a way that all these checks can be done at a single glance.

Discussion

The USE Dashboard allows a rapid performance analysis of a single host. Instead of ssh-ing into a host and collecting statistics from a number of system tools (vmstat, iostat, sar), we get all relevant numbers together with their historical context in a single dashboard. We found this visualization valuable to judge the utilization of our own infrastructure, and as a first go-to source for analyzing performance problems.

However, the work on this dashboard is far from complete. First, we are lacking error metrics for CPU, Memory, and Disk resources. We have included text notes on how to get them manually in the USE dashboard to make those blind spots “known unknowns”. Another big working site is the quality of the measurements. As explained above, even basic metrics like CPU utilization have large margins of errors and conceptual weaknesses in the measurement methodology. Also, we did not cover all resources that a system provides to the application. E.g. we don’t have metrics about the File System, Network Controllers, I/O- and Memory- interconnects, and even basic metrics about the physical CPU and Memory resources are missing (Instruction per sec, Memory ECC events). Figure 3 below positions the covered resources into the high level system overview presented in the introduction.

Figure 3: Metrics displayed in the USE dashboard per resource
Figure 3: Metrics displayed in the USE dashboard per resource

Try it yourself!

To try the USE Dashboard for yourself, log into your Circonus account, click on “Checks” > “+ New Host” to provision a new host with a single command. The USE Dashboard will be automatically created. A link is provided at the command line.

No Circonus account, yet? Get one for free here.

References

  1. Gregg – Thinking Methodically about Performance, CACM 2013
  2. Gregg – Use Method: Linux Performance Checklist
  3. Gregg – Systems Performance, Prentice Hall 2014
  4. https://www.kernel.org/doc/Documentation/filesystems/proc.txt
  5. https://github.com/torvalds/linux/blob/master/Documentation/cpu-load.txt#L20-L22
  6. Gregg – Broken Linux Performance Tools, Scale14x (2016)
  7. https://en.wikipedia.org/wiki/Load_(computing)
  8. Damato – All of your network monitoring is probably wrong
  9. Hartmann – Monitoring Queue Sizes
  10. https://github.com/iovisor/bcc
  11. Andreas Veithen – The precise meaning of IO wait on Linux (blog)

Monitoring Queue Sizes

At Circonus we are collecting a number of metric about block devices. An interesting one is the average length of the request queue, which is exposed by iostat[1]:

aqu-sz
  The average queue length of the requests that were issued to the device.
  Note: In previous versions, this field was known as avgqu-sz.

To collect aqu-sz as a metric we don’t want to fire up iostat on each collection interval. So we needed to understand how this value is derived from statistics exposed by the operating system. It turns out that there is an interesting bit of mathematics involved in deriving this value. Here is how it works.

It was already noted by Baron Schwarz in 2010[2], that the calculation behind iostat are interesting. This post can be seen as a formalized elaboration.

Step 1: Checking the Implementation

The first step in understanding the iostat calculation of aqu-sz is to check the implementation (source):

/* aqu-sz */
    cprintf_f(NO_UNIT, 1, 6, 2, S_VALUE(ioj->rq_ticks, ioi->rq_ticks, itv) / 1000.0);

Here S_VALUEis defined (source) as:

#define S_VALUE(m,n,p)		(((double) ((n) - (m))) / (p) * HZ)

and rq_ticks is Field 11 from /proc/diskstats (source).

Digging around in the source code a little more, you will be able to verify for yourself that the following
formula is used to calculate aqu-sz:

$$ aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} $$

where $t_0, t_1$ are the measurement timestamps in ms, and F_{11}(t) is the value of Field 11 of at time t.

So, the average queue size can be calculated as a slope or discrete derivative of a mystical field from /proc/diskstats. Interesting!

Step 2: Checking the kernel documentation

Let’s see how that works. The kernel documentation[3] says the following:

Field 11 -- weighted # of milliseconds spent doing I/Os
  This field is incremented at each I/O start, I/O completion, I/O
  merge, or read of these stats by the number of I/Os in progress
  (field 9) times the number of milliseconds spent doing I/O since the
  last update of this field.  This can provide an easy measure of both
  I/O completion time and the backlog that may be accumulating.

So:

$$
F_{11}(t_1) = F_{11}(t_0) + (\text{I/Os in progress at time $t_1$}) \times (\text{ms spent doing I/O in $t_0,t_1$})
$$

Where $t_0$ is the time of the last update, and $t_1$ is the current time.

So if we want to calculate the value of $F_{11}$ at a time $t$, we have to recursively apply this rule until the beginning of time (boot):

$$
F_{11}(t) = \sum_{t_i} (\text{I/Os in progress at time t'}) \times (\text{ms spent doing I/O in $t_{i-1},t_{i-1}$})
$$

where $t_i$ runs through the times of every I/O event in the system. For the mathematical reader, this sum has a very familiar form. It’s summing a function values times a little time interval. This is precisely how integration over time works. So we are inclined to compare it with:

$$
F_{11}(T) \overset{?}{\leftrightarrow} \int_0^T (\text{number of I/Os in progress at time t}) \text{dt}
$$

This is not be quite right, since we should only sum over I/O time, and it’s not clear that the sum can be replaced by an integral, but it points us at the right direction. Assuming this were the case, we would get the following expression for `avgqz`:

$$
avgqz \overset{?}{\leftrightarrow}  \frac{\int_{t_0}^{t_1} (\text{I/Os in progress at time t}) \text{dt}}{t_1 - t_0}.
$$

This is a very common way to express a mean value of a function. In this case, we average the number of I/Os in progress, and not the queue length, but that kind of imprecise wording is commonly found in man pages.

So all this would make sense, if only Field 11 accounted the total time and not the time “spent doing I/O” and we could replace the sum by an integral!

Step 3: A mathematical model

In order to make the above idea precise, let’s introduce some notation:

\begin{align*}
ios(t) &= \text{(number of I/O operations in progress at time t)} \\
busy(t) &= \text{(one if the device is processing IO at time t, zero otherwise)}
\end{align*}

Here, again $t$ denotes system uptime in ms. To warm up, let’s first look at the integral:

$$
Busy(T) = \int_0^T busy(t) \text{dt} = \int_{t : busy(t) = 1} 1 \text{dt}
$$

This integral measures the total time the system was busy processing IO since boot. We can get the utilization of the system during a time interval $t_0 and t_1$ as:

$$
\frac{Busy(t_1) - Busy(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} busy(t) \text{dt}}{t_1-t_0} = \frac{\text{time spent doing IO}}{\text{total time}} = util(t_0, t_1)
$$

This is another very important statistics that is reported by `iostat -x` under the %util column.

To get a mathematical description of Field 11, let’s try the following:

$$ F(T) := \int_0^T ios(t) busy(t) \text{dt} $$

In this way we have:

$$ F(t_1) - F(t_0) = \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

or, equivalently:

$$ F(t_1) = F(t_0) + \int_{t_0}^{t_1} ios(t) busy(t) \text{dt} $$

Now, if $ios(t)$ is constant during the time interval $t_1,t_0$, then:

$$ F(t_1) = F(t_0) + ios(t_1) \times \int_{t_0}^{t_1} busy(t) \text{dt} $$

Looking closely at this, we see that this is precisely the recursion formula for Field 11 from /proc/diskstat

$$ F(t_1) = F(t_0) + (\text{IO ops in progress at time $t_1$}) \times (\text{numer of ms spent doing I/O in $t_0,t_1$}) $$

And if $t_1,t_0$ are adjacent I/O events (start, complete, merge) the assumption that ios(t) is constant in between is justified. Hence we see, that $F_{11} = F$.

Step 4: This can’t be true!

Now that we have a firm grip of $F_{11}$, we can start examining the claimed formula for the average queue size

$$ aqusz = \frac{F_{11}(t_1) - F_{11}(t_0)}{t_1 - t_0} = \frac{\int_{t_0}^{t_1} ios(t) busy(t) \text{dt}}{t_1 - t_0} $$

Is this indeed a sensible measure for the average queue size?

It does not seem to be the case. Take for example a time interval, where there are $ios(t) = 10$, but $busy(t) = 0$, then: The average queue size should be 10, but the integral evaluates to 0. Hence the above expression is zero, which is clearly not sensible.

Step 5: But is it really?

But is this really a valid example? If the two functions busy() and ios() were really independent, then this condition could certainly occur. In the analogue case of CPU scheduling, cases like these, where there are runnable threads but the CPU is busy doing something else (interrupt handlers, hypervisor) can indeed happen.

But is this really the case for block IO scheduling? Another look at the documentation[1] reveals the following:

    Field  9 -- # of I/Os currently in progress
        The only field that should go to zero. Incremented as requests are
        given to appropriate struct request_queue and decremented as they finish.
    Field 10 -- # of milliseconds spent doing I/Os
        This field increases so long as field 9 is nonzero.

So, Field 9 is our function $ios(t)$ and Field 10 is actually our function $Busy(t)$! And they indeed have a relation:

> "[Busy(t)] increases as long as [ios(t)] is nonzero"

In other words $busy(t) = 1$ if $ios(t) > 0$ and $0$ otherwise!

Revisiting our definition of $F$, we find the following (explanations follow)

\begin{align*}
F(T) &:= \int_{t \in [0,T]} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) busy(t) \text{dt} + \int_{t \in [0,T], busy(t) = 0} ios(t) busy(t) \text{dt} \\
&= \int_{t \in [0,T], busy(t) = 1} ios(t) \text{dt} + 0 \\
&= \int_{t \in [0,T], ios(t) > 0} ios(t) \text{dt} + \int_{t \in [0,T], ios(t) = 0} ios(t) \text{dt} \\ &= \int_{t \in [0,T]} ios(t) \text{dt}
\end{align*}

Let’s break that down. In the first step, we can divide the integral into two parts, where $busy = 0/1$, the part where $busy = 0$ evaluates to 0. In the next step we note that $busy = 0$ is equivalent to $ios = 0$, so we can replace the condition. Next we complete the integral by introducing another 0-summand: Integrating over $ios(t) = 0$ which evaluates to zero as well. Finally we put things back together.

We see that for the calculation of Field 11, it does not matter if we include $busy(t)$ in the integrand. We end up with the same number whether we sum over “the number of milliseconds *spent doing I/O* since the last update of this field” or just “the number of milliseconds since the last update of this field”:

\begin{align*}
avgqz(t_0, t_1)
&= \frac{F_{11}(t_1) – F_{11}(t_0)}{t_1 – t_0} \\
& \overset{!}{=} \frac{\int_{t_0}^{t_1} ios(t) \text{dt}}{t_1 – t_0} \\
&= \text{Integral-average over the number of IOs in progess within $t_0,t_1$.}
\end{align*}

Conclusion

We have seen that the avg-qz reported by iostat has indeed the interpretation of an integral average over the number of IOs in progress. We find it interesting to see calculus can be applied various state accounting metrics and clarify their relationships. Moreover, the ability to express integral averages as discrete derivatives of a counter ($F_{11}$) is another remarkable takeaway.

References

  1. man iostat
  2. Schwarz – How iostat computes metrics (2010)
  3. I/O statistics fields