Linux System Monitoring with eBPF

The Linux kernel is an abundant component of modern IT systems. It provides the critical services of hardware abstraction and time-sharing to applications. The classical metrics for monitoring Linux are among the most well known metrics in monitoring: CPU utilization, memory usage, disk utilization, and network throughput. For a while now, Circonus installations have organized, the key system metrics in the form of a USE Dashboard, as a high level overview of the system resources.

While those metrics are clearly useful and important, there are lot of things to be wished for. Even the most basic metrics like CPU utilization have some serious flaws (cpu-load.txt) that limit their significance. Also there are a lot of questions for which there are simply no metrics exposed (such as disk errors and failed mallocs).

eBPF is a game changing technology that became available in recent kernel versions (v4.1 and later). It allows subscribing to a large variety of in kernel events (Kprobes, function call tracing) and aggregating them with minimal overhead. This unlocks a wide range of meaningful precise measurements that can help narrow the observability gap. A great and ever growing collection of system tracing tools is provided by the bcc toolkit by iovisor.

The Circonus Monitoring Agent comes with a plugin that collects eBPF metrics using the bcc toolkit (see source code & instructions here). At the time of this writing, the plugin is supported on the Ubuntu 16.04 platform. In the following examples we will demonstrate how this information can be used.

Block-I/O Latencies

The block-I/O layer of the operating system is the interface the block-devices, like disk, offer to the file system. Since they are an API, it’s natural to apply the RED methodology (adapted from the SRE Book, see e.g. Tom Wilkie 2018) and monitor rate, errors, and duration. One famous example of how this information can be used is to identify environmental influences to I/O performance, as seen in Brendan Gregg – Shouting in the Datacenter (YouTube 2008). The example duration measurements can be seen in the figure below.

150M I/O events that were recorded on three disks over the period of a week.
150M I/O events that were recorded on three disks over the period of a week

This diagram shows a total of 150M I/O events that were recorded on three disks over the period of a week. The visualization as stand-alone histogram allows us to qualitatively compare the access latency profiles very easily. In this case, we see that the traffic pattern is imbalanced (one disk serving less than half of the load of the others), but the latency modes are otherwise very similar, indicating good (or at least consistent) health of the disks.

The next figure shows how the these requests are distributed over time.

Disk array visualized as a heatmap.
Disk array visualized as a heatmap.

This is the latency duration profile of the disk array visualized as a heatmap. The line graphs show the p10,p50 and p90 percentiles calculated over 20 minute spans. One can see how the workload changes over time. Most of the requests were issued between Sept 10th and Sept 11th 12:00, with a median performance of around 3ms.

File System Latency

From the application perspective, the file system latencies are much more relevant than block I/O latencies. The following graphic shows the latency of the read(2) and write(2) syscalls executed over the period of a few days.

The median latency of this dataset is around 5u-sec for read and 14u-sec for write accesses. This is an order of magnitude faster than block I/O latencies and indicates that buffering and caching of file system accesses is indeed speeding things up.

Caveat: In UNIX systems, “everything is a file.” Hence the same syscalls are used to write data to all kinds of devices (sockets, pipes) and not only disks. The above metrics do not differentiate between those devices.

CPU Scheduling Latency

Everyone knows that systems become less responsive when they are overloaded. If there are more runable processes than CPUs in the system, processes begin to queue and additional scheduling latency is introduced. The load average reported by top(1) gives you a rough idea how many processes were queued for execution over the last few minutes on average (the reality is quite a bit more subtle). If this metric is higher than the number of CPUs you will get “some” scheduling latency.

But how much scheduling latency did your application actually experience?

With eBPF, you can just measure the latency of every scheduling event. The diagram below shows the latency of 17.4B scheduling events collected over a 4 week period.

The median scheduling latency (30u-sec) was very reasonable. Clearly visible are Several modes which I suppose can be attributed to processes waiting behind none, one, or two other processes in the queue. The tail of the distribution shows the collateral damage caused by periods of extreme loads during the collection period. The longest scheduling delay was a severe hang of 45 seconds!

Next steps

If you want to try this out on your system, you can get free Circonus account is a matter of minutes. Installing the Circonus agent on an Ubuntu 16.04 machine can be done with a single command. Then enable the eBPF plugin on your host by following the instructions here.

It’s an ongoing effort to extend the capabilities of the eBPF plugin. Apart from the metrics shown above, there are also rate and duration metrics for all 392 Linux system calls that are exposed by the plugin. There are a lot more interesting tools in iovisor/bcc that wait to be ported.

Happy Monitoring!