Putting Linux kernel block I/O schedulers under the eBPF microscope
eBPF tracing is a broad and deep subject, and can be a bit daunting at first sight. However, when Brendan Gregg issued the dictum “Perhaps you’d like a new year’s resolution: learn eBPF!”, I figured it was as good a time as any to do something fun with it. Here at Circonus, we’ve talked about eBPF previously, so I had a starting point to look for an interesting problem to solve. Having spent a lot of time digging into the guts of Linux kernel schedulers, I decided to look at three different block I/O schedulers impacted disk performance. I set out expecting to see differing distributions of latencies for each block scheduler, but ultimately found that I didn’t understand low-level systems behavior to the degree I thought I did.
The Testing Environment
I configured a testing environment on my 2011 iMac (3.4Ghz quad core i7, Crucial M4 750GB SSD) using VMWare Fusion 10 configured with one CPU as the virtual machine provider. In an ideal world, I would have used a bare metal setup, but it’s been several years since I’ve owned rack-mounted hardware (I miss it, mostly). For the purposes of my investigation, being able to do an apples-to-apples comparison with different kernel schedulers would suffice.
I installed an Ubuntu 16.04 host (Xenial Xerus), which loaded the 4.4.0-141-generic kernel (eBPF requires a minimum kernel version of 4.1). I installed the iovisor BCC 0.5 toolkit from source; BCC is under rapid development and I wanted to make sure to match the versions that my colleague Heinrich had used in his eBPF blog post. I then installed the Circonus Agent and the bccbpf plugin to capture BPF metrics. At this point, I was able to capture several points of telemetry as histograms; block I/O latency, runq latency, and a number of syscall latencies as well.
I/O Schedulers and Sysbench
Ubuntu Linux 16 provides 3 different I/O schedulers with the default kernel – noop, deadline, and cfq (completely fair queuing). Each scheduler prioritizes disk operations using a different algorithm. While investigating these schedulers I came across an excellent article by Ben Cane titled “Improving Linux System Performance with I/O Scheduler Tuning”. As documented in that post, Ubuntu allows one to change the I/O scheduler on a per-device basis at runtime just by
echoing a new value to a file.
root@ebpfiotesting:~# cat /sys/block/sda/queue/scheduler noop [deadline] cfq root@ebpfiotesting:~# echo 'noop' > /sys/block/sda/queue/scheduler root@ebpfiotesting:~# cat /sys/block/sda/queue/scheduler [noop] deadline cfq
Sysbench is a benchmark tool based on LuaJIT. It provides an easy to use file I/O benchmark as well as a number of other system-oriented benchmarks. I planned to run a number of file I/O benchmarks and compare the distribution of block I/O latencies measured by eBPF to evaluate how each of the different schedulers performed. In theory, I would see a different distribution for each scheduler, along with a range of statistics such as 99th percentile, median, and modes describing the scheduler latency performance for writing to block devices.
Command to run a random read write benchmark lasting one hour, 16k block size writes, single thread:
sysbench --max-requests=1000000000 --max-time=3600 --num-threads=1 --test=fileio --file-test-mode=rndrw run
I executed a one hour benchmark for each scheduler of randomly ordered reads and writes, using a single thread and 16k block size writes. For each test, I captured several BPF metrics, but focused on the block I/O latency metrics. Each benchmark generated 1.82k samples, which I displayed as a summary histogram to examine the latency profile.
Why are the schedulers all exhibiting nearly the same latency profiles?
The distributions are all bimodal, and the summary statistics from each are largely identical. Why are the schedulers all exhibiting nearly the same latency profiles? The answer is simple; block I/O latency is a measure of the time it takes to read or write a block to storage. The kernel I/O scheduler orders system reads and writes. Measurement of the block I/O latency takes place after the read or write is scheduled.
The kernel I/O scheduler has no effect on block I/O latency – it is purely a function of the block device itself (in this case an SSD). The median latency for each of these distributions was right about 55 microseconds – not bad for a write through a virtualization layer. The maximum latency was approximately 200 microseconds. 95th and 99th percentiles were about 90 and 130 microseconds, respectively.
|Scheduler||Median q(0.5)||q(0.95)||q(0.99)||Max q(1)||Primary Mode||Secondary Mode|
|Cfq||55.9 μs||89 μs||130 μs||220 μs||52 μs||63 μs|
|Deadline||55.1 μs||86 μs||130 μs||200 μs||51 μs||60 μs|
|Noop||54.7 μs||90 μs||130 μs||200 μs||52 μs||61 μs|
The Two Modes
After poring over the results of the benchmarks, one oddity stood out. Why are there two modes in the histograms for block I/O latency? There is no disk head seek to read or write from physically separate sectors in an SSD. There is no need to spin disk platters up to speed from a low power state. Why don’t all block I/O reads/writes take largely the same amount of time?
To understand better how SSDs work, I read a detailed article at ExtremeTech. SSDs have on disk caches just like spinning rust drives. SSDs can write data very quickly to an empty drive, but overwriting data is much slower due to NAND flash voltage requirements. Is it possible that the higher latency mode is due to overwrites and not read/writes to empty blocks?
To investigate this anomaly, I executed another set of benchmarks. Sysbench normally calls
fsync() every 100 writes. I decided to use the option to disable
fsync() to the end of the test to see if that had any effect on the latency distribution.
Deadline I/O scheduler latency distribution,
The secondary latency mode which had previously been seen at ~60 μs was not there anymore. The primary mode was still present at ~52 μs. The median was largely unchanged at 53.8 μs, as were the 95th percentile at 91 μs, and the 99th percentile at 120 μs. The primary mode contained about 400 samples, as opposed to 300 samples with
fsync() enabled every 100 writes.
fsync() every 100 writes result in the secondary latency mode?
fsync() forces a physical write of data from the disk buffer cache to the disk. It is possible that the primary mode represents the latency taken to write data to the disk buffer cache, and the secondary mode represents the additional time to flush the cache to disk. The exact mechanics of how data is written to disk varies between file systems (ext4 was used in this example), as well as kernel versions.
Proving this theory requires tracing the code path used by the benchmark, as well as an understanding of the write-back cache behavior of the block device used. Alternatively, it is possible that some part of the virtualization layer is causing this bimodal distribution. But based on the distribution observed by my colleague in his eBPF block I/O measurements which was also bimodal, I am led to believe that this bimodality is coupled to how the filesystem is executing the writes.
eBPF has brought to Linux the observability into low-level systems behavior which has only been matched by the DTrace tool (which has historically not been available for Linux). In this exercise, it furthered my understanding of the difference between I/O scheduling and block device operations, but I didn’t reach the goal of figuring out which one is the ‘best’. Nor do I have an answer to why the latency I/O distribution was bimodal, but perhaps as my experience with eBPF grows that can be the subject of another examination. Sometimes we end up with more questions than answers from initial investigations.
The question of which block I/O scheduler is best remains unanswered, but it’s worth postulating how we can use other approaches with eBPF to find an answer. From an end user, and consequently application perspective, the best I/O scheduler would result in both the lowest latency reads and writes to the filesystem. eBPF provides us with file system latencies for the
write() system calls which we could use to investigate this. However, because everything is a file in UNIX based systems, the same syscalls used to write to the filesystem are also used to write to sockets and pipes. Hence the latencies I was interested in were conflated with network traffic latencies, and could not be used to evaluate block I/O scheduler performance.
I contemplated running the benchmark again with the network disabled, which would eliminate one source of socket calls affecting the measurements, but I would need to install a local data broker to store the eBPF telemetry, which could then forward the relevant metrics to Circonus once the benchmark was finished. I’ll be exploring this approach for a follow up to this post.