High Volume Ingest

Like it or not, high-volume time series databases face one relentless challenge: ingestion. Unlike many other databases, TSDBs tend to record things that have already happened. Records are inserted with “past time stamps” (even if they’re often just a few milliseconds old) because the event has already happened. This has a consequence that is particularly hostile: being down, slow, or otherwise unavailable doesn’t mean the event didn’t happen… it still happened, it’s not going to disappear, and when you become available again, it will show up and the database will have to ingest it. Basically, any reduction in availability only serves to increase future load. It is literally an architecture in which you cannot afford to have unavailability. These constraints also require TSDBs to be operated with reasonable, sufficient capacity to accommodate expected events; if you do go offline, you need to be able to “catch-up quick” when service is restored.

Ingestion speed and system resiliency are dual targets for any production TSDB. IRONdb handles resiliency as a core competence through careful distributed system design and battle hardened implementation work that has seen hostile networks, systems, and even operators; but that’s not what we’ll focus on here… This is an article about speed.

First lets start with some assumptions. These might be controversial, but they shouldn’t be, since they’ve been assumed and shown correct again and again in large production installs.

  • TSDBs have high write volumes and ultimately store a lot of data (millions of new measurements per second and tens to hundreds of terabytes of data over time).
  • At these write volumes and storage sizes, bit error rate (BER) is a certainty, not a likelihood. Bit errors in a piece of data can cause wrong answers, but errors in metadata can result in catastrophic data loss. Techniques to get data onto disk and maintain integrity as it rests there is paramount. One must store their data on ZFS or otherwise employ the data safety techniques it does in some ancillary way. To be clear, you don’t specifically need ZFS; it’s just a reference implementation of safety techniques that happens to be free.
  • The data arriving at TSDBs tends to be “recent”. This is to say that you are unlikely to get one day of data for each event source: A0,A1,A2,A3,B0,B1,B2,B3,C0,C1,C2,C3. We call this pattern long-striped. You are instead likely to get all data from all event sources for a short time period, then get them again: A0,B0,C0,A1,B1,C1,A2,B2,C2,A3,B3,C3. We call this pattern short-striped.
  • When data is read from TSDBs, the vast majority of operations are sequential per event source: requesting A0,…,An.

These assumptions lead to a well understood, albeit poorly solved, challenge of choosing expensive writes or choosing expensive reads. Given that TSDBs never have a hiatus on their write demands, it is clear that it would be nonsensical to sacrifice anything consequential on the write performance side.

There are a variety of database techniques that can be used and to say that we use only one within IRONdb would be highly disingenuous. The ingestion pipeline is layered to apply time-based partitioning, on-disk format changes optimizing for different workloads, and pipelines of deferred work. That said, the data that arrives “right now” tends to follow a fairly simple path: get it to disk fast, cheap, and accommodate our other layers as efficiently as possible. Today, we use RocksDB for this early stage pipeline.

RocksDB itself is a complicated piece of software providing layers and pipelines itself. Data put into RocksDB is first written to a write-ahead-log (WAL) unordered, but kept in memory in efficient ordered data structures. Once that in-memory structure gets “too large”, the ordered data is written out to disk and the WAL checkpointed. This leaves us with a whole slew of files on disk that have sorted data in them, but data across the files aren’t sorted (so we need to look in each for the data). As the amount of those files becomes too great, we merge sort those into a “level” of larger sorted files… and so on for several levels. As the system runs at high volume, most of the data finds itself in a very large sorted layout. This has the effect (at a cost) of slowly churning out data from short-striped to long-striped.

Why is long-striped data better? For long-term storage and servicing read queries, all of the data you tend to read in a single query finds itself “together” on disk. You might say that with SSD (and even moreso with NVMe) that random reads are so cheap that I don’t need my data stored sequentially to be performant. You’re not as right as you might think. There are two things at play here. First, at very, very large storage volumes, spinning disks still make sense, so ignoring them completely would effectively sacrifice economic efficiencies by design in the software! Second, disks, HBAs, and filesystems still work together to get you data, and they use sectors, command queues, and blocks/records respectively. Even in the case of NVMe, where the HBA is largely erased from the picture, in order to fetch data the filesystem must pull the relevant blocks, and those blocks are backed by some set of assigned sectors on the underlying storage. Those things need to be big enough to be useful units of work (as we only get so many of those retrievals per second). Typically today, storage sector sizes are 4 kilobytes and filesystem block sizes range from 8 kilobytes to 128 kilobytes.

If I ask for some sequence of data, say A0,… An, I’d like that data to live together in large enough chunks (long enough stripes) that it fills a block on the filesystem.

As a part of our transformation into dense, read-optimized formatting for long-term storage, we need to read “all the data” and convert it. We used to do a simple set up for loops:

1: Foreach eligible partition:
    2: Foreach stream
        3: Foreach measurement
            Collect and transform
        Write range

A challenge with this is that the measurements (each stripe) might not be stored in the same order that they are in our stream iteration. This would cause us to read segments of data (stripes) in for loop 3 that would not be directly sequential to one another. This causes random read I/O which on spinning storage systems exhibits radically poor performance when compared to sequential read I/O. On such mechanical media, you need to wait for the heads to relocate, which induces latency (a “seek penalty”) that is not present on non-mechanical media such as SSDs and NVMe. We changed it thusly:

1: foreach eligible partition
    2: foreach measurement
        Cache lookup stream
        if stream different : write range
        Collect and transform

This changes the random I/O to the stream lookup (which is a small enough dataset to be cached aggressively) resulting in almost exclusively in-memory operations (or no I/O at all).

The benefits here should be obvious for spinning media. When we start reading the measurements block reads and read-ahead will always be subsequently used; no I/O is wasted. The interesting part is the efficiencies gained on SSD, NVMe, and other “random I/O friendly” storage media.

The issue is that RocksDB doesn’t pretend to know the underlying fileystem and block device layers and pays little attention to the unit sizes of those underlying operations. Assuming that we have a filesystem block size of 8k, we need to read at least that much data at a time. If I read a stripe of data out of RocksDB, the beginning of that data (and the end) are very unlikely to align with the filesystem block boundary. This means that when I issue the read, I’m pulling data with that read that isn’t relevant to the range request. If I read a very large amount of data randomly this way, the read amplification can be significant. In our setup, a day’s worth of data can have up to 17kb in a stripe which dictates 7k of unrelated read data. That manifests as 1.4x read amplification; for every 10 bytes of data we request, we actually pull 14 bytes of data off disk. This obviously varies with the stripe length, but in bad scenarios where you have short striping (say 32 bytes or less of data in a stripe within an 8k filesystem block) you can have colossal read amplification in the range of 256x. This read amplification is compounded if your filesystem is configured to optimistically “read-ahead” subsequent blocks “just in case” they are going to be read; this is common on rotating media.

It turns out that if you can long-stripe the data for reads and your reads have a large stride, you pull many sequential blocks. While SSDs and other non-mechanical media have enormous advantages for randomized reads, ignoring read amplification can be a performance killer. In optimizing our system for optimal performance on mechanical storage (by driving operations sequentially and read-amplification to near 1x), we also gained significantly on SSDs and NVMe storage technology.