Circonus’ time-series database, IRONdb, can ingest, compute, and analyze time-series data at greater volume and frequency than any other solution in the market. We realize that’s a bold statement, but we stand by it. In fact, IRONdb can ingest over a trillion measurements per second, giving companies the data granularity they need for better, more accurate monitoring. It also offers unlimited scalability and the highest data safety guarantees. So what goes into designing such a powerful database?
Circonus Founder and CTO Theo Schlossnagle recently spoke with Justin Sheehy, the chief architect of global performance and operations for Akamai, about some of the key decisions he made in building Circonus’ IRONdb. In this post, we include a snapshot of that discussion. The full interview was published in ACM (Association for Computing Machinery) in an article entitled “Always-On Time-Series Database: Keeping Up When There’s No Way to Catch Up.”
JS: Theo, why did you feel the need to write your own time-series database?
TS: There were a number of reasons. For one, almost all the data that flows into our telemetry analysis platform comes in the form of numbers over time. We’ve witnessed more than exponential growth in the volume and breadth of that data. In fact, by our estimate, we’ve seen an increase by a factor of about 1×1012 over the past decade. Obviously, compute platforms haven’t kept pace with that. Nor have storage costs dropped by a factor of 1×1012. Which is to say the rate of data growth we’ve experienced has been way out of line with the economic realities of what it takes to store and analyze all that data. So, the leading reason we decided to create our own database had to do with simple economics. Basically, you can work around any problem but money. It seemed that, by restricting the problem space, we could have a cheaper, faster solution that would end up being more maintainable over time.
JS: Did you consider any open-source databases? If so, did you find any that seemed almost adequate for your purposes, only to reject them for some interesting reason? Also, I’m curious whether you came upon any innovations while looking around that you found intriguing… or, for that matter, anything you really wanted to avoid?
TS: I’ve been very influenced by DynamoDB and Cassandra and some other consistent hashing databases like Riak. As inspiring as I’ve found those designs to be, I’ve also become very frustrated by how their approach to consistent hashing tends to limit what you can do with constrained datasets. What we wanted was a topology that looked similar to consistent hashing databases like DynamoDB or Riak or Cassandra, but we also wanted to make some minor adjustments, and we wanted all of the data types to be CRDTs. We ended up building a CRDT-exclusive database. That radically changes what is possible, specifically around how you make progress writing to the database.
…But the real advantage of the CRDT approach is that, if you can limit your entire operation set to CRDTs, you can forego consensus algorithms such as Paxos, MultiPaxos, Fast Paxos, Raft, Extended Virtual Synchrony, and anything else along those lines. Which is not to disparage any of those algorithms. It’s just that they come with a lot of unnecessary baggage once the ability to make progress without quorum can be realized. There are a couple of reasons why this makes CRDTs incredibly appealing for time-series databases.
One is that most time-series databases—especially those that work at the volume ours does—take data from machines, not people. The networks that connect those machines are generally wide and what I’d describe as “always on and operational.” This means that, if you have any sort of interruption of service on the ingest side of your database, every moment you’re down is a moment where it’s going to become all the more difficult to recover state. So, if I have an outage where I lose quorum in my database for an hour, it’s not like I’ll be able just to pick up right away once service resumes. First, I’ll need to process the last hour of data, since the burden on the ingest side of the database continues to accumulate over time, regardless of the system’s availability. Which is to say, there’s an incredibly strong impetus to build time-series databases for always-on data ingest since otherwise—in the event of disruptions of service or catastrophic failures—you’ll find, upon resumption of service, your state will be so far behind present that there will simply be no way to catch up.
JS: I’ve learned, however, that many so-called CRDT implementations don’t actually live up to that billing. I’m curious about how you got to where you could feel confident your data-structure implementations truly qualified as CRDTs.
TS: It’s certainly the case that a lot of CRDTs are really complicated, especially in those instances where they represent some sort of complex interrelated state. A classic example would be the CRDTs used for document editing, string interjection, and that sort of thing. But the vast majority of machine-generated data is of the write-once, delete-never, update-never, append-only variety. That’s the type of data yielded by the idempotent transactions that occur when a device measures what something looked like at one particular point in time. It’s this element of idempotency in machine-generated data that really lends itself to the use of simplistic CRDTs.
In our case, conflict resolution is the primary goal since, for time-series data, there can be only one measurement that corresponds to a particular timestamp from one specific sensor. To make sure of that, we use a pretty simplistic architecture to ensure that the largest absolute value for any measurement will win. If, for some reason, a device should supply us with two samples for the same point in time, our system will converge on the largest absolute value. We also have a generational counter that we use out of band. This is all just to provide for simplistic conflict resolution. With all that said, in the course of a year, when we might have a million trillion samples coming in, we’ll generally end up with zero instances where conflict resolution is required simply because that’s not the sort of data machines generate.
JS: It’s clear you viewed CRDTs as a way to address one of your foremost design constraints: the need to provide for always-on data ingest. Were there any other constraints on the system that impacted your design?
TS: We’ve learned firsthand that whenever you have a system that runs across multiple clusters, you don’t want to have certain nodes that are more critical than all the others since that can lead to operational problems. A classic problem here is something like NameNodes in Hadoop’s infrastructure or, basically, any sort of special metadata node that needs to be more available than all the other nodes. That’s something we definitely wanted to avoid if only to eliminate single points of failure. That very much informed our design.
JS: Given your always-on, always-ingesting system and the obvious need to protect all the data you’re storing for your customers, I’m curious about how you deal with version upgrades.
TS: I think you’ll find a lot of parallels throughout database computing in general. In our case, because we know the challenges of perfect forward compatibility, we try to make sure each upgrade between versions is just as seamless as possible so we don’t end up needing to rebuild the whole system. But the even more important concern is that we really can’t afford to have any disruption of service on ingestion, which means we need to treat forward and backward compatibility as an absolutely fundamental requirement.
JS: I know you made a conscious decision to go with one specific type of file-system technology. What drove that choice, and how do you feel about it now?
TS: There absolutely is a performance penalty to be paid when you’re operating on top of a file system. Most of that relates to baggage that doesn’t help when you’re running a database. With that said, there still are some significant data-integrity issues to be solved whenever you’re looking at writing data to raw devices. The bottom line is: I can write something to a raw device. I can sync it there. I can read it back to make sure it’s correct. But then I can go back to read it later, and it will no longer be correct. Bit rot is real, especially when you’re working with large-scale systems. If you’re writing an exabyte of data out, I can guarantee it’s not all coming back. There are questions about how you deal with that. Our choice to use ZFS was really about delivering value in a timely manner. That is, ZFS gave us growable storage volumes; the ability to add more disks seamlessly during operation; built-in compression; some safety guarantees around snapshot protection; checksumming; and provisions for the autorecovery of data. The ability to get all of that in one fell swoop made it worthwhile to take the performance penalty of using a file system.
JS: How do four embedded database technologies prove useful to you?
TS: At root, the answer is: A single generalized technology rarely fits a problem perfectly; there’s always compromise. So, why might I hesitate to use LMDB? It turns out that writing to LMDB can be significantly slower at scale than writing to the end of a log file, which is how LSM-style databases like Rocks work. But then, reading from a database like RocksDB is significantly slower than reading from a B+ tree database like LMDB. You must take all these tradeoffs and concessions into account when you’re building a large-scale database system that spans years’ worth of data and encompasses a whole range of access patterns and user requirements. Ultimately, we chose to optimize various paths within our system so we could take advantage of the strengths of each of these database technologies while working around their weaknesses.
JS: What did you do to provide the live inspection capabilities people could use to really understand the system while it’s running?
TS: That’s something I can talk a lot about since one of the really interesting parts of our technology has to do with our use of high-definition histograms. We have an open-source implementation of HDR [high dynamic range] log-linear quantized histograms, called circllhist, that’s both very fast and memory efficient. With the help of that, we’re able to track the performance of function boundaries for every single IOP [input/output operation], database operation, time in queue, and the amount of time required to hand off to another core in the system. This is something we use to track everything on the system, including every single RocksDB or LMDB get() or put(). The latency of each one of those is tracked in terms of nanoseconds.
Within a little web console, we’re able to review any of those histograms and watch them change over time. Also, since this happens to be a time-series database, we can then put that data back in the database and connect our tooling to it to get time-series graphs for the 99.9th percentile of latencies for writing numeric data, for example. Once the performance characteristics of the part you’re looking to troubleshoot or optimize have been captured, you’ve got what you need to perform controlled experiments where you can change one tuning at a time. This gives you a way to gather direct feedback by changing just one parameter and then tracking how that changes the performance of the system, while also looking for any unanticipated regressions in other parts of the system. I should add this all comes as part of an open-source framework (Learn more about OpenHistogram, vendor-neutral log linear histograms: https://openhistogram.io/).
To learn more about the thinking behind the design of IRONdb and how these decisions enabled IRONdb to handle unprecedented volume and frequency of telemetry data, read the full interview with Theo in ACM Queue.