When many people talk about clean data, they are referring to data that was collected in a controlled and rigorous process where bad inputs are avoided. Dirty data has samples outside of the intended collection set or values for certain fields that may be mixed up (e.g. consider “First Name: Theo Schlossnagle” and “Last Name: Male” …oops). These problems pose huge challenges for data scientists and statisticians, but it can get a whole lot worse. What if your clean data were rotten?
All (or almost all) of this data is stored on disks today… in files on disks (yes, even if it is in a database)… in files that are part of a filesystem on disks. There is also a saying, “It’s turtles all the way down,” that refers to the poor implementation of foundational technology. Case in point: did you know that you’re likely to have a bit error (i.e. one bit read back opposite of how it was stored) every time you write between 200TB to 2PB of data? This probability of storing bad data is called Bit Error Rate (BER). Did you know that most filesystems assume a BER of zero, when it never has and never will be zero? That means that on every filesystem you’ve used (unless you’ve been blessed to run on one of the few filesystems that accounts for this) you’ve had a chance of reading back data that you legitimately never wrote there!
Now, you may be thinking one bit in 2PB is quite a lot. This BER is published by drive manufacturers and while they are not outright lying, they omit a very real truth. You don’t store data on drives without connecting them to a system via cables to a Host Bus Adapter (HBA). Two more pieces of hardware that we’ll simply call turtles. Most HBAs use a memory type called Error-Correcting Code (ECC) that is designed to compensate for single bit errors in memory, but cabling is often imperfect and the effective BER of the attached drives is bumped ever so slightly higher. Also take into account that physical media is an imperfect storage medium; it is possible to write something correctly and have it altered over time due to environmental conditions and (to a lesser extent) use; this effect is called bit rot or data rot. All of this illustrates that the BER listed on your hard drive specification is optimistic. Combine all this with the inconvenient truth that writing out 2PB of data is quite common in today’s data systems and you wind up with even your cleanest data soiled. As an anecdote, at one point we detected more than one bit of error per month in a relatively small cluster (< 100TB).
You’ll notice that I said we detected these issues; this is because we use the ZFS filesystem underneath our customer’s data. ZFS checksums all data written so that it can be verified when it is retrieved. The authors of ZFS knew that on large data systems these issues would be real and must be handled and for that they have my deep gratitude. There is one issue here that escapes most people that have the foresight to run an advanced filesystem and it is hidden within this very paragraph.
In order for a checksumming filesystem (like ZFS) to detect bit errors, it must read the bad data. On large systems, some data is hot (meaning it is read often), but a significant amount of data is cold written and ignored for extended periods of time (months or years). When data engineers design systems, they account for the data access patterns of the applications that run on top of their systems: How many writes and reads? How much hot and cold? Are the reads and writes sequential or random? The answers to these questions help specify the configuration of the underlying storage systems so that it has enough space, enough bandwidth, and low enough latency to satisfy the expected usage. But, if we add into this the chance that our precious data is rotting and that we must detect an error before we can choose to repair it, then we are left with a situation where we must read all our cold data. We must read all our cold data. We must read all our cold data. Said three times it will induce cold sweats in most storage engineers; it wasn’t part of the originally specified workload and if you didn’t account for that in your system design, you’re squarely misspecified.
Scrubbing out the rot
In the ZFS world, the action of reading all of your data to verify its integrity and correct for data rot is aptly named “scrubbing.” For the storage engineers out there, I thought this would be an interesting exploration into what scrubbing actually does to your I/O latency. At Circonus we actually care about our customer’s data and scrub it regularly. I’ll show you what this looks like and then very briefly describe what we do to make sure that users aren’t affected.
On our telemetry storage nodes, we measure and record the latency of every disk I/O operation against every physical disk in the server using io nad plugin (which leverages DTrace on Illumos and ktap on Linux). All of these measurements are sent up to Circonus as a histogram and from there we can analyze the distribution of latencies.
In this first graph, we’re looking at a time-series histogram focused on the period of time immediately before an apparently radical change in behavior.
Moving our mouse one time unit to the right (just before 4am), we can see an entirely different workload present. One might initially think that in the new workload we have much better performance as many samples are now present in the lower latency side of the distribution (the left side of the heads-up graph). However, in the legend you’ll notice that the first graph is focused on approximately 900 thousand samples whereas the second graph is focused on approximately 3.2 million samples. So, while we have more low-latency samples, we also have many more samples as well.
Of further interest is that, almost immediately at 4am, the workload changes again and we see a new distribution emerge in the signal. This distribution stays fairly consistent for about 7 hours with a few short interruptions, changes yet again just before Jan 5 at 12pm, and seemingly recovers to the original workload just after 4pm (16:00). This is the havoc a scrub can play, but we’ll see with some cursory analysis that the effects aren’t actually over at 4pm.
The next thing we do is add an analytical overlay to our latency graph. This overlay represents an approximation of two times the number of modes in the distribution (the number of humps in the histogram) as time goes on. This measurement is an interesting characteristic of workload and can be used to detect changes in workload. As we can see, we veered radically from our workload just before 4am and returned to our original workload (or at least something with the same modality) just after midnight the following day.
Lastly, we can see the effects on the upper end of the latency distribution spectrum by looking at some quantiles. In the above graph we reset the maximum y-value to 1M (the units here are in microseconds, so this is a 1s maximum). The overlays here are q(0.5), q(0.99), q(0.995), and q(0.999). We can see our service times growing into a range that would cause customer dissatisfaction.
While I won’t go into detail about how we solve this issue, it is fairly simplistic. All data in our telemetry store is replicated on multiple nodes. The system understands node latency and can prefer reads from nodes with lower latency.
Understanding how our systems behave while we keep our customers’ data from rotting away allows us to always serve the cleanest data as fast as possible.