Thank you, ZFS

If you’ve had a technical conversation with anyone at Circonus, there are very likely two technologies that came up: ZFS and DTrace. While we love DTrace, ZFS has literally changed our world and made some personal “whoopsies” out of what could have been otherwise catastrophic, business-ending mistakes. ZFS has been a technology that has changed the way we interact with production computing systems and business problems.

ZFS has a million features that made it over a decade ahead of its time, but that’s not so important today. Today, we’re just thankful.

We want to thank the creators of ZFS and the community that placed such abusive production demands on it before us. We want to thank the community of people that saw the value of ZFS and ported it to… everything: FreeBSD, Linux, Mac OS X, and even Windows. It was a dreadful mistake that many vendors didn’t prop those communities up and assist them in making ZFS ubiquitous and default on their systems; you’ve done a disservice to your customers.

For 5 years, we didn’t bring “snowth” (IRONdb’s internal name) to market as a standalone product for a variety of reasons. One of the primary technical reasons was that our reliance on ZFS and Linux’s lack of adoption of ZFS made our deployable market artificially small. Last year, we decided that ZFS on Linux was “stable enough” to support our customers, and the last gating factor for IRONdb as a product was eliminated. I’m thankful for all of the ridiculously hard work the ZoL (ZFS on Linux) team has put into making ZFS as good on Linux as it is today.

At Circonus, under the torture harness of IRONdb, we’ve pushed on ZFS in ways that are hard to contemplate. Having dialogue with other ZFS developers and users, we know that we push on this filesystem in ways that are simply diabolical. If you look in the sizing section of the documentation of the excellent and capable InfluxDB, IRONdb sits squarely in their “infeasible” performance category; we attribute our ability to shatter these notions because we built out tech atop ZFS.

Yes, we use all the magical features therein: compression, device management (growing pools), online disk replacement, scrubbing, checksumming, snapshots, and many more, but today we want to show appreciation for the most important ZFS characteristic of all… stability. You’re just there for us. You’ve lost less data than any other filesystem we’ve used. You’ve crashed and locked up less than any of your counterparts. You’ve saved us years of filesystem checks, by completely eliminating them. You’ve saved us.

Thank you.

Some Like It Flat

JSON rules the world, much to our collective chagrin. I’ve mentioned before the atrocious shortcomings of JSON as a format and I feel deeply saddened that the format has taken the world by storm. However, it is here and we must cope… but not always.

Like every other API-centric platform in the world, we support data in and out in the ubiquitous JavaScript Object Notation format. I’ll admit, for a (more or less) machine parseable notation, it is remarkably comprehensible to humans (this is, of course, one of the leading drivers for its ubiquity). While I won’t dive into the deficits of JSON on the data-type interpretation side (which is insidious for systems that mainly communicate numerical data), I will talk about speed.

JSON is slow. JSONB was created because JSON was slow. JSONB is still slow. Object (data) serialization has always been of interest to protocols. When one system must communicate data to another, both systems must agree on a format for transmission. While JSON is naturally debuggable, it does not foster agreement (specifically on numeric values) and it is truly abysmal on the performance side. This is why so many protocol serializations exist today.

From the Java world comes Thrift, its successor Avro, and MessagePack. From Python we have pickle, which somehow has escaped the Python world to inflict harm upon others. From the C and C++ world we have Cap’n Proto, Flatbuffers, and perhaps the most popular, Google Protobuf (the heart of the widely adopted gRPC protocol). Now, these serialization libraries might have come from one language world, but they’d be useless without bindings in basically every other language… which they do generally boast, with the exception of pickle.

It should be noted that not all of these serialization libraries stop at serialization. Some bring protocol specification (RPC definition and calling convention) into scope. Notwithstanding that this can be useful, the fact that they are conflated within a single implementation is a tragedy.

For IRONdb, we needed a faster alternative to JSON because we were sacrificing an absurd amount of performance to the JSON god. This is a common design in databases; either a binary protocol is used from the beginning or a fast binary protocol is added for performance reasons. I will say that starting with JSON and later adopting a binary encoding and protocol has some rather obvious and profound advantages.

By delaying the adoption of a binary protocol, we had large system deployments with petabytes of data in them under demanding, real-world use cases. This made it a breeze to evaluate both the suitability and performance gains. Additionally, we were able to understand how much performance we squandered on the protocol side vs. the encoding side. Our protocol has been HTTP and our payload encoding has been JSON. It turns out that in our use cases, the overhead for dealing with HTTP was nominal in our workloads, but the overhead for serializing and deserializing JSON was comically absurd (and yes, this is even with non-allocating, SAX-style JSON parsers).

So, Google Protobuf right? Google can’t get it wrong. It’s popular; there are a great many tools around it. Google builds a lot of good stuff, but there were three undesirable issues with Google Protobuf: (1) IRONdb is in C and while the C++ support is good, the C support is atrocious, (2) it conflates protocol with encoding so it becomes burdensome to not adopt gRPC, and (3) it’s actually pretty slow.

So, if not Google’s tech, then whose? Well, Google’s of course. It is little known that Flatbuffers are actually Google Flatbuffers designed specifically as an encoding and decoding format for gaming and other high performance applications. To understand why it is so fast, consider this: you don’t have to parse anything to access data elements of encoded objects. Additionally, it boasts the strong forward and backward compatibility guarantees that you get with other complex serialization systems like Protobuf.

The solution for us:

Same REST calls, same endpoints, same data, different encoding. Simply specify the data encoding you are sending with a ‘Content-Type: x-circonus-<datatype>-flatbuffer’ header and the remote end can even avoid copying or even parsing any memory; it just accesses it. The integration into our C code is very macro-oriented and simple to understand. Along with roughly 1000x speedup in encoding and decoding data, we save about 75% volumetrically “over the wire.”

Post-Mortem 2017.1.12.1

TL;DR: Some users received spurious false alerts for approximately 30 minutes, starting at 2017-01-12 22:10 UTC. It is our assessment that no expected alerts were missed. There was no data loss.

Overview

Due to a software bug in the ingestion pipeline specific to fault detection, a small subset (less than 2.5%) of checks were not analyzed by the online fault detection system for 31 minutes, starting at 2017-01-12 22:10 UTC.

The problem was triaged. Broker provisioning and deprovisioning services were taken offline at 22:40 UTC, at which time all fault detection services returned to normal.

Broker provisioning and deprovisioning services were brought back online at 2017-01-13 00:11 UTC. All broker provisioning and deprovisioning requests issued during that period were queued and processed successfully upon service resumption.

Gratuitous Detail

Within the Circonus architecture, we have an aggregation layer at the edge of our service that communicates with our store-and-forward telemetry brokers (which in-turn accept/acquire data from agents). This component is called “stratcond.” On January 5th, we launched new code that allows more flexible configuration orchestration and, despite having both unit tests and end-to-end tests, an error was introduced. Normal operations continued successfully until January 12th, when a user issued a command requiring reconfiguration of this system. That command managed to exercise the code path containing this specific error and stratcond crashed. As with all resilient systems, the stratcond was restarted immediately, and it suffered approximately 1.5 seconds of “disconnection” from downstream brokers.

The system is designed to tolerate failures, as failures are pretty much the only guaranteed thing in distributed systems. These can happen at the most unexpected times and many of our algorithms for handling failure are designed to cope with the randomness (or perceived randomness) of distributed failure.

The command that caused the crash was queued and reattempted precisely 60 seconds later, and again 60 seconds after that, and again after that. A recurrent and very non-random failure. There are many checks that customers have scheduled to run every 60 seconds. When a check is scheduled to run on a broker, it is scheduled to run at a random offset within the first 60 seconds of that broker’s boot time. So, of the 60-second-period checks, 2.5% of the checks would have been scheduled to run during the 1.5 second real-time-stream outage that transpired as a part of this failure. The particular issue here is that because the crash recurred almost exactly every 60 seconds, the same 1.5 seconds of each minute was vulnerable to exclusion. Therefore the same 2.5% of checks were affected each minute, making them “disappear” to the fault detection system.

The same general pipeline that powers graphs and analysis is also used for long-term storage, but due to open-ended temporal requirements, that system was unaffected. All checks run in those “outage” windows had their measurements successfully sent upstream and stored (during the outages, since there were no outages for the storage stream).

Operational response led to diagnosis of the cause of the crash, avoidance, and restoration of normal fault detection operation within 31 minutes. Crash analysis and all-hands engineer triage led to a bug fix, test, packaging, and deployment at 2 hours and 11 minutes.

Actions

There are two actions to be taken, and both will require research and implementation.

The first is to implement better instability detection to further enhance the already sophisticated capabilities of flagging instability in the fault detection system. The short, reliable timing of the disconnections in this case did not trigger the fault detection system’s instability mode and thus it did not react as it should have.

The second is to better exploit “at least once delivery” in the fault pipeline. In order to make sure that we get the job done that we promise to get done, we make sure our systems can process the same data more than once. Often, a metric is actually delivered to the fault detection system four times. We can further extend this “duplication tolerance” to the stratcond-broker feed and replay some window of past traffic to send upstream. In online systems, old data is worthless. In all systems, “old” is subjective. By relaxing our definition of “old” a bit more and leveraging the fact that no upstream protections will be required, we should easily be able to make this tiny section of our pipeline even more resilient to failure.

To close, we live in the real world. Failure is the only option. We embrace the failures that we see on a daily basis and do our best to ensure that the failures we see do not impact the service we deliver to you in any way. Yesterday, we learned that we can do better. We will.

Systems Monitoring is Ripe for a Revolution

Before we explore systems, let’s talk users. After all, most of our businesses wouldn’t exist without lots of users; users that have decreasing brand loyalty and who value unintrusive, convenient, and quick experiences. We’ve intuited that if a user has a better experience on a competitor’s site, then they will stop being your customer and start being theirs. Significant research into exactly how much impact is had by substandard web performance started around 2010, progressed to consensus, and has turned into a tome of easily consumable knowledge. What allowed for this? RUM.

Real User Monitoring

The term RUM wasn’t in constant usage until just after 2010, but the concept and practice was a slow growth that transformed the previous synthetic web monitoring industry. Both Keynote and Gomez (the pall bearers of synthetic web monitoring) successfully survived that transition and became leaders in RUM as well. Of course, the industry has many more and varied competitors now.

Synthetic monitoring is the process of performing some action and taking measurements around the aspects of that performance. A simple example would be asking, “how fast does my homepage load?” The old logic was that an automated system would perform a load of your homepage and measure how long various aspects took: initial page load, image rendering, above-the-fold completeness, etc. One problem is that real users are spread around the world, so to simulate them “better,” one would need to place these automated “agents” around the world so that a synthetic load could indeed come from Paris, or Copenhagen, or Detroit. The fundamental problem remained that the measurements being taken represented exactly zero real users of your web site… while users of your website were actively loading your home page. RUM was born when people decided to simply observe what’s actually happening. Now, synthetic monitoring isn’t completely useless, but RUM largely displaced most of its obvious value.

What took RUM so long? The short answer was the size of the problem relative to the capabilities of technology. The idea of tracking the performance of every user action before 2000 was seen as a “Big Data Problem” before we coined the term Big Data. Once the industry better understood how to cope with data volumes like this, RUM solutions became commonplace.

Now it seems fairly obvious to anyone that monitoring real users is fundamental in attempting to understand the behavior of their website and its users… so why not with systems?

Systems are Stuck

Systems, like websites, have “real users,” those users just happen to be other systems most of the time. It is common practice today to synthetically perform some operation against a system and measure facets of that performance. It is uncommon today to passively observe all operations against the system and extract the same measurements. Systems are stuck in synthetic monitoring land.

Now, to be fair, certain technologies have been around for a while that allow the observation of inflight systems; caveat “systems” with a focus on custom applications running on systems.

The APM industry took a thin horizontal slice of this problem, added sampling, and sold a solution (getting much market capitalization). To sum up their “solution,” you have an exceptional view into part of your system some of the time. Imagine selling that story in the web analytics industry today: “now see real users… only on your search pages, only 1% of the time.”

Why don’t we have a magically delicious RUM store for systems? For the same reason it took so long to get RUM. The technology available today doesn’t afford us the opportunity to crunch that much data. Users work in human time (second and minutes) at human scale (tens of millions); systems work at computing time (nanoseconds) at cloud scale (tens of thousands of machines and components). It’s literally a million times harder to think about Real Systems Monitoring (RSM) than it is to think about Real User Monitoring (RUM).

The Birth of Real Systems Monitoring

The technology has not improved a million-fold over the last 10 years, so we can’t solve this RSM problem as comprehensively. But it has improved significantly, so we’re ready for a departure from synthetic systems monitoring into a brave new world. Circonus and many of its industry peers have been nipping at the heels of this problem and we are entering the age of tangible gains. Here’s what’s coming (and 5-10 years from now will be ubiquitous table stakes):

  • 100% sampling of microsecond or larger latencies in systems operation. (ie, You see everything)
  • Software and services will expose measurement streams from real activity.
  • Histograms as the primary data types in most measurement systems
  • Significantly more sophisticated math to cope with human reasoning of large datasets
  • Measurement collection on computer-scale (billions of measurements per second)
  • Ultimately a merge of RUM and RSM… After all, we only have systems because we have users.

Exciting Times

At Circonus, we’ve been building the architectures required to tackle these problems: the scale, the histograms, and the math. We see the cost efficiencies increasing, resulting in positive (and often huge) returns on investment. We see software and service providers avidly adding instrumentation that exposes real measurements to interested observers. We’re at an inflection point and the world of systems monitoring is about to take an evolutionary leap forward. These are exciting times.

No, We “Fixed the Glitch”

If you haven’t seen the movie Office Space, you should do so at your earliest convenience. As with the new TV comedy, “Silicon Valley,” Mike Judge hits far too close to home for the movie to be comfortable… its hilarity, on the other hand, is indisputable. So much of our lives are wrapped up in “making the machine work” that comedic exposures of our industries deep malfunctions are, perhaps, the only things that keep me sane.

Not a day goes by that I don’t see some scene or line from “Office Space” percolate from either the Industry or Circonus itself. Just after 21:30 UTC on October 3rd was just another one of these events, but the situation that brought it up is interesting enough to share.

In “Office Space,” there is an employee named Milton, whom management believes they have fired, but who has been working and getting paid for years. Classic communication breakdown. However, due to the over-the-top passive aggressive behavior in the organization, management doesn’t want a confrontation to correct the situation. Instead of informing Milton, they simply decide to stop paying him and let the situation work itself out… They “fixed the glitch.” If you do this, you’re an asshole. Spoiler alert: Milton burns the building down

.

The interesting thing about software is that it is full of bugs. So full of bugs, that we tend to fix things we didn’t even know were broken. While it’s no less frustrating to have a “glitch” fixed on you, it’s a bit more understandable when it happens unintentionally. We’re fixing glitches before they are identified as glitches. This most commonly occurs in undocumented behavior that is assumed to be stable by some consumer of a system. It happens during a new feature introduction, or some other unrelated bug fixing, or a reimplementation of the system exhibiting the undocumented behavior, and then boom… some unsuspecting consumer has their world turned upside down. I’m sure we’ve done this at Circonus.

On October 3rd, a few customers had their Amazon Cloudwatch checks stop returning data. After much fretting and testing, we could find nothing wrong with Amazon’s API. Sure, it was a bit slow and gave stale information, but this is something we’ve accommodated from the beginning. Amazon’s Cloudwatch service is literally a metrics tire fire. But this was different… the answers just stopped happening.

Circonus’ collection system is three-tier (unlike many of our competitors that use two-tier systems). First, there’s the thing that’s got the info: the agent. In this case, the agent is the Cloudwatch API itself. Then, there’s the thing that stores and analyzes the data: Circonus SaaS. And finally there’s this middle tier that talks to the agents, then stores and forwards the data back to Circonus SaaS. We call this the broker. Brokers are basically babelfish; they speak every protocol (e.g. they can interrogate the Cloudwatch API), and they are spread out throughout the world. By spreading them out, we can place brokers closer to the agents so that network disruptions don’t affect the collection of data, and so that we get a more resilient observation fabric. This explains why I can assert that “we didn’t change anything,” even with as many as fifty code launches per day. The particular broker in question, the one talking to the cloudwatch API, hadn’t been upgraded in weeks. Additionally, we audit changes to the the configuration of the broker, and the configurations related to Cloudwatch interrogations hasn’t been modified either.

So, with no changes to the system or code asking Cloudwatch for data and no changes to the questions we are asking Cloudwatch, how could the answers just stop? Our first thought was that Amazon must have changed something, but that’s a pretty unsatisfying speculation without more evidence.

The way Cloudwatch works is that you ask for a metric and then limit the ask by fixing certain dimensions on the data. For example, if I wanted to look at a specific Elastic Load Balancer (ELB) servicing one of my properties and ascertain the number of healthy hosts backing it, then I’d work backwards. First, I’d ask for the number of healthy hosts, the “HealthyHostCount”, and then I’d limit that to the namespace “AWS/ELB” and specify a set of dimensions. Some of the available dimensions are “Service”, “Namespace”, and “LoadBalancerName”. Now, our Cloudwatch integration is very flexible, and users can specify whatever dimensions they please, knowing that it is possible that they might work themselves out of an answer (by setting dimensions that are not possible).

The particular Cloudwatch interrogation said that dimension should match the following: Service=”ELB”, Namespace=”AWS”, and LoadBalancerName=”website-prod13.” And behold: data. The broker was set to collect this data at 12:00 UTC on October 1st and to check it every minute.

There appears to be an outage
There appears to be an outage.

As we can see from this graph, while it worked at first, there appears to be an outage. “It just stopped working.” Or did it? Around 21:30 on October 3rd, things went off the rails.

Things went off the rails here.
Things went off the rails here.

This graph tells a very different story than things “just stopping.” For anyone that runs very large clusters of machines where they do staged rollouts, this might look familiar. It looks a lot like a probability of 1 shifting to a probability of 0 over about two hours. Remember, there are no changes in what we are asking or how we are asking it… just different answers. In this case, the expected answer is 2, but we received no answer at all.

The part I regret most about this story is how long it took for the problem to be completely resolved. It turns out that by removing the Service=”ELB” and Namespace=”AWS” dimensions, leaving only the LoadBalancerName=”website-prod13”, resulted in Amazon Cloudwatch correctly returning the expected answer again. The sudden recovery on October 7th wasn’t magic; the customer changed the configuration in Circonus to eliminate those two dimensions from the query.

Our confidence is pretty high that nothing changed on our end. My confidence is also pretty high that in a code launch on October 3rd, Amazon “fixed a glitch.”

How We Keep Your Data Safe

Some of our users tell us that moving to Circonus from in-house operated tools was driven by the desire to outsource responsibility for availability and data safety to someone else. This might sound odd, but building highly resilient systems is hard, most engineering teams are focused on solving those hard problems for their applications and their customers, and it can be highly distracting (even flat-out wasteful) to focus on doing that for telemetry data. Telemetry data has very different storage, recall, and consistency requirements than most typical application data. At first, this might seem easier than your typical database problem. Once you’ve struggled first-hand with the combination of availability and performance requirements on the read path (for both failure detection and operational decision making) and the intensely painful, disk saturating write path, it becomes clear that this is a hard beast to wrestle.

When placing this responsibility in someone else’s hands, you could simply ignore the gruesome details and assume you’re in good hands… but you should expect an explanation of just how that’s all accomplished. At Circonus, we care very much about our customers’ data being correctly collected and reported on, but that would be all for naught if we couldn’t safely store it. Here’s how we do it…

ZFS

The first thing you should know is that we’re committed to ZFS. If you know about ZFS, that should make you feel better already. ZFS is the most advanced and production-ready file system available; full stop. It supports data redundancy and checksumming, as well as replication and snapshots. From an operational perspective, putting data on ZFS puts a big, fat smile on our faces.

ZFS Checksums

HowWeKeepYourDataSafe1

ZFS checksumming means that we can detect when the data we wrote to disk isn’t the data we’re reading back. How could that be? Bitrot and other errors on the read or write path can cause it. The bottom line is that at any reasonable scale, you will have bad data. If you are storing massive data on a filesystem that doesn’t protect the data round trip the way ZFS does, you may not know (or notice) errors in your data, but you almost certainly have them… and that should turn your stomach. In fact, we “scrub” the filesystems regularly just to check and correct for bitrot. Bitrot doesn’t separate us from our competitors, noticing it does.

ZFS Snapshots and Rollback

ZFS also supports snapshots. When we perform maintenance on our system, change on-disk formats, or apply large changesets of data, we use snapshots to protect the state of the overall system from operator error. Have we screwed things up? Have we broken databases? Have we corrupted data? Yes, you betcha; mistakes happen. Have we ever lost a piece of data or exposed bad data to customers? No, and we owe that largely to ZFS snapshots and rollback.

ZFS Device Management

On top of the data safety features baked into ZFS for modern-sized systems, ZFS supports software data redundancy across physical media. This might sound like RAID, and in concept it is, but the implementation being baked through the whole filesystem block layer allows for enormous volumes with speedy recovery and rebuild times. So, getting into the nitty gritty: all data that arrives on a node is stored in at least two physical disks (think RAID-1). Any disk in our storage node can fail (and often several can fail) without any interruption of service, and replacement disks are integrated with online rebuild with zero service interruption.

Snowth Clustering

HowWeKeepYourDataSafe2

Our magic mojo is provided by our timeseries database, called Snowth. Snowth uses a consistent hashing model to place arriving data on more than one node. Our Snowth clusters are configured for three write copies. This means that every piece of data we collect is stored on three machines. With three copies of the data, any two machines in the cluster can fail (or malfunction) and both writes and reads can continue uninterrupted.

Since data is stored on two physical drives per node and stored redundantly on three nodes, that means each data point you record in Circonus lives on six disks.

HowWeKeepYourDataSafe3

Multi-Datacenter Operation

While disk failure and node failure can (and do) happen, datacenters can also fail (sometimes catastrophically due to fire or natural disaster). Each measurement you send to Circonus follows two completely independent paths from the broker to each of two datacenters. Basically, each production datacenter acts as a subscriber to the metrics published by a broker. The Snowth clusters in the two datacenters are completely independent. So while your data lives on only six disks in one datacenter, rest assured that it resides on six other disks in the other datacenter. We’re up to 12 disks.

HowWeKeepYourDataSafe4

Safe and Sound

How serious are we about data safety? Each measurement lives on 12 disks in 6 different chassis in 2 different datacenters, protected from bitrot, controller errors, and operator errors by ZFS. So yeah, we’re pretty serious about it.

How safe is your data? Maybe it’s time to ask your current vendor. You have a right to know.

Time, but Faster


Time, Faster 1

Every once in awhile, you find yourself in a rabbit hole, unsure of where you are or what time it might be. In this post, I’ll show you a computing adventure about time through the looking glass.

The first premise you must understand was summed up perfectly by the late Douglas Adams; “Time is an illusion, lunchtime doubly so.” The concept of time, when colliding with decoupled networks of computers that run at billions of operations per second, is… well, the truth of the matter is that you simply never really know what time it is. That is why Leslie Lamport’s seminal paper on Lamport timestamps was so important to the industry, but this post is actually about wall-clock time, or a reasonably useful estimation of it.

Time, Faster 2

Even on today’s computers, it is feasible to execute an instruction in under a nanosecond. When the white rabbit looks at his pocket watch, he is seeing what time it was a nanosecond before, as the light travels from the the hands on the watch to his eye – assuming that Carroll’s timeless tale took place in a vacuum and that the rabbit was holding the watch one third of a meter from his eyes.

When you think of a distributed system where a cluster of fast computers are often more than one light nanosecond away from each other, it is understandably difficult to time something that starts in one place and ends in another with nanosecond precision; this is the realm of physicists, not bums like us with commodity computing hardware run in environments we don’t even manage. To further upset the newcomer, every computer today is effectively a distributed system itself, with each CPU core having its own clock ticking away, with it’s own subtly different frequency and sense of universal beginning.

All that said, computers need to give us the illusion of a clock. Without it, we won’t know what time it is. As computers get faster and faster, we are able to improve the performance of our systems, but one fundamental of performance engineering is that you cannot improve what you cannot measure; so measure we must. The fundamental paradox is that as what we measure gets smaller, the cost of measuring it remains fixed, and thus becomes relatively monstrous.

The beginning of the tumble…

Time, Faster 3

We write a database here at Circonus. It’s fast and it keeps getting faster and more efficient. We dump energy into this seemingly endless journey because we operate at scale and every bit of efficiency we eke out results in lower COGS (Cost Of Goods Sold) for us, better service for our customers, and fundamentally affords a cost-effectiveness of telemetry collection and analysis that approaches reasonable economics for us to “monitor all the things.” In that context…

Let us assume we want to achieve an average latency for operations of 1 microsecond. Now, let’s wrap some numbers around that. I’ll make some notes about certain aspects of hardware, but I’ll only really focus on hardware from the last several years. While you and I like to think in terms of seconds, computers don’t care about our concept of time… They only care about clock ticks.

The TSC

Time, Faster 4

Online CPUs are forever marching forward at some frequency, and the period of this frequency is called a “tick.” In an effort to save power, many computers can shift between different power saving states that cause the frequency of the CPU to change. This could make it excruciatingly difficult to accurately tell high-granularity time, if the frequency of the CPU were used for timing. Each core on a modern CPU has a TSC (TimeStamp Counter) that counts the number of ticks that have transpired, and you can read this counter with the cleverly named rdtsc assembly instruction. Also, modern CPUs have a “feature” called an invariant TSC, which guarantees that the frequency at which ticks occur will not change for any reason (but mainly for power saving mode changes). My development box has an invariant TSC that ticks approximately 2.5999503 times per nanosecond. Other machines here have different frequencies.

The standard tooling to figure out how long an operation takes on a UNIX machine is either clock_gettime(CLOCK_MONOTONIC,…) or gethrtime(). These calls return the number of nanoseconds since some arbitrary fixed point in the past. I will use gethrtime() in my examples because it is shorter to write.

hrtime_t start = gethrtime();
some_operation_worthy_of_measurement();
hrtime_t elapsed = gethrtime() - start;

As we measure these things, the gethrtime() call itself will take some time. The question to ask is: where does the time it returns sit relative to the beginning and end of the gethrtime() call itself? We can answer that with benchmarks. The bias introduced by measuring the start and finish is relative to its contribution to overall running time. In other words, if we make the “operation” we’re measuring take a long time over many iterations, we can reduce the measurement bias to near zero. Timing gethrtime() with gethrtime() would look like this:

#define LOTS 10000000
hrtime_t start = gethrtime();
for(int i=0;i<LOTS;i++) (void)gethrtime();
hrtime_t elapsed = gethrtime() - start;
double avg_ns_per_op = (double) elapsed / (double)LOTS;

And behold: a benchmark is born. Furthermore, we could actually measure the number of ticks elapsed in the test by bracketing the test with calls to rdtsc in assembly. Note that you must bind yourself to a specific CPU on the box to make this effective because the TSC clocks on each core do not have the same concept of “beginning.”

If I run this on our two primary platforms (Linux and Illumos/OmniOS,on a 24-core 2.60GHz Intel box), then I get these results:

Operating System Call Call Time
Linux 3.10.0 gettimeofday 35.7 ns/op
Linux 3.10.0 gethrtime 119.8 ns/op
OmniOS r151014 gettimeofday 304.6 ns/op
OmniOS r151014 gethrtime 297.0 ns/op

The first noticeable thing is that Linux optimizes both of these calls significantly more than OmniOS does. This has actually been addressed as part of the LX brand work in SmartOS by Joyent and will soon be upstreamed into Illumos for general consumption by OmniOS. Alas, that isn’t the worst thing; objectively determining what time it is is simply too slow for microsecond level timing, even at the lower 119.8 ns/op number above. Note that gettimeofday() only supports microsecond level accuracy and thus is not suitable for timing faster operations.

At just 119.8ns/op, bracketing a 1 microsecond call will result in (119.8*2)/(1000 + 119.8*2) -> 19.33%. So 19.33% of the execution time is spent on calculating the timing, and that doesn’t even include the time spent recording the result. A good goal to target here is 10% or less. So, how do we get there?

Looking at our tools

Time, Faster 5

On these same modern CPUs that have invariant TSCs, we have the rdtsc instruction, which reads the TSC, yet doesn’t provide insight into which CPU you are executing on. That would require either prefixing the call with a cpuid instruction or binding the executing thread to a specific core. The former adds ticks to the work, the latter is wholly inconvenient and can really defeat any advanced NUMA aware scheduling that the kernel might provide you. Basically, binding the CPU gives you a super-fast, but overly restrictive solution. We just want the gethrtime() call to work and be fast.

We’re not the only ones in need. Out of the generally recognized need, the rdtscp instruction was introduced. It supplies the value in the TSC and a programmable 32 bit value. The operating system can program this value to be the ID of the CPU, and a sufficient amount of information is emitted in a single instruction… Don’t be deceived; this instruction isn’t cheap and it measures in at 34 ticks on this machine. We’ll code that instruction call as uint64_t mtev_rdtscp(int *cpuid), and that returns the TSC and optionally sets a cpuid to the programmed value.

The first challenge here is to understand the frequency. This is a straight-forward timing exercise:

mtev_thread_bind_to_cpu(0);
hrtime_t start_ns = gethrtime();
uint64_t start_ticks = mtev_rdtscp(NULL);
sleep(10);
hrtime_t end_ns = gethrtime();
uint64_t end_ticks = mtev_rdtscp(NULL);
double ticks_per_nano = (double)(end_ticks-start_ticks) / (double)(end_ns-start_ns);

The next challenge becomes quite clear when testing out this solution for timing the execution of a job… even the simplest of jobs:

uint64_t start = mtev_rdtscp(NULL);
*some_memory = 0;
uint64_t elapsed = mtev_rdtscp(NULL) - start;

This usually takes around 10ns, assuming there isn’t a major page fault during the assignment. 10ns to set a piece of memory! Now remember, that includes the average time of a call to mtev_rdtscp(), which is just over 9ns. That’s not really the problem… the problem is that sometimes we get HUGE answers. Why? We switch CPUs and the output of the two TSC calls are reporting to completely unrelated counters. So, to rephrase the problem: we must relate the counters.

Time, Faster 6

The code for skew assessment is a bit much to include inline here. The basic idea is that we should run a calibration loop on each CPU that measures the TSC*ticks_per_nano and assess the skew from gethrtime(), accommodating the running time of gethrtime(). As with most calibration loops, we’ll discard the most skewed and average the remaining. It’s basically back to primary school math regarding finding the linear intercept equation: y = mx + b. We want gethrtime() = ticks_per_nano * mtev_rdtscp() + skew. As the TSC is per CPU, we need to track m and b (ticks_per_nano and skew) on a per-CPU basis.

Another nuance is that these two values together describe the translation between a CPU’s TSC and the systems gethrtime(), and they are estimates. That means two important things: (1) they need to be updated regularly to correct for error in the calibration and estimation, and (2) they need to be set and read atomically. Here enters the cmpxchg16b instruction.

Additionally, we perform this calibration work every five seconds in a separate thread and attempt to make that thread high-priority on a real-time scheduler. It turns out that this all works quite well, even without the ability to change priority or scheduling class.

Gravy

Since we’re clearly having to correct for skew to align with the system gethrtime(), and the point in the past to which gethrtime() is relative is arbitrary (according to documentation), we’ve elected to make that “arbitrary” point UNIX epoch. No additional instructions are required, and now we can use our replacement gethrtime() to power gettimeofday(). So, our y = mx + b is actually implemented as nano_second_since_epoch = ticks_per_nano * mtev_rdtscp() + skew. Obviously, we’ll only pick up changes to the wall-clock (via adjtime() et.al.) when we recalibrate.

Safety

Time, Faster 7

Obviously things can and do go wrong. We’ve put a variety of fail-safe mechanisms in place to ensure proper behavior when our optimizations become unsafe. By default, we detect the lack of an invariant TSC and disable. If a calibration loop fails for too long (fifteen seconds), we mark the CPU as bad and disable. We do some rudimentary performance tests, and if the system’s gethrtime() can beat our emulation, then we disable. If all those tests pass, we still check to see if the underlying system can perform gettimeofday() faster than we can emulate it; if so, we disable gettimeofday() emulation. The goal is to have mtev_gethrtime() be as fast or faster than gethrtime() and to have mtev_gettimeofday() be as fast or faster than gettimeofday().

Results

Time, Faster 8

The overall results are better than expected. Our original goal was to simply provide a way for our implementation on Illumos to meet the performance of Linux. The value of ZFS is deeply profound, and while Linux has some performance advantages in specific arenas, that doesn’t matter much if you have undetectable corruption of the data you store.

Further optimization is possible in the implementation, but we’ve stopped for now, having achieved our initial goals. Additionally, for the purposes of this test, we’ve built the code portably. We can find a couple of nanoseconds if we compile -march=native on machines supporting the AVX instruction set.

It is true that an approximately 40ns gehrtime() can be considered slow enough, relative to microsecond-level efforts, that very prudent selection is still necessary. It is also true that 40ns gethrtime() can open up a new world of possibilities for user-space instrumentation. It has certainly opened our eyes to some startling things.

Operating System Call System Call Time Mtev-variant Call Speedup
Linux 3.10.0 gettimeofday 35.7 ns/op 35.7 ns/op x1.00
Linux 3.10.0 gethrtime 119.8 ns/op 40.4 ns/op x2.96
OmniOS r151014 gettimeofday 304.6 ns/op 47.6 ns/op x6.40
OmniOS r151014 gethrtime 297.0 ns/op 39.9 ns/op x7.44

This all comes for free with libmtev (see mtev_time.c for reference). As of writing this, Linux and Illumos are supported platforms and Darwin and FreeBSD do not have “faster time” support. The faster time support in libmtev was a collaborative effort between Riley Berton and Theo Schlossnagle.

Rapid Resolution With The Right Tools

Here at Circonus, we obviously monitor our infrastructure with the Circonus platform. However, there are many types of monitoring, and our product isn’t intended to cover all the bases. While Circonus leads to profound insights into the behavior of systems, it isn’t a deep-dive tool for when you already know a system is broken.

Many of our components here are written in C, so as to be as small, portable, and efficient as possible. One such component gets run on relatively small systems on-premises with customers, to perform the tasks of data collection, store-and-forward, and real-time observation. When it comes to the data collection pipeline, a malfunction is unacceptable.

This story is of one such unacceptable event, on a warm Wednesday in July.

Around 2pm, the engineering group is hacking away on their various deliverables for new features and system improvements to the underlying components that power Circonus. We’ve just launched our Pully McPushface feature to a wider audience (everyone) after over a year of dark launches.

Rapid_Response_with_the_Right_Tools

At 2:47pm, a crash notice from Amsterdam appears in our #backtrace slack channel. We use Backtrace to capture and analyze crashes throughout our globally distributed production network. Crashes do happen, and this one will go into an engineer’s queue to be analyzed and fixed when they switch from one of their current tasks. At 2:49pm, things change; we get an eerily similar crash signature from Amsterdam. At this point, we know that we have an event worthy of interruption.

Around 2:51pm, we snag the Backtrace object identifier out of Slack and pull it up in the analysis tool (coroner). It provides the first several steps of post-mortem debugging analysis for us. About 8 minutes later, we have consensus on the problem and a proposed fix, and we commit that change to Github at 2:59pm.

Now, in the modern world, all this CI/CD stuff claims to provide great benefits. We use Jenkins internally for most of the development automation. Jenkins stirs and begins its build and test process. Now, this code runs on at least five supported platforms, and since the fix is in a library, we actually can count more than fifty packages that need to be rebuilt. After just 6 minutes, Jenkins notes all test suites passed and begins package builds at 3:05pm.

Building, and testing, and packaging software is tedious work. Many organizations have continuous integration, but fail to have continuous deployment. At 3:09pm, just 4 minutes later, we have all of our fixed packages deployed to our package repositories and ready for install. One might claim that we don’t have continuous deployment because our machines in the field do not auto-update their software. It’s an interesting fight internal to Circonus as to whether or not we should be performing automated updates; as of today, a human is still involved to green-light production deployment for new packages.

One more minute passes, and at 3:10pm, Amsterdam performs its yum update (it’s a CentOS box), and the Slack channel quiets. All is well.

The engineers here at Circonus have been building production software for many years, and a lot has changed over that time. I distinctly remember manually testing software in the 1990s. I remember some hobbled form of CI entering the scene, but we were still manually packaging software for distribution in the 2000s. Why did it take our industry so long to automate these pieces? I believe some of that has to do with relative scale. Collecting the necessary post-mortem debugging information to confidently issue a fix took hours or days when machines in production malfunctioned, because operations groups had to notice, report, and then interact with development. As such, the extra hour or two to compile, test, and package the software was inconsequential, reminding us of a golden XKCD comic in its time.

compiling

Recapping this event makes me, the old grey-beard, feel like I’m living in the future. Jenkins was able to reduce a process from hours to minutes, and Backtrace was able to reduce a highly inconsistent hours-to-days process to minutes.

A two minute MTTD and a 23 minute MTTR for a tested fix involving a core library. The future looks way more comfortable than the past.

LuaJIT and the Illumos VM

I’m about to dive into some esoteric stuff, so be warned. We use LuaJIT extensively throughout most of our product suite here at Circonus. It is core supported technology in the libmtev platform on which we build most of our high-performance server apps. Lua (and LuaJIT) integrate fantastically with C code and it makes many complex tasks seem simple if you can get the abstraction right between procedural lua code and a non-blocking event framework like that within libmtev. Lua is all fine and well, but for performance reasons we use LuaJIT and therein bodies are buried.

What’s so special about LuaJIT? An object in LuaJIT is always 64bit. It can represent a double, or if the leading 13 bits are 1s (which is invalid in any 64bit double representation) then it can represent other things:

**                  ---MSW---.---LSW---
** primitive types |  itype  |         |
** lightuserdata   |  itype  |  void * |  (32 bit platforms)
** lightuserdata   |ffff|    void *    |  (64 bit platforms, 47 bit pointers)
** GC objects      |  itype  |  GCRef  |
** int (LJ_DUALNUM)|  itype  |   int   |
** number           -------double------

Ouch, that’s right. We’ve got pointer compaction going on. On 64 bit systems, we can only represent 47 bits of our pointer. That is 128TB of addressable space, but we’ll explore problems it causes.

LuaJIT is a garbage collected language and as such wants to manage its own memory. It mmaps large chunks of memory and then allocates and manages objects within that segment. These objects actually must sit in the lower 32bits of memory because luajit leverages the other 32bits for type information and other notes (e.g. garbage collection annotations). Linux and several other UNIX-like operating systems support a MAP_32BIT flag to mmap that instructs the kernel to return a memory mapped segment under the 4GB boundary.

LuaJIT_and_IllumosVM_1

Here we see a diagram of how this usually works with LuaJIT on platforms that support MAP32_BIT. When asked for a MMAP_32BIT mapping, Linux (and some other operating systems) starts near the 4GB boundary and works backwards. The heap (or brk()/sbrk() calls), where malloc and friends live, typically starts near the beginning of memory space and works upwards. 4GB is enough space to cause many apps to not have problems, but if you are a heavy lifting app, at some point you could legitimately attempt to group your heap into the memory mapped regions and that would result in a failed allocation attempt. You’re effectively out of memory! If your allocator uses mmap behind the scenes, you won’t have this problem.

Enter Illumos:

LuaJIT_and_IllumosVM_2

Enter Illumos, and we have a set of problems emerge. On old versions of Illumos, MAP_32BIT wasn’t supported, and this usually caused issues around 300MB or so of heap usage. That’s an “Itty-bitty living space,” not useful for most applications; certainly not ours. Additionally, the stack doesn’t grow down from the 47-bit boundary; it grows down from the 64-bit boundary.

On systems that don’t support MAP_32BIT, LuaJIT will make a best effort by leveraging mmap()’s “hinting” system to hint at memory locations it would like down in the lower 32bit space. This works and can get off the starting blocks, but we’ve compounded our issues, as you can see in this diagram. Because we’re not growing strictly down from the 4GB boundary, our heap has even less room to grow.

Lack of MAP_32BIT support was fixed back in August of 2013, but you know people are still running old systems. On more recent versions Illumos, it looks more like Linux.

LuaJIT_and_IllumosVM_3

The interaction between LuaJIT and our stratospheric stack pointers remains an unaddressed issue. If we push lightuserdata into lua from our stack, we get a crash. Surprisingly, in all our years, we’ve ran across only one dependency that has triggered the issue: LPEG, for which we have a patch.

We’re still left with a space issue: our apps still push the envelope, and 4GB’s of heap is a limitation we thought it would be ridiculous to accept. So we have stupid trick to fix the issue.

On Illumos (and thus OmniOS and SmartOS), the “brk” starting point is immediately after the BSS segment. If we can simply inform the loader that our binary had a variable declared in BSS somewhere else, we could make our heap start growing from somewhere other than near zero.

# gcc -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    1 .bss           _end
      [88]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    0 .bss           _end
LuaJIT_and_IllumosVM_4

Kudos to Rich Lowe for pointers on how to accomplish this feat. Using the Illumos linker, we can actually specify a map using -M map64bit where map64bit is file containing:

mason_dixon = LOAD ?E V0x100000000 L0x8;

# gcc -Wl,-M -Wl,map64bit -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    1 ABS            _end
      [88]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    0 ABS            _end

This has the effect of placing an 8 byte sized variable at the virtual address 4GB, and the VM space looks like the above diagram. Much more room!

Roughly 4GBs of open space for native LuaJIT memory management, and since we run lots of VMs in a single process, this gets exercised. The heap grows up from 4GB, leaving the application plenty of room to grow. There is little risk of extending the heap past 128TB, as we don’t own any machines with that much memory. So, everything on the heap and in the VM are all cleanly accessible to LuaJIT.

So when we say we have some full stack engineers here at Circonus — we know which way the stack grows on x86_64.

The Uphill Battle for Visibility

In 2011, Circonus released a feature that revolutionized the way people could observe systems. We throw around the word “histogram” a lot and the industry has listened. And while we’ve made great progress, it seems that somewhere along the way the industry got lost.

I’ve said for some time that without a histogram, you just don’t understand the behavior of your system. Many listened and the “histogram metric type” was born, and now is used in so many of today’s server apps to provide “better understanding” of systems behavior. The deep problem is that those histograms might be used “inside” the product, but from the operator’s perspective you only see things like min, max, median, and some other arbitrary percentiles (like 75 and 99). Quick question: do you know what you get when you store a 99th percentile over time? Answer: not a histogram. The “problem” with averages that I’ve railed against for years is not that averages are bad or useless, but instead that an average is a single statistical assessment of a large distribution of data… it doesn’t provide enough insight into the data set. Adding real histograms solves this issue, but analyzing a handful of quantiles that people are calling histograms puts us right back where we started. Perhaps we need a T-shirt: “I was sold a histogram, but I all got were a few lousy quantiles. :(”

I understand the problem: it’s hard. Storing histograms as an actual datatype is actually innovative to the point that people don’t understand it. Let’s say, for example, you’re trying to measure the performance of your Cassandra cluster… as you know, every transaction counts. In the old days, we all just tracked things like averages. If you served approximately 100,000 requests in a second, you would measure the service times of each and track an EWMA (exponential weighted moving average) and store this single output over time as an assessment of ongoing performance.

Stored Values
Time EWMA (seconds)
T1 0.0020
T2 0.0023
T3 0.0019
T4 0.0042

Now, most monitoring systems didn’t actually store second-by-second data, so instead of having an approximation of an average of 100,000 or so measurements every second, you would have an approximation of an average of around 6,000,000 measurements every minute. When I first started talking about histograms and their value, it was easy for an audience to say, “Wow, if I see one datapoint representing six million, it stands to reason I don’t know all that much about the nature of the six million.”

Enter the histogram hack:

By tracking the service latencies in a histogram (and there a lot of methods, from replacement sampling, to exponentially decaying, to high-definition-resolution collection), one now had a much richer model to reason about for understanding behavior. This is not the hack part… that followed. You see, people didn’t know what to do with these histograms because their monitoring/storage systems didn’t understand them, so they degraded all that hard work back into something like this:

Stored Values
Time mean min max 50th 75th 95th 99th
T1 0.0020 0.00081 0.120 0.0031 0.0033 0.0038 0.091
T2 0.0023 0.00073 0.140 0.0027 0.0031 0.0033 0.092
T3 0.0019 0.00062 0.093 0.0024 0.0027 0.0029 0.051
T4 0.0042 0.00092 0.100 0.0043 0.0050 0.0057 0.082

This may seem like a lot of information, but even at second-by-second sampling, we’re taking 100,000 data points and reducing them to a mere seven. Is this is most-certainly better than a single statistical output (EWMA), but claiming even marginal success is folly. You see, the histogram that was used to calculate those 7 outputs could calculate myriad other statistical insights, but all of that insight is lost if those histograms aren’t stored. Additionally, reasoning about the extracted statistics over time is extremely difficult.

At Circonus, from the beginning, we realized the value of storing a histogram as a first-class data type in our time series database.

Stored Values Derived Values
Time Histogram mean min max 50th nth
T1 H1(100427) mean(H1) q(H1,0) q(H1,100) q(H1,50) q(H1,n)
T2 H2(108455) mean(H2) q(H2,0) q(H2,100) q(H2,50) q(H2,n)
T3 H3(94302) mean(H3) q(H3,0) q(H3,100) q(H3,50) q(H3,n)
T4 H4(98223) mean(H4) q(H4,0) q(H4,100) q(H4,50) q(H4,n)

Obviously, this is a little abstract, because how do you store a histogram? Well, we take the abstract to something quite concrete in Go, Java, Javascript, and C. This also means that we can view the actual shape of the distribution of the data and extract other assessments, like estimated modality (how many modes there are).

All this sounds awesome, right? It is! However, there is an uphill battle. When the industry started to get excited about using histograms, they adopted the hacky model where histograms are used internally to provide poor statistics upstream, we’re left at Circonus trying to get water from a stone. It seems each and every product out there requires code modifications to be able to expose this sort of rich insight. Don’t fear, we’re climbing that hill and more and more products are falling into place.

Today’s infrastructure and infrastructure applications (like Cassandra and Kafka), as well as today’s user and machine facing APIs, are incredibly sensitive to performance regressions. Without good visibilities, the performance of your platform and the performance of your efforts atop it will always be guesswork.

In order to make things easier, we’re introducing Circonus’s Wirelatency Protocol Observer. Run it on your API server or on your Cassandra node and you’ll immediately get deep insight to the performance of your system via real histograms in all their glory.