No, We “Fixed the Glitch”

If you haven’t seen the movie Office Space, you should do so at your earliest convenience. As with the new TV comedy, “Silicon Valley,” Mike Judge hits far too close to home for the movie to be comfortable… its hilarity, on the other hand, is indisputable. So much of our lives are wrapped up in “making the machine work” that comedic exposures of our industries deep malfunctions are, perhaps, the only things that keep me sane.

Not a day goes by that I don’t see some scene or line from “Office Space” percolate from either the Industry or Circonus itself. Just after 21:30 UTC on October 3rd was just another one of these events, but the situation that brought it up is interesting enough to share.

In “Office Space,” there is an employee named Milton, whom management believes they have fired, but who has been working and getting paid for years. Classic communication breakdown. However, due to the over-the-top passive aggressive behavior in the organization, management doesn’t want a confrontation to correct the situation. Instead of informing Milton, they simply decide to stop paying him and let the situation work itself out… They “fixed the glitch.” If you do this, you’re an asshole. Spoiler alert: Milton burns the building down


The interesting thing about software is that it is full of bugs. So full of bugs, that we tend to fix things we didn’t even know were broken. While it’s no less frustrating to have a “glitch” fixed on you, it’s a bit more understandable when it happens unintentionally. We’re fixing glitches before they are identified as glitches. This most commonly occurs in undocumented behavior that is assumed to be stable by some consumer of a system. It happens during a new feature introduction, or some other unrelated bug fixing, or a reimplementation of the system exhibiting the undocumented behavior, and then boom… some unsuspecting consumer has their world turned upside down. I’m sure we’ve done this at Circonus.

On October 3rd, a few customers had their Amazon Cloudwatch checks stop returning data. After much fretting and testing, we could find nothing wrong with Amazon’s API. Sure, it was a bit slow and gave stale information, but this is something we’ve accommodated from the beginning. Amazon’s Cloudwatch service is literally a metrics tire fire. But this was different… the answers just stopped happening.

Circonus’ collection system is three-tier (unlike many of our competitors that use two-tier systems). First, there’s the thing that’s got the info: the agent. In this case, the agent is the Cloudwatch API itself. Then, there’s the thing that stores and analyzes the data: Circonus SaaS. And finally there’s this middle tier that talks to the agents, then stores and forwards the data back to Circonus SaaS. We call this the broker. Brokers are basically babelfish; they speak every protocol (e.g. they can interrogate the Cloudwatch API), and they are spread out throughout the world. By spreading them out, we can place brokers closer to the agents so that network disruptions don’t affect the collection of data, and so that we get a more resilient observation fabric. This explains why I can assert that “we didn’t change anything,” even with as many as fifty code launches per day. The particular broker in question, the one talking to the cloudwatch API, hadn’t been upgraded in weeks. Additionally, we audit changes to the the configuration of the broker, and the configurations related to Cloudwatch interrogations hasn’t been modified either.

So, with no changes to the system or code asking Cloudwatch for data and no changes to the questions we are asking Cloudwatch, how could the answers just stop? Our first thought was that Amazon must have changed something, but that’s a pretty unsatisfying speculation without more evidence.

The way Cloudwatch works is that you ask for a metric and then limit the ask by fixing certain dimensions on the data. For example, if I wanted to look at a specific Elastic Load Balancer (ELB) servicing one of my properties and ascertain the number of healthy hosts backing it, then I’d work backwards. First, I’d ask for the number of healthy hosts, the “HealthyHostCount”, and then I’d limit that to the namespace “AWS/ELB” and specify a set of dimensions. Some of the available dimensions are “Service”, “Namespace”, and “LoadBalancerName”. Now, our Cloudwatch integration is very flexible, and users can specify whatever dimensions they please, knowing that it is possible that they might work themselves out of an answer (by setting dimensions that are not possible).

The particular Cloudwatch interrogation said that dimension should match the following: Service=”ELB”, Namespace=”AWS”, and LoadBalancerName=”website-prod13.” And behold: data. The broker was set to collect this data at 12:00 UTC on October 1st and to check it every minute.

There appears to be an outage
There appears to be an outage.

As we can see from this graph, while it worked at first, there appears to be an outage. “It just stopped working.” Or did it? Around 21:30 on October 3rd, things went off the rails.

Things went off the rails here.
Things went off the rails here.

This graph tells a very different story than things “just stopping.” For anyone that runs very large clusters of machines where they do staged rollouts, this might look familiar. It looks a lot like a probability of 1 shifting to a probability of 0 over about two hours. Remember, there are no changes in what we are asking or how we are asking it… just different answers. In this case, the expected answer is 2, but we received no answer at all.

The part I regret most about this story is how long it took for the problem to be completely resolved. It turns out that by removing the Service=”ELB” and Namespace=”AWS” dimensions, leaving only the LoadBalancerName=”website-prod13”, resulted in Amazon Cloudwatch correctly returning the expected answer again. The sudden recovery on October 7th wasn’t magic; the customer changed the configuration in Circonus to eliminate those two dimensions from the query.

Our confidence is pretty high that nothing changed on our end. My confidence is also pretty high that in a code launch on October 3rd, Amazon “fixed a glitch.”

The Circonus API and Raspberry PI

Building a Raspberry PI Circonus AlertBot

To a Site Reliability Engineer, one of the most important things is making sure that you get alerts as soon as events happen. Circonus provides many contact options, such as SMS, email, Slack, PagerDuty, or VictorOps. For me, if I’m not actively watching for alerts,I rely on my phone to do something to get my attention. On occasion, I may bump the mute switch on my iPhone, or the battery might run out, etc, etc. It’s rare, but it happens.

Folks that know me also know that I get a kick out of the Raspberry Pi. They run linux, they’re cheap, and they provide all those nice GPIO pins. So it was a good practical exercise to have a Raspberry Pi do some work for me. I decided to build a back up alerting tool for those times when I don’t see the alerts on my phone.

First I attached a pair of LEDs to the GPIO (one red, one green), along with appropriate resistors, one to GPIO pin 7, and one to pin 11. Then I attached a super obnoxious buzzer with a resistor to GPIO pin 24, and a momentary switch (button) to GPIO pin 26 along with a pullup resistor.

Then I wrote up a script in Perl that will poll the Circonus API for any sev1 alerts. The green light means no alerts, and the red light means something broke. And since I’m usually sleeping when it goes off, the buzzer accompanies the red light. This is as annoying as intended, and therefore the momentary switch will signal to the script to keep quiet.

I have a few improvements planned for this set up. My next task is to implement a cheap OLED display to provide some information about an alert. I’ve also thought about replacing the LEDs with a relay that could switch on something more robust, like a lamp or a motor, which might be useful in an office setting.

I’ll admit the script leaves room for a lot of improvement, and pull requests are welcome! Either way, pairing the Raspberry Pi with the Circonus API has been fun and productive. I imagine there are plenty other applications that could make use the two! Join the Circonus Labs slack community and you can share and discuss your ideas with our engineers and other Circonus users.

Crash-Dump Necromancy

Recently, one of our on-premises customers experienced a system crash with one of their metric-storage nodes. Circonus offers the option to operate a complete, private copy of our product on a customer’s own premises. This may be because of a compliance requirement or for reasons of scale that make the public SaaS version unsuitable. In this case, the node was temporarily incapacitated, and recovered on its own without intervention. This is the story of how we diagnosed the issue to determine why the node had this panic, which then informed our customer’s decisions about hardware replacement.

Our metrics store runs on OmniOS, a distribution of illumos, itself a fork of the now-defunct OpenSolaris project. The Solaris pedigree is important, because it means we have a robust set of post-mortem debugging capabilities, beginning with a kernel crash dump.

Whenever an OmniOS system encounters an unexpected condition, the kernel logs a panic message and dumps the contents of kernel memory to the local dump or swap device. On the next boot, a service called savecore notices that a dump is present and captures that data as a compressed vmdump.X file in /var/crash/<hostname>, where X is just an ordinal number, starting at 0, to distinguish multiple crash dumps. Our customer was able to send that file to us for analysis.

Analysis of such a crash dump is performed with MDB, the Modular Debugger. MDB is something of a swiss-army knife for debugging in illumos-based systems. It can operate on various types of “targets,” including kernel crash dumps and application core files, as well as a running system. In this exercise, we’ll primarily be using MDB interactively, but it can also function quite handily in a shell pipeline, as we’ll see later.

The customer knew from the panic message that the system had panicked due to the ZFS deadman timer:

genunix: [ID 918906 kern.notice] I/O to pool 'data' appears to be hung.

ZFS is an advanced storage subsystem that combines the traditional capabilities of a filesystem, volume manager, and RAID controller into a unified storage stack. It features a pooled storage model, provable data integrity with end-to-end checksumming, and dead-simple administration. It’s a cornerstone of how Circonus protects metric data – we utilize node-local data redundancy with a mirroring discipline, as well as writing the same data to multiple nodes in the metric storage cluster. The end-to-end checksumming ensures that when we read back metric data, we can be confident it’s the same data we originally wrote.

The ZFS deadman timer does what it says on the tin – if the timer expires, the system panics and restarts. It is a last resort for when ZFS has sent one or more I/O commands to the underlying storage and has not heard back in the allotted time (this is tunable; the default is 1000 seconds, or about 16.5 minutes). Panicking the system at this point provides the administrator with a snapshot of the system state at the point when the deadman timer expired. As we will see, this is a rich source of information that helps us drill down on the culprit(s).

In order to analyze the customer’s crash dump, we must first decompress the crash dump file produced by savecore on the customer’s system:

# savecore -f vmdump.0 .

The uncompressed dump is written to ./vmcore.0, along with a copy of the corresponding kernel binary as ./unix.0. The kernel binary provides mappings of the core image for kernel functions, global variables, etc. To launch an interactive session using this crash dump, we just invoke mdb with the name of the core file.

# mdb vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs mr_sas sd ip hook neti sockfs arp usba lofs idm random ptm ipc nfs cpc ufs logindmux ]

We are now at the prompt (>) where we can run MDB dcmds (debugger commands), which are routines that access various properties of the target. If we didn’t have the hint from the kernel log about the reason for the crash, we could see what we’re dealing with by using the ::status dcmd:

> ::status
debugging crash dump vmcore.0 (64-bit) from [hostname redacted]
operating system: 5.11 omnios-d0a4f2e (i86pc)
panic message: I/O to pool 'data' appears to be hung.
dump content: kernel pages only

The deadman timer is tracked in the spa_sync function, so the first order of business is to locate the kernel thread that was in this function. To do that, we use the ::stacks dcmd, which prints kernel thread stacks. This dcmd is provided by the illumos kernel, so it is only available when debugging a kernel crash dump or interacting with a live system in kernel mode. Additionally, we can narrow down the list of stacks to our target function with the ‘-c’ option:

> ::stacks -c spa_sync
THREAD           STATE    SOBJ                COUNT
ffffff00f4ce5c40 SLEEP    CV                      2

Interesting – a count of 2. ::stacks summarizes unique kernel thread stacks that share common features, such as thread state, type of synchronization object (conditional variable, mutex, etc.), and program counter. To find out more about these stacks, we can use another kernel-provided dcmd, ::findstack. This prints more information about a thread’s stack, including function arguments if the “-v” option is used. We can also take advantage of MDB’s handy pipelining capability, similar to a shell pipeline, in order to avoid a lot of tedious copy/pasting:

> ::stacks -c spa_sync | ::findstack -v
stack pointer for thread ffffff00f4ce5c40: ffffff00f4ce5960
[ ffffff00f4ce5960 _resume_from_idle+0xf4() ]
  ffffff00f4ce5990 swtch+0x141()
  ffffff00f4ce59d0 cv_wait+0x70(ffffff30ef69ff40, ffffff30ef69ff38)
  ffffff00f4ce5a10 zio_wait+0x5b(ffffff30ef69fc18)
  ffffff00f4ce5aa0 dsl_pool_sync+0xb3(ffffff21f098a6c0, 168a1db)
  ffffff00f4ce5b70 spa_sync+0x27e(ffffff21f0998000, 168a1db)
  ffffff00f4ce5c20 txg_sync_thread+0x227(ffffff21f098a6c0)
  ffffff00f4ce5c30 thread_start+8()
stack pointer for thread ffffff00f457dc40: ffffff00f457d960
[ ffffff00f457d960 _resume_from_idle+0xf4() ]
  ffffff00f457d990 swtch+0x141()
  ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
  ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
  ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
  ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
  ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
  ffffff00f457dc30 thread_start+8()

Unlike the shell pipeline, the output that gets piped is not the literal text that we see in the terminal, but the value of an expression, which MDB calls “dot”. MDB’s syntax is based on computing the value of something, usually a memory address in the target, with dot representing the current address. Here, ::stacks is setting dot to the address of each stack in succession and transmitting those over the pipe, to be used as input by ::findstack. Then ::findstack operates on each one in turn. Without the pipeline, we would need to manually copy and paste each of the stack addresses and prefix it to the next dcmd, e.g. <addr>::findstack -v.

With both stacks shown with function arguments, we can see that the first argument to spa_sync differs between the two threads. From the source, we know that the first argument to spa_sync is a pointer to a spa_t struct, and that has a member, spa_name. This tells us the name of the pool that we’re syncing to. We can get that value by printing at the address of the struct and providing the struct type and member name we’re interested in:

> ffffff21f0998000::print -t spa_t spa_name
char [256] spa_name = [ "rpool" ]

> ffffff21f0f93000::print -t spa_t spa_name
char [256] spa_name = [ "data" ]

There is one for each pool in this system. We know from the panic message that it was the “data” pool that was hung, so we’ll focus on the “dc40” stack from here on:

stack pointer for thread ffffff00f457dc40: ffffff00f457d960
[ ffffff00f457d960 _resume_from_idle+0xf4() ]
  ffffff00f457d990 swtch+0x141()
  ffffff00f457d9d0 cv_wait+0x70(ffffff30106b6dc8, ffffff30106b6dc0)
  ffffff00f457da10 zio_wait+0x5b(ffffff30106b6aa0)
  ffffff00f457daa0 dsl_pool_sync+0xb3(ffffff21f6494c80, 168a1df)
  ffffff00f457db70 spa_sync+0x27e(ffffff21f0f93000, 168a1df)
  ffffff00f457dc20 txg_sync_thread+0x227(ffffff21f6494c80)
  ffffff00f457dc30 thread_start+8()

Moving up the stack, dsl_pool_sync issued some I/O, in the form of a zio_t, to the underlying storage. This resulted in calling zio_wait for notification that the I/O operations are complete. Each ZIO can have child ZIOs for related data, and those can have their own children, etc. So rather than looking up a single struct member, we potentially could be recursing down a non-trivial list. It’s therefore a good thing that, in addition to dcmds, MDB has something called a “walker,” which knows the specifics of a particular data structure and can automatically “walk” through it for us. The ZFS kernel module helpfully provides just such a walker for ZIOs called (perhaps obviously) ::zio.

> ::help zio

  zio - zio_t summary

  addr ::zio [-cpr]
        -c display children
        -p display parents
        -r display recursively

Starting from the address of the ZIO that we’re waiting on, we use both the -c option to show child ZIOs and the -r option to recurse down through any descendants. For a busy pool, this can produce a lot of output. We can use another MDB feature, calling out to the shell with “!” to send the output to a file that we can work with outside of MDB.

> ffffff30106b6aa0::zio -c -r !cat > zio_data
> ::quit

$ wc -l zio_data 
4798 zio_data

Now we can look at this file in a text editor in order to more easily explore the output.

I/O in ZFS flows through the ZIO pipeline, where various stages can not only perform the actual I/O, but also transform the data in different ways, such as compression or deduplication. Since we’re dealing with a hung pool, we’re interested in actual I/O. Therefore we want to look for ZIOs that are in the VDEV_IO_* stages.

ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
ffffff30106b6aa0        NULL  CHECKSUM_VERIFY  ffffff00f457dc40 -
 ffffff30106afc58       WRITE ISSUE_ASYNC      -                -
  ffffff2e4e9248e0      WRITE ISSUE_ASYNC      -                -
   ffffff309de9ac18     WRITE ISSUE_ASYNC      -                -
    ffffff22749a8578    WRITE ISSUE_ASYNC      -                -
     ffffff318ec07750   WRITE ISSUE_ASYNC      -                -
      ffffff23025a5e58  WRITE ISSUE_ASYNC      -                -
       ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms

Starting at the top of our output file, we see the ZIO address that we were waiting on, followed by its descendants. This pool was primarily doing writes, as expected, given that it’s running the Circonus metrics store. If we proceed down the list, we eventually come to some actual VDEV_IO operations. In ZFS, a “vdev” (virtual device) is the building block for a pool. There are many types of vdevs (some representing higher-level objects such as mirror or raidz sets), but for our purposes, we’re talking about “leaf vdevs”. These are the actual devices that store data, i.e., disks. The last column is interesting too – time elapsed. This means that this ZIO has been in the VDEV_IO_START stage for a whopping 1007 seconds. Something’s definitely wrong here, and that lines up pretty well with the ZFS deadman timer.

We would expect that the ZIOs doing vdev I/O would have some information about which vdev they are operating on. Recalling the zio_t structure, does it mention vdevs?

> ffffff31ba43ee50::print -t zio_t !grep vdev
    zio_done_func_t *io_done = vdev_mirror_child_done
    vdev_t *io_vd = 0xffffff21f4db0d00

Yes it does, and there’s a pointer, *io_vd, to a vdev_t. If we print that with MDB, we find that it has a wealth of information about the vdev in question. In particular, there is a device path:

> ffffff21f4db0d00::print -t vdev_t
vdev_t {
    char *vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"

Now we know that our waited-upon ZIO was at least writing to /dev/dsk/c1t7d0s0 and maybe to some others as well (remember, there were nearly 4800 lines of output for all the active ZIOs for the “data” zpool).

As an aside, we can actually collapse the above two invocations into one by using the structure pointer operator:

> ffffff31ba43ee50::print -t zio_t io_vd->vdev_path
char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"

It’s time to break out our text-processing tools and crunch this data. We’d need to get vdev_t pointers out of all the ZIOs that are “interesting,” and print the vdev_path from each one of those to show what device the ZIO was issued to. I like to use awk for this:

$ awk '$1 ~ /^ff/ && $5 != "-"' zio_data | head -5
       ffffff31ba43ee50 WRITE VDEV_IO_START    -                1007685ms
       ffffff31855c0c70 WRITE VDEV_IO_START    -                1007685ms
       ffffff2d9fe99af0 WRITE VDEV_IO_START    -                1007685ms
       ffffff22746e7060 WRITE VDEV_IO_START    -                1007686ms
       ffffff2d9a82c1b0 WRITE VDEV_IO_START    -                1007685ms

“Interesting” ZIOs in this case would be ones where the elapsed time has a value, i.e., the last column is not “-”. But this isn’t helpful on its own – we need to drill down into each zio_t, find the vdev_t, and extract the vdev_path from it. To do this, we build up a list of MDB dcmds and use the shell to invoke mdb with this list.

$ awk '$1 ~ /^ff/ && $5 != "-" \
    printf("%s::print -t zio_t io_vd->vdev_path\n", $1)
  }' zio_data | \
  mdb vmcore.0 | \
  sort | uniq
char *io_vd->vdev_path = 0xffffff21f5b196e0 "/dev/dsk/c1t7d0s0"
char *io_vd->vdev_path = 0xffffff21f5b196f8 "/dev/dsk/c1t6d0s0"
char *io_vd->vdev_path = 0xffffff21f5b19710 "/dev/dsk/c1t5d0s0"
char *io_vd->vdev_path = 0xffffff21f5b19728 "/dev/dsk/c1t4d0s0"

This takes the ZIO address (column 1) and prints the *io_vd->vdev_path value. Awk generates a command line for every ZIO address, which we shell-pipe into MDB. Then we sort and uniq the output to produce a concise list of the disks with outstanding I/O at the time of the panic. Also, in case you might be wondering, the vmcore.0 image is ~25G in size and this completes in under 0.5 seconds on a modest machine.

If a single disk were implicated, we might suspect that disk of being the culprit. However, since multiple disks are shown, the culprit is more likely either the disk controller or possibly some other core system problem, such as a PCIe bus error, a bad backplane port, or a cabling issue.

Even though this problem wasn’t caused by our software, the fact that OmniOS has these debugging facilities available enabled us to help the customer diagnose the issue with their system, leading to a replacement of the disk controller.

How We Keep Your Data Safe

Some of our users tell us that moving to Circonus from in-house operated tools was driven by the desire to outsource responsibility for availability and data safety to someone else. This might sound odd, but building highly resilient systems is hard, most engineering teams are focused on solving those hard problems for their applications and their customers, and it can be highly distracting (even flat-out wasteful) to focus on doing that for telemetry data. Telemetry data has very different storage, recall, and consistency requirements than most typical application data. At first, this might seem easier than your typical database problem. Once you’ve struggled first-hand with the combination of availability and performance requirements on the read path (for both failure detection and operational decision making) and the intensely painful, disk saturating write path, it becomes clear that this is a hard beast to wrestle.

When placing this responsibility in someone else’s hands, you could simply ignore the gruesome details and assume you’re in good hands… but you should expect an explanation of just how that’s all accomplished. At Circonus, we care very much about our customers’ data being correctly collected and reported on, but that would be all for naught if we couldn’t safely store it. Here’s how we do it…


The first thing you should know is that we’re committed to ZFS. If you know about ZFS, that should make you feel better already. ZFS is the most advanced and production-ready file system available; full stop. It supports data redundancy and checksumming, as well as replication and snapshots. From an operational perspective, putting data on ZFS puts a big, fat smile on our faces.

ZFS Checksums


ZFS checksumming means that we can detect when the data we wrote to disk isn’t the data we’re reading back. How could that be? Bitrot and other errors on the read or write path can cause it. The bottom line is that at any reasonable scale, you will have bad data. If you are storing massive data on a filesystem that doesn’t protect the data round trip the way ZFS does, you may not know (or notice) errors in your data, but you almost certainly have them… and that should turn your stomach. In fact, we “scrub” the filesystems regularly just to check and correct for bitrot. Bitrot doesn’t separate us from our competitors, noticing it does.

ZFS Snapshots and Rollback

ZFS also supports snapshots. When we perform maintenance on our system, change on-disk formats, or apply large changesets of data, we use snapshots to protect the state of the overall system from operator error. Have we screwed things up? Have we broken databases? Have we corrupted data? Yes, you betcha; mistakes happen. Have we ever lost a piece of data or exposed bad data to customers? No, and we owe that largely to ZFS snapshots and rollback.

ZFS Device Management

On top of the data safety features baked into ZFS for modern-sized systems, ZFS supports software data redundancy across physical media. This might sound like RAID, and in concept it is, but the implementation being baked through the whole filesystem block layer allows for enormous volumes with speedy recovery and rebuild times. So, getting into the nitty gritty: all data that arrives on a node is stored in at least two physical disks (think RAID-1). Any disk in our storage node can fail (and often several can fail) without any interruption of service, and replacement disks are integrated with online rebuild with zero service interruption.

Snowth Clustering


Our magic mojo is provided by our timeseries database, called Snowth. Snowth uses a consistent hashing model to place arriving data on more than one node. Our Snowth clusters are configured for three write copies. This means that every piece of data we collect is stored on three machines. With three copies of the data, any two machines in the cluster can fail (or malfunction) and both writes and reads can continue uninterrupted.

Since data is stored on two physical drives per node and stored redundantly on three nodes, that means each data point you record in Circonus lives on six disks.


Multi-Datacenter Operation

While disk failure and node failure can (and do) happen, datacenters can also fail (sometimes catastrophically due to fire or natural disaster). Each measurement you send to Circonus follows two completely independent paths from the broker to each of two datacenters. Basically, each production datacenter acts as a subscriber to the metrics published by a broker. The Snowth clusters in the two datacenters are completely independent. So while your data lives on only six disks in one datacenter, rest assured that it resides on six other disks in the other datacenter. We’re up to 12 disks.


Safe and Sound

How serious are we about data safety? Each measurement lives on 12 disks in 6 different chassis in 2 different datacenters, protected from bitrot, controller errors, and operator errors by ZFS. So yeah, we’re pretty serious about it.

How safe is your data? Maybe it’s time to ask your current vendor. You have a right to know.

Time, but Faster

Time, Faster 1

Every once in awhile, you find yourself in a rabbit hole, unsure of where you are or what time it might be. In this post, I’ll show you a computing adventure about time through the looking glass.

The first premise you must understand was summed up perfectly by the late Douglas Adams; “Time is an illusion, lunchtime doubly so.” The concept of time, when colliding with decoupled networks of computers that run at billions of operations per second, is… well, the truth of the matter is that you simply never really know what time it is. That is why Leslie Lamport’s seminal paper on Lamport timestamps was so important to the industry, but this post is actually about wall-clock time, or a reasonably useful estimation of it.

Time, Faster 2

Even on today’s computers, it is feasible to execute an instruction in under a nanosecond. When the white rabbit looks at his pocket watch, he is seeing what time it was a nanosecond before, as the light travels from the the hands on the watch to his eye – assuming that Carroll’s timeless tale took place in a vacuum and that the rabbit was holding the watch one third of a meter from his eyes.

When you think of a distributed system where a cluster of fast computers are often more than one light nanosecond away from each other, it is understandably difficult to time something that starts in one place and ends in another with nanosecond precision; this is the realm of physicists, not bums like us with commodity computing hardware run in environments we don’t even manage. To further upset the newcomer, every computer today is effectively a distributed system itself, with each CPU core having its own clock ticking away, with it’s own subtly different frequency and sense of universal beginning.

All that said, computers need to give us the illusion of a clock. Without it, we won’t know what time it is. As computers get faster and faster, we are able to improve the performance of our systems, but one fundamental of performance engineering is that you cannot improve what you cannot measure; so measure we must. The fundamental paradox is that as what we measure gets smaller, the cost of measuring it remains fixed, and thus becomes relatively monstrous.

The beginning of the tumble…

Time, Faster 3

We write a database here at Circonus. It’s fast and it keeps getting faster and more efficient. We dump energy into this seemingly endless journey because we operate at scale and every bit of efficiency we eke out results in lower COGS (Cost Of Goods Sold) for us, better service for our customers, and fundamentally affords a cost-effectiveness of telemetry collection and analysis that approaches reasonable economics for us to “monitor all the things.” In that context…

Let us assume we want to achieve an average latency for operations of 1 microsecond. Now, let’s wrap some numbers around that. I’ll make some notes about certain aspects of hardware, but I’ll only really focus on hardware from the last several years. While you and I like to think in terms of seconds, computers don’t care about our concept of time… They only care about clock ticks.


Time, Faster 4

Online CPUs are forever marching forward at some frequency, and the period of this frequency is called a “tick.” In an effort to save power, many computers can shift between different power saving states that cause the frequency of the CPU to change. This could make it excruciatingly difficult to accurately tell high-granularity time, if the frequency of the CPU were used for timing. Each core on a modern CPU has a TSC (TimeStamp Counter) that counts the number of ticks that have transpired, and you can read this counter with the cleverly named rdtsc assembly instruction. Also, modern CPUs have a “feature” called an invariant TSC, which guarantees that the frequency at which ticks occur will not change for any reason (but mainly for power saving mode changes). My development box has an invariant TSC that ticks approximately 2.5999503 times per nanosecond. Other machines here have different frequencies.

The standard tooling to figure out how long an operation takes on a UNIX machine is either clock_gettime(CLOCK_MONOTONIC,…) or gethrtime(). These calls return the number of nanoseconds since some arbitrary fixed point in the past. I will use gethrtime() in my examples because it is shorter to write.

hrtime_t start = gethrtime();
hrtime_t elapsed = gethrtime() - start;

As we measure these things, the gethrtime() call itself will take some time. The question to ask is: where does the time it returns sit relative to the beginning and end of the gethrtime() call itself? We can answer that with benchmarks. The bias introduced by measuring the start and finish is relative to its contribution to overall running time. In other words, if we make the “operation” we’re measuring take a long time over many iterations, we can reduce the measurement bias to near zero. Timing gethrtime() with gethrtime() would look like this:

#define LOTS 10000000
hrtime_t start = gethrtime();
for(int i=0;i<LOTS;i++) (void)gethrtime();
hrtime_t elapsed = gethrtime() - start;
double avg_ns_per_op = (double) elapsed / (double)LOTS;

And behold: a benchmark is born. Furthermore, we could actually measure the number of ticks elapsed in the test by bracketing the test with calls to rdtsc in assembly. Note that you must bind yourself to a specific CPU on the box to make this effective because the TSC clocks on each core do not have the same concept of “beginning.”

If I run this on our two primary platforms (Linux and Illumos/OmniOS,on a 24-core 2.60GHz Intel box), then I get these results:

Operating System Call Call Time
Linux 3.10.0 gettimeofday 35.7 ns/op
Linux 3.10.0 gethrtime 119.8 ns/op
OmniOS r151014 gettimeofday 304.6 ns/op
OmniOS r151014 gethrtime 297.0 ns/op

The first noticeable thing is that Linux optimizes both of these calls significantly more than OmniOS does. This has actually been addressed as part of the LX brand work in SmartOS by Joyent and will soon be upstreamed into Illumos for general consumption by OmniOS. Alas, that isn’t the worst thing; objectively determining what time it is is simply too slow for microsecond level timing, even at the lower 119.8 ns/op number above. Note that gettimeofday() only supports microsecond level accuracy and thus is not suitable for timing faster operations.

At just 119.8ns/op, bracketing a 1 microsecond call will result in (119.8*2)/(1000 + 119.8*2) -> 19.33%. So 19.33% of the execution time is spent on calculating the timing, and that doesn’t even include the time spent recording the result. A good goal to target here is 10% or less. So, how do we get there?

Looking at our tools

Time, Faster 5

On these same modern CPUs that have invariant TSCs, we have the rdtsc instruction, which reads the TSC, yet doesn’t provide insight into which CPU you are executing on. That would require either prefixing the call with a cpuid instruction or binding the executing thread to a specific core. The former adds ticks to the work, the latter is wholly inconvenient and can really defeat any advanced NUMA aware scheduling that the kernel might provide you. Basically, binding the CPU gives you a super-fast, but overly restrictive solution. We just want the gethrtime() call to work and be fast.

We’re not the only ones in need. Out of the generally recognized need, the rdtscp instruction was introduced. It supplies the value in the TSC and a programmable 32 bit value. The operating system can program this value to be the ID of the CPU, and a sufficient amount of information is emitted in a single instruction… Don’t be deceived; this instruction isn’t cheap and it measures in at 34 ticks on this machine. We’ll code that instruction call as uint64_t mtev_rdtscp(int *cpuid), and that returns the TSC and optionally sets a cpuid to the programmed value.

The first challenge here is to understand the frequency. This is a straight-forward timing exercise:

hrtime_t start_ns = gethrtime();
uint64_t start_ticks = mtev_rdtscp(NULL);
hrtime_t end_ns = gethrtime();
uint64_t end_ticks = mtev_rdtscp(NULL);
double ticks_per_nano = (double)(end_ticks-start_ticks) / (double)(end_ns-start_ns);

The next challenge becomes quite clear when testing out this solution for timing the execution of a job… even the simplest of jobs:

uint64_t start = mtev_rdtscp(NULL);
*some_memory = 0;
uint64_t elapsed = mtev_rdtscp(NULL) - start;

This usually takes around 10ns, assuming there isn’t a major page fault during the assignment. 10ns to set a piece of memory! Now remember, that includes the average time of a call to mtev_rdtscp(), which is just over 9ns. That’s not really the problem… the problem is that sometimes we get HUGE answers. Why? We switch CPUs and the output of the two TSC calls are reporting to completely unrelated counters. So, to rephrase the problem: we must relate the counters.

Time, Faster 6

The code for skew assessment is a bit much to include inline here. The basic idea is that we should run a calibration loop on each CPU that measures the TSC*ticks_per_nano and assess the skew from gethrtime(), accommodating the running time of gethrtime(). As with most calibration loops, we’ll discard the most skewed and average the remaining. It’s basically back to primary school math regarding finding the linear intercept equation: y = mx + b. We want gethrtime() = ticks_per_nano * mtev_rdtscp() + skew. As the TSC is per CPU, we need to track m and b (ticks_per_nano and skew) on a per-CPU basis.

Another nuance is that these two values together describe the translation between a CPU’s TSC and the systems gethrtime(), and they are estimates. That means two important things: (1) they need to be updated regularly to correct for error in the calibration and estimation, and (2) they need to be set and read atomically. Here enters the cmpxchg16b instruction.

Additionally, we perform this calibration work every five seconds in a separate thread and attempt to make that thread high-priority on a real-time scheduler. It turns out that this all works quite well, even without the ability to change priority or scheduling class.


Since we’re clearly having to correct for skew to align with the system gethrtime(), and the point in the past to which gethrtime() is relative is arbitrary (according to documentation), we’ve elected to make that “arbitrary” point UNIX epoch. No additional instructions are required, and now we can use our replacement gethrtime() to power gettimeofday(). So, our y = mx + b is actually implemented as nano_second_since_epoch = ticks_per_nano * mtev_rdtscp() + skew. Obviously, we’ll only pick up changes to the wall-clock (via adjtime() when we recalibrate.


Time, Faster 7

Obviously things can and do go wrong. We’ve put a variety of fail-safe mechanisms in place to ensure proper behavior when our optimizations become unsafe. By default, we detect the lack of an invariant TSC and disable. If a calibration loop fails for too long (fifteen seconds), we mark the CPU as bad and disable. We do some rudimentary performance tests, and if the system’s gethrtime() can beat our emulation, then we disable. If all those tests pass, we still check to see if the underlying system can perform gettimeofday() faster than we can emulate it; if so, we disable gettimeofday() emulation. The goal is to have mtev_gethrtime() be as fast or faster than gethrtime() and to have mtev_gettimeofday() be as fast or faster than gettimeofday().


Time, Faster 8

The overall results are better than expected. Our original goal was to simply provide a way for our implementation on Illumos to meet the performance of Linux. The value of ZFS is deeply profound, and while Linux has some performance advantages in specific arenas, that doesn’t matter much if you have undetectable corruption of the data you store.

Further optimization is possible in the implementation, but we’ve stopped for now, having achieved our initial goals. Additionally, for the purposes of this test, we’ve built the code portably. We can find a couple of nanoseconds if we compile -march=native on machines supporting the AVX instruction set.

It is true that an approximately 40ns gehrtime() can be considered slow enough, relative to microsecond-level efforts, that very prudent selection is still necessary. It is also true that 40ns gethrtime() can open up a new world of possibilities for user-space instrumentation. It has certainly opened our eyes to some startling things.

Operating System Call System Call Time Mtev-variant Call Speedup
Linux 3.10.0 gettimeofday 35.7 ns/op 35.7 ns/op x1.00
Linux 3.10.0 gethrtime 119.8 ns/op 40.4 ns/op x2.96
OmniOS r151014 gettimeofday 304.6 ns/op 47.6 ns/op x6.40
OmniOS r151014 gethrtime 297.0 ns/op 39.9 ns/op x7.44

This all comes for free with libmtev (see mtev_time.c for reference). As of writing this, Linux and Illumos are supported platforms and Darwin and FreeBSD do not have “faster time” support. The faster time support in libmtev was a collaborative effort between Riley Berton and Theo Schlossnagle.

Rapid Resolution With The Right Tools

Here at Circonus, we obviously monitor our infrastructure with the Circonus platform. However, there are many types of monitoring, and our product isn’t intended to cover all the bases. While Circonus leads to profound insights into the behavior of systems, it isn’t a deep-dive tool for when you already know a system is broken.

Many of our components here are written in C, so as to be as small, portable, and efficient as possible. One such component gets run on relatively small systems on-premises with customers, to perform the tasks of data collection, store-and-forward, and real-time observation. When it comes to the data collection pipeline, a malfunction is unacceptable.

This story is of one such unacceptable event, on a warm Wednesday in July.

Around 2pm, the engineering group is hacking away on their various deliverables for new features and system improvements to the underlying components that power Circonus. We’ve just launched our Pully McPushface feature to a wider audience (everyone) after over a year of dark launches.


At 2:47pm, a crash notice from Amsterdam appears in our #backtrace slack channel. We use Backtrace to capture and analyze crashes throughout our globally distributed production network. Crashes do happen, and this one will go into an engineer’s queue to be analyzed and fixed when they switch from one of their current tasks. At 2:49pm, things change; we get an eerily similar crash signature from Amsterdam. At this point, we know that we have an event worthy of interruption.

Around 2:51pm, we snag the Backtrace object identifier out of Slack and pull it up in the analysis tool (coroner). It provides the first several steps of post-mortem debugging analysis for us. About 8 minutes later, we have consensus on the problem and a proposed fix, and we commit that change to Github at 2:59pm.

Now, in the modern world, all this CI/CD stuff claims to provide great benefits. We use Jenkins internally for most of the development automation. Jenkins stirs and begins its build and test process. Now, this code runs on at least five supported platforms, and since the fix is in a library, we actually can count more than fifty packages that need to be rebuilt. After just 6 minutes, Jenkins notes all test suites passed and begins package builds at 3:05pm.

Building, and testing, and packaging software is tedious work. Many organizations have continuous integration, but fail to have continuous deployment. At 3:09pm, just 4 minutes later, we have all of our fixed packages deployed to our package repositories and ready for install. One might claim that we don’t have continuous deployment because our machines in the field do not auto-update their software. It’s an interesting fight internal to Circonus as to whether or not we should be performing automated updates; as of today, a human is still involved to green-light production deployment for new packages.

One more minute passes, and at 3:10pm, Amsterdam performs its yum update (it’s a CentOS box), and the Slack channel quiets. All is well.

The engineers here at Circonus have been building production software for many years, and a lot has changed over that time. I distinctly remember manually testing software in the 1990s. I remember some hobbled form of CI entering the scene, but we were still manually packaging software for distribution in the 2000s. Why did it take our industry so long to automate these pieces? I believe some of that has to do with relative scale. Collecting the necessary post-mortem debugging information to confidently issue a fix took hours or days when machines in production malfunctioned, because operations groups had to notice, report, and then interact with development. As such, the extra hour or two to compile, test, and package the software was inconsequential, reminding us of a golden XKCD comic in its time.


Recapping this event makes me, the old grey-beard, feel like I’m living in the future. Jenkins was able to reduce a process from hours to minutes, and Backtrace was able to reduce a highly inconsistent hours-to-days process to minutes.

A two minute MTTD and a 23 minute MTTR for a tested fix involving a core library. The future looks way more comfortable than the past.

LuaJIT and the Illumos VM

I’m about to dive into some esoteric stuff, so be warned. We use LuaJIT extensively throughout most of our product suite here at Circonus. It is core supported technology in the libmtev platform on which we build most of our high-performance server apps. Lua (and LuaJIT) integrate fantastically with C code and it makes many complex tasks seem simple if you can get the abstraction right between procedural lua code and a non-blocking event framework like that within libmtev. Lua is all fine and well, but for performance reasons we use LuaJIT and therein bodies are buried.

What’s so special about LuaJIT? An object in LuaJIT is always 64bit. It can represent a double, or if the leading 13 bits are 1s (which is invalid in any 64bit double representation) then it can represent other things:

**                  ---MSW---.---LSW---
** primitive types |  itype  |         |
** lightuserdata   |  itype  |  void * |  (32 bit platforms)
** lightuserdata   |ffff|    void *    |  (64 bit platforms, 47 bit pointers)
** GC objects      |  itype  |  GCRef  |
** int (LJ_DUALNUM)|  itype  |   int   |
** number           -------double------

Ouch, that’s right. We’ve got pointer compaction going on. On 64 bit systems, we can only represent 47 bits of our pointer. That is 128TB of addressable space, but we’ll explore problems it causes.

LuaJIT is a garbage collected language and as such wants to manage its own memory. It mmaps large chunks of memory and then allocates and manages objects within that segment. These objects actually must sit in the lower 32bits of memory because luajit leverages the other 32bits for type information and other notes (e.g. garbage collection annotations). Linux and several other UNIX-like operating systems support a MAP_32BIT flag to mmap that instructs the kernel to return a memory mapped segment under the 4GB boundary.


Here we see a diagram of how this usually works with LuaJIT on platforms that support MAP32_BIT. When asked for a MMAP_32BIT mapping, Linux (and some other operating systems) starts near the 4GB boundary and works backwards. The heap (or brk()/sbrk() calls), where malloc and friends live, typically starts near the beginning of memory space and works upwards. 4GB is enough space to cause many apps to not have problems, but if you are a heavy lifting app, at some point you could legitimately attempt to group your heap into the memory mapped regions and that would result in a failed allocation attempt. You’re effectively out of memory! If your allocator uses mmap behind the scenes, you won’t have this problem.

Enter Illumos:


Enter Illumos, and we have a set of problems emerge. On old versions of Illumos, MAP_32BIT wasn’t supported, and this usually caused issues around 300MB or so of heap usage. That’s an “Itty-bitty living space,” not useful for most applications; certainly not ours. Additionally, the stack doesn’t grow down from the 47-bit boundary; it grows down from the 64-bit boundary.

On systems that don’t support MAP_32BIT, LuaJIT will make a best effort by leveraging mmap()’s “hinting” system to hint at memory locations it would like down in the lower 32bit space. This works and can get off the starting blocks, but we’ve compounded our issues, as you can see in this diagram. Because we’re not growing strictly down from the 4GB boundary, our heap has even less room to grow.

Lack of MAP_32BIT support was fixed back in August of 2013, but you know people are still running old systems. On more recent versions Illumos, it looks more like Linux.


The interaction between LuaJIT and our stratospheric stack pointers remains an unaddressed issue. If we push lightuserdata into lua from our stack, we get a crash. Surprisingly, in all our years, we’ve ran across only one dependency that has triggered the issue: LPEG, for which we have a patch.

We’re still left with a space issue: our apps still push the envelope, and 4GB’s of heap is a limitation we thought it would be ridiculous to accept. So we have stupid trick to fix the issue.

On Illumos (and thus OmniOS and SmartOS), the “brk” starting point is immediately after the BSS segment. If we can simply inform the loader that our binary had a variable declared in BSS somewhere else, we could make our heap start growing from somewhere other than near zero.

# gcc -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    1 .bss           _end
      [88]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    0 .bss           _end

Kudos to Rich Lowe for pointers on how to accomplish this feat. Using the Illumos linker, we can actually specify a map using -M map64bit where map64bit is file containing:

mason_dixon = LOAD ?E V0x100000000 L0x8;

# gcc -Wl,-M -Wl,map64bit -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    1 ABS            _end
      [88]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    0 ABS            _end

This has the effect of placing an 8 byte sized variable at the virtual address 4GB, and the VM space looks like the above diagram. Much more room!

Roughly 4GBs of open space for native LuaJIT memory management, and since we run lots of VMs in a single process, this gets exercised. The heap grows up from 4GB, leaving the application plenty of room to grow. There is little risk of extending the heap past 128TB, as we don’t own any machines with that much memory. So, everything on the heap and in the VM are all cleanly accessible to LuaJIT.

So when we say we have some full stack engineers here at Circonus — we know which way the stack grows on x86_64.

The Uphill Battle for Visibility

In 2011, Circonus released a feature that revolutionized the way people could observe systems. We throw around the word “histogram” a lot and the industry has listened. And while we’ve made great progress, it seems that somewhere along the way the industry got lost.

I’ve said for some time that without a histogram, you just don’t understand the behavior of your system. Many listened and the “histogram metric type” was born, and now is used in so many of today’s server apps to provide “better understanding” of systems behavior. The deep problem is that those histograms might be used “inside” the product, but from the operator’s perspective you only see things like min, max, median, and some other arbitrary percentiles (like 75 and 99). Quick question: do you know what you get when you store a 99th percentile over time? Answer: not a histogram. The “problem” with averages that I’ve railed against for years is not that averages are bad or useless, but instead that an average is a single statistical assessment of a large distribution of data… it doesn’t provide enough insight into the data set. Adding real histograms solves this issue, but analyzing a handful of quantiles that people are calling histograms puts us right back where we started. Perhaps we need a T-shirt: “I was sold a histogram, but I all got were a few lousy quantiles. :(”

I understand the problem: it’s hard. Storing histograms as an actual datatype is actually innovative to the point that people don’t understand it. Let’s say, for example, you’re trying to measure the performance of your Cassandra cluster… as you know, every transaction counts. In the old days, we all just tracked things like averages. If you served approximately 100,000 requests in a second, you would measure the service times of each and track an EWMA (exponential weighted moving average) and store this single output over time as an assessment of ongoing performance.

Stored Values
Time EWMA (seconds)
T1 0.0020
T2 0.0023
T3 0.0019
T4 0.0042

Now, most monitoring systems didn’t actually store second-by-second data, so instead of having an approximation of an average of 100,000 or so measurements every second, you would have an approximation of an average of around 6,000,000 measurements every minute. When I first started talking about histograms and their value, it was easy for an audience to say, “Wow, if I see one datapoint representing six million, it stands to reason I don’t know all that much about the nature of the six million.”

Enter the histogram hack:

By tracking the service latencies in a histogram (and there a lot of methods, from replacement sampling, to exponentially decaying, to high-definition-resolution collection), one now had a much richer model to reason about for understanding behavior. This is not the hack part… that followed. You see, people didn’t know what to do with these histograms because their monitoring/storage systems didn’t understand them, so they degraded all that hard work back into something like this:

Stored Values
Time mean min max 50th 75th 95th 99th
T1 0.0020 0.00081 0.120 0.0031 0.0033 0.0038 0.091
T2 0.0023 0.00073 0.140 0.0027 0.0031 0.0033 0.092
T3 0.0019 0.00062 0.093 0.0024 0.0027 0.0029 0.051
T4 0.0042 0.00092 0.100 0.0043 0.0050 0.0057 0.082

This may seem like a lot of information, but even at second-by-second sampling, we’re taking 100,000 data points and reducing them to a mere seven. Is this is most-certainly better than a single statistical output (EWMA), but claiming even marginal success is folly. You see, the histogram that was used to calculate those 7 outputs could calculate myriad other statistical insights, but all of that insight is lost if those histograms aren’t stored. Additionally, reasoning about the extracted statistics over time is extremely difficult.

At Circonus, from the beginning, we realized the value of storing a histogram as a first-class data type in our time series database.

Stored Values Derived Values
Time Histogram mean min max 50th nth
T1 H1(100427) mean(H1) q(H1,0) q(H1,100) q(H1,50) q(H1,n)
T2 H2(108455) mean(H2) q(H2,0) q(H2,100) q(H2,50) q(H2,n)
T3 H3(94302) mean(H3) q(H3,0) q(H3,100) q(H3,50) q(H3,n)
T4 H4(98223) mean(H4) q(H4,0) q(H4,100) q(H4,50) q(H4,n)

Obviously, this is a little abstract, because how do you store a histogram? Well, we take the abstract to something quite concrete in Go, Java, Javascript, and C. This also means that we can view the actual shape of the distribution of the data and extract other assessments, like estimated modality (how many modes there are).

All this sounds awesome, right? It is! However, there is an uphill battle. When the industry started to get excited about using histograms, they adopted the hacky model where histograms are used internally to provide poor statistics upstream, we’re left at Circonus trying to get water from a stone. It seems each and every product out there requires code modifications to be able to expose this sort of rich insight. Don’t fear, we’re climbing that hill and more and more products are falling into place.

Today’s infrastructure and infrastructure applications (like Cassandra and Kafka), as well as today’s user and machine facing APIs, are incredibly sensitive to performance regressions. Without good visibilities, the performance of your platform and the performance of your efforts atop it will always be guesswork.

In order to make things easier, we’re introducing Circonus’s Wirelatency Protocol Observer. Run it on your API server or on your Cassandra node and you’ll immediately get deep insight to the performance of your system via real histograms in all their glory.

Percentages Aren’t People

This is a story about an engineering group celebrating success when it shouldn’t be… and their organization buying into it. This is not the fault of the engineering groups, or the operations team, or any one person. This is the fault of yesterday’s tools not providing the right data. The right insights. The ability to dig into the data to get meaningful information to push your business forward.

Herein, we’ll dive into a day in the life of an online service where a team wakes up and triages an outage after ignorantly celebrating a larger outage as a success just twelve hours before. All names have been removed to protect the exceptionally well-intending and competent parties. You see, the problem is that the industry has been misleading us with misapplied math and bad statistics for years.

I’ll set the stage with a simple fact of this business… when it takes longer than one and half seconds to use their service, users leave. Armed with this fact, let’s begin our journey. Despite this data living in Circonus, it isn’t measuring Circonus; alas, as stories are best told in the first person with friends along for the ride, I shall drop into the first-person plural for the rest of the ride: let’s go.

We track the user’s experience logging into the application. We do this not by synthetically logging in and measuring (we do this too, but only for functional testing), but by measuring each user’s experience and recording it. When drawn as a heatmap, the data looks like the graph below. The red line indicates a number that, through research, we’ve found to be a line of despair and loss. Delivering an experience of 1.5 seconds or slower causes our users to leave.


Heatmaps can be a bit confusing to reason about, so this is the last we’ll see of it here. The important part to remember is that we are storing a complete model of the distribution of user experiences over time and we’ll get to why that is important in just a bit. From this data, we can calculate and visualize all the things we’re used to.


The above is a graph for that same lonely day in June, and it shows milliseconds of latency; specifically, the line represents the average user experience. If I ask you to spot the problem on the graph, you can do so just as easily as a four year old; it’s glaring. However, you’ll note that our graph indicates we’re well under our 1.5s line of despair and loss. We’re all okay right? Wrong.

A long time ago, the industry realized that averages (and standard deviations) are very poor representations of sample sets because our populations are not normally distributed. Instead of using an average (specifically an arithmetic mean), we all decided that measuring on some large quantile would be better. We were right. So, an organization would pick a percentage: 99.9% or 99% and articulate, “I have to be at least ‘this good’ for at least ‘this percentage’ of my users.” If this percentage seems arbitrary, it is… but, like the 1.5 second line of despair and loss, it can be derived from lots of business data and user behavior studies.

This, ladies and gentlemen, is why we don’t use averages. Saying that averages are misleading is a bit less accurate than admitting that many people are misled by averages. They simply don’t represent the things that are important to us here: how are we treating our users? This question is critical because it is our users who fund us and our real question is, “How many users are having a dissatisfying experience?”


The above graph is radically different than the first; it might surprise you to know that it is showing the same underlying dataset. Instead of the average experience, it shows the 99th percentile experience over time. It is much clearer that we had something catastrophically bad happen at 5am. It also shows that aside from two small infractions (7:52pm and 11:00pm), the rest of the day delivered the objective of a “less than 1.5s 99th percentile experience.” Okay, let’s stop. That’s such a disgustingly opaque and clinical way to talk about what we’re representing. These are actual people attempting to use this service.

What we’re saying here is that for each of the points on the purple line in this graph, during the time window that it represents (at this zoom level, each point represents 4 minutes), that 99% of visitors had an experience better than the value, and 1% had an experience worse than the value. Here we should see our first problem: percentages aren’t people.

Reflecting on the day as a whole, we see a catastrophic problem at 5am, to which our mighty engineering organization responded and remediated diligently over the course of approximately fifty minutes. Go Team! The rest of the day was pretty good, and we have those two little blips to diagnose and fix going forward.

I’m glad we’re not using averages for monitoring! We’d most likely not have been alerted to that problem at 5am! Here is where most monitoring stories end because a few quantiles is all that is stored and the raw data behind everything isn’t available for further analysis. Let’s return to our earlier question, “How many users are having a dissatisfying experience?” Luckily for us, we know how many users were on the site, so we can actually just multiply 1% by the number of current visitors to understand “how many” of the users are having an experience worse than the graph… But that isn’t the question is it? The question is how many users are having a worse experience than 1.5s, not worse than the 99th percentile.


This graph adds a black line that shows the number of current users each minute on the site (numbered on the right axis). To illustrate how we’re really missing the point, let’s just take a random point from our 99th percentile graph (again each point represents 4 minutes at this zoom level). We randomly pick 9:32pm. The graph tells us that the 99th percentile experience at that point is at 1.266s. This is better than our goal of 1.5s. Well, looking at the black line we see that we have about 86 users each minute on the site at that point, or 344 users over the four minute period. 1% of that is between 3 and 4 users. Okay, we’re getting somewhere! So we know that between 3 and 4 users had an experience over 1.266s. Wait, that wasn’t our question. Who cares about 1.266s, when we want to know about 1.5s? We’re not getting anywhere at all.

Our objective is 1.5 seconds. We’re looking at this all upside down and backwards. We should not be asking how bad the experience is for the worst 1%, instead we should be asking what percentage has a bad experience (any experience worse than our objective of 1.5 seconds). We shouldn’t be asking about quantiles; we should be asking about inverse quantiles. Since we’re storing the whole distribution of experiences in Circonus, we can simply ask, “What percentage of the population is faster than 1.5s?” If we take one minus this inverse quantile at 1.5 seconds, we get exactly the answer to our question: What percentage of users had a “bad experience?”


Now we’re getting somewhere. It is clear that we had a bad time at 5am and we did pretty well with just some line noise during our successful prior evening, right? Let’s return to our first problem: percentages aren’t people.


Luckily, just as we did before, we can simply look at how many people are visiting the site (the green line above) and multiply that by the percentage of people having a bad time and we get the number of actual people. Now we’re talking about something everyone understands. How many people had a bad experience? Let’s multiply!


In this image, we have simply multiplied two data streams from before, and we see the human casualties of our system. This is the number of users per minute that we screwed out of a good experience. These are users that, in all likelihood, are taking their business elsewhere. As anyone that thinks about it for more than a few seconds realizes, a small percentage of a large number can easily be bigger than a large percentage of a small number. Managing to inverse quantile numbers (let alone abstractly reasoning about quantiles), without knowing the size of the population, is misleading (to put it mildly).

Another way to look at this graph is to integrate; that is, to calculate the area under the curve. Integrating a graph representing users over time results in a graph of users. In other words, the number of cumulative users that have had a bad experience.


This should be flat-out eye opening. The eight hours from 2am to 10am (including the event of despair and loss) affected 121 people. The eight hours preceding it affected almost as many: 113.

It can be pretty depressing to think you’ve celebrated a successful day of delivery only to learn that it really wasn’t that successful at all. But, this isn’t so much about celebrating successes that were actually failures; it’s about understanding what, when, and where you can improve. Every user matters; and if you treat them that way, you stand to get a lot more of them.


When you look back at your own graphs, just remember that the most casualties of our day happened in these two bands. You should be using inverse quantiles for SLA reporting; if you don’t have those, use quantiles… if you only have averages, you’re blind as a bat.

Understanding API Latencies

Today’s Internet is powered by APIs. Tomorrow’s will be even more so. Without a pretty UI or a captivating experience, you’re judged simply on performance and availability. As an API provider, it is more critical than ever to understand how your system is performing.

With the emergence of micro services, we have an API layered cake. And often that layered cake looks like one from a Dr. Seuss story. That complex systems fail in complex ways is a deep and painful truth that developers are facing now in even the most ordinary of applications. So, as we build these decoupled, often asynchronous, systems that compose a single user transaction from often tens of underlying networked subtransactions we’re left with a puzzle. How is the performance changing as volume increases usage and, often more importantly, how is it changing as we rapidly deploy micro updates to our micro services?

Developers have long known that they must be aware of their code performance and, at least in my experience, developers tend to be fairly good about minding their performance P’s and Q’s. However, in complex systems, the deployment environment and other production environmental conditions have tremendous influence on the actual performance delivered. The cry, “but it worked in dev” has moved from the functionality to the performance realm of software. I tell you now that I can sympathize.

It has always been a challenge to take a bug in functionality observed in production and build a repeatable test case in development to diagnose, address, and test for future regression. This challenge has been met by the best developers out there. The emergent conditions in complex, decoupled production system are nigh impossible to replicate in a development environment. This leaves developers fantastically frustrated and requires a different tack: production instrumentation.

As I see it, there are two approaches to production instrumentation that are critically important (there would be one approach if storage and retrieval were free and observation had no effect — alas we live in the real world and must compromise). You can either sacrifice coverage for depth or sacrifice depth for coverage. What am I talking about?

I’d love to be able to pick apart a single request coming into my service in excruciating detail. Watch it arrive, calculate the cycles spent on CPU, the time off, which instruction and stack took me off CPU, the activity that requested information from another microservice, the perceived latency between systems, all of the same things on the remote micro service, the disk accesses and latency on delivery for my query against Cassandra, and the details of the read-repair it induced. This list might seem long, but I could go on for pages. The amount of low-level work that is performed to serve even the simplest of requests is staggering… and every single step is subject to bugs, poor interactions, performance regressions and other generally bad behavior. The Google Dapper paper and the OpenZipkin project take a stab at delivering on this type of visibility, and now companies like Lightstep are attempting to deliver on this commercially. I’m excited! This type of tooling is one of two critical approaches to production system visibility.


The idea of storing this information on every single request that arrives is absurd today, but even when it is no longer absurd tomorrow, broad and insightful reporting on it will remain a challenge. Hence the need for the second approach.

You guessed it, Circonus falls squarely into the second approach: coverage over depth. You may choose not to agree with my terminology, but hopefully the point will come across. In this approach, instead of looking at individual transactions into the system (acknowledging that we cannot feasibly record and report all of them), we look at the individual components of the system and measure everything. That API we’re serving? Measure the latency of every single request on every exposed endpoint. The micro service you talked to? Measure the latency there. The network protocol over which you communicated? Measure the size of every single package sent in each direction. That Cassandra cluster? Measure the client-facing latency, but also measure the I/O latency of every single disk operation on each spindle (or EBS volume, or ephemeral SSD) on each node. It sounds like a lot of data, sure. We live in the future, and analytics systems are capable of handling a billion measurements per second these days, all the while remaining economical.


The above graph shows the full distribution of every IO operation on one of our core database nodes. The histogram in the breakout box shows three distinct modes (two tightly coupled in the left peak and one smaller mode further out in the latency spectrum. We can also see a radical divergence in behavior immediately following Feb 14th at 9am. As we’re looking at one week of data, each time slice vertically is 1h30m. The slice highlighted by the vertical grey hairline is displayed in the upper-left breakout box; it represents nearly 12 million data points alone. The full graph represents about 1.2 billion measurements, and fetching that from the Circonus time series database took 48ms. When you start using the right tools, your eyes will open.