How We Keep Your Data Safe

Some of our users tell us that moving to Circonus from in-house operated tools was driven by the desire to outsource responsibility for availability and data safety to someone else. This might sound odd, but building highly resilient systems is hard, most engineering teams are focused on solving those hard problems for their applications and their customers, and it can be highly distracting (even flat-out wasteful) to focus on doing that for telemetry data. Telemetry data has very different storage, recall, and consistency requirements than most typical application data. At first, this might seem easier than your typical database problem. Once you’ve struggled first-hand with the combination of availability and performance requirements on the read path (for both failure detection and operational decision making) and the intensely painful, disk saturating write path, it becomes clear that this is a hard beast to wrestle.

When placing this responsibility in someone else’s hands, you could simply ignore the gruesome details and assume you’re in good hands… but you should expect an explanation of just how that’s all accomplished. At Circonus, we care very much about our customers’ data being correctly collected and reported on, but that would be all for naught if we couldn’t safely store it. Here’s how we do it…

ZFS

The first thing you should know is that we’re committed to ZFS. If you know about ZFS, that should make you feel better already. ZFS is the most advanced and production-ready file system available; full stop. It supports data redundancy and checksumming, as well as replication and snapshots. From an operational perspective, putting data on ZFS puts a big, fat smile on our faces.

ZFS Checksums

HowWeKeepYourDataSafe1

ZFS checksumming means that we can detect when the data we wrote to disk isn’t the data we’re reading back. How could that be? Bitrot and other errors on the read or write path can cause it. The bottom line is that at any reasonable scale, you will have bad data. If you are storing massive data on a filesystem that doesn’t protect the data round trip the way ZFS does, you may not know (or notice) errors in your data, but you almost certainly have them… and that should turn your stomach. In fact, we “scrub” the filesystems regularly just to check and correct for bitrot. Bitrot doesn’t separate us from our competitors, noticing it does.

ZFS Snapshots and Rollback

ZFS also supports snapshots. When we perform maintenance on our system, change on-disk formats, or apply large changesets of data, we use snapshots to protect the state of the overall system from operator error. Have we screwed things up? Have we broken databases? Have we corrupted data? Yes, you betcha; mistakes happen. Have we ever lost a piece of data or exposed bad data to customers? No, and we owe that largely to ZFS snapshots and rollback.

ZFS Device Management

On top of the data safety features baked into ZFS for modern-sized systems, ZFS supports software data redundancy across physical media. This might sound like RAID, and in concept it is, but the implementation being baked through the whole filesystem block layer allows for enormous volumes with speedy recovery and rebuild times. So, getting into the nitty gritty: all data that arrives on a node is stored in at least two physical disks (think RAID-1). Any disk in our storage node can fail (and often several can fail) without any interruption of service, and replacement disks are integrated with online rebuild with zero service interruption.

Snowth Clustering

HowWeKeepYourDataSafe2

Our magic mojo is provided by our timeseries database, called Snowth. Snowth uses a consistent hashing model to place arriving data on more than one node. Our Snowth clusters are configured for three write copies. This means that every piece of data we collect is stored on three machines. With three copies of the data, any two machines in the cluster can fail (or malfunction) and both writes and reads can continue uninterrupted.

Since data is stored on two physical drives per node and stored redundantly on three nodes, that means each data point you record in Circonus lives on six disks.

HowWeKeepYourDataSafe3

Multi-Datacenter Operation

While disk failure and node failure can (and do) happen, datacenters can also fail (sometimes catastrophically due to fire or natural disaster). Each measurement you send to Circonus follows two completely independent paths from the broker to each of two datacenters. Basically, each production datacenter acts as a subscriber to the metrics published by a broker. The Snowth clusters in the two datacenters are completely independent. So while your data lives on only six disks in one datacenter, rest assured that it resides on six other disks in the other datacenter. We’re up to 12 disks.

HowWeKeepYourDataSafe4

Safe and Sound

How serious are we about data safety? Each measurement lives on 12 disks in 6 different chassis in 2 different datacenters, protected from bitrot, controller errors, and operator errors by ZFS. So yeah, we’re pretty serious about it.

How safe is your data? Maybe it’s time to ask your current vendor. You have a right to know.

Time, but Faster


Time, Faster 1

Every once in awhile, you find yourself in a rabbit hole, unsure of where you are or what time it might be. In this post, I’ll show you a computing adventure about time through the looking glass.

The first premise you must understand was summed up perfectly by the late Douglas Adams; “Time is an illusion, lunchtime doubly so.” The concept of time, when colliding with decoupled networks of computers that run at billions of operations per second, is… well, the truth of the matter is that you simply never really know what time it is. That is why Leslie Lamport’s seminal paper on Lamport timestamps was so important to the industry, but this post is actually about wall-clock time, or a reasonably useful estimation of it.

Time, Faster 2

Even on today’s computers, it is feasible to execute an instruction in under a nanosecond. When the white rabbit looks at his pocket watch, he is seeing what time it was a nanosecond before, as the light travels from the the hands on the watch to his eye – assuming that Carroll’s timeless tale took place in a vacuum and that the rabbit was holding the watch one third of a meter from his eyes.

When you think of a distributed system where a cluster of fast computers are often more than one light nanosecond away from each other, it is understandably difficult to time something that starts in one place and ends in another with nanosecond precision; this is the realm of physicists, not bums like us with commodity computing hardware run in environments we don’t even manage. To further upset the newcomer, every computer today is effectively a distributed system itself, with each CPU core having its own clock ticking away, with it’s own subtly different frequency and sense of universal beginning.

All that said, computers need to give us the illusion of a clock. Without it, we won’t know what time it is. As computers get faster and faster, we are able to improve the performance of our systems, but one fundamental of performance engineering is that you cannot improve what you cannot measure; so measure we must. The fundamental paradox is that as what we measure gets smaller, the cost of measuring it remains fixed, and thus becomes relatively monstrous.

The beginning of the tumble…

Time, Faster 3

We write a database here at Circonus. It’s fast and it keeps getting faster and more efficient. We dump energy into this seemingly endless journey because we operate at scale and every bit of efficiency we eke out results in lower COGS (Cost Of Goods Sold) for us, better service for our customers, and fundamentally affords a cost-effectiveness of telemetry collection and analysis that approaches reasonable economics for us to “monitor all the things.” In that context…

Let us assume we want to achieve an average latency for operations of 1 microsecond. Now, let’s wrap some numbers around that. I’ll make some notes about certain aspects of hardware, but I’ll only really focus on hardware from the last several years. While you and I like to think in terms of seconds, computers don’t care about our concept of time… They only care about clock ticks.

The TSC

Time, Faster 4

Online CPUs are forever marching forward at some frequency, and the period of this frequency is called a “tick.” In an effort to save power, many computers can shift between different power saving states that cause the frequency of the CPU to change. This could make it excruciatingly difficult to accurately tell high-granularity time, if the frequency of the CPU were used for timing. Each core on a modern CPU has a TSC (TimeStamp Counter) that counts the number of ticks that have transpired, and you can read this counter with the cleverly named rdtsc assembly instruction. Also, modern CPUs have a “feature” called an invariant TSC, which guarantees that the frequency at which ticks occur will not change for any reason (but mainly for power saving mode changes). My development box has an invariant TSC that ticks approximately 2.5999503 times per nanosecond. Other machines here have different frequencies.

The standard tooling to figure out how long an operation takes on a UNIX machine is either clock_gettime(CLOCK_MONOTONIC,…) or gethrtime(). These calls return the number of nanoseconds since some arbitrary fixed point in the past. I will use gethrtime() in my examples because it is shorter to write.

hrtime_t start = gethrtime();
some_operation_worthy_of_measurement();
hrtime_t elapsed = gethrtime() - start;

As we measure these things, the gethrtime() call itself will take some time. The question to ask is: where does the time it returns sit relative to the beginning and end of the gethrtime() call itself? We can answer that with benchmarks. The bias introduced by measuring the start and finish is relative to its contribution to overall running time. In other words, if we make the “operation” we’re measuring take a long time over many iterations, we can reduce the measurement bias to near zero. Timing gethrtime() with gethrtime() would look like this:

#define LOTS 10000000
hrtime_t start = gethrtime();
for(int i=0;i<LOTS;i++) (void)gethrtime();
hrtime_t elapsed = gethrtime() - start;
double avg_ns_per_op = (double) elapsed / (double)LOTS;

And behold: a benchmark is born. Furthermore, we could actually measure the number of ticks elapsed in the test by bracketing the test with calls to rdtsc in assembly. Note that you must bind yourself to a specific CPU on the box to make this effective because the TSC clocks on each core do not have the same concept of “beginning.”

If I run this on our two primary platforms (Linux and Illumos/OmniOS,on a 24-core 2.60GHz Intel box), then I get these results:

Operating System Call Call Time
Linux 3.10.0 gettimeofday 35.7 ns/op
Linux 3.10.0 gethrtime 119.8 ns/op
OmniOS r151014 gettimeofday 304.6 ns/op
OmniOS r151014 gethrtime 297.0 ns/op

The first noticeable thing is that Linux optimizes both of these calls significantly more than OmniOS does. This has actually been addressed as part of the LX brand work in SmartOS by Joyent and will soon be upstreamed into Illumos for general consumption by OmniOS. Alas, that isn’t the worst thing; objectively determining what time it is is simply too slow for microsecond level timing, even at the lower 119.8 ns/op number above. Note that gettimeofday() only supports microsecond level accuracy and thus is not suitable for timing faster operations.

At just 119.8ns/op, bracketing a 1 microsecond call will result in (119.8*2)/(1000 + 119.8*2) -> 19.33%. So 19.33% of the execution time is spent on calculating the timing, and that doesn’t even include the time spent recording the result. A good goal to target here is 10% or less. So, how do we get there?

Looking at our tools

Time, Faster 5

On these same modern CPUs that have invariant TSCs, we have the rdtsc instruction, which reads the TSC, yet doesn’t provide insight into which CPU you are executing on. That would require either prefixing the call with a cpuid instruction or binding the executing thread to a specific core. The former adds ticks to the work, the latter is wholly inconvenient and can really defeat any advanced NUMA aware scheduling that the kernel might provide you. Basically, binding the CPU gives you a super-fast, but overly restrictive solution. We just want the gethrtime() call to work and be fast.

We’re not the only ones in need. Out of the generally recognized need, the rdtscp instruction was introduced. It supplies the value in the TSC and a programmable 32 bit value. The operating system can program this value to be the ID of the CPU, and a sufficient amount of information is emitted in a single instruction… Don’t be deceived; this instruction isn’t cheap and it measures in at 34 ticks on this machine. We’ll code that instruction call as uint64_t mtev_rdtscp(int *cpuid), and that returns the TSC and optionally sets a cpuid to the programmed value.

The first challenge here is to understand the frequency. This is a straight-forward timing exercise:

mtev_thread_bind_to_cpu(0);
hrtime_t start_ns = gethrtime();
uint64_t start_ticks = mtev_rdtscp(NULL);
sleep(10);
hrtime_t end_ns = gethrtime();
uint64_t end_ticks = mtev_rdtscp(NULL);
double ticks_per_nano = (double)(end_ticks-start_ticks) / (double)(end_ns-start_ns);

The next challenge becomes quite clear when testing out this solution for timing the execution of a job… even the simplest of jobs:

uint64_t start = mtev_rdtscp(NULL);
*some_memory = 0;
uint64_t elapsed = mtev_rdtscp(NULL) - start;

This usually takes around 10ns, assuming there isn’t a major page fault during the assignment. 10ns to set a piece of memory! Now remember, that includes the average time of a call to mtev_rdtscp(), which is just over 9ns. That’s not really the problem… the problem is that sometimes we get HUGE answers. Why? We switch CPUs and the output of the two TSC calls are reporting to completely unrelated counters. So, to rephrase the problem: we must relate the counters.

Time, Faster 6

The code for skew assessment is a bit much to include inline here. The basic idea is that we should run a calibration loop on each CPU that measures the TSC*ticks_per_nano and assess the skew from gethrtime(), accommodating the running time of gethrtime(). As with most calibration loops, we’ll discard the most skewed and average the remaining. It’s basically back to primary school math regarding finding the linear intercept equation: y = mx + b. We want gethrtime() = ticks_per_nano * mtev_rdtscp() + skew. As the TSC is per CPU, we need to track m and b (ticks_per_nano and skew) on a per-CPU basis.

Another nuance is that these two values together describe the translation between a CPU’s TSC and the systems gethrtime(), and they are estimates. That means two important things: (1) they need to be updated regularly to correct for error in the calibration and estimation, and (2) they need to be set and read atomically. Here enters the cmpxchg16b instruction.

Additionally, we perform this calibration work every five seconds in a separate thread and attempt to make that thread high-priority on a real-time scheduler. It turns out that this all works quite well, even without the ability to change priority or scheduling class.

Gravy

Since we’re clearly having to correct for skew to align with the system gethrtime(), and the point in the past to which gethrtime() is relative is arbitrary (according to documentation), we’ve elected to make that “arbitrary” point UNIX epoch. No additional instructions are required, and now we can use our replacement gethrtime() to power gettimeofday(). So, our y = mx + b is actually implemented as nano_second_since_epoch = ticks_per_nano * mtev_rdtscp() + skew. Obviously, we’ll only pick up changes to the wall-clock (via adjtime() et.al.) when we recalibrate.

Safety

Time, Faster 7

Obviously things can and do go wrong. We’ve put a variety of fail-safe mechanisms in place to ensure proper behavior when our optimizations become unsafe. By default, we detect the lack of an invariant TSC and disable. If a calibration loop fails for too long (fifteen seconds), we mark the CPU as bad and disable. We do some rudimentary performance tests, and if the system’s gethrtime() can beat our emulation, then we disable. If all those tests pass, we still check to see if the underlying system can perform gettimeofday() faster than we can emulate it; if so, we disable gettimeofday() emulation. The goal is to have mtev_gethrtime() be as fast or faster than gethrtime() and to have mtev_gettimeofday() be as fast or faster than gettimeofday().

Results

Time, Faster 8

The overall results are better than expected. Our original goal was to simply provide a way for our implementation on Illumos to meet the performance of Linux. The value of ZFS is deeply profound, and while Linux has some performance advantages in specific arenas, that doesn’t matter much if you have undetectable corruption of the data you store.

Further optimization is possible in the implementation, but we’ve stopped for now, having achieved our initial goals. Additionally, for the purposes of this test, we’ve built the code portably. We can find a couple of nanoseconds if we compile -march=native on machines supporting the AVX instruction set.

It is true that an approximately 40ns gehrtime() can be considered slow enough, relative to microsecond-level efforts, that very prudent selection is still necessary. It is also true that 40ns gethrtime() can open up a new world of possibilities for user-space instrumentation. It has certainly opened our eyes to some startling things.

Operating System Call System Call Time Mtev-variant Call Speedup
Linux 3.10.0 gettimeofday 35.7 ns/op 35.7 ns/op x1.00
Linux 3.10.0 gethrtime 119.8 ns/op 40.4 ns/op x2.96
OmniOS r151014 gettimeofday 304.6 ns/op 47.6 ns/op x6.40
OmniOS r151014 gethrtime 297.0 ns/op 39.9 ns/op x7.44

This all comes for free with libmtev (see mtev_time.c for reference). As of writing this, Linux and Illumos are supported platforms and Darwin and FreeBSD do not have “faster time” support. The faster time support in libmtev was a collaborative effort between Riley Berton and Theo Schlossnagle.

Rapid Resolution With The Right Tools

Here at Circonus, we obviously monitor our infrastructure with the Circonus platform. However, there are many types of monitoring, and our product isn’t intended to cover all the bases. While Circonus leads to profound insights into the behavior of systems, it isn’t a deep-dive tool for when you already know a system is broken.

Many of our components here are written in C, so as to be as small, portable, and efficient as possible. One such component gets run on relatively small systems on-premises with customers, to perform the tasks of data collection, store-and-forward, and real-time observation. When it comes to the data collection pipeline, a malfunction is unacceptable.

This story is of one such unacceptable event, on a warm Wednesday in July.

Around 2pm, the engineering group is hacking away on their various deliverables for new features and system improvements to the underlying components that power Circonus. We’ve just launched our Pully McPushface feature to a wider audience (everyone) after over a year of dark launches.

Rapid_Response_with_the_Right_Tools

At 2:47pm, a crash notice from Amsterdam appears in our #backtrace slack channel. We use Backtrace to capture and analyze crashes throughout our globally distributed production network. Crashes do happen, and this one will go into an engineer’s queue to be analyzed and fixed when they switch from one of their current tasks. At 2:49pm, things change; we get an eerily similar crash signature from Amsterdam. At this point, we know that we have an event worthy of interruption.

Around 2:51pm, we snag the Backtrace object identifier out of Slack and pull it up in the analysis tool (coroner). It provides the first several steps of post-mortem debugging analysis for us. About 8 minutes later, we have consensus on the problem and a proposed fix, and we commit that change to Github at 2:59pm.

Now, in the modern world, all this CI/CD stuff claims to provide great benefits. We use Jenkins internally for most of the development automation. Jenkins stirs and begins its build and test process. Now, this code runs on at least five supported platforms, and since the fix is in a library, we actually can count more than fifty packages that need to be rebuilt. After just 6 minutes, Jenkins notes all test suites passed and begins package builds at 3:05pm.

Building, and testing, and packaging software is tedious work. Many organizations have continuous integration, but fail to have continuous deployment. At 3:09pm, just 4 minutes later, we have all of our fixed packages deployed to our package repositories and ready for install. One might claim that we don’t have continuous deployment because our machines in the field do not auto-update their software. It’s an interesting fight internal to Circonus as to whether or not we should be performing automated updates; as of today, a human is still involved to green-light production deployment for new packages.

One more minute passes, and at 3:10pm, Amsterdam performs its yum update (it’s a CentOS box), and the Slack channel quiets. All is well.

The engineers here at Circonus have been building production software for many years, and a lot has changed over that time. I distinctly remember manually testing software in the 1990s. I remember some hobbled form of CI entering the scene, but we were still manually packaging software for distribution in the 2000s. Why did it take our industry so long to automate these pieces? I believe some of that has to do with relative scale. Collecting the necessary post-mortem debugging information to confidently issue a fix took hours or days when machines in production malfunctioned, because operations groups had to notice, report, and then interact with development. As such, the extra hour or two to compile, test, and package the software was inconsequential, reminding us of a golden XKCD comic in its time.

compiling

Recapping this event makes me, the old grey-beard, feel like I’m living in the future. Jenkins was able to reduce a process from hours to minutes, and Backtrace was able to reduce a highly inconsistent hours-to-days process to minutes.

A two minute MTTD and a 23 minute MTTR for a tested fix involving a core library. The future looks way more comfortable than the past.

LuaJIT and the Illumos VM

I’m about to dive into some esoteric stuff, so be warned. We use LuaJIT extensively throughout most of our product suite here at Circonus. It is core supported technology in the libmtev platform on which we build most of our high-performance server apps. Lua (and LuaJIT) integrate fantastically with C code and it makes many complex tasks seem simple if you can get the abstraction right between procedural lua code and a non-blocking event framework like that within libmtev. Lua is all fine and well, but for performance reasons we use LuaJIT and therein bodies are buried.

What’s so special about LuaJIT? An object in LuaJIT is always 64bit. It can represent a double, or if the leading 13 bits are 1s (which is invalid in any 64bit double representation) then it can represent other things:

**                  ---MSW---.---LSW---
** primitive types |  itype  |         |
** lightuserdata   |  itype  |  void * |  (32 bit platforms)
** lightuserdata   |ffff|    void *    |  (64 bit platforms, 47 bit pointers)
** GC objects      |  itype  |  GCRef  |
** int (LJ_DUALNUM)|  itype  |   int   |
** number           -------double------

Ouch, that’s right. We’ve got pointer compaction going on. On 64 bit systems, we can only represent 47 bits of our pointer. That is 128TB of addressable space, but we’ll explore problems it causes.

LuaJIT is a garbage collected language and as such wants to manage its own memory. It mmaps large chunks of memory and then allocates and manages objects within that segment. These objects actually must sit in the lower 32bits of memory because luajit leverages the other 32bits for type information and other notes (e.g. garbage collection annotations). Linux and several other UNIX-like operating systems support a MAP_32BIT flag to mmap that instructs the kernel to return a memory mapped segment under the 4GB boundary.

LuaJIT_and_IllumosVM_1

Here we see a diagram of how this usually works with LuaJIT on platforms that support MAP32_BIT. When asked for a MMAP_32BIT mapping, Linux (and some other operating systems) starts near the 4GB boundary and works backwards. The heap (or brk()/sbrk() calls), where malloc and friends live, typically starts near the beginning of memory space and works upwards. 4GB is enough space to cause many apps to not have problems, but if you are a heavy lifting app, at some point you could legitimately attempt to group your heap into the memory mapped regions and that would result in a failed allocation attempt. You’re effectively out of memory! If your allocator uses mmap behind the scenes, you won’t have this problem.

Enter Illumos:

LuaJIT_and_IllumosVM_2

Enter Illumos, and we have a set of problems emerge. On old versions of Illumos, MAP_32BIT wasn’t supported, and this usually caused issues around 300MB or so of heap usage. That’s an “Itty-bitty living space,” not useful for most applications; certainly not ours. Additionally, the stack doesn’t grow down from the 47-bit boundary; it grows down from the 64-bit boundary.

On systems that don’t support MAP_32BIT, LuaJIT will make a best effort by leveraging mmap()’s “hinting” system to hint at memory locations it would like down in the lower 32bit space. This works and can get off the starting blocks, but we’ve compounded our issues, as you can see in this diagram. Because we’re not growing strictly down from the 4GB boundary, our heap has even less room to grow.

Lack of MAP_32BIT support was fixed back in August of 2013, but you know people are still running old systems. On more recent versions Illumos, it looks more like Linux.

LuaJIT_and_IllumosVM_3

The interaction between LuaJIT and our stratospheric stack pointers remains an unaddressed issue. If we push lightuserdata into lua from our stack, we get a crash. Surprisingly, in all our years, we’ve ran across only one dependency that has triggered the issue: LPEG, for which we have a patch.

We’re still left with a space issue: our apps still push the envelope, and 4GB’s of heap is a limitation we thought it would be ridiculous to accept. So we have stupid trick to fix the issue.

On Illumos (and thus OmniOS and SmartOS), the “brk” starting point is immediately after the BSS segment. If we can simply inform the loader that our binary had a variable declared in BSS somewhere else, we could make our heap start growing from somewhere other than near zero.

# gcc -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    1 .bss           _end
      [88]  0x0000000000412480 0x0000000000000000  OBJT GLOB  D    0 .bss           _end
LuaJIT_and_IllumosVM_4

Kudos to Rich Lowe for pointers on how to accomplish this feat. Using the Illumos linker, we can actually specify a map using -M map64bit where map64bit is file containing:

mason_dixon = LOAD ?E V0x100000000 L0x8;

# gcc -Wl,-M -Wl,map64bit -m64 -o mem mem.c
# elfdump -s mem  | grep _end
      [30]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    1 ABS            _end
      [88]  0x0000000100001000 0x0000000000000000  OBJT GLOB  D    0 ABS            _end

This has the effect of placing an 8 byte sized variable at the virtual address 4GB, and the VM space looks like the above diagram. Much more room!

Roughly 4GBs of open space for native LuaJIT memory management, and since we run lots of VMs in a single process, this gets exercised. The heap grows up from 4GB, leaving the application plenty of room to grow. There is little risk of extending the heap past 128TB, as we don’t own any machines with that much memory. So, everything on the heap and in the VM are all cleanly accessible to LuaJIT.

So when we say we have some full stack engineers here at Circonus — we know which way the stack grows on x86_64.

The Uphill Battle for Visibility

In 2011, Circonus released a feature that revolutionized the way people could observe systems. We throw around the word “histogram” a lot and the industry has listened. And while we’ve made great progress, it seems that somewhere along the way the industry got lost.

I’ve said for some time that without a histogram, you just don’t understand the behavior of your system. Many listened and the “histogram metric type” was born, and now is used in so many of today’s server apps to provide “better understanding” of systems behavior. The deep problem is that those histograms might be used “inside” the product, but from the operator’s perspective you only see things like min, max, median, and some other arbitrary percentiles (like 75 and 99). Quick question: do you know what you get when you store a 99th percentile over time? Answer: not a histogram. The “problem” with averages that I’ve railed against for years is not that averages are bad or useless, but instead that an average is a single statistical assessment of a large distribution of data… it doesn’t provide enough insight into the data set. Adding real histograms solves this issue, but analyzing a handful of quantiles that people are calling histograms puts us right back where we started. Perhaps we need a T-shirt: “I was sold a histogram, but I all got were a few lousy quantiles. :(”

I understand the problem: it’s hard. Storing histograms as an actual datatype is actually innovative to the point that people don’t understand it. Let’s say, for example, you’re trying to measure the performance of your Cassandra cluster… as you know, every transaction counts. In the old days, we all just tracked things like averages. If you served approximately 100,000 requests in a second, you would measure the service times of each and track an EWMA (exponential weighted moving average) and store this single output over time as an assessment of ongoing performance.

Stored Values
Time EWMA (seconds)
T1 0.0020
T2 0.0023
T3 0.0019
T4 0.0042

Now, most monitoring systems didn’t actually store second-by-second data, so instead of having an approximation of an average of 100,000 or so measurements every second, you would have an approximation of an average of around 6,000,000 measurements every minute. When I first started talking about histograms and their value, it was easy for an audience to say, “Wow, if I see one datapoint representing six million, it stands to reason I don’t know all that much about the nature of the six million.”

Enter the histogram hack:

By tracking the service latencies in a histogram (and there a lot of methods, from replacement sampling, to exponentially decaying, to high-definition-resolution collection), one now had a much richer model to reason about for understanding behavior. This is not the hack part… that followed. You see, people didn’t know what to do with these histograms because their monitoring/storage systems didn’t understand them, so they degraded all that hard work back into something like this:

Stored Values
Time mean min max 50th 75th 95th 99th
T1 0.0020 0.00081 0.120 0.0031 0.0033 0.0038 0.091
T2 0.0023 0.00073 0.140 0.0027 0.0031 0.0033 0.092
T3 0.0019 0.00062 0.093 0.0024 0.0027 0.0029 0.051
T4 0.0042 0.00092 0.100 0.0043 0.0050 0.0057 0.082

This may seem like a lot of information, but even at second-by-second sampling, we’re taking 100,000 data points and reducing them to a mere seven. Is this is most-certainly better than a single statistical output (EWMA), but claiming even marginal success is folly. You see, the histogram that was used to calculate those 7 outputs could calculate myriad other statistical insights, but all of that insight is lost if those histograms aren’t stored. Additionally, reasoning about the extracted statistics over time is extremely difficult.

At Circonus, from the beginning, we realized the value of storing a histogram as a first-class data type in our time series database.

Stored Values Derived Values
Time Histogram mean min max 50th nth
T1 H1(100427) mean(H1) q(H1,0) q(H1,100) q(H1,50) q(H1,n)
T2 H2(108455) mean(H2) q(H2,0) q(H2,100) q(H2,50) q(H2,n)
T3 H3(94302) mean(H3) q(H3,0) q(H3,100) q(H3,50) q(H3,n)
T4 H4(98223) mean(H4) q(H4,0) q(H4,100) q(H4,50) q(H4,n)

Obviously, this is a little abstract, because how do you store a histogram? Well, we take the abstract to something quite concrete in Go, Java, Javascript, and C. This also means that we can view the actual shape of the distribution of the data and extract other assessments, like estimated modality (how many modes there are).

All this sounds awesome, right? It is! However, there is an uphill battle. When the industry started to get excited about using histograms, they adopted the hacky model where histograms are used internally to provide poor statistics upstream, we’re left at Circonus trying to get water from a stone. It seems each and every product out there requires code modifications to be able to expose this sort of rich insight. Don’t fear, we’re climbing that hill and more and more products are falling into place.

Today’s infrastructure and infrastructure applications (like Cassandra and Kafka), as well as today’s user and machine facing APIs, are incredibly sensitive to performance regressions. Without good visibilities, the performance of your platform and the performance of your efforts atop it will always be guesswork.

In order to make things easier, we’re introducing Circonus’s Wirelatency Protocol Observer. Run it on your API server or on your Cassandra node and you’ll immediately get deep insight to the performance of your system via real histograms in all their glory.

Percentages Aren’t People

This is a story about an engineering group celebrating success when it shouldn’t be… and their organization buying into it. This is not the fault of the engineering groups, or the operations team, or any one person. This is the fault of yesterday’s tools not providing the right data. The right insights. The ability to dig into the data to get meaningful information to push your business forward.

Herein, we’ll dive into a day in the life of an online service where a team wakes up and triages an outage after ignorantly celebrating a larger outage as a success just twelve hours before. All names have been removed to protect the exceptionally well-intending and competent parties. You see, the problem is that the industry has been misleading us with misapplied math and bad statistics for years.

I’ll set the stage with a simple fact of this business… when it takes longer than one and half seconds to use their service, users leave. Armed with this fact, let’s begin our journey. Despite this data living in Circonus, it isn’t measuring Circonus; alas, as stories are best told in the first person with friends along for the ride, I shall drop into the first-person plural for the rest of the ride: let’s go.

We track the user’s experience logging into the application. We do this not by synthetically logging in and measuring (we do this too, but only for functional testing), but by measuring each user’s experience and recording it. When drawn as a heatmap, the data looks like the graph below. The red line indicates a number that, through research, we’ve found to be a line of despair and loss. Delivering an experience of 1.5 seconds or slower causes our users to leave.

Percentages_Are_Not_People_1

Heatmaps can be a bit confusing to reason about, so this is the last we’ll see of it here. The important part to remember is that we are storing a complete model of the distribution of user experiences over time and we’ll get to why that is important in just a bit. From this data, we can calculate and visualize all the things we’re used to.

Percentages_Are_Not_People_2

The above is a graph for that same lonely day in June, and it shows milliseconds of latency; specifically, the line represents the average user experience. If I ask you to spot the problem on the graph, you can do so just as easily as a four year old; it’s glaring. However, you’ll note that our graph indicates we’re well under our 1.5s line of despair and loss. We’re all okay right? Wrong.

A long time ago, the industry realized that averages (and standard deviations) are very poor representations of sample sets because our populations are not normally distributed. Instead of using an average (specifically an arithmetic mean), we all decided that measuring on some large quantile would be better. We were right. So, an organization would pick a percentage: 99.9% or 99% and articulate, “I have to be at least ‘this good’ for at least ‘this percentage’ of my users.” If this percentage seems arbitrary, it is… but, like the 1.5 second line of despair and loss, it can be derived from lots of business data and user behavior studies.

This, ladies and gentlemen, is why we don’t use averages. Saying that averages are misleading is a bit less accurate than admitting that many people are misled by averages. They simply don’t represent the things that are important to us here: how are we treating our users? This question is critical because it is our users who fund us and our real question is, “How many users are having a dissatisfying experience?”

Percentages_Are_Not_People_3

The above graph is radically different than the first; it might surprise you to know that it is showing the same underlying dataset. Instead of the average experience, it shows the 99th percentile experience over time. It is much clearer that we had something catastrophically bad happen at 5am. It also shows that aside from two small infractions (7:52pm and 11:00pm), the rest of the day delivered the objective of a “less than 1.5s 99th percentile experience.” Okay, let’s stop. That’s such a disgustingly opaque and clinical way to talk about what we’re representing. These are actual people attempting to use this service.

What we’re saying here is that for each of the points on the purple line in this graph, during the time window that it represents (at this zoom level, each point represents 4 minutes), that 99% of visitors had an experience better than the value, and 1% had an experience worse than the value. Here we should see our first problem: percentages aren’t people.

Reflecting on the day as a whole, we see a catastrophic problem at 5am, to which our mighty engineering organization responded and remediated diligently over the course of approximately fifty minutes. Go Team! The rest of the day was pretty good, and we have those two little blips to diagnose and fix going forward.

I’m glad we’re not using averages for monitoring! We’d most likely not have been alerted to that problem at 5am! Here is where most monitoring stories end because a few quantiles is all that is stored and the raw data behind everything isn’t available for further analysis. Let’s return to our earlier question, “How many users are having a dissatisfying experience?” Luckily for us, we know how many users were on the site, so we can actually just multiply 1% by the number of current visitors to understand “how many” of the users are having an experience worse than the graph… But that isn’t the question is it? The question is how many users are having a worse experience than 1.5s, not worse than the 99th percentile.

Percentages_Are_Not_People_4

This graph adds a black line that shows the number of current users each minute on the site (numbered on the right axis). To illustrate how we’re really missing the point, let’s just take a random point from our 99th percentile graph (again each point represents 4 minutes at this zoom level). We randomly pick 9:32pm. The graph tells us that the 99th percentile experience at that point is at 1.266s. This is better than our goal of 1.5s. Well, looking at the black line we see that we have about 86 users each minute on the site at that point, or 344 users over the four minute period. 1% of that is between 3 and 4 users. Okay, we’re getting somewhere! So we know that between 3 and 4 users had an experience over 1.266s. Wait, that wasn’t our question. Who cares about 1.266s, when we want to know about 1.5s? We’re not getting anywhere at all.

Our objective is 1.5 seconds. We’re looking at this all upside down and backwards. We should not be asking how bad the experience is for the worst 1%, instead we should be asking what percentage has a bad experience (any experience worse than our objective of 1.5 seconds). We shouldn’t be asking about quantiles; we should be asking about inverse quantiles. Since we’re storing the whole distribution of experiences in Circonus, we can simply ask, “What percentage of the population is faster than 1.5s?” If we take one minus this inverse quantile at 1.5 seconds, we get exactly the answer to our question: What percentage of users had a “bad experience?”

Percentages_Are_Not_People_5

Now we’re getting somewhere. It is clear that we had a bad time at 5am and we did pretty well with just some line noise during our successful prior evening, right? Let’s return to our first problem: percentages aren’t people.

Percentages_Are_Not_People_6

Luckily, just as we did before, we can simply look at how many people are visiting the site (the green line above) and multiply that by the percentage of people having a bad time and we get the number of actual people. Now we’re talking about something everyone understands. How many people had a bad experience? Let’s multiply!

Percentages_Are_Not_People_7

In this image, we have simply multiplied two data streams from before, and we see the human casualties of our system. This is the number of users per minute that we screwed out of a good experience. These are users that, in all likelihood, are taking their business elsewhere. As anyone that thinks about it for more than a few seconds realizes, a small percentage of a large number can easily be bigger than a large percentage of a small number. Managing to inverse quantile numbers (let alone abstractly reasoning about quantiles), without knowing the size of the population, is misleading (to put it mildly).

Another way to look at this graph is to integrate; that is, to calculate the area under the curve. Integrating a graph representing users over time results in a graph of users. In other words, the number of cumulative users that have had a bad experience.

Percentages_Are_Not_People_8

This should be flat-out eye opening. The eight hours from 2am to 10am (including the event of despair and loss) affected 121 people. The eight hours preceding it affected almost as many: 113.

It can be pretty depressing to think you’ve celebrated a successful day of delivery only to learn that it really wasn’t that successful at all. But, this isn’t so much about celebrating successes that were actually failures; it’s about understanding what, when, and where you can improve. Every user matters; and if you treat them that way, you stand to get a lot more of them.

Percentages_Are_Not_People_9

When you look back at your own graphs, just remember that the most casualties of our day happened in these two bands. You should be using inverse quantiles for SLA reporting; if you don’t have those, use quantiles… if you only have averages, you’re blind as a bat.

Understanding API Latencies

Today’s Internet is powered by APIs. Tomorrow’s will be even more so. Without a pretty UI or a captivating experience, you’re judged simply on performance and availability. As an API provider, it is more critical than ever to understand how your system is performing.

With the emergence of micro services, we have an API layered cake. And often that layered cake looks like one from a Dr. Seuss story. That complex systems fail in complex ways is a deep and painful truth that developers are facing now in even the most ordinary of applications. So, as we build these decoupled, often asynchronous, systems that compose a single user transaction from often tens of underlying networked subtransactions we’re left with a puzzle. How is the performance changing as volume increases usage and, often more importantly, how is it changing as we rapidly deploy micro updates to our micro services?

Developers have long known that they must be aware of their code performance and, at least in my experience, developers tend to be fairly good about minding their performance P’s and Q’s. However, in complex systems, the deployment environment and other production environmental conditions have tremendous influence on the actual performance delivered. The cry, “but it worked in dev” has moved from the functionality to the performance realm of software. I tell you now that I can sympathize.

It has always been a challenge to take a bug in functionality observed in production and build a repeatable test case in development to diagnose, address, and test for future regression. This challenge has been met by the best developers out there. The emergent conditions in complex, decoupled production system are nigh impossible to replicate in a development environment. This leaves developers fantastically frustrated and requires a different tack: production instrumentation.

As I see it, there are two approaches to production instrumentation that are critically important (there would be one approach if storage and retrieval were free and observation had no effect — alas we live in the real world and must compromise). You can either sacrifice coverage for depth or sacrifice depth for coverage. What am I talking about?

I’d love to be able to pick apart a single request coming into my service in excruciating detail. Watch it arrive, calculate the cycles spent on CPU, the time off, which instruction and stack took me off CPU, the activity that requested information from another microservice, the perceived latency between systems, all of the same things on the remote micro service, the disk accesses and latency on delivery for my query against Cassandra, and the details of the read-repair it induced. This list might seem long, but I could go on for pages. The amount of low-level work that is performed to serve even the simplest of requests is staggering… and every single step is subject to bugs, poor interactions, performance regressions and other generally bad behavior. The Google Dapper paper and the OpenZipkin project take a stab at delivering on this type of visibility, and now companies like Lightstep are attempting to deliver on this commercially. I’m excited! This type of tooling is one of two critical approaches to production system visibility.

Understanding_API_Latencies_1

The idea of storing this information on every single request that arrives is absurd today, but even when it is no longer absurd tomorrow, broad and insightful reporting on it will remain a challenge. Hence the need for the second approach.

You guessed it, Circonus falls squarely into the second approach: coverage over depth. You may choose not to agree with my terminology, but hopefully the point will come across. In this approach, instead of looking at individual transactions into the system (acknowledging that we cannot feasibly record and report all of them), we look at the individual components of the system and measure everything. That API we’re serving? Measure the latency of every single request on every exposed endpoint. The micro service you talked to? Measure the latency there. The network protocol over which you communicated? Measure the size of every single package sent in each direction. That Cassandra cluster? Measure the client-facing latency, but also measure the I/O latency of every single disk operation on each spindle (or EBS volume, or ephemeral SSD) on each node. It sounds like a lot of data, sure. We live in the future, and analytics systems are capable of handling a billion measurements per second these days, all the while remaining economical.

Understanding_API_Latencies_2

The above graph shows the full distribution of every IO operation on one of our core database nodes. The histogram in the breakout box shows three distinct modes (two tightly coupled in the left peak and one smaller mode further out in the latency spectrum. We can also see a radical divergence in behavior immediately following Feb 14th at 9am. As we’re looking at one week of data, each time slice vertically is 1h30m. The slice highlighted by the vertical grey hairline is displayed in the upper-left breakout box; it represents nearly 12 million data points alone. The full graph represents about 1.2 billion measurements, and fetching that from the Circonus time series database took 48ms. When you start using the right tools, your eyes will open.

Pully McPushface

The Argument for Connectivity Agnosticism

turning the corner
Turning the corner

It’s about push vs. pull… but it shouldn’t be.

There has been a lot of heated debate on whether pushing telemetry data from systems or pulling that data from systems is better. If you’re just hearing about this argument now, bless you. One would think that this debate is as ridiculous as vim vs. emacs or tabs vs. spaces, but it turns out there is a bit of meat on this bone. The problem is that the proposition is wrong. I hope that here I can reframe the discussion productively to help turn the corner and walk a path where people get back to more productive things.

At Circonus, we’ve always been of the mindset that both push and pull should have their moments to shine. We accept both, but honestly, we are duped into this push vs. pull dialogue all too often. As I’ll explain, the choices we are shown aren’t the only options.

The idea behind pushing metrics is that the “system” in question (be it a machine or a service) should emit telemetry data to an “upstream” entity. The idea of pull is that some “upstream” entity should actively query systems for telemetry data. I am careful not to use the word“centralized” because in most large-scale modern monitoring systems, all of these bits (push or pull) are decentralized rather considerably. Let’s look through both sides of the argument (I’ve done the courtesy of striking through the claims that are patently false):

Push has some arguments:

  1. Pull doesn’t scale well
  2. I don’t know where my data will be coming from.
  3. Push works behind complex network setups.
  4. When events transpire, I should push; pulling doesn’t match my cadence.
  5. Push is more secure.

Pull has some arguments:

  1. I know better when a machine or service goes bad because I control the polling interval.
  2. Controlling the polling interval allows me to investigate issues faster and more effectively.
  3. Pull is more secure.

To address the strikethroughs in verse: Pulling data from 2 million machines isn’t a difficult job. Do you have more than 2 million machines? Pull scales fine… Google does it. When pulling data from a secure place to the cloud or pushing data from a secure place to the cloud, you are moving some data across the same boundary and are thus exposed to the same security risks involved. It is worth mentioning that in a setup where data is pulled, the target machine need not be able to even route to the Internet at all, thus making the attack surface more slippery. I personally find that argument to be weak and believe that if the right security policies are put in place, both methods can be considered equally “securable.” It’s also worth mentioning that many of those making claims about security concerns have wide open policies about pushing information beyond the boundaries of their digital enclave and should spend some serious time reflecting on that.


Now to address the remaining issues.

Push: I don’t know where my data will be coming from.

Yes, it’s true that you don’t always know where your data is coming from. A perfect example is web clients. They show up to load a page or call an API, and then could potentially disappear for good. You don’t own that resource and, more importantly, don’t pay an operational or capital expenditure on acquiring or running it. So, I sympathize that we don’t always know which systems will be submitting telemetry information to us. On the flip side, those machines or services that you know about and pay for — it’s just flat-out lazy to not know what they are. In the case of short-lived resources, it is imperative that you know when it is doing work and when it is gone for good. Considering this, it would stand to reason that the resource being monitored must initiate this. This is an argument for push… at least on layer 3. Woah! What? Why I am talking about OSI layers? I’ll get to that.

Push: Works behind complex network setups.

It turns out that pull actually works behind some complex network configurations where push fails, though these are quite rare in practice. Still, it also turns out that TCP sessions are bidirectional, so once you’ve conquered setup you’ve solved this issue. So this argument (and the rare counterargument) are layer 3 arguments that struggle to find any relevance at layer 7.

Push: When events transpire, I should push; pulling doesn’t match my cadence.

Finally, some real meat. I’ve talk about this many times in the past, and it is 100% true that some things you want to observe fall well into the push realm and others into the pull realm. When an event transpires, you likely want to get that information upstream as quickly as possible, so push makes good sense. And as this is information… we’re talking layer 7. If you instrument processes starting and stopping, you likely don’t want to missing something. On the other hand, the way to not miss disk space usage monitoring on a system is to log every block allocation and deallocation — sounds like a bit of overkill perhaps? This is a good example of where pulling that information at an operator’s discretion (say every few seconds or every minute) would suffice. Basically, sometimes it makes good sense to push on layer 7, sometimes it makes better sense to pull.

Pull: I know better when a machine or service goes bad because I control the polling interval.

This, to me, comes down to the responsible party. Is each of your million machines (or 10) responsible for detecting failure (in the form of absenteeism) or is that the responsibility of the monitoring system? That was rhetorical, of course. The monitoring system is responsible, full stop. Yet detecting failure of systems by tracking the absenteeism of data in the push model requires elaborate models on acceptable delinquency in emissions. When the monitoring system pulls data, it controls the interval and can determine unavailability in a way that is reliable, simple, and, perhaps most importantly, simple to reason about. While there are elements of layer 3 here if the client is not currently “connected” to the monitoring system, this issue is almost entirely addressed on layer 7.

Pull: Controlling the polling interval allows me to investigate issues faster and more effectively.

For metrics in many systems, taking a measurement every 100ms is overkill. I have thousands of metrics available on a machine, and most of them are very expressive on observation intervals as large as five minutes. However, there are times at which a tighter observation interval is warranted. This is an argument of control, and it is a good argument. The claim that an operator should be able to dynamically control the interval at which measurements are taken is a completely legitimate claim and expectation to have. This argument and its solution live in layer 7.

Enter Pully McPushface.


Pully McPushface is just a name to get attention: attention to something that can potentially make people cease their asinine pull vs. push arguments. It is simply the acknowledgement that one can push or pull at layer 3 (the direction in which one establishes a TCP session) and also push (send) or pull (request/response) on layer 7, independent of one another. To be clear, this approach has been possible since TCP hit the scene in 1982… so why haven’t monitoring systems leveraged it?

At Circonus, we’ve recently revamped our stack to allow for this freedom in almost every level of our architecture. Since the beginning, we’ve supported both push and pull protocols (like collectd, statsd, json over HTTP, NRPE, etc.), and we’ll continue to do so. The problem was that these all (as do the pundits) conflate layer 3 and layer 7 “initiation” in their design. (The collectd agent connects out via TCP to push data, and a monitor connects into NRPE to pull data.) We’re changing the dialogue.

Our collection system is designed to be distributed. We have our first tier: the core system, our second tier: the broker network, and our third tier: agents. While we support a multitude of agents (including the aforementioned statsd, collectd, etc.), we also have our own open source agent called NAD.

When we initially designed Circonus, we did extensive research with world-leading security teams to understand whether our layer 3 connections between tier 1 and tier 2 should be initiated by the broker to the core or vice verse. The consensus (unanimous I might add) was that security would be improved by controlling a single inbound TCP connection to the broker, and the broker could be operated without a default route, disabling it from easily sending data to malicious parties were it duped into sending data. It turns out that our audience wholeheartedly disagreed with this expert opinion. The solution? Be agnostic. Today, the conversations between tier 1 and tier 2 care not as to who initiates the connection. Prefer the broker reaches out? That’s just fine. Want the core to connect to the broker? That’ll work too.

In our recent release of C:OSI (and NAD), we’ve applied the same agnosticism to connectivity between tier 2 and tier 3. Here is where the magic happens. The nad agent now has the ability to both dial in and be dialed to on layer 3, while maintaining all of its normal layer 7 flexibility. Basically, however your network and systems are setup, we can work with that and still get on-demand, high-frequency data out; no more compromises. Say hello to Pully McPushface.

What’s new in JLog?

There is a class of problems in systems software that require guaranteed delivery of data from one stage of processing to the next stage of processing. In database systems, this usually involves a WAL file and a commit process that moves data from the WAL to the main storage files. If a crash or power loss occurs, we can replay the WAL file to reconstitute the database correctly. Nothing gets lost. Most database systems use some variant of ARIES.

In message broker systems, this usually involves an acknowledgement that a message was received and a retry from the client if there was no response or an error response. For durable message brokers, that acknowledgement should not go to the client until the data is committed to disk and safe. In larger brokered systems, like Kafka, this can extend to the data safely arriving at multiple nodes before acknowledging receipt to the client. These systems can usually be configured based on the relative tolerance of data loss for the application. For ephemeral stream data where the odd message or two can be dropped, we might set Kafka to acknowledge the message after only the leader has it, for example.

JLog is a library that provides journaled log functionality for your application and allows decoupling of data ingestion from data processing using a publish subscribe semantic. It supports both thread and multi-process safety. JLog can be used to build pub/sub systems that guarantee message delivery by relying on permanent storage for each received message and allowing different subscribers to maintain a different position in the log. It fully manages file segmentation and cleanup when all subscribers have finished reading a file segment.

Recent additions

To support ongoing scalability and availability objectives at Circonus, I recently added a set of new features for JLog. I’ll discuss each of them in more detail below:

  • Compression with LZ4
  • Single process support on demand
  • Rewindable checkpoints
  • Pre-commit buffering

Compression with LZ4

If you are running on a file system that does not support compression, JLog now supports turning on LZ4 compression to reduce disk storage requirements and also increase write throughput, when used with pre-commit buffering. The API for turning on compression looks like:

typedef enum {
  JLOG_COMPRESSION_NULL = 0,
  JLOG_COMPRESSION_LZ4 = 0x01
} jlog_compression_provider_choice;

int jlog_ctx_set_use_compression(jlog_ctx *ctx, uint8_t use);

int jlog_ctx_set_compression_provider(jlog_ctx *ctx,    
    jlog_compression_provider_choice provider);

Currently, only LZ4 is supported, but other compression formats may be added in the future. Choosing the NULL compression provider option is the same as choosing no compression. It’s important to note that you must turn on compression before calling jlog_ctx_init and the chosen compression will be stored with the JLog for it’s lifetime.

Single process support

This really should be called “switching off multi-process support”, as multi-process is the default behavior. Multi-process protects the JLog directory with a file lock, via fcntl(linux impl. linked). We always maintain thread-safety and there is no option to disable thread safety, but you can turn off this system call if you know that writes will only ever come from a single process (probably the most common usage for JLog).

Using the following call with mproc == 0 will turn off this file locking, which should result in a throughput increase:

int jlog_ctx_set_multi_process(jlog_ctx *ctx, uint8_t mproc);

Rewindable checkpoints

Highly available systems may require the ability to go back to a previously read checkpoint. JLog, by default, will delete file segments when all subscribers have read all messages in the segment. If you wanted to go back to a previously read checkpoint for some reason (such as failed processing), you were stuck with no ability to rewind. Now with support for rewindable checkpoints, you can set an ephemeral subscriber at a known spot and backup to that special named checkpoint. The API for using rewindable checkpoints is:

int jlog_ctx_add_subscriber(jlog_ctx *ctx, const char *subscriber,
    jlog_position whence);
int jlog_ctx_set_subscriber_checkpoint(jlog_ctx *ctx, 
    const char *subscriber, 
    const jlog_id *checkpoint);

Here’s an example of it’s usage:

  char begins[20], ends[20];
  jlog_id begin, end, checkpoint;
  int count, pass = 0, orig_expect = expect;
  jlog_message message;

  ctx = jlog_new(“/tmp/test.foo”);
  if(jlog_ctx_open_reader(ctx, “reader”) != 0) {
    fprintf(stderr, "jlog_ctx_open_reader failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* add our special trailing check point subscriber */
  if (jlog_ctx_add_subscriber(ctx, “checkpoint-name”, JLOG_BEGIN) != 0 && errno != EEXIST) {
    fprintf(stderr, "jlog_ctx_add_subscriber failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the checkpoint subscriber to where the real reader is */
  if (jlog_get_checkpoint(ctx, “reader”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  if (jlog_ctx_set_subscriber_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_ctx_set_subscriber_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

Now we have a checkpoint named “checkpoint-name” at the same location as the main subscriber “reader”. If we want to rewind, we simply do this:

  /* move checkpoint to our original position, first read checkpoint location */
  if (jlog_get_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the main read checkpoint there */
  if (jlog_ctx_read_checkpoint(ctx, &checkpoint) != 0) {
    fprintf(stderr, "checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
  } else {
    fprintf(stderr, "\trewound checkpoint...\n");
  }

To move our checkpoint forward, we merely call jlog_ctx_set_subscriber_checkpoint with the safe checkpoint.

Pre-commit buffering

One of the largest challenges with JLog is throughput. The ability to disable multi-process support helps reduce the syscalls required to write our data. This is good, but we still need to make a writev call for each message. This syscall overhead takes a serious bite out of throughput (more in the benchmarks section below). To get around this issue, we have to find a safe-ish way to reduce the syscall overhead of lots of tiny writes. We can either directly map the underlying block device and write to it directly (a nightmare) or we can batch the writes. Batching writes is way easier, but sacrifices way too much data safety (a crash before a batch commit can lose many rows depending on the size of the batch). At the end of the day, I chose a middle ground approach which is fairly safe for the most common case but also allows very high throughput using batch writes.

int jlog_ctx_set_pre_commit_buffer_size(jlog_ctx *ctx, size_t s);

Setting this to something greater than zero will turn on pre-commit buffering. This is implemented as a writable mmapped memory region where all writes go to batch up. This pre-commit buffer is flushed to actual files when it is filled to the requested size. We rely on the OS to flush the mmapped data back to the backing file even if the process crashes. However, if we lose the machine to power loss this approach is not safe. There is always a tradeoff between safety and throughput. Only use this approach if you are comfortable losing data in the event of power loss or kernel panic.

It is important to note that pre-commit buffering is not multi-process writer safe. If you are using JLog under a scheme that has multiple writing processes writing to the same JLog, you have to set the pre-commit buffer size to zero (the default). However, it is safe to use from a single process, multi-thread writer setup, and it is also safe to use with under multi-process when there are multiple reading processes but a single writing process.

There is a tradeoff between throughput and read side latency if you are using pre-commit buffering. Since reads only ever occur out of the materialized files on disk and do not consider the pre-commit buffer, reads can only advance when the pre-commit buffer is flushed. If you have a large-ish pre-commit buffer size and a slow-ish write rate, your readers could be waiting for a while before they advance. Choose your pre-commit buffer size wisely based on the expected throughput of your JLog. Note that we also provide a flush function, which you could wire up to a timer to ensure the readers are advancing even in the face of slow writes:

int jlog_ctx_flush_pre_commit_buffer(jlog_ctx *ctx);

Benchmarks

All benchmarks are timed by writing one million JLog entries with a message size of 100 bytes. All tests were conducted on OmniOS v11 r151014 using ZFS as the file system with compression enabled.

Test Entries/sec Time to complete
JLog Default ~114,000 8.735 sec
LZ4 compression on ~96,000 10.349 sec
Multi-process OFF ~138,000 7.248 sec
MP OFF + LZ4 ~121,000 8.303 sec
MP OFF + Pre-commit buffer 128K ~1,080,000 0.925 sec
MP OFF + Pre-commit + LZ4 ~474,000 2.113 sec

As you can see from the table above, turning multi-process support off provides a slight throughput advantage and all those calls to fcntl are elided, but the real amazing gains come from pre-commit buffering. Even a relatively small buffer of 128 KBytes gains us almost 8X in throughput over the next best option.

That LZ4 is running more slowly is not surprising. We are basically trading CPU for space savings. In addition, using a compressing file system will get you these space gains without having to flip on compression in JLog. However, if you are running on a non-compressed file system, it will save you disk space.

Circonus One Step Install

Introducing Quick and Simple Onboarding with C:OSI

When we started developing Circonus 6 years ago, we found many customers had very specific ideas about how they want their onboarding process to work. Since then, we’ve found that many more customers aren’t sure where to start.

The most rudimentary task new and existing users face is just getting metric data flowing from a new host into Circonus. New users want to see their data, graphs, and worksheets right away, and that process should be quick and easy, without any guesswork involved in sorting through all of the options. But those options need to continue to be available for users who require that flexibility, usually because they have a particular configuration in mind.

So, we listened. Now we’ve put those 6 years of gathering expertise to use in this new tool, so that everyone gets the benefit of that knowledge, but with a simple, streamlined process. This is a prescriptive process, so users who just want their data don’t have to be concerned with figuring out the best way to get started.

You can now register systems with Circonus in one simple command or as a simple part of configuration management. With that single command, you get a reasonable and comprehensive set of metrics and visuals. Check out our C:OSI tutorial on our Support Portal to see just how quick and simple it is, or have a quick look at this short demo video:

New and existing Circonus users can use C:OSI to automate the process of bringing systems online. Without inhibiting customization, a single cut-n-paste command does all of this in one step:

In just one step, C:OSI will:

  1. Select an agent.
  2. Install the agent.
  3. Configure the agent to expose metrics.
  4. Start the agent.
  5. Create a check to retrieve/accept the metrics from the agent.
  6. Enable basic system metrics.
  7. Create graphs for each of the basic metric groups.
  8. Create a worksheet containing the basic graphs so there is a unified view of the specific host.

C:OSI does all this via configuration files, pulled off a central site or read from a local configuration file. Both of which can easily be modified to suit your needs.

C:OSI also allows for customization, so users who depend on the flexibility of Circonus can also benefit from the simplicity of the streamlined process. If the default configuration prescribed by C:OSI doesn’t meet your own specifications, you can modify it, but the onboarding process would still be as simple as running a single command.

You can dig into those customization options by visiting the C:OSI documentation in the Circonus Labs public Github repository.

Anyone in DevOps, or anyone who has been responsible for monitoring a stack, knows that creating connections or nodes can be a time consuming task. A streamlined, prescriptive onboarding process is faster and more efficient. This provides stronger consistency in the data collected, which in turn allows us to do better, smarter things with that data.