Time, but Faster
Every once in awhile, you find yourself in a rabbit hole, unsure of where you are or what time it might be. In this post, I’ll show you a computing adventure about time through the looking glass.
The first premise you must understand was summed up perfectly by the late Douglas Adams; “Time is an illusion, lunchtime doubly so.” The concept of time, when colliding with decoupled networks of computers that run at billions of operations per second, is… well, the truth of the matter is that you simply never really know what time it is. That is why Leslie Lamport’s seminal paper on Lamport timestamps was so important to the industry, but this post is actually about wall-clock time, or a reasonably useful estimation of it.
Even on today’s computers, it is feasible to execute an instruction in under a nanosecond. When the white rabbit looks at his pocket watch, he is seeing what time it was a nanosecond before, as the light travels from the the hands on the watch to his eye – assuming that Carroll’s timeless tale took place in a vacuum and that the rabbit was holding the watch one third of a meter from his eyes.
When you think of a distributed system where a cluster of fast computers are often more than one light nanosecond away from each other, it is understandably difficult to time something that starts in one place and ends in another with nanosecond precision; this is the realm of physicists, not bums like us with commodity computing hardware run in environments we don’t even manage. To further upset the newcomer, every computer today is effectively a distributed system itself, with each CPU core having its own clock ticking away, with it’s own subtly different frequency and sense of universal beginning.
All that said, computers need to give us the illusion of a clock. Without it, we won’t know what time it is. As computers get faster and faster, we are able to improve the performance of our systems, but one fundamental of performance engineering is that you cannot improve what you cannot measure; so measure we must. The fundamental paradox is that as what we measure gets smaller, the cost of measuring it remains fixed, and thus becomes relatively monstrous.
The beginning of the tumble…
We write a database here at Circonus. It’s fast and it keeps getting faster and more efficient. We dump energy into this seemingly endless journey because we operate at scale and every bit of efficiency we eke out results in lower COGS (Cost Of Goods Sold) for us, better service for our customers, and fundamentally affords a cost-effectiveness of telemetry collection and analysis that approaches reasonable economics for us to “monitor all the things.” In that context…
Let us assume we want to achieve an average latency for operations of 1 microsecond. Now, let’s wrap some numbers around that. I’ll make some notes about certain aspects of hardware, but I’ll only really focus on hardware from the last several years. While you and I like to think in terms of seconds, computers don’t care about our concept of time… They only care about clock ticks.
Online CPUs are forever marching forward at some frequency, and the period of this frequency is called a “tick.” In an effort to save power, many computers can shift between different power saving states that cause the frequency of the CPU to change. This could make it excruciatingly difficult to accurately tell high-granularity time, if the frequency of the CPU were used for timing. Each core on a modern CPU has a TSC (TimeStamp Counter) that counts the number of ticks that have transpired, and you can read this counter with the cleverly named rdtsc assembly instruction. Also, modern CPUs have a “feature” called an invariant TSC, which guarantees that the frequency at which ticks occur will not change for any reason (but mainly for power saving mode changes). My development box has an invariant TSC that ticks approximately 2.5999503 times per nanosecond. Other machines here have different frequencies.
The standard tooling to figure out how long an operation takes on a UNIX machine is either clock_gettime(CLOCK_MONOTONIC,…) or gethrtime(). These calls return the number of nanoseconds since some arbitrary fixed point in the past. I will use gethrtime() in my examples because it is shorter to write.
hrtime_t start = gethrtime(); some_operation_worthy_of_measurement(); hrtime_t elapsed = gethrtime() - start;
As we measure these things, the gethrtime() call itself will take some time. The question to ask is: where does the time it returns sit relative to the beginning and end of the gethrtime() call itself? We can answer that with benchmarks. The bias introduced by measuring the start and finish is relative to its contribution to overall running time. In other words, if we make the “operation” we’re measuring take a long time over many iterations, we can reduce the measurement bias to near zero. Timing gethrtime() with gethrtime() would look like this:
#define LOTS 10000000 hrtime_t start = gethrtime(); for(int i=0;i<LOTS;i++) (void)gethrtime(); hrtime_t elapsed = gethrtime() - start; double avg_ns_per_op = (double) elapsed / (double)LOTS;
And behold: a benchmark is born. Furthermore, we could actually measure the number of ticks elapsed in the test by bracketing the test with calls to rdtsc in assembly. Note that you must bind yourself to a specific CPU on the box to make this effective because the TSC clocks on each core do not have the same concept of “beginning.”
If I run this on our two primary platforms (Linux and Illumos/OmniOS,on a 24-core 2.60GHz Intel box), then I get these results:
|Operating System||Call||Call Time|
|Linux 3.10.0||gettimeofday||35.7 ns/op|
|Linux 3.10.0||gethrtime||119.8 ns/op|
|OmniOS r151014||gettimeofday||304.6 ns/op|
|OmniOS r151014||gethrtime||297.0 ns/op|
The first noticeable thing is that Linux optimizes both of these calls significantly more than OmniOS does. This has actually been addressed as part of the LX brand work in SmartOS by Joyent and will soon be upstreamed into Illumos for general consumption by OmniOS. Alas, that isn’t the worst thing; objectively determining what time it is is simply too slow for microsecond level timing, even at the lower 119.8 ns/op number above. Note that gettimeofday() only supports microsecond level accuracy and thus is not suitable for timing faster operations.
At just 119.8ns/op, bracketing a 1 microsecond call will result in (119.8*2)/(1000 + 119.8*2) -> 19.33%. So 19.33% of the execution time is spent on calculating the timing, and that doesn’t even include the time spent recording the result. A good goal to target here is 10% or less. So, how do we get there?
Looking at our tools
On these same modern CPUs that have invariant TSCs, we have the rdtsc instruction, which reads the TSC, yet doesn’t provide insight into which CPU you are executing on. That would require either prefixing the call with a cpuid instruction or binding the executing thread to a specific core. The former adds ticks to the work, the latter is wholly inconvenient and can really defeat any advanced NUMA aware scheduling that the kernel might provide you. Basically, binding the CPU gives you a super-fast, but overly restrictive solution. We just want the gethrtime() call to work and be fast.
We’re not the only ones in need. Out of the generally recognized need, the rdtscp instruction was introduced. It supplies the value in the TSC and a programmable 32 bit value. The operating system can program this value to be the ID of the CPU, and a sufficient amount of information is emitted in a single instruction… Don’t be deceived; this instruction isn’t cheap and it measures in at 34 ticks on this machine. We’ll code that instruction call as uint64_t mtev_rdtscp(int *cpuid), and that returns the TSC and optionally sets a cpuid to the programmed value.
The first challenge here is to understand the frequency. This is a straight-forward timing exercise:
mtev_thread_bind_to_cpu(0); hrtime_t start_ns = gethrtime(); uint64_t start_ticks = mtev_rdtscp(NULL); sleep(10); hrtime_t end_ns = gethrtime(); uint64_t end_ticks = mtev_rdtscp(NULL); double ticks_per_nano = (double)(end_ticks-start_ticks) / (double)(end_ns-start_ns);
The next challenge becomes quite clear when testing out this solution for timing the execution of a job… even the simplest of jobs:
uint64_t start = mtev_rdtscp(NULL); *some_memory = 0; uint64_t elapsed = mtev_rdtscp(NULL) - start;
This usually takes around 10ns, assuming there isn’t a major page fault during the assignment. 10ns to set a piece of memory! Now remember, that includes the average time of a call to mtev_rdtscp(), which is just over 9ns. That’s not really the problem… the problem is that sometimes we get HUGE answers. Why? We switch CPUs and the output of the two TSC calls are reporting to completely unrelated counters. So, to rephrase the problem: we must relate the counters.
The code for skew assessment is a bit much to include inline here. The basic idea is that we should run a calibration loop on each CPU that measures the TSC*ticks_per_nano and assess the skew from gethrtime(), accommodating the running time of gethrtime(). As with most calibration loops, we’ll discard the most skewed and average the remaining. It’s basically back to primary school math regarding finding the linear intercept equation: y = mx + b. We want gethrtime() = ticks_per_nano * mtev_rdtscp() + skew. As the TSC is per CPU, we need to track m and b (ticks_per_nano and skew) on a per-CPU basis.
Another nuance is that these two values together describe the translation between a CPU’s TSC and the systems gethrtime(), and they are estimates. That means two important things: (1) they need to be updated regularly to correct for error in the calibration and estimation, and (2) they need to be set and read atomically. Here enters the cmpxchg16b instruction.
Additionally, we perform this calibration work every five seconds in a separate thread and attempt to make that thread high-priority on a real-time scheduler. It turns out that this all works quite well, even without the ability to change priority or scheduling class.
Since we’re clearly having to correct for skew to align with the system gethrtime(), and the point in the past to which gethrtime() is relative is arbitrary (according to documentation), we’ve elected to make that “arbitrary” point UNIX epoch. No additional instructions are required, and now we can use our replacement gethrtime() to power gettimeofday(). So, our y = mx + b is actually implemented as nano_second_since_epoch = ticks_per_nano * mtev_rdtscp() + skew. Obviously, we’ll only pick up changes to the wall-clock (via adjtime() et.al.) when we recalibrate.
Obviously things can and do go wrong. We’ve put a variety of fail-safe mechanisms in place to ensure proper behavior when our optimizations become unsafe. By default, we detect the lack of an invariant TSC and disable. If a calibration loop fails for too long (fifteen seconds), we mark the CPU as bad and disable. We do some rudimentary performance tests, and if the system’s gethrtime() can beat our emulation, then we disable. If all those tests pass, we still check to see if the underlying system can perform gettimeofday() faster than we can emulate it; if so, we disable gettimeofday() emulation. The goal is to have mtev_gethrtime() be as fast or faster than gethrtime() and to have mtev_gettimeofday() be as fast or faster than gettimeofday().
The overall results are better than expected. Our original goal was to simply provide a way for our implementation on Illumos to meet the performance of Linux. The value of ZFS is deeply profound, and while Linux has some performance advantages in specific arenas, that doesn’t matter much if you have undetectable corruption of the data you store.
Further optimization is possible in the implementation, but we’ve stopped for now, having achieved our initial goals. Additionally, for the purposes of this test, we’ve built the code portably. We can find a couple of nanoseconds if we compile -march=native on machines supporting the AVX instruction set.
It is true that an approximately 40ns gehrtime() can be considered slow enough, relative to microsecond-level efforts, that very prudent selection is still necessary. It is also true that 40ns gethrtime() can open up a new world of possibilities for user-space instrumentation. It has certainly opened our eyes to some startling things.
|Operating System||Call||System Call Time||Mtev-variant Call||Speedup|
|Linux 3.10.0||gettimeofday||35.7 ns/op||35.7 ns/op||x1.00|
|Linux 3.10.0||gethrtime||119.8 ns/op||40.4 ns/op||x2.96|
|OmniOS r151014||gettimeofday||304.6 ns/op||47.6 ns/op||x6.40|
|OmniOS r151014||gethrtime||297.0 ns/op||39.9 ns/op||x7.44|
This all comes for free with libmtev (see mtev_time.c for reference). As of writing this, Linux and Illumos are supported platforms and Darwin and FreeBSD do not have “faster time” support. The faster time support in libmtev was a collaborative effort between Riley Berton and Theo Schlossnagle.