Percentages Aren’t People

This is a story about an engineering group celebrating success when it shouldn’t be… and their organization buying into it. This is not the fault of the engineering groups, or the operations team, or any one person. This is the fault of yesterday’s tools not providing the right data. The right insights. The ability to dig into the data to get meaningful information to push your business forward.

Herein, we’ll dive into a day in the life of an online service where a team wakes up and triages an outage after ignorantly celebrating a larger outage as a success just twelve hours before. All names have been removed to protect the exceptionally well-intending and competent parties. You see, the problem is that the industry has been misleading us with misapplied math and bad statistics for years.

I’ll set the stage with a simple fact of this business… when it takes longer than one and half seconds to use their service, users leave. Armed with this fact, let’s begin our journey. Despite this data living in Circonus, it isn’t measuring Circonus; alas, as stories are best told in the first person with friends along for the ride, I shall drop into the first-person plural for the rest of the ride: let’s go.

We track the user’s experience logging into the application. We do this not by synthetically logging in and measuring (we do this too, but only for functional testing), but by measuring each user’s experience and recording it. When drawn as a heatmap, the data looks like the graph below. The red line indicates a number that, through research, we’ve found to be a line of despair and loss. Delivering an experience of 1.5 seconds or slower causes our users to leave.

Percentages_Are_Not_People_1

Heatmaps can be a bit confusing to reason about, so this is the last we’ll see of it here. The important part to remember is that we are storing a complete model of the distribution of user experiences over time and we’ll get to why that is important in just a bit. From this data, we can calculate and visualize all the things we’re used to.

Percentages_Are_Not_People_2

The above is a graph for that same lonely day in June, and it shows milliseconds of latency; specifically, the line represents the average user experience. If I ask you to spot the problem on the graph, you can do so just as easily as a four year old; it’s glaring. However, you’ll note that our graph indicates we’re well under our 1.5s line of despair and loss. We’re all okay right? Wrong.

A long time ago, the industry realized that averages (and standard deviations) are very poor representations of sample sets because our populations are not normally distributed. Instead of using an average (specifically an arithmetic mean), we all decided that measuring on some large quantile would be better. We were right. So, an organization would pick a percentage: 99.9% or 99% and articulate, “I have to be at least ‘this good’ for at least ‘this percentage’ of my users.” If this percentage seems arbitrary, it is… but, like the 1.5 second line of despair and loss, it can be derived from lots of business data and user behavior studies.

This, ladies and gentlemen, is why we don’t use averages. Saying that averages are misleading is a bit less accurate than admitting that many people are misled by averages. They simply don’t represent the things that are important to us here: how are we treating our users? This question is critical because it is our users who fund us and our real question is, “How many users are having a dissatisfying experience?”

Percentages_Are_Not_People_3

The above graph is radically different than the first; it might surprise you to know that it is showing the same underlying dataset. Instead of the average experience, it shows the 99th percentile experience over time. It is much clearer that we had something catastrophically bad happen at 5am. It also shows that aside from two small infractions (7:52pm and 11:00pm), the rest of the day delivered the objective of a “less than 1.5s 99th percentile experience.” Okay, let’s stop. That’s such a disgustingly opaque and clinical way to talk about what we’re representing. These are actual people attempting to use this service.

What we’re saying here is that for each of the points on the purple line in this graph, during the time window that it represents (at this zoom level, each point represents 4 minutes), that 99% of visitors had an experience better than the value, and 1% had an experience worse than the value. Here we should see our first problem: percentages aren’t people.

Reflecting on the day as a whole, we see a catastrophic problem at 5am, to which our mighty engineering organization responded and remediated diligently over the course of approximately fifty minutes. Go Team! The rest of the day was pretty good, and we have those two little blips to diagnose and fix going forward.

I’m glad we’re not using averages for monitoring! We’d most likely not have been alerted to that problem at 5am! Here is where most monitoring stories end because a few quantiles is all that is stored and the raw data behind everything isn’t available for further analysis. Let’s return to our earlier question, “How many users are having a dissatisfying experience?” Luckily for us, we know how many users were on the site, so we can actually just multiply 1% by the number of current visitors to understand “how many” of the users are having an experience worse than the graph… But that isn’t the question is it? The question is how many users are having a worse experience than 1.5s, not worse than the 99th percentile.

Percentages_Are_Not_People_4

This graph adds a black line that shows the number of current users each minute on the site (numbered on the right axis). To illustrate how we’re really missing the point, let’s just take a random point from our 99th percentile graph (again each point represents 4 minutes at this zoom level). We randomly pick 9:32pm. The graph tells us that the 99th percentile experience at that point is at 1.266s. This is better than our goal of 1.5s. Well, looking at the black line we see that we have about 86 users each minute on the site at that point, or 344 users over the four minute period. 1% of that is between 3 and 4 users. Okay, we’re getting somewhere! So we know that between 3 and 4 users had an experience over 1.266s. Wait, that wasn’t our question. Who cares about 1.266s, when we want to know about 1.5s? We’re not getting anywhere at all.

Our objective is 1.5 seconds. We’re looking at this all upside down and backwards. We should not be asking how bad the experience is for the worst 1%, instead we should be asking what percentage has a bad experience (any experience worse than our objective of 1.5 seconds). We shouldn’t be asking about quantiles; we should be asking about inverse quantiles. Since we’re storing the whole distribution of experiences in Circonus, we can simply ask, “What percentage of the population is faster than 1.5s?” If we take one minus this inverse quantile at 1.5 seconds, we get exactly the answer to our question: What percentage of users had a “bad experience?”

Percentages_Are_Not_People_5

Now we’re getting somewhere. It is clear that we had a bad time at 5am and we did pretty well with just some line noise during our successful prior evening, right? Let’s return to our first problem: percentages aren’t people.

Percentages_Are_Not_People_6

Luckily, just as we did before, we can simply look at how many people are visiting the site (the green line above) and multiply that by the percentage of people having a bad time and we get the number of actual people. Now we’re talking about something everyone understands. How many people had a bad experience? Let’s multiply!

Percentages_Are_Not_People_7

In this image, we have simply multiplied two data streams from before, and we see the human casualties of our system. This is the number of users per minute that we screwed out of a good experience. These are users that, in all likelihood, are taking their business elsewhere. As anyone that thinks about it for more than a few seconds realizes, a small percentage of a large number can easily be bigger than a large percentage of a small number. Managing to inverse quantile numbers (let alone abstractly reasoning about quantiles), without knowing the size of the population, is misleading (to put it mildly).

Another way to look at this graph is to integrate; that is, to calculate the area under the curve. Integrating a graph representing users over time results in a graph of users. In other words, the number of cumulative users that have had a bad experience.

Percentages_Are_Not_People_8

This should be flat-out eye opening. The eight hours from 2am to 10am (including the event of despair and loss) affected 121 people. The eight hours preceding it affected almost as many: 113.

It can be pretty depressing to think you’ve celebrated a successful day of delivery only to learn that it really wasn’t that successful at all. But, this isn’t so much about celebrating successes that were actually failures; it’s about understanding what, when, and where you can improve. Every user matters; and if you treat them that way, you stand to get a lot more of them.

Percentages_Are_Not_People_9

When you look back at your own graphs, just remember that the most casualties of our day happened in these two bands. You should be using inverse quantiles for SLA reporting; if you don’t have those, use quantiles… if you only have averages, you’re blind as a bat.

Understanding API Latencies

Today’s Internet is powered by APIs. Tomorrow’s will be even more so. Without a pretty UI or a captivating experience, you’re judged simply on performance and availability. As an API provider, it is more critical than ever to understand how your system is performing.

With the emergence of micro services, we have an API layered cake. And often that layered cake looks like one from a Dr. Seuss story. That complex systems fail in complex ways is a deep and painful truth that developers are facing now in even the most ordinary of applications. So, as we build these decoupled, often asynchronous, systems that compose a single user transaction from often tens of underlying networked subtransactions we’re left with a puzzle. How is the performance changing as volume increases usage and, often more importantly, how is it changing as we rapidly deploy micro updates to our micro services?

Developers have long known that they must be aware of their code performance and, at least in my experience, developers tend to be fairly good about minding their performance P’s and Q’s. However, in complex systems, the deployment environment and other production environmental conditions have tremendous influence on the actual performance delivered. The cry, “but it worked in dev” has moved from the functionality to the performance realm of software. I tell you now that I can sympathize.

It has always been a challenge to take a bug in functionality observed in production and build a repeatable test case in development to diagnose, address, and test for future regression. This challenge has been met by the best developers out there. The emergent conditions in complex, decoupled production system are nigh impossible to replicate in a development environment. This leaves developers fantastically frustrated and requires a different tack: production instrumentation.

As I see it, there are two approaches to production instrumentation that are critically important (there would be one approach if storage and retrieval were free and observation had no effect — alas we live in the real world and must compromise). You can either sacrifice coverage for depth or sacrifice depth for coverage. What am I talking about?

I’d love to be able to pick apart a single request coming into my service in excruciating detail. Watch it arrive, calculate the cycles spent on CPU, the time off, which instruction and stack took me off CPU, the activity that requested information from another microservice, the perceived latency between systems, all of the same things on the remote micro service, the disk accesses and latency on delivery for my query against Cassandra, and the details of the read-repair it induced. This list might seem long, but I could go on for pages. The amount of low-level work that is performed to serve even the simplest of requests is staggering… and every single step is subject to bugs, poor interactions, performance regressions and other generally bad behavior. The Google Dapper paper and the OpenZipkin project take a stab at delivering on this type of visibility, and now companies like Lightstep are attempting to deliver on this commercially. I’m excited! This type of tooling is one of two critical approaches to production system visibility.

Understanding_API_Latencies_1

The idea of storing this information on every single request that arrives is absurd today, but even when it is no longer absurd tomorrow, broad and insightful reporting on it will remain a challenge. Hence the need for the second approach.

You guessed it, Circonus falls squarely into the second approach: coverage over depth. You may choose not to agree with my terminology, but hopefully the point will come across. In this approach, instead of looking at individual transactions into the system (acknowledging that we cannot feasibly record and report all of them), we look at the individual components of the system and measure everything. That API we’re serving? Measure the latency of every single request on every exposed endpoint. The micro service you talked to? Measure the latency there. The network protocol over which you communicated? Measure the size of every single package sent in each direction. That Cassandra cluster? Measure the client-facing latency, but also measure the I/O latency of every single disk operation on each spindle (or EBS volume, or ephemeral SSD) on each node. It sounds like a lot of data, sure. We live in the future, and analytics systems are capable of handling a billion measurements per second these days, all the while remaining economical.

Understanding_API_Latencies_2

The above graph shows the full distribution of every IO operation on one of our core database nodes. The histogram in the breakout box shows three distinct modes (two tightly coupled in the left peak and one smaller mode further out in the latency spectrum. We can also see a radical divergence in behavior immediately following Feb 14th at 9am. As we’re looking at one week of data, each time slice vertically is 1h30m. The slice highlighted by the vertical grey hairline is displayed in the upper-left breakout box; it represents nearly 12 million data points alone. The full graph represents about 1.2 billion measurements, and fetching that from the Circonus time series database took 48ms. When you start using the right tools, your eyes will open.

Pully McPushface

The Argument for Connectivity Agnosticism

turning the corner
Turning the corner

It’s about push vs. pull… but it shouldn’t be.

There has been a lot of heated debate on whether pushing telemetry data from systems or pulling that data from systems is better. If you’re just hearing about this argument now, bless you. One would think that this debate is as ridiculous as vim vs. emacs or tabs vs. spaces, but it turns out there is a bit of meat on this bone. The problem is that the proposition is wrong. I hope that here I can reframe the discussion productively to help turn the corner and walk a path where people get back to more productive things.

At Circonus, we’ve always been of the mindset that both push and pull should have their moments to shine. We accept both, but honestly, we are duped into this push vs. pull dialogue all too often. As I’ll explain, the choices we are shown aren’t the only options.

The idea behind pushing metrics is that the “system” in question (be it a machine or a service) should emit telemetry data to an “upstream” entity. The idea of pull is that some “upstream” entity should actively query systems for telemetry data. I am careful not to use the word“centralized” because in most large-scale modern monitoring systems, all of these bits (push or pull) are decentralized rather considerably. Let’s look through both sides of the argument (I’ve done the courtesy of striking through the claims that are patently false):

Push has some arguments:

  1. Pull doesn’t scale well
  2. I don’t know where my data will be coming from.
  3. Push works behind complex network setups.
  4. When events transpire, I should push; pulling doesn’t match my cadence.
  5. Push is more secure.

Pull has some arguments:

  1. I know better when a machine or service goes bad because I control the polling interval.
  2. Controlling the polling interval allows me to investigate issues faster and more effectively.
  3. Pull is more secure.

To address the strikethroughs in verse: Pulling data from 2 million machines isn’t a difficult job. Do you have more than 2 million machines? Pull scales fine… Google does it. When pulling data from a secure place to the cloud or pushing data from a secure place to the cloud, you are moving some data across the same boundary and are thus exposed to the same security risks involved. It is worth mentioning that in a setup where data is pulled, the target machine need not be able to even route to the Internet at all, thus making the attack surface more slippery. I personally find that argument to be weak and believe that if the right security policies are put in place, both methods can be considered equally “securable.” It’s also worth mentioning that many of those making claims about security concerns have wide open policies about pushing information beyond the boundaries of their digital enclave and should spend some serious time reflecting on that.


Now to address the remaining issues.

Push: I don’t know where my data will be coming from.

Yes, it’s true that you don’t always know where your data is coming from. A perfect example is web clients. They show up to load a page or call an API, and then could potentially disappear for good. You don’t own that resource and, more importantly, don’t pay an operational or capital expenditure on acquiring or running it. So, I sympathize that we don’t always know which systems will be submitting telemetry information to us. On the flip side, those machines or services that you know about and pay for — it’s just flat-out lazy to not know what they are. In the case of short-lived resources, it is imperative that you know when it is doing work and when it is gone for good. Considering this, it would stand to reason that the resource being monitored must initiate this. This is an argument for push… at least on layer 3. Woah! What? Why I am talking about OSI layers? I’ll get to that.

Push: Works behind complex network setups.

It turns out that pull actually works behind some complex network configurations where push fails, though these are quite rare in practice. Still, it also turns out that TCP sessions are bidirectional, so once you’ve conquered setup you’ve solved this issue. So this argument (and the rare counterargument) are layer 3 arguments that struggle to find any relevance at layer 7.

Push: When events transpire, I should push; pulling doesn’t match my cadence.

Finally, some real meat. I’ve talk about this many times in the past, and it is 100% true that some things you want to observe fall well into the push realm and others into the pull realm. When an event transpires, you likely want to get that information upstream as quickly as possible, so push makes good sense. And as this is information… we’re talking layer 7. If you instrument processes starting and stopping, you likely don’t want to missing something. On the other hand, the way to not miss disk space usage monitoring on a system is to log every block allocation and deallocation — sounds like a bit of overkill perhaps? This is a good example of where pulling that information at an operator’s discretion (say every few seconds or every minute) would suffice. Basically, sometimes it makes good sense to push on layer 7, sometimes it makes better sense to pull.

Pull: I know better when a machine or service goes bad because I control the polling interval.

This, to me, comes down to the responsible party. Is each of your million machines (or 10) responsible for detecting failure (in the form of absenteeism) or is that the responsibility of the monitoring system? That was rhetorical, of course. The monitoring system is responsible, full stop. Yet detecting failure of systems by tracking the absenteeism of data in the push model requires elaborate models on acceptable delinquency in emissions. When the monitoring system pulls data, it controls the interval and can determine unavailability in a way that is reliable, simple, and, perhaps most importantly, simple to reason about. While there are elements of layer 3 here if the client is not currently “connected” to the monitoring system, this issue is almost entirely addressed on layer 7.

Pull: Controlling the polling interval allows me to investigate issues faster and more effectively.

For metrics in many systems, taking a measurement every 100ms is overkill. I have thousands of metrics available on a machine, and most of them are very expressive on observation intervals as large as five minutes. However, there are times at which a tighter observation interval is warranted. This is an argument of control, and it is a good argument. The claim that an operator should be able to dynamically control the interval at which measurements are taken is a completely legitimate claim and expectation to have. This argument and its solution live in layer 7.

Enter Pully McPushface.


Pully McPushface is just a name to get attention: attention to something that can potentially make people cease their asinine pull vs. push arguments. It is simply the acknowledgement that one can push or pull at layer 3 (the direction in which one establishes a TCP session) and also push (send) or pull (request/response) on layer 7, independent of one another. To be clear, this approach has been possible since TCP hit the scene in 1982… so why haven’t monitoring systems leveraged it?

At Circonus, we’ve recently revamped our stack to allow for this freedom in almost every level of our architecture. Since the beginning, we’ve supported both push and pull protocols (like collectd, statsd, json over HTTP, NRPE, etc.), and we’ll continue to do so. The problem was that these all (as do the pundits) conflate layer 3 and layer 7 “initiation” in their design. (The collectd agent connects out via TCP to push data, and a monitor connects into NRPE to pull data.) We’re changing the dialogue.

Our collection system is designed to be distributed. We have our first tier: the core system, our second tier: the broker network, and our third tier: agents. While we support a multitude of agents (including the aforementioned statsd, collectd, etc.), we also have our own open source agent called NAD.

When we initially designed Circonus, we did extensive research with world-leading security teams to understand whether our layer 3 connections between tier 1 and tier 2 should be initiated by the broker to the core or vice verse. The consensus (unanimous I might add) was that security would be improved by controlling a single inbound TCP connection to the broker, and the broker could be operated without a default route, disabling it from easily sending data to malicious parties were it duped into sending data. It turns out that our audience wholeheartedly disagreed with this expert opinion. The solution? Be agnostic. Today, the conversations between tier 1 and tier 2 care not as to who initiates the connection. Prefer the broker reaches out? That’s just fine. Want the core to connect to the broker? That’ll work too.

In our recent release of C:OSI (and NAD), we’ve applied the same agnosticism to connectivity between tier 2 and tier 3. Here is where the magic happens. The nad agent now has the ability to both dial in and be dialed to on layer 3, while maintaining all of its normal layer 7 flexibility. Basically, however your network and systems are setup, we can work with that and still get on-demand, high-frequency data out; no more compromises. Say hello to Pully McPushface.

What’s new in JLog?

There is a class of problems in systems software that require guaranteed delivery of data from one stage of processing to the next stage of processing. In database systems, this usually involves a WAL file and a commit process that moves data from the WAL to the main storage files. If a crash or power loss occurs, we can replay the WAL file to reconstitute the database correctly. Nothing gets lost. Most database systems use some variant of ARIES.

In message broker systems, this usually involves an acknowledgement that a message was received and a retry from the client if there was no response or an error response. For durable message brokers, that acknowledgement should not go to the client until the data is committed to disk and safe. In larger brokered systems, like Kafka, this can extend to the data safely arriving at multiple nodes before acknowledging receipt to the client. These systems can usually be configured based on the relative tolerance of data loss for the application. For ephemeral stream data where the odd message or two can be dropped, we might set Kafka to acknowledge the message after only the leader has it, for example.

JLog is a library that provides journaled log functionality for your application and allows decoupling of data ingestion from data processing using a publish subscribe semantic. It supports both thread and multi-process safety. JLog can be used to build pub/sub systems that guarantee message delivery by relying on permanent storage for each received message and allowing different subscribers to maintain a different position in the log. It fully manages file segmentation and cleanup when all subscribers have finished reading a file segment.

Recent additions

To support ongoing scalability and availability objectives at Circonus, I recently added a set of new features for JLog. I’ll discuss each of them in more detail below:

  • Compression with LZ4
  • Single process support on demand
  • Rewindable checkpoints
  • Pre-commit buffering

Compression with LZ4

If you are running on a file system that does not support compression, JLog now supports turning on LZ4 compression to reduce disk storage requirements and also increase write throughput, when used with pre-commit buffering. The API for turning on compression looks like:

typedef enum {
  JLOG_COMPRESSION_NULL = 0,
  JLOG_COMPRESSION_LZ4 = 0x01
} jlog_compression_provider_choice;

int jlog_ctx_set_use_compression(jlog_ctx *ctx, uint8_t use);

int jlog_ctx_set_compression_provider(jlog_ctx *ctx,    
    jlog_compression_provider_choice provider);

Currently, only LZ4 is supported, but other compression formats may be added in the future. Choosing the NULL compression provider option is the same as choosing no compression. It’s important to note that you must turn on compression before calling jlog_ctx_init and the chosen compression will be stored with the JLog for it’s lifetime.

Single process support

This really should be called “switching off multi-process support”, as multi-process is the default behavior. Multi-process protects the JLog directory with a file lock, via fcntl(linux impl. linked). We always maintain thread-safety and there is no option to disable thread safety, but you can turn off this system call if you know that writes will only ever come from a single process (probably the most common usage for JLog).

Using the following call with mproc == 0 will turn off this file locking, which should result in a throughput increase:

int jlog_ctx_set_multi_process(jlog_ctx *ctx, uint8_t mproc);

Rewindable checkpoints

Highly available systems may require the ability to go back to a previously read checkpoint. JLog, by default, will delete file segments when all subscribers have read all messages in the segment. If you wanted to go back to a previously read checkpoint for some reason (such as failed processing), you were stuck with no ability to rewind. Now with support for rewindable checkpoints, you can set an ephemeral subscriber at a known spot and backup to that special named checkpoint. The API for using rewindable checkpoints is:

int jlog_ctx_add_subscriber(jlog_ctx *ctx, const char *subscriber,
    jlog_position whence);
int jlog_ctx_set_subscriber_checkpoint(jlog_ctx *ctx, 
    const char *subscriber, 
    const jlog_id *checkpoint);

Here’s an example of it’s usage:

  char begins[20], ends[20];
  jlog_id begin, end, checkpoint;
  int count, pass = 0, orig_expect = expect;
  jlog_message message;

  ctx = jlog_new(“/tmp/test.foo”);
  if(jlog_ctx_open_reader(ctx, “reader”) != 0) {
    fprintf(stderr, "jlog_ctx_open_reader failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* add our special trailing check point subscriber */
  if (jlog_ctx_add_subscriber(ctx, “checkpoint-name”, JLOG_BEGIN) != 0 && errno != EEXIST) {
    fprintf(stderr, "jlog_ctx_add_subscriber failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the checkpoint subscriber to where the real reader is */
  if (jlog_get_checkpoint(ctx, “reader”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  if (jlog_ctx_set_subscriber_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_ctx_set_subscriber_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

Now we have a checkpoint named “checkpoint-name” at the same location as the main subscriber “reader”. If we want to rewind, we simply do this:

  /* move checkpoint to our original position, first read checkpoint location */
  if (jlog_get_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the main read checkpoint there */
  if (jlog_ctx_read_checkpoint(ctx, &checkpoint) != 0) {
    fprintf(stderr, "checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
  } else {
    fprintf(stderr, "\trewound checkpoint...\n");
  }

To move our checkpoint forward, we merely call jlog_ctx_set_subscriber_checkpoint with the safe checkpoint.

Pre-commit buffering

One of the largest challenges with JLog is throughput. The ability to disable multi-process support helps reduce the syscalls required to write our data. This is good, but we still need to make a writev call for each message. This syscall overhead takes a serious bite out of throughput (more in the benchmarks section below). To get around this issue, we have to find a safe-ish way to reduce the syscall overhead of lots of tiny writes. We can either directly map the underlying block device and write to it directly (a nightmare) or we can batch the writes. Batching writes is way easier, but sacrifices way too much data safety (a crash before a batch commit can lose many rows depending on the size of the batch). At the end of the day, I chose a middle ground approach which is fairly safe for the most common case but also allows very high throughput using batch writes.

int jlog_ctx_set_pre_commit_buffer_size(jlog_ctx *ctx, size_t s);

Setting this to something greater than zero will turn on pre-commit buffering. This is implemented as a writable mmapped memory region where all writes go to batch up. This pre-commit buffer is flushed to actual files when it is filled to the requested size. We rely on the OS to flush the mmapped data back to the backing file even if the process crashes. However, if we lose the machine to power loss this approach is not safe. There is always a tradeoff between safety and throughput. Only use this approach if you are comfortable losing data in the event of power loss or kernel panic.

It is important to note that pre-commit buffering is not multi-process writer safe. If you are using JLog under a scheme that has multiple writing processes writing to the same JLog, you have to set the pre-commit buffer size to zero (the default). However, it is safe to use from a single process, multi-thread writer setup, and it is also safe to use with under multi-process when there are multiple reading processes but a single writing process.

There is a tradeoff between throughput and read side latency if you are using pre-commit buffering. Since reads only ever occur out of the materialized files on disk and do not consider the pre-commit buffer, reads can only advance when the pre-commit buffer is flushed. If you have a large-ish pre-commit buffer size and a slow-ish write rate, your readers could be waiting for a while before they advance. Choose your pre-commit buffer size wisely based on the expected throughput of your JLog. Note that we also provide a flush function, which you could wire up to a timer to ensure the readers are advancing even in the face of slow writes:

int jlog_ctx_flush_pre_commit_buffer(jlog_ctx *ctx);

Benchmarks

All benchmarks are timed by writing one million JLog entries with a message size of 100 bytes. All tests were conducted on OmniOS v11 r151014 using ZFS as the file system with compression enabled.

Test Entries/sec Time to complete
JLog Default ~114,000 8.735 sec
LZ4 compression on ~96,000 10.349 sec
Multi-process OFF ~138,000 7.248 sec
MP OFF + LZ4 ~121,000 8.303 sec
MP OFF + Pre-commit buffer 128K ~1,080,000 0.925 sec
MP OFF + Pre-commit + LZ4 ~474,000 2.113 sec

As you can see from the table above, turning multi-process support off provides a slight throughput advantage and all those calls to fcntl are elided, but the real amazing gains come from pre-commit buffering. Even a relatively small buffer of 128 KBytes gains us almost 8X in throughput over the next best option.

That LZ4 is running more slowly is not surprising. We are basically trading CPU for space savings. In addition, using a compressing file system will get you these space gains without having to flip on compression in JLog. However, if you are running on a non-compressed file system, it will save you disk space.

Circonus One Step Install

Introducing Quick and Simple Onboarding with C:OSI

When we started developing Circonus 6 years ago, we found many customers had very specific ideas about how they want their onboarding process to work. Since then, we’ve found that many more customers aren’t sure where to start.

The most rudimentary task new and existing users face is just getting metric data flowing from a new host into Circonus. New users want to see their data, graphs, and worksheets right away, and that process should be quick and easy, without any guesswork involved in sorting through all of the options. But those options need to continue to be available for users who require that flexibility, usually because they have a particular configuration in mind.

So, we listened. Now we’ve put those 6 years of gathering expertise to use in this new tool, so that everyone gets the benefit of that knowledge, but with a simple, streamlined process. This is a prescriptive process, so users who just want their data don’t have to be concerned with figuring out the best way to get started.

You can now register systems with Circonus in one simple command or as a simple part of configuration management. With that single command, you get a reasonable and comprehensive set of metrics and visuals. Check out our C:OSI tutorial on our Support Portal to see just how quick and simple it is, or have a quick look at this short demo video:

New and existing Circonus users can use C:OSI to automate the process of bringing systems online. Without inhibiting customization, a single cut-n-paste command does all of this in one step:

In just one step, C:OSI will:

  1. Select an agent.
  2. Install the agent.
  3. Configure the agent to expose metrics.
  4. Start the agent.
  5. Create a check to retrieve/accept the metrics from the agent.
  6. Enable basic system metrics.
  7. Create graphs for each of the basic metric groups.
  8. Create a worksheet containing the basic graphs so there is a unified view of the specific host.

C:OSI does all this via configuration files, pulled off a central site or read from a local configuration file. Both of which can easily be modified to suit your needs.

C:OSI also allows for customization, so users who depend on the flexibility of Circonus can also benefit from the simplicity of the streamlined process. If the default configuration prescribed by C:OSI doesn’t meet your own specifications, you can modify it, but the onboarding process would still be as simple as running a single command.

You can dig into those customization options by visiting the C:OSI documentation in the Circonus Labs public Github repository.

Anyone in DevOps, or anyone who has been responsible for monitoring a stack, knows that creating connections or nodes can be a time consuming task. A streamlined, prescriptive onboarding process is faster and more efficient. This provides stronger consistency in the data collected, which in turn allows us to do better, smarter things with that data.