Pully McPushface

The Argument for Connectivity Agnosticism

turning the corner
Turning the corner

It’s about push vs. pull… but it shouldn’t be.

There has been a lot of heated debate on whether pushing telemetry data from systems or pulling that data from systems is better. If you’re just hearing about this argument now, bless you. One would think that this debate is as ridiculous as vim vs. emacs or tabs vs. spaces, but it turns out there is a bit of meat on this bone. The problem is that the proposition is wrong. I hope that here I can reframe the discussion productively to help turn the corner and walk a path where people get back to more productive things.

At Circonus, we’ve always been of the mindset that both push and pull should have their moments to shine. We accept both, but honestly, we are duped into this push vs. pull dialogue all too often. As I’ll explain, the choices we are shown aren’t the only options.

The idea behind pushing metrics is that the “system” in question (be it a machine or a service) should emit telemetry data to an “upstream” entity. The idea of pull is that some “upstream” entity should actively query systems for telemetry data. I am careful not to use the word“centralized” because in most large-scale modern monitoring systems, all of these bits (push or pull) are decentralized rather considerably. Let’s look through both sides of the argument (I’ve done the courtesy of striking through the claims that are patently false):

Push has some arguments:

  1. Pull doesn’t scale well
  2. I don’t know where my data will be coming from.
  3. Push works behind complex network setups.
  4. When events transpire, I should push; pulling doesn’t match my cadence.
  5. Push is more secure.

Pull has some arguments:

  1. I know better when a machine or service goes bad because I control the polling interval.
  2. Controlling the polling interval allows me to investigate issues faster and more effectively.
  3. Pull is more secure.

To address the strikethroughs in verse: Pulling data from 2 million machines isn’t a difficult job. Do you have more than 2 million machines? Pull scales fine… Google does it. When pulling data from a secure place to the cloud or pushing data from a secure place to the cloud, you are moving some data across the same boundary and are thus exposed to the same security risks involved. It is worth mentioning that in a setup where data is pulled, the target machine need not be able to even route to the Internet at all, thus making the attack surface more slippery. I personally find that argument to be weak and believe that if the right security policies are put in place, both methods can be considered equally “securable.” It’s also worth mentioning that many of those making claims about security concerns have wide open policies about pushing information beyond the boundaries of their digital enclave and should spend some serious time reflecting on that.


Now to address the remaining issues.

Push: I don’t know where my data will be coming from.

Yes, it’s true that you don’t always know where your data is coming from. A perfect example is web clients. They show up to load a page or call an API, and then could potentially disappear for good. You don’t own that resource and, more importantly, don’t pay an operational or capital expenditure on acquiring or running it. So, I sympathize that we don’t always know which systems will be submitting telemetry information to us. On the flip side, those machines or services that you know about and pay for — it’s just flat-out lazy to not know what they are. In the case of short-lived resources, it is imperative that you know when it is doing work and when it is gone for good. Considering this, it would stand to reason that the resource being monitored must initiate this. This is an argument for push… at least on layer 3. Woah! What? Why I am talking about OSI layers? I’ll get to that.

Push: Works behind complex network setups.

It turns out that pull actually works behind some complex network configurations where push fails, though these are quite rare in practice. Still, it also turns out that TCP sessions are bidirectional, so once you’ve conquered setup you’ve solved this issue. So this argument (and the rare counterargument) are layer 3 arguments that struggle to find any relevance at layer 7.

Push: When events transpire, I should push; pulling doesn’t match my cadence.

Finally, some real meat. I’ve talk about this many times in the past, and it is 100% true that some things you want to observe fall well into the push realm and others into the pull realm. When an event transpires, you likely want to get that information upstream as quickly as possible, so push makes good sense. And as this is information… we’re talking layer 7. If you instrument processes starting and stopping, you likely don’t want to missing something. On the other hand, the way to not miss disk space usage monitoring on a system is to log every block allocation and deallocation — sounds like a bit of overkill perhaps? This is a good example of where pulling that information at an operator’s discretion (say every few seconds or every minute) would suffice. Basically, sometimes it makes good sense to push on layer 7, sometimes it makes better sense to pull.

Pull: I know better when a machine or service goes bad because I control the polling interval.

This, to me, comes down to the responsible party. Is each of your million machines (or 10) responsible for detecting failure (in the form of absenteeism) or is that the responsibility of the monitoring system? That was rhetorical, of course. The monitoring system is responsible, full stop. Yet detecting failure of systems by tracking the absenteeism of data in the push model requires elaborate models on acceptable delinquency in emissions. When the monitoring system pulls data, it controls the interval and can determine unavailability in a way that is reliable, simple, and, perhaps most importantly, simple to reason about. While there are elements of layer 3 here if the client is not currently “connected” to the monitoring system, this issue is almost entirely addressed on layer 7.

Pull: Controlling the polling interval allows me to investigate issues faster and more effectively.

For metrics in many systems, taking a measurement every 100ms is overkill. I have thousands of metrics available on a machine, and most of them are very expressive on observation intervals as large as five minutes. However, there are times at which a tighter observation interval is warranted. This is an argument of control, and it is a good argument. The claim that an operator should be able to dynamically control the interval at which measurements are taken is a completely legitimate claim and expectation to have. This argument and its solution live in layer 7.

Enter Pully McPushface.


Pully McPushface is just a name to get attention: attention to something that can potentially make people cease their asinine pull vs. push arguments. It is simply the acknowledgement that one can push or pull at layer 3 (the direction in which one establishes a TCP session) and also push (send) or pull (request/response) on layer 7, independent of one another. To be clear, this approach has been possible since TCP hit the scene in 1982… so why haven’t monitoring systems leveraged it?

At Circonus, we’ve recently revamped our stack to allow for this freedom in almost every level of our architecture. Since the beginning, we’ve supported both push and pull protocols (like collectd, statsd, json over HTTP, NRPE, etc.), and we’ll continue to do so. The problem was that these all (as do the pundits) conflate layer 3 and layer 7 “initiation” in their design. (The collectd agent connects out via TCP to push data, and a monitor connects into NRPE to pull data.) We’re changing the dialogue.

Our collection system is designed to be distributed. We have our first tier: the core system, our second tier: the broker network, and our third tier: agents. While we support a multitude of agents (including the aforementioned statsd, collectd, etc.), we also have our own open source agent called NAD.

When we initially designed Circonus, we did extensive research with world-leading security teams to understand whether our layer 3 connections between tier 1 and tier 2 should be initiated by the broker to the core or vice verse. The consensus (unanimous I might add) was that security would be improved by controlling a single inbound TCP connection to the broker, and the broker could be operated without a default route, disabling it from easily sending data to malicious parties were it duped into sending data. It turns out that our audience wholeheartedly disagreed with this expert opinion. The solution? Be agnostic. Today, the conversations between tier 1 and tier 2 care not as to who initiates the connection. Prefer the broker reaches out? That’s just fine. Want the core to connect to the broker? That’ll work too.

In our recent release of C:OSI (and NAD), we’ve applied the same agnosticism to connectivity between tier 2 and tier 3. Here is where the magic happens. The nad agent now has the ability to both dial in and be dialed to on layer 3, while maintaining all of its normal layer 7 flexibility. Basically, however your network and systems are setup, we can work with that and still get on-demand, high-frequency data out; no more compromises. Say hello to Pully McPushface.

What’s new in JLog?

There is a class of problems in systems software that require guaranteed delivery of data from one stage of processing to the next stage of processing. In database systems, this usually involves a WAL file and a commit process that moves data from the WAL to the main storage files. If a crash or power loss occurs, we can replay the WAL file to reconstitute the database correctly. Nothing gets lost. Most database systems use some variant of ARIES.

In message broker systems, this usually involves an acknowledgement that a message was received and a retry from the client if there was no response or an error response. For durable message brokers, that acknowledgement should not go to the client until the data is committed to disk and safe. In larger brokered systems, like Kafka, this can extend to the data safely arriving at multiple nodes before acknowledging receipt to the client. These systems can usually be configured based on the relative tolerance of data loss for the application. For ephemeral stream data where the odd message or two can be dropped, we might set Kafka to acknowledge the message after only the leader has it, for example.

JLog is a library that provides journaled log functionality for your application and allows decoupling of data ingestion from data processing using a publish subscribe semantic. It supports both thread and multi-process safety. JLog can be used to build pub/sub systems that guarantee message delivery by relying on permanent storage for each received message and allowing different subscribers to maintain a different position in the log. It fully manages file segmentation and cleanup when all subscribers have finished reading a file segment.

Recent additions

To support ongoing scalability and availability objectives at Circonus, I recently added a set of new features for JLog. I’ll discuss each of them in more detail below:

  • Compression with LZ4
  • Single process support on demand
  • Rewindable checkpoints
  • Pre-commit buffering

Compression with LZ4

If you are running on a file system that does not support compression, JLog now supports turning on LZ4 compression to reduce disk storage requirements and also increase write throughput, when used with pre-commit buffering. The API for turning on compression looks like:

typedef enum {
  JLOG_COMPRESSION_NULL = 0,
  JLOG_COMPRESSION_LZ4 = 0x01
} jlog_compression_provider_choice;

int jlog_ctx_set_use_compression(jlog_ctx *ctx, uint8_t use);

int jlog_ctx_set_compression_provider(jlog_ctx *ctx,    
    jlog_compression_provider_choice provider);

Currently, only LZ4 is supported, but other compression formats may be added in the future. Choosing the NULL compression provider option is the same as choosing no compression. It’s important to note that you must turn on compression before calling jlog_ctx_init and the chosen compression will be stored with the JLog for it’s lifetime.

Single process support

This really should be called “switching off multi-process support”, as multi-process is the default behavior. Multi-process protects the JLog directory with a file lock, via fcntl(linux impl. linked). We always maintain thread-safety and there is no option to disable thread safety, but you can turn off this system call if you know that writes will only ever come from a single process (probably the most common usage for JLog).

Using the following call with mproc == 0 will turn off this file locking, which should result in a throughput increase:

int jlog_ctx_set_multi_process(jlog_ctx *ctx, uint8_t mproc);

Rewindable checkpoints

Highly available systems may require the ability to go back to a previously read checkpoint. JLog, by default, will delete file segments when all subscribers have read all messages in the segment. If you wanted to go back to a previously read checkpoint for some reason (such as failed processing), you were stuck with no ability to rewind. Now with support for rewindable checkpoints, you can set an ephemeral subscriber at a known spot and backup to that special named checkpoint. The API for using rewindable checkpoints is:

int jlog_ctx_add_subscriber(jlog_ctx *ctx, const char *subscriber,
    jlog_position whence);
int jlog_ctx_set_subscriber_checkpoint(jlog_ctx *ctx, 
    const char *subscriber, 
    const jlog_id *checkpoint);

Here’s an example of it’s usage:

  char begins[20], ends[20];
  jlog_id begin, end, checkpoint;
  int count, pass = 0, orig_expect = expect;
  jlog_message message;

  ctx = jlog_new(“/tmp/test.foo”);
  if(jlog_ctx_open_reader(ctx, “reader”) != 0) {
    fprintf(stderr, "jlog_ctx_open_reader failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* add our special trailing check point subscriber */
  if (jlog_ctx_add_subscriber(ctx, “checkpoint-name”, JLOG_BEGIN) != 0 && errno != EEXIST) {
    fprintf(stderr, "jlog_ctx_add_subscriber failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the checkpoint subscriber to where the real reader is */
  if (jlog_get_checkpoint(ctx, “reader”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  if (jlog_ctx_set_subscriber_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_ctx_set_subscriber_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

Now we have a checkpoint named “checkpoint-name” at the same location as the main subscriber “reader”. If we want to rewind, we simply do this:

  /* move checkpoint to our original position, first read checkpoint location */
  if (jlog_get_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the main read checkpoint there */
  if (jlog_ctx_read_checkpoint(ctx, &checkpoint) != 0) {
    fprintf(stderr, "checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
  } else {
    fprintf(stderr, "\trewound checkpoint...\n");
  }

To move our checkpoint forward, we merely call jlog_ctx_set_subscriber_checkpoint with the safe checkpoint.

Pre-commit buffering

One of the largest challenges with JLog is throughput. The ability to disable multi-process support helps reduce the syscalls required to write our data. This is good, but we still need to make a writev call for each message. This syscall overhead takes a serious bite out of throughput (more in the benchmarks section below). To get around this issue, we have to find a safe-ish way to reduce the syscall overhead of lots of tiny writes. We can either directly map the underlying block device and write to it directly (a nightmare) or we can batch the writes. Batching writes is way easier, but sacrifices way too much data safety (a crash before a batch commit can lose many rows depending on the size of the batch). At the end of the day, I chose a middle ground approach which is fairly safe for the most common case but also allows very high throughput using batch writes.

int jlog_ctx_set_pre_commit_buffer_size(jlog_ctx *ctx, size_t s);

Setting this to something greater than zero will turn on pre-commit buffering. This is implemented as a writable mmapped memory region where all writes go to batch up. This pre-commit buffer is flushed to actual files when it is filled to the requested size. We rely on the OS to flush the mmapped data back to the backing file even if the process crashes. However, if we lose the machine to power loss this approach is not safe. There is always a tradeoff between safety and throughput. Only use this approach if you are comfortable losing data in the event of power loss or kernel panic.

It is important to note that pre-commit buffering is not multi-process writer safe. If you are using JLog under a scheme that has multiple writing processes writing to the same JLog, you have to set the pre-commit buffer size to zero (the default). However, it is safe to use from a single process, multi-thread writer setup, and it is also safe to use with under multi-process when there are multiple reading processes but a single writing process.

There is a tradeoff between throughput and read side latency if you are using pre-commit buffering. Since reads only ever occur out of the materialized files on disk and do not consider the pre-commit buffer, reads can only advance when the pre-commit buffer is flushed. If you have a large-ish pre-commit buffer size and a slow-ish write rate, your readers could be waiting for a while before they advance. Choose your pre-commit buffer size wisely based on the expected throughput of your JLog. Note that we also provide a flush function, which you could wire up to a timer to ensure the readers are advancing even in the face of slow writes:

int jlog_ctx_flush_pre_commit_buffer(jlog_ctx *ctx);

Benchmarks

All benchmarks are timed by writing one million JLog entries with a message size of 100 bytes. All tests were conducted on OmniOS v11 r151014 using ZFS as the file system with compression enabled.

Test Entries/sec Time to complete
JLog Default ~114,000 8.735 sec
LZ4 compression on ~96,000 10.349 sec
Multi-process OFF ~138,000 7.248 sec
MP OFF + LZ4 ~121,000 8.303 sec
MP OFF + Pre-commit buffer 128K ~1,080,000 0.925 sec
MP OFF + Pre-commit + LZ4 ~474,000 2.113 sec

As you can see from the table above, turning multi-process support off provides a slight throughput advantage and all those calls to fcntl are elided, but the real amazing gains come from pre-commit buffering. Even a relatively small buffer of 128 KBytes gains us almost 8X in throughput over the next best option.

That LZ4 is running more slowly is not surprising. We are basically trading CPU for space savings. In addition, using a compressing file system will get you these space gains without having to flip on compression in JLog. However, if you are running on a non-compressed file system, it will save you disk space.

Circonus One Step Install

Introducing Quick and Simple Onboarding with C:OSI

When we started developing Circonus 6 years ago, we found many customers had very specific ideas about how they want their onboarding process to work. Since then, we’ve found that many more customers aren’t sure where to start.

The most rudimentary task new and existing users face is just getting metric data flowing from a new host into Circonus. New users want to see their data, graphs, and worksheets right away, and that process should be quick and easy, without any guesswork involved in sorting through all of the options. But those options need to continue to be available for users who require that flexibility, usually because they have a particular configuration in mind.

So, we listened. Now we’ve put those 6 years of gathering expertise to use in this new tool, so that everyone gets the benefit of that knowledge, but with a simple, streamlined process. This is a prescriptive process, so users who just want their data don’t have to be concerned with figuring out the best way to get started.

You can now register systems with Circonus in one simple command or as a simple part of configuration management. With that single command, you get a reasonable and comprehensive set of metrics and visuals. Check out our C:OSI tutorial on our Support Portal to see just how quick and simple it is, or have a quick look at this short demo video:

New and existing Circonus users can use C:OSI to automate the process of bringing systems online. Without inhibiting customization, a single cut-n-paste command does all of this in one step:

In just one step, C:OSI will:

  1. Select an agent.
  2. Install the agent.
  3. Configure the agent to expose metrics.
  4. Start the agent.
  5. Create a check to retrieve/accept the metrics from the agent.
  6. Enable basic system metrics.
  7. Create graphs for each of the basic metric groups.
  8. Create a worksheet containing the basic graphs so there is a unified view of the specific host.

C:OSI does all this via configuration files, pulled off a central site or read from a local configuration file. Both of which can easily be modified to suit your needs.

C:OSI also allows for customization, so users who depend on the flexibility of Circonus can also benefit from the simplicity of the streamlined process. If the default configuration prescribed by C:OSI doesn’t meet your own specifications, you can modify it, but the onboarding process would still be as simple as running a single command.

You can dig into those customization options by visiting the C:OSI documentation in the Circonus Labs public Github repository.

Anyone in DevOps, or anyone who has been responsible for monitoring a stack, knows that creating connections or nodes can be a time consuming task. A streamlined, prescriptive onboarding process is faster and more efficient. This provides stronger consistency in the data collected, which in turn allows us to do better, smarter things with that data.

Circonus Instrumentation Packs

In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

HTTPTrap uuid-secret
CHECKID / UUID / Secret Example

This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

Below is a bit more detail on each of the currently available CIPs:

Java

Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

Go

This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

You’ll need two github repos:

Here is the sample code (also found in the circonus-gometrics readme):

[slide]package main
import (
 "fmt"
 "net/http"
 metrics "github.com/circonus-gometrics"
)
func main() {
// Get your Auth token at https://login.circonus.com/user/tokens
 metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
// Get your Checkid on the check details page
 metrics.WithCheckId(163063)
 metrics.Start()
http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
 fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
 }))
 http.ListenAndServe(":8080", http.DefaultServeMux)
}

After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

go-httptrap-histogram-example

Node.js

This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

Here is a basic example to measure latency:

First, some setup – making the app;

% mkdir restify-circonus-example
% cd restify-circonus-example
% npm init .

(This defaults to npm init . works fine.) Then:

% npm install --save restify
% npm install --save probability-distributions
% npm install --save circonus-cip

Next, edit index.js and include:

var restify = require('restify'),
 PD = require("probability-distributions"),
 circonus_cip = require('circonus-cip')
var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
var circonus_secret = 'ssssssssh_its_oh_so_quiet'
var server = restify.createServer()
server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
server.get('/', function (req, res, next) {
 setTimeout(function() {
 res.writeHead(200, { 'Content-Type': 'text/plain' });
 //res.write("Hello to a new world of understanding.\n");
 res.end("Hello to a new world of understanding.\n");
 next();
 }, PD.rgamma(1, 3, 2) * 200);
})

server.listen(8888)

Now just start up the app:

node index.js

Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

The Restify Histogram graph
The Restify Histogram graph

Discovering circonusvi

Folks who know Circonus and use it regularly for operations also know that its API is an important part of efficient management for your monitoring facility.

For example, after building out a new datacenter with our SaaS application, we wanted to apply tagging to our checks to make searching more effective. The UI is easy to use, but I needed to tag batches of checks rather than one at a time, which is a job for the API.

I could also write a small program to do a search and populate tags. That’s when a co-worker suggested I use circonusvi (https://github.com/circonus-labs/circonusvi).

Circonusvi is a neat little tool contributed by Ben Chapman to the Circonus Labs github repo. It’s a natural tool to use for most folks who work with unix or unix-like platforms. Blend that with the JSON input/output of the Circonus API and you have a quick way to make adhoc changes.

So after installing the python requirements for circonusvi, I generated a normal API token from the browser, ran circonusvi once, and validated the token in the browser user interface.

My first run of circonusvi without arguments returned everything, allowing me look over things and understand the JSON structure.

Now for the business.

This returns a list of common servers JSON output that I can now edit in vi:

./circonusvi.py 'display_name=servers([0-9]).foo.net json:nad'

And this example finds all the empty tags and populates it with something useful:

%s/\"tags\"\:\ \[\]/\"tags\":\ [\"component:http\",\"datacenter:ohio\",\"os:plan9\"]/g

After saving the changes and verifying the results, circonusvi prompts you one last time about updating the server. Then it updates and you’re done!

Graph Hover Lock

A new feature to help make sense of graphs with multiple data points

When visualizing your data, you may often want to compare multiple data points on a single graph. You may even want to compare a metric across a dozen machines, but a graph with more then two or three data points can quickly turn into a visual mess. Circonus helps to make these more complex graph become human-readable by allowing users to highlighting one data point at a time. This new feature expands on that capability.

When you hover over a graph with multiple datapoints, with your cursor close to one datapoint, that datapoint is highlighted. Now it’s highlighted more prominently and brought to the front, while other data points fade to the back.

You can also click the graph to lock that state into place. You can tell it’s in a locked hover state by the lock icon in the upper right corner of the graph. Click the graph again to unlock.

For graphs with many datapoints, this will help you zero in on the specific datapoint(s) you want to focus on.

See Figure 1. This graph shows HTTP Connect Times across a dozen combinations of different services and different brokers. A number of the data points are hard to see because of the number data points in the graph.

Graph Hover Lock
Graph Hover Lock

Hovering over the graph allows us to view the datapoints more easily. Here in Figure 2, we have used this feature to lock the graph and now we can see one of the smaller datapoints clearly.

Locked graph
Locked graph

To enable this behavior across all graphs, a couple click behaviors have changed. First, when on the graphs list page or on a worksheet, you can no longer click a graph to go view that graph; now you have to click a graph’s title bar to go view it. Second, on the metrics page in grid mode, you can no longer click a metric graph to select that metric for graph creation; instead, you have to click the metric graph’s title bar.

This tool should make it even easier to visualize your data.

Advanced Search Builder

Last year, changes on the backend allowed Circonus to make significant improvements to our search capability. Now, we’ve added an Advanced Search tool to allow users to easily build complex search queries, making Search in Circonus more powerful and flexible than ever before.

When you click on the search icon, you will see an “Advanced” button to the right of the search field after it is expanded. Clicking this button will expand the Advanced Search Builder and allow you to construct advanced search queries.


More information about our Search functionality and the logic it uses is available in our user documentation.

The New Grid View – Instant Graphs on the Metrics Page

We just added a new feature to the UI which displays a graph for every metric in your account.

While the previous view (now called List View) did show a graph for each metric, these graphs were hidden by default. The new Grid View now shows a full page a graphs, one for each metric. You can easily switch between Grid and List views as needed.

These screenshots below illustrate the old list view, the new layout options menu, and the new grid view.

Old list view
Figure 1 – Old list view
New Layout Options menu
Figure 2 – New “Layout Options” menu
New Grid View
Figure 3 – New Grid View

The grid-style layout provides you with an easy way to view a graph for each metric in the list. It lets you click-to-select as many metrics as you want and easily create a graph out of them.

You can also:

  • Choose from 3 layouts with different graph sizes.
  • Define how the titles are displayed.
  • Hover over a graph to see the metric value.
  • Play any number of graphs to get real-time data.

We hope this feature is as useful to you as it has been to us. More information is available in our user documentation and below is a short video showing off some of these features:

Show Me the Data

Avoid spike erosion with Percentile – and Histogram – Aggregation

It has become common wisdom that the lossy process of averaging measurements leads to all kinds of problems when measuring performance of services (see Schlossnagle2015,  Ugurlu2013,  Schwarz2015,  Gregg2014). Yet, most people are not aware that averages are even more abundant than just in old-fashioned formulation of SLAs and storage backends for monitoring data. In fact, it is likely that most graphs that you are viewing involve some averaging behind the scenes, which introduces severe side effects. In this post, we will describe a phenomenon called spike erosion, and highlight some alternative views that allow you to get a more accurate picture of your data.

Meet Spike Erosion

Spike Erosion of Request Rates

Take a look at Figure 1. It shows a graph of request rates over the last month. The spike near December 23, marks the apparent maximum at around 7 requests per second (rps).

Figure 1: Web request rate in requests per second over one month time window
Figure 1: Web request rate in requests per second over one month time window

What if I told you, the actual maximal request rate was almost double that value at 13.67rps (marked with the horizontal guide)? And moreover, it was not reached at December 23, but December 15 at 16:44, near the left boundary of the graph?

Looks way off, right?

But it’s actually true! Figure 2 shows the same graph zoomed in at said time window.

Figure 2: Web request rates (in rps) over a 4h period
Figure 2: Web request rates (in rps) over a 4h period

We call this phenomenon spike erosion; the farther you zoom out, the lower the spikes, and it’s actually very common in all kinds of graphs across all monitoring products.

Let’s see another example.

Spike Erosion of Ping Latencies

Take a look at Figure 3. It shows a graph of ping latencies (from twitter.com) over the course of 4 weeks. Again, it looks like the latency is rather stable around 0.015ms with occasional spikes above 0.02ms and a clear maximum around December 23, with a value of ca 0.03ms.

Figure 3: Ping latency of twitter.com in ms over the last month
Figure 3: Ping latency of twitter.com in ms over the last month

Again, we have marked the actual maximum with a horizontal guide line. It is more than double the apparent maximum and is assumed at any of the visible spikes. That’s right. All spikes do in fact have the same maximal height. Figure 4 shows a closeup of the one on December 30, in the center.

Figure 4: Ping latency of twitter.com in ms on December 30
Figure 4: Ping latency of twitter.com in ms on December 30

What’s going on?

The mathematical explanation of spike erosion is actually pretty simple. It is an artifact of an averaging process that happens behind the scenes, in order to produce sensible plots with high performance.

Note that within a 4 month period we have a total of 40,320 samples collected that we need to represent in a plot over that time window. Figure 5 shows how a plot of all those samples looks in GnuPlot. There are quite a few issues with this raw presentation.

Figure 5: Plot of the raw data of request rates over a month
Figure 5: Plot of the raw data of request rates over a month

First, there is a ton of visual noise in that image. In fact, you cannot even see the individual 40,000 samples for the simple reason that the image is only 1240 pixels wide.

Also, rendering such an image within a browser puts a lot of load on the CPU. The biggest issue with producing such an image is the latency involved with retrieving 40K float values from the db and transmitting them as JSON over the internet.

In order to address the above issues, all mainstream graphing tools pre-aggregate the data before sending it to the browser. The size of the graph determines the number of values that should be displayed e.g. 500. The raw data is then distributed across 500 bins, and for each bin the average is taken, and displayed in the plot.

This process leads to plots like Figure 1, which (a) can be produced much faster, since less data has to be transferred and rendered (in fact, you can cache the pre-aggregated values to speed up retrieval from the db), and (b) are less visually cluttered. However, it also leads to (c) spike erosion!

When looking at a four week time window, the raw number of 40.320 samples is reduced to a mere 448 plotted values, where each plotted value corresponds to an average over a 90 minute period. If there is a single spike in one of the bins, it gets averaged with 90 other samples of lower value, which leads to the erosion of the spike height.

What to do about it?

There are (at least) two ways to allow you to avoid spike erosion and get more insight into your data. Both change the way the data is aggregated.

Min-Max Aggregation

The first way is to show the minimum and the maximum values of each bin along with the mean value. By doing so, you get a sense of the full range of the data, including the highest spikes. Figures 6 and 7 show how Min-Max Aggregation looks in Circonus for the request rate and latency examples.

Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay

In both cases, the points where the maximum values are assumed are clearly visible in the graph. When zooming into the spikes, the Max aggregation values stay aligned with the global maximum.

Keeping in mind that minimum and maximum are special cases of percentiles (namely the 0%-percentile and 100%-percentile), it appears natural to extend the aggregation methods to allow general quantiles as well. This is what we implemented in Circonus with the Percentile Aggregation overlay.

Histogram Aggregation

There is another, structurally different approach to mitigate spike erosion. It begins with the observation that histograms have a natural aggregation logic: Just add the bucket counts. More concretely, a histogram metric that stores data for each minute can be aggregated to larger time windows (e.g. 90 minutes) without applying any summary statistic, like a mean value, simply by adding the counts for each histogram bin across the aggregation time window.

If we combine this observation with the simple fact that time-series metrics can be considered histogram with a single value in it, we arrive at the powerful Histogram Aggregation that rolls-up time series into histogram metrics with lower time resolution. Figures 8 and 9 show Histogram Aggregation Overlays for the Request Rate and Latency examples discussed above.

Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay

In addition to showing the value range (which in the above figure is amplified by using the Min-Max Aggregation Overlay) we also gain a sense of the distribution of values across the bin. In the request rate example the data varies widely across a corridor of width 2.5-10rps. In the latency example, the distribution is concentrated near the mean global median of 0.015ms, with single value outliers.

Going Further

We have seen that displaying data as histograms gives a more concise picture of what is going on. Circonus allows you to go one step further and collect your data as histograms in the first place. This allows you to capture the latencies of all requests made to your API, instead of only probing your API once per minute. See [G.Schlossnagle2015] for an in-depth discussion of the pros and cons of this “passive monitoring” approach. Note that you can still compute averages and percentiles for viewing and alerting.

Figure 10: API Latency Histogram Metric with Average Overlay
Figure 10: API Latency Histogram Metric with Average Overlay

Figure 10 shows a histogram metric of API latencies, together with the mean value computed as an overlay. While this figure looks quite similar to Figures 8 and 9, the logical dependency is reversed. The mean values are computed from the histogram, not the other way around. Also, note that the time window of this figure only spans a few hours, instead of four weeks. This shows how much richer the captured histogram data is.

Resources

  1. Theo Schlossnagle – Problem with Math
  2. Dogan Ugurlu (Optimizely) – The Most Misleading Measure of Response Time: Average
  3. Baron Schwarz – Why percentiles don’t work the way you think
  4. Brendan Gregg – Frequency Tails: What the mean really means
  5. George Schlossnagle – API Performance Monitoring

The Future of Monitoring: Q&A with Jez Humble


jez_humble

Jez Humble is a lecturer at U.C. Berkeley, and co-author of the Jolt Award-winning Continuous Delivery: Reliable Software Releases through Build, Test and Deployment Automation (Addison-Wesley 2011) and Lean Enterprise: How High Performance Organizations Innovate at Scale (O’Reilly 2015), in Eric Ries’ Lean series. He has worked as a software developer, product manager, consultant and trainer across a wide variety of domains and technologies. His focus is on helping organisations deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices.

Theo’s Intro:

It is my perspective that the world of technology will be a continual place eventually.  As services become more and more componentized they stand to become more independently developed and operated.  The implications on engineering design when attempting to maintain acceptable resiliency levels are significant.  The convergence on a continual world is simply a natural progression and will not be stopped.
Jez has taken to deep thought and practice around these challenges quite a bit ahead of the adoption curve, and has a unique perspective on where we are going, why we are going there and (likely vivid) images of the catastrophic derailments that might occur along the tracks.  While I spend all my time thinking about how people might have peace of mind that their systems and businesses are measurably functioning during and after transitions into this new world, my interest is compounded by the Circonus’ internal uses of continual integration and deployment practices for both our SaaS and on-premise customers.

THEO: Most of the slides, talks and propaganda around CI/CD (Continuous Integration/Continuous Delivery) are framed in the context of businesses launching software services that are consumed by customers as opposed to software products consumed by customers. Do you find that people need a different frame of mind, a different perspective or just more discipline when they are faced with shipping product vs. shipping services as it relates to continual practices?

JEZ: The great thing about continuous delivery is that the same principles apply whether you’re doing web services, product development, embedded or mobile. You need to make sure you’re working in small batches, and that your software is always releasable, otherwise you won’t get the benefits. I started my career at a web startup but then spent several years working on packaged software, and the discipline is the same. Some of the problems are different: for example, when I was working on go.cd, we built a stage into our deployment pipeline to do automated upgrade testing from every previous release to what was on trunk. But fundamentally, it’s the same foundations: comprehensive configuration management, good test automation, and the practice of working in small batches on trunk and keeping it releasable. In fact, one of my favourite case studies for CI/CD is HP’s LaserJet Firmware division — yet nobody is deploying new firmware multiple times a day. You do make a good point about discipline: when you’re not actually having to deploy to production on a regular basis it can be easy to let things slide. Perhaps you don’t pay too much attention to the automated functional tests breaking, or you decide that one long-lived branch to do some deep surgery on a fragile subsystem is OK. Continuous deployment (deploying to production frequently) tends to concentrate the mind. But the discipline is equally important however frequently you release.

THEO: Do you find that organizations “going lean” struggle more, take longer or navigate more risk when they are primarily shipping software products vs. services?

JEZ: Each model has its own trade-offs. Products (including mobile apps) usually require a large matrix of client devices to test in order to make sure your product will work correctly. You also have to worry about upgrade testing. Services, on the other hand, require development to work with IT operations to get the deployment process to a low-risk pushbutton state, and make sure the service is easy to operate. Both of these problems are hard to solve — I don’t think anybody gets an easy ride. Many companies who started off shipping product are now moving to a SaaS model in any case, so they’re having to negotiate both models, which is an interesting problem to face. In both cases, getting fast, comprehensive test automation in place and being able to run as much as possible on every check-in, and then fixing things when they break, is the beginning of wisdom.

THEO: Thinking continuously is only a small part of establishing a “lean enterprise.” Do you find engineers more easily reason about adopting CI/CD than other changes such as organizational retooling and process refinements? What’s the most common sticking point (or point of flat-out derailment) for organizations attempting to go lean?

JEZ: My biggest frustration is how conservative most technology organizations are when it comes to changing the way people behave. There are plenty of engineers who are happy to play with new languages or technologies, but god forbid you try and mess with their worldview on process. The biggest sticking point – whether it’s engineers, middle management or leadership – is getting them to change their behavior and ways of thinking.

But the best people – and organizations – are never satisfied with how they’re doing and are always looking for ways to improve.

The worst ones either just accept the status quo, or are always blowing things up (continuous re-orgs are a great example), lurching from one crisis to another. Sometimes you get both. Effective leaders and managers understand that it’s essential to have a measurable customer or organizational outcome to work towards, and that their job is to help the people working for them experiment in a disciplined, scientific way with process improvement work to move towards the goal. That requires that you actually have time and resources to invest in this work, and that you have people with the capacity for and interest in making things better.

THEO: Finance is precise and process oriented and often times bad things happen (people working from different/incorrect base assumptions) when there are too many cooks in the kitchen. This is why finance is usually tightly controlled by the CFO and models and representations are fastidiously enforced. Monitoring and analytics around that data shares a lot in common with respect to models and meanings. However, many engineering groups have far less discipline and control than do financial groups. Where do you see things going here?

JEZ: Monitoring isn’t really my area, but my guess is that there are similar factors at play here to other parts of the DevOps world, which is the lack of both an economic model and the discipline to apply it. Don Reinertsen has a few quotes that I rather like: “you may ignore economics, but economics won’t ignore you.” He also says of product development “The measure of execution in product development is our ability to constantly align our plans to whatever is, at the moment, the best economic choice.” Making good decisions is fundamentally about risk management: what are the risks we face? What choices are available to us to mitigate those risks? What are the impacts? What should we be prepared to pay to mitigate those impacts? What information is required to assess the probability of those risks occurring? How much should we be prepared to pay for that information? For CFOs working within business models that are well understood, there are templates and models that encapsulate this information in a way that makes effective risk management somewhat algorithmic, provided of course you stay within the bounds of the model. I don’t know whether we’re yet at that stage with respect to monitoring, but I certainly don’t feel like we’re yet at that stage with the rest of DevOps. Thus a lot of what we do is heuristic in nature — and that requires constant adaptation and improvement, which takes even more discipline, effort, and attention. That, in a department which is constantly overloaded by firefighting. I guess that’s a very long way of saying that I don’t have a very clear picture of where things are going, but I think it’ll be a while before we’re in a place that has a bunch of proven models with well understood trade-offs.

THEO: In your experience how do organizations today habitually screw up monitoring? What are they simply thinking about “the wrong way?”

JEZ: I haven’t worked in IT operations professionally for over a decade, but based on what I hear and observe, I feel like a lot of people still treat monitoring as little more than setting up a bunch of alerts. This leads to a lot of the issues we see everywhere with alert fatigue and people working very reactively. Tom Limoncelli has a nice blog post where he recommends deleting all your alerts and then, when there’s an outage, working out what information would have predicted it, and just collecting that information. Of course he’s being provocative, but we have a similar situation with tests — people are terrified about deleting them because they feel like they’re literally deleting quality (or in the case of alerts, stability) from their system. But it’s far better to have a small number of alerts that actually have information value than a large number that are telling you very little, but drown the useful data in noise.

THEO: Andrew Shaffer said that “technology is 90% tribalism and fashion.” I’m not sure about the percentage, but he nailed the heart of the problem. You and I both know that process, practice and methods sunset faster in technology than in most other fields. I’ll ask the impossible question… after enterprises go lean, what’s next?

JEZ: I actually believe that there’s no end state to “going lean.” In my opinion, lean is fundamentally about taking a disciplined, scientific approach to product development and process improvement — and you’re never done with that. The environment is always changing, and it’s a question of how fast you can adapt, and how long you can stay in the game. Lean is the science of growing adaptive, resilient organizations, and the best of those are always getting better. Andrew is (as is often the case) correct, and what I find really astonishing is that as an industry we have a terrible grasp of our own history. As George Santayana has it, we seem condemned to repeat our mistakes endlessly, albeit every time with some shiny new technology stack. I feel like there’s a long way to go before any software company truly embodies lean principles — especially the ability to balance moving fast at high quality while maintaining a humane working environment. The main obstacle is the appalling ineptitude of a large proportion of IT management and leadership — so many of these people are either senior engineers who are victims of the Peter Principle or MBAs with no real understanding of how technology works. Many technologists even believe effective management is an oxymoron. While I am lucky enough to know several great leaders and managers, they have not in general become who they are as a result of any serious effort in our industry to cultivate such people. We’re many years away from addressing these problems at scale.