What’s new in JLog?

What's New in JLOG_text

Introduction

There is a class of problems in systems software that require guaranteed delivery of data from one stage of processing to the next stage of processing. In database systems, this usually involves a WAL file and a commit process that moves data from the WAL to the main storage files. If a crash or power loss occurs, we can replay the WAL file to reconstitute the database correctly. Nothing gets lost. Most database systems use some variant of ARIES.

In message broker systems, this usually involves an acknowledgement that a message was received and a retry from the client if there was no response or an error response. For durable message brokers, that acknowledgement should not go to the client until the data is committed to disk and safe. In larger brokered systems, like Kafka, this can extend to the data safely arriving at multiple nodes before acknowledging receipt to the client. These systems can usually be configured based on the relative tolerance of data loss for the application. For ephemeral stream data where the odd message or two can be dropped, we might set Kafka to acknowledge the message after only the leader has it, for example.

JLog is a library that provides journaled log functionality for your application and allows decoupling of data ingestion from data processing using a publish subscribe semantic. It supports both thread and multi-process safety. JLog can be used to build pub/sub systems that guarantee message delivery by relying on permanent storage for each received message and allowing different subscribers to maintain a different position in the log. It fully manages file segmentation and cleanup when all subscribers have finished reading a file segment.

Recent additions

To support ongoing scalability and availability objectives at Circonus, I recently added a set of new features for JLog. I’ll discuss each of them in more detail below:

  • Compression with LZ4
  • Single process support on demand
  • Rewindable checkpoints
  • Pre-commit buffering

Compression with LZ4

If you are running on a file system that does not support compression, JLog now supports turning on LZ4 compression to reduce disk storage requirements and also increase write throughput, when used with pre-commit buffering. The API for turning on compression looks like:

typedef enum {
  JLOG_COMPRESSION_NULL = 0,
  JLOG_COMPRESSION_LZ4 = 0x01
} jlog_compression_provider_choice;

int jlog_ctx_set_use_compression(jlog_ctx *ctx, uint8_t use);

int jlog_ctx_set_compression_provider(jlog_ctx *ctx,    
    jlog_compression_provider_choice provider);

Currently, only LZ4 is supported, but other compression formats may be added in the future. Choosing the NULL compression provider option is the same as choosing no compression. It’s important to note that you must turn on compression before calling jlog_ctx_init and the chosen compression will be stored with the JLog for it’s lifetime.

Single process support

This really should be called “switching off multi-process support”, as multi-process is the default behavior. Multi-process protects the JLog directory with a file lock, via fcntl(linux impl. linked). We always maintain thread-safety and there is no option to disable thread safety, but you can turn off this system call if you know that writes will only ever come from a single process (probably the most common usage for JLog).

Using the following call with mproc == 0 will turn off this file locking, which should result in a throughput increase:

int jlog_ctx_set_multi_process(jlog_ctx *ctx, uint8_t mproc);

Rewindable checkpoints

Highly available systems may require the ability to go back to a previously read checkpoint. JLog, by default, will delete file segments when all subscribers have read all messages in the segment. If you wanted to go back to a previously read checkpoint for some reason (such as failed processing), you were stuck with no ability to rewind. Now with support for rewindable checkpoints, you can set an ephemeral subscriber at a known spot and backup to that special named checkpoint. The API for using rewindable checkpoints is:

int jlog_ctx_add_subscriber(jlog_ctx *ctx, const char *subscriber,
    jlog_position whence);
int jlog_ctx_set_subscriber_checkpoint(jlog_ctx *ctx, 
    const char *subscriber, 
    const jlog_id *checkpoint);

Here’s an example of it’s usage:

  char begins[20], ends[20];
  jlog_id begin, end, checkpoint;
  int count, pass = 0, orig_expect = expect;
  jlog_message message;

  ctx = jlog_new(“/tmp/test.foo”);
  if(jlog_ctx_open_reader(ctx, “reader”) != 0) {
    fprintf(stderr, "jlog_ctx_open_reader failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* add our special trailing check point subscriber */
  if (jlog_ctx_add_subscriber(ctx, “checkpoint-name”, JLOG_BEGIN) != 0 && errno != EEXIST) {
    fprintf(stderr, "jlog_ctx_add_subscriber failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the checkpoint subscriber to where the real reader is */
  if (jlog_get_checkpoint(ctx, “reader”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  if (jlog_ctx_set_subscriber_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_ctx_set_subscriber_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

Now we have a checkpoint named “checkpoint-name” at the same location as the main subscriber “reader”. If we want to rewind, we simply do this:

  /* move checkpoint to our original position, first read checkpoint location */
  if (jlog_get_checkpoint(ctx, “checkpoint-name”, &checkpoint) != 0) {
    fprintf(stderr, "jlog_get_checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
    exit(-1);
  }

  /* now move the main read checkpoint there */
  if (jlog_ctx_read_checkpoint(ctx, &checkpoint) != 0) {
    fprintf(stderr, "checkpoint failed: %d %s\n", jlog_ctx_err(ctx), jlog_ctx_err_string(ctx));
  } else {
    fprintf(stderr, "\trewound checkpoint...\n");
  }

To move our checkpoint forward, we merely call jlog_ctx_set_subscriber_checkpoint with the safe checkpoint.

Pre-commit buffering

One of the largest challenges with JLog is throughput. The ability to disable multi-process support helps reduce the syscalls required to write our data. This is good, but we still need to make a writev call for each message. This syscall overhead takes a serious bite out of throughput (more in the benchmarks section below). To get around this issue, we have to find a safe-ish way to reduce the syscall overhead of lots of tiny writes. We can either directly map the underlying block device and write to it directly (a nightmare) or we can batch the writes. Batching writes is way easier, but sacrifices way too much data safety (a crash before a batch commit can lose many rows depending on the size of the batch). At the end of the day, I chose a middle ground approach which is fairly safe for the most common case but also allows very high throughput using batch writes.

int jlog_ctx_set_pre_commit_buffer_size(jlog_ctx *ctx, size_t s);

Setting this to something greater than zero will turn on pre-commit buffering. This is implemented as a writable mmapped memory region where all writes go to batch up. This pre-commit buffer is flushed to actual files when it is filled to the requested size. We rely on the OS to flush the mmapped data back to the backing file even if the process crashes. However, if we lose the machine to power loss this approach is not safe. There is always a tradeoff between safety and throughput. Only use this approach if you are comfortable losing data in the event of power loss or kernel panic.

It is important to note that pre-commit buffering is not multi-process writer safe. If you are using JLog under a scheme that has multiple writing processes writing to the same JLog, you have to set the pre-commit buffer size to zero (the default). However, it is safe to use from a single process, multi-thread writer setup, and it is also safe to use with under multi-process when there are multiple reading processes but a single writing process.

There is a tradeoff between throughput and read side latency if you are using pre-commit buffering. Since reads only ever occur out of the materialized files on disk and do not consider the pre-commit buffer, reads can only advance when the pre-commit buffer is flushed. If you have a large-ish pre-commit buffer size and a slow-ish write rate, your readers could be waiting for a while before they advance. Choose your pre-commit buffer size wisely based on the expected throughput of your JLog. Note that we also provide a flush function, which you could wire up to a timer to ensure the readers are advancing even in the face of slow writes:

int jlog_ctx_flush_pre_commit_buffer(jlog_ctx *ctx);

Benchmarks

All benchmarks are timed by writing one million JLog entries with a message size of 100 bytes. All tests were conducted on OmniOS v11 r151014 using ZFS as the file system with compression enabled.

Test Entries/sec Time to complete
JLog Default ~114,000 8.735 sec
LZ4 compression on ~96,000 10.349 sec
Multi-process OFF ~138,000 7.248 sec
MP OFF + LZ4 ~121,000 8.303 sec
MP OFF + Pre-commit buffer 128K ~1,080,000 0.925 sec
MP OFF + Pre-commit + LZ4 ~474,000 2.113 sec

As you can see from the table above, turning multi-process support off provides a slight throughput advantage and all those calls to fcntl are elided, but the real amazing gains come from pre-commit buffering. Even a relatively small buffer of 128 KBytes gains us almost 8X in throughput over the next best option.

That LZ4 is running more slowly is not surprising. We are basically trading CPU for space savings. In addition, using a compressing file system will get you these space gains without having to flip on compression in JLog. However, if you are running on a non-compressed file system, it will save you disk space.

 

 

We’re always happy to get feedback. Will these Jlog features be helpful for your operation?
What other new features would you like to see?

Circonus One Step Install

Introducing Quick and Simple Onboarding with C:OSI

Circonus Banner Blog Image_#1

When we started developing Circonus 6 years ago, we found many customers had very specific ideas about how they want their onboarding process to work. Since then, we’ve found that many more customers aren’t sure where to start.

The most rudimentary task new and existing users face is just getting metric data flowing from a new host into Circonus. New users want to see their data, graphs, and worksheets right away, and that process should be quick and easy, without any guesswork involved in sorting through all of the options. But those options need to continue to be available for users who require that flexibility, usually because they have a particular configuration in mind.

So, we listened. Now we’ve put those 6 years of gathering expertise to use in this new tool, so that everyone gets the benefit of that knowledge, but with a simple, streamlined process. This is a prescriptive process, so users who just want their data don’t have to be concerned with figuring out the best way to get started.

You can now register systems with Circonus in one simple command or as a simple part of configuration management. With that single command, you get a reasonable and comprehensive set of metrics and visuals. Check out our C:OSI tutorial on our Support Portal to see just how quick and simple it is, or have a quick look at this short demo video:

New and existing Circonus users can use C:OSI to automate the process of bringing systems online. Without inhibiting customization, a single cut-n-paste command does all of this in one step:

In just one step, C:OSI will:

  1. Select an agent.
  2. Install the agent.
  3. Configure the agent to expose metrics.
  4. Start the agent.
  5. Create a check to retrieve/accept the metrics from the agent.
  6. Enable basic system metrics.
  7. Create graphs for each of the basic metric groups.
  8. Create a worksheet containing the basic graphs so there is a unified view of the specific host.

C:OSI does all this via configuration files, pulled off a central site or read from a local configuration file. Both of which can easily be modified to suit your needs.

C:OSI also allows for customization, so users who depend on the flexibility of Circonus can also benefit from the simplicity of the streamlined process. If the default configuration prescribed by C:OSI doesn’t meet your own specifications, you can modify it, but the onboarding process would still be as simple as running a single command.

You can dig into those customization options by visiting the C:OSI documentation in the Circonus Labs public Github repository.

Anyone in DevOps, or anyone who has been responsible for monitoring a stack, knows that creating connections or nodes can be a time consuming task. A streamlined, prescriptive onboarding process is faster and more efficient. This provides stronger consistency in the data collected, which in turn allows us to do better, smarter things with that data.

To learn more about Circonus, click here to schedule a demo.

We’d really like to hear what you think.

Tell us about your own onboarding experience, and let us know what you think about C:OSI. Do you like the C:OSI default settings? What else would you like to see?

To learn more about Circonus, click below:

Circonus Instrumentation Packs

In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

HTTPTrap uuid-secret
CHECKID / UUID / Secret Example

This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

Below is a bit more detail on each of the currently available CIPs:

Java

Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

Go

This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

You’ll need two github repos:

Here is the sample code (also found in the circonus-gometrics readme):

[slide]package main
import (
 "fmt"
 "net/http"
 metrics "github.com/circonus-gometrics"
)
func main() {
// Get your Auth token at https://login.circonus.com/user/tokens
 metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
// Get your Checkid on the check details page
 metrics.WithCheckId(163063)
 metrics.Start()
http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
 fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
 }))
 http.ListenAndServe(":8080", http.DefaultServeMux)
}

After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

go-httptrap-histogram-example

Node.js

This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

Here is a basic example to measure latency:

First, some setup – making the app;

% mkdir restify-circonus-example
% cd restify-circonus-example
% npm init .

(This defaults to npm init . works fine.) Then:

% npm install --save restify
% npm install --save probability-distributions
% npm install --save circonus-cip

Next, edit index.js and include:

var restify = require('restify'),
 PD = require("probability-distributions"),
 circonus_cip = require('circonus-cip')
var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
var circonus_secret = 'ssssssssh_its_oh_so_quiet'
var server = restify.createServer()
server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
server.get('/', function (req, res, next) {
 setTimeout(function() {
 res.writeHead(200, { 'Content-Type': 'text/plain' });
 //res.write("Hello to a new world of understanding.\n");
 res.end("Hello to a new world of understanding.\n");
 next();
 }, PD.rgamma(1, 3, 2) * 200);
})

server.listen(8888)

Now just start up the app:

node index.js

Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

The Restify Histogram graph

Let us know what you think, and more examples and languages will follow.
Community participation is encouraged, and feedback of any kind is more than welcome:
If you want a demo, or have a specific question, we’re happy to work with you.

Contact Me

To learn more about Circonus, click below:

Discovering circonusvi

Folks who know Circonus and use it regularly for operations also know that its API is an important part of efficient management for your monitoring facility.

For example, after building out a new datacenter with our SaaS application, we wanted to apply tagging to our checks to make searching more effective. The UI is easy to use, but I needed to tag batches of checks rather than one at a time, which is a job for the API.

I could also write a small program to do a search and populate tags. That’s when a co-worker suggested I use circonusvi (https://github.com/circonus-labs/circonusvi).

Circonusvi is a neat little tool contributed by Ben Chapman to the Circonus Labs github repo. It’s a natural tool to use for most folks who work with unix or unix-like platforms. Blend that with the JSON input/output of the Circonus API and you have a quick way to make adhoc changes.

So after installing the python requirements for circonusvi, I generated a normal API token from the browser, ran circonusvi once, and validated the token in the browser user interface.

My first run of circonusvi without arguments returned everything, allowing me look over things and understand the JSON structure.

Now for the business.

This returns a list of common servers JSON output that I can now edit in vi:

./circonusvi.py 'display_name=servers([0-9]).foo.net json:nad'

And this example finds all the empty tags and populates it with something useful:

%s/\"tags\"\:\ \[\]/\"tags\":\ [\"component:http\",\"datacenter:ohio\",\"os:plan9\"]/g

After saving the changes and verifying the results, circonusvi prompts you one last time about updating the server. Then it updates and you’re done!

 

 

Graph Hover Lock

A new feature to help make sense of graphs with multiple data points

When visualizing your data, you may often want to compare multiple data points on a single graph. You may even want to compare a metric across a dozen machines, but a graph with more then two or three data points can quickly turn into a visual mess. Circonus helps to make these more complex graph become human-readable by allowing users to highlighting one data point at a time. This new feature expands on that capability.

When you hover over a graph with multiple datapoints, with your cursor close to one datapoint, that datapoint is highlighted. Now it’s highlighted more prominently and brought to the front, while other data points fade to the back.

You can also click the graph to lock that state into place. You can tell it’s in a locked hover state by the lock icon in the upper right corner of the graph. Click the graph again to unlock.

For graphs with many datapoints, this will help you zero in on the specific datapoint(s) you want to focus on.

Please click here for a Circonus demo: http://bit.ly/1TZOrGJ

See Figure 1. This graph shows HTTP Connect Times across a dozen combinations of different services and different brokers. A number of the data points are hard to see because of the number data points in the graph.

Graph_Hover_Lock_1

Hovering over the graph allows us to view the datapoints more easily. Here in Figure 2, we have used this feature to lock the graph and now we ca can see one of the smaller datapoints clearly.

Graph_Hover_Lock_2

To enable this behavior across all graphs, a couple click behaviors have changed. First, when on the graphs list page or on a worksheet, you can no longer click a graph to go view that graph; now you have to click a graph’s title bar to go view it. Second, on the metrics page in grid mode, you can no longer click a metric graph to select that metric for graph creation; instead, you have to click the metric graph’s title bar.

This tool should make it even easier to visualize your data.

Please click here for a Circonus demo: http://bit.ly/1TZOrGJ

To learn more about Circonus please contact us at sales@circonus.com or by phone at 877.385.6194 X244

About Circonus

Circonus is a microservices monitoring and analytics platform built for on premises or SaaS deployment. Its fully automatable API-Centric platform is more scalable and reliable than the systems it monitors. Developed for the requirements of DevOps, Circonus delivers percentile-based alerts, graphs, dashboards, and machine-learning intelligence that enable business optimization. If you’re not using Circonus, your results are average.

 

 

Advanced Search Builder

Last year, changes on the backend allowed Circonus to make significant improvements to our search capability. Now, we’ve added an Advanced Search tool to allow users to easily build complex search queries, making Search in Circonus more powerful and flexible than ever before.

When you click on the search icon, you will see an “Advanced” button to the right of the search field after it is expanded. Clicking this button will expand the Advanced Search Builder and allow you to construct advanced search queries.

Advanced_Search_Builder

More information about our Search functionality and the logic it uses is available in our user documentation.

 

 

The New Grid View – Instant Graphs on the Metrics Page

We just added a new feature to the UI which displays a graph for every metric in your account.

While the previous view (now called List View) did show a graph for each metric, these graphs were hidden by default. The new Grid View now shows a full page a graphs, one for each metric. You can easily switch between Grid and List views as needed.

These screenshots show, from left to right, the old list view, the new layout options menu, and the new grid view.

The grid-style layout provides you with an easy way to view a graph for each metric in the list. It lets you click-to-select as many metrics as you want and easily create a graph out of them.

You can also:

  • Choose from 3 layouts with different graph sizes.
  • Define how the titles are displayed.
  • Hover over a graph to see the metric value.
  • Play any number of graphs to get real-time data.

We hope this feature is as useful to you as it has been to us. More information is available in our user documentation and below is a short video showing off some of these features:

 

 

Show Me the Data

Avoid spike erosion with Percentile – and Histogram – Aggregation

It has become common wisdom that the lossy process of averaging measurements leads to all kinds of problems when measuring performance of services (see Schlossnagle2015,  Ugurlu2013,  Schwarz2015,  Gregg2014). Yet, most people are not aware that averages are even more abundant than just in old-fashioned formulation of SLAs and storage backends for monitoring data. In fact, it is likely that most graphs that you are viewing involve some averaging behind the scenes, which introduces severe side effects. In this post, we will describe a phenomenon called spike erosion, and highlight some alternative views that allow you to get a more accurate picture of your data.

Meet Spike Erosion

Spike Erosion of Request Rates

Take a look at Figure 1. It shows a graph of request rates over the last month. The spike near December 23, marks the apparent maximum at around 7 requests per second (rps).

request-rates.png
Figure 1: Web request rate in requests per second over one month time window

What if I told you, the actual maximal request rate was almost double that value at 13.67rps (marked with the horizontal guide)? And moreover, it was not reached at December 23, but December 15 at 16:44, near the left boundary of the graph?

Looks way off, right?

But it’s actually true! Figure 2 shows the same graph zoomed in at said time window.

request-rates_zoomed.png
Figure 2: Web request rates (in rps) over a 4h period

We call this phenomenon spike erosion; the farther you zoom out, the lower the spikes, and it’s actually very common in all kinds of graphs across all monitoring products.

Let’s see another example.

Spike Erosion of Ping Latencies

Take a look at Figure 3. It shows a graph of ping latencies (from twitter.com) over the course of 4 weeks. Again, it looks like the latency is rather stable around 0.015ms with occasional spikes above 0.02ms and a clear maximum around December 23, with a value of ca 0.03ms.

latencies_max.png
Figure 3: Ping latency of twitter.com in ms over the last month

 

Again, we have marked the actual maximum with a horizontal guide line. It is more than double the apparent maximum and is assumed at any of the visible spikes. That’s right. All spikes do in fact have the same maximal height. Figure 4 shows a closeup of the one on December 30, in the center.

latencies_zoomed.png
Figure 4: Ping latency of twitter.com in ms on December 30

 

What’s going on?

The mathematical explanation of spike erosion is actually pretty simple. It is an artifact of an averaging process that happens behind the scenes, in order to produce sensible plots with high performance.

Note that within a 4 month period we have a total of 40,320 samples collected that we need to represent in a plot over that time window. Figure 5 shows how a plot of all those samples looks in GnuPlot. There are quite a few issues with this raw presentation.

raw_data.png
Figure 5: Plot of the raw data of request rates over a month

First, there is a ton of visual noise in that image. In fact, you cannot even see the individual 40,000 samples for the simple reason that the image is only 1240 pixels wide.

Also, rendering such an image within a browser puts a lot of load on the CPU. The biggest issue with producing such an image is the latency involved with retrieving 40K float values from the db and transmitting them as JSON over the internet.

In order to address the above issues, all mainstream graphing tools pre-aggregate the data before sending it to the browser. The size of the graph determines the number of values that should be displayed e.g. 500. The raw data is then distributed across 500 bins, and for each bin the average is taken, and displayed in the plot.

This process leads to plots like Figure 1, which (a) can be produced much faster, since less data has to be transferred and rendered (in fact, you can cache the pre-aggregated values to speed up retrieval from the db), and (b) are less visually cluttered. However, it also leads to (c) spike erosion!

When looking at a four week time window, the raw number of 40.320 samples is reduced to a mere 448 plotted values, where each plotted value corresponds to an average over a 90 minute period. If there is a single spike in one of the bins, it gets averaged with 90 other samples of lower value, which leads to the erosion of the spike height.

What to do about it?

There are (at least) two ways to allow you to avoid spike erosion and get more insight into your data. Both change the way the data is aggregated.

Min-Max Aggregation

The first way is to show the minimum and the maximum values of each bin along with the mean value. By doing so, you get a sense of the full range of the data, including the highest spikes. Figures 6 and 7 show how Min-Max Aggregation looks in Circonus for the request rate and latency examples.

request-rates_w_min_max.png
Figure 6: Request rate graph with Min-Max Aggregation Overlay

 

latencies_w_min_max.png
Figure 7: Latencies with Min/Max-Aggregation Overlay

 

In both cases, the points where the maximum values are assumed are clearly visible in the graph. When zooming into the spikes, the Max aggregation values stay aligned with the global maximum.

Keeping in mind that minimum and maximum are special cases of percentiles (namely the 0%-percentile and 100%-percentile), it appears natural to extend the aggregation methods to allow general quantiles as well. This is what we implemented in Circonus with the Percentile Aggregation overlay.

Histogram Aggregation

There is another, structurally different approach to mitigate spike erosion. It begins with the observation that histograms have a natural aggregation logic: Just add the bucket counts. More concretely, a histogram metric that stores data for each minute can be aggregated to larger time windows (e.g. 90 minutes) without applying any summary statistic, like a mean value, simply by adding the counts for each histogram bin across the aggregation time window.

If we combine this observation with the simple fact that time-series metrics can be considered histogram with a single value in it, we arrive at the powerful Histogram Aggregation that rolls-up time series into histogram metrics with lower time resolution. Figures 8 and 9 show Histogram Aggregation Overlays for the Request Rate and Latency examples discussed above.

request-rates_w_histogram.png
Figure 8: Request Rates with Histogram Aggregation Overlay

 

latencies_w_histogram.png
Figure 9: Latencies with Histogram Aggregation Overlay

 

In addition to showing the value range (which in the above figure is amplified by using the Min-Max Aggregation Overlay) we also gain a sense of the distribution of values across the bin. In the request rate example the data varies widely across a corridor of width 2.5-10rps. In the latency example, the distribution is concentrated near the mean global median of 0.015ms, with single value outliers.

Going Further

We have seen that displaying data as histograms gives a more concise picture of what is going on. Circonus allows you to go one step further and collect your data as histograms in the first place. This allows you to capture the latencies of all requests made to your API, instead of only probing your API once per minute. See [G.Schlossnagle2015] for an in-depth discussion of the pros and cons of this “passive monitoring” approach. Note that you can still compute averages and percentiles for viewing and alerting.

histogram_metric.png
Figure 10: API Latency Histogram Metric with Average Overlay

 

Figure 10 shows a histogram metric of API latencies, together with the mean value computed as an overlay. While this figure looks quite similar to Figures 8 and 9, the logical dependency is reversed. The mean values are computed from the histogram, not the other way around. Also, note that the time window of this figure only spans a few hours, instead of four weeks. This shows how much richer the captured histogram data is.

 

 

The Future of Monitoring: Q&A with Jez Humble


jez_humbleJez Humble is a lecturer at U.C. Berkeley, and co-author of the Jolt Award-winning Continuous Delivery : Reliable Software Releases through Build, Test and Deployment Automation (Addison-Wesley 2011) and Lean Enterprise : How High Performance Organizations Innovate at Scale (O’Reilly 2015), in Eric Ries’ Lean series. He has worked as a software developer, product manager, consultant and trainer across a wide variety of domains and technologies. His focus is on helping organisations deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices.

Theo’s Intro:

It is my perspective that the world of technology will be a continual place eventually.  As services become more and more componentized they stand to become more independently developed and operated.  The implications on engineering design when attempting to maintain acceptable resiliency levels are significant.  The convergence on a continual world is simply a natural progression and will not be stopped.
Jez has taken to deep thought and practice around these challenges quite a bit ahead of the adoption curve, and has a unique perspective on where we are going, why we are going there and (likely vivid) images of the catastrophic derailments that might occur along the tracks.  While I spend all my time thinking about how people might have peace of mind that their systems and businesses are measurably functioning during and after transitions into this new world, my interest is compounded by the Circonus’ internal uses of continual integration and deployment practices for both our SaaS and on-premise customers.

THEO: Most of the slides, talks and propaganda around CI/CD (Continuous Integration/Continuous Delivery) are framed in the context of businesses launching software services that are consumed by customers as opposed to software products consumed by customers. Do you find that people need a different frame of mind, a different perspective or just more discipline when they are faced with shipping product vs. shipping services as it relates to continual practices?

JEZ: The great thing about continuous delivery is that the same principles apply whether you’re doing web services, product development, embedded or mobile. You need to make sure you’re working in small batches, and that your software is always releasable, otherwise you won’t get the benefits. I started my career at a web startup but then spent several years working on packaged software, and the discipline is the same. Some of the problems are different: for example, when I was working on go.cd, we built a stage into our deployment pipeline to do automated upgrade testing from every previous release to what was on trunk. But fundamentally, it’s the same foundations: comprehensive configuration management, good test automation, and the practice of working in small batches on trunk and keeping it releasable. In fact, one of my favourite case studies for CI/CD is HP’s LaserJet Firmware division — yet nobody is deploying new firmware multiple times a day. You do make a good point about discipline: when you’re not actually having to deploy to production on a regular basis it can be easy to let things slide. Perhaps you don’t pay too much attention to the automated functional tests breaking, or you decide that one long-lived branch to do some deep surgery on a fragile subsystem is OK. Continuous deployment (deploying to production frequently) tends to concentrate the mind. But the discipline is equally important however frequently you release.

THEO: Do you find that organizations “going lean” struggle more, take longer or navigate more risk when they are primarily shipping software products vs. services?

JEZ: Each model has its own trade-offs. Products (including mobile apps) usually require a large matrix of client devices to test in order to make sure your product will work correctly. You also have to worry about upgrade testing. Services, on the other hand, require development to work with IT operations to get the deployment process to a low-risk pushbutton state, and make sure the service is easy to operate. Both of these problems are hard to solve — I don’t think anybody gets an easy ride. Many companies who started off shipping product are now moving to a SaaS model in any case, so they’re having to negotiate both models, which is an interesting problem to face. In both cases, getting fast, comprehensive test automation in place and being able to run as much as possible on every check-in, and then fixing things when they break, is the beginning of wisdom.

THEO: Thinking continuously is only a small part of establishing a “lean enterprise.” Do you find engineers more easily reason about adopting CI/CD than other changes such as organizational retooling and process refinements? What’s the most common sticking point (or point of flat-out derailment) for organizations attempting to go lean?

JEZ: My biggest frustration is how conservative most technology organizations are when it comes to changing the way people behave. There are plenty of engineers who are happy to play with new languages or technologies, but god forbid you try and mess with their worldview on process. The biggest sticking point – whether it’s engineers, middle management or leadership – is getting them to change their behavior and ways of thinking.

But the best people – and organizations – are never satisfied with how they’re doing and are always looking for ways to improve.

The worst ones either just accept the status quo, or are always blowing things up (continuous re-orgs are a great example), lurching from one crisis to another. Sometimes you get both. Effective leaders and managers understand that it’s essential to have a measurable customer or organizational outcome to work towards, and that their job is to help the people working for them experiment in a disciplined, scientific way with process improvement work to move towards the goal. That requires that you actually have time and resources to invest in this work, and that you have people with the capacity for and interest in making things better.

THEO: Finance is precise and process oriented and often times bad things happen (people working from different/incorrect base assumptions) when there are too many cooks in the kitchen. This is why finance is usually tightly controlled by the CFO and models and representations are fastidiously enforced. Monitoring and analytics around that data shares a lot in common with respect to models and meanings. However, many engineering groups have far less discipline and control than do financial groups. Where do you see things going here?

JEZ: Monitoring isn’t really my area, but my guess is that there are similar factors at play here to other parts of the DevOps world, which is the lack of both an economic model and the discipline to apply it. Don Reinertsen has a few quotes that I rather like: “you may ignore economics, but economics won’t ignore you.” He also says of product development “The measure of execution in product development is our ability to constantly align our plans to whatever is, at the moment, the best economic choice.” Making good decisions is fundamentally about risk management: what are the risks we face? What choices are available to us to mitigate those risks? What are the impacts? What should we be prepared to pay to mitigate those impacts? What information is required to assess the probability of those risks occurring? How much should we be prepared to pay for that information? For CFOs working within business models that are well understood, there are templates and models that encapsulate this information in a way that makes effective risk management somewhat algorithmic, provided of course you stay within the bounds of the model. I don’t know whether we’re yet at that stage with respect to monitoring, but I certainly don’t feel like we’re yet at that stage with the rest of DevOps. Thus a lot of what we do is heuristic in nature — and that requires constant adaptation and improvement, which takes even more discipline, effort, and attention. That, in a department which is constantly overloaded by firefighting. I guess that’s a very long way of saying that I don’t have a very clear picture of where things are going, but I think it’ll be a while before we’re in a place that has a bunch of proven models with well understood trade-offs.

THEO: In your experience how do organizations today habitually screw up monitoring? What are they simply thinking about “the wrong way?”

JEZ: I haven’t worked in IT operations professionally for over a decade, but based on what I hear and observe, I feel like a lot of people still treat monitoring as little more than setting up a bunch of alerts. This leads to a lot of the issues we see everywhere with alert fatigue and people working very reactively. Tom Limoncelli has a nice blog post where he recommends deleting all your alerts and then, when there’s an outage, working out what information would have predicted it, and just collecting that information. Of course he’s being provocative, but we have a similar situation with tests — people are terrified about deleting them because they feel like they’re literally deleting quality (or in the case of alerts, stability) from their system. But it’s far better to have a small number of alerts that actually have information value than a large number that are telling you very little, but drown the useful data in noise.

THEO: Andrew Shaffer said that “technology is 90% tribalism and fashion.” I’m not sure about the percentage, but he nailed the heart of the problem. You and I both know that process, practice and methods sunset faster in technology than in most other fields. I’ll ask the impossible question… after enterprises go lean, what’s next?

JEZ: I actually believe that there’s no end state to “going lean.” In my opinion, lean is fundamentally about taking a disciplined, scientific approach to product development and process improvement — and you’re never done with that. The environment is always changing, and it’s a question of how fast you can adapt, and how long you can stay in the game. Lean is the science of growing adaptive, resilient organizations, and the best of those are always getting better. Andrew is (as is often the case) correct, and what I find really astonishing is that as an industry we have a terrible grasp of our own history. As George Santayana has it, we seem condemned to repeat our mistakes endlessly, albeit every time with some shiny new technology stack. I feel like there’s a long way to go before any software company truly embodies lean principles — especially the ability to balance moving fast at high quality while maintaining a humane working environment. The main obstacle is the appalling ineptitude of a large proportion of IT management and leadership — so many of these people are either senior engineers who are victims of the Peter Principle or MBAs with no real understanding of how technology works. Many technologists even believe effective management is an oxymoron. While I am lucky enough to know several great leaders and managers, they have not in general become who they are as a result of any serious effort in our industry to cultivate such people. We’re many years away from addressing these problems at scale.

ACM – Testing a Distributed System

I want to sing the praises of one of our lead engineers, Phil Maddox, for authoring a very interesting paper, Testing a Distributed System, which was published in Communications of the ACM, Vol. 58 No. 9.

A brief excerpt follows:

082015_CACMpg55_Testing-a-Distributed.large

“Distributed systems can be especially difficult to program for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

Individual component testing, usually done via automated test suites, certainly helps by verifying that each component is working correctly. Component testing, however, usually does not fully test all of the bits of a distributed system. Testers need to be able to verify that data at one end of a distributed system makes its way to all of the other parts of the system and, perhaps more importantly, is visible to the various components of the distributed system in a manner that meets the consistency requirements of the system as a whole.”