Circonus Instrumentation Packs

In our Circonus Labs public github repo, we have started a project called Circonus Instrumentation Packs, or CIP. This is a series of libraries to make it even easier to submit telemetry data from your application.

Currently there are CIP directories for gojava,  and node.js. Each separate language directory has useful resources to help instrument applications written in that language.

Some languages have a strong leaning toward frameworks, while others are about patterns, and still others are about tooling. These packs are intended to “meld in” with the common way of doing things in each language, so that developer comfort is high and integration time and effort are minimal.

Each of these examples utilize the HTTP Trap check, which you can create within Circonus. Simply create a new JSON push (HTTPTrap) check in Circonus using the HTTPTRAP broker, and then the CheckID, UUID and secret will be available on the check details page.

HTTPTrap uuid-secret
CHECKID / UUID / Secret Example

This can be done via the user interface or via the API. The “target” for the check does not need to be an actual hostname or IP address; the name of your service might be a good substitute.

We suggest that you use a different trap for different node.js apps, as well as for production, staging, and testing.

Below is a bit more detail on each of the currently available CIPs:

Java

Java has a very popular instrumentation library called “metrics,” originally written by Coda Hale and later adopted by Dropwizard. Metrics has some great ideas that we support whole-heartedly; in particular, the use of histograms for more insightful reporting. Unfortunately, the way these measurements are captured and reported makes calculating service level agreements and other such analytics impossible. Furthermore, the implementations of the underlying histograms (Reservoirs in metrics-terminology) are opaque to the reporting tools. The Circonus metrics support in this CIP is designed to layer (non-disruptively) on top of the Dropwizard metrics packages.

Go

This library supports named counters, gauges, and histograms. It also provides convenience wrappers for registering latency instrumented functions with Go’s built-in http server.

Initializing only requires you set the AuthToken (which you generate in your API Tokens page) and CheckId, and then “Start” the metrics reporter.

You’ll need two github repos:

Here is the sample code (also found in the circonus-gometrics readme):

[slide]package main
import (
 "fmt"
 "net/http"
 metrics "github.com/circonus-gometrics"
)
func main() {
// Get your Auth token at https://login.circonus.com/user/tokens
 metrics.WithAuthToken("cee5d8ec-aac7-cf9d-bfc4-990e7ceeb774")
// Get your Checkid on the check details page
 metrics.WithCheckId(163063)
 metrics.Start()
http.HandleFunc("/", metrics.TrackHTTPLatency("/", func(w http.ResponseWriter, r *http.Request) {
 fmt.Fprintf(w, "Hello, %s!", r.URL.Path[1:])
 }))
 http.ListenAndServe(":8080", http.DefaultServeMux)
}

After you start the app (go run the_file_name.go), load http://localhost:8080 in your broswer, or curl http://localhost:8080. You’ll need to approve access to the API Token (if it is the first time you have used it), and then you can create a graph (make sure you are collecting histogram data) and you’ll see something like this:

go-httptrap-histogram-example

Node.js

This instrumentation pack is designed to allow node.js applications to easily report telemetry data to Circonus using the UUID and Secret (instead of an API Token and CheckID). It has special support for providing sample-free (100% sampling) collection of service latencies for submission, visualization, and alerting to Circonus.

Here is a basic example to measure latency:

First, some setup – making the app;

% mkdir restify-circonus-example
% cd restify-circonus-example
% npm init .

(This defaults to npm init . works fine.) Then:

% npm install --save restify
% npm install --save probability-distributions
% npm install --save circonus-cip

Next, edit index.js and include:

var restify = require('restify'),
 PD = require("probability-distributions"),
 circonus_cip = require('circonus-cip')
var circonus_uuid = '33e894e6-5b94-4569-b91b-14bda9c650b1'
var circonus_secret = 'ssssssssh_its_oh_so_quiet'
var server = restify.createServer()
server.on('after', circonus_cip.restify(circonus_uuid, circonus_secret))
server.get('/', function (req, res, next) {
 setTimeout(function() {
 res.writeHead(200, { 'Content-Type': 'text/plain' });
 //res.write("Hello to a new world of understanding.\n");
 res.end("Hello to a new world of understanding.\n");
 next();
 }, PD.rgamma(1, 3, 2) * 200);
})

server.listen(8888)

Now just start up the app:

node index.js

Then go to your browser and load localhost:8888, or at the prompt curl http:localhost:8888.

You’ll then go and create the graph in your account. Make sure to enable collection of the metric – “… httptrap: restify `GET `/ `latency…” as a histogram, and you’ll end up with a graph like this:

The Restify Histogram graph
The Restify Histogram graph

Discovering circonusvi

Folks who know Circonus and use it regularly for operations also know that its API is an important part of efficient management for your monitoring facility.

For example, after building out a new datacenter with our SaaS application, we wanted to apply tagging to our checks to make searching more effective. The UI is easy to use, but I needed to tag batches of checks rather than one at a time, which is a job for the API.

I could also write a small program to do a search and populate tags. That’s when a co-worker suggested I use circonusvi (https://github.com/circonus-labs/circonusvi).

Circonusvi is a neat little tool contributed by Ben Chapman to the Circonus Labs github repo. It’s a natural tool to use for most folks who work with unix or unix-like platforms. Blend that with the JSON input/output of the Circonus API and you have a quick way to make adhoc changes.

So after installing the python requirements for circonusvi, I generated a normal API token from the browser, ran circonusvi once, and validated the token in the browser user interface.

My first run of circonusvi without arguments returned everything, allowing me look over things and understand the JSON structure.

Now for the business.

This returns a list of common servers JSON output that I can now edit in vi:

./circonusvi.py 'display_name=servers([0-9]).foo.net json:nad'

And this example finds all the empty tags and populates it with something useful:

%s/\"tags\"\:\ \[\]/\"tags\":\ [\"component:http\",\"datacenter:ohio\",\"os:plan9\"]/g

After saving the changes and verifying the results, circonusvi prompts you one last time about updating the server. Then it updates and you’re done!

Graph Hover Lock

A new feature to help make sense of graphs with multiple data points

When visualizing your data, you may often want to compare multiple data points on a single graph. You may even want to compare a metric across a dozen machines, but a graph with more then two or three data points can quickly turn into a visual mess. Circonus helps to make these more complex graph become human-readable by allowing users to highlighting one data point at a time. This new feature expands on that capability.

When you hover over a graph with multiple datapoints, with your cursor close to one datapoint, that datapoint is highlighted. Now it’s highlighted more prominently and brought to the front, while other data points fade to the back.

You can also click the graph to lock that state into place. You can tell it’s in a locked hover state by the lock icon in the upper right corner of the graph. Click the graph again to unlock.

For graphs with many datapoints, this will help you zero in on the specific datapoint(s) you want to focus on.

See Figure 1. This graph shows HTTP Connect Times across a dozen combinations of different services and different brokers. A number of the data points are hard to see because of the number data points in the graph.

Graph Hover Lock
Graph Hover Lock

Hovering over the graph allows us to view the datapoints more easily. Here in Figure 2, we have used this feature to lock the graph and now we can see one of the smaller datapoints clearly.

Locked graph
Locked graph

To enable this behavior across all graphs, a couple click behaviors have changed. First, when on the graphs list page or on a worksheet, you can no longer click a graph to go view that graph; now you have to click a graph’s title bar to go view it. Second, on the metrics page in grid mode, you can no longer click a metric graph to select that metric for graph creation; instead, you have to click the metric graph’s title bar.

This tool should make it even easier to visualize your data.

Advanced Search Builder

Last year, changes on the backend allowed Circonus to make significant improvements to our search capability. Now, we’ve added an Advanced Search tool to allow users to easily build complex search queries, making Search in Circonus more powerful and flexible than ever before.

When you click on the search icon, you will see an “Advanced” button to the right of the search field after it is expanded. Clicking this button will expand the Advanced Search Builder and allow you to construct advanced search queries.


More information about our Search functionality and the logic it uses is available in our user documentation.

The New Grid View – Instant Graphs on the Metrics Page

We just added a new feature to the UI which displays a graph for every metric in your account.

While the previous view (now called List View) did show a graph for each metric, these graphs were hidden by default. The new Grid View now shows a full page a graphs, one for each metric. You can easily switch between Grid and List views as needed.

These screenshots below illustrate the old list view, the new layout options menu, and the new grid view.

Old list view
Figure 1 – Old list view
New Layout Options menu
Figure 2 – New “Layout Options” menu
New Grid View
Figure 3 – New Grid View

The grid-style layout provides you with an easy way to view a graph for each metric in the list. It lets you click-to-select as many metrics as you want and easily create a graph out of them.

You can also:

  • Choose from 3 layouts with different graph sizes.
  • Define how the titles are displayed.
  • Hover over a graph to see the metric value.
  • Play any number of graphs to get real-time data.

We hope this feature is as useful to you as it has been to us. More information is available in our user documentation and below is a short video showing off some of these features:

Show Me the Data

Avoid spike erosion with Percentile – and Histogram – Aggregation

It has become common wisdom that the lossy process of averaging measurements leads to all kinds of problems when measuring performance of services (see Schlossnagle2015,  Ugurlu2013,  Schwarz2015,  Gregg2014). Yet, most people are not aware that averages are even more abundant than just in old-fashioned formulation of SLAs and storage backends for monitoring data. In fact, it is likely that most graphs that you are viewing involve some averaging behind the scenes, which introduces severe side effects. In this post, we will describe a phenomenon called spike erosion, and highlight some alternative views that allow you to get a more accurate picture of your data.

Meet Spike Erosion

Spike Erosion of Request Rates

Take a look at Figure 1. It shows a graph of request rates over the last month. The spike near December 23, marks the apparent maximum at around 7 requests per second (rps).

Figure 1: Web request rate in requests per second over one month time window
Figure 1: Web request rate in requests per second over one month time window

What if I told you, the actual maximal request rate was almost double that value at 13.67rps (marked with the horizontal guide)? And moreover, it was not reached at December 23, but December 15 at 16:44, near the left boundary of the graph?

Looks way off, right?

But it’s actually true! Figure 2 shows the same graph zoomed in at said time window.

Figure 2: Web request rates (in rps) over a 4h period
Figure 2: Web request rates (in rps) over a 4h period

We call this phenomenon spike erosion; the farther you zoom out, the lower the spikes, and it’s actually very common in all kinds of graphs across all monitoring products.

Let’s see another example.

Spike Erosion of Ping Latencies

Take a look at Figure 3. It shows a graph of ping latencies (from twitter.com) over the course of 4 weeks. Again, it looks like the latency is rather stable around 0.015ms with occasional spikes above 0.02ms and a clear maximum around December 23, with a value of ca 0.03ms.

Figure 3: Ping latency of twitter.com in ms over the last month
Figure 3: Ping latency of twitter.com in ms over the last month

Again, we have marked the actual maximum with a horizontal guide line. It is more than double the apparent maximum and is assumed at any of the visible spikes. That’s right. All spikes do in fact have the same maximal height. Figure 4 shows a closeup of the one on December 30, in the center.

Figure 4: Ping latency of twitter.com in ms on December 30
Figure 4: Ping latency of twitter.com in ms on December 30

What’s going on?

The mathematical explanation of spike erosion is actually pretty simple. It is an artifact of an averaging process that happens behind the scenes, in order to produce sensible plots with high performance.

Note that within a 4 month period we have a total of 40,320 samples collected that we need to represent in a plot over that time window. Figure 5 shows how a plot of all those samples looks in GnuPlot. There are quite a few issues with this raw presentation.

Figure 5: Plot of the raw data of request rates over a month
Figure 5: Plot of the raw data of request rates over a month

First, there is a ton of visual noise in that image. In fact, you cannot even see the individual 40,000 samples for the simple reason that the image is only 1240 pixels wide.

Also, rendering such an image within a browser puts a lot of load on the CPU. The biggest issue with producing such an image is the latency involved with retrieving 40K float values from the db and transmitting them as JSON over the internet.

In order to address the above issues, all mainstream graphing tools pre-aggregate the data before sending it to the browser. The size of the graph determines the number of values that should be displayed e.g. 500. The raw data is then distributed across 500 bins, and for each bin the average is taken, and displayed in the plot.

This process leads to plots like Figure 1, which (a) can be produced much faster, since less data has to be transferred and rendered (in fact, you can cache the pre-aggregated values to speed up retrieval from the db), and (b) are less visually cluttered. However, it also leads to (c) spike erosion!

When looking at a four week time window, the raw number of 40.320 samples is reduced to a mere 448 plotted values, where each plotted value corresponds to an average over a 90 minute period. If there is a single spike in one of the bins, it gets averaged with 90 other samples of lower value, which leads to the erosion of the spike height.

What to do about it?

There are (at least) two ways to allow you to avoid spike erosion and get more insight into your data. Both change the way the data is aggregated.

Min-Max Aggregation

The first way is to show the minimum and the maximum values of each bin along with the mean value. By doing so, you get a sense of the full range of the data, including the highest spikes. Figures 6 and 7 show how Min-Max Aggregation looks in Circonus for the request rate and latency examples.

Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 6: Request rate graph with Min-Max Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay
Figure 7: Latencies with Min/Max-Aggregation Overlay

In both cases, the points where the maximum values are assumed are clearly visible in the graph. When zooming into the spikes, the Max aggregation values stay aligned with the global maximum.

Keeping in mind that minimum and maximum are special cases of percentiles (namely the 0%-percentile and 100%-percentile), it appears natural to extend the aggregation methods to allow general quantiles as well. This is what we implemented in Circonus with the Percentile Aggregation overlay.

Histogram Aggregation

There is another, structurally different approach to mitigate spike erosion. It begins with the observation that histograms have a natural aggregation logic: Just add the bucket counts. More concretely, a histogram metric that stores data for each minute can be aggregated to larger time windows (e.g. 90 minutes) without applying any summary statistic, like a mean value, simply by adding the counts for each histogram bin across the aggregation time window.

If we combine this observation with the simple fact that time-series metrics can be considered histogram with a single value in it, we arrive at the powerful Histogram Aggregation that rolls-up time series into histogram metrics with lower time resolution. Figures 8 and 9 show Histogram Aggregation Overlays for the Request Rate and Latency examples discussed above.

Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 8: Request Rates with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay
Figure 9: Latencies with Histogram Aggregation Overlay

In addition to showing the value range (which in the above figure is amplified by using the Min-Max Aggregation Overlay) we also gain a sense of the distribution of values across the bin. In the request rate example the data varies widely across a corridor of width 2.5-10rps. In the latency example, the distribution is concentrated near the mean global median of 0.015ms, with single value outliers.

Going Further

We have seen that displaying data as histograms gives a more concise picture of what is going on. Circonus allows you to go one step further and collect your data as histograms in the first place. This allows you to capture the latencies of all requests made to your API, instead of only probing your API once per minute. See [G.Schlossnagle2015] for an in-depth discussion of the pros and cons of this “passive monitoring” approach. Note that you can still compute averages and percentiles for viewing and alerting.

Figure 10: API Latency Histogram Metric with Average Overlay
Figure 10: API Latency Histogram Metric with Average Overlay

Figure 10 shows a histogram metric of API latencies, together with the mean value computed as an overlay. While this figure looks quite similar to Figures 8 and 9, the logical dependency is reversed. The mean values are computed from the histogram, not the other way around. Also, note that the time window of this figure only spans a few hours, instead of four weeks. This shows how much richer the captured histogram data is.

Resources

  1. Theo Schlossnagle – Problem with Math
  2. Dogan Ugurlu (Optimizely) – The Most Misleading Measure of Response Time: Average
  3. Baron Schwarz – Why percentiles don’t work the way you think
  4. Brendan Gregg – Frequency Tails: What the mean really means
  5. George Schlossnagle – API Performance Monitoring

The Future of Monitoring: Q&A with Jez Humble


jez_humble

Jez Humble is a lecturer at U.C. Berkeley, and co-author of the Jolt Award-winning Continuous Delivery: Reliable Software Releases through Build, Test and Deployment Automation (Addison-Wesley 2011) and Lean Enterprise: How High Performance Organizations Innovate at Scale (O’Reilly 2015), in Eric Ries’ Lean series. He has worked as a software developer, product manager, consultant and trainer across a wide variety of domains and technologies. His focus is on helping organisations deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices.

Theo’s Intro:

It is my perspective that the world of technology will be a continual place eventually.  As services become more and more componentized they stand to become more independently developed and operated.  The implications on engineering design when attempting to maintain acceptable resiliency levels are significant.  The convergence on a continual world is simply a natural progression and will not be stopped.
Jez has taken to deep thought and practice around these challenges quite a bit ahead of the adoption curve, and has a unique perspective on where we are going, why we are going there and (likely vivid) images of the catastrophic derailments that might occur along the tracks.  While I spend all my time thinking about how people might have peace of mind that their systems and businesses are measurably functioning during and after transitions into this new world, my interest is compounded by the Circonus’ internal uses of continual integration and deployment practices for both our SaaS and on-premise customers.

THEO: Most of the slides, talks and propaganda around CI/CD (Continuous Integration/Continuous Delivery) are framed in the context of businesses launching software services that are consumed by customers as opposed to software products consumed by customers. Do you find that people need a different frame of mind, a different perspective or just more discipline when they are faced with shipping product vs. shipping services as it relates to continual practices?

JEZ: The great thing about continuous delivery is that the same principles apply whether you’re doing web services, product development, embedded or mobile. You need to make sure you’re working in small batches, and that your software is always releasable, otherwise you won’t get the benefits. I started my career at a web startup but then spent several years working on packaged software, and the discipline is the same. Some of the problems are different: for example, when I was working on go.cd, we built a stage into our deployment pipeline to do automated upgrade testing from every previous release to what was on trunk. But fundamentally, it’s the same foundations: comprehensive configuration management, good test automation, and the practice of working in small batches on trunk and keeping it releasable. In fact, one of my favourite case studies for CI/CD is HP’s LaserJet Firmware division — yet nobody is deploying new firmware multiple times a day. You do make a good point about discipline: when you’re not actually having to deploy to production on a regular basis it can be easy to let things slide. Perhaps you don’t pay too much attention to the automated functional tests breaking, or you decide that one long-lived branch to do some deep surgery on a fragile subsystem is OK. Continuous deployment (deploying to production frequently) tends to concentrate the mind. But the discipline is equally important however frequently you release.

THEO: Do you find that organizations “going lean” struggle more, take longer or navigate more risk when they are primarily shipping software products vs. services?

JEZ: Each model has its own trade-offs. Products (including mobile apps) usually require a large matrix of client devices to test in order to make sure your product will work correctly. You also have to worry about upgrade testing. Services, on the other hand, require development to work with IT operations to get the deployment process to a low-risk pushbutton state, and make sure the service is easy to operate. Both of these problems are hard to solve — I don’t think anybody gets an easy ride. Many companies who started off shipping product are now moving to a SaaS model in any case, so they’re having to negotiate both models, which is an interesting problem to face. In both cases, getting fast, comprehensive test automation in place and being able to run as much as possible on every check-in, and then fixing things when they break, is the beginning of wisdom.

THEO: Thinking continuously is only a small part of establishing a “lean enterprise.” Do you find engineers more easily reason about adopting CI/CD than other changes such as organizational retooling and process refinements? What’s the most common sticking point (or point of flat-out derailment) for organizations attempting to go lean?

JEZ: My biggest frustration is how conservative most technology organizations are when it comes to changing the way people behave. There are plenty of engineers who are happy to play with new languages or technologies, but god forbid you try and mess with their worldview on process. The biggest sticking point – whether it’s engineers, middle management or leadership – is getting them to change their behavior and ways of thinking.

But the best people – and organizations – are never satisfied with how they’re doing and are always looking for ways to improve.

The worst ones either just accept the status quo, or are always blowing things up (continuous re-orgs are a great example), lurching from one crisis to another. Sometimes you get both. Effective leaders and managers understand that it’s essential to have a measurable customer or organizational outcome to work towards, and that their job is to help the people working for them experiment in a disciplined, scientific way with process improvement work to move towards the goal. That requires that you actually have time and resources to invest in this work, and that you have people with the capacity for and interest in making things better.

THEO: Finance is precise and process oriented and often times bad things happen (people working from different/incorrect base assumptions) when there are too many cooks in the kitchen. This is why finance is usually tightly controlled by the CFO and models and representations are fastidiously enforced. Monitoring and analytics around that data shares a lot in common with respect to models and meanings. However, many engineering groups have far less discipline and control than do financial groups. Where do you see things going here?

JEZ: Monitoring isn’t really my area, but my guess is that there are similar factors at play here to other parts of the DevOps world, which is the lack of both an economic model and the discipline to apply it. Don Reinertsen has a few quotes that I rather like: “you may ignore economics, but economics won’t ignore you.” He also says of product development “The measure of execution in product development is our ability to constantly align our plans to whatever is, at the moment, the best economic choice.” Making good decisions is fundamentally about risk management: what are the risks we face? What choices are available to us to mitigate those risks? What are the impacts? What should we be prepared to pay to mitigate those impacts? What information is required to assess the probability of those risks occurring? How much should we be prepared to pay for that information? For CFOs working within business models that are well understood, there are templates and models that encapsulate this information in a way that makes effective risk management somewhat algorithmic, provided of course you stay within the bounds of the model. I don’t know whether we’re yet at that stage with respect to monitoring, but I certainly don’t feel like we’re yet at that stage with the rest of DevOps. Thus a lot of what we do is heuristic in nature — and that requires constant adaptation and improvement, which takes even more discipline, effort, and attention. That, in a department which is constantly overloaded by firefighting. I guess that’s a very long way of saying that I don’t have a very clear picture of where things are going, but I think it’ll be a while before we’re in a place that has a bunch of proven models with well understood trade-offs.

THEO: In your experience how do organizations today habitually screw up monitoring? What are they simply thinking about “the wrong way?”

JEZ: I haven’t worked in IT operations professionally for over a decade, but based on what I hear and observe, I feel like a lot of people still treat monitoring as little more than setting up a bunch of alerts. This leads to a lot of the issues we see everywhere with alert fatigue and people working very reactively. Tom Limoncelli has a nice blog post where he recommends deleting all your alerts and then, when there’s an outage, working out what information would have predicted it, and just collecting that information. Of course he’s being provocative, but we have a similar situation with tests — people are terrified about deleting them because they feel like they’re literally deleting quality (or in the case of alerts, stability) from their system. But it’s far better to have a small number of alerts that actually have information value than a large number that are telling you very little, but drown the useful data in noise.

THEO: Andrew Shaffer said that “technology is 90% tribalism and fashion.” I’m not sure about the percentage, but he nailed the heart of the problem. You and I both know that process, practice and methods sunset faster in technology than in most other fields. I’ll ask the impossible question… after enterprises go lean, what’s next?

JEZ: I actually believe that there’s no end state to “going lean.” In my opinion, lean is fundamentally about taking a disciplined, scientific approach to product development and process improvement — and you’re never done with that. The environment is always changing, and it’s a question of how fast you can adapt, and how long you can stay in the game. Lean is the science of growing adaptive, resilient organizations, and the best of those are always getting better. Andrew is (as is often the case) correct, and what I find really astonishing is that as an industry we have a terrible grasp of our own history. As George Santayana has it, we seem condemned to repeat our mistakes endlessly, albeit every time with some shiny new technology stack. I feel like there’s a long way to go before any software company truly embodies lean principles — especially the ability to balance moving fast at high quality while maintaining a humane working environment. The main obstacle is the appalling ineptitude of a large proportion of IT management and leadership — so many of these people are either senior engineers who are victims of the Peter Principle or MBAs with no real understanding of how technology works. Many technologists even believe effective management is an oxymoron. While I am lucky enough to know several great leaders and managers, they have not in general become who they are as a result of any serious effort in our industry to cultivate such people. We’re many years away from addressing these problems at scale.

ACM – Testing a Distributed System

I want to sing the praises of one of our lead engineers, Phil Maddox, for authoring a very interesting paper, Testing a Distributed System, which was published in Communications of the ACM, Vol. 58 No. 9.

A brief excerpt follows:

“Distributed systems can be especially difficult to program for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

Individual component testing, usually done via automated test suites, certainly helps by verifying that each component is working correctly. Component testing, however, usually does not fully test all of the bits of a distributed system. Testers need to be able to verify that data at one end of a distributed system makes its way to all of the other parts of the system and, perhaps more importantly, is visible to the various components of the distributed system in a manner that meets the consistency requirements of the system as a whole.”

Read the entire paper here: Testing a Distributed System

The Future of Monitoring: Q&A with John Allspaw


john allspaw

John Allspaw is CTO at Etsy. John has worked in systems operations for over 14 years in biotech, government, and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr. He is a well-known industry pundit, speaker, blogger, and the author of Web Operations and The Art of Capacity Planning. Visit John’s blog


Theo: As you know I live numbers. The future of monitoring is leaning strongly toward complex analytics on epic amounts of telemetry data. How do you think this will affect how operations and engineering teams work?

John: Two things come to mind. The first is that we could look at it in the same way the field is looking at “Big Data.” While we now have technologies to help us get answers to questions we have, it turns out that finding the right question is just as important. And you’re right: it’s surprisingly easy to collect a massive amount of telemetry data at a rate that outpaces our abilities to analyze it. I think the real challenge is one of designing systems that can make it easy to navigate this data without getting too much in our way.

I’m fond of Herb Simon’s saying “Information is not a scarce resource. Attention is.” I think that part of this challenge includes using novel ways of analyzing data algorithmically. I think another part, just as critical, is to design software and interfaces that can act as true advisors or partners. More often than not, I’m not going to know what I want to look until I look around in these billions of time-series datasets. If we make it easy and effortless for a team to “look around” – maybe this is a navigation challenge – I’ll bet on that team being better at operations.

Theo: Given your (long) work in operations, you’ve seen good systems and bad systems, good teams and bad teams, good approaches and bad approaches. If you could describe a commonality of all the bads in one word, what would it be? and why?

John: Well, anyone who knows me knows that summarizing (forget about in one word!) is pretty difficult for me. 🙂 If I had to, I would say that what we want to avoid is being brittle. Brittle process, brittle architecture design, brittle incident response, etc. Being brittle in this case means that we can always be prepared for anything, as long as we can imagine it beforehand. The companies we grow and systems we build have too much going on to be perfectly predictable. Brittle is what you get when you bet all your chips on procedures, prescriptive processes, and software that takes on too much of its own control.

Resilience is what you get when you invest in preparing to be unprepared.

Theo: What was it about “Resilience Engineering” that sucked you in?

John: One of the things that really drew me into the field was the idea that we can have a different perspective on how we look at how successful work actually happens in our field. Traditionally, we judge ourselves on the absence of failures, and we assume almost tacitly that we can design a system (software, company org chart, financial model, etc.) that will work all the time, perfectly. All you have to do is: don’t touch it.

Resilience Engineering concepts assert something different: that success comes from people adapting to what they see happening, anticipating what limits and breakdowns the system is headed towards, and making adjustments to forestall them. In other words, success is the result of the presence of adaptive capacity, not the absence of failures.

This idea isn’t just plain-old “fault tolerance” – it’s something more. David Woods (a researcher in the field) calls this something “graceful extensibility” – the idea that it’s not just degradation after failure, but adapting when you get close to the boundaries of failure. Successful teams do this all the time, but no attention is paid to it, because there’s no drama in a non-outage.

That’s what I find fascinating: instead of starting with an outage and explain what a team lacked or didn’t do, we could look at all the things that make for an outage-less day. Many of the expertise ingredients of outage-less days are unspoken, come from “muscle memory” and rules-of-thumb that engineers have developed tacitly over the years. I want to discover all of that.

Theo: How do you think the field of Resilience Engineering can improve the data science that happens around telemetry analysis in complex systems?

John: Great question! I think a really fertile area is to explore the qualitative aspects of how people make sense of telemetry data, at different levels (aggregate, component, etc.) and find ways that use quantitative methods to provide stronger signals than the user could do on their own. An example of this might be to explore expert critiquing systems, where a monitoring system doesn’t try to be “intelligent” but instead provides options/directions for diagnosis for the user to take, essentially providing decision support. This isn’t an approach I see taken yet, in earnest. Maybe Circonus can take a shot at it? 🙂

Theo: As two emerging fields of research are integrated into practice, there are bound to be problems. Can you make a wild prediction as to some of these problems might be?

John: Agreed. I think it might be awkward like a junior high school dance. We have human factors, qualitative analysts and UX/UI folks on one side of the gymnasium, and statisticians, computer scientists, and mathematicians on the other. One of the more obvious potential quagmires is the insistence that each approach will be superior, resulting in a mangled mess of tools or worse: no progress at all. In a cartoon-like stereotype of the fields, I can imagine one camp designing with the belief that all bets must be placed on algorithms, no humans needed. And in the other camp, an over-focus on interfaces that ignore or downplays potential computational processing advantages.

If we do it well, both camps won’t play their solos at the same time, and will take the nuanced approach. Maybe data science and resilience engineering can view themselves as a great rhythm section of a jazz band.

Hallway Track: The Future of Monitoring

I’ve been in this “Internet industry” since around 1997. That doesn’t make me the first on the stage, but I’ve had a very wide set of experiences: from deep within the commercial software world to the front lines of open source and from the smallest startup sites to helping fifteen of the world’s most highly trafficked web sites. My focus for a long time was scalability, but that slowly morphed into general hacking and speaking. As a part of my rigorous speaking schedule, I’ve been to myriad conferences all around the globe; sometimes attending, sometimes chairing, but almost always speaking. I’ve often been asked: “Why do you travel so much? Why do you go to so many conferences?” The answer is simple: the people.

Some go to conferences for session material, perhaps most attendees even. In multi-track conferences, people tend to stick to one track or another. I’d argue that all conferences are inherently multi-tracked: you have whatever tracks are on the program, me and you have the hallway track. The hallway track is where I go to learn, to feel small and to be truly inspired and excited about the challenges we’re collectively facing and the pain they’re causing.

The hallway track is like a market research group, a support group, a cheerleading sideline and a therapy session all in one. I like it so much, I founded the Surge conference at OmniTI to bring together the right people thinking about the right things with an ulterior and selfish motive to concoct the perfect hallway track. Success!

Now for the next experiment: can we emulate a hallway track conversation from the observer’s perspective. Would an online Q&A between me and a variety of industry luminaries be interesting? I hope so and we’re going to find out.