Percentages Aren’t People

This is a story about an engineering group celebrating success when it shouldn’t be… and their organization buying into it. This is not the fault of the engineering groups, or the operations team, or any one person. This is the fault of yesterday’s tools not providing the right data. The right insights. The ability to dig into the data to get meaningful information to push your business forward.

Herein, we’ll dive into a day in the life of an online service where a team wakes up and triages an outage after ignorantly celebrating a larger outage as a success just twelve hours before. All names have been removed to protect the exceptionally well-intending and competent parties. You see, the problem is that the industry has been misleading us with misapplied math and bad statistics for years.

I’ll set the stage with a simple fact of this business… when it takes longer than one and half seconds to use their service, users leave. Armed with this fact, let’s begin our journey. Despite this data living in Circonus, it isn’t measuring Circonus; alas, as stories are best told in the first person with friends along for the ride, I shall drop into the first-person plural for the rest of the ride: let’s go.

We track the user’s experience logging into the application. We do this not by synthetically logging in and measuring (we do this too, but only for functional testing), but by measuring each user’s experience and recording it. When drawn as a heatmap, the data looks like the graph below. The red line indicates a number that, through research, we’ve found to be a line of despair and loss. Delivering an experience of 1.5 seconds or slower causes our users to leave.

Percentages_Are_Not_People_1

Heatmaps can be a bit confusing to reason about, so this is the last we’ll see of it here. The important part to remember is that we are storing a complete model of the distribution of user experiences over time and we’ll get to why that is important in just a bit. From this data, we can calculate and visualize all the things we’re used to.

Percentages_Are_Not_People_2

The above is a graph for that same lonely day in June, and it shows milliseconds of latency; specifically, the line represents the average user experience. If I ask you to spot the problem on the graph, you can do so just as easily as a four year old; it’s glaring. However, you’ll note that our graph indicates we’re well under our 1.5s line of despair and loss. We’re all okay right? Wrong.

A long time ago, the industry realized that averages (and standard deviations) are very poor representations of sample sets because our populations are not normally distributed. Instead of using an average (specifically an arithmetic mean), we all decided that measuring on some large quantile would be better. We were right. So, an organization would pick a percentage: 99.9% or 99% and articulate, “I have to be at least ‘this good’ for at least ‘this percentage’ of my users.” If this percentage seems arbitrary, it is… but, like the 1.5 second line of despair and loss, it can be derived from lots of business data and user behavior studies.

This, ladies and gentlemen, is why we don’t use averages. Saying that averages are misleading is a bit less accurate than admitting that many people are misled by averages. They simply don’t represent the things that are important to us here: how are we treating our users? This question is critical because it is our users who fund us and our real question is, “How many users are having a dissatisfying experience?”

Percentages_Are_Not_People_3

The above graph is radically different than the first; it might surprise you to know that it is showing the same underlying dataset. Instead of the average experience, it shows the 99th percentile experience over time. It is much clearer that we had something catastrophically bad happen at 5am. It also shows that aside from two small infractions (7:52pm and 11:00pm), the rest of the day delivered the objective of a “less than 1.5s 99th percentile experience.” Okay, let’s stop. That’s such a disgustingly opaque and clinical way to talk about what we’re representing. These are actual people attempting to use this service.

What we’re saying here is that for each of the points on the purple line in this graph, during the time window that it represents (at this zoom level, each point represents 4 minutes), that 99% of visitors had an experience better than the value, and 1% had an experience worse than the value. Here we should see our first problem: percentages aren’t people.

Reflecting on the day as a whole, we see a catastrophic problem at 5am, to which our mighty engineering organization responded and remediated diligently over the course of approximately fifty minutes. Go Team! The rest of the day was pretty good, and we have those two little blips to diagnose and fix going forward.

I’m glad we’re not using averages for monitoring! We’d most likely not have been alerted to that problem at 5am! Here is where most monitoring stories end because a few quantiles is all that is stored and the raw data behind everything isn’t available for further analysis. Let’s return to our earlier question, “How many users are having a dissatisfying experience?” Luckily for us, we know how many users were on the site, so we can actually just multiply 1% by the number of current visitors to understand “how many” of the users are having an experience worse than the graph… But that isn’t the question is it? The question is how many users are having a worse experience than 1.5s, not worse than the 99th percentile.

Percentages_Are_Not_People_4

This graph adds a black line that shows the number of current users each minute on the site (numbered on the right axis). To illustrate how we’re really missing the point, let’s just take a random point from our 99th percentile graph (again each point represents 4 minutes at this zoom level). We randomly pick 9:32pm. The graph tells us that the 99th percentile experience at that point is at 1.266s. This is better than our goal of 1.5s. Well, looking at the black line we see that we have about 86 users each minute on the site at that point, or 344 users over the four minute period. 1% of that is between 3 and 4 users. Okay, we’re getting somewhere! So we know that between 3 and 4 users had an experience over 1.266s. Wait, that wasn’t our question. Who cares about 1.266s, when we want to know about 1.5s? We’re not getting anywhere at all.

Our objective is 1.5 seconds. We’re looking at this all upside down and backwards. We should not be asking how bad the experience is for the worst 1%, instead we should be asking what percentage has a bad experience (any experience worse than our objective of 1.5 seconds). We shouldn’t be asking about quantiles; we should be asking about inverse quantiles. Since we’re storing the whole distribution of experiences in Circonus, we can simply ask, “What percentage of the population is faster than 1.5s?” If we take one minus this inverse quantile at 1.5 seconds, we get exactly the answer to our question: What percentage of users had a “bad experience?”

Percentages_Are_Not_People_5

Now we’re getting somewhere. It is clear that we had a bad time at 5am and we did pretty well with just some line noise during our successful prior evening, right? Let’s return to our first problem: percentages aren’t people.

Percentages_Are_Not_People_6

Luckily, just as we did before, we can simply look at how many people are visiting the site (the green line above) and multiply that by the percentage of people having a bad time and we get the number of actual people. Now we’re talking about something everyone understands. How many people had a bad experience? Let’s multiply!

Percentages_Are_Not_People_7

In this image, we have simply multiplied two data streams from before, and we see the human casualties of our system. This is the number of users per minute that we screwed out of a good experience. These are users that, in all likelihood, are taking their business elsewhere. As anyone that thinks about it for more than a few seconds realizes, a small percentage of a large number can easily be bigger than a large percentage of a small number. Managing to inverse quantile numbers (let alone abstractly reasoning about quantiles), without knowing the size of the population, is misleading (to put it mildly).

Another way to look at this graph is to integrate; that is, to calculate the area under the curve. Integrating a graph representing users over time results in a graph of users. In other words, the number of cumulative users that have had a bad experience.

Percentages_Are_Not_People_8

This should be flat-out eye opening. The eight hours from 2am to 10am (including the event of despair and loss) affected 121 people. The eight hours preceding it affected almost as many: 113.

It can be pretty depressing to think you’ve celebrated a successful day of delivery only to learn that it really wasn’t that successful at all. But, this isn’t so much about celebrating successes that were actually failures; it’s about understanding what, when, and where you can improve. Every user matters; and if you treat them that way, you stand to get a lot more of them.

Percentages_Are_Not_People_9

When you look back at your own graphs, just remember that the most casualties of our day happened in these two bands. You should be using inverse quantiles for SLA reporting; if you don’t have those, use quantiles… if you only have averages, you’re blind as a bat.

 

 

We’re always happy to get feedback. Tell us what you think:

Understanding API Latencies

Today’s Internet is powered by APIs. Tomorrow’s will be even more so. Without a pretty UI or a captivating experience, you’re judged simply on performance and availability. As an API provider, it is more critical than ever to understand how your system is performing.

With the emergence of micro services, we have an API layered cake. And often that layered cake looks like one from a Dr. Seuss story. That complex systems fail in complex ways is a deep and painful truth that developers are facing now in even the most ordinary of applications. So, as we build these decoupled, often asynchronous, systems that compose a single user transaction from often tens of underlying networked subtransactions we’re left with a puzzle. How is the performance changing as volume increases usage and, often more importantly, how is it changing as we rapidly deploy micro updates to our micro services?

Developers have long known that they must be aware of their code performance and, at least in my experience, developers tend to be fairly good about minding their performance P’s and Q’s. However, in complex systems, the deployment environment and other production environmental conditions have tremendous influence on the actual performance delivered. The cry, “but it worked in dev” has moved from the functionality to the performance realm of software. I tell you now that I can sympathize.

It has always been a challenge to take a bug in functionality observed in production and build a repeatable test case in development to diagnose, address, and test for future regression. This challenge has been met by the best developers out there. The emergent conditions in complex, decoupled production system are nigh impossible to replicate in a development environment. This leaves developers fantastically frustrated and requires a different tack: production instrumentation.

As I see it, there are two approaches to production instrumentation that are critically important (there would be one approach if storage and retrieval were free and observation had no effect — alas we live in the real world and must compromise). You can either sacrifice coverage for depth or sacrifice depth for coverage. What am I talking about?

I’d love to be able to pick apart a single request coming into my service in excruciating detail. Watch it arrive, calculate the cycles spent on CPU, the time off, which instruction and stack took me off CPU, the activity that requested information from another microservice, the perceived latency between systems, all of the same things on the remote micro service, the disk accesses and latency on delivery for my query against Cassandra, and the details of the read-repair it induced. This list might seem long, but I could go on for pages. The amount of low-level work that is performed to serve even the simplest of requests is staggering… and every single step is subject to bugs, poor interactions, performance regressions and other generally bad behavior. The Google Dapper paper and the OpenZipkin project take a stab at delivering on this type of visibility, and now companies like Lightstep are attempting to deliver on this commercially. I’m excited! This type of tooling is one of two critical approaches to production system visibility.

Understanding_API_Latencies_1

The idea of storing this information on every single request that arrives is absurd today, but even when it is no longer absurd tomorrow, broad and insightful reporting on it will remain a challenge. Hence the need for the second approach.

You guessed it, Circonus falls squarely into the second approach: coverage over depth. You may choose not to agree with my terminology, but hopefully the point will come across. In this approach, instead of looking at individual transactions into the system (acknowledging that we cannot feasibly record and report all of them), we look at the individual components of the system and measure everything. That API we’re serving? Measure the latency of every single request on every exposed endpoint. The micro service you talked to? Measure the latency there. The network protocol over which you communicated? Measure the size of every single package sent in each direction. That Cassandra cluster? Measure the client-facing latency, but also measure the I/O latency of every single disk operation on each spindle (or EBS volume, or ephemeral SSD) on each node. It sounds like a lot of data, sure. We live in the future, and analytics systems are capable of handling a billion measurements per second these days, all the while remaining economical.

Understanding_API_Latencies_2

The above graph shows the full distribution of every IO operation on one of our core database nodes. The histogram in the breakout box shows three distinct modes (two tightly coupled in the left peak and one smaller mode further out in the latency spectrum. We can also see a radical divergence in behavior immediately following Feb 14th at 9am. As we’re looking at one week of data, each time slice vertically is 1h30m. The slice highlighted by the vertical grey hairline is displayed in the upper-left breakout box; it represents nearly 12 million data points alone. The full graph represents about 1.2 billion measurements, and fetching that from the Circonus time series database took 48ms. When you start using the right tools, your eyes will open.

 

 

We’re always happy to get feedback. What do you think?

Pully McPushface

The Argument for Connectivity Agnosticism

turning the corner

It’s about push vs. pull… but it shouldn’t be.

There has been a lot of heated debate on whether pushing telemetry data from systems or pulling that data from systems is better. If you’re just hearing about this argument now, bless you. One would think that this debate is as ridiculous as vim vs. emacs or tabs vs. spaces, but it turns out there is a bit of meat on this bone. The problem is that the proposition is wrong. I hope that here I can reframe the discussion productively to help turn the corner and walk a path where people get back to more productive things.

At Circonus, we’ve always been of the mindset that both push and pull should have their moments to shine. We accept both, but honestly, we are duped into this push vs. pull dialogue all too often. As I’ll explain, the choices we are shown aren’t the only options.

The idea behind pushing metrics is that the “system” in question (be it a machine or a service) should emit telemetry data to an “upstream” entity. The idea of pull is that some “upstream” entity should actively query systems for telemetry data. I am careful not to use the word“centralized” because in most large-scale modern monitoring systems, all of these bits (push or pull) are decentralized rather considerably. Let’s look through both sides of the argument (I’ve done the courtesy of striking through the claims that are patently false):

Push has some arguments:

  1. Pull doesn’t scale well
  2. I don’t know where my data will be coming from.
  3. Push works behind complex network setups.
  4. When events transpire, I should push; pulling doesn’t match my cadence.
  5. Push is more secure.

Pull has some arguments:

  1. I know better when a machine or service goes bad because I control the polling interval.
  2. Controlling the polling interval allows me to investigate issues faster and more effectively.
  3. Pull is more secure.

To address the strikethroughs in verse: Pulling data from 2 million machines isn’t a difficult job. Do you have more than 2 million machines? Pull scales fine… Google does it. When pulling data from a secure place to the cloud or pushing data from a secure place to the cloud, you are moving some data across the same boundary and are thus exposed to the same security risks involved. It is worth mentioning that in a setup where data is pulled, the target machine need not be able to even route to the Internet at all, thus making the attack surface more slippery. I personally find that argument to be weak and believe that if the right security policies are put in place, both methods can be considered equally “securable.” It’s also worth mentioning that many of those making claims about security concerns have wide open policies about pushing information beyond the boundaries of their digital enclave and should spend some serious time reflecting on that.

Layer3_and_7

Now to address the remaining issues.

Push: I don’t know where my data will be coming from.

Yes, it’s true that you don’t always know where your data is coming from. A perfect example is web clients. They show up to load a page or call an API, and then could potentially disappear for good. You don’t own that resource and, more importantly, don’t pay an operational or capital expenditure on acquiring or running it. So, I sympathize that we don’t always know which systems will be submitting telemetry information to us. On the flip side, those machines or services that you know about and pay for — it’s just flat-out lazy to not know what they are. In the case of short-lived resources, it is imperative that you know when it is doing work and when it is gone for good. Considering this, it would stand to reason that the resource being monitored must initiate this. This is an argument for push… at least on layer 3. Woah! What? Why I am talking about OSI layers? I’ll get to that.

Push: Works behind complex network setups.

It turns out that pull actually works behind some complex network configurations where push fails, though these are quite rare in practice. Still, it also turns out that TCP sessions are bidirectional, so once you’ve conquered setup you’ve solved this issue. So this argument (and the rare counterargument) are layer 3 arguments that struggle to find any relevance at layer 7.

Push: When events transpire, I should push; pulling doesn’t match my cadence.

Finally, some real meat. I’ve talk about this many times in the past, and it is 100% true that some things you want to observe fall well into the push realm and others into the pull realm. When an event transpires, you likely want to get that information upstream as quickly as possible, so push makes good sense. And as this is information… we’re talking layer 7. If you instrument processes starting and stopping, you likely don’t want to missing something. On the other hand, the way to not miss disk space usage monitoring on a system is to log every block allocation and deallocation — sounds like a bit of overkill perhaps? This is a good example of where pulling that information at an operator’s discretion (say every few seconds or every minute) would suffice. Basically, sometimes it makes good sense to push on layer 7, sometimes it makes better sense to pull.

Pull: I know better when a machine or service goes bad because I control the polling interval.

This, to me, comes down to the responsible party. Is each of your million machines (or 10) responsible for detecting failure (in the form of absenteeism) or is that the responsibility of the monitoring system? That was rhetorical, of course. The monitoring system is responsible, full stop. Yet detecting failure of systems by tracking the absenteeism of data in the push model requires elaborate models on acceptable delinquency in emissions. When the monitoring system pulls data, it controls the interval and can determine unavailability in a way that is reliable, simple, and, perhaps most importantly, simple to reason about. While there are elements of layer 3 here if the client is not currently “connected” to the monitoring system, this issue is almost entirely addressed on layer 7.

Pull: Controlling the polling interval allows me to investigate issues faster and more effectively.

For metrics in many systems, taking a measurement every 100ms is overkill. I have thousands of metrics available on a machine, and most of them are very expressive on observation intervals as large as five minutes. However, there are times at which a tighter observation interval is warranted. This is an argument of control, and it is a good argument. The claim that an operator should be able to dynamically control the interval at which measurements are taken is a completely legitimate claim and expectation to have. This argument and its solution live in layer 7.

Enter Pully McPushface.

Layer3

Pully McPushface is just a name to get attention: attention to something that can potentially make people cease their asinine pull vs. push arguments. It is simply the acknowledgement that one can push or pull at layer 3 (the direction in which one establishes a TCP session) and also push (send) or pull (request/response) on layer 7, independent of one another. To be clear, this approach has been possible since TCP hit the scene in 1982… so why haven’t monitoring systems leveraged it?

At Circonus, we’ve recently revamped our stack to allow for this freedom in almost every level of our architecture. Since the beginning, we’ve supported both push and pull protocols (like collectd, statsd, json over HTTP, NRPE, etc.), and we’ll continue to do so. The problem was that these all (as do the pundits) conflate layer 3 and layer 7 “initiation” in their design. (The collectd agent connects out via TCP to push data, and a monitor connects into NRPE to pull data.) We’re changing the dialogue.

Our collection system is designed to be distributed. We have our first tier: the core system, our second tier: the broker network, and our third tier: agents. While we support a multitude of agents (including the aforementioned statsd, collectd, etc.), we also have our own open source agent called NAD.

When we initially designed Circonus, we did extensive research with world-leading security teams to understand whether our layer 3 connections between tier 1 and tier 2 should be initiated by the broker to the core or vice verse. The consensus (unanimous I might add) was that security would be improved by controlling a single inbound TCP connection to the broker, and the broker could be operated without a default route, disabling it from easily sending data to malicious parties were it duped into sending data. It turns out that our audience wholeheartedly disagreed with this expert opinion. The solution? Be agnostic. Today, the conversations between tier 1 and tier 2 care not as to who initiates the connection. Prefer the broker reaches out? That’s just fine. Want the core to connect to the broker? That’ll work too.

In our recent release of C:OSI (and NAD), we’ve applied the same agnosticism to connectivity between tier 2 and tier 3. Here is where the magic happens. The nad agent now has the ability to both dial in and be dialed to on layer 3, while maintaining all of its normal layer 7 flexibility. Basically, however your network and systems are setup, we can work with that and still get on-demand, high-frequency data out; no more compromises. Say hello to Pully McPushface.

 

Vote Now

 

We’re always happy to get feedback. Where do you stand on Push vs. Pull?
What do you think of Pully McPushface?

The Future of Monitoring: Q&A with Jez Humble


jez_humbleJez Humble is a lecturer at U.C. Berkeley, and co-author of the Jolt Award-winning Continuous Delivery : Reliable Software Releases through Build, Test and Deployment Automation (Addison-Wesley 2011) and Lean Enterprise : How High Performance Organizations Innovate at Scale (O’Reilly 2015), in Eric Ries’ Lean series. He has worked as a software developer, product manager, consultant and trainer across a wide variety of domains and technologies. His focus is on helping organisations deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices.

Theo’s Intro:

It is my perspective that the world of technology will be a continual place eventually.  As services become more and more componentized they stand to become more independently developed and operated.  The implications on engineering design when attempting to maintain acceptable resiliency levels are significant.  The convergence on a continual world is simply a natural progression and will not be stopped.
Jez has taken to deep thought and practice around these challenges quite a bit ahead of the adoption curve, and has a unique perspective on where we are going, why we are going there and (likely vivid) images of the catastrophic derailments that might occur along the tracks.  While I spend all my time thinking about how people might have peace of mind that their systems and businesses are measurably functioning during and after transitions into this new world, my interest is compounded by the Circonus’ internal uses of continual integration and deployment practices for both our SaaS and on-premise customers.

THEO: Most of the slides, talks and propaganda around CI/CD (Continuous Integration/Continuous Delivery) are framed in the context of businesses launching software services that are consumed by customers as opposed to software products consumed by customers. Do you find that people need a different frame of mind, a different perspective or just more discipline when they are faced with shipping product vs. shipping services as it relates to continual practices?

JEZ: The great thing about continuous delivery is that the same principles apply whether you’re doing web services, product development, embedded or mobile. You need to make sure you’re working in small batches, and that your software is always releasable, otherwise you won’t get the benefits. I started my career at a web startup but then spent several years working on packaged software, and the discipline is the same. Some of the problems are different: for example, when I was working on go.cd, we built a stage into our deployment pipeline to do automated upgrade testing from every previous release to what was on trunk. But fundamentally, it’s the same foundations: comprehensive configuration management, good test automation, and the practice of working in small batches on trunk and keeping it releasable. In fact, one of my favourite case studies for CI/CD is HP’s LaserJet Firmware division — yet nobody is deploying new firmware multiple times a day. You do make a good point about discipline: when you’re not actually having to deploy to production on a regular basis it can be easy to let things slide. Perhaps you don’t pay too much attention to the automated functional tests breaking, or you decide that one long-lived branch to do some deep surgery on a fragile subsystem is OK. Continuous deployment (deploying to production frequently) tends to concentrate the mind. But the discipline is equally important however frequently you release.

THEO: Do you find that organizations “going lean” struggle more, take longer or navigate more risk when they are primarily shipping software products vs. services?

JEZ: Each model has its own trade-offs. Products (including mobile apps) usually require a large matrix of client devices to test in order to make sure your product will work correctly. You also have to worry about upgrade testing. Services, on the other hand, require development to work with IT operations to get the deployment process to a low-risk pushbutton state, and make sure the service is easy to operate. Both of these problems are hard to solve — I don’t think anybody gets an easy ride. Many companies who started off shipping product are now moving to a SaaS model in any case, so they’re having to negotiate both models, which is an interesting problem to face. In both cases, getting fast, comprehensive test automation in place and being able to run as much as possible on every check-in, and then fixing things when they break, is the beginning of wisdom.

THEO: Thinking continuously is only a small part of establishing a “lean enterprise.” Do you find engineers more easily reason about adopting CI/CD than other changes such as organizational retooling and process refinements? What’s the most common sticking point (or point of flat-out derailment) for organizations attempting to go lean?

JEZ: My biggest frustration is how conservative most technology organizations are when it comes to changing the way people behave. There are plenty of engineers who are happy to play with new languages or technologies, but god forbid you try and mess with their worldview on process. The biggest sticking point – whether it’s engineers, middle management or leadership – is getting them to change their behavior and ways of thinking.

But the best people – and organizations – are never satisfied with how they’re doing and are always looking for ways to improve.

The worst ones either just accept the status quo, or are always blowing things up (continuous re-orgs are a great example), lurching from one crisis to another. Sometimes you get both. Effective leaders and managers understand that it’s essential to have a measurable customer or organizational outcome to work towards, and that their job is to help the people working for them experiment in a disciplined, scientific way with process improvement work to move towards the goal. That requires that you actually have time and resources to invest in this work, and that you have people with the capacity for and interest in making things better.

THEO: Finance is precise and process oriented and often times bad things happen (people working from different/incorrect base assumptions) when there are too many cooks in the kitchen. This is why finance is usually tightly controlled by the CFO and models and representations are fastidiously enforced. Monitoring and analytics around that data shares a lot in common with respect to models and meanings. However, many engineering groups have far less discipline and control than do financial groups. Where do you see things going here?

JEZ: Monitoring isn’t really my area, but my guess is that there are similar factors at play here to other parts of the DevOps world, which is the lack of both an economic model and the discipline to apply it. Don Reinertsen has a few quotes that I rather like: “you may ignore economics, but economics won’t ignore you.” He also says of product development “The measure of execution in product development is our ability to constantly align our plans to whatever is, at the moment, the best economic choice.” Making good decisions is fundamentally about risk management: what are the risks we face? What choices are available to us to mitigate those risks? What are the impacts? What should we be prepared to pay to mitigate those impacts? What information is required to assess the probability of those risks occurring? How much should we be prepared to pay for that information? For CFOs working within business models that are well understood, there are templates and models that encapsulate this information in a way that makes effective risk management somewhat algorithmic, provided of course you stay within the bounds of the model. I don’t know whether we’re yet at that stage with respect to monitoring, but I certainly don’t feel like we’re yet at that stage with the rest of DevOps. Thus a lot of what we do is heuristic in nature — and that requires constant adaptation and improvement, which takes even more discipline, effort, and attention. That, in a department which is constantly overloaded by firefighting. I guess that’s a very long way of saying that I don’t have a very clear picture of where things are going, but I think it’ll be a while before we’re in a place that has a bunch of proven models with well understood trade-offs.

THEO: In your experience how do organizations today habitually screw up monitoring? What are they simply thinking about “the wrong way?”

JEZ: I haven’t worked in IT operations professionally for over a decade, but based on what I hear and observe, I feel like a lot of people still treat monitoring as little more than setting up a bunch of alerts. This leads to a lot of the issues we see everywhere with alert fatigue and people working very reactively. Tom Limoncelli has a nice blog post where he recommends deleting all your alerts and then, when there’s an outage, working out what information would have predicted it, and just collecting that information. Of course he’s being provocative, but we have a similar situation with tests — people are terrified about deleting them because they feel like they’re literally deleting quality (or in the case of alerts, stability) from their system. But it’s far better to have a small number of alerts that actually have information value than a large number that are telling you very little, but drown the useful data in noise.

THEO: Andrew Shaffer said that “technology is 90% tribalism and fashion.” I’m not sure about the percentage, but he nailed the heart of the problem. You and I both know that process, practice and methods sunset faster in technology than in most other fields. I’ll ask the impossible question… after enterprises go lean, what’s next?

JEZ: I actually believe that there’s no end state to “going lean.” In my opinion, lean is fundamentally about taking a disciplined, scientific approach to product development and process improvement — and you’re never done with that. The environment is always changing, and it’s a question of how fast you can adapt, and how long you can stay in the game. Lean is the science of growing adaptive, resilient organizations, and the best of those are always getting better. Andrew is (as is often the case) correct, and what I find really astonishing is that as an industry we have a terrible grasp of our own history. As George Santayana has it, we seem condemned to repeat our mistakes endlessly, albeit every time with some shiny new technology stack. I feel like there’s a long way to go before any software company truly embodies lean principles — especially the ability to balance moving fast at high quality while maintaining a humane working environment. The main obstacle is the appalling ineptitude of a large proportion of IT management and leadership — so many of these people are either senior engineers who are victims of the Peter Principle or MBAs with no real understanding of how technology works. Many technologists even believe effective management is an oxymoron. While I am lucky enough to know several great leaders and managers, they have not in general become who they are as a result of any serious effort in our industry to cultivate such people. We’re many years away from addressing these problems at scale.

The Future of Monitoring: Q&A with John Allspaw


johnallspaw_2015John Allspaw is CTO at Etsy. John has worked in systems operations for over 14 years in biotech, government, and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr. He is a well-known industry pundit, speaker, blogger, and the author of Web Operations and The Art of Capacity Planning. Visit John’s blog

Theo: As you know I live numbers. The future of monitoring is leaning strongly toward complex analytics on epic amounts of telemetry data. How do you think this will affect how operations and engineering teams work?

John: Two things come to mind. The first is that we could look at it in the same way the field is looking at “Big Data.” While we now have technologies to help us get answers to questions we have, it turns out that finding the right question is just as important. And you’re right: it’s surprisingly easy to collect a massive amount of telemetry data at a rate that outpaces our abilities to analyze it. I think the real challenge is one of designing systems that can make it easy to navigate this data without getting too much in our way.

I’m fond of Herb Simon’s saying “Information is not a scarce resource. Attention is.” I think that part of this challenge includes using novel ways of analyzing data algorithmically. I think another part, just as critical, is to design software and interfaces that can act as true advisors or partners. More often than not, I’m not going to know what I want to look until I look around in these billions of time-series datasets. If we make it easy and effortless for a team to “look around” – maybe this is a navigation challenge – I’ll bet on that team being better at operations.

Theo: Given your (long) work in operations, you’ve seen good systems and bad systems, good teams and bad teams, good approaches and bad approaches. If you could describe a commonality of all the bads in one word, what would it be? and why?

John: Well, anyone who knows me knows that summarizing (forget about in one word!) is pretty difficult for me. 🙂 If I had to, I would say that what we want to avoid is being brittle. Brittle process, brittle architecture design, brittle incident response, etc. Being brittle in this case means that we can always be prepared for anything, as long as we can imagine it beforehand. The companies we grow and systems we build have too much going on to be perfectly predictable. Brittle is what you get when you bet all your chips on procedures, prescriptive processes, and software that takes on too much of its own control.

Resilience is what you get when you invest in preparing to be unprepared.

Theo: What was it about “Resilience Engineering” that sucked you in?

John: One of the things that really drew me into the field was the idea that we can have a different perspective on how we look at how successful work actually happens in our field. Traditionally, we judge ourselves on the absence of failures, and we assume almost tacitly that we can design a system (software, company org chart, financial model, etc.) that will work all the time, perfectly. All you have to do is: don’t touch it.

Resilience Engineering concepts assert something different: that success comes from people adapting to what they see happening, anticipating what limits and breakdowns the system is headed towards, and making adjustments to forestall them. In other words, success is the result of the presence of adaptive capacity, not the absence of failures.

This idea isn’t just plain-old “fault tolerance” – it’s something more. David Woods (a researcher in the field) calls this something “graceful extensibility” – the idea that it’s not just degradation after failure, but adapting when you get close to the boundaries of failure. Successful teams do this all the time, but no attention is paid to it, because there’s no drama in a non-outage.

That’s what I find fascinating: instead of starting with an outage and explain what a team lacked or didn’t do, we could look at all the things that make for an outage-less day. Many of the expertise ingredients of outage-less days are unspoken, come from “muscle memory” and rules-of-thumb that engineers have developed tacitly over the years. I want to discover all of that.

Theo: How do you think the field of Resilience Engineering can improve the data science that happens around telemetry analysis in complex systems?

John: Great question! I think a really fertile area is to explore the qualitative aspects of how people make sense of telemetry data, at different levels (aggregate, component, etc.) and find ways that use quantitative methods to provide stronger signals than the user could do on their own. An example of this might be to explore expert critiquing systems, where a monitoring system doesn’t try to be “intelligent” but instead provides options/directions for diagnosis for the user to take, essentially providing decision support. This isn’t an approach I see taken yet, in earnest. Maybe Circonus can take a shot at it? 🙂

Theo: As two emerging fields of research are integrated into practice, there are bound to be problems. Can you make a wild prediction as to some of these problems might be?

John: Agreed. I think it might be awkward like a junior high school dance. We have human factors, qualitative analysts and UX/UI folks on one side of the gymnasium, and statisticians, computer scientists, and mathematicians on the other. One of the more obvious potential quagmires is the insistence that each approach will be superior, resulting in a mangled mess of tools or worse: no progress at all. In a cartoon-like stereotype of the fields, I can imagine one camp designing with the belief that all bets must be placed on algorithms, no humans needed. And in the other camp, an over-focus on interfaces that ignore or downplays potential computational processing advantages.

If we do it well, both camps won’t play their solos at the same time, and will take the nuanced approach. Maybe data science and resilience engineering can view themselves as a great rhythm section of a jazz band.

Hallway Track: The Future of Monitoring

I’ve been in this “Internet industry” since around 1997. That doesn’t make me the first on the stage, but I’ve had a very wide set of experiences: from deep within the commercial software world to the front lines of open source and from the smallest startup sites to helping fifteen of the world’s most highly trafficked web sites. My focus for a long time was scalability, but that slowly morphed into general hacking and speaking. As a part of my rigorous speaking schedule, I’ve been to myriad conferences all around the globe; sometimes attending, sometimes chairing, but almost always speaking. I’ve often been asked: “Why do you travel so much? Why do you go to so many conferences?” The answer is simple: the people.

Some go to conferences for session material, perhaps most attendees even. In multi-track conferences, people tend to stick to one track or another. I’d argue that all conferences are inherently multi-tracked: you have whatever tracks are on the program, me and you have the hallway track. The hallway track is where I go to learn, to feel small and to be truly inspired and excited about the challenges we’re collectively facing and the pain they’re causing.

The hallway track is like a market research group, a support group, a cheerleading sideline and a therapy session all in one. I like it so much, I founded the Surge conference at OmniTI to bring together the right people thinking about the right things with an ulterior and selfish motive to concoct the perfect hallway track. Success!

Now for the next experiment: can we emulate a hallway track conversation from the observer’s perspective. Would an online Q&A between me and a variety of industry luminaries be interesting? I hope so and we’re going to find out.

The Problem with Math: Why Your Monitoring Solution is Wrong

Math is perfect – the most perfect thing in the world. The problem with math is that it is perpetrated by us imperfect humans. Circonus has long aimed to bring deep numerical analysis to business telemetry data. That may sound like a lot of mumbo-jumbo, but it really means we want better answers to better questions about all that data your business creates. Like it or not, to do that, we need to do maths and to do them right.

That’s not how it works

I was watching a line of e-surance commercials that have recently aired wherein people execute Internet-sounding tasks in nonsensical ways, like posting photographs to your [Facebook] wall by pinning things to your living room wall while your friends sits on the couch to observe, and I was reminded of something I see too often: bad math. Math to which I’m inclined to rebut: “That’s not how it works; that’s not how any of this works!”

A great example of such bad math is the inaccurate calculation of quantiles. This may sound super “mathy,” but quantiles have wide applications and there is a strong chance that you need them for things such as enforcing service level agreements and calculating billing and payments.

What is a quantile? First we’ll use the syntax q(N, v) to represent a quantile where N is a set of samples and v is some number between 0 and 1, inclusively. Herein, we’ll assume some set N and just write a quantile as q(v). Remember that 0% is 0 and 100% is actually 1.0 in probability, so the quantile is simply asking what sample in my set represents a number such that v of the samples are less than it and (1-v) samples are greater than it. This may sound complicated. Most descriptions of math concepts are a bit opaque at first, but a few examples can be quite illustrative.

  • What is q(0)? What sample is such that 0 (or none) in the set are smaller and the rest are larger? Well, that’s easy: the smallest number in the set. q(0) is another way of writing the minimum.
  • What is q(1)? What sample is such that 1 (100% or all) in the set are smaller and none are larger? Also quite simple: the largest number in the set. q(1) is another way of writing the maximum.
  • What is q(0.5)? What sample is such that 50% (or half) in the set are smaller and the other half are larger? This is the middle or in statistics, the median. q(0.5) is another way of writing the median.

SLA calculations – the good, the bad and the ugly

Quite often when articulating Internet bandwidth billing scenarios, one will measure the traffic over each 5 minute period throughout an entire month and calculate q(0.95) over those samples. This is called 95th percentile billing. In service level agreements, one can stipulate that the latency for a particular service must be faster than a specific amount (some percentage of the time), or that some percentage of all interactions with said service must be at a specific latency or faster. Why are those methods different? In the first, your set of samples are calculated over discrete buckets of time, whereas in the second your samples are simply the latencies of each service request. As a note to those writing SLAs, the first is dreadfully difficult to articulate and thus nigh-impossible to calculate consistently or meaningfully, the second just makes sense. While discrete time buckets might make sense for availability-based SLAs, it makes little-to-no sense for latency-based SLAs. As “slow is the new down” is adopted across modern businesses, simple availability-based SLAs are rapidly becoming irrelevant.

So, let’s get more concrete: I have an API service with a simply stated SLA: 99.9% of my accesses should be services in 100 or fewer milliseconds. It might be a simple statement, but it is missing a critical piece of information: over what time period is this enforced? Not specifying the time unit of SLA enforcement is most often the first mistake people make. A simple example will best serve to illustrate why.

Assume I have an average of 5 requests per second to this API service. “Things go wrong” and I have twenty requests that are served at 200ms (significantly above our 100ms requirement) during the day: ten slow requests at 9:02am and the other ten offenders at 12:18pm. Some back of the napkin math says that as long as less that 1 request out of 1000 are slow, I’m fine (99.9% are still fast enough). As I far exceed 20,000 total requests during the day, I’ve not violated my SLA… However, if I enforce my SLA on five minute intervals, I have 1500 requests occurring between 9:00 and 9:05 and 10 slow responses. 10 out of 1500 is… well… there goes my SLA. Same blatant failure from 12:15 to 12:20. So, based on a five-minute-enforcement method I have 10 minutes of SLA violation during the day vs no SLA violation whatsoever using a one-day-enforcement method. But wait.. it gets worse. Why? Bad math.

Many systems out there calculate quantiles over short periods of time (like 1 minute or 5 minutes). Instead of storing the latency for every request, the system retains 5 minutes of these measurements, calculates q(0.999), and then discards the samples and stores the resulting quantile for later use. At the end of the day, you have q(0.999) for each 5 minute period throughout the day (288 of them). So given 288 quantiles throughout the day, how do you calculate the quantile for the whole day? Despite the fictitious answer some tools provide, the answer is you don’t. Math… it doesn’t work that way.

There are only two magical quantiles that will allow this type of reduction: q(0) and q(1). The minimum of a set of minimums is indeed the global minimum; the same is true for maximums. Do you know what the q(0.999) of the a set of q(0.999)s is? Or what the average of a set of q(0.999)s is? Hint: it’s not the answer you’re looking for. Basically, if you have a set of 288 quantiles representing each of the day’s five minute intervals and you want the quantile for the whole day, you are simply out of luck. Because math.

In fact, the situation is quite dire. If I calculated the quantile of the aforementioned 9:00 to 9:05 time interval where ten samples of 1500 are 200ms, the q(0.999) is 200ms despite the other 1490 samples being faster than 100ms. Devastatingly, if the other 1490 samples were 150ms (such that every single request was over the prescribed 100ms limit), the q(0.999) would still be 200ms. Because I’ve tossed the original samples, I have no idea if all of my samples violated the SLA or just 0.1% of them. In the worst case scenario, all of them were “too slow” and now I have 3000 request that were too slow. While 20 requests aren’t enough to break the day, 3000 most certainly are and my whole day is actually in violation. Because the system I’m using for collection quantile information is doing math wrong, the only reasonable answer to “did we violate the SLA today?” is “I don’t know, maybe.” It doesn’t need to be like this – Circonus calculates these quantiles correctly.

An aside on SLAs: While some of this might be a bit laborious to follow, the takeaway is to be very careful how you articulate your SLAs or you will often have no idea if you are meeting them. I recommend calculating quantiles on a day-to-day basis (and that all days are measured by UTC so the abomination that is daylight savings time never foils you). So to restate the example SLA above: 99.9% or more of the requests occurring on a 24-hour calendar day (UTC) shall be serviced in 100ms or less time. If you prefer to keep your increments of failure lower, you can opt for an hour-by-hour SLA instead of a day-by-day one. I do not recommend stating an SLA in anything less that one hour spans.

Bad math in monitoring – don’t let it happen to you

Quantiles are an example where the current methods that most tools use are simply wrong and there is no trick or method that can help. However, something that I’ve learned working on Circonus is how often other tools screw up even the most basic math. I’ll list a few examples in the form of tips, without the indignity of attributing them to specific products. (If you run any of these products, you might recognize the example… or at some point in the future you will either wake up in a cold sweat or simply let loose a stream of astounding explicatives.)

  • The average of a set of minimums is not the global minimum. It is, instead, nothing useful.
  • The simple average of a set of averages of varying sample sizes isn’t the average of all the samples combined. It is, instead, nothing useful.
  • The average of the standard deviations of separate sets of samples is not the standard deviation of the combined set of samples. It is, instead, nothing useful.
  • The q(v) of a set of q(v) calculated over sample sets is not the q(v) of the combined sample set. While creative, it is nothing useful.
  • The average of a set of q(v) calculated over sample sets is, you guessed it, nothing useful.
  • The average of a bunch of rates (which are nothing more than a change in value divided by a change in time: dv/dt) with varying dts is not the damn average rate (and a particularly horrible travesty).

At Circonus we learned early on that math is both critically important to our mission and quite challenging for many of our customers to understand. This makes it imperative that we not screw it up. Many new adopters contact support asking for an explanation as to why the numbers they see in our tool don’t match their existing tools. We have to explain that we thought they deserved the right answer. As for their existing tools: that’s not how it works; that’s not how any of this works.

Underneath Clean Data: Avoiding Rot

When many people talk about clean data, they are referring to data that was collected in a controlled and rigorous process where bad inputs are avoided. Dirty data has samples outside of the intended collection set or values for certain fields that may be mixed up (e.g. consider “First Name: Theo Schlossnagle” and “Last Name: Male” …oops). These problems pose huge challenges for data scientists and statisticians, but it can get a whole lot worse. What if your clean data were rotten?

Rotten data

All (or almost all) of this data is stored on disks today… in files on disks (yes, even if it is in a database)… in files that are part of a filesystem on disks. There is also a saying, “It’s turtles all the way down,” that refers to the poor implementation of foundational technology. Case in point: did you know that you’re likely to have a bit error (i.e. one bit read back opposite of how it was stored) every time you write between 200TB to 2PB of data? This probability of storing bad data is called Bit Error Rate (BER). Did you know that most filesystems assume a BER of zero, when it never has and never will be zero? That means that on every filesystem you’ve used (unless you’ve been blessed to run on one of the few filesystems that accounts for this) you’ve had a chance of reading back data that you legitimately never wrote there!

Now, you may be thinking one bit in 2PB is quite a lot. This BER is published by drive manufacturers and while they are not outright lying, they omit a very real truth. You don’t store data on drives without connecting them to a system via cables to a Host Bus Adapter (HBA). Two more pieces of hardware that we’ll simply call turtles. Most HBAs use a memory type called Error-Correcting Code (ECC) that is designed to compensate for single bit errors in memory, but cabling is often imperfect and the effective BER of the attached drives is bumped ever so slightly higher. Also take into account that physical media is an imperfect storage medium; it is possible to write something correctly and have it altered over time due to environmental conditions and (to a lesser extent) use; this effect is called bit rot or data rot. All of this illustrates that the BER listed on your hard drive specification is optimistic. Combine all this with the inconvenient truth that writing out 2PB of data is quite common in today’s data systems and you wind up with even your cleanest data soiled. As an anecdote, at one point we detected more than one bit of error per month in a relatively small cluster (< 100TB).

You’ll notice that I said we detected these issues; this is because we use the ZFS filesystem underneath our customer’s data. ZFS checksums all data written so that it can be verified when it is retrieved. The authors of ZFS knew that on large data systems these issues would be real and must be handled and for that they have my deep gratitude. There is one issue here that escapes most people that have the foresight to run an advanced filesystem and it is hidden within this very paragraph.

In order for a checksumming filesystem (like ZFS) to detect bit errors, it must read the bad data. On large systems, some data is hot (meaning it is read often), but a significant amount of data is cold written and ignored for extended periods of time (months or years). When data engineers design systems, they account for the data access patterns of the applications that run on top of their systems: How many writes and reads? How much hot and cold? Are the reads and writes sequential or random? The answers to these questions help specify the configuration of the underlying storage systems so that it has enough space, enough bandwidth, and low enough latency to satisfy the expected usage. But, if we add into this the chance that our precious data is rotting and that we must detect an error before we can choose to repair it, then we are left with a situation where we must read all our cold data. We must read all our cold data. We must read all our cold data. Said three times it will induce cold sweats in most storage engineers; it wasn’t part of the originally specified workload and if you didn’t account for that in your system design, you’re squarely misspecified.

Scrubbing out the rot

In the ZFS world, the action of reading all of your data to verify its integrity and correct for data rot is aptly named “scrubbing.” For the storage engineers out there, I thought this would be an interesting exploration into what scrubbing actually does to your I/O latency. At Circonus we actually care about our customer’s data and scrub it regularly. I’ll show you what this looks like and then very briefly describe what we do to make sure that users aren’t affected.

On our telemetry storage nodes, we measure and record the latency of every disk I/O operation against every physical disk in the server using io nad plugin (which leverages DTrace on Illumos and ktap on Linux). All of these measurements are sent up to Circonus as a histogram and from there we can analyze the distribution of latencies.

Scrubbing Data #1

In this first graph, we’re looking at a time-series histogram focused on the period of time immediately before an apparently radical change in behavior.

Scrubbing Data #2

Moving our mouse one time unit to the right (just before 4am), we can see an entirely different workload present. One might initially think that in the new workload we have much better performance as many samples are now present in the lower latency side of the distribution (the left side of the heads-up graph). However, in the legend you’ll notice that the first graph is focused on approximately 900 thousand samples whereas the second graph is focused on approximately 3.2 million samples. So, while we have more low-latency samples, we also have many more samples as well.

Scrubbing Data #3

Of further interest is that, almost immediately at 4am, the workload changes again and we see a new distribution emerge in the signal. This distribution stays fairly consistent for about 7 hours with a few short interruptions, changes yet again just before Jan 5 at 12pm, and seemingly recovers to the original workload just after 4pm (16:00). This is the havoc a scrub can play, but we’ll see with some cursory analysis that the effects aren’t actually over at 4pm.

Scrubbing Data #4

The next thing we do is add an analytical overlay to our latency graph. This overlay represents an approximation of two times the number of modes in the distribution (the number of humps in the histogram) as time goes on. This measurement is an interesting characteristic of workload and can be used to detect changes in workload. As we can see, we veered radically from our workload just before 4am and returned to our original workload (or at least something with the same modality) just after midnight the following day.

Scrubbing Data #5

Lastly, we can see the effects on the upper end of the latency distribution spectrum by looking at some quantiles. In the above graph we reset the maximum y-value to 1M (the units here are in microseconds, so this is a 1s maximum). The overlays here are q(0.5), q(0.99), q(0.995), and q(0.999). We can see our service times growing into a range that would cause customer dissatisfaction.

While I won’t go into detail about how we solve this issue, it is fairly simplistic. All data in our telemetry store is replicated on multiple nodes. The system understands node latency and can prefer reads from nodes with lower latency.

Understanding how our systems behave while we keep our customers’ data from rotting away allows us to always serve the cleanest data as fast as possible.

Alerting on disk space the right way.

Most people that alert on disk space use an arbitrary threshold, such as “notify me when my disk is 85% full.” Most people then get alerted, spend an hour trying to delete things, and update their rule to “notify me when my disk is 86% full.” Sounds dumb, right? I’ve done it and pretty much everyone I know in operations has done it. The good news is that we didn’t do this because we are all stupid people, we did it because the tools we were using didn’t allow us to ask the questions we really want to answer. Let’s work backwards to a better disk space check.

There are occasionally reasons to set static thresholds, but most of the time we care about disk space it’s because we need to buy more. The question then becomes, “how much advance notice do I need?” Let’s assume, for the sake of argument, that I need 4 weeks to execute on increasing storage capacity (planning for and scheduling possible system downtime, resizing a LUN, etc.). If you’re a cloudy sort of architecture, maybe you’re looking at a single day so that this sort of change happens during a maintenance window where all necessary parties are available. After all, why would you want to act on this in an emergency?

Really, the question we’re aiming at is “when will I run out of disk space in 4 weeks time?” It turns out that this is a very simple statistical question and with a few hints, you can get an answer in short order. First we need a model of the data growth and this is where we need a bit more information. Specifically, how much history should drive the model? This depends heavily on the usage of the system, but most systems have a fairly steady growth pattern and you’d like to include some multiple of the period of that pattern.

Graph Adding an Exponential Regression
Adding an Exponential Regression

To be a little more example oriented, let’s say we have a system that is growing over time and also generates logs that get deleted daily. We expect a general trend upward with daily periodic oscillation as we accumulate log files and then wipe them out. As rule of thumb, I would say that one week of data should be sufficient in most of the systems, so we should build our model off 7 days worth of history.

Graph looking 1 week back and 28 days forward.
Looking 1 week back and 28 days forward.

Quite simply, we should take our data over the last 7 days and generate a regression model. Then, we time shift the regression model backwards by 4 weeks (the amount of notice we’d like) and “current value” would be the model-predicted value four weeks from today. If that value is more than 100%, we need to tell someone. Easy.

Suffice it to say some tools require extracting the data into Excel or pulling data out with R or Python to accomplish this. While those tools work well, they fail to fit the bill with respect to monitoring because this model and projected value must be constantly recalculated as new data arrives so that we can reduce the MTTD to something expected.

While Circonus has had this feature squirreled away for many months, I’m pleased to say that the alerting UI has been refactored and it is now accessible to mere mortals (at least those mortals that use Circonus).

Exploring Keynote XML Data Pulse

I’ll be the first to admit that the Circonus service can be somewhat intimidating. Sometimes it is hard to puzzle out what we do and what we don’t do. Case in point: perspective-based transactional web monitoring.

Many people have asked us, given our global infrastructure, why don’t we support complex web flows from all of our global nodes and report back granular telemetry regarding the web pages, assets and interactivity. The short and simple answer is: someone else is better at it. It turns out they are a lot better at it.

Keynote has been providing synthetic web transaction monitoring via their global network of nodes for many years and have an evolved and widely adopted product offering. So, why all this talk about Keynote?

But why?

You might ask why it is important to get deep, global data about web page performance into Circonus. It’s already in Keynote, right? Their tools even support exploring that data and arguably better than within Circonus?

The reason is simple… you’re other critical performance data is in Circonus too. Real-time correlations, visualization and trending can’t happy easily unless the data is available in the same toolset. Web performance is delivered by web infrastructure. Web performance powers business. Once all your performance data is in Circonus, you can can tie these three macro-systems together in a cohesive view and produce actionable information quickly.

The story of how we made this possible is, as most good stories are, rife with failures.

Phase Failure: the Keynote API

For over a year, we’ve had support for extracting telemetry data from Keynote via their traditional API. For over a year, most of our customers had no idea… because it was in hidden beta. It was hidden because we struggled to make it work. Honestly, the integration was painful due to the API allowing us to pull only a single telemetry point at a time. It was so painful that we struggled to add any real value on top of the data they stored. The API is so bad (for our needs) it almost looks like Amazon Cloudwatch (a pit of hell deserving of a separate blog post).

If you look at a standard deployment of Keynote, you might find yourself pulling data 200-300 measurements from 15 different locations every minute. For Circonus to pull that feed, we’d have to do 4500 API calls/minute to Keynote for each customer! That’s not good for anyone involved.

Phase Success: the Keynote XML Data Pulse

Recently, our friends over at Keynote let us in on their new XML Data Pulse service which looks at their data more “east and west” as opposed to “north and south.” This newer interface with Keynote’s global infrastructure allows us to pull wide swaths of telemetry data into our systems in near real-time… just like Circonus wants it.

If you’re a Keynote customer and are interesting in leveraging our new Data Pulse integration, please reach out to your Keynote contact and get setup with a Data Pulse agreement.