ACM – Testing a Distributed System

I want to sing the praises of one of our lead engineers, Phil Maddox, for authoring a very interesting paper, Testing a Distributed System, which was published in Communications of the ACM, Vol. 58 No. 9.

A brief excerpt follows:

“Distributed systems can be especially difficult to program for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

Individual component testing, usually done via automated test suites, certainly helps by verifying that each component is working correctly. Component testing, however, usually does not fully test all of the bits of a distributed system. Testers need to be able to verify that data at one end of a distributed system makes its way to all of the other parts of the system and, perhaps more importantly, is visible to the various components of the distributed system in a manner that meets the consistency requirements of the system as a whole.”

Read the entire paper here: Testing a Distributed System

The Future of Monitoring: Q&A with John Allspaw

john allspaw

John Allspaw is CTO at Etsy. John has worked in systems operations for over 14 years in biotech, government, and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr. He is a well-known industry pundit, speaker, blogger, and the author of Web Operations and The Art of Capacity Planning. Visit John’s blog

Theo: As you know I live numbers. The future of monitoring is leaning strongly toward complex analytics on epic amounts of telemetry data. How do you think this will affect how operations and engineering teams work?

John: Two things come to mind. The first is that we could look at it in the same way the field is looking at “Big Data.” While we now have technologies to help us get answers to questions we have, it turns out that finding the right question is just as important. And you’re right: it’s surprisingly easy to collect a massive amount of telemetry data at a rate that outpaces our abilities to analyze it. I think the real challenge is one of designing systems that can make it easy to navigate this data without getting too much in our way.

I’m fond of Herb Simon’s saying “Information is not a scarce resource. Attention is.” I think that part of this challenge includes using novel ways of analyzing data algorithmically. I think another part, just as critical, is to design software and interfaces that can act as true advisors or partners. More often than not, I’m not going to know what I want to look until I look around in these billions of time-series datasets. If we make it easy and effortless for a team to “look around” – maybe this is a navigation challenge – I’ll bet on that team being better at operations.

Theo: Given your (long) work in operations, you’ve seen good systems and bad systems, good teams and bad teams, good approaches and bad approaches. If you could describe a commonality of all the bads in one word, what would it be? and why?

John: Well, anyone who knows me knows that summarizing (forget about in one word!) is pretty difficult for me. 🙂 If I had to, I would say that what we want to avoid is being brittle. Brittle process, brittle architecture design, brittle incident response, etc. Being brittle in this case means that we can always be prepared for anything, as long as we can imagine it beforehand. The companies we grow and systems we build have too much going on to be perfectly predictable. Brittle is what you get when you bet all your chips on procedures, prescriptive processes, and software that takes on too much of its own control.

Resilience is what you get when you invest in preparing to be unprepared.

Theo: What was it about “Resilience Engineering” that sucked you in?

John: One of the things that really drew me into the field was the idea that we can have a different perspective on how we look at how successful work actually happens in our field. Traditionally, we judge ourselves on the absence of failures, and we assume almost tacitly that we can design a system (software, company org chart, financial model, etc.) that will work all the time, perfectly. All you have to do is: don’t touch it.

Resilience Engineering concepts assert something different: that success comes from people adapting to what they see happening, anticipating what limits and breakdowns the system is headed towards, and making adjustments to forestall them. In other words, success is the result of the presence of adaptive capacity, not the absence of failures.

This idea isn’t just plain-old “fault tolerance” – it’s something more. David Woods (a researcher in the field) calls this something “graceful extensibility” – the idea that it’s not just degradation after failure, but adapting when you get close to the boundaries of failure. Successful teams do this all the time, but no attention is paid to it, because there’s no drama in a non-outage.

That’s what I find fascinating: instead of starting with an outage and explain what a team lacked or didn’t do, we could look at all the things that make for an outage-less day. Many of the expertise ingredients of outage-less days are unspoken, come from “muscle memory” and rules-of-thumb that engineers have developed tacitly over the years. I want to discover all of that.

Theo: How do you think the field of Resilience Engineering can improve the data science that happens around telemetry analysis in complex systems?

John: Great question! I think a really fertile area is to explore the qualitative aspects of how people make sense of telemetry data, at different levels (aggregate, component, etc.) and find ways that use quantitative methods to provide stronger signals than the user could do on their own. An example of this might be to explore expert critiquing systems, where a monitoring system doesn’t try to be “intelligent” but instead provides options/directions for diagnosis for the user to take, essentially providing decision support. This isn’t an approach I see taken yet, in earnest. Maybe Circonus can take a shot at it? 🙂

Theo: As two emerging fields of research are integrated into practice, there are bound to be problems. Can you make a wild prediction as to some of these problems might be?

John: Agreed. I think it might be awkward like a junior high school dance. We have human factors, qualitative analysts and UX/UI folks on one side of the gymnasium, and statisticians, computer scientists, and mathematicians on the other. One of the more obvious potential quagmires is the insistence that each approach will be superior, resulting in a mangled mess of tools or worse: no progress at all. In a cartoon-like stereotype of the fields, I can imagine one camp designing with the belief that all bets must be placed on algorithms, no humans needed. And in the other camp, an over-focus on interfaces that ignore or downplays potential computational processing advantages.

If we do it well, both camps won’t play their solos at the same time, and will take the nuanced approach. Maybe data science and resilience engineering can view themselves as a great rhythm section of a jazz band.

Hallway Track: The Future of Monitoring

I’ve been in this “Internet industry” since around 1997. That doesn’t make me the first on the stage, but I’ve had a very wide set of experiences: from deep within the commercial software world to the front lines of open source and from the smallest startup sites to helping fifteen of the world’s most highly trafficked web sites. My focus for a long time was scalability, but that slowly morphed into general hacking and speaking. As a part of my rigorous speaking schedule, I’ve been to myriad conferences all around the globe; sometimes attending, sometimes chairing, but almost always speaking. I’ve often been asked: “Why do you travel so much? Why do you go to so many conferences?” The answer is simple: the people.

Some go to conferences for session material, perhaps most attendees even. In multi-track conferences, people tend to stick to one track or another. I’d argue that all conferences are inherently multi-tracked: you have whatever tracks are on the program, me and you have the hallway track. The hallway track is where I go to learn, to feel small and to be truly inspired and excited about the challenges we’re collectively facing and the pain they’re causing.

The hallway track is like a market research group, a support group, a cheerleading sideline and a therapy session all in one. I like it so much, I founded the Surge conference at OmniTI to bring together the right people thinking about the right things with an ulterior and selfish motive to concoct the perfect hallway track. Success!

Now for the next experiment: can we emulate a hallway track conversation from the observer’s perspective. Would an online Q&A between me and a variety of industry luminaries be interesting? I hope so and we’re going to find out.

Case for a Broader, Deeper, Coordinated Approach to Monitoring Your Business

Effective monitoring is essential for running a successful web business. It improves business performance by accelerating response to issues and opportunities that arise from web application operations, helping your company get and keep customers, boost revenue and build brand reputation.

Regardless of their role, everyone responsible for the success of the business needs the ability to assess its status at any given point. Adopting a holistic approach to monitoring that integrates business and technology goals, and metrics provides executives, analysts and engineers with a clear picture of how the entire business is operating. It also provides invaluable data on trends and component interactions to guide planning, troubleshooting and strategy optimization. While system engineers don’t need to understand the details of marketing, they should be aware of their company’s marketing objectives and how the web applications they support contribute to, and are affected by, those objectives. Likewise, the CEO doesn’t need to know how the web applications work in the background, but should be able to correlate the importance of key operating metrics, such as email bounce rates for an e-commerce marketing business, and their impact on costs, revenue and market perception.

While almost all web businesses perform some level of monitoring, companies would benefit by adopting a broader, more sophisticated and proactive monitoring strategy. Use the approach recommended in this paper to determine the business objectives, measures and thresholds that define the success of your web application and will drive your monitoring strategy. Create a dashboard that combines this business and technical information to produce a visually impactful, holistic view of your web business performance. Review existing web applications to ensure monitoring is sufficient and used effectively. If your current sources of monitoring data are insufficient, research, acquire, learn and deploy the right set of monitoring tools to support your new guidelines. When developing new web applications, incorporate the design and construction of business and functionality monitors within the scope of the projects to focus efforts on the most important success measures and maximize the benefit of monitoring efforts once the application is deployed.

To learn more of the case for taking a broader, deeper and more coordinated approach to monitoring, read Monitoring the Big Picture: A Modern Approach for Web Application Monitoring. It provides technical and business managers a greater understanding of the role and importance of monitoring in managing their web businesses. It will illustrate how a holistic, multi-disciplinary monitoring program can solve complex issues that cross business and technical boundaries and drive real improvements in business performance.

Distributed Systems Testing or Why My Hair Is Falling Out

Here at Circonus, our infrastructure is highly distributed. Many of the functions of Circonus are distinct systems that communicate with each other. In addition, we use a distributed data store for storing and retrieving data for graphs. Of course, testing a system is necessary to ensure that everything keeps working smoothly, but testing distributed systems like ours can be extraordinarily difficult.

There are a bunch of issues that come up when testing distributed systems – accounting for asynchronous delivery among nodes in a distributed data store, assuring that all data has been replicated and stored properly across the data cluster, making sure that the entire system works end-to-end, dealing with and recovering from individual component or data storage node failure, and more. These issues can be extremely difficult to test, as anyone who has worked on a distributed system can attest. I have been working on distributed systems for years, and have developed a few strategies for dealing with these and other issues.

In the July issue of ACM Queue, I discuss how we deal with these and other issues when testing our systems. To learn about our approach to testing in more detail, check out Testing A Distributed System.

Also, if tackling complex infrastructures is your dream job, come work with us!

Introducing Leo – A tool to automatically setup & configure NAD!

Leo is an extension to the NAD (Node Agent Daemon) client which will automatically create a check, a set of graphs, and a worksheet for a particular host. Leo’s goal is to make configuration of a host as simple as possible.

After running a simple command line script, you will be able to log into your Circonus account and view graphs for CPU Usage, Disk, Network, and Memory utilization, as well as a worksheet for your host. Leo will prompt you for information such as IP or hostname, Circonus auth token, Broker id, and the location of a config file. Then, it will take that information  and give it to NAD, which will use it to create the check, graphs, and worksheet.

If you decide try Leo, we would love to hear your feedback. You can send any comments, including what you liked, what you thought could be improved, and any questions, to:

To get started with Leo, visit our youtube channel and watch our tutorials for installing and configuring Leo:

Installation Tutorial


  1. This program is installed on an amazon web services ec2 instance running CentOS.
  2. You must have node.js and nad already installed.

More info on NAD is available on github here.

There are two ways to install Leo: wget and git clone.


# wget
# unzip
# cd leo-master
# npm install

git clone

# git clone
# cd leo
# npm install

There is a third way to install leo that is not included in the video. This method takes into account that Leo is published as an npm module as circonus-leo. However, this method is not recommended because it will require users to go through the node_modules directory before it can reach Leo, and consequently adds an extra step to the process of running Leo.


# npm install

Configuration Tutorial


The instructions below assume that you installed Leo using git clone. If you used wget or npm, the instructions are the same, but you access Leo through the leo-master directory for wget, and go through node_modules/leo for npm.

To get an auth/API token:

  1. On your Circonus account, go to “API Tokens” under the “User” section of the Main Menu and click “New API Token +”
  2. In your terminal write :
    # leo/bin/circonus-setup -k ["YOUR API TOKEN"] -t ["YOUR IP ADRESS OR HOSTNAME"]
  3. Hit enter. You will receive an error saying “App: nad still pending approval”
  4. Then go back to your Circonus account and refresh your “API Tokens” page. You should now have an option to allow NAD access.
  5. Click the “Allow Access” option.

To find your Broker Id:

  1. Go to your Dashboard and click the “manage brokers” option above the map that displays all of the brokers.
  2. Click the menu symbol (the little hamburger) to the left of the broker you want to use.
  3. Click “view API object”. The number that comes after ‘”_cid”: “/broker/’ is your broker id. For example, if the API object read “_cid”: “/broker/3”, the broker id would be 3.

You can either configure Leo with one command line request that contains all of your information or run Leo and let it prompt you for your info.

Command Line Request

This example is using the default settings, which includes a JSON check, 4 graphs (CPU Usage, Disks, Network, and Memory), and a worksheet containing that check and those 4 graphs:

 # leo/bin/circonus-setup -k ["YOUR API TOKEN"] -t ["YOUR IP ADRESS OR HOSTNAME"] -b ["BROKER ID"] --alldefault 

Once you hit enter, it will prompt you to save your settings to a config file. Either enter in the name of the file to which you want your information to be saved, or just hit enter to skip this step. Then it should tell you that 1 check, 4 graphs, and 1 worksheet have been created.

Letting Leo Prompt You

# leo/bin/circonus-setup 

After running this code, Leo will prompt you for your auth token, target, broker id, whether or not you want to use the defaults settings (if you say no you can use the type of check you want to create and the metrics you want), and for a config file to which it will send your information. It will then create a check, graphs, and a worksheet based on the information you provided.

Let’s look at Leo in action through the creation of a JSON check and its counterparts:

Checks & Metrics

Leo can create either a JSON check or a PostgreSQL check containing up to 155 different metrics.



One worksheet will be created for each configured check.



If you choose to create a JSON check, Leo and NAD will create a graph for CPU, Disk space, Memory, and Network utilization. For a PostgreSQL check, Leo and NAD will create a PostgreSQL Connections graph.


Metrics from Custom Apps…Easy!

Many developers are looking for a platform to which they can send arbitrary data from their custom applications to collect and visualize those metrics, as well as alert on specific thresholds. With Circonus’s ability to accept and parse raw JSON, it’s easy to send metrics from custom applications into the system. More information on JSON parsing can be found here or in the User Docs, but the steps below will get you up and running quickly.

1.) The first step for sending JSON data to Circonus is to create an HTTPTrap check. Under the Checks page, click on “New Check +” in the upper right corner, then expand the JSON option and choose “Push (HTTPTrap)”. Select the HTTPTrap broker from the list, then set up the host and secret. Click on “Test Check”, then “Finish”, even though there are no metrics selected.

2.) Now that the check is created, find that check and go into the details. There will be a “Data Submission URL” listed, which is the URL to which you will PUT the data from your application. Once the data is being submitted at regular intervals (either as frequently as you have it, or every 30 seconds if it is a sample), you can go back into the check to enable the metrics. Alternatively, you can use the Check Bundle API to manage the metrics and enable any metrics that are present but disabled. You can also enable histogram collection using the same methods.

Once data is being collected, you can then start graphing, alerting, streaming to dashboards, and performing analytics on your data. Additionally, if you add other check types to your Circonus account, you can compare these custom metrics to other data to get the full picture of what is really going on at any given time.

Video: Architecture of a Distributed Analytics/Storage Engine for Massive Time-Series Data

The numerical analysis of time-series data isn’t new. The scale of today’s problems is. With millions of concurrent data streams, some of which run at 1MM samples per second, the challenge of storing the data and making it continuously available for analysis is a daunting challenge.

At Circonus, we designed such a solution. Our CEO, Theo Schlossnagle, during Applicative 2015, discusses the approach and the technical details of how the system was constructed. Check out his talk…

The Problem with Math: Why Your Monitoring Solution is Wrong

Math is perfect – the most perfect thing in the world. The problem with math is that it is perpetrated by us imperfect humans. Circonus has long aimed to bring deep numerical analysis to business telemetry data. That may sound like a lot of mumbo-jumbo, but it really means we want better answers to better questions about all that data your business creates. Like it or not, to do that, we need to do maths and to do them right.

That’s not how it works

I was watching a line of e-surance commercials that have recently aired wherein people execute Internet-sounding tasks in nonsensical ways, like posting photographs to your [Facebook] wall by pinning things to your living room wall while your friends sits on the couch to observe, and I was reminded of something I see too often: bad math. Math to which I’m inclined to rebut: “That’s not how it works; that’s not how any of this works!”

A great example of such bad math is the inaccurate calculation of quantiles. This may sound super “mathy,” but quantiles have wide applications and there is a strong chance that you need them for things such as enforcing service level agreements and calculating billing and payments.

What is a quantile? First we’ll use the syntax q(N, v) to represent a quantile where N is a set of samples and v is some number between 0 and 1, inclusively. Herein, we’ll assume some set N and just write a quantile as q(v). Remember that 0% is 0 and 100% is actually 1.0 in probability, so the quantile is simply asking what sample in my set represents a number such that v of the samples are less than it and (1-v) samples are greater than it. This may sound complicated. Most descriptions of math concepts are a bit opaque at first, but a few examples can be quite illustrative.

  • What is q(0)? What sample is such that 0 (or none) in the set are smaller and the rest are larger? Well, that’s easy: the smallest number in the set. q(0) is another way of writing the minimum.
  • What is q(1)? What sample is such that 1 (100% or all) in the set are smaller and none are larger? Also quite simple: the largest number in the set. q(1) is another way of writing the maximum.
  • What is q(0.5)? What sample is such that 50% (or half) in the set are smaller and the other half are larger? This is the middle or in statistics, the median. q(0.5) is another way of writing the median.

SLA calculations – the good, the bad and the ugly

Quite often when articulating Internet bandwidth billing scenarios, one will measure the traffic over each 5 minute period throughout an entire month and calculate q(0.95) over those samples. This is called 95th percentile billing. In service level agreements, one can stipulate that the latency for a particular service must be faster than a specific amount (some percentage of the time), or that some percentage of all interactions with said service must be at a specific latency or faster. Why are those methods different? In the first, your set of samples are calculated over discrete buckets of time, whereas in the second your samples are simply the latencies of each service request. As a note to those writing SLAs, the first is dreadfully difficult to articulate and thus nigh-impossible to calculate consistently or meaningfully, the second just makes sense. While discrete time buckets might make sense for availability-based SLAs, it makes little-to-no sense for latency-based SLAs. As “slow is the new down” is adopted across modern businesses, simple availability-based SLAs are rapidly becoming irrelevant.

So, let’s get more concrete: I have an API service with a simply stated SLA: 99.9% of my accesses should be services in 100 or fewer milliseconds. It might be a simple statement, but it is missing a critical piece of information: over what time period is this enforced? Not specifying the time unit of SLA enforcement is most often the first mistake people make. A simple example will best serve to illustrate why.

Assume I have an average of 5 requests per second to this API service. “Things go wrong” and I have twenty requests that are served at 200ms (significantly above our 100ms requirement) during the day: ten slow requests at 9:02am and the other ten offenders at 12:18pm. Some back of the napkin math says that as long as less that 1 request out of 1000 are slow, I’m fine (99.9% are still fast enough). As I far exceed 20,000 total requests during the day, I’ve not violated my SLA… However, if I enforce my SLA on five minute intervals, I have 1500 requests occurring between 9:00 and 9:05 and 10 slow responses. 10 out of 1500 is… well… there goes my SLA. Same blatant failure from 12:15 to 12:20. So, based on a five-minute-enforcement method I have 10 minutes of SLA violation during the day vs no SLA violation whatsoever using a one-day-enforcement method. But wait.. it gets worse. Why? Bad math.

Many systems out there calculate quantiles over short periods of time (like 1 minute or 5 minutes). Instead of storing the latency for every request, the system retains 5 minutes of these measurements, calculates q(0.999), and then discards the samples and stores the resulting quantile for later use. At the end of the day, you have q(0.999) for each 5 minute period throughout the day (288 of them). So given 288 quantiles throughout the day, how do you calculate the quantile for the whole day? Despite the fictitious answer some tools provide, the answer is you don’t. Math… it doesn’t work that way.

There are only two magical quantiles that will allow this type of reduction: q(0) and q(1). The minimum of a set of minimums is indeed the global minimum; the same is true for maximums. Do you know what the q(0.999) of the a set of q(0.999)s is? Or what the average of a set of q(0.999)s is? Hint: it’s not the answer you’re looking for. Basically, if you have a set of 288 quantiles representing each of the day’s five minute intervals and you want the quantile for the whole day, you are simply out of luck. Because math.

In fact, the situation is quite dire. If I calculated the quantile of the aforementioned 9:00 to 9:05 time interval where ten samples of 1500 are 200ms, the q(0.999) is 200ms despite the other 1490 samples being faster than 100ms. Devastatingly, if the other 1490 samples were 150ms (such that every single request was over the prescribed 100ms limit), the q(0.999) would still be 200ms. Because I’ve tossed the original samples, I have no idea if all of my samples violated the SLA or just 0.1% of them. In the worst case scenario, all of them were “too slow” and now I have 3000 request that were too slow. While 20 requests aren’t enough to break the day, 3000 most certainly are and my whole day is actually in violation. Because the system I’m using for collection quantile information is doing math wrong, the only reasonable answer to “did we violate the SLA today?” is “I don’t know, maybe.” It doesn’t need to be like this – Circonus calculates these quantiles correctly.

An aside on SLAs: While some of this might be a bit laborious to follow, the takeaway is to be very careful how you articulate your SLAs or you will often have no idea if you are meeting them. I recommend calculating quantiles on a day-to-day basis (and that all days are measured by UTC so the abomination that is daylight savings time never foils you). So to restate the example SLA above: 99.9% or more of the requests occurring on a 24-hour calendar day (UTC) shall be serviced in 100ms or less time. If you prefer to keep your increments of failure lower, you can opt for an hour-by-hour SLA instead of a day-by-day one. I do not recommend stating an SLA in anything less that one hour spans.

Bad math in monitoring – don’t let it happen to you

Quantiles are an example where the current methods that most tools use are simply wrong and there is no trick or method that can help. However, something that I’ve learned working on Circonus is how often other tools screw up even the most basic math. I’ll list a few examples in the form of tips, without the indignity of attributing them to specific products. (If you run any of these products, you might recognize the example… or at some point in the future you will either wake up in a cold sweat or simply let loose a stream of astounding explicatives.)

  • The average of a set of minimums is not the global minimum. It is, instead, nothing useful.
  • The simple average of a set of averages of varying sample sizes isn’t the average of all the samples combined. It is, instead, nothing useful.
  • The average of the standard deviations of separate sets of samples is not the standard deviation of the combined set of samples. It is, instead, nothing useful.
  • The q(v) of a set of q(v) calculated over sample sets is not the q(v) of the combined sample set. While creative, it is nothing useful.
  • The average of a set of q(v) calculated over sample sets is, you guessed it, nothing useful.
  • The average of a bunch of rates (which are nothing more than a change in value divided by a change in time: dv/dt) with varying dts is not the damn average rate (and a particularly horrible travesty).

At Circonus we learned early on that math is both critically important to our mission and quite challenging for many of our customers to understand. This makes it imperative that we not screw it up. Many new adopters contact support asking for an explanation as to why the numbers they see in our tool don’t match their existing tools. We have to explain that we thought they deserved the right answer. As for their existing tools: that’s not how it works; that’s not how any of this works.

Wrangling Elephants in the Cloud

Yonah Russ is a hands-on Technology Executive, System Architect, and Performance Engineer. He is founder of DonateMyFee. You can read more articles by Yonah on LinkedIn where you will also find the original version of this post.

You know the elephant in the room, the one no one wants to talk about. Well it turns out there was a whole herd of them hiding in my cloud. There’s a herd of them hiding in your cloud too. I’m sure of it. Here is my story and how I learned to wrangle the elephants in the cloud.

Like many of you, my boss walked into my office about three years ago and said, “We need to move everything to the cloud.” At the time, I wasn’t convinced that moving to the cloud had technical merit. The business, on the other hand, had decided that, for whatever reason, it was absolutely necessary.

As I began planning the move, selecting a cloud provider, picking tools with which to manage the deployment, I knew that I wasn’t going to be able to provide the same quality of service in a cloud as I had in our server farm. There were too many unknowns.

The cloud providers don’t like to give too many details on their setups nor do they like to provide many meaningful SLAs. I have very little idea what hardware I’m running. I have almost no idea how it’s connected. How many disks I’m running on? What RAID configuration? How many IOPS can I count on? Is a disk failing? Is it being replaced? What will happen if the power supply blows? Do I have redundant network connections?

Whatever it was that made the business decide to move, it trumped all these unknowns. In the beginning, I focused on getting what we had from one place to the other, following whichever tried and true best practices were still relevant.

Since then, I’ve come up with these guiding principles for working around the unknowns in the cloud.


  • Develop in the cloud
  • Develop for failure
  • Automate deployment to the cloud
  • Distribute deployments across regions


  • Monitor everything
  • Use multiple providers
  • Mix and match private cloud

Wrangling elephants for beginners:

Develop in the cloud.

Developers invariably want to work locally. It’s more comfortable. It’s faster. It’s why you bought them a crazy expensive MacBook Pro. It is also nothing like production and nothing developed that way ever really works the same in real life.

If you want to run with the IOPS limitations of standard Amazon EBS or you want to rely on Amazon ELBs to distribute traffic under sudden load, you need to have those limitations in development as well. I’ve seen developers cry when their MongoDB deployed to EBS and I’ve seen ELBs disappear 40% of a huge media campaign.

Develop for failure.

Cloud providers will fail. It is cheaper for them to fail and, in the worst case, credit your account for some machine hours, than it is for them to buy high quality hardware and setup highly available networks. In many cases, the failure is not even a complete and total failure (that would be too easy). Instead, it could just be some incredibly high response times which your application may not know how to deal with.

You need to develop your application with these possibilities in mind. Chaos Monkey by Netflix is a classic, if not over-achieving, example.

Automate deployment to the cloud.

I’m not even talking about more complicated, possibly over complicated, auto-scaling solutions. I’m talking about when it’s 3am and your customers are switching over to your competitors. Your cloud provider just lost a rack of machines including half of your service. You need to redeploy those machines ASAP, possibly to a completely different data center.

If you’ve automated your deployments and there aren’t any other hiccups, it will hopefully take less than 30 minutes to get back up. If not, well, it will take what it takes. There are many other advantages to automating your deployments but this is the one that will let you sleep at night.

Distribute deployments across regions.

A pet peeve of mine is the mess that Amazon has made with their “availability zones.” While the concept is a very easy to implement solution (from Amazon’s point of view) to the logistical problems involved in running a cloud service, it is a constantly overlooked source of unreliability for beginners choosing Amazon AWS. Even running a multi-availability zone deployment in Amazon only marginally increases reliability whereas deploying to multiple regions can be much more beneficial with a similar amount of complexity.

Whether you use Amazon or another provider, it is best to build your service from the ground up to run in multiple regions, even if only in an active/passive capacity. Aside from the standard benefits of a distributed deployment (mitigation of DDOS attacks and uplink provider issues, lower latency to customers, disaster recovery, etc.), running in multiple regions will protect you against regional problems caused by hardware failure, regional maintenance, or human error.

Advanced elephant wrangling:

The four principles before this are really about being prepared for the worst. If you’re prepared for the worst, then you’ve managed 80% of the problem. You may be wasting resources or you may be susceptible to provider level failures, but your services should be up all of the time.

Monitor Everything.

It is very hard to get reliable information about system resource usage in a cloud. It really isn’t in the cloud provider’s interest to give you that information. After all, they are making money by overbooking resources on their hardware. No, you shouldn’t rely on Amazon to monitor your Amazon performance, at least not entirely.

Even when they give you system metrics, it might not be the information you need to solve your problem. I highly recommend reading the book Systems Performance: Enterprise and the Cloud by Brendan Gregg.

Some clouds are better than others at providing system metrics. If you can choose them, great! Otherwise, you need to start finding other strategies for monitoring your systems. It could be to monitor your services higher up in the stack by adding more metric points to your code. It could be to audit your request logs. It could be to install an APM agent.

Aside from monitoring your services, you need to monitor your providers. Make sure they are doing their jobs. Trust me that some times they aren’t.

I highly recommend monitoring your services from multiple points of view so you can corroborate the data from multiple observers. This happens to fit in well with the next principle.

Use multiple providers.

There is no way around it. Using one provider for any third party service is putting all your eggs in one basket. You should use multiple providers for everything in your critical path, especially the following four:

  • DNS
  • Cloud
  • CDN
  • Monitoring

Regarding DNS, there are some great providers out there. CloudFlare is a great option for the budget conscious. Route53 is not free but not expensive. DNSMadeEasy is a little bit pricier but will give you some more advanced DNS features. Some of the nastiest downtimes in the past year were due to DNS providers.

Regarding Cloud, using multiple providers requires very good automation and configuration management. If you can find multiple providers which run the same underlying platform (for example, Joyent licenses out their cloud platform to various other public cloud vendors), then you can save some work. In any case, using multiple cloud providers can save you from some downtime, bad cloud maintenance, or worse.

CDNs also have their ups and downs. The Internet is a fluid space and one CDN may be faster one day and slower the next. A good Multi-CDN solution will save you from the bad days, and make every day a little better at the same time.

Monitoring is great but who’s monitoring the monitor. It’s a classic problem. Instead of trying to make sure every monitoring solution you use is perfect, use multiple providers from multiple points of view (application performance, system monitoring, synthetic polling).

These perspectives all overlap to some degree, backing each other up. If multiple providers start alerting, you know there is a real actionable problem, and from how they alert, you can sometimes home in on the root cause much more quickly.

If your APM solution starts crying about CPU utilization but your system monitoring solution is silent, you know that you may have a problem that needs to be verified. Is the APM system misreading the situation or has your system monitoring agent failed to warn you of a serious issue?

Mix and match private cloud

Regardless of all the above steps you can take to mitigate the risks of working in environments not completely in your control, really important business should remain in-house. You can keep the paradigm of software defined infrastructure by building a private cloud.

Joyent license their cloud platform out to companies for building private clouds with enterprise support. This makes a mixing and matching between public and private very easy. In addition, they have open sourced the entire cloud platform, so if you want to install without support, you are free to do so.


When a herd of elephants is stampeding, there is no hope of stopping them in their tracks. The best you can hope for is to point them in the right direction. Similarly, in the cloud, we will never get back the depth of visibility and control that we have with private deployments. What’s important is to learn how to steer the herd so we are prepared for the occasional stampede while still delivering high quality systems.