ACM – Testing a Distributed System

I want to sing the praises of one of our lead engineers, Phil Maddox, for authoring a very interesting paper, Testing a Distributed System, which was published in Communications of the ACM, Vol. 58 No. 9.

A brief excerpt follows:

“Distributed systems can be especially difficult to program for a variety of reasons. They can be difficult to design, difficult to manage, and, above all, difficult to test. Testing a normal system can be trying even under the best of circumstances, and no matter how diligent the tester is, bugs can still get through. Now take all of the standard issues and multiply them by multiple processes written in multiple languages running on multiple boxes that could potentially all be on different operating systems, and there is potential for a real disaster.

Individual component testing, usually done via automated test suites, certainly helps by verifying that each component is working correctly. Component testing, however, usually does not fully test all of the bits of a distributed system. Testers need to be able to verify that data at one end of a distributed system makes its way to all of the other parts of the system and, perhaps more importantly, is visible to the various components of the distributed system in a manner that meets the consistency requirements of the system as a whole.”

Read the entire paper here: Testing a Distributed System

The Future of Monitoring: Q&A with John Allspaw


john allspaw

John Allspaw is CTO at Etsy. John has worked in systems operations for over 14 years in biotech, government, and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr. He is a well-known industry pundit, speaker, blogger, and the author of Web Operations and The Art of Capacity Planning. Visit John’s blog


Theo: As you know I live numbers. The future of monitoring is leaning strongly toward complex analytics on epic amounts of telemetry data. How do you think this will affect how operations and engineering teams work?

John: Two things come to mind. The first is that we could look at it in the same way the field is looking at “Big Data.” While we now have technologies to help us get answers to questions we have, it turns out that finding the right question is just as important. And you’re right: it’s surprisingly easy to collect a massive amount of telemetry data at a rate that outpaces our abilities to analyze it. I think the real challenge is one of designing systems that can make it easy to navigate this data without getting too much in our way.

I’m fond of Herb Simon’s saying “Information is not a scarce resource. Attention is.” I think that part of this challenge includes using novel ways of analyzing data algorithmically. I think another part, just as critical, is to design software and interfaces that can act as true advisors or partners. More often than not, I’m not going to know what I want to look until I look around in these billions of time-series datasets. If we make it easy and effortless for a team to “look around” – maybe this is a navigation challenge – I’ll bet on that team being better at operations.

Theo: Given your (long) work in operations, you’ve seen good systems and bad systems, good teams and bad teams, good approaches and bad approaches. If you could describe a commonality of all the bads in one word, what would it be? and why?

John: Well, anyone who knows me knows that summarizing (forget about in one word!) is pretty difficult for me. 🙂 If I had to, I would say that what we want to avoid is being brittle. Brittle process, brittle architecture design, brittle incident response, etc. Being brittle in this case means that we can always be prepared for anything, as long as we can imagine it beforehand. The companies we grow and systems we build have too much going on to be perfectly predictable. Brittle is what you get when you bet all your chips on procedures, prescriptive processes, and software that takes on too much of its own control.

Resilience is what you get when you invest in preparing to be unprepared.

Theo: What was it about “Resilience Engineering” that sucked you in?

John: One of the things that really drew me into the field was the idea that we can have a different perspective on how we look at how successful work actually happens in our field. Traditionally, we judge ourselves on the absence of failures, and we assume almost tacitly that we can design a system (software, company org chart, financial model, etc.) that will work all the time, perfectly. All you have to do is: don’t touch it.

Resilience Engineering concepts assert something different: that success comes from people adapting to what they see happening, anticipating what limits and breakdowns the system is headed towards, and making adjustments to forestall them. In other words, success is the result of the presence of adaptive capacity, not the absence of failures.

This idea isn’t just plain-old “fault tolerance” – it’s something more. David Woods (a researcher in the field) calls this something “graceful extensibility” – the idea that it’s not just degradation after failure, but adapting when you get close to the boundaries of failure. Successful teams do this all the time, but no attention is paid to it, because there’s no drama in a non-outage.

That’s what I find fascinating: instead of starting with an outage and explain what a team lacked or didn’t do, we could look at all the things that make for an outage-less day. Many of the expertise ingredients of outage-less days are unspoken, come from “muscle memory” and rules-of-thumb that engineers have developed tacitly over the years. I want to discover all of that.

Theo: How do you think the field of Resilience Engineering can improve the data science that happens around telemetry analysis in complex systems?

John: Great question! I think a really fertile area is to explore the qualitative aspects of how people make sense of telemetry data, at different levels (aggregate, component, etc.) and find ways that use quantitative methods to provide stronger signals than the user could do on their own. An example of this might be to explore expert critiquing systems, where a monitoring system doesn’t try to be “intelligent” but instead provides options/directions for diagnosis for the user to take, essentially providing decision support. This isn’t an approach I see taken yet, in earnest. Maybe Circonus can take a shot at it? 🙂

Theo: As two emerging fields of research are integrated into practice, there are bound to be problems. Can you make a wild prediction as to some of these problems might be?

John: Agreed. I think it might be awkward like a junior high school dance. We have human factors, qualitative analysts and UX/UI folks on one side of the gymnasium, and statisticians, computer scientists, and mathematicians on the other. One of the more obvious potential quagmires is the insistence that each approach will be superior, resulting in a mangled mess of tools or worse: no progress at all. In a cartoon-like stereotype of the fields, I can imagine one camp designing with the belief that all bets must be placed on algorithms, no humans needed. And in the other camp, an over-focus on interfaces that ignore or downplays potential computational processing advantages.

If we do it well, both camps won’t play their solos at the same time, and will take the nuanced approach. Maybe data science and resilience engineering can view themselves as a great rhythm section of a jazz band.

Hallway Track: The Future of Monitoring

I’ve been in this “Internet industry” since around 1997. That doesn’t make me the first on the stage, but I’ve had a very wide set of experiences: from deep within the commercial software world to the front lines of open source and from the smallest startup sites to helping fifteen of the world’s most highly trafficked web sites. My focus for a long time was scalability, but that slowly morphed into general hacking and speaking. As a part of my rigorous speaking schedule, I’ve been to myriad conferences all around the globe; sometimes attending, sometimes chairing, but almost always speaking. I’ve often been asked: “Why do you travel so much? Why do you go to so many conferences?” The answer is simple: the people.

Some go to conferences for session material, perhaps most attendees even. In multi-track conferences, people tend to stick to one track or another. I’d argue that all conferences are inherently multi-tracked: you have whatever tracks are on the program, me and you have the hallway track. The hallway track is where I go to learn, to feel small and to be truly inspired and excited about the challenges we’re collectively facing and the pain they’re causing.

The hallway track is like a market research group, a support group, a cheerleading sideline and a therapy session all in one. I like it so much, I founded the Surge conference at OmniTI to bring together the right people thinking about the right things with an ulterior and selfish motive to concoct the perfect hallway track. Success!

Now for the next experiment: can we emulate a hallway track conversation from the observer’s perspective. Would an online Q&A between me and a variety of industry luminaries be interesting? I hope so and we’re going to find out.