John Allspaw is CTO at Etsy. John has worked in systems operations for over 14 years in biotech, government, and online media. He started out tuning parallel clusters running vehicle crash simulations for the U.S. government, and then moved on to the Internet in 1997. He built the backing infrastructures at Salon, InfoWorld, Friendster, and Flickr. He is a well-known industry pundit, speaker, blogger, and the author of Web Operations and The Art of Capacity Planning. Visit John’s blog
John: Two things come to mind. The first is that we could look at it in the same way the field is looking at “Big Data.” While we now have technologies to help us get answers to questions we have, it turns out that finding the right question is just as important. And you’re right: it’s surprisingly easy to collect a massive amount of telemetry data at a rate that outpaces our abilities to analyze it. I think the real challenge is one of designing systems that can make it easy to navigate this data without getting too much in our way.
I’m fond of Herb Simon’s saying “Information is not a scarce resource. Attention is.” I think that part of this challenge includes using novel ways of analyzing data algorithmically. I think another part, just as critical, is to design software and interfaces that can act as true advisors or partners. More often than not, I’m not going to know what I want to look until I look around in these billions of time-series datasets. If we make it easy and effortless for a team to “look around” – maybe this is a navigation challenge – I’ll bet on that team being better at operations.
Theo: Given your (long) work in operations, you’ve seen good systems and bad systems, good teams and bad teams, good approaches and bad approaches. If you could describe a commonality of all the bads in one word, what would it be? and why?
John: Well, anyone who knows me knows that summarizing (forget about in one word!) is pretty difficult for me. 🙂 If I had to, I would say that what we want to avoid is being brittle. Brittle process, brittle architecture design, brittle incident response, etc. Being brittle in this case means that we can always be prepared for anything, as long as we can imagine it beforehand. The companies we grow and systems we build have too much going on to be perfectly predictable. Brittle is what you get when you bet all your chips on procedures, prescriptive processes, and software that takes on too much of its own control.
Resilience is what you get when you invest in preparing to be unprepared.
Theo: What was it about “Resilience Engineering” that sucked you in?
John: One of the things that really drew me into the field was the idea that we can have a different perspective on how we look at how successful work actually happens in our field. Traditionally, we judge ourselves on the absence of failures, and we assume almost tacitly that we can design a system (software, company org chart, financial model, etc.) that will work all the time, perfectly. All you have to do is: don’t touch it.
Resilience Engineering concepts assert something different: that success comes from people adapting to what they see happening, anticipating what limits and breakdowns the system is headed towards, and making adjustments to forestall them. In other words, success is the result of the presence of adaptive capacity, not the absence of failures.
This idea isn’t just plain-old “fault tolerance” – it’s something more. David Woods (a researcher in the field) calls this something “graceful extensibility” – the idea that it’s not just degradation after failure, but adapting when you get close to the boundaries of failure. Successful teams do this all the time, but no attention is paid to it, because there’s no drama in a non-outage.
That’s what I find fascinating: instead of starting with an outage and explain what a team lacked or didn’t do, we could look at all the things that make for an outage-less day. Many of the expertise ingredients of outage-less days are unspoken, come from “muscle memory” and rules-of-thumb that engineers have developed tacitly over the years. I want to discover all of that.
Theo: How do you think the field of Resilience Engineering can improve the data science that happens around telemetry analysis in complex systems?
John: Great question! I think a really fertile area is to explore the qualitative aspects of how people make sense of telemetry data, at different levels (aggregate, component, etc.) and find ways that use quantitative methods to provide stronger signals than the user could do on their own. An example of this might be to explore expert critiquing systems, where a monitoring system doesn’t try to be “intelligent” but instead provides options/directions for diagnosis for the user to take, essentially providing decision support. This isn’t an approach I see taken yet, in earnest. Maybe Circonus can take a shot at it? 🙂
Theo: As two emerging fields of research are integrated into practice, there are bound to be problems. Can you make a wild prediction as to some of these problems might be?
John: Agreed. I think it might be awkward like a junior high school dance. We have human factors, qualitative analysts and UX/UI folks on one side of the gymnasium, and statisticians, computer scientists, and mathematicians on the other. One of the more obvious potential quagmires is the insistence that each approach will be superior, resulting in a mangled mess of tools or worse: no progress at all. In a cartoon-like stereotype of the fields, I can imagine one camp designing with the belief that all bets must be placed on algorithms, no humans needed. And in the other camp, an over-focus on interfaces that ignore or downplays potential computational processing advantages.
If we do it well, both camps won’t play their solos at the same time, and will take the nuanced approach. Maybe data science and resilience engineering can view themselves as a great rhythm section of a jazz band.