The Future of Monitoring: Q&A with Jez Humble


jez_humbleJez Humble is a lecturer at U.C. Berkeley, and co-author of the Jolt Award-winning Continuous Delivery : Reliable Software Releases through Build, Test and Deployment Automation (Addison-Wesley 2011) and Lean Enterprise : How High Performance Organizations Innovate at Scale (O’Reilly 2015), in Eric Ries’ Lean series. He has worked as a software developer, product manager, consultant and trainer across a wide variety of domains and technologies. His focus is on helping organisations deliver valuable, high-quality software frequently and reliably through implementing effective engineering practices.

Theo’s Intro:

It is my perspective that the world of technology will be a continual place eventually.  As services become more and more componentized they stand to become more independently developed and operated.  The implications on engineering design when attempting to maintain acceptable resiliency levels are significant.  The convergence on a continual world is simply a natural progression and will not be stopped.
Jez has taken to deep thought and practice around these challenges quite a bit ahead of the adoption curve, and has a unique perspective on where we are going, why we are going there and (likely vivid) images of the catastrophic derailments that might occur along the tracks.  While I spend all my time thinking about how people might have peace of mind that their systems and businesses are measurably functioning during and after transitions into this new world, my interest is compounded by the Circonus’ internal uses of continual integration and deployment practices for both our SaaS and on-premise customers.

THEO: Most of the slides, talks and propaganda around CI/CD (Continuous Integration/Continuous Delivery) are framed in the context of businesses launching software services that are consumed by customers as opposed to software products consumed by customers. Do you find that people need a different frame of mind, a different perspective or just more discipline when they are faced with shipping product vs. shipping services as it relates to continual practices?

JEZ: The great thing about continuous delivery is that the same principles apply whether you’re doing web services, product development, embedded or mobile. You need to make sure you’re working in small batches, and that your software is always releasable, otherwise you won’t get the benefits. I started my career at a web startup but then spent several years working on packaged software, and the discipline is the same. Some of the problems are different: for example, when I was working on go.cd, we built a stage into our deployment pipeline to do automated upgrade testing from every previous release to what was on trunk. But fundamentally, it’s the same foundations: comprehensive configuration management, good test automation, and the practice of working in small batches on trunk and keeping it releasable. In fact, one of my favourite case studies for CI/CD is HP’s LaserJet Firmware division — yet nobody is deploying new firmware multiple times a day. You do make a good point about discipline: when you’re not actually having to deploy to production on a regular basis it can be easy to let things slide. Perhaps you don’t pay too much attention to the automated functional tests breaking, or you decide that one long-lived branch to do some deep surgery on a fragile subsystem is OK. Continuous deployment (deploying to production frequently) tends to concentrate the mind. But the discipline is equally important however frequently you release.

THEO: Do you find that organizations “going lean” struggle more, take longer or navigate more risk when they are primarily shipping software products vs. services?

JEZ: Each model has its own trade-offs. Products (including mobile apps) usually require a large matrix of client devices to test in order to make sure your product will work correctly. You also have to worry about upgrade testing. Services, on the other hand, require development to work with IT operations to get the deployment process to a low-risk pushbutton state, and make sure the service is easy to operate. Both of these problems are hard to solve — I don’t think anybody gets an easy ride. Many companies who started off shipping product are now moving to a SaaS model in any case, so they’re having to negotiate both models, which is an interesting problem to face. In both cases, getting fast, comprehensive test automation in place and being able to run as much as possible on every check-in, and then fixing things when they break, is the beginning of wisdom.

THEO: Thinking continuously is only a small part of establishing a “lean enterprise.” Do you find engineers more easily reason about adopting CI/CD than other changes such as organizational retooling and process refinements? What’s the most common sticking point (or point of flat-out derailment) for organizations attempting to go lean?

JEZ: My biggest frustration is how conservative most technology organizations are when it comes to changing the way people behave. There are plenty of engineers who are happy to play with new languages or technologies, but god forbid you try and mess with their worldview on process. The biggest sticking point – whether it’s engineers, middle management or leadership – is getting them to change their behavior and ways of thinking.

But the best people – and organizations – are never satisfied with how they’re doing and are always looking for ways to improve.

The worst ones either just accept the status quo, or are always blowing things up (continuous re-orgs are a great example), lurching from one crisis to another. Sometimes you get both. Effective leaders and managers understand that it’s essential to have a measurable customer or organizational outcome to work towards, and that their job is to help the people working for them experiment in a disciplined, scientific way with process improvement work to move towards the goal. That requires that you actually have time and resources to invest in this work, and that you have people with the capacity for and interest in making things better.

THEO: Finance is precise and process oriented and often times bad things happen (people working from different/incorrect base assumptions) when there are too many cooks in the kitchen. This is why finance is usually tightly controlled by the CFO and models and representations are fastidiously enforced. Monitoring and analytics around that data shares a lot in common with respect to models and meanings. However, many engineering groups have far less discipline and control than do financial groups. Where do you see things going here?

JEZ: Monitoring isn’t really my area, but my guess is that there are similar factors at play here to other parts of the DevOps world, which is the lack of both an economic model and the discipline to apply it. Don Reinertsen has a few quotes that I rather like: “you may ignore economics, but economics won’t ignore you.” He also says of product development “The measure of execution in product development is our ability to constantly align our plans to whatever is, at the moment, the best economic choice.” Making good decisions is fundamentally about risk management: what are the risks we face? What choices are available to us to mitigate those risks? What are the impacts? What should we be prepared to pay to mitigate those impacts? What information is required to assess the probability of those risks occurring? How much should we be prepared to pay for that information? For CFOs working within business models that are well understood, there are templates and models that encapsulate this information in a way that makes effective risk management somewhat algorithmic, provided of course you stay within the bounds of the model. I don’t know whether we’re yet at that stage with respect to monitoring, but I certainly don’t feel like we’re yet at that stage with the rest of DevOps. Thus a lot of what we do is heuristic in nature — and that requires constant adaptation and improvement, which takes even more discipline, effort, and attention. That, in a department which is constantly overloaded by firefighting. I guess that’s a very long way of saying that I don’t have a very clear picture of where things are going, but I think it’ll be a while before we’re in a place that has a bunch of proven models with well understood trade-offs.

THEO: In your experience how do organizations today habitually screw up monitoring? What are they simply thinking about “the wrong way?”

JEZ: I haven’t worked in IT operations professionally for over a decade, but based on what I hear and observe, I feel like a lot of people still treat monitoring as little more than setting up a bunch of alerts. This leads to a lot of the issues we see everywhere with alert fatigue and people working very reactively. Tom Limoncelli has a nice blog post where he recommends deleting all your alerts and then, when there’s an outage, working out what information would have predicted it, and just collecting that information. Of course he’s being provocative, but we have a similar situation with tests — people are terrified about deleting them because they feel like they’re literally deleting quality (or in the case of alerts, stability) from their system. But it’s far better to have a small number of alerts that actually have information value than a large number that are telling you very little, but drown the useful data in noise.

THEO: Andrew Shaffer said that “technology is 90% tribalism and fashion.” I’m not sure about the percentage, but he nailed the heart of the problem. You and I both know that process, practice and methods sunset faster in technology than in most other fields. I’ll ask the impossible question… after enterprises go lean, what’s next?

JEZ: I actually believe that there’s no end state to “going lean.” In my opinion, lean is fundamentally about taking a disciplined, scientific approach to product development and process improvement — and you’re never done with that. The environment is always changing, and it’s a question of how fast you can adapt, and how long you can stay in the game. Lean is the science of growing adaptive, resilient organizations, and the best of those are always getting better. Andrew is (as is often the case) correct, and what I find really astonishing is that as an industry we have a terrible grasp of our own history. As George Santayana has it, we seem condemned to repeat our mistakes endlessly, albeit every time with some shiny new technology stack. I feel like there’s a long way to go before any software company truly embodies lean principles — especially the ability to balance moving fast at high quality while maintaining a humane working environment. The main obstacle is the appalling ineptitude of a large proportion of IT management and leadership — so many of these people are either senior engineers who are victims of the Peter Principle or MBAs with no real understanding of how technology works. Many technologists even believe effective management is an oxymoron. While I am lucky enough to know several great leaders and managers, they have not in general become who they are as a result of any serious effort in our industry to cultivate such people. We’re many years away from addressing these problems at scale.