In today’s world, the performance of your IT systems has a direct impact on your brand reputation and overall business revenue. A “good enough” approach to software performance is no longer good enough. This has led to the growing importance of SREs and a shift to more sophisticated, advanced observability that requires moving beyond basic on/off monitoring to advanced monitoring techniques. Engineers must measure everything, understand the behavior of their systems, and build failure budgets that allow them to re-risk innovation.
In the following post, I share 6 rules SREs must always apply to make their software run better in production, particularly in large-scale distributed systems- rules that align with a more modern, advanced approach to monitoring and observability.
Rule #1: Fail quickly and safely: crash landings should be both fast and controlled.
The most important rule for engineers to follow when building a system — in particular a distributed system – is understanding how to fail quickly and safely. Performance issues are inevitable. When things go wrong, you don’t just keep trying to make it work as is. Rather, you need a procedure in place — shut this down, turn this off as quickly as possible, etc. A pilot flying a plane that suddenly experiences engine failure doesn’t keep flying in the hope they’ll make it — they follow procedures for safely landing the plane. The same concept applies to software. Any part of your system can fail at any time and knowing how to react is critical. There are a lot of techniques in doing so that this post cannot dive into, but three important points to remember: the scope of failure should collapse quickly and completely; the time to failure should be measured in small multiples of normal service time; and nothing outside the scope of failure should be impacted.
Rule #2: Post mortems are fundamental: pragmatic analysis is required to understand failure’s true nature.
Autopsies are not just for medicine. When a failure happens within your system, post mortem analysis is essential to preventing a repeat. Systems always follow instructions. In fact, there tend to be very few scenarios in which a system malfunctions in a non-deterministic way. So when it fails, it did what it was told — meaning it’s very likely it will do the same thing again. As your system grows, it’s unacceptable to have a repeated failure in your architecture that you cannot explain. You need to understand why this failure happened to prevent it from inevitably impacting even more of your users.
Rule #3: Use circuit breakers: circuit breakers are designed to avoid cascading failure.
The difference between a shock and an electrocution is real. Using a technique of circuit breakers is about failing gracefully. When something is about to fail, trying harder is usually not the right answer. You actually want to have the circuit break or flip. You need to design tolerances in your systems so that when a component or interaction in that system starts to go wrong or goes too slow, it doesn’t just keep piling on more work. It’s like a traffic jam that causes three and a half hour backups because when the road slows down, there is no way for the underlying system to stop putting traffic into the problem. More traffic just adds to the problem. By putting circuit breakers into your system, you actually control the behavior of the system better. It’s much better to have 50% of your traffic turned away rather than 100% of your traffic served poorly, say, with slow website uploads. Circuit breakers ensure you let only enough customers in where you know you can deliver that service. Allowing too many in degrades service for everyone to the point where they abandon the site.
Rule #4: Understand system behavior: you cannot understand what you cannot measure.
Many times, when I’ve asked organizations how long it takes their API to service a call, the response is, “it’s pretty fast.” Fast is not a speed. If you’re launching an API service, it takes a certain number of microseconds to service that call to users, and engineers must know exactly what this number is.
Software should be instrumented so that you can answer any question at a later point. If you’re asked how you improved the system, “I made it faster” is a lot less impactful than “I dropped the user experience from 700 milliseconds to 250 milliseconds and after this, shopping cart conversions and revenue increased.” If you’re measuring this, you can prove the impact of your work.
Engineers need to measure everything and err on the side of exposing everything they can in a piece of software. Only then can they understand performance changes and build robust models of behavior. Just remember, don’t use averages, and don’t use percentiles alone.
Rule #5: Have a failure budget: avoiding failure is impossible, so expect and manage failure.
Never set up your environment with the goal to achieve 100% uptime and 100% service quality. Rather, use the data from measuring your systems as stated in “rule 4” to build failure budgets. By “measuring everything” and defining what your uptime is and what your performance is, you can identify when to introduce risks that allow you to move faster. Why not launch those features that were marked slightly unstable and test them out in production where you can get faster feedback while still being able to deliver on your quality of service requirements? This failure budget methodology allows you to de-risk speed, and to define and reward success on improvement and competency, not just uptime.
Rule #6: Instrumentation and observability have no equals: instrument code for observability.
The only failure that matters is the one that you’re experiencing right now. So it’s absolutely critical that everything you do in developing software and all the practices that you’d apply in operating that software can lead you to diagnosing that failure. I’ve seen software architectures where you have a bug report from a customer, and multiple engineers spend several weeks trying to reproduce the error in development in order to fix it. However, repeating failures is both highly error prone and very elusive.
Instead, go into production live and just figure it out. And if you wreck a train car in there, that’s ok as long as you have a failure budget. Now regulations like GDPR make this more difficult, because as you go into production, you are exposed to information that you may not be allowed to see. But, there are ways to do this using Canary systems. When you have a problem, it’s a lot easier to divert a problematic user to that Canary system or infrastructure than it is to repeat the entire problem in a development environment. There are methodologies and techniques that we can use when deploying production systems that give us easy access to isolating existing production problems so that we can solve them.
This is obviously by no means a complete list of how to improve the performance of your software. Some organizations will have different processes and requirements. But I have found these six to be essential for the majority of organizations and a good way, at a high level, to organize an approach to the more modern monitoring required in today’s environment.