In the new world of rapid releases, continuous change, and increasingly high user expectations, more organizations are embracing DevOps. One of the primary drivers for adopting DevOps is speed — particularly the reduction of risk at speed. As DevOps seeks to reduce risk and deliver insight at an increasingly faster pace, new tools have emerged in the monitoring space. But these tools alone will not deliver us into the low-risk world of DevOps — not without new and updated thinking. Organizations looking to adopt DevOps and implement functional roles that ascribe to the DevOps way of thinking, like Site Reliability Engineering (SRE), need a new, updated way to approach monitoring. Why?
At its heart, monitoring is about observing and determining the behavior of systems. Its purpose is to answer the ever-present question: Are my systems doing what they are supposed to? It’s also worth mentioning that systems is a very generic term, and in healthy organizations, systems are seen in a far wider scope than just computers and computing services — they include sales, marketing, and finance, alongside other “business units,” so the business is seen as the complex interdependent system it truly is. That is, good monitoring can help people take a truly systems view not only of systems, but also organizations. Today’s systems are born in an agile world and remain fluid to accommodate changes in both the supplier and the consumer landscape. This highly dynamic system stands to challenge traditional monitoring paradigms.
Why DevOps Can’t Rely onTraditional Monitoring
In the old world of slow release cycles (often between six and 18 months), the system deployed at the beginning of a release looked a lot like the same system several months later. It was maintained with bug fixes and performance enhancements, but it did not have new features or functions that would fundamentally change the stress on the architecture. Simply put, it is not very fluid – which is great for monitoring. If the system today is the system tomorrow and the exercise that system does today is largely the same tomorrow, then the baselines developed by observing the behavior of the system’s components will very likely live long, useful lives.
In the new world of rapid and fluid business and development processes, we have change on a continual basis. The problem here is that the fundamental principles that power monitoring — the very methods that judge if your machine is behaving itself — require an understanding of what good behavior looks like. In order to understand if systems are misbehaving, you need to know what it looks like when they are behaving.
In this new world, you also have adopted a microservices-systems architecture pattern. Microservices dictate that the solution to a specific technical problem should be isolated to a network accessible service with clearly defined interfaces such that the service has freedom. This freedom is very powerful, but its true value lies in decoupling release schedules and maintenance, and allowing for independent higher-level decisions around security, resiliency, and compliance. The conflation of these two changes results in something quite unexpected for the world of monitoring: the system of today neither looks like nor should behave like the system of tomorrow.
Characteristics of Successful Monitoring
So what should monitoring in the new world of DevOps look like? Organizations who are successfully adjusting their monitoring strategies to align with the DevOps philosophy have the following characteristics in common:
What is more important than how
The first thing to remember is that all the tools in the world will not help you detect bad behavior if you are looking at the wrong things. Be wary of tools that come with prescribed monitoring for complex assembled systems; rarely are systems in the tech industry assembled and used in the same way at two different organizations. The likely scenario is that the monitoring will seem useless, but in some cases it may provide a false confidence that the systems are functioning well. When it comes to monitoring the “right thing,” always look at your business from the top down. The technical systems the organization operates are only provisioned and operated to meet some stated business goal. Start by monitoring whether that goal is being met.
Most often, systems are being monitored around delivered performance, so the consumed values (or indicators) are numbers and often latencies (a time representing how long a specific operation took). Basic statistics are a fundamental requirement for both asking and interpreting the answers to questions about the behavior of systems. As systems grow and the focus turns more to their behavior, data volumes rise. Some people still monitor systems by taking a measurement from them every minute or so, but more and more people are actually observing what their systems are doing. This results in millions or tens of millions of measurements per second on standard servers. Handling 10 million measurements per second from a single server when you might have thousands of servers might sound like overkill, but people are doing it because the technology exists that makes the cost of finding the answers less than the value of those answers. To handle data at that volume, you must also use a capable set of tools. To form intelligent questions around data at this volume, you must embrace mathematics. Without a set of tools to help you perform fast, accurate, and appropriate mathematical analysis against your observed data, you will be at a considerable disadvantage.
A third important characteristic of successful monitoring systems is data retention. Monitoring data has often been considered low value and high cost. Times have changed, and, as with all things computing, the cost of storing data has fallen dramatically. More importantly, DevOps have changed the value of long-term retention of this data. DevOps is a culture of learning. When things go wrong, and they always do, it is critical to have a robust process for interrogating the system and the organization to understand how the failure transpired. This allows processes to be altered to reduce future risk. At the pace we move, it is undeniable that your organization will develop intelligent questions regarding a failure that were missed immediately after past failures. Those new questions are crucial to the development of your organization, but they become absolutely precious if you can travel back in time and ask those questions about past incidents. Data retention can often lead to valuable learning that reduces future risk.
Be articulate about what success looks like
Using a language to articulate what success looks like allows people to win. It is wholly disheartening to think you’ve done a good job and met expectations, and then learn the goalposts have moved or that you cannot articulate why you’ve been successful. The art of the SLI (service-level indicator), SLO (service-level objective), and SLA (service-level agreement) reigns here. Understanding the service your business provides and the levels at which you aim to deliver that service is the heart of monitoring. SLIs are things that you have identified as directly related to the delivery of a service. SLOs are the goals you set for the team responsible for a given SLI. SLAs are SLOs with consequences, often financial. For this, a good understanding of histograms can help.
Aim for Better
Today, with architectures dynamically shifting in size by the minute or hour and shifting in design by the day or the week, we need to step back and remember that monitoring is about understanding the behavior of systems, and that systems need not be limited to computers and software. A business is a complex system itself, and monitoring can be applied to all of these systems to measure important indicators and detect changes in overall systems behavior. Monitoring can seem quite overwhelming. The most important thing to remember is that perfect should never be the enemy of better, and if you have embraced DevOps, you have already signed up for making it better over time.