Shipping Software with an SRE Mindset

This blog post is a summary of a presentation I delivered at SREcon Americas, which can be viewed here.

A lot of companies bequeath the projects that they build internally unto the world as open source. But they’re usually up for a pretty rude awakening when they ship their software and it has to be run for the first time by someone other than them. The reason is that the software they built is not as good as it could have been without the myriad supporting tooling they were leveraging internally, such as their deployment platform, metrics platform, distributed tracing platform, or log analytics collection platform.

So why would you ship software in the first place? That is likely the topic for another port by a different author, but on the short list is that it turns out that people feel like they can’t put critical data into someone else’s architecture or into code they cannot audit or modify. Many times they don’t have compliance requirements or procedures that put constraints on their deployment methodology and/or topology that open source software often solves for. At Circonus, we deal with a lot of on-premises software shipment due to hybrid customer requirements. We run Circonus as a SaaS product, but our customers can, for compliance, governance, performance, or cost reasons, elect to run various pieces of our stack on-premises as well. Over the years, we’ve learned that many SRE techniques can apply directly to the construction, packaging, and shipment of installed software.

In fact, every effort that organizations spend bringing SRE techniques into software engineering makes our entire ecosystem of technologies more accessible to every single person that wants to employ them. In the following, I share six ways engineers can ship software with an SRE mindset.

#1: Crash analysis: if you don’t know why it failed, then you don’t know anything at all.

When your systems or applications fail, it’s absolutely fundamental that you get the stack traces, core dumps, etc., so you can diagnose that failure and create a permanent solution to it. At Circonus, we use a tool called Backtrace for this. About 35 milliseconds after we have a failure condition, we receive a 300 kilobyte mini dump that shows every thread, every stack frame of every thread, every local variable, and in our UI we can see the line numbers of the code that it was on. We ship this integration on-premises to customers who don’t want to put their data outside the firewall. In the best case scenario, within a second of a crash, we already have a stack trace and it already went into the product development pipeline. Worst case scenario, the customer sends us the diagnostic files manually.

#2: Expose telemetry: ideally, any question you would ask of a production system can be done so nondisruptively.

Exposing telemetry is absolutely critical. At Circonus, we collect telemetry on everything – every disc IOP, KV manipulation, storage record, etc. – every nanosecond latency of each of those is measured and stored in histograms. So, for instance, if we find search queries are running slow and want to see what has changed, we can look at every single thing that’s happening in the system, as well as all historical data on everything that has ever happened.

This is invaluable in troubleshooting a problem, because when we have a question, we have the data to answer it. This is what you should aim towards. What’s critical is that when you ship a piece of software to someone else, you must have some simple way to extract that data out of the product natively because they are likely not using the same tools (like dashboards) as you. If you must rely on the orchestration framework or the metrics framework, then suddenly your application is no longer operable. You need an operational substrate in your application or whatever it is that you’re shipping that itself qualifies as a minimum viable product, so that when you ask a question, you’re able to answer it.

#3: Logging for humans: during failure reconstruction, logs hold truth; computers talking to computers have better ways than logs, and logs are for computers talking to humans.

Logging is for humans; structured logging is for computers. Events and distributed tracing are types of structured logging. When you realize that logging is for humans, the first step is asking “will this be useful to me?” The next step is realizing that some logs may be the opposite of useful for someone who is not you and not familiar with your product. So when you deploy a product like my SQL or MongoDB and get developer centric logs in there, no one would ever understand what’s going on. When you start focusing on shipping software and you start talking about logging observability in there, you start to treat it like a product. You start to treat your service like a product, and you start to think about how these logs can make the life of an operator of this product easier.

#4: Dynamic tracing: real unknown unknowns are solved by dynamic tracing.

There’s a big misconception that observability systems can provide you access to answer the questions around your unknown unknowns. This sounds great, because engineers want the ability to solve problems where they don’t understand or can’t predict the input, don’t know what the output is, and don’t know what the question is until they want to ask it. Well, it turns out that most of these products add something to your code, so it’s not really unknown unknowns anymore. The important part about dynamic tracing is the idea of being able to instrument (not just report on) a system on the fly with no context.

This has been available forever in DTrace, and things like bpftrace, which will allow you to go in and ask questions. The typical scenario is that when there’s an issue, an engineer goes back to their code, adds a metric in, and then waits hours (weeks for databases) to redeploy the application and wait for the problem to happen again. With dynamic tracing, the memory on the box is yours. With one line of code, you can get the answer to your unknown unknowns, and gain really deep insights into the application.

This is very effective when working with both productionized software that’s deployed in your SaaS and with software you’ve shipped because the scripts end up being really tiny and easy to supply to customer-operators. Dynamic tracing actually gives you the instrumentation and questioning framework to be able to answer questions live in production without disruption.

#5: Internalized MVP: No additional apparatus; no additional deployment constraints.

It’s great to send metrics, distributed tracing, and eBPF data into your company’s overall framework for analyzing it, but it’s also incredibly valuable to be able to remedially review these within the application. You really need to be able to interrogate the app itself without using all of these surrogate systems. The more you have these complicated dependency cycles, where it turns out that your monitoring graphs don’t show up because they actually needed the system you’re trying to debug right now, these dependencies become really obtuse and hard to manage. So the more you make each individual product, service, or application that you ship individually more self-sustaining, then you have more power (or at least accessibility) to analyze its behavior. You’re no longer required to participate in the overall ecosystem to understand how it works.

#6: Operational assessment and procedures: shipping software means more operators, less average knowledge; tools → solutions.

Everybody has operational assessments and procedures so that when things start to fail, you have steps that you take as an operator of that software to interrogate the system, try different things, and remediate the problem. But when you’re trying to get somebody else to do that through a game of telephone, it never works. This forces you to codify those things – those things stop being things that you do, and they start being automated tasks that you trigger to be done, which eventually turn into product features where the system self-heals and self-maintains. It’s amazing how often, when an engineer is faced with a challenge, they come up with a set of solutions, but if you ship that product and they can’t apply them, those solutions turn into product features under the hood. It’s really about turning tools into solutions.

Making SRE more accessible and useful

Every effort to bring SRE techniques to software engineering makes SRE more accessible and and useful in cloud /saas engineering. And, as I stated above, I think the most important part of this is that every effort we spend bringing SRE techniques into software engineering makes our entire ecosystem of software products more accessible to every single person that wants to employ them. As we start to codify SRE techniques into the software themselves, it makes everybody’s life a tremendous amount better and advances the SRE movement as a whole.

This blog post is a summary of a presentation I delivered at SREcon Americas, which can be viewed here.