Here at Circonus, we obviously monitor our infrastructure with the Circonus platform. However, there are many types of monitoring, and our product isn’t intended to cover all the bases. While Circonus leads to profound insights into the behavior of systems, it isn’t a deep-dive tool for when you already know a system is broken.
Many of our components here are written in C, so as to be as small, portable, and efficient as possible. One such component gets run on relatively small systems on-premises with customers, to perform the tasks of data collection, store-and-forward, and real-time observation. When it comes to the data collection pipeline, a malfunction is unacceptable.
This story is of one such unacceptable event, on a warm Wednesday in July.
Around 2pm, the engineering group is hacking away on their various deliverables for new features and system improvements to the underlying components that power Circonus. We’ve just launched our Pully McPushface feature to a wider audience (everyone) after over a year of dark launches.
At 2:47pm, a crash notice from Amsterdam appears in our #backtrace slack channel. We use Backtrace to capture and analyze crashes throughout our globally distributed production network. Crashes do happen, and this one will go into an engineer’s queue to be analyzed and fixed when they switch from one of their current tasks. At 2:49pm, things change; we get an eerily similar crash signature from Amsterdam. At this point, we know that we have an event worthy of interruption.
Around 2:51pm, we snag the Backtrace object identifier out of Slack and pull it up in the analysis tool (coroner). It provides the first several steps of post-mortem debugging analysis for us. About 8 minutes later, we have consensus on the problem and a proposed fix, and we commit that change to Github at 2:59pm.
Now, in the modern world, all this CI/CD stuff claims to provide great benefits. We use Jenkins internally for most of the development automation. Jenkins stirs and begins its build and test process. Now, this code runs on at least five supported platforms, and since the fix is in a library, we actually can count more than fifty packages that need to be rebuilt. After just 6 minutes, Jenkins notes all test suites passed and begins package builds at 3:05pm.
Building, and testing, and packaging software is tedious work. Many organizations have continuous integration, but fail to have continuous deployment. At 3:09pm, just 4 minutes later, we have all of our fixed packages deployed to our package repositories and ready for install. One might claim that we don’t have continuous deployment because our machines in the field do not auto-update their software. It’s an interesting fight internal to Circonus as to whether or not we should be performing automated updates; as of today, a human is still involved to green-light production deployment for new packages.
One more minute passes, and at 3:10pm, Amsterdam performs its yum update (it’s a CentOS box), and the Slack channel quiets. All is well.
The engineers here at Circonus have been building production software for many years, and a lot has changed over that time. I distinctly remember manually testing software in the 1990s. I remember some hobbled form of CI entering the scene, but we were still manually packaging software for distribution in the 2000s. Why did it take our industry so long to automate these pieces? I believe some of that has to do with relative scale. Collecting the necessary post-mortem debugging information to confidently issue a fix took hours or days when machines in production malfunctioned, because operations groups had to notice, report, and then interact with development. As such, the extra hour or two to compile, test, and package the software was inconsequential, reminding us of a golden XKCD comic in its time.
Recapping this event makes me, the old grey-beard, feel like I’m living in the future. Jenkins was able to reduce a process from hours to minutes, and Backtrace was able to reduce a highly inconsistent hours-to-days process to minutes.
A two minute MTTD and a 23 minute MTTR for a tested fix involving a core library. The future looks way more comfortable than the past.