Post-Mortem 2017.1.12.1

TL;DR: Some users received spurious false alerts for approximately 30 minutes, starting at 2017-01-12 22:10 UTC. It is our assessment that no expected alerts were missed. There was no data loss.

Overview

Due to a software bug in the ingestion pipeline specific to fault detection, a small subset (less than 2.5%) of checks were not analyzed by the online fault detection system for 31 minutes, starting at 2017-01-12 22:10 UTC.

The problem was triaged. Broker provisioning and deprovisioning services were taken offline at 22:40 UTC, at which time all fault detection services returned to normal.

Broker provisioning and deprovisioning services were brought back online at 2017-01-13 00:11 UTC. All broker provisioning and deprovisioning requests issued during that period were queued and processed successfully upon service resumption.

Gratuitous Detail

Within the Circonus architecture, we have an aggregation layer at the edge of our service that communicates with our store-and-forward telemetry brokers (which in-turn accept/acquire data from agents). This component is called “stratcond.” On January 5th, we launched new code that allows more flexible configuration orchestration and, despite having both unit tests and end-to-end tests, an error was introduced. Normal operations continued successfully until January 12th, when a user issued a command requiring reconfiguration of this system. That command managed to exercise the code path containing this specific error and stratcond crashed. As with all resilient systems, the stratcond was restarted immediately, and it suffered approximately 1.5 seconds of “disconnection” from downstream brokers.

The system is designed to tolerate failures, as failures are pretty much the only guaranteed thing in distributed systems. These can happen at the most unexpected times and many of our algorithms for handling failure are designed to cope with the randomness (or perceived randomness) of distributed failure.

The command that caused the crash was queued and reattempted precisely 60 seconds later, and again 60 seconds after that, and again after that. A recurrent and very non-random failure. There are many checks that customers have scheduled to run every 60 seconds. When a check is scheduled to run on a broker, it is scheduled to run at a random offset within the first 60 seconds of that broker’s boot time. So, of the 60-second-period checks, 2.5% of the checks would have been scheduled to run during the 1.5 second real-time-stream outage that transpired as a part of this failure. The particular issue here is that because the crash recurred almost exactly every 60 seconds, the same 1.5 seconds of each minute was vulnerable to exclusion. Therefore the same 2.5% of checks were affected each minute, making them “disappear” to the fault detection system.

The same general pipeline that powers graphs and analysis is also used for long-term storage, but due to open-ended temporal requirements, that system was unaffected. All checks run in those “outage” windows had their measurements successfully sent upstream and stored (during the outages, since there were no outages for the storage stream).

Operational response led to diagnosis of the cause of the crash, avoidance, and restoration of normal fault detection operation within 31 minutes. Crash analysis and all-hands engineer triage led to a bug fix, test, packaging, and deployment at 2 hours and 11 minutes.

Actions

There are two actions to be taken, and both will require research and implementation.

The first is to implement better instability detection to further enhance the already sophisticated capabilities of flagging instability in the fault detection system. The short, reliable timing of the disconnections in this case did not trigger the fault detection system’s instability mode and thus it did not react as it should have.

The second is to better exploit “at least once delivery” in the fault pipeline. In order to make sure that we get the job done that we promise to get done, we make sure our systems can process the same data more than once. Often, a metric is actually delivered to the fault detection system four times. We can further extend this “duplication tolerance” to the stratcond-broker feed and replay some window of past traffic to send upstream. In online systems, old data is worthless. In all systems, “old” is subjective. By relaxing our definition of “old” a bit more and leveraging the fact that no upstream protections will be required, we should easily be able to make this tiny section of our pipeline even more resilient to failure.

To close, we live in the real world. Failure is the only option. We embrace the failures that we see on a daily basis and do our best to ensure that the failures we see do not impact the service we deliver to you in any way. Yesterday, we learned that we can do better. We will.