First I want to start by saying I sincerely apologize for anyone adversely affected by yesterday’s false alerts. That is something that we are very conscious of when rolling out new changes and clearly something I hope never to repeat.
How did it happen? First, a quick run down of the systems involved. As data is streamed into the system from the brokers, it is sent over RabbitMQ to a group of Complex Event Processors (CEP) running Esper and additionally the last collected value for each unique metric is stored in Redis for quick lookups. The CEPs are responsible for identifying when a value has triggered an alert, and then tell the notification system about it.
Yesterday we were working on a bug in the CEP system where under certain conditions, if a value went from bad to good, and we were restarting the service, it was possible we would never trigger an “all clear” event and as such your alert would never clear. After vigorously testing in our development environment, we thought we had it fixed and all our (known) corner cases tested.
So the change was deployed to one of the CEP systems to verify it in production. For the first few minutes all was well, stale alerts were clearing, I was a happy camper. Then roughly 5 minutes after the restart, all hell broke loose, every “on absence” alert fired, and then cleared within 1 minute, pagers went off around the office, happiness aborted.
Digging into the code we thought we spotted the problem, when we loaded the last value into the CEP from Redis, we need to do so in a particular order. Because we used multiple threads to load the data and let it do so asynchronously, some was being loaded in the proper order, but the vast majority was being loaded too late. Strike one for our dev environment. It doesn’t have near the volume of data, so everything was loaded in order by chance. We fixed the concurrency issue, tested, redeployed, BOOM same behavior as before.
The next failure was a result of the grouping that we do in the Esper queries, we were grouping by the check id, the name of the metric and the target host being observed. The preload data was missing the target field. What this caused was the initial preload event to be inserted ok, then as we got new data in it would also be inserted just fine, but was being grouped differently. Our absence windows currently have a 5 minute timeout, so 5 minutes after boot, all the preload data would exit the window, which would now be empty and we triggered an alert. Then, as the newly collected data filled its window, we would send an all clear for that metric and at this point we would be running normally, albeit with a lot of false alerts getting cleaned up.
Unfortunately at this point, the redis servers didn’t have the target information in their store, so a quick change was made to push that data into them. That rollout was a success, a little happiness was restored since something went right. After they had enough time to populate all the check data, changes were again rolled out to the CEP to add the target to the preload, hopes were high. We still at this point had only rolled the changes to the first CEP machine, so that was updated again, rebooted, and after 5 minutes things still looked solid, then the other systems were updated. BOOM.
The timing of this failure didn’t make sense. CEP one had been running for 15 minutes now, and there are no timers in the system what would explain this behavior. Code was reviewed and looked correct. Upon review of the log files, we saw failures and recoveries on each CEP system, however they were being generated by different machines.
The reason for this was due to a recent scale out of the CEP infrastructure. Each CEP is connected to RabbitMQ to receive events, to split the processing amongst them each binds a set of routing keys for events it cares about. This splitting of events wasn’t mimicked in the preload code, each CEP would be preloaded with all events. Since each system only cared about its share, the events it wasn’t receiving would trigger an absence alert as it would see them in the preload and then never again. Since the CEP systems are decoupled, an event A on CEP one wouldn’t be relayed to any other system, so they would not know that they needed to send a clear event since as far as they were concerned, everything was ok. Strike two for dev, we don’t use that distributed setup there.
Once again the CEP was patched, this time the preloader was given the intelligence to construct the routing keys for each metric. At boot it would pull the list of keys its cared about from its config, and then as it pulled the data from Redis, it would compare what that metrics key would be to its list, if it had it, preload the data. One last time, roll changes, restart, wait, wait, longest 5 minutes in recent memory, wait some more… no boom!!!
At this point though, one of the initial problems I set out to solve was still an issue. Because data streaming in looked good, the CEP won’t emit an all clear for no reason, it has to be bad first, so we had a lot of false alerts hanging out and people being reminded about them. To rectify this, I went into the primary DB, cleared all the alerts with a few updates, and rebooted the notification system so it would no longer see them as an issue. This stopped the reminders and brought us back to a state of peace. And this is where we sit now.
What are the lessons learned and how do we hope to prevent this in the future? Step 1 is, of course, always making sure dev matches production; not just in code, but in data volume and topology. Outside of the CEP setup it does, so we need a few new zones brought into the mix today and that will resolve that. Next, better staging and rollout procedure for this system. We can bring up a new CEP in production, give it a workload but have its events not generate real errors, going forward we will be verifying production traffic like this before a roll out.
Once again, sorry for the false positives. Disaster porn is a wonderful learning experience, and if any of the problems mentioned in this post hit home, I hope it gets you thinking about what changes you might need to be making. For updates on outages or general system information, remember to follow circonusops on Twitter.