Last night circonus.com became unavailable for 34 minutes, this was due to the primary database server becoming unavailable. Here is a breakdown of events, times are US/Eastern.
- 8:23 pm kernel panic on primary DB machine, system rebooted but did not start up properly
- 8:25 -> 8:27 first set of pages went out about DB being down and other dependent systems not operating
- 8:30 work began on migrating to the backup DB
- 8:57 migration complete and systems were back online
In addition to the web portal being down during this time, alerts were delayed. The fault detection system continued to operate, however we have discovered some edge cases in the case management portion that will be addressed soon.
Because of the highly decoupled nature of Circonus, metric collection, ingestion and long term storage was not impacted by this event. Other services like search, streaming, and even fault detection (except as outlined above) receive their updates over a message queue and continued to operate as normal.
After the outage we discussed why recovery took so long and boiled it down to inadequate documentation on the failover process. Not all the players on call that night knew all they needed about the system. This is something that is being addressed so recovery in an event like this in the future can be handled much faster.