Postmortem: 2017-04-11 Firewall Outage

The Event

At approximately 05:40AM GMT on 4/11/2017, we experienced a network outage in our main datacenter in Chicago, IL.
The outage lasted until approximately 10:55AM GMT on the same day. The Circonus SaaS service, as well as any PUSH based checks that use the public trap.noit.circonus.net, were affected by this outage. Any HTTPTrap checks using the public trap.noit.circonus.net broker would have been unable to send data to Circonus during this time period. As a result, any alerts based on this PUSH data would have also not been working. Meanwhile, enterprise brokers may have experienced a delay in processing data, but no data would have been lost for users of enterprise brokers as we use a store and forward mechanism on the brokers.

The Explanation

We use a pair of firewall devices in an active/passive configuration with automatic failover should one of the devices become unresponsive. The firewall device in question went down, and automatic failover did not trigger for an unknown reason (we are still investigating). When we realized the problem, we killed off the bad firewall device, causing the secondary to promote itself to master and service to be restored.

What We’re Doing About It

Going forward, we will use more robust checking mechanisms on these firewall devices to be alerted more quickly should we encounter a similar situation. Using an enterprise broker can insulate you from network outages between your infrastructure and Circonus should any future issues arise similar to this, or should any future issues arise in the network path between your infrastructure and Circonus.