Fault Detection: New Features and Fixes

One of the trickier problems when detecting faults is detecting the absence of data. Did the check run and not produce data? Did we lose connection and miss the data? The latter problems are where we lost a bit of insight, which we sought to correct.

The system is down

A loss of connection to the broker happens for one of two reasons. First, the broker itself might be down, the software restarted, machine crashed, etc. Second, there was a loss of connectivity in the network between the broker and the Circonus NOC. Note that for our purposes, a failure in our NOC would look identical to the broker running but having network problems.

Lets start with a broker being down. Since we aren’t receiving any data, it looks to the system like all of the metrics just went absent. In the event that a broker goes down, the customer owning that broker be inundated with absence alerts.

Back in July, we solved this by adding the ability to set a contact group on a broker. If the broker disconnects, you will get a single alert notifying you that the broker is down. While disconnected, the system automatically puts all metrics on the broker into an internal maintenance mode, when it reconnects we flip them out of maintenance and then ask for a current state of the world, so anything that is bad will alert. Note that if you do not set a contact group, we have no way to tell you the broker is disconnected so we will fall back to not putting metrics in maintenance and you will get paged about each one as they go absent. Even though this feature isn’t brand new, it is worth pointing out.

Can you hear me now?

It is important to know a little about how the brokers work… When they restart, all the checks configured on it are scheduled to run within the first minute, then after that they follow the normal frequency settings. To this end, when we reestablish connectivity with a broker, we look at the internal uptime monitor, if it is >= 60 seconds we know all the checks have run and we can again use the data for alerting purposes.

This presented a problem when an outage was caused by a network interruption or a problem in our NOC. Such a network problem happened late one night and connections to a handful of brokers were lost temporarily. When they came back online, because they had never restarted we saw the uptime was good and immediately started using the data. This poses a problem if we reconnected at the very end of an absence window. A given check might not run again for 1 – 5 minutes, so we would potentially trigger absences, and then recover them when the check ran.

We made two changes to fix this. First, we now have two criteria for a stable / connected broker:

Uptime >= 60 seconds
Connected to the NOC for >= 60 seconds

Since the majority of the checks run every minute, this meant that we would see the data again before declaring the data absent. This, however, doesn’t account for any checks with a larger period. To that end, we changed the absence alerting to first check to see how long the broker has been connected. If it has been connected for less than the absence window length, we push out the absence check to another window in order to first ensure the check would have run. A small change but one that took a lot of testing and should drastically cut down on false absence alerts due to network problems.