Circonus will soon be releasing our next generation fault detection system, faultd (fault-dee). Faultd is an internal component of our infrastructure has run alongside our existing fault detection system for several months with outputs verified for accuracy. Additionally it is in use by some of our enterprise customers who have reported no issues with faultd.
Faultd introduces powerful new features which will make it easy to manage alerting in ephemeral infrastructures such as serverless, container based applications, and large enterprises.
Pattern Based Rulesets
Say I have a few thousand hosts that emit S.M.A.R.T. disk status telemetry, and I want to alert when the seek error rate exceeds a threshold. Previously I would need to create a few thousand rules to alert on this condition for each host. While the Circonus API certainly makes this programmatically feasible, I would also need to create or delete these rules on each host addition or removal.
Now I can create a single pattern based rule using regular expressions to cover swaths of infrastructure for a given metric. I can also harness the power of stream tags to create pattern based rules based on metric metadata. What would have taken operators hours to do in the past can now be done easily in minutes.
Histogram Based Alerting
Traditional alerting has been based on a value exceeding a threshold for a given amount of time. Every monitoring system can do this. And each one of them suffers from the shortcoming of outliers triggering false positive alerts which are infamous for waking up systems operators in the middle of the night for what turns out to be nothing.
Histogram based alerting paves the way for alerts based on percentiles, which are much more robust than alerting on individual values or averages which can become easily skewed by outliers. This also allows for alerting on conditions when Service Level Objectives (SLOs) are exceeded, a capability core to the mission of Software Reliability Engineers (SREs). Alerts based on Inverse Quantiles are also now possible – “alert me if 20% of my requests in the last five minutes exceeded 500ms”, or “alert me if more than 100 requests in the last 5 minutes exceeded 253ms”.
Under the Hood
Faultd has been engineered in C with the libmtev application framework, which provides highly concurrent lock free data structures and safe memory reclamation. This implementation is radically more efficient for memory and CPU than the previous fault detection system written in Java. It also provides more powerful ways to scale out for ridiculously large installations, and supports more sophisticated clustering.
As a result, some window function alerts may show increased accuracy. Enterprise customers will enjoy greater reliability in not having to occasionally restart a JVM as part of normal maintenance.
Faultd will be going live on December 17th for Circonus hosted customers. While you might not notice anything new that day, that’s intentional as we expect complete continuity of service during the transition. In the coming weeks to months after, we’ll be showcasing the new features provided by faultd here on this blog so that you can put them to work.