Alerting on disk space the right way.

Most people that alert on disk space use an arbitrary threshold, such as “notify me when my disk is 85% full.” Most people then get alerted, spend an hour trying to delete things, and update their rule to “notify me when my disk is 86% full.” Sounds dumb, right? I’ve done it and pretty much everyone I know in operations has done it. The good news is that we didn’t do this because we are all stupid people, we did it because the tools we were using didn’t allow us to ask the questions we really want to answer. Let’s work backwards to a better disk space check.

There are occasionally reasons to set static thresholds, but most of the time we care about disk space it’s because we need to buy more. The question then becomes, “how much advance notice do I need?” Let’s assume, for the sake of argument, that I need 4 weeks to execute on increasing storage capacity (planning for and scheduling possible system downtime, resizing a LUN, etc.). If you’re a cloudy sort of architecture, maybe you’re looking at a single day so that this sort of change happens during a maintenance window where all necessary parties are available. After all, why would you want to act on this in an emergency?

Really, the question we’re aiming at is “when will I run out of disk space in 4 weeks time?” It turns out that this is a very simple statistical question and with a few hints, you can get an answer in short order. First we need a model of the data growth and this is where we need a bit more information. Specifically, how much history should drive the model? This depends heavily on the usage of the system, but most systems have a fairly steady growth pattern and you’d like to include some multiple of the period of that pattern.

To be a little more example oriented, let’s say we have a system that is growing over time and also generates logs that get deleted daily. We expect a general trend upward with daily periodic oscillation as we accumulate log files and then wipe them out. As rule of thumb, I would say that one week of data should be sufficient in most of the systems, so we should build our model off 7 days worth of history.

Graph looking 1 week back and 28 days forward. — Looking 1 week back and 28 days forward

Quite simply, we should take our data over the last 7 days and generate a regression model. Then, we time shift the regression model backwards by 4 weeks (the amount of notice we’d like) and “current value” would be the model-predicted value four weeks from today. If that value is more than 100%, we need to tell someone. Easy.

Suffice it to say some tools require extracting the data into Excel or pulling data out with R or Python to accomplish this. While those tools work well, they fail to fit the bill with respect to monitoring because this model and projected value must be constantly recalculated as new data arrives so that we can reduce the MTTD to something expected.

While Circonus has had this feature squirreled away for many months, I’m pleased to say that the alerting UI has been refactored and it is now accessible to mere mortals (at least those mortals that use Circonus).