Consider me old-fashioned, but I remember a time when an alert notification meant something. Drives failed, servers ran short on memory, or a cage monkey pulled the wrong cable at 3 A.M. Regardless of the circumstance, it demanded attention. Those were the days.
Today, operations is all about doing more with less. No more dedicated hardware or late-night maintenance windows. Everything is virtual, cloud-based, or filling up squares in the grid. Automation reigns supreme, limitless scalability at our disposal. Abstraction at its finest.
But woe unto you, the flapping anomaly.
That visitor who tried to load your website was turned away, timed out and left to wither. Poor Jane wanted to view your site. She needed to view your site. She’d already submitted her order, only to be ignored. Forgotten. Disconnected with nary a trace to route nor a cookie to favor.
Jane was a victim of a numbers game. Someone, somewhere, decided that some problems don’t matter. Which ones? Who cares? They don’t matter. And because she happened to visit when this problem reared its head, you ignored her request. Who would ever make such a silly presumption that one failure is less important than another? What criteria is used to determine the worthiness of this alert or that one? Pure random circumstance, it would appear.
Many “uptime” services and monitoring suites promote the concept of selective or flapping failures. Vendors sell these features as a convenience, ostensibly as a sleep aide. The administrator’s snooze-bar. I can’t think of any other reason that ignoring a faulty condition would be considered a good thing. Perhaps they reason that only the check is affected. If it responds after the third attempt, it was probably ok for visitors all along. Right?
It’s disappointing how many vendors embrace this broken methodology. It probably seemed innocent at a glance. But the damage has been done; recklessness has taken root. We’ve been conditioned to accept these transient malfunctions as mere operational speed bumps. Rather than address the problem, we nudge the threshold a tad higher. Throw additional nodes into the cluster. Increase capacity, while decreasing exposure.
But there is a more responsible alternative. What ever happened to purposeful, iterative corrections and Root Cause Analysis? Notifications may be annoying at times, but they serve a crucial function in a healthy production architecture. Ignored alerts lead to stagnant bugs, lost traffic and missed opportunities. Stop treating your visitors like they don’t matter. There’s no such thing as a flapping customer.