Systems Monitoring is Ripe for a Revolution

Before we explore systems, let’s talk users. After all, most of our businesses wouldn’t exist without lots of users; users that have decreasing brand loyalty and who value unintrusive, convenient, and quick experiences. We’ve intuited that if a user has a better experience on a competitor’s site, then they will stop being your customer and start being theirs. Significant research into exactly how much impact is had by substandard web performance started around 2010, progressed to consensus, and has turned into a tome of easily consumable knowledge. What allowed for this? RUM.

Real User Monitoring

The term RUM wasn’t in constant usage until just after 2010, but the concept and practice was a slow growth that transformed the previous synthetic web monitoring industry. Both Keynote and Gomez (the pall bearers of synthetic web monitoring) successfully survived that transition and became leaders in RUM as well. Of course, the industry has many more and varied competitors now.

Synthetic monitoring is the process of performing some action and taking measurements around the aspects of that performance. A simple example would be asking, “how fast does my homepage load?” The old logic was that an automated system would perform a load of your homepage and measure how long various aspects took: initial page load, image rendering, above-the-fold completeness, etc. One problem is that real users are spread around the world, so to simulate them “better,” one would need to place these automated “agents” around the world so that a synthetic load could indeed come from Paris, or Copenhagen, or Detroit. The fundamental problem remained that the measurements being taken represented exactly zero real users of your web site… while users of your website were actively loading your home page. RUM was born when people decided to simply observe what’s actually happening. Now, synthetic monitoring isn’t completely useless, but RUM largely displaced most of its obvious value.

What took RUM so long? The short answer was the size of the problem relative to the capabilities of technology. The idea of tracking the performance of every user action before 2000 was seen as a “Big Data Problem” before we coined the term Big Data. Once the industry better understood how to cope with data volumes like this, RUM solutions became commonplace.

Now it seems fairly obvious to anyone that monitoring real users is fundamental in attempting to understand the behavior of their website and its users… so why not with systems?

Systems are Stuck

Systems, like websites, have “real users,” those users just happen to be other systems most of the time. It is common practice today to synthetically perform some operation against a system and measure facets of that performance. It is uncommon today to passively observe all operations against the system and extract the same measurements. Systems are stuck in synthetic monitoring land.

Now, to be fair, certain technologies have been around for a while that allow the observation of inflight systems; caveat “systems” with a focus on custom applications running on systems.

The APM industry took a thin horizontal slice of this problem, added sampling, and sold a solution (getting much market capitalization). To sum up their “solution,” you have an exceptional view into part of your system some of the time. Imagine selling that story in the web analytics industry today: “now see real users… only on your search pages, only 1% of the time.”

Why don’t we have a magically delicious RUM store for systems? For the same reason it took so long to get RUM. The technology available today doesn’t afford us the opportunity to crunch that much data. Users work in human time (second and minutes) at human scale (tens of millions); systems work at computing time (nanoseconds) at cloud scale (tens of thousands of machines and components). It’s literally a million times harder to think about Real Systems Monitoring (RSM) than it is to think about Real User Monitoring (RUM).

The Birth of Real Systems Monitoring

The technology has not improved a million-fold over the last 10 years, so we can’t solve this RSM problem as comprehensively. But it has improved significantly, so we’re ready for a departure from synthetic systems monitoring into a brave new world. Circonus and many of its industry peers have been nipping at the heels of this problem and we are entering the age of tangible gains. Here’s what’s coming (and 5-10 years from now will be ubiquitous table stakes):

100% sampling of microsecond or larger latencies in systems operation. (ie, You see everything)
Software and services will expose measurement streams from real activity.
Histograms as the primary data types in most measurement systems
Significantly more sophisticated math to cope with human reasoning of large datasets
Measurement collection on computer-scale (billions of measurements per second)
Ultimately a merge of RUM and RSM… After all, we only have systems because we have users.

Exciting Times

At Circonus, we’ve been building the architectures required to tackle these problems: the scale, the histograms, and the math. We see the cost efficiencies increasing, resulting in positive (and often huge) returns on investment. We see software and service providers avidly adding instrumentation that exposes real measurements to interested observers. We’re at an inflection point and the world of systems monitoring is about to take an evolutionary leap forward. These are exciting times.