Many organizations relying on Graphite will be leveraging telemetry provided through Statsd. And if you rely on Graphite in combination with StatsD telemetry, you’re likely suffering from aggregation bloat.
In a typical Graphite ingestion pipeline, applications emit data points via UDP, which are then received by an aggregator such as StatsD.
Most StatsD servers only offer static aggregations, which must be configured upfront. For example, if you want the 92nd percentile for metric values, you must anticipate that need and configure accordingly from the start.
It is difficult to know in advance what an appropriate percentile value is for a given metric. Often, practitioners are guessing, hoping that they can correctly pick which metrics to aggregate ahead of time based on what they think they’ll need later. This approach often leaves teams creating metrics they don’t need while lacking metrics that they do need.
Teams commonly require different aggregations. For example, perhaps one team is interested in the 90th, 92nd, and 95th percentiles, and another team is interested in the 60th and 55th percentiles. This particular scenario is commonly addressed by pre-aggregating and calculating all five of the percentiles.
Unfortunately, this means aggregating far more metrics than each team needs. Worse yet, it requires managing the storage across servers of all the extra metrics you collected which have a real financial and performance impact.
Additionally, teams tend to change the aggregations they collect over time, which in real terms means continuing to add more aggregations, further magnifying the problem.
Over time, it’s common to find yourself in a situation where the majority of metrics being collected are things no one even looks at. However, it is nigh impossible to prove that no person or process relies on the metric. So rather than risk breaking a service with an undocumented dependency on those metrics, people tend to live with slow queries caused by the cluttering of their Whisper storage.
If this resembles your current situation, you should consider histograms.
Circonus Histograms Eliminate The Need For Pre-Aggregation Servers
A histogram is a representation of the distribution of a continuous variable, in which the entire range of values is divided into a series of intervals (or “bins”) and the representation displays how many values fall into each bin. Histograms visualize the distribution of latency data to make it easy for engineers to identify disruptions and concentrations and ensure performance requirements are met.
Histograms eliminate pre-aggregation servers entirely, collecting ALL of your metrics and storing them extremely efficiently—without requiring multiple racks of pre-aggregation servers. This means you’ll never again be forced to guess which metrics you need ahead of time.
In fact, you’ll be able to conduct on-demand aggregations whenever you choose, after the fact—including advanced calculations, such as arbitrary quantiles, percentiles, inverse quantiles, and inverse percentiles—all without needing to manage or store additional metrics, or to reconfigure an aggregator.
In the end, your metric ingestion will be reduced by 20-40x because log-linear histograms are such an informationally dense way of representing your data. You will experience more efficient ingestion and more flexible querying capabilities, eliminate spurious aggregations, and cut down on overall metric congestion.
Equally as important, you will now be able to empower downstream users to leverage histograms that enable modern SRE practices like SLOs and error budgeting.
Circonus Histograms Help You Ask Better Questions
Rarely are we able to ask all of the relevant questions that would fully explain the root cause of a service impacting event. New questions and realizations can present themselves at any time, often long after a particular event has passed. In such cases, there is a distinct need to “go back in time” to investigate past failures in light of these new questions and ideas.
Histograms make this possible by offering unlimited data retention and efficiently storing all raw latency data. This provides the ability to easily calculate any percentile you would like to see, on demand, and turns activities like postmortems into enlightening experiences that provide new knowledge that reduces future risk.
Near unlimited data retention and querying ability also make histograms the best way to compute latency SLOs, whether you are improving efficiencies in a long-established enterprise or still evaluating your service and are not yet ready to commit to a latency threshold.
In these ways and more, employing an enterprise monitoring and observability platform that stores data as histograms will ensure your business is making truly informed decisions about its SLO commitments.
Redfin Uses Circonus Monitoring and Observability
Real estate listing and brokerage service, Redfin, formerly relied on Graphite to monitor the various applications and services that power its web and mobile experiences.
Historically, Redfin had relied on a conventional Graphite deployment—their applications emitted StatsD telemetry via UDP, which was then shipped to a StatsD aggregator, and the resulting aggregations were stored in Whisper.
Like many businesses, Redfin reached a point at which the amount of StatsD telemetry it was emitting had become difficult to manage. This inability to scale their metric ingestion pipeline (and the corresponding operational overhead) hindered their ability to achieve their goals of adopting modern SRE practices.
Further complicating things was the fact that Redfin had an incredibly large number of Graphite dashboards and alerting rules that it wanted to keep—and seamlessly move over to a modern platform without having to rebuild everything.
Redfin was able to accomplish this and more by replacing its legacy StatsD and Graphite components with the full Circonus platform.
The company re-architected its metric ingestion pipeline using OpenHistograms, and in doing so, reduced its overall metric footprint by 50 percent, thereby solving its aggregation bloat issue.
The Redfin team is now able to aggregate on demand, meaning they no longer need to keep all of the many aggregations they don’t use. It is also able to simplify the creation and compliance monitoring of SLOs.
Moreover, Redfin benefits from the incredible data safety of the broader Circonus monitoring platform, as discussed in our previous post.
Migrating Graphite to Circonus
As with the IronDB “drop-in” Whisper database replacement, when adopting the complete Circonus platform, all of your Graphite dashboards will be transferred by Circonus, near instantaneously.
You get to keep your dashboards and everything else you like about Graphite. Only now, everything runs faster, your metric ingestion footprint is dramatically reduced, you can calculate latency data on demand, and you are able to utilize all Circonus platform features. The end result is less overhead and management costs, greater efficiency, and access to an array of additional features. This helps you improve the accuracy and flexibility of your SLOs and implement modern SRE practices that benefit both your customers and your bottom line.