Three Common Challenges to Monitoring StatsD and How to Tackle Them

StatsD is a key unifying protocol and set of tools for collecting application metrics and gaining visibility into the performance of applications. StatsD as a protocol was created by Etsy in 2011 for emitting application metrics. Soon after, the StatsD Server was developed as a tool for receiving StatsD line protocol metrics and subsequently aggregating them. While there are no official backends as part of the StatsD ecosystem, Graphite became the most commonly used. StatsD quickly grew in popularity and today is a critical component of monitoring infrastructure.

Despite its age, legacy StatsD pipelines remain well-suited for application monitoring purposes — so long as you can keep up with the volume and submission frequency, and have a good place to store the data long-term. This is likely more realistic for smaller enterprises just beginning to analyze telemetry from their applications. However, as organizations expand their applications and bring in new application teams, their StatsD metric load begins to quickly increase. As this happens, their StatsD monitoring solutions inevitably become too fragile to handle the breadth of StatsD metrics their applications are now emitting — presenting challenges that result in inaccuracies, performance issues, and costs.

Many organizations are currently exploring alternatives to their legacy StatsD pipelines, to address some of these challenges. There are multiple different solutions to choose — from open source to managed offerings. What’s right for you depends on which of these challenges is most affecting your organization and what particular monitoring objectives your organization is currently working towards. The following is a list of these challenges, the pitfalls, and what type of solutions you may want to consider for each. And even if you’re not experiencing significant challenges yet, these will provide insights for how to improve the scale and effectiveness of your StatsD monitoring.

Pre-aggregations hinder flexibility when calculating SLOs

Challenge: StatsD has built-in aggregation functions for timers that are performed by the StatsD daemon, which include count, min, max, median, standard deviation, sum, sum of squares, percentiles (p90) and more. But most StatsD servers only offer static aggregations, which you have to configure upfront. So for example, if you want the 97th percentile for metric values, you have to have known that you’ll need the 97th percentile and configure that from the start — otherwise you run the risk of not having the data that’s requested.

Pitfalls: Obviously this information is hard to pre-guess, which ultimately prevents teams from having the ability to dynamically analyze latencies or calculate SLOs on demand. A manager may want to see a p85 or p80, but the closest thing they may have is a p90. Also, various teams all must use the same SLOs, because they’re forced to share the same pre-calculated aggregations.

Solution: If your organization is looking to implement SRE/Devops principles like SLOs and “measure everything,” then use log linear histograms for StatsD aggregation. Histograms allow you to efficiently and cost-effectively store all raw data, so that you can perform StatsD aggregations and build percentiles on the fly, after ingestion. Because the histogram contains all the data, no pre-configuration is required. This flexibility empowers your SRE teams to dynamically set and measure their own SLOs for existing and future use cases.

Increase in scale results in performance issues and high operational overhead

Challenge: A lot has changed since 2011. Organizations are embracing Kubernetes, microservices, and stateless applications. This means they’re emitting significantly more StatsD metrics. Also, the StatsD server aggregations introduce many challenges when used at scale, including precalculation of a large number of aggregates, potentially in the millions, that are not used. In some cases, 20+ aggregated metrics are produced for a given application timer metric, so what could have been one raw metric collected at face value becomes 10 or 20 different individual metrics for every metric that you want to collect. This is costing a lot of compute power, and all this data must get flushed to a backend.

On top of this, you have to manage multiple atomic implementations of the StatsD server, because the same metric has to continue to go to the same server for aggregations to be correct — as well as manage relays, to duplicate traffic to multiple servers and backends for redundancy. As the cardinality of metrics increases, the reality is that many backends just can’t scale as required, and the operational burden of managing these StatsD pipelines is significant.

Pitfalls: Inability to scale as needed inevitably leads to performance issues, lack of visibility, and increased time to troubleshoot. Increasingly complex architectures result in more network congestion, more resources, and higher costs.

Solution: If your company is investing in Kubernetes and microservices, growing rapidly, or emitting a significant volume of StatsD metrics, then you need to invest in a more modern backend database — one that easily can handle the volume of metrics emitted by today’s applications. It should also automate redundancy, to remove the burden from your team. You should be able to scale as needed, without sacrificing performance, and confidently deliver great user experiences.

While not required to ensure scale and performance, log linear histograms can also benefit you here. Because histograms can compress and store all source data for years at low cost, they eliminate the need for multiple StatsD servers performing aggregations. They compress all data into a single histogram and then send all of this data to your backend in one transaction, rather than multiple ones. Overall, you significantly reduce the number of metrics you’re ingesting and storing compared to pre-aggregations (as much as 10-20x less), as well as reducing network bandwidth and associated costs.

No data correlation = longer MTTR

Challenge: Sending StatsD metrics to certain server/backend combinations doesn’t support tagging in the line protocol. Modern IT environments are dynamic and ephemeral, making tagging essential to monitoring services and infrastructure.

Pitfalls: Without tagging, monitoring today’s complex IT infrastructures becomes ineffective. You lack the ability to slice and dice metrics for visualization and alerting, identify and resolve issues quickly, or correlate insights across business units.

Solution: If you’re looking to advance the sophistication of your monitoring by gaining deeper insights and correlating insights in a way that empowers monitoring to drive more business value, then you need a monitoring solution that enables Metrics 2.0 tagging of StatsD telemetry. Metrics 2.0 requires metrics be tagged with associated “metadata” or context about the metric that is being collected, such as application version. The additional context and metadata makes it easier to analyze across various dimensions and drastically improves the insight discovery process among millions of unique metrics. You can search based on these tags and also identify specific services for deeper analysis. Tagging allows you to correlate and alert your data, so that you can more quickly identify the cause of issues and glean more overall intelligence about your operations and performance.

Improve the ease, scale, and flexibility of your StatsD monitoring

Many StatsD pipelines are not equipped to handle the volume of data emitted by today’s applications, causing inaccuracies and limitations when monitoring StatsD metrics. Depending on your business, monitoring goals, and the StatsD challenges impacting your organization the most, it may be time for you to evaluate other solutions — so that you improve the ease and flexibility of your StatsD monitoring and get more value out of all that insightful data you’re generating.