No, We “Fixed the Glitch”

If you haven’t seen the movie Office Space, you should do so at your earliest convenience. As with the new TV comedy, “Silicon Valley,” Mike Judge hits far too close to home for the movie to be comfortable… its hilarity, on the other hand, is indisputable. So much of our lives are wrapped up in “making the machine work” that comedic exposures of our industries deep malfunctions are, perhaps, the only things that keep me sane.

Not a day goes by that I don’t see some scene or line from “Office Space” percolate from either the Industry or Circonus itself. Just after 21:30 UTC on October 3rd was just another one of these events, but the situation that brought it up is interesting enough to share.

In “Office Space,” there is an employee named Milton, whom management believes they have fired, but who has been working and getting paid for years. Classic communication breakdown. However, due to the over-the-top passive aggressive behavior in the organization, management doesn’t want a confrontation to correct the situation. Instead of informing Milton, they simply decide to stop paying him and let the situation work itself out… They “fixed the glitch.” If you do this, you’re an asshole. Spoiler alert: Milton burns the building down.

The interesting thing about software is that it is full of bugs. So full of bugs, that we tend to fix things we didn’t even know were broken. While it’s no less frustrating to have a “glitch” fixed on you, it’s a bit more understandable when it happens unintentionally. We’re fixing glitches before they are identified as glitches. This most commonly occurs in undocumented behavior that is assumed to be stable by some consumer of a system. It happens during a new feature introduction, or some other unrelated bug fixing, or a reimplementation of the system exhibiting the undocumented behavior, and then boom… some unsuspecting consumer has their world turned upside down. I’m sure we’ve done this at Circonus.

On October 3rd, a few customers had their Amazon Cloudwatch checks stop returning data. After much fretting and testing, we could find nothing wrong with Amazon’s API. Sure, it was a bit slow and gave stale information, but this is something we’ve accommodated from the beginning. Amazon’s Cloudwatch service is literally a metrics tire fire. But this was different… the answers just stopped happening.

Circonus’ collection system is three-tier (unlike many of our competitors that use two-tier systems). First, there’s the thing that’s got the info: the agent. In this case, the agent is the Cloudwatch API itself. Then, there’s the thing that stores and analyzes the data: Circonus SaaS. And finally there’s this middle tier that talks to the agents, then stores and forwards the data back to Circonus SaaS. We call this the broker. Brokers are basically babelfish; they speak every protocol (e.g. they can interrogate the Cloudwatch API), and they are spread out throughout the world. By spreading them out, we can place brokers closer to the agents so that network disruptions don’t affect the collection of data, and so that we get a more resilient observation fabric. This explains why I can assert that “we didn’t change anything,” even with as many as fifty code launches per day. The particular broker in question, the one talking to the cloudwatch API, hadn’t been upgraded in weeks. Additionally, we audit changes to the the configuration of the broker, and the configurations related to Cloudwatch interrogations hasn’t been modified either.

So, with no changes to the system or code asking Cloudwatch for data and no changes to the questions we are asking Cloudwatch, how could the answers just stop? Our first thought was that Amazon must have changed something, but that’s a pretty unsatisfying speculation without more evidence.

The way Cloudwatch works is that you ask for a metric and then limit the ask by fixing certain dimensions on the data. For example, if I wanted to look at a specific Elastic Load Balancer (ELB) servicing one of my properties and ascertain the number of healthy hosts backing it, then I’d work backwards. First, I’d ask for the number of healthy hosts, the “HealthyHostCount”, and then I’d limit that to the namespace “AWS/ELB” and specify a set of dimensions. Some of the available dimensions are “Service”, “Namespace”, and “LoadBalancerName”. Now, our Cloudwatch integration is very flexible, and users can specify whatever dimensions they please, knowing that it is possible that they might work themselves out of an answer (by setting dimensions that are not possible).

The particular Cloudwatch interrogation said that dimension should match the following: Service=”ELB”, Namespace=”AWS”, and LoadBalancerName=”website-prod13.” And behold: data. The broker was set to collect this data at 12:00 UTC on October 1st and to check it every minute.

As we can see from this graph, while it worked at first, there appears to be an outage. “It just stopped working.” Or did it? Around 21:30 on October 3rd, things went off the rails.

This graph tells a very different story than things “just stopping.” For anyone that runs very large clusters of machines where they do staged rollouts, this might look familiar. It looks a lot like a probability of 1 shifting to a probability of 0 over about two hours. Remember, there are no changes in what we are asking or how we are asking it… just different answers. In this case, the expected answer is 2, but we received no answer at all.

The part I regret most about this story is how long it took for the problem to be completely resolved. It turns out that by removing the Service=”ELB” and Namespace=”AWS” dimensions, leaving only the LoadBalancerName=”website-prod13”, resulted in Amazon Cloudwatch correctly returning the expected answer again. The sudden recovery on October 7th wasn’t magic; the customer changed the configuration in Circonus to eliminate those two dimensions from the query.

Our confidence is pretty high that nothing changed on our end. My confidence is also pretty high that in a code launch on October 3rd, Amazon “fixed a glitch.”