Visualizing Regressions

We’ve heard a lot of talk about Continuous Deployment strategies over the last 12-18 months. Timothy Fitz was one of the earliest proponents, publishing stories of their success over at IMVU last year. One of the greatest benefits to continually pushing your changes to production is that it takes less time and effort to find bugs when something goes wrong, since you have fewer commits in-between to navigate. But even with this style of release management, it helps to know which versions of code are running live on your components at any point. What happens when your newest code is enough to alter the normal behavior of the system, but not so drastic as to trigger an alert?

One of the nicer trending features in Circonus (or its open-source relative, Reconnoiter) is the ability to correlate unrelated datasets. I can take any collection of metrics on my account and group them together on a single graph. But what if you could view isolated events on the same graph, as an orthogonal data point? Check out these two graphs displaying some recent activity on one of our fault detection systems. The vertical lines represent the point at which a text metric’s value changed. Circonus renders them this way so you can easily recognize that specific moment in time.

20101025_screen1-624x364

the first graph I’m hovering over a dip in performance caused by the most recent release to that comment (svn r6230). In the second graph we’re running a fix (svn r6232) for the regression introduced in the previous commit. Could I have done the same level of correlation manually? Of course, but it’s nice to be able to zoom out and study the long-term affects of our release strategy on our overall stability. This is an enormously helpful tool for investigating Root Cause Analysis on our live systems, especially if you perform releases many times in a week (like we do). If you’re one of many using automation and Configuration Management suites like Puppet, Chef and the Marionette Collective, no doubt you’ll find it even more useful.

If you’d like to start trending your own text metrics, check out the Resmon DTD. Circonus can pull in your custom metrics in this format. Although the version numbers I mentioned earlier look like integers (well, they are integers), I can explicitly cast them as a string metric using the Resmon DTD. Here is what that might look like:

<ResmonResults> 
  <ResmonResult module="Site::CircProd" service="vers"> 
    <last_runtime_seconds>0.000274</last_runtime_seconds> 
    <last_update>1288044642</last_update> 
    <metric name="ernie" type="s">6297</metric> 
  </ResmonResult> 
</ResmonResults>

As you might imagine, you can get pretty creative with the sort of data you can pull into Circonus. In our next post I plan to look at how you can combine WebHook Notifications (that Brian announced last week) with these text metrics to start trending your alert history. Stay tuned!

WebHook Notifications

This week we added support for webhook notifications in Circonus. For those that are unsure what a webhook is, its simply an HTTP POST with all the information about an alert you would normally get via email, XMPP or AIM.

Webhooks can be added to any contact group. Unlike other methods, you can’t add one to an individual user, and then add that user to a group, however this might be supported in the future based on feedback. Simply go to your account profile, click on the field “Type to Add New Contact” on the group you would like to add the hook to, and enter the URL you would like us to contact. The contact type will then display as your URL with the method of HTTP (for brevity).

Now that your hook is setup, what will it look like when the data is posted to you? Here is a perl Data::Dumper example, grouped by alert for readability, of the parameters posted for 2 alerts:

%post = (
   'alert_id' => [
   '21190',
   '21191'
   ],
   'account_name' => 'My Account',
   'severity_21190' => '1',
   'metric_name_21190' => 'A',
   'check_name_21190' => 'My Check',
   'agent_21190' => 'Ashburn, VA, US',
   'alert_value_21190' => '91.0',
   'clear_value_21190' => '0.0',
   'alert_time_21190' => 'Thu, 21 Oct 2010 16:35:49',
   'clear_time_21190' => 'Thu, 21 Oct 2010 16:36:49',
   'alert_url_21190' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21190',
   'severity_21191' => '1',
   'metric_name_21191' => 'B',
   'check_name_21191' => 'My Other Check',
   'agent_21191' => 'Ashburn, VA, US',
   'alert_value_21191' => '91.0',
   'alert_time_21191' => 'Thu, 21 Oct 2010 16:36:21',
   'alert_url_21191' =>
   'https://circonus.com/account/my_account/fault-detection?alert_id=21191',
);

So lets look at what we have here. First thing to notice is that we pass multiple alert_id parameters, giving you the ID of each alert in the payload. From there, every other parameter is suffixed with _<alert_id> so you know which alert that parameter is associated with. In this example 21190 is a recovery, and 21191 is an alert, recoveries get the additional parameters of clear_value and clear_time.

Webhooks open up all sorts of possibilities both inside and outside of Circonus. Maybe you have a crazy complicated paging schedule, or prefer a contact method that we don’t natively support yet, fair enough, let us post the data to you and you can integrate it however you like. Want to graph your alerts? We are in the process of working on a way to overlay alerts on any graphs, but in the meantime, setup your webhook and feed the data back to Circonus via Resmon XML, now you have data for your graphs.

If you are curious about other features and would like to see an in depth post on them, please contact us at hello@circonus.com.

Good Times in Charm City

It’s been a while since I had time to enjoy the technical conference scene. Thanks to my involvement with Circonus, I have plenty of action scheduled between RailsConf, Velocity and the Surge Scalability Conference. We attended RailsConf in Baltimore a couple weeks ago and had a great time. Circonus had an exhibition booth and we gave out tons of demonstrations, free swag and t-shirts. But the best part of any con is catching up with old friends and making new ones.

I finally met Mark Imbriaco of 37signals in person. Mark has been a valued user for us, giving plenty of awesome feedback during the beta and after our production launch. If you haven’t seen it already, check out Mark’s interview on webpulp.tv. He offers a lot of insight into 37signals’ operations and architecture. Good stuff.

Last but not least, a nice relationship blossomed out of our participation at RailsConf. I’ve been aware of the RPM service over at NewRelic for a while now. Although they sometimes market it as monitoring software for Rails, a more apt description would be to call it a kickass profiling tool for Ruby and Java applications. It’s very useful for tracking down performance issues within your application code. But what happens when the problem isn’t in your source code… or maybe you’re just not sure? Fortunately for NewRelic RPM users, the solution just became very clear.

We recently rolled out support for importing your NewRelic RPM metrics directly into Circonus! All of the application statistics available over NewRelic’s data API are now easily accessible inside your Circonus account. Correlate your application CPU and Response Time with HTTP first-byte and total duration. See the impact of optimizations on your end user experience! All you need to create your first RPM check in Circonus is the account id and license key (available in your application’s newrelic.yml).

Ironically, the one question we kept hearing over and over again at RailsConf was:

How is Circonus different from NewRelic?

The answer is simple; we’re perfect complements. Circonus offers a holistic view of your networks, architectures, systems and services. NewRelic RPM provides a detailed view of your application internals. Both support real-time analysis of their individual focus areas. It’s really the perfect monitoring combination for any serious Rails shop.

If you’d like more information on Circonus or how it can support your architecture, shoot us a line or stop by Booth 103 at Velocity this week. We still have plenty of black t-shirts to give away. 🙂

Circonus at Velocity 2010

Hot on the heels of our RailsConf ticket giveaway, we have another contest for a free pass to Velocity 2010! I’m really excited to attend this year’s Velocity. It’s the Web Performance event to attend, and a great place to see the sharpest whips in the industry.

Like before, the rules of this giveaway are simple. Just tweet a message about Circonus being at Velocity and ask your friends to retweet it. The original "twitterer" with the most retweets by Friday, June 14 at noon (12pm EDT) wins. Here’s an example:

The @Circonus stuff is hot and it looks like they’ll be at #velocityconf this year:

That’s an easy way to earn a free 2010 Velocity sessions pass ($1295 value). Free free to get creative with your tweet message. Our only requirements are that it’s a positive message that mentions @Circonus and #velocityconf, and that it includes the link.

Yay, free stuff!

Circonus at RailsConf 2010

We’re anxious to meet and greet everyone at RailsConf next month in Baltimore. This will be our first conference appearance since the production launch. Some of our customers, including 37signals, will be visiting Charm City for this big event. I’m excited to see so many talented Web developers and operations folk in one conference. Having it in our hometown is icing on the cake.

As if that wasn’t enough, we have a couple of fun things to announce. First, Circonus will be giving away a free RailsConf sessions pass! All you have to do is tweet a message about Circonus at RailsConf to your friends and ask them to retweet it for you. The individual with the most retweets by noon (12pm EDT) on Monday, May 31, 2010 wins. Here’s an example tweet:

The @Circonus stuff is hot and it looks like they’ll be at #railsconf this year:

If you’re keeping score at home, that’s a free 2010 RailsConf sessions pass ($795 value) for the price of a few clicks. Feel free to get creative with your tweet message. Our only requirements are that it’s a positive message that mentions @Circonus and #railsconf, and that includes the link.

Why are you still reading this? Go off and start tweeting for your free RailsConf pass (Conference Sessions Only).

See you in Baltimore!

Your Visitors Don’t Matter

Consider me old-fashioned, but I remember a time when an alert notification meant something. Drives failed, servers ran short on memory, or a cage monkey pulled the wrong cable at 3 A.M. Regardless of the circumstance, it demanded attention. Those were the days.

Today, operations is all about doing more with less. No more dedicated hardware or late-night maintenance windows. Everything is virtual, cloud-based, or filling up squares in the grid. Automation reigns supreme, limitless scalability at our disposal. Abstraction at its finest.

But woe unto you, the flapping anomaly.

That visitor who tried to load your website was turned away, timed out and left to wither. Poor Jane wanted to view your site. She needed to view your site. She’d already submitted her order, only to be ignored. Forgotten. Disconnected with nary a trace to route nor a cookie to favor.

Jane was a victim of a numbers game. Someone, somewhere, decided that some problems don’t matter. Which ones? Who cares? They don’t matter. And because she happened to visit when this problem reared its head, you ignored her request. Who would ever make such a silly presumption that one failure is less important than another? What criteria is used to determine the worthiness of this alert or that one? Pure random circumstance, it would appear.

Many “uptime” services and monitoring suites promote the concept of selective or flapping failures. Vendors sell these features as a convenience, ostensibly as a sleep aide. The administrator’s snooze-bar. I can’t think of any other reason that ignoring a faulty condition would be considered a good thing. Perhaps they reason that only the check is affected. If it responds after the third attempt, it was probably ok for visitors all along. Right?

It’s disappointing how many vendors embrace this broken methodology. It probably seemed innocent at a glance. But the damage has been done; recklessness has taken root. We’ve been conditioned to accept these transient malfunctions as mere operational speed bumps. Rather than address the problem, we nudge the threshold a tad higher. Throw additional nodes into the cluster. Increase capacity, while decreasing exposure.

But there is a more responsible alternative. What ever happened to purposeful, iterative corrections and Root Cause Analysis? Notifications may be annoying at times, but they serve a crucial function in a healthy production architecture. Ignored alerts lead to stagnant bugs, lost traffic and missed opportunities. Stop treating your visitors like they don’t matter. There’s no such thing as a flapping customer.

Disrupting the Status Quo

As a hobbyist programmer and full-time operations geek, I’ve been involved in my share of odd software projects. More often than not I’ve had to explain the purpose of the thing, answering numerous questions about the why, what or whowuzzit. I can say without any reservation that Circonus is that rare venture that breaks through the trappings of application design and me-too engineering principles to become something truly revolutionary. To use the product is to highlight Circonus’ strengths. User reactions tell the story.

Bryan Allen, chief server wrangler over at Pobox, has been one of our earliest and most active Beta participants. These folks have been doing email services for longer than I’ve been using it. In a field this competitive, there is zero room for slack, and they know it. Bryan is a very sharp guy, so we were very pleased to read his thoughts on Circonus.

Monitoring, trending and fault analysis are tedious. So much so, most shops get them wrong, or don’t bother at all. Circonus is already poised to be a disruptive player; making the tedious easy, fast and accurate.

I was grateful to meet Bryan in person during my visit to Philly for PostgreSQL Conference, U.S. 2010. I’ve learned that Pobox and OmniTI share a number of common technical interests and philosophies, so it should come as no large surprise that they’d see some value in our efforts.

On the other end of the spectrum, you have the team at 37signals. They are an established leader in web design and SaaS solutions. Their specific forte is with simple (yet powerful) productivity services like Basecamp, Backpack, Campfire and Highrise. Heck, they created Ruby on Rails. If anyone knows good web applications, you better believe they do. We were fortunate to have Mark Imbriaco, Operations Manager for 37signals, run Circonus through the paces during our Beta program.

Circonus’ trending functions are incredibly powerful. The ability to consolidate metrics across a variety of services into a single graph makes it much easier to spot bottlenecks in one area that may correlate to performance problems in another. It’s a graph nerd’s paradise!

I’ll have to take Mark’s word on the last part. Many geeks’ idea of paradise lies somewhere on a beach with a frosty beverage and a strong wireless signal. But if you’re like Mark, and you need something to monitor your systems, you probably owe it to yourself to add Circonus to your shopping list.

There’s one word that I’ve heard repeated a few times from users, that Circonus is disruptive. Occasionally you’ll hear the word banted about to describe a new social media outlet or computing device. It’s usually associated with a revolutionary technology. There’s nothing new about monitoring, trending or fault detection. But there is something refreshingly insightful about the synergy of monitoring services on a single unified metric collection.

Enjoy the Revolution.

Introducing Circonus

Great ideas always begin with a catalyst. They can ignite in a flash of brilliance, or grow slowly like an ember hidden in the ashes of failure. Inspiration comes from different places, and is only ever cultivated into success with the right combination of talent, timing and fortitude.

And sometimes it just happens because you get fed up with inferior products.

The beginnings of Circonus land somewhere in-between. Created by the engineers at OmniTI, we’ve been dealing with the pains of performance monitoring and trending in highly scalable environments for years. We’ve tried various combinations of Open Source and COTS software packages, all of which left us with a sour taste and wanting for more.

Over the last couple of years, our team of highly skilled engineers, led by OmniTI’s own Theo Schlossnagle, have been crafting and refining a truly convergent monitoring platform. Circonus started off as the Reconnoiter project, attempting to address the disconnect between existing monitoring and trending solutions.

Circonus is currently in a closed beta, receiving valuable feedback from customers and partners. We expect to launch publicly in April 2010. In the meantime, we’ll use this blog as an outlet to discuss the upcoming release and divulge all the cool stuff in the pipeline. I hope you visit here often to find out what we’re working on.