September 24, 2012 :: Incorrectly reported outages by Pingdom

Started by Jason, September 24, 2012, 04:46:53 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

The company we use for our monitoring site experienced an issue during an upgrade which caused some monitoring stations to report false outages for some servers today.

It appears that they incorrectly recorded between 40 - 50 minutes of outages on three of our servers today that were online the entire day:

Avalanche
Deluge
Sirocco

Here's a shortened version of the email they sent this afternoon:

Quote
Today, the Pingdom team deployed a software upgrade to some of our monitoring probes. Despite thorough testing, this upgrade contained a malfunction that led to false down alerts being sent to a portion of our customers, including you.

Even if the issue affected monitoring for less than 90 minutes for a limited number of customers, it's of course frustrating if you were one of them. We take a lot of pride in delivering a reliable service and this doesn't represent what Pingdom stands for.

Let us first stress how rare it is that something like this happens at Pingdom. In fact, this is the first time a similar occurrence has struck us. That said, we want to take this opportunity to provide information about what happened, present what actions we've already taken, as well as tell you how we move forward.

Our normal deployment of new and updated software consists of a series of tests designed to making sure that our systems are reliable. This means that we roll out updates gradually to our infrastructure and only after they've been thoroughly tested in our development and staging environment.

Today at around 8 am GMT we gradually started to roll out the update to a few selected monitoring probes. Immediately we saw that there was an issue with the code and did a rollback. But, unfortunately, a limited number of customers had faulty downtimes recorded in their data and in some cases also received faulty down alerts during a limited time.

After a thorough investigation we've already initiated actions to minimize the effect this may have had, including:

x Affected Pingdom checks will have their up and down records marked as
  unmonitored for the period in question, up to a maximum of 90 minutes.
  Therefore, each site's uptime record will not be affected. In other
  words, your uptime percentage will not change due to this incident.

x Any lost SMS credits due to incorrect alerts in connection with this
  issue have been refunded. You will receive double the amount of credits
  that was used during the incident.

x We will take further steps to make sure that future upgrades to our
  infrastructure will be implemented with even more caution. This
  incident has already led to improvements in our deployment routines.

We want you to rest assured that all of us working at Pingdom take significant pride in delivering the best possible service, and even though mistakes happen they are not acceptable to us.


I'm very happy with their service and their approach to communication but I wanted to share this situation in case anyone was concerned about what appears to be long outages today that did not, in actuality, occur.

Thanks,
Jason