Aug 5, 2009 :: Blizzard outage

Started by Jason, August 05, 2009, 06:08:34 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

Blizzard is currently unresponsive as of about 10 mins ago.  Looking into this now.

Jason

I've opened a reboot ticket with the dc.  Rebooting virtually is not working.

Jason

The server has been up (for about 10 minutes) -- we're looking into seeing if we can determine a cause.  It's possible we won't since it appears to have hung but we'll certainly try.

Powerbob

It's nice to be nice

Jason

Unfortunately no. We know the load was high based on how the services went down one by one but ultimately it stopped responding to anything which is why I had to power cycle it via console.  It came up fine after that and I had a tech monitoring it for several hours afterwards.  He didn't find a cause from what the logs contained.

akheir

Sounds like my car. :(

Quote from: Jason on August 06, 2009, 05:15:48 PM
...tech monitoring it for several hours afterwards.  He didn't find a cause...
Plan
Promote
Profit

Kheir Consulting
http://www.kheirconsulting.com

Jason

There was an outage for approx 15-20 minutes on Blizzard about 1.5 hours ago.  The same thing also happened yesterday.

I'm waiting on the results from some logging that we put in place yesterday due to the fact that this has been occurring quite too often on Blizzard in the past two weeks.  I can see from one alert a specific account with a high cpu load right before today's outage.  I'd like to track that back to processes before I can say that was to blame (at least for today) but I'll update this thread as we get more definite details.

All of these outages are caused by high loads that ultimately leads to latency and then httpd service restarts.

Jason

Quick update -- we had an outage yesterday afternoon and one again about 30 minutes ago.  We know the cause of yesterdays due to greater logging we setup.  I'm waiting on seeing if this morning's cause was the same.  If so, I'll be working with that site to fix or remove the script that is to blame.

Jason

Two cases in the last 24 hours (possibly 3 now) are due to one customer's script.  We're disabling it now and contacting the customer.  The script itself is not malicious but definitely has some coding problems that ends up impacting mysql serverwide. 

Thank you for your patience.  We'll continue to monitor to see if that was the true cause.

Jason

We've blocked that script from loading and so far all looks good.  We'll continue to monitor closely the next few days to see if this solves the problem we've seen lately.

I appreciate your patience.  Hopefully that was the cause and we're set going forward.  If not, I will update this tread if necessary.