December 16, 2011 :: Jetstream outage

Started by Jason, December 16, 2011, 05:10:25 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

We're currently investigating an issue with Jetstream.

The server isn't down but services are not responding correctly.

I'll update this thread as we learn more.

Jason


Jason

We're going to reboot now. 

On a related note, this server became unresponsive this morning around 2:30am EST (roughly) and had to be rebooted then as well.

We'll work with the datacenter once it's accessible again to try to locate the issue.  For this server to have an outage, much less two in one day, is not normal.

Jason

Service is restored as of about 15 minutes ago.

We're going through logs now to see if we can track the cause.

Jason

Jetstream became unresponsive again about 10 minutes ago.  The server is unreachable and I've requested a reboot. 

Jason

The reboot is complete.  We're monitoring this server again now.

We didn't find anything in the logs last week and we monitored for almost a day afterwards.  However, it's been about 2 more days and the issue seems to have repeated. 

I have a few more ideas we're looking into now. 

Mark

Anything come of this? Just being curious. Haven't had any issues.

Jason

#7
Quote from: Mark on December 22, 2011, 01:19:02 PM
Anything come of this? Just being curious. Haven't had any issues.

Yes and no.  Unfortunately nothing very concrete at this time.

I had a lead regarding our offserver backups.  From time to timet throughout the years, we've had some issues where servers become unresponsive if the NAS becomes blocked or hangs during a transfer.  I thought that process was running on Thursday when the first crash happened but it actually wasn't.  However, it did run on Monday morning before the crash then.  That raises the load a good bit even when all is well so that could've been a cause or could've been a part of the cause.  So I'm not ruling that out.

Loads were definitely high at the times the outages occurred but nothing abnormal in the logs points to anything specific.  I hate to theorize without concrete evidence but it could've been a number of things.

What's frustrating is that we monitored the server extensively over the weekend and then again Monday after the reboot and of course, nothing has occurred again. 

I am still waiting on some feedback of whether we want to do any kind of hardware scans but for now we're monitoring things in the event we can catch it while it's happening.

Sorry -- nothing very useful I'm afraid.

Mark

A Ghost in the Machine! Lawnmower Man!... Lawnmower Man 2?

Wait, no, that's not right.

Well, I guess since we haven't seen anything new happen, that's a good thing right, aside from it making it harder to find out what happened.

Jason

Quote from: Mark on December 22, 2011, 02:05:27 PM
Well, I guess since we haven't seen anything new happen, that's a good thing right, aside from it making it harder to find out what happened.

Yes.  Frustrating but good.  I've seen stranger issues work themselves out with no explanation so it wouldn't be the first time.  It's part of that "Self-Heal v.3" application that I'm beta testing.  :)


Jason

On a side note -- somewhat related -- I've migrated our monitoring to a new provider and we now have a new spot where you can monitor uptime on demand.  This is in addition to our status page on our site.

Here's the master page:   http://uptime.charlottezweb.com/

If you click on a server name from that page, you can view all the downtime reports and all kinds of interesting stats.  You can change from the current timeframe to past months too.  For example, here are the December uptime stats for Jetstream specifically that show all the outages discussed in this thread.  You can hover over lots of things to see more details. 

http://uptime.charlottezweb.com/430858/2011/12

One thing to note -- with Alertra, we were monitoring every five minutes.  With our new service, we are monitoring every 1 minute.  So the positive of that is that issues are caught faster and the uptime is typically more accurate.  The downside (for me) is that if the load spikes for a moment or a network disruption impacts a server for even 1 minute, it will be caught.   So bottomline, the reports should be much more granular -- down to the minute -- but it's possible in some cases that there may be some false positives.

I'll make a post about this in our Service Monitoring board shortly since the reports posted there will change.

Thanks,
Jason

Mark