Charlottezweb

Charlottezweb Hosting => Server Updates & Outages => Topic started by: Jason on September 01, 2007, 01:46:14 AM

Title: 9/01/2007 :: Server outage
Post by: Jason on September 01, 2007, 01:46:14 AM
Something at the datacenter caused of loss of connectivity to 6 of our servers tonight for approx 10 minutes.  (Alertra recorded around 4-5 minutes for most servers but I think it's several minutes longer than that given other reports I've been reading).

All servers but Thunder (hosted elsewhere) were unreachable at that time.

I will update this thread once I hear back from the datacenter with what occurred.

All servers are up again but loads are a bit high as they compensate for the increased traffic they're currently fielding.  This will lower shortly.
Title: Re: 9/01/2007 :: Server outage
Post by: Finner on September 01, 2007, 10:57:36 PM
My guess is it was power related   ::)
Title: Re: 9/01/2007 :: Server outage
Post by: Jason on September 01, 2007, 11:52:43 PM
It was but not at our datacenter, it was at the network (telco) building upstream.  I'll post more tomorrow.
Title: Re: 9/01/2007 :: Server outage
Post by: Finner on September 02, 2007, 12:18:09 AM
If the Gnax DC crashes will CWebs still be up??
Title: Re: 9/01/2007 :: Server outage
Post by: Jason on September 02, 2007, 12:54:52 AM
Quote from: Finner on September 02, 2007, 12:18:09 AM
If the Gnax DC crashes will CWebs still be up??

Not sure I follow ... Are you asking if Charlottezweb.com (the site) will still be up?

If so, the answer is "yes" for right now but within the next month, I will be relocating it with all other accounts on Thunder to our new server Thunderstorm.
Title: Re: 9/01/2007 :: Server outage
Post by: Finner on September 03, 2007, 12:22:24 PM
Yes I meant the Cwebs site....

Is there an update on this outage? 
Title: Re: 9/01/2007 :: Server outage
Post by: Jason on September 03, 2007, 03:34:16 PM
Quote from: Finner on September 03, 2007, 12:22:24 PM
Yes I meant the Cwebs site....

Gotcha.  Charlottezweb.com is hosted on Thunder which is the last of our servers that's not located in the Atlantanap (GNAX) datacenter.  I'm in the process of moving it -- along with all accounts on Thunder -- to our new server Thunderstorm.  Therefore, within the next month or two, all accounts will be hosted on my servers at that datacenter.

Quote
Is there an update on this outage? 

As I hinted to above, the outage was directly upstream of the datacenter. 

Datcenter owner's quote: 
"This was Telx doing maintenance on their power in their building and dropped the load - we hve no control over this - all providers I looked at in the building were out.  We were quite surprised that this happened and have opened inquiries with them regarding prevention of this in the future."

Here's a post from me and their response (below).  Unfortunately, the timing seems to be pretty bad given that we've experienced 2 or 3 outages related to power or network cuts in the last 3 weeks.  Any outage is no good but on the positive side, one outage was a Telia cut that impacted much of the country, one was a power/circuit problem in the dc and this latest was upstream.  Honestly, only one of these outages could've been addressed by them so again, the coincidental timing is poor.

In the last several years of service with them, I haven't experienced this over the span of months; much less within 1 month.  It's extremely frustrating to experience this though I trust the guys there and have to give them the benefit of the doubt.  Ultimately, it was less than 10 minutes of outage (except for Blizzard who had its dns service corrupted.  That was just a painful byproduct of the increased loads):

----------


Quote
Originally Posted by charlottezweb
Quote
Can you expand on this: "this was telx doing maintenance on their power in their building and dropped the load - we hve no control over this - all providers I looked at in the building were out."

Does that mean all your connectivity runs through them as your upstream provider thus creating a single point of failure?

Please forgive my ignorance but I thought as part of all the infrastructure build-up that you were creating a fully redundant center both network and power-wise?

This appears to be the 2nd or 3rd "outage" in the last two weeks whether it's been caused by power or connectivity.

I have a number of clients who are done with my excuses and I'd really like to provide them a reassuring answer of what happened, what should've happened and why it won't happen again. It sounds like I can't deliver them that confidence if the blame is on a third party we have no control over? I'm waiting for them to start sending questions regarding the dc's 99.999% uptime claim...

in most major internet hubs in the us - all the connectivity usually ends up going through one or 2 major points of interconnect. this is how people are able to get the bandwidht pricing down.

it is certainly possible for us to go out and pull metro rings to other providers in different locations to diversify - however prices would definatley go up across the board as this would get very expensive.

it is the nature of the beast.

we are waiting to find out why telx had a power outage that affected an entire floor.

in fact - even if we had our backed up by our own mini ups - it would have still had outages on the other carriers located on the same floor.

cogent was not hit because they are on net in the nap and we dont pick them up from telx.

we ordered diverse circuits from telx 5 months ago and they delieverd non diverse - we are still in the process of rectifying this with them as of last week - we are supposed to have another circuit from another floor / generator / ups. then it will be tied into our dual fed power strips that feed our different switches for a lot more redundancy.

why did they drop the power? I dont know. they claim ups maintenance - but they should have just put it in bypass to do it and not dropped the power.
are we going to quit doing business with them over it? no

we are going to keep at the process of putting in place more safeguards against it happening again - and in fact have been working with them to get this done. they are not the fastest moving ship.

bad timing on this one.

again - the internet is not a 100% thing - you can only try to make it as close as possible to that as you can afford and that your customers dictate through their $$,


Hopefully that helps explain a bit of what happened and that they're doing all they can to address it with Telx in this case. 

Regards,
Jason