January 9, 2007 :: Datacenter outage

Jason · January 09, 2008, 05:23:50 AM

We're waiting on info from the dc but right around 4:50am EST today, some sort of widespread outage (looks to be power) happened at the dc that took a majority of our servers offline.

All came back up quickly except for Maelstrom which is still down. It likely hung on the reboot/restart.

A ticket is open for this however, given the size of the outage (looks to be very widespread) and the time of the early morning, support is there is under extreme pressure to respond. Senior mgmt has posted that they're on their way in to address.

I'll keep this posted as details are learned.

Jason · January 09, 2008, 10:33:18 AM

Here's what little more I know at this time...

I've heard conflicting news about the cause of what's going on now only that it is still very much impacting the datacenter. They called in all staff a few hours back and have everyone working phones and tickets.

Overall, Charlottezweb managed to be a lot better off than a lot of people who have lots of servers down. There's quite a heated discussion taking place all morning in the client's board there.

I've requested an update a few times in that thread and updated our ticket. As painful as having this server down is, I know they're doing everything in their power to address everything at this time.

I fully expect to have a lot more information to share when this is over because whatever occurred this morning has stirred up a lot of anger and questions over reliability.

Jason · January 09, 2008, 11:28:18 AM

I've received word that they're working on Maelstrom now...

Ken · January 09, 2008, 11:29:25 AM

Thanks for keeping us updated, Jason.

Jason · January 09, 2008, 11:39:11 AM

Service to Maelstrom has been restored.

I'll absolutely be updating this thread as the details unfold on what happened and what's being done to address it.

Jason · January 09, 2008, 02:33:49 PM

Here's a full RFO from the owner of the datacenter. I'll keep adding to this thread if/as more information comes from this event.

Quote
RFO January 9, 2008 4:45 am EST

--------------------------------------------------------------------------------

At approximately 4:45 am EST the NAP suffered a power outage lasting approximately 10 seconds from Georgia Power.

The generators fired and came online 15 seconds after the initial outge and the load was transferred to generators which ran for 30 minutes while monitoring the incoming power quality from GA Power at which time the load was transferred back to utility.

One of the UPS's that serves part of the facility suffered a battery outage on 2 different redundant strings which caused it to drop the load.
We installed a second redundant string approximately 9 months ago to minimize the possibility of this type of situation. The batteries in the 2 strings are setup in parallel meaning each is capable of carrying the full load for up to 5 minutes.

All it takes is 1 battery in a string to fail for the entire string to fail. this is the same in all ups systems and is the reason we installed the second string from advice from the manufacturer.

The original string batteries are 1.5 years old and were installed new. The second string is 9 months old and was installed new.

A single battery in the second string failed after 3 batteries in the first string failed.

We turned the generators back on to avoid an interruption during troubleshooting and maintenance and MGE sent a tech onsite within an hour to troubleshoot at which time we discovered the battery issue. we replaced the batteries within an hour of diagnosis and brought the system back onlnine and out of maintenance bypass.

The load is currently protected and all batteries have been tested again.

Both sets of batteries have been maintained and tested by MGE direct service every 6 months under a pm plan that they recommended for proper maintenance and operation.

This was extremely rare and unforseen to have something like this happen.

We are purchasing our own battery tester and will set up a monthly pm on the batteries that we will conduct ourselves in addition to the 6 month pm that MGE does on the UPS as well as the batteries. We are also researching a real time battery monitoring system that can predict battery failure.

Batteries are the weakest link in the system and we feel like we properly followed recommended engineering and maintenance on these systems. - however that will not assure 100% as we found out today in a very rare incident.

Extemporaneous events that continued to affect service during the outage:
one of the main metro e switches that runs the links of our backbone went offline during the outage and during that powerinduced reboot we lost connectivity to half our backbones. we have our backbones split in half - with half going out the east and half out the west side of the building taking dirverse paths across redundant switches to the final interconnect points.
the switch was unstable when it came back online due to a gbic that died and for some odd reason rebooted itself several times about every 10 minutes. we replaced the gbic with a spare we keep onsite.

This caused half the backbones to go up and down and placed a large cpu load on the different core routers we have due to bgp table loads going on - this is very cpu intensive and when you have a lot of up and down it can appear that the network is completely down (it is if you are on a link that is flapping) but the fact is that the entire network was not down but was impacted. this settled down when the switch was stabilized.

We split our backbones up over several different redundant backbone routers.

once this switch was brought back online and stabilized the network stabilized as well.

an access switch that serves 16 servers also died and we replaced it with a spare once we found the issue. we keep spares on site for every piece of network gear we have.

an apc that was only 6 months old and is a dual fed apc from 2 different power sources (including the newer ups) failed and did not come back - we replaced it with an onsite spare. it was bizarre to say the least and of course it powered one of our 3 main dns clusters so we lost dns capacity for an hour.

Most of the issues currently going on are related to server hardware that did not do well in a power reboot situation or need a fsck. we are actively working on them and will not rest until all is well.

Many customers in the facility do have A and B feeds from our power. we offer this through different ups systems / different power panels and different transformers. Some very early customers that purchased a and b feeds when we only had one ups system at the NAP are on the same ups and as such lost power. those customers will be offered a free move on their b feed to the newer ups to increase their power diversity - they simply need to open a ticket.

What are we doing on power in the future?

We have another UPS from MGE on order as of 4 weeks ago that is due to deliver in mid Feb that will increase the diveristy of the power in the facility. We plan on having 2 battery strings on it as well.

We are in the process of installing another set of 5 cummins generators and another 3000 amp transformer which will further diversify our generator and transformer plant - this will be completed in mid february - construction of this is going on currently we took delivery of the switchgear and generators 2 weeks ago. 4 ups/ will be moved to the new power feeed and g enarators to diversify the power source to the UPS . this will give us 100% redundancy on the A / B feeds at that point.

We installed a redundant b feed to our metro e gear and 2 dual fed apcs at our TELX cabinet after TELX suffered a complete UPS failure at 56 marietta 4 months ago. This turned out to be good because there was another complete failure of the B ups 4 weeks ago - but we were not affected since we had a redundant feed from them. the outage affected all customers on the second floor. we would have more than 50% of our network had we not been on dual fed apcs and dual power feeds at the building which would have been bad.

we are increasing the battery pm schedule to monthly from biannual.

we are researching a battery monitoring system for the strings.

we will be taking a fuel delivery this week to restock our main fuel supply

we are examining in depth on of our 4 core metro switch abnormalities this morning and if we do not find a rfo from the manufacturer will be examining replacing it or upgrading to a different more robust solution - which has been in our long term plan but may get moved up.

we will be doing another power examination of our core swithcing routers ( currently 6 of them all with dual fed power ) and our core metro e switches (currently 4 of them) to make sure that our power feeeds are truly redundant and no legacy circuits are there to affect them.

we will be examining our on site spares inventory to make sure we are still at correct levels since we used some items this morning.

We appologize for the outage caused by the failure of hte primary and backup batteries and will continue to provide the best service at an excellent price.
The MGE tech that has all the major accounts in Atlanta including coke and several others told us that this was a very freak occurance with negligible odds of happening and in his opinion we have done everything right on our maintenance and pm and redundancy of the batteries and he would have done the same thing and that there was really nothing he would have recommended different at that point.

we are still going to make the changes above that I mentioned though.

Finner · January 09, 2008, 03:54:37 PM

Sounds like he's having a fun day

Jason · January 09, 2008, 04:02:24 PM

I feel bad for him in many ways -- they just launched some new services and have really taken a lot of initiative to answer and provide things that have been requested in the past to really build a solid center up. For this to happen with what appears to be quite a number of standalone problems all caused by the one grid outage by the local power co. is very bad timing. Granted, it doesn't mean they don't need to address everything fully but even with the best fail-overs in place, this is a perfect example that weird things can happen.

They're still working on servers there (many of their clients still have down boxes) and it looks like we just had a very quick drop to Tempest a few mins ago so hopefully things will hold up as they work on solidifying everything.

Jason · January 10, 2008, 09:40:52 PM

Followup from the dc:

Quote
AtlantaNAP adding a third string of batteries
We have decided to add a third string of batteries to our current n+1 parallel setup. apparently have a redundant string was not enough so we will now have 2 backup strings for each unit.

This will be installed Friday 1/11/08 and monday. we have already acquired the batteries.

We have also implemented daily testing of the strings to look for an open cell which occurs every morning at 8 am m-f

finally we have acquired an advanced testing machine which will allow us to do comprehensive testing and tracking of the batteries once a month with trend tracking of the cells by test. this will allow us to spot trends that can identify weak cells - this is what MGE was and still does for us on their recommended every 6 months schedule. we will now do it once a month internally.

we still highly recommend that if you can not take any outage at all ever that you open a ticket and order a redundant power setup fed by 2 different ups systems here at the NAP. remember interruptible is not un-interruptible. its less interruptible.

Good for them.

Finner · January 10, 2008, 10:06:01 PM

It's nice to see he's not messing around..

Is losing Global Crossing as a backbone going to effect anything?

So far it seems that it helped..

I hope Jeff sleeps well tonight, as he has had a rough day I'm sure..
When it rains it pours..

Jason thanks for the updates today..

Jason · January 10, 2008, 10:25:47 PM

No problem. Hopefully we can all have a relaxing weekend

Ultimately, it's possible they may turn that circuit back up if they can get proof that something is diagnosed and fixed. Otherwise, it's only one of 6 gigE circuits. With stability of that circuit being unstable, it does more harm than good having it included.

dania · January 14, 2008, 02:49:17 PM

I wonder did that happen to designsexchange forums on the 8th

Hmm..you never know...still down.

Jason · January 14, 2008, 03:04:25 PM

Quote from: dania on January 14, 2008, 02:49:17 PM
I wonder did that happen to designsexchange forums on the 8th

Hmm..you never know...still down.

What was their actual website?