April 25, 2014 :: Avalanche extended outage due to parent hardware failure

Started by Jason, April 26, 2014, 04:05:48 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

The following is the capture of what I posted on our Status page of the issues faced by server Avalanche on Friday, April 26.

The issue was originally thought to have been a problem with the server but it was later traced upstream to the parent server that suffered a very rare situation (explained below in the datacenter summary) that impacted the storage configuration.

Quote
Initial Issue Description - This server initially became unresponsive and required a reboot just before 8am Eastern today.  The reboot completed but the startup required a file system check which lengthened the time to get it back online.

Update 9:50am Eastern - the dc confirmed it is running a file system check after the reboot.  These can take some time to complete.  Given how long it's been already (approx 50 minutes since the reboot), it should complete fairly soon.

Update 10:17am Eastern - the file system check appears to have completed and services are starting to come up now.  Loads are very high due to the downtime and everyone trying to reconnect repeatedly so it will take some time for it to stabilize.

Update 10:35am Eastern - I see that the datacenter has initiated another reboot just now.  I'm awaiting an update from them on why that was initiated.

Update 10:45am Eastern - I've spoken to the datacenter.  The loads after reboot spiked so high that it became unresponsive again. After the current reboot we're going to temporarily block web traffic so they can get in and diagnose what's driving the loads.

Update 11am Eastern - The server is up however we're blocking all HTTP traffic while they investigate the load issue.  This means that your sites are still down but the server itself is up and running again.  I hope to have their update shortly.

Update 11:50am Eastern - The firewall has been updated to re-allow HTTP traffic again.  Loads are very high (but are dropping) due to the influx of inbound traffic.  Hopefully we will see stabilization soon.  I know that many sites are not loading due to the resources required to generate them.  This should correct itself as the server stabilizes.

Update 1:20pm Eastern - The datacenter is still actively working on this server.  Whereas it's technically up, it's essentially unable to serve sites due to high load.

Update 2:50pm Eastern - Unfortunately the server required another reboot due to load spike (unresponsiveness) and it's running an automated file system check again.  As mentioned before, a file system check can take anywhere from 15 minutes or so to much longer. Earlier today it took just over an hour before access was restored.  The tech onsite I'm working with is connected to the server so he'll know the moment that completes so he can login and attempt to isolate the load cause before it spikes again. 

Update 3:20pm Eastern - The datacenter is now aware of an issue with the parent server that Avalanche runs from.  They are performing emergency maintenance right now on it.  Hopefully that is the cause of what we've been experiencing today (vs. an issue with Avalanche itself).   We'll know if that's the case once that maintenance is complete.

Update 3:35pm Eastern - The datacenter is quite confident the issue we're seeing is due to the parent issue and not with Avalanche.  Once they complete their emergency maintenance (currently in progress) we'll know for sure.  They will also provide a root cause for us.

Update 4:50pm Eastern
- The datacenter maintenance on the parent server is still in progress.  I will update this once they've advised of completion.

Update 6:20pm Eastern - I've been advised that they're currently rebuilding (cloning) the parent server.  If all goes as planned they're hoping to have that restored in the next 1-2 hours.
   
Update 9:25pm Eastern - Here is the present status as of right now.  The parent server suffered from a hard drive hardware failure that could not wait for a normal maintenance window or process to address.  That led to the need to do a complete rebuild/clone of the parent as I mentioned above.  Given the massive amounts of data involved in something like this, it's taking time to restore everything.  The parent server has a configuration that uses four hard drives.  The first drive has been restored.  At present, they are rebuilding the second drive and that should take approximately another hour.  When that drive is complete, there's a chance Avalanche will be part of the first and second drive space to where we could be back online while drives 3 and 4 rebuild in the background.  If we're part of the 3rd/4th drive space, however, we would remain down for the timeframe needed to rebuild drives 3 and 4.  That could be several more hours. 

To summarize - the parent server suffered a hardware problem related to storage and a rebuild of that is ongoing.

Update 9:45pm Eastern - Avalanche is back online.  I'm able to load a few sites but they're a little slow which is to be expected given traffic and load after being down all day.  I don't know if this means the maintenance that impacts us is complete or not.  I will attempt to confirm.
     
Update 8:10am Eastern, April 26 - Avalanche has remained up overnight while the datacenter continued their rebuild.  It was fortunate that we were setup to where we fell into the first half of the restore.  Whereas the parent is up and not under an immediate threat of disruption, they would like to migrate us to a new parent that's more stable.  We've had to do this in the past on a few of our other instances.  It's a normal proces they do when they need to upgrade or perform (non-emergency) maintenance on a parent.  This involves an initial brief outage (usually around 15 minutes) and then the server is back up and online while data transfers in the background to the new parent.  The transfer can take anywhere from a few hours to 10+.  During this time there's typically no performance issue experienced and sites will be online.  When the transfer is complete, a second outage occurs to sync up and finalize the move.   They would like to schedule this for 12am Monday morning however I'm looking into possibly doing it sometime late tonight so it completes over the weekend.  I don't want to impact another business day (especially a Monday) if avoidable.  I will confirm here once I have full details.  I will also be moving all of these updates into a thread on our forum and will send a summary email to everyone with an account on Avalanche.
     
Thank you for your ongoing patience.  Hardware-related issues like this are never pleasant.

Regards,
Jason

Ultimately the parent server was rebuilt last night to restore service, however, the datacenter would like to relocate Avalanche to a new parent as soon as possible.  Rather than wait for a potential impact next week, I'm going to have this maintenance done starting at 4am Eastern tomorrow (Sunday) morning.    I will be posting informtaion on that separately here:

http://www.charlottezweb.com/forums/index.php?topic=1967.0

We've had to do this in the past on a few of our other servers.  It's a normal process they do when they need to upgrade or perform (non-emergency) maintenance on a parent.  This involves an initial brief outage (usually around 15 minutes) and then the server is back up and online while data transfers in the background to the new parent.  The transfer can take anywhere from a few hours to 10+.  During this time there's typically no performance issue experienced and sites will be online.  When the transfer is complete, a second outage occurs to sync up and finalize the move.

I will be emailing all customers on Avalanche within the next 30 minutes to advise them of this scheduled follow-up maintenance.

For those interested in the technical summary from the datacenter of what occurred, here are the details:

Quote
First of all I would like to apologize for the downtime you experienced. We would not have taken the server down had it not been totally necessary.

There was a series of hardware failures that caused this downtime, and the parent issues negatively impacted the instances on the parent.

This parent is set up in a 4x1T SATA R10. Typically if there is an issue with one of the drives in the RAID the drive will be removed by the card and the degraded RAID will be detected. In this case, there were hardware issues with all 4 of the drives in the RAID, but an issue with the card itself prevented the drives from being removed. If the bad drives are not removed the RAID does not report as degraded, so we are not alerted to the issue until the parent starts showing instability. This is what happened yesterday.

This is the first time I had seen issues with all 4 drives, though. The fact that this is such a rare occurrence contributed to how much time it took to identify the issue. Typically if a parent is exhibiting the symptoms this parent was exhibiting yesterday the cause is related to one of the instances on the parent receiving a DOS attack, sending out spam, or something of that nature. This was the direction our monitoring team was looking at first. Once they had exhausted their options and they did not know what else to look for they asked me to assist in tracking down the issue. Once I identified the issue we took the parent down right away to prevent any data loss.

For everyone on Avalanche, I greatly appreciate your patience throughout the day yesterday.  Ultimately it's nice to know Avalanche itself wasn't having the issues however that certainly doesn't make such an extended outage any easier to accept.

If you have any questions, suggestions or feedback, feel free to post them here.

Thank you,
Jason