March 3, 2012 :: Avalanche Scheduled Maintenance [outage expected]

Started by Jason, February 28, 2012, 06:19:34 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

Within the hour, I will be posting a link to this thread to our server status page (which now shows maintenance messages) and will be sending an email from our billing system to all customers with accounts on our server "Avalanche."

-------------

This has been rescheduled for midnight Sunday evening (12:01am EST, Monday, March 5)

This coming Saturday morning at approximately 4am EST, Avalanche will experience some scheduled datacenter maintenance that will result in two periods of outage separated by some potentially slower response times in between.  Each outage should be under 30 minutes however it could be less.

To provide some background...
Avalanche is one of newer cloud servers.  As a virtual server instance, it is managed by a "parent" server that is responsible for storage among other things.  In this case, the parent has one harddrive that is reporting errors and needs to have that particular drive replaced.  This is normal of course but what's nice is the datacenter is migrating us to a new parent so that we don't see any extended downtime.  To complete this migration process, there will be a couple outages while data is backed up, copied and resynched.  This is the same process if we wanted to upgrade any of our virtual servers and we recently did this with our server Thundersnow.  That upgrade went very well.

For anyone wanting a more technical response, here is a response from the datacenter I received today:

Quote
Hello Jason,1. One of the drives in the parent's RAID array has become degraded, and we will need to take the server offline to replace it, so we will need to move all the instances off before we can take it offline. Also, since this is one of the older models of parent, it will be decommissioned after it is empty.
2. Since all the non-bare metal parents use RAID, from time to time one of the drives in a given array will need to be replaced. This is preferable to having only one drive and needing it to be replaced, since there is less downtime involved in replacing one drive in an array than one stand-alone drive. From time to time, individual Storm parents require maintenance, and this will involve a brief amount of downtime for each instance while we move it to a new parent.
3. A move involves two periods of downtime. The first happens at the beginning, when the instance is shut down so that the filesystem can be checked and prepared for the new parent. A very rough estimate of the downtime, based on the size of your instance, would be 15 - 30 minutes. The instance is then started on the destination parent, and data is copied over from the origin parent. Once the data transfer is complete, there is another period of downtime when the instance is stopped and the move finalized. The time it takes to transfer the data varies greatly depending on several factors, but the instance is online during that time.

If you have any questions, please feel free to post them here.

Thanks,
Jason

--------
edit:  fixed grammatical error.

Jason

Reminder that this maintenance is scheduled for early tomorrow morning.

Thanks,
Jason

Jason

There is another server on the same parent that has a migration running right now so we're holding off on doing our's.  If it doesn't complete within the next hour, I'll probably reschedule our's.

I will update this thread as soon as I reach a decision with the datacenter.

Thank you,
Jason

Jason

We've decided to move this to midnight Sunday, March 4.  (00:01 AM EST March 5)

I have updated our Status page with this adjustment.

http://www.charlottezweb.com/clients/serverstatus.php

Thank you for your patience,
Jason

Jason

A final reminder email has been sent to customers on Avalanche approx 5 minutes ago.

I will update this thread as I learn more tonight.

Jason

This process is partially complete.  There were a series of outages around 12:10 - 12:30am EST (approximately). 

There will be a second outage later this morning once the full image transfer is complete.  That is likely to occur within the next several hours.

Regards,
Jason

Jason

Update -- as of 9:30am EST, the process is 86% complete.

I can't be certain of an exact time but it may complete within the next couple of hours.  At that time we will experience the final outage required to finalize the migration.


Jason

It appears the final phase (outage) is occurring now.  I will update this thread once everything is complete.

Jason

This maintenance is now complete. 

PingdomAlert UP:
Avalanche (67.225.235.5) is UP again at 03/05/2012 10:25:02AM, after 11m of downtime.


Thank you,
Jason