March 31, 2014 :: Network Disruption (Atlanta - Multiple servers impacted)

Started by Jason, March 31, 2014, 10:24:18 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

One of our datacenters experienced a network disruption today that impacted accessibility around the approximate timeframe of 2:30pm - 3:15pm Eastern.

Most (if not all) of our servers at that location were impacted during this timeframe.  This includes:

Jetstream
Monsoon
Sirocco
Supercell
Tempest
Tsunami

Our other servers were not impacted.

I will be updating this thread once I have the root cause and remediation details from the datacenter.

Thank you,
Jason

Jason

For those interested, here's the full writeup from the datacenter:

Quote
Summary of Service Event on 3/31/2014

On Monday, March 31 at 2:17 PM EST, GNAX routers were unable to effectively route traffic to the internet. The issue stemmed from a large peer at the TIE peering fabric flooding the peer routers with unproductive routes, which crippled our route tables on the adjacent routers and then propagated and affected our core routers as BGP neighbors. The immediate fix re-converged routes at 3:12 PM EST.

To prevent this type of incident occurring again in the future, our network team has applied more stringent access lists in those peers. Also, our stricter configuration will terminate a BGP peer if they show a sudden, unexpected increase in routes, further protecting our customers from this type of occurrence in the future.

The immediate fix was determined and implementation was started in less than 30 minutes, as our network team launched in to action. However, due to the scale and variety of our network infrastructure, it took a few hours to fully diagnose and confirm the issue from the logs, design a more permanent resolution and carefully test it.

We apologize for the inconvenience and trouble this disruption caused to our customers and sincerely thank you for your patience and understanding as we worked through the issue. We know how critical our services are to our customers. We will do everything we can to learn from this event over the coming days and weeks to further understand the details and refine our resolution and processes. We are committed to providing our customers mission-critical IT infrastructure, therefore we are implementing a status page that will give periodic updates during any future issues. We will only update events as they are confirmed with factual information. 

Regards,
Jason