August 12, 2014 :: Upstream network issues

Started by Jason, August 12, 2014, 11:17:11 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

Issue:  Some customers are experiencing connectivity issues where their sites are down and/or very slow.   So far I've had about 15 customers inquire about this.  The servers are up and running normally -- I'm not experiencing the issue nor is Pingdom (our monitoring provider) or any uptime check sites I've tried. 

I've confirmed with the datacenter that this issue is not internal to us -- there's some sort of large scale disruption related to some large networks today that is causing some customers to experience this if they utilize those connections to reach the datacenter.

Here's the latest from the datacenter:

Update 10:38am Eastern from datacenter:  "Multiple large network providers have experienced issues affecting many across the web this morning. This has caused some of our customers to experience ongoing connectivity issues. Since this isn't directly related to any of our infrastructure, we are trying to get more information as well. We're in the process of communicating with several providers and will relay any further updates the second we receive them."

Unfortunately, if the issues are with external providers, they will need to restore service to their networks to return proper connectivity.

I have this posted on our Status Page and will keep that updated as more details arrive.

Thank you,
Jason

---------
marking resolved
---------

Jason

A very quick update --

I've been reaching out to the customers who reported an impact today and so far everyone is reporting that service appears to be back to normal.

I'm waiting on an official response on what the datacenter observed and will update this forum post with all the details I receive once they are available.

Thank you for your patience -- I know this has been a frustrating day for those impacted.

Jason

I'm starting a recap of today's events while I wait on the final pieces --

To start, here are my observations as well as those experienced by customers:
- I saw no impact myself to any sites nor did the vast majority of customers.
- Our monitoring services (Pingdom.com) or uptime checking sites were also clear.
- That being said, a significant number of customers DID see issues which we quickly confirmed
- From the reports I received, the largest number of users seemed to be on servers: Avalanche, Wildfire, Sandstorm, Thunderbolt (although that doesn't mean other ones weren't impacted).
- It *seemed* like a lot of customers were in the upper mid-west region although I had reports overseas of impact.
- From many customers, they or several customers saw the impact whereas others were fine
- Some of my customers could reach accounts on some servers but not others (all within the same datacenter)
- Some of my customers couldn't reach their sites on their home connection but were fine on their cellular networks (or vice versa)
- I would estimate the total number of customers who reached out to me was around 15-20. 

I was monitoring numerous uptime sites and forums and there were some consistent conversations.  Comcast was seeing issues as were Level3 and Cogent which are two significant service providers.  I had several customers send me tracerts of their network path between themselves and our servers and several of them traversed networks that were dropping their data.  This was the initial information the datacenter provided as well.

Meanwhile, I started reading some interesting posts online that were focusing on an issue much more global in nature.  It appears the "Global Routing Table" used by routers reached a milestone that exceeded the default setting on many routers.  This starts getting fairly complex but the Internet essentially outgrew the limits once thought sufficient by such hardware providers as Cisco.

Quote
http://www.nux.ro/
Today someone announced some more IPv4 classes on the Internet, nothing new here, but this meant the global routing table has exceeded 500k entries (501,525 as we speak). This has caused a lot of very popular Cisco router models to go belly up because their default value for the IPv4 table size is 512k which in this case was not enough to hold the global table.

Our datacenter confirmed the same cause in their latest update:

Quote
08/12/14 5:00PM (EDT)
Last night the global routing table exceeded 512,000 entries. This caused problems for many popular router models, including the Cisco routers that Liquid Web uses. The default memory allocation for the BGP table size is not large enough to hold the global table after surpassing the 512,000 entry mark. This caused many routers across the globe to experience issues, including the routers at Liquid Web. Unfortunately at this time we do not have an exact diagnosis of what has definitively caused this issue. We are still investigating and troubleshooting all possible solutions.

What are we currently doing to fix the issue?
We believe that allocating more memory to our core routers to handle the additional BGP routes will potentially remedy this situation on our network. We have begun the process to upgrade the memory allocation on our core routers. As a part of this memory upgrade, a reboot of some core routers is required and we are in the process of completing this. While performing these reboots we encountered a problem with a line card in Core 5 and we are currently working to repair this. We believe that this issue should be alleviated after successfully allocating more memory to the core routers and performing the reboot, however, there still may be problems with other providers outside of our network.

What's interesting is that most of our customers (myself included) did not see any impact today.  I strongly think some of the external providers they mentioned before played a stronger role in the problem than just their internal settings.  If it was strictly our provider seeing the issue, it would've been felt by all of us, not a subset.   It's possible those providers were also facing routing issues due to the same issue discussed above.

That's the remaining question I'm hoping to get an answer to.

In the meantime, this may be an interesting topic to pay attention to as it likely has a global impact.

Thank you for your patience -- please post any questions you have. If I can't answer them, I'll seek an answer for us.

-Jason

Jason


jillwritergrrl

Very helpful to put some of this in context. Thanks for the updates.

Jill

aura

Jason

Thanks for all of this, i was lucky and didnt have any issues that day, but it was bad in many places for people i know.  I was reading in one of our papers about why, same thing as was written in the article. 

I just look in my household how much net is used here compared to 5 years ago, its used by the tv, its on everyone's phone, ipad etc.  We are constantly looking up something.

Thanks for everything you do to keep your customers happy, this is an amazing host.

aura


Elements of Design - A Graphics Community, join us at elements-of-design.org