Thunder :: DOS April 11 2005

Started by Jason, April 11, 2005, 08:13:51 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jason

There was an effective DOS (Denial of Service) on Thunder resulting in multiple outages early this morning.  I am gathering more information and will be posting it here as soon as it's available.

Regards,
Jason

Jason

More info...A client site was hosting a file that must have been advertised somewhere popular last night which caused a steady stream of 1000+ connections to the server since shortly after midnight EST this morning.  This many simultaneous connections to dowload the file raised loads to unacceptable levels and our monitoring system restarted services to compensate and maintain stability.  Each time the system brought the services back up, the loads would then spike again causing a repeating process.  The server never crashed, but the monitoring software was continually stopping services to prevent this.

The file has been deleted from the server and the account suspended pending further investigation with the client today.  Meanwhile, we are still getting hundreds of requests for that file and though they can't retrieve anything, this is still taxing the system.  CPU loads are presently doing well, though it's possible that we may experience some latency or possible short outages today. 

I am still investigating this with our techs in the meantime and will update this thread if we get anything else worth sharing.  I am going to email all clients on that server to direct them to this thread for information.

Regards,
Jason

Jason

More info...the file was a install program for a new piece of software that was just released yesterday.  There was no malicious attack involved in this outage and the traffic was legit, which is why it didn't trigger some of the DOS/DDOS failsafes that we and the datacenter are running.  Therefore, the server continued to try to serve the file despite the huge traffic increase.  This resulted in the latency and httpd restarts.

I have contacted the software developer and he's removed the links to the file from his website.  This should stop further incoming traffic from his site.

I will be updating this thread in regards to server monitoring as I'm not happy with the time involved in this incident.  Since there was no real server outage, our techs were not notified for quite some time despite the problems visibile when viewing your sites.  I will address this issue after speaking with a few people later today.

Thank you,
Jason

Mark

Hehe I was wondering what was going on lastnight... guess I chose to do my homework assignement at the right time, as soon as I finished uploading my project I tried to view it and nothing worked :P All is well now as far as I can tell though, and many thanks for the heads up in e-mail and on the forum :)

Elril Galia

Well whilst we had a period of up time earlier this afternoon, our site seems to be mostly down this evening :(

As does my access to this site... its been periodically down too :(

Jason

Quote from: Elril Galia on April 11, 2005, 04:00:12 PM
Well whilst we had a period of up time earlier this afternoon, our site seems to be mostly down this evening :(

As does my access to this site... its been periodically down too :(

I can access your site without problems.  Has this happened recently?  The issue described above has been resolved for about 6-8 hours now.

Regards,
Jason

Elril Galia

at the moment its behaving fine

but yes, since about 4pm until 9pm (BST) it's uptime deteriorated badly

Jason

Quote from: Elril Galia on April 11, 2005, 05:59:12 PM
at the moment its behaving fine

but yes, since about 4pm until 9pm (BST) it's uptime deteriorated badly

BST (British Summer Time) = GMT, correct?

You had problems between 11am - 4pm EST today?  The server was operating normally during that time unless I misunderstood your timezone?

Regards,
Jason

Elril Galia

BST is an hour ahead of GMT

I dont know what time zone EST is

the forum and this one are again intermittantly working... where other sites are loading just fine

ive emailed you two traceroutes

Jason

I emailed you back.  I'm going to have this looked into to see if it's something datacenter or ISP related. 

Regards,
Jason

Jason

As a result of this initial event, we've now added Alertra as another monitoring service (on top of Hyperspin and the ones used by our support).  Alertra will automatically call my cell anytime any of our servers don't respond within 5 minutes to avoid any extended situations like this one.

Furthermore, it will test the servers every 5 minutes, 24 hours a day, from multiple locations around the world.  If one encounters a problem, the others will try to hit it to make sure it's not a false alarm.  For more "how it works" info, please click here.

I will shortly provide a stats page where you can view uptime logs recorded by this service.  This should further our ability to maintain the highest service levels possible.

Regards,
Jason

Mack Bolan

 Thanks Jason!  Keep up the good work!  8)