Team Analysis: Inside the Amazon Cloud Outage

As millions of frustrated Americans found out this past weekend as they lost Netflix, Instagram and other services, rolling thunderstorms across the US took down Amazon Web Services (AWS) over the weekend.

If there ever was a time for cloud computing cynics to feel vindicated, it is now. Amazon is capable of much better given its resources and the many services it is responsible for delivering to us. Last weekend’s outage was a big time disappointment. Here’s our analysis of what happened and why. Analysis by: Ant Pruitt, Shane Brady and Mat Lee for aNewDomain.net.

 Photo Credit: Gina Smith in Geneva for aNewDomain.net

AWS, which encompasses its EC2 service, was down at approximately 8:21 pm Friday and fully restored Saturday in the afternoon. The outage affected several cloud services that depend on Amazon’s EC2 equipment, including Netflix and Instagram.

Datacenterknowledge.com reports the data center in Ashburn, VA that hosts the US-East-1 region lost power for about 30 minutes. But scores of customers suffered longer as Amazon worked to recover virtual machine instances after that.

Network outages at any level can mean loss of revenues regardless of the size of the business. Enterprise understands this and takes measures to ensure outages are at a minimum as well as plan backup and disaster recovery protocols.

With Amazon getting hit by the storm and causing customers depending on their servers to lose service, you’ve just got to question what its redundancy protocols are. Amazon should take at least the same precautions — and more, given the number and range of cloud services it hosts — plus some.

Every geek with a USB flash drive understands that backing up data and having redundancy is pivotal is bouncing back from unforeseen disasters such as power outages or damage. What makes for successful backup strategy is the implementation of colocation of redundant data.

It’s as simple as having an alternate hard drive stored at a friend’s home in another city.

So what happened? The problem occurs when services such as Amazon failed to have a legitimate fail-over in place if power was lost in a data center. Did Amazon have onsite a gasoline/diesel powered UPS (uninterrupted power supply) to power up in case of facility power outage? Having such equipment could easily buy the onsite engineers time to investigate the issue and get keep services online if a connection could be re-established off site.

“Putting all your servers in one EC2 Zone is like putting all your servers in the same datacenter and hoping a tornado doesn’t hit it,” says our Shane Brady. He has a point.

A ton of sites and apps are running on AWS. The smart ones are using more than one availability zone. If you don’t and that zone happens to go down, what do you expect? Of course there is a risk/ benefit analysis that goes into these decisions. And sometimes you gamble on cutting a few corners to increase profits.

But that’s at the expense of your customers’ data. And happiness.

Companies like Amazon need to spend more to make sure their sites are up 24/7.  Is saying the more money spent and the less you make fair? Where do you draw the line. Cutting corners with design and backup is just plain silly.

It’s up to those higher powers to decide how long they’re willing to walk the line. How permanent is your data and business? Some people are starting to see the mortality in their own design flaws and cut corners.

We also think and important point being missed  is this was a pretty substantial storm.  Though only one segment of one zone of AWS was having problems. If your business was using AWS and that one failed segment in the problem zone caused your whole business to collapse, you are mismanaging your infrastructure or just being cheap, and that is not Amazon’s fault.

Some larger companies like Netflix were indeed using redundancy measures as offered by AWS, and this is where the blame on Amazon can start its finger pointing. For some people, the Elastic Load Balancers (ELBs) for multi-zone deployments had completely failed. Amazon Web Services spokeswoman Tera Randall said one of its 10 data centers in Virginia lost both primary and backup power, which “ended up impacting a single-digit percentage” of instances in the region. This is what caused ELB and Beanstalk to completely drop for some. This would cause the management console and API functions to be completely inaccessible, possibly because they were stored in the affected data center.  Our best guess is this is what happened in the case of Netflix, and Instagram.

If you are properly using ELB, you can distribute incoming traffic across your Amazon EC2 instances in a single Availability Zone or multiple Availability Zones. This gives your site a greater fault tolerance in the case of a regionalized outage like this. That’s assuming ELB is working properly, and not just making things worse.

According to Rick Branson, an exec on the Instagram infrastructure team, even when AWS was brought back up, employees were still trying to figure out what was causing the sick instances to cause continual problems.

The sooner everyone learns that the mystical, magical cloud is just as fallible as anything else running on electricity is just as prone to failure — and by everyone, that means customers and providers alike — the better off we will all be.

A major disaster is one thing. But thunderstorms? We expect better from Amazon.

 

 

About the author

Ant Pruitt

Based in Charlotte, NC, Anthony Pruitt is an IT pro and senior contributor at aNewDomain.net. Follow him at @ant_pruitt or as +Ant Pruitt on Google +. Email him at Ant@aNewDomain.net