Dec 27, 2012 (06:12 AM EST)
Amazon Outage Scrooges Netflix, Heroku
Read the Original Article at InformationWeek
Netflix customer complaints lit up social networks on Christmas Eve, but the video service could only point a finger of blame at AWS, its cloud services provider. Amazon offered little explanation, but a "status history" report for 12/24 on the Amazon Service Health Dashboard shows "performance issues" affected Amazon's Northern Virginia data center.
The outage hit Netflix viewers from Canada to Brazil. It also affected Amazon's own Amazon Prime video-streaming service and Salesforce.com's Heroku cloud platform, which served up HTTP errors and ssl:endpoint unavailability messages during the outage.
Netflix reported that it was able to restore services to most of the affected consumers by late Christmas Eve. But that entailed a workaround that involved manually reassigning capacity to other Amazon data centers. Amazon reported that it took until the afternoon of Christmas Day to fix the problems at its Northern Virginia data center.
[ Want more on cloud foibles? Read Cloud Computing: Best And Worst News Of 2012. ]
The three specific AWS services affected were Amazon CloudWatch, EC2 and Elastic Beanstalk. CloudWatch provides monitoring for AWS cloud services and apps. EC2 is the Elastic Compute Cloud that provides on-demand compute capacity. Elastic Beanstalk automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling and application health monitoring. AWS offers built-in redundancy for all these services by way of multiple data centers and availability zones around the globe, but it's clear that provisions for automatic failover went down along with the CloudWatch and Beanstalk services.
The latest incident marks the fourth AWS outage in 2012. June 14 and June 29 disruptions were tied to power outages while a less-serious October 22 incident involved the vendor's Elastic Block Storage Service.
Amazon's Easter outage of 2011 still ranks as one of the service provider's worst disruptions, as multiple availability zones went down and some customers took days to recover. The outage was ultimately blamed on human error.