Post Mortem: When Amazon's Cloud Turned On Itself

Apr 29, 2011 (02:04 PM EDT)

Read the Original Article at

The snafus in the cloud, it turns out, aren't so different from those occurring in the overworked, under-automated and undocumented processes of the average data center. According to Amazon's post mortem explanation of its recent hours-long outage, the failure was apparently triggered by a human error.

If so, processes susceptible to human error are not going to be good enough in the future, if the cloud is going to be a permanent platform for enterprise computing.

The cause of Amazon's recent outage, which would have been more of a disaster than it was but for the low Easter holiday traffic, was the result of a configuration error in a scheduled network update. The change was attempted in the middle of the night--at 3:47 a.m. in Northern Virginia-"as part of our normal scaling activity," according to the official explanation. That sounds like the EC2 data center was anticipating the start of early morning activity, where big customers such as Bizo or Reddit start refreshing hundreds of websites in preparation to meet the day's earliest readers.

The primary network serving one of the four availability zones in EC2's U.S. East-1 data center needed more network capacity. The attempt to provide it mistakenly shifted the traffic off a primary network onto a secondary and lower bandwidth network used for backup purposes. This is a change that has been probably correctly implemented thousands of times. It's the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse.

However, I thought the Amazon Web Services cloud used more automated procedures than that. I thought clearly obvious errors had been anticipated and worked through, with defenses in place. Two lines of logic, checking the operator's decision, would have halted him in his tracks. A simple network configuration error should not be the source of a monumental hit to confidence in cloud computing. But apparently it is.

What happened next is not so different from what we speculated in Cloud Takes A Hit; Amazon Must Fix EC2 a week ago, based on the cryptic postings on the Services Health Dashboard. Eight minutes after the change marked the start of what the Amazon Service Health Watch dashboard described as "a networking event." The misconfiguration choked the backup network, which caused "a large number of EBS nodes in a single EBS cluster lost connection to their replicas."

An EBS cluster is servers and disk serving as short-term storage for running workloads in a given availability zone. The preceding description doesn't sound like much of an event, but in the cloud, it triggers a massive response. Suddenly large sets of data no longer knew whether their backup copy still existed on the cluster, and a central tenet of the cluster's operation is that a backup copy is always available--in case of a hardware failure.

The networking error in itself was relatively minor and easily rectified. But the error set up a massive "re-mirroring storm," a new and valuable addition to computing lexicon's already long list of disaster terms. So many Elastic Block Store volumes were trying to find disk space on which to recreate themselves that when they failed to find it, they aggressively tried again, tying up disk operations in a zone. You get the picture.

In building high availability into cloud software, we've escaped the confines of hardware failures that brought running systems to a halt. In the cloud, the hardware may fail and everything else keeps running. On the other hand, we've discovered that we've entered a higher atmosphere of operations and larger plane on which potential failures may occur.

The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn't work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That's an unanticipated event in cloud architecture because it isn't supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U .S. East-1, but they couldn't access anything, more servers couldn't be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.

The accounts that I have paid the most attention to in the aftermath have been those whose operations didn't fail, despite the Amazon architecture's breakdown. Accounts like the one from Donnie Flood, VP of engineering at Bizo, or Oren Michels, CEO of the Mashery. In talking to Jesse Lipson, CEO of ShareFile, an original EC2 beta customer in 2008 and still a customer, he said, "We're pretty paranoid about betting on any company, even if it's Amazon," and his firm invoked the option of redirecting its traffic to Amazon's West Coast data center when it found its servers failing. ShareFile, which supplies a file sharing and storage service to business, maintains its own "heartbeat" monitoring system for its servers, and the system detected ShareFile servers disappearing after the "network event" in EC2. The system automatically shifted ShareFile traffic toward those that were in the West Coast data center.

I think Amazon itself should have a traffic shifting system that reroutes the bulk of customer traffic when an availability zone or whole data center is no longer available. It should shift it, as individual customers did, from East to West, degrading service no doubt, but keeping customers online. Lipson points out, however, that linking data centers might allow the harm to spread. Inside the Northern Virginia data center, availability zones--which are subdivisions of the data center operating independently--the trouble spread like a contagion. Backup measures that worked in individual cases or across a small set cascaded out of control when invoked on a scale that had previously been unanticipated.

Despite that risk, I still think Amazon must link data centers, but it must also include a circuit breaker that queues up traffic or shunts it away if it turns into a threat to the functioning facility. Within a data center, availability zones need to be, well, available, even if there is trouble in one of them. I think that means architecting services so that they operate in some isolation in one zone from troubles in another. In the aftermath, the EBS and RDS services operated across availability zones, and freezing them in one froze them in all.

All of this is much easier said than done when operating on the scale and complexity of Amazon's EC2. Amazon has done such a good job of pioneering the cloud that there is an immense reservoir of faith among its customers that it will eventually get it right. No one I've talked to says they're willing to switch. Cloud computing may have had a setback, but it will make a quick comeback. There is a widespread belief that when it does, it will be better. Still, it remains to be said: Amazon has got to do better than this. It has got to get it right.

Charles Babcock is an editor-at-large for InformationWeek.