Apr 26, 2011 (02:04 PM EDT)
How HootSuite Recovered From Amazon's Cloud Outage
Read the Original Article at InformationWeek
HootSuite lost a business day's worth of service Thursday in what HootSuite CTO Simon Stanlake called an "absolute worst case scenario" when the East Coast data center for Amazon Web Services suffered a major outage.
HootSuite, which provides a social media dashboard to monitor chatter on the Web, was offline from about 1 a.m. to 7 p.m. Pacific time on Thursday, after which it was able to restore service from backup servers in the same Amazon data center but within a different "availability zone." Amazon has marketed these zones as part of its architecture for high availability, saying that they were architected so that the failure of one zone within a data center would not affect customers hosted in the other zones. However, all four zones in the Virginia data center were affected during Thursday's incident, and data center services weren't fully restored until the weekend.
HootSuite was only able to restore service Thursday night by sacrificing some customer data, which at that point was trapped in an availability zone that was still offline, and proceeding on the basis of a Tuesday night backup. That meant new users and users who had made updates to their accounts in the meantime lost some work. In a blog post, HootSuite said it would offer all customers a $50 credit (and some unspecified additional consideration for enterprise account holders). Even though HootSuite's terms of service say refunds will be offered only for outages of more than 24 hours, CEO Ryan Holmes wrote "we acknowledge users were inconvenienced and we want to make things right."
Stanlake said the fragility of Amazon's infrastructure came as an unpleasant surprise. "I was under the impression that an entire availability zone going dark was an earthquake-like scenario--let alone an entire region going dark," he said in an interview.
Stanlake made judgments on how conservative to be in his backup and recovery plans partly on the perception that the risk was fairly remote. For example, that came into play when assessing the greater expense that would have been required to maintain a "hot backup"--a frequently synchronized instance of the HootSuite production database and Web services that could be fired up at a moment's notice. He was willing to run the risk of losing a day's worth of data because he thought the risk was relatively low.
Stanlake said HootSuite's service relies on Amazon's Elastic Block Storage service to manage user account data. By late Thursday afternoon, Amazon had restored this service in three of the four availability centers, but the one containing HootSuite's production database was still inaccessible. As the hours continued to pass, and Amazon seemed to be coming no closer to resolving the issue, the engineering team decided it had no choice but to restore service based on a backup copy of the database. "We hated to make this decision but it turned out to be the best option," he said, given that HootSuite didn't regain access to that database until Sunday.
As tough as this situation was, Stanlake said it hasn't soured him on the idea of cloud computing, which has become a standard Web strategy for technology startups that want to be in a position to grow quickly. "I don't think there's any doubt that we couldn't have gotten to where we are today without these services being in place," he said. Certainly, he plans to have a serious conversation with the service provider about what it will do to prevent a recurrence, he said.
"I'm not ruling anything out, but my gut says we won't be moving off Amazon anytime soon," Stanlake said.
HootSuite was already working on replicating its infrastructure to other data centers--as much for geographic coverage for customers around the world as for redundancy--and will likely accelerate those plans, Stanlake said. However, figuring out the most efficient distributed architecture will take time, he said. "The difficult thing is doing it properly."