TechWeb

Google Vs. Zombies -- And Worse

Mar 12, 2013 (06:03 AM EDT)

Read the Original Article at http://www.informationweek.com/news/showArticle.jhtml?articleID=240150484


Anonymous: 10 Things We Have Learned In 2013
Anonymous: 10 Things We Have Learned In 2013
(click image for larger view and for slideshow) \
After the zombies took over Google's data center, the heroic action of a few selfless individuals saved the day. Never underestimate what a site reliability engineer can do with an axe.

"If you look at zombies in the data center, they're after the people," explained Kripa Krishnan, technical program manager at Google. "So it becomes less of a machine's problem and becomes more of a people problem..."

The zombie invasion occurred back in 2007. It was one of the first Disaster Recovery Testing (DiRT) events created to evaluate Google's operational resilience in a crisis. This was before Centers for Disease Control and Prevention began warning about zombies because storms, pandemics and earthquakes don't get people's attention anymore.

Although heroism has played a central role in saving Google more than once -- another scenario involved an executive wielding the teleportation gun from Valve's Portal -- it's not something that can be relied on when disaster strikes, just like any IT system or business process at a time of crisis. Google as a company promotes the perception that its employees are exceptionally talented. But when it comes to preparing for the worst, the company can't simply assume that exceptional skills will save the day.

[ Could what you wear be used to identify you in the future? Read Google Funds Fashion Recognition Research. ]

"We find that people are people, and they burn out if they work insane hours and long shifts," said Krishnan. "Heroic tactics are not a sustainable model if you're in a disaster."

The DiRT program was created seven years ago and Krishnan began managing DiRT events a year after that. Genial and sharp, with a penchant for using the word "goodness" to emphasize a point, her background recalls the famously overachieving Buckaroo Banzai, depicted in the 1984 movie that bears his name as a neurosurgeon, physicist, rock musician and test pilot.

Hyperbole perhaps, but it's a necessary element in a story about heroism. Krishnan was studying medicine over a decade ago when her interests took her to music and theater. Three years in, she decided to study performance arts, and eventually came to the U.S. to focus on theater. Then a professor convinced her to take a computer science course. Having left science for the arts, Krishnan finally emerged from graduate school with a degree in Management Information Systems. Thereafter, she became involved with telemedicine networking in Kosovo and later landed at Google.

Now her job is to break things, as Krishnan explained in an interview at Google's Mountain View, Calif., headquarters.

"Sometimes we will bring in someone to write something that will cause a failure in some underneath layer and it will manifest itself as cascading failures in some front-end facing product," Krishnan said. Other times, she says, her team might direct someone to introduce corrupt data into a system, to see how long it takes to find the problem.

DiRT is an annual exercise. Although various Google product groups conduct their own internal stress tests, DiRT's scope is companywide. DiRT scenarios challenge both technical infrastructure and organizational dynamics. Initially, the tests were restricted to user-facing systems, but they have been expanded to cover the full range of Google operations. Beyond data centers, DiRT testing might include systems used by facilities, finance, human resources and security, among other business groups. More recently, as the company's enterprise business has become more successful, customer support systems were added to the tests.

DiRT exercises require the work of hundreds of engineering and operations employees for several days, which means they're not inexpensive to run. They can affect live systems and have even resulted in revenue loss. But the price is deemed to be worth it.

Sanjay Jain, associate industry professor in the department of decision sciences at George Washington University, said in an email that the apparent increase in manmade and natural disasters around the globe demands more active continuity planning.

Google Dirt Conference Table

"Recently, companies have had to face major issues due to disasters including the loss of operations in New York and New Jersey area following Hurricane Sandy a few months ago, and the major impact on supply chains following the tsunami in Japan in 2011," he said in an email. "Companies need to be more thorough in planning for safety of their personnel and maintaining business continuity in face of such eventualities. Such efforts have to go beyond duplicating data servers (that is of course needed) to employing live and computer simulations of potential disaster scenarios and their impact on companies' personnel, operations, and assets, and testing of measures to eliminate or substantially reduce the negative impacts."

In case of emergency, Google has a war room. DiRT tests are run from a simulated war room, which can be one of the company's many conference rooms.




"The war room is actually a physical room..." said Krishnan. "We have a lot tech leads and a lot of coordinators sitting together. So we're communicating with each other constantly. It's a very adrenaline-filled room. Very little sleep. Everybody, when something goes down, we're all stressed and alert. And we are supposed to know everything that's going on at any given point in time."

The events often require being up in the middle of the night, due to the global nature of the testing. They're powered by caffeine and donuts, which pretty much covers the hacker food pyramid.

Google Dirt Donuts
"The one time we said, 'let's bring health food into it,' we brought this vegetable tray with dip," said Krishnan."[But] you don't eat vegetables in the war room.... The good food went bad in like two hours. It was just smelling up the room, and people are gagging, 'Get me out of here.'"

The cause of the problem, whether it's rotting vegetables, zombies taking over a data center or something more mundane, isn't as important as the problems that are revealed and the response to them. DiRT exists to increase the likelihood that Google can keep its equipment and operations up and running.

One recent DiRT exercise, for example, involved an earthquake near the company's headquarters that took down a data center housing several internal Google systems. It revealed not only systems that didn't have adequate backups but also unexpected dependencies. Some engineers had systems failing over to workstations at offices in Mountain View, but these became inaccessible when the "earthquake" caused authentication mechanisms to fail.

Real disasters, such as Hurricane Sandy, have informed how Google deals with imagined ones.

"Sandy corrected a lot of our assumptions," said Krishnan. "That was a real world application of a lot of the things that we've done. We found a lot of gaps that we hadn't addressed at all. We found that some of the things that we decided would work are somewhat contrived and we should fix that."

Google's technical infrastructure weathered Sandy just fine, according to Krishnan. The problems that arose had to do with people: people who had to deal with flooded homes or family emergencies, people who didn't have power or who had lost Internet access, people who didn't have the information they needed to contact others, and people whose concerns overwhelmed incident managers. Google ended up sending many employees based in the New York area home during the crisis.

Google Dirt Conference Table

The problems exposed by Sandy showed up in a subsequent DiRT test: Sandy created a lot of internal company email, "so we actually simulated that environment during our recent DiRT exercise," Krishnan explained. "We started bombarding our incident managers with, 'Hey, I have a flight home. I don't know how to get there. Tell me. The airline is costing me a bazillion dollars. Can you expense this for me? Will you pay for me?'"

To Krishnan's surprise, the test participants responded well. They self-organized and dealt with the problems, she said.

The learning isn't always so swift. During the first DiRT test, only one person was able to find the emergency communication plan and dial in to the conference call at the designated time. A follow-up test produced a far better response, so good in fact that the number of callers exceeded the bridge line's capacity. And a subsequent call was undone by someone who called in and then placed the call on hold, subjecting the other conference call participants to "hold music" and revealing the lack of a mechanism to eject the absent caller or silence the music.

It turns out that it isn't easy to crash Google's systems. Krishnan recounted an attempt to simulate network packet loss that proved ineffective. "Our test bombed on us," she explained. "Then we realized that it was because we chose a certain time of day when there was almost no traffic. And based on the traffic, we had to actually cause a full outage to notice anything. There was no amount of packet loss we could create for anybody to notice anything. We were super-resilient to that."

The goal of another DiRT exercise, Krishnan said, was to test executive decision making. An alert was issued. "Within 15 minutes, some of our most senior executives showed up on a phone bridge," she said. "They were making decisions all over the place. The beauty of the whole thing is -- even through those decisions -- the first thing they thought about was their users." She characterized the call as "the most inspirational eight minutes of any DiRT so far."

DiRT, in conjunction with other quality-control regimes, has helped make Google software better. Over the years, engineers have whitelisted certain applications to exempt them from tests when they know the applications cannot pass. Krishnan says that the number of whitelisted applications has been declining and that now hardly any applications have to be excluded.

"We spend a ton of energy and time building top-notch products for our users," Krishnan explained. "But we also want to build top-notch infrastructure, so that our users believe our systems are reliable and available. DiRT tests to make sure that is true."

Heroism might not be a sustainable model for dealing with disasters, but at Google, it's more than a fictional framework; it's part of the job. "[W]e have had scenarios of zombies, and the incredibly axe- and baseball-bat skilled site reliability engineers and hardware operations techs who save the day," Krishnan said in an email. "In reality, however, all credit goes to the people who work tirelessly each year to make this happen: the DiRT team, the incident commanders, and the rest of those at Google who respond to these intense exercises."

It's often said that failing to plan is planning to fail. However, the converse is not true: Planning to fail isn't failing to plan. Rather, as Google demonstrates, planning to fail is preparing to succeed.