Hurricane Sandy Lesson: VM Migration Can Stop Outages

Nov 16, 2012 (08:11 AM EST)

Read the Original Article at

Who Is Hacking U.S. Banks? 8 Facts
Who Is Hacking U.S. Banks? 8 Facts
(click image for larger view and for slideshow)
When it comes to disaster recovery, there's nothing like having learned your lesson before the disaster arrives. One of the most useful lessons is that virtualized systems now enable IT managers to move entire systems out of harm's way, even if that means half way across the country -- although it's necessary to have an alternative site set up well in advance.

Another lesson is that moderate-cost backup telephony services based on IP networks, including the Internet, and open-source code can serve as host substitutes for your local carrier and office PBX service, if those get knocked offline.

Asterisk and BroadSoft are providers of what's come to be known as SIP trunkline services. Another, Evolve IP, played a role in ensuring patients needing oxygen in the aftermath of Hurricane Sandy could still place their calls to Apria Health Care and get their deliveries.

Apria is a national deliverer of home health care services, which learned in 2005 when Katrina hit New Orleans that it could lose telephone service at the moment it was most needed. Apria supplies hospital beds, drug infusion equipment, and other medical goods and nursing services to homes of elderly, incapacitated and convalescing patients. But one of the most frequently needed services, especially in a disaster, is simply the personal oxygen tank.

[ Read how data centers reacted to Hurricane Sandy, including bucket brigades. Hurricane Sandy Disaster Recovery Improv Tales. ]

During Katrina, Apria employees were incapacitated by the failure of office phone systems as the storm and its flood waters swept through the New Orleans region. That meant oxygen users who had left home to move in with relatives or find a room on higher ground had only what they had been able to carry away with them, often reflecting a hurried departure at the last minute. But when they picked up the phone -- their primary way of placing orders -- they often found the line to the Apria Health Care office was dead.

"Patients couldn't get through to us after Katrina," said David Slack, VP of IT network engineering for Apria. There was an alternative -- calling a national toll free number -- but few patients had it when they needed it. Something had to be done.

Apria looked at backups to its phone system from the major carriers, who expected the firm to spend $2,000 to $4,000 per office to upgrade to SIP trunkline services. Apria has 550 offices nationwide, which would have brought the total bill to over $2 million. It found a lower-cost substitute in the firms that carry enterprise voice and data over IP networks, including the Internet. BroadSoft and Asterisk use private branch exchange open-source code to provide such a service. Apria settled on Evolve IP, another provider, an enterprise version of the Vonage VOIP for consumers.

The Evolve IP system, providing a virtual PBX for Apria hosted in an Evolve IP data center, was installed a year ago, based on the lessons learned from Katrina. The hosted PBX needed to be accessible from any branch and have a call forwarding capability in case a Katrina-like storm should knock out a branch. As Sandy gathered strength and was projected to veer into Middle Atlantic states, Apria's executives knew the system was going to get its first test. Twenty of its offices were directly in Sandy's path. The system needed to be programmed where to forward calls if the primary destination wasn't available, and Apria made those adjustments. If the Middletown, N.Y., branch lost telephone service, the system was to shift calls to the nearest well-staffed hub, in Cromwell, Conn.

"We lost both phone and data connectivity in Middletown. Immediately the phones started ringing in Cromwell, the backup site. All the intelligence on (emergency) routing was in the cloud. It understood instantly if a branch office was down. It worked fantastically," said Slack in an interview. In this case, the "cloud" is three private Apria data centers in Philadelophia; Wayne, Pa.; and Las Vegas, connected through SIP gateways to the branches.

Apria offices in Brooklyn, N.Y., and Elmsford, N.J., also lost voice and data service and the backup system took over their calls as well.

In all, 1,006 messages dealing with customers' needs were handled by Apria's remote, virtual PBX system. The system could be reprogrammed on the fly to select a new backup location if a designated one was knocked out. "Our product is not a nicety that people come and pick up. To maintain themselves, clients need our services," Slack said. And those 1,006 rerouted calls was proof of that after Sandy.

Not everyone was fortunate enough to have had a prior hurricane as an instructor. One hard lesson taught by Sandy was that the magnitude of the disaster can change the terms on which you thought your recovery plan was operating. The top priority of a plan is to keep a data center running; the top priority of government authorities might lie somewhere else.

Datapipe prepared its two data centers in Somerset, N.J., for the storm, making sure to top off the fuel tanks of the diesel backup generators and going the extra length of calling in a diesel fuel tank truck from its private contractor and parking it on premises. Daniel Newton, senior VP of operations, said other preparations on staffing and communications had been made and the data centers rode out the storm, experiencing only a couple of minor leaks from wind-driven rain.

The site kept a bank of emergency generators running as other sites lost their power supplies and the Somerset utility power showed fluctuations. Newton thought nothing of using a little diesel fuel. He had plenty to spare. Then "an unforeseen circumstance occurred" as his backup supply tanker fired up its engine and drove off. Its owner had been ordered to deliver fuel to hospitals, nursing homes and convalescent centers instead of standing by at Datapipe. The sheer scale of the storm had undermined the plan.

Newton said the site never suffered a power outage, so the issue became moot. But he had discovered a hole in the plan. Datapipe immediately plugged it "by procuring our own fuel truck."

In New York, the solution wasn't as simple. The unexpected happened and a storm surge washed over three blocks of lower Manhattan from Battery Park. Caught in that surge was 75 Broad Street, with Peer 1 Hosting and Internap data centers in the building. Steve Orchard, senior VP of development and operations, knew the building's fuel supply system was stocked up and his backup generators were on the second floor, well above any conceivable flooding. But he didn't allow for the reserve fuel tank's vent pipe, allowing air to enter the tank as fuel was pumped out. It was two feet above the ground, outside the building.

When the storm surge hit the neighborhood, it flooded the basement, disrupting the redundant pumping system's electrical supply and shutting down the pumps. That would have been a relatively simple problem to fix: bring in a new pump and move fuel from ground level to the second floor. But salt water had been able to enter and flow down the vent pipe into the building's reserve diesel supply and 10,000 gallons of precious fuel was contaminated. Orchard had two major issues to overcome with the building's engineers. They did so, rigging a new pump, fuel supply and "creative fabrication of piping and hoses" to start moving fuel to the second floor.

Orchard said there were many different workmen involved, figuring out how to disengage the fuel line from its current linkages and apply new fittings to allow it to connect to a fuel truck. They had to locate a small generator to provide power to let them do the work. Internap had to shut down late in the morning Oct. 30. It was up and running again before midnight, having been out of commission for less than 12 hours, thanks to "the creativity and resiliency" of the Internap staff and 75 Broad building engineers.

One thing that might have gone wrong didn't. When salt water got into the vent pipe, the redundant system of pumps in the basement stopped working at the same time, so the contamination didn't spread in the line. Instead of needing to flush the line and perhaps repair generators, Orchard only needed to connect the new source.

The best laid plans of many data centers had some aspect of disaster recovery go off the rails. The collected experiences might become an argument for disaster preparedness to move out of the realm of attempting to guarantee the physical integrity and continuous operation of a given data center to system transfer -- migrating mission-critical virtual machines out of the data center to another, outside of harm's way. That approach would be useful only not for hurricanes but for fires, floods and earthquakes.

The ability to move virtual machines, which started with VMware's VMotion capability, used to be limited to a new location in the same data center rack. It gained the capability to move across racks in one data center, then between data centers.

Internap, Datapipe and many other data center service providers now have high-level disaster recovery services that allow the movement of critical systems from one location to another. But SunGard, another provider of the services, warns you can't wait until the last minute to invoke them.

"If you don't have a subscription (for disaster recovery) with SunGard, we don't allow you to sign up on the fly. The owner can't call the insurance company to write a policy when the building's on fire," said Walter Dearing, VP of recovery services at the SunGard Availability Services unit.

In fact, such recovery systems still take forethought and planning to implement. The key issue is not whether you can place virtual machine duplicates in some other location. That's a cinch. The main problem is getting a synchronized and up-to-date data flow into those systems to allow them to keep running.

Few companies keep a hot mirrored system running in a remote location, receiving a real-time stream of data and ready to pick up where another leaves off in a few milliseconds. There are many ways to recover systems that are less expensive than a complete live duplicate, and each customer decides what level of recovery he must have.

Do they want to recover by digging week-old tapes out of a vault somewhere? Do they have snapshot backups that are only a day old? Do they have all the server logs they need to reconstruct transactions up until a point that is only a few minutes or a few seconds short of the point of failure?

"Tape is the cheapest backup. It's also the most error prone in terms of physically accessing the tapes and the data on the tapes," Dearing said, especially during weather like a hurricane. But even a site that is taking frequent snapshots of its data and replicating them to two or more remote locations will need data recovery systems in place to maintain data integrity and restart systems.

"Recovery facilities do not have a cookie cutter similarity," he noted. Its facilities in Carlstadt , N.J., and Philadelphia are prepared to handle recovery of much larger, more complex systems than its facility in Arizona, he said. Even with a virtual machine recovery system, it must be tested frequently and rigorously, something many customers find hard to fit into busy schedules. Seamless failover based on virtual machines is possible today between sites, Internap, Datapipe and SunGard executives all agree. "But you can't do it without the proper due diligence," Dearing said.

Recent breaches have tarnished digital certificates, the Web security technology. The new, all-digital Digital Certificates issue of Dark Reading gives five reasons to keep it going. (Free registration required.)