Feb 27, 2009 (07:02 PM EST)
New Options Power Always-On Apps
Read the Original Article at InformationWeek
Enterprises of all sizes demand 24/7 application delivery. Server failures, maintenance downtime, and acts of nature are no excuse. If keeping key applications online is your job, you should consider yourself lucky: You have more options for keeping apps up than ever before.
With the tools available today, organizations have few excuses -- not even budgetary ones -- for relying on the hours-long process of manually restoring mission-critical apps from backup.
Application failover approaches run the gamut, from basic clustered server "ping and a prayer" software to complete virtualized systems and application-specific schemes. Finding the one that's right for you will involve more than a glance at the price tag, which runs from $1,500 to $10,000-plus per protected server. You'll also need to consider ease of use, speed of failover, bandwidth consumption, and how much data is at risk.
When most system administrators look to improve application availability, they start with server clusters. Failover clustering has been available in Windows Server's Enterprise Editions since Windows NT 4 was state of the art in the mid 1990s, but it developed a well-deserved reputation for being finicky.
Windows clusters used shared storage, which of course made the storage subsystem a single point of failure, until Windows Server 2008 was released. Microsoft insisted on only integrated server and storage solutions, so users that had Hewlett-Packard servers and EqualLogic storage, for example, were out of luck when it came to support. Most significantly, applications had to be cluster-aware to smoothly fail over from one server node to another.
Even before Microsoft added clustering to Windows itself, vendors like Double-Take Software released solutions that combined data replication, which eliminates storage as a single point of failure, with automatic failover. Early versions of these products required a lot of setup and tweaking, including installing the OS and applications on both servers. However, the current crop, such as SteelEye's LifeKeeper, CA/XOsoft's WANsync, NeverFail's Continuous Protection Suite, and, of course, Double-Take, can clone a production server to the standby server, both speeding setup and ensuring the servers are similarly configured. And some of these offerings support Linux clustering as well as traditional Windows clusters.
In a generic cluster or high-availability system, the failover server, or servers, monitor the primary host by exchanging heartbeat messages across the network (see diagram, "Two Ways To Keep Apps At Your Service"). If the primary host doesn't respond within a given period of time, the standby server assumes the primary host's identity and starts processing data in its place.
This method can prevent data loss due to a complete failure of the primary host and allows manual failovers for patching and other server maintenance, but it can't detect more subtle failures of services and daemon processes. Vendors including SonaSoft and Marathon sell more app-aware offerings, which check the state of services or connect directly to applications to ensure they're running.
Products also use different methods to allow a standby server to assume the identity of a production server in the event of a failure. The simplest way is to assume the production server's IP address and start appropriate services. A more sophisticated approach used by NeverFail and others is to hide the standby server behind an internal firewall to prevent users from accessing it until it's called on to take on the primary server role. At the top end of the product spectrum, Marathon's EverRun runs the primary and standby servers in lockstep in a virtual environment. Each server processes all data, but users access only the primary one. The backup server waits in the wings until something goes wrong.
In a high-availability cluster (left), a data center's standby server can adopt the primary server's IP address and identity when it doesn't respond over the heartbeat link. Effective disaster recovery (right) demands a more sophisticated scheme: In the event of server failure, the data must fail over to the standby server, often in a remote facility, via wireless LAN.
Just as in data center management, server virtualization has enabled administrators to greatly improve availability.' The most obvious impact is that virtualizing standby servers breaks the expensive one-to-one relationship between production and standby servers, thus reducing the cost of providing standby servers.' Because all virtual servers on the same virtualization platform look identical to the guest OS, driver and other related hardware issue are eliminated.
Organizations can also quickly provision servers in data center high-availability systems at the virtual server host. VMware High Availability, Microsoft Clustering of Hyper-V, and Linux failover solutions for Xen all protect guest servers from host failures and allow host maintenance without significant guest downtime. Marathon's EverRun for Hyper-V and Xen can extend host protection to true multisite disaster recovery as well.
While VMware's Site Recovery Manager (SRM) does require some scripting, it also provides for site-to-site failover of virtual servers across a variety of applications and guest operating systems. SRM relies on storage arrays to replicate the data from site to site and array manufacturers have to write an adapter to enable SRM to manage the replication process.
Another recent trend in application availability has been the development of high-availability and disaster-recovery solutions that are not only application-aware but also operate at the application layer.' A general-purpose solution replicates file- or block-level writes to the primary host's storage to a standby storage system.
Regardless of whether the replication is done using software in the primary and standby hosts or by the storage system itself, the application's database is being managed by the primary host with the standby's copy of the application sitting idly by. When the failover occurs, the application starts on the standby server and mounts its "crash-consistent" copy of the database. (Crash consistent is the industry euphemism for a database that's as consistent as it would be when the server crashes -- or, in plain English, not consistent at all since some number of transactions were assumed to be in the middle of being processed when the server crashed.) Therefore, the first thing the server has to do is a quick consistency check to roll back the transactions that were in progress when the crash occurred. This process usually takes just a minute or two but can occasionally leave the server unavailable to users for several hours as the database is checked and reindexed, especially if the crash occurs in the middle of a database defragmentation.
Application-specific solutions replicate transactional data to a standby server where the running application applies the transaction to its copy of the database. This approach has several advantages. First, because the backup server is running the application, it usually doesn't take long to fail over to the backup, start the application processes, and mount the database. Second, posting completed transactions prevents many sources of database corruption, such as those caused by malware on the primary host, or storage system I/O errors, from propagating to the backup server.
The secondary server can also be used as a data source for operations like backup, archiving, and reporting, allowing these processes to run anytime without affecting users.
Replicating transactional data also reduces the amount of data that must be sent between primary and secondary data stores. Modern databases write data to transaction log files and then, when the transaction is complete, to the on-disk database. Solutions that replicate storage data must replicate the writes to both the transaction log and database, whereas transaction-based solutions only have to send any given transaction across the line once.
Many application-specific failover vendors have focused on Exchange, in no small part because it has so many moving parts and interconnections to Active Directory and other network services. Software like Cemaphore's MailShadow OnSite and SonaSoft's SonaSafe for Exchange capture data from the Exchange server using the native Exchange MAPI protocol and transfer it to a running Exchange server; one backup server can provide protection for several source servers and with SonaSafe production servers in different offices can back each other up. To fail over, they run a script that updates the user's mailbox location data in active directory; users then connect to their mailboxes on the standby server.
Teneros' Application Continuity Appliance packages the MAPI data acquisition, failover, and standby Exchange server in an appliance that's positioned inline between the users and the production Exchange server. It will also asynchronously replicate data to an additional appliance at a remote site for disaster-recovery purposes.'
Microsoft has even dipped its toe in the water with disaster recovery features for Exchange 2007: Cluster continuous replication (CCR) for high availability and standby continuous replication (SCR), which ship transaction log files from primary to secondary servers to keep the database up to date. SCR requires significant manual intervention, or scripting, to bring up the standby server, but it shows promise. CCR relies on Windows clustering for failover and is limited to having all the systems on the same subnet.
The downside to application-level failover is that, in the event of a primary server failure, at least a few transactions will be lost in transition, posted to the primary server but not replicated to the standby. So while these solutions improve recovery time, they can also negatively affect recovery points.