2 Sept 2013

VMware and Hyper-V virtual machine disaster recovery

What qualifies as a disaster can be defined widely, but in this article it refers to any risk of service outage due to hardware, software or environmental failure and the process of managing that outage.

Specifically, this article will cover virtual machine (VM) disaster recovery using the VMware vSphere and Microsoft Hyper-V platforms.

Within their products, VMware and Microsoft provide the ability to cater for multiple disaster scenarios other than total site loss.

Most virtual machine disaster recovery (DR) products need additional hardware at the local or remote sites, and in some cases will require shared storage. With some careful planning, administrators can integrate these products into their virtual server designs that provide effective business continuity and so mitigate against failure.

Disaster recovery basics – backup and replication

Ensuring business continuity typically takes one of two forms:
  • Data and system backups to tape or disk that enable the recovery of entire systems onto new hardware, either rebuilt or at a new location
  • Real-time replication of data to a new location with hardware ready and waiting, using replication technology and a wide area network. Unsurprisingly, data replication is a more expensive option and typically may only be used for important production systems.
In the pre-virtual world, backup was achieved using backup software agents installed onto each server, with backup data taken offsite manually or written offsite across the network.

Replication was typically done at the storage array, using technologies such as SRDF from EMC or TrueCopy from Hitachi, but it was also possible to replicate data at the server or application, using, for example, Oracle's Data Guard.

Array-based replication works better in large-scale environments where the complexity and time required to restore individual servers means recovery time objectives (RTO) cannot easily be met.

Virtual machine backup and replication

Server virtualisation introduces new challenges in implementing disaster recovery policies.

The traditional backup process doesn't work well for virtual server backup. A virtual environment uses shared hardware, such as network and storage ports, and achieves cost savings by virtue of the fact that most server hardware is underutilised.

During the backup window in traditional environments, the aim is to back up data as quickly as possible, and that means using all the network capacity available. The result is that traditional backup methods can cause bottlenecks and performance issues in virtual deployments.

Data replication has similar challenges. The recommended configuration for both vSphere and Hyper-V involves creating large volumes (or LUNs) within the storage array and storing multiple VMs on each of them.

This means servers on the same LUN are grouped together for DR purposes as only an entire LUN is replicated and failed over by the storage array. Administrators therefore have to think through carefully any array-based data layout to achieve a compromise between space utilisation and disaster recovery flexibility.

Adding intelligence – hypervisor-based solutions

Hypervisor suppliers have recognised the issue of managing DR for virtual environments and added features to their products to address these issues.

First we'll talk about VMware features, then those of Hyper-V.

VMware, VADP and VDP

VStorage APIs for Data Protection (VADP) is an API framework that provides a set of features for managing virtual machine backups. It supersedes VCB (VMware Consolidated Backup) an early VMware backup feature and is an integral part of the hypervisor itself.

VADP allows backup software suppliers to interface with a vSphere host and back up entire virtual machines, either as a full image or incrementally, using Changed Block Tracking (CBT).

CBT provides a high level of granularity in tracking the changes applied to a virtual machine, in a similar way to traditional backups that look for changed files.

VADP can also integrate with VSS (Volume Shadow Copy Services) on Windows Server 2008 and upwards, ensuring host consistency during the backup process, rather than the standard "crash copy" style backup where no synchronisation takes place.

VDP, or vSphere Data Protection, is VMware's virtual appliance for backups. This uses EMC Avamar to store backups on disk, taking advantage of features such as data deduplication to improve space utilisation.

Many third-party suppliers also support VADP, including Symantec, both NetBackup and Backup Exec product lines; Veeam; CommVault; Arkeia; HP, with Data Protector; and EMC, with Avamar and Networker.

VMware Fault Tolerance

Fault Tolerance is a vSphere feature that ensures virtual machine availability in the event of a hardware failure. Fault Tolerance works by maintaining a second "shadow" copy of a virtual machine, which is continually kept in-sync and up to date with the primary.

In the event of a disaster, such as the loss of hardware or power to the primary systems, Fault Tolerance automatically starts the secondary server with no downtime or outage.

Fault Tolerance is best suited to implementing local DR recovery, where the outage does not affect all of the local systems and where a recovery point objective (RPO) of zero is required.

This could mean implementing two sets of hardware, separated by physical and power boundaries, for example. Extending the primary and secondary systems over any great distance could, however, introduce issues of latency in keeping the secondary copy up to date.

VMware Replication

vSphere's Replication feature uses the change block tracking feature to replicate data to a remote site for disaster recovery purposes. Data is moved at the virtual machine level (the VMDK) and so is independent of the underlying storage. This is a good solution for replicating data between different array types where the DR site is deployed on less expensive hardware.

Replication is implemented using a dedicated virtual appliance at the source site, plus replication agents on each VM in the replication process. This makes it more invasive than an array-only replication solution.

Replication can be used in conjunction with VMware's SRM (Site Recovery Manager) to provide a comprehensive DR management solution that covers the process of failover and failback in the result of a DR incident.

VMware Replication alternatives

There are a number of third-party alternatives to the native VMware Replication feature, all using the same underlying interface.

Zerto offers Virtual Replication, a product that provides the ability to fully manage the disaster recovery process of virtual machines, including replicating into public cloud as a DR target. Meanwhile, VirtualSharp, acquired by PHD Virtual Technologies, provides DR Assurance via its PHD Virtual Reliable DR solution. This enables testing and validation of disaster recovery scenarios on a regular basis to ensure configurations are correct at the time of an actual disaster.

Hyper-V Cluster Shared Volumes

In Windows Server 2008 R2, Microsoft introduced the concept of Cluster Shared Volumes. These are shared storage volumes that are accessible by all Hyper-V nodes in a shared cluster environment. In the event of a failure of a node, another node in the cluster can take over the virtual machines of the failing server.

Cluster Shared Volumes are good for local DR where hardware can be separated by physical and power boundary failure domains, but is not well suited for distance DR due to the latency any extended distance would produce.

Hyper-V Replica

With the release of Windows Server 2012 and Hyper-V 3.0, Microsoft introduced a new feature called Replica. This allows asynchronous replication of a virtual machine over distance to a secondary site.

In the initial deployment, the replication interval is fixed, but with the release of Windows Server 2012 R2 due this year, the interval will be user configurable.

In addition, Microsoft will add extended site support to enable replication to a third site. This provides the possibility to have a closely located and a remote replica over a greater distance, providing more DR flexibility.

No comments:

Post a Comment