24 Jul 2013

VMware backup and the VMware snapshot

Server virtualisation introduces a radically different approach to backing up data in the enterprise than simply using backup agents installed in the guest operating system and backing up VMs across the network: VM-aware backup technologies.

If your virtualisation environment is growing exponentially, you'll soon begin to notice the CPU, disk and network hit if you don't quickly move over to such a technology.

But, it's critical that you understand the role of the VMware snapshot and how VM backup and restore works before parting with your budget. In particular, you will want to ensure that VM-based backup offers the same level of granularity as your legacy backup system.

Early VMware backup products were very good at backing up but somewhat lacking in flexibility when it came to the restore process. Sadly, our industry has got the whole issue the wrong way round. Backup software should be called restore software, because that's what customers buy it for.

Basic operation

Many vendors' VM backup products use the same methods and APIs for backing up VMs. The process begins with the backup software taking a VMware snapshot of the VM. This performs two main tasks.

First, it triggers the quiescing of the VM and flushes the disk contents out of the file system cache. This leverages both the OS and application versions of Microsoft Volume Shadow Copy Service (VSS) that ensure files that may be locked and in use inside Windows are released in such a way that a full backup is more likely to happen. The snapshot is exactly the same as those available from the vSphere Client that can be taken manually by the VMware admin.

Second, the VMware snapshot unlocks the files that make up a VM from the file system. When a snapshot is engaged, each virtual disk receives a snapshot delta  file (you will find it is called something like "vmname-00001.vmdk"). From this point onwards, all disk changes accrue in the delta files, which grow in increments of 16 MB. This leaves the files that make up the VM, such as the VM's configuration file (VMX) and, critically, the virtual disk files (VMDK), free to be archived.

Without the VMware snapshot engaged, the files would be locked by the ESX server that "owns" that VM when it is powered on. The situation is similar to when you try to copy or move a file that's already open in an application.

Delta file size and removal of snapshots

As you can probably guess, there are a number of challenges associated with the use of VMware snapshots. The longer the snapshot is engaged, the larger the delta files that make up a VMware snapshot can be -- relative to the rate of "churn" on your data. This can have implications for available free space on the volume where the snapshots reside and also a potential performance hit, dependent again on the rate of data churn and type of storage deployed. For example, RAID-enabled SSD or SAS drives will outperform SATA volumes in most cases.

However, your main concern with VMware snapshots is how your backup vendor handles their removal. Once the backup job has completed, an instruction is normally sent to either vCenter or directly to the ESX servers to remove the VMware snapshot from the VMs. Assuming that communication to these nodes is available at that time, a failure to communicate from the backup system to the management layer of vSphere can result in "orphaned" VMware snapshots left behind after the backup job completed.

A good backup product will at least log and alert the VMware admin to this fact, and the better ones will cycle through a garbage collection process to remove them at the earliest opportunity or when the next backup runs. There are many a sorry tale to be heard from VMware admins who have found a VMware snapshot file has grown so large that it fills a volume or a LUN.

Restoration

Another important feature is how the backup product goes about restoring files. In the early days of VM backup, many vendors merely mounted the virtual disks that made up the backup to their management system and left it to the VMware admin to copy files around using Window's hidden "dollar" shares, such as C$ and admin$. That's hardly approaching the sophisticated use of backup agents that intelligently restore files to the same or different locations.

Fortunately, things have improved in recent years. There are two methods that most vendors support. In the first method, the backed-up VMDK files are taken from a shared location accessible to the ESX hosts and "hot-added" to the VM to which they need to be recovered.

The result is that the VM "magically" has a new drive added to it whilst it is powered on. This appears as new X drive or Y drive, for example, and this allows the application owner of the VM to restore files using Windows Explorer.

Secondly, if the entire VM has been lost, many vendors allow for a temporary VM to be started using the files that have been backed up. Once booted and in use on the network, the VMware admin can use Storage vMotion to relocate the restored VM to its rightful location.

These two methods of restoring VMs are infinitely more sophisticated than copying files using Microsoft's CIFS protocol. You should ensure that your chosen backup vendor supports at least one of these methodologies.

Scalable recovery

Of course, backup shouldn't be your only strategy when it comes to data protection. While it's true that 99% of all recoveries are of relatively small amounts of end-user data, there are situations that require a more robust and scalable recovery strategy than backups allow on their own.

What if you have very large VMs that hold terabytes of data? What if the storage array suffered a major outage? What if a storage admin took a LUN used by VMs offline and deleted it? All of these possibilities for data loss all share the same attribute: terrifying amounts of data lost in seconds. Attempting to recover this amount of data even with disk-to-disk backups could take hours or days depending on the volume of data and your maximum restore throughput.  

For this reason, you really need to consider a cycle of snapshots driven by your storage vendor's array technology. The storage vendor snapshot offers your environment an enormous Undo button for your volumes and LUNs. This can be incredibly helpful if a couple of VMs get accidentally deleted or destroyed. The history of previous snapshots is available, and these can be presented to the ESX hosts and mounted by them directly from the storage layer.

It's like having a Recycle Bin for VMs, for gigabytes or terabytes of data. Storage vendors such as Dell, EMC and NetApp now have tools that integrate directly with the vCenter system to facilitate this recovery process without the need to understand the storage management tools or speak to the storage admins.

Storage vendor snapshots also form the basis of most storage vendors' replication technologies. This should also be one of the linchpins of your data recovery strategy. The single point of failure of virtualisation is often the storage array. All VMs are stored on it and it is central to virtualisation's advanced features such as vMotion, High Availability (HA), Distributed Resource Scheduler (DRS), Distributed Power Management (DPM), Fault Tolerance (FT) and maintenance mode.

Without some kind of centralised storage array, most virtualisation projects are hobbled from the get-go. At the same time, however, with a storage array, VMs become eggs stored in the proverbial basket.

There are two ways to approach this risk -- if maximum availability is required, most storage vendors have their own "continuous availability" models, where two arrays are kept at the same state using synchronous replication.  If one of the storage arrays goes down, the standby array takes the place of the primary.

Sadly, this can be an expensive option. What's more viable is stretching out this replication between two arrays within a site to include another storage array in a different site. This gives the option to fail over if you lose the array and at the same time gives the business protection from the ultimate form of data loss: loss of an entire site.

No comments:

Post a Comment