4 Dec 2013

Troubleshooting VMware performance

A prerequisite to troubleshooting VMware storage performance is confirming whether storage or its infrastructure is the problem. While there are many sophisticated tools available to monitor the virtual environment, a simple and free way to make this determination is to monitor host CPU and virtual machine (VM) CPU utilization over time. Essentially, you want to know what the utilization of the CPU resource is when the performance problem is most noticeable. If the utilization is above 65%, it's more than likely that the performance problem can be best solved by upgrading the host, allocating more CPU resource to that particular VM or moving the VM to another host.

A simple way to rule out a CPU-related performance issue is to migrate the VM to a more powerful host with more memory, if possible. Assuming the alternate host is on the same shared storage infrastructure, a repeat in performance loss on a second host certainly begins to make storage performance a top candidate for resolving the issue.

One of the prime benefits that virtualization offers is its role in isolating performance problems. In the past, moving an application to another host meant acquiring server hardware, installing the operating system and application, and then migrating users. With virtualization, a simple vMotion can provide a lot of information in the troubleshooting process.

Targeting the storage network

Once a performance problem has been better isolated to the storage infrastructure, the next step is to determine where it's occurring in that infrastructure. Conventional wisdom (and storage vendors) says to "throw hardware" at the problem and buy more disk drives, solid-state drives (SSDs) or a more powerful storage controller. While a faster storage device may be in order, IT planners should first look at the storage network between the VMware hosts and the storage system. If a network problem exists, it doesn't matter how fast the storage devices in the system are.

A simple way to determine a network performance issue is to look at disk performance. Assuming CPU utilization is low, a storage device performance issue should show a relatively steady state of IOPS, which means disk I/O has hit a wall. Occasional high spikes or sporadic spikes in disk I/O performance means the device and storage system have performance to spare, but data isn't getting to them fast enough. In other words, there is a problem in the network.

IT professionals tend to focus on overall bandwidth as the biggest area of contention in the storage network -- for example, when moving from a 1 Gigabit Ethernet (GbE) environment to 10 GbE, or from 4 Gb Fibre Channel (FC) to 8 Gb FC. While an increase in bandwidth can improve performance, it's not always the main culprit. Other problem areas, like the quality and capabilities of the network card or the configuration of the network switch, should also be considered at the outset. Resolving issues at these levels is often far less expensive.

Network interface cards (NICs), whether they're FC- or Internet Protocol-based, are typically shared across multiple VMs within a host. Even multi-port cards are typically aggregated and shared. If a particular VM has a performance problem, dedicating that VM to its own port on a card -- or even its own card -- may be all that's needed to resolve the performance problem. If the decision is made to upgrade the NIC to a faster speed, look for cards where specific VM traffic can be isolated or provided a certain Quality of Service.

You can also upgrade the NIC without upgrading the rest of the network. While it may seem counterintuitive, placing a 16 Gb FC card into an 8 Gb FC network does two things: It lays the foundation for faster storage infrastructures, and it improves performance even over the old cabling. This is because the processing capabilities of the interface card become more robust with each generation. To move data into and out of a NIC requires processing power, so the faster this can occur, the better the performance of that card.

Switches can get overwhelmed

The second area of the storage network to explore is the switch. Just like a card, a switch can be overwhelmed by the amount of traffic it has to handle; many switches on the market weren't even designed for a 100% I/O load. For example, some switch designers may have counted on some connections not needing full bandwidth at all times. So while a switch may have 48 ports, it can't sustain full bandwidth to all ports at the same time. In fairness, in the pre-virtualization days, this was a safe practice. In the modern virtualized infrastructure, however, the thought of idle physical hosts is no longer practical.

Another common problem in switch configuration is inter-switch links. It's not uncommon as switch infrastructures get upgraded to find inter-switch connections hard set to their prior network speed. This configuration error essentially locks switch performance to its older performance level.

Looking for trouble in the storage controller

If disk performance measurements show a relatively steady state and CPU utilization is low, then it's more than likely that there's a problem with the storage system. Again, most tuning efforts tend to focus on the storage device, but the storage controller should be ruled out first. The modern storage controller is responsible for getting data into and out of the system, providing features like snapshots and managing RAID. In addition, some systems now perform even more sophisticated activities, such as data tiering between SSDs and hard disk drives (HDDs).

There are two parts of the storage controller that must be ruled out: the network interconnect between the controller and the drives, and the processing resource. Most storage systems will provide a GUI interface that will display the relevant statistics. It's important to monitor them during the problem period to determine if either one of these are the source of the problem. In the past, these two resources were seldom a concern, but in a virtualized data center, it's not uncommon. Also, if and when SSDs are installed in the storage system, it's important to recheck those resources to ensure they're not blocking the SSD from reaching its full potential.

Analyzing the storage device

After all this triage is done, the storage device can finally be analyzed. It's important to note that most storage tuning efforts start here, when in actuality this is where they should end. Having a fast storage device without an optimized infrastructure is a waste of resources. That said, the above modifications (host CPU, storage network and storage controller), while not optimal, will often prove to be acceptable. The easiest way to confirm a disk I/O performance problem is when your measurement tool shows a consistent performance result. For example, if IOPS is consistently reporting in the same range while CPU and network utilization are low.

The fix for device-based performance problems is typically to add additional drives or to migrate to SSD. In the modern era, a move to SSD is almost always more beneficial, providing a better performance improvement for less expense. However, before shifting to more or faster drives, IT professionals should also look at how the VM disk files are distributed. Too many on a single volume can be problematic; moving them to different HDD volumes can help. In the end, SSD should also solve the problem.

Tuning the VMware environment is a step-by-step process. Before you upgrade to high-storage devices, you should go through the above process to ensure your VMware environment will see the maximum benefit from your investment.

No comments:

Post a Comment