The Hyper-V Cloud - no clusters?

Microsoft has recently published a set of guides to build your own private cloud solution using Hyper-V, System Center Virtual Machine Manager and its Self-Service Portal 2.0

They cover planning, deployment and operations. You can find them here.

Note however that the guides assume a deployment on stand-alone servers. There is no discussion of clustering for high availability.

I'd like to add a few observations derived from experience of running a private cloud implementation that includes clusters too.

Storage

It seems obvious, but local storage is not relevant for clusters. SCVMM will explicitly place virtual machines ONLY on cluster volumes (dedicated or shared). You can still manually create virtual machines on directly-attached disks, but it just complicates things, as they are not easily distinguishable from highly available ones in VMM. It makes sense to purchase systems with a hardware-mirrored boot volume only, plus whatever dedicated storage adapter is appropriate to your workload.

Clusters tend to drive high consolidation ratios to the storage, for the simple fact that it is shared amongst several nodes. The number of sustained IOps often becomes more relevant than the maximum theoretical bandwidth, as during VM operations you may often find that a lot of relatively small I/Os are performed with a random access pattern. Fibre channel may fit such a profile better than iSCSI. Ideally, you'll profile the load before consolidation, but in a private cloud / infrastructure-on-demand environment you may not have the luxury.

You'll be well advised to consult the storage vendor's planning guide beforehand.

Storage vendors often publish the IOps and bandwidth ratings of their arrays. For instance, for the HP P2000 G3, you will find that 10 Gb/s iSCSI (with dedicated adapters) and 8 Gb/s FC are comparable in bandwidth utilization for large-block sequential reads and writes. However, FC still sustains 20% more IOps than iSCSI with a random 60/40 mix of read-write operations of 8KB blocks.

Interestingly, 6Gb SAS is equal or slightly better than FC in HP's measurements, which used 4 directly-attached servers (no fabric) for testing. Results may vary when a fabric is involved.

Servers

Blade servers have grown in popularity and are often recommended for private cloud solution, thanks to their good price / performance ratio, density and flexibility. However, in a highly available implementation due consideration must be given to:

- Connectivity: Microsoft recommends at least 4 network ports + 1 storage port (2 better) for each node in a hyper-v cluster. Blade connectivity may limit your options.

- I/O performance:  i/o bandwidth and operations per second depend on the midplane capabilities. Chassis capable of full redundancy with non-blocking backplanes and 10 Gb/s per lane are available - at a price (e.g. HP c7000). Cheap solutions typically involve some element of oversubscription or no redundancy.

- Reliability of the shared components. For instance, in a study published on the IBM Systems Journal, the lowest mean time to failure (MTTF) belonged to the chassis switch modules, followed by the blade base board. An equivalent external Cisco switch lasts about twice longer between failures, according to published specifications. Even so, according to the same study, a blade availability can reach 99.99% and it is possible to build a 99.999% available infrastructure with blades by having at least 1 "hot-standby" in the chassis, in addition to using redundant components where possible.

High-end servers (e.g. IBM x3850) are also interesting virtualization platforms, because of hardware becoming more affordable and Windows Datacenter unlimited virtualization rights. They drive high consolidation and operating efficiency, mainly thanks to capacity up to 1TB RAM, 64 cores. They enhance the availability of the basic platform by providing features not commonly found elsewhere, like:

- ChipKill or SDDC memory with hot add / replace.

- Hot-add / remove of i/o adapters.

- Hot-add CPUs.

Particular attention must be given to the memory configuration of the servers, especially with the recent NUMA chipsets from Intel and AMD:

- Populate processor local memory banks with equal amounts of RAM.

- Populate memory controllers with equal capacity.

- Populate all memory channels for each controller evenly to exploit the maximum memory bandwidth.

- Use dual-rank DIMMs.

Network

Whilst hyper-v will work with any supported network adapters, it is important to notice that certain features will be available only with the appropriate combination of chipsets.

VMQs (hardware-managed network queues for VMs) are highly recommended for best performance and consolidation ratios, but they require support for interrupt coalescing (in R2), the latest Intel Pro or Broadcom chipsets, appropriate drivers and some registry hacking, as explained here and here.

VM Chimney (TCP offload for VMs) has proven unreliable due to driver issues in my experience. I'd rather have VMQs.

Note that if you enable IPSec or any filter driver on a particular connection (e.g. Windows firewall), that connection may not be offloaded.

Microsoft does NOT support network teaming with Hyper-V. Teaming is supported by the OEMs. VMQs and teaming may be mutually exclusive, depending on vendor.

Operating System Editions

In order to build fail-over clusters, you will need enterprise or datacenter editions. Note that Hyper-V Server R2 is also capable of clustering and similar in many respects to enterprise edition. The smaller footprint of Hyper-V Server R2 implies the need to patch it less often than a full edition, hence it is ideal to minimize planned downtime.

It is possible and legal to purchase datacenter edition (for the unlimited licensing) but deploy Hyper-V Server R2 (for the reduced footprint) , transferring the licenses.

Cluster Size

There are several considerations to determine the number of nodes and machine per node in a cluster:

- The officially supported maximum is 64 VMs per node in a hyper-v R2 cluster (increasing to 384 in SP1).

- The officially supported max virtual / physical core ratio is 8:1.

- Large clusters are more likely to incur in the WMI issues mentioned in my previous post.

- It is fine to have the CPU capacity, how about the i/o? How many iops per node can you sustain with your adapter / SAN combination?

Cluster Shared Volumes

Assuming that you want live migration to minimize planned downtime and optimize allocation of resources, you will need CSVs. A common question is how many machines to deploy on each CSVand how many CSVs to have per cluster. The answer to that depends on several factors. In my experience, the most troublesome one is the data you need to back up and how long you can tolerate reduced performance during the backup.

Each time you perform a CSV backup, the server hosting the VM to back up requires ownership of the whole CSV volume to snapshot. I/O to the volume is redirected over the network for all VMs hosted on that volume, with consequent performance impact. The time you can tolerate that, multiplied by the backup throughput will give you the max amount of data to put on that CSV. Divide that by the average size of a VHD and you'll have a rough estimate of how many VMs will fit.

Rules of Thumb

I claim no scientific basis for the following rules, other than my empirical observations. Here they go, in no particular order:

1. Keep the number of nodes in a cluster small (2-4) to avoid annoying SCVMM bugs.

2. Assume a random 60/40 i/o pattern if you don't know in advance what your VM workload will be. It is quite common, in my experience.

3. Plan for at least 1 management network, 1 for heartbeat, 1 for live migration, 1 dedicated to VMs and 1 dedicated storage adapter. For higher consolidation ratios and availability, plan on 2 storage adapters with MPIO.

4. Use separate VLANs for management and VMs to isolate traffic and for ease of administration. Consider 10GbE for the shared VM networks in order to minimize the number of adapters and cables.

5. The quality of VSS (snapshot) and VDS (disk) providers varies greatly with the OEM. Be sure to test them. A snapshot should NOT take longer than 10 seconds or it will fail.

6. If you don't know how long you can tolerate redirected i/o, 30 minutes is a useful maximum. I have seen CSVs crash in several occasions (again, depending on OEM) after that.

7. Fail-over clusters do NOT take into account connectivity to the VMs. In other words, if you are sharing the VM network with the host o/s and for some reason that connection fails, the cluster may fail over to another node its own IP addresses on that network but not the VMs attached to it.

8. A few large servers are easier to manage than many small blades, if you implement appropriate procedures to minimize downtime and have a support contract when things go wrong to fix them quickly :-) They may also be more cost-effective, if you drive consolidation and take advantage of power optimization technology.

9. Use group policies to control patching with WSUS or similar. Do NOT use the default "download and install at 3am" option on all cluster nodes, or they will all reboot at the same time.

10. If you don't know your storage vendor's iops ratings, use these ballpark figures on wikipedia.

11. On your hosts, make sure that the antivirus excludes .vhd, .vsv, .avhd files and vmms.exe, vmwp.exe. If you are running hyper-v server only, do you need an antivirus on the host? This is not a rhetorical question by the way; I am interested in opinions.

12. If you don't know the size of your CSVs in advance, 2 TB works on all MBR and GPT disks. Most backup and restore utilities, snapshot providers etc... can handle 2 TB. It is also a tolerable size should you ever need to run chkdsk or defrag on the volume (few, large vhd files should not cause much trouble or take much time to fix in that respect).

13. Both fail-over and PRO have no idea of virtual applications, i.e. applications that require a set of interconnected virtual machines. They may move them on different nodes. A way around it is to script the appropriate migration sequence with Powershell.

Powered by Qumana