Hyper-V How To: Plan HA VMs

Here's another guest post from Jeremy Hagan, thanks again Jeremy!

If you are planning on using High Availability for Hyper-V there are a number of things to keep in mind while planning your deployment and there are a number of best practices that will impact on your planning.  I will cover the best practices first then discuss how these impact your pre-deployment decision making.

Best Practice 1 – One VM per LUN

High Availability in Hyper-V is provided through WS08 Failover Clustering.  When you fail-over a cluster resource with a dependent disk that disk can only be active on one cluster node at a time.  So if your cluster disk has more than one VM on it, then they must all fail over as a group.  Essentially there is nothing wrong with this, so long as you know what you are doing when you decide to go down this path.  The benefit of one VM per LUN is that you have the best granularity in terms of moving workloads between cluster nodes for maintenance or for smoothing out performance bottlenecks.

Imagine if you had 5 VMs running on one LUN in a 5 node cluster.  One of your cluster nodes is over-utilized and one is under-utilized.  If you were to move the 5-VM group to the under-utilized node then you might end up moving the bottleneck.  If the 5 VMs were split you could fail them over individually.  This would allow you to smooth out the utilizations more evenly.

Best Practice 2 – Fixed VHD files

VHD files can be Dynamic (i.e. the VHD is the size of the data in the VHD) or Fixed (i.e. the VHD is the size of the data and free space).  Having Fixed VHD files confers a performance benefit since there is no requirement to continually allocate new space to the VHD file.  Fixed VHD files also prevent performance issues caused by file fragmentation (especially when combined with one VM per LUN).

With these two best practices in mind I will now explain how they may affect your planning decisions around LUN sizing for a HA Hyper-V cluster.  My first attempt at planning my LUN sizes consisted of monitoring the physical servers I was going to virtualize for a period (in my case, three weeks) and extrapolating their observed data growth out to 3 years’ worth and then adding a 15% premium.  After that I rounded up to the nearest 5 GB and then ordered my LUNs from the SAN team.

My first mistake was that I forgot to factor in the amount of memory I planned to allocate to each VM in the calculations (a real beginner's mistake).  The next thing I didn’t think about until it came time to implement was that even though you might have a Fixed VHD file with 5 GB of free space inside that file, as soon as you take a snapshot, that 5 GB of free space becomes immediately inaccessible since now all changes are written to a differencing VHD file until the snapshot is deleted. 

So if you decide to go for both of the above best practices you really need to think hard about, not only  your LUN sizes, but also your management practices regarding snapshots.  In my environment I have decided that no Production VM will be allowed to run on a snapshot beyond a certain period.  So if you decide to make a snapshot of a VM prior to making a change, once the change is accepted and you have decided not to revert, then the snapshot should be deleted.  This will cause the changes to be committed to the original VHD on the next reboot.

If you decide to ignore best practice number 1 make sure you do it with your eyes open.  Know what you are getting yourself in for and have a bailout plan ready.  Oh and if you do ignore BP #1 I wouldn’t recommend ignoring BP #2.  The benefits of pooling VMs on a LUN are significant.  It simplifies your SAN configuration, reduces the chances that you will run out of drive letters on your parent partition and best of all you don’t have to make as many hard decisions about locking away disk space.  You can just go ahead and create a 1 TB LUN and start putting VMs on it, run with as many snapshots as you like and not worry.  Until you start to reach a performance bottleneck that is.

In summary, think hard about sizing your LUNs.  Consequences of under-sizing are significant, no room for snapshots, no room for boosting the amount of memory allocated to a VM.  If you have the luxury of copious amounts of SAN space (hey storage is cheap, right?) then overspec.

Notes:

  • If you want to put more than 1 VM on a LUN and don’t want to have to write a script to do it, then you will want this hotfix (actually you will want it anyway, since the other changes in functionality it enables are invaluable for long term operations).
  • If you group clustered VMs on a single LUN, don’t shut down the OS from within the OS or from Hyper-V Manager, this is not a cluster-aware shutdown and counts as a failure.  The OS will be restarted by the cluster and then if you go “what the ?!?!” and shut it down again, depending on how the cluster resource is configured this will induce a failover to another node, taking the rest of your VMs with it.  Try explaining that outage to your boss when a business critical server goes out of action for a couple of minutes at a crucial period because you shut down a scratch VM you had been using to test some software.  I for one do not feel like having to get a change request authorized just to shut down a machine.  Once you have made a VM a clustered VM, ALWAYS use Failover Cluster Management or the SCVMM Console to shut it down.