Drilling into ‘reasons for not switching to Hyper-V’

Information week published an article last week “9 Reasons why enterprises shouldn’t switch to hyper-v”. The Author is Elias Khnaser, this is his website and this is the company he works for.  A few people have taken him to task over it, including Aidan . I’ve covered all the points he made, most of which seem to have come the VMwares bumper book of FUD, but I wanted to start with one point which I hadn’t seen before. 

Live migration. Elias talked of “an infrastructure that would cause me to spend more time in front of my management console waiting for live migration to migrate 40 VMs from one host to another, ONE AT A TIME.” and claimed it “would take an administrator double or triple the time it would an ESX admin just to move VMs from host to host”.   Posting a comment to the original piece he went off the deep end replying to Justin’s comments , saying “Live Migration you can migrate 40 VMs if nothing is happening? Listen, I really have no time to sit here trying to educate you as a reply like this on the live migration is just a mockery. Son, Hyper-v supports 1 live VM migration at a time.” . Now this does at least start with a fact : Hyper-V only allows one VM to be in flight on a given node at any moment: but you can issue one command and it moves all the hyper-v VMs between nodes. Here’s the PowerShell command that does it.
Get-ClusterNode -Name grommit-r2 | Get-ClusterGroup |
  where-object { Get-ClusterResource -input $_ | where {$_.resourcetype -like "Virtual Machine*"}} |
     Move-ClusterVirtualMachineRole -Node wallace-r2             
The video shows it in action with 2 VMs but it could just as easily be 200.  The only people who would “spend more time in front of [a] management console” are those who are not up to speed with Windows Clustering. System Center will sequence moves for you as well. But… does it matter if the VMs are migrated in series or in parallel ?  If you have a mesh of Network connections between cluster nodes you could be copying to 2 nodes of two networks with the parallel method, but if you don’t (and most clusters don’t) then n copies will go at 1/n the speed of a single copy. Surely if you have 40VMs an they take a minute to move it takes 40 minutes either way…  right ? Well no... Let’s use some rounded numbers for illustration only: say 55 seconds of the minute is doing the initial copy of memory, 4 seconds doing the second pass copy of memory pages which changed in that 55 seconds, and 1 second doing the 3rd pass copy and handshaking. Then Hyper-V moves onto the next VM and the process repeats 40 times. What happens with 40 copies in parallel ? Somewhere in 37th minute the first pass copies complete - none of the VMs have moved to their new node yet. Now: if 4 seconds worth changed in 55 seconds – that’s about 7% of all the pages - what percentage will have changed in 36 minutes ?  Some won’t change from hour to hour and others change from second to second – how many actually change in 55 seconds or  36 minutes or any other length of time depends on the work being done at that point, and the memory size and will be enormously variable. However the extreme points are clear (a) In the very best case no memory changes and the parallel copy takes as long as the sequential. In all other cases it takes longer (b) In the worst case scenario the second pass has to copy everything – when that happens the migration will never complete.  

Breadth of OS support. In Microsoft-speak “supported”  means a support incident can go to the point of issuing a hot-fix if need be. Not supported doesn’t mean non-cooperation if you need help – but the support people can’t make the same guarantee of a resolution. By that definition, we don’t “support” any other companies’ software – they provide hot-fixes, not us - but we do have arrangements with some vendors so a customer can open a support case and have it handed on to Microsoft or handed on by Microsoft as a single incident. We have those arrangements with Novell for Suse Linux and Red Hat for RHEL, and it’s reasonable to think we are negotiating arrangements for more platforms: those who know what is likely to be announced in future won’t identify which platforms to avoid prejudicing the process. In VMware-speak “supported”, has a different meaning. In their terms NT4 is “Supported”. NT4 works on HyperV but without hot-fixes for NT4 it’s not “Supported”. If NT4 is supported on VMware and not on Hyper-V exactly how is a customer better off ? Comparisons using different definitions of “support” are meaningless.   “Such and Such an OS works on ESX / Vsphere but fails on Hyper-V” or “Vendor X works with VMware but not with Microsoft” allows the customer can say “so what” or “That’s a deal-breaker”.

Security.  Was it hyper-v that had the vulnerability which let VMs break out of into the host partition ? No that was VMware. Elias commented that "You had some time to patch before the exploit hit all your servers" which makes me worry about his understanding of network worms. He also brings up the discredited disk footprint argument; that is based on the fallacy that every Megabyte of  code is equally prone to vulnerabilities, Jeff sank that one months ago and pretty comprehensively – the patch record  shows a little code from VMware has more flaws than a lot of code of Microsoft’s.

Memory over-commit. Vmware's advice is don't do it. Deceiving a virtualized OS about the amount of memory at its disposal means it makes bad decisions about what to bring into memory - with the virtualization layer paging blindly - not knowing what needs to be in memory and what doesn’t. That means you must size your hardware for more disk operations, and still accept worse performance. Elias writes about using oversubscription, “to power-on VMs when a host experiences hardware failure”. In other words the VMs fail over to another host which is already at capacity and oversubscription magically makes the extra capacity you need. We’d design things with a node’s worth of unused memory (and CPU , Network, and Disk IOps ) in the other node[s] of the cluster. VMware will cite their ability to share memory pages, but this doesn’t scale well to very large memory systems (more pages to compare), and to work you must not have [1] large amounts of data in memory in the VMs (the data will be different in each), or [2]  OSes which support entry point randomization (Vista, Win7, Server 2008/2008-R2) or [3] heterogeneous operating systems. Back in March 2008 I showed how a Hyper-v solution was more cost effective if you spent some of the extra cost of buying VMware on memory – in fact I showed the maths underneath it and how under limited circumstances VMware could come out better. Advocates for VMware [Elias included] say buying VMware buys greater VM density: the same amount spent on RAM buys even-greater density. The VMware case is always based on a fixed amount of memory in the server: as I said back then, either you want to run [a number of] VMs on the box, or the budget per box is [a number] Who ever yelled "Screw the budget, Screw the workload. Keep the memory constant !" ? The flaw in that argument is more pronounced now than it was when I first pointed it out as the amount of RAM you get for the price of VMware has increased.

Hot add memory. Hyper-v only does hot-add of disk, not memory. Some guest OSes won’t support it at all. Is it an operation which justifies the extra cost of VMware ? . 

Priority restart - Elias describes a situation where all the domain controllers / DNS servers on are one host. In my days in Microsoft Consulting Services reviewing designs customers had in front of them, I would have condemned a design which did that, and asked some tough questions of whoever proposed it.  It takes scripting (or very conservative start-up timeouts) in Hyper-V to manage this. I don’t know enough of the feature in VMware to know how sequences things not based on the OS running but all the services being ready to respond

Fault tolerance. VMware can offer parallel running - with serious restrictions. Hyper-v needs 3rd party products (Marathon) to match that.  What this saves is the downtime to restart the VM after an unforeseen hardware failure. It’s no help with software failures if the app crashes, or the OS in the VM crashes, then both instances crash identically. Clustering at the application level is the only way to guarantee high levels of service: how else do you cope with patching the OS in the VM or the application itself ?      click for larger version

Maturity: If you have a new competitor show up in your market, you tell people how long you have been around. But what is the advantage in VMware’s case ? Shouldn’t age give rise to wisdom, the kind of wisdom which stops you shipping Updates which cause High Availability VMs to unexpectedly reboot, or shipping beta time-bomb code in a release product. It’s an interesting debating point whether VMware had that Wisdom and lost it – if so they have passed through maturity and reached senility.

 Third Party vendor support. Here’s a photo. At a meet-the-suppliers event one of our customers put on, they had us next to VMware. Notice we’ve got System Center Virtual Machine manager on our stand, running in VM, managing two other hyper-V hosts which happen to be clustered, but the lack of traffic at the VMware stand allows us to see they weren’t showing any software – a full demo of our latest and greatest needs 3 laptops, and theirs ? Well the choice of hardware is a bit limiting. There is a huge range of management products to augment Windows – indeed the whole reason for bring System Center in is that it manages hardware, Virtualization (including VMware) and Virtualized workloads. When Elias talks of 3rd party vendors I think he means people like him – and that would mean he’s saying you should buy VMware because that’s what he sells.