Scaling Hyper-v.

A couple of Stories have been doing the rounds on our internal Virtualization discussions. One was headed "HOLY COW HYPER-V VIRTUALIZING MICROSOFT.COM!!!!!!" (and before anyone wonders if this is breaking something internal to the world, it's already been described in detail on by Rob Emanuel on the Windows Server blog ). The MS.COM operations team have also produced an article on how they virtualized Technet and MSDN

Now... Microsoft.com is not your average home page. The statistics are staggering: 1.2 Billion hits per month, on average that's over 4,000 every second, but at busy times it peaks at 4 times that. Its 7 million pages take up 300GB of space. It used to need 80 servers to deliver it, but the Ops team migrated it to Server 2008 and newer hardware and saw the opportunity to reduce the number of servers. What became apparent with modern CPUs and RAM sizes, the servers were disk bound Simply throwing more CPU cores and more RAM at the servers wasn't going to reduce the number of boxes needed. Redesigning the site so it used more disk spindles would help - but the quickest win would be to take bigger servers and use Hyper-V to partition them, with each Virtual Machine getting it's own disks that would double the the number of disk IOs available without breaking the site into parts on different disks. But could they use virtualization with out Hyper-V itself becoming the bottleneck. Deploying new servers into the array is involves sync'ing 7 million pages: would virtualizing the servers - even if they ran one VM per box - help deployment ? Even if Hyper-V could scale and wasn't a drag on management and deployment , would it be reliable ?  And would running one Mega site on Hyper-V give the Microsoft.Com folks confidence to consolidate some of the smaller machines they mange. ... incidentally Blogs.technet.com where this page is hosted is run for us by a third party.

If you read the post you find the detail behind why the answers to all these questions turns out to be yes.

I think (at least in the short term) most of the deployments of Hyper-V will be consolidating 5-20 servers into a single box.  It's perfectly capable of running many more VMs than that - indeed we demonstrated hundreds of VMs on the old Virtual Server product - (more than VMware will support) but my own view is that the typical VM requirement, and the typical hardware capacity leads to a typical ratio of 10:1 (it could be 8, or 12 but I'm using rough orders of magnitude here) and the greatest most deployments will fall within half and double that. That's not scientific, but that's how I get to my own "gut-feel". I say servers, because I'm not a great believer in virtualizing the desktop OS - a thin client with a fatter server running your desktop as a VM doesn't reduce hardware costs compared with, rich client and skinnier server architecture. It doesn't reduce power consumption (in fact it probably increases power and A/C costs) and delivers an inferior service; don't try video conferencing company events, or deploying a Voice technology from the the PC. Don't try working off line either. Yet it has the management and licensing overhead of having many machines.  If there isn't really a requirement for a PC , just one or two PC applications, then a terminal Service way of working is usually better.

So, I see Hyper-V most running servers and this case of running a single workload on under Hyper-V - even running multiple identical instances of the same workload is unusual. But if anyone tries to tell you Hyper-V doesn't scale to take on the biggest workloads... well you know different.

In the same vein, QLogic announced Hyper-V can do 180,000 IOPs. That's not a typo. It's vast number of I/O operations per second. In fact some people find it unbelievable, Chris Wolf posted a critique of the test on his blog , the comments are interesting and I felt the need to join in. Chris actually sent me a nice mail afterwards, so I'll repost what I said on in my comment.

The purpose of this benchmark is to prove - if it can be proved - that Hyper-V is not an I/O bottle neck. I read the numbers and said "What the hell kind of system can do 200,000 IOPs per second" it was plainly not the kind of system which is going be installed in many environments. It allows Microsoft people to shout "B.S." at the top of their lungs if anyone from VMware claims to have drivers which are much better than Windows ones. It also kills any suggestion that Hyper-v and Windows drivers are OK in small systems but don't scale.

You're right that if a Microsoft benchmark says "runs at 90% of the speed of RAW hardware" the intelligent question to ask is "is that better, worse or about the same as the competition". Is it "Faster than any previous benchmark on virtulization" because it got a better percentage of the hardware or because it kept on scaling when the hardware improved ? Either would be a win for Microsoft. Just saying "ya boo sucks ... we're faster than you " isn't.

Would VMware spread disinformation ? Sure they would. These are the people who can title a section "VMware ESXi – The Most Advanced Hypervisor" and in the very next sentence say "VMware ESXi 3.5 is the latest generation of the bare-metal x86 hypervisor that VMware pioneered and introduced over seven years ago.". So a design that's more than 7 years old and wasn't designed to exploit the latest Intel and AMD technology is also the most advanced ? These are the people who can claim "Many VMware ESX customers have achieved uptimes of more than 1,000 days without reboots." which is pretty remarkable when you look at impendent analysis of VMware's patch history. (follow the link and you'll find a quoted interval of every 19 days... 50 sets of missed patches ! Don't tell the boss). 

The Xen and Microsoft architectures rely on routing all virtual machine I/O to generic drivers installed in the Linux or Windows OS in the hypervisor’s management partition. These generic drivers can be overtaxed easily by the activity of multiple virtual machines

When I challenged VMware to find a customer who was in production with the over-commit ratios they claimed, they could only produce one who was thinking about it. So I think I don't think I'm being unfair in calling it BS. Interestingly the post I linked to above repeats that claim.  So I really don't feel bad calling their pronouncement on drivers BS. (I'll wait to see if someone from VMware comes up with a reason why the sum of activity many small VMs is different from one big one. )