Fighting talk from VMware.

There’s running joke on my team – if I want to drive my blog statistics up all I need to do is talk tough about VMware. A few days ago I posted a response to some VMware FUD. and it’s had 3 times the readers of a typical post and more than the usual number re-tweets, inbound links and comments, including one from  Scott Drummonds. He said. 

I work for VMware and am one of the people responsible for our performance white papers.

I know who Scott is, and that’s not all he’s responsible for: I got a huge spike earlier in the year when I went after him for posting a dishonest video on youtube.  He goes on

You incorrectly state that VMware recommends against memory over-commit.  It is foolish for you to make this statement, supported by unknown text in our performance white paper, If you think any of the language in this document supports your position, please quote the specific text.  I urge you to put that comment in a blog entry of its own.

Fighting talk. I gave the link to , where both the following quotes are “unknown” to Scott – they are on page 6.

Avoid frequent memory reclamation. Make sure the host has more physical memory than the total amount of memory that will be used by ESX plus the sum of the working set sizes that will be used by all the virtual machines running at any one time.

Due to page sharing and other techniques, ESX Server can be overcommitted on memory and still not swap. However, if the over-commitment is large and ESX is swapping, performance in the virtual machines is significantly reduced.

Scott says that he is “sure that your [i.e. my] interpretation of the text will receive comments from many of your customers.”  I don’t think there I’m doing any special interpretation here (comment away if you think I am), the basic definition of “Fully committed”  is when the host has an amount physical memory equal to the total amount of memory that will be used by the Virtualization layer plus the sum of the working set sizes that will be used by all the virtual machines running at any one time.  Saying you should Make sure the host has more memory than that translates as DO NOT OVER COMMIT. 

The second point qualifies the first: there’s a nerdy but important point that working set is not the same as memory allocated. The critical thing is to avoid the virtualization layer swapping memory. If you had – say – two hosts running 8GB VMs with Exchange in, and you tried to fit both VMs into one host with 8GB of available RAM, both VMs would try to cache  6-7GB of mail-store data; but without physical memory behind it, what got pulled in from disk gets swapped out to disk again. In this case you would be better off telling the VMs they had 4GB to use, that way they can keep 2GB of critical data (indexes and so) in memory, and not take pot luck with the virtualization layer swaps to disk. “Balloon” drivers make memory allocated to a VM unavailable, reducing an individual working set; page sharing reduces the sum of working sets. It might sound pedantic to say ‘you can over-allocate without over-committing’  but that’s what this comes down to: the question is “by how much”, as I said in the original post:

VMware will cite their ability to share memory pages, but this doesn’t scale well to very large memory systems (more pages to compare), and to work you must not have [1] large amounts of data in memory in the VMs (the data will be different in each), or [2]  OSes which support entry point randomization (Vista, Win7, Server 2008/2008-R2) or [3] heterogeneous operating systems.

We had this argument with VMware before when they claimed you could over-commit by 2:1 in every situation– eventually as “proof” they found a customer (they didn’t name) who was evaluating (not in production with) a VDI solution based on Windows XP, with 256MB in each VM. The goal was to run a single (unnamed) application, so this looked like a much better fit for Terminal Services (one OS instance / many users & apps) than for VDI (Many OS instances), but if the app won’t run in TS then this is a situation where a VMware-based solution has the advantage over a Microsoft based one. [Yes, such situations do exist! I just maintain they are less common than many people would have you believe].

Typing this up I wondered if Scott thought I was saying that VMware’s advice was never, ever, under any circumstances whatsoever should customers use the ability to over commit – which they have taken the time and trouble to put into their products. You can see that they recommend against memory over-commit, as he puts it, in a far more qualified way than that. The article from Elias which I was responding to (and I’ve no idea if Scott read it or not) talked about using oversubscription “to power-on VMs when a host experiences hardware failure”. This sounds like a design with cluster nodes having redundant capacity for CPU, Network and disk I/O, (not sweating the hardware as in the VDI case) but with none for memory; after a failure moving to fully used CPU, Network and disk capacity, but accepting a large-over commit on memory with poor performance as a result.  In my former role in Microsoft consulting services I used to joke in design reviews that I could only say “we can not foresee any problems” and if problems came up we’d say “that was unforeseen”. I wouldn’t sign-off Elias’ design: if I did, and over-committing memory meant service levels weren’t met, any lawyer would say that was a foreseeable problem and produce that paper to show it is exactly the kind of set-up VMware tell people to avoid.

There is one last point: Scott says everyone loves Over-commit (within these limitations presumably), and we will too “once you have provided similar functionality in Hyper-V.”. Before beta of Server 2008-R2 was publically available we showed dynamic memory. This allowed VM to say it needed more memory (if added it showed up using the hot-add ability of newer OSes). The host could ask for some memory back – in which case a balloon driver reduced the working set of the VM. There was no sharing of pages between VMs and no paging of VMs memory by the virtualization layer – the total in use by the VMs could not exceed physical memory. It was gone from the public beta; and I have heard just about every rumour on what will happen to it “it will come out as a patch”, “it will come out in a service pack”, “it will be out in the next release of Windows”, “the feature’s gone for good”, “it will return much as you saw it”, “it will return with extra abilities”. As I often say in these situations those who know the truth aren’t taking, and those who talk don’t know.

Oh yes, Merry Christmas and Goodwill to all.

Comments (4)

  1. Anonymous says:

    Fernado , read the last paragraph.

    I’ve no information whether "dynamic memory" will come back. Technically we have said we would ship it, but not when and never retracted that so my guess is it will, but my guess is no better than people who watch the hyper-V team from outside Microsoft.

    I’ve also said I don’t know what the features will be if it does – it didn’t do page sharing because that doesn’t offer any benefit on any  current Microsoft OS (or on the previous release either) and it didn’t let our customers get into the same problems with the hypervisor doing paging that VMware want people to avoid. That could change, but I’d be surprised.

    Is the "remember Microsoft said Live motion was not needed" thing now being pushed as the standard response to any doubt expressed about any VMware feature ? It’s come up twice in quick succession on different posts. I’ve said in a lot of discussions recently I think Live migration is a bit like an airbag on a car – you really shouldn’t need it, but we’re all much happier knowing it’s there.

  2. Anonymous says:

    Wade, yes. I made that point in the body text. That’s why I made a distinction between over-allocating memory and over-committing it. People – including the author I singled out in the original piece – would have you believe a 4GB server can run a 8GB Exchnge VM or if you kept to a 4GB Exchange VM on 2008-R2 you can move in a 2GB SQL Machine on 2003 and a 1GB XP and a 1GB Windows 7 Machine as clients, and everything will be great. It won’t.

    Overallocation *IS* OK if the memory in sharable pages exceeds the excess memory allocated, but there are a lot of places where you can’t do page sharing, homogenous OSes , VMs with a large amount of memory used for unique data rather than sharable programs, OSes with entry point randomization (Vista or Windows 7 VDI clients, Server 2008/2008-R2.). So as I said in the main piece it is not a blanket prohibition (it would be odd to allow something forbidden to be configured). rather a statement of limitations which some VMware advocates choose to ignore.

  3. Fernando says:

    So, can we assume Hyper-V will never eve have this feature, since it is so bad right ?

    Careful …. remember the "VMotion is not needed, quick migration will do the job", just to have a full live migration 1 year latter.

  4. Wade H. says:

    Hi James,

    From my interpretation of the above quoted VMware document, VMware is recommending against frequent memory reclamation (i.e. host swapping). They are not recommending against the over commit potential derived from Transparent Pager Sharing. I do not see how the above text translates to "do not over commit". It translates more to "do not allow ESX host swapping, or there will be performance degradation"