Imagine this. Lots of machines, some virtual, some physical. Every single machine can ping every other machine, except just one pair. An administrators nightmare? Yes. Trust me, I’ve been there spending several hours diagnosing this particularly tricky problem.
Let’s simplify the situation. Start with two Virtual Server hosts. For ease of reference, lets call them A & B. Now let’s say A & B each have a two Virtual machines running on them. We’ll call these C, D, E & F. C & D are running on host A, while E & F are running host B. Still with me? Maybe the annotation below will make it a bit clearer
Host A B
VMs C,D E,F
Now let’s just treat a physical or virtual machine the same draw up the problem in terms of which machine can ping which.
A B C D E F
A | Y | Y | Y | Y |*N*| Y |
B | Y | Y | Y | Y | Y | Y |
C | Y | Y | Y | Y | Y | Y |
D | Y | Y | Y | Y | Y | Y |
E |*N*| Y | Y | Y | Y | Y |
F | Y | Y | Y | Y | Y | Y |
In other words, A cannot ping E and E cannot ping A. Obvious thoughts? Firewall. Checked, OK. Besides, why would B,C,D & F be able to ping E? Made no sense.
The first step was to run a network trace on host A while trying to ping VM E. This yielded an ARP query of the sort “Who has <IP1>, tell <IP2>” where IP1 is the IP address of VM E and IP2 is the IP Address of Host A. There was no response to that ARP query seen. Even more curious.
The next step was to ensure the ARP cache on the host was correct. First I did an “arp -a” and it showed that host A indeed didn’t know the ethernet address of VM E. This makes sense, otherwise there wouldn’t be an ARP query going out on the network.
Working around the problem, I again used arp, with the -s parameter to add the correct IP address and mac address of VM E to the arp cache on host A. Another network trace later, and all I saw was the ping request going out, but no response being received.
Just to be safe, I cleared out the arpcache using netsh interface ip delete arpcache. Another network trace showed the arp request again.
Of course, I’d already done a reboot of both host A and VM E to no avail, but another reboot never hurts. Same result though.
In the end, a light bulb went on as to what the probable cause of the problem was. In retrospect, it’s obvious. In fact, so obvious when I regularly tell Virtual Server users to be aware of the problem that I could kick myself. I just wasn’t expecting it to manifest itself in quite the way it was being seen above.
VM’s C, D, E & F all have dynamic MAC addresses. OK, that’s fine, but Virtual Server only guarantees MAC addresses to be unique between all VMs running on a _single_ host. In other words, you could have a duplicate MAC address on a VM configured with a dynamic MAC address on another host. I came unstuck precisely because of this. Host A did indeed have VM C setup with two network adapters. Both were set for dynamic MAC addresses, but the second NIC was set to disabled within the VM itself. It turned out that the dynamic MAC address for the second NIC for VM C was the same as the single NIC configured for VM E. I would have found this much earlier had that NIC been enabled. But there you go – half the fun of diagnosing this stuff.
I’ve clearly over simplified the problem description above – in my case there were actually 11 VM’s across three hosts, just to add a bit more head scratching to the equation. So the golden rule here – if you have multiple VS hosts with multiple VMs which all need to inter-communicate, either manually assign static MAC addresses to each NIC on each VM yourself, or be extra careful to check the dynamically assigned MAC address. Virtual Server isn’t infalible.