Hi, my name ist Alex Mihai and I work in the Cluster/Virtualization Team at Microsoft. Today I wanted to talk to you about a problem with the Shared Nothing Live Migration in Windows Server 2012 that one of my customer recently encountered.
Shared Nothing Live Migration is a new functionality introduced in Windows Server 2012, that allows you to live migrate the virtual workloads in a non-cluster environment.
If you would like to know more, please have a look at:
Configure and Use Live Migration on Non-clustered Virtual Machines
Ok, so having this explained, I will go ahead and describe my customers problem. His environment consisted of 3 Hyper-V 2012 Host (Non-cluster) with some virtual workloads split up between them. Let’s simply call them S1, S2 and S3.
At some point, my customer couldn’t live migrate any virtual machine to S1. The live migration between S2 and S3 could be done successfully, but not on S1.
When performing the live migration to S1, we got:
Virtual machine migration operation failed at the migration source.
Failed to establish a connection with host ‘S1’: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (0x8007274C)
The first thing we did, was to check if Live Migration on S1 was checked in Hyper-V Settings. It was. We could also ping S1 from S2 and S3 and back. S1 was also restarted.
I wanted to see if this is happening only with the already configured virtual machine’s or with newly created ones also. We created a TestVM on S2 and tried to live migrate. It didn’t work.
We then went ahead and created a TestVM1 on S1 and tried to live migrate it to S2 and it worked, but couldn’t set it again on S1.
Taken into account that we have a communication problem with S1, I asked the customer to check if the Hyper-V rules from the Windows Firewall are the same with those from server S2. They were the same.
We then disabled the Windows Firewall on S1 and tried to do a live migration. This time it failed with another error:
Virtual machine migration operation failed at migration source.
Failed to establish a connection with host ‘S1’: No connection could be made because the target machine actively refused it. (0x8007274D).
Knowing that system and application event log isn’t revealing any additional information about the above error message, we went ahead and looked in Microsoft-Windows-Hyper-V-VMMS-Admin and found the following:
Log Name: Microsoft-Windows-Hyper-V-VMMS-Admin
Date: 08.10.2012 09:48:47
Event ID: 20408
Task Category: None
The Virtual Machine Management Service failed to start the listener for Virtual Machine migration connections: The requested address is not valid in its context. (0x80072741).
I couldn’t find any information regarding this error whatsoever, therefore had a look in a network trace during the live migration failure also.
The live migration is done on the port 6600 and the network trace showed us that the connection was simply reset-ed.
In a netstat –a, I could see that the process vmms.exe wasn’t listening on port 6600 on S1. S2 and S3 had the vmms process listening on port 6600.
With the above information, we could establish that this was our problem. We restarted the vmms process on S1 and then BINGO, a live migration was possible.
The problem wasn’t gone. After S1 was restarted, my customer couldn’t do a live migration again. The vmms process didn’t listened on the port 6600.
What I would like to mention is, is that my customer used 2 NIC’s:
1 x Hyper-V Host NIC
1 x Teamed NIC
The IP for the live migration came from the Teamed NIC.
After discussing with one of my experienced virtualization colleague, we concluded that there seem to be a race condition when the server is restarted.
After setting the restart type for the process vmms.exe to Automatic “Delayed Start”, the problem didn’t appeared after a reboot of the server S1.
When the Virtual Machine Managment Service (vmms.exe) starts it tries to register this port to listen for incoming migration requests. When the underlying Network is not (yet) ready, the listening port cannot be registered.
This might happen on Teamed Networks that are not completely initalized during system startup.
Another workaround was to set the IP of the Hyper-V host for the live migration traffic.
Hope this will help you if you encounter the same simptoms.
GTSC Support Engineer