How to troubleshoot Host status of 'Not Responding' in SCVMM 2008

This is one of the most often asked questions on our support forums. And it can also be rather difficult to troubleshoot although if you follow the steps below you can systematically approach this issue and usually resolve it. While we're on the topic, this status indicates that the VMM server is unable to communicate with the Host for some reason. If this communication is interrupted for any reason, even intermittently, then you can expect the Host status to change in the VMM admin console.

1. The first item to check is to make sure that the following hotfixes are installed:

·          958124 A wmiprvse.exe process may leak memory when a WMI notification query is used heavily on a Windows Server 2008-based or Windows Vista-based computer

·          954563 Memory corruption may occur with the Windows Management Instrumentation (WMI) service on a computer that is running Windows Server 2008 or Windows Vista Service Pack 1

·          955805 Certain applications become very slow on a Windows Server 2008-based or Windows Vista S955805-based computer when a certificate with SIA extension is installed

·          961983 Description of the hotfix rollup package for System Center Virtual Machine Manager 2008: April 14th, 2009

 

2. TCP Offloading - will need to be disabled in Windows, the registry and also in any NIC teaming management software that may be in use. It is very important to check in all three locations to ensure that TCP Offloading is completely disabled. This operation needs to be performed on both the VMM server and Host computer.

· Locate all physical NICs in the registry under the following key: 'HKLM\System\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}'

· There will be folders with four digits, starting with '0000'. Locate the ones that show the physical NIC names on the right. Look for the 'DriverDesc' value on the right; this should have the name of a NIC such as 'HP NC360T PCIe Gigabit Server Adapter.' For each of these, make the changes below.

            Disable All Vendor specific Offloading. Set values for any entries below that include the word 'Offload' to '0' (Disabled)
*FlowControl

No description available

                        *IPChecksumOffloadIPv4
Describes whether the device enabled or disabled the calculation of IPv4 checksums.
*TCPChecksumOffloadIPv4
Describes whether the device enabled or disabled the calculation of TCP Checksum over IPv4 packets.
*TCPChecksumOffloadIPv6
Describes whether the device enabled or disabled the calculation of TCP checksum over IPv6 packets.
*UDPChecksumOffloadIPv4
Describes whether the device enabled or disabled the calculation of UDP Checksum over IPv4 packets.
*UDPChecksumOffloadIPv6
Describes whether the device enabled or disabled the calculation of UDP Checksum over IPv6 packets.
*LsoV1IPv4
Describes whether the device enabled or disabled the segmentation of large TCP packets over IPv4 for large send offload version 1 (LSOv1).
*LsoV2IPv4
Describes whether the device enabled or disabled the segmentation of large TCP packets over IPv4 for large send offload version 2 (LSOv2).
*LsoV2IPv6
Describes whether the device enabled or disabled the segmentation of large TCP packets over IPv6 for large send offload version 2 (LSOv2).
*IPsecOffloadV1IPv4
Describes whether the device enabled or disabled the calculation of IPsec headers over IPv4.
*IPsecOffloadV2
Describes whether the device enabled or disabled IPsec offload version 2 (IPsecOV2). IPsecOV2 provides support for additional crypto-algorithms, IPv6, and co-existence with large send offload version 2 (LSOv2).
*IPsecOffloadV2IPv4
Describes whether the device enabled or disabled IPsecOV2 for IPv4 only.
*RSS
Receive side scaling
*TCPUDPChecksumOffloadIPv4
Describes whether the device enabled or disabled the calculation of TCP or UDP checksum over IPv4.
*TCPUDPChecksumOffloadIPv6
Describes whether the device enabled or disabled the calculation of TCP or UDP checksum over IPv6.

· Disable Offloading in Windows.
Use the following registry values to enable or disable task offloading for the TCP/IP protocol: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\TCPIP\Parameters\DisableTaskOffload

            Setting this DWORD value to ‘1’ disables all of the task offloads from the TCP/IP transport. Setting this value to zero enables all of the task offloads.

· Disable Offloading in teaming management software.

            This is an often overlooked component of this troubleshooting process. Many vendors have some form of Offloading capabilities built-in to their teaming management software. This can take on many forms and is often vendor specific.

Additional information of Offloading can be found at the following MSDN article:

Using Registry Values to Enable and Disable Task Offloading

https://msdn.microsoft.com/en-us/library/aa938424.aspx

3. WinRM and svchost.

            SCVMM is very dependent upon WinRM for its underlying communication. So if there is a problem with the underlying WinRM communication between the VMM server and Host computer, then we can also reach this same error condition. The clue that most often leads me down this path is that if we reboot the Host and the status is 'Ok', but after a period of time of up to 3-4 hours the status changes back to 'Not Responding' then this may indeed be the underlying root cause.

            If this is the underlying problem, then the expected behavior is that if you stop the WinRM service from a command prompt it will take much longer than normal to complete. I've seen it take up to 5 minutes to stop in this scenario. This issue can occur if the shared instance of svchost that WinRM is running under becomes backed up.

            As a solution, run WinRM in its own instance of svchost by typing the following command from an elevated command prompt. The exact syntax is very important. Notice the space after the =. If this is successful you will see the following:

                        c:\>sc config winrm type= own

                        [SC] ChangeServiceConfig SUCCESS

4. Another possible cause of this issue can be if a "Restrictive Groups" group policy is removing the VMM server machine account from the local administrators group on the host computer. This issue is discussed in further detail in KB 969164 and is available online here. If this is the case, move the VMM server and Host computers to a new OU that is blocking inheritance of all group policy objects.

5. Some additional potential causes of this problem include:

· if the VMM Agent is not running

· if Anti-virus software is scanning ports or protocols

While we're on the topic of VMM security, I thought it would be helpful to list what accounts need to be where in order for VMM to function properly.

· VMM server machine account

            Administrators group on VMM server and all Hosts

                        Virtual Machine Manager Servers local group on the VMM server

· The account used to perform actions in VMM

            Must be a member of the Local Administrators group on the VMM server and all Host computers

           

In summary, these steps should resolve most issues relating to the Host status of 'Not Responding'.