I received a call from a customer who said they were experiencing networking problems with virtual machines running on Hyper-V. The customer had implemented a two node Windows 2008 R2 SP1 Failover Cluster using Live Migration for Cluster Shared Volumes. Specifically, it was reported, virtual machines running on cluster could not access CIFS-based share on NetAPP appliance. Oddly enough, all of the virtual machines running on either node could access any standard Windows share; it was only the NetAPP appliance which could not be reached. Keep in mind that NetAPP appliance was not new to the environment and could be accessed from any machine, physical or virtual, as long as the client was not being hosted by cluster in question. Also, from the Hyper-V host, the NetAPP share could be reached without a problem. The error from the client virtual machine(s) when attempting to access share is below. When sometimes accessing the share this error would be generated immediately, other times the explorer shell would sit for a full minute. A network trace from the client showed traditional three hand shake between virtual machine and NetAPP followed by reset from the client. I did not have a network trace available from the NetAPP appliance to see what was occurring on server side.
The first question of course is the infamous “What has changed?” Fortunately, customer had a formal change process and could detail changes made before problem. In a nutshell:
- Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters
- The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2
- Last month’s security updates via Windows Update
- Latest HP Proliant Support Pack
At the time of the problem I had the following thoughts:
- Can’t be the NetAPP appliance because it can be accessed from other Windows machine outside of Cluster
- Can’t be the Hyper-V host because from each Hyper-V host, the parent partition, the Administrator could access the share
- Not a hardware problem since the issue is present on both cluster nodes
- Highly unlikely Windows security patches are relevant especially since they were applied only to the Hyper-V hosts not the virtual machines
- Highly unlikely recommended updates to cluster as they will be included in SP2 for 2008 R2
- Can’t be security because the same Administrator who accesses the share from Hyper-V host gets the permission error within the VM, therefore the dialog box is bogus
The problem with train of thought is I essentially ruled out everything…. We tried all the usual steps to collect more information about problem such as connecting/reconnecting virtual NICs within VM’s, changed from synthetic to legacy adapters within VM’s, restarted VM’s, restarted Hyper-V hosts etc. etc.
After a lot of testing through trial and error, a suggestion was made to roll back the network card driver which was updated in the HP PSP to latest version on one of the cluster nodes. Immediately after replacing the latest driver with older driver problem was solved. Evidently a bug exists in new driver which prevented only the virtual machines from accessing the NetAPP appliance, truly a strange one indeed.
The problematic driver was:
In summary, with multiple layers of virtualization and integration with 3rd party devices, troubleshooting and identifying where the problem resides can certainly be difficult. One lesson I learned which may have made identification of the problem easier would have been updating only one cluster node at a time and having client certify and sign off before patching second node. Of course, I am not sure that a basic certification would have even revealed problem unless someone attempted to access NetAPP device from within VM.