In Part 2, I discussed implementing networks in a Failover Cluster. In this final segment, I will discuss troubleshooting cluster networking issues.
As previously stated, it is important that redundant and reliable cluster communications connectivity exist between all nodes in a cluster. However, there may be times when communications connectivity within a cluster gets disrupted either because of actual network failures or because of misconfiguration of network connectivity. A loss of communications connectivity with a node in a cluster can result in the node being removed from cluster membership. When a node is removed from cluster membership, it will terminate its cluster service to avoid problems or conflicts as other nodes in the cluster take over the services or applications and resources that were hosted on the node that was removed. The node will attempt to rejoin the cluster when the cluster service restarts. This problem can also have broader effects because the loss of a node in a cluster affects ‘quorum’. Should the number of nodes participating in a cluster fall below a majority; all highly available services will be taken Offline until ‘quorum’ is re-established (The quorum model, No Majority: Disk Only, is the one exception. However, this model is not recommended).
Here are some recommended troubleshooting procedures for cluster connectivity issues:
1. Examine the system log on each cluster node and identify any errors reporting a loss of communications connectivity in the cluster or even broader network related issues. Here are some example cluster related error messages you may encounter:
Figure 22: Cluster Network Connectivity error messages
Figure 23: Network Connectivity and Configuration error messages
2. If the system logs provide insufficient detail, generate the cluster logs and inspect the contents for more detailed information concerning the loss of network connectivity.
Note: Generate the cluster logs by running this PowerShell cmdlet –
3. Verify the configuration of all networks in the cluster.
4. Verify the configuration of network connectivity devices such as Ethernet switches.
5. Run an abbreviated cluster validation process by selecting only the Network tests.
The tests that are executed are shown here:
The desired end result is this:
As an example, here is the section in the validation report that shows the results for the List Network Binding Order test –
Some of the common issues seen with respect to the network validation tests include, but may not be limited to:
· Multiple NICs on a cluster node configured to be on the same subnet.
· Excessive latency (usually > 2 seconds) in ping tests between interfaces on cluster nodes.
· Warning that the firewall has been disabled on one or more nodes.
6. Conduct simple networking tests, such as a ‘ping’ test, across all networks enabled for cluster communications to verify connectivity between the nodes. Use network monitoring tools such as Microsoft’s Network Monitor to analyze network traffic between the nodes in the cluster (Refer to Figures 13 and 14).
7. Evaluate hardware failures related to networking devices such as Network Interface Cards (NICs), network cabling, or network connectivity devices such as switches and routers as needed.
8. Review the change management log (if one exists in your organization) to determine what, if any, changes were made to the nodes in the cluster that may be related to the disruption in communications connectivity.
9. Consider opening a support incident with Microsoft because if a node is removed from cluster membership, this means there were no networks configured on that node that could be used to communicate with other nodes in the cluster. If there are multiple networks configured for cluster use, as recommended, then cluster membership loss indicates a problem that affects all the networks or the system’s ability to send or receive heartbeat messages.
Note: For additional information on Troubleshooting Windows Server 2008 consult TechNet.
Hopefully, the information provided in this three part blog was helpful and will assist in properly configuring network connectivity in Windows Server 2008 Failover Clusters.
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support