What’s Going On With My Cluster?

Hello everyone. Today’s topic deals with a behavior you’ll notice on Windows 2008 Failover Clusters when there are networking issues between nodes of the cluster.

I’ve simulated a loss of network connectivity between two nodes of my cluster. When this happens, the cluster determines if any of the nodes have enough ‘quorum’ votes to keep the cluster up. In my scenario, I am using a ‘Node and Disk’ quorum model. Once the outage between nodes occurs, the node that owns the ‘witness disk’ at the time should take over the cluster. The other node doesn’t have enough votes to maintain quorum so the cluster service is shutdown.

The first signs of problems I’d see are on the remaining node of the cluster, the other node shows as ‘down’ in Failover Cluster Manager

image

So, as an administrator, my first action would be to start the cluster service on node 2 to get it to rejoin the cluster.

I could do this from Failover Cluster Manager by right-clicking on that node, [More actions…], [Start Cluster Service]

image

If I am still having network issues where node 1 can’t talk to node 2 over ANY interface, I’ll get a pretty vague RPC error.

image

What can I try now? I’ll go over to node 2 and try and start the cluster service from Services.

image

Great, it started. I am good to go. Am I?

I went back and looked at Failover Cluster Manager on node 1.

image

Node 2 still shows as down. Wait a minute? The cluster service still shows as ‘started’ on node 2.

image

How can the service be started AND it show as down in Failover Cluster Manager?

Let’s go to command line to dig a little deeper. If I run ‘cluster node’ from command prompt on the working node, I see:

image

Bit if I run the same command on node 2, I see:

image

So it appears there is a disparity in what each node is reporting. Node 1 shows 2 as ‘down’ . Node 2 shows 1 as ‘down’ and its own status is ‘joining’ .

It's this ‘joining’ status that’s the key. What’s happening is that node 2 is trying to join the existing cluster by contacting node 1 for permission to join. Since there’s no network connectivity, node 2 will start the cluster service since that node “thinks'” it needs to form a new cluster. In order for node 2 to form, it needs enough quorum votes (two in this case) to fully start. In the meantime, it will attempt repeatedly to arbitrate for the ‘witness disk’ so that the needed 2 votes can be achieved. This also occurs with other quorum models as well. We’ll try for about 20 second to arbitrate for that second vote. If we can’t, we terminate the cluster service and start the process all over again depending on your service restart configuration. It’s this 20 second window or “waiting for quorum'” state that shows as status ‘joining’ when viewed from command prompt.

If you look in the system event log on node 2, you’ll see:

image

Event ID 1553: This cluster node has no network connectivity. It cannot participate in the cluster until connectivity is restored. The event you’ll see when node 2 tries, and fails, to arbitrate for the witness disk is:

image

Event ID 1573: Node ‘nodename’ failed to form a cluster. This was because the witness was not accessible.

“If” you were focusing on that event, it may appear the whole problem is that node 2 can’t access the witness, whether that be a witness disk or file share witness. What’s really happening is that node 2 is trying to form its own cluster but can’t access the witness resource because the other node owns a lock on that resource. This ‘lock’ is how we keep cluster nodes from trying to form their own cluster when other nodes are up and running “split brain”.

A good tool to troubleshoot networking issues between nodes is the validation report. If I run the validation test from node 1 and only select the ‘Network’ tests, I see the following:

image

Looks like there may be networking issues on node 2. I try a basic ping test, that fails. Now I know I am in the right ballpark.

After I resolve the network issue(s). I can then try and start the cluster service on node 2.

image

Looks much better! Hope this chat sheds some light on how to go about troubleshooting scenarios like this.

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support