Failover Cluster Node Startup Order in Windows Server 2012 R2


In this blog, my Colleague, JP, and I would like to talk about how to start a Cluster without the need of the ForceQuorum (FQ) switch.  We have identified 3 different scenarios and how the Cluster behaves when you turn on the nodes in a certain order for Windows Server 2012 R2.  I first want to mention two articles that you should be familiar with.

How to Properly Shutdown a Failover Cluster or a Node
http://blogs.msdn.com/b/clustering/archive/2013/08/23/10443912.aspx

Microsoft’s recommendation is to configure the Cluster with a witness
https://technet.microsoft.com/en-us/library/dn265972.aspx#BKMK_Witness

Now, on to the scenarios.

Scenario 1: Cluster without a witness (Node majority)
Scenario 2: Cluster with a disk witness
Scenario 3: cluster with a file share witness

In the below scenario, we have tried starting the cluster with and without the witness.

Scenario 1: Cluster without a witness (Node Majority)
=====================================================

Let’s use the name of cluster as ‘CLUSTER’ and the name of the nodes as ‘A’ ‘B’ and ‘C’.  We have setup the witness type as Node Majority.  All
nodes have a weighted vote (meaning an assigned and a current vote).  The core Cluster Group and the other resources (two Cluster Shared Volumes) are on Node A.  We also have not defined any Preferred Owners of any group.  For simplistic sake, the Node ID of each is as follows.  You can get NodeID with the Powershell commandlet Get-ClusterNode.

Name ID State
==== == =====
A    1  Up
B    2  Up
C    3  Up

When we gracefully shut down Node A, all the resources on the node fail over to Node B, which is the next highest Node ID.  When we say a graceful shutdown, we are meaning shutting down the machine by clicking on the Start Menu or shutting down after applying patches.  All the resources are on Node B.  So the current votes would be:

Node A = 0
Node B = 1
Node C = 1

Now, let’s gracefully shut down Node B.  All the resources now failover to Node C.  As per the way dynamic quorum works with Windows Server 2012 R2, the Cluster sustains on one node as the “last man standing”.  So our current votes are:

Node A = 0
Node B = 0
Node C = 1

Now we want to gracefully shut down Node C as well.  Since all the nodes are down, the Cluster is down. 

When we start Node A, which was shut down first, the Cluster is not formed and we see the below in the Cluster log:

INFO  [NODE] Node 3: New join with n1: stage: ‘Attempt Initial Connection’ status (10060) reason: ‘Failed to connect to remote endpoint 192.168.1.101:~3343~’
DBG   [HM] Connection attempt to C failed with error (10060): Failed to connect to remote endpoint 192.168.1.101:~3343~.
INFO  [NODE] Node 3: New join with n2: stage: ‘Attempt Initial Connection’ status (10060) reason: ‘Failed to connect to remote endpoint 192.168.1.100:~3343~’
DBG   [HM] Connection attempt to C failed with error (10060): Failed to connect to remote endpoint 192.168.1.100:~3343~.
DBG   [VER] Calculated cluster versions: highest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00082580], lowest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00082580] with exclude node list: ()

When we start Node B, which was shut down second, the Cluster is not formed and below are the entries we see in the Cluster log:

INFO  [NODE] Node 1: New join with n2: stage: ‘Attempt Initial Connection’ status (10060) reason: ‘Failed to connect to remote endpoint 192.168.1.100:~3343~’
DBG   [HM] Connection attempt to C failed with error (10060): Failed to connect to remote endpoint 192.168.1.100:~3343~.
DBG   [VER] Calculated cluster versions: highest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00082580], lowest [Major 8 Minor 9600 Upgrade 3 ClusterVersion 0x00082580] with exclude node list: ()

Both nodes are trying to connect to Node C, which is shut down.  Since they are unable to connect to Node C, it does not form the Cluster.  Even though we have two nodes up (A and B) and configured for Node Majority, the Cluster is not formed.

WHY??  Well, let’s see.

We start Node C and now the Cluster is formed.

Again, WHY??  Why did this happen when the others wouldn’t??

This is because the last node (Node C) that was shutdown was holding the Cluster Group. So to answer your questions, the node that was shut down last  is the first node to be turned on.  When a node is shutdown, its vote is changed to 0 in the Cluster registry.  When a node goes to start the Cluster Service, it will check its vote.  If it is 0, it will only join a Cluster.  If it is 1, it will first try to join a Cluster and if it cannot connect to the Cluster to join, it will form the Cluster.

This is by design.

Shut down all the 3 nodes again in the same order. 

Node A first
Node B second
Node C last

Power up Node C and the Cluster is formed with the current votes as:

Node A = 0
Node B = 0
Node C = 1

Turn on Node B.  It joins and is given a vote.  Turn on Node A.  It joins and is given a vote. 

If you start any other node in the Cluster other than the node that was last to be shut down, the ForceQuorum (FQ) switch must be used to form the Cluster.  Once it is formed, you can start the other nodes in any order and they would join.

Scenario 2: Cluster with a disk witness
=======================================
We take the same 3 nodes and the same environment; but, add a disk witness to it.

Let’s observe the difference and the advantage of adding the witness.  To view the property of Dynamic witness weight, use the Powershell commandlet (Get-Cluster).WitnessDynamicWeight.

PS C:\> (Get-Cluster).WitnessDynamicWeight
0

NOTE:
The setting of 1 means it has a vote.  The setting of 0 means it does not have a vote.  Remember, we still go by the old ways of keeping the votes at an odd number

Initially, the Cluster Group and all the resources are on Node A with the other 2 nodes adding votes to it.  The Disk Witness also adds a vote dynamically when it is needed.

Node A = 1 vote = Node ID 1
Node B = 1 vote = Node ID 2
Node C = 1 vote = Node ID 3
Disk Witness = 0 vote

We gracefully shut down Node A.  All the resources and the Cluster Group move to Node B while Node A loses its vote.  Next, we gracefully shut down Node B and it loses its vote.  All the resources and Cluster Group move to Node C.  This leaves Node C as the “Last Man Standing” as in the previous scenario.  Gracefully shut down Node C as well and the Cluster is down.

This time, instead of powering on the last node that was shut down, i.e. Node C, power on Node B which was shut down in the order, second in the list.

THE CLUSTER IS UP !!!!!

This is because we have a witness configured and the Dynamic Quorum comes into play.  If you check the witness dynamic weight now, you will see that it has a vote.

PS C:\> (Get-Cluster).WitnessDynamicWeight
1

Because it has a vote, the Cluster forms.

Scenario 3: Cluster with a file share witness
=============================================
Again, we take the same 3 nodes, with the same environment and add a file share witness to it.

Presently, Node A is holding the Cluster Group and the other resources.  With the other 2 nodes and a file share witness adding the ability to dynamically add a vote to it if it needs it.

The votes are follows

Node A = 1 vote = Node ID 1
Node B = 1 vote = Node ID 2
Node C = 1 vote = Node ID 3
File Share Witness = 0 vote

We gracefully shut down Node A.  The resources move over to Node B and Node A loses a vote.  Because Node A lost the vote, the file share witness dynamically adjusted and gave itself a vote to keep it at an odd number.  Next, we gradefully shut down Node B. The resources move over to Node C and Node B also loses its vote.

Node C is now the “Last Man Standing” which is holding the Cluster Group and all other resources.  When we shut down Node C, the Cluster shuts down.

Let’s take a look back at the 2nd scenario where we could turn on any node and the Cluster would form.  All the resources come online and we had a disk witness in place.  In the case of a file share witness, this does not happen.

If we turn on Node A, which was shut down first, the Cluster would not form even though we have a file share witness.  We need to revert back to turning on the node that was shut down last, i.e. Node C (the “Last Man Standing”), to automatically form the Cluster. 

So what is the difference?  We have a witness configured….  This is because the file share witness does not hold a copy of the Cluster Database.

So why are you doing it this way? 

To answer this, we have to go back in time to the way the Cluster starts and what database the Cluster uses when a form takes place.

In Windows 2003 and below, we had the quorum drive.  The quorum drive always had the latest copy of the database.  The database holds all configurations, resources, etc for the Cluster.  It also took care of replicating any changes to all nodes so they would have the up to date information.  So when the Cluster formed, it would download the copy on the quorum drive and then start.  This actually wasn’t the best way of doing things as there is really only one copy and if it goes down, everything goes down.

In Windows 2008, this changed. Now, any of the nodes or the disk witness would have the latest copy.  We track this with a “paxos” tag.  When a change is made on a node (add resource, delete resource, node join, etc), that nodes Paxos tag is updated.  It will then send out a message to all other nodes (and disk witness if available) to update its database.  This way, everything is current. 

When you start a node in the Cluster to form the Cluster, it is going to compare it’s paxos with the one on the witness disk.  Whichever is later is the direction in which the Cluster database is used.  If the paxos is later on the disk witness, then it downloads to the node the latest copy and uses it.  If the local node is later, it uploads it to the disk witness and runs with it.

We do things in this fashion so that you will not lose any configuration.  For example, you have a 7 node Hyper-V Cluster with 600 virtual machines running.  Node 6 is powered down, for whatever reason, and is down for a while.  In the meantime, you add an additional 200 virtual machines.  All nodes and a disk witness knows about this.  Say that the rack or datacenter the Cluster is in loses power.  Power is restored and Node6 gets powered up first.  If there is a disk witness, it is going to have a copy of the Cluster database with all 800 virtual machines and this node that has been down for so long will have them.  If you had a file share witness (or no witness) that does not contain the Cluster database, you would lose the 200 and have to reconfigure them.

The ForceQuorum (FQ) switch will override this and starts with whatever Cluster database (and configurations) that are on the node, irregardless of paxos tags numbers.  When you use this, it makes that node’s Cluster database the “golden” copy and replicates it to all other nodes (and disk witness) as they come up.  So be cautious when using this switch.  In the above example, you start up Node 6, you lost the 200 virtual machines and will need to recreate them in the Cluster.

As a side note, Windows Server 2016 Failover Cluster follows this same design.  If you haven’t had a chance to test it out and see all the new features, come on aboard and try it out.

https://www.microsoft.com/en-us/server-cloud/products/windows-server-2016/default.aspx

Regards,
Santosh Bathija and S Jayaprakash
Microsoft India GTSC