File Share Witness (FSW) placement and the cluster group.

With Exchange 2007 Cluster Continuous Replication clusters the recommended quorum type for the host cluster is Majority Node Set with File Share Witness (Windows 2003) or Node Majority and File Share Witness (Windows 2008).

In this blog post I want to talk about two things that have an influence on these decisions.

(Note:  All information in this blog assumes a two node scenario since that is the maximum node count supported on Exchange 2007 CCR based clustered installations.) 

The first item is placement of the file share witness.

In order for a two node solution to maintain quorum, we have to have a minimum of two votes.  In our two node cluster scenarios we attempt to maintain two votes by locking the file share witness location.  When a node has the ability to establish an SMB file lock on the file share witness, that node gets the benefit of the vote.  The node that has the minimum two votes necessary has quorum, and will stay functional and host applications.  The node that has the remaining one vote is lost quorum, and will terminate its cluster service.

When both nodes are in the same data center the placement of the file share witness is generally not an issue.  When multiple data centers / physical locations are involved, where WAN connections are used to maintain connectivity between them, the placement of the file share witness is important.

In many scenarios customers are only dealing with a primary and secondary data centers.  Generally I would recommend that the file share witness would be placed in the location where Exchange will service user accounts.  In this case, if the link between the two nodes are down (for example – WAN failure), Exchange will stay functioning on the server where users will be serviced.  This is due to the fact that two votes are available in the primary data center so that node has quorum, and only one vote is available in the secondary data center and that node has lost quorum.  In the event that the primary data center is actually lost, and the secondary data center must be activated, users could follow the appropriate forceQuorum instructions for their operating system to force the solution online. 

Considerations with the aforementioned scenario is that when connectivity is lost between the two data centers Exchange stays functioning in the primary data center.  Manual activation of the secondary data center would be necessary in the event of full primary data center loss.  Should the active node in the primary data center stop functioning, the solution would still function using the node in the secondary data center and the file share witness in the primary data center.

Another scenario is where the file share witness is placed in the secondary data center.  When given the same WAN failure as outlined before, Exchange would automatically be moved to the node in the secondary data center since that is the only node that can maintain quorum (ie has two votes).  The node in the primary data center does not have access to the file share witness, and will terminate it’s cluster services (lost quorum).  This scenario does appeal to some.  For example, should the primary data center be lost Exchange would automatically come online in the secondary data center.  What I consider to be a drawback of this design is that any communications loss between the primary data center and the secondary data center would result in Exchange coming online only in the secondary data center automatically, and not being able to service users (assumes users use the same WAN connection between data centers).  As in the previous scenario, should the WAN be functioning and the node lost in the secondary data center, Exchange would function in the primary data center using the file share witness in the remote data center to maintain quorum.

The last scenario is for customers that have at least three data centers.  In this scenario, the assumption is that each data center has direct connectivity to each other (think triangle here).  For example, Node A would be placed in DataCenter1, Node B in DataCenter2, and the File Share Witness in DataCenter3.  Should DataCenter1 and DataCenter2 loss connectivity, each will have equal access to the file share witness.  The first to successfully lock the file share witness gets the benefit of the vote, and can maintain quorum.  Any node maintaining quorum in this scenario will continue to host existing applications, and arbitrate other applications from nodes that are lost quorum.

In the previous example you get automatic activation should either primary or secondary data center be unavailable, protection from a single WAN failure between any two datacenters, and automatic activation for any node failure.

In the first two examples above it is generally not relevant which node owns the cluster group.  The ability to lock the file share witness is derived from it’s placement on either side of the WAN and the ability to maintain that WAN connection.  It is in the three data center scenario that the location of the cluster group is of importance.  Let’s take a look at that…

The second item – which node owns the cluster group (Applies to Windows 2003 Only).

In Windows 2003 the cluster group contains the cluster name, cluster IP address, and majority node set resource (configured to use file share witness). 

If you review the private properties of the majority node set resource, you will see a timer value called MNSFileShareDelay.  (cluster <clusterFQDN> res “Majority Node Set” /priv)

Cluster.exe cluster-1.exchange.msft res “Majortiy Node Set” /priv

Listing private properties for 'Majority Node Set':

T Resource Name Value

-- -------------------- ------------------------------ -----------------------

S Majority Node Set MNSFileShare \\2003-DC1\MNS_FSW_Cluster-1

D Majority Node Set MNSFileShareCheckInterval 240 (0xf0)

D Majority Node Set MNSFileShareDelay 4 (0x4)

By default the MNSFileShareDelay is 4 seconds.  You can configure this to a different value but in general this is not necessary. 

When there is a condition where the two member nodes cannot communicate, and there is a need to use the file share witness to maintain quorum, the node that owns the cluster group gets the first change to lock the file share witness.  The node that does not own the cluster group sleeps for MNSFileShareDelay – in this case 4 seconds.

The second item – which node owns the cluster group (Applies to Windows 2008 Only).

In Windows 2008 the cluster group is partially abstracted from the users.  The items that comprise the cluster group – ip address, network name, and quorum resource are now known as cluster core resources.

Like Windows 2003, Windows 2008 also implements a delay for nodes not owning the cluster core resources when attempting to lock the file share witness. 

If you review the private properties of the File Share Witness resource, you will see a value called ArbitrationDelay. 

Listing private properties for 'File Share Witness (\\HT-2\MNS_FSW_MBX-1)':

T Resource Name Value

-- -------------------- ------------------------------ -----------------------

S File Share Witness SharePath \\HT-2\MNS_FSW_MBX-1

   (\\HT-2\MNS_FSW_MBX-1

D File Share Witness ArbitrationDelay 6 (0x6)

   ( \\HT-2\MNS_FSW_MBX-1 )

The default arbitration delay value is 6 seconds and it is generally not necessary to change this value.

When there is a condition where the two member nodes (or greater since FSW can be used with more than two nodes in Windows 2008) can no longer communicate, and utilization of the file share witness is necessary in order to maintain quorum, the node that owns the cluster core resources gets the first attempt to lock the file share witness.  Challenging nodes will sleep for 6 seconds before attempting to lock the witness directory. 

So…why does this delay matter?

Take the example of the three data center scenario.  Datacenter1 hosts NodeA currently running a clustered mailbox server, Datacenter2 hosts NodeB currently running the cluster group, and DataCenter3 hosts the file share witness.  The link between DataCenter1 and DataCenter2 is interrupted, no interruption exists between DataCenter1 and DataCenter3 or DataCenter2 and DataCenter3 – all nodes have equal access to the file share witness.  Since the cluster group is owned on NodeB, NodeB will immediately lock the file share witness.  NodeA, since a lock already exists, will be unable to lock the file share witness and will terminate its cluster service.  NodeB will arbitrate the Exchange resources and bring them online.  Because of this delay, in the three location scenario, you may end up with results that were unexpected (for example, expecting NodeA to continue running Exchange without interruption).

When using the Exchange commandlets to manage cluster (move-clusteredmailboxserver) we do not take any actions in regards to the cluster group, we only act on the Exchange group.  Taking into account the above example, you might find it necessary to modify how you move the Exchange and cluster resources between nodes.  Let me give two examples of where you might modify how you move resources between nodes.

Example #1:   You have a three data center scenario outlined before.  Your primary client base accessing Exchange is in DataCenter1.  You have decided to run Exchange on NodeB in DataCenter2.  The cluster group remains on NodeA in DataCenter1.  The link between DataCenter1 and DataCenter2 is interrupted.  Connections from each data center to DataCenter3 are not impacted.  NodeA, which owns the cluster resources, is first to lock the file share witness.  NodeB, waiting it’s delay period, finds an existing lock and is unable to maintain quorum – the cluster service terminates.  NodeA successfully arbitrates the Exchange resources.  In this case by leaving the cluster group on the node in the main data center, when the link was lost Exchange came home so that user service could be continued.

Example #2:   You have the three data center scenario outlined before.  Your primary client base accessing Exchange is in DataCenter1.  It is time to apply patches to your operating system requiring a reboot.  You successfully apply the patches to NodeB in DataCenter2.  Post reboot, you issue a move command for Exchange resources (move-clusteredmailboxserver –identity <CMSNAME> –targetNode NodeB) and the resources move successfully.  You then patch NodeA and issue a reboot.  During the reboot process, the cluster automatically arbitrates the cluster group to NodeB.  When NodeA has completed rebooting, you issue a command to move the Exchange resources back to NodeA.  Sometime after these moves occur the link between DataCenter1 and DataCenter2 is interrupted.  The link between each data center and DataCenter3 is not impacted.  NodeB, currently owning the cluster group, is allowed first access to the file share witness and is successful in establishing a lock.  NodeA, which also has access, is unable to establish a lock and terminates its cluster service.  In this case Exchange is moved from NodeA to NodeB (and presumably users are now cutoff from mail services since the link between DataCenter1 and DataCenter2 is not available).

Example #3:   You have the three data center scenario outlined before.  Your primary client base accessing Exchange is in DataCenter1.  It is time to apply patches to your operating system requiring a reboot.  You successfully apply the patches to NodeB in DataCenter2.  Post reboot, you issue a move command for Exchange resources (move-clusteredmailboxserver –identity <CMSNAME> –targetNode NodeB) and the resources move successfully.  You then patch NodeA and issue a reboot.  During the reboot process, the cluster automatically arbitrates the cluster group to NodeB.  When NodeA has completed rebooting, you issue a command to move the Exchange resources back to NodeA.  You also issue a command to move the cluster group back to NodeA (presumably because you’ve read and understood this blog).  (Cluster <clusterFQDN> group “Cluster Group” /moveto:<NODE>).  Sometime after these moves occur the link between DataCenter1 and DataCenter2 is interrupted.  The link between each data center and DataCenter3 is not impacted.  NodeA, currently owning the cluster group, is allowed first access to the file share witness and is successful in establishing a lock.  NodeB, which also has access, is unable to establish a lock and terminates its cluster service.  In this case Exchange is not impacted.

In most installations I work on it is not necessary to manage the cluster group – both nodes are located at the same location with the file share witness in the same location as the nodes.  If using multiple data centers, consider what is outlined here in the management of your Exchange and cluster resources.