Placement of the File Share Witness (FSW) on a Geographically Dispersed CCR Cluster


Update 6/4/2008: Please note that we have posted our new guidance around this subject here: New File Share Witness and Force Quorum Guidance for Exchange 2007 Clusters.

Exchange Server 2007 introduced Continuous Cluster Replication (CCR), which utilizes the log shipping and replay features of Exchange 2007 combined with a Majority Node Set (MNS) cluster. CCR utilizes the Microsoft Cluster service to provide failover to the passive node. An Exchange 2007 CCR cluster can be configured with two nodes and a File Share Witness (FSW). FSW is an option that is recommended so that a third piece of hardware does not need be utilized. FSW takes the place of the voter node that was used prior to update 921181. This hotfix allows the use of FSW which is used as a quorum resource without using a shared disk. The FSW is used by the cluster node for arbitration. At least two of the three (when counting nodes or FSW) must be up and communicating to maintain a majority or any cluster resources and the cluster service will cleanly shutdown.

With a CCR cluster it is possible to place each node in a separate datacenter, however the nodes must be on the same subnet. This is a Windows Clustering limitation that will removed with Longhorn Server. With the nodes being in separate datacenters, where should you place the FSW? It is Microsoft's recommendation to place the FSW in the same datacenter with the preferred owning node of the CCR cluster. This recommendation is based on management and statistics. These statistics show the most common network failure is WAN, not LAN related. This means that the highest risk of communication failure between 'voters' in the CCR implementation is between the Active and Passive nodes when the cluster is geographically disbursed. If we place the FSW in the datacenter with the preferred owning node, a loss of communication between the datacenter's will not cause the cluster service to shutdown. We recommend the FSW to be placed on the Hub Transport Role Server.

Note: We also recommend that you create a CNAME record in DNS for the server hosting the share, instead of the actual server name. When creating the share for the file share witness use the fully-qualified domain name for the CNAME record instead of the server name, as this practice assists with site resilience.

Please see article 281308 for some configuration issues with using the CNAME to access the FSW (share):

We are assuming the following has been configured for the following examples:

  1. Creation of FSW in primary data center on the Hub Transport Server.
  2. CCR Geo-Cluster has been configured.
  3. IP subnets have been properly configured.
  4. FSW share has been previously configured in the Backup Data Center.

So what happens if we lose the primary datacenter? If this were to occur, we could not have an automatic failover to the passive node as we would have lost the active node as well as the FSW, and the cluster would no longer have a majority (counting both Nodes or FSW) so the cluster service on the passive node would shutdown. Manual intervention in this scenario is necessary to force the quorum and to redirect the FSW to a new share in the secondary datacenter:

1.) Update the CNAME to point to a Hub Transport Server in the Backup Data Center.

2.) Force the quorum to start the cluster service on the Secondary Node:

net start clussvc /forcequorum node_list

Please see article 258078 â?? Cluster service startup options for more details

3.) At this point the Cluster service will start and you should be able to bring your cluster resources on-line.

The FSW share at the secondary site should already exist. The recommended location is on the Hub Transport server in the backup datacenter. You can easily script the previous steps if desired, but the Primary Data Center failure will require manual intervention.

For the procedure to create and secure the file share for the FSW please refer to:

http://technet.microsoft.com/en-us/bb124922.aspx

What happens if the Primary Data Center comes back on-line, but the WAN link is down to the Backup Data Center? The following illustrates this scenario:

1.) When the Primary Node comes up in the Primary Data Center, it would not see the Secondary node.

2.) The Primary node would see the local FSW because it would not see the DNS update since the local DNS and the DNS in the backup data center have not synced.

3.) The FSW (in the Primary Data Center) would show that the Primary Node has a current version of the cluster database, then the Primary Node will form a cluster. At this point we have hit the Split Brain Syndrome (a scenario under which a MNS cluster is partitioned because of network communication failure and each partition considers itself as an owner and brings resources online).

To avoid hitting the Split Brain Syndrome, you can delete the files in the FSW in the primary datacenter before starting the Primary Node or don't start the server where the FSW resides. If the WAN link was restored and the Primary Node had not been started, propagate the DNS updates if possible so the Primary Node will use the FSW in the Backup Data Center.

Note: You should ensure that your Servers are not configured to auto-power back on. Most up-to-date server manufacturers have a BIOS configuration to control when the server powers back on. It can be set to remember previous power sate, turn on, or leave off. The servers should be set in a "leave off" configuration, so when power is restored to the Data Center the server can be brought up in a controlled/managed way.

Using a 3rd Site to achieve Automatic Failover

If possible, you could use a 3rd site to host the FSW:

With the 3rd site, a failure at the Primary Data Center would allow for an automatic failover to the Secondary Node, as cluster keeps a majority of the voters (Secondary Node and FSW).

- Matt Richoux

Comments (24)
  1. Paul Flaherty says:

    if you have replicated storage, could you use fsw with scc?

  2. Robert says:

    This whole CCR idea is a little confusing. I deployed SCC Exchange 2007 cluster with MNS/FSW and added HUB server as the ‘witness’ and everything seems to be working just fine. MNS simply replaces the need to have dedicated shared disk for your quorum requirements. You still need to have shared disks for the rest of your edb files and your logs. Can someone please explain if we are still needing MSTDC as a cluster resource? My Exchange 2007 SCC cluster does not have MSDTC running

    thank you

  3. pesospesos says:

    Do you guys have any information on the following? http://www.eggheadcafe.com/software/aspnet/29244557/re-setup-preparead-fail.aspx

    I am getting this error when trying to install the 32-bit management tools on a server…  very frustrating!

  4. pesospesos says:

    now each time i run it i get error:

    "active directory operation failed on dc1.domain.local.  The object ‘CN=Public Folders,CN=All Address Lists,CN=Address Lists Container,CN=DOMAIN,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=domain,DC=local" already exists.

    The object exists."

    I get this each time i try to install.  Any tips would be greatly appreciated…

    thanks,

    Wes

  5. Matt Richoux says:

    Hi Robert,

    While you can use FSW with SCC, SCC does not perform data replication to the passive node, so you still have a single point of failure with your database. CCR does not use Shared Disk.  Each node has it’s own copy of the database on it’s own local disk.  MSDTC is not needed for an Exchange 2007 Cluster.

  6. Matt Richoux says:

    Hi Paul,

    While technically you could get this to work, our support stance for Exchange 2007 would still be the same as Exchange 2003, for 3rd party replication, please see the links below:

    http://support.microsoft.com/kb/895847

    http://technet.microsoft.com/en-us/library/bedf62a9-dff7-49a8-bd27-b2f1c46d5651.aspx

  7. Robert says:

    Hi Matt

    Thanks for you comments. I do use LCR for my SCC cluster. Presently using CCR is an expensive option for us as we host up to 12 databases on each of our Exchange 2003 cluster using EMC SAN storage. That would mean we now have to deploy additional 12 SAN disks. As for CCR do i have any control over the replication process like lets say replicate every 4 hours or so. The reason i am asking is that we are using BCV (business continuity volumes) and we had corruption replicated to BCV and had to restore from tape backup anyways.

  8. Matt Richoux says:

    He Pesos,

    This really isn’t the right forum for this question, try the Exchange Newsgroups:

    http://msexchangeteam.com/archive/2004/06/14/155584.aspx

  9. Robert,

    LCR is not supported in a "single copy" cluster scenario.  

    Ross

  10. Ben Hoffman Exchange 2007 says:

    Great Post good information, keep up the good work!

    Ben

    http://exchangeis.com

  11. James Kahn says:

    Matt,

    In the last scenario covered (FSW at third WAN site), what would the result be if the WAN is broken in such a way that the clusters can’t see each other, but both can see the FSW?  Will they both think they are primary?

    Thanks,

    JK.

  12. HARISH says:

    James,

    In the last scenerio since both the nodes can see the FSW, they will try to lock the FSW share so that they can get a majority, Whichever node is able to access FSW first will be the active node and will host all the resources.

  13. Hi James says:

    Harish is correct.  If we just lose network connectivity between the nodes, the primary node will continue to as the active node as long as it can communicate with the FSW.  If we had a condition where eveything was comming back up, but commuincation between the nodes was not restored, the first one up and communicating with the FSW will own the resources.

  14. James Kahn says:

    Thanks guys, that makes sense.

  15. sfotovat says:

    If I have two data centers, is it possible to use "Single copy clusters" for redundancy within a data center and "Cluster continuous replication" for across data centers redundancy?    

  16. Harry L says:

    Since CCR supports multiple Clustered Mailbox Servers, why cannot we configure Active/Active for CCR? Will future release, say, Exchange 2007 SP1 supports Active/Active CCR?

    Thanks.

  17. Matt Richoux says:

    Hi Harry,

    No Exchange 2007 Clusters support Active/Active, this includes CCR.

    CCR does not support more than one CMS per cluster.

  18. Matt Richoux says:

    Hi Sfotovat,

    No you can’t use SCC with CCR, but you can use SCC with SCR, which will be available with Exchange 2007 SP1.  For more information on SCR see the following:

    http://msexchangeteam.com/archive/2007/02/23/435699.aspx

  19. Jim Ealer says:

    Hi Matt:

    When planning for the public and private "heartbeat" networks in a geo-dispersed CCR implementation, do we need to have separate network links for the public/private networks (i.e. separate routers, switches, WAN connections), or can we use VLANs over a single wide area link?

    The CCR Planning documentation at http://technet.microsoft.com/en-us/library/a26a07fc-f692-4d84-a645-4c7bda090a70.aspx says that "A separate cluster private network must be provided" and that "to avoid single points of failure, use independent VLAN hardware for the different paths between the nodes."

    This suggests we should have two separate links, but I wanted to get clarification on this if possible.

    Thanks for your help!

    Jim

  20. tony says:

    Hi Matt,

    In your last scernario of FSW at third site what could happen to CCR if this third site down?

    Thanks, Tony

  21. Matt Richoux says:

    The CCR would continue to function on the active node.  We still can communicate with the passive node so no failure would occur.  If we did lost communication with the passive node at this time, the cluster service on both nodes would stop.

  22. Matt Richoux says:

    Hi Jim,

    You can use seperate VLANs on the same network link (as long as your hardware supports it), but it would be a single point of failure if they are coming off the same router/switch.

  23. Matthew Wie says:

    We’ve got exactly this set up, however in testing, the /forcequorum switch does not seem to start the cluster service on the secondary node if the hub transport server hosting the FSW is tranferred to the secondary site and the primary exchange node is down. Any ideas?

  24. bobster says:

    What happens if you have your active and passive ccr in the same location with your HUB server with the FSW installed and you add a second HUB for HA and I rebooted the HUB with the FWS the cluster would go down would it not?  If so how would i use both HUBs and FSW to provide HA?

Comments are closed.

Skip to main content