Part 4: Datacenter Activation Coordination and the Prevention of Split Brain

In Part 1 of this blog post series, I discussed the rules for mounting databases when Datacenter Activation Coordination (DAC) mode is enabled on a Database Availability Group (DAG).  In this post, I want to look at the specific scenarios that apply to the prevention of split brain after a Datacenter Switchover has been performed.

 

In the example below, we have a 4-member DAG that has two members in Datacenter-A and two in Datacenter-B.  There is a single domain controller installed in each datacenter, as well.

 

image

 

Imagine that an outage has occurred in Datacenter-A and the decision to perform a datacenter switchover has been made.  Currently the two remaining Exchange servers are online in Datacenter-B and the cluster is in a lost quorum state.   Using the integrated commands the administrator starts the switchover process by running stop-databaseavailabilitygroup for the DAG members in Datacenter-A, as illustrated below.

 

Stop-DatabaseAvailabilityGroup –identity DAG –activeDirectorySite:Datacenter-A –configurationOnly:$TRUE

 

The results of this command can be verified with get-databaseavailabilitygroup –identity DAGNAME | fl name,StartedMailboxServers,StoppedMailboxServers.  As expected, servers in Datacenter-B remain on the started servers list while servers in Datacenter-A are on the stopped servers list.

 

The administrator then stops the Cluster service on the surviving DAG members in Datacenter-B in preparation for the restore-databaseavailabilitygroup command.

 

Stop-Service CLUSSVC <or> net stop CLUSSVC

The final step is to run restore-databaseavailabilitygroup.  This task forces the Cluster services online on the remaining DAG members, evicts the DAG members on the stopped servers list, and configures the appropriate quorum model.

 

Restore-DatabaseAvailabilityGroup –identity DAGNAME –activeDirectorySite:Datacenter-B

 

At this point, the administrator might perform additional procedures for database activation, restoration of mail flow, and restoration of client access.

 

After several hours, the DAG members in Datacenter-A come back online but without network connectivity between Datacenter-A and Datacenter-B.  The cluster services on the DAG members in Datacenter-A will start and a cluster will be formed.  This is normal and expected, as two of the four members exist in Datacenter-A along with the witness server, thereby providing the 3 votes necessary to establish quorum.

 

You can verify the membership of the cluster as seen from a member in Datacenter-A:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name : mbx-1
State : Up

Name : mbx-2
State : Up

Name : mbx-3
State : Down

Name : mbx-4
State : Down

You can also verify the membership of the cluster as seen from a member in Datacenter-B:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name : mbx-3
State : Up

Name : mbx-4
State : Up

These outputs are both accurate when considering the steps that were taken during the datacenter switchover process.  We know that the restore-databaseavailabilitygroup command successfully forced the remaining members online in Datacenter-B and evicted members from Datacenter-A.  We also know that the members in Datacenter-A have not been able to establish communications with the members in Datacenter-B and they are therefore unaware that any cluster membership changes have occurred.

 

The stop-databaseavailabilitygroup cmdlet updated the stopped servers list on a domain controller in Datacenter-B, but a domain controller in Datacenter-A was not accessible at the time that command was issued.  Using the get-databaseavailabilitygroup command, we can verify the started and stopped servers list on the domain controller in Datacenter-A.

 

[PS] C:\>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers

Name : DAG
StartedMailboxServers : {MBX-2.exchange.msft, MBX-1.exchange.msft, MBX-3.exchange.msft, MBX-4.exchange.msft}
StoppedMailboxServers : {}

 

Comparing the two lists, note that according to a domain controller in Datacenter-A, all servers are started and no servers are stopped. According to a domain controller in Datacenter-B, two of the servers are started and two of the servers are stopped.

 

Even though the members in Datacenter-A established quorum, we can verify with get-mailboxdatabase –status | fl name,mounted that they did not mount their databases:

 

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted

Name : Mailbox Database 1252068500
Mounted : False

Name : Mailbox Database 1757981393
Mounted : False

Name : Mailbox Database 1370762657
Mounted :

WARNING: Exchange can't connect to the Information Store service on server MBX-4.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server.
Name : Mailbox Database 1511135053
Mounted :

WARNING: Exchange can't connect to the Information Store service on server MBX-3.exchange.msft. Make sure that the
service is running and that there is network connectivity to the server.

 

So why didn’t the databases mount in Datacenter-A, even though the members had quorum?

 

DAC mode works by using a bit stored in memory by Active Manager called the Datacenter Activation Coordination Protocol (DACP). DACP is simply a bit in memory set to either a 1 or a 0. A value of 1 means Active Manager can issue mount requests, and a value of 0 means it cannot.

 

The starting bit is always 0, and because the bit is held in memory, any time the Microsoft Exchange Replication service (MSExchangeRepl.exe) is stopped and restarted, the bit reverts to 0.  In the example of a lost data center the bit is set to 0 when the servers power on and the replication service initializes.   In order to change its DACP bit to 1 and be able to mount databases, a starting DAG member needs to either:

 

  • Be able to communicate with any other DAG member that has a DACP bit set to 1; or
  • Be able to communicate with all DAG members that are listed on the StartedMailboxServers list.

 

If either condition is true, Active Manager on a starting DAG member will issue mount requests for the active databases copies it hosts. If neither condition is true, Active Manager will not issue any mount requests.

 

In order for the DACP bit to be set to 1 (mount database allowed) the starting DAG member must also be a member of the DAG’s cluster, and the cluster must have quorum.

 

In this example MBX-1 can contact MBX-2 but no other members of the DAG.  MBX-2 does not have its DACP bit set to 1 and MBX-1 cannot contact all servers on the started servers list because AD has not replicated the updated started servers list from Datacenter-B and therefore all nodes in the DAG appear on the started servers list.

 

By enforcing the logic of contacting another member with a DACP bit set to 1 or contacting all servers on the started servers list, a split brain condition is prevented even when a quorum of nodes exist and the cluster service functions.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================