Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover…

When running Restore-DatabaseAvailabilityGroup as part of the datacenter switchover process, servers in the secondary datacenter are forced online from a quorum and cluster perspective, and servers in the primary datacenter are evicted from the DAG’s cluster. When nodes in the primary datacenter come back online and network connectivity is restored, these restored nodes are not aware that any changes to cluster membership have occurred.  The cluster services on the nodes in the primary datacenter will attempt to join/form a cluster with the nodes running in the secondary datacenter.  When this occurs, the nodes in the secondary datacenter inform the nodes in the primary datacenter that they were evicted. 

 

After a datacenter switchover has occurred, unless the original datacenter is gone or otherwise unrecoverable, eventually services in the primary datacenter will be restored.  When services are restored, including full network connectivity, database availability group (DAG) administrators can begin the switchback process by using the Start-DatabaseAvailabilityGroup cmdlet.

 

Before performing a switchback, you can perform the following tasks to verify that it is safe to run Start-DatabaseAvailabilityGroup for servers in the primary datacenter.

 

The first task is to ensure that the following events are present in the system log of the servers on the StoppedMailboxServers list:

 

Log Name: System
Source: Service Control Manager
Date: 5/27/2012 1:13:35 PM
Event ID: 7040
Task Category: None
Level: Information
Keywords: Classic
User: SYSTEM
Computer: MBX-1.exchange.msft
Description:
The start type of the Cluster Service service was changed from auto start to disabled.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 5/27/2012 1:13:35 PM
Event ID: 4621
Task Category: Cluster Evict/Destroy Cleanup
Level: Information
Keywords:
User: SYSTEM
Computer: MBX-1.exchange.msft
Description:
This node was sucessfully removed from the cluster.

Log Name: System
Source: Service Control Manager
Date: 5/27/2012 1:13:35 PM
Event ID: 7036
Task Category: None
Level: Information
Keywords: Classic
User: N/A
Computer: MBX-1.exchange.msft
Description:
The Cluster Service service entered the stopped state.

In this example, MBX-1 was informed of the eviction, and had it’s cluster services cleaned up and it’s Cluster service startup type set to disabled. The second task is to verify that the Cluster service startup type is set to Disabled. You can use the Services snap-in to verify this.

 

image

 

The third and last task is to verify that the cluster registry has been successfully cleaned up. This is an important step because any remnants of the cluster registry can lead the server to believe it is actually still in a cluster even though it has been evicted. You can use registry editor and navigate to HKEY_LOCAL_MACHINE (HKLM). If there is a hive called Cluster under the root of HKLM then the cleanup did not complete successfully.

 

Here is an example of a node where a successful cleanup was performed:

 

image

 

Here is an example of a node where the Cluster service has not been successfully cleaned up:

 

image

 

Anytime part of the cleanup process fails it typically means that Start-DatabaseAvailabilityGroup will also fail. If any of these three tasks show that cleanup did not complete successfully, it’s relatively easy to fix these issues. Administrators can force the cleanup to occur by running a cluster command.

 

Windows 2008:

 

Cluster node /force

 

Windows 2008 R2 / Windows 2012:

 

Import-Module FailoverCluters

Clear-CluserNode <NODENAME> –Force

 

Some administrators proactively include this as a step in their datacenter switchover documentation when bringing resources back to the primary datacenter. This is not a bad idea. Proactively running this command, even on a node was cleaned up successfully has no ill effects and eliminates the need to perform the three tasks listed above.

 

Therefore, I recommend administrators either incorporate the three tasks or proactively run the cleanup command as a part of their datacenter switchover procedures.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================