DAG Active Manager Deep Dive

In Exchange 2007 and previous versions, Exchange used the cluster resource management model to install, implement and manage the Mailbox server high availability solution, so we were completely reliant on the cluster manager. Historically, building a highly available Mailbox server involved first building a Windows Failover Cluster, and then running Exchange Setup in clustered mode on the top. In this mode, the Exchange cluster resource DLL, exres.dll, would be registered and allow the creation of a clustered mailbox server or CMS (called an Exchange Virtual Server in Exchange 2003).

Exchange 2010 now uses a new component called Active Manager that provides functionality replacing the resource model and failover management, features provided by integration with the Cluster service in previous versions of Exchange

All Exchange cluster resources provided by exres.dll no longer exist, including the construct known as a clustered mailbox server (CMS). A Windows Failover Cluster is still needed and used by Exchange, but there are no cluster groups for Exchange, and there are no storage resources in the cluster. So, if you examine the cluster using cluster management tools, you’ll see only the core cluster resources (IP Address and Network Name, and if needed, quorum resource). Cluster nodes and networks will also exist, but those are managed by Exchange (DAG Networks) and not the cluster or cluster tools.

Say Hello to PAM and SAM...

Active Manager runs on all Mailbox servers that are members of a database availability group (DAG). There are two Active Manager roles: Primary Active Manager (PAM) and Standby Active Manager (SAM) . PAM is the Active Manager in a DAG that decides which database copies will be active and passive. PAM is responsible for getting topology change notifications and reacting to server failures. The DAG member that holds the PAM role is always the member that currently owns the cluster quorum resource (default cluster group). If the server that owns the cluster quorum resource fails, the PAM role automatically moves to a surviving server that takes ownership of the cluster quorum resource. The PAM controls all movement of the active designations between a database's copies. The PAM also performs the functions of the SAM role on the local system (detecting local database and local Information Store failures).

So, what’s SAM up to then?

The SAM provides information on which server hosts the active copy of a mailbox database to other components of Exchange that are running an Active Manager client component (for example, RPC Client Access service or Hub Transport server). The SAM detects failures of local databases and the local Information Store. It reacts to failures by asking the PAM to initiate a failover (if the database is replicated). A SAM does not determine the target of failover, nor does it update a database’s location state in the PAM.

Replication, Replication, Replication..

In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors Extensible Storage Engine (ESE) for any I/O errors or failures. When the service detects a failure, it notifies Active Manager (PAM). Active Manager (PAM) then determines which database copy should be mounted and what it requires to mount that database. In addition, it tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.

When the lights go out in database world..

 

When a failure occurs that affects a replicated mailbox database, the PAM initiates failover logic and selects the best available database copy for activation. PAM uses up to ten separate sets of criteria when locating the best copy to activate. Before using its selection criteria to locate the best copy to activate, a process called attempt copy last logs (ACLL) occurs. Exchange 2010 has been enhanced to deal with multiple database copies, and it recognizes which copy is the best source for copying log files. ACLL makes parallel remote procedure calls to each Mailbox server in the DAG that hosts a copy of the mailbox database to check whether the server is available and healthy, and to examine the value of LogInspectorGeneration for the database copy. The mailbox database copy with the highest value for LogInspectorGeneration is the best source for copying log files.

After the ACLL process has completed, if all missing log files were copied from the selected best source, the database mounts without any data loss. This is known as a lossless failure. If the ACLL process is unsuccessful, the configured value for AutoDatabaseMountDial is consulted. For more information about AutoDatabaseMountDial, see Set-MailboxServer. If the number of lost logs is within the configured value for AutoDatabaseMountDial, the database is mounted. If the number of lost logs is outside the configured value for AutoDatabaseMountDial, the database isn't mounted until either missing log files are recovered or until an administrator explicitly mounts the database and accepts the larger data loss.

Which of my 300 database copies are you going to mount up then?? - Active Manager Best Copy Selection Criteria

When a failure affecting the active database occurs, Active Manager calls several sets of selection criteria to determine which database copy should be activated. Active Manager attempts to locate a mailbox database copy that has a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource, and that meets all of the following criteria:

  • It has a content index with a status of Healthy.
  • It has a copy queue length less than 10 log files.
  • It has a replay queue length of less than 50 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling.
  • It has a copy queue length less than 10 log files.
  • It has a replay queue length of less than 50 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy.
  • It has a replay queue length of less than 50 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling.
  • It has a replay queue length of less than 50 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a replay queue length of less than 50 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy.
  • It has a copy queue length less than 10 log files.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling; and
  • It has a copy queue length that is less than 10 log files

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy.

If none of the database copies meet all of the preceding criteria, Active Manager tries to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling.

If none of the database copies meet all of the preceding criteria, Active Manager tries to activate any database copy with a status of Healthy, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource. If it can't find any database copies with this status, it isn't able to automatically activate a database copy.

In each of the preceding passes, if more than one database copy meets all of the preceding criteria, the configured value for ActivationPreference is consulted, and the database with the lowest value is activated and mounted.