Exchange 2007 - Standby clustering with pre-staged resources (part 1)

Recently I’ve worked with several Exchange 2007 customers that are leveraging storage replication solutions with a Single Copy Cluster (SCC) as part of their site resiliency / disaster recovery solution for Exchange data. As a part of these implementations, customers are pre-staging clusters in their standby datacenters <and> creating Exchange clustered resources for these clusters.

In general, two configurations are typically seen:

1. The same clustered mailbox server (CMS) is recovered to a standby cluster.

2. An alternate CMS is installed and mailboxes are moved to the standby cluster.

In part 1 of this series, I will address the first method –recovering the original CMS to a standby cluster.

In part 2 of this series, I will address the second method.

First, let’s take a look at the topology.

In my primary site, I establish a two-node shared-storage cluster with NodeA and NodeB. In my remote datacenter, I establish a second two-node shared-storage cluster with NodeC and NodeD. Third-party storage replication technology is used to replicate the storage from the primary site to the remote site.

 

image

Figure 1 - Implementation prior to introduction of CMS

 

On the primary cluster, I install CMS named MBX-1 in an SCC configuration and create my desired storage groups and databases. This in turn creates the associated cluster resources for the database instances (in Exchange 2007, each database has an associated clustered resource called a Microsoft Exchange Database Instance).

From a storage standpoint, the disks connected to the primary cluster are in read-write mode and the disks connected to the standby cluster are in read-only mode.

 

image

Figure 2 - Implementation after introduction of CMS in primary site

 

image

Figure 3 - Example of database instances as seen in failover cluster manager

 

After preparing the CMS in the primary site, the administrator prepares the secondary site. As part of this preparation, the existing CMS is taken offline. Then, the administrator changes the replication direction of the storage, making the storage connected to the standby cluster R/W and the storage connected to the primary cluster R/O. Both storage solutions are synchronized so that they contain same the data.

Once storage synchronization has completed the administrator uses the /recoverCMS process to recover MBX-1 to the standby cluster. The /recoverCMS process reads the CMS configuration data from Active Directory and then recreates the CMS and its resources on the standby cluster.

 

image

Figure 4 - Implementation after introduction of the CMS in the remote site

 

At this point the same CMS exists on two different clusters. After the standby CMS has been brought online and validated on the standby cluster, the CMS is moved back to the primary cluster and the direction of storage replication is again reversed. The storage connected to the primary cluster is in R/W mode and the storage connected to the standby cluster is in R/O mode.

Once storage synchronization has completed the administrator brings the CMS on the primary cluster online.

Next, the administrator updates the RedundantMachines property of the CMS to reflect the nodes in the primary cluster.

 

image

Figure 5 - Implementation after introduction of CMS in the remote site and activation of CMS in the primary site

 

Because these solutions are often used for site resilience, when a failure of the primary cluster or site occurs, the administrator will perform the following steps to activate the standby cluster.

· Ensure all CMS resources are offline on the primary site cluster

· Change storage from R/O to R/W in the remote site

· Update the redundantMachines property to reference the nodes in the standby cluster

· Bring the CMS online on the remote servers

Often these steps work just fine without any issues. But recently I’ve worked on some cases where this process does not work.

Let’s take a look at some issues that may arise with this type of implementation.

1. Exchange was not designed to have the same resources exist simultaneously on two different clusters. Any recovery using pre-staged resources is not a recommended recovery mechanism for Exchange servers (we’ll talk about the recommeded recovery process shortly).

 

2. Administrators sometimes fail to update the redundantMachines attribute of the CMS. Each CMS has a property called redundantMachines. This property is a list of the names of nodes that can take ownership of the CMS. In general, the /recoverCMS process will reset this property for a CMS when the CMS is recovered to a different set of nodes.

In this case, the resources are pre-staged and /recoverCMS is not used after the initial configuration. As a result, the administrator must manually set this property using the Set-MailboxServer cmdlet. If an administrator fails to do this, other cmdlets that depend on this attribute (like Start-ClusteredMailboxServer, Move-ClusteredMailboxServer and Stop-ClusteredMailboxServer) will fail.

 

3. Resource configuration on the standby cluster is static.

Each database on a CMS has an associated clustered resource. When pre-staging the standby cluster, you are copying the configuration that existed at that time. Often, the configuration of the CMS on the primary cluster will change over time. I have worked with customers who added storage groups and databases to a CMS to a primary cluster after the standby cluster was configured. This results in clustered resources missing from the standby cluster.

To resolve this problem, some administrators have attempted to manually create clustered resources for the missing database instances. Unfortunately, this is not supported, and it results in the administrator having to follow a process similar to the one I recommend below.

 

4. Issues when applying Exchange Service Packs

When applying Exchange service packs to a CMS, the final step is to run /upgradeCMS. In order for /upgradeCMS to be considered successful (which is defined as the upgrade process reporting success and the CMS watermark being cleared from the registry) all of the resources on the cluster must be brought online.

For the primary cluster this does not present any issues. However, it is an issue for the standby cluster. On the standby cluster the following resources will not be able to come online:

· Physical Disk Resources – these resources in the remote site cluster are R/O and cannot brought online for the cluster upgrade

· Network Name Resource – this would result in a duplicate name on the network

Therefore, /upgradeCMS will fail. To resolve this condition, an administrator must either take the primary cluster offline or isolate the standby cluster from the primary cluster in order to complete the upgrade.

 

 

Obviously, this process could cause some longer term issues in the environment after its initial establishment. So, I want to outline a process that I’ve recommended in these environments. The first few parts of the process are the same as above:

1. In my primary site, I establish a two-node shared-storage cluster with NodeA and NodeB. In my remote datacenter, I establish a second two-node shared-storage cluster with NodeC and NodeD. Third-party storage replication technology is used to replicate the storage from the primary site to the remote site.

 

image

Figure 6 - Implementation prior to introduction of CMS

 

2.  On the primary cluster, I install CMS named MBX-1 in an SCC configuration and create my desired storage groups and databases. This in turn creates the associated cluster resources for the database instances.

3.  From a storage standpoint, the disks connected to the primary cluster are in read-write mode and the disks connected to the standby cluster are in read-only mode.

 

image

Figure 7 - Implementation after introduction of CMS in primary site

 

4.  On the standby cluster I prepare each node by installing and configuring the SCC, but instead of performing a /recoverCMS operation, I install only the passive mailbox server role on each node. This is done by running setup.com /mode:install /roles:mailbox. This process puts the Exchange program files on the system, performs cluster registrations, and prepares the nodes to accept a CMS at a later time.

 

image

Figure 8 - Implementation after introduction of CMS in primary site and passive role installation on clustered nodes in remote site

 

At this point, all preparation for the two sites is completed. When a failure occurs and a decision is made to activate the standby cluster I recommend that customers use the following procedure:

 

1.  Ensure that all CMS resources on the primary cluster are offline.

2.  Change the replication direction to allow the disks in the remote site to be R/W and the disks in the primary site to be R/O.

 

image

Figure 9 – Storage in remote site changed to R/W

 

3. Use the Exchange installation media to run the /recoverCMS process and establish the CMS on the standby cluster.

setup.com /recoverCMS /cmsName:<NAME> /cmsIPV4Addresses:<IPAddress,IPAddress>

image

Figure 10 – CMS recovery to passive nodes in remote site.

 

4. Move disks into appropriate groups and update resource dependencies as necessary.

At this point, the resources have been established on the standby cluster and clients should be able to resume connectivity.

 

Assuming that the primary site will come back up and the original nodes are available, the following process can be used to prepare the nodes in the primary site.

1. Ensure that the disks and network name do not come online. This can be accomplished by ensuring that the nodes have no network connectivity.

2. On the node that shows as owner of the offline Exchange CMS group, run the command setup.com /clearLocalCMS. The setup command will clear the local cluster configuration from those nodes and remove the CMS resources. The physical disk resources will be maintained in a cluster group that was renamed.

 

image

Figure 11 – Removal of the CMS in the source site.

 

3.  Ensure that storage replication is in place, healthy, and that a full synchronization of changes has occurred.

4.  Schedule downtime to accomplish the failback to the source nodes.

During this downtime, use the following steps can be utilized to establish services in the primary site.

 

1.  Take the CMS offline in the remote site.

 

image

Figure 12 – CMS offline in remote site.

 

2.  On the node owning the Exchange resource group in the remote site cluster execute a setup.com /clearLocalCMS command.  This will remove the clustered instance from the remote cluster.

 

image

Figure 13:  Removal of the CMS resources from the remote site cluster.

 

3.  Change the replication direction to allow the disks in the primary site to be R/W and the disks in the remote site to be R/O.

 

image

Figure 14:  Disks in primary site changed to R/W.  Disks in remote site changed to R/O

 

4.  Using setup media run the /recoverCMS command to establish the clustered resources on the standby cluster.

setup.com /recoverCMS /cmsName:<NAME> /cmsIPV4Addresses:<IPAddress,IPAddress>

 

image

Figure 15:  Recovery of CMS resources completed to primary site cluster.

 

5. Move disks into appropriate groups and update dependencies as necessary. 

6.  Clients should be able to resume connectivity when this process is completed.

 

How does this address the issues that I’ve outlined above?

1. The /recoverCMS process is a fully supported method to recover a CMS between nodes.

2. The /recoverCMS process is responsible for updating the redundantMachines property of the CMS. This prevents the administrator from having to manually change this as resources are recovered between clusters.

3. The /recoverCMS process will always recreate resources based on the configuration information in the directory. If databases are added to the primary cluster, the appropriate resources will be populated on the standby cluster when /recoverCMS is run. Similarly, if the CMS runs on the standby cluster for an extended period of time, and additional resources are created there, they will be added to the primary cluster when it is restored to service.

4. Service pack upgrades can be performed without having any special configuration. On the primary cluster you follow the standard practice of upgrading the program files with setup.com /mode:upgrade and then upgrading the CMS using setup.com /upgradeCMS. The nodes in the standby cluster are independent passive role installations and can be upgraded by using setup.com /mode:upgrade.