Public Folder Problems in Exchange 2007 CCR - its all about replication...

We see a number of cases coming through support regarding an issue with Public Folder databases after a CCR failover on an Exchange 2007SP1/SP2 Cluster Continuous Replication (CCR) cluster. This is pretty well documented, but seems to be either misunderstood or simply unknown to many Exchange admins. For reference, here is our support stance

 https://technet.microsoft.com/en-us/library/bb123996(EXCHG.80).aspx

"CCR and public folder replication are two very different forms of replication built into Exchange. Due to interoperability limitations between continuous replication and public folder replication, if more than one Mailbox server in the Exchange organization has a public folder database, public folder replication is enabled and public folder databases should not be hosted in CCR environments."

I want to take a few minutes to explain what can happen if you have a public folder database on a CCR cluster and you replicate it to another public folder server in your organization. Keep in mind that if you have a public folder database on CCR and another public folder database in your environment, you will always be replicating between the two, even if you don't add replicas of your folders. This is because the folder hierarchy of your public folder tree is replicated to all servers, even if the content of your data is not.

The issues you will run into will occur when there is an unexpected failover of your CCR. There are three possible scenarios for CCR failover in regards to public stores mounting:

1. Scheduled Failover. If you move the CMS to the passive node via the GUI or via the Move-ClusteredMailboxServer command (and the Storage Groups are Healthy at the time), the public folder store will come online successfully. This is assuming that there are no logs lost during the failover.

2. Unscheduled Failover with no data loss. In this scenario, the active node of the CCR goes down unexpectedly and the CMS fails over to the passive node automatically. The public folder store will not mount in this scenario until the former active node is brought back online and all the logs for the storage group are available. Once the first node is back online, you should be able to run Restore-StorageGroupCopy on the second node (now active) and then mount the Public Folder store.

3. Unscheduled Failover with data loss. This scenario is the same as #2, except in this case either the original active node is unrecoverable or its log data is lost due to disk corruption, etc. Unfortunately, in this situation, you will not be able to mount the public folder store at all. When the failover occurs, the following event will be logged on the newly active node when it attempts to mount the public folder database:

Log Name: Application

Source: MSExchangeRepl

Date: 4/2/2010 9:50:28 AM

Event ID: 2094

Task Category: Action

Level: Warning

Keywords: Classic

User: N/A

Computer: CCRnode2.ferris.com

Description:

Clustered Mailbox Server: CMS1

Physical Server: CCRnode2

Storage group CMS1\PF contains a public folder database that will not be automatically mounted because there was data loss.

  *The last log generated before the Move-ClusteredMailboxServer operation or failover was: 0

* The last log successfully replicated to the passive node was: 0

Attempts to copy to the last logs from the active node were not successful. Error code: The directory name \\CCRnode1\f4ce3b2b-b512-412b-ba09-131ed76090b1$ is invalid.

If you then attempt to mount the database manually from Exchange Management Shell, the following error will be returned:

[PS] C:\>mount-database pf

Mount-Database : Failed to mount database 'CMS1.contoso.com\pf' after a lossy failover occurred because of the current setting for AutoDatabaseMountDial. You must run Restore-StorageGroupCopy before you can mount the database.

This sounds logical, so let’s just run Restore-StorageGroupCopy as directed. We’ll even add the –force switch since the source is not available. Now we get:

[PS] C:\>Restore-StorageGroupCopy pf -force

Restore-StorageGroupCopy : Invalid operation for Restore-StorageGroupCopy. Reason: The specified storage group (PF) cannot be restored. The database will remain dismounted because it is a public folder database and public folder replication is enabled.

So now you can see that we cannot mount the database under any circumstances. The only option at this point is to create a new storage group and public folder store on the CCR cluster. To repopulate the database you could then either use public folder replication, or swap the original .edb file with the one for the new database. As you can imagine, this is a lot of trouble to go through in the case of a single server failure.

The good news is that none of this occurs if you’re not doing public folder replication. This is why the guidance is to either (A) Create only one public folder store for your entire organization on the CCR cluster or (B) House your public folder stores on non-clustered mailbox servers and enable public folder replication. Either one is a valid scenario, and either one will give you redundancy. The big questions to answer are whether your CCR can handle the user load from all your public folder users and also whether your users are geographically dispersed and need a local public folder server.