Let me touch on an interesting topic in this blog post: “dirty shutdown” recovery of DFS Replication (DFSR). A TechNet blog post reviewing “What is new in Windows Server 2008” includes a good description of DFS Replication dirty shutdown recovery process. Related system event log entries, e.g. Event ID 2212, refer to the same event as “unexpected shutdown”, and I will also refer to it as unexpected shutdown for the rest of this blog post. This blog post enhances the existing description of unexpected shutdown, and adds new details about the current behavior as of Windows Server 2012.
The DFS Replication service maintains state information pertaining to the contents of each replicated folder in a database on the volume that hosts the replicated folders. In this database, DFSR keeps track of file versions and other metadata that enables it to function as a multi-master file replication engine and to automatically resolve conflicts. The DFS Replication service is a consumer of the NTFS USN (Update Sequence Number) journal, which is a journal of updates to files and folders maintained by NTFS. Entries in this journal notify the DFS Replication service about changes occurring to the contents of a replicated folder. These notifications thus end up triggering replication activity. Every unique change occurring on the file system relating to a folder replicated by DFSR triggers the creation or update of a record in the DFSR database as well. DFSR also stores a “USN checkpoint” in the DFSR database to keep track of the last USN journal entry that it has consumed.
Sometimes, it is possible that the database and the file system get out of sync. Examples of such scenarios are abrupt power loss on the server or if the DFSR service was stopped abnormally for any reason. Another example is if the volume hosting a replicated folder loses its power, gets disconnected or is forced to dismount. These exception conditions result in unexpected shutdown of DFSR database, as any of these can cause inconsistencies between the database and the file system. DFSR is designed to automatically recover from these situations starting with Windows Server 2008, and this behavior continued through Windows Server 2008 R2.
In January 2012, Microsoft released a hotfix for Windows Server 2008 R2 that made the following changes (this is now also the default behavior of Windows Server 2012):
1. Change the default unexpected shutdown handling policy from auto-recovery to manual-recovery, so the default behavior requires a manual user approval to go ahead with unexpected shutdown recovery. This was done to allow a user to take a backup of existing replicated folders on the volume before the recovery operation.
2. Support manually resuming the unexpected shutdown recovery and replication of the replicated folder(s) in a volume, using a WMI method. The command(*) to do that is:
wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid=”<volume-GUID>” call ResumeReplication
3. Support setting the default behavior back to automatic unexpected shutdown recovery, as in Windows Server 2008. The command for this:
wmic /namespace:\\root\microsoftdfs path dfsrmachineconfig set StopReplicationOnAutoRecovery=FALSE
(*) One way to retrieve <volume-GUID> is via: dfsradmin RF List /RgName:<Replication Group-Name> /Attr:All
Note however that the helpful new event log entry (Event ID: 2213) information includes the entire command line that you can simply copy-and-paste, e.g.:
To resume the replication for this volume, use the WMI method ResumeReplication of the DfsrVolumeConfig class. For example, from an elevated command prompt, type the following command:
wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid=”0D9806D1-AC1A-11E1-98C3-00155D4FBB00″ call ResumeReplication
It turns out that the new recovery behavior also has an important implication for DFSR failover cluster deployments.
Let’s say in a 2-node cluster with nodes N1 and N2, you have set up a clustered file server ContosoFS and added a DFS replicated folder on that file server. This creates a DFS replicated folder resource, as part of the ContosoFS resource group – called a ‘clustered role’ in Windows Server 2012. For more details on DFSR clustering deployment, refer to Mahesh’s old blog post which is a pretty good read. At any given time, ContosoFS can be owned by only one node (say N1), which means DFS Replication service for the replicated folder also runs on N1.
Let’s say you move ContosoFS in a planned way to N2 – by using Failover Cluster Manager, or the Failover Cluster Windows PowerShell cmdlets. DFS Replication service also fails over to N2 in this case; however since this is a graceful failover, there is no DFSR unexpected shutdown recovery here. ContosoFS is now owned by N2.
Let’s say instead of a graceful failover, you now powered off N2. ContosoFS and the DFS Replication service failover to N1 as expected. However, note that this is an unplanned failover. In this case, DFS Replication Service detects an unexpected shutdown of the database and logs the new event 2213 cited above and then waits for a manual intervention (the new default behavior). So unless you monitor for the new event and resume the replication, your clustered DFS replicated folder is not highly available, because it will remain offline waiting for the manual initiation of the unexpected shutdown recovery operation. If instead you prefer the DFS replicated folder to auto-recover and be automatically highly-available in unplanned failover scenarios, i.e. you want the Windows Server 2008 behavior, you should change default behavior to perform auto-recovery on each one of the cluster nodes – N1 and N2 in this example – and restart DFS Replication service. This would be a one-time configuration step.
So how does unexpected shutdown recovery process work?
Let’s discuss how unexpected shutdown recovery process works, and specifically why one might not want the auto-recovery behavior.
When the DFS Replication service is asked to resume replication and perform unexpected shutdown recovery, either via auto-recovery or via manual intervention, it performs the following steps:
1) The first thing that DFSR does is to validate if the “USN checkpoint” in the database is valid by comparing the database against referenced USN record in the journal. If the checkpoint itself is invalid, each entry for each file and folder in all replicated folders on the volume is examined for correctness by comparing the entry to the corresponding file or folder on the volume. So this could take some time, depending on how many files are in the replicated folder(s). If, on the other hand, the checkpoint is valid, there is less need for cleanup – DFSR simply deletes the database entries after the last checkpoint because they are not reliable.
2) DFSR marks each one of the file and folder database records with “Initial Sync” fence value. Then it solicits information for all changes that may have happened, called “version vectors” in DFSR parlance, to the files in the replicated folder(s) on that affected volume from each of the replication partners. There are two possible outcomes in this phase:
a. If the hash value for the local file matches the value returned by a remote replication partner for the same file, it means that the local file version is correct. In that case, DFSR clears the “Initial Sync” fence value to “Default” in the local DFSR database for that file.
b. If the hash value for the local file does not match the value returned by a remote replication partner for the same file, it means that the local file version is not correct. The remote data is always considered authoritative in this case. So DFSR moves the local file to <ReplicatedFolderPath>DfsrPrivateConflictAndDeleted folder, and installs the remote version of the file in its place.
3) At the end, there may still be some files and folders with the “Initial Sync” fence value. This is the subset of files that exists only on the local machine, but not known to the remote replication partners. DFSR moves this subset to <ReplicatedFolderPath>DfsrPrivatePreExisting folder. Finally, DFSR also cleans up entries in the local database that do not have valid hash values, and resets the DFSR volume management state out of unexpected shutdown.
Standard DFS replication mechanics resume at this point.
Let’s summarize the two most important resulting implications from the previous discussion:
a) The local copy of replicated folder data for the server going through unexpected shutdown recovery is never considered “authoritative”, remote data is considered more trustworthy wherever a local file version does not match that of a replication partner.
b) At the end of an unexpected shutdown recovery, a local file or a folder may end up in one of four states:
1. Left just where it was, if the local file or folder is identical to that on the remote replication partner, OR,
2. It may move to DFSR-private ‘Pre-existing’ folder, if the local file or folder does not exist on a remote replication partner, OR,
3. It may move to DFSR-private ‘Conflict And Deleted’ folder, if the local copy is different from that on the remote replication partner, OR,
4. It may move to DFSR-private ‘Conflict And Deleted’ folder and then get purged. This is due to the quota size and high watermark configured on the ‘Conflict And Deleted’ folder. The “least recently used” content in ‘Conflict And Deleted’ folder is purged when the high watermark is reached for the folder, until the folder size drops down to configured low watermark. The DFS Replication service enforces the configured size and the watermarks on this folder as a machine-local activity; this does not involve remote replication partners. You can read more about this in the TechNet article Staging folders and Conflict and Deleted folders.
These possible states (#2 through #4 above) are precisely the reason why the new Windows Server 2012 default behavior provides an opportunity for you to take a backup of the local data before unexpected shutdown recovery goes ahead. Depending on your application scenario, you are the best judge to determine if the local data is in fact accurate and therefore has business value.
Hope this discussion helped you understand DFS Replication, particularly the unexpected shutdown recovery mechanics, better.