What happens in a Journal Wrap?

FRS is a multi-master replication system that takes care of replicating the contents of Sysvol between all DC’s in the domain (it can also replicate normal data but we're primarily interested in Sysvol replication in the blog entry).

With proper care and maintenance, Post-SP2 FRS on W2k3 is pretty stable and happily hums along as long as there isn’t an external condition such as a network outage or disk problems that cause it to break down (assuming the data you're replicating isn't completely unsuitable for replicating like .PST files, profile data or content that changes frequently).

The most frequent FRS issue is where a Journal Wrap occurs; let’s take a closer look at what happens during a Journal Wrap under the hood.

 

The way FRS works is that it has an internal database that contains all the files and folders it is replicating and each of these has a unique global ID (GUID). The dababase also contains a pointer to the last NTFS disk operation (in the USN Journal/NTFS Journal) that the FRS service processed.

If a user changes a file or folder on a disk, the following happens:

1) the operation is picked up by NTFS and an entry is made in the NTFS Journal

2) FRS monitors the NTFS Journal for changes and notes that a change has been made to that file

3) FRS keeps a record of the last NTFS Journal event that it processed and checks if it has processed it already

4) If it hasn’t processed it already, it looks at whether it is a file that it should replicate

5) If it should be replicated, the file goes into the normal process of staging, replicating, etc.

6) FRS increments the entry in its database about the NTFS Journal event that it has processed so it won’t consider it again

Now…let’s simplify things a bit.

- Our disk contains one file and one folder (e:Test and test.txt)

- Our NTFS journal has a size of 10 entries (default NTFS Journal size in RL is ~512 Mb depending on your OS/SP level)

- Our FRS database contains three entries

o a GUID for E:test

o a GUID for E:testtest.txt

o A referral to the last NTFS Journal entry we processed (let’s say #4)

Normal operations:

- someone makes a change to test.txt

o the NTFS Journal is updated to #5

o FRS notes that the NTFS journal says that a change has been made to test.txt and it sees that it hasn’t processed that change

o Stage/Replicate and update the FRS database to reflect that we have processed that NTFS Journal entry.

Now, an Admin stops the FRS service for 30 minutes….

- Someone makes 10 changes to test.txt

o The NTFS Journal is updated 20 times and is now at #24 (remember we have a log size limit of the last 10 entries so therefore need to wrap around)

o FRS is stopped so it isn’t monitoring the NTFS Journal log

At this point, we have changes on the disk which FRS isn’t aware of. FRS still knows the last NTFS Journal entry that it processed and it will compare this with the current NTFS Journal the next time it restarts.

The next time the FRS service starts, it sees that it has missed NTFS operations on the disk (it last processed NTFS operation #4 but the NTFS Journal is now at #24 and we only have a log that goes back 10 entries so we’re missing operations #5-#14 from the database.

This is when FRS complains it has reached a Journal Wrap state, the NTFS Journal log has wrapped around and it doesn’t know the current state of things on the disk.

The impact of this on an affected DC is that FRS will not set the IsSysvolReady registry key to indicate to the Netlogon service that all is well, Sysvol will therefore not be shared out and the DC will not be able to authenticate users fully until the Journal Wrap condition has been resolved.
Manually sharing out Sysvol or setting the
IsSysvolReady registry key to 1 are not valid methods of resolving this issue and are not addressing the real problem.

For FRS to recover from a Journal wrap, you’ll basically have to start from scratch and reset the FRS database and start counting the NTFS Journal from the current values it has.
This means either:

- Replicating in data from an existing inbound partner (The d2 or non-authoritative FRS restore approach)

- Making your own data authoritative and let everyone else replicate from you (the d4 or authoritative FRS restore approach)

The d2 approach is fairly simple to perform, the requirements are however that you have a good network connection with the inbound replication partner and the time it will take is dependent on the amount of data to be replicated vs. the capacity of the link

On the other hand, this may not always be sufficient and you can find yourself being forced to go with the d4 option. Going with the d4 approach should always be a last resort, it’s a time-consuming operation that requires careful planning and co-ordination between all DC's and they will be more or less inoperative during that time as the FRS service has to be stopped on each and only restarted gradually as the operation progresses. This is especially important for DC’s as they will have a hard time servicing users without a proper Sysvol being present.

For a full description of the d2/d4 burflags and how to use them, See KB 290762.

Further reading:

Troubleshooting journal_wrap errors on Sysvol and DFS replica sets
http://support.microsoft.com/kb/292438

Using the BurFlags registry key to reinitialize File Replication Service replica sets
http://support.microsoft.com/kb/290762

How to rebuild the SYSVOL tree and its content in a domainhttp://support.microsoft.com/kb/315457

Monitoring and Troubleshooting the File Replication Service
http://www.microsoft.com/windowsserver2003/technologies/storage/dfs/tshootfrs.mspx

Why is placing the Sysvol directory on a separate partition a good practice?

http://www.microsoft.com/technet/abouttn/flash/tips/tips_091404.mspx

Troubleshooting File Replication Service

http://technet.microsoft.com/en-us/library/bb727056.aspx