Ned here again. When setting up Distributed File System Replication (DFSR) on Windows Server 2003 R2 and Windows Server 2008, it’s fairly common to hear the following from a customer:
- “I setup DFSR a few hours ago, but it does not seem to be configured on all the servers”
- “I ran the DFSR Diagnostic health report and after hours it still says:
‘This member is waiting for initial replication for replicated folder FOO and is not currently participating in replication. This delay can occur because the member is waiting for the DFS Replication service to retrieve replication settings from Active Directory. After the member detects that it is part of replication group, the member will begin initial replication.’
What’s up here?
How DFSR works
First, a little background. DFSR depends almost completely on Active Directory (AD) and a Domain Controller (DC) to get settings, topology, and all the other goo that configures replication. It pulls them from two places:
Within a global container for the domain, like:
And within each DFSR server’s container, like:
This means that for DFSR configuration to take effect, your DC’s must have replicated in the data. Furthermore, DFSR has what we call “DC affinity”. Once it finds a DC, the DFSR service will stick with that DC until the DFSR service is itself restarted. Yes, even if that DC is down or if it no longer exists. The DFSR service will poll its DC every 15 minutes and pick up these changes – if the data is not on that DC yet through replication, the changes will not take effect.
Finding the DC
So now we know the importance of AD to DFSR. When we get the ‘waiting to retrieve’ warning, what can we do to figure out what’s going on? Let’s go through this step by step.
There are a few ways we can determine which DC is being polled.
- Check the DFSR event log
If we open the Event Viewer snap-in on the server, navigate to the DFSR event log, and then search for the most recent 1206 event, we will see which DC the server is talking to:
The downside to this is the event log may have already wrapped when you look here. If you restart the DFSR service the new event will be added, but there’s the risk that it might start talking to another DC which isn’t having a problem. So your problem is (kind of) fixed, but you still don’t have a root cause.
- Use a WMI query
A more reliable way to determine the DC is by using the WMIC command. Open a CMD prompt as an administrator on the DFSR server and run:
WMIC /namespace:\\root\microsoftdfs path DfsrReplicationGroupConfig get LastChangeSource
This will return the DC you are talking to:
- Examine the DFSR debug logs
Finally, you can examine the DFSR debug logs. Go to %systemroot%\debug and open the DFSR<somenumber>.log file. Look for:
20080521 10:57:47.110 2972 CFAD 7256 Config::AdConfig::Connect Binding to dcAddr:\\10.10.0.10 dcDnsName:\\2003DC10.contoso.com
20080521 10:57:47.110 2972 CFAD 6586 Config::AdConfig::BindToAd Trying to connect. hostName:2003DC10.contoso.com
20080521 10:57:47.130 2972 CFAD 6605 Config::AdConfig::BindToAd Bound. hostName:2003DC10.contoso.com
20080521 10:57:47.130 2972 CFAD 6641 Config::AdConfig::BindToDc Try to bind. hostName:\\2003DC10.contoso.com domainName:<null>
20080521 10:57:47.150 2972 CFAD 6651 Config::AdConfig::BindToDc Bound. hostName:\\2003DC10.contoso.com domainName:<null>
There’s our domain controller. If it’s not in the latest log, you will need to unzip the DFSR debug logs using a tool that understands the GZ format (such as 7-Zip, Winzip, WinRar, etc), and then you can find the entry. As you can tell, using the debug logs is probably the least friendly way to find – but you’ll see later that it’s critical for troubleshooting.
Figuring out why DFSR doesn’t have the info from AD
When we talk to customers in this scenario, DFSR is always blameless – it’s really just exposing a deeper issue in Active Directory. Since we now know the DC in question, let’s run through the five most common reasons why we might be getting this warning:
1. AD replication latency
Active Directory replication is based on the theory of ‘multi-master loose consistency with convergence’. Intra-site replication on Win2003 DC’s will only take fifteen seconds, but by default inter-site replication is 180 minutes (three hours).
So if I have a chain of sites connected like below and my original DFSR configuration was set on a DC in Site B, those three DC’s would have the change available for their DFSR customers within 30 seconds or so. But the DC in Site F might not see it for nine hours.
You have a few options here:
A. Wait patiently and enjoy some light reading.
C. Change your replication schedule and topology.
So actually this isn’t really always an ‘issue’ per se – just a fact of your environmental configuration.
2. AD replication blocked due to network misconfigurations
It’s possible that your DC is not actually replicating at all. Historically, the most likely reason for this is network misconfigurations. These come from DNS resolution, firewall blocks, etc. To step through the most common areas, check out:
Whatever you suspect, it’s always advisable to start with a REPADMIN.EXE /SHOWREPS on the DC just to get a line on what replication health looks like.
3. AD replication blocked due to topology misconfigurations
AD replication might be in a state where if just knew where its partners were, it could replicate fine. It’s common to find a variety of site topology missteps in environments, so make sure you run down:
Your event logs and DSSITE.MSC will tell the tale…
4. AD replication blocked due to lingering objects
Now we have moved past trivial configuration issues into a much more insidious ground. Lingering objects are typically objects that exist in the read-only GC partition of a domain controller but no longer exist in the read-write source domain partition. This can happen when an administrator brings a DC back online after it has been shut off for months; source objects that were deleted and tombstoned are longer available and the old DC can’t be told about the deletions anymore, so he still has ‘reanimated’ versions.
If ‘strict replication consistency’ is in place, that server is not allowed to replicate anymore until it’s fixed (this is a good thing – and if you don’t think so, I will be happy to tell you stories about lingering object cleanup on a forest with 20 domains and 3000 DC’s, all infected with LO’s). So you will want to follow:
5. AD replication blocked due to tombstone lifetime
Honestly, if you get this one, you should call us. Event ID 2042 (“It has been too long since this machine replicated”) is tricky to fix without having a frank discussion about the ramifications for your environment. While we do have some techniques, the cure is usually worse than the disease. In most circumstances, the best answer is to forcibly demote the DC because you (of course!) have several other DC’s that can handle the load in the meantime.
So you came here looking to fix DFSR and left having to fix Active Directory. That’s the funny thing about troubleshooting distributed systems; often the component throwing the errors isn’t actually at fault. In order for DFSR to function smoothly, it needs solid information from its domain controller – keeping that in mind will help you through all your days.
One last thing – I can already hear people yelling ‘USN Rollback!’, ‘Target principal name incorrect!’ and other more esoteric scenarios that might cause AD replication failures. We’re just trying to cover the 99% here, you AD veterans.
– Ned Pyle