Preventing large time offset problems

Greetings, Todd here and I wanted to take a few moments to talk to you about an issue that arises from time to time. I will start this time-related issue exploration with a worst case scenario.

The Primary Domain Controller Emulator (also known as the PDCe) in your forest root has a hardware issue which requires the replacement of the motherboard or even the replacement of the machine due to theft, fire, water based fire suppression system damage, etc, etc… The motherboard or machine is replaced and the machine is started.

Sometime shortly after the motherboards or system replacement it is noted that AD replication is failing, everywhere. The error you are receiving is:

Event Type: Error
Event Source: NTDS Replication
Event Category: Replication Event
ID: 2042
Date: 12/01/2008 Time: 1:13:153 AM
Computer: DALTX-DC00

Description: It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime. Replication has been stopped with this source.

The reason that replication is not allowed to continue is that the two machine’s views of deleted objects may now be different. The source machine may still have copies of objects that have been deleted (and garbage collected) on this machine. If they were allowed to replicate, the source machine might return objects which have already been deleted.

Time of last successful replication:

2004-10-27 08:59:52

Invocation ID of source:

154ef845-f894-054e-88fc-a205dcbff605 Name of source:

Tombstone lifetime (days): 60

So here is the breakdown of our time related AD replication failure: the motherboard that was replaced never had its time set in the BIOS, so the time the OS referenced was from 2003. It would probably be helpful for you to know that the OS, at startup, will read the BIOS CMOS clock and set the time within the OS to this value. When the machine booted up it read the BIOS time and proceeded to set the current system time to the new setting even though it was 5 years in the past.

This machine being the PDCe for the forest means that the other DC’s in the root and the other PDCe’s in the other domains in the forest will sync from it. So we have just managed to propagate a bogus time setting to all the other DC’s in the forest.

For a detailed explanation of how the Windows Time hierarchy works please review the TechNet documentation

The PDCe at the root of the forest then syncs from its local or Internet time source and set the time properly and this new time setting is then propagated throughout the environment.

All the DC’s that replicated when the date was set back to 2003 after which we receive the current time and before replication we check to see when we last replicated. The last replication time date stamp shows 2003 so as far as the machine is concerned we have not replicated for 5 years which just slightly exceeded the 60 -180 day Tombstone lifetime.

Recovering from this AD replication error state can be ugly and time consuming, though we have methods to ultimately resolve it. Besides the initial AD replication failure due to the replication quarantine for replication partners that have not replicated for a period greater than that the tombstone lifetime the chances are very high that you will experience lingering objects. So the operative word here is prevention.

What we could have done to prevent or limit the DC’s from experiencing a large time offset? Here are some ideas:

1. Set the motherboard BIOS time to the current date and time before booting the operating system by powering the machine up and selecting the BIOS or system configuration settings.

2. Transfer or even seize the PDCe FSMO role to a different machine before reintroducing the PDCe with the replaced motherboard to the environment.

3. Implement KB 884776 – How to configure the Windows Time service against a large time offset. This will effectively prevent a machine from correcting it time offset beyond the hard upper and lower limits. If this is in place on the DC’s, Servers, and clients we would not see this scenario as a big problem.

One final comment concerning the circumstances of this issue occurring; anything that can change the time on a DC can cause this issue. BIOS on the motherboard being reset, the BIOS battery going bad, a poorly patched DC getting a virus which flips the time, a router or hardware based time solution being used as the central Network Time Protocol (NTP) time source, etc…

Hopefully you will never experience this type of issue since with a little forethought and configuration you will be able to completely prevent an otherwise difficult situation.

- Todd Maxey