USN Rollback, Virtualized DCs and improvements on Windows Server 2012.

The USN rollback issue has been causing hundreds of support calls and AD replication halts throughout the world since the introduction of AD in Windows 2000 Server and up to Windows Server 2008 R2.

Every DC maintains a table - ReplUpToDateVector - (or Up-to-Dateness Vector) per Naming Context (NC or AD partition).
These tables record data from the local DC and its replication partners, this data includes the uuidDSA (or DSA GUID); usnHighPropUpdate (or High Watermark) and timeLastSyncSuccess (or the time stamp of last successfull replication from that replication partner) for a particular partition.

When a change is made (ie. an object is created or deleted, an attribute of an object is modified) one (or more) attribute(s) will have their Originating and Local USN incremented.

Example:

repadmin /showobjmeta * OU=USNROLLBACK,DC=contoso,DC=com

Additionally, the ReplUptoDateVector table (UTDVEC table from now on) on the local DC for itself will be updated.

screenshot of repadmin /showutdvec * dc=contoso,dc=com

 

Normal operation, In this case there is no outstanding replication from ContosoDC1 to ContosoDC2 (values of ContosoDC1 match on both DCs; on the other hand ContosoDC1 will have to replicate from ContosoDC2 66 changes (220356-220290)

 

Then a replication partner will compare its own version of the table and requests the changes that are higher than the High Watermark from the source.

If the USN for the DC on the replication partners is higher than the one the DC has for itself, you are dealing with a USN Rollback.

Example:

screenshot of repadmin /showutdvec ( relevant USNs highlighted)

 

USN Rollback. Note that ContosoDC1 "thinks" that ContosoDC2 has a higher "High Watermark" than in reality. So without the USN rollback protection mechanism the next 182 (220380-220198) changes originated on ContosoDC2 will be discarded by ContosoDC1

 

In that case the originating DC will log Replication Event ID 2095 in Directory Service log and will disable inbound and outbound replication as a protection mechanism in order to avoid further damage.

screenshot of event 2095

Without this safety valve further changes held on the originating DC will never be replicated, and eventually only when the originating DC catches-up with the USN known by its replication partners will start replicating again, however any changes in between are lost forever.

NOTE: Ensure that "Allow replication with corrupt and divergent partners" is not in use, or this protection will be ignored.
 
In order to avoid this issue you should backup your DCs using a supported method, that is an AD aware backup application (NTBackup and Windows Backup are AD aware as so other 3rd party backup applications).
What you should NOT do as a replacement for the applications above is to restore AD from unsupported backup methods like:

Disk Mirroring
Cloning
VHD copies
VM snapshots
or any other cloning method that doesn't reset the DSA Invocation ID when an AD restore is executed.

Note:

The DSA invocation ID is reset once you restore AD using a supported method. Thus replication partners will update their UTDVEC tables with the new value for the restored DC. This doesn't happen when using unsupported methods.

To fix the problem there are two supported methods:

1. Reinstall Active Directory on the affected Domain Controller.

Transfer any FSMO roles if needed.
Demote the DC.
Perform a metadata clean-up of all references to the DC.
Re-promote the DC.

2. Restore the System State.

If a valid system state backup was made before the DC was restored from one unsupported method. Restore the system state from the most recent backup.

For more information on how AD replication works and USN rollback please refer to the following articles:

How the Active Directory Replication Model Works
https://technet.microsoft.com/en-us/library/cc772726(WS.10).aspx

Running Domain Controllers in Hyper-V
https://technet.microsoft.com/en-us/library/d2cae85b-41ac-497f-8cd1-5fbaa6740ffe(v=WS.10)#usn_and_usn_rollback

In Windows Server 2012 virtualized Domain Controllers, you can now restore snapshots without permanently damage domain controllers.
While this does not prevent other issues for other technologies and applications, it does make domain controller virtualization safer.

Now virtualized domain controller snapshot restore resets the DC's unique Invocation ID.
Additionally discards the local RID pool and non-authoritatively restores the SYSVOL folder. 
This means that accidentally restoring a snapshot is no longer an unsafe operation on domain controllers.

The following process describes how Virtualized DC (VDC) Safe Restore is achieved:

1. Restore of an existing virtual machine (VM) domain controller from a snapshot in a hypervisor that supports VM-Generation ID (Windows Server 2012 Hyper-V for instance).

Assuming that this VM already has an existing VM Generation-ID on its DC computer object when the snapshot was taken as part of the msDS-GenerationID attribute (Schema version 56).

2. The VM then reads the VM-Generation ID provided by Hyper-V VMGenerationCounter driver and compares the VM-Generation IDs values.

If they do not match, it continues with snapshot restoration operations.
Once restored, the Generation-ID set on the DC computer object (in AD) is updated to match the new ID provide by the hypervisor host.

If the hypervisor does not provide a VM-Generation ID for comparison, the guest will operate like a Windows Server 2008 R2 or earlier virtualized domain controller.

3. The Virtualized DC then:

Invalidates the local RID pool.
Sets a new DSA invocation ID.

4. Non-authoritative inbound replication is triggered  from a replication partner. The DC requests changes starting at a USN that precedes the USN at which the local directory service was restored. The UTDVEC table of the destination DC is updated appropriately.

5. The virtualized DC synchronizes the SYSVOL:

If using FRS, it stops the NTFRS service and sets the BURFLAGS registry value (D2).
It then starts the NTFRS service, thus performing a non-authoritative restore of the SYSVOL.

If using DFSR, it stops the DFSR service and deletes the DFSR database files. It then starts the DFSR service, thus performing a non-authoritative restore of the SYSVOL.

6. The VM updates the msDS-GenerationID attribute on its own DC object to match the current Hypervisor VM-Generation ID

If you haven't got started with Windows Server 2012, download Windows Server 2012 from:
https://technet.microsoft.com/en-us/evalcenter/hh670538.aspx

 

To finish this post as a personal note, and from all improvements in Windows Server 2012 this my preferred feature. And I think that whomever imagined it deserves a big round of applause from all of us.

Feel free to leave your comments.

Cheers.