“Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

Hi, Gary from Directory Services here and I’m going to talk today about the concept of “lag sites” or “hot sites” as a recovery strategy. I recently had a case where the customer asked if the replication interval for a site link could be set higher than 10,080 minutes (7 days). The quick answer was that Active Directory only supports values from 15 up to 10,080 minutes and the schedule is based on a week. If the replinterval attribute on the site link is manually set to something lower than 15 it will use the default of 15. If it is set to something higher than 10,080, it will be ignored and 10,080 will be used.

But the underlying question kept coming back to the recommendation of a latent “lag site”.

First let me give a quick definition of a lag site or hot site and its general intended purpose. A lag site is just an Active Directory site that is configured with a replication schedule of one, two or maybe three days out of the week. That way it will have data that would be intentionally out-of-date as of the last successful inbound replication. It is sometimes used as a quick way to recover accidentally deleted objects without having to resort to finding the most recent successful backup within the tombstone lifetime of the domain that has the data.

This sounds like a decent idea, in theory. However, Microsoft Support does not recommend a lag site as a disaster recovery strategy. Servicing products such as hotfixes and service packs not recognize quasi-offline DC state monitoring software may also detect the state of a lag site DC as malfunctioning and attempt to re-enable it (or tell an unwitting administrator to do so). Microsoft makes no guarantees that the servicing and monitoring products would not re-enable Netlogon and KDC services in a lag site. In addition, other Microsoft products, such as Exchange Server, are not designed to operate in a lag site and they may not function properly with lag site DCs.

The following lists some reasons why lag sites should not be relied upon as a disaster recovery strategy, especially in lieu of proper Active Directory System State backups:

Lag sites are not guaranteed to be intact in a disaster:

  • If the disaster is not discovered in time before replication occurs, the problem is replicated to the lag site, and the lag site cannot be used to undo the disaster. A lag site typically needs to be three days latent in order to cover situations that occur during the weekend where visibility is low. However this means that you are actually forced to ‘lose’ more changes than a reliable daily backup being run on domain controllers.
  • Thus, the administrator must act immediately when a disaster occurs: inbound and outbound replications must be disabled and repadmin /force must be forbidden.

Replicating from lag site might have unrecoverable consequences:

  • Since a lag site contains out-of-date data, using it as a replication source may result in data loss depending on the amount of latency between the disaster and the last replication to the lag site.
  • If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes.

Lag sites pose security threats to the corporate environment:

  • For example, when an employee is fired from the company, his/her account is immediately deleted (or disabled) from Active Directory, but the account might still be left behind in the lag site. If the lag site domain controllers allow logons, this could potentially lead to unauthorized users with access to corporate resources during the lag site replication delay “window”.

Careful consideration must be put in configuring and deploying lag sites:

  • An Administrator needs to decide the number of lag sites to deploy in a forest. The more domains that have lag sites, the more likely one can recover from a replicated disaster. However, this would also mean increased hardware and maintenance costs.
  • An Administrator needs to decide the amount of latency to introduce. The shorter the latency, the more up-to-date and useful the data would be in the lag site. However, this would also mean that administrators must act quickly to stop replication to the lag site when a disaster occurs.

The above list is not exhaustive, and there could be other unseen problems with deploying lag sites as a disaster recovery strategy. It has always been strongly recommended that the best way to prepare for disasters such as mass deletions, mass password changes, etc. is to backup domain controllers daily and verify these backups regularly through test restorations.

Finally, keep in mind that testing your disaster recovery routine is vital both prior to beginning to rely on that routine in case of failure as well as once you begin to use it as your recovery strategy. Surprise is never good when a disaster strikes.

Here are some links to Microsoft recommended recovery steps and practices:

840001 How to restore deleted user accounts and their group memberships in Active Directory – http://support.microsoft.com/kb/840001

Useful shelf life of a system-state backup of Active Directory – http://support.microsoft.com/kb/216993

Managing Active Directory Backup and Restore – http://technet2.microsoft.com/windowsserver/en/library/5d683eeb-e76c-46e9-92f4-fcb2a10f955f1033.mspx

Step-by-Step Guide for Windows Server 2008 AD DS Backup and Recovery – http://technet.microsoft.com/en-us/library/cc771290.aspx

Active Directory Backup and Restore in Windows Server 2008 – http://technet.microsoft.com/en-us/magazine/cc462796.aspx

– Gary Mudgett