Active Directory Forest Recovery...

The helpdesk phone had been ringing incessantly all day. Many people throughout the AD forest were unable to login to their respective domains. It seems that accounts throughout the forest had somehow been deleted. John, tired from having been up all night watching "White and Nerdy", was called in to help identify what was going on. Fortunately he had recently enabled auditing for account deletions due to a recent problem that he had. After some serious filtering he was able to find the following event in the Security event log:

 

Event Type: Success Audit
Event Source: Security
Event Category: Account Management
Event ID: 630
Date: 1/17/2007
Time: 12:30:44 AM
User: Contoso\JuniorAdmin
Computer: DisgruntledXP
Description:
User Account Deleted:
Target Account Name: JustinTurner
Target Domain: Contoso
Target AccountID: Justin Turner []DEL:3f4567f2-f90b-493e-81a3-dcfc75596cd7
Caller User Name: JuniorAdmin
Caller Domain: Contoso

 

This was a little offsetting to say the least. "JuniorAdmin" was the name of the account for one of his Junior Network Administrators that they just fired for getting them into that last mess. He quickly disabled the account, and then attempted to identify what kind of mess they were in now. His heart sank into his stomach when he discovered that JuniorAdmin was a member of the Schema and Enterprise Admins security groups...

 

I had planned on providing an in-depth discussion about forest recovery, and then realized that there is already more than enough information on this topic. Since I have already advertised this, I will go ahead and provide what I hope will serve as a good general overview, and then point you to a few good resources for the process. There is now a Server 2003 specific forest recovery whitepaper, but the process is unchanged from Windows 2000. There are some additional server 2003 specific goodies added however. (like repadmin /removelingeringobjects)

Before we dive right into the process I want to point out a couple of reasons for why you might have to perform an Active Directory forest recovery.

There are a few reasons that I won't mention, but the two most common I see are:

1. The security of your directory has been compromised either through virus, hacker, or disgruntled employee.

2. A change was made to the schema which needs to be undone.

 

This really is a big deal, and is not something you want to jump straight to without first consulting Microsoft PSS/CSS/EPS/Platforms Support. (we've had so many different names, I don't remember the current one :-) The team you would be dealing with for this particular issue would be Platforms Directory Services. We want to try to determine what caused the forest failure, and also to ensure that a forest recovery is the best recovery option. An entire forest recovery is obviously one of the last steps you would want to try, so it really is best to explore all other recovery options first.

The five hundred thousand foot overview of the process is:

1. Recover one dc from the forest root domain first from backup.

2. Recover one dc from each of the remaining domains from backup.

3. Restore additional DC's by promoting them via dcpromo.

What follows is a general overview of the process that is outlined in both the Windows 2000 and Server 2003 forest recovery whitepapers referenced earlier. Please reference the particular whitepaper for the specific steps.

There are three major stages of a forest recovery:

Pre-recovery, Recovery, and Post Recovery

Pre-Recovery:

1. Determine the current forest structure/topology

2. Find one trusted backup to use per domain

3. Shutdown, and disconnect if possible, all DC's in the forest

Recovery:

1. Isolate the server, (unplug network cable) and perform a system state restore (ensure you choose the Advanced option to perform a Primary restore of Sysvol) Only choose this option for the first DC in a domain.

2. Verify DC was successfully restored after rebooting

3. Configure DNS

4. Disable Global Catalog (if enabled)

5. Raise RID pool by 100,000

6. Seize FSMO roles

7. Perform metadata cleanup of all other DC's in the forest root domain (also delete DC computer objects for dc's that will not be restored from backup in this domain)

8. Reset machine account twice

9. Reset the krbtgt account password twice

10. Reset all trust passwords twice

11. Restore the first DC in each of the remaining domains from backup (perform Recovery steps 1-10 to recover one dc in each of the remaining domains)

As you restore each DC, you will want to point them to the recovered forest root DC for DNS.

12. Connect the restored DC's back to the network (prior to performing this step ensure that no old dc's are still online)

13. Perform a full replica set sync of AD

14. Enable forest root dc as a GC

15. Seize schema master on forest root dc (if the schema master wasn't the dc that was restored)

16. Recover additional DC's in each of the domains using dcpromo

Post-Recovery:

1. Revert forest back to original DNS configuration

2. Redistribute FSMO roles

3. Enable additional Global catalog servers

4. Get a good system state backup from at least two dc's in each domain

 

As you can see, this is a very lengthy process. The whitepaper walks you through each step in detail. There is a good index in the paper that has step by step instructions for every single process as well.

Finally I just want to expand on a couple of the items listed above.

Some considerations to take when identifying which DC's to restore:

You will only be restoring one DC per domain. The recovery process will go much quicker if the restored DC was a DNS server, and was not a GC at the time the backup was taken. For some of you this may be an easy choice as you may only be able to find one good backup. I find that when it comes to these situations, many have trouble locating a decent system state backup. (but maybe my view is skewed because the customers that have tested their disaster recovery plan don't call us?) Additionally the process will go by quicker if the DC that you restore in the forest root domain was the Domain Naming and or Schema master. Selecting one that was a RID master will also help. If you are unable to locate a backup from one of these FSMO masters then you will just need to seize the role after the server is restored. To help you out with this there is a cool repadmin command that shows you the last time a dc's system state was backed up:  repadmin /showbackup DCName

Don't try to shortcut this process by leaving out steps:  

For example: When it says to shutdown and/or disconnect each dc. Do exactly that. We want to ensure that a restored dc does not replicate in bad data from a dc that we forgot to (or couldn't) shutdown. So at the very least ensure that you have your servers that you are restoring disconnected from the network. Also ensure that you reset each of the passwords listed twice. Ensure that you are very thorough with your metadata cleanup stage. Otherwise you will have a not so fun time troubleshooting why your DC's aren't replicating.

There is a typo several times in both whitepapers that greatly changes the meaning of the step:

"Delete server objects and computer objects for all domain controllers in the forest root domain that you are restoring from backup..."

This should read "...that you aren't restoring from backup" I will attempt to get this changed in the whitepapers.

 

Repadmin is your friend:

There are a few steps where you will use various repadmin commands. Learning repadmin syntax ahead of time will aid in the process. It is also very useful for performing day-to-day AD operations as well.

Some options that you will need to use:

/showbackup

/syncall

/showreps

/options

You may also end up having to use /add, /sync, and /removelingeringobjects as well. However, if you follow the step where it says not to restore a DC that was a GC (or just uncheck that after the restore) then you shouldn't have to worry about lingering objects.

 

Well that's all I have to say about that. :-) I'll add more later if I think of something else that I left out.

 

Post any comments or questions you have about this or any other topic that I have blogged about.

Up next: Cluster service failure troubleshooting

 

Thanks for reading!

 

Justin

 

Technorati tags: Active Directory, AD, Server 2003, Disaster Recovery