Myth debunking: "Multiple forests help us reduce the risk of a forest-wide Active Directory failure"

More and more customers seem to be using "lower risk" as justification for their multiple forest plan. It may seem reasonable at first glance, but something about it doesn't feel right to me. Since recognizing this pattern a couple of months ago, I've spent a lot of time mulling it over with numerous colleagues.

The first question that pops to mind: how would you decide on the number of forests? Is the right number two? Five? Ten? Fifty? A failure that impacts 50% of your users is probably as bad from a business perspective as one impacting 100% of your users, especially if the forests are divided along business unit or geographic boundaries. Even a 10% outage could be catastrophic. This model could work if every business function were covered by at least two forests, but this would probably result in a complex, difficult-to-manage environment with significant cross-forest resource access.

One of my customers recently proposed ten forests to reduce their risk of failure. Let's use this as an example and compare the ten forest approach to a single forest.

A commonly used formula for quantifying risk is Probability x Impact = Exposure. In a single forest environment, the impact of a forest-wide failure is incredibly high. "Losing the forest" would mean downtime for practically everyone. I'm not sure exactly how to assign a dollar value to this. What I do know is the cost of a 10% partial outage (one of the ten forests) is higher than simply dividing the cost of a total outage by 10. If I had the choice between ten 10% outages and one 100% outage, I'd choose the latter. Based on my conversations with others, I estimate the cost of a partial outage to be 20%-40% of the full outage cost.

What about the other side of the equation - probability? What would cause a forest outage? I'm assuming an enterprise environment with dozens of domain controllers, so hardware failures are out. Some sort of replicated corruption is a possibility, but it is extremely rare. It's so rare that I don't think an accurate probability of replicated corruption can be calculated. The lion's share of forest problems are a direct result of administrator error. What factors can affect the rate of administrator error?
- Environment complexity
- Administrative skill and experience
- Adherence to testing processes
- Automation vs. manual changes

Given the close tie between forest failure and administrator error, I've come up with the following logic:

- Scripting is goodness. It reduces administrative errors in production when coupled with a comprehensive testing process.
- More forests equals more administrative tasks. You can either handle this with more administrators or more scripting.
- If you use more administrators instead of more scripting, your probability of failure increases faster than the corresponding reduction in impact.
- If you use scripting, the impact in a multi-forest environment is similar to the impact in single forest because the automated tasks will span all forests.

And so we're back to square one. The multiple forest environment ends up being a riskier venture, or at best it results in the same risk as a well-managed single forest.