Michael here again, and this time I would like to talk a little bit about Active Directory replication and Disaster Recovery sites.
Since the not so recent events (taking place on 9/11) many companies started to invest time and money in designing and implementing Disaster Recovery solutions which are located at a different physical location in order to provide the organization the option to seamlessly (saying seamlessly after a disaster strikes is kind of bad choice of words, but hey… don't shoot the messenger) failover to that site and keep the business and the organization working.
One of the aspects which needs to be considered is the oh so important but often overlooked feature of client logon. In the first part of this post we will discuss about how clients logging on to the network are affected by this.
In the second part of the post we'll review how to fail-over clients when we have multiple sites with Domain Controllers, and we want the client to failover to the best site possible, and in the third part we'll see the affect it may have on Domain Controller replication in the organization, and how to properly configure and test the failover scenarios.
Client logon scenario:
In our sample organization we have a hub site and a DR site both of which contain Domain Controllers. In addition we have several branch sites which don't contain any DCs and rely on the Automatic Site Coverage feature to provide the closest DC for authentication to clients. (More information on the Automatic Site Coverage can be found here - http://technet.microsoft.com/en-us/library/cc978016.aspx)
In this scenario the required behavior (and please note I say the required behavior and not expected behavior – that will be explained later) is for clients to authenticate to the Domain Controllers in the DR site in case the Domain Controllers in the Hub site have failed (or the site link from a specific Branch to the Hub site has failed).
So, we have clients located at a branch site relying on the Automatic Site Coverage feature of the Domain Controllers in order to find the closest DC.
In reality this would look similar to this:
Child-DC01 (located in the HUB site) is performing Automatic Site Coverage for Branch and Branch2 sites) while Child-DC02 which is located in the DR does not.
(Note: More information on the Automatic Site Coverage may be found here - http://technet.microsoft.com/en-us/library/cc978016.aspx)
Now comes the interesting part…
What happens when Child-DC01 fails?
So based on this http://support.microsoft.com/kb/314861 (and this http://technet.microsoft.com/en-us/library/cc759550(WS.10).aspx) the client would then fall-back to the generic list of all Domain Controllers in the domain:
So in our "simple" scenario (And I say simple cause there's more it in a second ) the client would failover to that list and will successfully find a Domain Controller in the domain. Now since the only option left available is the DC in the DR site - Child-DC02 we're good!
Now, you remember (well, it's just in the line above, if you don't then you have more serious things to worry about than this post… ) me saying there's more??
So here's more:
How do I make a domain controller failover to that DR site if I have a 3rd DC in another site?
So in this scenario we would have a Branch site, just as with the previous example – getting Auto site covered by the DC in the HUB site, but we also have a second Branch site, called Branch2 which does contain a DC (child-dc03) and the DRP site which contains child-dc02:
So in this scenario we need to consider costs. So considering BASL (Bridge All Site Links) is enabled – meaning all site links are transitive the following is the list of costs from the perspective of the branch site:
Branch –> Hub = 100
Branch –> DRP = 110 This is because the site links are transitive and we combine the cost of Branch-> Hub and Hub-> DRP(100 + 10).
Branch –> Branch2 = 200 (Branch-> Hub and Hub-> Branch2 = 100+100).
So obviously we would prefer to go where it's cheaper (which eventually translate to WAN link bandwidth, latency and other decisions affecting cost selection).
But, based on what we have experienced previously the client would get all DCs from the DNS query, including child-dc03 to which we don't want the client to go.
Looking at a netmon trace the DNS result would be similar to:
This is the default netlogon DC locator behavior. If the DC in my client site (and DCs which Auto Site Cover are considered to be in my site) fail then we fallback to query the generic list of DCs for the domain (_ldap._tcp.dc._msdcs.domain.name)
So in order to resolve this situation we have the long ago mentioned solutions:
As explained in the Branch Offices Guide (http://technet.microsoft.com/en-us/library/cc749944.aspx) we can prevent the domain controllers at the branch site from registering the generic SRV records.
The recommended configuration in a branch office deployment is as follows:
On all branch office domain controllers, add all entries that do not have "AtSite" as part of the mnemonic, to the value of the registry key, except the DsaCname.
On hub domain controllers, do not use the registry key. This allows the domain controller to register all records.
Creating the registry key on the branch DC (child-dc03) with the value of:
Would result in having only the Hub and DR DCs (child-dc01 and child-dc02) listed in the generic list of Domain Controllers for the domain:
Which leaves the client with only one option if the Hub site DCs fail… the DR site DCs.
That's it for Part 1. In Part 2 we'll talk about failover scenarios for client sites which DO contain Domain Controllers, but those have failed.