ISTG what happens when it Fails ?

I had an interesting investigation the other day with a customer of mine with reference to the role and failover of the ISTG role.

In the customers scenario they switch off the ISTG and several other servers in a very busy site to simulate a outage scenario. This site is a main hub site in a very large enterprise environment Forest and has in excess of 200 sites hanging off it. Many of these sites are empty many have just 1 or 2 Domain Controllers in there.

So what should happen ?

When a Server that holds the ISTG is unavailable for a period of time then there is an inbuilt automated process which the Servers within the site go through to re-allocate this role to another server. The checking for the existence of a ISTG will effectively happen every 15 minutes as part of the KCC process.

So in the example I am about to make up with have 10 Domain Controllers in one site all servicing the Same Domain.

DC1, DC2, DC3, DC4, DC5 DC6 DC7 DC8 DC9 DC10  - All are Windows 2003 SP1 and we are running at 2003 FFL (Forest Functional Level)

DC1 is the ISTG

DC3 is the Bridgehead server to a second Site

For failover test we isolate the network which means that

DC1 , DC3 , DC5 DC6 DC 9 are all isolated.

DC2 and DC4, DC5 and DC7 and DC10 are still on the live network and can see the second site (network wise), however we now have no correct replication links and the ISTG is also unavailable due to it being part of the group that have been isolated

So what is the process by which the ISTG is reallocated and how long should it take ?

1. Generate a list of all domain controllers in the same site, in ascending order based on GUID, and by evaluating the msDS-Behavior-Version attribute on the NTDS Settings object of each DC, determine which ones have a version number greater than or equal to .NET Interim Forest Mode

2. With this list it determines which election algorithm to use by evaluating if we are 2003 or 2000.. in the example above I am using we are all W2003 so it uses this alogrithm to ascertain  we are at the correct Forest Level.

3. Another parameter will read the interSiteTopologyGenerator - this will give a result of in this example of  DC1 which is no longer available as it has been isolated.

4. Another parameter will determine if the DC (DC1)  is in the list of valid DCs from the previous step 1. In this case and in my example it is .

  The last synchronisation time is checked and compared to the following parameter,

interSiteTopologyFailover

CN=NTDS Site Settings,CN=SITENAME,CN=Sites,CN=Configuration

Default -> “not set” = 120min (W2003)

Eventually the 120mins period will fail and then :

5. We need to change the ISTG as too much time has past since the last successful synchronisation and we need to change to a new one.(If the ISTG could not be determined from the NTDS Site Settings object, start at the beginning of the DC list created earlier in step 1. If the ISTG could be determined but was deemed invalid, start at that DCs position in the list. From either position, for each interval that has passed, skip one domain controller in the list. ). 

6.  The new ISTG is set ?

Where we have a potential weakness is the last statement in step 5

“If the ISTG could be determined but was deemed invalid, start at that DCs position in the list. From either position, for each interval that has passed, skip one domain controller in the list. ).”

The algorithm works fine with a maximum failover if left to the defaults of 2 hours (120 mins) to the next Domain Controller in the Guid List. However if there are several Domain Controllers switched off and they are sequenced in the GUID list created in Step 1 then the transfer of the ISTG role could potentially take a longer amount of time.

One approach would be  to reduce this time to lower the value from its 120 mins default.

interSiteTopologyFailover

CN=NTDS Site Settings,CN=SITENAME,CN=Sites,CN=Configuration

Default -> “not set” = 120min (W2003)

Also considering that the remaining Domain Controllers will be very busy recreating KCC connection objects plus also because of the consequence of that re-creation carrying out VVJOINS, it would be worth investigating the use of the following ;

Redundant connection mode or Branch Office Mode in Active Directory (AD)?

Normally, only one replication object is created per namespace between sites, which achieves the most efficient replication. In situations in which branch offices all connect to a hub location, if a domain controller (DC) at the hub goes down, all the remote locations must recalculate replication objects. This results in a huge amount of changes, which when the DC is back, won't fail back.

This mode requires two steps: Step one is to enable the redundant connection mode to have two connection objects to the hub location; the second step is to disable detection of failed connection objects because you're assuming a failed DC will be coming back so no need to modify the connection objects. You need to run the commands on all remote locations that will require the redundant connections.

C:\>repadmin /siteoptions /site:London +IS_REDUNDANT_SERVER_TOPOLOGY_ENABLED

Branch10

Current Site Options: (none)

New Site Options: IS_REDUNDANT_SERVER_TOPOLOGY_ENABLED

C:\>repadmin /siteoptions /site:London +IS_TOPL_DETECT_STALE_DISABLED

Branch10

Current Site Options: IS_REDUNDANT_SERVER_TOPOLOGY_ENABLED

New Site Options: IS_TOPL_DETECT_STALE_DISABLED IS_REDUNDANT_SERVER_TOPOLOGY_ENABLED

Information on KCC Branch Office Mode

KCC Branch Office mode was created to provide an easily managed redundant
topology for branch office deployments . This mode reduces VV join load on FRS by maintaining a relatively static topology between hub and branch DC's. KCC Branch Office mode can be enabled on a per site basis after the Forest Functional Level has been raised to Windows Server 2003.Under this mode the KCC will build 2 redundant connections between a DC in a branch-site and 2 DCs in the hub site. KCC Branch Office connections are created on preferred bridgeheads if defined, otherwise a random dc will be selected.
Also, when these connections are made they are given staggered schedules. Once KCC creates these connections it treats them as though they were created manually and disables KCC failover as long as the metadata for the preferred Bridgehead or randomly selected server remains in AD. Gracefully demoting a DC using DCPROMO or removing its metadata with NTDSUTIL "remove selected server" will cause KCC to re-evaluate its redundancy requirements. A new DC will be considered in the redundant topology when promoted into the forest.