What to do with FSMO roles...

We recently hired a new engineer to a team which manages some of the internal MS environments...  We were discussing FSMO role placement and he sent me mail (snippet below slightly edited) which I thought was interesting...

The reason why we separated the roles at my last company was due to the FSMO role seizure process. You are correct, although the server is still a single point of failure, we can mitigate this single point of failure by placing the forest roles on one box and the domain roles on another. In the event that we unexpectedly lose a DC that is either a forest or domain FSMO role holder, the process of seizing the roles is minimized (less roles to seize). Also, it had been our experience that the forest roles aren't really used that often. You are correct, FSMO roles are still a single point of failure, however, unless we really need to perform any forest related “stuff”, the single point of failure (from a forest FSMO perspective) is a non-issue. This is not the case with the domain FSMO roles, specifically the PDCE. At my last company, we felt that due to the PDCE functions it was necessary to place the domain FSMO roles on a separate box...

I wanted to share this, because it reminded me of a FSMO related interview question which I've used in some variation or another:

Suppose you're paged in the middle of the night and told that one of the 150 domain controllers in your single domain forest crashed. You're first thought is likely "So what, I'll deal with it in the morning." but then you remember it's the one holding all 5 FSMO roles. If you could only pick one FSMO role to sieze, which one would let you go back to sleep without worrying about the next day?

There are many people that I've asked this question to...the large majority of who answered, "The Schema Master, because without the schema the AD can't function."... Hopefully they aren't reading this blog from their whichever other job they landed...

So back to the the whole FSMO single-point of failure and redundancy thing...

I figured there were 2 possible reasons that they arrived at the idea that seperating FSMO roles based on forest/domain division was logical:

  1. There was some sort of fault tolerance between FSMO roles which could be preserved in a failure
  2. There was some urgency (specifically user impact) to getting a role holder back online immediately should a failure occur

The first reason is obviously false.  FSMO is the "Flexible Single Master Operations" with the emphasis on "single"...the whole point of these roles was that even though Active Directory is a distributed system, there were just some things that could only be done in one place at a time.  So let's just take the generally accepted knowledge that each FSMO role provides specific functionality which only exists in that role.

The second reason takes a bit more thought, but what really happens when a FSMO role holder fails?  Looking at each role, what the impact of it being offline is, and the urgency:

Schema Master – Schema updates are not available – These are generally planned changes, and the first step when doing a schema change is normally something like "make sure your environment is healthy".  There isn't any urgency if the schema master fails, having it offline is largely irrelevant until you want to make a schema change.

Domain Naming Master – No new domains or application partitions can be added – This sort of falls into the same "healthy environment" bucket as the schema master.  I don't know of anyone who has just randomly decided to add a new domain to a forest without much thought or planning...of course, then again, I don't know all that many people either...  You might wonder why I mentioned app partitions there as well...personal experience.  When we upgraded the first DC to a beta Server 2003 OS which included the code to create the DNS application partitions, we couldn't figure why they weren't instantiated...until we realized that the server hosting the DNM was offline (being upgraded) at the same time.  Sure enough, it came online and there they were...  But I've never said we were perfect here...

Infrastructure Master – No cross domain updates, can't run any domain preps – Domain preps are planned (again)...But no cross-domain updates.  Hmmm...that could be important if you have a multi-domain environment with a lot of changes occurring...but wait...the IM tasks are throttled to run over a 2 day period (by default), so how much urgency does that really imply?  I guess you'd have to call it as you see it in your environment but it's probably not 3am urgency...for my buddy the new engineer, he's only working in single domain forests anyways, so urgency = zero.

RID Master – New RID pools unable to be issued to DC's – This gets a bit more complicated, but let me see if I can make it easy.  Every DC is initially issued 500 RID's.  When it gets down to 50% (250) it requests a second pool of RID's from the RID master.  So when the RID master goes offline, every DC has anywhere between 250 and 750 RIDs available (depending on whether it's hit 50% and received the new pool).  So the urgency question is how long will it take your environment to exhaust the RIDs on a given DC?  My guess is that in most environments, this isn't that urgent either.  Oh yeah, and don't forget that if you do seize the RID master during a failure...that's an automatic flatten and rebuild of the server, you can't bring it back online.

PDC – Time, logins, pw changes, trusts – So we made it to the bottom of the list, and by this point you've figured that the PDC has to be the most urgent FSMO role holder to get back online...the rest of them can be offline for varying amounts of time with no impact at all...so what about this one?  Yes, you should get the PDC back online whenever you can but it's not even something that I'd jump out of bed to do...let's call it the "first thing in the morning" list.  Time synch's are important, but w32time does a pretty good job and nothings going to diverge between today and tomorrow enough to impact you...users may see funky behavior if they changed their password, but replication will probably have completed before they call the help desk so nothing to worry about, and trust go back to that whole "healthy forest" thing again...  The biggest impact we see internally at Microsoft from the PDC being offline are all of the applications which were written in NT4.0 timeframe that are biased towards it.  Now that's something to consider.

So when it really comes down to it, is there any benefit to seperating the forest and domain roles onto seperate servers?  Probably not...is there any harm in it?  Nahh...let's just chalk it up to "operational preference" since the guys who are watching this stuff day to day need to be comfortable with the way the environments are configured.

Pop Quiz Time:

Raise your hand, if when your phone rings in the middle of the night and you get that call...you transfer the PDC role and go back to sleep...

...

...

now keep your hand in the air if you reconfigured the server that you transferred the role to, to also be authoritative for time?  I think I found a topic for my next blog...

If I don't see you before then, Merry Christmas, Happy Holidays...or like that commercial says, Merry Chrismahanakwanzaka.