Exchange Does Not Always Use Local GCs


Until last week I was under the impression that Exchange always uses local GCs (Global Catalog servers) and only uses out-of-site directory servers (GCs or DCs) if no local ones are available. I found out otherwise from working on the issue described below.


 


One of our customers had a server outage and Outlook clients could no longer connect to the Exchange server. In other words the server was hung and we had a critsit (critical customer situation case). When dealing with hangs, the first thing we normally ask customers to do is capture a few hang dumps a few minutes apart so that we can determine what the hung process is doing.


 


The customer dutifully sent us 3 hang dumps taken about 5 minutes apart. On opening the dumps, the first thing that stood out was that there were 6 threads with similar call stacks whose execution state remained completely unchanged in all three dumps. In just one second of execution a thread’s stack can change so drastically that it is almost unrecognizable as the same thread.  Put in context, these threads had stayed at the same point for millions of years--in processor time. They each had a stack that looked like this (the stack grows upwards from kernel32!BaseThreadStart and only the relevant parts are included):


 


00 3386dbb8 74fd1394 NTDLL!NtWaitForSingleObject+0xb


...


05 3386ddc8 77955bb2 WLDAP32!LdapWaitForResponseFromServer+0x533


06 3386de04 77955a86 WLDAP32!ldap_result_with_error+0x101


07 3386de34 77959f71 WLDAP32!ldap_search_ext_sW+0x84


08 3386de90 77958f41 WLDAP32!LdapDetermineServerVersion+0x56


09 3386e22c 7795ef47 WLDAP32!LdapBind+0x1e4


0a 3386e25c 7795eed0 WLDAP32!LdapNonUnicodeBind+0x84


0b 3386e274 62ebcfbc WLDAP32!ldap_bind_s+0x17


0c 3386e2cc 62ebcefd dsaccess!CLdapConnection::BindToHost+0x100


 


...


 


16 3386e834 61ee2422 tokenm!CSearchResults::DoDCSearches+0x7b


 


...


21 3386f850 77d5d899 store!EcDoConnectEx+0x4c


22 3386f8c8 77d9c912 rpcrt4!Invoke+0x30


31 3386ffb4 7c57b388 rpcrt4!ThreadStartRoutine+0x18


32 3386ffec 00000000 KERNEL32!BaseThreadStart+0x52


 


From frame 05 (in bold), these threads were all waiting for a response from an LDAP server creating a convoy that was in turn blocking several other RPC threads. None of the threads was making progress and as a result the available RPC connections that Outlook clients rely on were maxed out. OWA clients worked fine while all this was going on because OWA does not use RPC.


 


So why wasn’t the LDAP server responding? First I dumped the LDAP connection object to determine what server these threads were querying. Interestingly, the server was named


 


                  aussydneygc1.au.company1.com


 


Per the naming convention, this was a GC in Australia yet the Exchange server was in the US. Hmmh, why were we going across the sea all the way to Australia when there were 16 healthy GCs in the US? The next thing I had them do was turn up diagnostic logging for DSACCESS (the Exchange component responsible for querying the AD) to find out what GCs were being discovered and in what order. DSACCESS uses an algorithm that discovers GCs in the local site first and then those in remote sites based on the cost of WAN (Wide Area Network) links and logs the list in the informational Event Id 2080 (viewable every 15 minutes when diagnostic logging for MSExchangeDSAccess à Topology is set to maximum). The table looked like this:


 


 



process INETINFO.EXE (PID=2380). DSAccess has discovered the following servers with the following characteristics:


 


 (Server name | Roles | Reachability | Synchronized | GC capable | PDC | SACL right | Critical Data)


 


In-site:


wsdc1.us.company1.com   CDG 7 7 1 0 1 1 7 1


wsdc2.us.company1.com   CDG 7 7 1 0 0 1 7 1


wsdc3.us.company1.com   CDG 7 7 1 0 0 1 7 1


wsdc4.us.company1.com   CDG 7 7 1 0 1 1 7 1


 


Out-of-site:


nydc5.us.company1.com   CDG 7 7 1 0 1 1 7 1


nydc7.us.company1.com   CDG 7 7 1 0 1 1 7 1


 


For more infomation on what these values mean see kb article 316300


 


Only the GCs/DCs in the US were listed (none of the ones in Australia were) and so DSACCESS could not have been the culprit.


 


I looked at the stuck threads a little more and determined that the users connected to the Exchange server via RPC had DNs (Distinguished Names) that looked like this:


 


"CN=John Doe,OU=Users,OU=Melbourne,OU=Company1 AUS,DC=au,DC=company1,DC=com"


 


"CN=David Boe,OU=Users,OU=Melbourne,OU=Company1 AUS,DC=au,DC=company1,DC=com"


 



 


"CN=Jane Zoe,OU=Users,OU=Sydney,OU=Company1 AUS,DC=au,DC=company1,DC=com"



For a moment I thought maybe these users in the Australian domain had mailboxes on the US Exchange server but the customer assured me that their mailboxes were properly located on Australian mailbox servers. The only other possibility was that these Australian users were logging on to the US server for public folder content.  Looking at the code and dumping the mdb (store) GUID confirmed this.


To recap the scenario, Australian mailbox users were connecting to a US Exchange server for public folder content. In order to determine what permissions the users had on the public folder content, the US Exchange server was querying a DC in Australia. The threads sending queries to this DC were stuck waiting for a response and they were in turn blocking other RPC threads, maxing out available RPC connections and adversely affecting Outlook clients on the US server. The question still remains, why wasn’t the US Exchange server querying the local US GCs listed above in the DSACCESS table?


It turns out that Exchange does not always use the local GCs. For certain specific security related user attributes like tokenGroups and tokengroupsGlobalandUniversal (used to determine what security groups a user is a member of and therefore what permissions s/he has to secure resources such as public folders). Exchange MUST query a DC that is authoritative for the user’s home domain, which will likely be an out-of-site DC—in this case it happened to be a DC in Australia. This behavior was introduced around the Exchange 2000 SP2 time frame to address an issue where users from remote domains (sibling or parent) were denied access to public folders even when the security groups they were in should have allowed them access. Pre-SP2 we had made the false assumption (in the product) that a local GC can service ALL queries that Exchange issues. A local GC can (and should) service MOST queries in a well designed multi-site AD environment.


Now back to Company1’s situation. At the time of the outage the WAN link to Australia was, in the customer’s words, "having some serious issues", which is why the LDAP responses were severely delayed. The connection was really spotty but it wasn’t down, which is probably why the connections didn’t time out. "How could this have been avoided?" they wanted to know. One simple way would be using dedicated public folder servers. If the US Exchange server were only a public folder server the worst that could happen is the Australian users wouldn’t have had access to the public folder content stored on the server (and perhaps individual Australian Outlook Clients would display the infamous "retrieving information from the server" RPC dialog). This is a less serious problem any day than lots of users (US users in this case) having no access to email. Another possible way to avoid the situation would be setting up replicas of the US Public Folder content on an Australian public folder server to avoid referrals over the WAN link. "But what if we have too little public folder content to justify a dedicated public folder server?" the customer asked. Fair question, but the honest answer is you risk running into this kind of problem again. Hopefully the WAN link going down is a relatively rare occurrence. If cross site public folder referral is also rare, the two rarities multiplied make for a low probability of the event occurring, but, like an earthquake, when it does occur...you get the point.


BTW we might have some good news on this in a few weeks. Stay tuned!


You may also want to read a related post by Ross Smith IV on some New DSProxy referral changes introduced in Exchange 2003 SP2 here.


- Jasper Kuria

Comments (11)
  1. Joe says:

    Great post.

    Is there a way to have determined that the Australian GC was being used without chasing through a dump?

  2. YJ says:

    Or, would it have helped (from a design perspective), if 1 or 2 GCs from the Australian domain were setup in the US? This might also help Australian users who travel to the US offices to have fast logon services.

  3. Andrewva says:

    Great posting.

    Another case we ran into on an Exchange project in my client which was very similar was when we were migrating users from 5.5 to 2003. When Exchange tries to upgrade 5.5 DLs to AD SGs when used permissions on public folders, the Exchange server will attempt to contact a GC from the domain where the DL resides to convert it.

    Seems obvious, but it had the same impact on us (hung stores) until we figured it out and resolved the communications issues between the Exchange site and the site where the GCs from the domain that owned the DL where located.

  4. I had a look at the "tokengroupsGlobalandUniversal" attribute in the schema. This attribute is NOT replicated to the GC. Therefore it is totaly clear, that Exchange has to query a DC from the User Domain. The question is: why isn’t that attribute added to the GC? Wouldn’t that make sense to avoid such issues?

  5. jasperk says:

    Joe,

    I dont think there is away to do this configuration-wise. You could capture network traces and see some LDAP traffic going to Australian GCs but it would be hard to conclude that it is this problem based on that alone.

  6. jasperk says:

    YJ,

    that would help but could lead to some very inelegant design. In a case where you have lots of domains, lots of sites and lots of exchange servers you would have to have in domain A-SiteA (Assuming the domains are mapped to sites) a DC for domains B, C, D, E…but thats just my personal opinion as I tend to like elegant design. Someone once said, Donald Knuth I believe, that the key to great performance is elegant design, not lots of special cases:)

  7. jasperk says:

    thanks Andrewva. I’m glad you found the post useful.

  8. jasperk says:

    Christian,

    I asked this question to our AD folks and Steve Linehan had this to say:

    "tokengroupsGlobalandUniversal is a constructed attribute so it is really not replicated anywhere but built on the fly. You can determine this by looking at the system-flags value which in this case is 0x08000014 and that last value tells you it is constructed. Now on to why you have to contact a DC in the security principals domain. That is the only way to build a full token of the user and not only must the DC be contacted but if that DC is not a GC it will contact a GC to build the transitive groups and the full token. This really comes down to how the authentication and security subsystems were architected in the OS so it is not as simple as making that attribute part of the Global Catalog."

  9. Thanks Jasper,

    there’s always something one can learn about AD internals :-)

  10. Anonymous says:

    We now have a solution for a problem that was described last month in Exchange Does Not Always Use Local…

  11. Anonymous says:

    Here is the updated, corrected post. Exchange Does Not Always Use Local GC(s). Thanks to Dmitri for his…

Comments are closed.

Skip to main content