Until last week I was under the impression that Exchange always uses local GCs (Global Catalog servers) and only uses out-of-site directory servers (GCs or DCs) if no local ones are available. I found out otherwise from working on the issue described below.
One of our customers had a server outage and Outlook clients could no longer connect to the Exchange server. In other words the server was hung and we had a critsit (critical customer situation case). When dealing with hangs, the first thing we normally ask customers to do is capture a few hang dumps a few minutes apart so that we can determine what the hung process is doing.
The customer dutifully sent us 3 hang dumps taken about 5 minutes apart. On opening the dumps, the first thing that stood out was that there were 6 threads with similar call stacks whose execution state remained completely unchanged in all three dumps. In just one second of execution a thread’s stack can change so drastically that it is almost unrecognizable as the same thread. Put in context, these threads had stayed at the same point for millions of years--in processor time. They each had a stack that looked like this (the stack grows upwards from kernel32!BaseThreadStart and only the relevant parts are included):
00 3386dbb8 74fd1394 NTDLL!NtWaitForSingleObject+0xb
05 3386ddc8 77955bb2 WLDAP32!LdapWaitForResponseFromServer+0x533
06 3386de04 77955a86 WLDAP32!ldap_result_with_error+0x101
07 3386de34 77959f71 WLDAP32!ldap_search_ext_sW+0x84
08 3386de90 77958f41 WLDAP32!LdapDetermineServerVersion+0x56
09 3386e22c 7795ef47 WLDAP32!LdapBind+0x1e4
0a 3386e25c 7795eed0 WLDAP32!LdapNonUnicodeBind+0x84
0b 3386e274 62ebcfbc WLDAP32!ldap_bind_s+0x17
0c 3386e2cc 62ebcefd dsaccess!CLdapConnection::BindToHost+0x100
16 3386e834 61ee2422 tokenm!CSearchResults::DoDCSearches+0x7b
21 3386f850 77d5d899 store!EcDoConnectEx+0x4c
22 3386f8c8 77d9c912 rpcrt4!Invoke+0x30
31 3386ffb4 7c57b388 rpcrt4!ThreadStartRoutine+0x18
32 3386ffec 00000000 KERNEL32!BaseThreadStart+0x52
From frame 05 (in bold), these threads were all waiting for a response from an LDAP server creating a convoy that was in turn blocking several other RPC threads. None of the threads was making progress and as a result the available RPC connections that Outlook clients rely on were maxed out. OWA clients worked fine while all this was going on because OWA does not use RPC.
So why wasn’t the LDAP server responding? First I dumped the LDAP connection object to determine what server these threads were querying. Interestingly, the server was named
Per the naming convention, this was a GC in Australia yet the Exchange server was in the US. Hmmh, why were we going across the sea all the way to Australia when there were 16 healthy GCs in the US? The next thing I had them do was turn up diagnostic logging for DSACCESS (the Exchange component responsible for querying the AD) to find out what GCs were being discovered and in what order. DSACCESS uses an algorithm that discovers GCs in the local site first and then those in remote sites based on the cost of WAN (Wide Area Network) links and logs the list in the informational Event Id 2080 (viewable every 15 minutes when diagnostic logging for MSExchangeDSAccess à Topology is set to maximum). The table looked like this:
process INETINFO.EXE (PID=2380). DSAccess has discovered the following servers with the following characteristics:
(Server name | Roles | Reachability | Synchronized | GC capable | PDC | SACL right | Critical Data)
wsdc1.us.company1.com CDG 7 7 1 0 1 1 7 1
wsdc2.us.company1.com CDG 7 7 1 0 0 1 7 1
wsdc3.us.company1.com CDG 7 7 1 0 0 1 7 1
wsdc4.us.company1.com CDG 7 7 1 0 1 1 7 1
nydc5.us.company1.com CDG 7 7 1 0 1 1 7 1
nydc7.us.company1.com CDG 7 7 1 0 1 1 7 1
For more infomation on what these values mean see kb article 316300
Only the GCs/DCs in the US were listed (none of the ones in Australia were) and so DSACCESS could not have been the culprit.
I looked at the stuck threads a little more and determined that the users connected to the Exchange server via RPC had DNs (Distinguished Names) that looked like this:
"CN=John Doe,OU=Users,OU=Melbourne,OU=Company1 AUS,DC=au,DC=company1,DC=com"
"CN=David Boe,OU=Users,OU=Melbourne,OU=Company1 AUS,DC=au,DC=company1,DC=com"
"CN=Jane Zoe,OU=Users,OU=Sydney,OU=Company1 AUS,DC=au,DC=company1,DC=com"
For a moment I thought maybe these users in the Australian domain had mailboxes on the US Exchange server but the customer assured me that their mailboxes were properly located on Australian mailbox servers. The only other possibility was that these Australian users were logging on to the US server for public folder content. Looking at the code and dumping the mdb (store) GUID confirmed this.
To recap the scenario, Australian mailbox users were connecting to a US Exchange server for public folder content. In order to determine what permissions the users had on the public folder content, the US Exchange server was querying a DC in Australia. The threads sending queries to this DC were stuck waiting for a response and they were in turn blocking other RPC threads, maxing out available RPC connections and adversely affecting Outlook clients on the US server. The question still remains, why wasn’t the US Exchange server querying the local US GCs listed above in the DSACCESS table?
It turns out that Exchange does not always use the local GCs. For certain specific security related user attributes like tokenGroups and tokengroupsGlobalandUniversal (used to determine what security groups a user is a member of and therefore what permissions s/he has to secure resources such as public folders). Exchange MUST query a DC that is authoritative for the user’s home domain, which will likely be an out-of-site DC—in this case it happened to be a DC in Australia. This behavior was introduced around the Exchange 2000 SP2 time frame to address an issue where users from remote domains (sibling or parent) were denied access to public folders even when the security groups they were in should have allowed them access. Pre-SP2 we had made the false assumption (in the product) that a local GC can service ALL queries that Exchange issues. A local GC can (and should) service MOST queries in a well designed multi-site AD environment.
Now back to Company1’s situation. At the time of the outage the WAN link to Australia was, in the customer’s words, "having some serious issues", which is why the LDAP responses were severely delayed. The connection was really spotty but it wasn’t down, which is probably why the connections didn’t time out. "How could this have been avoided?" they wanted to know. One simple way would be using dedicated public folder servers. If the US Exchange server were only a public folder server the worst that could happen is the Australian users wouldn’t have had access to the public folder content stored on the server (and perhaps individual Australian Outlook Clients would display the infamous "retrieving information from the server" RPC dialog). This is a less serious problem any day than lots of users (US users in this case) having no access to email. Another possible way to avoid the situation would be setting up replicas of the US Public Folder content on an Australian public folder server to avoid referrals over the WAN link. "But what if we have too little public folder content to justify a dedicated public folder server?" the customer asked. Fair question, but the honest answer is you risk running into this kind of problem again. Hopefully the WAN link going down is a relatively rare occurrence. If cross site public folder referral is also rare, the two rarities multiplied make for a low probability of the event occurring, but, like an earthquake, when it does occur...you get the point.
BTW we might have some good news on this in a few weeks. Stay tuned!
You may also want to read a related post by Ross Smith IV on some New DSProxy referral changes introduced in Exchange 2003 SP2 here.