We now have a solution for a problem that was described last month in Exchange Does Not Always Use Local GCs. The gist of the problem is that Exchange makes occasional requests to out-of-site DCs and if a WAN link is flaky or the remote DC is unresponsive, a few of those remote calls can block the majority of LDAP calls that are otherwise happy to use local GCs. This sometimes led to severe server outages especially when the initial LDAP connect succeeded but the bind or request response took too long to complete.
Not anymore. We have introduced a shorter timeout for LDAP calls and now if a query takes too long it will quickly time out allowing other calls to proceed. The new default timeout is 30 seconds but can be further tuned by using the following registry key:
This key is available with the fix contained in the soon to be published knowledge base article KB 911830 (please check back, the article WILL be there). It is an Exchange 2003 post SP2 fix. Without this fix Exchange uses the default LDAP timeout of 2 minutes and attempts 24 retries. Now we use a 30 second timeout (or whatever LdapBindTimeoutSecs is set to) and still attempt 24 retries. If you experience the problem it's probably best to set the value based on what is normal for a remote LDAP query in your environment. An LDAP query to a local GC generally takes well under 1 second. If a remote query takes 3-5 seconds in your environment maybe set LdapBindTimeoutSecs to double that. This is just a suggestion. You are the best judge of what works for your particular environment.
I should emphasize that the fix does NOT solve LDAP connectivity issues. It only ensures that you don't suffer a complete Exchange outage because some remote DC Exchange tried to contact is unresponsive, yet the local GCs are available to service the majority of Exchange LDAP requests. If the Exchange server is hung waiting on responses from local GCs or all local GCs are unavailable and therefore all requests end up going to out-of-site servers, the fix is not particularly helpful. In both of these situations the first order of business should be figuring out what is wrong with the local GCs and fixing that. For the situation where the fix is useful you should also address whatever is wrong with the underlying substrate (WAN link or whatever else the case might be).
In the blog feedback Nino requested in November, MKohlman said he would like to see:
"...a post or two regarding a recent KB article concerning a recent issue followed with a history of how the problem was discovered and documented. It would be interesting to follow this process from discovery to resolution or work-around (say, was it discovered internally by MS or did it start as an issue in the field that was researched and resolved either by the admins, MS or both?) I know that I've been very curious on occasion when I've run into an issue that I could not find an answer for via KB or newsgroups, then see a KB turn up a few weeks or months later that matches or closely matches the issue that no one else could initially confirm or duplicate."
MKohlman, we hope this is what you had in mind!