SharePoint: The problem with changing UserQueryMaxTimeout

Article
11/26/2017

Consider the following scenario:

You have a fairly large and / or complex Active Directory (AD) infrastructure.
When using People Picker in a SharePoint 2013 or 2016 site, you are unable to find users from certain domains, and eventually the People Picker control displays an error:

“Sorry, we’re having trouble reaching the server”.

You do some research and find several blogs that say to edit clientpeoplepicker.js and set SPClientPeoplePicker.UserQueryMaxTimeout to 60 (seconds).

Here are the problems with that “solution”:

It’s technically unsupported to make changes to SharePoint supporting files on the file system.
Any SharePoint cumulative update or service pack could include a new version of clientpeoplepicker.js, in which case, your change will be overwritten.
There are several blogs out there that seem to imply that changing the default value ("25e3") to "60" will increase the timeout to 60 seconds. That is incorrect. You would actually need to change it to "60e3". Changing it to just "60" sets the timeout to 60 milliseconds. The result being that People Picker immediately displays the “Sorry, we’re having trouble reaching the server” error.
Increasing a timeout is rarely the best solution. Ok great, now the People Picker control no longer throws the error, but it takes 45 seconds to pull up results. Are you happy?

So what should you do instead?

Troubleshoot. Review logs, review network traces, review your People Picker configuration, etc, to figure out why it’s taking so long to find users.

In most cases, the reason that the default 25 second timeout is not long enough falls into one of these categories:

Your People Picker settings are not optimally configured, leading to wasting time searching unnecessary domains.
There is a firewall blocking communication and causing the LDAP connection to timeout.
DNS and Active Directory Sites and Subnets are not configured correctly, which can lead to connecting to a domain controller that is at the other end of a slow network link.

Step 1: Configure your web application People Picker settings.

Review your domain and forests trusts. Have some conversations with web app stakeholders to understand exactly which domains this web app is going to need access to.
I find many SharePoint admins don’t have these conversations and just default to allowing People Picker to search every trusted domain. Sometimes that’s all you can do, but understand that restricting People Picker to only search the domains you actually need will improve performance and reliability.
This is how you output the current People Picker settings for a web app:

 $wa = Get-SPWebApplication https://yourWebApplicationUrl
$wa.PeoplePickerSettings
$wa.PeoplePickerSettings.SearchActiveDirectoryDomains

Note: If SearchActiveDirectoryDomains is blank, that means that People Picker will search all trusted domains.

Step 2: Review your logs.

Turn your SharePoint logging up to verbose:

 Set-SPLogLevel -TraceSeverity verbose

Reproduce the problem behavior on one of the sites.
Get the logs from the web-front-ends and search for “ClientPeoplePicker”
You should find the request from your reproduction. You can then filter by correlation ID.
You’re interested in the “SearchFromGC” events. That’s where People Picker finds a domain controller and issues an LDAP query for the user. You should see it iterate through each domain in your environment.
What you may find is that some domains take a long time to respond, or don’t respond at all. That’s what’s causing the People Picker to timeout.

Here’s an example where we get no response from “contoso.com” when searching for “user1”. Note the timestamps.

 11/21/2017 09:56:23.82 w3wp.exe (0x2388) 0x4984 SharePoint Foundation Performance ftq1 Verbose SearchFromGC name = contoso.com. start 15142f9e-6389-1024-5551-eed4886f8d25

11/21/2017 09:57:10.95 w3wp.exe (0x2388) 0x4984 SharePoint Foundation Performance ftq3 Verbose SearchFromGC name = contoso.com. Error Message: The server is not operational. 15142f9e-6389-1024-5551-eed4886f8d25

11/21/2017 09:57:10.95 w3wp.exe (0x2388) 0x4984 SharePoint Foundation General 7fbh Verbose Exception when search "user1" from domain "contoso.com". Exception: "The server is not operational.  ", StackTrace: " at System.DirectoryServices.DirectoryEntry.Bind(Boolean throwIfFail) at System.DirectoryServices.DirectoryEntry.Bind() at System.DirectoryServices.DirectoryEntry.get_AdsObject() at System.DirectoryServices.DirectorySearcher.FindAll(Boolean findMoreThanOne) at Microsoft.SharePoint.WebControls.PeopleEditor.SearchFromGC(SPActiveDirectoryDomain domain, String strFilter, String[] rgstrProp, Int32 nTimeout, Int32 nSizeLimit, SPUserCollection spUsers, ArrayList& rgResults) at Microsoft.SharePoint.Utilities.SPUserUtility.SearchAgainstAD(String input, Boolean useUpnInResolve, SPActiveDirectoryDomain domainController, SPPrincipalType scopes, SPUserCollection usersContainer, Int32 maxCount, String customQuery, String customFilter, TimeSpan searchTimeout, Boolean& reachMaxCount)". 15142f9e-6389-1024-5551-eed4886f8d25

In the above case, it’s pretty clear that a firewall is blocking the traffic between the SharePoint web-front-ends and domain controllers for contoso.com. We spent about 47 seconds trying to initiate an LDAP Bind before giving up.

When you’re looking at the SearchFromGC traffic, you want to pay attention to the start and end timestamps. That will tell you how long it took to query the domain. If it takes several seconds per-domain, and you have a bunch, that will lead to timeout.

Step 3: Review network traces.

Now that you have a feel for which are the problematic domains, a network trace will help identify exactly what the problem is.

Run ipconfig /flushdns on the SharePoint web-front-end to ensure the DNS calls are captured in a network trace.

Get another set of verbose ULS logs, and at the same time take a Netmon or WireShark trace from the web-front-end.
Find the applicable entries in the ULS logs again and note the timestamps.
Filter the network trace to DNS (port 53) and LDAP (ports 389 and 3268).
The DNS entries will show you how SharePoint determined which domain controller (DC) to reach out to for a given domain.
The LDAP entries will show you if the connection succeeded, and if so, how long it took to get the results.

If you find that LDAP connections to the remote DCs are successful, but taking several seconds, you’ll want to look at your DNS and routing configuration to see if there’s a way to make sure SharePoint always contacts a DC that is geographically close and quick to respond.

If you find that the LDAP connections are not successful, it’s likely due to a firewall. You'll need to get your network people to assist.
Some have found that the People Picker Port Tester tool can be useful for troubleshooting this kind of problem.
https://github.com/Nauplius/PeoplePickerPortTester

Here's an example of a firewall situation. The SharePoint server is the source address. As you can see, it's reaching out to multiple domain controllers, trying to establish a TCP connection on port 389. It never gets a response, so it retransmits the SYN packet, which also gets no response. This is attempted multiple times before SharePoint gives up and throws the "The server is not operational" error.

For people Picker to function properly, we need all these ports open between the SharePoint servers and the domain controllers in the remote domain:
TCP/UDP 135, 137, 138, 139 (RPC)
TCP/UDP 389 (LDAP)
TCP 636 (LDAP SSL)
TCP 3268 (LDAP Global Catalog)
TCP 3269 (LDAP Global Catalog SSL)
TCP/UDP 53 (DNS)
TCP/UDP 88 (Kerberos)
TCP/UDP 445 (Directory Services)