Understanding ATQ performance counters, yet another twist in the world of TLAs

Hello again, this is guest author Herbert from Germany.

If you worked an Active Directory performance issue, you might have noticed a number of AD Performance counters for NTDS and “Directory Services” objects including some ATQ related counters.

In this post, I provide a brief overview of ATQ performance counters, how to use them and discuss several scenarios we’ve seen.

What are all these ATQ thread counters there for anyway?

“ATQ” stands for “Asynchronous Thread Queue”.
LSASS adopted its threading library from IIS to handle Windows socket communication and uses a thread queue to handle requests from Kerberos and LDAP.
English versions of ATQ counters are named per component so you can group them together when viewing a performance log. Here is list followed by a short explanation of each ATQ counter:

Counter

Explanation

ATQ Estimated Queue Delay

How long a request has to wait in the queue

ATQ Outstanding Queued Requests

Current number of requests in the queue

ATQ Request Latency

Time it takes to process a request

ATQ Threads LDAP

The number of threads used by the LDAP server as determined by LDAP policy.

ATQ Threads Other

Threads used by other component, in this case the KDC

ATQ Threads Total

All Threads currently allocated

More details on the counters
ATQ Threads Total
This counter tracks the total number of threads from the ATQ Threads LDAP and ATQ Threads Other counters. The maximum number of threads that a given DC can apply to incoming workloads can be found my multiplying the product of MaxPoolThreads times the number of logical CPU cores. MaxPoolThreads defaults to a value of 4 in LDAP Policy and should not be modified without understanding the implications.

When viewing performance logs from a performance challenged DC:

  • Compare the “ATQ Threads Total” counter with the other two “ATQ Threads…” counters. If the “ATQ Threads LDAP” counter equals “ATQ Threads Total” then all of the LDAP listen threads are stuck processing LDAP requests currently. If the “ATQ Threads Other” counter equals “ATQ Threads Total”, then all of the LDAP listen threads are busy responding to Kerberos related traffic.
  • Similarly, note how close the current value for ATQ Thread total is to the max value recorded in the trace and whether both values are using the maximum number of threads supported by the DC being monitored.

Note that the value for the current number of ATQ Threads Total does not have to match the maximum value as the thread count will increase and decrease based on load. Pay attention when the current value for this counter matches the total # of threads supported by the DC being monitored.

ATQ Threads LDAP
This is the number of threads currently servicing LDAP requests. If there are a significant number of concurrent LDAP queries being processed, check for

  • Expensive or Inefficient LDAP queries
  • Excessive numbers of LDAP queries
  • An Insufficient number of DCs to service the workload (or existing DCs are undersized)
  • Memory, CPU or disk bottlenecks on the DC

Large values for this counter are common but the thread count should remain less than the total # of threads supported by your DC. The ATQ Threads LDAP and other ATQ counters are captured by the built-in AD Diagnostic Data Collector Set documented in this blog entry.

Follow these guides if applications are generating expensive queries:

The ATQ Threads LDAP counter could also run “hot” for reasons that are initially triggered by LDAP but are ultimately affected by external reasons:

External factors Scenario

Symptom and Cause

Resolution

Scenario 1

DC locator traffic (LDAP ping) from clients whose IP address doesn’t map to an AD site

The LDAP server performs an exhaustive address lookup to discover additional client IP addresses so that it may find a site to map to the client.

LDAP, Kerberos and DC locator responses are slow or time out

Netlogon event 5807 may be logged within a four hour window.

According to the name resolution response or time-out, the related LDAP ping is locking one of the threads of the limited Active Thread Queue (ATQ) pool. Many of these LDAP pings over a longer time may constantly exhaust the ATQ pool. Because the same pool is required for regular LDAP and Kerberos requests, the domain controller may become unresponsive to unavailable to users and applications.

The problem is described in KB article 2668820. Install corrective fixes and policy documented in KB 2922852.

Scenario 2

DC supports LDAP over SSL/TLS

A user sends a certificate on a session. The server need to check for certificate revocation which may take some time.


This becomes problematic if network communication is restricted and the DC cannot reach the Certificate Distribution Point (CDP) for a certificate.

To determine if your clients are using secure LDAP (LDAPs), check the counter “LDAP New SSL Connections/sec”.

If there are a significant number of sessions, you might want to look at CAPI-Logging.

See the details below

For scenario 2: Depending on the details, there are a few approaches to remove the bottleneck:

  1. In certificate manager, locate the certificate used for LDAPs for the account in question and in the general pane, select the item “Enable only the following purposes” and uncheck the CLIENT_AUTHENTICATION purpose. The Internet Proxy and Universal Access Gateway team sees this more often for reserve proxy scenarios, this guide describes the Windows Server 2003 UI: http://technet.microsoft.com/en-us/library/cc514301.aspx
  2. Use different certificates that can be checked in the internal network, or remove the CLIENT_AUTHENTICATION purpose on new certificates.
  3. Allow the DC to access the real CDP, maybe allow it to traverse the proxy to the Internet. It’s quite possible that your security department goes a bit frantic on the idea.
  4. Shorten the time-out for CRL checks so the DC gives up faster, see ChainUrlRetrievalTimeoutMilliseconds and ChainRevAccumulativeUrlRetrievalTimeoutMilliseconds on TechNet. This does not avoid the problem, but reduces the performance impact.
  5. You can suppress the “invitation” to send certificates by not sending a list of trusted roots in the local store by using SendTrustedIssuerList=0. This does not help if the client is coded to always include a certificate if a suitable certificate is present. The Microsoft LDAP client defaults to doing this, thus:
  6. Change the client application to not include the user certificate. This requires setting an LDAP session option before starting the actual connection. In LDAP API set the option:

LDAP_OPT_SSPI_FLAGS
0x92
Sets or retrieves a ULONG value giving the flags to pass to the SSPI InitializeSecurityContext function.

In System.DirectoryServices.Protocols:

SspiFlag

The SspiFlag property specifies the flags to pass to the Security Support Provider Interface (SSPI) InitializeSecurityContext function. For more information about the InitializeSecurityContext function, see the InitializeSecurityContext function topic in the MSDN library

From InitializeSecurityContext:

ISC_REQ_USE_SUPPLIED_CREDS

Schannel must not attempt to supply credentials for the client automatically

ATQ Threads Other
You can also have external dependencies generating requests that hit the Kerberos Key Distribution Center (KDC).
One common operation is getting the list of global and universal groups from a DC that is not a Global Catalog (GC).
A 2nd external and potentially intermittent root cause occurs when the Kerberos Forest Search Order (KFSO) feature has been enabled on Windows Server 2008 R2 and later KDCs to search trusted forests for SPNs that cannot be located in the local forest.
The worst case scenario occurs when the KDC searches both local and trusted forests for an SPN that can’t be found either because the SPN does not exist or because the search focused on an incorrect SPN.
Memory dumps from in-state KDCs will reveal a number of threads working on Kerberos Service Ticket Requests along with pending RPC calls +to remote domain controllers.
Procdump triggered by performance counters could also be used to identify the condition if the spikes last long enough to start and capture the related traffic.

More information on KFSO can be found on TechNet including performance counters to monitor when using this feature.

ATQ Queues and ATQ Request Latency
The ATQ Queue and latency counters provide statistics as to how requests are being processed. Since the type of requests can differ, the average processing time is typically not significant. An expensive LDAP query that takes minutes to execute can be masked by hundreds of fast LDAP queries or KDC requests.

The main use of these counters is to monitor the wait time in queue and the number of requests in the queue. Any non-zero values indicate that the DC has run out of threads.

Note the performance monitor counters have a timing behavior on the actual time a performance variable is sampled. This is quite a problem when you have a high sample interval. Thus a counter for current queue length such as “ATQ Outstanding Queued Requests” may not be reliable to show the actual degree of server overload.

To work around the averaging problem, you have to take other counters into consideration to better validate confidence in the value. In the event of an actual wait time, there must have been requests sitting in the queue at some point in the last sample interval. The load and processing delay was just not bad enough to have at least one in the queue at the sample time-stamp.

What about other thread pools?

LSASS has a number of other worker threads, e.g. to process IPSec-handshakes. Then of course there is the land of RPC server threads for the various RPC servers. Describing all the RPC servers would take up a number of additional blog entries. You can see them listed as “load generators” in the data collector set results.

A lot of details on LSASS ATQ performance counters, I know. But, geeks love the details.

Cheers,

Herbert