Exchange 2010: AD Recipient Cache and Locking Threads in EdgeTransport - Or How We Figured Out Something Wasn't Right With Mail Stuck in Submission Queue

A while ago we had an issue, wherein SCOM fired for message queue length for messages in Submission. When we went to take a look, the messages were stuck in Submission and weren't really going anywhere. Took a dump of EdgeTransport, paused Transport, and the queues eventually flushed.

Opened Windbg against the dump and here's what was going on: Every time a flood of email would come, Transport would see that the local recipient cache for a group was out of date. Threads would spin-up (typically, per message) to update the recipient cache. Since this is an AD call, a lock is used. The lock is owned by the thread until it's released, which means we need to wait for the LDAP call to complete before the next thread can enter the critical section. Repeat ad infinim. The issue that complicated this matter is that the refresh was for the same group of sixteen-thousand-plus users, each time.

In the dump, this is how I went down the rabbit hole. First, I checked for any locks that may be blocking in the process:

 

!locks

 

CritSec Wldap32!SelectLock1+0 at 000007fefdd5ada0

WaiterWoken No

LockCount 0

RecursionCount 1

OwningThread 2a18

EntryCount 0

ContentionCount 2

*** Locked

 

I then checked the another thread to see if it was aware that there was contention (it was):

 

00 00000000`2b8fd4f8 000007fe`fd991430 ntdll!ZwWaitForMultipleObjects+0xa

01 00000000`2b8fd500 00000000`77731723 KERNELBASE!WaitForMultipleObjectsEx+0xe8

02 00000000`2b8fd600 000007fe`f830d0a5 kernel32!WaitForMultipleObjectsExImplementation+0xb3

03 00000000`2b8fd690 000007fe`f8311aa9 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1

04 00000000`2b8fd730 000007fe`f8344e95 mscorwks!Thread::DoAppropriateAptStateWait+0x41

05 00000000`2b8fd790 000007fe`f8427d94 mscorwks!Thread::DoAppropriateWaitWorker+0x191

06 00000000`2b8fd890 000007fe`f83ee48a mscorwks!Thread::DoAppropriateWait+0x5c

07 00000000`2b8fd900 000007fe`f82f077d mscorwks!CLREvent::WaitEx+0xbe

08 00000000`2b8fd9b0 000007fe`f843e6ea mscorwks!AwareLock::EnterEpilog+0xc9

09 00000000`2b8fda80 000007fe`f83336db mscorwks!AwareLock::Enter+0x72

0a 00000000`2b8fdab0 000007fe`f88b308f mscorwks!AwareLock::Contention+0x1fb

0b 00000000`2b8fdb80 000007fe`f2ff2beb mscorwks!JITutil_MonContention+0xdf

0c 00000000`2b8fdd30 000007fe`f2ff318f Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].CreateAndCache(System.__Canon, System.__Canon, Boolean)+0x16cb

0d 00000000`2b8fddf0 000007fe`f324dd16 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].GetValue(System.__Canon, System.__Canon)+0x192f

0e 00000000`2b8fdeb0 000007fe`f324e2a0 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, System.Guid, System.Guid)+0x25b6a6

0f 00000000`2b8fe090 000007fe`f0940c15 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, Microsoft.Exchange.Data.Directory.ADObjectId, System.__Canon)+0x25bfd0

10 00000000`2b8fe180 000007fe`e0fc7267 Microsoft_Exchange_Transport_ni!Microsoft.Exchange.Transport.RecipientAPI.AddressBookImpl.IsMemberOf(Microsoft.Exchange.Data.Transport.RoutingAddress, Microsoft.Exchange.Data.Transport.RoutingAddress)+0x135

11 00000000`2b8fe200 000007fe`f245c46a Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.MembershipChecker.Equals(System.String, System.String)+0x47

12 00000000`2b8fe250 000007fe`e0fd5584 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RuleUtils.CompareStringValues(System.Object, System.Object, Microsoft.Exchange.MessagingPolicies.Rules.IStringComparer)+0x7a

13 00000000`2b8fe2a0 000007fe`f2459efe Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.IsMemberOfPredicate.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x74

14 00000000`2b8fe2e0 000007fe`f2459efe Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e

15 00000000`2b8fe360 000007fe`f246a157 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e

16 00000000`2b8fe3e0 000007fe`e0fd839f Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluator.Run()+0x1b7

 

O.k., so if we’re update the cache, how many times are we doing it? Oh, I see…:

 

0:086> !findstack Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3 0

Thread 023, 2 frame(s) match

Thread 024, 2 frame(s) match

Thread 025, 2 frame(s) match

Thread 034, 2 frame(s) match

Thread 035, 2 frame(s) match

Thread 036, 2 frame(s) match

Thread 037, 2 frame(s) match

Thread 038, 2 frame(s) match

Thread 039, 2 frame(s) match

Thread 040, 2 frame(s) match

Thread 050, 2 frame(s) match

Thread 051, 2 frame(s) match

Thread 052, 2 frame(s) match

Thread 053, 2 frame(s) match

Thread 054, 2 frame(s) match

Thread 055, 2 frame(s) match

Thread 056, 2 frame(s) match

Thread 057, 2 frame(s) match

Thread 058, 2 frame(s) match

Thread 059, 2 frame(s) match

Thread 060, 2 frame(s) match

Thread 061, 2 frame(s) match

Thread 062, 2 frame(s) match

Thread 063, 2 frame(s) match

Thread 064, 2 frame(s) match

Thread 065, 2 frame(s) match

Thread 066, 2 frame(s) match

Thread 067, 2 frame(s) match

Thread 068, 2 frame(s) match

Thread 069, 2 frame(s) match

Thread 070, 2 frame(s) match

Thread 071, 2 frame(s) match

Thread 072, 2 frame(s) match

Thread 073, 2 frame(s) match

Thread 074, 2 frame(s) match

Thread 075, 2 frame(s) match

Thread 076, 2 frame(s) match

Thread 077, 2 frame(s) match

Thread 078, 2 frame(s) match

Thread 079, 2 frame(s) match

Thread 080, 2 frame(s) match

Thread 081, 2 frame(s) match

Thread 082, 2 frame(s) match

Thread 083, 2 frame(s) match

Thread 084, 2 frame(s) match

Thread 085, 2 frame(s) match

Thread 086, 2 frame(s) match

Thread 087, 2 frame(s) match

 

The thread with the lock has an LDAP call being made:

 

00 00000000`5443cfc8 000007fe`fce91e35 ntdll!ZwWaitForSingleObject+0xa

01 00000000`5443cfd0 000007fe`fce91f96 mswsock!SockWaitForSingleObject+0xe145

02 00000000`5443d050 000007fe`ff1f4efc mswsock!WSPSelect+0xac26

03 00000000`5443d1f0 000007fe`ff1f4e7d ws2_32!DPROVIDER::WSPSelect+0x3c

04 00000000`5443d230 000007fe`fdd13a19 ws2_32!select+0x75a

05 00000000`5443d330 000007fe`fdd199b5 Wldap32!DrainWinsock+0x417

06 00000000`5443d3c0 000007fe`fdd136ec Wldap32!LdapWaitForResponseFromServer+0x646

07 00000000`5443d480 000007fe`fdd1d967 Wldap32!ldap_result_with_error+0x279

08 00000000`5443d520 000007fe`f84abec7 Wldap32!ldap_result+0x74

*** WARNING: Unable to verify checksum for System.DirectoryServices.Protocols.ni.dll

09 00000000`5443d580 000007fe`f1c453ff mscorwks!DoNDirectCall__PatchGetThreadCall+0x7b

0a 00000000`5443d630 000007fe`f1c5c61b System_DirectoryServices_Protocols_ni!DomainBoundILStubClass.IL_STUB(IntPtr, Int32, Int32, System.DirectoryServices.Protocols.LDAP_TIMEVAL, IntPtr ByRef)+0x8f

0b 00000000`5443d720 000007fe`f1c59f79 System_DirectoryServices_Protocols_ni!System.DirectoryServices.Protocols.LdapConnection.ConstructResponse(Int32, System.DirectoryServices.Protocols.LdapOperation, System.DirectoryServices.Protocols.ResultAll, System.TimeSpan, Boolean)+0x2cb

 

It looks like we’re performing this query, at the time of the dump:

 

000000005443d538 00000000e0188bf8 System.String "(&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))"

 

0:086> !do 00000000e0188bf8

Name: System.String

MethodTable: 000007fef7747d90

EEClass: 000007fef734e560

Size: 170(0xaa) bytes

GC Generation: 2

(C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)

String: (&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))

Fields:

              MT Field Offset Type VT Attr Value Name

000007fef774f000 4000096 8 System.Int32 1 instance 73 m_arrayLength

000007fef774f000 4000097 c System.Int32 1 instance 72 m_stringLength

000007fef77497d8 4000098 10 System.Char 1 instance 28 m_firstChar

000007fef7747d90 4000099 20 System.String 0 shared static Empty

                                 >> Domain:Value 00000000004e5e50:000000015fff0370 <<

000007fef7749688 400009a 28 System.Char[] 0 shared static WhitespaceChars

                                 >> Domain:Value 00000000004e5e50:000000015fff0b90 <<

 

O.k., how many calls against this object are we making?:

 

0:086> !refs 00000000e0188bf8

This command will work faster with an SOSEX heap index. To build an index, run !bhi.

Objects referenced by 00000000e0188bf8 (System.String):

NONE

 

Objects referencing 00000000e0188bf8 (System.String):

follow 00000000c0393040 80 System.Object[]

follow 00000000c0393108 96 System.Object[]

follow 00000000c04a1928 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c04a2960 72 System.Object[]

follow 00000000c0689800 80 System.Object[]

follow 00000000c06898c8 96 System.Object[]

follow 00000000c07c3e68 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c07c4de0 72 System.Object[]

follow 00000000c08ba200 80 System.Object[]

follow 00000000c08ba2c8 96 System.Object[]

follow 00000000c0976870 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c09777e8 72 System.Object[]

follow 00000000c0fbaf58 80 System.Object[]

follow 00000000c0fbb020 96 System.Object[]

follow 00000000c103fa40 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1046fd0 72 System.Object[]

follow 00000000c113a868 80 System.Object[]

follow 00000000c113a930 96 System.Object[]

follow 00000000c11e75a0 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c11e8518 72 System.Object[]

follow 00000000c13b1068 80 System.Object[]

follow 00000000c13b1130 96 System.Object[]

follow 00000000c1547260 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c15481d8 72 System.Object[]

follow 00000000c1669338 80 System.Object[]

follow 00000000c1669400 96 System.Object[]

follow 00000000c175b778 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c175c6f0 72 System.Object[]

follow 00000000c1904c80 80 System.Object[]

follow 00000000c1904d48 96 System.Object[]

follow 00000000c1acba70 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1acc9e8 72 System.Object[]

follow 00000000c1d03f08 80 System.Object[]

follow 00000000c1d03fd0 96 System.Object[]

follow 00000000c1f8f398 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1f90310 72 System.Object[]

follow 00000000c21bd2f0 80 System.Object[]

follow 00000000c21bd3b8 96 System.Object[]

follow 00000000c243d8a0 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000000c243e818 72 System.Object[]

follow 00000000e0188510 176 Microsoft.Exchange.Data.Directory.ADPagedReader`1[[Microsoft.Exchange.Data.Directory.ADRawEntry, Microsoft.Exchange.Data.Directory]]

follow 00000000f096aea8 80 System.DirectoryServices.Protocols.SearchRequest

follow 00000001106ce9f8 80 System.DirectoryServices.Protocols.SearchRequest

 

CLR Thread 0x49

stack:000000005443d538

stack:000000005443d5d0

stack:000000005443d970

 

O.k., how many threads do we have waiting on this?:

 

0:086> !syncblk

Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner

  148 000000004ef62350 95 1 000000004ef6a070 2a18 86 0000000080074a08 System.Object

            Waiting threads: 23 24 25 34 35 36 37 38 39 40 50 51 52 53 54 55 56 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 87

-----------------------------

Total 1548

CCW 0

RCW 0

ComClassFactory 0

Free 0

 

In the words of Strongbad: That is not a very small number. That is a very BIG number!

 

For those of you wondering what this means in common, everyday production use of Exchange:

 

Mail would come in, destined for 'x' recipient. Transport would see a rule that has the condition, IsMemberOf. Then, Transport would see that the local recipient cache for the group it needs to verify that the recipient is a member of is out-of-date. It would call a thread to update it. Since threads aren't context-aware, this would occur for many of the recipients currently in the Submission Queue. The thread making the LDAP call would own the lock, as it entered the critical section. The other threads had to wait for the lock to be released to make their AD call. Since only one thread in the process' bounds can own a lock, at a time, this means the threads could only complete in a one-by-one scenario. This is also why the mail couldn't be categorized because the lock prevented the thread from validating IsMemberOf without a current cache.

 

For those of you concerned if you'll run into this issue: This condition only reproduces in a single instance and that instance has only recurred twice, so far, during the timespan of a year. Rest assured, that even though the perceivable affected footprint of this issue is small, it is being investigated by the Transport developers.