In today’s blog post, I’m going to talk about a local resource access performance issue.
One of our customers reported that SQL server instances running on a Windows 2003 server were failing over to another node especially when CPU load on the server was high for a short period (like 5 seconds or so). After some research by our SQL expert, it was determined that the fail over occurred because the IsAlive() function implemented by the SQL server does a “SELECT @@SERVERNAME” T-SQL query to the local SQL server instance by using the local IP address and that query doesn’t return in a timely manner.
When loopback interface was monitored during the problem time, it was also seen that “Output Queue Length” for the MS TCP loopback interface was dramatically increasing and then dropping down to 0 and then dramatically increase again like below:
Where access to local SQL instance was too slow as such, the same SQL instance could be accessed without issues from remote clients. That behavior suggested that there was a problem with loopback access.
One of our senior escalation engineers (Thanks to PhilJ) mentioned that loopback access (accessing to 127.0.0.1 or any other local IP addresses) simply queues up Work items to be later processed by a function in AFD driver. The work items queued that way are later processed by a worker thread in kernel which runs a function provided by AFD driver. Normally system can create up to 16 of such dynamic worker threads and those worker threads run at a priority level of 12. If the system had a higher priority work and if it wasn’t possible to create new worker threads, then the problem might be visible as explained above.
There’s a way to make more delayed worker threads available initially which could be configured as follows:
You can find more information at the following link:
http://msdn.microsoft.com/en-us/library/ee377058(BTS.10).aspx Registry Settings that can be Modified to Improve Operating System Performance
Even though the AdditionalDelayedWorkerThreads registry key was set to a higher value than default, the problem was still in place. Finally it was decided to get a kernel dump to better understand what was going wrong with those Delayed worker threads so that local access was too slow. Before going to the conclusion part, it would be good to state that this local resource access problem was not specific to SQL and could be experienced with any application locally accessing a resource (there were similar issues reported by other customers like “slow performance seen when local applications accessing local web services, databases” etc)
Finally a kernel dump has revealed the real problem why the delayed worker threads couldn’t catch up with the work items being posted:
A 3rd party Firewall driver was posting many delayed work items and delayed worker threads processing these work items were all trying to synchronize on the same executive resource (which was apparently a contention point) and hence the system was hitting the ceiling for the number of worker threads quickly so that new items couldn’t processed in a timely manner and local resource access was being too slow. I just wanted to give a sample call stack for one of those delayed worker threads for your reference:
THREAD fffffaee4460aaf0 Cid 0004.0044 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (WrResource) KernelMode Non-Alertable
Owning Process fffffadc44622040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 294068 Ticks: 6 (0:00:00:00.093)
Context Switch Count 200681
Start Address nt!ExpWorkerThread (0xfffff800010039f0)
Stack Init fffffadc1f1c1e00 Current fffffadc1f1c1950
Base fffffadc1f1c2000 Limit fffffadc1f1bc000 Call 0
Priority 13 BasePriority 12 PriorityDecrement 1
Child-SP RetAddr : Args to Child : Call Site
fffffadc`1f1c1990 fffff800`01027682 : fffffadc`1edcb910 fffffadc`1edeb180 00000000`0000000b fffffadc`1ed2b180 : nt!KiSwapContext+0x85
fffffadc`1f1c1b10 fffff800`0102828e : 0000000a`b306fa71 fffff800`011b4dc0 fffffadc`44605c88 fffffadc`44605bf0 : nt!KiSwapThread+0x3c9
fffffadc`1f1c1b70 fffff800`01047688 : 00000000`000000d4 fffffadc`0000001b fffffadc`1edeb100 fffffadc`1edeb100 : nt!KeWaitForSingleObject+0x5a6
fffffadc`1f1c1bf0 fffff800`01047709 : 00000000`00000000 fffffadc`167a6c70 fffffadf`fbc44a00 fffff800`01024d4a : nt!ExpWaitForResource+0x48
fffffadc`1f1c1c60 fffffadc`167a720b : fffffadc`3d41fc20 fffffadc`167a6c70 fffffadc`44605bf0 fffffadc`3d3deef8 : nt!ExAcquireResourceExclusiveLite+0x1ab
fffffadc`1f1c1c90 fffffadc`167a6c87 : 00000000`00000001 fffffadf`fbc44a00 fffff800`011cda18 fffffadc`44605bf0 : XYZ+0x1220b
fffffadc`1f1c1cd0 fffff800`010375ca : 00000000`00000000 fffffadc`1d6f7001 00000000`00000000 fffffadf`fbc44a00 : XYZ+0x11c87
fffffadc`1f1c1d00 fffff800`0124a972 : fffffadc`44605bf0 00000000`00000080 fffffadc`44605bf0 fffffadc`1edf3680 : nt!ExpWorkerThread+0x13b
fffffadc`1f1c1d70 fffff800`01020226 : fffffadc`1edeb180 fffffadc`44605bf0 fffffadc`1edf3680 00000000`00000000 : nt!PspSystemThreadStartup+0x3e
fffffadc`1f1c1dd0 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxStartSystemThread+0x16
Note: The driver name was removed on purpose.
After the problem with the 3rd party Firewall driver was addressed, the issue was resolved. Also thanks to Ignacio J who driven this case from a technical perspective and shared the resolution details with us.
Hope this helps in your similar problems.