The Case of the Unexplained Server Hang

Recently, I received a call from one of my customers reporting a new server deployed in their environment became unresponsive or‘hung’ almost daily.  The customer explained that the server was deployed using their standard build process and wanted to collaborate on troubleshooting the issue.  
I asked a few scoping questions:

What Operating System and Service Pack? Windows 2003 Enterprise SP2 X86

Was RDP possible during unresponsiveness? No

Does server return to normal operation after some amount of time? No

Does the server respond to ping? Yes

Do they have Poolmon data for review? Yes

Did they manually capture a crash dump that we could review? No. Suggested to customer Windows feature lets you generate a memory dump file by using the keyboard

Did they collect MPS Reports for review? Yes. Microsoft Product Support Reports 

At this point I have some data to review and giving the customer a hard time about deploying an operating system in extended support with 32 bit architecture was unlikely to be well received. 

[System Event Log - beginning with clean start of system]

System 7/14/2011
1:04  Information        6009       EventLog                             POOLMONEXAMPLE                      Microsoft (R) Windows (R) 5.02. 3790 Service Pack 2 Multiprocessor Free.

System 7/14/2011
8:29  Error                  2019       Srv                                       POOLMONEXAMPLE                      The server was unable to allocate from the system nonpaged pool because the pool was empty.

So, luckily enough, we have something interesting only seven hours and twenty-five minutes later and after clean boot, kernel nonpaged pool appears to be depleted?  Why? We have a time frame that we should be able to correlate with Poolmon logs.  You may already be familiar with Poolmon but by definition the tool is designed to “display data that the operating system collects about memory allocations from the system paged and nonpaged kernel pools and about the memory pool”.  Microsoft CTS Support has a very simple script to put the collection in place if the KB by itself is not sufficient. 

Ok, so on to Poolmon, let’s start with when the system was clean booted for baseline information. 

[Clean boot happened at 7/14/2011 1:04]

First sample from Poolmon is shown below.  The N: stands for nonpaged pool and P: stands for Paged Pool.  The error above stated that the system ran out of nonpaged pool so we may want to keep a close eye on that number which is currently 26 MB.  It’s important to note that 32 bit architecture is severely limited by the amount of paged and nonpaged pool accessible at startup. This is not a Windows limitation, per se, but a mathematical one.  Having said that, what is the maximum amount of nonpaged pool which Windows can use on this platform and SKU? Roughly 256MB. Our friend and Windows expert Mark Russinovich has written a definitive article on this subject: Pushing the Limits of Windows: Paged and Nonpaged Pool

Next sample shows that nonpaged pool has grown at an alarming rate.

Now we are getting somewhere.  We know what is happening, next step is to determine root cause by identifying offending driver.  At first glance this appears too difficult to read, but for what we are after its relatively straight forward.  Thankfully, poolmon sorts the drivers based on tags and the amount of pages consumed. When drivers in kernel space need memory, they ask the memory manager to assign a page of memory.  There is nothing wrong for systems under load to take memory, but it is important to free that memory when load diminishes.  Some of the key fields are: Memory/Available memory – no problem there, Pool N or nonpaged is maxed out which is our problem, paged pool is fine. Looking at the tags listed, you see a tag/driver called “Tidt”which has consumed 214 MB which is almost all available nonpaged pool where the next highest consuming driver only used 35 MB.  We need to know what developer owns TidT tag?  A few searches on the internet, unfortunately, did not yield any promising clues.  A teammate shared a trick where you can parse all the files in system32 directory to locate a matching tag name. It is common practice that tag names may be in the header of the image file.  This is the command I used:

C:\WINDOWS\system32\drivers>findstr TidT *.*
LhmonProxy.sys:...LhmonIoCompletion...c:\winstaps\v81\tap\win\src\lhmonproxy\ioc
ompletion.c...LhmonProxy.sys..................ï Uï.â.?í$Ç..Sï]?Vïp(3.ëE.ëE°èC!ä.
Wëunt.ïC`ÇH..SF[.  SF.²  SF_.  ï}.à .ä....ìN. ..p..ï.ïw.ë.ëq.ïMnèQ.ä.t.ïw.à÷ïW?ë
U?t. A?d.ïu.è.â.. ..p..à÷.ä....èC$ä.t.èG.ä.u.ïC.à.|?èO.ä.u?à.}_èG.ä.tXSh.O..j.hÆ
...h.P..h(P..h`P..FU...ïE?ïM.â..PSQ .ShpP..j.hö...höP..h¿P..haP..ëE°F#...â..d.ïE
?à.t    j.P .Lp..ïunâ..ï. ..p..ïMn I?è.ï. ..p..WFl...ïE°_^[ïs].?.ïG?à.t j.P .Lp.
.WFJ...ïE°_^[ïs].?...............ï Uï.Vïu?èF!ä.t.ïF`ÇH..VF..  VFìn  VF.n  3.^].?
.................ï Uï.â..í$Ç..SVïp(ì^.Wï.ë]° ..p..ì~.ï7;.êE t!3.ÉïF?;.ë^.t?SP .L
p..ë^?ï6;.usï]°èU ï. ..p.._^[ïs].................ï Uï.SVWï°í$Ç..ïX(F9...ï.à÷tZïM
.ïU?èE.ëN.èM.ëV?èU.êN.êF.êV..G..Q..ëw .G.aì{.ï. ..p..ìK.è.ïA.ë.ëF.ë0ëq.ï. ..p.._
^3.[]..._^.....[]................ï Uï.ïM.ïU.ïE.ï@`QïM.RïU.QïM?RQFL   ]..........
ïM.PëH? .tp..]........ï Uï.QVï. ..p..ìw.êE 96t.Sï.Lp..ï.ï.j.ë.Pëq. .96ue[èU 3.ï.
ëG?ëG.ëG.ëG. ..p..^ïs]................ï SWï. ..p..ïN.è.ìF.;.t.ï.ë.ëB.ïF?ï.Hè.ï.ë
F? ..p..ï._[.ïF.ïN.PQj. ..p..è.ï.ï° ..p..ï._[.........ï Vï5(p..Wh`í.. ..dí..3 h@
í..úhí..údí..ë=lí....tí..(...ë=pí....xí..xtCE ..Dí..haá..úHí..úDí..ë=Lí....Tí...
...ë=Pí....Xí..gToI ..Sá..h.í..úFá..úSá..ë=.á.....á..X...ë=.á....°á..TidT ...í..
h í..ú.í..ú.í..ë=?í.....í..P...ë=.í.....í..CidT ..$í..h.á..ú(í..ú$í..ë=,í....4í.
.....ë=0í....8í..gTcA ...á..ë=.á..ë=.á.._ú.á..ú.á.....á..?......á..gTcD^........
......ï W.`í..F.²  <.@í..F.²>  .aá..F.²  ..í..F.²  . í..F½²  ..á..Fí²  _..........
......ï V.`í..Fp²  à.^t!3.ï.ë

This points us to the lhmonproxy.sys driver written by 3rd party vendor.  The customer, armed with this information, was able to obtain a patch from the vendor to resolve the problem.  I hope this blog is helpful providing a real world example of troubleshooting with poolmon and provides further encouragement to move towards X64 platform.