A word of caution to those of you that like endings: this isn’t over yet.
I’m running a rather sad and noisy X64 desktop as a server at home. Once a proud warrior, actually, no, wait, it was never any good. It’s just a Virtual Server host (it’s not quite Hyper-V capable; next one will be). SBS 2003, an IIS and an ISA Server all exist(ed) happily in there at one point. (Did I mention I virtualized my work desktop machine the other day? So liberating!)
I blatted Windows Server 2008 onto it at RTM, and it’s been happily puttering along doing the RRAS internet access and Virtual Server thing for me ever since.
But I’ve had to reset it from unresponsive-no-mouse-no-capslock situations on about four occasions over the last two weeks, and as the problem wasn’t getting any better, so I figured I’d take a look at what I could do to try to diagnose it.
My guess was that I had a kernel-mode memory leak (a user mode memory leak shouldn’t ever trash the box to that extent), but it didn’t seem to correspond with any driver upgrades or software installations… something else had changed, sometime.
Perfmon (the new, shiny version) or more specifically the Reliability Monitor confirmed my suspicions:
(happy, everything-used-to-be-so-nice side on the left, then the gradual decline due to Disruptive Shutdowns towards the right). Note the quite-regular interval of red things on the bottom row. (Does it happen more when I’m at home, he wondered?)
As I had a theory in mind, I thought I’d create a Perfmon BLG (log file with lots of counters in it; lots of people seem to like CSV, but BLG is faster, and I’m never going to be opening it in Excel anyway).
How to do that? Things have changed: now, I create a "Data Collector Set", it seems. Oh yeah, reading manuals and/or following basic instruction: not my thing.
I created a new one based on the System Performance collector set, which matches my needs nicely because it contains all the Process counters and Memory counters. Between that lot, I should easily be able to spot a memory leak.
Started the collector set, and made a mental note to check in tonight.
After a little fiddling, I worked out that the animated "Data Collection In Progress" screen wasn’t generating a report, and that I’d have to stop the data collector set to view it. Right on!
So, one stopped data collector set later, the Reports view is what I’m interested in.
Remember your training – you’re interested in patterns that have slopes or steps. One counter leapt out at me, which I moused over and found was….
Process (_Total) Pool Nonpaged Bytes
So, yep, there’s a memory leak, and it’s in one or more of the objects tracked by Process counters. So let’s add the Pool NonPaged Bytes counters for <All Instances> (so I can see all the processes).
So Add all them, and there’s a counter that matches the slope, but at a different scale. Click it in the display to select it, and it’s SVCHOST#10. Hide all the other counters I’ve just added (multi select, right click, hide all), and then right-click it and choose Scale Selected Counter.
Whop! Matches the curve almost exactly.
So, now I know it’s a service host, but I don’t know which one (they all look alike to me). I assume it’s probably still running, too. How do I find that out now?
Easy: Add the "ID Process" counter for svchost#10 (#9 pictured, artistic license)
And then click the counter in the list to see the value it has (the plotted line is flat across the graph, meaning it didn’t change at any point). I get PID 1348.
TASKLIST /SVC tells me everything I need to know (well, not everything obviously, but enough to take corrective action).
Yep – it’s the DHCP Server instance of SVCHost that’s apparently leaking NPP, a kernel resource.
Why!? And why now!?
The graph tells me the times at which this happened, but the Event Logs are very, very quiet around then. So I’ll need to use tracing or logging or some other technique to actually track down the cause of the problem.
I right-clicked the SVCHOST instance with PID 1348 and chose Create Dump File (awesome feature, mentioned that before), for archival/root cause purposes – it may well not be possible to see the cause of the leak after the fact from a hangdump, but it’s worth grabbing just in case – and then restarted the DHCP Server Service.
Taskman memory use dropped by about 100MB straight away. This is not a busy network, and NPP isn’t typically used as cache by user mode programs (he giggled (in a manly way)). Something weird is going on there.
I restarted my performance logging, and I’ll check in again tomorrow to see if there’s any further indication of a memory leak (I haven’t done anything to fix it, so I assume there will be). Now, time to look for logging and diagnostic options…
A word on Perfmon in Windows Vista and 2008: USE IT!
If you’re doing any level of performance analysis of Perfmon logs, you need to try out the new, improved Perfmon in Vista. It runs rings around the old one. It’s fantastic (at least by comparison). It’s worth the cost of the upgrade alone. Seriously, if you do any sort of work with perfmon logs, try doing it on a Vista box and see whether your life is 1000% easier! I’m not saying it’s perfect, but by comparison with the last version in XP/2003…