SCCM Server hangs due to memory pressure (Kernel Cache Manager)


Got a request the other day to have a look at a full memory dump of a SCCM server that was hanging due to physical memory pressure (has 32GB RAM installed).

As it is the case usually in these situations, the issue was not caused by any SCCM component, but rather something else 😉

It's a good opportunity though to discuss a little bit about the Windows Kernel Cache Manager. Everything will be done with Public Symbols as usual and also the heavy use of the MEX Debugger Extension (showing it's power as well 😀 ).

Note: Even if I am using Public Symbols, I am using WinDe dbg extension which is available for some customers under NDA. Some MEX dbg extensions also will not work because they rely on WinDe dbg extension. One of these commands is !mex.mem. I will use this one for simplicity's sake as the output is structured and easy to see. If these are not available, you will need to use !kdexts.memusage or even better, the !kdexts.vm command. I will use WinDe/MEX because the output is much nicer to see :p But, you can achieve the same with the build-in commands mentioned.

 

First pf all, let's start by checking the memory status of the server - we can do this by running !mex.mem

5: kd> !mex.mem
Page File: \??\C:\pagefile.sys
Current: 10794940 Kb Free Space: 6592416 Kb ( 10.29 GB)
Minimum: 10794940 Kb Maximum: 10794940 Kb ( 10.29 GB)

Physical Memory: 8385237 ( 33540948 Kb) ( 31.99 GB)
Available Pages: 27177 ( 108708 Kb) ( 106.16 MB)
ResAvail Pages: 8193973 ( 32775892 Kb) ( 31.26 GB)
Locked IO Pages: 0 ( 0 Kb) ( 0)
Free System PTEs: 33514422 ( 134057688 Kb) ( 127.85 GB)
Modified Pages: 4073135 ( 16292540 Kb) ( 15.54 GB)
Modified PF Pages: 3 ( 12 Kb) ( 12.00 KB)
Modified No Write Pages: 0 ( 0 Kb) ( 0)
NonPagedPool 0: 8326 ( 33304 Kb) ( 32.52 MB)
NonPagedPool 1: 6844 ( 27376 Kb) ( 26.73 MB)
NonPagedPool 2: 0 ( 0 Kb) ( 0)
NonPagedPool 3: 0 ( 0 Kb) ( 0)
NonPagedPool Usage: 36694 ( 146776 Kb) ( 143.34 MB) Current pool size: 981.94 MB
NonPagedPool Max: 6263793 ( 25055172 Kb) ( 23.89 GB)
PagedPool 0: 107157 ( 428628 Kb) ( 418.58 MB)
PagedPool 1: 12274 ( 49096 Kb) ( 47.95 MB)
PagedPool 2: 6081 ( 24324 Kb) ( 23.75 MB)
PagedPool 3: 0 ( 0 Kb) ( 0)
PagedPool 4: 38 ( 152 Kb) ( 152.00 KB)
PagedPool Usage: 125550 ( 502200 Kb) ( 490.43 MB) WorkingSet: 286.58 MB, PeakWorkingSet: 517.27 MB
PagedPool Maximum: 33554432 ( 134217728 Kb) ( 128.00 GB)
Processor Commit: 3751 ( 15004 Kb) ( 14.65 MB)
Session Commit: 12901 ( 51604 Kb) ( 50.39 MB)
Syspart SharedCommit 0
Shared Commit: 17255 ( 69020 Kb) ( 67.40 MB)
Special Pool: 0 ( 0 Kb) ( 0)
Kernel Stacks: 13455 ( 53820 Kb) ( 52.56 MB)
Pages For MDLs: 1859 ( 7436 Kb) ( 7.26 MB)
Pages For AWE: 0 ( 0 Kb) ( 0)
NonPagedPool Commit: 0 ( 0 Kb) ( 0)
PagedPool Commit: 125678 ( 502712 Kb) ( 490.93 MB)
Driver Commit: 4240 ( 16960 Kb) ( 16.56 MB)
Boot Commit: 0 ( 0 Kb) ( 0)
System PageTables: 0 ( 0 Kb) ( 0)
VAD/PageTable Bitmaps: 3882 ( 15528 Kb) ( 15.16 MB)
ProcessLockedFilePages: 0 ( 0 Kb) ( 0)
Pagefile Hash Pages: 0 ( 0 Kb) ( 0)
Sum System Commit: 183021 ( 732084 Kb) ( 714.93 MB)
Total Private: 5012200 ( 20048800 Kb) ( 19.12 GB)
Misc/Transient Commit: 170655 ( 682620 Kb) ( 666.62 MB)
Committed pages: 5365876 ( 21463504 Kb) ( 20.47 GB)
Commit limit: 11083507 ( 44334028 Kb) ( 42.28 GB)

*** Low Available Pages (Standby+Zeroed+Free). This can cause performance issues. ***

Virtual Memory Physical Memory File Cache Cache Writes

Notice these 2 interesting things from the output:

  • Available Pages: 27177 ( 108708 Kb)   ( 106.16 MB)           => so we only have 106.16 MB free physical memory (pages) ...
  • Modified Pages: 4073135 ( 16292540 Kb)   ( 15.54 GB)      => from 32 GB total physical memory, we have 15.54 GB (half of the memory) in Modified Pages

What is the MiModifiedPageWriter (Modified Page Writer) thread (running in System context) doing - is it blocked by anything?

To switch to the System process, we need to find it's address - we can do this by using !mex.tl and passing it the name of the process we are interested in:

5: kd> !mex.tl system
PID Address Name
=========== ================ =========================================
0x4 0n4 fffffa8019a6b040 System
0x1b8 0n440 fffffa8028f02060 svchost.exe(LocalSystemNetworkRestricted)
=========== ================ =========================================
PID Address Name

Warning! Zombie process(es) detected (not displayed). Count: 2 [zombie report]

Ok, got it (fffffa8019a6b040) - let's switch to it by running !mex.p and passing it this address:

5: kd> !mex.p fffffa8019a6b040
Name Address Ses PID User Name Create Time Up Time Mods Handle Act Thrd Z Thrd Parent
====== ================ ==== ======= =============== ========================== ============ ==== ====== ======== ====== ==============
System fffffa8019a6b040 none 4 (0n4) DOMAIN\COMPUTERNAME$ 06/26/2017 12:11:57.995 PM 2h:03:22.004 0 1176 204 0 Idle 0 (0n0)

Memory Details:

VM Peak Work Set Commit Size
======== ======== ======== ===========
23.18 MB 27.99 MB 18.49 MB 556 KB

CPU Details:

User Kernel Total
==== ========== ==========
0s 24m:33.696 24m:33.696

Show Threads: Unique Stacks !mex.listthreads (!lt) fffffa8019a6b040 !winde.lp fffffa8019a6b040 !process fffffa8019a6b040 7

Right! We are in the context of the System process, let's find the Modified Page Writer thread - it should have this function call on the stack "nt!MiModifiedPageWriter".

Being in the context of the System process we can use MEX again to search in all available threads for a thread stack that matches what we are looking for - we can use !mex.us and pass it the string we are looking for "nt!MiModifiedPageWriter":

5: kd> !mex.us nt!MiModifiedPageWriter
1 thread [stats]: fffffa8019ab43c0
fffff80001eda6aa nt!KiSwapContext+0x7a
fffff80001edd142 nt!KiCommitThreadWait+0x1d2
fffff80001e9c03b nt!KeWaitForGate+0xfb
fffff80001e73c3a nt!MiModifiedPageWriter+0x5a
fffff80002170f12 nt!PspSystemThreadStartup+0x5a
fffff80001ec9de6 nt!KiStartSystemThread+0x16

Threads matching filter: 1 out of 204

Cool, found it! Now we want to see more information about what this thread is doing, it's state, etc. We can do this with !mex.t and passing it the thread address (fffffa8019ab43c0):

5: kd> !mex.t fffffa8019ab43c0
Process Thread CID UserTime KernelTime ContextSwitches Wait Reason Time
System (fffffa8019a6b040) fffffa8019ab43c0 4.ac 0s 1s.154 3135 WrFreePage 3s.650

WaitBlockList:
Object Name Type Other Waiters
fffff80002092e00 nt!MmModifiedPageWriterGate+0x0 Gate 0

Priority:
Current Base UB FB IO Page
17 8 0 0 2 5

# Child-SP Return Call Site
0 fffff880029fdac0 fffff80001edd142 nt!KiSwapContext+0x7a
1 fffff880029fdc00 fffff80001e9c03b nt!KiCommitThreadWait+0x1d2
2 fffff880029fdc90 fffff80001e73c3a nt!KeWaitForGate+0xfb
3 fffff880029fdce0 fffff80002170f12 nt!MiModifiedPageWriter+0x5a
4 fffff880029fdd40 fffff80001ec9de6 nt!PspSystemThreadStartup+0x5a
5 fffff880029fdd80 0000000000000000 nt!KiStartSystemThread+0x16

It's not blocked by anything, it is just waiting on nt!MmModifiedPageWriterGate which is a (nt!_KGATE) Gate object (short said, event to get triggered to start doing work) - we can have a look at the Gate object using dx and passing it the address (fffff80002092e00):

0: kd> dx -r1 ((nt!_KGATE *)0xfffff80002092e00)
((nt!_KGATE *)0xfffff80002092e00)                 : 0xfffff80002092e00 [Type: _KGATE *]
    [+0x000] Header           [Type: _DISPATCHER_HEADER]

0: kd> dx -r1 ((ntkrnlmp!_DISPATCHER_HEADER *)0xfffff80002092e00)
((ntkrnlmp!_DISPATCHER_HEADER *)0xfffff80002092e00)                 : 0xfffff80002092e00 [Type: _DISPATCHER_HEADER *]
    [+0x000] Type             : 0x7 [Type: unsigned char]
    [+0x001] TimerControlFlags : 0x1 [Type: unsigned char]
    [+0x001 ( 0: 0)] Absolute         : 0x1 [Type: unsigned char]
    [+0x001 ( 1: 1)] Coalescable      : 0x0 [Type: unsigned char]
    [+0x001 ( 2: 2)] KeepShifting     : 0x0 [Type: unsigned char]
    [+0x001 ( 7: 3)] EncodedTolerableDelay : 0x0 [Type: unsigned char]
    [+0x001] Abandoned        : 0x1 [Type: unsigned char]
    [+0x001] Signalling       : 0x1 [Type: unsigned char]
    [+0x002] ThreadControlFlags : 0x6 [Type: unsigned char]
    [+0x002 ( 0: 0)] CpuThrottled     : 0x0 [Type: unsigned char]
    [+0x002 ( 1: 1)] CycleProfiling   : 0x1 [Type: unsigned char]
    [+0x002 ( 2: 2)] CounterProfiling : 0x1 [Type: unsigned char]
    [+0x002 ( 7: 3)] Reserved         : 0x0 [Type: unsigned char]
    [+0x002] Hand             : 0x6 [Type: unsigned char]
    [+0x002] Size             : 0x6 [Type: unsigned char]
    [+0x003] TimerMiscFlags   : 0x0 [Type: unsigned char]
    [+0x003 ( 5: 0)] Index            : 0x0 [Type: unsigned char]
    [+0x003 ( 6: 6)] Inserted         : 0x0 [Type: unsigned char]
    [+0x003 ( 7: 7)] Expired          : 0x0 [Type: unsigned char]
    [+0x003] DebugActive      : 0x0 [Type: unsigned char]
    [+0x003 ( 0: 0)] ActiveDR7        : 0x0 [Type: unsigned char]
    [+0x003 ( 1: 1)] Instrumented     : 0x0 [Type: unsigned char]
    [+0x003 ( 5: 2)] Reserved2        : 0x0 [Type: unsigned char]
    [+0x003 ( 6: 6)] UmsScheduled     : 0x0 [Type: unsigned char]
    [+0x003 ( 7: 7)] UmsPrimary       : 0x0 [Type: unsigned char]
    [+0x003] DpcActive        : 0x0 [Type: unsigned char]
    [+0x000] Lock             : 393479 [Type: long]
    [+0x004] SignalState      : 0 [Type: long]
    [+0x008] WaitListHead     [Type: _LIST_ENTRY]

Hmm, what else? Well we also have another system thread that is writing modified pages (aka dirty pages) - this is being used to write memory mapped files that have been modified (these are usually being backed by the page file) - nt!MiMappedPageWriter - same as before !mex.us) and then using !mex.t to have more detailed information:

5: kd> !mex.us nt!MiMappedPageWriter
1 thread [stats]: fffffa8019ab5040
fffff80001eda6aa nt!KiSwapContext+0x7a
fffff80001edd142 nt!KiCommitThreadWait+0x1d2
fffff80001ede056 nt!KeDelayExecutionThread+0x186
fffff80001f117bc nt!MiGatherMappedPages+0x8c
fffff80001f122f8 nt!MiMappedPageWriter+0x198
fffff80002170f12 nt!PspSystemThreadStartup+0x5a
fffff80001ec9de6 nt!KiStartSystemThread+0x16

Threads matching filter: 1 out of 204
5: kd> !mex.t fffffa8019ab5040
Process Thread CID UserTime KernelTime ContextSwitches Wait Reason Time
System (fffffa8019a6b040) fffffa8019ab5040 4.b0 0s 187ms 133350 DelayExecution 15ms

Priority:
Current Base UB FB IO Page
17 8 0 0 2 5

# Child-SP Return Call Site
0 fffff88002b08910 fffff80001edd142 nt!KiSwapContext+0x7a
1 fffff88002b08a50 fffff80001ede056 nt!KiCommitThreadWait+0x1d2
2 fffff88002b08ae0 fffff80001f117bc nt!KeDelayExecutionThread+0x186
3 fffff88002b08b50 fffff80001f122f8 nt!MiGatherMappedPages+0x8c
4 fffff88002b08c50 fffff80002170f12 nt!MiMappedPageWriter+0x198
5 fffff88002b08d40 fffff80001ec9de6 nt!PspSystemThreadStartup+0x5a
6 fffff88002b08d80 0000000000000000 nt!KiStartSystemThread+0x16

Also not blocked, it's normally waiting in nt!KeDelayExecutionThread which means that after some time, it will run again and then wait again and so on (it basically runs on an interval).

But what files are being currently cached and how big are they in memory? We can have a look at this using !mex.mem -fc and we get this here:

5: kd> !mex.mem -fc
Name ControlArea FsContext Valid StandbyDirty Shared Locked Total
=================================================================================== ================ ================ ======== ============ ====== ====== ========
[...SNIPPED...]
$LogFile fffffa801fc6b990 fffffa801f656e10 860 KB 860 KB
CM_123.mdf fffffa802a3dd490 fffff8a003a85140 1.75 MB 1.75 MB
$Mft fffffa801ce9d4a0 fffffa801ce7dbb0 2.82 MB 2.82 MB
CM_123.mdf fffffa80204ffd30 fffff8a003423140 10.69 MB 15.54 GB 15.55 GB

Virtual Memory Physical Memory File Cache Cache Writes

Interesting ... there is a file called CM_123.mdf that is using 15.54GB - that is half of our physical memory !!! Ha, there we have it!

Let's have a look at more details (ex. it's path, etc.). This address (fffffa80204ffd30) to the file will point to a Control Area which is s structure the Kernel creates for the image file - we can use !kdexts.ca with this address and get more information:

5: kd> !kdexts.ca fffffa80204ffd30
ControlArea @ fffffa80204ffd30
Segment fffff8a004a8b940 Flink 0000000000000000 Blink 0000000000000000
Section Ref 1 Pfn Ref 3e30f0 Mapped Views 19f
User Ref 0 WaitForDel 0 Flush Count 0
File Object fffffa802a4144f0 ModWriteCount 0 System Views 19f
WritableRefs 0
Flags (8080) File WasPurged

\SCCMSQLBackup\123Backup\SiteDBServer\CM_123.mdf

Segment @ fffff8a004a8b940
ControlArea fffffa80204ffd30 ExtendInfo 0000000000000000
Total Ptes 1745000
Segment Size 1745000000 Committed 0
Flags (c0000) ProtectionMask

Subsection 1 @ fffffa80204ffdb0
ControlArea fffffa80204ffd30 Starting Sector 0 Number Of Sectors 800000
Base Pte fffff8a014a00000 Ptes In Subsect 800000 Unused Ptes 0
Flags d Sector Offset 0 Protection 6
Accessed
Flink fffffa801fc069d0 Blink fffffa802a3dd560 MappedViews 0

Subsection 2 @ fffffa801fc0a010
ControlArea fffffa80204ffd30 Starting Sector 800000 Number Of Sectors 800000
Base Pte fffff8a01ca00000 Ptes In Subsect 800000 Unused Ptes 0
Flags d Sector Offset 0 Protection 6
Accessed
Flink 0000000000000000 Blink 0000000000000000 MappedViews 19f

Subsection 3 @ fffffa801fc54b30
ControlArea fffffa80204ffd30 Starting Sector 1000000 Number Of Sectors 745000
Base Pte 0000000000000000 Ptes In Subsect 745000 Unused Ptes 0
Flags c Sector Offset 0 Protection 6
Flink 0000000000000000 Blink 0000000000000000 MappedViews 0

It really seems to be something related to SCCM - but which process is holding this file?

In this moment, I am guessing that we are already doing Cache Write Throttling - this is a mechanism used to improve server performance when we may delay the writes. This is something that will be triggered when nt!CcTotalDirtyPages will be bigger than nt!CcDirtyPageThreshold.

From the Windows Internals book:

"The file system and cache manager must determine whether a cached write request will affect system performance and then schedule any delayed writes. First the file system asks the cache manager whether a certain number of bytes can be written right now without hurting performance by using the CcCanIWrite function and blocking the write if necessary. For asynchronous I/O, the file system sets up a callback with the cache manager for automatically writing the bytes when writes are again permitted by calling CcDeferWrite. Otherwhise, it just blocks and waits on CcCanIWrite to continue. Once it's notified of an impending write operation, the cache manager determines how many firty pages are in the cache and how much physical memory is available. If few physical pages are free, the cache manager momentarily blocks the file system thread that's requesting to write data to the cache. The cache manager's lazy writer flushes some of the dirty pages to disk and then allows the blocked file system thread to continue. This throttling prevents system performance from degrading because of a lack of memory when a file system or network server issues a large write operation".

We have a very cool command we can use to check this which will also show us which (if any) files are currently in this state (from where we can get the file handle) - !kdexts.defwrites:

5: kd> !kdexts.defwrites
*** Cache Write Throttle Analysis ***

CcTotalDirtyPages: 4075797 (16303188 Kb)
CcDirtyPageThreshold: 4058054 (16232216 Kb)
AvailablePages: 27177 ( 108708 Kb)
ThrottleTop: 450 ( 1800 Kb)
ThrottleBottom: 80 ( 320 Kb)
ModifiedPages: 4073135 (16292540 Kb)

CcTotalDirtyPages >= CcDirtyPageThreshold, writes throttled

Check these thread(s): CcWriteBehind(LazyWriter)
Check critical workqueue for the lazy writer, !exqueue 16
Cc Deferred Write list: (CcDeferredWrites)
File: fffffa802a4144f0 Event: fffff880074906d0
File: fffffa8029c08070 Event: fffff88007968640

Yup, sure enough, we are currently doing throttling and here are the 2 file handles we have fffffa802a4144f0 fffffa8029c08070.

Nice, so now that we have the handles, we can use !kdexts.fileobj and give it the address to understand what files these are:

5: kd> !kdexts.fileobj fffffa802a4144f0

\SCCMSQLBackup\123Backup\SiteDBServer\CM_123.mdf

Device Object: 0xfffffa801f7d07c0 \Driver\volmgr
Vpb: 0xfffffa8029683ad0
Event signalled
Access: Read Write Delete

Flags: 0x43062
Synchronous IO
Sequential Only
Cache Supported
Modified
Size Changed
Handle Created

File Object is currently busy and has 0 waiters.

FsContext: 0xfffff8a003423140 FsContext2: 0xfffff8a00f5d3500
Private Cache Map: 0xfffffa802029b9c0
CurrentByteOffset: edb0f0000
Cache Data:
Section Object Pointers: fffffa8029479818
Shared Cache Map: fffffa802029b850 File Offset: ffffffffdb0f0000

Ha! Exactly the one we are looking for (using 15.54GB of RAM): \SCCMSQLBackup\123Backup\SiteDBServer\CM_123.mdf

No need to look at the other one now, it's the one we are interested in. So ok, what SCCM process is currently having a handle open on this file? We can use !mex.findhandle with the address of the file handle (fffffa802a4144f0) to get the process (you can use !kdexts.findhandle instead):

5: kd> !mex.findhandle fffffa802a4144f0
smssqlbkup.exe fffffa802935eb10
384

Ok, let's have a look at what this SCCM process (smssqlbkup.exe) is doing by switching to it now that we have the address (fffffa802935eb10):

5: kd> !mex.p fffffa802935eb10
Name Address Ses PID User Name Create Time Up Time Mods Handle Act Thrd Z Thrd Parent
============== ================ === ============ =============== ========================== ============ ==== ====== ======== ====== ========================
smssqlbkup.exe fffffa802935eb10 0 614 (0n1556) DOMAIN\COMPUTERNAME$ 06/26/2017 12:13:00.723 PM 2h:02:19.276 54 221 8 0 services.exe 264 (0n612)

Command Line: "D:\SMS_FOLDER\bin\x64\smssqlbkup.exe"

Memory Details:

VM Peak Work Set Commit Size PP Quota NPP Quota
======== ========= ======== =========== ========= =========
86.53 MB 107.85 MB 66.95 MB 12.94 MB 137.89 KB 15.38 KB

CPU Details:

User Kernel Total
====== ========= =========
2s.403 2m:21.447 2m:23.850

Show Threads: Unique Stacks !mex.listthreads (!lt) fffffa802935eb10 !winde.lp fffffa802935eb10 !process fffffa802935eb10 7

Show LPC Port information for process

Let's have a look at the list of threads and afterwards we should have a look at each thread - we can do this using !mex.lt now that we are already in the context of this process:

5: kd> !mex.lt
Process PID Thread Id Time Reason
============== === ================ ==== ======= ==============
smssqlbkup.exe 614 fffffa8029361b50 5c0 2s.371 UserRequest
smssqlbkup.exe 614 fffffa801f6a5b50 53c 43s.649 DelayExecution
smssqlbkup.exe 614 fffffa801f6a55c0 874 514ms Executive
smssqlbkup.exe 614 fffffa80293b2b50 c38 43s.649 UserRequest
smssqlbkup.exe 614 fffffa80295c0b50 7d4 43s.649 UserRequest
smssqlbkup.exe 614 fffffa801f929b50 11fc 43s.649 WrQueue
smssqlbkup.exe 614 fffffa8021324060 1af4 2s.386 WrQueue
smssqlbkup.exe 614 fffffa8029610060 10a0 2s.371 WrQueue

Thread Count: 8

Looking through each thread, we find this veeery interesting thread (fffffa801f6a55c0):

5: kd> !mex.t fffffa801f6a55c0
Process Thread CID UserTime KernelTime ContextSwitches Wait Reason Time COM-Initialized
smssqlbkup.exe (fffffa802935eb10) fffffa801f6a55c0 614.874 1s.841 2m:21.228 36298 Executive 514ms APTKIND_MULTITHREADED (MTA)

WaitBlockList:
Object Type Other Waiters
fffff880074906d0 NotificationEvent 0

# Child-SP Return Call Site
0 fffff88007490430 fffff80001edd142 nt!KiSwapContext+0x7a
1 fffff88007490570 fffff80001edf96f nt!KiCommitThreadWait+0x1d2
2 fffff88007490600 fffff80001ea2527 nt!KeWaitForSingleObject+0x19f
3 fffff880074906a0 fffff880016ebcc8 nt!CcCanIWrite+0xfffffffffffa11f7
4 fffff88007490770 fffff88001401102 Ntfs!NtfsCopyWriteA+0x68
5 fffff88007490970 fffff880014048ba fltmgr!FltpPerformFastIoCall+0xf2
6 fffff880074909d0 fffff8800142283e fltmgr!FltpPassThroughFastIo+0xda
7 fffff88007490a10 fffff800021f1bbe fltmgr!FltpFastIoWrite+0x1ce
8 fffff88007490ab0 fffff80001ed70d3 nt!NtWriteFile+0x5ad
9 fffff88007490bb0 0000000077a2bdba nt!KiSystemServiceCopyEnd+0x13
a 00000000014cce38 000007fefd73865c ntdll!NtWriteFile+0xa
b 00000000014cce40 00000000778d171a KERNELBASE!WriteFile+0xfe
c 00000000014cceb0 00000000778d1ea5 kernel32!BaseCopyStream+0x4b2
d 00000000014cd8a0 00000000778d1907 kernel32!BasepCopyFileExW+0x545
e 00000000014cdda0 00000000779556f2 kernel32!CopyFileExW+0x97
f 00000000014cde20 000000013ff6fb4a kernel32!CopyFileA+0x62
10 00000000014cde80 000000013ff81a5c smssqlbkup+0x9fb4a
11 00000000014cdf10 000000013ff7c430 smssqlbkup+0xb1a5c
12 00000000014ce280 000000013ff7bb5a smssqlbkup+0xac430
13 00000000014cf3b0 000000013ff7753f smssqlbkup+0xabb5a
14 00000000014cf400 000000013ff7987c smssqlbkup+0xa753f
15 00000000014cf6f0 000000013ff85643 smssqlbkup+0xa987c
16 00000000014cf7c0 0000000140155d55 smssqlbkup+0xb5643
17 00000000014cf9e0 0000000140155bec smssqlbkup+0x285d55
18 00000000014cfa10 00000000778d59cd smssqlbkup+0x285bec
19 00000000014cfa40 0000000077a0a561 kernel32!BaseThreadInitThunk+0xd
1a 00000000014cfa70 0000000000000000 ntdll!RtlUserThreadStart+0x1d

Well, well, well - what do we have on the stack here?

  • first of all we are clearly trying to write to a file (nt!NtWriteFile after the usermode call to CopyFile because the smssqlbkup.exe process is trying to copy the file)
  • second of all we are waiting in nt!CcCanIWrite (remember explanation of throttling from above)

So, it seems that we either have a very slow disk or we have a problem with some disk related driver. Let's check for any pending disk IRPs - we can do this by running !mex.ioq (you can use !kdexts.irpfind, but will need to parse the results manually):

5: kd> !mex.ioq
Collecting Data...This might take a moment.
Collecting Raid Data...Returned Failure
Collecting Disk Data...Added Object Disk with Filter DISK
Returned Success

================== Disk.sys Devices ==========================

Device Name DeviceObject DeviceExtension PrivateFdoData PendingIRPs
===================== ================ ================ ================ ===========
\Device\Harddisk0\DR0 fffffa801cd28060 fffffa801cd281b0 fffffa801cd29010 2
==================RaidPort Info: \Device\RaidPort2 ==========================
Device Name Miniport DeviceExtension BusType IoModel
=================================== ======== ================ =========== =========================
\Device\RaidPort2(fffffa801a668060) somestoragedriver fffffa801a6681b0 BusTypeRAID StorSynchronizeFullDuplex
RaidUnit(LUN) ClassPNPDevice DeviceState QueueState OutstandingCount QueueDepth PendingIrps
================ ================ ================== ========== ================ ========== ===========
fffffa801a6b2930 fffffa801cd28060 DeviceStateWorking Normal 2 254 2

Summary Notes:
IOQ found 80 devices, of which 2 have a total of 4 pending Irps.

Alright, so we have pending disk IRPs - let's have a look at the IRP queue now that we have the DeviceExtension address (fffffa801cd281b0) by using !mex.ioq again, but this time with the "-e" parameter with the address (fffffa801cd281b0):

5: kd> !mex.ioq -e fffffa801cd281b0

============================================
Disk Information: \Device\Harddisk0\DR0
============================================
ClassPNP Devices(Disk.sys) Information:
============================================
Device Name DeviceObject DeviceExtension PrivateFdoData PendingIRPs
===================== ================ ================ ================ ===========
\Device\Harddisk0\DR0 fffffa801cd28060 fffffa801cd281b0 fffffa801cd29010 2

Transfer Packet Current Irp Thread Wait Thread #Retries TimeOut (s) DeviceName CurrentDrvr Original Irp FileName Srb SrbStatus
================ ================ =========== ============================== ======== =========== =============== =========== ================ ============= ================ ==================
fffffa801d11a310 fffffa801d11b010 15ms fffffa80213cc430(RAMMap64.exe) 0 0 HarddiskVolume2 somestoragedriver fffffa8029b27010 \pagefile.sys fffffa801d11a430 SRB_STATUS_PENDING
fffffa801d1fdb10 fffffa801d1fd910 0s fffffa8029573b50(sqlservr.exe) 0 0 HarddiskVolume2 somestoragedriver fffffa801f8ab980 \pagefile.sys fffffa801d1fdc30 SRB_STATUS_PENDING

Wow ... stuck IRPs on writing to the page file (pagefile.sys) - remember that the memory mapped files are usually backed by the page file.

So, now it's clear and it does not matter that that we do not see our file here - this is normal because at the time the dump was taken, the we were waiting in nt!CcCanIWrite and it seems that we did not have any pending IRP for it currently.

The important thing is that we have a disk related issue here.

Let's have a quick look at both IRPs in more detail by using !mex.mirp with the "-v" parameter (for verbose) and each IRP addresses (fffffa801d11b010 & fffffa801d1fd910) (you can use !kdexts.irp instead):

5: kd> !mex.mirp -v fffffa801d11b010

Irp Details: fffffa801d11b010 [ verbose | !ddt | !winde.io | !irp ]

Mdl : fffffa801fd8a7b0
System buffer :
Issuing Process :
Thread :
Frame Count : 2
IoStatus Status : c00000bb
IoStatus Info : 0000000000000000
Requester Mode :
Cancel : 0
Cancel IRQL : 0
Apc Environment : 0
User Iosb : 0000000000000000
User Event :
APC : 0000000000000000
Completion Key : 0000000000000000
Cancel Routine :
Original File Object: 0000000000000000
Original File Name :

Irp Stack Frame(s)

# Driver Major Minor Dispatch Routine Control Code Flg Ctrl Status Completion Invoker(s) Device QueueLocation File Context Completion Routine Args
=== ================ ======================= ===== ======================== ============ === ==== ======= ====================== ================ ============= ====== ====================== ============================ ===============================================================================
1 CREATE 0 0 0 None 0000000000000000 (null) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
->2 \Driver\somestoragedriver INTERNAL_DEVICE_CONTROL 0 storport!RaDriverScsiIrp 12 e1 Pending Cancel, Success, Error fffffa801a6b27e0 (null) fffffa801d11a310(CPnp) CLASSPNP!TransferPktComplete fffffa801d11a430(CPnp) 0000000000000000 0000000000000000 fffffa801a6b2930(Devi)

IO Status: 0xc00000bb (The request is not supported.)

 

5: kd> !mex.mirp -v fffffa801d1fd910

Irp Details: fffffa801d1fd910 [ verbose | !ddt | !winde.io | !irp ]

Mdl : fffffa801f5b9670
System buffer :
Issuing Process :
Thread :
Frame Count : 2
IoStatus Status : c00000bb
IoStatus Info : 0000000000000000
Requester Mode :
Cancel : 0
Cancel IRQL : 0
Apc Environment : 0
User Iosb : 0000000000000000
User Event :
APC : 0000000000000000
Completion Key : 0000000000000000
Cancel Routine :
Original File Object: 0000000000000000
Original File Name :

Irp Stack Frame(s)

# Driver Major Minor Dispatch Routine Control Code Flg Ctrl Status Completion Invoker(s) Device QueueLocation File Context Completion Routine Args
=== ================ ======================= ===== ======================== ============ === ==== ======= ====================== ================ ============= ====== ====================== ============================ ===============================================================================
1 CREATE 0 0 0 None 0000000000000000 (null) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
->2 \Driver\somestoragedriver INTERNAL_DEVICE_CONTROL 0 storport!RaDriverScsiIrp 12 e1 Pending Cancel, Success, Error fffffa801a6b27e0 (null) fffffa801d1fdb10(CPnp) CLASSPNP!TransferPktComplete fffffa801d1fdc30(CPnp) 0000000000000000 0000000000000000 fffffa801a6b2930(Devi)

IO Status: 0xc00000bb (The request is not supported.)

Ok, there we go, we are stuck with these IRPs inside of the disk vendor driver (called "somestoragedriver" here for obvious purposes). The cause these are stuck is that we get the IO Status 0xc00000bb from the IOCTL, which means STATUS_NOT_SUPPORTED (The request is not supported).

Last thing - let's check some information about this driver (like how old it is) - we can do this by running lmtm (or any other lm* command) and give it the name of the driver (somestoragedriver):

5: kd> lmtm somestoragedriver
Browse full module list
start end module name
fffff880`01549000 fffff880`01559000 somestoragedriver Mon Aug 9 21:21:24 2010 (4C604724)

Wow ... this is awesome ... we are in July 2017 and this disk driver is from August 2010 ... niiice ...

 

So, what we need to do is:

  • Update the "somestoragedriver.sys" driver to the latest existing version
  • After this, monitor disk performance using Performance Monitor and check if the disk is performing well

Sure enough, after updating the "somestoragedriver.sys" driver, the issue was solved and the SCCM server stopped eating up memory until it gets into a hung state.

 

WOOT! 😀

 

PS: We can also get to the file directly by running !mex.mem -p, but this command usually takes a very long time (couple of hours sometimes, depending how big the dump is - this can also be achieved with !kdexts.memusage, but the output will need to be looked at manually - not summary for the Mapped Files).

The reason why it takes a lot of time, is that it has to scan the PFN (Page Frame Number) database and process all the data.

5: kd> !mex.mem -p
Loading PFN database... Building kernel map...this will take a while.

Top physical RAM users:

Type Name ControlArea/PID Valid Standby Dirty Shared Locked PageTables TOTAL
=========== ============== ================ ========= ======== ======== ======= ========= ========== =========
Mapped File CM_123.mdf FFFFFA80204FFD30 10.69 MB 15.54 GB 15.55 GB
Process sqlservr.exe 14866272 14.18 GB 79.96 MB 16 KB n/a n/a 35.34 MB 14.29 GB
Process ReportingServi FFFFFA80292A6B10 241.43 MB 2.62 MB 16 KB n/a n/a 2.28 MB 246.35 MB
System NonPaged Pool 184.33 MB n/a 184.33 MB 184.33 MB
Process RAMMap64.exe FFFFFA8020C87380 111.77 MB 4.94 MB n/a n/a 2.06 MB 118.77 MB
Process some_other_process FFFFFA8029411060 97.23 MB 1.47 MB n/a n/a 1.18 MB 99.88 MB
System Kernel Stacks 53.94 MB 44 KB n/a 992 KB 128 KB 54.11 MB
Process procexp64.exe FFFFFA80297EB060 36.13 MB n/a n/a 608 KB 36.72 MB
Process svchost.exe FFFFFA8028EF25D0 28.81 MB 1.97 MB n/a n/a 732 KB 31.49 MB
Process explorer.exe FFFFFA8029673A40 23.09 MB 196 KB n/a n/a 724 KB 23.99 MB
Process perfmon.exe FFFFFA802976A6C0 21.61 MB 160 KB n/a n/a 460 KB 22.21 MB
Mapped File sqlservr.exe FFFFFA8029838AC0 20.22 MB 20.22 MB
Process explorer.exe FFFFFA80296AAB10 19.06 MB 120 KB n/a n/a 664 KB 19.82 MB
Process lsass.exe FFFFFA801F106A70 18.47 MB 96 KB n/a n/a 352 KB 18.91 MB
Process WmiPrvSE.exe FFFFFA80294AF060 18.11 MB 4 KB n/a n/a 548 KB 18.64 MB
Process smsexec.exe FFFFFA801F5D0B10 17.64 MB 40 KB n/a n/a 724 KB 18.38 MB
Process procexp64.exe FFFFFA802A574B10 17.18 MB 40 KB n/a n/a 460 KB 17.66 MB
Mapped File shell32.dll FFFFFA801EF80680 13.5 MB 3.12 MB 16.62 MB
Process svchost.exe FFFFFA8028F0BB10 15.03 MB 28 KB n/a n/a 432 KB 15.48 MB
Process some_process FFFFFA8029018B10 14.19 MB 196 KB n/a n/a 432 KB 14.8 MB
=========== ============== ================ ========= ======== ======== ======= ========= ========== =========
Type Name ControlArea/PID Valid Standby Dirty Shared Locked PageTables TOTAL

Only the top 20 of 1612 rows are being shown. Show all

Virtual Memory Physical Memory File Cache Cache Writes

 

Here is how it would look like with !kdexts.memusage with all other stuff snipped out:

5: kd> !kdexts.memusage
[...SNIPPED...]
Usage Summary (in Kb):
Control Valid Standby Dirty Shared Locked PageTables name
ffffffffffffd 8384 0 0 0 0 0 AWE
fffffa8019a34010 440 0 0 184 0 0 mapped_file( winhttp.dll )
fffffa8019ac18d0 16 0 0 0 0 0 mapped_file( netutils.dll )
[...SNIPPED...]
fffffa80204ffd30 10944 0 16292096 0 0 0 mapped_file( CM_123.mdf )
[...SNIPPED...]

 

Don't forget to read-up on your Windows Internals 😉

 

Comments (0)

Skip to main content