The Curious Case of TMG Stopping Responding in Random days but always during the Morning

1. Introduction


Talking about curious case to troubleshoot, this was a very good one. First because this was a new deployment, a well planned hardware for the deployment and the amount of traffic hitting TMG was not that huge. Everything seems to be okay, other than the fact that at least twice a week TMG was stopping responding and it was necessary to restart the whole server. To make things even more bizarre, such issue was happening in random days but always within the range of 8AM and 11AM.


2. Gathering Data


On this scenario TMG administrator couldn’t really get a dump of the wspsrv.exe process because he couldn’t even log in locally.  It was necessary to prepare the machine for a kernel dump capture, for that I used the procedures from KB969028.


3. Analyzing the Data


On a scenario where the machine completely stops responding one of the most useful commands to type while analyze a kernel dump is the !locks command. By definition (windbg help), the !locks command will:


 “Display all locks held on resources by threads. A lock can be shared or exclusive, which means no other threads can gain access to that resource. This information is useful when a deadlock occurs on a system. A deadlock is caused by one non-executing thread holding an exclusive lock on a resource that the executing thread needs.”


In this case here it is the result of this command:


0: kd> !locks


KD: Scanning for held locks………………………………………………………………………………………………………………………….


Resource @ 0xfffffa800dbad748    Shared 1 owning threads

    Contention Count = 493

     Threads: fffffa800e09db63-01<*> *** Actual Thread fffffa800e09db60

KD: Scanning for held locks……


Resource @ 0xfffffa800bc2a7b8    Exclusively owned

     Threads: fffffa800e09db60-01<*>

KD: Scanning for held locks………………………………………………………………………………

7461 total locks, 2 locks currently held


Next step is to verify which thread is that:


0: kd> !thread fffffa800e09db60

THREAD fffffa800e09db60  Cid 0004.0680  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (UserRequest) KernelMode Non-Alertable

    fffffa800cfc4060  SynchronizationEvent

IRP List:

    fffffa8010231bd0: (0006,0430) Flags: 00060901  Mdl: fffffa8010287000

Not impersonating

DeviceMap                 fffff8a000008b30

Owning Process            fffffa8009a3a040       Image:         System

Attached Process          N/A            Image:         N/A

Wait Start TickCount      26737768       Ticks: 3 (0:00:00:00.046)

Context Switch Count      940461            

UserTime                  00:00:00.000

KernelTime                00:13:42.562

Win32 Start Address vhdmp!VhdmpiAsyncOpThread (0xfffff880076bcdb0)

Stack Init fffff88008267db0 Current fffff88008267440

Base fffff88008268000 Limit fffff88008262000 Call 0

Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           : Args to Child                                                           : Call Site

fffff880`08267480 fffff800`01ac9992 : 00000000`0000000c fffffa80`0e09db60 fffffa80`00000000 fffffa80`10231bd0 : nt!KiSwapContext+0x7a

fffff880`082675c0 fffff800`01acbcff : 00000000`0000139c fffffa80`0cfc4060 00000000`00000000 fffffa80`0bf1a9a0 : nt!KiCommitThreadWait+0x1d2

fffff880`08267650 fffff800`01dbd1d2 : 00000000`00000000 ffffffff`00000006 00000000`00000000 00000000`00000000 : nt!KeWaitForSingleObject+0x19f

fffff880`082676f0 fffff800`01ac1993 : fffffa80`0e09db60 00000000`00200200 00000000`00000000 fffffa80`0cfc4060 : nt!NtWaitForSingleObject+0xb2

fffff880`08267760 fffff800`01abdf30 : fffff880`07694640 fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`08267760)

fffff880`082678f8 fffff880`07694640 : fffffa80`0cde80e0 fffffa80`0bc2a000 ffffffff`8000139c 00000000`000001d1 : nt!KiServiceLinkage

fffff880`08267900 fffff880`076b7ee2 : 00000000`00000000 fffff880`076b7e5c fffffa80`0cde80e0 fffffa80`0cde80e0 : vhdmp!VhdmpiReadVhdFileAtPassiveLevel+0x50

fffff880`08267970 fffff880`076bd9e0 : 00000000`00000000 00000000`000001d1 fffff8a0`0c2f2000 00000000`00001d10 : vhdmp!VhdmpiCopyRawSectorsSync+0x72

fffff880`082679c0 fffff880`076bdf11 : fffff8a0`01f26000 00000000`00022dbe fffff880`08267b90 fffff880`08267ac8 : vhdmp!VhdmpiRegisterWithRootEnumerator+0x220

fffff880`08267a60 fffff880`076c05e4 : 00000000`00000001 00000000`00000000 00000000`00022dbe fffffa80`0bc2a000 : vhdmp!VhdmpiShutdownNestingLevels+0x3b1

fffff880`08267b60 fffff880`076c083a : 00000000`00000000 00000000`00000001 fffffa80`0e573c00 fffffa80`0e573c00 : vhdmp!VhdmpiDoCompaction+0x124

fffff880`08267ca0 fffff880`076bce02 : ffffffff`ff676980 fffffa80`0e09db60 00000000`00000080 fffffa80`0cde80e0 : vhdmp!VhdmpiCompactThread+0x13a

fffff880`08267d00 fffff800`01d66c06 : fffffa80`0e09db60 00000000`00000080 fffffa80`09a3a040 00000000`00000001 : vhdmp!VhdmpiAsyncOpThread+0x52

fffff880`08267d40 fffff800`01aa0c26 : fffff880`009bf180 fffffa80`0e09db60 fffffa80`09a3ab60 fffff880`01452534 : nt!PspSystemThreadStartup+0x5a

fffff880`08267d80 00000000`00000000 : fffff880`08268000 fffff880`08262000 fffff880`08267450 00000000`00000000 : nt!KxStartSystemThread+0x16


Notice that this thread has an I/O request packet (IRP) and for scenarios like this (server freezes) it is very important to review the IRP for that thread that seems to be causing the issue. To do that you need to use the command !irp, here it is the result for this case:


0: kd> !irp fffffa8010231bd0

Irp is active with 12 stacks 9 is current (= 0xfffffa8010231ee0)

 Mdl=fffffa8010287000: No System Buffer: Thread fffffa800e09db60:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   


                     Args: 00000000 00000000 00000000 00000000

 [  0, 0]   0  0 00000000 00000000 00000000-00000000    


                     Args: 00000000 00000000 00000000 00000000

>[  3,34]  10 e1 fffffa800a03f060 00000000 fffff88000e34180-fffffa800a04de20 Success Error Cancel pending

              \Driver\Disk  volmgr!VmpReadWriteCompletionRoutine

                     Args: 00200200 00000000 3bd4e26200 00000000

 [  3, 0]   0 e0 fffffa800a04dcd0 00000000 fffff88001792ee0-fffffa800de75450 Success Error Cancel

              \Driver\volmgr      volsnap!VspReadCompletionRoutine

                     Args: 00200200 00000000 3bd4d26200 00000000

 [  3, 0]   0 e1 fffffa800a047040 00000000 fffff8800145ef70-fffffa800dabd550 Success Error Cancel pending

              \Driver\volsnap     Ntfs!NtfsMasterIrpAsyncCompletionRoutine

                     Args: 00200200 00000000 3bd4d26200 00000000

 [  3, 0]   0  1 fffffa800a069030 fffffa800df75880 00000000-00000000    pending


                     Args: 00200200 00000000 21ed477200 00000000


The major function 3,34 has the pending status set, which appears to be stuck on \Drover\Disk (Volume Managr – volmgr). Let’s take a closer look on the value fffffa800a03f060, which we can see by using the command !devobj as shown below:


0: kd> !devobj fffffa800a03f060

Device object (fffffa800a03f060) is for:

 DR2 \Driver\Disk DriverObject fffffa8009e53e70

Current Irp 00000000 RefCount 0 Type 00000007 Flags 01002050

Vpb fffffa8009fd50b0 Dacl fffff9a10043d7f0 DevExt fffffa800a03f1b0 DevObjExt fffffa800a03f858 Dope fffffa8009feb450

ExtensionFlags (0x00000800) 

                             Unknown flags 0x00000800

AttachedDevice (Upper) fffffa800a03fb90 \Driver\partmgr

AttachedTo (Lower) fffffa8009fe9b70 \Driver\USBSTOR


This is a USB driver, which is something that I was not expecting to have a this TMG box at all, so I decided to before even move forward ask to the TMG administrator what a USB driver is doing on TMG. TMG Administrator then clarified the scenario as follow: “We have a USB external driver attached to the system to store the full backup of the system that we perform every night.”


The dump is quiet clear, in other words: server is waiting for the disk. At this point I can conclude that the server was hanging waiting for this USB driver, which for some reason was not responding, as a consequence the whole box stop responding (including TMG, but clearly not a TMG issue).


4. Conclusion


After start reviewing the logs from the backup software we could conclude that: the backup job was supposed to finish at 6AM, however some days was extending beyond that, when the backup was running ahead of 8AM and the production started on TMG the system presented this behavior. Backup software was performing a full backup of the system, including TMG cache folder (which by itself was 4GB). We do not recommend making a backup of TMG cache folder, this folder needs to be excluded from backup. Making backup of the cache folder can cause other issue of this nature, such as crashing Firewall Service as I mentioned in this post.  To resolve this problem we excluded the TMG cache folder from the backup and re-schedule the backup job to start at 11PM and let it run until 7AM in an attempt to avoid backup job to go beyond commercial business hours.


Comments (2)

  1. Anonymous says:

    it looks like some cases I´ve seen at a customer…An USB driver to store the backup??? Wow!!!

  2. Juniad says:

    Good Job Yuri..