Processes consuming 100% of a system's CPU time is such a situation everybody has faced at least once (likely much more than once :)), and normally as solution we just "kill the murderer" by stopping the culprit process. But what was really causing that process to behave like that? If you can't properly answer this question you might be convicted to repeat the "kill the murderer" process forever...
The procedures we will see below are general enough for troubleshoot any application causing the 100% CPU situation, however I will use IIS as an example here since, as a custom application's host, it is one of the products whereby this situation happens very often.
Notice: This has been classified as an intermediary level content, so it presumes you understand concepts like memory dumps, crashes, hangs, applications, processes, threads, call stacks etc, and also have some experience with debugging.
Before the problem happens...
Make sure you are familiar with and have the proper tools we will need for the further steps. We will need the System Monitor (perfmon.exe), the Debugging Tools for Windows package installed (preferably installed at the c:\debuggers folder) and if possible the proper symbol files.
You will also need some basic understanding about the IIS architecture.
Run the System Monitor (perfmon.exe), in a remote machine if possible, and create a new Counter Log: Chose any name, lets say "Performance_log". After confirming the name you've just chosen, in the General Tab, click in Add Objects and choose "Select counter objects from comupter". In the combo list, select or type the remote computer name (the one where the problem happens and which you're going to monitor)
Select the following objects::counters:
Start monitoring and wait until the server gets the high CPU utilization issue
When the problem happens...
First confirm that is an IIS process (any w3wp.exe for IIS6, inetinfo.exe or any dllhost.exe for IIS5/5.1 of inetinfo.exe or any MTX.exe - IIS4) that is consuming the CPU time. You can use even the task manager (CPU and Process columns) to confirm that.
After identifying which process is consuming the CPU time, take a note of the Process ID of such a process from task manager. You can also use other tools like tasklist.exe (builtin for Windows XP and Windows 2003) or tlist.exe from the Debugging Tools for Windows. For IIS5 servers, if you want to know what application is being hosted by the process (in case of a DLLHOST.EXE) you can use the Component Services mmc or simply run the following command from a command prompt:
c:\debuggers\>tlist -k | find /i "dllhost.exe"
It will list all the DLLHOST.EXE process and the respective applications being hosted by the processes in following output format:
C:\Debuggers>tlist -k | find /i "dllhost"
3496 dllhost.exe MyCom+ Application 01
4892 dllhost.exe MyCom+ Applicatoin 02
4900 dllhost.exe IIS Out_Of_Proc_Pooled
3840 dllhost.exe Mts: System Application
For the IIS4, the same assumption is valid except that you have MTS Explorer instead and the surrogate name will change from dllhost.exe to mtx.exe.
For the IIS6 however, you can't use Component Services (IIS6 doesn't rely on COM+ for creating the surrogates as IIS5 does) nor tlist.exe to identify the applications being hosted by the processes. You will need to use the script iisapp.vbs instead (it's built in script for Windows 2003). Run iisapp from a command prompt with no parameters and it will return a list with the w3wp.exe processes, their PID and the application pools being hosted by them.
Once you've identified the PID (process ID) the CPU consumer you're ready to generate a hang memory dump (please note that the dump will contain useful information only, and if only, it was taken while the problem was happening). To generate the dump execute the following commands from a command prompt:
C:\Debuggers\>cscript adplus.vbs -hang -p <PID> -quiet -o c:\dumps
Adplus will spawn a debugger (cdb.exe) which will attach to the process, generate the hang dump and dettach itself. If you're familiar with the Debug Diagnostics Tool package (DebugDiag), feel free to use it. It doesn't matter which tool you will use as long as you generate a good full dump.
Generally is a good practice to generate more 1 or 2 dumps repeating above procedure 1 minute apart from each other. This might be useful when analyzing the problem cause.
Once the adplus.vbs is done, stop the perfmon log. At this point you can also apply the workaround (likely kill the process) you've been applying to make your server reponsive again.
Analyzing the data...
Open the perfmon log:
Run perfmon , select System Monitor on the left navigation bar and type Ctrl + Q to open the properties window.
Click in the Source tab and select Log Files. Click add and select the c:\perflogs\Performance_log000001.blg file and click ok.
Press Ctrl + I, or press the "+" button, to add objects and counters.
Select Process object, and the %Processor Time counter, select all the IIS related processes, click add then close.
The system monitor window will show the %CPU utilization of each process. Press Ctrl+ B to change the view to histogram. Determine the highest bar and double click it.
Press Ctrl+ H to confirm you selected the highest bar on the histigram view. Determine which process is the one that owns the highest bar. It must to be the same as the one you identified on the task manager as the culprit.
Press Ctrl + E to clean the histogram window and press CTRL + I to include new counters.
Select thread object, select %Processor Time counter and select all the instances of the process you determined as the culprit one.
Repeat the same process as before to determine which thread is the one consuming CPU - tip: Histogram is the best view to do that.
Once you know the process and thread within the process that is consuming the CPU, clean the perfmon System Monitor window one more time - Ctrl + E
Press Ctrl + I and select the Thread object again. Select the ID thread counter and select the instance of the previous determined thread. Click add and close. See how to simplify the Thread and Process ID identification in the perfmon in this article.
In the System Monitor Window, determine the thread ID of thread by looking at Min, Max and Avg values. THey should be all the same. If they are not the same is because you have monitored for a longer period than the high CPU utilization one. If that happened press Ctrl + Q, go to source tab and move the Time WIndow slide in order to get only the period of high utilization CPU, then go back to System Monitor Tool and chek for thread ID again.
Convert the Thread ID value from decimal to hexadecimal and take a note of this hex value.
How about some debugging now?
Open the Windbg.exe from the c:\debuggers folder.
Press Ctrl + D and select the hang dump (.dmp file) you generated earlier by uging adplus.vbs (they should be placed in the folder c:\dumps based on the option "-o c:\dumps" used before)
Close the Disassembly window if it opens (some versions of Windbg will do it autommaticaly)
Make sure your symbol path is properly set (remember to use the command !symfix or set it manually to http://msdl.microsoft.com/download/symbols) and run the .reload /s command if you needed to change the symbol path.
On the Windbg's command prompt type ~ and press enter. A thread list will show up as the example below:
0 Id: f4.f3 Suspend: 1 Teb: 7ffde000 Unfrozen
1 Id: f4.f7 Suspend: 1 Teb: 7ffdd000 Unfrozen
2 Id: f4.10a Suspend: 1 Teb: 7ffdb000 Unfrozen
3 Id: f4.10f Suspend: 1 Teb: 7ffda000 Unfrozen
4 Id: f4.110 Suspend: 1 Teb: 7ffd9000 Unfrozen
5 Id: f4.111 Suspend: 1 Teb: 7ffd8000 Unfrozen
6 Id: f4.112 Suspend: 1 Teb: 7ffd7000 Unfrozen
7 Id: f4.115 Suspend: 1 Teb: 7ffd6000 Unfrozen
8 Id: f4.116 Suspend: 1 Teb: 7ffd5000 Unfrozen
9 Id: f4.117 Suspend: 1 Teb: 7ffd4000 Unfrozen
10 Id: f4.11a Suspend: 1 Teb: 7ffae000 Unfrozen
11 Id: f4.11b Suspend: 1 Teb: 7ffad000 Unfrozen
12 Id: f4.129 Suspend: 1 Teb: 7ffac000 Unfrozen
13 Id: f4.12a Suspend: 1 Teb: 7ffab000 Unfrozen
14 Id: f4.12e Suspend: 1 Teb: 7ffaa000 Unfrozen
The Id: identificator above is followed by the PID.TID where PID is the hex representation of the process ID and TID is the hex representation of the thread ID.
Find the one which has the thread ID you determined before. Tip: you can convert from hex to decimal by the the "?" in the debugger's command prompt. Ex.: "? f4" will generate the following output:
0:00> ? f4
Evaluate expression: 244 = 000000f4
Once you identified the thread in the list you have two options to set this thread's context the active one. You run the following command with using the TID:
Or you run the following command using the corresponding thread number (#):
Both command will produce the same result which is set the thread's context the active one in the debugger. Now you can check the thread stack to see what is causing the high CPU utilization by just typing one of stack commands (k). For example, the kb command will show you the stack with the some additional information like the three first arguments passed through the functions as below:
ChildEBP RetAddr Args to Child
07c7bc20 623016d6 60a27460 07c7bff8 00000200 oran8!nnciqdn+0xab
07c7bd3c 60b2695a 60a27460 00298880 07c7bff8 orantns8!nnftboot+0x6d6
07c7bd68 60b25881 00298880 07c7bff8 00000200 oran8!nnfgqdn+0x3a
07c7bdc0 60a10acf 60a27460 07c7e630 00000000 oran8!nnfgrne+0x2c1
07c7c418 60b6e222 60a27460 07c7e630 00000016 oranl8!nlolgobj+0x6df
07c7c458 60b27be0 60a27460 07c7e630 00298930 oran8!sntseltst+0xc072
07c7c488 60b16472 60a27460 07c7e630 000000ff oran8!nnfsn2a+0x30
07c7d600 60b60ebc 60a27460 00001803 07c7e630 oran8!niqname+0x252
07c7f76c 604b0c93 07c7e630 00000007 00001803 oran8!osncon+0x23c
07c7f94c 604b849e 07012e44 14414d22 00000007 ORACLIENT8!xaolog+0x1f533
07c7f970 604b84ea 14414df8 14414d22 00000007 ORACLIENT8!xaolog+0x26d3e
07c7f98c 604ba15d 14414df8 14414d22 00000007 ORACLIENT8!xaolog+0x26d8a
07c7fe44 604bb128 14414df8 14414cf6 0000002b ORACLIENT8!xaolog+0x289fd
07c7fe84 6047fed4 14414df8 14414cf6 ffffffff ORACLIENT8!xaolog+0x299c8
07c7febc 604614a0 14414df8 14414cf6 ffffffff ORACLIENT8!upilog+0x64
07c7fef0 10002642 14414db8 14414df8 14414cf6 ORACLIENT8!ocirlo+0xb0
07c7ff84 7800265a 14414b90 01b0ed14 7800ef03 ociw32!olog+0x62
07c7ffb8 77f04ef0 00297850 01b0ed14 7800ef03 MSVCRT!beginthreadex+0xb2
07c7ffec 00000000 780025ff 00297850 00000000 KERNEL32!lstrcmpiW+0xbe
By looking at thread stack you should be able to identify what is causing the high CPU utilization problem. You can always unassembly (u) from the intruction pointer to try digging a little bit more and obtain some additional information. Ex.:
0:065> u eip
60b1fefb 8b10 mov edx,dword ptr [eax]
60b1fefd 83c004 add eax,4
60b1ff00 52 push edx
60b1ff01 50 push eax
60b1ff02 8b4510 mov eax,dword ptr [ebp+10h]
60b1ff05 53 push ebx
60b1ff06 50 push eax
60b1ff07 56 push esi
I will post a real case analysis very soon in this blog...