PRF: Server Hang (Pre-Windows Server 2008+)

PRF Header


 


Computer Hung/Unresponsive


(Pre-Windows Server 2008)


 


Description: A hang is typically defined as a condition where a machine is non-responsive over the network and\or at the console. This usually manifests itself in not being able to log onto the console or a session, or a session becoming unresponsive to input or network traffic. This is not to be confused with a crash or bugcheck, which indicates a software or kernel fault. This document is specific to instances where a machine hangs or becomes unresponsive during normal use. This does not apply to these symptoms (they are covered elsewhere):


 


Server hang during boot


Server hang after CTRL-ALT-DEL


Server hang at Applying Computer Settings


Server hang at Shutdown


 


This document applies to:


 


Windows 2000 Service Pack 4 with Update Rollup Package 1. (Mainstream support ended


6/30/2005)


Windows Server 2003 RTM (Mainstream support ended 3/30/2007)


Windows Server 2003 Service Pack 1 (Mainstream support ended 4/14/2009)


Windows Server 2003 Service Pack 2 (Mainstream support ends 7/13/2010)


 


Note: http://support.microsoft.com/gp/lifeselect


 


  


Scoping the Issue: Define the type of hang:


 


1.     Is the console hung or is it an issue with network connectivity?


2.     Does Ctrl-Alt-Delete bring up the Windows Security dialog?


3.     Can you toggle Caps Lock or Num Lock? If you can’t it could be a hardware or driver problem.


4.     Can you move the mouse?


5.     Is there a KVM in use?


6.     When did the issue start occurring?  DDMMYYYY, HH:MM:SS


7.     What changed?


8.     How long has the server being in production?


9.     How often does the issue occur?


10.  Under what conditions does the issue occur?


11.  What else is going on when the issue occurs?


12.  Does it happen at a particular time of day (users logging in, scheduled tasks, backup etc).


13.  Is there anything you can do to make the problem occur (repro steps)?


14.   Can you ping by Ip address, Netbios or Fully Qualified Domain Name?


15.   Can you open network shares?  Can users connect to file shares on the hung machine?  Are there any errors?


16.   Are you able to logon at the physical console?  If so, are there any errors?


17.   Are you able to logon at via Remote Desktop (RDP client)?  Are there any errors?


If this is a terminal server, are you observing this behavior from a session or at the console?


18.   Are you able to open Computer Management remotely?  Are there any errors?


19.   What do you do to recover from the hang?


20.   How long have you waited before rebooting the server?


21.   What have you tried to do to fix the problem?


22.   If it’s not completely hung and we can get to Task Manager, check resources:


CPU time – is there a specific process pegging the CPU?


If so and its third party, if we end it what happens?


 


 


Data Gathering: One of the most useful tools in diagnosing system hangs is Performance Monitor (Perfmon) logging. Perfmon allows the user to gather performance counters for various objects relating to system health, such as: Memory, Network Interface, Physical Disk, Processor, Process, etc.


 


 


In all instances, collect:


 


1.        MPS Reports PFE version


 


Microsoft Premier Services Reporting Utility (PFE version)


http://www.microsoft.com/downloads/details.aspx?FamilyId=00AD0EAC-720F-4441-9EF6-EA9F657B5C2F&displaylang=en


 


2.       Perfmon logs should include the timeframe when the problem is happening on the system. 


You can create the log parameters manually, or by using the Performance Monitor Wizard


 


You should capture the logs remotely from another computer.


 


a.     Set up the remote Binary Circular performance log grab all core OS counters 


 


·         Cache


·         Logical disk


·         Memory


·         NBT Connections


·         Network interface


·         Objects


·         Paging File


·         Physical disk


·         Process


·         Processor


·         Redirector


·         Server


·         Server Work Queues


·         System


 


The Perfmon capture interval is determined by the length of time it takes the server to go from a normal state, to a problem state.


 


Please gather two concurrent Perfmon logs:


 


b.      Short interval with a 5 seconds interval.


 










If the average time to issue is:


The capture interval should be:


Hourly


5 seconds


 


And


 


c.       Long interval


Please use the table below to set the capture interval.


 

























If the average time to issue is:


The capture interval should be:


Daily


160 seconds


3 days


360 seconds


1 week


800 seconds


2 weeks


1600 seconds


3 weeks


2400 seconds


Monthly


7200onds


 


d.      In Windows 2000, a common problem encountered when attempting to collect Perfmon logs remotely is that by default, the Performance Logs and Alerts service is started under the local computer’s “System” account. For steps on how to enable a network account to have permissions on the Performance Logs and Alerts service, please refer to Microsoft KB Article 240389: Log is not started when you try to start a log with remote counters in System Monitor.


e.      In Windows Server 2003, you can simply use the “RunAs” option when setting up the counters.


  


 


3.       Setup for a complete memory dump per KB 972110.


 


Proactively, make sure that :


————————————–



  1. Check with the OEM vendor for any known issues with their hardware or updates.

  2. Update the bios

  3. Update the drivers and firmware from the OEM server hardware vendor website.

  4. Update the remote management software i.e. iLO/DAC

  5. Update the HBA driver and firmware

  6. Update the Storage driver and firmware

  7. Verify that software drivers are up to date. This includes antivirus, quota management software, remote management software, etc.

  8. Verify that Windows security and reliability updates are up to date.

 


 


Troubleshooting / Resolution:


1.       In the “System Event Log” look for “Event ID 2019” and “Event ID 2020”


  


2.       In Perfmon, check for any Process –> NameofProcess –> Handles value larger than 15,000.


Note:  LSASS.exe on DC’s is normal to see a value up to 50,000.


Note: Store.exe on Exchange servers is normal to see a value up to 65,000


 


 


Additional Resources:


 


972110 How to generate a kernel dump file or a complete memory dump file in Windows Server 2003


http://support.microsoft.com/?id=972110


 


177415 How to use Memory Pool Monitor (Poolmon.exe) to troubleshoot kernel mode memory leaks


http://support.microsoft.com/kb/177415


 


PoolMon Examples


http://msdn.microsoft.com/en-us/library/ms792885.aspx


 


Poolmon Overview


http://technet.microsoft.com/en-us/library/cc737099(WS.10).aspx


  


164933 How to allow Poolmon.exe to run by setting GlobalFlag value


http://support.microsoft.com/kb/164933


 


Using PoolMon to Find a Kernel-Mode Memory Leak


http://msdn.microsoft.com/en-us/library/cc267829.aspx


 


246758 How to Monitor Performance of a Remote Computer Without Logging on to It


http://support.microsoft.com/id=246758


 


969639 Error message when you try to access the Performance Monitor (Perfmon.exe) on a remote computer: “Access Is Denied”


Http://support.microsoft.com/?id=969639


 


888989 A Performance Monitor counter for the Physical Disk performance object may not be displayed in Windows 2000


Http://support.microsoft.com/?id=888989


  


248993 PRB: Performance Object Is Not Displayed in Performance Monitor


http://support.microsoft.com/?id=248993