Preparing to Troubleshoot – Part One

A very common question that we are asked is, “What kind of data can I gather before I talk to a Support Engineer at Microsoft?”  In all honesty, there is no single right answer – especially where the Performance team is concerned!  It all depends on the issue.  However, that’s not to say that you can’t get ahead of the curve on some of the more common issues that we deal with and start gathering your own data.

Obvious Troubleshooting Questions:

  1. What changed?  In many cases, the problem is caused by a change in the environment – new software, patches, permissions changes etc. 
  2. How many machines are affected?  If the problem is only occurring on one machine, but not on other similarly configured machines, then there is something unique to that machine that is the problem.
  3. How many users are affected?  Are all users on the same machine affected?  Are administrators affected?  Understanding how profiles and permissions work is key when troubleshooting application (and other) issues that only seem to affect certain users / machines.
  4. Is the machine completely up to date on drivers & patches?  It seems like an obvious thing to check, but there have been many situations where a driver update for software like Anti-virus or a Print Driver, or a released Microsoft Security patch is the fix for the problem you are experiencing.

MPS Reports:

MPS Reports is one of the cornerstones of our troubleshooting process.  MPS Reports provides us with a snapshot of the machine – including event logs, network configuration, loaded drivers etc.  One of the great things about MPS Reports is that it can be used for more than just troubleshooting.  Several customers have actually added MPS Reports to their arsenal of tools for server health checks, change controls and disaster recovery scenarios.

MPS Reports can be downloaded here.  The MPSRPT_SetupPerf.exe is the version that the Performance team use most.  You should always try to gather MPS Reports for the problem machine(s) and be ready to provide them to the Microsoft Support Engineer to help cut down on the troubleshooting time.

Application Crashes:

For Application Crashes (including spooler and web browser crashes), the most important thing to collect is the dump file of the failure.  In most cases, the Dr. Watson tool provided with the Operating System can be used to capture this information.  However, many customers prefer to use the IIS Diagnostic Toolkit which includes the DebugDiag tool to capture this data.  Information regarding the IIS Diagnostics Toolkit can be found here.

A quick note regarding troubleshooting Application Crashes – if the problem application is a non-Microsoft application, our ability to troubleshoot is somewhat limited since we neither have the symbols nor the source code for the application.  It is always a good idea to engage the vendor of the problem application directly for assistance with non-Microsoft applications.

Server Hangs:

When troubleshooting Server Hangs, there are some things to check when troubleshooting:

  1. Are you able to ping the server remotely?  If not, then you may want to consult your hardware vendor to run diagnostics on the server to ensure that there is no underlying hardware issue.
  2. At the console, are you able to use the NumLock or CapsLock keys?
  3. At the console, are you able to bring up the GINA screen using Ctrl+Alt+Del?

Diagnosing Server Hangs more often than not will require generating a manual crash dump of the problem machine.  We can gather this dump using the CtrlScrollLock method outlined in KB Article 244139.

Leaks and the infamous 2019 / 2020 error messages:

The first thing to understand is what exactly a “Leak” is.  A leak is a condition whereby a process (program or service) does not release resources that it no longer needs.  As a result, the process continues to “grab” the resorurces for itself.  The eventual end result of this condition is that other programs cannot function.  Most people think of leaks as “Memory Leaks” – however we also see issues with handle leaks and token leaks.

Event ID 2019 & 2020 are special types of resource depletion.  These refer to NonPaged and Paged Pool depletion.  PagedPool refers to a region of virtual memory in the System Space that can be paged in and out of the system (paged to disk).  NonPaged Pool consists of ranges of system virtual addresses that reside in the physical memory at all times and can be accessed at any time without incurring a page fault.

OK, so now that we’ve defined a couple of terms, how do we troubleshoot these issues?  Setting aside the 2019 / 2020 errors for a moment, we can use Perfmon (Performance Monitor) to capture data on the server and identify the leak.  Setting up Perfmon on Windows 2000 / XP / Server 2003 is not a complicated process and is explained in KB Article 248345.  This article also includes the link for the Performance Monitor Wizard which provides a wizard-based method to capture perfmon logs.

Getting back to the 2019 and 2020 errors, Performance Monitor logs are not the only data that we gather.  We also collect Memory Pool Data and our old friend the Manual Crash Dump.  As with Perfmon, Memory Pool Data is not difficult to capture (see KB Article 177415).  When using Poolmon.exe you should ensure that you use the /n switch to log the data to a file.  A list of the switches and syntax for Poolmon.exe can be found here.

Well, that’s all for the moment.  Part Two will be coming soon!

 – CC Hameed