Servers locking up and crashing...

Today I had the "hardware problem bit" set to 1!

 

First, I worked on an issue where a server was randomly rebooting with a *BSOD*.  It would create a memory dump each time.  The problem was that each time the server crashed it was a different stop code and error code.  This renders the memory dump pretty much useless as the dump is tied to the particular crash.  Since each error code was different that would mean the root cause would be different.  Typically with random BSOD's there is an underlying hardware issue (random meaning different stop codes with no clear "cause and effect").  In this case, the server started having problems after installing Windows 2003 Service Pack 1.   The Service Pack was removed via the recovery console and *normal operation* was resumed.  So what is the problem??  Not sure to be honest.  The recommendation was to make sure that the hardware (motherboard/scsi controllers/etc) had the latest bios/firmware updates from the vendor and then reapply the Service Pack.

 

Next, it was a server randomly "hard locking".  Hard locking meaning the mouse and keyboard become unresponsive at the server console.  It was not know whether NUMLOCK and/or CAPSLOCK worked (meaning the little lights on the keyboard responded to pressing the keys).  The only way to recover was to power the server off and back on.  In the Event Viewer, there was absolutely no indication of a problem.  This would occur about once every 2-3 months.  The action on this was to see if NUMLOCK/CAPSLOCK worked (if they do not, it is a *true* hard lock and troubleshooting becomes extremely difficult without a cause/effect...basically start removing hardware and testing).  If they DO, we could implement https://support.microsoft.com/?id=244139 and get a memory dump to hopefully get a clue as to what is happening.  Note we could have also setup some performance counters with perfmon, but typically we would get some indication in Event Viewer if we had a resource leak, like event ID 2019 and/or 2020.  Maybe if we can get a dump we will then utilize perfmon and/or pool tagging to get more granular.

 

petergal