In the event where Windows encounters a problem in the kernel (memory corruption, null pointer reference, explicit call to KeBugCheckEx, etc.), the "Blue Screen of Death" (BSoD) is observed.
This, we all know.
The BSoD is actually your friend – it’s there to halt the system as soon as a problem is detected so that further damage is avoided, and in an ideal world Windows is also set up to produce a memory dump which can be run through a debugger to try to figure out what happened.
But even if the memory dump option is not set, information is recorded in the page file to allow Windows during restart to put an entry in the event log to indicate (roughly) what the problem was.
The STOP (or "bugcheck") code along with its parameters is logged with the event so we know why the server restarted.
But sometimes we get only a 6008 event: "the previous shutdown was unexpected" and no further information to indicate what happened… why would this be?
4 main causes spring to mind:
A) The power button was pressed (and possibly had to be held in for 4 seconds)
B) The power supply was interrupted (brownout, UPS failure, power cable yanked, etc.)
C) The physical disk holding the boot volume vanished
D) An ASR occurred
Hopefully the first scenario one would be aware of, and the second would either affect multiple machines or be detected by some other health-monitoring system.
The disk "vanishing" at the hardware level would make Windows unable to satisfy hard fault requests – this probably ends up being a STOP 0x77 (KERNEL_STACK_INPAGE_ERROR) or similar – if this was purely a software (driver) issue then we should still get a memory dump or access to the disk as we access the device without drivers when a bugcheck occurs.
Where a bugcheck is occurring but we are not getting a memory dump (and the settings indicate we ought to), then you should clear the “Automatically restart” option in the “Startup and Recovery” settings from the Advanced tab of System Properties:
Now if a bugcheck occurs, regardless of whether or not we wrote a dump file, the blue screen will remain displayed until the server is reset manually.
This brings me to the topic I wanted to discuss…
The countdown ticks away constantly and so long as the agent is getting CPU cycles it is able to reset the countdown, but if the OS hangs, the CPU load is at 100% for a period of time or the agent has a fault then the countdown can hit zero.
The ASR feature then effectively emulates a press of the reset button, assuming the server to be in a hung state and is an attempt to recover from the situation.
As the ASR relies on the agent running in the OS, when a bugcheck occurs and we effectively freeze everything else and do a dump of physical memory to the swap file, we can end up with a reboot in the middle of this process and end up with a corrupt (and useless) dump file.
ASR events are often logged in the OS by the agent and/or recorded in the BIOS/EFI – management tools such as the IBM Director or HP Insight Manager can query the agents remotely to display these events.
So if you have a server which is getting "unexpected shutdowns" with no record of a bugcheck occurring, it is definitely worth considering disabling any ASR feature (consult your server documentation) so that the "real" problem is uncovered.
If the server really is hanging then we want to know about it, not sweep the symptom under the carpet – and if it’s only a massive CPU load for a period of time (e.g. a backup or large batch job) then interrupting this with a warm reset could be potentially devastating.