Almost everyone that has ever used Windows has either heard of or experienced a bugcheck – the infamous “Blue Screen of Death.” A system may bugcheck for different reasons, but the bottom line is that the operating system has experienced a catastrophic fault that prevents the system from continuing to run. We’re going to cover some basic information about why a server may crash, explain how to configure and capture crash dumps and review some basic debugging of a crash dump.
Before we get started however, remember that there is a difference between a bugcheck and an application crash. A bugcheck is a kernel-mode crash, whereas an application crash is a user-mode event. We covered the differences between kernel- and user-mode memory in our Memory Management 101 post several months ago. So what are some common reasons why you may experience a bugcheck?
- A device driver or operating system function that runs in the kernel-mode space experiences an exception that it does not know how to handle (an unhandled exception). This would include trying to write to memory to which it does not have access, or trying to read an address that is not mapped and therefore invalid
- A kernel support routine is called that results in a reschedule when the Interrupt Request Level (IRQL) is Deferred Procedure Call (DPC) / dispatch level or higher. An IRQL is the priority ranking of an interrupt. The IRQL at which a piece of kernel-mode executes determines the hardware priority. DPC is a mechanism that allows the processor that is currently executing a critical task to perform less critical tasks by deferring their execution to some point later – when the IRQL drops below Dispatch level.
- A device driver or operating system function explicitly crashes the system because it detects that there is either corruption or some other situation indicating that the system cannot continue to function without risking data corruption
- Faulty hardware may also cause a bugcheck
OK – so if Windows knows that something is wrong why does it crash? Wouldn’t it be better to ignore the failure and carry on working? In some cases, there is a possibility that the problem is isolated and that the failing component will recover on its own. However it is more likely that there is a deeper issue, such as memory corruption or a hardware failure. If the system simply ignored these issues and continued to run, then the risk of further errors and data corruption would increase – a risk that is too high to take.
An analogy of this would be the “Check Engine Light” in your car suddenly coming on. When this light comes on there, you don’t immediately know how serious the problem is. It could be something as simple as the fact that your gas cap has not been tightened properly. In this instance, pulling over and tightening the gas cap would resolve the issue. However, there could be a far more serious issue that you won’t be able to resolve until you have the diagnostic trouble codes in your car’s on-board computer memory reviewed. In either case, it would be inadvisable to ignore the “Check Engine Light.”
So what actually happens on a system when it bugchecks? There is a function that is documented in the Windows DDK called KeBugcheckEx. This function brings down the system in a controlled manner. After this function masks out all interrupts on all processors on the system, it switches the display into VGA-mode, paints the blue background and displays the STOP code, along with four parameters that are interpreted based on the nature of the STOP code. There may also be text displayed that provides standard suggestions for the user. Windows XP Service Pack 1 and higher, as well as Windows Server 2003 introduced a new function – KeRegisterBugCheckReasonCallback. Drivers use this function to register routines that execute during system bugcheck. These additional routines may include drivers appending their data to the crash dump or writing crash dump data to alternate devices. Although there are over one hundred unique STOP codes, there are a few common ones which represent the majority of bugchecks on Windows systems. The Help file included with the Windows Debugging Tools contains information on the different STOP codes. The help file can assist you in interpreting the errors, however it may be necessary to review the crash dump file that is created when the system bugchecks.
Bugchecks most often occur after a change has been made to the system – for example the installation of new software or hardware. If you have just added a driver, rebooted the machine and the system bugchecks early during the system initialization process, then using the Last Known Good Configuration option can sometimes bring the system back online so that troubleshooting can be performed, and the offending driver removed (if necessary). This is because the installation of a new driver creates the associated registry entries that determine the driver startup type and file path. Until the system reboots successfully after this installation, the entry is not committed to the ControlSet number referenced in the LastKnownGood value in the HKLMSystemSelect key. However, this same troubleshooting does not work if you update an existing driver because the associated registry entries that call for that driver to be loaded will already be present on the system as a result of the last successful boot. Since the actual files have changed, the Last Known Good Configuration option will not work.
And that brings us to the end of our quick look at Understanding Bugchecks. In our next post on this topic, we will cover the properties of Crash Dump Files. Until next time …