Quick Reference: Recovery Options for Post-Mortem Debugging for Windows and Virtual Machines

Hi everyone, Robert Smith here to talk to you today a bit about crash dump configurations and options. With the wide-spread adoption of virtualization, large database servers, and other systems that may have a large amount or RAM, pre-configuring the systems for the optimal capturing of debugging information can be vital in debugging and other efforts. Ideally a stop error or system hang never happens. But in the event something happens, having the system configured optimally the first time can reduce time to root cause determination.

The information in this article applies the same to physical or virtual computing devices. You can apply this information to a Hyper-V host, or to a Hyper-V guest. You can apply this information to a Windows operating system running as a guest in a third-party hypervisor. If you have never gone through this process, or have never reviewed the knowledge base article on configuring your machine for a kernel or complete memory dump, I highly suggest going through the article along with this blog.

Why worry about Crashdump settings in Windows?

When a windows system encounters an unexpected situation that could lead to data corruption, the Windows kernel will implement code called KeBugCheckEx to halt the system and save the contents of memory, to the extent possible, for later debugging analysis. During KeBugCheckEx, Windows will write diagnostic information to the paging file, set a flag noting the paging file contains the information, and on the next reboot Windows will write the diagnostic information to a memory “dump” file, normally called “memory.dmp”.

The problem arises as a result of large memory systems, that are handling large workloads. One of the dump types called “kernel”, was created for this situation. Even if you have a very large memory device, Windows can save just kernel-mode memory space, which usually results in a reasonably sized memory dump file. But with the advent of 64-bit operating systems, very large virtual and physical address spaces, even just the kernel-mode memory output could result in a very large memory dump file.

When the Windows kernel implements KeBugCheckEx execution of all other running code is halted, then some or all of the contents of physical RAM is copied to the paging file. On the next restart, Windows checks a flag in the paging file that tells Windows that there is debugging information in the paging file. If there is sufficient free disk space in the location specified under ‘Recovery’ options, Windows will attempt to write the debugging information into a file normally called ‘Memory.dmp’. NOTE: For Windows 7 and Windows Server 2008 R2, a hotfix is available to allow a memory dump to occur without a paging file. Please see KB2716542 for more information on this hotfix.

Herein lies the problem. One of the Recovery options is memory dump file type. There are a number of memory.dmp file types, to accommodate the current environment. For reference, here are the types of memory dump files that can be configured in Recovery options:

    • Every current Windows OS
    • 128 KB on 64-bit systems
    • Contains exception thread only, module list, and basic system info
    • Every current Windows OS
    • > 2 GB on 32-bit systems, 2+ GB on 64-bit, usually < 10 GB
    • Very little user-mode address space available
    • Sufficient for majority of diagnostic needs
    • Windows 8 and later including Windows Server 2012 and later
    • > 2 GB on 32-bit systems, 2+ GB on 64-bit, usually < 10 GB
    • Very little user-mode address space available
    • Increases paging file size automatically if needed
    • Windows 10 and later including Windows Server 2016 and later
    • Kernel-mode + “active” memory pages
    • Size unknown, but at least the size of kernel or automatic dump and likely more than, to substantially more than kernel or automatic dump size.
    • Every current Windows OS
    • Memory dump size is equal to size of physical RAM, or configured RAM with “Maxmem” parameter
    • Output files larger than 32 GB can be very difficult to work with in the debugging tools.

On systems with 32 GB or less physical RAM, it would be feasible to obtain a Complete memory dump. Anything larger would be impractical. For one, the memory dump file itself consumes a great deal of disk space, which can be at a premium. Second, moving the memory dump file from the server to another location, including transferring over a network can take considerable time. The file can be compressed but that also takes free disk space during compression. The memory dump files usually compress very well, and it is recommended to compress before copying externally or sending to Microsoft for analysis.

On systems with more than about 32 GB of RAM, the only feasible memory dump types are kernel, automatic, and active (where applicable). Kernel and automatic are the same, the only difference is that Windows can adjust the paging file during a stop condition with the automatic type, which can allow for successfully capturing a memory dump file the first time in many conditions.

The ‘Active‘ crash dump type, which is new to Windows 10 and Server 2016, would be the ideal memory dump type setting in conditions where you need to get kernel and user mode memory the first time, but have too much memory to configure for a complete memory dump type. The Active dump type is designed for Hyper-V, SQL, Exchange, or any server that is running a large workload and has a relatively large amount of RAM, of say 32 GB or more. Even with the ‘Active’ memory dump type, it is possible that a server with say 1 TB of RAM could possibly generate a memory dump file of 50 GB or more. A 50 GB or more file is hard to work with due to sheer size, and can be difficult or impossible to examine in debugging tools.

Why bother with changing automatic recovery options?

In many, or even most cases, the Windows default recovery options are optimal for most debugging scenarios. The purpose of this article is to convey settings that cover the few cases where more than a kernel memory dump is needed the first time. Nobody wants to hear that they need to reconfigure the computing device, wait for the problem to happen again, then get another memory dump either automatically or through a forced method.

The problem comes from the fact that the Windows has two different main areas of memory: user-mode and kernel-mode. User-mode memory is where applications and user-mode services operate. Kernel-mode is where system services and drivers operate. This explanation is extremely simplistic. More information on user-mode and kernel-mode memory can be found at this location on the Internet:

User mode and kernel mode

Scenarios where non-default Recovery options may be needed

What happens if we have a system with a large amount of memory, we encounter or force a crash, examine the resulting memory dump file, and determine we need user-mode address space to continue analysis? This is the scenario we did not want to encounter. We have to reconfigure the system, reboot, and wait for the abnormal condition to occur again.

We need a ‘Complete’ memory dump file

If the ‘Kernel’ or ‘Automatic’ dump file types are not yielding sufficient debugging information, the options are ‘Active’ and ‘Complete’ dump file types.

‘Active’ memory dump file option

Let’s say that debugging analysis shows we need user-mode address space. This would be the case where we could try the ‘Active’ memory dump type. The problem here is we don’t know how large we are going to have to size the paging file. The secondary problem is we must have sufficient free disk space available. If we have a secondary local drive, we can redirect the memory dump file to that location, which could solve the second problem. The first one is still having a large enough paging file.

The problem is we won’t know until the next crash occurs after changing to ‘Active’ memory dump type. If the paging file is not large enough, or the output file location does not have enough disk space, or the process of writing the dump file is interrupted, we will not obtain a good memory dump file. In this case we will not know until we try.

‘Complete’ memory dump file option

Wait, we already covered this. The ‘Complete’ is not an option with large RAM systems, right? With some additional configuration, we can obtain a ‘Complete’ memory dump file that is of reasonable size. The trick is that we have to temporarily limit the amount of physical RAM available to Windows. We can do this easily with the ‘System Configuration’ tool. You can invoke the System Configuration tool by running “msconfig” from the Start Menu.

  1. Click the ‘Boot’ tab
  2. Click the ‘Advanced Options’ button
  3. Click the ‘Maximum memory’ box
  4. Change the number, in MB to the desired size.

    For example, 16 GB would be “16,384”. The numbers do not have to be exact multiples of 2. You could simply type “20000” for approximately 20 GB.

We can choose a reasonable amount of RAM such as something between 16 GB and 32 GB. We also ensure the paging file is set to at least RAM plus several MB. To be safe, set the paging file to RAM plus 100 MB. The last condition we have to meet is to ensure the output location has enough free disk space to write out the memory dump file.

Once the configurations have been set, restart the system and then either start the issue reproduction efforts, or wait for the abnormal conditions to occur through the normal course of operation. Note that with reduced RAM, there ability to serve workloads will be greatly reduced. Once the debugging information has been obtained, the previous settings can be reversed to put the system back into normal operation.

This is a lot of effort to go through and is certainly not automatic. But in the case where user-mode memory is needed, this could be the only option. The following are illustrations of the System Configuration (MSCONFIG) tool to configure maximum memory option:

Figure 1: System Configuration Tool

Figure 2: Maximum memory boot configuration

Figure 3: Maximum memory set to 16 GB

Once maximum memory is configured, click the “Ok” button, restart the computer, and the operating system will be limited to the amount of memory configured…in this case 16 GB. With a reduced amount of physical RAM, there may now be sufficient disk space available to capture a complete memory dump file. To reverse the maximum memory configuration, run “MSCONFIG”, go to Advanced BOOT Options, uncheck the “Maximum memory” configuration option, click “OK”, and restart. After the restart, the memory configuration will be as it was before running “MSCONFIG”.

What about Windows guest OS running in hypervisors?

In the majority of cases, a bugcheck in a virtual machine results in the successful collection of a memory dump file. The common problem with virtual machines is disk space required for a memory dump file. The default Windows configuration (Automatic memory dump) will result in the best possible memory dump file using the smallest amount of disk space possible. The main factors preventing successful collection of a memory dump file are paging file size, and disk output space for the resulting memory dump file after the reboot.

There are several situations that make even “normal” Crashdump collecting very difficult.

  1. Virtual machines (VMs) with virtual drives presented via CIFS or SMB to the VM, though configured as a local disk.
  2. Non-persistent Virtual Desktop Infrastructure (VDI) VMs.

Virtual machines (VMs) with virtual drives presented via CIFS or SMB

There are currently hypervisor technologies that employ Common Internet Files System (CIFS) or Server Message Block (SMB) file shares that can host virtual disk files. These drives may be presented to the VM as a local disk, that can be configured as the destination for a paging file or crashdump file. The problem occurs in case a Windows virtual machine calls KeBugCheckEx, and the location for the Crashdump file is configured to write to a virtual disk hosted on a file share. Depending on the exact method of disk presentation, the virtual disk may not be available when needed to write to either the paging file, or the location configured to save a crashdump file.

It may be necessary to change the crashdump file type to kernel to limit the size of the crashdump file. Either that, or temporarily add a local virtual disk to the VM and then configure that drive to be the dedicated crashdump location. Great information about the dedicated dump file settings can be found in this article (NOTE: This measure is not necessary in Windows 7/Windows Server 2008 R2 and beyond):

How to use the DedicatedDumpFile registry value to overcome space limitations on the system drive when capturing a system memory dump

The important point is to ensure that a disk used for paging file, or for a crashdump destination drive, are available at the beginning of the operating system startup process.

Non-persistent Virtual Desktop Infrastructure (VDI) Virtual Machines (VMs)

Virtual Desktop Infrastructure is a technology that presents a desktop to a computer user, with most of the compute requirements residing in the back-end infrastructure, as opposed to the user requiring a full-featured physical computer. Usually the VDI desktop is accessed via a kiosk device, a web browser, or an older physical computer that may otherwise be unsuitable for day-to-day computing needs.

Non-persistent VDI means that any changes to the desktop presented to the user are discarded when the user logs off. The nature of non-persistent VDI is that a read-only copy of an operating system image is paired with a “write cache” virtual disk. From the time the VDI desktop OS starts, all writes are redirected to the write cache disk. Even writes to the paging file are redirected to the write cache disk.

Typically the write cache disk is sized for normal day-to-day computer use. VDI users are often required to log off at the end of the work day, so the write cache may be sized large enough to handle several days of “normal” computer use.

The problem occurs that, in the event of a bugcheck, the paging file may no longer be accessible. Even if the pagefile is accessible, the location for the memory dump would ultimately be the write cache disk. Even if the pagefile on the write cache disk could save the output of the bugcheck data from memory, that data may be discarded on reboot. Even if not, the write cache drive may not have sufficient free disk space to save the memory dump file.

What then are the options for saving memory dump files on non-persistent VDI desktops?

  • Configure the memory dump type to “small”, and configure the write cache drive as the destination for the memory dump file. In this situation, having a small memory dump file is better than no dump file, for runtime bugcheck stop errors.
  • For situations where a problem leads to a virtual machine “hang” (non-responsive to user input), there are some options:
    • Create a new pool of virtual machines, then put only one virtual machine into that pool. The idea is that we know the name of the virtual machine.
    • Change the configuration of the virtual machine from read-only to “read-write”. Reproduce the issue, copy the dump file to another location if needed, then return the VM to the normal pool, if needed.
    • Temporarily attach another virtual disk to the VM, of sufficient size to save the intended memory dump type. Reproduce the issue, save the memory dump, and then revert changes as needed.

Dealing with computer devices that go non-responsive

In the event a Windows operating system goes non-responsive, additional steps may need to be taken to capture a memory dump.

  • Depending on the type of hang, user-mode memory space may be needed in addition to kernel-mode memory. Therefore, either the Active or Complete memory dump types may be needed.
  • Some configuration changes may need to be made and a reboot required.
    • Memory dump type
    • Paging file configuration
    • Output location for memory dump file
  • A registry setting called “CrashOnCtrlScroll” may need to be set, which requires a restart.

CrashOnCtrlScroll bugcheck

Setting a registry value called CrashOnCtrlScroll provides a method to force a kernel bugcheck using a keyboard sequence. The right CTRL key is held and the SCROLL LOCK key pressed twice. This will trigger the bugcheck code, and should result in saving a memory dump file. A restart is required for the registry value to take effect. The CrashOnCtrlScroll feature work where you have a keyboard with a right CTRL key available. Not all keyboards have a right CTRL key available, such as the Surface Pro keyboard.

In the event a different key sequence is needed, other than right CTRL + SHIFT, keyboard keys can be remapped using information from this article. This situation may also help in the case of accessing a virtual computer and a right CTRL key is not available.

NMI bugcheck

For server-class, and possibly some high-end workstations, there is a method called Non-Maskable Interrupt (NMI) that can lead to a kernel bugcheck. The NMI method can often be triggered over the network using an interface card with a network connection that allows remote connection to the server over the network, even when the operating system is not running.

Forcing a non-responsive VM to bugcheck from the hypervisor

In the case of a virtual machine that is non-responsive, and cannot otherwise be restarted, there is a PowerShell method available. There is a parameter to the PowerShell command “Debug-VM” called “-InjectNonMaskableInterrupt“. This command can be issued to the virtual machine from the Windows hypervisor that is currently running that VM.

Other methods to force a bugcheck from a non-responsive computer (physical or virtual)

The big challenge in the cloud computing age is accessing a non-responsive computer that is in a datacenter somewhere, and your only access method is over the network. In the case of a physical server there may be an interface card that has a network connection, that can provide console access over the network. Other methods such as virtual machines, it can be impossible to connect to a non-responsive virtual machine over the network only.

Forcing a crashdump from a non-responsive computer using “NotMyFault.exe”

NotMyFault.exe is a Sysinternals tool, that with elevation, can create a Windows bugcheck on demand. The trick though is to be able to run NotMyFault.exe when the system is otherwise non-responsive. If you know that you are going to see a non-responsive state in some amount of reasonable time, an administrator can open an elevated .CMD prompt and run the command line version of NotMyFault.exe, through an interactive logon session. There is also a GUI version of NotMyFault.exe that can be opened elevated, left running, and if you are able to get back to NotMyFault.exe, you may be able to force a bugcheck.

Some other methods such as starting a scheduled task, or using PSEXEC to start a process remotely probably will not work, because if the system is non-responsive, this usually includes the networking stack.

Summary and conclusions

  • Root cause analysis of unusual OS conditions often require a memory dump file for debugging analysis.
    • In the majority of cases, a “kernel” memory dump (mostly kernel-mode memory), is sufficient.
    • In some cases user-mode memory will be needed as well as kernel-mode. On large memory servers, there are two choices:
      • “Active” memory dump type (Windows 10, Server 2016, or later)
      • Limit active Windows memory for a reproduction cycle using “MSCONFIG”(System Configuration tool).
        • Limit server to 32 GB memory or less temporarily
        • Change memory dump type to “Complete” dump.
        • Set “CrashOnCtrlScroll” registry setting to enable ability to crash the machine with the keyboard
  • The “Active” memory dump type could an important option for large memory servers.
  • Memory dumps larger than 30 or so GB can be very difficult to analyze with the debugging tools. This is why the need to limit to “kernel” dump, “active” dump, or “complete” dump with limited memory.

Hopefully this will help you with your crash dump configurations and collecting the data you need to resolve your issues. Thanks for reading!