My system is randomly freezing / hanging without any explanation. What can I do?

Hello, my name is Mircea - Alexandru Popescu, and I am a Support Engineer with the Windows Core Performance EMEA team.

We see a lot of this issue coming in and I was thinking to provide some useful guidelines about systems entering an unresponsive state and what data is required by Microsoft engineers to start troubleshooting.

What is an OS hanging state?

A few causes of a system hang can be:
- resource exhaustion (CPU, memory, disk)
- the OS is too busy working exclusively on something (high priority threads, spinlocks, waiting of an event, etc.)
- hardware error

So, what can I do to find out what is happening?

Usually, so that the business is broth up as fast as possible and the impact minimized, the first thing a customer will do is to hard reboot the OS.
If your goal is only to minimize the impact and have the server up and running as fast as possible, this is OK. However, if you would also like to find out the cause of the hang, then there are a few things that need to be done prior to rebooting the server.

So, what can we do before rebooting the OS to ensure Microsoft support has all the correct data to start the troubleshooting?
Well, to see what is going on, we need to look at the server’s memory. This will give us an idea of what was happening at the time of the hang. To do this, we need to obtain a memory dump of the server in the hang state.

Depending on if the server is physical or virtual we can do the following:

For the physical systems, we will need to configure some registry keys before the issue occurs or after we have rebooted the server for recovery. If it is done after the server has been rebooted for recovery, then we will need to wait for another occurrence to gather the necessary data.

1) Start Registry Editor

2) Locate the following key in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters

3) On the Edit menu, click Add Value, and then add the following registry value:
Name: CrashOnCtrlScroll
Type: REG_DWORD
Data: 1

4) Locate the following key in the registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\kbdhid\Parameters

5) On the Edit menu, click Add Value, and then add the following registry value:
Name: CrashOnCtrlScroll
Type: REG_DWORD
Data: 1

6) Locate the following key in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl

7) On the Edit menu, click Add Value, and then add the following registry value:
Name: NMICrashDump
Type: REG_DWORD
Data: 1

8) Quit Registry Editor & reboot

When the system hang occurs, hold down the right-hand CTRL key and hit SCROLL LOCK twice to bugcheck the machine (STOP 0xE2).
*

Taking the dump too soon may not reveal anything, as the wait times may not be sufficient to "raise a flag". So we do recommend to wait at least ~20 minutes before creating the memory dump.

**

The keyboard combination will only work if the keyboard is attached to the system before we encounter the issue.

Having a keyboard connected through a KVM to the server may also not work. We recommend that the keyboard is directly connected to the machine.

***
a/ Ensure that the system is set to create a “Kernel memory dump” or “Complete memory dump”
While “Complete memory dump” would be the recommended setting, because it contains all the memory loaded at the time of the memory dump creation, for system with a lot of memory this could cause an issue, due to the space required.

For more details, about how to configure the memory dump and what is the disk space required, see below article: https://support.microsoft.com/en-us/help/969028/how-to-generate-a-kernel-or-a-complete-memory-dump-file-in-windows-ser

b/ For Physical system that have a management console and where we cannot use the keyboard combination, we can use the NMI trigger.

The above changes will configure the server to accept the key combination (right CTRL + Scroll Lock twice) and the NMI, that will send an interrupt request with the highest priority, which in the end will trigger the memory dump.

For the virtual machines, the way we can collect the data is much simpler.

I/ Hyper-V (2008 / 2008 R2 / 2012 / 2012 R2):

a/ Open Hyper-V manager and select the affected VM
b/ Create a checkpoint of the VM while the hanging state is present
* As much as possible, wait approximately 15 – 20 minutes, before creating the checkpoint.

c/ Browse to the appropriate folder that corresponds to your VM and save the .VSS and .BIN files

***Now, why do we ask for the .VSS and .BIN files, and what do we do with them:

Using internal tools, we are going to convert the files into a memory dump. Unfortunately, the tool is no longer available for download. As an alternative, you can use the LiveKD, which is a free Sysinternals tool, if you prefer not to send the files to Microsoft support or if you would like to debug locally.

For more details about how to use and where to download the LiveKD, see:
https://blogs.msdn.microsoft.com/vimalsdesk/2014/11/23/taking-a-dump-of-a-vm-running-on-hyper-v/

II/ Hyper-V 2016:

a/ Open Hyper-V manager and select the affected VM
b/ Create a checkpoint of the VM while the hanging state is present
* As much as possible, wait approximately 20 – 25 minutes before creating the checkpoint

c/ Browse to the appropriate folder that corresponds to your VM and save the following files to a different location:

.vmcx
.vmrs
.avhdx
.avhdx.mrt
.avhdx.rct

III/ VMWare VM (ESX / ESXi under version 6.0)

  1. Login to the vCenter Server or ESXi host using vSphere Client or vSphere Web Client
  2. Select the Virtual Machine
  3. Create a Snapshot of the VM or alternative place the VM into Suspend State.
  4. Based on the option selected above, in the virtual machine directory, you will find a .vmsn (snapshot) or a .vmss (suspend state) file.
  5. Save the files to a different location

***

As in the Hyper-V scenario, VMWare provides a tool called VMSS2Core, so we can convert the snapshot / suspend state into a memory dump. Again, the conversion can be done locally, if preferred.
More details about the VMSS2CORE tool: https://labs.vmware.com/flings/vmss2core
VMWare article: https://kb.vmware.com/s/article/2003941

IV/ For ESX / ESXi version 6.0 and above:

As in ESXi under version 6.0, the steps are the same, the only difference is in the files needed to be converted into a memory dump. After ESXi version 6.0, there is an additional file required, .VMEM, for the conversion to work.

 

***

Additional details that can help in troubleshooting and can provide a faster resolution:

  1. What are the recent changes (application updates, hardware, network, etc.)
  2. When did you first notice this and how
  3. Any pattern noticed?
  4. Have you updated all binaries to the latest version (MS and all Third-Party) on the affected machine as well as the firmware BIOS?
  5. What was the system doing at the time of the issue (idle, backup, normal work, under heavy load, etc.)?

 

-- Alex