In this post, I take you through a process of troubleshooting high pool usage using free tools available in the Windows Sysinternals suite and the Windows Performance Toolkit (WPT). I also show you how to resolve the issue by properly configuring the Cluster Shared Volume (CSV) cache using PowerShell command-lets available in the Failover Cluster module.
Here’s the issue
Imagine that you receive the error message below when you attempt to move a virtual machine (VM) from a stand-alone Hyper-V host to a Failover Cluster in order to make it highly available:
This would be unexpected, especially when you know you have only a few VMs running on the destination host configured with memory totaling about 20 GB out of 64 GB installed total RAM. Looking at the Use Counts tab in RAMMap on the destination host, you see that:
- Running VMs consume roughly 20 GB of RAM (Driver Locked)
- The non-paged pool is touching around 36 GB (Nonpaged Pool)
Non-paged pool usage is over 50% of total RAM and warrants an investigation. We would rather have VMs consume this memory and maximize density. So, what is this non-paged pool? The non-paged pool consists of kernel virtual memory addresses that always reside in RAM. Mark Russinovich discusses pools in detail here, where he also covers tracking pool leaks using Poolmon, Strings and the Debugging Tools for Windows.
Troubleshooting tools and techniques
Let us see how WPT can provide some clues on what is causing this condition. To collect a trace, copy both xperf.exe and perfctrl.dll from a WPT installation folder to a temporary working folder on the affected machine. You don't want to install WPT on a production server - this toolset is usually installed on workstations, generally where the analysis will be performed. To start and stop the trace, run the following commands:
- xperf -on PROC_THREAD+LOADER+POOL -stackwalk PoolAlloc+PoolFree+PoolAllocSession+PoolFreeSession -BufferSize 1024 -MinBuffers 256 -MaxBuffers 256 -MaxFile 256 -FileMode Circular
- xperf -stop -d pool.etl
For this scenario, running the trace for about 30 seconds should be adequate.
Once pool.etl (or whatever name you chose) has been generated, copy it to a machine with WPT installed and open it with the Windows Performance Analyzer (WPA). Load symbols and add the "Outstanding Size by Paged, Tag" graph to the analysis view. This immediately gives you a clue on the tag that was used for the allocations.
To get an idea on which driver or kernel mode component is using a particular tag for pool allocations, have a look in the pooltag.txt file. Pooltag.txt is installed with Debugging Tools for Windows and with the Windows DDK:
Unfortunately the tag we are after in this case (RDrc) is not listed in Pooltag.txt. Anyway, drivers are typically found in c:\Windows\System32\drivers. Searching for RDrc in the drivers directory using Sysinternals’ Strings yields the following results:
Sysinternals' Sigcheck can be used to get more information on the driver in question. It can be seen from the description in the output that csvvbus.sys is a Cluster Volume Bus Driver. The cool thing about the Sigcheck tool is that it also shows other valuable information such as company and publisher.
With this kind of information, taking a look back in the trace makes more sense. The key to effective data analysis is to sort columns appropriately. Folks at the NT Debugging Blog explain this concept here. This is how I had mine setup, and we can see that RDrc comes up under AIFO (allocated inside, freed outside):
The stack shows that csvvbus.sys makes a call to allocate pool with tag (ntoskrnl.exe!ExAllocatePoolWithTag).
How do we fix this?
Having dealt with a lot of Hyper-V clusters in the field, what immediately came to mind after seeing this was the CSV cache! There are also some clues when you look at the call stack in the trace. For context, the CSV cache provides caching at the block level of read-only unbuffered I/O operations by allocating system memory (RAM) as a write-through cache. This document recommends enabling and properly configuring CSV cache for all clustered Hyper-V and Scale-Out File Server deployments.
Run the following PowerShell command-let to check BlockCacheSize. In this case, it is configured with a maximum of a 1 TB instead of 1 GB or 2 GB I come across in most deployments. This contributes to the high usage of the non-paged pool we observed.
What do we do to fix this? Elden Christensen has some good guidance on how to enable CSV cache. To set the BlockCacheSize to 1 GB, run (Get-Cluster).BlockCacheSize = 1024. The value is in MB, which explains why you may see 1 TB or 1 PB as an example where administrators are not sure whether this value is in bytes, kilobytes, etc. After configuring the cache limit, the non-paged pool usage immediately drops and physical RAM becomes available for other use such as accommodating more VMs. Shared-nothing live migration succeeded in my case.
As a side note, you can also run Get-ClusterSharedVolume "<CSV Name>" | Get-ClusterParameter to confirm if the EnableBlockCache private property is set to true (1), per CSV. I bring this up because I've seen folks try Get-ClusterSharedVolume | FL * and not being excited with the results!
In this post, I demonstrated how free tools that are available in the Sysinternals suite and the Windows Performance Toolkit can quickly help you troubleshoot issues that may not be easy to catch in Windows and/or Windows Server otherwise. In this scenario, I also covered concepts such as the CSV cache and how this feature could have a negative impact on your system if not properly configured. Till next time…