PRF: Memory Management (Large System Cache Issues)


MEMORY MANAGEMENT (LARGE SYSTEM CACHE ISSUES)



Description:  The file system cache resides in kernel address space.  It is used to buffer access to the much slower hard drive.  The file system cache will map and unmap sections of files based on access patterns, application requests and I/O demand.  The file system cache operates like a process working set.  You can monitor the size of your file system cache’s working set using the Memory\System Cache Resident Bytes performance monitor counter.  This value will only show you the system cache’s current working set.  Once a page is removed from the cache’s working set it is placed on the standby list.  You should consider the standby pages from the cache manager as a part of your file cache.  You can also consider these standby pages to be available pages.  This is what the pre-Vista Task Manager does.  Most of what you see as available pages is probably standby pages for the system cache.


On 32 bit systems, the kernel could address at most 2GB of virtual memory.  This address range is shared and divided up for the many resources that the system needs; one of which is the System File Cache’s working set.  On 32-bit systems the theoretical limit is almost 1GB for the cache’s working set; however, when a page is removed from the working set it will end up on the standby page list.  Therefore the system can cache more than the 1 GB limit if there is available memory. The working set; however, is just limited to what can be allocated within the Kernel’s 2GB virtual address range.  Since most modern systems have more than 1 GB of physical RAM, the System File Cache’s working set’s size on a 32-bit system typically isn’t a problem.


With 64-bit systems, the kernel virtual address space is very large and is typically larger than physical RAM on most systems.  On these systems the System File Cache’s working set can be very large and is typically about equal to the size of physical RAM.  If applications or file sharing performs a lot of sustained cached read I/O, the System File Cache’s working set can grow to take over all of physical RAM.  If this happens, then process working sets are paged out and there is contention for physical pages – resulting in performance degradation.


The only way to mitigate this problem is to use the provided APIs of GetSystemFileCacheSize() and SetSystemFileCacheSize().  The blog post “Too Much Cache” contains sample code and a compiled utility that can be used to manually set the System File Cache’s working set size.


 


Scoping the Issue:  Although we normally see this issue on 64 bit file servers and backup servers and Microsoft Data Protection Manager (DPM) Servers we do at times see this on 32-bit machines as well.  What occurs is that the system will, through its use of cache, consume all available memory until the system becomes resource starved and unable to satisfy any new requests for physical memory.  This can appear as a system hang, no RDP/RDC, new connections refused to shares and current connections can and often do stop responding.


 


Data Gathering:  In all instances, collecting either MPS Reports with the General, Internet and Networking, Business Networks and Server Components diagnostics, or a Performance-oriented MSDT manifest must be done.  Additional data required may include the following:



  • Performance Monitor logs that include the timeframe when the Working Set Trimming occurred.  Ideally, the capture interval should not exceed 10 seconds.  You can create the log parameters manually, or by using the Performance Monitor Wizard.  Required counters include:

    • All Memory Counters / All Instances

    • All Process Counters / All Instances

    • All Disk Counters / All Instances

  • Pool Monitor (PoolMon) logs that include the timeframe when the Working Set Trimming occurred.  Ideally, the capture interval should not exceed 10 seconds.

 


Troubleshooting / Resolution: After you have gathered this data, review the following:



  • MPS Reports

    • Outdated drivers & firmware – in particular for the NIC and Disk / Storage subsystems as well as Anti-virus

    • Event ID’s look for the Event ID’s listed above and also any 2019’s or 2020’s.  These events are indicative of NonPaged / Paged Pool depletion

  • Performance Monitor Logs

    • Look for evidence of high cache bytes.

    • Also look for evidence of a particular process’ Working Set growing at the time of the flish as this could indicate why the trim occurred.  Common catalysts include large file copy processes, or backup jobs.

    • If there is evidence of a leaking process, test removing or disabling the product to see if the issue goes away. If so, contact the product vendor for a resolution.

  • PoolMon logs

    • Look for trending increase of paged pool or non-paged pool memory which may indicate a leak

    • If there is evidence of a leaking pool tag, research what it correlates to. If possible test removing or disabling the product to see if the issue goes away. If so, contact the product vendor for a resolution

That brings us back to the only provided solution – use the provided APIs.  While this isn’t an ideal solution, it does work, but with the limitations mentioned above.  In order to help address these limitations, The SetCache utility has been updated to the Microsoft Windows Dynamic Cache Service.  While this service does not completely address the limitations above, it does provide some additional relief.


The Microsoft Windows Dynamic Cache Service uses the provided APIs and centralizes the management of the System File Cache’s working set size.  With this service, you can define a list of processes that you want to prioritize over the System File Cache by monitoring the working set sizes of your defined processes and back off the System File Cache’s working set size accordingly.  It is always running in the background monitoring and dynamically adjusting the System File Cache’s working set size.  The service provides you with many options such as adding additional slack space for each process’ working set or to back off during a low memory event.


Please note that this service is experimental and includes sample source code and a compiled binary.  Anyone is free to re-use this code in their own solution.  Please note that you may experience some performance side effects while using this service as it cannot possibly address all usage scenarios. There may be some edge usage scenarios that are negatively impacted.  The service only attempts to improve the situation given the current limitations.  Please report any bugs or observations here to this blog post.  While we may not be able to fix every usage problem, we will try to offer a best effort support.


One side effect you may experience is Cache Page churn.  If the System File Cache’s working set is too low and there is sustained cached read I/O, the memory manager may not be able to properly age pages.  When forced to remove some pages in order to make room for new cache pages, the memory manager may inadvertently remove the wrong pages.  This could result in cached page churn and decreased disk performance for all applications.


Additional Resources: