IMPORTANT ANNOUNCEMENT FOR OUR READERS!
AskPFEPlat is in the process of a transformation to the new Core Infrastructure and Security TechCommunity, and will be moving by the end of March 2019 to our new home at https://aka.ms/CISTechComm (hosted at https://techcommunity.microsoft.com). Please bear with us while we are still under construction!
We will continue bringing you the same great content, from the same great contributors, on our new platform. Until then, you can access our new content on either https://aka.ms/askpfeplat as you do today, or at our new site https://aka.ms/CISTechComm. Please feel free to update your bookmarks accordingly!
Why are we doing this? Simple really; we are looking to expand our team internally in order to provide you even more great content, as well as take on a more proactive role in the future with our readers (more to come on that later)! Since our team encompasses many more roles than Premier Field Engineers these days, we felt it was also time we reflected that initial expansion.
If you have never visited the TechCommunity site, it can be found at https://techcommunity.microsoft.com. On the TechCommunity site, you will find numerous technical communities across many topics, which include discussion areas, along with blog content.
NOTE: In addition to the AskPFEPlat-to-Core Infrastructure and Security transformation, Premier Field Engineers from all technology areas will be working together to expand the TechCommunity site even further, joining together in the technology agnostic Premier Field Engineering TechCommunity (along with Core Infrastructure and Security), which can be found at https://aka.ms/PFETechComm!
As always, thank you for continuing to read the Core Infrastructure and Security (AskPFEPlat) blog, and we look forward to providing you more great content well into the future!
NOTE: This blog is going through a specific issue in order to help show the steps involved in troubleshooting this type of issue. The process name(s) referenced in this content, except for System, can be any process, and not just the process used as an example in this post, and in no way are there expectations, no is it implied, that this particular process will cause you any problems!
Hey everyone, Konstantin Chernyi here. I’m a Premier Field Engineer at Microsoft Russia and today I’m gonna tell you a real-world story that happened recently. Long story short, a customer asked me: “How do I understand why the System process is consuming 100% of a single CPU core on my machine?”
Whenever I see description or request like this, my first step is to collect ETW trace. In the past, I would send a long instruction how to install Windows Performance Toolkit
https://docs.microsoft.com/en-us/windows-hardware/test/wpt/ and how to use xperf with appropriate kernel flag to collect data, but these days, thanks to PG, I don’t need to do that anymore. Since very first release of Windows 10/Window server 2016 – WPR (Windows Performance Recorder) with a lot of predefined profiles shipped with the OS. So, all we need is – collect short trace at the very exact moment when problem exist. In this case we used CPU profile:
Wpr -start CPU
<wait 10-15 seconds, so we have enough information>
Wpr -stop C:\temp\trace.etl
As soon as customer provided the trace, I opened it in WPA (Windows Performance Analyzer).
The CPU is busy indeed:
Top CPU consumer – System:
The System process has multiple threads, but only one TID (#76) is very active and consuming CPU time:
With public symbols https://docs.microsoft.com/en-us/windows/desktop/dxtecharts/debugging-with-symbols we can go deeper and review function called in this thread:
ntoskrnl.exe!KeBalanceSetManager huh, time to remember what I’ve read in Windows Internals back in a days. On page 188 of the second part of Windows Internals 6th version, you can find explanation of this function:
The balance set manager (KeBalanceSetManager, priority 16). It calls an inner routine, the
working set manager (MmWorkingSetManager), once per second as well as when free
memory falls below a certain threshold. The working set manager drives the overall memory
management policies, such as working set trimming, aging, and modified page writing.
Hmm, it looks like they are facing memory related problems, but they didn’t mention it, and the initial request was about high CPU consumption by System process. Let’s look at the memory info that we have in the trace, which isn’t much since we used the CPU profile, but at least let’s give it a try.
According to info in the trace, this box has 4.9GB in Zero and Free lists, and 2.7GB in Standby Lists, which gives us 7.6GB available memory…confusing, isn’t it? It looks like this box has plenty of available memory, but system calls KeBalanceSetManager routine every second:
Also, if you take a closer look at memory utilization, you can see that the Page Pool Commit is around 11.9GB, which is a lot:
Here is are some good articles to read on this area:
Total installed RAM on this box – 32GB:
It looks like we need a better look at what’s happening in memory…the best way to do that – memory dump. In this case we decide to try generating mirror memory dump via livekd.exe
So with that in mind, I asked the customer to grab a dump by executing livekd -ml -o C:\temp\m.dmp
Next I’ll use WinDbgX aka WindDbg preview https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugging-using-windbg-preview and the Mex extension https://www.microsoft.com/en-us/download/details.aspx?id=53304
Let’s start by reviewing virtual memory states:
From here we can see that the PagedPool Commit = 11.95GB, but the PagedPool usage is zero, I assume it’s because we used mirror dump. Anyway, from this data we clearly see that we definitely have a memory issue since Available pages = 2.87MB and there are a lot pool allocation failures. First let’s see first memory usage by process:
Heh, SynTPEnh.exe consumed 9.48GB of RAM, very well. Now let’s see what’s in Paged Pool:
Hmm, bunch of Token objects, let’s shed some light and dump them all:
Again SynTPEnh.exe, let’s calculate all token handles:
Keeping in mind that every handle gives us about 8 rows, so we need to divide it and get in total about 160k handles, which is a lot, and almost all of them belong to SynTPEnh.exe.
On the customer machine, the application event log was full of events like this one:
So the next step is to check if there are any “zombie processes” https://randomascii.wordpress.com/2018/02/11/zombie-processes-are-eating-your-memory/, and we can see a lot of them:
The majority of the zombie processes are in session 1, which is not in used in this case since the customer is using RDP to connect to this machine:
The customer said that on this tower, there weren’t any pointing devices except a mouse, so it was safe to uninstall it and check. After uninstallation memory consumption immediately went down, and there was nothing to do for the system process, so it goes almost Idle.
So what was our root cause?
For some reason the process SynTPEnh.exe was being created every 4 seconds, do some work for about 1 sec, and then crash. The token handle from parent services that started this process is not released, which leads us to memory leak and high CPU consumption. Here is an example from the trace that process SynTPEnh.exe come and go all the time: