Author: DJ Ball, Senior Escalation Engineer, Skype for Business
Recently I worked on a couple of cases where the administrators were reporting higher than average CPU consumption on their Director pool servers. They reported seeing sustained 80 to 90% CPU consumption during peak business hours. This was most noticeable around the top of each hour. Then, a few hours before the end of their day, the CPU would begin to fall back to their normal 20 to 30% average (normal for these customers, every customer should have their own baseline!).
As we began to troubleshoot the issue over several days, we noticed that only two or three servers in the pool would have high CPU consumption on a given day. We were able to confirm that every server in the pool had high CPU consumption at some point, so this problem was definitely affecting all members of the pool (just not all at the same time) .
Watching Task Manager was enough to figure out that RTChost.exe was the top consumer of CPU time. Now we needed to determine what was causing the problem. Was it load not well balanced among servers in the pool? Was something different on the problem servers (or problem servers on problem days)? Was there were any increase in users or devices on problem days?
A custom perfmon counter log was needed to dig deeper and understand why this service was consuming more CPU. Here is the Logman command line that allowed the customer to easily create the counter log on each server. I have provided the Performance Counter text file that contains all the counters that we used.
logman -create counter SFBPERF -f bin -v mmddhhmm -cf PerformanceCounters.txt -o %systemdrive%\Perflog\%COMPUTERNAME%.LOG -y -cnf 24:00:00
logman start SFBPERF
Logman stop SFBPERF
I had the customer run these perfmon logs on each server on issue and non-issue days (so we could compare problematic vs. non-problematic). Once I had this data, it was a time-consuming task to pick it apart.
In reviewing the perfomns, I started off adding these two counters. They showed that the RTCHost.exe process trended up exactly as the total CPU usage. Rtchost was using ~20% of the Processor time\_Total.
Process\% processor Time\RTCHost
Processor\% processor Time\_Total
Then I overlaid these additional counters to look at user load:
LS:SIP protocol\SIP - Incoming Messages /Sec
LS:SIP - Load Management\SIP - Average Holding Time For Incoming Messages
It was very clear that SIP - Incoming Messages /Sec went from an average of 3080, and jumped to 4380. That is about a 40% jump in traffic over the course of ~3 minutes. SIP - Average Holding Time For Incoming Messages also rose from basically 0, to 13.9 just at this same time. But when I compared these peaks against other servers in the pool, they were no higher than other servers that were not having high CPU. I had established was that the 10:00 am hour was a peak time for users joining meetings.
What is RTChost doing when it is consuming so much CPU? Next was to add these counters to the view:
.Net CLR Memory\% Time in GC
Private Bytes counter showed that RtcHost process grew from consuming about 1Gb of memory to a peak just over 13GB in the span of 9 minutes. Available Mbytes counter showed that total system memory went from averaging ~14GB free, then dropped to 3.6GB free over that same period. % Time In GC is a counter that shows .Net Garbage Collection that is occurring for that process. Our jump in user load is what caused the process to consume much more memory, which causes GC to start kicking into overdrive, which drove up the CPU usage.
Now that we knew GC was our bottleneck, I discovered the customer was still running the old .Net 4.0 framework. .Net 4.6.2 release has improved memory management performance and Skype for Business Server has supported .Net 4.6.2 since the February 2017 update. We do not support .Net 4.7 version as it has not been fully tested. The 4.6.2 version can be found here.
The .Net Garbage Collector serves as the automatic memory manager for applications written in .Net. While GC is running, the other worker threads are blocked until GC finishes. The more often GC is running, the less often other work can be done. As a process becomes busier, GC will run more often and for longer periods of time.
Garbage Collection has two modes, Server and a Workstation. The Rtchost process is configured to use workstation mode by default. Workstation mode will have 1 thread to perform GC, and 1 memory heap, where as Server mode will have 1 heap per logical CPU core and 1 GC thread per CPU core. These differences can cause a process to consume as much as 2.5 times the amount of memory. You need to check the Memory\Available Mbytes counter closely to ensure you have enough system memory to handle this change. For a deep dive on GC, the Fundamentals of Garbage Collection is a great resource and the Exchange Team Blog has this excellent post.
Once the servers were updated with .Net 4.6.2, I had the customer enable server mode GC with concurrency in the Rtchost config file as shown below. You should make a backup of this file before adding the two lines to the <runtime> section. This change does require reboot to be picked up.
Default path - "C:\Program Files\Skype for Business Server 2015\Server\Core\RtcHost.Exe.config"
<?xml version="1.0" encoding="utf-8" ?>
If you think this change may help your environment, you need to consider the following caveats:
- Per Server requirements for Skype for Business Server 2015, Director role servers are recommended to have 16GB of memory. You need to closely monitor the Memory\Available Mbytes counter before and after making this change. You should have at least 1.5GB free during peak times.
- Future Cumulative updates may overwrite your custom RtcHost.Exe.config. You will need to check this setting after each update. This is a custom configuration that needs to be set for each environment.
Thanks for reading!