Help! My Director is consuming all my resources!


Author: DJ Ball, Senior Escalation Engineer, Skype for Business

Recently I worked on a couple of cases where the administrators were reporting higher than average CPU consumption on their Director pool servers. They reported seeing sustained 80 to 90% CPU consumption during peak business hours. This was most noticeable around the top of each hour. Then, a few hours before the end of their day, the CPU would begin to fall back to their normal 20 to 30% average (normal for these customers, every customer should have their own baseline!).

As we began to troubleshoot the issue over several days, we noticed that only two or three servers in the pool would have high CPU consumption on a given day. We were able to confirm that every server in the pool had high CPU consumption at some point, so this  problem was definitely affecting all members of the pool (just not all at the same time) .

Watching Task Manager was enough to figure out that RTChost.exe was the top consumer of CPU time. Now we needed to determine what was causing the problem. Was it load not well balanced among servers in the pool? Was something different on the problem servers (or problem servers on problem days)? Was there were any increase in users or devices on problem days?

A custom perfmon counter log was needed to dig deeper and understand why this service was consuming more CPU. Here is the Logman command line that allowed the customer to easily create the counter log on each server. I have provided the Performance Counter text file that contains all the counters that we used.

PerformanceCounters1

 

Create command:

logman -create counter SFBPERF -f bin -v mmddhhmm -cf PerformanceCounters.txt -o %systemdrive%\Perflog\%COMPUTERNAME%.LOG -y -cnf 24:00:00

Start command:

logman start SFBPERF

Stop command:

Logman stop SFBPERF

 

I had the customer run these perfmon logs on each server on issue and non-issue days (so we could compare problematic vs. non-problematic). Once I had this data, it was a time-consuming task to pick it apart.

In reviewing the perfomns, I started off adding these two counters. They showed that the RTCHost.exe process trended up exactly as the total CPU usage. Rtchost was using ~20% of the Processor time\_Total.

Process\% processor Time\RTCHost

Processor\% processor Time\_Total

DJBlog1

 

Then I overlaid these additional counters to look at user load:

LS:SIP protocol\SIP - Incoming Messages /Sec

LS:SIP - Load Management\SIP - Average Holding Time For Incoming Messages

 

It was very clear that SIP - Incoming Messages /Sec went from an average of 3080, and jumped to 4380. That is about a 40% jump in traffic over the course of ~3 minutes. SIP - Average Holding Time For Incoming Messages also rose from basically 0, to 13.9 just at this same time. But when I compared these peaks against other servers in the pool, they were no higher than other servers that were not having high CPU. I had established was that the 10:00 am hour was a peak time for users joining meetings.

 

DJBlog2

 

What is RTChost doing when it is consuming so much CPU? Next was to add these counters to the view:

Process\Private bytes\RtcHost

Memory\Available Mbytes

.Net CLR Memory\% Time in GC

Private Bytes counter showed that RtcHost process grew from consuming about 1Gb of memory to a peak just over 13GB in the span of 9 minutes. Available Mbytes counter showed that total system memory went from averaging ~14GB free, then dropped to 3.6GB free over that same period. % Time In GC is a counter that shows .Net Garbage Collection that is occurring for that process. Our jump in user load is what caused the process to consume much more memory, which causes GC to start kicking into overdrive, which drove up the CPU usage.

 

DJBlog3

 

Now that we knew GC was our bottleneck, I discovered the customer was still running the old .Net 4.0 framework. .Net 4.6.2 release has improved memory management performance and Skype for Business Server has supported .Net 4.6.2 since the February 2017 update. We do not support .Net 4.7 version as it has not been fully tested. The 4.6.2 version can be found here.

The .Net Garbage Collector serves as the automatic memory manager for applications written in .Net. While GC is running, the other worker threads are blocked until GC finishes. The more often GC is running, the less often other work can be done. As a process becomes busier, GC will run more often and for longer periods of time.

Garbage Collection has two modes, Server and a Workstation. The Rtchost process is configured to use workstation mode by default. Workstation mode will have 1 thread to perform GC, and 1 memory heap, where as Server mode will have 1 heap per logical CPU core and 1 GC thread per CPU core. These differences can cause a process to consume as much as 2.5 times the amount of memory. You need to check the Memory\Available Mbytes counter closely to ensure you have enough system memory to handle this change. For a deep dive on GC, the Fundamentals of Garbage Collection is a great resource and the Exchange Team Blog has this excellent post.

Once the servers were updated with .Net 4.6.2, I had the customer enable server mode GC with concurrency in the Rtchost config file as shown below. You should make a backup of this file before adding the two lines to the <runtime> section. This change does require reboot to be picked up.

 

Default path - "C:\Program Files\Skype for Business Server 2015\Server\Core\RtcHost.Exe.config"

<?xml version="1.0" encoding="utf-8" ?>

<configuration>

<runtime>

<generatePublisherEvidence enabled="false"/>

<gcServer enabled="true"/>

</runtime>

<system.serviceModel>

<services>

 

If you think this change may help your environment, you need to consider the following caveats:

  1. Per Server requirements for Skype for Business Server 2015, Director role servers are recommended to have 16GB of memory. You need to closely monitor the Memory\Available Mbytes counter before and after making this change. You should have at least 1.5GB free during peak times.
  2. Future Cumulative updates may overwrite your custom RtcHost.Exe.config. You will need to check this setting after each update. This is a custom configuration that needs to be set for each environment.

 

Thanks for reading!

DJ.

Comments (4)

  1. soder says:

    Just 1 stupid question from me: how the hell has a director such high resource utilization, if the director is only used to redirect the user to its hone pool during initial signin, and the director does not run conferencing services or any other long-persistent session with the client? And also clients normally shouldn’t even touch the director after their fielrst successful logon, as their pool fqdns and IPs are cached for subsequent logins. So cannot imagine why such a high load was even possible (if the DOTNET issue wasn’t happening)

    1. Simon Gardner says:

      Directors also have to look up meeting details and redirect to the appropriate pool on a meeting join. I’ve seen high CPU on directors at the top of the hour in larger environments because of this.

      1. soder says:

        Hi Simon:
        I know the director has 2 main roles: 1) terminating simple URL HTTPS traffic to resolve the meeting ID and then redirect the user to the real conf pool, and 2) initial user sigin-in and redirection to their real home pool. But thats all what it really does. Both should happen under a split-second, and then the director is no longer working on anything. Director does not manage any actual conferencing workload, so the former 2 should not require such a huge powerful server. Or should it? You must be running a huge environment (40-80.000 users or even more?) if your director is so strongly stressed.
        By the way, adding the director into your topology added any performance benefit on the other real pools, or what was the reason of having it deployed anyway? Would be happy if you’d answer that, I am really curious.

    2. Ajit says:

      I believe director role here referred for any enterprise fe pool where initial auto discovery, meet, dial-in, admin etc. url requests hit

Skip to main content