Help! My Director is consuming all my resources!

Author: DJ Ball, Senior Escalation Engineer, Skype for Business

Recently I worked on a couple of cases where the administrators were reporting higher than average CPU consumption on their Director pool servers. They reported seeing sustained 80 to 90% CPU consumption during peak business hours. This was most noticeable around the top of each hour. Then, a few hours before the end of their day, the CPU would begin to fall back to their normal 20 to 30% average (normal for these customers, every customer should have their own baseline!).

As we began to troubleshoot the issue over several days, we noticed that only two or three servers in the pool would have high CPU consumption on a given day. We were able to confirm that every server in the pool had high CPU consumption at some point, so this  problem was definitely affecting all members of the pool (just not all at the same time) .

Watching Task Manager was enough to figure out that RTChost.exe was the top consumer of CPU time. Now we needed to determine what was causing the problem. Was it load not well balanced among servers in the pool? Was something different on the problem servers (or problem servers on problem days)? Was there were any increase in users or devices on problem days?

A custom perfmon counter log was needed to dig deeper and understand why this service was consuming more CPU. Here is the Logman command line that allowed the customer to easily create the counter log on each server. I have provided the Performance Counter text file that contains all the counters that we used.


Create command:

logman -create counter SFBPERF -f bin -v mmddhhmm -cf PerformanceCounters.txt -o %systemdrive%\Perflog\%COMPUTERNAME%.LOG -y -cnf 24:00:00

Start command:

logman start SFBPERF

Stop command:

Logman stop SFBPERF


I had the customer run these perfmon logs on each server on issue and non-issue days (so we could compare problematic vs. non-problematic). Once I had this data, it was a time-consuming task to pick it apart.

In reviewing the perfomns, I started off adding these two counters. They showed that the RTCHost.exe process trended up exactly as the total CPU usage. Rtchost was using ~20% of the Processor time\_Total.

Process\% processor Time\RTCHost

Processor\% processor Time\_Total



Then I overlaid these additional counters to look at user load:

LS:SIP protocol\SIP - Incoming Messages /Sec

LS:SIP - Load Management\SIP - Average Holding Time For Incoming Messages


It was very clear that SIP - Incoming Messages /Sec went from an average of 3080, and jumped to 4380. That is about a 40% jump in traffic over the course of ~3 minutes. SIP - Average Holding Time For Incoming Messages also rose from basically 0, to 13.9 just at this same time. But when I compared these peaks against other servers in the pool, they were no higher than other servers that were not having high CPU. I had established was that the 10:00 am hour was a peak time for users joining meetings.




What is RTChost doing when it is consuming so much CPU? Next was to add these counters to the view:

Process\Private bytes\RtcHost

Memory\Available Mbytes

.Net CLR Memory\% Time in GC

Private Bytes counter showed that RtcHost process grew from consuming about 1Gb of memory to a peak just over 13GB in the span of 9 minutes. Available Mbytes counter showed that total system memory went from averaging ~14GB free, then dropped to 3.6GB free over that same period. % Time In GC is a counter that shows .Net Garbage Collection that is occurring for that process. Our jump in user load is what caused the process to consume much more memory, which causes GC to start kicking into overdrive, which drove up the CPU usage.




Now that we knew GC was our bottleneck, I discovered the customer was still running the old .Net 4.0 framework. .Net 4.6.2 release has improved memory management performance and Skype for Business Server has supported .Net 4.6.2 since the February 2017 update. We do not support .Net 4.7 version as it has not been fully tested. The 4.6.2 version can be found here.

The .Net Garbage Collector serves as the automatic memory manager for applications written in .Net. While GC is running, the other worker threads are blocked until GC finishes. The more often GC is running, the less often other work can be done. As a process becomes busier, GC will run more often and for longer periods of time.

Garbage Collection has two modes, Server and a Workstation. The Rtchost process is configured to use workstation mode by default. Workstation mode will have 1 thread to perform GC, and 1 memory heap, where as Server mode will have 1 heap per logical CPU core and 1 GC thread per CPU core. These differences can cause a process to consume as much as 2.5 times the amount of memory. You need to check the Memory\Available Mbytes counter closely to ensure you have enough system memory to handle this change. For a deep dive on GC, the Fundamentals of Garbage Collection is a great resource and the Exchange Team Blog has this excellent post.

Once the servers were updated with .Net 4.6.2, I had the customer enable server mode GC with concurrency in the Rtchost config file as shown below. You should make a backup of this file before adding the two lines to the <runtime> section. This change does require reboot to be picked up.


Default path - "C:\Program Files\Skype for Business Server 2015\Server\Core\RtcHost.Exe.config"

<?xml version="1.0" encoding="utf-8" ?>



<generatePublisherEvidence enabled="false"/>

<gcServer enabled="true"/>





If you think this change may help your environment, you need to consider the following caveats:

  1. Per Server requirements for Skype for Business Server 2015, Director role servers are recommended to have 16GB of memory. You need to closely monitor the Memory\Available Mbytes counter before and after making this change. You should have at least 1.5GB free during peak times.
  2. Future Cumulative updates may overwrite your custom RtcHost.Exe.config. You will need to check this setting after each update. This is a custom configuration that needs to be set for each environment.


Thanks for reading!


Comments (5)
  1. soder says:

    Just 1 stupid question from me: how the hell has a director such high resource utilization, if the director is only used to redirect the user to its hone pool during initial signin, and the director does not run conferencing services or any other long-persistent session with the client? And also clients normally shouldn’t even touch the director after their fielrst successful logon, as their pool fqdns and IPs are cached for subsequent logins. So cannot imagine why such a high load was even possible (if the DOTNET issue wasn’t happening)

    1. Simon Gardner says:

      Directors also have to look up meeting details and redirect to the appropriate pool on a meeting join. I’ve seen high CPU on directors at the top of the hour in larger environments because of this.

      1. soder says:

        Hi Simon:
        I know the director has 2 main roles: 1) terminating simple URL HTTPS traffic to resolve the meeting ID and then redirect the user to the real conf pool, and 2) initial user sigin-in and redirection to their real home pool. But thats all what it really does. Both should happen under a split-second, and then the director is no longer working on anything. Director does not manage any actual conferencing workload, so the former 2 should not require such a huge powerful server. Or should it? You must be running a huge environment (40-80.000 users or even more?) if your director is so strongly stressed.
        By the way, adding the director into your topology added any performance benefit on the other real pools, or what was the reason of having it deployed anyway? Would be happy if you’d answer that, I am really curious.

        1. Simon Gardner says:

          In this case it was a fairly large pool pair (over 20k users) with a director pool in front, comprising two director servers. They were VMs but otherwise to recommended spec. You’re correct in that in a conference join, the director has to look up which pool the conference will be on and redirect to there, and this should happen very quickly. However, you tend to get big spikes of conference joins on the hour and particularly at 10am and 2pm. User behaviour is often to snooze an upcoming meeting until the time it starts, so when you hit 10am you have thousands of people hitting the join button within a window of 30s or so. In bigger environments, particularly those that do a lot of conferencing, monitor the metrics from the article and they got nuts during that window. It’s just a matter of lots and lots of requests to process all at once.

          It was never really a problem, everything ticks along nicely, but if you’d lost one of the directors the remaining one would struggle a bit with those peaks. Was an eye-opener as the director conversation is usually around security and they’re not typically thought of as doing much “real” work.

          Haven’t been able to compare with/without director although I have seen where a “proof of concept” SE was deployed, simple URLs all pointed there, then a later large EE pool deployed but the simple URLs overlooked – that SE was working very hard to deal with all the authentication and conference joins by itself!

    2. Ajit says:

      I believe director role here referred for any enterprise fe pool where initial auto discovery, meet, dial-in, admin etc. url requests hit

Comments are closed.

Skip to main content