Determining Health and Wellness of an OCS Deployment – IM and Presence

As an IT Admin, how do you know when end user experience will start to suffer and which Performance Monitor counters should you be monitoring to ensure your users continue to have a quality experience? Also, how would you predict degradation of user experience proactively?

Author: Stu Osborn

Publication date: July 2008

Product version: Office Communications Server 2007 R2

My colleague Pauline already has an excellent UC blog on this subject. Great stuff... She concentrates on the Front end server role and its interaction with the pool’s SQL Back end server. But there are hundreds and hundreds of separate Performance Monitor counters for Office Communications Server 2007 and most deployments include several other server roles besides Front end and Back end. Current guidance on this subject from the product team includes: administration guides, deployment guides, planning guides, technical reference guides and the like. But what am I offering new here?

Well, this blog has new information about how to determine server health. In addition to listing Perfmon counters as recommended by the product team, I identify certain thresholds so you can see when health is degrading and exactly when to take action! I also recommend a three-pronged approach to this task by “polling”, “monitoring” and taking “remedial actions”.

Below are the recommended perf counters with thresholds that should trigger action on the part of an Administrator. The resource utilization, user load and server health counters below are directly applicable to IM/Presence functionality. But you as an IT Admin will need to run resource utilization and user load baseline tests during medium load first to determine what is “normal” for your deployment. Then once you have your baseline numbers, you can add health monitoring counters to your overall monitoring scheme and go from there.

Recommended baseline counters to test and monitor resource utilization:

Processor; % Processor Time (_Total) [should operate at less than 80% during peak load]

Process; % Processor Time (RtcSrv)

Process; % Processor Time (IMMcuSvc)

Memory; Pages/sec ---

Network Interface; Bytes Total/sec ([your NIC]) [should operate at less than 80% capacity of the NIC]

(No baseline rules for individual process or memory utilization)
Pages/sec - indicates total “pressure” on the server’s available memory
Network Interface example: 100Mbit/sec NIC should be <80%x12.5Mbytes/sec ~ <10Mbytes/sec

Recommended baseline counters to test and monitor user load:

LC:SIP – 01 - Peers; SIP - 028 - Incoming Requests/sec (_Total)
LC:SIP - 01 – Peers; SIP – 001 – TLS Connections Active (_Total)
LC:SIP – 01 – Peers; SIP – 000 – Connections Active (_Total) [should be less than 15,000 connections per Front end]
LC:SIP – 02 – Protocol; SIP - 001 - Incoming Messages/sec ----
LC:ImMcu – 00 - IMMcuSvc Conferences; IMMCU – 000 - Active Conferences ----

LC:ImMcu – 00 - IMMcuSvc Conferences; IMMCU – 001 – Connected Users ----

LC:USrv – 00 – DBStore; Usrv – 002 – Queue Latency (msec) [healthy is less than 100 msec]
(server health decreases as latency increases to 12 sec when server throttling begins)

LC:USrv – 00 – DBStore; Usrv – 004 – Sproc (Stored Procedure) Latency (msec) [healthy is less than 100 msec]
(server health decreases as latency increases to 12 sec when server throttling begins)
Queue Latency=the time a request spent in the queue to the Back end server
Sproc Latency= the time it took the Back end server to process the request

Recommended counters to monitor for server health:

(These counters will indicate negative trends as well as overall server health)
LC:SIP – 01 - Peers; SIP - 024 – Flow-controlled Connections Dropped (_Total)

LC:SIP – 01 - Peers; SIP - 025 – Average Flow-Control Delay (_Total)

LC:SIP – 07 – Load Management; SIP – 000 – Average Holding Time For Incoming Messages ----

LC:ImMcu – 02 – MCU Health And Performance; IMMCU – 005 – MCU Health State ----

LC:USrv – 20 – Https Transport; USrv – 002 – Number of failed connection attempts ----

LC:USrv – 20 – Https Transport; USrv – 002 – Number of failed connection attempts / Sec ----

OCS 2007 MOM Pack thresholds from the documentation:

IMMCU - 020 - Throttled Sip Connections (Sample) (number of connections at which new SIP requests are refused)
Sample Interval is 15 minutes. The current health of the MCU. 0 = Normal. 1 = Loaded. 2 = Full. 3 = Unavailable.
Causes: MCU is overloaded, backend server is slow to respond, net problem
Resolutions: This could happen if too many conferences are assigned to this MCU. [should be no more than 500 maximum
sessions per MCU]
(Normal= healthy; Loaded=marginal; Unavailable=maximum reached)

IMMCU - 020 - Throttled Sip Connections (Warning) (Error) (number of throttled Sip connections total)
Sample Interval is 15 minutes
Numeric Threshold Rule triggered when the sampled value is greater than 10.
Causes: Peer is not processing requests in a timely fashion.
Resolutions: This can happen if the peer machine is overloaded.
(“Peer”=connected servers or adjacent Front end servers or MCUs in the same EE Pool – the same set of counters apply)

There are three phases of determining overall deployment health and wellness in a strategic monitoring plan:

Phase I: Start by polling your environment

  • Run OCS Best Practice Analyzer (BPA) to perform a comprehensive inventory of servers and server-side settings. Among other things, BPA will flag incorrect settings and unsupported collocation of server roles and will even tell you if all the required hot fixes are installed, per server role.
  • After performing your server inventory, compare your topology to recommended guidelines by using the Planning Tool for Office Communications Server 2007. This new tool can be very useful if used as a companion with the OCS Planning Guide. It’s an OCS deployment planning tool that uses a wizard to ask questions and then shows a graphical representation of the recommended topology based on profiles originated from the PG (5,000 users; 5-30K; 30-50K; 50-125K) using the recommended hardware.
  • Review OCS Setup logs and OCS Application logs upon first run of the servers just after setup completes. Make a point of checking Application Logs regularly. But also make it a routine practice to check, “Show Logs” after OCS setup finishes. HTML-based hierarchical logs can then be expanded to show errors and the resulting cascading effect on the services.
  • Run Validation Wizards for each server role as they are deployed to diagnose issues upon first run and to review informational and error messages relating to missing configurations or services not started. Those expandable HTML-based logs are very useful and handy to trace down exactly what’s wrong.
  • Plan to repeat these on a rotating schedule:
    1. BPA – run every month; update BPA every week
    2. Planning Tool – run for major topology changes
    3. Application logs – check logs on all servers every day
    4. Validation wizards – run for every new server deployed

Phase II: Follow a comprehensive plan to monitor your environment

  • Think about downtime optimization and use proactive thinking to catch and fix issues before they interrupt the services. Use Microsoft Operations Manager (MOM) 2005. You can install the OCS 2007 MOM Pack to monitor and create alerts and implement thresholds that trigger those alerts while monitoring an operation over time, using reporting to graph out weekly, monthly and seasonal usage! IT Admins worth their salt have already determined baselines for average usage and peak usage periods to ensure there is enough server headroom remaining during predictable usage spikes and they constantly update this information.
  • Consider using Performance Monitor or MOM Alerts set to page IT Administrators. MOM calls attention to critical events that require administrator intervention. MOM offers info about root causes and suggests solutions from its knowledge database. Guarding SQL against over-usage of CPU, Disk, and Memory and understanding when to add a Front end server is critical to being proactive as your user base grows.
  • Use the Admin tools. OCS has some good out-of-the-box tools for monitoring servers. In the status pane of the Microsoft Management Console, you’ll see status for ‘General Settings’, an Event Log tab and some of the recommended Performance Monitor counters already loaded up.
  • Employ SQL Performance Dashboard to monitor SQL. That veteran team has worked long and hard developing this tool. For the Back end server, it’s likely to boil down to over-using the resources of the machine (disk, CPU or memory) and with all the information out there about SQL Server and which performance monitor counters to watch, you can likely solve any over usage problem if you know what to look for.
  • Use Archiving and Call Detail Records to capture data for all sessions on your servers. Then use this information to monitor usage across your entire environment, including usage of specific functionalities, duration of specific sessions and per-user usage of specific features. Then you will understand how your end-users are making use of which OCS features and when. Using Archiving/CDR, you can capture details about how many users are sending IM to whom, when and how often. This will provide more insight about baseline usage of your deployment, not only for IM and multi-party IM but for other functionalities too. Determine usage spikes by analyzing the reports.

Phase III: Take quick and decisive remedial actions

  • Take the proper steps to remedy the most common OCS issues seen because of decaying health of the servers before services are interrupted. Being PROACTIVE is really what you want but if you have to be REACTIVE, you want to strike at the heart of the developing issue. Take advantage of the OCS 2007 Resource Kit and its great set of troubleshooting tools to react properly.
  • Develop an action plan using the OCS Administration Guide and follow it consistently. Even better, change it over time as your user base grows and usage changes. Train and encourage users to gather and upload their logs. For troubleshooting an OCS Director, ask the user to manually populate their server logon with the pool FQDN to rule out operator error or other client issues. Once you’ve confirmed there are no issues logging in directly to the pool, have the user set the logon back to automatic and gather Communicator logs. Generally, those logs are enough to find out what’s happening without going server side.
  • OCS Logger is the tool to do server-side logging. It is documented in the Admin Guide. Network Monitor is also a very useful tool. Armed with both server-side and client side traces, you’ll know what’s up and more importantly, what’s down!
  • Consider adding another Front end server in an expanded topology as thresholds are approached during peak load, but realize there will be declining return on hardware investment especially in a consolidated topology. Adding another server will definitely help, but scaling will not be linear. So would a new Front end facilitate an additional 5000 active users? It’s not out of the question that another server will spread the load, but it’s a false expectation to assume that you can facilitate another 15-20,000 active users every time another Front end is added.

TechNet resources on Troubleshooting IM and Presence issues:

For an in-depth resource on Office Communications Server 2007, including detailed troubleshooting tips, refer to the Office Communications Server 2007 Resource Kit, especially Chapter 13: “Monitoring,” available from MS Press at:

Stu prepared the content for this post prior to transferring to Unify2

Lync Server Resources

We Want to Hear from You

Skip to main content