Author: Pauline Batthish
Publication date: September 2007
Product version: Office Communications Server 2007 R2
Edit July 2009 - This post was written for OCS 2007, while you may use with R2 it is possible that you find items that do not exist or do not provide data due to product updates.
One of the most common questions I am asked is "How do I know my server is healthy?" How can you tell if your server performance is adequate? There are some a few key counters that are good indicators of overall health from the front end server. This is by no means a comprehensive list and is not meant to identify root cause. These counters will give you the ability to do a quick check on you server health. I recommend verifying these counters on each of the servers in the pool. It's important to understand what these counter values are when your server is healthy. A baseline is crucial to understanding what changed when the user experience is degraded.
The first counter to check on your front end server is the Processor% Processor Time . This should be less than 80%. If it is higher, then you need to determine if you have more users connected than usual or if there has been some other change that may result in higher load.
The front end server can indicate problems that may be due to bottlenecks elsewhere in the system. This means it is the best place to start when looking at overall system health. Two of the first counters I always check are: LC:USrv â 00 â DBStoreUsrv â 002 â Queue Latency (msec)and LC:USrv â 00 â DBStoreUsrv â 0 04â Sproc Latency (msec). The queue latency counter represents the time a request spent in the queue to the backend and the sproc latency represents the time it took for the backend to process the request. If for any reason (disk/memory/network/Processor, etc) the backend is in trouble the queue latency counter will be high. It can also be high if there is high network latency between the front end and the back end. Now, the next question is "how high is too high?" Well, at 12 seconds the front end servers will start throttling requests to the backend. This means they will start returning "Server too busy errors" (503) to the client. I expect a healthy server to have <100msec DBStore queue latencies at steady state, but during times where the server has just come online and users are all logging in at the same time that counter can be quite high and you may even see it hit multiple seconds. The servers will be quite loaded after services are restarted. Performing maintenance during off hours will help mitigate the performance impact as users will not all be competing to get back in at the same time. Also, if you load balancer is configured for the least number of connections, and one of the front end servers is restarted, then all users that attempt to reconnect will be pointed to that server since it will have less connections than the other servers in the pool. Therefore it may be overloaded while the other servers in the pool are fine.
If the LC:USrv â 00 â DBStoreUsrv â 002 â Queue Latency (msec) or the LC:USrv â 00 â DBStoreUsrv â 0 04â Sproc Latency (msec) counters are high, the most likely bottleneck is the SQL backend. Is the CPU too high (>80%) on your SQL server? Is the disk latency high? In an ideal world you have enough RAM to have the entire RTC and RTCDYN databases in memory, then, the only reason the server would be accessing the disk is to write to the log files and flush to the databases. Our tests have shown that 12GB of RAM is sufficient for 100K user deployments. This is based on the assumption that the RTC and RTCDYN databases size total <12GB. If your databases are larger than that then you may find you need more memory. You can tell if you need more RAM by looking at the MSSQL Buffer ManagerPage life expectancy, a value less than 3600 indicates memory pressure. Also, you should see little to no reads on your DB drive if you have enough memory as SQL should only be writing to the database.
Let's get back to the front end. There is another throttling mechanism in the front end server, the DBStore latency throttling only kicks in if the latency to the SQL server is high, this throttling will kick in if the processing time on the front end is high. One example of a cause that can result in this type of throttling is if the front end server is CPU bound. The way it works is if the average processing time (LC:SIP - 07 - Load ManagementSIP - 000 - Average Holding Time For Incoming Messages) on the server is in excess of 6 seconds then the server goes into throttling mode and only allows one outstanding transaction per client connection. Once the processing time drops down to 3 seconds then the server drops out of throttling mode and allows up to 20 outstanding transactions per client connection. Whenever the number of transactions on a specific connection exceeds the threshold above, the connection is marked as flow controlled and the server does not post any receives on it and the LC:SIP â 01 â PeersFlow Controlled Connections counter is incremented. If a connection stays in a flow controlled state for more than one minute then the server closes it. It does so lazily, when it has a chance to check the connection it determines if it was throttled for too long and closes it if it has been more than one minute.
So, now you know about the 2 throttling mechanisms. There is one counter that summarizes what, if any, throttling the server is doing. It is LC:SIP â 04 â Responses object SIP â 051 â Local 503 Responses/sec . The term "Local" in the above counter means locally generated responses. The 503 code corresponds to server unavailable. You should not be seeing any 503s on a healthy server at steady state. Again, during ramp up, after a server is just brought online, you may see some 503s. But as all the users get back in, and the server returns to a stable state there should not be anymore 503s.
The LC:SIP â 04 â Responses SIP â 053 â Local 504 Responses/sec counter indicates connectivity issues with other servers. It can indicate failures to connect or delays. If you are seeing 504s one more counter that is good to check is the LC:SIP â 01 â PeersSIP â 017 - Sends Outstanding counter. This counter indicates the number of requests and responses that are queued outbound, which means if this counter is high then the problem is probably not on this server. This counter can be high if there are network latency issues. It could also be a problem with the local NIC but is more likely to be due to a problem on a remote server. I have seen this counter be high on a director server when the pool it is attempting to contact is overloaded. The key with this counter is to look at the instances, not just the total. That will help you isolate the target.
For more technical information and resources for evaluating, deploying and maintaining Office Communications Server 2007, please visit our TechCenter.
Lync Server Resources
- Lync Server 2010 documentation in the TechNet Library
- DrRez blog
- Lync Server and Communications Server resources