Since I occasionally troubleshoot performance problems, I thought I’d write up the basics of the process I use. Due to the nature of this topic, this isn’t going to be a comprehensive coverage of all possible performance problems. I just want to focus on the main process, and list a few of the things I look at first to break the problem down.
As I see it, there are two main symptoms of a server performance problem:
- High average rpc latencies or rpc requests
- Large or rising queues
This blog will focus on identifying possible server bottlenecks related to the first symptom, although the steps for rising queues are pretty much the same.
In general, the causes of a performance problem fall into two categories:
1. problems due to increased load
2. problems with a resource bottleneck
When troubleshooting, I always attempt to differentiate between these two categories of causes first. It is important to note than an increase in load will cause a resource bottleneck, unless the increase is sufficiently small that no symptom of poor performance is observed. Thus, if an increase in load is found, I attempt to identify the cause for the increase in load first, and then identify the resource which is bottlenecked.
Resource bottlenecks may occur on any of the following resources:
- Network (between client and server, or server and any other server)
- Active directory server
- Other server resources
Increased load can be caused by a change or increase in user activity, or by other applications using server resources. In this blog, I’ll only talk about load caused by MAPI client load, though the principles apply to any type of load.
First, find out when and for howlong the problem repros. Check to see if you are reproing the symptom now. To verify that the server is exhibiting the reported RPC performance problem, check the following:
- See if the MSExchangeIS\RPC latency is higher than 50 ms (this is considered a problem)
- See if the number of outstanding RPC requests (MSEXCHANGEIS\RPC Requests) is higher than 50.
- Look at the MSEXCHANGEIS\RPC operations per second. Is it higher than you expect? This number depends on the number of users. At Microsoft, we usually see about 0.20 operations per second per user on the server.
If the server isn’t exhibiting a performance issue at the time you are investigating, you may not be able find out what is going wrong. Nonetheless, even if these counters appears to be healthy, if users are complaining, I usually continue to investigate anyway – but keep in mind it’s best to investigate while the server is unhealthy. If the server is healthy, I use the information I gather as a baseline to compare against when the server exhibits poor performance. Now you’re ready for the next step: identifying high or increased load.
After I find out if the problem is reproing, the next thing I look at is the sources of server load. There are many sources of load such as incoming mail rates, MAPI operations, POP3 requests, or even 3rd party software running on the server. What you look at first will depend on what you know about the system. For a typical back-end server that hosts Outlook users, I look for high rates of RPC Operations per second. If RPC rates haven’t changed, yet the health of the server has declined, I may also look for other sources of increased load, such as the incoming message rates. If I find an increase in load, I attempt to identify the cause.
In the case of high RPC load, I look to see if more than 20-30% of the load is due to a single user, or if the load is distributed across many users. I also check if the number of logons per user is excessive (greater than 4 per user) for each database. Finally, I see if the average mailbox size is high on that server, or if individual folders are excessively large (more than 5000 items in a user’s inbox, sent items, deleted items or calendar folder). Large folders and mailbox growth can lead to increased CPU and I/O load.
As I mentioned, at Microsoft, our users average 0.2 ops/sec (MSExchangeIS RPC Operations per second divided by the number of mailboxes on the server) at our peak busy time (around 9-11 am). If the whole server is even 10% higher than that for a sustained period, I suspect we’ve had an increase in load. Normally a 10% fluctuation wouldn’t be noticeable, but we keep a history of this value, so I know what is normal for our load profile. For other systems, I have to guess, which usually works fine too. I tend to get very concerned when RPC rates are higher than 0.4 ops/sec per user, though problems can occur at lower rates. If I don’t have a baseline for the unhealthy server, I compare with other servers in the same company that are healthy. Are the per-user rates higher on the server that is having trouble? If you don’t know your load profile, you just have to guess. You can use 0.20 ops/sec as an approximate baseline for active users.
If the rates are high, I run Exmon.
- Use Exmon to determine if a single user is responsible for more than 20% of the server’s RPC load (on a server with more than 200 users) or 40% of the load (on a server with 200 or fewer users). I usually collect 1 minute of data every 5-10 minutes, to look for users that are consistently consuming a lot of CPU. It’s normal for some common operations to cause a user to hog CPU for a short period of time – ignore this. You’re looking for the guy that is at the top of the list most of the time. This can be more of an art than a science at times… use your best judgment.
- If the RPC rate is high (for a single user or for everyone), find out if users have desktop search, 3rd party client plug-ins or blackberry devices. Consider investigating these applications as the source of high load, and trying to reduce the load (by removing plug-ins or verifying that they are being used in an optimal fashion).
Note: even though I generally sort by the %CPU usage, this doesn’t mean I am expecting a CPU bottleneck. Actually, disk bottlenecks are the most common bottlenecks that Exchange servers encounter. I look at %CPU usage in Exmon because it is fast, and because high CPU usage will usually translate to high I/O. Some people prefer to sort on the Read Pages and PreRead columns as a more precise way to find out which users are causing the most I/O reads, and the Dirty Pages column to find which users are causing the most I/O writes.
To find the bottleneck, I usually look at most of the performance counters that are described in the “Troubleshooting Exchange Server 2003 Performance” whitepaper, though I always start with CPU and disk. For nearly all cases, I use the thresholds from the whitepaper.
With disk, I’m mainly looking for read latencies on the databases drives, and write latencies on the storage log. I’m not going to go into all the counters and thresholds because it’s all laid out in the whitepaper. If the latencies are high, or other counters indicate a problem, the server has a disk bottleneck.
Check if the processor is healthy. Mostly, I check that the CPU is below 80%, and that most of the CPU is coming from the store process (on a back-end machine). If CPU is higher, I know we have a CPU bottleneck. If it’s not coming from store.exe, I find out what process is hogging CPU.
There are many other things that can impact performance. I always recommend running the ExBPA tool to ensure the server is well configured. No one can remember the thousands of configuration details to check; let ExBPA do it for you. Here are a few other things you may want to checkto verify that the server configuration encourages good performance:
- Are any maintenance tasks still running, or have they run recently? Make sure all maintenance tasks run during non-busy hours.
- Are the transaction log drives shared with any other resource?
- Are the database files, temp file, tmp file, SMTP server or system drives shared with any other resource?
- Is RegTrace enabled? (leaving RegTrace enabled can cause performance issues)
- Is there less than 10% free disk space on any drive used by the Exchange Server?
- Is there less than 20MB free on any drive used by the Exchange Server?
Occasionally, hardware is unhealthy, and that is cause of a resource bottleneck. You’re just going to have to make that judgment based on the individual circumstances. Are disk latencies high even though the throughput is low? Maybe something just isn’t performing well. If you suspect hardware isn’t living up to spec, swap it out if you can.
Once I know what’s going on, I can start working on suggested resolutions. I usually get the whole picture before making any changes, because most systems will exhibit many problems simultaneously. It is easy to focus on the first problem that is found, and miss another bigger problem. So, don’t act on these resolutions until you can answer both of these questions:
1) Is the problem caused by increased load?
2) Which resources are my bottlenecks?
If a performance problem is due to increased load, you have a couple options. First, if you have identified the source of the high load, you might be able to reduce the load – perhaps by asking users to install fixes for some of their client applications, or to stop using certain expensive applications. That’s the most obvious, but it’s not always an optoin. Next, you may want to restrict mailbox sizes, and instruct users to archive items out of folders – this also reduces load. Finally, you may decide to spread the load between servers by moving users. For example, if some users have a lot of email-intensive applications, you may want to avoid putting them on the same database or same server. On the other hand, sometimes you may do the opposite – move the extra heavy users to their own server and let them duke it out between themselves for server resources. Either way, the original server, and the rest of the users, are happier.
Resolving performance problems due to a bottleneck
If you can’t reduce the load on a server, your options are to improve the capacity of the resource that is bottlenecked, modify the configuration of the server when applicable, replace malfunctioning hardware, or move users off to another server. Increasing the capacity of a resource usually means adding more hardware. Sometimes you can increase the capacity by offloading some server work to another server, such as removing optional applications that are running on the server.
If any of the disks are unhealthy, and there hasn’t been an increase in load, I first check to see if the disks are used by anything else (are they shared with another Exchange server, or disk-intensive application like SQL?). Do performance problems only occur when the disks are also being accessed by the other server/program? Exchange doesn’t do well when disks are bottlenecked, and in my experience, disks that are shared often get bottlenecks. The transaction log drives, in particular, should not be shared by any other resource.
If disks are unhealthy, you have a few basic choices: move users to another server or to another database hosted on a different drive system, or increase the number of spindles for the current disk array.
The resolutions for a CPU bottleneck are simple: increase the processor capacity by increasing the number of CPUs or turning on hyper-threading when applicable, move users to another server, or remove any optional applications on the server that are consuming CPU.
If the kernel memory is unhealthy, and the number of logons per user is high, I recommend removing users from the server or, if this is an option, reducing the number of logons per user. You can reduce logons per user by turning off 3rd party plugins, or reducing the number of client applications per user.
There are many more details that I’ve left out of this blog, but I’ve covered the basics.
Here is the process in a nutshell:
- Identify the symptoms.
- Find out when the problem occurs.
- Identify if there has been an increase in load.
- Try to identify the cause of the increase.
- Reduce the load if you can.
- Try to identify the cause of the increase.
- Identify the hardware bottlenecks.
- Check if the bottleneck is due to a misconfiguration.
- Fix the misconfiguration if present.
- Check if the bottleneck is due to malfunctioning hardware.
- Replace the hardware if necessary.
- Finally, if 2-5 don’t resolve the problem, remove the resource bottleneck
- Increase the capacity of the bottlenecked hardware or
- Move users to another server that doesn’t have a bottleneck.
- Increase the capacity of the bottlenecked hardware or
Sometimes you will have to iterate through the step 3-6 a few times, as resolving one bottleneck may expose another. And yeah, I hear your pain - a lot of resolutions do involve moving users to other servers. It’s just a fact of life that if the user load increases, you got to move over to make room for it. In this case, it often means more hardware, unless you can convince your users to stop any excessive activity.
Unfortunately, as I mentioned at the start of this blog, there are billions and billions of specific details (this is my Carl Sagan impression) that I have left out, but hopefully this has provided a little structure to the troubleshooting-and-problem-resolution process.
Finally, I’ll take a moment to plug my latest project – I mean, I’d like to mention that there is a new tool that is designed to take away some of the tedium of troubleshooting performance. Look for the Exchange Performance Troubleshooter Analyzer (EXPTA) 1.0 release in a few months. The tool is based on the same technology as ExBPA, and will walk you through the steps of identifying high load and bottlenecks.