How To Tackle Performance Issues (For System Center products like ConfigMgr / OpsMgr):
The intention of this blog is to help troubleshoot 'performance issues' on System Center products or at least identify where the bottle neck is from.
First of all, don't get carried away by the name 'performance issues'. The term performance can be interpreted in many ways.
Explain what he / she sees as performance issues. Always get clarity on which action is producing the so called 'performance issues'.
Mostly, performance issues should revolve around local server or SQL.
For Configuration Manager (ConfigMgr), there is a huge dependency for WMI as well. Normally there is an indication about performance issues with backlogging in the inboxes.
For Operations Manager (OpsMgr), just evaluate local server and SQL. Normally there is an indication about performance issues with Events like 2115.
Let's drill a bit down to local server first.
A server can experience sluggishness because of primarily the following 4 parameters.
1) Disk bottleneck
2) Memory starving
3) CPU over utilized
4) Network congestion
Other than this, you could also look at the possibility of WMI repository bloating. (chances are very less for this case though).
A quick check of the size of the repository folder could let you know. If you see that to be like 300 MB or greater than that, it's likely that something is not right there.
Engage windows performance experts if that's the case. (Do NOT blindly rebuild the WMI repository).
Also, look at the WMI memory quota if that is getting exhausted (especially for ConfigMgr provider / console).
Mark Ghazai had written an excellent post about that. - http://blogs.technet.com/b/askperf/archive/2008/09/16/memory-and-handle-quotas-in-the-wmi-provider-service.aspx
We will explore each of the above 4 parameters one by one here (Again, no intention of deep troubleshooting here).
All of these can be covered by taking a perfmon trace while the issue is being reproduced.
'How to take perfmon?'
Go to Start, Run and type "perfmon" (without quotation marks) and press OK.
Expand Data Collector Sets | User Defined
Right-click "User Defined" and select New | Data Collector Set
Give it a Name and click Next (make sure "Create manually (Advanced)" is selected)
Choose “Create data logs” and enabled the checkbox for “performance counter”
Click on the “ADD” button and choose all the below ones.
Please make sure that “All Instances” are selected.
OpsMgr* - - For SCOM
SMS* - - For SMS
SQLServer:Buffer Manager - - For SQL server.
SQLServer:Memory Manager - - For SQL server.
In the “Sample Interval”, please choose “10 seconds” and click Next.
Choose the location where we want the file to be saved and click Next.
Choose “Start this data collector set now” option and click finish.
Always keep in mind that perfmon logs are good for long-term trend analysis only.
That means, a live perfmon may not really be the proper way to catch something. Keep it running for a prolonged time and then do the analysis.
Once you have the perfmon logs, you can do a quick analysis by yourself to see if you can catch something obvious. I have pasted some sample perfmon graphs (not all) for better understanding.
There are a couple of counters that are very important to look at when you suspect disk. The complete analysis / review of all of them is much beyond the scope of this session, however I've tried to put some of those to begin with.
Look at logical volumes versus physical disks. Having multiple logical volumes on a single physical disk doesn't do any good as far as performance is concerned.
Review the %Idle time for the disk. The greater the idle time, the merrier. 🙂
Typically it has to be 70% or higher.
There are many issues that we have seen where the disk idle time would be at 0% for a prolonged period of time.
Take a look at the disk E:\ idle time. This is almost touching zero ALL the time. Typically it has around 90.
This explains that this disk is super busy.
Review the Avg Disk Queue Length counter.
Typically, this will be zero. However, you may end up seeing higher values. It is not a problem if this value goes high and comes back to zero. Prolonged higher values means that those many requests are still *waiting* to access the disk.
The above is the combined graphs of Disk Q length + Read Q length + Write Q length.
Note that the curves for all three graphs are in sync.
The values are not visible in perfmon because, multiple counters are selected. However, the Q length value had an avg of 135 and max of 1268.
That means those many requests are waiting to access the disk.
Typically this value has to be less than double the number of spindles on the disk.
Review the Avg Disk sec / Transfer counter.
This counter reflects what an application is seeing because this tells the time taken for each i/o completion.
Typically, this has to be around 0.00x or atleast 0.01x.
Minute analysis of this requires you to know more details (write cache enabled controller, type of disk like SAN etc). That is beyond the scope of this.
For system center engineers, just look at this value and see if this is high.
In the above graph, Disk sec / transfer value is avg 0.2 and max of 2.25
This is definitely a very poor value. Clearly, the disk is not capable enough.
Review the Split IO/Sec counter.
Normally, this will be close to zero. If not, that disk could be fragmented / low free space or NTFS block size was small.
OpsMgr does not create many small files, however this could be acceptable for around 5 or anything less than 10 for ConfigMgr servers because there will be too many small files under the inboxes.
As you can see above this has an avg of 101 and max of 274. This volume could use a defragmentation to be run.
Review the Cache Bytes counter.
See if there is a fluctuation in cache bytes. That typically is a sign of disk bottleneck.
Review the Available MB counter.
This way you can find out how much of RAM is available still.
No hard and fast rule on this, but typically we need 5% of total RAM to be free.
On 64 bit servers, you would need atleast 250 MB of this ALL the time, but starving condition is around 100 MB.
Starving condition is 4 MB on 32 bit servers.
Here is a sample to help understand how the available MB is proportional to the Total Working Set of processes.
As you can see when the working set goes up, the available MB goes down as expected.(Due to perfmon scaling, the top line is for working set and the bottom line is for available MB)
Rest of the memory counters are not generally needed by System Center engineers unless going in-depth.
You can also take memory dumps of offending process or the entire system if need be for further debugging.
Review the %processor Queue length counter.
This shows the number of threads in the processor queue. A sustained processor queue of greater than two threads generally indicates processor congestion.
Look for longer durations with higher queue values.
Review which process is taking the maximum spike and determine next steps from there.
You can use Process Explorer to drill a bit down on what is happening with the process. Further troubleshooting is beyond the scope of this blog.
Review the %processor time counter.
This counter lets you know the usage of processor time. You need to consider any sustained CPU time over 90%.
Sometimes, you may want to add the ‘Threads’ counter in perfmon in some cases. Do it ONLY in case of troubleshooting CPU spike issues, not otherwise.(Data will be extremely huge if you take this).
If you note the above graph, the top most line in black is the %Processor Time. As you can see, it’s always at 100%.
Also note that there are gaps in between. The reason is because perfmon is a non-invasive process, so it would not collect the data if the system is extremely busy with CPU cycles.
Review the %Privileged time counter.
Privileged time is the time spent doing kernel work.
Sample scenario of smsexec.exe process consuming CPU cycles.
In this graph, I’m painting both the SYSTEM and SMSEXEC’s privileged time usage. The reason you should evaluate SYSTEM as well is because SMSEXEC is a user mode application. SYSTEM is the only process that you would see in the task manager that actually runs in the kernel. and oh! It always will have the PID of 4. (just fyi).
Review the Output Queue Length counter.
Generally this would need investigation if this value is greater than 5.
Review the Packets Received Discarded counter and the Packets Received Errors counter.
Generally this suggests hardware problems or driver issues.
You should be able to at least identify where the bottleneck is from based on the above by now.
Let’s now get back to the SQL server.
Remember to take the normal perfmon counters + include two more SQL counters (MSSQL$:memory manager, MSSQL$:buffer manager) as given above.
SQL performance is again a very vast topic and will go way beyond the scope of this blog.
There are a couple of very basic things that we need to be aware of....
SQL server will consume almost ALL the memory that the operating system has.
This is not a bug, this is normal.
We should do the capping properly (of course, leave memory of OS as well. There is no pre-defined rule as far as I know, but will need to be observed based on perfmon baselining. On the other hand, it is generally safe to cap SQL with 6 GB max memory if the OS has total 8 GB of RAM considering there is no other heavy application).
The Operating system is also designed to trim the working set of SQL to gain back memory if there is a demand to the OS.
This will have a direct impact on SQL server performance. (page faults).
Use the following query to get information about SQL System Configuration:
Select * from sys.configuration
For SQL server as well, do the perfmon analysis like you did for the server.
Additional SQL specific things to consider for basic troubleshooting is as below....
Review the Page Life Expectancy counter.
This tells the time in seconds that a page will remain the buffer pool.
Higher values are good. Typically, this should not drop below 300.
Review the Total Server Memory and Target Server Memory counters.
These tells how much memory SQL Server is using for its cache.
Review the Buffer cache hit ratio counter.
This shows the pages in percentage that is in the buffer pool (ie.. No need to do a read from disk).
100% is excellent, but anything less than 80% is not good enough.
It simply means that SQL needs more memory.
Check the contig level using the following query:
Dbcc showcontig with fast
Once you have the output of the above query, look at Scan Density. Ideally this has to be greater than 80. Also look at Logical Fragmentation for that.
Troubleshooting techniques typically involve looking at the large tables and then identify the scan density and logical fragmentation of that table.
Kevin Holman has written a very great article on 'Useful SQL queries'. I normally use this to copy the queries while troubleshooting.
For instance, in ConfigMgr if you see issues regarding one particular component (like state messages) then evaluate tables associated for that.
For instance, in OpsMgr, if you see issues with one type of workflow (like Alert), then evaluate tables associated for that.
Once you identify the table which is fragmented, You can choose to re-index just that one table or the entire database depending on the output.
This can really have a *significant* performance boost.
Sample example output of the query for a table named 'Test': D
BCC showcontig ('dbo.Test')
DBCC SHOWCONTIG scanning 'Test' table...
Table: 'Test' (546100986); index ID: 1, database ID: 8
TABLE level scan performed.
- Pages Scanned................................: 12360831
- Extent Switches..............................: 6491845
- Scan Density [Best Count:Actual Count].......: 23.80% [1545104:6491846]
- Logical Scan Fragmentation ..................: 50.04%
From the above, we can see that the scan density and the logical fragmentation values are poor.
This clearly indicates that a re-index of this table is needed.
Sometimes updates stats may also be needed. That improves the query plan and ensures that query is with latest statistics, but on the expense of having to recompile the query.
Query to update statistics:
exec sp_msforeachtable 'update statistics ? with fullscan'
Query to rebuild indexes:
Exec sp_MSForEachtable 'DBCC DBREINDEX (''?'')'
Another major parameter which is often neglected is to review the Maximum Degree Of Parallelism (maxdop).
This setting is to tell SQL how many processors to use for query execution.
MAXDOP setting of “0” means all available processors will be used. A setting of 0 may be fine for servers with eight or fewer processors.
For servers with more than eight processors, the time it takes SQL to coordinate the use of all processors may be counterproductive. So for servers with more than 8 processors, don't leave it to 0. Change it to 8.
For Configuration Manager, we don't do much of parallel processing. So, keeping this value to 1 or the total number of *physical* processors should be fine.
For Operations Manager, we may need parallel processing especially for Data Warehouse database. So, keeping this value to 8 or the total number of *physical* processors should be fine.
Note that we are *not limiting* SQL to use only one CPU here instead of all the CPU’s, but this is a mechanism to tell SQL how many CPU to use for executing any given single query.
You could also use the following query to find out what is the optimal maxdop setting for SQL.
when cpu_count / hyperthread_ratio > 8 then 8
else cpu_count / hyperthread_ratio end as optimal_maxdop_setting
What other than these???
Process Explorer is an excellent tool for troubleshooting CPU spike issues. You can look at which thread is consuming CPU cycles on a process and even look at the stack of that thread.
While troubleshooting performance issues, we take memory dumps as well for digging in-depth should the need arise.
Often, many admins come up with a question on when to take memory dumps? What criteria?
To understand about it, a bit of how windows works in the background is beneficial.
To keep it short, just understand, all that a process sees is the Virtual Memory. Windows does the manipulation of physical memory to virtual memory and vice-versa.
When a memory dump is captured for a process, that is the actual virtual memory of that process.
Two types of memory dumps you may want to know about. (hang dumps and crash dumps).
When you suspect something is leaking or if you suspect the application is not doing what it is supposed to do, you tend to go for a hang dump.
When a process is terminating abnormally, you tend to go for a crash dump.
Use tools like adplus or procdump (my personal choice) to generate memory dumps of the suspect process, so that it could be debugged later.
BTW, if you feel perfmon analysis is complex, don't worry. Tools make life easier. Find PAL (http://pal.codeplex.com)
Hope this helps!!
System Center Support.