How does the ‘Total CPU Utilization Percentage’ monitor work?

image

I wanted to start this year off right with wishes to everyone for a happy, healthy and prosperous year 2011!

One thing I’d like to start doing more of this year is writing about some of the more common questions that come up in the TechNet Forums.  I’m also open to any suggestions for topics you’d like to learn more about.  You can send me your questions and topics using the contact form on my blog.  If I have some knowledge to share, I’ll make an effort to write about it.

Ustad raised this question recently in the forums, so I figured this would be a good reference point for those interested in learning exactly how this monitor is implemented.  This is an intermediate level overview of this fairly complex monitor, so not all details are covered.

Note: The content in this article refers to the Windows Server management packs up to version 6.0.6794.0.

If we take a look at the monitor type, we can see that this monitor is composed of four modules.

image

Two data sources

- DS1 uses the native performance data source module to sample Processor \ % Processor Time \ _Total.
- ProbeActionDS uses a script-based probe as a data source to sample ProcessorQueueLength from Win32_PerfRawData_PerfOS_System (WMI).

Two expression filters

- FilterOK filters elements that indicate a health condition.
- FilterNotOK filters elements that indicate an unhealthy condition.

Take a look at the order of execution of these modules.

Healthy detection
image

Unhealthy detection
image

We can see that DS1 is first in the order of execution and always passes to ProbeActionDS as second module.  Depending on the results of the probe module, which is the state property bag value of either good or bad, we’ll match either the FilterOK or FilterNotOK expression filter, respectively.

By default, this monitor will change to a critical state and generate an alert when the FilterNotOK expression module is matched.  The monitor will change back to healthy and the alert resolved when the FilterOK expression module is matched.

Default thresholds

- % Processor Time \ _Total > 95%
- Processor Queue Length > 15

So far, so good.  But often times we do not see a monitor state change event or an alert raised when we meet these thresholds.  Why?

Under the hood of the ProbeActionDS, there is a calculation performed on the Processor Queue Length threshold value.  The default threshold is multiplied by the number of processors.

 If nQueueLength > CLng(CPU_QUEUELEN_THRESHOLD) * lNumProcessors Then
     ReturnResults "BAD", nQueueLength & "", CPU_USAGE & ""

The result of this calculation will change the default threshold for any computer that has more than one processor.  For example, if a computer has 8 processors, the default threshold for Processor Queue Length will actually be 120.

So the next time you believe an alert should have been raised for a particular computer where the processor appeared to be stressed, check processor queue length and do the math to reveal why the monitor evaluated to a healthy state.  The simple fix is to change the default threshold for processor queue length to something that makes more sense for your implementation.

Check ProcessorQueueLength using POSH

Get-WmiObject Win32_PerfRawData_PerfOS_System | select ProcessorQueueLength

Note that if we cannot query Win32_PerfRawData_PerfOS_System from WMI, this will result in monitor failure.  In this case, you may need to rebuild the WMI repository.  Or if it takes too long to query WMI due to issues impacting computer performance, the script may timeout also causing the monitor to fail.  In this case, you can adjust the TimeoutSeconds override parameter.  If you adjust the TimeoutSeconds parameter, I suggest also adjusting the Interval parameter to avoid runtime conflicts.

You don’t like how this monitor works? Here’s a trick!

We can configure the monitor to only use the % Processor Time threshold.  If the CPU Queue Length override parameter is set to -1, this will effectively change this monitor to an unhealthy state when only the % Processor Time threshold is exceeded, because the Processor Queue Length value will always be greater than -1.

 

And there you have it, the Total CPU Utilization Percentage monitor…exposed.