The business need:
It is a very common request to monitor a process on a given set of servers, and collect that data for reporting, or monitor it for a given threshold.
One thing you might notice when trying to monitor some performance counters, is that not all perf counters in perfmon behave the way you might assume.
For instance, I want to monitor “how much CPU a process is using”. Perhaps we wish to monitor our SQLServer.exe process on our SQL servers?
This is easy – because Perfmon already has a Performance Object, Counter, and Instance for that. In perfmon, we would use:
Process > % Processor Time > Sqlserver.exe
So, we can quite easily create a performance threshold monitor, and a performance collection rule using this. Let’s say we set the monitor to alert anytime the SQLserver.exe process is consuming more than 80% of the CPU sustained for 5 minutes.
However, quite quickly we might notice erratic behavior from our monitor and rule. The monitor is generating TONS of alerts from almost all our SQL servers, and then quickly closing them… essentially flip-flopping. When we check the performance data we have collected, we see the process is using up to 800% CPU!!! So – thinking something is wrong with OpsMgr – we inspect a busy SQL server in perfmon directly… but observe the exact same behavior:
As you can see – this process is using almost 400% CPU. Why? How is this possible?
This is because the Process monitoring counters in Windows are not multi-CPU aware. When a server has 4 CPU’s (like this one above does) a process can use more than one CPU at a time, provided it is spawning multiple threads. This way, it can be using up to 100% of each CPU or Core (logical processor). A process on a 4 processor server can consume up to 400% of that process counter. So if a process is really only consuming 20% of the total CPU, that will show up as 80% on a 4-core machine. Think about today’s hardware… many boxes have up to 16 cores these days, which would register as 320% processor utilization for something really only using 20% of the total CPU.
As you can see – this causes a BIG problem for monitoring processes. As an IT Pro – you need to know when a process is consuming more than (x) percent of the *total system resources*…. and every server will likely have a different number of processors.
In OpsMgr R2 – a new XML based function was created to help resolve this challenge. This is known as <ScaleBy>
The <ScaleBy> function essentially gives you the ability to take the monitoring data collected by something (that is an integer), and divide by something else (integer).
I can input a fixed value here, in integer form, or I can input a variable. For the variable, I can actually pull data from discovered properties of monitoring classes. This is GREAT in this instance, because we already discover the number of Processors a Windows Computer has. We can use this discovered data, along with this <ScaleBy> function, to fix our monitors and collection rules that need a little massaging to the data we get from perfmon.
Here are the Windows Computer class properties:
Let’s walk through an example using the authoring console.
- Open the Authoring console.
- Create a new empty management pack.
- Go to Health Model, Monitors, right click and create a new monitor.
- Windows Performance > Static Thresholds > Consecutive Samples.
- Give your workflow an ID, Display Name, and choose a good target class which will contain your process. I will use Windows Server Operating System for example purposes, but you want to always try and choose a target class that will have your process counter in perfmon.
- Select System.Health.PerformanceState as the parent Monitor:
- Browse a SQL server for the process object you will need – or type in the relevant data. I will set my samples for the monitor to inspect every minute. This data is not collected and inserted in the database for a monitor – this sample data is kept on the agent for inspection of a threshold match… so we can monitor the process with a MUCH higher sample rate than we would ever do a performance collection rule.
- I set my monitor to change state when 5 consecutive samples have all been over 80% CPU:
- Click finish – then open the properties of the monitor you just created. Go to the configuration tab. Here are all the typical configurable items in a performance monitor workflow.
- However – we need to add one more – the <ScaleBy> function.
We have to do this in XML – as there is no UI that added this capability. Click “Edit” on the configuration tab which will pop up the XML of this configuration.
We are going to add a single line after <Frequency> which will be this line:
What this does – is tell the workflow to take the numeric value received from perfmon, and then divide by the numeric value that is a property of the Windows Computer class for number of logical processors. Then take THIS calculated output and use that for collection or threshold evaluation.
Here is my finished XML snippet:
<CounterName>% Processor Time</CounterName>
Now – the authoring console was not updated to fully understand this new function, so you might see an error for this. Simply hit ignore.
Your new monitor configuration now looks like this:
You can do the exact same operation on a performance collection rule as well to “normalize” this counter into something that makes more sense for reporting.
Some other uses of this might be for situations where a counter in bytes…. and you want it reported in Megabytes. You could hard code a <ScaleBy> 1000000 (one million). That way – if you wanted to report on how many megabytes a process was consuming over time… instead of representing this as 349,000,000 on a chart (bytes) you can represent this as a simple 349 Megabytes. That XML would simply be:
Ok… I hope this made some sense…. this is a valuable method to normalize some perfmon data that might not be in what I call “human format”. Keep in mind – you can ONLY use this XML functionality on an R2 management group, and it will only be understood by an R2 agent.
You can quickly go back to your previously written process monitors, and add this single line of XML really easily, using your XML editor of choice.
One last thing I want to point out….. some of the previously delivered MP’s that Microsoft shipped might be impacted by this issue. For instance – in the current ADMP version 6.0.7065.0 there is a monitor “AD_CPU_Overload.Monitor” (AD Processor Overload (lsass) Monitor) which does not take into account the number of logical processors. This is often one of the MOST noisy monitors in my customer environments, especially on a busy domain controller. This is simply because MOST DC’s have more than one CPU – and this skews the ability for this monitor to work. The issue is – they could not add this <ScaleBy> functionality to this MP – because that would make the ADMP R2-only… which we don't want to do.
You have two workarounds for SP1 management groups: Monitor processes using a script that will query WMI for the number of CPU’s and handle the math for this function (ugly) OR create groups of all Windows Computers based on their number of logical processors (easy) and then override these types of monitor thresholds with relevant numeric's for their processor count.
For R2 customers – I recommend disabling this monitor in the ADMP – and replacing it with a custom one that utilizes the <ScaleBy> functionality.