First – let me warn you. The way SCOM monitors Processor time is *incredibly* complicated. If you don’t like it – there is *NOTHING* wrong with nuking this from orbit (disable via override) and just create your own very simple consecutive samples (or average) monitor. That said, while complicated and difficult to understand, it is very powerful and useful, and limits “noise”.
Ok, all warnings aside – lets figure out how this works.
In the Windows Server 2016 OS Management Pack, there is a built in monitor which evaluates the Processor load. This monitor (Total CPU Utilization Percentage or Microsoft.Windows.Server.10.0.OperatingSystem.TotalCPUUtilization) targets the “Windows Server 2016 Operating System” class.
It runs every 15 minutes, and evaluates after 3 samples. The samples are not consecutive samples as the product knowledge states – they are AVERAGE samples.
Like previous versions of the CPU monitor, this is often misunderstood. This monitor does not use a native perfmon module, it runs a PowerShell script. The script evaluates TWO DIFFERENT perfmon counters:
Processor Information / % Processor Time / _Total (default threshold 95)
System / Processor Queue Length (default threshold 15)
BOTH of these above thresholds must be met, before we will create a monitor state change/alert. This means that even if your server is stuck at 100% CPU utilization, it will not genet an alert most of the time.
The default threshold of “15” is multiplied times the number of logical CPU’s for the server. So on a typical VM with 4 virtual CPU’s, this means that the value of SYSTEM\Processor Queue Length must be great than (15*4) = 60. Not only that, but the value must be above 60 for the average of any three consecutive samples. This is incredibly high.
What this means, is that it is VERY unlikely this monitor will ever trigger, unless your system is absolutely HAMMERED. If you like this, great! If you don’t like this, then you have two options.
1) Re-write your own monitor and make it a very simple consecutive or average samples threshold performance monitor.
2) Override the default monitor – but set the “CPU Queue Length” threshold to “zero” as in the picture below:
This will result in the equation ignoring the CPU queue length requirement, and make the monitor consider “% Processor Time” only. If you find this is too noisy, you can use the CPU queue length, but use lower value than the default of 15. Another thing to keep in mind, this is a PowerShell script based monitor, so if you want to run this VERY frequently (the default is every 15 minutes) then consider replacing it with a less impactful native perfmon based monitor.
The default monitor has a recovery on it – that will output the top consuming processes to health explorer state change context:
Note – the numbers are not exactly correct – my “ProcessorHog” process was consuming 100% of the CPU…. but this server has 32 cores, so it looks like you need to multiply by the number of cores to understand the ACTUAL utilization consumed by a process. This is a typical Windows problem in how windows looks at processes, not a SCOM issue.
Ok – so that covers the basic monitoring of the CPU, from an _Total perspective.
What about monitoring individual *logical processors* like virtual CPU’s or actual cores on physical servers? Can we do that?
Yes, yes we can.
First – let me start by saying – I DON’T recommend you do this. In fact, I recommend AGAINST this. This type of monitoring is INCREDIBLY detailed, and creates a huge instance space in SCOM that will only serve to slow down your environment, console, and increase config and monitoring load. It should only be leveraged where you have a very specific need to monitor individual logical processing cores for very specific reasons, which should be rare.
There is a VERY specific scenario where this type of monitoring might be useful…. that is when an individual single threaded process “runs away” on CPU 0, core 0. This has been seen on Skype servers and will impact server performance. So if you MUST monitor for this condition, you can consider discovering these individual CPU’s. I still don’t recommend it and certainly not across the board.
Ok, all warnings aside – lets figure out how this works.
There is an optional discovery (disabled by default) in the Windows Server 2016 Operating System (Discovery) management pack, to discover individual CPU’s: “Discover Windows CPUs” (Microsoft.Windows.Server.10.0.CPU.Discovery) This discovery runs once a day, and calls the Microsoft.Windows.Server.10.0.CPUDiscovery.ModuleType datasource. This datasource runs a PowerShell script that discovers two object types:
1. Microsoft.Windows.Server.10.0.Processor (Windows Server 2016 Processor)
2. Microsoft.Windows.Server.10.0.LogicalProcessor (Windows Server 2016 Logical Processor)
If you enable this discovery – you will discover both types:
Let’s start with “Windows Server 2016 Processor”. This class discovers actual physical or virtual Processors in sockets, as they are exposed to the OS by physical hardware or the virtualization layer. See example below:
By contrast – the “Windows Server 2016 Logical Processor” class shows instances of physical or virtual “Logical Processors” which will be virtual processors on a VM, and logical CPU’s exposed to the physical layer – either actual cores or hyper-threaded cores:
The former is how all our previous monitoring worked for individual CPU monitoring, which is pretty much worthless. If we need to monitor cores, we generally don’t care about “sockets”.
The latter is new for Windows Server 2016 management pack, which actually discovers individual logical CPU’s as seen by the OS.
Now – lets look at the monitoring provided out of the box.
IF you enable discover the individual CPU discovery, there are three monitors targeting the “Windows Server 2016 Processor” class, one of which is enabled out of the box. This is “CPU percentage Utilization” It runs every three minutes, 5 samples, with a threshold of “10”. It is also a PowerShell script based monitor.
Comments on above:
1. Monitoring for individual “socket” utilization seems really silly to me, and not useful at all. You probably should not use this.
2. The default threshold of “10” is WAY too low…. I have no idea why we would use that.
3. The counter uses “Processor” perfmon object instead of the newer “Processor Information” The reason this isn't a simple change, is because the “Performance Monitor Instance Name” class property doesn’t match the newer counters instance value.
Additionally, there are three rules to collect perfmon data – one of which is enabled. You should disable this collection rule as well, IF you just HAVE to discover individual CPU’s.
Ok, now lets move on to the Windows Server 2016 Logical Processor.
This is more useful as it will monitor individual CORE’s (or virtual CPU’s) to look for runaway single threaded processes.
There are three monitors out of the box targeting this class and NONE of these are enabled by default.
The one for CPU util, Microsoft.Windows.Server.10.0.LogicalProcessor.CPUUtilization is a native perfmon monitor for consecutive samples. I like this WAY better than complicated and heavy handed script based monitors.
HOWEVER – this will potentially be VERY noisy – as a server will have multiple CPU’s, and these will alarm anytime the _Total condition is met. This means duplication of alerts when a server is heavily utilized. That said – if only a SINGLE logical processor is spiked, but the overall CPU utilization is low, this will let you know that is happening.
1. CPU monitoring of the OS level is complex, script based, and uses multiple perf counters before it triggers. Be aware, and be proactive in managing this.
2. The individual CPU’s can be discovered, but I DON’T recommend it as a general rule.
3. The default rules and monitors enabled for individual CPU monitoring focuses on SOCKETS, and isn't very useful, and should be disabled.
4. The new Logical Processor class for the Server 2016 MP is more useful as it monitors cores/logical CPU’s, but all monitoring is disabled by default.