Randomly, you might see a single MonitoringHost.exe process on an agent, consuming 100% CPU. (Or 50%, or 25% depending on how many cores you have). This process will stay at this level, and will not recover. If you restart the OpsMgr HealthService, the problem goes away, and might not return for days or even weeks.
This particular symptom, might be due to an XML spinlock issue… this is a core Windows OS issue, and there is a hotfix available, which I have on my HOTFIX LINK
The KB is 968967 :
“The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed”
I have seen that most customers are affected by this issue from time to time. I have seen it very commonly in my lab, on Server 2008 Domain controllers, and my Server 2008 Hyper-V hosts…
A note on patching Server 2008:
When you go to download this hotfix for a server 2008 machine – it is very misleading on which hotfix to even get. Here is the list of all available fixes:
For patching Server 2008 – you need to download the “Windows Vista” hotfix – in either x86 or x64, depending on your OS version:
Monitoring for this condition:
You can easily write a threshold monitor targeting agent or HealthService, to track the monitoringhost process \ %processor time threshold, and set it to alert when it has multiple consecutive samples above a defined threshold.
Here is an example of creating this monitor:
Authoring Pane > Monitors > New Unit Monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Consecutive Samples over Threshold.
Give it a custom name that follows your documented custom Monitor naming standard, target “Health Service”, and put this under Performance rollup.
Hit the “Select” button (in SP1 – select “Browse”) In the perf counter picker – choose a server with an installed agent, choose the Object “Process” the counter “%Processor Time” and the Instance “MonitoringHost”, and click OK.
Since there are multiple MonitoringHost processes… we will add a Wildcard to the Instance name in the monitor…. this will monitor ANY MonitoringHost process for high CPU. Set the Interval to every 1 minute.
For the number of consecutive samples, and threshold… that is up to you. For me – I will say that if I detect a single MonitoringHost process using more than 50% CPU, over all 5 consecutive samples (5 minutes) then I consider that bad:
At this point…. you can simply alert on the condition, or event try and add a recovery script – that will bounce the health service. Generally, bouncing the HealthService when one of the processes is using all the CPU is not always 100% reliable… especially from a “NET STOP & NET START” type command. I have found it more reliable to just kill the MonitoringHost process in this condition, and allow it to respawn…. but your mileage may vary.