Do you randomly see a MonitoringHost.exe process consuming lots of CPU?


Randomly, you might see a single MonitoringHost.exe process on an agent, consuming 100% CPU. (Or 50%, or 25% depending on how many cores you have).  This process will stay at this level, and will not recover.  If you restart the OpsMgr HealthService, the problem goes away, and might not return for days or even weeks.

 

This particular symptom, might be due to an XML spinlock issue… this is a core Windows OS issue, and there is a hotfix available, which I have on my HOTFIX LINK

 

The KB is 968967 :

“The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed”

I have seen that most customers are affected by this issue from time to time.  I have seen it very commonly in my lab, on Server 2008 Domain controllers, and my Server 2008 Hyper-V hosts…

 

 

A note on patching Server 2008:

 

When you go to download this hotfix for a server 2008 machine – it is very misleading on which hotfix to even get.  Here is the list of all available fixes:

 

image

 

For patching Server 2008 – you need to download the “Windows Vista” hotfix – in either x86 or x64, depending on your OS version:

 

image

 

 

 

Monitoring for this condition:

You can easily write a threshold monitor targeting agent or HealthService, to track the monitoringhost process \ %processor time threshold, and set it to alert when it has multiple consecutive samples above a defined threshold.

 

Here is an example of creating this monitor:

Authoring Pane > Monitors > New Unit Monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Consecutive Samples over Threshold.

 

image

 

Give it a custom name that follows your documented custom Monitor naming standard, target “Health Service”, and put this under Performance rollup.

 

image

 

Hit the “Select” button (in SP1 – select “Browse”)  In the perf counter picker – choose a server with an installed agent, choose the Object “Process” the counter “%Processor Time” and the Instance “MonitoringHost”, and click OK.

 

image

 

Since there are multiple MonitoringHost processes… we will add a Wildcard to the Instance name in the monitor…. this will monitor ANY MonitoringHost process for high CPU.  Set the Interval to every 1 minute.

 

image

 

For the number of consecutive samples, and threshold… that is up to you.  For me – I will say that if I detect a single MonitoringHost process using more than 50% CPU, over all 5 consecutive samples (5 minutes) then I consider that bad:

 

image

image

 

image

 

At this point…. you can simply alert on the condition, or event try and add a recovery script – that will bounce the health service.  Generally, bouncing the HealthService when one of the processes is using all the CPU is not always 100% reliable… especially from a “NET STOP & NET START” type command.  I have found it more reliable to just kill the MonitoringHost process in this condition, and allow it to respawn…. but your mileage may vary.

http://blogs.technet.com/kevinholman/archive/2008/03/26/using-a-recovery-in-opsmgr-basic.aspx

Comments (35)

  1. Kevin Holman says:

    Interesting point on that one – in that case it would be better to make two or three rules – once for each possible instance of MH.

    From most of what I have worked with – this might cause the monitor to trigger – but flip flop when another instance is not above the threshold.  Good point.

  2. Kevin Holman says:

    Mike – dont get MAD.  🙂

    Using _total is fine as well.

    One reason you might not be getting alerts – could be the counter you are using.  I discussed this in the following article:  blogs.technet.com/…/how-to-monitor-a-process-on-a-multi-cpu-agent-using-scaleby.aspx

    You can always just use the process monitoring template if needed.

  3. Anonymous says:

    We are experiencing the issue as described in kb968967.  It has been applied to a windows 2008 server successfully (have seen no issue in 2 days).  I have a Windows 2003 R2 Enterprise x64 edition Service Pack 2 running SQL having the monitoringhost.exe issue.   The MSXML 6 is SP2 (version6.20.2003.0) noted with (KB973686).  If I try to apply the hotfix it will state that it is an older version.  Does it need to be applied anyway?  Doesn’t the SP2 version have spinlock the fix?  In the meantime, I am having to daily rename the health service state.  Our SQL admin would like to set a priority on the monitoringhost.exe process.  Could that be another work around?

  4. Kevin Holman says:

    As you can see from the straight line – this means no monitoring data was available.  The process that was in a spinlock was also the process that was collecting data… which appears to have died.

  5. Kevin Holman says:

    @Sunil – If your monitoring solution can handle the load of what you wish, then you aren’t getting the monitoring you desire. 🙂 I have not run across this before, but I’d anticipate the issue is with several hundred queues discovered as instances, with
    workflows attached. This can create a great load on the agent. First, I’d make sure you have set your healthservice limits very high – to keep the agents from restarting: http://blogs.technet.com/b/kevinholman/archive/2013/02/21/healthservice-restarts-still-a-challenge-in-opsmgr-2012.aspx
    Second – I’d disable ALL of the 16 rules, and the 13 monitors, that target the "Windows 2008 Print Services Printer Queue" class. Then – I’d see if the problem goes away. Next, I’d enable ONLY the performance collection rules that you deem critical, one by
    one, by deleting your custom overrides. After they are all enabled, bounce the healthservice and make sure it initializes and works fine. Next – do the same for the monitors dealing with performance – that you deem critical. There are several event based rules
    and monitors that I consider poorly written, as they utilize regular expressions unnecessarily, and they target a multi-instance object (printer queue) unnecessarily. Technically these should all cook down, but it is possible that they don’t correctly. I would
    personally disable all the event based monitors and make simple event based alert rules – and target the print server role for these alerts…. if you are seeing a resource utilization issue….. IF you determine they are root cause.

  6. Kevin Holman says:

    Why on EARTH are you running that old conversion MP?  That thing is BAD news and no amount of tuning will make up for the enhancements in the SP1 native MP.

  7. Kevin Holman says:

    @Sunil – that condition is not normal. – that hotfix is for a VERY specific issue – spinlock – which isn’t terribly common. For Server 2008R2 SP1, I’d say you have some other issue at hand, likely a poorly written MP that is doing something bad. I’d put
    certain classes into maintenance mode and see if you can identify. For instance – does it go to 100% immediately, or is it random? If you put the print server role into MM, does this not happen? You can use tools like ETL tracing when this occurs to try and
    determin root cause, or open a case with Microsoft.

  8. Anonymous says:

    Thanks Kevin. Yes, we are seeing this and currently have a PSS case open. Only problem is KB968967 requires SP2, we are only on SP1 and can’t upgrade at the moment. So, threshold monitor it is. But, I’m having some problems choosing the Object and Counter in trying to target the Monitoringhost.exe process. If you could provide a little more specific guidance for creating this monitor, it would help us out greatly.

    Thanks,

    Tom

  9. Kevin Holman says:

    There arent any common issues where a MH process spikes to 100% and stays there, unless it is the MSXML spinlock.

    The only other situations I am aware of is when you have some bad MP’s are really sick machines.

    When you say "high cpu utilization" you need to be more specific.  Which process (or processes) is spiking, is it going to 100%, how long does it stay there, whats in the OpsMgr event logs, etc….  What is the OS of the agent, what MP’s are loaded against it, what is the machine’s role in life, etc.

    A dump of the process can be analyzed when it is in this condition to determine whats eating the CPU.

  10. Kevin Holman says:

    Why/How is KB968967 for SP2 only?

    There shows to be a version for Server 2008 RTM and/or SP2?

  11. Kevin Holman says:

    @Sunil – What print server MP’s are you using?

  12. tom says:

    Apparently the SP1 version wasn’t available for download, even though it said it was. Our pss case engineer was able to get this resolved on the Microsoft side. Thanks

  13. tom says:

    Hey Kevin, the monitor is working correctly. Meaning, it’s generating the alerts,but I’m not receiving the email alerts via our subscription. In my subscription I have the following parameters selected:

    Check mark on "Raised by any instance of a specific class" to this I have added "Health Service".

    Check mark on "created by specific rules or monitors (e.g. sources)" to this I have added the newly created Rule "MonitoringHost.exe process CPU monitor"

    Any idea why this subscription is not working? Have I missed something in the configurations?

    Thanks,

    Tom

  14. AlanZ says:

    Hi Kevin, I think the System.Performance.ConsecutiveSamplesCondition does not work well with multiple instances. From My experience with SCOM 2007 R2 ALL the instances have to be over the threshold.

  15. AlanZ says:

    Actually I have a script that does the monitoring of CPU for multiple instances and that I use when I know multiple instances exist. It allows me to do the monitoring with only 1 monitor and without discovery.

  16. Tim Sneath says:

    AlanZ –

    would you mind sharing how you’re monitoring for CPU utilization for multiple instances of MonitoringHost.exe?    

    Is it a vbscript? powershell? We’re running into the issue described and want to get monitoring in place for this behavior until we can apply the hotfix.  

    Thanks

    -TimS

  17. Tim Sneath says:

    looks like this will point me in the right direction for now

    http://www.leeholmes.com/blog/AccessingPerformanceCountersInPowerShell.aspx

    Thanks!

  18. JohnS says:

    Hi, I see that loads of people are having teh same issue with high CPU utlilization. I have the same problem and if I check the solutions, they are all refering to updating MSXML6 to SP1. Problem is….. I already have SP2 loaded and am still getting teh same issue. The work around I have put in place for now is to set process affinity so that the server at least responds in the interim till I find a better solution. Has anyone had a similar issue?

  19. NickG says:

    Hello Kevin,

    I am new to OPSMGR so don’t understand create 2 or 3 rules per MH instance. Do I need to reconfigure the original monitor I built. Will this monitor run against all servers in my environment? Last question I have windows 2003 servers with MSXML 6 sp2 does KB 968967 apply in this case?

    Thanks,

    Nick

  20. T0by says:

    We had this issue even running the MSXML6 hotfix and VBScript 5.7 while still running the converted Exchange MP.  (CPU Spikes were 8-20 seconds long, but causing CPU resource contention on the monitored systems and caused service impact.)  After much debugging, I got the issue down to the Execute: Test-Mailflow* cmdlets.  As soon as I turned both internal and remote execution off, the CPU spikes stopped completely.

  21. Mike Kulikov says:

    Hello all – I tried to create monitor for this condition – but it doesn't work for me! I'm using SCOM 2007 R2 with CU2, I can see 90-100% CPU load on some Windows Server 2003 R2 SP1 and SP2 virtual machines from task manager (host is Virtual Server 2005 R2 SP2), but I can't see any alerts (new or closed) for monitoringhost.exe on SCOM console! It's very strange to me, because when I create similar monitor for perfomance counter _Total, it works and show alerts. So, my question is: where I was wrong? It makes me MAD why I cannot see alerts…

  22. Mike Kulikov says:

    Thanks for quick answer, Kevin – I already read your article and used "scaleby" and process monitoring template – without anу success results. No alerts and healthy proccess state view… I'll try again of course.

  23. Mike Kulikov says:

    This is perfomance view for instances of  monitoringhosts (object – process, counter – %processor time) from one of our servers (process monitoring template is on): postimage.org/…/1gzdfjy84 – here you can see there is no graph for critical perfomance – when I connect to server at 8:00 a.m., monitoringhost perfomance changed between 90-100% until I kill process – and after this it appears (you can see it in the screenshot). Why SCOM cannot determine it when it is critical?

  24. susaa says:

    Kevin, Is this hotfix applicable for Windows 2003 server also? I could see this is applicable for Windows 2008 and Vista; also I could see, it is applicable for "other systems that have MSXML 6.0 installed". So can I install KB 968967 on Windows 2003 server?

  25. Naveed says:

    Hi Kevin,

    Can you please help me in thisregard,

    i am facing cscript.exe running on my citrix server machine. and getting high CPU.

    windows server 2008 R2 Sp1

    XenApp 6.5

  26. Sunil says:

    Hi Kevin, As best I can determine, the Hotfix in KB968967 does not apply to Windows 2008 R2 SP1 which is the OS we are running on our print servers. The versions of msxml6.dll and msxml6r.dll that the hotfix installs: – msxml6.dll V6.20.5001.0 – msxml6r.dll
    V6.0.3883.0 The existing versions of these two files on the server: – msxml6.dll V6.30.7601.17988 – msxml6r.dll V6.30.7600.16385 Do we have any solution in this case – Monitoringhost.exe is spiking and is going to 100% and stays their till i restart the healthservice,
    Machines role is print server.

  27. Sunil says:

    Hi Kavin, As you suggested, kept print server role in MM and its not happening.. server CPU performance is stable. but i cannot keep this role in MM as I need to monitor this. What could be the next course of action to fix this issue.

  28. Sunil says:

    Hi Kevin… Its Windows Server Print Server 2008 Management Pack

  29. Sunil says:

    Yes Kevin the version is 6-0-7004-0.. We are having large print server with several hundred print queue and I have kept below three performance collection rule enabled Performance Measuring: Print QueueJobs Performance Measuring: Print QueueJobs Spooling
    Performance Measuring: Print QueueTotal Jobs Printed Performance Measuring: Print QueueTotal Pages Printed This may be the cause of high resource utilization.. We are using SCOM specially to monitor performance of print server so cannot disable this rules.
    Do we have any solution or workaround in this case.

  30. Sunil says:

    I only had kept 4 rules and 12 monitors mentioned below in enabled state still facing this issue.. Never mind will follow steps as u suggested and share u the result.. Thanks a lot for your help :).. Rules: Performance Measuring: Print QueueJobs Performance
    Measuring: Print QueueJobs Spooling Performance Measuring: Print QueueTotal Jobs Printed Performance Measuring: Print QueueTotal Pages Printed Monitors: LPD Service: Service Status Monitor Print Spooler: Check Windows resources Print Spooler: Check Windows
    resources (Windows Server 2008 R2) Print Spooler: The print spooler failed to complete a task Print Spooler: The print spooler failed to complete a task (Windows Server 2008 R2) Print Spooler: Restart the Print Spooler service Print Spooler: Restart the Print
    Spooler service (Windows Server 2008 R2) Print Spooler: Restart the server or troubleshoot hardware problems Print Spooler: Restart the server or troubleshoot hardware problems (Windows Server 2008 R2) Print Spooler: Service Status Monitor Print Server 2008
    Queue Job Errors Print Server 2008 Queue Not Ready Errors

  31. Ijaz says:

    Hi Kevin, we have installed SCOM 2012 ur2 version. We pushed the agent through SCOM console to client servers. In Domain Controller servers that are windows server 2012 standard , Monitoringhost.exe consumes high CPU. Does kb/974051 applicable for this
    issue and if so, what is the hotfix we need to install in Windows server 2012 Domain controllers?

  32. sam says:

    Can we get patch for windows server 2008 r2 servers for the same issue

  33. Alex says:

    Hey

    I’m not sure if something went wrong or you actually were active just a while ago, let me ask you here this question: how can I effectively trace SCOM agent that is connected directly to OpInsights workplace (bypassing any local SCOM infrastructure)? It hugs all CPU once in a while and I really don’t have a low-end desktop that’s being monitored (or is 8*4GHz cores not enough? 🙂 )

    Thanks!

  34. James says:

    I have a question on this I just upgraded to SCOM 2016 and We are seeing monitoringhost.exe accessing the security log not a couple hundred a day, but thousands an hour. Before the upgrade there where only a few hundred, but now after the upgrade running the Splunk query on a single machine for a 4 minute search there where 2,508 accesses. I do not think this is normal. The CPU usage is not running out of control it is running with-in normal parameters.

Skip to main content