HealthService restarts – still a challenge in OpsMgr 2012.


This is probably the single biggest issue I find in 100% of customer environments. 

Way back in the day I wrote about this issue, where the SCOM agent in some cases can consume above typical resource levels of memory, handles, etc.  When this occurs – we will restart the agent to kill any “runaway” processes.  Read about this here:


One of the things I have noticed, is that on many of my servers, these thresholds are being breached on a regular basis – mostly due to the monitoringhost.exe processes needing to use more than the default of 300mb of RAM (private bytes).


The issue is, that you will likely have NO idea this is happening.  We don’t generate any alerts for this by default – we simply “fix the problem” by creating a state change, then running a response script to bounce the agent.  The REALLY bad part about this, is you could have agents in a constant restart loop.

Customers often have hundreds of agents in a constant restart loop, filling the SCOM DB with state change events and barely monitoring the systems because the agent is always in a restart loop.  Additionally, the agent eventually fails to start back up, resulting in a heartbeat failure.


In SCOM 2012 – I recommend making the following changes via overrides:  Open the “Operations Manager > Agent Details > Agents by Version” view in the console:




Open health explorer for one of the agents – and here is an example of an agent that has been bouncing on a regular basis:



I recommend the following:

Private bytes monitors should be set to a default threshold of 943718400 (up from the default of 300MB)

Handle Count monitors should be set to 30,000  (the default of 6000 is WAY low)

In addition, on each monitor:

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)


Override EACH monitor, “all objects of another class” and choose “Agent” class.




This is a good configuration:











As a refresher – this will be common on any monitored systems that discover a large number of instances – such as Exchange, DNS, SQL servers, SCVMM, large web servers, etc.

Comments (9)

  1. Kevin Holman says:

    @Stooney –

    Not sure I understand your question – if you read my blog post above – I *AM* recommending these as system wide changes and picture the values I prefer as a base. No – management servers should not be changed which is why I choose the Agent class to modify.
    Exchange and Hyper-V servers might need even higher values since they host so many objects… and you can add another override for those scoped to a group, which will win in conflict since group is more specific than class.

  2. Krish says:

    Hope you are doing good.

    I am facing the above problem with my management servers and the health state of the servers are in critical because of this alert.
    Steps taken for this alert.

    Applied override for the classes management server, management server agent.

    Parameter name – Agent performance monitor type (Consecutive Samples) – Threshold – default value – 314572800 Effective Value – 1610612736.

    After changing the threshold values also the state change is happening. Could you please help me to fix this issue.

  3. stooney says:

    Glad I landed on this post, we are seeing a wide range of servers affected by this issue, particularly with Exchange and Hyper-V. After reading the post, I am wondering if it would be safe/recommended to implement a system wide change with these values
    (600MB and 15000 handles) to a 2,000 server environment? If so, should Management server values be modified as well? Thank you

  4. stooney says:

    Yes, it does answer my question. Thank very much. I just wanted some assurance since your recommended values are such a jump from the defaults. It just makes me wonder why they are not the defaults to start with.

  5. lucy says:

    Great post from your hands again. I loved the complete article.
    By the way nice writing style you have. I never felt like boring while reading this article.

    I will come back & read all your posts soon. Regards, Lucy.

  6. Tommy says:

    Hi Kevin,

    Since SCOM 2012 R2 U3 the Health Service Handle Count Threshold for Management Server (Agent) has been changed;
    The update threshold for monitor "Health Service Handle Count Threshold" is reset to 30,000. You can see this issue in the environment, and the Health Service Handle Count Threshold monitor is listed in the critical state.

    However, Monitoring Host Handle Count Threshold is still set to 10.000. Whilst you advise to increase the threshold for Agents to 15.000, how about Management Server (Agent)?

    Currently i have:

    Health Service Handle Count Threshold
    Management Server – 30.000 (default since UR3)
    Management Server Agent – 30.000 (default since UR3)
    Agent – Severity: Information, Generates Alerts: True, Auto-Resolve Alert: False, Threshold: 15.000

    Monitoring Host Handle Count Threshold
    Management Server – 10.000 (default)
    Management Server Agent – 10.000 (default)
    Agent – Severity: Information, Generates Alerts: True, Auto-Resolve Alert: False, Threshold: 15.000

  7. Tommy says:

    Also, in addition to Krish, We increased the value for management server (agent) to 4294967296 (bytes) because our health services on management servers easily consume 3G or more.

    1. rob1974 says:

      I don’t like the auto resolve, all though i understand the reason.

      What i’ve done is create a view for “information” alerts with the Microsoft.SystemCenter.Agent.% name. The view is sorted on “created” and grouped on “source”. This gives an overview per agent