Stop Healthservice restarts in SCOM 2016

<!--[if lt IE 9]>


Comments (19)
  1. Steve says:

    One thing I noticed on both this article and your older article for SCOM 2012 R2, is that you mention to not auto-resolve the alerts. I understand why (so that you can see which ones are firing) however that means we would need to close Monitor generated alerts which supposedly is against best practice. The OpsMgr 2012 Self Maintenance MP fires an alert when someone does this, which points to articles explaining why. Are we safe to go against that recommended best practice in this scenario because the alerts restart the agent which clears the Health State (which would now be green again anyway)?

    I actually overrode these to auto-resolve instead but wrote a report that shows which agents have triggered the alerts.

    Also I notice the figures for Private Bytes Threshold have gone up (about a third) on the old article since I set them back in 2016.

    Thanks Kevin


    1. Kevin Holman says:

      I would argue that closing monitor based alerts is NOT against “best practices”. It only creates a problem with monitors that are in an unhealthy state. In this case – the monitor goes back to healthy immediately upon the recovery action taken. We want the alert to remain – otherwise we would auto-resolve it and never see it….. and this would just create alert noise. The whole point of the alert is to leave it in the console, and just increment the repeat count, effectively treating the workflow like a rule.

      1. Steve says:

        I understand what you are saying in this case in regards to manually closing these monitor generated alerts, however I am basing the “best practice” comment off Microsoft articles that have told us not to do this (they even use the words “best practice” – and therefore I have enforced this something not to do in our environment.

        1. Kevin Holman says:

          Those links are poorly written. 🙂

          In seriousness…. they make “assumptions” that the the monitor auto-resolution is set to true. While a default and applicable to probably 99% of monitors, that is not valid 100% of the time. So technically this is an edge case. To be a “best practice” it would only apply to monitors which have auto-resolve set to true. Sometimes, I really, really, hate monitors.

          1. rob1974 says:

            See my remark below. you don’t need to set “auto close” to false. Just a matter of how you organise your view.

  2. Bo Lucas says:

    Out of curiosity, has MSFT updated the alert description yet? You would think they would have changed the description to show the actual values of the perf data instead of having you go look at the health explorer every time. Pretty worthless the default description. If they have not, then it might be a better idea to disable these monitors and create your own where you can put a proper alert description, one that actually dumps the data from the perf collection?

    1. Kevin Holman says:

      No – because that is how SCOM works. When a monitor is included in a sealed MP and is NOT configured to alert – there is no alert description beyond the default. The monitor must be configured to alert in order to be able to provide such. That is why the alert description is of such poor quality. I will work with the PG to try and get this default behavior changed.

  3. Jaco says:

    Just a thought, increasing the Health Service thresholds will increase the IO requests on the tempdb and data mount points as well as require a bit more drive space and as a result you will have extra pressure on the memory SQL uses. This becomes troublesome when you are using a cloud based solution where you have a credit limit to take into account. For our environment, which is based on a cloud solution, increasing drive space increases the credit limit that solves this quite effectively.

  4. Noah says:

    Hi Kevin,

    Do you have any updated recommendations for SCOM 2016 as far as Management Servers and Gateways go for the (4) monitors above? I’m curious if they too are still set too low as well.

    NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.


    1. Kevin Holman says:

      @Noah –

      The cool thing about these monitors for the management servers – is that even if they DO trip, I don’t care as much, because we do NOT restart the healthservice as a recovery action on management servers. That would be disastrous.

      Now – to answer your question – it is common for management servers to use a LOT more than the default limits (10,000 handles and 1.6GB of memory). I typically set handles to 60,000 and private bytes to 4GB on large customer environments. The idea is to figure out where your management servers typically are, then set appropriate thresholds that would be “actionable” and let you know something is wrong, as in someone imported a management pack that created a large amount of instances or workflows (or both) that spiked utilization on management servers way above normal.

      1. Noah says:

        Thanks Kevin!

        We’ve got a pretty large environment here (40 Management Servers, 4 Gateways, UNIX, Linux, Network, Windows – you get the picture) so I’ve been trying to determine good settings as we just finished upgrading to SCOM 2016 as I’ve seen these trip a bit more now than I remember them doing in the past.

        Our Management servers all have 16GB Mem and 8 vCPU’s so I’ll try and determine some baselines and use your recommendations as a starting point. Thank again – much appreciated.

        1. Kevin Holman says:

          Understood. The issue is, you are at the mercy of how .NET garbage collection works, based on whatever version of .NET you have. When you have management servers with LOTS of memory, you will have less memory pressure, and hence less garbage collection, so you will see SCOM processes using more memory or handles naturally. That does not mean there is a problem. So it is best to take your own baselines. Customers doing a LOT of agentless (Linux/URL/Network) monitoring, will also see much higher process utilization and that is also normal. Like so many perf counters, the customer needs to figure out a good baseline for an actionable threshold, due to variables in customer environments.

  5. rob1974 says:

    I never like the “do not auto-close” option. if you don’t make this override and create a view where you group on “source” you can still see the alerts with their “repeatcount” without them being visible in the active alert views.

    1. Kevin Holman says:

      The thing I don’t like about allowing the monitor to auto-resolve – is you end up with THOUSANDS (or tens of thousands) of alerts, which is more harmful to SCOM as a general rule, than fewer alerts with incremented repeat counts. Yes – you can create a view and scope by source to get counts of closed alerts, but in general this is less efficient. If you stay on top of this issue and resolve those “problem agents” then this methodology would be ok. But my experience is that customers rarely do that.

      1. rob1974 says:

        I had this discussions with 2 of your colleagues (Richard Usher and Brian mcDermot) years ago and they assured me that wasn’t true. A repeat is just as much impact as a new alert. Allthough that was more about SQL performance and now so much about SCOM Console performance.

        1. Kevin Holman says:

          Richard and Brian are absolutely gurus, so I will defer to them.

          However, a repeated alert consumes one line in the DB, and increments a number from that point on. There is a new line in the DB for each individual alert. That in itself tells me they aren’t comparable. But this may be splitting hairs at that point. 🙂

  6. Yasser says:

    I do the overwrite as mentioned but the issue disappear for 3 days then it is back again , what shall i do know ??

  7. It would be nice to know what the suggested maximum value is for these treshold. Even after these changes we still have the problem.

    1. Kevin Holman says:

      There is no maximum value.

      You establish a baseline that is more realistic.

      Next, you determine the systems for which the baseline is not acceptable (via the alerts created by these overrides)

      Next, you determine WHY those agents are need more memory or handles (perhaps they discover a huge number of objects like SQL, Exchange, or Lync) or perhaps they have a massive amount of memory, so the .NET memory garbage collection never runs, etc…

      Then – you determine if you should just ignore those agents, (assuming they are not restarting very often), or disable the monitoring for these as the SCOM agent is not causing any harm, or determine if they just need a much higher threshold.

      There are always edge cases, but these are outliers, and not the norm.

Comments are closed.

Skip to main content