Stop Healthservice restarts in SCOM 2016


 

image

 

This is probably the single biggest issue I find in 100% of customer environments.

YOU ARE IMPACTED.  Trust me.

 

SCOM monitors itself to ensure we aren't using too much memory, or too many handles for the SCOM processes.  If we detect that the SCOM agent is using an unexpected amount of memory or handles, we will forcibly KILL the agent, and restart it.

That sounds good right?

In theory, yes.  In reality, however, this is KILLING your SCOM environment, and you probably aren't aware it is even it is happening.

 

The problem?

1.  The default thresholds are WAY out of touch with reality.  They were set almost 10 years ago, when systems used a LOT less resources than modern operating systems today.  This is MUCH worse if you choose to MULTIHOME.  Multi-homed agents can use twice as many resources as non-multi-homed agents, and this restart can be issued from EITHER management group, but will affect BOTH.

2.  We don’t generate an alert when this happens, so you are blind that this is impacting you.

 

We need to change these in the product.  Until we do, a simple override is the solution.

 

Why is this so bad?

This is bad because of two impacts:

1.  You are hurting your monitored systems by restarting them over and over, causing the startup scripts to run on loops and actually consuming additional resources.  You are actually going periods of time without any monitoring because of this as well, because when the agent is killed and restarting, there is a period of time where the monitoring is unloaded.

2.  You are filling SCOM with state change events.  Every time all the monitors initialize, they send an updated “new” statechange event unpon initialization.  You are hammering SCOM with useless state data.

 

What can I do about it?

Well, I am glad you asked!  We simply need to override 4 monitors, to give them realistic agent thresholds, and set them to generate an informational alert.  I will also include a view for these alerts so we can see if anyone is still generating them.  I will wrap all this in a sample management pack for you to download.

 

In the console, go to Authoring, Monitors, and change scope to “Agent”

image

 

We will override each one:

Private bytes monitors should be set to a default threshold of 943718400 (triple the default of 300MB)

Handle Count monitors should be set to 30000  (the default of 6000 is WAY low)

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)

 

 

Override EACH monitor, “all objects of class” and choose “Agent” class.

image

 

NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.

 

This is a good configuration:

image

image

image

image

 

Ok – those are much more reasonable defaults.

 

What else should I do?

Create an alert view that shows alerts with name “Microsoft.SystemCenter.Agent.%”

This will show you if you STILL have some agents restarting on a regular basis.  You should review the ones with high repeat counts on a weekly basis, and adjust their agent specific thresholds, or investigate why they are consuming so much, so often.  An occasional agent restart (one or less per day) is totally fine and probably not worth the time to investigate.

 

image

 

I am including a management pack with these overrides, and the alert view, and you can download it below if you prefer to to make your own.

 

Download:

https://gallery.technet.microsoft.com/SCOM-Agent-Threshold-b96c4d6a


Comments (16)

  1. Steve says:

    One thing I noticed on both this article and your older article for SCOM 2012 R2, is that you mention to not auto-resolve the alerts. I understand why (so that you can see which ones are firing) however that means we would need to close Monitor generated alerts which supposedly is against best practice. The OpsMgr 2012 Self Maintenance MP fires an alert when someone does this, which points to articles explaining why. Are we safe to go against that recommended best practice in this scenario because the alerts restart the agent which clears the Health State (which would now be green again anyway)?

    I actually overrode these to auto-resolve instead but wrote a report that shows which agents have triggered the alerts.

    Also I notice the figures for Private Bytes Threshold have gone up (about a third) on the old article since I set them back in 2016.

    Thanks Kevin

    Steve

    1. Kevin Holman says:

      I would argue that closing monitor based alerts is NOT against “best practices”. It only creates a problem with monitors that are in an unhealthy state. In this case – the monitor goes back to healthy immediately upon the recovery action taken. We want the alert to remain – otherwise we would auto-resolve it and never see it….. and this would just create alert noise. The whole point of the alert is to leave it in the console, and just increment the repeat count, effectively treating the workflow like a rule.

      1. Steve says:

        I understand what you are saying in this case in regards to manually closing these monitor generated alerts, however I am basing the “best practice” comment off Microsoft articles that have told us not to do this (they even use the words “best practice” – and therefore I have enforced this something not to do in our environment.
        https://technet.microsoft.com/en-us/library/hh212689(v=sc.12).aspx
        https://support.microsoft.com/en-nz/help/979388/alerts-that-are-raised-by-the-monitors-should-not-be-manually-resolved-in-operations-manager
        https://technet.microsoft.com/en-us/library/hh212903.aspx

        1. Kevin Holman says:

          Those links are poorly written. 🙂

          In seriousness…. they make “assumptions” that the the monitor auto-resolution is set to true. While a default and applicable to probably 99% of monitors, that is not valid 100% of the time. So technically this is an edge case. To be a “best practice” it would only apply to monitors which have auto-resolve set to true. Sometimes, I really, really, hate monitors.

          1. rob1974 says:

            See my remark below. you don’t need to set “auto close” to false. Just a matter of how you organise your view.

  2. Bo Lucas says:

    Out of curiosity, has MSFT updated the alert description yet? You would think they would have changed the description to show the actual values of the perf data instead of having you go look at the health explorer every time. Pretty worthless the default description. If they have not, then it might be a better idea to disable these monitors and create your own where you can put a proper alert description, one that actually dumps the data from the perf collection?

    1. Kevin Holman says:

      No – because that is how SCOM works. When a monitor is included in a sealed MP and is NOT configured to alert – there is no alert description beyond the default. The monitor must be configured to alert in order to be able to provide such. That is why the alert description is of such poor quality. I will work with the PG to try and get this default behavior changed.

  3. Jaco says:

    Just a thought, increasing the Health Service thresholds will increase the IO requests on the tempdb and data mount points as well as require a bit more drive space and as a result you will have extra pressure on the memory SQL uses. This becomes troublesome when you are using a cloud based solution where you have a credit limit to take into account. For our environment, which is based on a cloud solution, increasing drive space increases the credit limit that solves this quite effectively.

  4. Noah says:

    Hi Kevin,

    Do you have any updated recommendations for SCOM 2016 as far as Management Servers and Gateways go for the (4) monitors above? I’m curious if they too are still set too low as well.

    NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.

    Thanks!

    1. Kevin Holman says:

      @Noah –

      The cool thing about these monitors for the management servers – is that even if they DO trip, I don’t care as much, because we do NOT restart the healthservice as a recovery action on management servers. That would be disastrous.

      Now – to answer your question – it is common for management servers to use a LOT more than the default limits (10,000 handles and 1.6GB of memory). I typically set handles to 60,000 and private bytes to 4GB on large customer environments. The idea is to figure out where your management servers typically are, then set appropriate thresholds that would be “actionable” and let you know something is wrong, as in someone imported a management pack that created a large amount of instances or workflows (or both) that spiked utilization on management servers way above normal.

      1. Noah says:

        Thanks Kevin!

        We’ve got a pretty large environment here (40 Management Servers, 4 Gateways, UNIX, Linux, Network, Windows – you get the picture) so I’ve been trying to determine good settings as we just finished upgrading to SCOM 2016 as I’ve seen these trip a bit more now than I remember them doing in the past.

        Our Management servers all have 16GB Mem and 8 vCPU’s so I’ll try and determine some baselines and use your recommendations as a starting point. Thank again – much appreciated.

        1. Kevin Holman says:

          Understood. The issue is, you are at the mercy of how .NET garbage collection works, based on whatever version of .NET you have. When you have management servers with LOTS of memory, you will have less memory pressure, and hence less garbage collection, so you will see SCOM processes using more memory or handles naturally. That does not mean there is a problem. So it is best to take your own baselines. Customers doing a LOT of agentless (Linux/URL/Network) monitoring, will also see much higher process utilization and that is also normal. Like so many perf counters, the customer needs to figure out a good baseline for an actionable threshold, due to variables in customer environments.

  5. rob1974 says:

    I never like the “do not auto-close” option. if you don’t make this override and create a view where you group on “source” you can still see the alerts with their “repeatcount” without them being visible in the active alert views.

    1. Kevin Holman says:

      The thing I don’t like about allowing the monitor to auto-resolve – is you end up with THOUSANDS (or tens of thousands) of alerts, which is more harmful to SCOM as a general rule, than fewer alerts with incremented repeat counts. Yes – you can create a view and scope by source to get counts of closed alerts, but in general this is less efficient. If you stay on top of this issue and resolve those “problem agents” then this methodology would be ok. But my experience is that customers rarely do that.

      1. rob1974 says:

        I had this discussions with 2 of your colleagues (Richard Usher and Brian mcDermot) years ago and they assured me that wasn’t true. A repeat is just as much impact as a new alert. Allthough that was more about SQL performance and now so much about SCOM Console performance.

        1. Kevin Holman says:

          Richard and Brian are absolutely gurus, so I will defer to them.

          However, a repeated alert consumes one line in the DB, and increments a number from that point on. There is a new line in the DB for each individual alert. That in itself tells me they aren’t comparable. But this may be splitting hairs at that point. 🙂

Skip to main content