Health Service Restarts on Service Manager Servers With SCOM Agents

With the release of System Center Service Manager (SCSM) 2012 SP1 a SCOM agent was added to the SCSM management servers.  This is a very welcomed addition as many management packs do not support agentless monitoring, leaving the SCSM management servers poorly monitored.

 However, there is a risk added.  In the Microsoft.SystemCenter.2007 Management Pack (Friendly Name: Microsoft System Center Core Management Pack, English Display String: System Center Core Monitoring) there is a monitor that watches to see if the health service and its related processes are taking up too much memory.  It has a recovery task to restart the SCOM agent if it is taking too much memory.

You will know you are encountering the issue if you see any of these events in the Operations Manager event log on your SCSM servers:

Log Name: Operations Manager
Source: Health Service Script
Date: 10/17/2013 6:48:59 PM
Event ID: 6024
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: aaa.bbb.com
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshhold.

Log Name: Operations Manager
Source: Health Service Script
Date: 10/17/2013 6:49:59 PM
Event ID: 6060
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: aaa.bbb.com
Description:
RestartHealthService.js : Restarting Health Service. Error: Failed to Terminate Health Service

Log Name: Operations Manager
Source: Health Service Script
Date: 10/17/2013 6:23:07 PM
Event ID: 6062
Task Category: None
Level: Information
Keywords: Classic
User: N/A
Computer: aaa.bbb.com
Description:
RestartHealthService.js : Restarting Health Service. Service successfully restarted.

(This management pack exists in SCOM 2012, don't let the 2007 in the name fool you)

The monitor with the recovery task to restart the health service is Microsoft.SystemCenter.HealthService.ServiceStateRollup.  It targets Microsoft.SystemCenter.HealthService.

 

It is an aggregate monitor with two unit monitors, one for the count, and one for the private bytes of HealthService.exe.

 

Microsoft.SystemCenter.Agent adds two additional unit monitors.

 

Since the recovery task is on the aggregate monitor, any of the child unit monitors can trigger it.  In my case, it was the private bytes of Monitoring Host that was causing the recovery task to run on my SCSM servers.

The class structure for Health Service can be a little confusing.

  • Microsoft.SystemCenter.HealthService
    • Microsoft.SystemCenter.Agent
      • Microsoft.SystemCenter.Agent.ManagementServer
    • Microsoft.SystemCenter.ManagementServer
      • Microsoft.SystemCenter.GatewayManagementServer

SCOM discovers SCSM Managemenet Servers as instances of the Microsoft.SystemCenter.Agent.ManagementServer class.

There is already an override that disables the recovery task for Microsoft.SystemCenter.ManagementServer  Since Microsoft.SystemCenter.GatewayManagementServer is a child class of Microsoft.SystemCenter.ManagementServer the override applies to both of them.  However Microsoft.SystemCenter.Agent.ManagementServer is not a child of management server, so the override does not apply to it.  To me this is silly as I wouldn't want the recovery task restarting a health service of a management server regardless of whether or not it was in a different management group.

Our solution is actually quite simple.  We just make an override that disables the recovery for that class too.  Overrides do not inherit up the tree, so this will not impact Microsoft.SystemCenter.Agent.

Here is the XML of the override that I made.

      <RecoveryPropertyOverride ID="Microsoft.SystemCenter.Agent.RestartHealthService.HealthServicePerfCounterThreshold.Enabled.AMS.Override" Context="SC!Microsoft.SystemCenter.Agent.ManagementServer" Enforced="true" Recovery="MSSC2007!Microsoft.SystemCenter.Agent.RestartHealthService.HealthServicePerfCounterThreshold" Property="Enabled">
<Value>false</Value>
</RecoveryPropertyOverride>