Health Service Heartbeat Failure, Diagnostics and Recoveries

I’ve seen plenty of questions come up in the forums and from customers regarding the Health Service Heartbeat Failure monitor, and its associated diagnostics and recoveries.  I spent a little time digging further into these workflows and thought I’d share what I found here.  Hope this helps those curious about what’s happening under the hood.

Communication Channel Basics

After an Operations Manager Agent is installed on a Windows computer, and after it is approved to establish a communication channel with an Operations Manager 2007 management group, the communication channel is maintained by the Health Service.  If this communication channel is interrupted or dropped between the Agent and its primary Management Server (MS) for any reason, the Agent will make three attempts to re-establish communication with its primary MS, by default.

If the Agent is not able to re-establish the channel to its primary MS, it fails over to the next available MS.  Failover configuration and the order of failover is another topic, and will not be covered here.

While the Agent is failed over to a secondary MS, it will attempt to re-establish communication with its primary MS every 60 seconds, by default.  As soon as the Agent can establish communication with its primary MS again, it will disconnect from the secondary MS and fail back to its primary MS.

Health Service Heartbeat Failure Monitor

To briefly summarize the Heartbeat process, there are two configurable mechanisms that control Heartbeat behavior.  Heartbeat interval and number of missed Heartbeats.  If the MS fails to receive a Heartbeat from an Agent computer greater than the number of intervals specified, the Health Service Heartbeat Failure monitor will change to a critical state and generate an alert.

Read more about Heartbeat and configuration here.

Diagnostic and Recovery Tasks

There are a couple of diagnostic tasks that run when the Health Service Heartbeat Failure monitor changes to a critical state.  Ping Computer on Heartbeat Failure and Check If Health Service Is Running.

Ping Computer on Heartbeat Failure

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default. This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to execute a probe action which is defined in the Microsoft System Center Library named WmiProbe.

This probe is initiated on the Health Service Watcher. Since the Health Service Watcher is a perspective class hosted by the Root Management Server, this is where the WMI query is executed when the Health Service Heartbeat Failure monitor changes to a critical state. Even though the agent may be reporting to another MS, it is the RMS that sends the ICMP packet to the agent.

Unlike the traditional Ping.exe program we are all accustomed to, which sends four ICMP packets to the target host by default, the WMI query is executed only once and sends a single ICMP packet, so there is no calculation of percentage of lost packets one would expect to see with Ping.exe.

Following is the WMI query executed on the RMS.

SELECT * FROM Win32_PingStatus WHERE Address = '$Config/NetworkTargetToPing$'

To verify the number of ICMP packets sent, I ran a traditional Ping.exe test and the WMI query used in this workflow and traced these using Netmon.  The first two entries in the image below were captured from the WMI query, and the last eight entries captured were from a Ping.exe test using default parameters (four packets).

WMI query vs. Ping.exe
image

The WMI query results are passed to a condition detection module, which filter StatusCode and execute the appropriate write action. If StatusCode <> 0, the write action ComputerDown will set state to reflect the computer is down. If StatusCode = 0, the write action ComputerUp will set state to reflect computer is up.

The condition detection modules that filter StatusCode are actually the recovery tasks shown in the Health Service Heartbeat Failure monitor. These are the reserved recoveries, Reserved (Computer Not Reachable - Critical) and Reserved (Computer Not Reachable - Success) , respectively.

Under the covers, these reserved recoveries are actually setting state of the Computer Not Reachable monitor, which is defined in the System Center Core Monitoring MP. Ultimately, if StatusCode <> 0, the Computer Not Reachable monitor will change to a critical state and generate the Failed to Connect to Computer alert.

Since this is a diagnostic task which runs during a degraded state change event, the Agent will only be pinged once when the Health Service Heartbeat Failure monitor changes to a critical state. If there are any network related problems after this monitor has changed to critical and the diagnostic task has ran, there will be no further monitoring regarding the ping status of this Agent and no “Failed to Connect to Computer” alert will be generated.

We can understand the root cause better based on whether the Health Service Heartbeat Failure alert was generated along with the Failed to Connect to Computer alert. If the Health Service Heartbeat Failure alert generated without the Failed to Connect to Computer alert, logic would tell us that the issue is not related to loss of network connectivity or that the server has shutdown or become unresponsive. Both alerts together generally indicate the server is completely unreachable due to network outage, or the server is down or unresponsive.

Check if Health Service is Running

This diagnostic is defined in the Operations Manager 2007 Agent Management Library and is enabled by default.  This workflow uses the Automatic Agent Management Account, which will run under the context of the Management Server Action Account by default, to initiate a probe action which is defined in the Operations Manager 2007 Agent Management Library named QueryRemoteHS.

Specifically, this probe is initiated on the Health Service Watcher and queries Health Service state and configuration on the Agent, when the Health Service Heartbeat Failure monitor changes to a critical state.  This probe module type is further defined in the Windows Core Library.  It takes computer name and service name as configuration, and passes the query results through an expression filter and returns the startup type and current state of the Health Service.

If the service doesn't exist or the computer cannot be contacted, state will reflect this.  Depending on output of the diagnostic task, optional recovery workflows may be initialized (i.e., reinstall agent, enable and start Health Service, and continue Health Service if paused), but these recoveries are not enabled by default.