In OpsMgr 2007, when a agent experiences a heartbeat failure, several things happen. There are diagnostics, and possibly recoveries that are run. Alerts, and possibly notifications go out.
But what happens if my Operations team misses on of these alerts? What can I do to "spot check" agents with issues?
Well, any time an agent has a heartbeat failure, we gray out the state icon of the agents last known state for in each state view.
However – you CAN create a State view that will turn Red or Yellow just like any other state views. Simply create a new State View, and scope the class to Health Service Watcher (Agent).
I called mine Heartbeat State View:
This view will show us when any of the agent health service watcher monitors are unhealthy: In my case – OWA and EXCH1 have issues. OWA is DOWN, while EXCH1 agent healthservice is stopped.
However – here is the issue. This view shows us when ANY monitor rolls up unhealthy state…. this includes heartbeat failures AND computer unreachable (server IP stack is down):
What if I want a State View – to ONLY show me computers that are DOWN…. as in… not heartbeating AND not responding to any PING? Most customers consider this their "most critical situation". Well, I haven’t found an easy way to do that…. so I wrote a report which handles it. This report will query the OpsDB for the state of the "Computer Not Reachable" monitor, and only display those servers. It is based on the following query:
SELECT bme.DisplayName, s.LastModified as LastModifiedUTC, dateadd(hh,-5,s.LastModified) as ‘LastModifiedCST (GMT-5)’
FROM state AS s, BaseManagedEntity as bme
WHERE s.basemanagedentityid = bme.basemanagedentityid AND s.monitorid
IN (SELECT MonitorId FROM Monitor WHERE MonitorName = ‘Microsoft.SystemCenter.HealthService.ComputerDown’)
AND s.Healthstate = ‘3’ AND bme.IsDeleted = ‘0’
ORDER BY s.Lastmodified DESC
You can import this report if you have created a data source as shown in my previous post:
Import this report into your custom folder… and run it. You can schedule it to receive it first thing every day… if you like the output:
***** Update 6-30-08 I removed a section of the original query relating to maintenance mode. We found that if a down server had never been in maintenance mode, the server would not show up in the report. The query and report download have been updated to address this.
Report is attached below: