Frequently we get customers using OpsMgr 2007 or OpsMgr 2012 who claim that they are not getting alerts on agent availability when the server has been shutdown. Many times what they actually mean is they are not being notified, so you first need to determine if the alerts are happening before troubleshooting the notifications. Here is a quick rundown on how the agent availability monitoring works as well as notifications and some troubleshooting tips.
There is a distinct difference between not receiving an alert and not receiving a notification on an alert (e.g. e-mail, text, instant message). Let’s start with troubleshooting the alerts since we will not receive a notification if we don’t first receive an alert. This is where most of the work is happening and there is a lot to generating the alerts, but if you are getting these alerts then you can skip this section and go on to troubleshooting the notification problems.
As you probably already know, there are some global settings that determine whether or not an agent is heartbeating and these are found in the Administration workspace of the console in the Settings node. The settings are as follows:
- Agent-> Heartbeat: tells the agent how often to send and the management server how often to expect a heartbeat from the agents it is monitoring (ex. 60 seconds)
- Server-> Heartbeat: tells the management server how many heartbeats should be missed before pinging the agent to test availability (ex. 3)
These are default settings so in this case if the management server registers four missed heartbeats, an alert will be generated (Health Service Heartbeat Failure) against the health service on the agent computer indicating that it is no longer available. In this example we need to make sure that we are waiting at least 4 minutes before expecting the heartbeat failure alert. The management server then attempts to diagnose the problem by pinging the agent computer. If the ping is unsuccessful, another alert is generated, indicating that the computer is no longer reachable (Computer Not Reachable). If the initial diagnostic ping is successful, no further action is taken.
There are two monitors that we are concerned with at this point. To see them you should be in the Authoring workspace of the console in the Management Pack Objects->Monitors node and be scoped to the Health Service Watcher. The monitors are as follows with their corresponding paths:
- Health Service Watcher > Entity Health > Availability > Computer Not Reachable
- Health Service Watcher > Entity Health > Availability > Health Service Heartbeat Failure
Note: Although heartbeat interval and number of missed heartbeats are configured at a global level and thus affect every agent and management server in the management group, the number of missed heartbeats can be overridden at the management server level and heartbeat interval can be overridden at the agent level. To check for overrides open the properties of the management server or agent in the Device Management node of the Administration workspace. Also, both of these monitors are disabled by default for client computers (i.e. XP, Vista..) but in most cases we are failing to receive the alert or notification on a server computer.
If you are not receiving the Health Service Heartbeat Failure alert after waiting the minimum time (heartbeat interval x (number of missed heartbeats plus 1)), there are a few things you can check. When you stop the health service on an agent, both its management server and the RMS log a 20022 event in the Operations Manager event log and a Health Service Heartbeat Failure alert is raised. The agent also appears grayed out in the Administration workspace (Agent Managed) node and the Health Service Watcher will show Critical. At this point you should open the Monitoring workspace of the console and click on the Discovered Inventory node. In the Action Pane on the far right choose Change Target Type-> View All Targets and select Health Service Watcher. Now the Discovered Inventory node will be displaying the health state of all discovered instances of the Health Service Watcher class which is what the monitors above are targeted to. Find your agent in the list and it should show a status of Critical. If you click on it and choose Health Explorer you see the critical status for the Monitors above. If all of this looks like it should but you are still not receiving the Health Service Heartbeat Failure then check the following:
1) Make sure the heartbeat interval global settings are not set too high and you are not expecting to receive an alert when we are allowing too much time to pass before triggering an alert. Confirm these settings by reviewing the following TechNet article:
2) The monitor that triggers this alert is in the System Center Core Monitoring MP and is targeted to the Health Service Watcher class. It has a default override to not generate alerts for Windows client computers (XP, Vista...) but alerts should be triggered by default for Windows server agents. If there are overrides other than this one you should consider those as a possible reason for not receiving the alerts.
3) Always check the discovered inventory and target to the Health Service Watcher. If the watcher for the agent isn't being monitored (is not there or still shows healthy) then this may be why you’re not getting the alert.
4) If your RMS is clustered, you must ensure that you have the “Use Network Name for Computer Name” option checked on the parameters tab of each of the clustered services. After checking this you should move the group to the other node to restart the services.
Typically if you are getting the Health Service Heartbeat Failure alerts you should receive the Computer Not Reachable alert then after the minimum time based on your heartbeat settings. This is a basic ping test and will alert if a ping is not returned successfully.
If you are receiving the alerts in the Operations Manager console but not getting notified consistently then we may have an issue with the channel we are using (e.g. e-mail, text or IM) rather than the notification workflows in OpsMgr 2007 itself. That is OpsMgr 2007 is working but for some other reason the e-mail didn’t get sent properly. In OpsMgr 2007 we can test notifications (outside of email or instant messaging) to make sure that some internal process isn’t failing by using a command as our notification channel. In the steps below we will use a command to create a notification to a text file.
1) In settings node of the Administration workspace, select Notification and the command tab. Add a new Notification Command channel.
2) In the name type any name. In Full path type cmd.exe
3) Command line parameters type the following:
/c date /T >>c:\notification.log & Time /T >> c:\notification.log & ECHO SCOM notification >> c:\notification.log
Note that you can also use scom variables to insert in the text document. This will output to a text file the current date, time and SCOM notification text.
4) Initial Directory set it up to c:
5) Click apply then OK
6) In the notification node create a new recipient.
7) In the general tab under display name type any account. Select Always send notifications.
8) In the Notification devices, add a new device
9) In the Notification Channel select the device created previously.
10) In the Delivery address for the selected channel type a letter. It won’t be used in this case.
11) Select Always send notifications
12) Type a name for the device
13) Create a new subscription, select the account created earlier.
14) Click next and accept defaults until you get to the Alert Criteria. Here you will select all that you want to create a notification. In our case selected all boxes to receive notifications quickly. Then click next twice and finish.
15) After this you can restart the SCOM services to quickly generate alerts. It may take a few minutes to create the log text file and insert the lines.
Hopefully this will get you to a point to where we at least know if this is an alerting issue or a notification issue and perhaps an idea on how to tackle the troubleshooting side of each. If you determine this is only a notification problem and the test notification works consistently then we may need to determine why the specific notification channels outside of Operations Manager are not sending the notifications out.
Dan Johnson | System Center Support Escalation Engineer
system center 2012 operations manager system center operations manager 2007 scom 2012 scom 2007