OpsMgr 2007: Monitoring Health Service Availability

In OpsMgr 2007, the object-oriented design used in management packs sometimes complicates monitoring. It's not easy to know which class of entity will generate a specific alert or produce data. Therefore, it's a common mistake to target the wrong class in a report or within a notification subscription. This may result in reports returning no data. And it may cause you to overlook an alert because your alert view or subscription does not include the right class.

One especially confusing scenario is the detection of agent heartbeat failure. Often, when you create a notification subscription that is scoped to a specific group or class of computers, you don't receive notifications when agents in that group or class fail to heartbeat. For example, if you create a notification subscription that only applies to the SQL 2005 Computers group, you don't receive a notification when a SQL 2005 computer is shut down or when its Health Service is stopped.

This issue may also affect reporting. If you run the Availability report and you target a specific group or class of computer, the report may not accurately reflect agent downtime. For example, if you add the SQL 2005 Computers group to the report objects list, the report does not reflect downtime accurately if a SQL computer was shut down or if its Health Service was not running for a period of time.

This occurs because Health Service availability is tracked by the Health Service Watcher nodes. These entities are used by management servers to track when an agent stops sending heartbeats. The Health Service Watcher nodes are entirely separate from the agents themselves and from the computer, application and hardware entities associated with the agents.

By design, when an agent stops sending heartbeats, the Health Service Watcher node for the agent turns from Healthy to Warning or Critical, depending on whether it responds to a ping. All other entities associated with the agent are placed into a "gray state." Gray state means that the health state for an entity is unknown.

While this design is a departure from MOM 2005 and is not intuitive for users, there's a logical reason for the design. Specifically, if the Health Service on an agent stops sending heartbeats, no assumptions can be made about the state of the applications that are hosted by the agent. For example, the Health Service may have stopped or the link between the agent and its management server may have failed, but SQL server components may still be healthy and available to users.

To notify and report on Health Service availability, you can create custom groups that contain the Health Service Watcher nodes. You then can use these Health Service Watcher groups in notification subscriptions or in Availability reports.

Keep in mind that an Availability report targeted at a Health Service Watcher group will only reflect Health Service availability. Availability for the operating system or the other applications installed on the agent will not be calculated. Also, keep in mind that you cannot use a Health Service Watcher group as the target for a performance report because Health Service Watcher nodes do not have performance counters.

Unfortunately, there are three significant limitations when creating groups of Health Service Watcher nodes:

1. The first limitation is that the Health Service Watcher class only has three properties available for dynamic group membership rules: Health Service ID, Health Service Name and Display Name. The Health Service ID is a unique identifier that is used in the database. Since unique identifiers are random, there is no way to use them effectively in dynamic membership rules. Additionally, the Health Service Name and Display Name will both match the FQDN of the associated agent. Therefore, dynamic membership rules for Health Service Watcher nodes only have the agent FQDN available to query in the rule formula.

2. The second limitation is that rule formulas using the Health Service Name and Display Name properties are case sensitive. This complicates dynamic rule formulas even further because you cannot always know or control whether uppercase or lowercase characters are used in an agent's FQDN.

3. The third limitation is that there is no way to extend the Health Service Watcher class to include more properties. For example, you cannot create a registry attribute and extend the Health Service Watcher node to include the registry attribute. This is because the Health Service Watcher nodes are not related to the computer or operating system entities for the associated agents. A registry attribute for a Health Service Watcher node doesn't make sense because this type of entity doesn't have a registry. Only entities such as Windows Computer, Windows Server or Windows Operating System have registries.

As a result of these limitations, it is important to have naming conventions for agents based on server role, location, domain, organizational boundaries, etc. Without a naming convention, you would have to add Health Service Watcher nodes to groups using explicit membership rules. In most organizations, this would unmanageable.

If you do have a naming convention, regular expression formulas provide the greatest accuracy and flexibility when creating dynamic membership rules for Health Service Watcher groups. With regular expressions, you can create matches based on any part of an FQDN or based on any character within a NetBIOS name. You can also use a modifier in the regular expression to ignore case.

To create a group of Health Service Watcher nodes using a regular expression, use the following steps:

1. Open Authoring in the SCOM 2007 Console, right-click Groups, and then click "Create a new Group" to run the Create Group Wizard.

2. Name the group and select a management pack to save it in.

3. On the Dynamic Members page, click "Create/Edit rules".

4. For the desired class, select Health Service Watcher and click Add.

Note: When you click Add, the wizard creates an "AND" group by default. This means that all expressions within a group of expressions must be met for inclusion. If you want agents to be included if they match any single expression within a group of expressions, right-click the header for the AND group and click "Switch to Or Group".

5. For the Property, select Health Service Name.

6. For the Operator, select "Matches regular expression" to include all agents that match a regular expression. Select "Does not match regular expression" to exclude all agents that match a regular expression.

7. For Value, enter the regular expression. See examples below for more information about regular expressions.

8. Click the Insert button or the drop-down on the Insert button to add any combination of additional expressions, AND groups or OR groups.

9. Click Formula to see and verify the expanded formula. A single AND group with one expression will look similar to the following:

( Object is Health Service Watcher AND ( Health Service Name Matches regular expression (<regular expression>) AND True )

10. Click through the rest of the wizard.

11. After the group is created, right-click the group and click "View group members" to confirm that it returns the expected set of Health Service Watcher nodes.

Example 1:

A company has agents in their development and QA environments that they want to exclude from Availability reports. These agents are monitored using the same management group as the production servers. The non-production servers use the letter "d" or the letter "q" in the third-to-last character of the NetBIOS name followed by any two-digit number. For example, non-production SQL servers are named sqlserverQ07.ad.com or sqlserverD23.ad.com. They want to create a Health Service Watcher group that explicitly excludes these agents. To do so, they use the "Does not match regular expression" operator with the following regular expression:

(?i:^[a-z0-9_-]+[dq][0-9]{2}[.])

- The ?i: modifier sets the option to ignore case. This is the workaround for case sensitivity in the Health Service Name and Display Name properties.

- The caret ^ anchor matches the start of the string. This may be necessary if you only want to search the NetBIOS name in an FQDN.

- The [a-z0-9_-] character set matches any letter, any number and the underscore _ and dash - characters.

- The plus + quantifier means that the preceding character set [a-z0-9_-] is matched one or more times. Therefore, the regular expression matches a string of any length that includes any combination of letters, numbers, underscores or dashes.

- The [dq] character set matches the letter "d" or the letter "q".

- The [0-9] character set matches any number.

- The {2} quantifier means that the preceding character set [0-9] is matched exactly two times. Therefore, the regular expression matches any two-digit number after the letter "d" or the letter "q".

- The [.] character set matches the period that separates the NetBIOS name from the domain name in an FQDN. Since a dot can have another meaning in a regular expression, enclose a dot in brackets to match a period.

- Since the regular expression is anchored to the start of the string by the caret character and ends with [.], the expression searches only the NetBIOS name in an FQDN.

After they create the Health Service Watcher group, they can select it as the target for an Availability report to track Health Service availability for the corresponding agents. If they check Warning in the "Down Time" box for the report parameters, the down time statistic includes periods of time when the Health Service on an agent stopped sending heartbeats, but the agent responded to ping. If they check only Unplanned Maintence, the down time statistic includes only periods of times when the Health

Service on an agent stopped sending heartbeats and the agent did not respond to ping. In most cases, this includes only periods of time when a server is not running, such as during a blue screen or unplanned shutdown.

Example 2:

A company uses "sql" in the computer names for their SQL servers. The "sql" string can be anywhere in the NetBIOS name and can use uppercase and/or lowercase letters. They want to create a Health Service Watcher group that includes only these SQL servers, so they can create a notification subscription to e-mail database administrators if a SQL server in the management group fails to heartbeat. To do so, they use the "Matches regular expression" operator with the following regular expression:

(?i:^[a-z0-9_-]*sql[a-z0-9_-]*[.])

- The ?i: modifier sets the option to ignore case. This is the workaround for case sensitivity in the Health Service Name and Display Name properties.

- The caret ^ anchor matches the start of the string. This may be necessary if you only want to search the NetBIOS name in an FQDN.

- The [a-z0-9_-] character set matches any letter, any number and the underscore _ and dash - characters.

- The asterisk * quantifier means that the preceding character set [a-z0-9_-] is matched zero or more times. Therefore, the regular expression matches an empty string or a string of any length that includes any combination of letters, numbers, underscores or dashes.

- The sql characters match "sql" only.

- The [a-z0-9_-] character set matches any letter, any number and the underscore _ and dash - characters.

- The asterisk * quantifier means that the preceding character set [a-z0-9_-] is matched zero or more times. Therefore, the regular expression matches an empty string or a string of any length that includes any combination of letters, numbers, underscores or dashes.

- The [.] character set matches the period that separates the NetBIOS name from the domain name in an FQDN. Since a dot can have another meaning in a regular expression, enclose a dot in brackets to match a period.

- Since the regular expression is anchored to the start of the string by the caret character and ends with [.], the expression searches only the NetBIOS name in an FQDN.

After they create the Health Service Watcher group, they can can create a notification subscription that is scoped only to the Health Service Watcher group and can select the database administrators as recipients. This way, database administrators are notified if any SQL server in the management group fails to heartbeat.

Example 3:

A company's administrative structure aligns with their Active Directory domain structure. Therefore, administrators for DomA are responsible only for servers in DomA, and administrators for DomB are responsible only for servers in DomB. The administrators in DomA want to create a Health Service Watcher group that includes only agents in DomA. To do so, they use the "Matches regular expression" operator with the following regular expression:

(?i:^[a-z0-9_-]+[.]doma)

- The ?i: modifier sets the option to ignore case. This is the workaround for case sensitivity in the Health Service Name and Display Name properties.

- The caret ^ anchor matches the start of the string. This may be necessary if you only want to search the NetBIOS name in an FQDN.

- The [a-z0-9_-] character set matches any letter, any number and the underscore _ and dash - characters.

- The plus + quantifier means that the preceding character set [a-z0-9_-] is matched one or more times. Therefore, the regular expression matches a string of any length that includes any combination of letters, numbers, underscores or dashes.

- The [.] character set matches the period that separates the NetBIOS name from the domain name in an FQDN. Since a dot can have another meaning in a regular expression, enclose a dot in brackets to match a period.

- The doma characters match "doma" only.

After they create the Health Service Watcher group, they can use the group in notification subscriptions and in Availibility reports to include only computers from DomA.

Additional resources:

Regular Expression Reference

https://www.regular-expressions.info/reference.html

Regular Expression Examples

https://www.regular-expressions.info/examples.html

Regular Expression Workbench (use to experiment and test the results of regular expressions)

https://code.msdn.microsoft.com/RegexWorkbench/Release/ProjectReleases.aspx?ReleaseId=406

Michael Sadoff | Support Escalation Engineer