Managed Availability Probes

The_Exchange_Team · ‎Aug 11 2014

Probes are one of the three critical parts of the Managed Availability framework (monitors and responders are the other two). As I wrote previously, monitors are the central components, and you can query monitors to find an up-to-the-minute view of your users’ experience. Probes are how monitors obtain accurate information about that experience.

There are three major categories of probes: recurrent probes, notifications, and checks.

Recurrent Probes

The most common probes are recurrent probes. Each probe runs every few minutes and checks some aspect of service health. They may transmit an e-mail to a monitoring mailbox using Exchange ActiveSync, connect to an RPC endpoint, or establish CAS-to-Mailbox server connectivity. All of these probes are defined in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition event log channel each time the Exchange Health Manager service is started. The most interesting properties for these events are:

Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.
TypeName: The code object type of the probe that contains the probe’s logic.
ServiceName: The name of the Health Set for this Probe.
TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ResultName
RecurrenceIntervalSeconds: How often this Probe executes.
TimeoutSeconds: How long this Probe should wait before failing.

On a typical Exchange 2013 multi-role server, there are hundreds of these probes defined. Many probes are per-database, so this number will increase quickly as you add databases. In most cases, the logic in these probes is defined in code, and not directly discoverable. However, there are two probe types that are common enough to describe in detail, based on the TypeName of the probe:

Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.GenericServiceProbe: Determines whether the service specified by TargetResource is running.
Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.EventLogProbe: Logs an error result if the event specified by ExtensionAttributes.RedEventIds has occurred in the ExtensionAttributes.LogName. Success results are logged if the ExtensionAttributes.GreenEventIds is logged. These probes will not work if you override them to watch for a different event.

The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe) some aspect of component health. If the component is healthy, the probe passes and writes an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with a ResultType of 3. If the check fails or times out, the probe fails and writes an error event to the same channel. A ResultType of 4 means the check failed and a ResultType of 1 means that it timed out. Many probes will re-run if they timeout, up to the MaxRetryAttempts property.

The ProbeResult channel gets very busy with hundreds of probes running every few minutes and logging an event, so there can be a real impact on the performance of your Exchange server if you perform expensive queries against this event channel in a production environment.

Notifications

Notifications are probes that are not run by the health manager framework, but by some other service on the server. These services perform their own monitoring, and then feed data into the Managed Availability framework by directly writing probe results. You will not see these probes in the ProbeDefinition channel, as this channel only describes probes that are run within the Managed Availability framework.

For example, the ServerOneCopyMonitor Monitor is triggered by Probe results written by the MSExchangeDagMgmt service. This service performs its own monitoring, determines whether there is a problem, and logs a probe result. Most Notification probes have the capability to log both a red event that turns the Monitor Unhealthy and a green event that make the Monitor healthy once more.

Checks

Checks are probes that only log events when a performance counter passes above or below a defined threshold. They are really a special type of Notification probe, as there is a service monitoring the performance counters on the server and logging events to the ProbeResult channel when the configured threshold is met.

To find the counter and threshold that is considered unhealthy, you can look at Monitor Definitions with a Type property of:

· Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThresholdMonitor or

· Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThresholdMonitor

This means that the probe the Monitor watches is a Check probe.

How this works with Monitors

From the Monitor’s perspective, all three probe types are the same as they each log to the ProbeResult channel. Every Monitor has a SampleMask property in its definition. As the Monitor executes, it looks for events in the ProbeResult channel that have a ResultName that matches the Monitor’s SampleMask. These events could be from recurrent probes, notifications, or checks. If the Monitor’s thresholds are reached or exceeded, it becomes Unhealthy.

It is worth noting that a single probe failure does not necessarily indicate that something is wrong with the server. It is the design of Monitors to correctly identify when there is a real problem that needs fixing versus a transient issue that resolves itself or was anomalous. This is why many Monitors have thresholds of multiple probe failures before becoming Unhealthy. Even many of these problems can be fixed automatically by Responders, so the best place to look for problems that require manual intervention is in the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. These events sometimes also include the most recent probe error message (if the developers of that Health Set view it as relevant when they get paged with that event’s text in Office 365).

There are more details on how Monitors work, and how they can be overridden to use different thresholds in the Managed Availability Monitors article.

Abram Jackson
Program Manager, Exchange Server

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs