Managed Availability Responders


Responders are the final critical part of Managed Availability. Recall that Probes are how Monitors obtain accurate information about the experience your users are receiving. Responders are what the Monitors use to attempt to fix the situation. Once they pass throttling, they launch a recovery action such as restarting a service, resetting an IIS app pool, or anything else the developers of Exchange have found often resolve the symptoms. Refer to the Responder Timeline section of the Managed Availability Monitors article for information about when the Responders are executed.

Definitions and Results

Just like Probes and Monitors, Responders have an event log channel for their definitions and another for their results. The definitions can be found in Microsoft-Exchange-ActiveMonitoring/ResponderDefinition. Some of the important properties are:

  • TypeName: The full code name of the recovery action that will be taken when this Responder executes.
  • Name: The name of the Responder.
  • ServiceName: The HealthSet this Responder is part of.
  • TargetResource: The object this Responder will act on.
  • AlertMask: The Monitor for this Responder.
  • ThrottlePolicyXml: How often this Responder is allowed to execute. I’ll go into more details in the next section.

The results can be found in Microsoft-Exchange-ActiveMonitoring/ResponderResult. Responders output a result on a recurring basis whether or not the Monitor indicates they should take a recovery action. If a ResponderResult event has a RecoveryResult of 2 and IsRecoveryAttempted of 1, the Responder attempted a recovery action. Usually, you will want to instead skip looking at the Responder results and go straight to Microsoft-Exchange-ManagedAvailability/RecoveryActionResults, but let’s first discuss the events in the Microsoft-Exchange-ManagedAvailability/RecoveryActionLogs event log channel.

Throttling

When a recovery action is attempted by a Responder, it is first checked against throttling limits. This will result in one of two events in the RecoveryActionLogs channel: 2050, throttling has allowed the operation, or 2051, throttling rejected the operation. Here’s a sample of a 2051 event:

throttlingevent

In the details, you will see:

ActionId

RestartService

ResourceName

MSExchangeRepl

RequesterName

ServiceHealthMSExchangeReplEndpointRestart

ExceptionMessage

Active Monitoring Recovery action failed. An operation was rejected during local throttling. (ActionId=RestartService, ResourceName=MSExchangeRepl, Requester=ServiceHealthMSExchangeReplEndpointRestart, FailedChecks=LocalMinimumMinutes, LocalMaxInDay)

LocalThrottleResult

<LocalThrottlingResult IsPassed="false" MinimumMinutes="60" TotalInOneHour="1" MaxAllowedInOneHour="-1" TotalInOneDay="1" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="LocalMinimumMinutes, LocalMaxInDay" TimeToRetryAfter="2015-02-11T14:29:57.9448377-08:00"> <MostRecentEntry Requester="ServiceHealthMSExchangeReplEndpointRestart" StartTime="2015-02-10T14:29:55.9920032-08:00" EndTime="2015-02-10T14:29:57.9448377-08:00" State="Finished" Result="Succeeded" /> </LocalThrottlingResult>

GroupThrottleResult

<not attempted>

TotalServersInGroup

0

TotalServersInCompatibleVersion

0

Hopefully, you recognize the first few fields. This is the RestartService recovery action, which restarts a service. The ResourceName is used by the recovery action to pick a target; for the RestartService recovery action, it is the name of the service to restart. The RequesterName is the name of the Responder, as listed in the ResponderDefinition or ResponderResult channels.

The LocalThrottleResult property is more interesting. Recovery actions are throttled per server, where the same recovery action cannot run too often on the same server, and per group, where the same recovery action cannot run too often on the same DAG (for the Mailbox role) or AD site (for the Client Access role). If a value is -1, this level of throttling is not used; for example, MaxAllowedInOneHour is not interesting if only 1 action is allowed per day. In this example, the MSExchangeRepl resource was already the target of a recovery action within the last 60 minutes, and so the recovery action did not pass the LocalMinimumMinutes throttling. As this recovery action attempt was blocked by local throttling, the group throttling was not attempted. This table has a description of each of the limits mentioned in this event:

ThrottlingResult attribute

Local throttle config attribute name

Group throttle config attribute name

Description

IsPassed

   

True if throttling will allow the recovery action. Otherwise, false.

MinimumMinutes,

LocalMinimumMinutes,

GroupMinimumMinutes

LocalMinimumMinutesBetweenAttempts

GroupMinimumMinutesBetweenAttempts

The time that must elapse before this recovery action may act upon the same resource on this server or in this group.

TotalInOneHour

   

The number of times this recovery action has acted upon this resource on this server or in this group in the last hour.

MaxAllowedInOneHour,

LocalMaxInHour

LocalMaximumAllowedAttemptsInOneHour

n/a

The number of times this recovery action is allowed to act upon this resource on this server or in this group in one hour.

TotalInOneDay

   

The number of times this recovery action has acted upon this resource on this server or in this group in the last 24 hours.

MaxAllowedInOneDay,

LocalMaxInDay,

GroupMaxInDay

LocalMaximumAllowedAttemptsInADay

GroupMaximumAllowedAttemptsInADay

The number of times this recovery action is allowed to act upon this resource on this server or in this group in 24 hours.

IsRecoveryInProgress,

RecoveryInProgress,

GroupRecoveryInProgress

   

Whether this recovery action is already acting upon this resource and has not completed. If True, the new action will be aborted.

TimeToRetryAfter

   

The time after which this recovery action would be allowed to act on this resource on this server or in this group.

The GroupThrottleResult has the same fields, and also gives details about the recovery actions that have taken place on the other servers in the group.

If the action is not throttled, event 500 will be logged in the Microsoft-Exchange-ManagedAvailability/RecoveryActionResults channel, indicating that the recovery action is beginning. If it succeeds, event 501 is logged. This is the most common case and where you’ll usually want to start. These events also have details about the recovery action that was taken and the throttling it passed. Recovery actions that start and then fail are still counted against throttling limits. For more information about recovery actions, read the What Did Managed Availability Just Do to This Service? article.

Viewing Throttling Limits

So what is the best way to find out what recovery action throttling is in place? You could wait for the Responder to begin a recovery action and view the throttling settings in the RecoveryActionsLogs channel, but there are two places that will be more timely. The first is the Microsoft-Exchange-ManagedAvailability\ThrottlingConfig event log channel. The second is the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition channel, introduced in the first section of this artcile. The advantage of the ThrottlingConfig channel is that you can see all the Responders that can take a particular recovery action grouped together, instead of having to check every Responder definition. Here’s a sample event from the ThrottlingConfig event log channel:

Identity

RestartService/Default/*/*/msexchangefastsearch

RecoveryActionId

RestartService

 

ResponderCategory

Default

 

ResponderTypeName

*

 

ResponderName

*

 

ResourceName

msexchangefastsearch

 

PropertiesXml

<ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" />

 

The Identity of a throttling configuration is a concatenation of the next five fields, so let’s discuss each. The RecoveryActionId is the Responder’s throttling type. You can find this as the name of the ThrottleEntries node in the Responder definition’s ThrottlePolicyXml property. The ResponderCategory is unused and is always Default right now. The ResponderTypeName is the Responder’s TypeName property. The ResourceName is the object the Responder acts on. In this example, the throttling for Responders that use the RestartService recovery action to restart the MSExchangeFastSearch process are allowed on any server up to 4 times a day, as long as it has been 60 minutes since this recovery action has restarted it on that server. The group throttling is not used.

The second method to view throttling limits is by the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition events. This will include any overrides you have in place. Here is the value of the ThrottlePolicyXml property from a ResponderDefinition event:

<ThrottleEntries> <RestartService ResourceName="MSExchangeFastSearch"> <ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" /> </RestartService> </ThrottleEntries>

You can see that these attribute names and values match the ThrottlingConfig event’s PropertiesXml values.

Changing Throttling Limits

There may be times when you want recovery actions to occur more frequently or less frequently. For example, you have a customer report of an outage and you find that a service restart would have fixed it but was throttled, or you have a third-party application that does particularly poorly with application pool resets. To change the throttling configuration, you can use the same Add-ServerMonitoringOverride and Add-GlobalMonitoringOverride cmdlets that work for other Managed Availability overrides. The Customizing Managed Availability article gives a good summary on using these cmdlets. For the PropertyName parameter, the cmdlet supports a special syntax for modifying the throttling configuration. Instead of specifying the entire XML blob as the override (which will work, but will be harder to read later), you can use ThrottleAttributes.LocalMinimumMinutesBetweenAttempts, or the other properties, as the PropertyName. Here’s an example:

Add-GlobalMonitoringOverride -ItemType Responder -Identity Search\SearchIndexFailureRestartSearchService –PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 240 -ApplyVersion "15.00.1044.025"

To only allow app pool resets by the ActiveSyncSelfTestRestartWebAppPool Responder every 2 hours instead of 1, you could use the command:

Add-GlobalMonitoringOverride -ItemType Responder -Identity ActiveSync.Protocol\ActiveSyncSelfTestRestartWebAppPool -PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 120 -ApplyVersion “Version 15.0 (Build 1044.25)”

If you want you servers to reboot when the MSExchangeIS service crashes and cannot start at the rate of all of your servers once a day and no more often than one in the DAG every 60 minutes, you could use the commands:

Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer -PropertyName ThrottleAttributes.GroupMinimumMinutesBetweenAttempts -PropertyValue 60 -ApplyVersion “15.00.1044.025”

Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer -PropertyName ThrottleAttributes.GroupMaximumAllowedAttemptsInADay -PropertyValue -1 -ApplyVersion “15.00.1044.025”

The LocalMaximumAllowedAttemptsInADay value is already 1, so each server would still reboot at most once per day. If the override was entered correctly, the ResponderDefinition event’s ThrottlePolicyXml value will be updated, and there will be a new entry in the ThrottlingConfig channel.

These may be poor examples, but it is hard to pick good ones as the Exchange developers pick values for the throttling configuration based on our experience running Exchange in Office 365. We don’t expect that changing these values is going to be something you’ll want to do very often, but it is usually a better idea than disabling a monitor or a recovery action altogether. If you do have a scenario where you need to keep a throttling limit override in place, we would love to hear about it.

Abram Jackson
Program Manager, Exchange Server

Comments (2)
  1. Karsten says:

    Something off Topic: I am using a resolution of 1600×1200 and the Firefox V.36.0 browser. In this blog the Facebook Button (and some others) are alwas covering some part of the text. Could you let this fix please?

  2. exguy says:

    MA needs a GUI. Accept it and implement it pls ….

Comments are closed.

Skip to main content