The high availability capabilities of the lagged database copy are enhanced in the upcoming release of Exchange 2016 Cumulative Update 1.
As you may recall, lagged copies can care for themselves by invoking automatic log replay to play down the log files in certain scenarios:
- When a low disk space threshold (10,000MB) is reached
- When the lagged copy has physical corruption and needs to be page patched
- When there are fewer than three available healthy HA copies for more than 24 hours
Play down based on health copy status requires ReplayLagManager to be enabled. Beginning with Exchange 2016 CU1, ReplayLagManager is enabled by default. You can change this via the following command:
Set-DatabaseAvailabilityGroup <DAGName> -ReplayLagManagerEnabled $false
Deferred Lagged Copy Play Down
When one of the above conditions is triggered, the Replication Service will initiate a play down event for the lagged database copy. However, there are times where this may not be ideal. For example, consider the scenario where there are four database copies on a disk, one passive, one lagged, and two active. Initiating a play down event on the lagged copy has the potential to impact any active copies on that disk – replaying log files generates IO and introduces disk latency as the disk head moves, which impacts users accessing their data on the active copies.
To address this concern, beginning with Cumulative Update 1 for Exchange 2016, the lagged copy’s play down activity is tied to the health of the disk by evaluating the disk’s IO latency:
- If the disk’s read IO latency is above 35ms, the play down event is deferred. In the event that there is a disk capacity concern, the disk latency deferral will be ignored and the lagged copy will play down.
- Once the disk’s read IO latency drops below 25ms, the play down event is resumed.
As a result, deferred lagged copy play down reduces the IO burstiness of lagged copy play down events and ensures that local active copies on the lagged copies disk are not affected. IO sizing of a lagged database copy does not change with this feature (nor does it affect the IO sizing of an active copy); you still must ensure there is available IO headroom in the event that the lagged copy becomes active.
Consider the following example:
The y axis is disk latency, measured in milliseconds. The x axis is a 24-hour period.
As you can see from the graph, between the hours of 1am to 9am, the disk IO latency is below 25ms, meaning that lagged copy replay is allowed. At 10am, the latency exceeds 35ms and this continues until about 2pm; during this time period, lagged copy replay is delayed or deferred. At 2pm, the latency drops below 25ms and lagged copy replay resumes. Latency increases again at 4pm and the process repeats itself.
By default, the maximum amount of time that a play down event can be deferred is 24 hours. You can adjust this via the following command:
Set-MailboxDatabaseCopy <database name\server> -ReplayLagMaxDelay:<value in the format of 00:00:00>
If you want to disable deferred play down, you can set the ReplayLagMaxDelay value to ([TimeSpan]::Zero).
The following events are recorded in the Microsoft-Exchange-HighAvailability/Monitoring crimson channel when log replay is deferred or resumed:
- Event 750 – Replay Lag Manager requested activating replay lag delay (suspending log replay) for database copy ‘%1\%2’ after a suppression interval of %4. Delay Reason: %6″
- Event 751 – Replay Lag Manager successfully activated replay lag delay (suspended log replay) for database copy ‘%1\%2’. Delay Reason: %4″
- Event 752 – Replay Lag Manager failed to activate replay lag delay (suspend log replay) for database copy ‘%1\%2’. Error: %4″
- Event 753 – Replay Lag Manager requested deactivating replay lag (resuming log replay) for database copy ‘%1\%2’ after a suppression interval of %4. Reason: %5″
- Event 754 – Replay Lag Manager successfully deactivated replay lag (resumed log replay) for database copy ‘%1\%2’. Reason: %4
- Event 755 – Replay Lag Manager failed to deactivate replay lag (resume log replay) for database copy ‘%1\%2’. Error: %4
- Event 756 – Replay Lag Manager will attempt to deactivate replay lag (resume log replay) for database copy ‘%1\%2’ because it has reached the maximum allowed lag duration. Detailed Reason: %5
The following events are recorded in the Microsoft-Exchange-HighAvailability/Operational crimson channel when log replay is deferred or resumed:
- Event 748 – Log Replay suspend/resume state for database ‘%1’ has changed. (LastSuspendReason=%3, CurrentSuspendReason=%4, CurrentSuspendReasonMessage=%5)
- Event 2050 – Suspend log replay requested for database guid=%1, reason=’%2′.
- Event 2051 – Suspend log replay for database guid=%1 succeeded.
- Event 2052 – Suspend log replay for database guid=%1 failed: %2.
- Event 2053 – Resume log replay requested for database guid=%1.
- Event 2054 – Resume log replay for database guid=%1 succeeded.
- Event 2055 – Resume log replay for database guid=%1 failed: %2.
The changes discussed above continue our work in improving the Preferred Architecture by ensuring that users have the best possible experience on the Exchange platform.
As always, we welcome your feedback.
Principal Program Manager
Office 365 Customer Experience