In a previous blog post Ross Smith IV had explained what the Replay Lag Manager is and what it does. It's a great feature that's somewhat underappreciated. We've seen a few support cases that seemed to have been opened out of the misunderstanding of what the Replay Lag Manager is doing. I wanted to cover a real world scenario I had dealt with recently with a customer that I believe will clarify some things.
What is a Replay Lag Manager?
In a nutshell, Replay Lag Manager provides higher availability for Exchange through the automatic invocation of a lagged database copy. To further explain, a lagged database copy is a database that Exchange delays committing changes to for a specified period of time. The Replay Lag Manager was first introduced in Exchange 2013 and is actually enabled by default beginning with Exchange 2016 CU1.
To understand what it is let's look at the Preferred Architecture (PA) in regards to a database layout. The PA uses 4 database copies like the following:
As you can see the 4th copy is a lagged copy. Even though we're showing it in a secondary site, it can exist in any site where a node in the same DAG resides.
The Replay Lag Manager will constantly watch for any of the three things to happen to the copies of DB1. Ross Smith's post does a wonderful job of explaining them and how Exchange will take other factors (i.e. disk IO) into consideration before invoking the lagged copy. In general, a log play down will occur:
- When a low disk space threshold (10,000MB) is reached
- When the lagged DB copy has physical corruption and needs to be page patched
- When there are fewer than three available healthy HA copies for more than 24 hours
A log "play down" essentially means that Replay Lag Manager is going to force that lagged database copy to catch up on all of the changes to make that copy current. By doing this it ensures that Exchange maintains at least 3 copies of each database.
When things are less than perfect…
In the real world we don't always see Exchange setup according to our Preferred Architecture because of environment constraints or business requirements. There was a recent case that was the best example of Lag Replay Manager working in the real world. The customer had over 100 DB's, all with 6 copies each. There were 3 copies in the main site and 3 copies in the Disaster Recovery site with one of those copies at each site being lagged. The DB copies were configured like this for all databases.
As you can see in this particular instance the lagged copy at Site A was being forced to play down while the other copy showed a Replay Queue Length (RQL) of 4919. This case was opened due to the fact that the lagged DB copy at Site A was not lagging.
The customer stated that the DB was lagging fine until recently. However, after a quick check of the Replay Queue Length counter in the Daily Performance Logs it didn't appear to have ever lagged successfully for this copy.
So, what we're seeing is the database has 6 copies, 2 lagged but 1 of those lagged copies isn't lagging. Naturally, you may try removing the lag by setting the -ReplayLagTime to 0 then changing back to 7 (or what it was before). You may even try recreating the database copy thinking something was wrong with it. These still don't cause Exchange to lag this copy.
The next step is to check if it's actually the Replay Lag Manager causing the log play down. You can quickly see this by running the following command specifying the lagged DB\Server Name. On this example will use SERVER3 as the server hosting the lagged copy of DB1.
Get-MailboxDatabaseCopyStatus DB1\SERVER3 | Select Id,ReplayLagStatus
Id : DB1\SERVER3
ReplayLagStatus : Enabled:False; PlayDownReason:LagDisabled; ReplaySuspendReason:None;
Percentage:0; Configured:7.00:00:00; MaxDelay:1.00:00:00; Actual:00:01:22
What we see is that the ReplayLagStatus is actually disabled and the PlayDownReason is LagDisabled. That tells us it's disabled but it doesn't really give us more detail as to why..
We can dig further by looking at the Microsoft-Exchange/HighAvailability log and we see a pattern of 3 events. The first event we encounter is the 708 but it doesn't give us any more information than the previous command does.
Time: 11/31/2017 3:32:55 PM
Message: Log Replay for database 'DB1' is replaying logs in the replay lag range. Reason: Replay lag has been disabled. (LogFileAge=00:06:00.8929066, ReasonCode=LagDisabled)
The second event we see has a little more information. At this point we know for sure it's the Replay Lag Manger because of its FastLagPlaydownDesired status.
Time: 11/31/2017 3:32:55 PM
Message: Database scanning during passive replay is disabled on 'DB1'. Explanation: FastLagPlaydownDesired.
On the third event we see the 738 which actually explains what's going on here.
Time: 11/30/2017 1:50:15 PM
Message: Replay Lag Manager suppressed a request to disable replay lag for database copy 'DB1\SERVER3' after a suppression interval of 1.00:00:00. Disable Reason: There were database availability check failures for database 'DB1' that may be lowering its availability. Availability Count: 3. Expected Availability Count: 3. Detailed error(s):
Server 'server4.domain.com' has database copy auto activation policy configuration of 'Blocked'.
Server 'server5.domain.com' has database copy auto activation policy configuration of 'Blocked'.
Server 'server6.domain.com' has database copy auto activation policy configuration of 'Blocked'.
The "Availability Count: 3. Expected Availability Count: 3." is a tad confusing but the heart the issue is in the detailed errors below that…
It's Replay Lag Manager doing it… but why?
The entire reason for this blog post comes out of the fact that we've seen the Replay Lag Manager blamed for not letting a lagged copy lag. So, the next step someone will do is to disable it. Please don't do that! It only wants to help!
Let's look at how we can resolve the our above example. The logs are showing that it's expecting 3 copies but there aren't 3 available. How can that be? They have at least 4 copies of this database available?!? If we run the following command we see a hint at culprit.
Get-mailboxdatabasecopystatus DB1 | Select Identity,AutoActivationPolicy
DB1\SERVER3 Unrestricted - Lagged Copy (Not lagging)
DB1\SERVER6 Blocked - Lagged Copy (Working)
There it is! There are 6 database copies, however, the copies in Site B are all blocked due to the AutoActivationPolicy. Now things are starting to make sense. In the eyes of the Replay Lag Manager those copies in Site B are not available because Exchange cannot activate them automatically. So, what's happening is the Replay Lag Manager only sees the 2 copies (in the green square below) as available. Therefore, it forces a play down of the logs on the lagged copy to maintain it's 3 available copies.
That explains why the lagged copy at Site A isn't lagging but why is the lagged copy at Site B working fine? This is because from the perspective of that database there are 3 available copies in Site A once that lagged copy was played down.
That's cool… how do I fix it?
There are essentially two ways to resolve this example and allow that lagged copy at Site A to properly lag.
The first way is to revisit the decision to block Auto Activation at Site B. The mindset in this particular instance was that their other site was actually for Disaster Recovery. They wanted some manual intervention if databases needed to fail over to the DR site. That's all well and good but it doesn't allow for a lagged copy at Site A to work properly due to the Replay Lag Manager. The customer did actually end up allowing 1 copy at the DR site (site B in our example) for Auto Activation. To do this you can run the following command:
Set-Mailboxdatabasecopystatus SERVER4\DB1 -AutoActivationPolicy Allowed
The other option here would be to create another database copy at Site A. Obviously, that's going to require a lot more effort and storage. However, doing this would allow for the Replay Lag Manager to resume lagging on the lagged database copy.
I hope this post clarifies some things in regards to the Replay Lag Manager. It's a great feature that will provide some automation in keeping your Exchange databases highly available.