Exchange 2010: The mystery of the 9223372036854775766 copy queue…

Database Copy Self-Protection Mode


What would you say if I told you that a copy queue length (CQL) of 9 quintillion log files was not a bad thing? By the way, if you are wondering, that works out to a CQL of 9,223,372,036,854,775,766, a number so big that it cannot be entered into Windows Calculator, and so big that if it represented actual transaction log files, it would require 8 yottabytes of storage, and sadly, as of this year, no storage or network system on the planet has reached even one thousandth of a yottabyte (which would be a zettabyte, by the way) of information.


But fear not, as you don’t need to start collecting all of the storage on the planet to plan for a CQL this large, but it is a real and valid value for a passive copy’s CQL, and in recent weeks, I’ve worked with a few customers who have experienced this.




During regular operations, the Microsoft Exchange Information Store service (store.exe) and the Microsoft Exchange Replication service (msexchangerepl.exe) on DAG members hosting an active database copy write two values to the cluster registry at HKLM\Cluster\ExchangeActiveManager\LastLog:


  • DatabaseGUID with a decimal value representing the last log generated by the active copy
  • DatabaseGUID_TIME with the system time of when that log file was generated


Here is an example of these entries in the cluster registry:




To decipher these entries, you can use Get-MailboxDatabase to get the GUID for a mailbox database:


[PS] C:\Windows\system32>Get-MailboxDatabase dag-db0 | fl name,*guid*

Name : DAG-DB0
Guid : 2abcac37-1b5d-4b9c-8472-e33c65379698

These values are written to the cluster registry on the server hosting the active copy, and native cluster registry replication is used to propagate this information to all other DAG members. DAG members that host a passive copy of the database use this information (the last log file generated by the active copy) along with information about the last log file replicated to the passive copy, to determine the CQL for the database copy. Thus, it is critical that all DAG members have up-to-date values, as the CQL is used by Active Manager to evaluating whether or not to mount a database in response to a failover.


A Larger-than-Life CQL


In Exchange 2010 SP1, we changed how we determine the CQL. In addition to tracking the last log generated by the active copy, we also track the last time that log file was generated. This was done specifically to prevent situations in which a passive copy is not aware of log files generated by the active copy and makes automount decisions based on stale data.


We use the DATABASEGUID_Time entry for this purpose. If the difference between the timestamp recorded in the cluster registry and the system time on the server hosting the passive copy is off by more than 12 minutes, the Microsoft Exchange Replication service on the server hosting the passive copy places the database copy into a self-protection mode. This is done by setting the CQL for that passive copy to 9223372036854775766. Because a passive copy can be activated and automatically mounted only when its CQL is equal to or less than the configured value for AutoDatabaseMountDial, this has the effect of preventing the passive copy from ever mounting automatically. After all, a value of 9223372036854775766 will always be higher than any possible value for AutoDatabaseMountDial.


Where Did the Time Go?


So why would a condition exist that causes the time discrepancy to be greater than 12 minutes in the first place? This can actually happen for a few reasons:


  • The Cluster service on the server hosting the active copy might be having a problem writing updates even though the node remains in cluster membership.
  • The Cluster service on the server hosting the passive copy might be having a problem receiving updates even though they remain in cluster membership.
  • The Information Store service and Exchange Replication service could be stopped on the server hosting the active copy. (Remember that a copy that is “active” simply signifies the node that owns the copy not the actual mount / dismount state of the database).
  • A datacenter switchover is being performed and more than 12 minutes have elapsed between the time when the failed DAG members were stopped and when the remote DAG members were activated.


What to Do When CQL Reaches 9223372036854775766?


To recover from this situation, an administrator can simply perform a database switchover. Note, though, that the switchover will need to be performed using the Exchange Management Shell, as the administrator will need to force the move by using multiple switches and the MountDialOverride parameter. Because the following command skips all of the built-in safety checks for a passive copy, it should be used only when you know that the copies to be activated were healthy prior to the copy going into self-protection mode.


Attempted move without overrides:


Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2

Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "" to server "".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-1           Failed                                      Dismounted  Dismounted
An Active Manager operation failed. Error The database action failed. Error: An error occurred while trying to validate the specified database copy for possible activation. Error: Database copy 'DAG-DB0' on server '' has a copy queue length of 9223372036854725486 logs, which is too high to enable automatic recovery. You can use the Move-ActiveMailboxDatabase cmdlet with the -SkipLagChecks and -MountDialOverride parameters to move the database with loss. If the database isn't mounted after successfully running Move-ActiveMailboxDatabase, use the Mount-Database cmdlet to mount the database.. [Database: DAG-DB0, Server:]
    + CategoryInfo          : InvalidOperation: (DAG-DB0:ADObjectId) [Move-ActiveMailboxDatabase], AmDbActionWrapperException
    + FullyQualifiedErrorId : 3F936D4B,Microsoft.Exchange.Management.SystemConfigurationTasks.MoveActiveMailboxDatabase


Successful move with overrides:


Move-ActiveMailboxDatabase DAG-DB0 -ActivateOnServer DAG-2 -SkipHealthChecks -SkipActiveCopyChecks -SkipClientExperienceChecks -SkipLagChecks -MountDialOverride:BESTEFFORT

Are you sure you want to perform this action?
Moving mailbox database "DAG-DB0" from server "" to server "".
[Y] Yes  [A] Yes to All  [N] No  [L] No to All  [?] Help (default is "Y"): y

Identity        ActiveServerAtS ActiveServerAtE Status     NumberOfLogsLost   RecoveryPoint MountStatus MountStatus
                tart            nd                                            Objective     AtMoveStart AtMoveEnd
--------        --------------- --------------- ------     ----------------   ------------- ----------- -----------
DAG-DB0         dag-1           dag-2           Succeeded  922337203685472... 5/29/2012 ... Dismounted  Dismounted


At this point, the database can be mounted on the remote server after moving the active copy.

Of course, the ultimate question is: why 9223372036854725486? The value for CQL must be a 64-bit integer that cannot be null; therefore, we chose a value close to maxInt64 that is so large that it prevents a potentially out-of-date copy from being activated.

Comments (16)
  1. TIMMCMIC says:


    Yes – the same self protection mechanisms exist in Exchange 2013.


  2. TIMMCMIC says:


    To my knowledge there’s no immediate way to determine who has the lock.


  3. Christian Campbell says:

    Great article. it just saved my hide with a service outage. Many thanks.

  4. TIMMCMIC says:


    Excellent – glad this was helpful.


  5. Nayan.S.A says:

    Hi TIM,
    Thanks for posting such an Informative Article!
    Now I know why that quintillion number shows up 🙂
    I think in my case, cluster service is not taking the updates in timely manner. One more query, if there are Stale entries in NETFT table, in result node is not able to join the cluster group, do we have any quick option to point out the culprit node (except
    reading cluster logs for all nodes) 😉

  6. Michael B. Smith says:

    50,321 delta from [int64]::maxvalue? where the heck did THAT come from? 🙂

  7. raihan says:

    How long does it take to run the command for 60GB database? It's taking very long time.

  8. @Raihan... says:

    Database size has nothing to do with this.

    Assuming a copy queue that was zero and a replay queue that was zero the command should complete fairly quickly.

    If there is any more of a replay queue then we end up waiting on log replay to finish before the database can be mounted.  In this case that time is dependent on the number of logs left to replay.


  9. bill says:

    I have the same situation as described.  I have attempted to move the active DB several times with the above switches, but the shell freezes and will not move past "A remote procedure call has been issued to database…". I ran it verbose and it doesn't move past "Resolved current organization", which is right after I confirm the action.  I let it sit for about 8 hours with no activity in the shell or in the event viewer.  This is present on all DBs (28) in a 2 server DAG group, and DBs are mounted, healthy copy status, healthy content index and a 0 replay queue length.  We have half as active on 1 server, and half active on the other.  Our latency is less than 1MS as both servers are located on our property, a local network with a 100Mbit connection about 1 mile apart from each other.  Any ideas on how to proceed?

  10. james says:

    Great article and a good story.

    Just one thing — the number is equivalient to 9 quadrillion, not 9 quintillion.

  11. TIMMCMIC says:


    Thanks for posting.  I'm glad you found the story good.

    If my research is correct a quintillion is a X followed by 18 zero values (or 1* 10 to the 18th power).

    For example:

    1 000 000 000 000 000 000

    If you match up the digit placement of a quintillion to the value mentioned above:

    9 223 372 036 854 775 766

    1 000 000 000 000 000 000

    That places the 9 in the positional equivalent of a quintillion.

    By contrast a quadrillion is X followed by 15 zeros.

    1 000 000 000 000 000

    The following references are what I utilized for this:…/Names_of_large_numbers…/zeros.htm

    If expanded I believe this would be 9 quintillion, 223 quadrillion, 372 trillion, 36 billion, 854 million, 775 thousand 766 hundred.

    I'm sure we'll become more familiar with quintillion as the national debt continues to grow.  Until then – hopefully this will suffice.


  12. kev says:

    Hi, i have the same issue as Bill, any chance someone over at MS can help out?


  13. TIMMCMIC says:


    If you are running into this issue where the databases are ONLINE and not during a datacenter switchover this is usually due to the cluster service not being able to take updates.

    When you issue the move, the PAM attempts to read the cluster registry as well as write to it – and is equally unsuccessful.  For the same reasons that the information store service cannot update the current log file.

    You most likely will not be able to move the database.

    In this case you'll end up killing the cluster service – as it will most likely not respond to restarting the service with service control manager.


  14. kev says:


    Thanks! On closer inspection it was the cluster services that had failed, the node that failed held the IP and Name resources too. when I killed the service it did essentially stop all clustering, which was not better than before, so I just ended up shutting them all down and bringing them up one ata time and that fixed it!

    cheers, this is good to know


  15. Gabriel Abarca H says:

    Muchas gracias me funcionó de maravilla.

  16. stephane says:

    HI Tim

    We Encountered this issue following Broadcom teaming bug (both adapters marked as deactivated).
    Cluster service has restarted and MSExchangeIS crashed.
    Problem was resolved by rebooting the impacted servers.
    My Customer asked me if this problem would have occured with Exchange 2013.
    I positively answered because i guess this protection system also exists in Exchange 2013 DAG.
    Do you confirm?


Comments are closed.

Skip to main content