Recommended Windows Hotfix for Database Availability Groups running Windows Server 2008 R2


In early August of this year, the Windows SE team released the following Knowledge Base (KB) article and accompanying software hotfix regarding an issue in Windows Server 2008 R2 failover clusters:

KB2550886 – A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

This hotfix is strongly recommended for all databases availability groups that are stretched across multiple datacenters. For DAGs that are not stretched across multiple datacenters, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.

As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the cluster database must also be operating properly.

Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.

The problem typically manifests itself in the form of cluster quorum loss due to an asymmetric communication failure (when two nodes cannot communicate with each other but can still communicate with other nodes). If there are delays among other nodes in the receiving of cluster regroup messages from the cluster’s Global Update Manager (GUM), regroup messages can end up being received in unexpected order. When that happens, the cluster loses quorum instead of invoking the expected behavior, which is to remove one of the nodes that experienced the initial communication failure from the cluster.

Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG members have latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that discover a broken connection between the pair. If the first node detects a connection loss well before the second node, a race condition can occur:

  • The first node will initiate a reconnect of the stream between the two nodes. This will cause the second node to add the new stream to its data.
  • Adding the new stream tears down the old stream and sets its failure handler to ignore. In the failure case, the old stream is the failed stream that has not been detected yet.
  • When the connection break is detected on the second node, the second node will initiate a reconnect sequence of its own. If the connection break is detected in the proper race window, the failed stream’s failure handler will be set to ignore, and the reconnect process will not initiate a reconnect. It will, however, issue a pause for the send queue, which stops messages from being sent between the nodes. When the messages are stopped, this prevents GUM from operating correctly and forces a cluster restart.

If this issue does occur, the consequences are very bad for DAGs. As a result, we recommend that you deploy this hotfix to all of your Mailbox servers that are members of a DAG, especially if the DAG is stretched across datacenters. This hotfix can also benefit environments running Exchange 2007 Single Copy Clusters and Cluster Continuous Replication environments.

In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:

Comments (22)
  1. William Holmes says:

    This helpful article comes about 3 weeks too late.  We experienced this issue and have in fact installed the hotfixes.  In addition to these fixes you may want to examine other aspects of your networking recomendations.  For instance: support.microsoft.com/…/951037 the features mentioned in this KB all contributed to triggering the problems that the hotfixes address.  Disabling the features mentioned improved the stability and responsiveness of our entire Exchange Organization.

  2. daliu says:

    I take it from the kb's these are "Windows" clustering hotfixes & therefore won't be rolled up into Exchange 2010 SP2 later this year, correct?

  3. Marcus L says:

    This is a question for William Holmes, when you say "Disabling the features mentioned improved stability", which features exactly, all of them?

  4. Martijn says:

    Will this info be part of the Installation Guide Template – DAG Member? Then it would be clear which hotfixes to install along with the latest Windows 2008 R2 & Exchange 2010 Service Packs and Update Rollups.

  5. Rob A says:

    MSFT needs to update ExBPA so that we don't have to comb through articles like this for obscure fixes and optimizations.  ExBPA makes life easier for us and for PSS.  I don't think I have seen an update for ExBPA in a very long time.

  6. Brian Day [MSFT] says:

    @Rob A, ExBPA updates are released in Service Packs and Update Rollups. If you want to make sure you have the latest ExBPA ruleset in place then install the latest SP and rollup on the machine you are running the ExBPA from.

  7. Eugene says:

    In our environment, using latest drivers available for IBM x3550 M2 servers and firmware, we can only stabilize a high-throughput server by disabling NetDMA in each and every case.

  8. Eugene says:

    In fact, IBM has documented recommendations for many of their products to disable NetDMA. But since our drivers are the latest available you’d think we’d expect a feature so heavily recommended by Microsoft perf. tuning guides to fundamentally work, which
    it fundamentally doesn’t.

    www-304.ibm.com/…/docview.wss

  9. Serhad MAKBULOĞLU says:

    Thanks.

  10. andy says:

    tried to request the hotfix but got below:

    The system is currently unavailable. Please try back later, or contact support if you want immediate assistance

    When will the hotfix be available from WSUS?  We need some quality assurance from Microsoft in order to get it approved on production environment.  

  11. William Holmes says:

    For Marcus: Yes all of them.  NetDMA in particular seems to have caused cluster communications to be disrupted. This in turn caused a number of exchange problems as might be expected.

  12. Shabarinath says:

    Thanks for sharing this.

  13. Ryan says:

    Just checking but I assume the process of installing this KB is to put one node of the DAG in Database Maintenance, install KB, reboot server, stop Database Maintenance, and then repeat process for existing nodes?  Thanks in advance.

  14. Ross Smith IV says:

    @Ryan,  yes that is a good approach. Of course prior to implemetnting the maintenance, it would be a good idea to verify that replication to all copies is healthy and up to date. If not, resolve the issues prior to undergoing maintenance.

    Ross

  15. Ryan says:

    After installing KB2550886 on my first DAG node, I am now receiving an error when trying to take this node out of DAG maintenance:

    [PS] D:Program FilesMicrosoftExchange ServerV14Scripts>.StopDagServerMaintenance.ps1

    cmdlet StopDagServerMaintenance.ps1 at command pipeline position 1

    Supply values for the following parameters:

    serverName: server01

    WARNING: [17:49:39.493 UTC] Call-ClusterExe: cluster.exe did not succeed, but 5 was not a retry-able error code. Not

    attempting any other servers. This may be an expected error by the caller.

    Log-Error : [17:49:39.493 UTC] Start-DagServerMaintenance: Failed to resume the ability of the server seacevs01 to host

    the Primary Active Manager, 'cluster.exe /cluster:server01 node server01 /resume' returned 5.

    At D:Program FilesMicrosoftExchange ServerV14ScriptsStopDagServerMaintenance.ps1:125 char:13

    +             Log-Error <<<<  ($StopDagServerMaintenance_LocalizedStrings.res_0008 -f $serverName,$serverName,$shortSer

    verName,$LastExitCode,"Start-DagServerMaintenance") -stop

       + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException

       + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Log-Error

    Unable to move forward until I am able to Stop DAG maintenance on this server.  Any help would be appreciated.

    Thanks,

    Ryan

  16. Included in SP2? says:

    Is this patch included in Exchange 2010 SP2?

  17. the patch broke MSExchange Active Manager Client perf counter says:

    on all servers where the hotfix was applied

    social.technet.microsoft.com/…/58692b89-d83e-4f3a-b991-9bb38b8ccad0

    This drill fixed the issues

  18. HotFix says:

    Might I recommend someone update the link support.microsoft.com/…/2545685 to instruct users to install KB2550886 in lieu of KB2552040 if it supercedes it.

  19. BLynk says:

    Can you tell me the following? My environment has plenty of W2K8 R2 SP1 Enterprise servers. Clustering is not switched on or required for ALL of these servers. But I want to make this HF part of all future W2K8 R2 SP1 Enterprise server builds – regardless of whether the Clustering service is turned on, post build – does the installed HF still update or fix the Clustering service to the levels documented in this MS KB article? Thank you kindly!

  20. theNerdGoddess says:

    Has this fix been included in SP2? I have several clients affected by this and since this interim update will have to be uninstalled before patching I wanted to make sure it's included in SP2.

    Thanks.

  21. theNerdGoddess says:

    I looked into this further and since this deals with clustering components, it would be included in a Windows RU update I am assuming.  Do you know which update this will be in?

    Thanks.

  22. Phil Lovatt says:

    This will not be included in Exchange 2010 SP2 as they are Windows patches and not Exchange Patches.  Hope this helps.

Comments are closed.