Recently, in two separate occasions, I had to assist in resolving an issue where a member of an Exchange 2010 database availability group (DAG) failed to participate in the DAG's Cluster Communications and therefore were unable to bring any database on those servers online. In both instances, this occurred after the server was rebooted. While each issue had a slightly different resolution, I am fairly confident that they are related. And since it took awhile to isolate and resolve these issues, I'd thought I would share this experience regarding these issues.
Before I begin, in neither scenario did we lose quorum of the DAG. Also, the symptoms of both scenarios were nearly identical.
- Viewing these servers from Failover Cluster Manager show them with a STATUS of DOWN.
- Network Connections for these members are listed as UNAVAILABLE
- Cluster Services Starts on these servers, however the following event is logged in the Event’s System Log
Log Name: System
Event ID: 1572
Task Category: Cluster Virtual Adapter
Description: Node 'SERVER' failed to join the cluster because it could not send and receive failure detection network messages with other cluster nodes. Please run the Validate a Configuration wizard to ensure network settings. Also verify the Windows Firewall 'Failover Clusters' rules.
- Attempt to view Exchange DAG status or network returns error:
A server-side administrative operation has failed. 'GetDagNetworkConfig' failed on the server. Error: The NetworkManager has not yet been initialized. Check the event logs to determine the cause. [Server: SERVER5.Contoso.inc]
+ CategoryInfo : NotSpecified: (0:Int32) [Get-DatabaseAvailabilityGroup], DagNetworkRpcServerException
+ FullyQualifiedErrorId : A6AA817A,Microsoft.Exchange.Management.SystemConfigurationTasks.GetDatabaseAvailabilityGroup
- Cluster Log Shows:
WARN [API] s_ApiOpenGroupEx: Group Cluster Group failed, status = 70
DBG [HM] Connection attempt to SERVER01 failed with error WSAETIMEDOUT(10060): Failed to connect to remote endpoint 126.96.36.199:~3343~.
INFO [JPM] Node 7: Selected partition 33910(1 2 3 4 5 6 9 10 11 12 13 14) as a target for join
WARN [JPM] Node 7: No connection to node(s) (10 12). Cannot join yet
- Cluster Validation Report shows:
Node SERVER01.Contoso.inc is reachable from Node SERVER5.Contoso.inc by only one pair of interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available or consider adding additional networks to the cluster.
The following are all pings attempted from network interfaces on node SERVER5.Contoso.inc to network interfaces on node SERVER05.Contoso.inc.
- Network Trace was showing that cluster communication was in fact going thru to all other nodes on port 3343 and responses were returned.
- There was no change in errors even after disabling Windows Firewall and removing file level antivirus and security products from the servers.
- Removing NIC Teaming from the server did not work
In this scenario, this occurred within our lab running on Hyper-V. Based on hyper-V's network summary output, I could see that the servers really were not communicating properly. Yes, they could ping and they could authenticate with the domain, but cluster communication was failing.
The resolution was to consistently configure the network settings on all DAG members & to reset the hyper-v network properties. This meant:
- Confirm that the networks were identically configured between all DAG node members (i.e. REPL / MAPI Networks, TCP/IP settings, Binding Order, Driver versions, etc)
- Disabled IPv6 from the servers [NOTE: It is recommended to leave IPv6 enabled, even if you do not have an IPv6-enabled network! In most scenarios, disabling IPv6 on an Exchange 2010 should be a last option.]
- Once rebooted, all was working fine.
- Edit the Hyper-V Network Properties Page for this VM
In this scenario, this occurred in production. Ultimately we decided to change the IP address of the 'broken' DAG member and reboot the server again. This allowed the server to properly register its network connections with the cluster DB (ClusDB) and all other nodes were able to talk properly. This allowed the DAG member to rejoin the DAG and then all databases were able to mount and/or replicate their copy successfully.
We found that not all of the production DAG members were identically configured with their network settings (i.e. 2 DAG members did not have a REPL network configured). Per http://technet.microsoft.com/en-us/library/dd638104.aspx#NR, "each DAG member must have the same number of networks". We fixed the networks and updated the servers to include the recommended hotfixes - http://blogs.technet.com/b/dblanch/archive/2012/02/27/a-few-hotfixes-to-consider.aspx
Why did changing the IP address of the DAG member work? Well, not exactly sure but we believe that this was either a stale TCP route or something in the CLUSDB was preventing any server with that IP address from joining the cluster.
Did you reboot all of the DAG member server before or after changing the IP address? No, we did not want to risk losing another server within the DAG (had already lost 2 of the 12 members). We did, however, reboot all of the servers in the lab scenario.
Did you ever lose quorum of the DAG? Nope.
Do you think that you could have prevented this? Maybe, if we had applied all of the hotfixes outlined here & confirmed all network settings were identical on all DAG members, then maybe servers might not have caused this issue. There may be other things causing this, but it is always recommended to resolve the known issues first.