Exchange 2010: Collapsing DAG Networks


As a post configuration step in an Exchange 2010 Database Availability Group installation the administrator may need to collapse Database Availability Group Networks.  Unfortunately this is a commonly missed configuration which results in the replication of log files in an unexpected manner.

 

Let’s take a look at the following Exchange installation.

 

image

 

In this case we are dealing with a total of four subnets, two subnets assigned to hosts in the primary data center and two subnets assigned to hosts in the secondary data center.  Each of the MAPI networks is routable via default gateway settings.  Each of the replication networks is routable by using the appropriately established static routes.

 

When the Database Availability Group is established the Failover Clustering services are leveraged for certain functions.  One of the functions of the Failover Cluster Service is the enumeration of networks on nodes.  When the cluster service starts the IP address bindings of each network card is reviewed and the subnet determined.  Failover Clustering then creates a Cluster Network for each subnet.  Nodes that have an IP address in a cluster network then have their network interface placed in the appropriate cluster network.  In this example there are four subnets – therefore Failover Clustering will enumerate four cluster networks.  Each of the individual cluster networks will contain two network interfaces, since each node has at least one network interface assigned to each subnet.

 

Here is an example of the cluster network enumeration as seen in failover cluster manager.

 

image

 

Here is an example of the network ports placed into a cluster network.

 

image

 

The Exchange Replication Service enumerates the cluster networks as reported by cluster and establishes an initial set of Database Availability Group Networks.  You can view the default Database Availability Group Networks in the Exchange Management Console.  Since Failover Clustering reports 4 cluster networks, the default set of DAG Networks is now four.  Here is an example:

 

image

 

In this example you can see the default four DAG networks.  Each DAG Network, like each Cluster Network, has assigned a network port from each host.  DAG Networks is how the replication service determines what connectivity is available for log shipping activities.  Based on this DAG network topology the replication service knows the following about DAG node communications:

 

192.168.0.3 <-> 192.168.0.4

10.0.0.1 <-> 10.0.0.2

10.0.1.1 <-> 10.0.1.2

192.168.1.3 <-> 192.168.1.4

 

What is missing here is any relationship between the 192.168.0.X and 192.168.1.X subnets as well as the 10.0.0.X and 10.0.1.X subnets.  As of now the replication service has no idea how a node in 192.168.0.X can communicate with a remote node –> can it do so on 192.168.1.X or 10.0.1.X?  In this situation we do not want DAG communications to fail so we resort to DNS name resolution.  For example, when the server MBX-4 wants to replicate log files that are hosted on MBX-2, it looks at the DAG networks and determines that there are no networks that contain both MBX-4 and MBX-2 – therefore the replication service cannot make a direct TCP connection to the known IP address for MBX-2.  Rather then fail replication, we issue a DNS query.  The DNS query should always return an IP address that corresponds to a MAPI network (replication networks should not be registered in DNS).  Therefore, the final connection from MBX-4 to MBX-2 is performed on IP Address 192.168.0.3.  The replication network IS NEVER USED.

This behavior is different though for communications from MBX-2 to MBX-3.  If MBX-3 needs to pull log files from MBX-2 the replication service knows that 10.0.0.X can be used, since DAGNetwork02 contains both network ports.  Therefore, the replication service can bypass DNS name resolution and make a direct IP connection from 10.0.0.2 to 10.0.0.1 to pull logs from MBX-2 to MBX-3.

 

The administrator can correct this condition by appropriately collapsing the DAG networks.  In this example we know that the underlying routing topology allows for the following:

192.168.0.X <-> 192.168.1.X

10.0.0.X <-> 10.0.1.X

At this point we need to re-assign subnets to the appropriate DAG networks.  In this example we will take the 10.0.1.X subnet from DAGNetwork05 and move it to DAGNetwork02.  This will leave an empty DAGNetwork05 which can be deleted.  We will also take the 192.168.1.X from DAGNetwork02 and move it to DAGNetwork01.  This will leave an empty DAGNetwork02.  The following example shows the desired final DAG network layout.

 

image

 

Once this is done we will disable replication on the MAPI network allowing only the replication network to initially service log shipping activities.  Why do you disable the MAPI network from log shipping activities?  Remember that if no other network exists in a DAG to replicate log files we will utilize the MAPI network for log shipping.  If the MAPI network is replication enabled, then when the replication service is choosing a network to perform log shipping it considers it at the same weight as identified replication networks.  By disabling the MAPI network it is no longer considered at the same weight and therefore all initial log shipping activities are balanced between the enumerated replication networks.

 

You can use the get-mailboxdatabasecopystatus * –connectionStatus | fl name,outgoingconnections,incominglogcopyingnetwork you can view the networks that are being utilized for inbound and outbound operations.

 

clip_image002

 

In this example you can see that all incoming and outgoing connections are occurring on DAGNetwork02.

You can also review a netstat –an an see that log copying activities are occurring on the 10.0.0.X network utilizing port 64327 (the default DAG replication port).

 

clip_image002[4]

 

By collapsing DAG networks you can ensure that the replication service functions in an optimized fashion.


Comments (29)

  1. Anonymous says:

    @CCP:

    This network can be safely removed.  I recommend using the Exchange Management Console, under org management -> mailbox -> database availability group tab.

    TIMMCMIC

  2. Anonymous says:

    @Vish…

    At this point you really need to upgrade to a minimum of SP2.

    TIMMCMIC

  3. Anonymous says:

    @Bernie…

    This is a great question.  In general if there's no additional redundancy provided by having the additional network it probably provides little if no value in the overall installation.

    TIMMCMIC

  4. Anonymous says:

    very nice indeed

  5. Anonymous says:

    @EPM…

    Sorry for the delay in responding.  There at this time should not be any IPv6 DAG networks.  Keep in mind – DAG networks do not control how clients actually connect to Exchange.

    In most cases you end up with a generic IPV6 network when you've made some underlying network port change.  It should be safely deleted.

    TIMMCMIC

  6. Anonymous says:

    Hec, As Tim said in his previous comments, the easiest is to do via EMC,

    go to the properties of the subnet, edit and Add the other subnet, apply and Ok.

    It will automatically bring up the Network interfaces of the other node.

  7. Anonymous says:

    @Gangaiyan…

    The easiest thing to do is use the Exchange Management Console.

    Timmcmic

  8. Anonymous says:

    Hi Tim,

    Thanks for sharing it, its very nice article.

    I had the same configuration, we have 4 sites each site have dedicated NIC for Replication and MAPI. Here my issue is log shipping happening through MAPI network.

    I am planning to reconfigure the DAG network as per your recommendation, so do we need to configure using set-databaseavailabilitynetwork command or is there any other way to configure.

    Please help me.

    Gangaiyan

  9. Anonymous says:

    @TurboMCP:

    I'm glad you enjoyed it.  The command is set-databaseavailabilitygroup -identity <DAGNAME> -discoverNetworks.

    TIMMCMIC

  10. Anonymous says:

    @Mail Maven…

    You can adjust the DAG networks at anytime.  The actual change does not take effect until you suspend and resume replication.

    TIMMCMIC

  11. Anonymous says:

    This Paragraph:

    At this point we need to re-assign subnets to the appropriate DAG networks.  In this example we will take the 10.0.1.X subnet from DAGNetwork05 and move it to DAGNetwork02.  This will leave an empty DAGNetwork05 which can be deleted.  We will also take the 192.168.1.X from DAGNetwork02 and move it to DAGNetwork01.  This will leave an empty DAGNetwork02.  The following example shows the desired final DAG network layout.

    Should Read:

    At this point we need to re-assign subnets to the appropriate DAG networks.  In this example we will take the 10.0.1.X subnet from DAGNetwork05 and move it to DAGNetwork02.  This will leave an empty DAGNetwork05 which can be deleted.  We will also take the 192.168.1.X from DAGNetwork04 and move it to DAGNetwork01.  This will leave an empty DAGNetwork04.  The following example shows the desired final DAG network layout.

    Thanks for an awesome blog Tim.

  12. turbomcp says:

    Great article(as always:))

    just one question, i remeber there is a switch or command to make the dag reenumerate all the networks again(lets say you deleted them all and want to re-enumerate)

    do you remeber what that command or switch is?

    Thanks

  13. CCP says:

    Awesome article (like all your others).

    When our environment was built they had IPv6 on so, there is a DAG Network (#3) that is not used and says this:  DAGNetwork03   {{fe80::/64,Unknown}}.

    Can it be deleted?, any possible issue?  PS command to do that?

    Thank you.

  14. paul says:

    When I do this in one of my data centers, the separate sites cannot replicate to each other until i put the replication networks into separate networks, and then the issue resolves itself.

  15. vish says:

    Your article is great and explains in depth. I have complete 2 DAG on Primary site and established a 3rd one on a DR site seperated by WAN. The exchange replication service crashes with 4999 error on this continuously along with 2060 error. Tried several things but nothing is working, we are on SP 1 and msft support suggests sp2 could resolve the crash issue. We would like to avoid the upgrade at this point in time because of the production down time invovled. Any suggestions, please

    Regards,

    Vish

  16. vish says:

    The crash is only on the DR site, the DAGs on primary site is working fine.

  17. Bernie says:

    What's the value of a replication network if it's not got its own physical fabric?

  18. Bernie says:

    (we're looking at a set-up like your top picture, but there's only one link between our DCs)

  19. Mail Maven says:

    I just came across this article while trouble-shooting replication using only MAPI networks.  Thank you so much for posting this information!  I am so surpised to discover that this step to collapse networks was completely missed during original configuration. Can these steps to collapse networks and set-databaseavailabilitygroup be done at any time without impact to production environment?

  20. MIM says:

    Tim

    Thanks for the great article, it helped me a lot.

    I have configured the DAGNetwork just like what you have said in the article.

    And for the Disaster Recovery simulation, I have shut down all the servers in our Primary site.

    I made DR site active by follwoing the procedure provided in technet.(DAC mode is on)

    After switching to DR site, I encountered one problem.

    I have run "Test-ReplicationHealth" and the status of "Cluster Network" is "Failed".It says as follows:

    subnet "192.168.0.X" on network "DAGNetwork01" is not up. Current Status is unknown

    subnet "10.0.0.X" on network "DAGNetwork02" is not up. Current Status is unknown.

    I think it is because I have put Primary site subnet and DR site subnet into one "DAGNetwork" both

    for MAPI and Replication Network.

    I have already switched to DR site.

    Could you tell me what I should do from here?

    Should I delete these subnets from "DAGNetwork" when DR site is Active?

  21. Hec says:

    Great article but you failed to explain how to reassign the subnets.

  22. EPM says:

    Do we need IPV6 Network for direct access clients ?  In this case it is OK  to collapse IPV6 Network into IPV4 ?

  23. funky brano says:

    Thanks !

  24. Harry says:

    Hi Tim, great post. However, after collapsing my replication network, I see lots of the following errors by running the following command:

    Get-MailboxDatabaseCopyStatus -ConnectionStatus -server note1 |ft databasename, incominglogcopyingnetwork -AutoSize -Wrap

    the errors we got are:

    DB2 {node2,DAGNetwork04,Communication was terminated by server ‘node2’: Data could not be read because the communication channel was closed.}

    DB4 {node2,DAGNetwork04,An error occurred while communicating with server ‘node2’. Error: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established
    connection failed because connected host has failed to respond.}

    …..

    sometimes it happened to lots of DBs, sometimes, just some of the DBs. sometimes, DB copy and replay have higher queues, sometimes there are almost no queues.

    any idea what happened and how to troubleshoot and fix the issue?

    the following technet post suggest that we should disable replication network as a workaround? I am confused.

    http://blogs.technet.com/b/samdrey/archive/2014/09/10/exchange-2010-dag-replication-issue-msexchangerepl-2153-potentially-an-error-while-trying-to-update-databasecopy.aspx

    Thanks,
    Harry

    Setup

    Exchange 2010 SP3 rollup7 on Windows 2008 R2 SP1 with Windows patches running quarterly
    one DAG with 4 DAG members, a witness server (HUB) and an alternate witness server (HUB).

    both node 1 and node 2 are located local switched (10 GB) LAN
    Node 3 and node 4 are stretched across WAN with 1 GB bandwidth ping latency is less than 9 ms across the WAN connection and ping between node 1 and node 2 is less than 1 ms.

  25. TIMMCMIC says:

    @Harry…

    If this is the first time that you’ve collapsed the networks it’s probably the first time that you’re actually using them cross datacenters.

    Disabling the replication network is really not a workaround or a solution (that’s more like just ignoring a problem actually exists).

    When you run -connectionStatus and you see an error there what you are seeing is the last error that interrupted replication. The text means literally what happened…the communications channel was closed. This usually means an intermediary network device,
    firewall, or something else has lost the connection between the two sites. Thankfully this is self healing and replication will just re-establish the connection – but none the less this probably points to an underlying networking issue based on my experience.

    It is becoming more common overall though for people to go to a single network DAG. Having a separate network for log shipping really does not add full value for the additional complexity.

    TIMMCMIC

  26. Harry says:

    Hi Tim,

    Thank you for quick reply.

    Yes, this is the first time I collapsed the replication network and disabled replication at MAPI interface. By the way, should we have to collapse MAPI network as well since MAPI network are route-able?

    However, both node 1 and node 2 are local switched LAN, there is no firewall at all. I will run package capture to see what’s going on between servers network.

    Thanks,
    Harry

  27. Harry says:

    How about this error

    DB5 {Note2,DAGNetwork04,An error occurred while communicating with server ‘node2’. Error:
    Unable to read data from the transport connection: An existing connection was forcibly closed by
    the remote host.}

    why An existing connection was forcibly closed by the remote host?

    Thanks again

  28. TIMMCMIC says:

    @Harry…

    This would indicate a networking issue in general (usually). Could also be things like file level antivirus, outdated NIC drivers…etc.

    TIMMCMIC