Failover Cluster can delay DNS Registration for 10 minutes

Given the following scenario:

  • Two Windows 2008 Servers running as an Active/Passive Exchange 2007 CCR Cluster.
  • Each node is part of a multi-site cluster and each is located in a different subnet and point to different DNS server.
  • The IP address needs to change in DNS each time a node fails over so that clients obtain the correct information when they query DNS.

When the Exchange Service is first failed over, the Client Access Point (IP Address and Network Name) updates DNS with the correct IP Address for the new active node. However, each subsequent failover may delay the DNS Registration update for up to 10 minutes.. These delays can cause clients to lose connectivity with Exchange and we all know that means trouble.

The details of the behavior are this. When you bring the Client Access Point on line for the first time, it will register itself with a DNS Server listed in the local machine’s TCP-IP Properties. It will also record a timestamp for a successful registration under the private properties of the Network Name.

C:\> Cluster res “Cluster Name” /prop

Resource         Name                  Value
------------     -----------------     ----------------------
Cluster Name     LastDNSUpdateTime     10/20/2008 11:46:39 AM

If a failover (or an online/offline) occurs, the Cluster Service will check this timestamp and if it is within one hour (60 minutes) of the last registration time, the Client Access Point on the node that is becoming active will wait for 10 minutes before registering the new IP address in DNS. Windows 2008 does not have a Cluster Log being written to like previous versions. You can generate a Cluster log with the command cluster log /gen and one will be create on each node of the Cluster in the C:\WINDOWS\CLUSTER\REPORTS folder. In this CLUSTER.LOG, you will see an entry similar to this:

Client Access Point registers when comes online
2008/07/16-21:33:14.405 INFO [RES] Network Name <2008-Cluster>: 
    Bringing resource online...
2008/07/16-21:33:14.405 INFO [RES] Network Name <2008-Cluster >: 
    TimerQueueTimer rescheduled to fire after 600 secs
2008/07/16-21:33:20.254 INFO [RES] Network Name <2008-Cluster >: 
    Re-registering DNS records time period (4 secs ) between last 
    registration and now is greater than 86400
2008/07/16-21:33:26.593 INFO [RES] Network Name <2008-Cluster >: 
    Network Name 2008-Cluster is now online

The above registration was successful when the resource came Online, so it updated the LastDNSUpdateTime value with the current time. If you then move the Exchange Service Application to the other node, you will see the delay occur as it will look at the LastDNSUpdateTime value and postpone registration if it is within the time period:

Client Access Point delays 10 minutes
2008/07/16-21:53:19.084 INFO [RES] Network Name <2008-Cluster>: 
    Bringing resource online...
2008/07/16-21:53:19.084 INFO [RES] Network Name <2008-Cluster >: 
    TimerQueueTimer rescheduled to fire after 600 secs
2008/07/16-21:53:26.174 INFO [RES] Network Name <2008-Cluster >: 
    Postponing DNS registrations to post online...
2008/07/16-21:53:32.302 INFO [RES] Network Name <2008-Cluster >: 
    Network Name JOHNGROUP is now online
*** 10 minutes later ***
2008/07/16-22:03:19.074 INFO [RES] Network Name <2008-Cluster >: 
    Re-registering DNS records time period (4 secs ) between last 
    registration and now is greater than 86400

In previous versions of Windows Cluster Server, every time a Network Name came online, it would register with DNS. In the case of multiple online and offlines of the resource, this can become very “chatty” with the DNS Servers. If there are delays with the registration process, it will delay the Network Name from coming online. Because of this, there was effort built in to cut down this DNS traffic and to make the name online process a little more streamlined.

This could cause quite a problem if you had issues with nodes failing over on a regular basis, but for the most part, failing over more than once an hour is not the norm.

Windows 2008 Failover Clusters allow for nodes to be placed in different subnets to better allow for multi-site configurations. With the newer “streamlined” Network name online process, this can cause a delay for clients needing to find the new IP Address. After the Client Access Point changes the IP address in DNS, client machines will not be able to find it until their DNS cache flushes the old IP address. There is a Knowledge Base article that addresses this side of the issue:

Description of what to consider when you deploy Windows Server 2008 failover cluster nodes on different, routed subnets

The server side delay of 10 minutes for re-registration is by design and not configurable. This will have to be factored in when you fail over nodes for maintenance, etc. Looking at this from a bigger picture, if your Service/Application is failing over more than once per hour, you probably have bigger issues than registration in DNS.

– Steven Martin