DNS Clients and Timeouts (part 2)


In the first part of this blog post I described the behavior of the DNS client when there are multiple entries in the DNS servers list. In this second part I will try to explain how the Windows DNS Client works when dealing with timeouts and retries.

Note: From now on when I refer to DNS Client I am referring to the Windows implementation of this service. I will also highlight the variances in different versions of Windows when applicable.

How much time is the timeout and how many times do we retry a query?

The DNS client uses an array that defines the timeout to use in each attempt to resolve a query instead of a single fixed value as the same timeout for all the attempts. What this means is that the timeout used during each attempt is not necessarily always the same. Each element of the array defines the timeout to use during each attempt. The query is retried as many times as there are elements in the array, and the process stops when there are no more elements in the array.

For example, suppose that the Timeout array has these values: [1, 1, 2, 4, 4]. The first attempt to resolve a query will timeout after 1 second (first element of the array), the second attempt will timeout after 1 second (second element of the array), the third after 2 seconds, the fourth after 4 seconds, and so on. As there are 5 elements in the array, we will stop retrying after 5 attempts.

The pre-defined timeout array for each of the currently supported versions of Windows are (values shown are in seconds):


Timeout

OS

[0]

[1]

[2]

[3]

[4]

Windows XP

1

1

2

4

7

Windows Server 2003

1

1

2

4

4

Windows Vista & Windows Server 2008

1

1

2

4

4

Windows 7 & Windows Server 2008 R2

1

1

2

4

4

These timeouts can be customized using the registry value HKLM\System\CurrentControlSet\Services\dnscache\Parameters\DNSQueryTimeouts. This value does not exist by default and then the pre-defined default array just mentioned is used. If the value is defined then it should have a type of REG_MULTI_SZ (multi-line string) with each line containing one value of the array with the last line having a 0 to indicate the end of the list.

image

If any of the values in a line is higher than 30 then a value of 30 is used for that line instead. If the total sum of the values is higher than 120 (2 minutes) then the list is shortened from bottom-up to remove values until the total timeout is less than 120.

Which DNS Servers do we query in each attempt?

The DNS Client queries the following DNS servers in each attempt:

  1. In the first attempt we query the preferred DNS server of the preferred network adapter only. The preferred DNS server is the first server listed for that adapter. The adapters are sorted based on their binding order and the preferred adapter is the one at the top of the list. You can change this binding order by opening ncpa.cpl and going to the menu Advanced/Advanced Settings…

    image

    Adapters that are disabled, disconnected, do not have TCP/IP enabled or have no DNS servers listed are ignored.
    In this attempt we query one DNS server only.
  2. If the previous attempt times-out, retry the query with the next best DNS server for all the adapters. The next best DNS server for an adapter is the next on its list that has not already been queried and timed-out.
    Note that the lists are managed as circular-lists, once we reach the end of the list for an adapter, the next best server for that adapter will be the first on its list.
    In this attempt we query one DNS server per adapter.
  3. If the previous attempt times-out (because none of the DNS servers queried in the previous step answered in the expected time), retry the query with the next best DNS server for all the adapters.
    In this attempt we query one DNS server per adapter.
  4. If the previous attempt times-out, retry the query with all the possible DNS servers in all the adapters. This includes even servers that have already timed-out in the previous steps.
    In this attempt we query all the DNS servers in all the adapters.
  5. Repeat step (4) until we have run out of attempts. If there are no more attempts then return an error to the caller.
    In this attempt we query all the DNS servers in all the adapters.

It is important to clarify something for steps 2 to 5 as multiple servers are queried in those steps: as long as one of the servers queried in that attempt responds, with either a positive or negative answer, then the query is considered resolved. It is OK if the other servers queried do not respond as we already have an answer which is what we wanted.

Using the information about the default Timeout array and the retry logic that was just described we can see that each attempt will timeout, by default, after:

  • First attempt (step 1): Preferred DNS Server on the preferred adapter: times-out after 1 second
  • Second attempt (step 2): Next best server for all the adapters: times-out after 1 second
  • Third attempt (step 3): Next best server for all the adapters: times-out after 2 seconds
  • Fourth attempt (step 4): All DNS servers in all the adapters: times-out after 4 seconds
  • Fifth attempt (step 5): Repeat of step (4): times-out after 4 seconds (7 seconds in Windows XP)

After the fifth attempt times-out there are no more elements in the array to use, then we stop the query and return an error to the caller. As our default Timeout array has 5 elements, we try to resolve the query in 5 attempts. The total waiting time before we give up the query is 12 seconds (15 seconds in XP) which is the sum of all the values in the array.

Important note: the DNS servers list is kept in memory by the dnscache service. The next best server is determined based on a priority. All the servers start with the same priority and they are sorted for each adapter based on the precedence in which they were configured. Each time a server times-out its priority is reduced and when a server answers its priority is boosted (error conditions also modify the priority of a server). The next best server for an adapter is the one with the higher priority that is higher in the precedence list (if more than one server have the same priority then the next best is the one that is higher in the precedence list).
It is important to note that this prioritized list is kept across different queries; this means that the priorities are not reset after each query, but they are reused. The idea is that if a server timed-out a recent query then the next query will go to another server with a higher priority first. The effect of this is that the preferred DNS server might not be the first to get the next query if it recently timed-out.
These priorities are reset to the initial default values after an interval named ServerPriorityTimeLimit defined in registry. See http://support.microsoft.com/kb/320760 for more information about this value.
An example of this behavior is a client pointing to two DNS servers: DNS1 and DNS2. The client tries to resolve a name and DNS1 times-out but DNS2 answers. The next query that this client tries to resolve is going to go DNS2 first before being retried in DNS1, because DNS2 would have a higher priority than DNS1.

Making sense of all this information with an example

Suppose we have a computer named CLIENT1 running Windows Server 2003 that has 4 NICs with the binding order NIC1, NIC2, NIC3 and NIC4. The DNS servers list in CLIENT1 is:

NIC1

NIC2

NIC3

NIC4

à

10.110.1.1

à

10.120.1.1

à

10.130.1.1

à

10.140.1.1

 

10.110.1.2

 

 

 

10.130.1.2

 

10.140.1.2

 

10.110.1.3

 

 

 

10.130.1.3

 

 

 

10.110.1.4

 

 

 

 

 

 

The next best server for each adapter is indicated by the “à” symbol. As we are just starting the next best DNS servers for each adapter is the first in their list. Instead of showing how the priorities are modified after each timeout we are going to use a circular list to select the next best server (the effect is the same, and the example looks easier to understand).

We also have a Timeout array in CLIENT1 that looks like this:

Timeout

à

1

 

1

 

2

 

4

 

4

The “à” symbol indicates the value to use as the timeout for our next attempt. As we are just starting to resolve a query, we are at the first element of the array.

A process in CLIENT1 needs to resolve a name. Assume that none of the configured DNS servers are reachable so all the attempts time-out.

  1. First attempt: Send query to 10.110.1.1 (best for NIC1 which is the preferred adapter). Wait at most for 1s (the current value in the Timeout array) for an answer.
    After this attempt times-out our DNS table and Timeout array will look like this (symbols in red indicate changes from the previous state):

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

    à

    10.130.1.1

    à

    10.140.1.1

    à

    10.110.1.2

     

     

     

    10.130.1.2

     

    10.140.1.2

     

    10.110.1.3

     

     

     

    10.130.1.3

     

     

     

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

    à

    1

     

    2

     

    4

     

    4

  2. Second attempt (or first retry): Send query to: 10.110.1.2 (next best for NIC1), 10.120.1.1 (next best for NIC2), 10.130.1.1 (next best for NIC3) and 10.140.1.1 (next best for NIC4). Wait at most for 1s (the current value in the Timeout array) for an answer from any of the servers queried.
    After this attempt times-out the DNS table and Timeout array will look like this:

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

     

    10.130.1.1

     

    10.140.1.1

     

    10.110.1.2

     

     

    à

    10.130.1.2

    à

    10.140.1.2

    à

    10.110.1.3

     

     

     

    10.130.1.3

     

     

     

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

     

    1

    à

    2

     

    4

     

    4

  3. Third attempt: Send query to: 10.110.1.3, 10.120.1.1 (NIC2 has just one DNS server listed then this server is always the best server for it), 10.130.1.2 and 10.140.1.2. Wait at most for 2s for an answer from any of the servers queried.
    After this attempt times-out the DNS table and Timeout array will look like this:

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

     

    10.130.1.1

    à

    10.140.1.1

     

    10.110.1.2

     

     

     

    10.130.1.2

     

    10.140.1.2

     

    10.110.1.3

     

     

    à

    10.130.1.3

     

     

    à

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

     

    1

     

    2

    à

    4

     

    4

  4. Fourth attempt: Send query to all DNS servers in all the adapters (including those that timed-out in previous attempts). Wait at most for 4s for an answer from any of the servers queried.
    At the end of the waiting time for this attempt the Timeout array will look like this (the DNS list table is not included as we will not use it again for this example):

    Timeout

     

    1

     

    1

     

    2

     

    4

    à

    4

  5. Fifth attempt: Send query to all DNS servers in all the adapters. Wait at most for 4s for an answer from any of the servers queried.
    After this attempt times-out we have run out of values in the Timeout array, then we give up and return an error to the caller.

Where is the Network Trace?

You can see the behavior of the previous example in a network trace:

image

  • Frame #3 shows the first attempt: preferred DNS server of the preferred adapter. We use one frame only for this attempt.
  • Frames #9 to #15 show the second attempt: next best DNS server in all the adapters. Notice how these frames have a time delta of 1s after the first attempt. We have 4 frames because we are querying 1 DNS server for each adapter and we have 4 NICs.
  • Frames #17 to #23 show the third attempt: next best DNS server in all the adapters. Notice how these frames have a time delta of 1s after the second attempt. We have 4 frames again because we are querying 1 DNS server for each adapter.
  • Frames #25 to #43 show the fourth attempt: all the DNS servers in all the adapters. Notice how these frames have a time delta of 2s after the third attempt. We have 10 frames because we are querying all the DNS servers in all the adapters, and we have a total of 10 servers to query: 4 for NIC1 + 1 for NIC2 + 3 for NIC3 + 2 for NIC4.
  • Frames #45 to #63 show the fifth attempt: all the DNS servers in all the adapters. Notice how these frames have a time delta of 4s after the fourth attempt. We have 10 frames for this attempt too.
  • We have no more frames because we do 5 attempts (remember the Timeout array has 5 elements by default). After the previous attempt times-out, which is going to be 4s (recheck the Timeout array again if you don’t remember where this value of 4s comes from), then we return an error to the caller.

Conclusion

Hopefully after reading this two-part blog post you have a better understanding of how the Windows DNS client works and the logic it follows when it deals with timeouts.

Based on this information, keep in mind these best practices:

  1. Configure the clients to point to more than one DNS server for fault-tolerance. Do not list more than one server to overcome disjoint DNS namespaces, and if you are going to do so, understand the risks and consequences.
  2. Try to have the DNS list in the clients ordered based on the “closeness” (in network terms) to the DNS servers to avoid retries due to timeouts.
  3. Try to have clients use DNS servers that have the information that they are going to query more often; in the case of domain members these would be the DNS servers that have the client domain’s zone.
  4. Maintain an internal DNS infrastructure and hierarchy where names can be resolved independently of the internal DNS server that is queried. For DNS implementations that support multi-domain AD environments, make sure that any DNS servers can resolve any names no matter the domain where the names are registered.
    Note: saying that any DNS servers can resolve any names does not mean that all of them have a copy of all the zones in the environment. What it means is that all of them have a way to find the name in the DNS hierarchy because the forwarders/stubs zones/delegations/secondary zones are properly configured.

Comments (17)

  1. This is fantastic content.  Keep it coming.

  2. Anonymous says:

    For satellite users, like on Exede, users have experienced DNS issues, over and over.  Most likely due in part to longer latencies, which range from 600+ms to over 1 second, depending upon users' activities and load upon ground based DNS server.

    There have been numerous posts at WildBlue/Exede World Forums  on this topic, for instance,

    DNS issues back???

    http://www.wildblueworld.com/…/showthread.php

    Sadly, there is no widget for a typical user to adjust these values.

    Waiting one second for DNSQuery is a long time for land-based wired ISPs, but for satellite users, if their DNS servers are overloaded with requests, ETC., then customers with standard OS settings may/will get page load errors.

    Thanks for the overview…but this issue needs attention for satellite users.

    1. Sean Mullin says:

      I would think this would be an issue for the Satellite companies to address with their users, not for Microsoft.

  3. karammasri says:

    @SteveNZ: this is the behavior explained in
    http://support.microsoft.com/kb/320760 (I have a reference in the post about it). The list is reset to the default order every 15 min. No idea where you got the round-robin reset though.

  4. karammasri says:

    @vancoq: press the "Alt" key

  5. vancoq says:

    I cannot find menu Advanced/Advanced Settings… on Windows 7 from ncpa.cpl. Where is it?

  6. Grim_Fandango says:

    Fantastic! Searched for ages for how to increase the DNS Client timeout on Windows 7 and finally found this page. All the other sites I found referenced: HKLMSYSTEMCurrentControlSetServicesTcpipParametersDNSQueryTimeouts  for xp and 2k which made no difference.

    Thanks a million karammasri!

  7. AndréH. says:

    Really nice content! Helps step more to understand how Windows "ticks". Thank you 🙂

  8. renato says:

    To overcome this, and to better deal with other clients (non-ms os), i was thinking about rxposing the dns ip address as a vittual (nlb) address. Thus, all clients would have a single server set, which would actually be made of three or four nlb members. Makes sense?

  9. readerX says:

    Sadly this is not true – when a DNS server for an interface responds with "no such name", Windows asked immediately the next DNS server on the next interface. Why? It has its answer.

  10. Brian Bohanna says:

    Another cool note.. I ran into this issues as well for querries of extrnal DNS name spaces without a forwarder setup.. What finnally fixed it was to make sure Bind Secondaries was selected on the 2008 R2 DNS server.  GUess the root hints don't like the timeout intervals set from Windows Servers.

  11. SteveNZ says:

    Here's a question then – I've seen reference to the DNS server that gets queried changes in a round-robin fashion every 15 minutes – is this the case, and if so, how does this affect the above?

  12. Zahid Kalwar says:

    Very well written and thanks – I have question please reply – I have home router which has DHCP and provide itself DNS to my machine and we have multiple suffix appended to my adapter when I ping with first in the list it is quicker and when ping host
    in second suffix it takes 9 seconds to come back with reply. In wireshark traces it shows that local DNS is reply four time with same response which I am getting from my corporate DNS via VPN and when I change the DNS to public or none response is in sub seconds.
    I also manually confgured same DNS and with dummy alternative DNS then it is quicker I do see in traces that queries goe to both but for each suffix response is once and same from both DNS on my local Wireless Adapter – have you seen this issue before.

  13. 성재우 says:

    많은 도움이 되었습니다. 감사합니다.

  14. OPlenty says:

    Hello Mr karammasri

    I read this article with great interest.
    We have similar problems in our environment. Resolving a CNAME eg "app.newdomain.net" with value "server.olddomain.net" (later we will switch app.newdomain.net to server.newdomain.net) can be sometimes resolved, sometimes not by our clients.

    Because newdomain.net and olddomain.net are namespaces with different authorative servers I suspect also a time out problem (FYI (DNS server newdomain.net contain stub zone olddomain.net)

    If I replace the cname app.newdomain.net with A record app.newdomain.net pointing to the IP of server.olddomain.net, it works ?!

    Is it so that when the Win7 client tries to resolve cname app.newdomain.net the timeouts descbribed above are in effect? Clients wait 1-1-2-4-4 seconds on DNS server to give a result, if not "ping request could not find host…" is returned?

    BR

  15. Tobi says:

    @karammasri
    The round-robin thing was mentioned here:
    https://web.archive.org/web/20131116075859/http://blogs.technet.com/b/ajayr/archive/2011/12/14/who-does-dns-client-prefer-preferred-or-alternate.aspx
    But it’s deleted now, perhaps because it’s not true!? Can you clarify this?

  16. Great post! This was something I enjoyed reading and always wanted to know. It is one of those topics that are hard to search because the search engines return millions of hits. This will be in my favorites list. Keep it up. Now off to do this in my lab
    and learn. Thanks.

Skip to main content