I was recently on site with a customer performing an ADRAP when we found that several domain controllers were missing certain generic SRV records from DNS. The environment had around one-hundred DCs and thirty of them were missing records. Unsure why this was inconsistent, we started to investigate, first by restarting the netlogon service on one of the domain controllers in question. Restarting the netlogon service is one way to force record registration of SRV records for that DC. We refreshed the zone and found that the missing records, _kerberos._udp.contoso.com, and _kpasswd._udp.contoso.com were present. A few minutes later, allowing some time for replication to occur, we re-ran our test. The records had vanished once more.
Examining the list of DCs in the environment that had missing SRV records, we found that most of them were in remote Active Directory sites, some a few hops from the main hub. Further examination of the DNS configuration showed that the Active Directory namespace in DNS (the contoso.com zone) was configured for aging and that the no-refresh interval was set to two hours, while the refresh interval was set to seventy hours. A single DC was set to scavenge records every three days and all DCs pointed to themselves for primary DNS.
Things clicked once we looked at the aging settings on the zone. Let’s start by briefly reviewing what the refresh and no-refresh intervals mean to DNS clients (or you can read this great post for a longer explanation).
When a client initially registers a record with its DNS server, and aging is enabled on the zone, the record starts off in the “no refresh” period. During this administrator-defined period of time, which defaults to seven days, the DNS record timestamp cannot be updated by the client. If the client IP address changes, the record may be updated and the timestamp will be written. Again, the no-refresh period just prevents the timestamp from being updated.
After this interval has expired, the record enters the “refresh” period. It’s during this period, if the client is still at the same IP address, it is able to update the timestamp on the record. Once the record is timestamped again, it enters the no-refresh period. This cycle continues as long as the record is consistently updated.
If the record passes the no-refresh interval and the refresh interval without being updated, it’s now eligible to be scavenged. Scavenging is a process that is generally carried out by a small number of DNS servers. The scavenging process checks each zone for which it is authoritative for any records that have aged beyond the no-refresh + refresh period. Records which were updated last beyond this interval are scavenged.
Client A (host) records are updated by the DHCP Client service on Server 2003/XP, or on Server 2008/Vista and later by the DNS Client service, every 24 hours. On domain controllers, SRV records are updated by the netlogon service. These updates occur, by default, every 24 hours on Server 2003 DCs, or hourly on Server 2008 and later DCs. On any version of Windows Server, the records are also registered when netlogon starts.
Back to the vanishing SRV records…
The DNS zone in question had set the no-refresh interval to just two hours. As previously mentioned, the netlogon service in Windows Server 2008 and later will attempt to register SRV records every hour, regardless of zone aging settings. Let’s look at how DNS records are stored in AD integrated zones.
DNS records are stored in the directory in a dnsNode object that corresponds to the name of the record. For example, _kerberos._udp.corp.milt0r.com would actually be a single object viewable in ADSIEdit at this path, assuming the replication scope is forest-wide: “DC=_kerberos._udp,DC=corp.milt0r.com,CN=MicrosoftDNS,DC=ForestDNSZones,DC=corp,DC=milt0r,DC=com”
Each of the individual SRV records you see listed in the DNS management console are stored in an attribute of that dnsNode object. This is a multi-value attribute called dnsRecord.
This dnsRecord attribute stores values that represent information about the SRV record (weight, priority, port, hostname). When an SRV record is created, an entry is added to the dnsRecord attribute. When an existing record is updated, the appropriate value is updated. The screenshot above shows that the _kpasswd._udp dnsNode object has four records under the dnsRecord attribute. If we examine the zone in the DNS management console, we would see this as four individual SRV records. Additionally, because the dnsRecord is replicated via multivalue replication, a change to a single value results in replication of the entire attribute. That brings us to the next logical question…
What happens if two DCs, pointing to themselves for DNS, register their own SRV records for the same dnsNode object at the same time?
Let’s say DC1 and DC2 both have no SRV record registered for _kerberos._udp.corp.milt0r.com. At 1:00PM, you restart netlogon on both domain controllers, forcing the record to update. Netlogon will attempt to register all of its necessary SRV records, including the missing_kerberos._udp record. In this case, since they don’t already exist, we would expect registration to succeed.
A little while later, replication takes place. You have two DCs, both with a dnsNode object for _kerberos._udp, but with different values in the dnsRecord attribute. Whichever domain controller wrote the change last will win. If we examine the zone on either DC after replication, we should only see the SRV record for just one of the DCs was successfully registered. When replication occurred, the copies of the object on either DC were found to have conflicting information. To resolve the conflict, the DC replicating the inbound change either kept its own copy, or replicated the replication partner’s copy, depending on which was most current.
What’s that got to do with a 2 hour no-refresh interval?
With the no-refresh interval set to a small span of time, every domain controller will successfully update its time stamp every other hour. In a small environment with just a few DCs, and a relatively low convergence time, administrators may never notice a problem. In a large environment with many domain controllers across many sites, and if those DCs all point to themselves for primary DNS, you’ll begin to see replication conflicts as DCs register records and those registrations overlap replication intervals. In an environment with over 100 DCs where the no-refresh period is just two hours, it stands to reason that you’d have multiple replication conflicts within any given hour period. Depending on the number of sites, the topology, and the replication intervals, you could have ten conflicts within the same 15 minute interval. Eventually, some DCs are going to “lose” and have their records scavenged since they’ll end up being seen as stale by the scavenging DNS server. This assumes that netlogon hasn’t been manually restarted at certain points.
How do I prevent this weird scenario?
Luckily, there’s a pretty easy fix. Set your DNS no-refresh period to an interval that will allow all DCs in the forest to experience consecutive registration failures or replication conflicts, but to still have a chance during that period to successfully register before other DCs enter the refresh period. Changing the interval to something like 24 hours means that, while netlogon will try to update the record each hour, it will only succeed once a day. This gives all of your DCs a longer window to register without the risk of experiencing multiple replication conflicts.
I hope that this post gives you a better understanding of the DNS refresh intervals, how DNS records are stored in the directory, and how a low no-refresh interval can impact SRV records and replication. As always, leave questions or comments below!
Once I’d finished writing this post, I discovered a pretty similar post on the subject (though a different root cause) on the AD blog. That article is located here: http://blogs.technet.com/b/ad/archive/2008/08/08/a-complicated-scenario-regarding-dns-and-the-dc-locator-srvs.aspx That particular post goes in to a bit more detail in some areas, but covers a case where the cause was the sheer number of DCs running Server 2003, as opposed to a very low no-refresh interval. I recommend reading that as well.
– Tom Moser