The story of the Mysteriously Malfunctioning Mail Router (AKA EDNS and Exchange Escapedes)

logo-header-e2010
A small anecdote to illustrate how external changes outside of the control of the local Administrators can adversely affect the internal infrastructure:

A colleague of mine in the Exchange team came to me with an issue where the customers Exchange server suddenly stopped being able to route incoming mail for a specific domain.
All other domains were being routed without any problems – but their Exchange servers had suddenly become unable to route mail to that specific domain.
The DNS results we were getting back from our DNS server looked highly suspicious - 'Server Failure'.

Looking at the Exchange logs from the mail router we saw the following:

 

 

 

 

 

 

 

 

 

 

 

No change had been made to anything at the customer site - yet this stopped working in the middle of the night with no local admins reported anywhere near a keyboard or a server.

After examining this we determined the following:

- The mail routing for the specific domain was making a DNS query to determine the IP's for the target servers.  The DNS query would normally return no records as the servers aren't registered externally and the mail routing code would fall back to using a local file which contains the IP's of the servers.

- For all other incoming mail there was a conditional forwarder that routed the DNS requests to an internal DNS server.

- There was no conditional forwarder set up for the failing domain so all DNS queries for that domain were being sent to either one of the ROOT DNS servers on the Internet or the ISP's DNS servers.

Essentially, an update had been made outside of the customers network.  This update was preventing the DNS response from coming back to us which resulted in the 'Server failure' message which in turn resulted in the mail routing code terminating the mail routing attempt instead of falling back to the local file.

I.e. a firewall or router enroute was dropping the EDNS packets - which was the root cause.

The case was ultimately resolved by simply adding a conditional forwarder for the affected domain - bypassing the external device that was causing the problem.
Disabling EDNS probes on the W2k3/W2k8 DNS server would also have been an option (dnscmd /config /enableEDNSprobes 0).

This is described in KB 828263 - which also applies to Windows Server 2008/R2 but the difference being that it's not enabled by default in W2k3 whereas it is the default behaviour in W2k8+.

 

Further details:

DNS query responses do not travel through a firewall in Windows Server 2003

 

http://support.microsoft.com/default.aspx?scid=kb;EN-US;828263