Ran into an interesting case today where mail flow had quit working in one direction between sites. The environment was on Exchange 2010. Basically, Site A and Site B could not send mail to an exchange server in Site C. They could send to each other, and mail was flowing out from Site C, but inbound mail to Site C was queueing up.
The error in the Application event logs looked as follows:
Log Name: Application
Date: 8/24/2016 1:21:43 PM
Event ID: 1035
Task Category: SmtpReceive
Inbound authentication failed with error UnexpectedExchangeAuthBlob for Receive connector Default SiteCServer. The authentication mechanism is ExchangeAuth. The source IP address of the client who tried to authenticate to Microsoft Exchange is [IP of Site A exchange server (or site B)].
<Provider Name="MSExchangeTransport" />
<TimeCreated SystemTime="2016-08-24T17:21:43.000000000Z" />
<Data>[IP of Site A exchange server (or Site B)]</Data>
There was no time skew in the environment and no backpressure events were showing up (Event IDs 15001 -15007) in the Application event logs.
Normally, I would spend a great deal of time working through Protocol logs and network traces to figure out what is happening here, but luckily for me, I have an engineer named Miguel Ortiz on my team and he has seen this before.
In this particular case, what was happening was, the Kerberos token from Site A and Site B destined for Site C had gotten corrupted. Site C had suffered a severe outage and we believe it was during this time that the Kerberos tokens became corrupted. We went through the following steps to remediate the issue:
- On the Site C domain controllers, we restarted the Kerberos Key Distribution Center service (KDC).
- On the Site C Exchange server we purged the Kerberos tickets using the KList Purge command from an administrative command prompt
- Next, we restarted the Netlogon service on the Site C Exchange server.
- Net Stop Netlogon && Net Start Netlogon
- Then we restarted the Transport Service on the Site C Exchange server.
- Net Stop MSExchangeTransport && Net Start MSExchangeTransport
- After transport came back up on Site C, we then repeated steps 2 through 4 on the Site A and Site B servers.
This process basically clears out the Kerberos tickets and forces Site A and Site B to re-send their Kerberos request to Site C. In other words, they cleared the corrupted token and requested a new one.
Once this process completed, mail flow returned to normal and no further 1035 events were thrown.
Thanks to Miguel Ortiz for helping me get through this case quickly and working out the steps to resolve it.