Tales of the Weird: Unable to Send TLS-Encrypted email to Office 365

Here is another case from the "Tales of the Weird and Unusual" bin at Microsoft Exchange Support.  This article includes a little bit of information from both the Office 365 and the Exchange On-Premise worlds, which is a little bit different as most of this blog focuses in solely on Exchange On-Premise.

Over the last week or so we have seen an uptick in cases where our customers are having difficulties sending emails to Office 365 from their Exchange On-Premise environment when using TLS.  This does not seem to affect just one version of Exchange which also can complicate matters.

Where we are seeing this happen is right after we send over the SSL certificate.  Most customers will see this as their mail queue building up, and most likely it'll be triggered by either a call to your help desk or your SCOM alerts.

Typically the first thing you will do is check your mail queue via the Get-Queue command and you'll see something similar to below:

RunspaceId : RunSpaceID_Would_Go_Here
DeliveryType : DnsConnectorDelivery
NextHopDomain : Contoso.mail.onmicrosoft.com
TlsDomain :
NextHopConnector : 00000000-0000-0000-0000-000000000000
Status : Active
MessageCount : MessagesInQueueForOffice365
LastError : [{LED=451 4.4.397 Error communicating with target host. -> 421 4.4.2 Connection dropped due to TimedOut};{MSG=};Contoso-mail-onmicrosoft-com.mail.protection.outlook.com};{IP=Office365IPAddressHere};{LRT=8/20/2018 12:00:39 AM}]

When I duplicated this using a SmartHost in my lab we noted the following:

Note, I have changed/removed the RunSpaceID's and NextHopConnectors to protect the innocent.  You would see your RunSpaceId's, etc in your own environments.

Now, the next step in troubleshooting is to enable our verbose protocol logging via the following command:

Set-SendConnector "Outbound to Office 365" -ProtocolLoggingLevel verbose

This will allow us to see what is happening from the Send Connector. Most often for this command, your send connector will be "Outbound to Office 365", however for some organizations it will be different.

Once you have the Send Connector Protocol Logging enabled you can grab your Protocol Logs which by default in Exchange 2013/2016 will be in %ExchangeInstallPath%TransportRoles\Logs\Hub\ProtocolLog\SmtpSend.  Again, you can customize this to suit your needs.  * 

Once you have the logs, you can easily import the logs into Excel to review as they are just .csv files.  When you import them, select delimited, comma format, with your import starting at line 5 and you will be able to filter down your connections.

Now, depending upon your error, you may see different behavior.  However, for what we are focusing in on today is the "TLS negotiation failed with error TimedOut".  Here's a snippit from the log in my lab:

date-time connector-id session-id sequence-number local-endpoint remote-endpoint event data context
2018-08-12-T12:4612.310Z Outbound to Office 365 1 2 192.168.1.1:45874 52.164.999.999:25 +
2018-08-12-T12:4612.342Z Outbound to Office 365 1 3 192.168.1.1:45874 52.164.999.999:25 < "220 mail.protection.outlook.com Microsoft ESMTP MAIL Service ready at Sun, 12 Aug 2018 12:46:19 +0000"
2018-08-12-T12:4612.342Z Outbound to Office 365 1 4 192.168.1.1:45874 52.164.999.999:25 > EHLO mail.contoso.com
2018-08-12-T12:4612.408Z Outbound to Office 365 1 5 192.168.1.1:45874 52.164.999.999:25 < 250 mail.protection.outlook.com Hello [192.168.1.1]
2018-08-12-T12:4612.409Z Outbound to Office 365 1 6 192.168.1.1:45874 52.164.999.999:25 > STARTTLS
2018-08-12-T12:4612.530Z Outbound to Office 365 1 7 192.168.1.1:45874 52.164.999.999:25 < 220 2.0.0 SMTP server ready
2018-08-12-T12:4612.530Z Outbound to Office 365 1 8 192.168.1.1:45874 52.164.999.999:25 * " CN=mail.contoso.com.com, OU=MyOU, OU=Domain Control Validated CN=MyCNand CA, O=Contoso, " Sending certificate Subject Issuer name Serial number Thumbprint Not before Not after Subject alternate names
2018-08-12-T12:4631.547Z Outbound to Office 365 1 9 192.168.1.1:45874 52.164.999.999:25 * TLS negotiation failed with error TimedOut
2018-08-12-T12:4631.648Z Outbound to Office 365 1 10 192.168.1.1:45874 52.164.999.999:25 - Remote

Now, it looks like our connection was just timing out - we were sending our SSL certificate, however we were never getting a reply from Office 365.

Then, our next step was we tried a telnet from our Exchange server.  We should see:

Trying 52.164.999.999...
Connected 52.164.999.999.
Escape character is '^]'.
220 mail.protection.outlook.com Microsoft ESMTP MAIL Service ready at Fri, 24 Aug 2018 14:12:07 +0000
EHLO contoso.com
250 mail.protection.outlook.com Hello [192.168.1.1]
STARTTLS
220 2.0.0 SMTP server ready

But instead, we saw nothing.

We ended up taking a network capture, and in our network capture we were seeing us connect and send our certificate.  However, we would never see a reply back from the service.

Many times in cases like this, we do end up taking concurrent network captures - one from Exchange, and then two from our firewall device - one capture on the internal interface, and then another on the external interface.  What we were seeing is the traffic was coming back from Office 365, but our firewall device instead was dropping the packets and never sending the reply back to Exchange.  This would show up sometimes as a [RST, ACK, CWR] flag in our network traces.  (The key being the congestion window reduced.)

What we found was that in some of these cases the firewall was doing what is commonly referred to as SMTP inspection, along with the TLS component can sometimes trip up certain firewalls.  In our case we were seeing Exchange attempt to negotiate TLS and then we immediately send a number of TCP retransmissions.  Then, we see the connection reset.  When we checked the logs in Office 365, we were never seeing the client certificate logged, which was then indicating the failure is below the application layer.

Once we either bypassed the firewall device or disabled SMTP and TLS inspection, the issue was resolved.

* NOTE: Make sure you disable your protocol logging via the Set-SendConnector "Outbound to Office 365" -ProtocolLoggingLevel none command when you are done troubleshooting, otherwise you could see your disk space usage jump dramatically quickly.

Special thanks to the following for their assistance in writing this blog:
Nick Ulmer, Exchange Support Escalation Engineer
Shawn Bakken, Exchange Support Escalation Engineer
Tom Kern, Exchange Online Senior Support Escalation Engineer