This is a new series of articles for this blog that were inspired by Mark Russinovich’s Case of the Unexplained series. Each article will tell the story of an Exchange Online scenario that initially made no sense. I’ll then progress through the troubleshooting steps and eventually end up with the root cause. I have a couple of hopes here.
- If you have experienced the same or similar issue, you can jump right to the root cause without needing to troubleshoot yourself.
- You can learn more about Exchange Online from seeing how I approach solving a problem as I’ll be detailing not only what I did, but why I did it.
Now let’s bring in our first contestant.
Environment & Problem
Inbound mail flow looks as follows.
Inbound mail is arriving properly at EOP, but the problem lies when EOP tries to deliver the message to the on-premises mail environment. Messages are often significantly delayed (hours) and an EOP Message Trace shows a number of inbound messages “Pending.”
Messages stuck in the “Pending” state would eventually be delivered on-premises, but only after a delay of usually a number of hours.
Let’s first start with what we know.
- This problem isn’t consistent, in that the messages are eventually delivered from EOP to on-premises, they are just very delayed.
- The problem most likely isn’t connector related because messages do eventually go through. If we had a complete mail down scenario I would start with the connector, but this isn't the case here.
For mail flow problems like this, the best place to start is with a message trace. While the regular message trace can sometimes give you enough information to figure out what’s going on, I usually prefer the detailed message trace as this will provide you with everything you could ever want. So let’s start there.
Detailed Message Trace
A detailed message trace is delivered in the form of a CSV file that shows an extremely detailed view of message hops as it passes through Exchange Online. For instructions on obtaining a detailed message trace see Run a Message Trace and View Results. Before submitting the request, always ensure the box “Include message events and routing details” has been checked. This will ensure we get the most data back.
In this case, the detailed trace validates the problem we’re seeing.
Unfortunately, the SMTP error we can see here is very generic and doesn’t immediately tell us what the problem is. This error usually indicates a problem on the wire, and most often these problems are caused by network equipment that do packet inspections. At this point, we can’t say if the problem is with a networking device, so we’re going to need to collect more data.
When in doubt, grab a network capture! We obtained a capture from the edge of the customer’s network. This is an important note. If we do have a problem with a network device, a network capture from the Exchange server wouldn’t show any issues as the problem would be further upstream in the network.
Here’s what our edge network capture looks like.
I’m using the following filter in WireShark to filter out all the information that we don’t need to see.
tcp.port == 25 and not tcp.analysis.retransmission and not tcp.analysis.duplicate_ack
The first filter only shows traffic from port 25. Because this capture is from a router, it contains duplicate packets (one when the packet arrives, and a second one when it leaves). I don’t need to see the second copy and so the second two filters remove these from capture.
Unfortunately, this network capture doesn’t help us. Because traffic is sent over TLS, we can’t see what’s going on here. This gives us two options going forward.
- Disable TLS on the on-premises spam filtering device. Remember that EOP uses opportunistic TLS, and so not forcing TLS on-premises won’t work. We actually need to turn it off. ie, don’t offer the STARTTLS verb.
- Investigate protocol logs to see if these give any clues as to what’s happening. If the problem is indeed a network device, on-premises protocol logs won’t show the problem because Exchange is behind the device (in this case, an on-premises spam filtering device). However, protocol logs from Office 365 would show a problem as O365 is would be talking directly to the on-premises networking device.
I didn’t want the customer to make any on-premises changes unless I was absolutely sure of the problem, and so I looked at protocol logs from the Exchange Online side.
Exchange Online protocol logs
I obtained protocol logs from the Exchange Online side and see the following entry, which is always followed by the connection being dropped.
TLS protocol SP_PROT_TLS1_0_CLIENT negotiation succeeded using bulk encryption algorithm CALG_RC4 with strength 128 bits,…
This is finally my, “AH HA!!!!” moment. At the time of this problem, Office 365 was in the process of disabling the RC4 cipher from cloud servers, but we were only about half way complete.
The on-premises spam filtering device is trying to negotiate with Office 365 using the TLS 1.0 RC 4 cipher, which at the time of this problem was only enabled on half of Office 365 servers. Every day, less of our servers had it enabled because we were slowly turning it off. When an EOP server that had RC4 disabled tried to send the message on-premises, the delivery would fail because the EOP server would reject the RC 4 negotiation. We retry approximately every 15 minutes, and so when a retry was performed by a server that had not yet had this cipher disabled, it would be delivered successfully on-premises.
See the recent blog post Enhancing mail flow security for Exchange Online for more information about the ciphers we support. This post also mentions support for RC4 being withdrawn.
Testing a solution
The customer I was working with had approximately 200 messages queued up in EOP. They disabled TLS 1.0 on their on-premises spam filtering server and rebooted the system. Once the spam filtering server came back, the EOP queue drained to zero in less than 15 minutes. This spam filtering server turned out to be a Windows 2003 server that didn’t support the newer TLS protocols. The plan was to upgrade this box to a 2008 R2 server, and then eventually pull it out of the mail flow loop once its rules had been migrated to EOP.
I hope this write up has helped you learn more about troubleshooting EOP mail flow problems. I really enjoy documenting step by step troubleshooting cases and will try to make this series a regular on this blog.