O365: On-Premises Transport Queuing Lessons Learned from CBL

I recently had a case where on-premises Transport had Queues that were 'stuck' going to the cloud. Specifically, they were going to Exchange Online Protection for routing from on-premises to cloud mailboxes. The on-premises transport servers were generating errors, in a unique way. For example, this is what we saw from the Queues in PowerShell:

(Get-TransportServer | Get-Queue | Where{$_.NextHopDomain -ilike '<cloud domain>'} | Select LastError -Unique).LastError

451 4.4.0 Primary target IP address responded with: "451 5.7.3 Must issue a STARTTLS command first." Attempted failover to alternate host, but that did not succeed. Either there are no alternate hosts, or delivery failed to all alternate hosts.

This error is highly deceptive because it could be inferred that the client is not issuing a STARTTLS to the remote endpoint in the cloud; however, when we looked at the SMTP Send protocol logs, we could clearly see the inverse was true:

<Server>.mail.protection.outlook.com Microsoft ESMTP MAIL Service ready at Fri, 6 Dec 2013 21:54:51 +0000",
2013-12-06T21:54:51.891Z,Outbound to Office 365,08D0C0DE40C1BF74,3,x.x.x.x:64526,x.x.x.x:25,>,EHLO <on-premises domain>,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,4,x.x.x.x:64526,x.x.x.x:25,<,250-<Server>.mail.protection.outlook.com Hello [External I.P. Address of Client],
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,5,x.x.x.x:64526,x.x.x.x:25,<,250-SIZE 157286400,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,6,x.x.x.x:64526,x.x.x.x:25,<,250-PIPELINING,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,7,x.x.x.x:64526,x.x.x.x:25,<,250-DSN,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,8,x.x.x.x:64526,x.x.x.x:25,<,250-ENHANCEDSTATUSCODES,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,9,x.x.x.x:64526,x.x.x.x:25,<,250-AUTH,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,10,x.x.x.x:64526,x.x.x.x:25,<,250-8BITMIME,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,11,x.x.x.x:64526,x.x.x.x:25,<,250-BINARYMIME,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,12,x.x.x.x:64526,x.x.x.x:25,<,250 CHUNKING,
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,13,x.x.x.x:64526,x.x.x.x:25,*,,Connector is configured to send mail only over TLS connections and remote doesn't support TLS
2013-12-06T21:54:52.001Z,Outbound to Office 365,08D0C0DE40C1BF74,14,x.x.x.x:64526,x.x.x.x:25,>,QUIT,

Ah, so the STARTTLS command is not available from the remote endpoint. Since TLS is required for this send connector, this is why we're queuing.

Let's double-check our math and our sanity, really quick. I use nslookup to query for the MX record of the cloud domain and telnet to port 25 on one of the I.P. addresses. So far, so good. I issue 'EHLO <on-premises domain>' and get this response:

<I.P. Address> - 220 <Server>.mail.protection.outlook.com Microsoft ESMTP MAIL Service ready at Fri, 6 Dec 2013 22:01:59 +0000
EHLO <on-premises domain> 250-<Server>.mail.protection.outlook.com Hello [External I.P. Address of Client]
250-SIZE 157286400
250-PIPELINING
250-DSN
250-ENHANCEDSTATUSCODES
250-AUTH
250-8BITMIME
250-BINARYMIME
250 CHUNKING

Sure enough, 'STARTTLS' isn't an available ESMTP verb. O.k., but why?

Well, as it turns out, we use public block lists and this company was cited by CBL for RWBL and XBL; specifically, Pushdo was being used to perform a DDoS against port 80 on webservers from an infected machine behind their public I.P. address.

As soon as they were listed, we removed STARTTLS (and other ESMTP verbs).

The customer discovered it was probably a visitor (CBL lists time of last activity) and as it had been over 8 hours, delisted themselves. Immediately, TLS was available for use, again.