Recently, I had a customer call in an issue where their transport services were failing on all Exchange 2013 servers usually within minutes of each other. Transport would restart on its own and then 15 minutes to an hour later, they would crash yet again. The following errors were showing up in the Event logs:
Log Name: Application
Date: 4/14/2017 7:41:23 AM
Event ID: 17017
Task Category: Storage
Transport Mail Database: Quota was exceeded while performing a database operation. The Microsoft Exchange Transport service is shutting down. Exception details: Microsoft.Isam.Esent.Interop.EsentTransactionTooLongException: Too many outstanding generations between JetBeginTransaction and current generation.
at Microsoft.Isam.Esent.Interop.Api.JetSetColumn(JET_SESID sesid, JET_TABLEID tableid, JET_COLUMNID columnid, Byte data, Int32 dataSize, Int32 dataOffset, SetColumnGrbit grbit, JET_SETINFO setinfo)
at Microsoft.Exchange.Transport.Storage.DataStreamImmediateWriter.Write(Int64 position, Byte data)
Log Name: Application
Source: MSExchange Common
Date: 4/14/2017 7:41:24 AM
Event ID: 4999
Task Category: General
Watson report about to be sent for process id: 32864, with parameters: E12N, c-rtl-AMD64, 15.00.1104.005, edgetransport.exe, KERNELBASE.dll, 8b9c, c0020001, 3cb8, 6.3.9600.17031 (winblue_gdr.140221-1952).
Log Name: Application
Date: 4/14/2017 7:38:19 AM
Event ID: 15004
Task Category: ResourceManager
The resource pressure increased from Medium to High.
The following resources are under pressure:
Version buckets = 2585 [High] [Normal=1750 Medium=2000 High=2500]
The following components are disabled due to back pressure:
Inbound mail submission from Hub Transport servers
Inbound mail submission from the Internet
Mail submission from Pickup directory
Mail submission from Replay directory
Mail submission from Mailbox server
Mail delivery to remote domains
Mail resubmission from the Message Resubmission component.
Mail resubmission from the Shadow Redundancy Component
The following resources are in normal state:
Queue database and disk space ("D:\TransportDB\data\Queue\mail.que") = 66% [Normal] [Normal=95% Medium=97% High=99%]
Queue database logging disk space ("D:\TransportDB\data\Queue\") = 66% [Normal] [Normal=95% Medium=97% High=99%]
Private bytes = 4% [Normal] [Normal=71% Medium=73% High=75%]
Physical memory load = 67% [limit is 94% to start dehydrating messages.]
Submission Queue = 0 [Normal] [Normal=2000 Medium=10000 High=15000]
Temporary Storage disk space ("D:\TransportDB\data\Temp") = 66% [Normal] [Normal=95% Medium=97% High=99%]
In the Protocol logs for SmtpSend, we could see several 452 4.3.1 status codes where the system had insufficient system resources. All of these showed up on Shadow Copy traffic or XProxyFrom traffic.
One of the first things we noticed was the Back Pressure event (15004) was happening about 3 minutes before the 17017 event every time. So we knew the MSExchange Transport service would crash in about 3 minutes after the 15004 event fired off. We decided we needed to see what was killing the transport service so we ran procdump. We got the pid of the edgetransport.exe process and then waited until the next 15004 event fired. At that point we ran the following command from an elevated command prompt:
procdump -ma -s 60 -n 3 <edgetransport.exe pid>
This captured 3 dumps, 60 seconds apart (since we knew the process would crash in 3 minutes).
We also did a different dump to catch it upon crashing using the following command (You can change the path at the end of the command to something relevant for your system):
procdump.exe -e 1 -ma -f *EsentTransactionTooLongException* edgetransport.exe -accepteula C:\dumps\crashdump.dmp
We sent those dumps to our Escalation Engineer and ultimately he came back saying that a large message, over 1.5GB, was trying to come in, getting through the receive connector and then blowing up the Transport service when trying to proxy to the backend servers. He identified the message telling us the server name and IP that sent the message, the Exchange server name and IP it was trying to proxy to, the Exchange server IP that had received the message from the sender, and the name of the connector that was proxying the message (not the receive connector the message came in under).
The sending server the dump identified was an application server trying to deliver the message through the relay connectors in Exchange. The Escalation Engineer then provided us the messageID and sender of the problematic message:
So, we have a really large message that has come in to the Exchange environment and was wreaking havoc on it.
How did that 1.5GB message get into my Exchange environment anyway?
This is the question that hounded the customer the most. How did this message get in? I mean, their relay connectors and their Default Connectors all had 20MB message limits. Why did that not stop the message from even making it into transport?
Well, to answer that, we looked into all of the relay connectors to see which one was being used by the IP of the Application server. We found it on Relay3. We took a closer look at the relay and found that the relay receive connector under the Security tab, had been set up with Externally Secured for the authentication security mechanism, and Exchange servers as the Permission group.
Externally secured requires you to set the Permission group to Exchange servers because that is what it treats those connections like. Anything coming in under this connector will be treated like an internal Exchange Server. It will assume the sender is 100% trusted and will not apply any checks to it, including the checks on message size. So that is how the 1.5GB message got in.
Now, because the message had gotten in, and Transport was crashing before the message could deliver, the sending server, an Unix server using SendMail, never got a response from Exchange that the message had been delivered, nor did it get an error. All it got was a severed connection due to the transport service crashing. So it waited a few minutes and then resubmitted the message. Depending on the configuration of the sendMail server, it could potentially do that indefinitely, which was the case in our scenario. So the cycle of crashing transport could have gone on forever if left unchecked.
We spent a lot of time trying to find the message and remove it from the queues before we realized it was being resubmitted over and over again after every crash. So we finally looked at the receive connectors and realized the issue was with how it was configured. For rules to apply to a receive connector, such as the Maximum receive message size, it simply couldn't be set up with externally secured. So how do we set it up so the application can still send messages through the relay to both the internal and external customers? It is easier than it sounds.
First, uncheck all of the Authentication and Permission groups in the current relay connector, and then check the Anonymous users permission group. Save that, then from the Exchange Management Shell, run the following to give the relay receive connector the permissions it needs to send internal and externally:
Get-ReceiveConnector “Server1\Relay 3” | Add-ADPermission -User “NT AUTHORITY\ANONYMOUS LOGON” -ExtendedRights “Ms-Exch-SMTP-Accept-Any-Recipient”
For good measure, I would then restart the MSExchange Transport service.
The next time the sendMail server tries to send that message into Exchange, Exchange will check it's message size and reject it immediately, sending a response over the SMTP connection back to the sending server that the message was rejected due to the message being too large. This would show up in the sendMail servers mail logs as a 5.3.4 status saying: "Message too big for system" or a 5.2.3 status stating: "Message too large". Since the sendMail server received the rejection response, it would discard the message and not try to send it in again.
Information on Anonymous versus Externally Secured permission groups can be found in the following article as well as the steps to set up either of the two types of relays: https://technet.microsoft.com/en-us/library/mt668454(v=exchg.160).aspx
If your distribution groups are set up to only accept mail from "Only senders inside my organization", then anonymous relays will not work for that. In order for an application to send as an internal sender so that it can successfully send to one of those distribution groups, it has to come in as Externally Secured. The other option is to set the distribution group for "Senders inside and outside my organization". Then anonymous relays would work fine and still grant you the protection of the size limitations and other rules.
There is another scenario where Externally Secured is needed for a relay. If the application is trying to send from OnPrem to a distribution group that lives in O365 that is set to accept messages only from authenticated users, even though the O365 Tenant has the same name space as the OnPrem environment, the message would need to appear to be internal. When it comes through an anonymous relay, the header is stamped with X-MS-Exchange-Organization-AuthAs: Anonymous and will be rejected as it would not be authorized to send to that distribution group. If the relay is set as Externally Secured, then it comes in a X-MS-Exchange-Organization-AuthAs: Internal and O365 accepts the message.
In a nutshell, only use the Externally Secured option when the application is 100% trusted to abide by the policies you have enforced on the Exchange environment. Make sure the application developers know that you have a size limit and make them code with that in mind. Otherwise, make them use the anonymous connectors and inform them they will not be able to send to any Distribution groups that are locked down to only receive from internal senders.
Thanks to Kary Wall our friendly Escalation Engineer, Tom Kern and Stephen Gilbert our Exchange Connector SMEs, and Sainath Vijayaraghavan out local team lead, for their help in coming to these conclusions and being a sounding board for me as I worked this case.