I recently had another very interesting case that I thought I would share. This is a follow up post to a previous blog I did on this issue which is located here.
The symptom, cause of the behavior on the iPhone, and diagnosing the issue were identical but it was the root cause and resolution that were different.
Your Exchange data may unexpectedly reload on some or all of the iOS devices in your organization. This includes Exchange email, contacts, and calendar information.
Environment: Exchange 2007, Exchange 2010, or Exchange 2013 published externally using Internet Security and Acceleration (ISA 2004 or ISA 2006) Server, Forefront Threat Management Gateway 2010 (TMG 2010) or Unified Access Gateway 2010 (UAG 2010) as well as F5 Big IP.
This issue is seen when the iOS devices receive an HTTP 500 error in response to consecutive requests. The ping that the client initiates is critical to the Direct Push technology that Exchange ActiveSync relies on for determining when there is new information to be pushed to the client.
For more information on Direct Push and how it works see the below links:
Diagnosing this issue:
Apple has addressed this issue on their support forum here and has some recommendations for diagnosing this at the client. It is my understanding that Apple support has to pull the ActiveSync logs off the iOS devices.
From the ISA/TMG standpoint, diagnosing this involves a couple of things. Start Live Logging on ISA/TMG and filter for traffic that is using the ActiveSync rule for your organization. If there are multiple requests that are showing as Failed and an Error 64 is indicated as the cause then we need to investigate further.
A Microsoft Support Engineer can gather application level tracing using our Best Practices Analyzer and more specifically the Data Packager.
In the ISA/TMG tracing which will need to be converted by Microsoft we will typically see ActiveSync conversations failing with error code 64(ERROR_NETNAME_DELETED) and will be followed by a response to client of HTTP/1.1 500.UAG Tracing would not show the (ERROR_NETNAME_DELETED) but you should still see a HTTP/1.1 500 being returned to the client.
Resolution and Root Cause:
To determine the root cause and the ultimate resolution to this issue I took the same approach that I have in all of these cases. First I wanted to eliminate my product (TMG in this case) as the culprit. Out of the box ISA and TMG will publishing your Exchange ActiveSync Services without an issue. I have never seen a case where either one was contributing to this issue and I have worked quite a few of these over the last few years.
In order to eliminate TMG as the culprit I had the customer gather data using our TMG Data Packager (Web Proxy and Web Publishing scenario) while reproducing the issue. Thankfully this issue consistently occurred in this environment after exactly 5 minutes (300 seconds). Another thing that worked in our favor was that customer was able to recreate this in their test environment which closely mimicked their production environment.
The data flow was essentially Client ======> F5 ======> TMG ======> F5 ======> Exchange CAS
Based on the tracing I gathered from TMG I was able to determine that the RST was happening somewhere between TMG and the CAS Server. This lead me to strongly suspect that F5 was to blame. A network trace from the CAS Server showed the RST was happening between TMG and CAS as well.
In previous troubleshooting for this issue the customer had discovered that F5 defaults to 300 second timeouts for many of its settings. This falls in line with what we were seeing in terms of a time frame. These timeouts were adjusted and the customer even bypassed F5 as a load balancer and yet the issue persisted.
Digging a little further with the help of the customers network team we found out that even technically bypassing F5 was not really bypassing it because it was also being used for “in-line” traffic routing. In other words the default gateway of TMG was using a VLAN that the F5 LTM was hosting. That functionality is described here. The customer consulted their F5 expert who was able to find out there is a 300 second idle timeout enforced on that particular VLAN. Here was the smoking gun!
In these types of issues always examine the data and determine if we are really seeing the whole picture. In this particular case we had to question every device in the path of the data before we ultimately found the root cause. In summary, the pain was being caused by a combination of the client device not handling resets well and a low default idle timeout set on the VLAN.
Author: Keith Abluton, Sr. Support Escalation Engineer