Outbound Proxy and SecureNAT requests stop working intermittently on TMG 2010. Restarting the Firewall Service seems to resolve the issue temporarily.

I worked on this case few months back, since it was a very interesting issue and lot of work had happened before it came to me, I thought of sharing it with all. So issue was with the web proxy and secureNAT clients, they would stop working intermittently. It required a restart of the firewall service to resolve the issue.

When this case came to me ,We already had data collected at the time of the issue.

Data analysis

So I had to do look at the data and find out if the already collected was pointing to anything that can resolve the issue. Data collected was TMG data packager collected on the TMG server. I found out the client IP from case details and then started looking at the network capture collected by the TMG data packager on the internal NIC, since it was an internal client.

In the network captures I found a very strange behavior, which we can see below

 

From here we can see that TMG was reseting the connection after it was getting syn packet from the client machine, it was not even letting the TCP handshake to complete. The whole traffic from client had met with same treatment from TMG. This was really strange. To find out why TMG was reseting this clients connection before even letting the TCP handshake to complete, I looked at the ISAtracing log collected by TMG data packager.

Where I Saw same behavior, however reason was still not clear from the tracing as well.

 

 

I also looked at netstat logs collected by TMG data packager and found huge number of connections in Close_wait state

 

I did further research on it and found that it could happen because of  what is mentioned in Kb https://support.microsoft.com/kb/2577795 i.e.  Under the cause part

"

This issue occurs because of a race condition in the Ancillary Function Driver for WinSock (Afd.sys) that causes sockets to be leaked. With time, the issue that is described in the "Symptoms" section occurs if all available socket resources are exhausted.

"

Solution

so solution was to upgrade the afd.sys. I checked the OS version on the TMG server , it was windows server 2008 R2 standard. Since it was a networking component, I engaged our networking team to find out the best way to upgrade to the latest afd.sys. Suggestion I got from the networking team was to first upgrade to windows server 2008 sp1 and then do windows update. We noted the file version of afd.sys and tcp.sys as well .Then followed the suggestion of our networking team. Then noted down the versions of the files again to make sure they have upgraded to a level equal or above the version mentioned in KB, After we upgraded, we then monitored if the issue comes back again. It never came back again.