Windows 2012 R2 server fails to establish outbound connections

Hi there,

It's been a very long while since I have blogged something here and it's time to come back and continue sharing our field experiences with the IT community hoping to shed light for similar problems.

I was tasked to deal with a customer problem where the end users were reporting various problems like "cannot access the file server, getting authentication prompts" and the IT admins were also observing various problems like the server wasn't properly applying GPOs, Netlogon service complaining about DC access issues and etc. At times, they were even able to manually reproduce the issue by issuing a "telnet DC-IP 389" command from the affected server.

There might be a lot of reasons behind, so I decided to collect a number of logs while the issue was reproduced:

a) TCPIP ETL trace:

You can collect it with the below commands on a Windows client/server: (from an elevated command prompt)

netsh start trace capture=yes scenario=internetclient

<<repro>>

netsh trace stop

b) Network trace:

This could be collected in different ways like using the above command, Wireshark, Network Monitor, Message Analyzer,...

c) Handle outputs

This could be collected as follows:

Note: Handle tool could be downloaded from the following link: https://technet.microsoft.com/en-us/sysinternals/handle.aspx Handle v4.1

handle.exe -a -u >> %computername%_handledetails.txt

handle.exe -s >> %computername%_handlesummary.txt

ANALYSIS:

========

The logs were collected while doing a repro with telnet command on the server. After the logs were shared with us, I checked various things to understand why the outbound connection might be failing (by the way, the file server not being able to authenticate the incoming users was also a side effect of this issue since the file server wasn't able to verify the client credentials via Netlogon secure channel)

1) I first checked network traces, but there were no outgoing connection attempts (TCP SYNs sent to the target server) which means the issue is local to the server itself

2) Then I checked the TCPIP ETL trace and observed the root cause:

Note: You can open up the ETL file that is generated as a result of running netsh command in Network Monitor or Message Analyzer

[0]03E0.5214::01/04/18-15:07:37.5237622 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..

[0]58F0.4558::01/04/18-15:07:51.8242042 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use..

[0]04D8.072C::01/04/18-15:07:52.0110322 [Microsoft-Windows-TCPIP/Diagnostic] TCP: endpoint (sockaddr=0.0.0.0) bind failed: port-acquisition status = The transport address could not be opened because all the available addresses are in use.. 1616260 [0]

...

Actually that clearly explained why the outbound connections were failing: PORT EXHAUSTION.

3) And the main reason behind the port failure was a socket leak caused by an outdated 3rd party AV software: (from handles.exe output)

Note: The process name was deliberately changed

92355 ABC.exe pid: 1148 NT AUTHORITY\SYSTEM

92517   144: File  (---)   \Device\Afd

92519   148: File  (---)   \Device\Afd

92627   220: File  (---)   \Device\Afd

92629   224: File  (---)   \Device\Afd

92633   22C: File  (---)   \Device\Afd

92635   230: File  (---)   \Device\Afd

92689   29C: File  (---)   \Device\Afd

92701   2B4: File  (---)   \Device\Afd

92703   2B8: File  (---)   \Device\Afd

92705   2BC: File  (---)   \Device\Afd

92707   2C0: File  (---)   \Device\Afd

92743   308: File  (---)   \Device\Afd

92755   320: File  (---)   \Device\Afd

92761   32C: File  (---)   \Device\Afd

92767   338: File  (---)   \Device\Afd

92771   340: File  (---)   \Device\Afd

92773   344: File  (---)   \Device\Afd

92779   350: File  (---)   \Device\Afd

92881   420: File  (---)   \Device\Afd

92897   440: File  (---)   \Device\Afd

92899   444: File  (---)   \Device\Afd

92927   47C: File  (---)   \Device\Afd

92929   480: File  (---)   \Device\Afd

92933   488: File  (---)   \Device\Afd

92935   48C: File  (---)   \Device\Afd

92941   498: File  (---)   \Device\Afd

92977   4E0: File  (---)   \Device\Afd

92993   500: File  (---)   \Device\Afd

93053   578: File  (---)   \Device\Afd

93073   5A0: File  (---)   \Device\Afd

93075   5A4: File  (---)   \Device\Afd

93077   5A8: File  (---)   \Device\Afd

93079   5AC: File  (---)   \Device\Afd

93093   5C8: File  (---)   \Device\Afd

93113   5F0: File  (---)   \Device\Afd

93145   630: File  (---)   \Device\Afd

93165   658: File  (---)   \Device\Afd

93167   65C: File  (---)   \Device\Afd

93175   66C: File  (---)   \Device\Afd

93195   694: File  (---)   \Device\Afd

93199   69C: File  (---)   \Device\Afd

93217   6C0: File  (---)   \Device\Afd

93219   6C4: File  (---)   \Device\Afd

93227   6D4: File  (---)   \Device\Afd

93239   6EC: File  (---)   \Device\Afd

93249   700: File  (---)   \Device\Afd

93253   708: File  (---)   \Device\Afd

93265   720: File  (---)   \Device\Afd

93269   728: File  (---)   \Device\Afd

93271   72C: File  (---)   \Device\Afd

93273   730: File  (---)   \Device\Afd

93275   734: File  (---)   \Device\Afd

93277   738: File  (---)   \Device\Afd

93281   740: File  (---)   \Device\Afd

93283   744: File  (---)   \Device\Afd

93285   748: File  (---)   \Device\Afd

93297   760: File  (---)   \Device\Afd

93299   764: File  (---)   \Device\Afd

93301   768: File  (---)   \Device\Afd

93305   770: File  (---)   \Device\Afd

93307   774: File  (---)   \Device\Afd

93313   780: File  (---)   \Device\Afd

93317   788: File  (---)   \Device\Afd

93321   790: File  (---)   \Device\Afd

93323   794: File  (---)   \Device\Afd

93327   79C: File  (---)   \Device\Afd

93329   7A0: File  (---)   \Device\Afd

93331   7A4: File  (---)   \Device\Afd

93333   7A8: File  (---)   \Device\Afd

93335   7AC: File  (---)   \Device\Afd

93339   7B4: File  (---)   \Device\Afd

93343   7BC: File  (---)   \Device\Afd

93355   7D4: File  (---)   \Device\Afd

93357   7D8: File  (---)   \Device\Afd

93359   7DC: File  (---)   \Device\Afd

93361   7E0: File  (---)   \Device\Afd

93365   7E8: File  (---)   \Device\Afd

93373   7F8: File  (---)   \Device\Afd

93383   810: File  (---)   \Device\Afd

93389   81C: File  (---)   \Device\Afd

 

RESOLUTION:

===========

So we advised the customer to update the 3rd party AV software. Apart from that, you can take the following actions to avoid possible port leak issues:

a) Please make sure that Windows OS runs with latest rollups/security updates

b) Please make sure that all 3rd party softwares are up to date (including Firewall, AV, backup or any kind of software that might have to frequently establish outbound connections)

c) Finally you may consider extending the port range for busy servers which are supposed to establish many outbound connections very frequently. The following is the maximum range that you can set, but you may extend the range in phases instead of maxing out at the very beginning: (from an elevated command prompt)

netsh int ipv4 set dynamicport tcp start=1025 num=64500

netsh int ipv4 set dynamicport udp start=1025 num=64500

and you can decrease the TCPTimedWaitDelay registry key on the servers: (you may lower it to 30 seconds)

https://technet.microsoft.com/en-us/library/cc757512(v=ws.10).aspx TcpTimedWaitDelay

The TcpTimedWaitDelay value determines the length of time that a connection stays in the TIME_WAIT state when being closed. While a connection is in the TIME_WAIT state, the socket pair cannot be reused. This is also known as the 2MSL state because the value should be twice the maximum segment lifetime on the network. To adjust the TcpTimedWaitDelay settings, you have to modify/create the registry settings as listed below:

 

Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
Value: TcpTimedWaitDelay
Data Type: REG_DWORD
Range: 30-300 (decimal)
Default value: 0x78 (120 decimal)
Recommended value: 30
Value exists by default? No, needs to be added.

Note: This change requires a server reboot

 

Please note that the same techniques could be applied to virtually any Windows versions as of Windows 7/Windows 2008 R2 onwards easily.

 

Hope this helps

Thanks,

Murat