Domain not available when trying to TS onto a Windows 2003 server.

Issue came in this week where when you attempted to logon to a server it would not authenticate your request and would give you a message indicating the "domain is not available".  If you tried logging on via your UPN, then it would give a slightly different error message indicating that "there is not enough storage to complete this operation".

After ruling out DNS and routing, I had the person run nltest /sc_query:BRADFOREST to see what DC it was pointing at and found that it did not have a secure channel to a DC which might be a reason we can't authenticate to the server. :) When we tried to reset the secure channel it would fail with error code 8 (ERROR_NOT_ENOUGH_MEMORY).) So we cranked up netlogon debug logging and then I repro'd the issue again.  We could then see this in the netlogon debug log:

08/14 22:55:06 [SESSION] BRADFOREST: NlSetServerClientSession: New DC is an NT 5 DC: \\brad-dc-01.bradforest.local
08/14 22:55:06 [SESSION] BRADFOREST: NlSetServerClientSession: New DC is in closest site: \\brad-dc-01.bradforest.local
08/14 22:55:06 [SESSION] BRADFOREST: NlSetServerClientSession: New DC runs the time service: \\brad-dc-01.bradforest.local
08/14 22:55:06 [SESSION] BRADFOREST: NlSetServerClientSession: New discovery flags: 0x1dc; Old flags: 0x0
08/14 22:55:06 [SESSION] BRADFOREST: NlDiscoverDc: Found DC \\brad-dc-01.bradforest.local
08/14 22:55:06 [SESSION] BRADFOREST: NlStartApiClientSession: Bind to server \\brad-dc-01.bradforest.local (TCP) 0 (Retry: 0).
08/14 22:55:06 [MAILSLOT] Going to wait on mailslot. (Timeout: 45000)
08/14 22:55:06 [CRITICAL] NlPrintRpcDebug: Dumping extended error for I_NetServerReqChallenge with 0xc0000017
08/14 22:55:06 [CRITICAL] [0] ProcessID is 780 <-------------------------LSASS.exe
08/14 22:55:06 [CRITICAL] [0] System Time is: 8/14/2007 21:55:6:372
08/14 22:55:06 [CRITICAL] [0] Generating component is 8
08/14 22:55:06 [CRITICAL] [0] Status is 14
08/14 22:55:06 [CRITICAL] [0] Detection location is 313
08/14 22:55:06 [CRITICAL] [0] Flags is 0
08/14 22:55:06 [CRITICAL] [0] NumberOfParameters is 0
08/14 22:55:06 [CRITICAL] [1] ProcessID is 780
08/14 22:55:06 [CRITICAL] [1] System Time is: 8/14/2007 21:55:6:372
08/14 22:55:06 [CRITICAL] [1] Generating component is 8
08/14 22:55:06 [CRITICAL] [1] Status is 10055
08/14 22:55:06 [CRITICAL] [1] Detection location is 311
08/14 22:55:06 [CRITICAL] [1] Flags is 0
08/14 22:55:06 [CRITICAL] [1] NumberOfParameters is 3
08/14 22:55:06 [CRITICAL] Long val: 1025
08/14 22:55:06 [CRITICAL] Pointer val: 0
08/14 22:55:06 [CRITICAL] Pointer val: 0
08/14 22:55:06 [CRITICAL] [2] ProcessID is 780
08/14 22:55:06 [CRITICAL] [2] System Time is: 8/14/2007 21:55:6:372
08/14 22:55:06 [CRITICAL] [2] Generating component is 8
08/14 22:55:06 [CRITICAL] [2] Status is 10055
08/14 22:55:06 [CRITICAL] [2] Detection location is 315
08/14 22:55:06 [CRITICAL] [2] Flags is 0
08/14 22:55:06 [CRITICAL] [2] NumberOfParameters is 0
08/14 22:55:06 [CRITICAL] BRADFOREST: NlSessionSetup: Session setup: cannot I_NetServerReqChallenge 0xc0000017
08/14 22:55:06 [MISC] Eventlog: 5719 (1) "BRADFOREST" 0xc0000017 c0000017 ....

 

Some interesting things to look at, first off what is 0xc0000017?  Well we can use err.exe to see what that translates to.

C:\Windows\system32>err 0xc0000017
# for hex 0xc0000017 / decimal -1073741801
STATUS_NO_MEMORY
# {Not Enough Quota}
# Not enough virtual memory or paging file quota is available
# to complete the specified operation.
 

Well that pretty much flies with what I was seeing when trying to logon via UPN.  We can also see two status codes being returned during the secure channel setup: 14 and 10055.

C:\Windows\system32>err /winerror.h 14
# winerror.h selected.
# for decimal 14 / hex 0xe
ERROR_OUTOFMEMORY
# Not enough storage is available to complete this operation. <-- This is what I was getting when trying to TS via UPN.

C:\Windows\system32>err /winerror.h 10055
# winerror.h selected.
# for decimal 10055 / hex 0x2747
WSAENOBUFS <--------------------HMMMMM?
# An operation on a socket could not be performed because the
# system lacked sufficient buffer space or because a queue
# was full.

So now that is interesting, so the next thing I did was do a netstat -s and looked at the statistics of ports and didn't see anything obvious and I then added the handles column in task manager and noticed that their custom application had 17,000 handles open.  Turns out that most of those handles were outgoing calls and used up all the ephemeral ports.  We had to set the MAXUSERPORT value in the registry to allow more ports to be used, once we did that everything returned to normal.

Ephemeral Ports

The number of user-accessible ephemeral ports that can be used to source outbound connections is configurable using the MaxUserPorts registry parameter. By default, when an application requests any socket from the system to use for an outbound call, a port between the values of 1024 and 5000 is supplied. The MaxUserPorts parameter can be used to set the value of the uppermost port that the administrator chooses to allow for outbound connections. For instance, setting this value to 10,000 (decimal) would make approximately 9000 user ports available for outbound connections.

Here is the KB article for the issue: https://support.microsoft.com/kb/196271

Here you can read about another setting called TCP TIME-WAIT delay which is how long the port hangs around before being terminated completely (4 minutes).  This can also cause issues with apps that perform many outbound connections in a short time may use up all available ports before the ports can be recycled.

Technorati tags: Windows 2003, Networking