Netlogon: Cross-Forest Delayed Authentication Requests Cause Subsequent (and Continuous) Authentication Failures

One of the longest debugging experiences I've ever had to debug, so far, in Exchange was a code bug that exists in the Netlogon code. I hope to cover what this bug was, how it manifested, and the fix that was implemented by the Windows developer to resolve the issue. So, this is going to be a long one....

Picture (Worth 1,000 Words)

Basics
Netlogon sessions use RPC (remote-procedure call) sessions with domain controllers to communicate authentication requests to domains. In the cases of cross-forest authentication requests, regardless of the type of trust created (e.g.: one-way transitive, one-way intransitive, etc.), the cross-forest authentication requests are forwarded (via the trust) to the responsible domain. In this case, the responsible domains exist in the customer's on-premises environments.

So, in the above example, you'll see that the client communicates with the Café server. During authentication, the Café passes the request to the managed domain controller. The managed domain controller will see the trust and communicate the authentication request across the trust and receive only an NT status response back for the request from the customer's domain controllers.

When Repro Cometh
The condition that causes repro to start occurring is when the local domain controller (in this case, the managed domain controller in the illustration above) is awaiting a response from the customer's domain controller for an authentication request. In the authentication pipeline, if this request times-out, it's considered a re-triable exception - which is important for later. Because this exception is retriable, a the Café server doesn't consider that the authentication request has failed. Also, during this same time, the Café server may build a new Netlogon session with a new domain controller, which is where our problem begins to surface.

The Netlogon code has a single object reference for the domain controller's name for the current Netlogon session on the current RPC session it should be using. (If you're familiar with native/unmanaged code, the reference to the domain controller's name is a pointer to a wchar_t value.) But remember: We've not disposed of the previous session because it's considered re-triable. So, since Netlogon can only communicate on one session per one RPC channel, we now have two Netlogon sessions with two RPC channels. The non-disposed of session is in red and the new session is in green in the illustration above.

The Bug
The bug is that all subsequent authentication requests traverse the red authentication path but use the domain controller's name that was obtained from the creation of the green authentication path (the domain controller's name is supplied in the authentication request as is defined in the specifications). This causes all subsequent authentication requests to fail, no matter the destination forest, because the domain controller receives a request that it should not process.

Verifying Repro
The best way to verify the repro of this bug is to look at the Netlogon logs. If you see 0xc0000122 (STATUS_INVALID_COMPUTER_NAME), then you've hit repro of this specific condition. In Exchange, this will bubble-up via the app pool in IIS as a 401 Unauthorised (which makes chasing the bug a bit more complicated).

The Fix
Windows dev determined that the best way to fix this was to tear down both the Netlogon and RPC sessions, regardless of current status. This has been verified as working in RS3 builds of Windows 10/Server 2016 and is currently being tested in RS1 builds of Windows 10/Server 2016.