The Case of the Process Startup Delays

I’ve been extremely busy here at Microsoft and so haven’t had time to blog until now, but plan on getting back to posting regularly. Before I start with a look at a technical problem I ran into recently, I’m pleased to report that the Sysinternals integration is proceeding smoothly and that Bryce and I will unveil an exciting new tool when the site moves to its new home under Microsoft TechNet in late October.

 

I don’t use my laptop much when I’m not traveling, but I occasionally read email on it in the living room. Like most Windows users, I’m frustrated by occasional unexplained delays when I perform routine tasks like start a program or open a web page. Since joining my laptop to an internal Microsoft domain, I began experiencing regular delays when starting processes. With my Sysinternals tools arsenal in hand, I set out to investigate the root cause, suspecting that recently joining the laptop to the Microsoft domain played a role.

 

I began my research by first noticing that, after a delay of a few seconds starting a new process, processes I started within the following 30 seconds launched instantly. I therefore started Process Explorer, waited for 30 seconds, and then executed Notepad from Explorer’s Run dialog. Notepad didn’t appear in Process Explorer’s process tree during the expected delay, which implied the Explorer thread starting Notepad was experiencing the pause, not Notepad’s startup.

 

A look at the stack of the launching Explorer thread might give me a hint about the cause, but I was too impatient to look at each of Explorer’s over a dozen threads and so attached the Windbg debugger from the Microsoft Debugging Tools for Windows to Process Explorer, launched Notepad with Process Explorer’s Run dialog, and broke into the debugger. I opened the thread list in Windbg by selecting Processes and Threads from the View menu, selected the first one displayed, and then revisited the View menu to open the Call Stack dialog:

 

 

 

Stacks display with the most recently called function at the top, so the ZwWaitForSingleObject frame at the top meant that the thread was waiting on some object to become signaled. The stack frames further up the stack are in the RPCRT4 (Remote Procedure Call Runtime Version 4) DLL and the reference to the OpenLpcPort function told me that the thread was trying to initiate a RPC with another process on the same system.  It looked like the wait might be due to the GetMachineAccountSid call highlighted in the screenshot. Just as for domain user accounts, computers belonging to a domain have accounts and GetMachineAccountSid’s name implies that the function returns the Security Identifier (SID) of the computer’s domain account.

 

I set a breakpoint on the return from the call to GetMachineAccountSid in the OpenLpcPort function and after a short pause consistent with the startup delays, the debugger’s command prompt activated. The x86 calling convention is for function return values to be passed in the EAX register so I examined its value:

 

 

After translating the value to decimal with the “?” command I searched for 1789 in the global error definitions file, WinError.h, of the Platform SDK:

 

 

I scoured MSDN documentation and the Web and found essentially no information on the underlying cause for that error code. However, the term “trusted relationship failure” implies that the domain the computer is connecting to doesn’t trust the domain of the computer. But under the circumstances the error didn’t make sense, because I was disconnected from the network and even if the computer was trying to connect a domain, the only one it would connect is the one it belonged to.

 

On a hunch, though, I opened a command prompt and ran PsGetSid to see what error it would get when trying to look up the computer’s domain SID (a computer’s domain account name is the computer name with a “$” appended to it):

 

 

Sure enough, it experienced the same delay, which must be a network timeout, and got the same error. Then I used remote access to connect to the domain and ran the command again:

 

 

 

Further, after connecting to the domain I no longer experienced the startup delays. I disconnected, but continued to have delay-free process launches, even after 30 seconds. After rebooting and not connecting, though, the delays reoccurred.

 

At that point I decided to investigate the internals of GetMachineAccountSid. The stack trace showed that it calls into the Netlogon DLL, which performs its own RPC to a function called NetrLogonGetTrustRid. I knew that the Netlogon service runs inside of the Local Security Authority Subsystem (LSASS):

 

 

I attached Windbg to LSASS and set a breakpoint on NetrLogonGetTrustRid. After launching a new process I hit the breakpoint and saw that if a particular field in a data structure is NULL, Netlogon tries to connect to a domain controller, but if the connection fails for any reason, it blindly returns error 1789. However, when I connected to the domain the call succeeded and the value in the data structure filled in with the SID of the computer account, which persists even after a disconnect. That explained the change in behavior after I connected to the domain.

 

Turning my attention back to GetMachineAccountSid, I found that it caches the results of an error for 30 seconds before asking NetLogon to attempt to connect to a domain controller again. That explained the 30 second quick-start periods. A look through the code flow in the debugger also revealed that OpenLpcPort queries the computer SID as part of a check to see if it matches the SID passed as one of its parameters. If so, OpenLpcPort changes the SID to the SID of the Local System account before calling NtSecureConnectPort, mapping the domain SID to a local one. NtSecureConnectPort takes a SID as a parameter and will only connect to the specified Local Procedure Call (LPC) port if the port was created by the account that matches the SID.

 

I’d answered a number of questions, but the big one remained: why was an RPC happening during a process launch at all? The initial stack trace only went up as far as the NegotiateTransferSyntax frame, but there were obviously other frames that the symbol engine couldn’t determine. The stack display went further when I had hit the breakpoint I set in OpenLpcPort, though:

 

 

Near the bottom you can see the call to ShellExecCmdLine that the CRunDlg class, which is responsible for the Run dialog implementation, calls. That eventually results in what looks like the execution of shell execute hook extensions, and the one that makes the RPC call is implemented by the MpShHook DLL. I didn’t know off hand what that DLL was, but Process Explorer’s DLL view showed that it’s part of Windows Defender:

 

 

I suspected that the hook is part of Windows Defender’s real-time protection, which the Windows Defender team confirmed. Autoruns reports that Windows Defender registers the shell execute hook:

 

 

The mystery was solved!  Putting it all together:

  1. Explorer’s Run dialog calls ShellExecuteCmdLine
  2. ShellExecuteCmdLine calls out to shell execute hooks
  3. Windows Defender’s hook for real-time protection, MpShHook.Dll, calls RPC to communicate with the Windows Defender service, passing the SID of the service as an argument
  4. The RPC library calls GetMachineAccountSid to see if the SID matches the computer’s domain SID, in which case it would map the SID to the local system account SID
  5. GetMachineAccountSid performs an RPC to the Netlogon service to get the computer account’s SID
  6. If the computer account’s SID hasn’t been obtained already, Netlogon tries to connect to a domain controller
  7. If the domain controller connection fails after a timeout (the delay), Netlogon returns a trust-relationship failure error
  8. The Windows Defender RPC proceeds using the unmapped SID
  9. Windows Defender’s service performs real-time checks and then process launches

 

A little more research led me to conclude that the delay only happens under very specific circumstances where:

 

  • The system is running Windows XP 64-bit for x64 or Windows Server 2003 SP1
  • Windows Defender Beta 2 is active
  • The system is domain joined, but has not connected to the domain in the current boot session

32-bit Windows XP doesn’t perform the SID mapping in OpenLpcPort and Windows Defender doesn’t use a shell execute hook on Windows Vista. The Windows Defender team is looking at workarounds for the next release, but now that I understand the delay I can work around it.