Update: The Active Directory team has released useful guides for troubleshooting slow logon issues:
Emails containing troubleshooting cases keep arriving in my inbox. I’ve received many cases that start with a seemingly unsolvable problem and end a few steps later with a solution or - often just as useful - a workaround. I’ve amassed several hundred such cases that I’ve captured in over 400 PowerPoint slides, giving me great material from which to draw for my blog and the Case of the Unexplained talk series I’ve delivered at a number of major industry conferences
I’m always looking for fresh cases, use of obscure tool features, and unique troubleshooting techniques, so please keep them coming. This time, I’m sharing a fascinating case that highlights two useful techniques: comparing Sysinternals Process Monitor logs from working and problematic systems, and using Sysinternals PsExec to capture activity during a logon.
The case begins when a systems administrator at a large company got multiple end-user complaints that logon was taking over three minutes. The users didn’t encounter any problems once logged on, but the delays were understandably frustrating. Many other users running with the same software configuration weren’t experiencing issues, however. Looking for commonalities, the administrator queried the network configuration database and, sure enough, saw that all the systems with complaints were Dell Precision 670 workstations. He thought he had a major clue until he looked he saw that the systems running without issue included seemingly identical 670 workstations.
Looking for clues more directly, his next step was to try to analyze the logon process of the delayed systems. He used PsExec to run Process Explorer in the Local System account so that it would survive a logoff and be active at the next logon. Because the systems were running Windows XP, the command-line he used was the following (see the end of the post for how to do this on Windows Vista and higher):
psexec –sid c:\sysint\procexp.exe
The “-s” directs PsExec to launch the process in the Local System account, “–i” to connect the process with the interactive desktop so that its windows are visible, and “-d” to return immediately instead of waiting for the process to terminate. Note that if you have Fast User Switching enabled and you are not logged into session 0, do not log out, but instead switch users, login to the problematic account, and then switch back to the session from which you started PsExec.
At the subsequent logon, he noticed that Lisa_client_220.127.116.11.exe, the company’s own system inventory line-of-business (LoB) application, consumed CPU for a short time, went idle for three minutes, then exited, after which the logon process would continue as normal:
David Solomon coined a phrase back before Process Monitor replaced Filemon and Regmon that still applies when updated: “When in doubt, run Process Monitor!” (I follow this advice religiously, even having my daughter run Process Monitor when she comes to me with a homework question). This case is a great example of that philosophy put into practice because it seems unlikely on the surface that Process Monitor would reveal the cause for a process hang, but the administrator turned to the tool nonetheless.
After launching Process Monitor with PsExec and capturing a logon trace, he scrolled to the beginning of the captured data and started his analysis. Because of what he saw in Process Explorer, the Lisa_client process was the obvious suspect, so he right-clicked on its process name in one of the trace lines and selected the Include quick-filter menu item to remove from the display entries related to activity from other processes:
When troubleshooting a hang with Process Monitor, you should first see if there are any gaps in operation time stamps that match the hang duration. You can look for lengthy operations by adding the Duration column to the display and then making sure to filter out operations that commonly don’t immediately complete, like directory change notifications. That can be useful when you don’t see a significant time gap between operations because the process has multiple threads, some of which continue to operate while the one causing the hang is dormant.
To his pleasant surprise, he soon found an event that not only preceded a gap of exactly three minutes, but that had an unusual result code, IO DEVICE ERROR:
It appeared that the Lisa_client process performed a SCSI pass-through command to the disk hosting the C: volume that timed-out after three minutes with a hardware error. Wondering what the result of the command was on one of the 670’s that logged on promptly, he captured a trace from one and saw that the corresponding operation took less than a millisecond and was successful:
The evidence clearly pointed at a hardware issue with the disks installed on a subset of the 670’s, so he gathered disk type data from all the 670’s, correlated them with the reports of slow logons, and found that all of the slow systems had Seagate disks and the others had Fujitsu disks.
His company was obviously not going to replace disks just to avoid an issue being caused by its own LoB application, so he had to figure out a workaround. He notified the Lisa_client development team of the issue, who reported that they could remove the command without loss of functionality, but that it would take at least several days for the update to go through their internal release process. Having a few days where system information wouldn’t be collected for a subset of systems was less important than end-user productivity, so in the meantime he wrote a WMI logon script to query the system disk and launch Lisa_client only if it wasn’t a Seagate model.
Without Process Monitor’s help he would have probably determined that the disks were the key hardware difference, but it’s not clear he would have discovered the root cause and been able to work around it rather than resort to replacing disks. This is yet another case solved with the help of Process Monitor and insightful detective work.
In closing, I mentioned that I would provide steps for configuring an application to survive logoff and logon on Windows Vista, Windows Server 2008 and higher. The PsExec command I supplied for Windows XP won’t work on newer operating systems because Windows Vista introduced Session 0 Isolation, requiring a couple of extra commands to make the launched application accessible after a logon. First, start the utility in session 0 with this PsExec command in an elevated command prompt:
psexec –sd –i 0 c:\sysint\procmon.exe
You’ll see a window titled “Interactive services dialog detection” flash in the taskbar, indicating that a process is running with a window on the hidden session 0 desktop. Click on the taskbar window to restore the notification dialog and then on the Show Me the Message button to switch to that desktop:
The utility you launched will be visible there and you can configure it with desired settings (it’s running in the Local System account so won’t have your own account’s defaults). When done, click on the Return button to get back to the main desktop. You can now logoff and log back on to reproduce the problem you’re investigating. After logging on again, execute the following commands in an elevated command prompt to cause the doorway to the session 0 desktop to reappear:
net stop ui0detect
net start ui0detect
Go back to the session 0 desktop to look at the captured information and close the tool.
One last thing I want to leave you with is a reminder that I’ve documented many other troubleshooting cases in this blog and you can find them in the blog index here. You can also watch recordings of my Case of the Unexplained sessions from TechEd here and be sure to come to TechEd US this June in New Orleans, where I’ll be delivering it again with all new cases.