The magical 2 minute logon delay mystery

Some time ago I had an interesting escalation where the problem description was as following:

I'm running a Citrix Metraframe Presentation Server farm with around 20 servers in it, after about 2 days my users start getting a logon delay of *exactly* 2 minutes.
If I reboot the server everything returns back to normal but the problem resurfaces within 48 hours, sometimes earlier.

Perfmon analysis showed that no CPU load was present on the server during that time and no resource leak either that we could see.
We asked the customer to install a checked build of Winlogon.exe and enable Winlogon debug logging, this showed us that a timeout condition was being hit.

So, taking a page out of Mark Russinovich's book I asked the customer to run Procmon during a logon attempt.  After filtering through the massive amount of data I found that Winlogon.exe was calling mpnotify.exe which then finished after exactly 2 minutes.

Looking through the Specs for Winlogon I determined that mpnotify was being called by Winlogon to talk to each of the registered Network Providers on the machine.  If one of the NP's doesn't respond Winlogon will be sitting around until the timeout of 120 seconds is hit, and then move on.

After this, we looked at the registered NP's on one of the Citrix boxes with the intent of removing them one by one until we found the culprit.  As it turned out, the first one we removed gave immediate results and the user was able to logon without a delay once the reference to the NP had been removed from the registry.

To summarize; here's what was happening:

On any platform that has installed the Lotus Notes client with the 'Single Sign-on' option chosen during the setup, a network provider called npnotes is added to the list of NP's that Winlogon should notify during logon and password changes. When this NP failed (usually within 48 hours on a busy TS), Winlogon started timing out waiting for an answer from it (default timeout is 2 minutes).

The workaround is to remove the npnotes entries from the Network Provider registry keys which prevents it from loading (no reboot required).  In this case the npnotes entry was the problem and removing it resolved the issue.

The following registry key is taken from a TS running Citrix Presentation Server 4.5:
[HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlNetworkProviderHwOrder]
"ProviderOrder"="Cwbnetnt,CdmService,RDPNP,LanmanWorkstation,WebClient,PnSson,npnotes"

 [HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlNetworkProviderOrder]
"ProviderOrder"="Cwbnetnt,CdmService,RDPNP,LanmanWorkstation,WebClient,PnSson,npnotes"

Using a checked build of Winlogon, the following is seen when the problem occurs:
15:05:46.881: 11472.21432> Winlogon-Trace-Timeout: Enabling timeout after 120 seconds
15:07:46.895: 11472.21432> Winlogon-Trace-Timeout: Input timer went off, sending TIMEOUT

Using Procmon you see Winlogon spawn an instance of mpnotify.exe to notify the NP's, which would normally terminate immediately after notifying all the NP's.

When the problem occurs mpnotify.exe terminates after 2 minutes and the logon proceeds normally. Additional symptoms are that the logon is normal for a period of time after a reboot, however when the problem occurs (usually within 48 hours) all logons on that server are affected (since NPNotes stops responding).

On a Citrix box the logon session will be waiting for 120 seconds while the "Checking your credentials" prompt is displayed within the TS client (with the Citrix GINA installed). Replacing the GINA changes the message but doesn't otherwise make any difference (since the problem is with the faulty NP).

Note that this may also affect normal workstations, however the symptoms will be much more apparent on a Terminal Server since it is less likely to be rebooted regularly and has a greater number of users logging on to it.