Troubleshooting High LSASS CPU Utilization on a Domain Controller (Part 2 of 2)-

Last time I discussed troubleshooting the most common high CPU scenario within LSASS, which is the server being beaten up by a remote machine. Let’s talk now about the much less common but still possible:

You find that the problem is coming from the DC itself.

As I said in the previous post, this is a super rare situation these days. If you are on Windows 2000 Server SP4 or Windows Server 2003 SP1/SP2, we really don’t have any known issues where we can simply hand you a hotfix and send you on your way. The most likely cause is something foreign to the operating system – an add-on security package, a custom password synchronizer, a service running something security-related, etc. A very down and dirty way to check these is:

Examine this registry key on your Windows Server 2003 (it will be slightly different on Win2000) machine being affected:

HKEY_LOCAL_MACHINE\system\CurrentControlSet\Control\Lsa

  • Do you see anything in the Authentication Packages value except msv1_0?
  • Do you see anything in the Security Packages value except Kerberos msv1_0 schannel wdigest?
  • Do you see anything in the Notification Packages value except RASSFM KDCSVC WDIGEST scecli?

Anything else in here on Windows Server 2003 may be suspect, as something or someone has injected non-standard libraries into LSASS. It may be intentional and the DLL is simply malfunctioning or misconfigured. It may be malicious. Find the file (it will nearly always be a DLL in the %windir%\system32 directory) and take a look at its properties:

  • Who made it?
  • Is it new?
  • Is it only on the machines having a problem and never on the ones that don’t?
  • Are there different versions between working and non-working machines?
  • Any of your colleagues recognize it?
  • Is it documented online?

Once you think you have a handle on it, get a backup of your server and this registry key and remove the entry, then restart the server in a change control window when users are least affected. Does the high CPU utilization come back? It almost never does, trust me…

If there was nothing of interest there, another good technique is to use MSCONFIG to identify and potentially disable applications that have been added on to the server.

By checking the ‘Hide All Microsoft Services’ box you can see System Services that were added to the machine what did not ship with the operating system (technically speaking, you may see some services that are from us, such as Exchange). You can then temporarily set them to ‘Disabled’ and restart the server to test for the performance problem. The same can be done with the ‘Startup’ section, for apps that live in the RUN key of the registry. By using the ‘divide by half’ rule (where you disable half and test, disable the other half and test, then narrow down by halves until you find your culprit), you can usually get to the bad guy pretty quickly.

You can see all of this info using the Microsoft Product Support Reporting Tools (MPSRPT_DirSvc.exe) as well.

Notes

This blog post is not about debugging – yes, some of the techniques I use above can be replaced with attaching WINDBG to LSASS, syncing symbols, and going to town to see what’s specifically wrong under the covers. The posting is for folks looking for remediation, not code-level root cause. And let’s be honest – we debug things like this every day on customer request. After all the work is done (and the billing against the customer’s contract – ouch!), we still have the same answer: please contact your vendor about this malfunctioning code, as only they can fix it. If’ you’d like a quick primer on seeing what modules may be loaded into LSASS by using a debugger and that might be suspect, please let us know and we’ll blog it up.