Conficker causes LSASS to consume CPU Time on Domain Controllers

Article
04/16/2009

Hi Gautam here, I wanted to blog about a high-impact problem we have been seeing recently.

The problem has to do with LSASS consuming a lot of CPU time on your Domain Controllers (DC's). The cause of this high CPU turns out to be Conficker infected computers throwing bad passwords against the DC's. The rate of bad passwords can be as high as 10,000 per minute from multiple clients.

Technical information on Conficker can be found here.

The problem could manifest itself in many ways, some being...

Slow authentication and logons being reported by users,
Slow mail flow
Slow Resource access (resources could be Files shares, printers and more) or even complete failure in Resource access.

Some of the above problems take time to be narrowed down. You typically will have to go through a few other pieces before you narrow down on the domain controllers being bogged down with high CPU time.

Background:

CPU usage on domain controllers continues to be very high (I'm rating high = 70% and above as long as this is not normal for the DC). On looking closer, you find LSASS.EXE eating up all of this CPU. Perfmon reports show the CPU usage stays more or less consistent throughout the day. It doesn't climb down during off-peak hours.

As you can imagine, this high CPU usage affects other workflows which are AD dependent – including Exchange/SharePoint/Authentication etc.

If you temporarily pull the network cable from the DC and wait a few minutes, LSASS drops back down to ~1% or whatever value is normal in your setup. Ned Pyle has the logic of pulling out the network cable described in a previous post in detail.

In this case as well, we saw that pulling out the network cable brings down the LSASS CPU usage to normal limits. Plugging it back in makes LSASS shoot up again to 80%-90% CPU.

If you follow the steps which Ned has documented in the blog, network traces will show a HUGE number of authentication requests coming into the DCs. Now it's not always easy to differentiate between bad and good traffic when you are looking at 100MB worth of network traffic.

In this case however, what you are bound to see if something like the below in the network traces – I highly recommend using Netmon 3* - the Conversations feature is ideal to work through a large trace which you are bound to get when collecting network traces from the DC.

09:54:16.593 192.168.0.1 DC01.CONTOSO.COM KerberosV5 KerberosV5:AS Request Cname: User1 Realm: CONTOSO.COM Sname: krbtgt/CONTOSO.COM

09:54:16.625 DC01.CONTOSO.COM 192.168.0.1 KerberosV5 KerberosV5:KRB_ERROR - KDC_ERR_PREAUTH_FAILED (24)

09:54:16.531 192.168.0.2 DC01.CONTOSO.COM TCP TCP:Flags=......S., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092510, Ack=0, Win=65535 ( ) = 65535

09:54:16.531 DC01.CONTOSO.COM 192.168.0.2 TCP TCP:Flags=...A..S., SrcPort=Microsoft-DS(445), DstPort=4614, PayloadLen=0, Seq=1831638666, Ack=3314092511, Win=17520 ( Scale factor not supported ) = 17520

09:54:16.531 192.168.0.2 DC01.CONTOSO.COM TCP TCP:Flags=...A...., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092511, Ack=1831638667, Win=65535 (scale factor 0x0) = 65535

09:54:16.531 192.168.0.2 DC01.CONTOSO.COM SMB SMB:C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12

09:54:16.531 DC01.CONTOSO.COM 192.168.0.2 SMB SMB:R; Negotiate, Dialect is NT LM 0.12 (#5), SpnegoNegTokenInit

09:54:16.578 192.168.0.2 DC01.CONTOSO.COM SMB SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE

09:54:16.578 DC01.CONTOSO.COM 192.168.0.2 SMB SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED

09:54:16.593 192.168.0.2 DC01.CONTOSO.COM TCP TCP:Flags=...A...F, SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092888, Ack=1831639470, Win=64732 (scale factor 0x0) = 64732

09:54:16.593 DC01.CONTOSO.COM 192.168.0.2 TCP TCP:Flags=...A...F, SrcPort=Microsoft-DS(445), DstPort=4614, PayloadLen=0, Seq=1831639470, Ack=3314092889, Win=17143 (scale factor 0x0) = 17143

09:54:16.593 192.168.0.2 DC01.CONTOSO.COM TCP TCP:Flags=...A...., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092889, Ack=1831639471, Win=64732 (scale factor 0x0) = 64732

Now, in the above three examples of network traffic, the first one with the Kerberos KDC_ERR_PREAUTH_FAILED is a sure shot bad password attempt. The other two traces aren't necessarily always bad authentication attempts, but is data connections to LSARPC which I saw on three of the four recent cases I had with this issue.

SPA reports will show high number of calls to SAMSRV or LSARPC. Tim Springston, who runs his own excellent AD related blog, has discussed the using of SPA here.

With TOP users attained from both SPA and from the Network traces, we explored 3 of the top client computers. We pulled MPSReports (an often used PSS Data collection tool) from these client computers. The first thing which stood out in the event logs was all the Audit Failures Logon/Logoff Event Id 529's in the Security Event logs.

Note: by default, only Success for Logon/Logoff and Account Logon is enabled. And in this case, the Domain Controllers were running with the defaults. The client computers had Failure for Logon/Logoff enabled.

More..

This of course led us to...

Checking this customer's account lockout policy –we saw they did not have account lockouts enabled
We enabled Failure for Account Logon on a policy which applied to the Domain Controllers as well.

No sooner had the failure-audit policy applied to the DC, the Security event logs were filled with Audit Failures Account Logon Event Id 675. Here is an example of a 675 event .

Event Type: Failure Audit
Event Source: Security
Event Category: Account Logon
Event ID: 675
Date: 3/23/2009
Time: 3:03:57 AM
User: NT AUTHORITYSYSTEM
Computer: DC01
Description:
Pre-authentication failed:

User Name: User1
User ID: %{S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxx-xxxxx}
Service Name: krbtgt/CONTOSO.COM
Pre-Authentication Type: 0x2
Failure Code: 0x18
Client Address: 192.168.0.100 ß IP of the computer which is throwing the bad credentials

Using EVENTCOMBMT to pull out the relevant event Ids from various DCs (namely 529, 644, 675, 676, and 681) and a little bit of Office Excel magic, I quickly had a list of ~100 computers sending bad passwords within a 30 minute time frame. The total number of failed logons were enough to drive up the LSASS.EXE CPU usage. LSASS ofcourse was only doing its job of keeping up with the load and failing the bad authentication attempts.

Putting it all together:

The kind of (multiple user logons from a single computer) and the rate (100's of attempts per minute per computer) at which they were throwing bad passwords, were a pretty sure sign of malware activity. A few more client computers, which we picked up from the SPA and Netmon reports, revealed traces of Conficker. With the Microsoft PSS Security team and the customers own Antivirus vendor involved, they were able to patch, scan, and clean their computers and this effort showed the LSASS CPU usage on the DCs drop down dramatically.

So from high LSASS CPU – to network traces leading to TOP client computers – to security events – to DC security events – back to the client computers! As you can imagine, it took some time in nailing down the first time. The 2^nd, 3^rd and 4^th cases were nailed down to unpatched computers infected by Conficker way quicker.

I hope with this blog post out, someone will save themselves a LOT of time and effort when facing such an issue.

Gautam Anand

Conficker causes LSASS to consume CPU Time on Domain Controllers

Background:

More..

Putting it all together:

Additional resources