How to frisk a DC when people are complaining of "Authentication Issues".

At Microsoft we do quite a bit of dogfooding (imagine that) and in doing so we run into  issues in the infrastructure and a lot of the time they crop up as "authentication issues".  For example, users can't get to a website, a share, e-mail, etc.  The symptoms can be varied and the outcome is the same, angry people at your door (sometimes literally). 

So in these situations how can you find out if a DC is misbehaving or give the "all clear" for the directory service and tell them to go look elsewhere?  Well here are some of the common things I check and do in this situation.

 

1) Portqry.exe the ports 389 (LDAP), 3268 (GC), 445 (Microsoft-DS), 139 (NetBIOS), and 88 (Kerberos).  This should give us an idea if the basics are working as far as the DC listening as it should.  Sometimes I through out a nbtstat -A against the IP of the DC for good measure.

2) Use Nltest and Tail.exe in tandem.  I like to do this remotely so I never actually have to TS (actually that goes for pretty much everything on this site).  I run nltest /server:BRAD-DC-01 /dbflag:2080FFFF.  This will turn on the netlogon debug logging.  You can turn this on the server where authentication issues are happening as well as the DC its pointed to.  Then use tail.exe (part of the resource kit tools) to watch the file in real time.  Now you can watch all the stuff go by in the netlogon log, or you can utilize findstr to just look for errors.

C:\Localbin>tail -f \\brad-dc-01\admin$\Debug\Netlogon.log |findstr /i Critical
12/01 18:45:10 [CRITICAL] BRADDOM: NlGetIncomingPassword: server.bradddom.brad.com: cannot LsarQueryTrustedDomainInfoByName 0xc0000034
12/01 18:45:13 [CRITICAL] BRADDOM: NlGetIncomingPassword: server.bradddom.brad.com: cannot LsarQueryTrustedDomainInfoByName 0xc0000034
12/01 18:45:23 [CRITICAL] BRADDOM: NlGetIncomingPassword: Can't NlSamOpenNamedUser for machine3$ 0xc0000064.
12/01 18:45:23 [CRITICAL] BRADDOM: NetrServerAuthenticate: Can't NlGetIncomingPassword for machine3$ 0xc0000064.
12/01 18:45:23 [CRITICAL] Ping from Brad-DC-01 for domain brad-dc-01.BRADDOM.brad.com (null) for (null) on <Local> is invalid since we don't host the named domain.
12/01 18:45:26 [CRITICAL] BRADDOM: NlGetIncomingPassword: Can't NlSamOpenNamedUser for Baller$ 0xc0000064.
12/01 18:45:26 [CRITICAL] BRADDOM: NetrServerAuthenticate: Can't NlGetIncomingPassword for Baller$ 0xc0000064.

We can then use err.exe (which I mentioned in my blog earlier) to look up some error codes of interest...

C:\Debuggers>err 0xc0000064
# for hex 0xc0000064 / decimal -1073741724
STATUS_NO_SUCH_USER ntstatus.h

ERROR_DUP_NAME winerror.h
# You were not connected because a duplicate name exists on
# the network. Go to System in Control Panel to change the
# computer name and try again.
# 2 matches found for "0xc0000034"

3) Net view the server would be a good idea, depending on the error code it might point to different things.  For instance if the DC is pinging but not responding to net view, it could be the firewall that you should look into, or perhaps IPSEC.

4) repadmin commands.  I use a few of these to get a feel for if the DC is in sync and everything's cool from a replication standpoint. 

C:\Localbin>repadmin /replsum brad-dc-01 /bysrc /bydest /sort:error
Replication Summary Start Time: 2006-12-01 18:52:35

Beginning data collection for replication summary, this may take awhile:
....

Source DC largest delta fails/total %% error
brad-dc-15 01d.11h:07m:03s 8 / 13 61 (1256) The remote system is not available. For information about network troubleshooting, see Windows Help. //DC offline
Sonja-DC-04 11h:56m:04s 4 / 11 36 (1256) The remote system is not available. For information about network troubleshooting, see Windows Help. //DC offline
brad-dc-14 42m:39s 0 / 5 0
brad-dc-02 42m:39s 0 / 13 0
brad-dc-03 42m:39s 0 / 13 0
brad-dc-12 50m:51s 0 / 13 0
brad-dc-19 42m:39s 0 / 13 0
brad-dc-25 42m:39s 0 / 13 0
CORP-DC-07 42m:39s 0 / 11 0

Destination DC largest delta fails/total %% error
brad-dc-01 01d.11h:07m:03s 12 / 105 11 (1256) The remote system is not available. For information about network troubleshooting, see Windows Help.

C:\Localbin>repadmin /queue brad-dc-01
Queue contains 0 items.

C:\Localbin>repadmin /showoutcalls brad-dc-01
brad-dc-01 has 1 outgoing DRS RPC calls in progress:

Call type: DRS_CALL_REPLICA_SYNC
Target server: 32577452-1d08-467b-8dd7-2384458f93232._msdcs.bradddom.brad.com //Outgoing to call to sync to this DC, lets see what it is below
Handle info: bound 1 FromCache 1 InCache 1
Client thread id: 1252
Time call started: 2006-12-01 18:51:15
Call timeout: 5 minutes
Call duration: 1 minutes and 42 seconds

C:\Localbin>ping 32577452-1d08-467b-8dd7-2384458f93232._msdcs.bradddom.brad.com

Pinging Sonja-DC-04 [0000:4898:dc05:32:3456:8a45:4588:4323] from 0000:4898:dc05:23:3456:618c:3cc4:1234 with 32 bytes of data:

Reply from 0000:4898:dc05:32:3456:8a45:4588:4323: time<1ms // It's Sonja-DC-04 from the repadmin report above that's offline for whatever reason we should probably look into that server.
Reply from 0000:4898:dc05:32:3456:8a45:4588:4323: time<1ms

5) Check DNS!  It seems to come back and bite us once in a while.  If SRV records get scavenged or something else is messed up you can see some weird behavior.  For instance, you could see a few DC's that are pegged at 100% CPU while the others are not loaded, or clients going to DC's outside of their site.

6) Use SPA.  If you have it installed :)  This will give you an idea if its network related that is causing load on the D, if that's your problem.

7) nltest /sc_query:domain /server:server.  I should have mentioned this first.  When you get the call that comes in indicating authentication issues with a particular resource, you should find out what that resource has for its secure channel and then start the frisking!

 

Now of course there are many other things you ca check, but this will at least give you peace of mind that a DC is healthy.  Other tools include:  evnetvwr, dcdiag, netdiag, replmon, NTDS diagnostics, Netmon, etc.

 

So what to do when you don't know what DC is messed up?

Well that's a bit more tricky.  Usually the easiest method is to start with the resource that is affected instead of looking for the needle in the haystack.  Again though repadmin is QUITE useful with error codes and could give you some clues. 

Easiest way to use repadmin to check every DC in the forest?

REPADMIN /REPLSUM * /BYSRC /BYDEST /SORT:ERROR

Give it a shot and see if your AD infrastructure is online and healthy! 

As always your comments are welcome...

 

Technorati tags: AD, Directory Service, Authentication, Domain Controller