Troubleshooting KCC Event Log Errors

My name is David Everett and I’m a Support Escalation Engineer on the Directory Services Support team.

I’m going to discuss a recent trend I’ve seen where Active Directory Replication appears to be fine but one DC only in one (or more) sites begins logging Knowledge Consistency Checker (KCC) Warning and Error events in the Directory Service event log. I included sample events below.

For those not familiar with the KCC, it is a distributed application that runs on every domain controller. The KCC is responsible for creating the connections between domain controllers and collectively forms the replication topology. The KCC uses Active Directory data to determine where (from what source domain controller to what destination domain controller) to create these connections.

In some cases these errors are logged all the time and in others they are logged at regular intervals and they clear on their own only to reappear like clockwork. Typically other DCs in the same site(s), perhaps even in the whole forest, report no KCC errors at all. In some cases the DC logging these errors have a small number of connection objects compared with their peer DCs in the same site:

Event Type: Warning Event Source: NTDS KCC Event Category: (1) Event ID: 1566 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITYANONYMOUS LOGON Computer: DC1X Description: All domain controllers in the following site that can replicate the directory partition over this transport are currently unavailable.

Site: CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com Directory partition: CN=Configuration,DC=contoso,DC=com Transport: CN=IP,CN=Inter-Site Transports,CN=Sites,CN=Configuration,DC=contoso,DC=com

-AND-

Event Type: Error Event Source: NTDS KCC Event Category: (1) Event ID: 1311 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITYANONYMOUS LOGON Computer: DC1X Description: The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition.

Directory partition: CN=Configuration,DC=contoso,DC=com

There is insufficient site connectivity information in Active Directory Sites and Services for the KCC to create a spanning tree replication topology. Or, one or more domain controllers with this directory partition are unable to replicate the directory partition information. This is probably due to inaccessible domain controllers.

User Action Use Active Directory Sites and Services to perform one of the following actions: - Publish sufficient site connectivity information so that the KCC can determine a route by which this directory partition can reach this site. This is the preferred option. - Add a Connection object to a domain controller that contains the directory partition in this site from a domain controller that contains the same directory partition in another site.

If neither of the Active Directory Sites and Services tasks correct this condition, see previous events logged by the KCC that identify the inaccessible domain controllers.

In some cases this event is also seen; it suggests name resolution is working but a network port is blocked:

Event Type: Warning Event Source: NTDS KCC Event Category: (1) Event ID: 1865 Date: 5/14/2008 Time: 1:51:23 PM User: NT AUTHORITYANONYMOUS LOGON Computer: DC1X Description: The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.

Sites: CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com

If you encounter this issue it could be the DC logging the errors is hosting the Intersite Topology Generator (ISTG) role for its site. This role is responsible for maintaining all of the Inter-site connection objects for the site. This role polls each DC in its site for connection objects that have failed and if failures are reported by the peer DCs the ISTG logs these events indicating something is not right with connectivity.

For those wondering what these events mean here is a quick rundown:

  • The 1311 event indicates the KCC couldn't connect up all the sites.
  • The 1566 event indicates the DC could not replicate from any server in the site identified in the event description.
  • When logged, the 1865 event contains secondary information about the failure to connect the sites and tells which sites are disconnected from the site where the KCC errors are occurring.

Ok, I’ll get to the point and explain how to identify the root cause and correct this. These errors are pointing to a topology or a connectivity issue. Either there are not enough site links to connect all the sites or more likely network connectivity is failing for a number of reasons.

If your network is not fully routed (the ability for any DC in the forest to perform an RPC bind to every other DC in the forest) make certain Bridge All Sites Links (BASL) is unchecked. If BASL is unchecked Site Links and/or Site Link Bridges must be configured. Site Links and Site Link Bridges provide the KCC with the information it needs to build connections over existing network routes. If the network is fully routed and you have BASL checked, fine.

While the network routes may exist the ports needed for Active Directory to replicate must not be restricted.

The assumption of this blog is these errors continue to be logged even though the site listed in the 1566 event has been added to a site link object and AD topology is correctly configured.

To locate the source of the KCC events and identify the root cause, you need to execute the following commands while the KCC events are being logged.

1) Identify the ISTG covering each site by running this command:

repadmin /istg

The output will list all sites in the forest and the ISTG for each site:

repadmin running command /istg against server localhost

Gathering topology from site Default-First-Site-Name (DC1.contoso.com):

                                   Site ISTG
================== =================
SiteX DC1X
SiteY DC1Y

NOTE: Determine from the output if the DC logging these events (DC1X) is the ISTG or not.

2) If the DC logging the events is the ISTG any one of the DCs in the same site as this ISTG could have connectivity issues to the site identified in the 1566 event. You can identify which DC(s) are failing to replicate from the site identified in the 1566 event by running this command which targets all DCs in the site that the ISTG logging the errors resides in. For example, DC1X is logging the events and it is the ISTG for siteX. To identify which DCs in siteX are failing to replicate from siteY run this command:

repadmin /failcache site:siteX >siteX-failcache.txt

The failcache output shows two DCs in siteX:

repadmin running command /failcache against server DC1X._msdcs.contoso.com
==== KCC CONNECTION FAILURES =========================== (none)

==== KCC LINK FAILURES =============================== SiteYDC1Y
DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473
No Failures.

repadmin running command /failcache against server DC2X._msdcs.contoso.com
==== KCC CONNECTION FAILURES =========================== (none)
==== KCC LINK FAILURES =============================== SiteYDC1Y
DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473
46 consecutive failures since 2008-08-12 22:14:39.
SiteZDC1Z DC object GUID: fh3h8bde-a928-466a-97b0-39a507acbe54
No Failures.

The output above identifies the Destination DC as (DC2X) in siteX that is failing to inbound replicate from siteY. In some cases the DC name is not resolved and shows as a GUID (s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com). If the DC name is not resolved determine the hostname of the Destination DC by pinging the fully qualified CNAME:

ping s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com

NOTE: DC2X may or may not be logging Error events in its Directory Services event log like the DC1X the ISTG is.

3) Logon to the Destination DC identified in the previous step and determine if RPC connectivity from the Destination DC to the Source DC (DC1Y) is working.

repadmin /bind DC1Y.contoso.com

  • If “repadmin /bind DC1Y” from the Destination DC succeeds:

Run “repadmin /showrepl <Destination DC>” and examine the output to determine if Active Directory Replication is blocked. The reason for replication failure should be identified in the output. Take the appropriate corrective action to get replication working.

  • If “repadmin /bind DC1Y” from the Destination DC fails:

Verify firewall rules are not interfering with connectivity between the Destination DC and the Source DC. If the port blockage between the Destination DC and the Source DC cannot be resolved, configure the other DCs in the site where the errors are logged to be Preferred Bridgeheads and force KCC to build new connection objects with the Preferred Bridgeheads only.

NOTE: Running "repadmin /bind DC1Y” from the ISTG logging the KCC errors may reveal no connectivity issues to DC1Y in the remote site. As noted earlier, the ISTG is responsible for maintaining inter-site connectivity and may not be the DC having the problem. For this reason the command must be run from the Destination DC that repadmin /failcache identified as failing to inbound replicate

A successful bind looks similar to this:

C:>repadmin /bind DC1Y
Bind to DC1Y succeeded.
NTDSAPI V1 BindState, printing extended members.
bindAddr: DC1Y
Extensions supported (cb=48):
BASE : Yes
ASYNCREPL : Yes
REMOVEAPI : Yes
MOVEREQ_V2 : Yes
GETCHG_COMPRESS : Yes
DCINFO_V1 : Yes
RESTORE_USN_OPTIMIZATION : Yes
KCC_EXECUTE : Yes
ADDENTRY_V2 : Yes
LINKED_VALUE_REPLICATION : Yes
DCINFO_V2 : Yes
INSTANCE_TYPE_NOT_REQ_ON_MOD : Yes
CRYPTO_BIND : Yes
GET_REPL_INFO : Yes
STRONG_ENCRYPTION : Yes
DCINFO_VFFFFFFFF : Yes
TRANSITIVE_MEMBERSHIP : Yes
ADD_SID_HISTORY : Yes
POST_BETA3 : Yes
GET_MEMBERSHIPS2 : Yes
GETCHGREQ_V6 (WHISTLER PREVIEW) : Yes
NONDOMAIN_NCS : Yes
GETCHGREQ_V8 (WHISTLER BETA 1) : Yes
GETCHGREPLY_V5 (WHISTLER BETA 2) : Yes
GETCHGREPLY_V6 (WHISTLER BETA 2) : Yes
ADDENTRYREPLY_V3 (WHISTLER BETA 3): Yes
GETCHGREPLY_V7 (WHISTLER BETA 3) : Yes
VERIFY_OBJECT (WHISTLER BETA 3) : Yes
XPRESS_COMPRESSION : Yes
DRS_EXT_ADAM : No
Site GUID: stn45bf5-f33f-4d53-9b1b-e7a0371f9a3d
Repl epoch: 0
Forest GUID: idk4734-eeca-11d2-a5d8-00805f9f21f5
Security information on the binding is as follows:
SPN Requested: LDAP/DC1Y
Authn Service: 9
Authn Level: 6
Authz Service: 0

4) If these events occur at specific periods of the day or week and then they resolve on their own, verify DNS Scavenging is not set too aggressively. It could be DNS Scavenging is so aggressive that SRV, A, CNAME and other valid records are purged from DNS causing name resolution between DCs to fail. If this is the behavior you are seeing, verify scavenging settings on these DNS zones:

  • _msdcs.forestroot.com
  • forestroot.com
  • Scavenging settings need to be checked on child domains if the Source or Destination DCs are in child domains.

Example: if Scavenging is set this way the outage will occur every 24 hours:

Non-refresh period: 8 hours
Refresh period: 8 hours
Scavenging period: 8 hours

To correct this change the Refresh and Non-refresh periods to 1 day each and set scavenging to 3 days. See Managing the aging and scavenging of server data on Technet to configure these settings for the DNS Server and/or zones.

Hopefully this clears up the mysterious KCC errors on that one DC.

- David Everett