Monitor Active Directory Replication Status with OMS

Active Directory is a key component of an enterprise IT environment. To ensure high availability and high performance, each domain controller has its own copy of the Active Directory database. Domain controllers replicate with each other in order to propagate changes across the enterprise. Failures in this replication process can cause a host of problems across the enterprise, so staying on top of replication status is an important task for any Active Directory administrator.

To help with this task, we’ve recently released a new solution for Operations Management Suite: AD Replication Status. This solution gathers information about replication failures throughout your AD environment and surfaces it on your OMS dashboard.

Getting started with the AD Replication Status solution

If you don’t have an OMS workspace yet, you can create one here, for free.

Then you’ll need to connect least one of your domain controllers to OMS. You can view detailed documentation on how to connect a machine to OMS.

If you’d prefer to run from an OMS-connected member server in your domain, rather than a domain controller, you’ll need to set the following registry key on the member server, then restart the HealthService:
Key: HKLM\SOFTWARE\Microsoft\AzureOperationalInsights\Assessments_Targets
Value: ADReplication

Once you have connected at least one domain controller (or member server with registry key set) to your OMS workspace, simply go to the Solutions Gallery from the main OMS dashboard, then click on the AD Replication Status solution.

adrepl1

Note: The AD Replication Status solution is currently limited to evaluating Active Directory forests with a maximum of 300 domain controllers.

Using the AD Replication Status solution

When you add this solution to your workspace, you’ll start to see statistics on replication errors in your Active Directory environment, right on your OMS dashboard:

adrepl2

(The “critical” replication error number refers to errors that are over 75% of tombstone lifetime, or TSL. If you’re not familiar with TSL, we’ll talk about it more in just a minute.)

This tile will update automatically every few days, so you’ll always be able to see the latest information on replication errors in your environment.

Clicking this tile will take you into the AD Replication Status dashboard screen, which has more detailed information about the errors that were detected:

adrepl3

Let’s take a closer look at the specific blades that show on this screen.

Destination Server Status and Source Server Status

adrepl4

These show destination servers and source servers, respectively, that are experiencing replication errors. The number after each domain controller name indicates the number of replication errors on that domain controller.

We show the errors by both source server and destination server because some problems are easier to troubleshoot from the source server perspective, and others from the destination server perspective. In this example, you can see that many destination servers have roughly the same number of errors, but there’s one source server (ADDC35) that has many more errors than all the others. It’s likely that there is some problem on ADDC35 that is causing it to fail to send data to its replication partners, and so fixing the problems on ADDC35 will likely resolve many of the errors that appear in the destination server blade.

If you click on a domain controller name, you will drop into the search screen, where you can see more detailed information on the errors on that specific domain controller.

adrepl5

Of course, all the great features of the OMS search screen are available to you to drill in to the root cause of the problem. Here, we’ve filtered down the results to just look at replication errors involving the Schema partition.

adrepl6

We can see that this source server is failing to replicate this same partition with 19 different destination servers, and at least the three shown here are failing with the same error (8451 – The replication operation encountered a database error). Again, this indicates that we can most likely focus our troubleshooting efforts on this single source server (ADDC35) and expect that a single fix will address multiple errors.

The search screen also displays a “HelpLink” for each error. Unfortunately, clicking on this link currently does not work properly, but you can copy/paste it into your browser window to view documentation on TechNet that has more information on the error and how to troubleshoot it. As an example, here’s a clickable link to the help on 8451 errors: http://go.microsoft.com/fwlink/?LinkId=228631

Replication Error Types

adrepl7

This blade gives you information on the types of errors detected throughout your enterprise. Each error has a unique numerical code, as well as a message that can help you determine the root cause of the error.

The donut at the top gives you an idea of which errors appear more/less frequently in your environment. In this example, we can see that the top occurring error codes were 8451 (152 occurrences), 1256 (93 occurrences), 1908 (22 occurrences), and 1722 (21 occurrences).

The list shows the error codes identified, along with the associated message. Again, you can click on an error in the list to drop into the search screen and see more detailed information on each occurrence of that particular error code, across all domain controllers in your enterprise. Here’s an example filtering down to just occurrences of error code 1908:

adrepl8

Tombstone Lifetime

adrepl9

The tombstone lifetime determines how long a deleted object (called a “tombstone”) is retained in the Active Directory database. Once a deleted object passes the tombstone lifetime, a garbage collection process automatically removes it from the Active Directory database.

The default tombstone lifetime is 180 days for most recent versions of Windows, but it was 60 days on older versions, and it can be changed explicitly by an Active Directory administrator.

It’s important to know if you’re having replication errors that are approaching or are past the tombstone lifetime. If two domain controllers experience a replication error that persists past the tombstone lifetime, then replication will be disabled between those two DCs, even if the underlying replication error is fixed.

The “Tombstone Lifetime” blade helps you identify places where this is in danger of happening. In the example shown above, you can see that there are 64 errors that are over 100% of tombstone lifetime (the orange arc in the donut). Each of these errors represents a partition that has not replicated between its source and destination server for at least the tombstone lifetime for the forest. Again, you can click on the “Over 100% TSL” text to drill into details of these errors. Here is one example:

adrepl10

In this case, the data was collected by OMS on December 29, 2015 (TimeGenerated field). The last successful synchronization (LastSuccessfulSync field) was on January 27, 2015 – 11 months earlier. Clearly, this is way past the tombstone lifetime!

In this situation, simply fixing the replication error will not be enough. At a minimum, you’ll need to do some manual investigation to identify and clean up lingering objects before you can restart replication. You may even need to decommission a domain controller.

In addition to identifying any replication errors that have persisted past the tombstone lifetime, you’ll also want to pay attention to any errors falling into the “50-75% TSL” or “75-100% TSL” buckets. These are errors that are clearly lingering, not transient, so they likely need your intervention to resolve. The good news is that they have not yet reached the tombstone lifetime. If you fix these problems promptly, before they reach the tombstone lifetime, replication can restart with minimal manual intervention.

As noted earlier, the dashboard tile for the AD Replication Status solution shows the number of “critical” replication errors in your environment, which is defined as errors that are over 75% of tombstone lifetime (including errors that are over 100% of TSL). Strive to keep this number at 0.

Note: All the tombstone lifetime percentage calculations are based on the actual tombstone lifetime for your Active Directory forest, so you can trust those percentages are accurate, even if you have a custom tombstone lifetime value set.

Replication problems are one of the top call generators for Microsoft’s Active Directory support team. We hope this new OMS solution helps you stay on top of your replication errors and fix them quickly when they occur. For more information on Active Directory replication, please see the Active Directory Replication Technologies topic on TechNet.