Monitor Active Directory Replication Status with OMS

Summary: This post explains the new Active Directory Replication Status solution for Microsoft Operations Management Suite.

Hi everyone, J.C. Hornbeck here. Active Directory is a key component of an enterprise IT environment. To ensure high availability and high performance, each domain controller has its own copy of the Active Directory database. Domain controllers replicate with each other to propagate changes across the enterprise. Failures in this replication process can cause a host of problems across the enterprise, so staying on top of replication status is an important task for any Active Directory administrator.

To help with this task, we’ve recently released a new solution for Microsoft Operations Management Suite: AD Replication Status. This solution gathers information about replication failures throughout your Active Directory environment and surfaces it on your OMS dashboard.

Getting started with the AD Replication Status solution

If you don’t have an OMS workspace yet, you can create one for free (see Create New Workspace).

Then you’ll need to connect least one of your domain controllers to OMS. For details, see Connect Windows computers directly to OMS.

If you’d prefer to run from an OMS-connected member server in your domain, rather than a domain controller, you’ll need to set the following registry key on the member server, then restart the Health Service:

HKLM\SOFTWARE\Microsoft\AzureOperationalInsights\Assessments_Targets
Value: ADReplication

After you have connected at least one domain controller (or member server with the registry key set) to your OMS workspace, simply go to the Solutions Gallery from the main OMS dashboard, and click the AD Replication Status solution.

Image of menu

Using the AD Replication Status solution

When you add this solution to your workspace, you’ll start to see statistics on your OMS dashboard about replication errors in your Active Directory environment, for example:

Image of menu

Note   The “critical” replication error number refers to errors that are over 75% of tombstone lifetime, or TSL. If you’re not familiar with TSL, I’ll talk about it more shortly.

This tile updates automatically every few days, so you’ll always be able to see the latest information about replication errors in your environment. Clicking this tile takes you into the AD Replication Status dashboard screen, which has more detailed information about the errors that were detected:

Image of menu

Let’s take a closer look at the specific blades that show on this screen.

Destination Server Status and Source Server Status

Image of menu

This page shows the status of destination servers and source servers that are experiencing replication errors. The number after each domain controller name indicates the number of replication errors on that domain controller.

The errors for both destination servers and source servers are shown because some problems are easier to troubleshoot from the source server perspective and others from the destination server perspective.

In the previous example, you can see that many destination servers have roughly the same number of errors, but there’s one source server (ADDC35) that has many more errors than all the others. It’s likely that there is some problem on ADDC35 that is causing it to fail to send data to its replication partners. Fixing the problems on ADDC35 will likely resolve many of the errors that appear in the destination server blade.

If you click a domain controller name, you will drop in to the Search screen, where you can see more detailed information about the errors on that specific domain controller.

Image of menu

Of course, all the great features of the OMS Search screen are available so that you can drill in to the root cause of the problem. Here, we’ve filtered down the results to look at replication errors involving the Schema partition.

Image of menu

This source server is failing to replicate the same partition with 19 different destination servers, and at least the three shown here are failing with the same error (8451 – The replication operation encountered a database error). This indicates that we can most likely focus our troubleshooting efforts on this single source server (ADDC35) and expect that a single fix will address multiple errors.

The search screen also displays a Help link for each error. Unfortunately, clicking this link currently does not work properly, but you can paste it into your browser window to view documentation on TechNet that has more information about the error and how to troubleshoot it. As an example, here’s a link to the Help for 8451 errors:

Replication error 8451: The replication operation encountered a database error

Replication error types

Image of menu

This blade gives you information about the types of errors detected throughout your enterprise. Each error has a unique numerical code and a message that can help you determine the root cause of the error.

The donut at the top gives you an idea of which errors appear more and less frequently in your environment. In this example, we can see that the top occurring error codes were:

  • 8451 (152 occurrences)
  • 1256 (93 occurrences)
  • 1908 (22 occurrences)
  • 1722 (21 occurrences)

The list shows the error codes identified, along with the associated message. Again, you can click an error message in the list to drop in to the Search screen and see more detailed information about each occurrence of that particular error code, across all domain controllers in your enterprise. Here’s an example that filters down to only occurrences of error code 1908:

Image of menu

Tombstone lifetime

Image of menu

The tombstone lifetime determines how long a deleted object (called a “tombstone”) is retained in the Active Directory database. When a deleted object passes the tombstone lifetime, a garbage collection process automatically removes it from the Active Directory database.

The default tombstone lifetime is 180 days for most recent versions of Windows, but it was 60 days on older versions, and it can be changed explicitly by an Active Directory administrator.

It’s important to know if you’re having replication errors that are approaching or are past the tombstone lifetime. If two domain controllers experience a replication error that persists past the tombstone lifetime, replication will be disabled between those two domain controllers, even if the underlying replication error is fixed.

The Tombstone Lifetime blade helps you identify places where this is in danger of happening. In the previous example, you can see that there are 64 errors that are over 100% of the tombstone lifetime (the orange arc in the donut). Each of these errors represents a partition that has not replicated between its source and destination server for at least the tombstone lifetime for the forest. Again, you can click the Over 100% TSL text to drill in to details of these errors. Here is one example:

Image of menu

In this case, the data was collected by OMS on December 29, 2015 (TimeGenerated field). The last successful synchronization (LastSuccessfulSync field) was on January 27, 2015 (11 months earlier). Clearly, this is way past the tombstone lifetime!

In this situation, simply fixing the replication error will not be enough. At a minimum, you’ll need to do some manual investigation to identify and clean up lingering objects before you can restart replication. You may even need to decommission a domain controller.

In addition to identifying any replication errors that have persisted past the tombstone lifetime, you’ll also want to pay attention to any errors falling into the “50-75% TSL” or “75-100% TSL” buckets.

These are errors that are clearly lingering, not transient, so they likely need your intervention to resolve. The good news is that they have not yet reached the tombstone lifetime. If you fix these problems promptly and before they reach the tombstone lifetime, replication can restart with minimal manual intervention.

As noted earlier, the dashboard tile for the AD Replication Status solution shows the number of “critical” replication errors in your environment, which is defined as errors that are over 75% of tombstone lifetime (including errors that are over 100% of TSL). Strive to keep this number at 0.

Note  All the tombstone lifetime percentage calculations are based on the actual tombstone lifetime for your Active Directory forest, so you can trust that those percentages are accurate, even if you have a custom tombstone lifetime value set.

Replication problems are one of the top call generators for the Microsoft Active Directory support team. We hope this new OMS solution helps you stay on top of your replication errors and fix them quickly when they occur.

For more information about Active Directory replication, please see the Active Directory Replication Technologies site on TechNet.

J.C. Hornbeck, senior solution asset PM
Microsoft Enterprise Cloud Group