Understanding DFS Replication "limits"

[Updated on 11/27/06 to clarify areas where customers commonly have questions] 

The most frequent DFS Replication questions are about replication limits. The answer is, as you might guess, not as simple as some arbitrary hard-coded limit. And just so we’re on the same page, I’m talking about the new replication engine in Windows Server 2003 R2, not File Replication Service. 🙂

Let’s start with the published tested DFS Replication limits:

  • Each server can be a member of up to 256 replication groups.
  • Each replication group can contain up to 256 replicated folders.
  • Each server can have up to 256 connections (for example, 128 incoming connections and 128 outgoing connections).
  • On each server, the result of the following formula should be kept to 1024 or fewer:

    (number of replicated folders in replication groupx * number of simultaneously replicating connections in replication groupx) + (number of replicated folders in replication groupy * number of simultaneously replicating connections in replication groupy) + (number of replicated folders in replication groupn * number of simultaneously replicating connections in replication groupn)

  • A replication group can be arbitrarily large, scaling to several thousands of members. However, each member can be connected to, at most, 256 partners. (See our blog post for more about this 256-member recommendation.)
  • A volume can contain up to 8 million replicated files, and a server can contain up to 1 terabyte of replicated files. (See our blog post for more about this 1-TB recommendation.)

Bullet 4 is a bit difficult to read. I’ll try to clarify it.

For one replication group, multiply the number of replicated folders by the number of simultaneously replicating connections. Repeat this for each replication group on the server. Then, add the results for all replication groups. The result should be kept to 1024 or fewer. (If the replication schedule is staggered, you do not need to count the connections that are not replicating due to a closed schedule.)

When using this formula, remember that this is for each individual server, not a DFS Replication deployment as a whole. Here’s an example: Assume you have a replication group that contains 25 members and 50 replicated folders. It might be tempting to multiply 25 and 50 and bam, you’ve exceeded 1,024 without even figuring in the connections. But this isn’t how to use the formula. Instead, you need to look at a given server with the largest number of connections. For example, assume one of the servers is hub server, so it will be replicating with every other server in the replication group. This is the extreme case, because your spoke servers will likely only replicate with the hub server.

The hub will have 24 incoming and 24 outgoing connections to the 24 other servers, so you’ll have 48 connections total. However, it is unlikely that you’ll have 24×7 replication across all of those connections. You will likely have staggered replication windows set up, so say only half of the spoke servers are replicating with the hub server at any time. Therefore, you’ll have 24 active connections at a time. To complete the formula for the hub server, you will have:

(50) replicated folders * (24) actively replicating connections=1200. This is slightly higher than the recommended 1,024, but should still yield acceptable replication performance if you’ve followed our other to-be-published guidelines for optimizing throughput.

For any given spoke server, the formula will have a different result:

(50) replicated folders * (2) actively replicating connections=100.

Let’s take another example. Imagine a hub and spoke topology where the hub server is performing both collection (separate data is coming from each spoke to the hub) and publication (data originates at the hub and goes to all spokes). If the hub server collects data from 128 partners, then the resulting calculation for collection is (2) connections * (1) replicated folder, which equals 2. Since you will have 128 replication groups, you can multiply 2*128, which equals 256. 

So you have 256 as the result, but you need to factor in the data publication calculation as well to make sure you are below 1024. Doing the math, you’ll find that you can create at most 3 replicated folders for publication, because the resulting calculation for publication will be (256) connections * (3) replicated folders equals 768, making the combined result 1024. 

In the list of tested limits, the formula in bullet 4 comes from the scenario where the hub and all 128 spokes are replicating simultaneously in both directions (inbound and outbound), which is an extreme stress scenario.  Typically, in the data collection scenario, the effective throughput with low bandwidth/high latency networks will be lower than what you can achieve in principal because DFS Replication internally limits the number of concurrent downloads to 4. Similarly, in the data publication scenario, DFS Replication limits the maximum number of files that can be served simultaneously to 5.  You can therefore go beyond the 128 fan-out at the hub and the 1024 product number without any performance degradation.

Here is another example of how to use the formula. Say for a given server you have two replication groups, RG1 and RG2. RG1 has 3 replicated folders, and RG2 has 5 replicated folders. RG1 has 4 simultaneously active connections on the server, and RG2 has 10.

What is the correct equation?

(3 * 4) + (5 * 10) = 62
2 * (3 + 5) * (4 + 10) = 640

The first formula is correct. The resulting products for each replication group are added together to reach the final number. The second formula is incorrect because the number of replication groups (2) is being used as a multiplier when in fact the number of replication groups does not have an impact on resource consumption. Instead, resource consumption for DFS Replication is based on the number of active sessions, and that in turn depends on the number of active replicated folders times the partners for each of those replicated folders.

Let’s also look at bullet 6

  • A volume can contain up to 8 million replicated files, and a server can contain up to 1 terabyte of replicated files.

This is not a hard limit on the number of files that can be replicated. DFS Replication maintains an internal database per volume for all metadata information. Theoretically, the Jet database can grow to as large as 32 TB. If you assume that 1/4th of the database is used for ID records, you have 8 TB to store ID records. Assuming an ID record is 1 KB, worst case the most files you can replicate is 8*10^9 files on the volume. By way of example, we have a pair of servers that have replicated in excess of 50 million files. The database takes 20 GB (the file data fills a little more than 1 TB).

Now to explain the size of data replicated, DFS Replication has no restrictions on the size of files replicated.  As long as the staging space is appropriately sized (you’ll have to read a future blog on this) and DFS Replication does not hit any space issues, it can replicate any size of files. 

It is possible to exceed these limits, perhaps many times over, and get acceptable replication performance. For example, our IT department here at Microsoft is using DFS Replication, and they’ve far exceeded these tested limits. How DFS Replication will perform in your organization, either below or above our recommended limits, depends on many variables, including:

The rate of change happening in a schedule window

  • The bandwidth throttling settings
  • The speed of the network
  • The ability to compress changes using RDC and whether cross-file RDC is used
  • The size of the staging folder quota
  • The speed of the disk subsystem
  • Whether you have optimized the servers by placing the replicated folder and staging folders on separate disks

When replicating large amounts of data there are some performance tuning options that may be utilized.  For example when replicating large data files on a LAN or very high speed WAN, disabling RDC may be beneficial whereby the files will not be staged as well.

One final point.  If you have deployed DFS Replication (or are going to), we are aware of some known Remote Procedure Call (RPC) issues causing replication to fail without possible corresponding failure events in the DFS replication event log.   We have an RPC GDR fix available as an official GDR package by early January on Windows Update as a recommended update.  If you are seeing replication backlogs increasing (from the Health Report) over a period of days, with no associated error messages in the event logs (i.e., no disk space constraints or other issues), we recommend that you install the update on all production servers participating in DFS Replication after initial lab testing.

We have much more to say on the subject of DFS Replication performance. We’re in the process of creating more blog entries, and eventually this content will all live in the documentation. Stay tuned!

Jill and Shobana