In this blog post, let me touch on a topic that I frequently get questions on from customers. Most namespace scalability questions are along one of the following lines:
- I want to put x number of DFS folders in one DFS namespace, does this work?
- Does this matter if I put these x DFS folders into one namespace versus into multiple?
- What factors should I consider in designing a scalable DFS namespace?
When you start thinking about scalability of your DFS Namespace with the desired number of DFS folders, you should consider at least the following four critical questions:
- Is my namespace server resourced to do a full sync on the namespaces without bringing itself to knees?
- Can my DFS namespace keep up with the throughput/frequency of operations performed by management applications?
- Can my clustered DFS namespace failover quickly enough? (this applies only to stand-alone namespaces)
- Can my namespace server come up quickly enough after a reboot?
Let us discuss each one of the questions in turn. I suspect most of you reading this blog post have probably used DFS namespaces for a while, so the namespace concepts and terms in the following discussion should be very familiar. In case you need to refresh your memory, Overview of DFS Namespaces is a good one to refer to. In the rest of the blog post, unless otherwise stated, I will use the term “DFS folder” to represent a DFS folder with one or more folder targets (as opposed to a DFS Folder that has no folder targets of its own, but which simply acts as a container for other DFS Folders).
1. Is my namespace server resourced to do a full sync on the namespaces?
What is Full sync? First, let us briefly review key related concepts. The DFSN service performs two kinds of “sync” operations for a domain-based namespace: Full sync and Periodic sync. For a stand-alone namespace, the DFSN service performs just a Full sync. The objective behind any sync operation is to synchronize the “working” namespace metadata on the DFS namespace server with the most recent authoritative information. The “working” metadata includes both the in-memory caches and the on-disk folder structure on the root SMB file share.
Let’s look at the easier one first. For domain-based namespaces, Periodic sync – also known as the hourly sync as that’s the default – looks for and performs delta sync for changed portions of the namespace metadata from the DC. Periodic sync is usually a faster operation: when there are no changes, the periodic sync is effectively a “no op”.
A Full sync may happen by itself under certain conditions such as when a management operation on the namespace (e.g. add a folder path) fails and the DFSN service decides that it is best to fully synchronize its working data set with the latest namespace metadata. This would be from the Active Directory domain controller (DC) for domain-based namespaces, and local registry for stand-alone namespaces. Full sync also occurs whenever a DFSN server switches from one DC to another DC as its source for authoritative namespace metadata. Non-reachability of the first DC is an example of an event that could cause such a DC switch.
Full sync by its very nature is a potentially time-taking and resource-intensive process because it synchronizes all the namespace metadata from the DC. While a Full sync is in progress, all other management operations on the namespace pause awaiting the Full sync to complete. A Full sync operation causes the DFSN service to also (re)create all the reparse points for each one of the DFS folders on the SMB root file share, and re-apply all the previously-configured ACLs on the reparse points. So a Full sync can cause spikes in network, CPU, memory and disk resource usage on the namespace server. In a large number of instances, DCs tend to double up also as namespace servers, so you should carefully monitor and understand these resource usage spikes before you go into production. Windows Server 2012 and later versions also include a key performance optimization in that it does not recreate reparse points if valid reparse points already exist.
How do I confirm server is sized for Full sync? The best way to confirm that your namespace server is properly resourced is to kick off the Full sync manually using the command “dfsutil root forcesync \RootServer1.Contoso.comPublicDocs” and monitoring the DFS performance counters (see discussion about performance counter specifics a little further down in this blog post), and Resource Monitor to monitor disk, CPU and network usage. Note that this command works only on stand-alone namespaces and Windows Server 2008 mode domain-based namespaces. Notice also that I am forcing a Full sync not on the namespace, but on the desired namespace root server. Since there can be multiple root servers for a namespace, you have to specify the specific root server that you want to fully synchronize.
In Windows Server 2012 and later, the DFSN service now logs events to the ‘Applications and Services LogsMicrosoftWindowsDFSN-ServerAdmin’ event channel both at the time a Full sync is initiated (Event ID: 516), and at the time the Full sync is completed (Event ID: 517).
Does my DC scale for DFSN Full sync? One interesting knob to consider in thinking about Full sync is root scalability for domain-based namespaces. When you enable it through “dfsutil property rootscalability Enable \Contoso.comPublicDocs“, DFSN will generally access the nearest DC instead of the Primary Domain Controller Emulator (PDCE) for the periodic sync. Additionally when this property is enabled, namespace servers do not send change notifications to other namespace servers. Root scalability is thus a good option if the namespace is relatively static. Notice further that unlike “forcesync”, I am enabling root scalability on the namespace as a whole, not on a single root server. Enabling this mode has the potential to significantly decrease the resource usage on PDCE to support DFSN Full sync operations. Root scalability is designed for usage with a large number of namespace servers, and is especially attractive when the namespace server is running right on a DC itself. However, you should be aware of potential for transient stale data in using this. When you make a namespace change, your nearest namespace server sends these namespace metadata changes on to PDCE via LDAP calls. Then the changes get replicated out to other DCs via AD replication mechanisms. When the local DC has not yet fully synchronized with PDCE, there’s the small window for stale data.
2. Can my DFS namespace keep up with the throughput/frequency of operations performed by management applications?
Management applications? First off, let me elaborate on what I mean by “management applications” here. DFS Namespaces feature ships with a set of management tools in Windows, e.g. dfsutil, dfscmd, dfsdiag, DFS Management UI, DFS BPA. We have also added a set of new DFS Namespaces Windows PowerShell cmdlets in Windows Server 2012 and later. File Server Resource Manager provides sophisticated file classification/quota management and reporting capabilities. Additionally, System Center Operations Manager (SCOM) also ships a File Services Management Pack as well as a Windows Server 2012 R2 File Services Management Pack that includes extensive DFS Namespaces monitoring capabilities. All these management tools consume a set of “NetDfs*” APIs under the covers that directly or indirectly can generate management traffic on a namespace server. So you want to size your namespace server to keep up with this management operation traffic.
Let us look at a couple of related API considerations. NetDfs APIs cover a gamut of capabilities including add, delete, update, and enumerate operations of various types. Compared to the DFSN service design in Windows Server 2008 R2 which serializes a portion of all these management operations without distinction, the DFSN service design in Windows Server 2012 lends itself to better scalability and concurrency in tasks requiring read operations (e.g. NetDfsGetInfo, NetDfsEnum).
Note that usage of NetDfs APIs has an interesting implication on root scalability that we have earlier discussed. In all versions of the OS, NetDfs APIs always cause the DFSN server to switch to sync from the PDCE for authoritative namespace metadata, even if root scalability is configured on that namespace. Further, a DFSN server is forced to do a full sync of the namespace whenever such a DC switch occurs. So you should be aware of the following parameters regarding this intersection:
- NetDfs-consuming applications (Microsoft – see previous examples, or 3rd party – check with your software vendor) and their site distributions across your environment: Ideally, you want to consolidate such applications to a single site with DFSN site-costing such that only the DFSN server in that single site is relying on PDCE while all other DFSN servers continue to exhibit root scalability
- Rate of management operations: Ideally, eliminate any typical routine “state refreshes” that such applications might perform if there’s no need for that refresh, as a state refresh typically involves additional calls to NetDfs APIs
- Latency from your namespace servers to the PDCE: Ideally, the site running such applications should be the same site as PDCE or the closest to it, to avoid doing full synchronization of namespace metadata over long-distance WAN links
- Total load on the PDCE: Size your PDCE to have enough head room to support the load generated by such unexpected namespace metadata full sync requests from DFSN servers servicing NetDfs calls
How do I select the namespace server that I want to test? If your DFSN deployment tends to be relatively static – i.e. you do not add/change/delete your namespaces, or namespace folders or their folder targets all that often – chances are that your namespace server is just fine handling the light management operations traffic. If you tend to make a significant number of namespace metadata changes on a daily or periodic basic, it is very important for you to simulate that burst of namespace activity and make sure that your namespace server is holding up well under that load. For domain-based namespaces with multiple namespace servers, it becomes a little tricky to identify or choose the specific namespace server that will service the management operations initiated by a client computer, as the best root target is automatically chosen by DFSN client based on site-costing and other factors. Fortunately however, dfsutil command supports just the required functionality:
Look up the current “active” root target in the client-local cache for the desired namespace root.
PS C:Windowssystem32> dfsutil cache referral
Expires in 0 seconds
UseCount: 0 Type:0x81 ( REFERRAL_SVC DFS )
0:[Server1Public] AccessStatus: 0xc00000be ( TARGETSET )
1:[Server2Public] AccessStatus: 0 ( ACTIVE )
In this example, if Server2 is what you want to test, you are already set as it is the current active target in the DFSN client cache. Go ahead with generating that burst of namespace activity. Let’s say you instead want to test Server1. You can change the active target to Server1 using the following:
PS C:Windowssystem32> dfsutil client property state active \Contoso.comPublic \Server1Public
Done processing this command.
Alternatively, you can also set the active server by selecting the desired server and clicking “Set Active” button under the “DFS” tab in the File Explorer properties window for the DFS folder. In either case, you can confirm that the desired namespace server is selected by checking out the updated contents of cache.
PS C:Windowssystem32> dfsutil cache referral
Expires in 0 seconds
UseCount: 0 Type:0x81 ( REFERRAL_SVC DFS )
0:[Server1Public] AccessStatus: 0xc00000be ( ACTIVE TARGETSET )
1:[Server2Public] AccessStatus: 0
How do I confirm the server is “holding up”? The DFSN service in Windows Server 2012 provides the following DFS Performance counter sets;
1. DFS Namespace Service API Requests. Shows performance information about requests (such as creating a namespace) made to the DFS Namespace service.
2. DFS Namespace Service Referrals. Shows performance information about various referral requests that are processed by the DFS Namespace service.
While management operations or Full sync is in progress, you should confirm that the namespace server is cranking through its work queues and being responsive – you can do this by confirming the following:
a) DFS Namespace Service API Requests – Requests Processed/sec – <All instances> counter should hold steady, or ramp up, and,
b) DFS Namespace Service API Requests – Requests Processed – – <All instances> counter should steadily increase with time.
Note: In previous releases of Windows Server, you see an additional performance counter set:
3. DFS Namespace Service API Queue. Shows the number of requests (made using the NetDfs API) in the queue for the DFS Namespace service to process.
This indicates the number of RPC threads waiting in queue to acquire internal locks. This lock contention has been eliminated in Windows Server 2012 – yet another reason to upgrade! –so this counter set does not exist in Windows Server 2012 and later. When working with previous releases of Windows Server, confirm that this queue is not always increasing while at the peak level of activity.
3. Can my clustered stand-alone DFS namespace failover quickly enough?
Do I need a stand-alone namespace? First off, I strongly recommend you use domain-based namespaces so you do not have to worry about this aspect of scalability. For a domain-based namespace, multiple namespace servers can concurrently be “active”, so any failure of one namespace server would cause a smooth failover to another namespace server with virtually no downtime from a DFSN client perspective.
How do I test and what do I tweak? Assuming that you have to use a stand-alone namespace for a strong reason, you will then need to use a clustered namespace server for high availability of DFS namespaces. One important consideration is to confirm that on the failure of the primary, the namespaces can rapidly failover to a secondary node in the failover cluster. Failover time of a namespace is directly proportional to the number of DFS folders – in this blog post, I specifically mean the DFS folders with folder targets – in the namespace. The cluster node taking over the namespace ownership sets the last modified time attribute time and the ACLs on each of the DFS folders, which are NTFS reparse points under the covers – see How DFS Works to learn about how DFSN uses reparse points. Prior to Windows Server 2012, a failover would always cause the all the reparse points to be re-created. As stated earlier, Windows Server 2012 includes an optimization to not create reparse points if valid ones already exist. This can take a non-trivial time potentially running into several minutes for very large namespaces (e.g. tens of thousands of folder targets). Refer the related Microsoft TechNet guidance on the recommended maximum number of DFS folders for a namespace. However, factors such as server performance, desired failover latency and the number of ACLs would all influence the practical count in the namespace. The biggest positive impact on DFSN failover latency is most likely realized in practice by using solid state disks (SSD) for the volume hosting the namespace root share.
4. Can my namespace server come up quickly enough after a reboot?
Should I worry about cold start latency? The answer to this question is substantially similar to the question#3. After a reboot, the DFSN service re-creates the referral folder structure on the SMB Share hosting the namespace. This includes creating the DFS folders with folder targets, and applying any ACLs and Access-based Enumeration (ABE) settings on the reparse points. So this process also is inherently time-consuming. Note that unlike the failover latency, which is specific to stand-alone namespaces, the cold start latency is a key consideration for both domain-based and stand-alone DFS namespaces – albeit a bit less so for domain-based as you likely have other namespace servers in that case to provide referrals when one is rebooting. For either type of namespace, the key to remember is that the longer it takes for the DFSN service cold start, the longer your DFS namespace would have one less namespace server to provide referrals. And that would have a direct bearing on the high availability of the namespace. Based on the previous discussion, not surprisingly, number of DFS folders plays a big role in the cold start latency.
What is the Microsoft guidance and data? Multi-core CPUs and SSD drives offer the best chance to accelerate DFSN cold start process. This is again something you do want to test in your staging deployment up front, as a lot of factors can influence the cold start latency. Jose, a previous DFSN PM, had published a detailed blog post with nice charts on the DFSN service cold start latencies relative to number of links for Windows Server 2008 R2 – under a set of environmental assumptions. While the core of that discussion applies just as well to Windows Server 2012, we have made some nice performance improvements in this release around parallelizing management operations within a single namespace.
Finally, let me wrap this DFSN scalability discussion with a reminder about DFSN compatibility with the new Scale-Out clustered file server first added in Windows Server 2012. DFSN folder targets can be SMB shares on a Scale-Out File Server; however the namespace root cannot be hosted on a Scale-Out File Server.
Hope this discussion added to your understanding of DFS Namespaces, especially regarding their scalability considerations.
- Review detailed Dfsutil command reference page on TechNet. Jose had also published a nice blog post about it sometime ago: Using the Windows Server 2008 DFSUTIL.EXE command line to manage DFS-Namespaces
- Check “How Root Scalability Mode Works” section in How DFS Works
- Learn how DFSN uses NTFS reparse points for its link folders in “Root and Link Folders” section under How DFS Works: DFS Physical Structures and Caches
- Be sure to check out the new Scale-Out File Server for Application Data in Windows Server 2012