Access-Based Enumeration (ABE) Troubleshooting (part 2 of 2)

Hello everyone! Hubert from the German Networking Team here again with part two of my little Blog Post Series about Access-Based Enumeration (ABE). In the first part I covered some of the basic concepts of ABE. In this second part I will focus on monitoring and troubleshooting Access-based enumeration.
We will begin with a quick overview of Windows Explorer’s directory change notification mechanism (Change Notify), and how that mechanism can lead to performance issues before moving on to monitoring your environment for performance issues.

Change Notify and its impact on DFSN servers with ABE

Let’s say you are viewing the contents of a network share while a file or folder is added to the share remotely by someone else. Your view of this share will be updated automatically with the new contents of the share without you having to manually refresh (press F5) your view.
Change Notify is the mechanism that makes this work in all SMB Protocols (1,2 and 3).
The way it works is quite simple:

  1. The client sends a CHANGE_NOTIFY request to the server indicating the directory or file it is interested in. Windows Explorer (as an application on the client) does this by default for the directory that is currently in focus.
  2. Once there is a change to the file or directory in question, the server will respond with a CHANGE_NOTIFY Response, indicating that a change happened.
  3. This causes the client to send a QUERY_DIRECTORY request (in case it was a directory or DFS Namespace) to the server to find out what has changed.

    QUERY_DIRECTORY is the thing we discussed in the first post that causes ABE filter calculations. Recall that it’s these filter calculation that result in CPU load and client-side delays.
    Let’s look at a common scenario:
  4. During login, your users get a mapped drive pointing at a share in a DFS Namespace.
  5. This mapped drive causes the clients to connect to your DFSN Servers
  6. The client sends a Change Notification (even if the user hasn’t tried to open the mapped drive in Windows Explorer yet) for the DFS Root.

    Nothing more happens until there is a change on the server-side. Administrative work, such as adding and removing links, typically happens during business hours, whenever the administrators find the time, or the script that does it, runs.

    Back to our scenario. Let’s have a server-side change to illustrate what happens next:
  7. We add a Link to the DFS Namespace.
  8. Once the DFSN Server picks up the new link in the namespace from Active directory, it will create the corresponding reparse point in its local file system.
    If you do not use Root Scalability Mode (RSM) this will happen almost at the same time on all of the DFS Servers in that namespace. With RSM the changes will usually be applied by the different DFS servers over the next hour (or whatever your SyncInterval is set to).
  9. These changes trigger CHANGE_NOTIFY responses to be sent out to any client that indicated interest in changes to the DFS Root on that server. This usually applies to hundreds of clients per DFS server.
  10. This causes hundreds of Clients to send QUERY_DIRECTORY requests simultaneously.

What happens next strongly depends on the size of your namespace (larger namespaces lead to longer duration per ABE calculation) and the number of Clients (aka Requests) per CPU of the DFSN Server (remember the calculation from the first part?)

As your Server does not have hundreds of CPUs there will definitely be some backlog. The numbers above decide how big this backlog will be, and how long it takes for the server to work its way back to normal. Keep in mind that while pedaling out of the backlog situation, your server still has to answer other, ongoing requests that are unrelated to our Change Notify Event.
Suffice it to say, this backlog and the CPU demand associated with it can also have negative impact to other jobs.  For example, if you use this DFSN server to make a bunch of changes to your namespace, these changes will appear to take forever, simply because the executing server is starved of CPU Cycles. The same holds true if you run other workloads on the same server or want to RDP into the box.

So! What can you do about it?
As is common with an overloaded server, there are a few different approaches you could take:

  • Distribute the load across more servers (and CPU cores)
  • Make changes outside of business hours
  • Disable Change Notify in Windows Explorer

Approach

Method

Distribute the load / scale up

An expensive way to handle the excessive load is to throw more servers/CPU cores into the DFS infrastructure. In theory, you could increase the number of Servers and the number of CPUs to a level where you can handle such peak loads without any issues, but that can be a very expensive approach.

Make changes outside business hours

Depending on your organizations structure, your business needs, SLAs and other requirements, you could simply make planned administrative changes to your Namespaces outside the main business hours, when there are less clients connected to your DFSN Servers.

Disable Change Notify in Windows Explorer

You can set:
NoRemoteChangeNotify
NoRemoteRecursiveEvents
See https://support.microsoft.com/en-us/kb/831129
to prevent Windows Explorer from sending Change Notification Requests.
This is however a client-side setting that disables this functionality (change notify) not just for DFS shares but also for any fileserver it is working with. Thus you have to actively press F5 to see changes to a folder or a share in your Windows Explorer. This might or might not be a big deal for your users.

Monitoring ABE

As you may have realized by now, ABE is not a fire and forget technology—it needs constant oversight and occasional tuning. We’ve mainly discussed the design and “tuning” aspect so far. Let’s look into the monitoring aspect.

Using Task Manager / Process Explorer

This is a bit tricky, unfortunately, as any load caused by ABE shows up in Task Manager inside the System process (as do many other things on the server). In order to correlate high CPU utilization in the System process to ABE load, you need to use a tool such as Process Explorer and configure it to use public symbols. With this configured properly, you can drill deeper inside the System Process and see the different threads and the component names. We need to note, that ABE and the Fileserver both use functions in srv.sys and srv2.sys. So strictly speaking it’s not possible to differentiate between them just by the component names. However, if you are troubleshooting a performance problem on an ABE-enabled server where most of the threads in the System process are sitting in functions from srv.sys and srv2.sys, then it’s very likely due to expensive ABE filter calculations. This is, aside from disabling ABE, the best approach to reliably prove your problem to be caused by ABE.

Using Network trace analysis

Looking at CPU utilization shows us the server-side problem. We must use other measures to determine what the client-side impact is, one approach is to take a network trace and analyze the SMB/SMB2 Service Response times. You may however end up having to capture the trace on a mirrored switch port. To make analysis of this a bit easier, Message Analyzer has an SMB Service Performance chart you can use.

clip_image002

You get there by using a New Viewer, like below.

smbserviceperf

Wireshark also has a feature that provides you with statistics under Statistics -> Service Response Times -> SMB2. Ignore the values for ChangeNotify (its normal that they are several seconds or even minutes). All other response times translate into delays for the clients. If you see values over a second, you can consider your files service not only to be slow but outright broken.
While you have that trace in front of you, you can also look for SMB/TCP Connections that are terminated abnormally by the Client as the server failed to respond to the SMB Requests in time. If you have any of those, then you have clients unable to connect to your file service, likely throwing error messages.

Using Performance Monitor

If your server is running Windows Server 2012 or newer, the following performance counters are available:

Object

Counter

Instance

SMB Server Shares

Avg. sec /Data Request

<Share that has ABE Enabled>

SMB Server Shares

Avg. sec/Read

‘’

SMB Server Shares

Avg. sec/Request

‘’

SMB Server Shares

Avg. sec/Write

‘’

SMB Server Shares

Avg. Data Queue Length

‘’

SMB Server Shares

Avg. Read Queue Length

‘’

SMB Server Shares

Avg. Write Queue Length

‘’

SMB Server Shares

Current Pending Requests

‘’

clip_image005

Most noticeable here is Avg. sec/Request counter as this contains the response time to the QUERY_DIRECTORY requests (Wireshark displays them as Find Requests). The other values will suffer from a lack of CPU Cycles in varying ways but all indicate delays for the clients. As mentioned in the first part: We expect single digit millisecond response times from non-ABE Fileservers that are performing well. For ABE-enabled Servers (more precisely Shares) the values for QUERY_DIRECTORY / Find Requests will always be higher due to the inevitable length of the ABE Calculation.

When you reached a state where all the other SMB Requests aside of the QUERY_DIRECTORY are constantly responded to in less than 10ms and the QUERY_DIRECTORY constantly in less than 50ms you have a very good performing Server with ABE.

Other Symptoms

There are other symptoms of ABE problems that you may observe, however, none of them on their own is very telling, without the information from the points above.

At a first glance a high CPU Utilization and a high Processor Queue lengths are indicators of an ABE problem, however they are also indicators of other CPU-related performance issues. Not to mention there are cases where you encounter ABE performance problems without saturating all your CPUs.

The Server Work Queues\Active Threads (NonBlocking) will usually raise to their maximum allowed limit (MaxThreadsPerQueue ) as well as the Server Work Queues\Queue Length increasing. Both indicate that the Fileserver is busy, but on their own don’t tell you how bad the situation is. However, there are scenarios where the File server will not use up all Worker Threads allowed due to a bottleneck somewhere else such as in the Disk Subsystem or CPU Cycles available to it.

See the following should you choose to setup long-term monitoring (which you should) in order to get some trends:

Number of Objects per Directory or Number of DFS Links
Number of Peak User requests (Performance Counter: Requests / sec.)
Peak Server Response time to Find Requests or Performance Counter: Avg. sec/Request
Peak CPU Utilization and Peak Processor Queue length.

If you collect those values every day (or a shorter interval), you can get a pretty good picture how much head-room you have left with your servers at the moment and if there are trends that you need to react to.

Feel free to add more information to your monitoring to get a better picture of the situation.  For example: gather information on how many DFS servers were active at any given day for a certain site, so you can explain if unusual high numbers of user requests on the other servers come from a server downtime.

ABELevel

Some of you might have heard about the registry key ABELevel. The ABELevel value specifies the maximum level of the folders on which the ABE feature is enabled. While the title of the KB sounds very promising, and the hotfix is presented as a “Resolution”, the hotfix and registry value have very little practical application.  Here’s why:
ABELevel is a system-wide setting and does not differentiate between different shares on the same server. If you host several shares, you are unable to filter to different depths as the setting forces you to go for the deepest folder hierarchy. This results in unnecessary filter calculations for shares.

Usually the widest directories are on the upper levels—those levels that you need to filter.  Disabling the filtering for the lower level directories doesn’t yield much of a performance gain, as those small directories don’t have much impact on server performance, while the big top-level directories do.  Furthermore, the registry value doesn’t make any sense for DFS Namespaces as you have only one folder level there and you should avoid filtering on your fileservers anyway.

While we are talking about Updates

Here is one that you should install:
High CPU usage and performance issues occur when access-based enumeration is enabled in Windows 8.1 or Windows 7 – https://support.microsoft.com/en-us/kb/2920591

Furthermore you should definitely review the Lists of recommended updates for your server components:
DFS
https://support.microsoft.com/en-us/kb/968429 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2951262 (2012 / 2012 R2)

File Services
https://support.microsoft.com/en-us/kb/2473205 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2899011 (2012 / 2012 R2)

Well then, this concludes this small (my first) blog series.
I hope you found reading it worthwhile and got some input for your infrastructures out there.

With best regards
Hubert