SharePoint: Common NTLM Authentication Issues, aka: Consider Ditching NTLM

NTLM authentication is not great.

It’s not the fastest. In most cases, that honor would go to Kerberos.

It’s not the most secure. Again, Kerberos.

It’s not all that flexible. For example, it doesn’t work well for extranets or anything cross-firewall. In those scenarios, Trusted Provider auth (SAML / WS-Fed) works well.  See: AD FS.

It doesn't work well with mobile clients, especially iPhone, iPad, etc.  -- Just search the Interwebs for "ios ntlm prompt" and you'll see what I mean --  Some of this is due to the fact that those devices are not joined to the Active Directory domain, and some of it is because NTLM is a Microsoft technology and others are not great at implementing it client-side.  Regardless, the best solution is to use Trusted Provider authentication, which is usually cookie-based and works well for all clients. -- If you're apprehensive about changing your authentication scheme within SharePoint just to appease your "mobile" users, you could use a Web Application Proxy (WAP) front-end as described here.  In that case, authentication is cookie-based between the client and WAP, but still uses Windows Integrated (Kerberos in this case) between WAP and SharePoint, meaning you don't have to do any user migration within SharePoint.

 

So why do so many still use it?

It’s the old stand-by. It works good enough, and there’s typically nothing extra you need to configure to get it to work. You just turn it on and it works. Unless it doesn’t, which is what this post is about.

 

Problems with NTLM usually manifest themselves in one of two ways:

1. Users cannot log in at all. They receive authentication prompts and then a 401 – Access Denied.

2. Users receive (seemingly) random authentication prompts when browsing SharePoint sites.

One thing to keep in mind when troubleshooting NLTM issues with SharePoint is that the problem is almost always external to SharePoint. Aside from turning it on or off, there’s not really anything you can configure inside of Sharepoint to make NTLM work better or worse. To enable NTLM, this is all you do within Central Administration | Manage Web Applications | <Your web app> | Authentication Providers:

And this is the resulting configuration in IIS Manager | <Your Site> | Authentication | Windows Authentication | Providers:

Here are some known issues with NTLM in no particular order:

Issue #1:

The network load balancer (NLB) is bouncing the client between web-front-ends (WFEs) in the middle of the "NTLM Handshake".

Note: See "other troubleshooting tips" section below for details on the "NTLM Handshake".

I know there’s some documentation out there that suggests that session persistence / affinity / "sticky sessions", is no longer required with the advent of Distributed Cache in SharePoint 2013 and above. However, that is not the case, at least not as long as you’re using NTLM. Staying on the same WFE is vital to any challenge / response authentication process (like NTLM).

Clearly, if the NTLM challenge comes from one WFE, but we send the response to another, that’s not going to work.

See this: <https://en.wikipedia.org/wiki/Challenge–response_authentication> “A more interesting challenge–response technique works as follows. Say, Bob is controlling access to some resource. Alice comes along seeking entry. Bob issues a challenge, perhaps "52w72y". Alice must respond with the one string of characters which "fits" the challenge Bob issued. The "fit" is determined by an algorithm "known" to Bob and Alice. (The correct response might be as simple as "63x83z" (each character of response one more than that of challenge), but in the real world, the "rules" would be much more complex.) Bob issues a different challenge each time, and thus knowing a previous correct response (even if it isn't "hidden" by the means of communication used between Alice and Bob) is of no use. A part of Alice's response might convey that it is Alice who is seeking authentication.”

Now consider the above "Bob and Alice" scenario without session persistence (sticky sessions). Bob issues the challenge. Alice sends the response to Fred, who has no idea what she’s talking about. Authentications fails.

To verify whether or not this is happening, I would suggest using HTTP Response Headers with Fiddler as I detailed in a previous post.

 

Solution #1:

Configure your NLB for "sticky sessions" so that a given client stays on a given WFE, at least throughout the authentication process.

 

Issue #2:

Users are denied access due to settings in the local security policy on the WFEs.

Reproduce the problem and take a look at the Security Event Log on the WFE. You may see a logon failure event like this:

 Log Name: Security
Source: Microsoft-Windows-Security-Auditing
Event ID: 4625
Task Category: Logon
Level: Information
Keywords: Audit Failure
Computer: WFE1.contoso.com
Description:
An account failed to log on. 
 
Subject:
 Security ID: S-1-0-0
 Account Name: -
 Account Domain: -
 Logon ID: 0x0
 
Logon Type: 3
 
Account For Which Logon Failed:
 Security ID: S-1-0-0
 Account Name: user1
 Account Domain: contoso
 
Failure Information:
 Failure Reason: The user has not been granted the requested logon type at this machine. 
 Status: 0xc000015b
 Sub Status: 0x0
 
Detailed Authentication Information:
 Logon Process: NtLmSsp 
 Authentication Package: NTLM

A logon type of “3” is a network logon. The failure reason tells us that there is something in the local security policy (possibly set by Group Policy) that is not allowing the user to logon.

 

Solution #2:

Run SecPol.msc from the Run prompt or command line. Check Local Policies | User Rights Assignment. These two policies should be your focus:

  • Access this computer from the network
  • Deny access to this computer from the network

Check all group memberships for your problem user(s) to make sure they are allowed access from the network and not explicitly denied via those two policies.

By default, there are no users or groups listed in "Deny access to this computer from the network", and the following groups normally have the "Access this computer from the network" privilege:
- Administrators
- Backup Operators
- Everyone
- Users

 

Issue #3:

No one agrees on which version of NTLM to use.

There are different versions of NTLM, and additional security options within them. If the client, WFE, and Domain Controller (DC) can’t find common ground, the authentication will fail. Reference: https://technet.microsoft.com/en-us/library/2006.08.securitywatch.aspx

From a Fiddler / IIS Log / data capture perspective, this one can be difficult to diagnose.
IIS logs may just show 401.0, 401.1, 401.1, with the last 401.1 showing a "sc-win32-status"of "2148074252", meaning "The logon attempt failed", which is not overly helpful.
However, if you go look at the registry or group policy editor on the applicable machines as described below, it should be easy to spot a problem.

 

Solution #3:

Check the LmCompatibilityLevel Registry key for client, WFE, and DCs. Make sure the value is compatible between the three.

Reference: https://technet.microsoft.com/en-us/library/cc960646.aspx

LmCompatibilityLevel is located here: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa

Note: This setting can be controlled by Group Policy (GPO), so you should check that to make sure any registry changes you make do not get reverted the next time group policy is applied.  If you run gpedit.msc, you'll find it under Computer Configuration | Windows Settings | Security Settings | Local Policies | Security Options:

If these are being set by GPO, you'll need to change that on the domain controller and reapply group policy.

Important: You may have to reboot before changes take effect.

 

Issue #4:

DNS / Domain Trust problems.

This is most likely to occur for users that are in a remote domain or trusted forest. If DNS is not configured properly, the SharePoint WFE will not be able to get the proper IP address for a remote domain controller.

This one is a little harder to nail down. It can take a network trace with Netmon or Wireshark to fully diagnose. However, a good indication of the problem may lie in your IIS logs.

Check the IIS log for the problem SharePoint site. You may see that the final request that includes the whole NTLM token receives a 401.1 with a particular sc-win32-status of 2148074257.

For example:

 10.87.68.93 GET /sites/Pages/allitems.aspx 443 – 192.168.56.21 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/7.0;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) https://teams.contoso.com/sites/team1/pages/default.aspx 401 1 2148074257 470 2787 31 

A “sc-win32-status” of “2148074257” means "SEC_E_NO_AUTHENTICATING_AUTHORITY", ie: we can't find a domain controller that is authoritative for that domain. Reference: https://msdn.microsoft.com/en-us/library/windows/desktop/aa375512(v=vs.85).aspx

For a good walkthrough of how to find the proper IIS log for your SharePoint web app, see this: https://blog.bugrapostaci.com/2012/04/12/how-to-collect-iis-logs-for-a-sharepoint-web-application/

 

Solution #4:

Fix your DNS so that the SharePoint servers get the proper IPs for remote domain controllers. You should also verify your domain and forest trusts.

 

Issue #5:

MaxConcurrentApi

This is a bit of a complicated topic, but you can sum it up like this: There is a finite number of Netlogon process threads available for NTLM authentication on both the SharePoint WFEs and the domain controllers. When that number is exceeded, authentication requests can fail. This typically happens in large environments with heavy NTLM traffic, and especially when that authentication occurs across domain trusts.

Reference: https://support.microsoft.com/en-us/help/975363/you-are-intermittently-prompted-for-credentials-or-experience-time-out

 

Solution #5:

Switch SharePoint (and other applications) to use Kerberos authentication.

This cuts down significantly on Netlogon service traffic, in most cases relieving the bottleneck. However, keep in mind that Kerberos authentication can still be impacted by MaxConcurrentAPI if there is a significant amount of it requiring PAC verification, or if NTLM authentication for other applications is saturating available threads.

Reference: https://support.microsoft.com/en-us/help/2688798/how-to-do-performance-tuning-for-ntlm-authentication-by-using-the-maxc

Another option is cutting down authentication traffic by making more resources available anonymously.

For example, within an out-of-box SharePoint site, all supporting files (CSS, JS, images, etc) are stored on the file system and are available anonymously (most are in the _layouts folder). However, some customizations and branding may store supporting files within a document library where an authentication request must occur for each file request.  The result can be a dozen or more NTLM authentication requests for each page load. Moving those supporting files their own folder in _layouts, or otherwise making them anonymously accessible will drastically reduce total authentication traffic when browsing the site.

 

Other troubleshooting tips:

Test it outside of SharePoint:

This is a good isolation technique.  The idea is to see if NTLM is working at all on your SharePoint web-front-ends.

Create a file share on the WFE.   From a client machine that is having problems authenticating to SharePoint, try to access the file share using the WFEs IP address.  Example: \\192.168.0.33\Share.  You must use IP and not the server name to force NTLM.  If you use the server name, Kerberos will normally be used to authenticate to the share, which is not the test we're going for.  Does accessing the share by IP work?  If you get prompted for credentials and can't authenticate, you should probably leave your SharePoint admins alone and start talking to your AD admins.

Note: This test may not be conclusive on Windows Server 2016 or other platforms where accessing a file share by IP is prohibited.

 

Use your tools:

As we saw in the above sections, IIS logs, the Security Event Log in Event Viewer, and Network traces can assist in diagnosing these problems. In this section, I’d like to walk you through using Fiddler to view the authentication traffic.  The purpose is to show what a successful NTLM authentication looks like.  You can use that to compare to your own trace of a failure.

 

NTLM authentication is done in a three-step process known as the "NTLM Handshake".

The first request is normally made anonymously. This is true of Kerberos as well.

The site requires authentication, so the WFE responds with a 401 – Unauthorized and a “WWW-Authenticate: NTLM” header.  That header is how the server tells the client which authentication methods to try.

 

 

The client makes a second request for the same page. This time it includes half of the NTLM token. The server again responds with a 401 (unauthorized) and issues an NTLM challenge.

 

 

The client makes a third request with the whole NTLM token, is successfully authenticated, and receives a 200-ok for home.aspx.

 

 

Note: The NLTM Handshake is not really a half-token / full-token situation, but for the purposes of simplifying the NTLM Handshake process, I find that explanation works well enough.  I think it helps to differentiate the second request (notice the client NTLM authorization header is fairly short) from the final request (NTLM header is much longer).  If you see your client send the full NTLM token, but the server still responds with 401 - unauthorized, then you need to look closer at the known issues described above.