SharePoint: User Profile web service failures and the dreaded 8313 error

This post is about how a simple web service failure, caused by a networking or Active Directory issue can take your site down.

I've come across this a few different ways. The behavior is almost always intermittent, making it hard to track down.

 

Possible Symptoms:

  • Users intermittently receive a "Something Went Wrong" message when trying to access a SharePoint page that makes a call to the User Profile Service Application.
  • Intermittent user profile web service timeout failures (8313, EndpointFailure)
    • To be clear, the 8313 event id is pretty generic. It just means that a web service request failed for some reason. The presence of 8313 errors in your event logs do not necessarily indicate you have this issue. In fact, this issue is pretty rare, so it's more likely caused by something else. Only an in-depth review of your SharePoint ULS logs can determine the cause.
  • Users intermittently receive "Sorry, this site has not been shared with you" (Access Denied) errors.

That last one is interesting. I'll expand on that:

– Normally you'd say that authorization / permissions have nothing to do with user profiles, and 99% of the time, you'd be right, but there was one interesting scenario that I came across where the two were indirectly linked:

  • User browses a site that has some kind of user profile-related web part or control on it. For example, the "My Links" web part.
  • The MyLinks web part attempts to render and tries to access the profile property cache.
  • SharePoint sends a WCF call to the UPA web service, which can hit any box in the farm running the User Profile Service.  For example: https://AppServer1:32844/3361d4a6050c411982ab7f19989ef1c5/ProfilePropertyService.svc
  • It fails with timeout on the client-side (WFE) after 20 seconds:
    • Exception occured while connecting to WCF endpoint: System.TimeoutException: The request channel timed out while waiting for a reply after 00:00:20. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: The HTTP request to 'https:// AppServer1 :32844/3361d4a6050c411982ab7f19989ef1c5/ProfilePropertyService.svc' has exceeded the allotted timeout of 00:00:20. The time allotted to this operation may have been a portion of a longer timeout.
  • Since we fail to construct the profile property cache, calls to the cache start failing with the "UserProfileApplicationNotAvailableException" error.
    • Failure retrieving application ID for User Profile Application Proxy 'User Profile Service Application': Microsoft.Office.Server.UserProfiles.UserProfileApplicationNotAvailableException: UserProfileApplicationNotAvailableException_Logging :: UserProfileApplicationProxy.ApplicationProperties ProfilePropertyCache does not have 1fa58dc1-81ff-423e-bc58-61b59566c9ad
  • The mylinks web part on default.aspx returns the above error after trying to access the profile property cache.
  • Since the web part fails, default.aspx tries to redirect the user to the web part maintenance page (/_layouts/15/spcontnt.aspx)
  • If the user does not have enough permission to see that page, they get redirected to AccessDenied.aspx.
  • If the user does have enough permission to manage web parts, they see the web part maintenance page, which is also a problem.

 

Cause:

An Active Directory configuration or performance issue causes a delay in resolving the user and group accounts listed as Administrators of the User Profile Service Application.

I'll try to explain that:

  • The users and groups listed as Administrators of the User Profile Service Application (UPA) are cached.
  • That cache expires every 5 minutes.
  • When the next web service call comes into the UPA, those accounts must be resolved again and re-cached.
  • Normally, if there's a local domain controller, that account resolution will take less than a second.
  • However, the following factors can make that account resolution take significantly longer:
    • The larger the list of users or groups, the longer it will take.
    • If there are accounts listed from trusted domains that don't have a local domain controller (DC), locating and accessing the remote domain controllers will take longer, or may fail entirely due to firewall configuration.
    • If Active Directory Sites and Services are not configured correctly, the user account resolution may intermittently be sent to a remote domain controller on the other side of a slow WAN link.
  • The Web-Front-End that issued the web service call has a default timeout of 20 seconds. If the UPA administrator account resolution takes longer than that, the web service call is abandoned, and whatever initiated the call will fail.

 

Solution:

You should start by trimming the list of accounts in UPA Administrators.

Service accounts are typically not required here.  This should only be used for admin accounts where you want the user to be able to administer the UPA, but have no access to other parts of Central Administration.  Reference:

https://technet.microsoft.com/en-us/library/ee806877(v=office.14).aspx

"An administrator of a Microsoft SharePoint Foundation 2010 service application can assign additional administrators to that service application. These users are granted security-trimmed access to the SharePoint Central Administration Web site and can manage settings related to the service application. You can similarly remove administrators from a service application.

Note: By default, members of the Farm Administrators group have permissions to manage all service applications."

 

Then look at your AD configuration.

We need to make sure that SharePoint can quickly, and consistently communicate with all domain controllers within the same Active Directory "Site".

To start with, make sure you can get a domain controller within the same site.

Run this command from your App servers that run the User Profile Service: nltest /dsgetdc:YourDomainNameHere

In the example above, I can see that I get a domain controller within the same Active Directory (AD) site as my SharePoint server. That's a good start.

However, an AD "Site" is not necessarily the same as a physical site. It may consist of more than one physical data center.

You need to make sure that all of your App servers running the User Profile Service can quickly access all of the domain controllers within that Site.

To get an entire list, run nltest /dclist:YourDomainNameHere

This is not a great example since I only have one DC, but in most cases, you would see several listed. Pick out all the DCs in the same "Site" as your SharePoint servers and start testing connectivity.

 

Do some testing.

The Ping command can be a starting place to provide a crude test of basic network latency between the boxes. If you're getting anything above a few milliseconds between any SharePoint server and any DC within the same Site, that would be a potential problem.

You can also use the following PowerShell to test the time it takes to complete the UPA Administrator account resolution:

Add-PSSnapin microsoft.sharepoint.powershell -erroraction silentlycontinue

$upa = get-spserviceapplication | ? {$_.TypeName -match ("User Profile")}

$starttime = [system.DateTime]::Now

$acl = $upa.GetAdministrationAccessControl()

$accessRules = $acl.accessRules

$endTime = [system.DateTime]::Now

$elapsedTime = $endTime.subtract($startTime).Totalseconds

Write-Host -ForegroundColor green "Duration: $elapsedTime Seconds"

 

And this PowerShell has a similar purpose, but lists each account it resolved and calls directly into the SharePoint API (GetFullNameFromLoginEx) used to resolve the accounts:

#Loop through each UPA account, try to resolve them, and list total time taken to resolve entire list

Add-PSSnapin *sharepoint*

$accounts = @()

[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SharePoint.Utilities.SPUtility")

$ServiceApps = Get-SPServiceApplication | ? {$_.typename -match "profile"}

foreach($serviceApp in $ServiceApps)

{

Write-host "UPA: " $serviceApp.name "`n"

$accounts += (Get-SPServiceApplicationSecurity $serviceApp -Admin).AccessRules | foreach{$_.name}

$accounts +=  (Get-SPServiceApplicationSecurity $serviceApp).AccessRules | foreach{$_.name}

$stopwatch =  [system.diagnostics.stopwatch]::StartNew()

foreach($account in $accounts)

{

write-host "Account: " $account

if($account -match "i:0#")

{

$caccount = $account.split("|")

$account=$caccount[1]

}

write-host "Resolved to: " ([Microsoft.SharePoint.Utilities.SPUtility]::GetFullNameFromLoginEx($account, [ref]$false))

}

$stopwatch.stop()

Write-host -ForegroundColor Green "Total Milliseconds: " $stopwatch.Elapsed.Milliseconds

Write-host -ForegroundColor Green "Total Seconds: " $stopwatch.Elapsed.Seconds

Write-host "--------------------------- `n"

}

 

You can take a Netmon or WireShark trace while running the above PowerShell over and over. If you intermittently see some runs that take several seconds (and certainly anything over 20 seconds), you would want to evaluate the network traffic during that run to identify which domain controller it used, and what the bottleneck was.

 

More symptom details / error messages:

There are a lot of errors that can be thrown in this situation, both from W3WP.exe and OWSTIMER.exe because both processes call the UPA web services (ProfilePropertyService.svc, ProfileDBCacheService.svc, and FeedCacheService.svc) in various scenarios.  Generally, you should see an 8313 error like this in the Application event log when the web service call fails:

Log Name:      Application
Source:        Microsoft-SharePoint Products-SharePoint Foundation
Event ID:      8313
Task Category: Topology
Level:         Error
Keywords:     
Computer:      WFE1
Description:
A failure was reported when trying to invoke a service application: EndpointFailure
Process Name: OWSTIMER
Process ID: 13224
AppDomain Name: DefaultDomain
AppDomain ID: 1
Service Application Uri: urn:schemas-microsoft-com:sharepoint:service:2e3fdb97663a4af88419ccd0ede35aad#authority=urn:uuid:baf342c531074999b685ecbe8310af8f&authority=https://WFE1:32844/Topology/topology.svc
Active Endpoints: 1
Failed Endpoints:1
Affected Endpoint: https://AppServer1:32843/2e3fdb97663a4af88419ccd0ede35aad/ProfileService.svc

-- When the "Something Went Wrong" message occurs for a user, you should see an error like this on the web-front-end (WFE) that serviced the request:

w3wp.exe (0x1B4C) 0x1BF8 SharePoint Portal Server User Profiles acruz High Exception occured while connecting to WCF endpoint: System.TimeoutException: The request channel timed out while waiting for a reply after 00:00:20. Increase the timeout value passed to the call to Request or increase the SendTimeout value on the Binding. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: The HTTP request to 'https://AppServer1:32843/2e3fdb97663a4af88419ccd0ede35aad/ProfilePropertyService.svc' has exceeded the allotted timeout of 00:00:20. The time allotted to this operation may have been a portion of a longer timeout. ---> System.Net.WebException: The operation has timed out

-- If you take the correlation id from the timeout error on the WFE and filter the ULS log on the server identified in the error (in this case "AppServer1"), you should see that it received the WCF request for ProfilePropertyService.svc/GetProfileProperties and processed it without error. However, it took more than 20 seconds to do it, so the WFE times out and gives up on the web service call.

- Here's a sample of what that looks like:
-The App server receives the request exactly 20 seconds before the timeout is thrown on the WFE:
w3wp.exe (0x184C) 0x1E74 SharePoint Foundation Topology e5mb Medium WcfReceiveRequest: LocalAddress: 'https://AppServer1:32843/2e3fdb97663a4af88419ccd0ede35aad/ProfilePropertyService.svc' Channel: 'System.ServiceModel.Channels.ServiceChannel' Action: 'https://Microsoft.Office.Server.UserProfiles/GetProfileProperties' MessageId: 'urn:uuid:84ed123a-69c6-4f28-b901-787f553ebeb9'

- GetProfileProperties calls userProfileApplication.GetProperties to get properties of the UPA:
w3wp.exe (0x184C) 0x1E74 SharePoint Portal Server User Profiles ajk4c Medium UserProfileProperty_WCFLogging :: ProfilePropertyService.GetProfileProperties - Call userProfileApplication.GetProperties

- Here's the call to get the UPA properties:
w3wp.exe (0x184C) 0x1E74 SharePoint Portal Server User Profiles ajk33 Medium UserProfileProperty_WCFLogging :: Begin UserProfileApplication.GetProperties

-This comes in almost a minute later:
w3wp.exe (0x184C) 0x1E74 SharePoint Portal Server User Profiles ajk34 Medium UserProfileProperty_WCFLogging :: End UserProfileApplication.GetProperties

-Total execution time for the web service call is 56 seconds.  All of that time was spent in UserProfileApplication.GetProperties.
w3wp.exe (0x184C) 0x1E74 SharePoint Foundation Monitoring b4ly Medium Leaving Monitored Scope (ExecuteWcfServerOperation). Execution Time=56136.7765316542

-- If you have a Netmon trace from the App server that serviced the request for ProfilePropertyService.svc/GetProfileProperties that timed out on the WFE, you should see a sequence like this:

-WFE sends a POST to the App server:
Http: Request, POST /2e3fdb97663a4af88419ccd0ede35aad/ProfilePropertyService.svc

-There's a bunch of SAMR traffic to domain controllers.  You can use a display filter like this:
"HTTP or SAMR"

-A while later (in my case it was 56 seconds), you will see the App server send the 200-ok back to the WFE for the web service call:
Http: Response, HTTP/1.1, Status: Ok, URL: /2e3fdb97663a4af88419ccd0ede35aad/ProfilePropertyService.svc

-So the web service call completed successfully. The only problem is that it took too long. It timed out on the WFE-side after 20 seconds.