Clarifying: Retry Behavior for Distribution Points

Article
11/25/2008

We had a lot of customers ask what happens when a branch distribution point that’s running on a workstation-class computer exceeds the maximum number of 10 concurrent sessions. The 10 concurrent sessions is a limitation of the operating system, and not something under the control of Configuration Manager, so this wasn’t originally tested. There was an assumption that clients would retry the same branch distribution point for 8 hours, known as “the 8 hour retry loop”.

Because we had so many customer ask this question, our SE (Sustained Engineering) team valiantly took up the challenge, and at this point I would like to pay tribute to QianDong Ni for all his help and patience in testing, verifying, and then explaining both the tests and the results to me. I’d also like to thank Hugo Wu for his assistance with these tests. What started off with a relatively simple question then spawned into many others – and hence the delay in getting this information to you.

Here are the results, divided up into question/answer sections for easier consumption – although they all interrelate with one another to some extent:

Question: What happens when the 11^th connection is attempted – and there are other distribution points available?

Answer: When an 11^th connection is attempted to a branch distribution point on a workstation computer, the client will try one more time and then, it if has another distribution point to try (either branch distribution point or standard distribution), it immediately tries that next distribution point. There is no “8 hour retry loop” before it tries the next distribution point in the list.

This result means that you can scale out branch distribution points in locations where you cannot run a server (for licensing or administration reasons) with a good end user experience. For example, if you have 15 clients but you don’t know whether they all need to download content at the same time, install two branch distribution points. Which of the branch distribution points a client will select is nondeterministic, so there is a built in element of load balancing to begin with. But if a client selects a branch distribution point that already has the maximum of 10 concurrent sessions, it will automatically try the next distribution point without a noticeable delay.

Question: What happens when the 11^th connection is attempted – and there are no other distribution points available?

Answer: In this scenario, if there are no more available distribution points, clients will go into a continuous retry loop for the same branch distribution point. The client twice retries the single branch distribution point every hour. There was an expectation that the retries would give up after 8 hours, but this wasn’t the case with our tests. We haven’t yet found a timeout value. The last test ran for 5 days and the client was still retrying every hour. This result surprised everybody.

Question: What happens when you have multiple workstation branch distribution points and they all have their maximum number of sessions in use?

Answer: In this scenario, clients continue to cycle through the list of branch distribution points, trying each twice every hour. This retry cycle continues until a connection is successful.

Question: Is this client retry behavior different for standard distribution points when a selected distribution point is not available?

Answer: Yes. The retry behavior is different depending on whether Configuration Manager deems the problem to be “recoverable” or “unrecoverable”. In the case of a workstation branch distribution point running out of concurrent sessions, this is deemed an unrecoverable error. Incorrect NTFS permissions on the folder or package is another example of an unrecoverable error. However, most scenarios are deemed recoverable and these include the server running the distribution point is turned off, the server name doesn’t resolve, or the network is down. In these cases, clients retry the same distribution point for 8 hours before trying the next distribution point on the list. Clients retry with exponentially increasing delays, starting with 30 seconds, then 1 minute, then 2 minutes, and so on, until they try once an hour.

Question: Why don't I see clients trying another distribution point when I test this retry behavior by turning off my protected distribution point?

Answer: You have to be careful testing retry behavior to ensure that you are testing what you think you are testing. Because the selection of equal distribution points is nondeterministic, you can force a client to try one distribution point over another by making one protected and the other unprotected.

If you enable the “fallback” option in the advertisement, and then turn off the protected distribution point, you might expect the client to try first the protected distribution point and then the unprotected distribution point. However, this is a flawed test because the client will never try the unprotected distribution point if the protected distribution is unavailable. Why not? The clue is in the full name of the fallback option: “Allow clients to fallback to unprotected distribution points when the content is not available on the protected distribution point.” In this scenario, the content was on the protected distribution point so clients would never be given an unprotected distribution point to try. The result you see is that the client sticks with the protected distribution point and keeps retrying to download content from it for 8 hours and does not try the unprotected distribution point.

The resulting behavior of protected and unprotected distribution points are covered in the topic About Protected Distribution Points, and particularly the table that outlines the different scenarios. Our tests verified all the scenarios and the documented outcome.

Question: How can you find out which distributions points are given to clients and which one was used to download the content?

Answer: The client log LocationServices.log displays the list of available distribution points (search for “Calling back with the following distribution points”). You will see these listed together with DPType=BRANCH or DPType= SERVER.

Use DataTransferService.log to check which distribution point was used to download the content, and whether it was over http (from a standard distribution point with BITS enabled) or SMB (from a branch distribution point – or a standard distribution that used SMB). Example log file entries:

DTSJob {D700E3F2-51F5-4AFC-9836-CEB8060152A0} created to download from 'https://SERVER1.CONTOSO.COM/SMS_DP_SMSPKGC$/TQL00006' to 'C:\WINDOWS\system32\CCM\Cache\TQL00006.2.System'.

DTSJob {EA322807-EF1E-4E3E-A15B-618BD80A8738} created to download from 'file:\\SERVER5\SMSPKGC$\AQN00003' to 'C:\WINDOWS\system32\CCM\Cache\AQN00003.6.System'.

Question: Does Configuration Manager continue to hand out distribution points as available when they are not?

Answer: Yes. Although Configuration Manager periodically monitors site systems and therefore knows when they are not responding, it continues to hand out distribution points to clients even if it detects that they are not responding. It’s up to the administrator to monitor the site systems manually (using the Site System Status home page) or automatically using Operations Manager or equivalent, and then either correct any problems or delete the failed distribution point.

Question: Can you distinguish between a client trying the next distribution point because there were no more concurrent sessions on the branch distribution point vs. the distribution point was switched off?

Answer: Yes, you can identify the specific condition of a workstation branch distribution point running out of concurrent sessions. Check the client DataTransferService.log file for the error “Error retrieving manifest (0x80070047)”. In comparison, when the distribution point (branch or standard) is offline or unreachable when the client attempts to connect to it, the error code for the same error message is 0x80070035. We’ve also seen 0x800705aa when the client connected and started the download but then the distribution point was the turned off in the middle of the transfer.

Unfortunately, you would have to check each client for these errors – there is no current method to log this condition on the branch distribution point. And even if there were, it probably would not be useful unless you know that it was the last available distribution point.

Question: What needs changing in the documentation as a result of these findings?

Answer: There are 2 areas of documentation that were identified as affected by these results, and these will be in our next documentation update:

We will include the behavior of the 11th attempted connection in the topic About Standard and Branch Distribution Points, where it talks about the limitations of 10 simultaneous client connections on a workstation computer. This was the question most frequently asked by customers.

The behavior of retrying the next distribution point in the list if the client fails to connect to the first distribution point will be corrected in the topic Configuration Manager and Content Location (Package Source Files) where it incorrectly stated that only transient errors after the client connected would result in clients retrying the same distribution point for 8 hours.

We hope that these clarifications help you to plan for your distribution points and give you a better understanding of how they work. If you have further questions or feedback, let us know through SMSDocs@Microsoft.com.

- Carol

This posting is provided AS IS with no warranties and confers no rights.

Clarifying: Retry Behavior for Distribution Points

Additional resources