ConfigMgr 2012 R2 - Multiple SUP scenario: Clients not failing over to the other SUP


Hi Folks,

For those who are not aware how Multiple SUPs selection and failover works in ConfigMgr 2012 SP1 onwards , Kindly refer below -

http://blogs.technet.com/b/configmgrteam/archive/2013/03/27/software-update-points-in-cm2012sp1.aspx

This says –

So what determines a scan failure, and how does the client react to these conditions?  Scans can fail with a number of retry, and non-retry error codes.  For failover, software update points only retry on retry error codes, and there are eleven of those we use from the Windows Update Agent and WinHTTP to determine that a scan has failed.  These errors will cause the client to retry its scan, and if necessitated by the number of failures (4), switch software update points.  The error codes themselves aren’t something important for you to worry about, but the high level conditions that scan failed are typically because the WSUS server couldn’t be reached, or it’s temporarily overloaded.  The retry error-codes are all variations on these two high-level themes.  In any case, scan just didn’t work, in which case we do the following:

  • Client scans at its scheduled time, or as initiated client-side through the control panel or SDK.  If scan fails, then it waits 30 minutes to try again, using the same software update point.
  • The client will minimally retry four times at 30 minute intervals, and after fourth failure, and after two more minutes, it will move to the next software update point in its list.
  • The same process is completed on this software update point until we get a successful scan.  Once scan succeeds against a software update point, the client will persist affinity with that software update point until it fails to scan against it, and then only if it fails to scan four times in 30 minute intervals.

Concern -If a client is failing with a very common error code and if that error is not in the Retry list then it will never failover to the other SUP. As an example error 0x80072ee2 is a common network timeout error and a client would be then not able to scan against the SUP\WSUS forever until the issue is fixed. An example environment explaining the scenario below -

Environment

============
There are 3 forests, A, B, C.
1. A is trusted by B, but B is not trusted by A.
2. C does not have the trust relationship with A or B.
Topology

=============
1 primary site with 3 software update points.
2 SUPs are in the same domain (A). They share the same SUSDB.
1 SUP is in the untrusted domain (C ).

Problem

=======
Some clients in Domain B is scanning with the SUP in C, the error code is 0x80072ee2 (winhttp time out). This is because the clients in B does not have the network access to the SUP in domain C.
Per the above blog, we hope after several scan failures, the client can fail over to the SUP in domain A. Because the client can access the SUP in domain A. But after waiting for days, the client is kept to scanning with the SUP in domain C owing to the error not in the retry list.

Now what are these Retry codes –?

 
select * from SC_Component_Property PROP 
join SC_SiteDefinition SCDEF on SCDEF.SiteNumber = Prop.SiteNumber
where Prop.Name = 'WSUS Scan Retry Error Codes'
and SCDEF.SiteCode = 'PR1'
--Where PR1 is the primary site code.

Check the Value 2 –

{2149842970, 2147954429, 2149859352, 2149859362, 2149859338, 2149859344, 2147954430, 2147747475, 2149842974, 2149859342, 2149859372, 2149859341, 2149904388, 2149859371, 2149859367, 2149859366, 2149859364, 2149859363, 2149859361, 2149859360, 2149859359, 2149859358, 2149859357, 2149859356, 2149859354, 2149859353, 2149859350, 2149859349, 2149859340, 2149859339, 2149859332, 2149859333, 2149859334, 2149859337, 2149859336, 2149859335}

 

These are the WSUS Scan retry error codes mentioned in the Decimal values.

The same in the WMI can be found in the place –

root\SMS\site_PR1

Query - select * from SMS_SCI_SCProperty where propertyname like '%scan retry error codes%'

SMS_WSUS_CONFIGURATION_MANAGER -> WSUS Scan Retry Error Codes

 

 

 

 

Workaround:Add the error code in the retry list. Note that they have to be converted to decimal (you can use the Scientific calc) before adding the value.  0x80072ee2 -> 2147954402. Follow the below steps.

1. Run “wbemtest” with administrator account.

2. Connect to root\sms\site_<sitecode>

3. Run the query “select * from sms_sci_component where componentname=”SMS_WSUS_CONFIGURATION_MANAGER”

4. Double-click the object.

5. Double-click the “Props”properties.

6. Click View Embedded

7. Double-click the query result to find PropertyName is “WSUS Scan Retry Error Codes”.

8. Double-click Value2. Add 2147954402to the list

9. Click “Save Property”

10. Click “Save Object”

11. Click “Close”and click Save Property

12. Click “Save Object”

This can also be done by the ConfigMgr SDK.

We also discovered where the clients store these error codes as well so we could verify if the codes went down properly after refreshing policy.

 
To verify the codes on the Client Machine:

1) Open Admin cmd prompt

2) Type in wbemtest

3) When WMI Tester comes up, give it the namespace : root\ccm\policy\machine\actualconfig then hit connect.

4) Hit Enum Classes, it will pop up a new box, select the recursive radio button and then hit OK

5)  Find CCM_UPdateSource in the list that is presented and double click on it. It will pop up a new box.

6) Hit "Instances" it will pop update a new box then double click on CCM_UpdateSource.UniqueID = "{.....ID}"

7) In the middle of the box, there is a scroll bar. Scroll until you find "ScanFalureRetryErrorCodes" then double click on it.

8) You should be presented with the list in the middle of the box that pops up.

Hope it helps!

Umair Khan

Support Escalation Engineer | Microsoft System Center Configuration Manager 

Disclaimer: This posting is provided "AS IS" with no warranties and confers no rights.

Comments (25)

  1. NoDowt says:

    Hi Umair,
    Can you please confirm a few things on this...
    -On which system(s) does this need to be edited? (presume just the primary sever?)
    -What is required to make this change effective on client systems?

    Also, is it possible to force a client system to use a particular SUP? or atleast speed up the automatic retry/failover process?
    In our infrastructure every client only has network access to 1 of our 3 SUPs (which 1 of the 3 depends the clients location & security requirements); and its always a sore point installing a new system & seeing it take up to 4 hours to reach the valid SUP.

    Thanks in advance!

  2. @NoDowt: Yes Primary site server WMI should be enough. For second question, No it is as of now not possible to force a client to a specific SUP.

    @Mike: I have added client side troubleshooting for this. See if the client has got the policy change and has the code in the WMI. If not a policy reset can help.

  3. Chris says:

    If you copy and paste the query select * from sms_sci_component where componentname=”SMS_WSUS_CONFIGURATION_MANAGER” make sure to delete and re-type the quotation marks or the query will fail.

  4. J-me says:

    I had to use single quotes instead of double quotes.

  5. mike says:

    This is great if it works. But I can't seem to get it to work because I has been more than one week and the client computers are not switching SUP! Is there a way to manually force the SUP switching?

  6. Brian says:

    I've implemented the '0x80072ee2' failure code in my environment where there are two SUPs - one in the DMZ and one in the Internal network; DMZ clients seem to refuse to fail to the DMZ SUP and stay there - they do occasionally seem to flip to it, but
    then drop off it within a few minutes.

    The listed behaviour of half hourly checks and failover after four does not apply, initially there are checks every half an hour but it doesn't fail over after four as it should. Has anyone else had this?

  7. Spruce says:

    @Brian I see this behavior to in a similar setup, it doesn't fail over.
    Do someone have a solution for this?

  8. Dave says:

    @Brian and @Spruce - I am seeing exactly the same issue. I have added the error code but the clients dont seem to failover.
    Anyone know of a solution?

  9. Neil says:

    Same thing is happening at our site as well. Applied WMI, it shows up on the client. The client will swap over, but then randomly it will swap back to the other side, causes a issue when we had a bunch of updates to install, it wouldnt stay on the corrrect
    side for long enough and then fail the instalation as the WSUS server was on the wrong one again.

  10. Timur says:

    Hi guys!
    Was suffering from the same thing, added new err code on the Primary, waited 2-3 days and had nothing on clients. Then did a couple of experiments on clients and ended up with VBS script which I deployed across the whole infrastructure and now I'm waiting for
    results and it looks like they will be successful. I just add the err code to actual config. Because it seems that iin my infrastructure it isn't rewritten by primary or anything else.

    Set objWMIService = GetObject("WinMgmts:{impersonationLevel=impersonate,AuthenticationLevel=pktprivacy}!rootccmpolicymachineactualconfig")
    Set objItems = objWMIService.InstancesOf("CCM_UpdateSource")

    For Each objItem in objItems
    For Each prop in objItem.Properties_
    If prop.Name = "ScanFailureRetryErrorCodes" Then
    'WScript.Echo prop.Name & ": " & Join(prop.Value,", ")
    prop.Value(0) = -2147012894
    objItem.Put_
    End If
    Next
    Next

    Keep in mind that I change the first err code in the array, because I haven't figured out how to add new member to the array.

    I tried ReDim but it didn't help.
    Hope it will help somebody.

    Cheers!

  11. Ilia Martinov says:

    Hello, Umair
    Is there any way to turn this feature off?
    In our network design, if some network problems occur, SUP failover feature causing clients to switch to SUP in remote network location, but we don't want it. So, can we "fasten" only one SUP per boundary group, even if its down or unreachable for some time?
    Thank you!

  12. Ilia Martinov says:

    Hello, Umair
    Is there any way to turn this feature off?
    In our network design, if some network problems occur, SUP failover feature causing clients to switch to SUP in remote network location, but we don't want it. So, can we "fasten" only one SUP per boundary group, even if its down or unreachable for some time?
    Thank you!

  13. Ilia Martinov says:

    Hello, Umair
    Is there any way to turn this feature off?
    In our network design, if some network problems occur, SUP failover feature causing clients to switch to SUP in remote network location, but we don't want it. So, can we "fasten" only one SUP per boundary group, even if its down or unreachable for some time?
    Thank you!

  14. adam says:

    LLia, I totally agree with you I don't think there is but I wish there was away to "fasten" to a SUP based on boundary group.

  15. David says:

    A Powershell script to add the code to the clients (can be used with DCM, thanks for Thomas Kurth)

    $updateConfig = Get-WmiObject -Namespace RootccmPolicyMachineActualConfig -Class CCM_UpdateSource

    $updateConfig.ScanFailureRetryErrorCodes += 2147954402
    $updateConfig.put()

  16. SCCM User says:

    @Ilia Martinov , cant turn it off

  17. Zap B. says:

    This is great info.. except the part about clients receiving the policy and updating their ScanFalureRetryErrorCodes does not seem to work. Is there anything new about how to update the clients without doing a mass script deployment?

    1. Aurimas N says:

      Same here, added the code, verified it is present on the client, but clients never seem to fail over to another SUP.

  18. Cam says:

    The ability to manually switch clients to a new software update point is in the product sheet for 1606.
    Has anyone tried this to see if it works?

    Cam
    PS, I followed the instruction above and added the additional return codes to my 1602 environment and haven't seen a change yet. The clients are aware of the code but they are still not failing over to the proper SUP. I've been struggling with this issue for two months now and it doesn't seem to be resolved.

    1. rbud says:

      This seems to be an issue since SystemCenter vNext. All my Clients hosted in an untrusted forest stopped pulling updates right after we upgraded from 2012 R2 to vNext. Automatic Failover worked perfectly fine in 2012 R2 without any Patches or Workarounds. And in vNext it does not, no matter what I do. Even the Failover trigger we have now in the Collections does not do anything. This just seems to be broken 🙁

  19. Juerg Koller says:

    Hi Umair
    Since there is no option to direct a Client to a specific SUP, is there at least a possibility to trigger the “switch to next Software Update Point” client notification. It is not very helpful to trigger this on the collection level. If we can trigger the switch via a script, it would be a good option to make a compliance baseline script, watch the WUAHandler.log for specific error codes and trigger the switch to next Software Update Point.

    1. 1702 has boundary based SUP now. So if you can move to CB then it would be much simple to manage.

  20. Anthony says:

    Hi Umair,

    Thanks for the article. I am facing the same issue for one of my customers. We have a Primary site and 57 secondary sites. On most of the sites, endpoints failing to scan with error '0x80072EE2 - Network connection: Windows Update Agent encountered transient network connection-related errors;. And immediately they are reaching out to Primary SUP. As a workaround, we reinstalled CCM clients and this resolved. But during software deployment, we are not giving fall back as per customer requirement to avoid network issue. Hence, scan failed endpoints are getting patched. We are having tough time. Please suggest how can we overcome this issue.

    Thanks,
    Anthony

    1. Anthony says:

      Hi Umair,

      Sorry, there was typo... scanned failed machines are actually not getting the software updates. Since it is very big environment having tough time.

      Thanks,
      Anthony

    2. For secondary sites not reaching the Primary SUP we do have a workaround where we have to make a small edit in the MP_GetWSUSServerLocations stored procedure on the secondary database. Please open up a CSS case and we can help you.

Skip to main content