Tweaking SCOM 2012 Management Servers for large environments


<!--[if lt IE 9]>

<![endif]-->


Comments (36)
  1. Kevin Holman says:

    Hi Diane –

    We had an issue in SCOM 2012 RC – where this was recommended as the fix…. then once RTM shipped this was no longer needed to be adjusted. Then the blogs started posting NEVER to change this and it was not recommended…. which is absolutely true for 98% of
    the deployments out there.

    In general, you should never change these settings unless advised to by Microsoft support, or without fully understanding the ramifications. However, this IS a valid setting, and IS recommended in VERY SPECIFIC cases where the default settings are not long
    enough and the result is resource pool suicide. If you aren’t experiencing this problem, then in general it doesn’t need to be changed.

    You can have the same conversation about the default observer, which is the database. In large environments, it is possible the default observer will be very slow due to I/O load, and there are specific scenarios where it makes sense to remove the default observer
    and let the management servers make the decisions for resource pool quorum. HOWEVER, it should not be removed (generally speaking) unless the customer experiences this specific issue which can only be determined via tracelogs. There are tradeoffs to making
    this process longer. The primary tradeoff is this increases the chances of duplication of workflows on different management servers, and potentially longer recovery times in the event of a real outage. Another tradeoff, is that ALL management servers MUST
    have the same settings. If any MS gets installed with the default settings, you will have constant resource pool flapping because the communication expectations are different across MS’s.

    So no – I don’t recommend making changes to the pool manager registry, unless you have a large environment, and you are experiencing resource pool failure far too frequently. And in those cases, we should examine the default observer behavior as well. But saying
    "never" change it? I disagree.

  2. Kevin Holman says:

    @ Ted T Hacker –

    Great question. Yes, there is. "Command Timeout Seconds" has to do with regular stored procedure calls from a SCOM workflow to the DW. Such at maintenance operations/aggregations. "Deployment Command Timeout Seconds" is different – this value has to do with
    scripts that are called during a major update, such as a version update, service pack, or update rollup. Changing the latter is more rare, however I have seen issues reported where these scripts got caught up blocking, and took a LONG time to complete, so
    rather than fail due to a timeout – we had the customer set a very long time to get them to complete. It isn’t a common occurrence and generally I’d only change that one under advisement from support, like you did. All good.

  3. Kevin Holman says:

    Brett – where did you get this 500 group limitation? The product group tested up to 1000 groups when performance testing SCOM 2007. It was recommended not to go over 1000 groups simply because we didn’t test beyond that. However, using groups that don’t
    rollup health state, and using simple group memberships heavily affected this scalability concern. Now that SCOM 2012 has a distributed model for config and group population, I have not heard any limitations such as this, nor have I heard what we test up to,
    I’d assume likely the same 1000 groups for testing. However, I have customers beyond this and they don’t have any issues with group population.

  4. Ted T Hacker says:

    Is the "Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange" workflow affected by the "Command Timeout Seconds" value? How can I tell if the workflow is timing out based on the registry value or is running to completion. Should the workflow
    always write an event (31572?) whenever it finished without issue? The monitor for this named "Data Warehouse Object Health State Data Collection Writer Periodic Data Maintenance Recovery State" has an override for the "Interval". Does changing the monitor
    override only affect how long the monitor waits for a 31572 before alerting?

    What is the downside to setting the "Command Timeout Seconds" to a longer time frame? I suppose at some value you just want to be alerted that it is taking a long time to process lots of data. I assume you don’t want to mask the fact that there may be a monitor
    state change or collection rule going nuts.

  5. Ted T Hacker says:

    Kevin,

    Is there a difference between the HKLMSOFTWAREMicrosoftMicrosoft Operations Manager3.0Data Warehouse
    "Command Timeout Seconds"
    and "Deployment Command Timeout Seconds" values? A MS Engineer during a incident advised creating the second value. Do these values conflict with each other or are they complimentary? We have the "Deployment Command Timeout Seconds" set to 86400 (1 day). At
    that point we were having problems upgrading from 2012 SP1 to R2. Thanks. Ted.

  6. Kevin Holman says:

    @JT – you are correct, I do not dictate any changes to gateways because to date, I have no experienced any changed that are proactively needed on the gateway role, regardless of management group size. They seem to handle things quite well out of the box.

  7. Kevin Holman says:

    @Jesse – Actually – Bulk Insert Command Timeout was a new registry control available with UR1. It wasn’t added in UR5. I don’t have a recommendation for adjusting this – it simply opened the capability to adjust this if needed. I only recommend changing
    that one if directed to by Microsoft Support to resolve a problem with bulk inserts to the warehouse, which is a rare condition. I have never worked with a customer who needed this modified from the default.

  8. Diane says:

    Hi, Kevin,
    I have seen a lot of blogs that recommend against setting the PoolManager key. They indicate that this was required when SCOM 2012 was first released but has since been fixed – implementing it now can actually degrade performance. Can you confirm this is still
    required for large environments?

  9. Larry says:

    Hey Kevin,

    Good stuff, as always! 🙂

    It would be of great value if, for each of the above registry settings, associated monitoring instrumentation could be identified in order to help administrators determine whether the corresponding registry setting update should be considered for their environment.

    Examples might include:
    1) evaluating a particular PerfMon counter against a specific threshold, or
    2) the presence of specific event log entries

    One might even wonder if/why this is not already included as monitors in the OpsMgr based MPs…

  10. Zohar says:

    Very helpful , good article. I wonder why Microsoft do not publish an article with those registry settings.

  11. brett says:

    SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  12. brett says:

    SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  13. jesse says:

    UR5 introduces a new registry value: Bulk Insert Command Timeout.
    http://support.microsoft.com/kb/3029227 Do you have any guidance around using this value as well?

  14. JT says:

    You mention that these changes are recommended on management servers but make no mention if they are required on a gateway server. What is Microsoft’s stance on tweaking registry settings on them?

  15. JT says:

    Thanks for the quick response. I didn’t think so either but I wanted to be 100% certain. Thanks again for taking the time to respond to everyone’s questions and keep this site updated. It is appreciated!

  16. Kevin Holman says:

    @Jasper –
    I don’t know offhand. MOST registry changes require a service restart unless there is code to check the registry on an interval or to be notified of a reg change. I doubt we would do this and my assumption is that a restart of services is required. I’d have
    to tracelog to be sure.

  17. Igor says:

    Hallo. I have 4vm (8ram) and 2 phys (24ram) scom servers.

    700 win agents
    60 linux agents.
    lot of MP`s. also custom heavy ones, like progess sonic monitoring
    349 groups
    270 network devices

    health service on phys server ~3-4 gb

    virtual servers sometimes says, that they dropping data..

    should i made s reg edit with those params?

  18. Hi Kevin,

    Thank you for sharing.

    Are these registry keys also applicable to Gateways Servers or just for Management Servers only?

    Marlon

  19. jean says:

    Hi Kevin, Are those keys also valid for a large environment with OpsMgr 2007 R2?
    thank you

  20. Kevin Holman says:

    @Marlon – SOME of these are potential candidates for gateways, but I generally don’t recommend any change3s on GW’s unless you are specifically experiencing a problem. On management servers – I set these on all my customers, regardless of size.

  21. jean says:

    Thank you!

  22. Thanks Kevin for the confirmation! We are still observing the changes in our SCOM environment.

  23. Anonymous says:

    This is a common practice for rotating old physical servers coming off lease, or when moving VM based

  24. Ashish says:

    Hi Kevin,

    Does these settings require server restart?

    T

  25. Ashish says:

    Hi Kevin,

    Does these settings require server restart?

    Thanks
    Ashish

  26. Tommy says:

    I do miss the parameters
    Maximum Global Pending Data Count
    and
    Persistence Version Store Maximum
    and
    Persistence Cache Maximum
    in this blog.

    With regards to the latter;
    In this PDF ( http://download.microsoft.com/download/8/2/8/828C05A2-E6A0-436A-9AE1-704A8005E355/9780735695825.pdf ) they say;

    Another important setting is Persistence Cache Maximum of type DWORD. This setting controls the amount of memory in pages used by Persistence Manager for its data store on the local database. The default value for this is 262144 (decimal), which is also the recommended value. If you are running an older version of Operations Manager on management servers that manage a large number of objects, you should change the value to 262144 (decimal).

    The default and recommended values are the same here, i think that is a mistake, but for now I’ve set it to 262144.

    Suggestions are welcome!

  27. Chris Gibson says:

    I love this article, deep explantion followed by a exec summary, followed by a “ok, i realise your lazy so…” the actual commands. Genius

  28. Arthur says:

    Kevin,

    When I restart the healthservice from one management server, the All Management Server Pool Unavailable alert is raised and my Noc dashboard created by widged changes his state to gray. Then I wait for 15 minutes to go back to normal. I think this is a pool process version related. Is there any configuration to increase the value of pool unavailable?

    1. Kevin Holman says:

      How many management servers do you have?

      1. Arthur Silvany says:

        Kevin,

        I have 3 management servers in the pool

        1. Kevin Holman says:

          3 MS should have no issue with keeping pool available when one MS is stopped. It sounds like you have some pool members not available at the time you stop it – which could cause a pool suicide. Look for 15000-15003 events in the event log and see that the pool is stable first.

  29. Hi Kevin,

    Does the same recommandations apply for SCOM 2016?

  30. JFarthing says:

    Hi Kevin,

    I have an environment where I am regularly seeing Resource Pools fail, from the SCOM console itself we see a heartbeat failure for the Pool and going by Event Logs I see that members are not acknowledging the lease request and are unloading all workflows.

    As the environment isn’t in live use I’ve tested applying the Pool Manager fixes mentioned in the blog and this appears to resolve or heavily reduce the issue. I was wondering whether you could provide more information around what the settings grant more time for the pool to complete, as the only time related recomendation I remember seeing in the official documentation is ~10ms latency between servers.

    In the case of my Resource Pools failing, is it almost certainly going to come down to workflows taking too long to complete despite apparant low load on the servers, or could it also be network issues causing the Pool to fail?

    As a side note, in the case of a pool being in a failed state would you expect SNMP traps (sent to all servers in a pool) to still get processed and alert within SCOM? The Pools in question are only used for SNMP monitoring, so if traps are still processed and it is just SNMP polling being impacted I’d be less concerned!

    Many thanks.

    James

    1. Kevin Holman says:

      I *never* recommend editing the registry for pool timeouts – because they often just mask the real problem and are almost never a good solution. We have also seen issues where they create pool instability by changing them.

      The most common issue with pool stability is network latency between management servers or MS to Database. The second most common issue is overloaded management groups, where we see blocking in the SQL database, or just too many objects hosted by the pools on the management servers.

  31. Hi Kevin –

    We had updated the registry of all our management servers and no issues encountered for 2 years. Recently, we are going to create a new SCOM Group via console and we are experiencing long time loading and sometimes not loading at all.

    Can you give us insights what setting to we need to check or need to adjust? As per PS command, we have a total of 1442.

    Thanks!

Comments are closed.

Skip to main content