Tweaking SCOM 2012 Management Servers for large environments


 

There are many articles on tweaking certain registry settings for SCOM agents, Gateways, and Management servers, for many reasons.  Large deployments, custom 3rd party MP’s, monitoring Exchange 2010 to name a few.  Matt Goedtel has a good list on his blog:  http://blogs.technet.com/b/mgoedtel/archive/2010/08/24/performance-optimizations-for-operations-manager-2007-r2.aspx

 

Below – I’d like to post some settings that I change on Management Servers, when monitoring large environments.  What does “very large” mean?  Well, I’d characterize that as a management group with a significant agent count (>1000), or a very large instance space (lots of Management Packs deployed both MS and 3rd party, and custom MP’s which don’t always behave well).  Perhaps you have a very large number of groups, or groups with complex expressions.  It could be your are monitoring a large number of “agentless” items, such as Linux servers, or Network Devices, or URLs, etc.

These settings are very common, and I recommend them for all environments, with documented caveats below.

 

1.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        Persistence Checkpoint Depth Maximum = 104857600
SCOM 2012 default existing registry value = 20971520

All management servers, that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM)  This is an ESE DB setting which controls how often ESE writes to disk.  A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash. 

2.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        State Queue Items = 20480
SCOM 2012 default existing registry value: not present.  Value must be created.  Default code value = 10240

All management servers in a large management group:  This sets the maximum size of healthservice internal state queue.  It should be equal or larger than the number of monitor based workflows running in a healthservice.  Too small of a value, or too many workflows will cause state change loss.  http://blogs.msdn.com/b/rslaten/archive/2008/08/27/event-5206.aspx

3.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value: 
    PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120
SCOM 2012 existing registry value:  not present (must create PoolManager key and both values)  Default code value =  120/30 seconds

All management servers, that participate in any resource pools, that run a large number of workflows.  This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case.  The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified.  This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.

4.  Key:     HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\
REG_DWORD Decimal Value:       GroupCalcPollingIntervalMilliseconds = 900000
SCOM 2012 existing registry value:  not present (must create value).  Default code value = 30000 (30 seconds)

All management servers that participate in the All Management Servers resource pool, that have a large agent count or large number of groups:  This setting will slow down how often group calculation runs to find changes in group memberships.  Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions.  Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O.

5.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Command Timeout Seconds = 1200
SCOM 2012 existing registry value: not preset (must create "Data Warehouse" key and value)  Default in code value = 300

All management servers in a management group, this helps with dataset maintenance as the default timeout of 10 minutes is often too short.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance.  This is a very common issue.   http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx

6.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Deployment Command Timeout Seconds = 86400
SCOM 2012 existing registry value: not preset (must create "Data Warehouse" key and value)  Default in code value = 10800 seconds (3 hours)

All management servers in a management group, this helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates.  Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours.  Setting this value to allow for one full day to deploy the script resolves most customer issues.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment.  This is a very common issue in large environments are very large warehouse databases.

 

7.  Key:    HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
    DALInitiateClearPool = 1
    DALInitiateClearPoolSeconds = 60
SCOM 2012 existing registry value:   not present – code default – 30 seconds?

All management servers in ANY management group.  This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis.  Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage.   Per:  http://support.microsoft.com/kb/2913046/en-us  All management servers in a management group should get the registry change.

 

To summarize:

Registry Key

Reg DWORD Value Name Reg DWORD Decimal Value

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\

Persistence Checkpoint Depth Maximum 104857600

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\

State Queue Items 20480

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\

PoolLeaseRequestPeriodSeconds

600

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\

PoolNetworkLatencySeconds 120

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\

GroupCalcPollingIntervalMilliseconds 900000

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Command Timeout Seconds 1200

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Deployment Command Timeout Seconds 86400

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPool 1

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPoolSeconds 60

 

****NOTE:

On modifying the following:

    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value: 
    PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120

This should NOT be done unless you are guided to by Microsoft support, generally speaking.  If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail.  All management servers must have identical settings here.  If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop.   All the other settings in this article are generally beneficial.  This specific one for PoolManager should receive great scrutiny before changing, due to the risks.

 

 

Below are some simple reg add statement examples on how you can run to make setting these easy:

reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "State Queue Items" /t REG_DWORD /d 20480 /f
reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "Persistence Checkpoint Depth Maximum" /t REG_DWORD /d 104857600 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0" /v "GroupCalcPollingIntervalMilliseconds" /t REG_DWORD /d 900000 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Command Timeout Seconds" /t REG_DWORD /d 1200 /f
reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Deployment Command Timeout Seconds" /t REG_DWORD /d 86400 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPool" /t REG_DWORD /d 1 /f
reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPoolSeconds" /t REG_DWORD /d 60 /f


Comments (27)

  1. Kevin Holman says:

    Hi Diane –

    We had an issue in SCOM 2012 RC – where this was recommended as the fix…. then once RTM shipped this was no longer needed to be adjusted. Then the blogs started posting NEVER to change this and it was not recommended…. which is absolutely true for 98% of
    the deployments out there.

    In general, you should never change these settings unless advised to by Microsoft support, or without fully understanding the ramifications. However, this IS a valid setting, and IS recommended in VERY SPECIFIC cases where the default settings are not long
    enough and the result is resource pool suicide. If you aren’t experiencing this problem, then in general it doesn’t need to be changed.

    You can have the same conversation about the default observer, which is the database. In large environments, it is possible the default observer will be very slow due to I/O load, and there are specific scenarios where it makes sense to remove the default observer
    and let the management servers make the decisions for resource pool quorum. HOWEVER, it should not be removed (generally speaking) unless the customer experiences this specific issue which can only be determined via tracelogs. There are tradeoffs to making
    this process longer. The primary tradeoff is this increases the chances of duplication of workflows on different management servers, and potentially longer recovery times in the event of a real outage. Another tradeoff, is that ALL management servers MUST
    have the same settings. If any MS gets installed with the default settings, you will have constant resource pool flapping because the communication expectations are different across MS’s.

    So no – I don’t recommend making changes to the pool manager registry, unless you have a large environment, and you are experiencing resource pool failure far too frequently. And in those cases, we should examine the default observer behavior as well. But saying
    "never" change it? I disagree.

  2. Kevin Holman says:

    @ Ted T Hacker –

    Great question. Yes, there is. "Command Timeout Seconds" has to do with regular stored procedure calls from a SCOM workflow to the DW. Such at maintenance operations/aggregations. "Deployment Command Timeout Seconds" is different – this value has to do with
    scripts that are called during a major update, such as a version update, service pack, or update rollup. Changing the latter is more rare, however I have seen issues reported where these scripts got caught up blocking, and took a LONG time to complete, so
    rather than fail due to a timeout – we had the customer set a very long time to get them to complete. It isn’t a common occurrence and generally I’d only change that one under advisement from support, like you did. All good.

  3. Kevin Holman says:

    Brett – where did you get this 500 group limitation? The product group tested up to 1000 groups when performance testing SCOM 2007. It was recommended not to go over 1000 groups simply because we didn’t test beyond that. However, using groups that don’t
    rollup health state, and using simple group memberships heavily affected this scalability concern. Now that SCOM 2012 has a distributed model for config and group population, I have not heard any limitations such as this, nor have I heard what we test up to,
    I’d assume likely the same 1000 groups for testing. However, I have customers beyond this and they don’t have any issues with group population.

  4. Ted T Hacker says:

    Is the "Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange" workflow affected by the "Command Timeout Seconds" value? How can I tell if the workflow is timing out based on the registry value or is running to completion. Should the workflow
    always write an event (31572?) whenever it finished without issue? The monitor for this named "Data Warehouse Object Health State Data Collection Writer Periodic Data Maintenance Recovery State" has an override for the "Interval". Does changing the monitor
    override only affect how long the monitor waits for a 31572 before alerting?

    What is the downside to setting the "Command Timeout Seconds" to a longer time frame? I suppose at some value you just want to be alerted that it is taking a long time to process lots of data. I assume you don’t want to mask the fact that there may be a monitor
    state change or collection rule going nuts.

  5. Ted T Hacker says:

    Kevin,

    Is there a difference between the HKLMSOFTWAREMicrosoftMicrosoft Operations Manager3.0Data Warehouse
    "Command Timeout Seconds"
    and "Deployment Command Timeout Seconds" values? A MS Engineer during a incident advised creating the second value. Do these values conflict with each other or are they complimentary? We have the "Deployment Command Timeout Seconds" set to 86400 (1 day). At
    that point we were having problems upgrading from 2012 SP1 to R2. Thanks. Ted.

  6. Anonymous says:

    @JT – you are correct, I do not dictate any changes to gateways because to date, I have no experienced any changed that are proactively needed on the gateway role, regardless of management group size. They seem to handle things quite well out of the box.

  7. Anonymous says:

    @Jesse – Actually – Bulk Insert Command Timeout was a new registry control available with UR1. It wasn’t added in UR5. I don’t have a recommendation for adjusting this – it simply opened the capability to adjust this if needed. I only recommend changing
    that one if directed to by Microsoft Support to resolve a problem with bulk inserts to the warehouse, which is a rare condition. I have never worked with a customer who needed this modified from the default.

  8. Diane says:

    Hi, Kevin,
    I have seen a lot of blogs that recommend against setting the PoolManager key. They indicate that this was required when SCOM 2012 was first released but has since been fixed – implementing it now can actually degrade performance. Can you confirm this is still
    required for large environments?

  9. Larry says:

    Hey Kevin,

    Good stuff, as always! 🙂

    It would be of great value if, for each of the above registry settings, associated monitoring instrumentation could be identified in order to help administrators determine whether the corresponding registry setting update should be considered for their environment.

    Examples might include:
    1) evaluating a particular PerfMon counter against a specific threshold, or
    2) the presence of specific event log entries

    One might even wonder if/why this is not already included as monitors in the OpsMgr based MPs…

  10. Zohar says:

    Very helpful , good article. I wonder why Microsoft do not publish an article with those registry settings.

  11. brett says:

    SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  12. brett says:

    SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?

  13. jesse says:

    UR5 introduces a new registry value: Bulk Insert Command Timeout.
    http://support.microsoft.com/kb/3029227 Do you have any guidance around using this value as well?

  14. JT says:

    You mention that these changes are recommended on management servers but make no mention if they are required on a gateway server. What is Microsoft’s stance on tweaking registry settings on them?

  15. JT says:

    Thanks for the quick response. I didn’t think so either but I wanted to be 100% certain. Thanks again for taking the time to respond to everyone’s questions and keep this site updated. It is appreciated!

  16. Kevin Holman says:

    @Jasper –
    I don’t know offhand. MOST registry changes require a service restart unless there is code to check the registry on an interval or to be notified of a reg change. I doubt we would do this and my assumption is that a restart of services is required. I’d have
    to tracelog to be sure.

  17. Igor says:

    Hallo. I have 4vm (8ram) and 2 phys (24ram) scom servers.

    700 win agents
    60 linux agents.
    lot of MP`s. also custom heavy ones, like progess sonic monitoring
    349 groups
    270 network devices

    health service on phys server ~3-4 gb

    virtual servers sometimes says, that they dropping data..

    should i made s reg edit with those params?

  18. Hi Kevin,

    Thank you for sharing.

    Are these registry keys also applicable to Gateways Servers or just for Management Servers only?

    Marlon

  19. jean says:

    Hi Kevin, Are those keys also valid for a large environment with OpsMgr 2007 R2?
    thank you

  20. Kevin Holman says:

    @Marlon – SOME of these are potential candidates for gateways, but I generally don’t recommend any change3s on GW’s unless you are specifically experiencing a problem. On management servers – I set these on all my customers, regardless of size.

  21. jean says:

    Thank you!

  22. Thanks Kevin for the confirmation! We are still observing the changes in our SCOM environment.

  23. Anonymous says:

    This is a common practice for rotating old physical servers coming off lease, or when moving VM based

  24. Ashish says:

    Hi Kevin,

    Does these settings require server restart?

    T

  25. Ashish says:

    Hi Kevin,

    Does these settings require server restart?

    Thanks
    Ashish

  26. Tommy says:

    I do miss the parameters
    Maximum Global Pending Data Count
    and
    Persistence Version Store Maximum
    and
    Persistence Cache Maximum
    in this blog.

    With regards to the latter;
    In this PDF ( http://download.microsoft.com/download/8/2/8/828C05A2-E6A0-436A-9AE1-704A8005E355/9780735695825.pdf ) they say;

    Another important setting is Persistence Cache Maximum of type DWORD. This setting controls the amount of memory in pages used by Persistence Manager for its data store on the local database. The default value for this is 262144 (decimal), which is also the recommended value. If you are running an older version of Operations Manager on management servers that manage a large number of objects, you should change the value to 262144 (decimal).

    The default and recommended values are the same here, i think that is a mistake, but for now I’ve set it to 262144.

    Suggestions are welcome!