We had an issue in SCOM 2012 RC – where this was recommended as the fix…. then once RTM shipped this was no longer needed to be adjusted. Then the blogs started posting NEVER to change this and it was not recommended…. which is absolutely true for 98% of
the deployments out there.
In general, you should never change these settings unless advised to by Microsoft support, or without fully understanding the ramifications. However, this IS a valid setting, and IS recommended in VERY SPECIFIC cases where the default settings are not long
enough and the result is resource pool suicide. If you aren’t experiencing this problem, then in general it doesn’t need to be changed.
You can have the same conversation about the default observer, which is the database. In large environments, it is possible the default observer will be very slow due to I/O load, and there are specific scenarios where it makes sense to remove the default observer
and let the management servers make the decisions for resource pool quorum. HOWEVER, it should not be removed (generally speaking) unless the customer experiences this specific issue which can only be determined via tracelogs. There are tradeoffs to making
this process longer. The primary tradeoff is this increases the chances of duplication of workflows on different management servers, and potentially longer recovery times in the event of a real outage. Another tradeoff, is that ALL management servers MUST
have the same settings. If any MS gets installed with the default settings, you will have constant resource pool flapping because the communication expectations are different across MS’s.
So no – I don’t recommend making changes to the pool manager registry, unless you have a large environment, and you are experiencing resource pool failure far too frequently. And in those cases, we should examine the default observer behavior as well. But saying
"never" change it? I disagree.
Great question. Yes, there is. "Command Timeout Seconds" has to do with regular stored procedure calls from a SCOM workflow to the DW. Such at maintenance operations/aggregations. "Deployment Command Timeout Seconds" is different – this value has to do with
scripts that are called during a major update, such as a version update, service pack, or update rollup. Changing the latter is more rare, however I have seen issues reported where these scripts got caught up blocking, and took a LONG time to complete, so
rather than fail due to a timeout – we had the customer set a very long time to get them to complete. It isn’t a common occurrence and generally I’d only change that one under advisement from support, like you did. All good.
Brett – where did you get this 500 group limitation? The product group tested up to 1000 groups when performance testing SCOM 2007. It was recommended not to go over 1000 groups simply because we didn’t test beyond that. However, using groups that don’t
rollup health state, and using simple group memberships heavily affected this scalability concern. Now that SCOM 2012 has a distributed model for config and group population, I have not heard any limitations such as this, nor have I heard what we test up to,
I’d assume likely the same 1000 groups for testing. However, I have customers beyond this and they don’t have any issues with group population.
Is the "Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange" workflow affected by the "Command Timeout Seconds" value? How can I tell if the workflow is timing out based on the registry value or is running to completion. Should the workflow
always write an event (31572?) whenever it finished without issue? The monitor for this named "Data Warehouse Object Health State Data Collection Writer Periodic Data Maintenance Recovery State" has an override for the "Interval". Does changing the monitor
override only affect how long the monitor waits for a 31572 before alerting?
What is the downside to setting the "Command Timeout Seconds" to a longer time frame? I suppose at some value you just want to be alerted that it is taking a long time to process lots of data. I assume you don’t want to mask the fact that there may be a monitor
state change or collection rule going nuts.
Is there a difference between the HKLMSOFTWAREMicrosoftMicrosoft Operations Manager3.0Data Warehouse
"Command Timeout Seconds"
and "Deployment Command Timeout Seconds" values? A MS Engineer during a incident advised creating the second value. Do these values conflict with each other or are they complimentary? We have the "Deployment Command Timeout Seconds" set to 86400 (1 day). At
that point we were having problems upgrading from 2012 SP1 to R2. Thanks. Ted.
@JT – you are correct, I do not dictate any changes to gateways because to date, I have no experienced any changed that are proactively needed on the gateway role, regardless of management group size. They seem to handle things quite well out of the box.
@Jesse – Actually – Bulk Insert Command Timeout was a new registry control available with UR1. It wasn’t added in UR5. I don’t have a recommendation for adjusting this – it simply opened the capability to adjust this if needed. I only recommend changing
that one if directed to by Microsoft Support to resolve a problem with bulk inserts to the warehouse, which is a rare condition. I have never worked with a customer who needed this modified from the default.
Hi, Kevin,
I have seen a lot of blogs that recommend against setting the PoolManager key. They indicate that this was required when SCOM 2012 was first released but has since been fixed – implementing it now can actually degrade performance. Can you confirm this is still
required for large environments?
It would be of great value if, for each of the above registry settings, associated monitoring instrumentation could be identified in order to help administrators determine whether the corresponding registry setting update should be considered for their environment.
Examples might include:
1) evaluating a particular PerfMon counter against a specific threshold, or
2) the presence of specific event log entries
One might even wonder if/why this is not already included as monitors in the OpsMgr based MPs…
UR5 introduces a new registry value: Bulk Insert Command Timeout. http://support.microsoft.com/kb/3029227 Do you have any guidance around using this value as well?
You mention that these changes are recommended on management servers but make no mention if they are required on a gateway server. What is Microsoft’s stance on tweaking registry settings on them?
Thanks for the quick response. I didn’t think so either but I wanted to be 100% certain. Thanks again for taking the time to respond to everyone’s questions and keep this site updated. It is appreciated!
@Jasper –
I don’t know offhand. MOST registry changes require a service restart unless there is code to check the registry on an interval or to be notified of a reg change. I doubt we would do this and my assumption is that a restart of services is required. I’d have
to tracelog to be sure.
@Marlon – SOME of these are potential candidates for gateways, but I generally don’t recommend any change3s on GW’s unless you are specifically experiencing a problem. On management servers – I set these on all my customers, regardless of size.
Another important setting is Persistence Cache Maximum of type DWORD. This setting controls the amount of memory in pages used by Persistence Manager for its data store on the local database. The default value for this is 262144 (decimal), which is also the recommended value. If you are running an older version of Operations Manager on management servers that manage a large number of objects, you should change the value to 262144 (decimal).
The default and recommended values are the same here, i think that is a mistake, but for now I’ve set it to 262144.
When I restart the healthservice from one management server, the All Management Server Pool Unavailable alert is raised and my Noc dashboard created by widged changes his state to gray. Then I wait for 15 minutes to go back to normal. I think this is a pool process version related. Is there any configuration to increase the value of pool unavailable?
3 MS should have no issue with keeping pool available when one MS is stopped. It sounds like you have some pool members not available at the time you stop it – which could cause a pool suicide. Look for 15000-15003 events in the event log and see that the pool is stable first.
I have an environment where I am regularly seeing Resource Pools fail, from the SCOM console itself we see a heartbeat failure for the Pool and going by Event Logs I see that members are not acknowledging the lease request and are unloading all workflows.
As the environment isn’t in live use I’ve tested applying the Pool Manager fixes mentioned in the blog and this appears to resolve or heavily reduce the issue. I was wondering whether you could provide more information around what the settings grant more time for the pool to complete, as the only time related recomendation I remember seeing in the official documentation is ~10ms latency between servers.
In the case of my Resource Pools failing, is it almost certainly going to come down to workflows taking too long to complete despite apparant low load on the servers, or could it also be network issues causing the Pool to fail?
As a side note, in the case of a pool being in a failed state would you expect SNMP traps (sent to all servers in a pool) to still get processed and alert within SCOM? The Pools in question are only used for SNMP monitoring, so if traps are still processed and it is just SNMP polling being impacted I’d be less concerned!
I *never* recommend editing the registry for pool timeouts – because they often just mask the real problem and are almost never a good solution. We have also seen issues where they create pool instability by changing them.
The most common issue with pool stability is network latency between management servers or MS to Database. The second most common issue is overloaded management groups, where we see blocking in the SQL database, or just too many objects hosted by the pools on the management servers.
We had updated the registry of all our management servers and no issues encountered for 2 years. Recently, we are going to create a new SCOM Group via console and we are experiencing long time loading and sometimes not loading at all.
Can you give us insights what setting to we need to check or need to adjust? As per PS command, we have a total of 1442.
Hi Diane –
We had an issue in SCOM 2012 RC – where this was recommended as the fix…. then once RTM shipped this was no longer needed to be adjusted. Then the blogs started posting NEVER to change this and it was not recommended…. which is absolutely true for 98% of
the deployments out there.
In general, you should never change these settings unless advised to by Microsoft support, or without fully understanding the ramifications. However, this IS a valid setting, and IS recommended in VERY SPECIFIC cases where the default settings are not long
enough and the result is resource pool suicide. If you aren’t experiencing this problem, then in general it doesn’t need to be changed.
You can have the same conversation about the default observer, which is the database. In large environments, it is possible the default observer will be very slow due to I/O load, and there are specific scenarios where it makes sense to remove the default observer
and let the management servers make the decisions for resource pool quorum. HOWEVER, it should not be removed (generally speaking) unless the customer experiences this specific issue which can only be determined via tracelogs. There are tradeoffs to making
this process longer. The primary tradeoff is this increases the chances of duplication of workflows on different management servers, and potentially longer recovery times in the event of a real outage. Another tradeoff, is that ALL management servers MUST
have the same settings. If any MS gets installed with the default settings, you will have constant resource pool flapping because the communication expectations are different across MS’s.
So no – I don’t recommend making changes to the pool manager registry, unless you have a large environment, and you are experiencing resource pool failure far too frequently. And in those cases, we should examine the default observer behavior as well. But saying
"never" change it? I disagree.
@ Ted T Hacker –
Great question. Yes, there is. "Command Timeout Seconds" has to do with regular stored procedure calls from a SCOM workflow to the DW. Such at maintenance operations/aggregations. "Deployment Command Timeout Seconds" is different – this value has to do with
scripts that are called during a major update, such as a version update, service pack, or update rollup. Changing the latter is more rare, however I have seen issues reported where these scripts got caught up blocking, and took a LONG time to complete, so
rather than fail due to a timeout – we had the customer set a very long time to get them to complete. It isn’t a common occurrence and generally I’d only change that one under advisement from support, like you did. All good.
Brett – where did you get this 500 group limitation? The product group tested up to 1000 groups when performance testing SCOM 2007. It was recommended not to go over 1000 groups simply because we didn’t test beyond that. However, using groups that don’t
rollup health state, and using simple group memberships heavily affected this scalability concern. Now that SCOM 2012 has a distributed model for config and group population, I have not heard any limitations such as this, nor have I heard what we test up to,
I’d assume likely the same 1000 groups for testing. However, I have customers beyond this and they don’t have any issues with group population.
Is the "Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange" workflow affected by the "Command Timeout Seconds" value? How can I tell if the workflow is timing out based on the registry value or is running to completion. Should the workflow
always write an event (31572?) whenever it finished without issue? The monitor for this named "Data Warehouse Object Health State Data Collection Writer Periodic Data Maintenance Recovery State" has an override for the "Interval". Does changing the monitor
override only affect how long the monitor waits for a 31572 before alerting?
What is the downside to setting the "Command Timeout Seconds" to a longer time frame? I suppose at some value you just want to be alerted that it is taking a long time to process lots of data. I assume you don’t want to mask the fact that there may be a monitor
state change or collection rule going nuts.
Kevin,
Is there a difference between the HKLMSOFTWAREMicrosoftMicrosoft Operations Manager3.0Data Warehouse
"Command Timeout Seconds"
and "Deployment Command Timeout Seconds" values? A MS Engineer during a incident advised creating the second value. Do these values conflict with each other or are they complimentary? We have the "Deployment Command Timeout Seconds" set to 86400 (1 day). At
that point we were having problems upgrading from 2012 SP1 to R2. Thanks. Ted.
@JT – you are correct, I do not dictate any changes to gateways because to date, I have no experienced any changed that are proactively needed on the gateway role, regardless of management group size. They seem to handle things quite well out of the box.
@Jesse – Actually – Bulk Insert Command Timeout was a new registry control available with UR1. It wasn’t added in UR5. I don’t have a recommendation for adjusting this – it simply opened the capability to adjust this if needed. I only recommend changing
that one if directed to by Microsoft Support to resolve a problem with bulk inserts to the warehouse, which is a rare condition. I have never worked with a customer who needed this modified from the default.
Hi, Kevin,
I have seen a lot of blogs that recommend against setting the PoolManager key. They indicate that this was required when SCOM 2012 was first released but has since been fixed – implementing it now can actually degrade performance. Can you confirm this is still
required for large environments?
Hey Kevin,
Good stuff, as always! 🙂
It would be of great value if, for each of the above registry settings, associated monitoring instrumentation could be identified in order to help administrators determine whether the corresponding registry setting update should be considered for their environment.
Examples might include:
1) evaluating a particular PerfMon counter against a specific threshold, or
2) the presence of specific event log entries
One might even wonder if/why this is not already included as monitors in the OpsMgr based MPs…
Very helpful , good article. I wonder why Microsoft do not publish an article with those registry settings.
SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?
SCOM 2007 recommended not having more than 500 groups. Does SCOM 2012 have a recommended limit?
UR5 introduces a new registry value: Bulk Insert Command Timeout.
http://support.microsoft.com/kb/3029227 Do you have any guidance around using this value as well?
You mention that these changes are recommended on management servers but make no mention if they are required on a gateway server. What is Microsoft’s stance on tweaking registry settings on them?
Thanks for the quick response. I didn’t think so either but I wanted to be 100% certain. Thanks again for taking the time to respond to everyone’s questions and keep this site updated. It is appreciated!
@Jasper –
I don’t know offhand. MOST registry changes require a service restart unless there is code to check the registry on an interval or to be notified of a reg change. I doubt we would do this and my assumption is that a restart of services is required. I’d have
to tracelog to be sure.
Hallo. I have 4vm (8ram) and 2 phys (24ram) scom servers.
700 win agents
60 linux agents.
lot of MP`s. also custom heavy ones, like progess sonic monitoring
349 groups
270 network devices
health service on phys server ~3-4 gb
virtual servers sometimes says, that they dropping data..
should i made s reg edit with those params?
Hi Kevin,
Thank you for sharing.
Are these registry keys also applicable to Gateways Servers or just for Management Servers only?
Marlon
Hi Kevin, Are those keys also valid for a large environment with OpsMgr 2007 R2?
thank you
@Marlon – SOME of these are potential candidates for gateways, but I generally don’t recommend any change3s on GW’s unless you are specifically experiencing a problem. On management servers – I set these on all my customers, regardless of size.
@Jean – no – not all these keys are valid for SOCM 2012R2 – as some don’t apply and some have moved.
For 2007R2 I will defer to Matt:
http://blogs.technet.com/b/mgoedtel/archive/2010/08/24/performance-optimizations-for-operations-manager-2007-r2.aspx and I also strongly recommend
http://blogs.technet.com/b/kevinholman/archive/2011/02/07/a-new-feature-in-r2-cu4-reconnecting-to-sql-server-after-a-sql-outage.aspx
Thank you!
Thanks Kevin for the confirmation! We are still observing the changes in our SCOM environment.
This is a common practice for rotating old physical servers coming off lease, or when moving VM based
Hi Kevin,
Does these settings require server restart?
T
Hi Kevin,
Does these settings require server restart?
Thanks
Ashish
I do miss the parameters
Maximum Global Pending Data Count
and
Persistence Version Store Maximum
and
Persistence Cache Maximum
in this blog.
With regards to the latter;
In this PDF ( http://download.microsoft.com/download/8/2/8/828C05A2-E6A0-436A-9AE1-704A8005E355/9780735695825.pdf ) they say;
Another important setting is Persistence Cache Maximum of type DWORD. This setting controls the amount of memory in pages used by Persistence Manager for its data store on the local database. The default value for this is 262144 (decimal), which is also the recommended value. If you are running an older version of Operations Manager on management servers that manage a large number of objects, you should change the value to 262144 (decimal).
The default and recommended values are the same here, i think that is a mistake, but for now I’ve set it to 262144.
Suggestions are welcome!
I love this article, deep explantion followed by a exec summary, followed by a “ok, i realise your lazy so…” the actual commands. Genius
Kevin,
When I restart the healthservice from one management server, the All Management Server Pool Unavailable alert is raised and my Noc dashboard created by widged changes his state to gray. Then I wait for 15 minutes to go back to normal. I think this is a pool process version related. Is there any configuration to increase the value of pool unavailable?
How many management servers do you have?
Kevin,
I have 3 management servers in the pool
3 MS should have no issue with keeping pool available when one MS is stopped. It sounds like you have some pool members not available at the time you stop it – which could cause a pool suicide. Look for 15000-15003 events in the event log and see that the pool is stable first.
Hi Kevin,
Does the same recommandations apply for SCOM 2016?
Hi Kevin,
I have an environment where I am regularly seeing Resource Pools fail, from the SCOM console itself we see a heartbeat failure for the Pool and going by Event Logs I see that members are not acknowledging the lease request and are unloading all workflows.
As the environment isn’t in live use I’ve tested applying the Pool Manager fixes mentioned in the blog and this appears to resolve or heavily reduce the issue. I was wondering whether you could provide more information around what the settings grant more time for the pool to complete, as the only time related recomendation I remember seeing in the official documentation is ~10ms latency between servers.
In the case of my Resource Pools failing, is it almost certainly going to come down to workflows taking too long to complete despite apparant low load on the servers, or could it also be network issues causing the Pool to fail?
As a side note, in the case of a pool being in a failed state would you expect SNMP traps (sent to all servers in a pool) to still get processed and alert within SCOM? The Pools in question are only used for SNMP monitoring, so if traps are still processed and it is just SNMP polling being impacted I’d be less concerned!
Many thanks.
James
I *never* recommend editing the registry for pool timeouts – because they often just mask the real problem and are almost never a good solution. We have also seen issues where they create pool instability by changing them.
The most common issue with pool stability is network latency between management servers or MS to Database. The second most common issue is overloaded management groups, where we see blocking in the SQL database, or just too many objects hosted by the pools on the management servers.
Hi Kevin –
We had updated the registry of all our management servers and no issues encountered for 2 years. Recently, we are going to create a new SCOM Group via console and we are experiencing long time loading and sometimes not loading at all.
Can you give us insights what setting to we need to check or need to adjust? As per PS command, we have a total of 1442.
Thanks!