How to detect and troubleshoot frequent configuration changes in Operations Manager 2007

hotfixHere’s a new Knowledge Base article we published today on SCOM 2007. This one talks about configuration churn, what can cause it and how you can address it if you see it in your environment:

=====

Configuration Overview

The System Center Management Configuration service is responsible for calculating the configuration of every health service in the Operations Manager 2007 Management Group. The configuration of a health service consists of the rules, monitors, discoveries and tasks for the health service and for all the instances monitored by the health service. In order to calculate all the required configurations for each health service, the Management Configuration service needs to have a list of all instances of all monitored classes, the hosting relationships between instances, the rules, monitors, discoveries and other workflows assigned to the monitored classes, and the health services responsible for monitoring the instances. In addition, the Management Configuration service also needs to read the membership of all instance groups in the Management Group and apply any overrides for rules and monitors that have been targeted at these groups, classes or individual instances.

Objects in a management group will be defined as instances of monitored classes based on discovery data submitted by discovery workflows. If a key property of an object changes, that object may be added as a new instance of a monitored class, or no longer be considered an instance of that class. As the list of classes the object is determined to be a member of changes, the configuration for the health service that monitors that object will also change as rules, monitors, discoveries, tasks and overrides are added or removed from the previous configuration.

Configuration Churn

If a large amount of discovery data is submitted to the Management Configuration service, or submitted too fast for the Management Configuration service to process before more discovery data is submitted, agents may not be able to receive a stable configuration, as it will always be in the process of being calculated. The frequent submission of discovery data, also known as configuration churn, can cause some health services to run under old configurations, or cause the configuration of management servers to become stale, subsequently causing them to appear gray in the Operations console.

Discovery data is submitted by a health service when a discovery workflow runs. Introduction of a new Management Pack to a Management Group can cause several discovery workflows to run on each agent, and as new instances are discovered, additional discoveries may be run on some agents. Changes to groups, overrides and other workflows can cause discovery workflows to run on agents, and introduction of new agents can also cause the Management Configuration service to update the instance space with the new agent's configuration.

When a discovery workflow is configured to run too often, or the properties discovered by the workflow change each time it is run, the Configuration Management service will be forced to recalculate the health service configuration often. If this happens for many agents, or the Root Management Server (RMS) is under heavy workload already, the Configuration Management service may not be able to keep up with the rate of change and configuration churn may occur.

Identifying Configuration Churn via the RMS event log

The following event in the Operations Manager event log on the RMS indicates that the Management Group configuration has changed due to new discovery data.

Log Name: Operations ManagerSource: OpsMgr ConnectorEvent ID: 21024Level: InformationComputer: <RMS Name> Description: OpsMgr's configuration may be out-of-date for management group <ManagementGroupName>, and has requested updated configuration from the Configuration Service. The current(out-of-date) state cookie is "3A B0 1E 5C 81 F3 12 F5 56 B7 8A EF F8 01 BA 09 86 55 06 48 "

The following event indicates that the Management Configuration service has finished processing the new discovery data and calculated any changes required to the Management Group configuration based on the new data.

Log Name: Operations ManagerSource: OpsMgr ConnectorEvent ID: 21025Level: InformationComputer: <RMS Name> Description: OpsMgr has received new configuration for management group <ManagementGroupName> from the Configuration Service. The new state cookie is "34 FA 11 61 4D B8 03 59 3D 1D 66 B7 83 F3 C0 AA 7A 6F 1A 3B "

In a typical environment, every 21024 should be followed by a 21025. If the discovery data did not cause any configuration data to change, the event ID will be 21026 instead. In a large Management Group, pairs of 21024 and 21025/26 events should be expected to occur several times per hour. Long strings of 21024 events with no corresponding 21025/26 event is a sign of configuration churn. In addition, the event log may show the following event, indicating churn has been detected.

Log Name: Operations ManagerSource: OpsMgr Config ServiceEvent ID: 29202Level: WarningComputer: <RMS Name> Description: OpsMgr Config Service could not retrieve a consistent state from the OpsMgr database due to too frequent database changes. This could be due to a normal and temporary increase of discovery data; however check the most recent changes to determine if this increase is unexpected. Most recent monitoring object change: Instance = %1Class = %2 Modified time = %3 Most recent monitoring relationship change: Relationship instance = %4 Source instance = %5 Target instance = %6 RelationshipClass = %7 Modified time = %8

The Data Access Layer must read multiple tables when querying for changes. If one of the tables is modified after it is read but before all tables have been read, the Data Access Layer will log the above event and retry. If an entity or relationship instance was read during this time, information about these is included in the event fields; otherwise, these fields are left empty.

Identifying Potential Causes of Configuration Churn via the Operations Manager Datawarehouse

In management groups where the Operations Manager Reporting component has been installed, several SQL queries can be used to identify workflows that are submitting frequent changes. These queries should be run in SQL Management Studio against the Datawarehouse instance.

Total Changes Submitted by Discovery Workflows in Last 24 Hours:

select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes' from (select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ) As #T group by ManagedEntityTypeSystemName, DiscoverySystemName order by count(*) DESC

This query will display three columns. The first column is the class of object at which the workflow is targeted. The second column indicates the internal name of the discovery workflow. The third column indicates the total number of property changes for all instances of this class submitted by the workflow in the last 24 hours. The total number of changes, for all classes, represents the number of times the Configuration Management service must recomputed the configuration for an agent health service.

The number of changes for some classes of objects, even in a stable environment, may not ever reach zero. Any change, such as adding or removing a property, agents being added or decommissioned, server roles being added or changed, etc. will be reflected in the numbers returned. In environments where configuration churn is experienced, one or several workflows will likely show a significantly larger value than other workflows.

Properties Changed in the Last 24 Hours:

select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName,
D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName',
MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path,
ME.Name,
C.OldValue,
C.NewValue,
C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())
ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName

This query can identify which properties have changed in the last 24 hours. Combined with the previous query, this query can show what the old and new values were for the property, which agents submitted the change, the workflow that conducted the discovery, and the management pack it was contained in.
Reducing Configuration Churn

Older management packs introduced discovery workflows that submitted property changes too frequently. The current version of most management packs have modified these discovery workflows to submit data less frequently, or not query volatile properties that change frequently. Serious consideration should be given to upgrade any management pack with workflows that show up frequently in the previous query. New versions of the management pack can be downloaded from the management pack catalog: https://systemcenter.pinpoint.microsoft.com/en-US/applications/search/operations-manager-d11?q=

If a new version of the management pack is not available, or cannot be deployed at the time, the discovery interval can be adjusted via override to run less often. In some situations, the discovery responsible for the configuration churn can be entirely disabled by override. If the discovery is disabled for several weeks, the objects discovered by the workflow may be groomed out of the database, but disabling the discovery can provide a short-term solution to eliminate configuration churn before this occurs. The workflow can also be enabled for short intervals to rediscover the objects prior to them being groomed.

Some of the workflows in these older management packs are highlighted in the following blog:

https://blogs.technet.com/b/kevinholman/archive/2009/10/05/what-is-config-churn.aspx

If the workflow is from a custom discovery that targets a volatile property, such as free disk space, the discovery should be re-written to not target a property that changes often. Discovery workflows should not target instances with a short lifetime (a few weeks or less), nor collect properties of those instances that change often (more than once a month). Rules that collect performance data should be used for volatile data, as that is not considered in calculating configuration.

Additional Performance Tuning

In large management groups (> 1000 agents), the RMS may become very busy with operations that would not normally cause a problem in smaller management groups. In this situation, even a small rate of property changes could cause frequent churn, due to the length of time required to process the changes. There are a number of configuration changes that can be implemented to reduce the operational overhead of the RMS and allow it to process a normal rate of property changes quickly enough to avoid configuration churn. These configuration changes are highlighted in the following blog:

https://blogs.technet.com/b/mgoedtel/archive/2010/08/24/performance-optimizations-for-operations-manager-2007-r2.aspx

Forcing Configuration Change for the Management Group

If configuration churn for the management group is occurring constantly, any changes to reduce frequency of or disable the problem workflows will never be propagated to agents. In this case, the flow of inbound discovery data will need to be blocked to allow the System Center Configuration Management service to calculate a current configuration with the modified or disabled workflow.

Discovery data is submitted to the OperationsManager database via the System Center Data Access Service. The data is first submitted to the DAS by the System Center Management service on the RMS. The RMS gets this data from agents or other management servers. Using the Windows Firewall or some other networking means to block inbound connections to the RMS on port 5723 will prevent discovery data from being submitted to the OperationsManager database just long enough for the Configuration Management service to calculate the current configuration for the agents submitting the data.

The System Center Management service and the System Center Data Access Service on the RMS should not be stopped or disabled during this process. The System Center Configuration Management service requires a running and healthy System Center Management service on the RMS in order to complete calculation of the management group configuration. It also requires the System Center Data Access Service to communicate with the database. In addition, some data may become backlogged on the agents and other management servers during this process, so the Firewall or port exclusion should be lifted as soon as event 21025 is seen in the Operations Manager event log on the RMS, indicating that the Configuration Management service has calculated the new configuration for the management group with the disabled or modified workflows.

Identifying Potential Causes of Configuration Churn via Operations Manager Reporting

New reports were introduced with version 6.1.7599.0 of the Operations Manager 2007 R2 Management Pack. These reports provide insight into the overall volume of data being processed by the management group. These reports can be used to establish a standard baseline and to identify opportunities for tuning object discovery workflows. Once configuration churn has been identified and addressed, these reports can be used for long-term planning to prevent recurrences of churn.

The management pack can be downloaded from here: https://www.microsoft.com/download/en/details.aspx?displaylang=en&id=23081·

  • Data Volume by Management Pack report
    The Data Volume by Management Pack report compiles information on the volume of data generated by management packs. The report lists the number of occurrences per management pack for the following data types:
    • Discoveries
    • Alerts
    • Performance (number of instances submitted for performance counters collected by management pack)
    • Events
    • State changes
  • Data Volume by Workflow and Instance report
    The Data Volume by Workflow and Instance report compiles information on the volume of data generated, broken down by workflows (discoveries, rules, monitors, etc.) as well as by instances.
    There are two ways to access this report:
    • In the Data Volume by Management Pack report, click one of the counts cells in the table at the top of the report to open the Data Volume by Workflow and Instance report for the management packs.
    • Run the report directly from the Reporting section in the Operations console. If you run the Data Volume by Workflow and Instance report directly, you should set the parameters of the report to customize the results; this report is designed to provide details for information in the Data Volume by Management Pack report and so the default parameter settings may not provide the information you are looking for.
Query Words

Churn SCOM 2007 config

=====

For the most current version of this article please see the following:

2603913 : How to detect and troubleshoot frequent configuration changes in Operations Manager 2007

J.C. Hornbeck | System Center Knowledge Engineer

Get the latest System Center news on Facebook and Twitter :

clip_image001 clip_image002

App-V Team blog: https://blogs.technet.com/appv/
AVIcode Team blog: https://blogs.technet.com/b/avicode
ConfigMgr Support Team blog: https://blogs.technet.com/configurationmgr/
DPM Team blog: https://blogs.technet.com/dpm/
MED-V Team blog: https://blogs.technet.com/medv/
OOB Support Team blog: https://blogs.technet.com/oob/
Opalis Team blog: https://blogs.technet.com/opalis
Orchestrator Support Team blog: https://blogs.technet.com/b/orchestrator/
OpsMgr Support Team blog: https://blogs.technet.com/operationsmgr/
SCMDM Support Team blog: https://blogs.technet.com/mdm/
SCVMM Team blog: https://blogs.technet.com/scvmm
Server App-V Team blog: https://blogs.technet.com/b/serverappv
Service Manager Team blog: https://blogs.technet.com/b/servicemanager
System Center Essentials Team blog: https://blogs.technet.com/b/systemcenteressentials
WSUS Support Team blog: https://blogs.technet.com/sus/

The Forefront Server Protection blog: https://blogs.technet.com/b/fss/
The Forefront Identity Manager blog : https://blogs.msdn.com/b/ms-identity-support/
The Forefront TMG blog: https://blogs.technet.com/b/isablog/
The Forefront UAG blog: https://blogs.technet.com/b/edgeaccessblog/