How to troubleshoot Operations Manager DB performance problems

FIXIn System Center 2012 Operations Manager (OpsMgr 2012 or OpsMgr 2012 R2), one of the more common performance concerns surrounds Operations Manager Database and Data Warehouse insertion times. Here are some tips to help identify and troubleshoot problems concerning Database and Data Warehouse data insertion.

A typical sign of Operations Manager DB performance problems is the presence of Event ID 2115 events in the Operations Manager Event log. These events typically indicate that performance issues exist on the Management Server or the Microsoft SQL Server that is hosting the OperationsManager or OperationsManager Data Warehouse databases.

First a little background

Database and Data Warehouse write action workflows run on the Management Servers and these workflows first retain the data received from the Agents and Gateway Servers in an internal buffer. They then gather this data from the internal buffer and insert it into the Database and Data Warehouse. When the first data insertion has completed, the workflows will then create another batch.

The size of each batch of data depends on how much data is available in the buffer when the batch is created, however there is a maximum limit on the size of the data batch of up to 5000 data items. If the data item incoming rate increases, or the data item insertion throughput to the Operation Manager and Data Warehouse databases throughput is reduced, the buffer will then accumulate more data and the batch size will grow larger.

There are several write action workflows that run on a Management Server. These workflows handle data insertion to the Operations Manager and Data Warehouse databases for different data types. For example:

  • Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
  • Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
  • Microsoft.SystemCenter.DataWarehouse.CollectEventData
  • Microsoft.SystemCenter.CollectAlerts
  • Microsoft.SystemCenter.CollectEntityState
  • Microsoft.SystemCenter.CollectPublishedEntityState
  • Microsoft.SystemCenter.CollectDiscoveryData
  • Microsoft.SystemCenter.CollectSignatureData
  • Microsoft.SystemCenter.CollectEventData

When a Database or Data Warehouse write action workflow on a Management Server experiences slow data batch insertion (e.g. times in excess of 60 seconds), it will begin logging Event ID 2115 to the Operations Manager Event log. This event is logged every one minute until the data batch is inserted into the Database or Data Warehouse, or the data is dropped by the write action workflow module. As a result, Event ID 2115 will be logged due to the latency inserting data into the Database or Data Warehouse. Below is an example Event logged due to data dropped by the write action workflow module:

Event Type: Error
Event Source: HealthService

Event Category: None

Event ID: 4506

Computer: <RMS NAME>

Description:

Data was dropped due to too much outstanding data in rule "Microsoft.SystemCenter.OperationalDataReporting.SubmitOperationalDataFailed.Alert" running for instance <RMS NAME> with id:"{F56EB161-4ABE-5BC7-610F-4365524F294E}" in management group <MANAGEMENT GROUP NAME>.

Taking a deeper look

Event ID 2115 contains 2 significant pieces of information. First, the name of the workflow that is experiencing the problem, and second, the elapsed time since the workflow began inserting the last batch of data. For example:

Log Name: Operations Manager
Source: HealthService

Event ID: 2115

Level: Warning

Computer: <RMS NAME>

Description:

A Bind Data Source in Management Group <MANGEMENT GROUP NAME> has posted items to the workflow, but has not received a response in 300 seconds. This indicates a performance or functional problem with the workflow.

Workflow Id : Microsoft.SystemCenter.CollectPublishedEntityState

Instance : <RMS NAME>

Instance Id : {88676CDF-E284-7838-AC70-E898DA1720CB}

This particular Event ID 2115 message indicates that the workflow Microsoft.SystemCenter.CollectPublishedEntityState, which writes Entity State data to the Operations Manager database, is trying to insert a batch of Entity State data and it started 300 seconds ago. In this example, the insertion of the Entity State data has not yet finished. Normally inserting a batch of data should complete within 60 seconds.

If the Workflow ID contains Data Warehouse then the problem concerns the Operations Manager Data Warehouse. Otherwise, the problem concerns inserting data into the Operations Manager database!

Common causes

Insertion Problems
As the description of Event ID 2115 states, this may indicate a database performance problem or too much data incoming from the agents. Event ID 2115 simply indicates there is a backlog inserting data into the Database (either Operations Manager or Operations Manager Data Warehouse). These events can originate from a number of possible causes. For example, there could be a sudden, large amount of Discovery data, there could be a database connectivity issue, the database may be full, or possibly there is a disk or network related constraint.

In Operations Manager, discovery data insertion is a relatively labor intensive process. It can also come in bursts, where there is a significant amount of data being received by the Management Server. These bursts can cause temporary instances of Event IDs 2115, but if Event ID 2115 consistently appears for discovery data collection this can indicate either a Database or Data Warehouse insertion problem or Discovery rules in a Management Pack collecting too much discovery data.

Operations Manager configuration updates caused by Instance Space changes or Management Pack imports have a direct effect on CPU utilization on the Database Server and this can impact Database insertion times. Following a Management Pack import or a large instance space change, it is expected to see Event ID 2115 messages. For more information on this topic please see the following:

2603913 – How to detect and troubleshoot frequent configuration changes in Operations Manager (http://support.microsoft.com/kb/2603913)

In Operations Manager, expensive user interface queries can also impact resource utilization on the database which can lead to latency in Database insertion times. When a user is performing an expensive User Interface operation it is possible to see Event ID 2115 messages logged.

Databases full or offline

If the Operations Manager or Operations Manager Data Warehouse databases are out of space or offline, it is expected that the Management Server will continue to log Event ID 2115 messages to the Operations Manager Event log and the pending time will grow higher. 

If the write action workflows cannot connect to the Operations Manager or Operations Manager Data Warehouse databases, or they are using invalid credentials to establish their connection, the data insertion will be blocked and Event ID 2115 messages will be logged accordingly until this situation is resolved.  

Configuration and environmental issues

Event ID 2115 messages can also indicate a performance problem if the Operations Manager Database and Data Warehouse databases as well as the supporting environment are not properly configured. Some possible causes include the following:

  • The SQL Log or TempDB database is too small or out of space.
  • The Network link from the Operations Manager and Data Warehouse database to the Management Server is bandwidth constrained or the latency is large. In this scenario we recommend to Management Server to be on the same LAN as the Operations Manager and Data Warehouse server.
  • The data disk hosting the Database, logs or TempDB used by the Operations Manager and Data Warehouse databases is slow or experiencing a function problem. In this scenario we recommend leveraging RAID 10 and we also recommend enabling battery backed Write Cache on the Array Controller.
  • The Operations Manager Database or Data Warehouse server does not have sufficient memory or CPU resources.
  • The SQL Server instance hosting the Operations Manager Database or Data Warehouse is offline.

It is also recommend that the Management Server reside on the same LAN as the Operations Manager and Data Warehouse database server.

Event ID 2115 messages can also occur if the disk subsystem hosting the Database, logs or TempDB used by the Operations Manager and Data Warehouse databases is slow or experiencing a functional problem. We recommend leveraging RAID 10 and we also recommend enabling battery backed Write Cache on the Array Controller.

Troubleshooting

The first step in troubleshooting the performance related Event ID 2115 messages is to identify what data items are being returned within the event. For example, the Workflow ID indicates what type of data items (e.g. discovery, alerts, event, perf) as well as what database is involved. If the Workflow ID contains the term DataWarehouse, then the focus should be towards the Operations Manager Data Warehouse. In other instances the focus would be towards the Operations Manager Database.

Scenario 1

In the example below, the problem concerns the workflow Microsoft.SystemCenter.CollectSignatureData.

Event Type: Warning
Event Source: HealthService
Event Category: None
Event ID: 2115
Computer: <RMS NAME>
Description:
A Bind Data Source in Management Group <MANGEMENT GROUP NAME> has posted items to the workflow, but has not received a response in 300 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.CollectSignatureData
Instance : <RMS NAME>
Instance Id : {F56EB161-4ABE-5BC7-610F-4365524F294E}

Resolution

We can identify the Performance Signature Data Collection Rules in this example by executing the following SQL query. This query should be executed in SQL Management Studio against the Operations Manager database.

— Return all Performance Signature Collection Rules
Use OperationsManager

select

managementpack.mpname,
rules.rulename
from performancesignature with (nolock)

inner join rules with (nolock)

on rules.ruleid = performancesignature.learningruleid

inner join managementpack with(nolock)

on rules.managementpackid = managementpack.managementpackid

group by managementpack.mpname, rules.rulename

order by managementpack.mpname, rules.rulename

This query will return all Performance Signature Collection Rules and their respective Management Pack name. A column is returned for Management Pack name and Rule name.

The following Performance Monitor Counters on a Management Server will provide information concerning Database and Data Warehouse write action insertion batch size and time. If the batch size is growing larger, for example the default batch size is 5000 items, this indicates either the Management Server is slow inserting the data to the Database or Data Warehouse, or is receiving a burst of Data Items from the Agents or Gateway Servers.

· OpsMgr DB Write Action Modules(*)\Avg. Batch Size
· OpsMgr DB Write Action Modules(*)\Avg. Processing Time
· OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms
· OpsMgr DW Writer Module(*)\Avg. Batch Size

From the Database and Data Warehouse write action account Average Processing Time counter, we can understand how long it takes on average to write a batch of data to the Database and Data Warehouse. Depending upon the amount of time it takes to write a batch of data to the Database, this may present an opportunity for tuning.

Scenario 2

If the SQL Server instance hosting the Operations Manager Database or Data Warehouse is offline, Event ID 2115 as well as Event ID 29200 will appear within the Operations Manager Event log. For example:

Log Name: Operations Manager
Source: HealthService

Date:

Event ID: 2115
Level: Warning

Description:

A Bind Data Source in Management Group MSFT has posted items to the workflow, but has not received a response in 60 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.CollectEventData
Instance : name.contoso.local

Instance Id : {88676CDF-E284-7838-AC70-E898DA1720CB}

=====

Log Name: Operations Manager
Source: OpsMgr Config Service

Event ID: 29200

Level: Error

Description:

OpsMgr Config Service has lost connectivity to the OpsMgr database, therefore it can not get any updates from the database. This may be a temporary issue that may be recovered from automatically. If the problem persists, it usually indicates a problem with the database. Reason:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 – Could not open a connection to SQL Server)

Resolution

To resolve the issue in this scenario, complete the following:

  1. Connect to the server hosting the Operations Manager database.
  2. Open the Services applet.
  3. Verify that the SQL Server (MSSQLSERVER) service is started and running.
  4. If the SQL Server (MSSQLSERVER) service is not started and running, start the service.

Once database connectivity is restored workflows should resume successfully storing data within the respective database. Event ID 31554 validates that the information has been written successfully:

Log Name: Operations Manager
Source: Health Service Modules

Event ID: 31554

Task Category: Data Warehouse

Level: Information

Description:

Workflow succeeded storing data in the Data Warehouse

One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance name: name.contoso.local
Instance ID: {88676CDF-E284-7838-AC70-E898DA1720CB}

Scenario 3

Event ID 2115 caused by invalid RunAs credentials.

Resolution

Examine the Operations Manager Event log for the following events. These events typically indicate that the Data Warehouse SQL Server Authentication Account may have incorrect credentials.

Log Name: Operations Manager
Source: HealthService

Event ID: 7000

Task Category: Health Service

Level: Error

Description:

The Health Service could not log on the RunAs account <ACCOUNT NAME> for management group <MANGEMENT GROUP NAME>. The error is Logon failure: unknown user name or bad password.(1326L). This will prevent the health service from monitoring or performing actions using this RunAs account

=====

Log Name: Operations Manager
Source: HealthService

Event ID: 7015

Task Category: Health Service

Level: Error

Description:

The Health Service cannot verify the future validity of the RunAs account <ACCOUNT NAME> for management group <MANGEMENT GROUP NAME>. The error is Logon failure: unknown user name or bad password.(1326L).


To resolve the issue in this scenario, complete the following:

1. Open the Operations Manager console.
2. Select Administration.
3. Select Run As Configuration\Accounts.
4. Configure the appropriate credentials for the Data Warehouse SQL Server Authentication Account.

Scenario 4

Event ID 2115 caused by disk performance issues. This can occur when the data disk hosting the database, logs or TempDB used by the Operations Manager and Data Warehouse databases is slow or experiencing a problem. In this scenario we recommend leveraging RAID 10 and we also recommend enabling battery backed Write Cache on the Array Controller.

Resolution

First, capture these Physical Disk counters for all drives that contain SQL data or log files:

  • % Idle Time: How much disk idle time is being reported. Anything below 50 percent could indicate a disk bottleneck.
  • Avg. Disk Queue Length: This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high. You could use the following formulas based on the RAID level and number of disks in the RAID configuration:
    • RAID 0: All of the disks are doing work in a RAID 0 set
    • Average Disk Queue Length <= # (Disks in the array) *2
    • RAID 1: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue
    • Average Disk Queue Length <= # (Disks in the array/2) *2
    • RAID 10: half the disks are “doing work”; therefore, only half of them can be counted toward Disks Queue
    • Average Disk Queue Length <= # (Disks in the array/2) *2
    • RAID 5: All of the disks are doing work in a RAID 5 set
    • Average Disk Queue Length <= # (Disks in the array/2) *2
    • Avg. Disk sec/Transfer: The number of seconds it takes to complete one disk I/O
    • Avg. Disk sec/Read: The average time, in seconds, of a read of data from the disk
    • Avg. Disk sec/Write: The average time, in seconds, of a write of data to the disk
    • Disk Bytes/sec: The number of bytes being transferred to or from the disk per second
    • Disk Transfers/sec: The number of input and output operations per second (IOPS)

NOTE The last three counters in this list should consistently have values of approximately .020 (20 ms) or lower and should never exceed.050 (50 ms). The following are the thresholds that are documented in the SQL Server performance troubleshooting guide:

    • Less than 10 ms: Very good
    • Between 10 – 20 ms: Good
    • Between 20 – 50 ms: Slow, needs attention
    • Greater than 50 ms: Serious I/O bottleneck
    • Disk Bytes/sec: The number of bytes being transferred to or from the disk per second
    • Disk Transfers/sec: The number of input and output operations per second (IOPS)

When % Idle Time is low (10 percent or less), this means that the disk is fully utilized. In this case, the last two counters in this list (Disk Bytes/sec and Disk Transfers/sec) provide a good indication of the maximum throughput of the drive in bytes and in IOPS, respectively. The throughput of a SAN drive is highly variable, depending on the number of spindles, the speed of the drives and the speed of the channel. The best bet is to check with the SAN vendor to find out how many bytes and IOPS the drive should support. If % Idle Time is low, and the values for these two counters do not meet the expected throughput of the drive, engage the SAN vendor for further troubleshooting assistance.

The following links provide deeper insight into troubleshooting SQL Server performance issues:

Scenario 5

Event ID 2115 is logged and a Management Server generates an "unable to write data to the Data Warehouse" alert in System Center Operations Manager. You also experience the following symptoms on the Management Server computer:

  • The Management Server generates one or more alerts that resemble the following:

Performance data collection process unable to write data to the Data Warehouse

Performance data collection process unable to write data to the Data Warehouse. Failed to store data in the Data Warehouse. Exception 'SqlException': Management Group with id '9069F7BD-55B8-C8E8-1CF9-4395F45527E2' is not allowed to access Data Warehouse under login 'DOMAIN\Action_Account' One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData Instance name: dataWarehouseServer.domain.com Instance ID: {48936EE3-4E3E-BEE1-8C09-AFDAB8ECF236} Management group: Management Group Name.

  • The following event is logged in the Operations Manager event log on the Management Server:

Event Type: Warning
Event Source: HealthService
Event Category: None
Event ID: 2115
Date: date

Time: time

User: N/A

Computer: ManagementServerName
Description: A Bind Data Source in Management Group Management Group Name has posted items to the workflow, but has not received a response in 1712 seconds. This indicates a performance or functional problem with the workflow. Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData Instance : ManagementServerName.domain.com Instance Id : {C7FDDE2A-E0AA-4B80-70DE-1D50D9965221}

Resolution

This issue can occur if the Management Server does not have accounts that are specified for its data warehouse "Run As" profiles. This issue is more likely to affect a secondary Management Server. To resolve this problem complete the following:

  1. On the computer that is running System Center Operations Manager, open the Operations Console.
  2. In the navigation pane, click Administration.
  3. Expand Security, and then click Run As Profiles.
  4. In the Run As Profiles view, double-click Data Warehouse Account.
  5. In the Run As Profile Properties – Data Warehouse Account properties dialog box, click the Run As Accounts tab, and then click New.
  6. In the Run As Account list, click Data Warehouse Action Account.
  7. In the Name list, click the Management Server that generated the alert.
  8. Click OK two times.
  9. Follow steps 4 through 8 to assign the appropriate Run As account to the following profiles:
    – Data Warehouse Configuration Synchronization Reader Account
    – Data Warehouse Report Deployment Account
    – Data Warehouse SQL Server Authentication Account
  10. For each profile, select the Run As account that matches the name of the Run As profile. For example, make the following assignments:
    – Assign the Data Warehouse Configuration Synchronization Reader Account to the Data Warehouse Configuration Synchronization Reader Account profile.
    – Assign the Data Warehouse Report Deployment Account to the Data Warehouse Report Deployment Account profile.
    – Assign the Data Warehouse SQL Server Authentication Account to the Data Warehouse SQL Server Authentication Account profile.
  11. On the Management Server that generated the alert, restart the OpsMgr Health Service.
  12. In the Operations Manager event log on the Management Server, verify that event ID 31554 events are logged. Event ID 31554 indicates that the monitor state has changed to Healthy. This change resolves the alert.

Scenario 6

Event ID 2115 occurs on a server running HP MPIO FF DSM XP v3.01 to which there are no LUNs presented. When the user opens Performance Monitor and attempts to add a counter, Performance Monitor will hang and the handle count for this application increases rapidly.

Resolution

There are two workarounds for this issue:

  1. Rename the HPPerfProv.dll file and reboot Windows. Performance Monitor will work without issue when the file is renamed and not loaded.
  2. Have at least 1 LUN present on the system.

For more information on this issue please see the following HP support document:

Windows MPIO DSM for XP v3.01.00 – Will Have Windows Perfmon Hung In "No LUN Presented" Environment

 J.C. Hornbeck | Solution Asset PM | Microsoft GBS Management and Security Division

Get the latest System Center news on Facebook and Twitter:

clip_image001 clip_image002

System Center All Up: http://blogs.technet.com/b/systemcenter/

Configuration Manager Support Team blog: http://blogs.technet.com/configurationmgr/ 
Data Protection Manager Team blog: http://blogs.technet.com/dpm/ 
Orchestrator Support Team blog: http://blogs.technet.com/b/orchestrator/ 
Operations Manager Team blog: http://blogs.technet.com/momteam/ 
Service Manager Team blog: http://blogs.technet.com/b/servicemanager 
Virtual Machine Manager Team blog: http://blogs.technet.com/scvmm

Microsoft Intune: http://blogs.technet.com/b/microsoftintune/
WSUS Support Team blog: http://blogs.technet.com/sus/
The RMS blog: http://blogs.technet.com/b/rms/
App-V Team blog: http://blogs.technet.com/appv/
MED-V Team blog: http://blogs.technet.com/medv/
Server App-V Team blog: http://blogs.technet.com/b/serverappv
The Surface Team blog: http://blogs.technet.com/b/surface/
The Application Proxy blog: http://blogs.technet.com/b/applicationproxyblog/

The Forefront Endpoint Protection blog : http://blogs.technet.com/b/clientsecurity/
The Forefront Identity Manager blog : http://blogs.msdn.com/b/ms-identity-support/
The Forefront TMG blog: http://blogs.technet.com/b/isablog/
The Forefront UAG blog: http://blogs.technet.com/b/edgeaccessblog/

System Center 2012 Operations Manager System Center 2012 R2 Operations Manager SCOM 2012 R2 SCOM2012R2