Standard Dataset Maintenance troubleshooter for System Center Operations Manager 2007

toolsignWhat is Standard Data Set Maintenance?

Standard Data Set Maintenance is a workflow that runs against the Data Warehouse to aggregate, optimize, and groom data. This workflow runs on the RMS and is triggered every 60 seconds. There is a hard-coded timeout for the workflow and under many circumstances can fail, resulting in 31552 events in the RMS Operations Manager event log. This process is actually a group of stored procedures (SP) that get called by the parent sp – StandardDataSetMaintenance. There are many sp’s that make up this group, but the primary ones are:

StandardDataSetMaintenance – Parent, Calls the other primary sp’s

StandardDataSetAggregate – Called for State and Perf data aggregation

StandardDataSetOptimize – Updates indexes and statistics

StandardDataSetGroom – Grooms old data based on age

image

The above shows the StandardDataSet sp’s, but misses the Data Set specific sp’s that it calls, such as the performance specific sp’s, such as:

PerformanceAggregate

PerformanceAggregationDelete

PerformanceGroom

PerformanceProcessStaging

Each data set type will have its own specific set of stored procedures similar to the above.

What causes things to go wrong?

There are a few different things that can cause this workflow to fail. Here are the five that will be the focus of this document as they are the most common:

1 – SQL Standard

SQL Standard Edition does not allow for online indexing of the database. Every night the maintenance sp will trigger the “StandardDataSetOptimize” stored procedure which will try to perform indexing on the data set tables. This process will fail so every night when this happens we will see corresponding 31552 events indicating the failure.

2 – SQL Performance

SQL Performance can cause 31552 events but not typically those events alone. When this is the cause you will typically see a wide variety of symptoms such as write failures, reports slow or timing out, scheduled reports failing, and possibly 2115 events. Disk performance is the most common cause here, which will be where we start in our troubleshooting steps later.

3 – Data Flood (Perf and/or State)

Since the maintenance process has to aggregate, copy, and groom State and Performance data an influx of data can cause it to fail. A single bad monitor can wreak havoc on the DW database if it is changing state quickly enough. Understand it does not necessarily have to be a “bad” monitor, it can simply be an issue in the environment, for instance an R2 upgrade that caused a new flood of state changes or a power outage that impacted enough objects to initiate a flood. Performance collections that make it fail are not necessarily bad either, as it could be simply too many collection rules, too many objects that the rules are running against, as well as a collection rule interval set a little too aggressive. A few quick queries as shown below will give us our answer here.

4 – Large MEP tables (ManagedEntityProperty)

This is mostly seen from SP1 environments. There was an issue in SP1 where we did not groom these properties out as we should, so this table just kept growing. This is fixed in R2 environments, but is listed here as we still support SP1 and the fact that we could still see it in an upgraded R2 management group. If this table is too large then it can take a long time to process the managed entity objects needed for our maintenance process, which leads to the 31552 event. We can manually groom this table if needed, and is shown below, but we need to take caution when performing these steps.

5 – SQL Permissions

This is the most unlikely of the scenarios and will not be discussed in detail here. This is primarily caused by wrongly configured Run As Profiles/Accounts and tend to manifest elsewhere as well. We would normally see more failures than just 31552 events in this case, and to troubleshoot you simply verify them, so I will not spend much time going over this here.

So now that we know a little about the event and its primary causes, let’s look at what we would need to do to troubleshoot some of the above scenarios. A key takeaway when troubleshooting these events is in the data set section of the event. Take the event below:

Failed to store data in the Data Warehouse.
Exception 'SqlException': Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.StandardDataSetMaintenance
Instance name: State data set
Instance ID: {8FBE35E8-41E7-F8F8-2DDC-AFC4A44D7522}
Management group: Operations Management

The highlighted area above will tell you the data set experiencing the issue. This is important as in the scenarios listed above not all data sets apply. In the troubleshooting section I will notate any restrictions necessary.

So how do we troubleshoot these events?

The first thing we need to do is isolate the cause. So as stated above we need to know what is impacted. Here is a list of things to check in the beginning:

What data sets are impacted?

If it is only State or Perf (sometimes both) then we can remove the idea of SQL permissions, MEP table, and SQL version. This means we can direct our focus on performance or noisy rules/monitors. If it is all data sets, then we can eliminate data flood as this only impacts the data set that is flooding the database. Now that we know the data sets, we move on to the next question.

When are the events happening?

If the events are only happening at night then we want to check the SQL version first as mentioned above. This should be the first thing checked in the all data set scenario since it is the quickest to eliminate. If the events are occurring during the day this is also a quick way to eliminate this step.

Are there any correlating events?

Sometimes we will see 31553 events or events indicating permissions issues with SQL. We also can check the SQL logs for login/permission issues. If we see these they are generally related to the SQL accounts defined in the console as seen below:

image

During the DW/Reporting install these accounts are created with spaces for the account and password. A quick thing to do here is if we know they are not supposed to be defined we can “reset” them by simply opening the account and putting a space in each box so it looks like this:

image

Once the above change is made we can restart the RMS Health Service and see if we still get the events. Correlating events like 31553’s can also indicate a couple of other issues that are outside of the scope of this document, so keep that in mind when troubleshooting this issue. 31553 events indicate that something went wrong during the maintenance procedure, and generally will indicate the specific failure in the event. For example:

Log Name: Operations Manager
Source: Health Service Modules
Date:
Event ID: 31553
Task Category: Data Warehouse
Level: Error
Keywords: Classic
User: N/A
Computer: RMS.corp.scom.com
Description:
Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.
Exception 'SqlException': Sql execution failed. Error 777971002, Level 16, State 1, Procedure StandardDatasetAggregate, Line 424, Message: Sql execution failed. Error 777971002, Level 16, State 1, Procedure PerformanceAggregate, Line 149, Message: Sql execution failed. Error 8115, Level 16, State 1, Procedure -, Line 1, Message: Arithmetic overflow error converting IDENTITY to data type int.

The above error indicates an issue where the table needs to be reseeded. Our partition tables have an identity column with a max value of type “int”, which means they have a max row count of 2147483647. We normally never reached this, and we usually see this error because the identity value is higher than the actual row count. I will not get in to how we get in this situation but to detect and resolve this issue we need to do the following:

Check the current ident value:

DBCC CHECKIDENT (“TableName”)

If the above verifies the value is too high, run this:

Select Count(*) from TableName – Tells us the value to use in the next query

DBCC CHECKIDENT (“TableName”, RESEED, 3) –the “3” here is the result from the last query

The above steps will get the total number of rows currently in the table and reset our identity so that the next row entered gets the correct value.

Now let’s get back to our problem for this document.

Take the below scenario:

image

As you can see there are quite a few events, and we are going to assume two situations, one at a time.

All workflows are impacted and we think it is SQL Performance

SQL Performance can be very tricky to troubleshoot, and I suggest you get an SQL engineer involved to assist with the diagnosis. Before you do, here are some things you should check before getting them involved.

First, let’s check some performance counters. Instead of listing the counters individually, here is a screenshot of what we should check:

image

When looking at these counters we need to verify where the files are located. If the database files are on the E and F drives, then we need the counters for those drives in Perfmon. Most of the above counters are up for interpretation, but PLE (Page Life Expectancy) is more specific. This should never drop below 300. This counter indicates how long a page stays in memory before being cycled out by SQL to make room for a new page. If this gets too low, then our maintenance process can take too long to process data and timeout, resulting in the event. Manoj Parvathaneni discusses SQL performance in his Grey Agent Troubleshooter located here. You can reference that for what to look for in the counters listed in this document. To save time, here is the section from his guide relative to our investigation:

· MSSQL$<instance>: Buffer Manager: Page Life expectancy – How long pages persist in the buffer pool. If this value is below 300 seconds, it may indicate that the server could use more memory. It could also result from index fragmentation.
· MSSQL$<instance>: Buffer Manager: Lazy Writes/sec – Lazy writer frees space in the buffer by moving pages to disk. Generally, the value should not consistently exceed 20 writes per second. Ideally, it would be close to zero.
· Memory: Available Mbytes - Values below 100 MB may indicate memory pressure. Memory pressure is clearly present when this amount is less than 10 MB.
· Process: Private Bytes: _Total – This is the amount of memory (physical and page) being used by all processes combined.
· Process: Working Set: _Total – This is the amount of physical memory being used by all processes combined. If the value for this counter is significantly below the value for Process: Private Bytes: _Total, it indicates that processes are paging too heavily. A difference of more than 10% is probably significant.
Counters to identify disk pressure: Capture these Physical Disk counters for all drives containing SQL data or log files:
· % Idle Time – How much disk idle time is being reported. Anything below 50% could indicate a disk bottleneck.
· Avg. Disk Queue Length – This value should not exceed 2 times the number of spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50 is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high. You could use the following formulas based on the RAID level and number of disks in the RAID configuration
RAID 0 – All of the disks are doing work in a RAID 0 set
Average Disk Queue Length <= # (Disks in the array) *2
RAID 1 – half the disks are “doing work” so only half of them can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
RAID 10 – half the disks are “doing work” so only half of them can be counted toward Disks Queue
Average Disk Queue Length <= # (Disks in the array/2) *2
RAID 5 – All of the disks are doing work in a RAID 5 set
Average Disk Queue Length <= # Disks in the array *2
· Avg. Disk sec/Transfer – The number of seconds it takes to complete one disk I/O.
· Avg. Disk sec/Read – The average time, in seconds, of a read of data from the disk.
· Avg. Disk sec/Write – The average time, in seconds, of a write of data to the disk.
The above three counters should be around .020 (20 ms) or below consistently and never exceed.050 (50 ms). Here are the thresholds documented in the SQL performance troubleshooting guide:
Less than 10 ms – very good
Between 10 - 20 ms – okay
Between 20 - 50 ms – slow, needs attention
Greater than 50 ms – Serious I/O bottleneck
· Disk Bytes/sec – The number of bytes being transferred to or from the disk per second.
· Disk Transfers/sec – The number of input and output operations per second (IOPS).
When % Idle Time is low (10% or less) – which means that the disk is fully utilized – the above two counters will provide a good indication of the maximum throughput of the drive in bytes and in IOPS, respectively. The throughput of a SAN drive is highly variable, depending on the number of spindles, the speed of the drives and the speed of the channel. The best bet is to check with the SAN vendor to find out how many bytes and IOPS the drive should support. If % Idle Time is low and the values for these two counters do not meet the expected throughput of the drive, engage the SAN vendor to troubleshoot.
he following links are great resource for getting deeper insight into troubleshooting SQL performance:
Troubleshooting Performance Problems in SQL Server 2005:
https://technet.microsoft.com/en-us/library/cc966540.aspx (https://technet.microsoft.com/en-us/library/cc966540.aspx)
Troubleshooting Performance Problems in SQL Server 2008:

https://msdn.microsoft.com/en-us/library/dd672789(SQL.100).aspx

Only the State Data Set is in the 31552 events

A lot of this is covered in my Blank Reports troubleshooter located here. There are some additional steps covered in that document that are not listed here as they are not really relevant.  So when only one data set is present in the events we can take the approach of a data flood or excessive data. Both can be investigated the same way. The first thing we need to do is find out how much data is being inserted into the DW. To see that, we can run the following:

SELECT datepart(dd,[DateTime]) as day
, count( * ) as Count_of_Events
FROM [State].[vStateRaw] s with (NOLOCK)
WHERE ([DateTime] BETWEEN
CONVERT(datetime, '2011-07-01 00:00:00', 120)
AND CONVERT(datetime, '2011-07-30 00:00:00', 120))
group by datepart(dd,[DateTime])
order by 1 asc

The above query will list out the amount of State data insertions per day, during the timeframe configured in the query. An example of the result is:

day Count_of_Events

1 59336

2 29516

3 17585

4 15396

5 15023

6 14792

7 17538

8 61737

9 2687722

10 2017917

11 1857917

12 20904

13 30034

14 29353

15 30565

……..

As you can see in the highlighted section above we had a flood on the 9th, 10th, and 11th of the month, which probably caused our issue. We can diagnose this further and see how many insertions per hour and see what monitor was possibly causing the flood.

--to show what items per hour per day in state are logged

SELECT datepart(hh,[DateTime]) as hr, count( * ) as Count_of_Events FROM [State].[StateRaw_<guid>] s with (NOLOCK) WHERE ([DateTime] BETWEEN
CONVERT(datetime, 'yyyy-mm-dd hh:mm:ss', 120) AND CONVERT(datetime, 'yyyy-mm-dd hh:mm:ss', 120))
group by datepart(hh,[DateTime])
order by 1 asc

--example of the above but with times added

SELECT datepart(hh,[DateTime]) as hr, count( * ) as Count_of_Events FROM [State].[StateRaw_561ADF0271A34D38AAA027F790BEDF82] s with (NOLOCK) WHERE ([DateTime] BETWEEN
CONVERT(datetime, '2011-04-01 00:00:00', 120) AND CONVERT(datetime, '2011-04-01 23:59:59', 120))
group by datepart(hh,[DateTime])
order by 1 asc

-- Noisiest monitors during the hours of interest

select datepart(hh,[DateTime]) as hr, ManagedEntityMonitorRowId,
count(StateRawRowid) as Count_Of_Entries from [State].[StateRaw_561ADF0271A34D38AAA027F790BEDF82] s with (NOLOCK)
WHERE ([DateTime] BETWEEN CONVERT(datetime, '2011-03-31 00:00:00', 120)
AND CONVERT(datetime, '2011-03-31 23:59:59', 120))
group by datepart(hh,[DateTime]),ManagedEntityMonitorRowId
having count(StateRawRowid) > 100 -- for now ignore anything with less than 100 entries
order by 1,3 desc

The above can be useful if the on the days of the flood we had an extremely large amount of insertions. We may have to flip the DirtyInd rows for specific timeframes to 0 to allow us to aggregate it. If this is the case, I recommend you work with a senior engineer to make these changes. You can also look into the monitor causing the flood and make any changes necessary.

Now that have this information we can check to see how many DirtyInd rows there are for the data set. To check that, run this:

DECLARE @DatasetId uniqueidentifier
SELECT
@DatasetId = DatasetId
FROM Dataset d
WHERE (d.DatasetDefaultName = 'State data set')
Select AggregationDateTime, AggregationTypeId
From StandardDatasetAggregationHistory
Where DatasetId = @DatasetId
And
DirtyInd = 1

This will output the amount of DirtyInd rows we have to deal with. We should expect to see at least 2 entries, 1 for the current hour and a second for the current day. Anything under 5-10 rows can probably be ignored, but in our example here it returned 550 rows. With a result this high, it will take quite some time to get caught up. The amount of data we can move at one time is restricted in the StandardDataSetAggregation table under the MaxRowsToGroom column. For State data, this is set to 50000 rows by default. We can increase this during our tests to say 100000 rows, by running the following:

Update StandardDatasetAggregation
Set MaxRowsToGroom = 100000
Where GroomStoredProcedureName = 'StateGroom'

This will let our next step move more data at one time. Before running this, we need to make sure we have enough space in the transaction log and in the temp db. We also need to override the rule in the console to make sure the workflow does not run while we run it manually.

Once we have verified this, we can run the below query:

DECLARE @DataSet uniqueidentifier
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State')
EXEC standarddatasetmaintenance @DataSet

The above query will execute maintenance for the specified Data Set manually. Sometimes this is not enough (if we have too many DirtyInd rows) so we need to run it more than once. You can run the below to execute it in a loop:

declare @i int
set @i=1
while(@i<=500)
begin
DECLARE @DataSet uniqueidentifier
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State')
EXEC standarddatasetmaintenance @DataSet
set @i=@i+1
Waitfor delay '00:00:05'
End

After running this query return to the DirtyInd query to see how many rows are left. We can force the query to run until we get below a defined number of rows (not recommended, but can be useful for situations where the customer does not want to babysit the query) by running this instead:

DECLARE @i int
DECLARE @DataSet uniqueidentifier
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = 'State')
While(Select COUNT(*) from StandardDatasetAggregationHistory where DirtyInd = 1 And DatasetId = @DataSet) > 10
Begin
EXEC standarddatasetmaintenance @DataSet
Waitfor delay '00:00:05'
End

Once the above is finished we should have all the data aggregated. I would recommend resetting the MaxRowsToGroom column back to the default and then remove the override to the Standard Data Set rule. Then we can monitor the event log to see if the events return.

Fred Lee  | System Center Support Escalation Engineer

App-V Team blog: https://blogs.technet.com/appv/
AVIcode Team blog: https://blogs.technet.com/b/avicode
ConfigMgr Support Team blog: https://blogs.technet.com/configurationmgr/
DPM Team blog: https://blogs.technet.com/dpm/
MED-V Team blog: https://blogs.technet.com/medv/
OOB Support Team blog: https://blogs.technet.com/oob/
Opalis Team blog: https://blogs.technet.com/opalis
Orchestrator Support Team blog: https://blogs.technet.com/b/orchestrator/
OpsMgr Support Team blog: https://blogs.technet.com/operationsmgr/
SCMDM Support Team blog: https://blogs.technet.com/mdm/
SCVMM Team blog: https://blogs.technet.com/scvmm
Server App-V Team blog: https://blogs.technet.com/b/serverappv
Service Manager Team blog: https://blogs.technet.com/b/servicemanager
System Center Essentials Team blog: https://blogs.technet.com/b/systemcenteressentials
WSUS Support Team blog: https://blogs.technet.com/sus/

clip_image001 clip_image002