Bad OM Performance because State Change Events are not groomed out


IMPORTANT: Always perform a FULL Backup of the database before doing anything to it !!!

ALSO: It is advised that you open a case at Microsoft before doing this - directly editing the database is not supported and you may find yourself in an unsupported state if anything goes wrong.

 

So I have encountered another interesting issue where we had a bad performance in SC Operations Manager 2012 (also applies to 2007 R2) because of the fact that the StateChangeEvent table was very large - the largest table actually in the database.

First of all every investigation should start with tuning the "flip-flopping" Monitors and manually grooming the "old" entries: http://blogs.technet.com/b/kevinholman/archive/2009/12/21/tuning-tip-do-you-have-monitors-constantly-flip-flopping.aspx

The important thing to know here from the start is the following: the "standard" OM Grooming Job p_StateChangeEventGrooming will *only* delete the StateChangeEvent entries which are older than the configured State Grooming Threshold *AND* for which the corresponding Monitors are currently in Healthy state.

The modified procedure from Kevin's Article is actually the "same" procedure p_StateChangeEventGrooming *but* it has a small, but big impact change, to delete *ALL* StateChangeEvents which are older that the configured State Grooming Threshold regardless of what state the corresponding Monitors are in.

 

Ok, should be basically pretty straight forward right? Good! But there are some situations when you tune and you tune and then all of the sudden you can observe that you have no more "noisy" Monitors and we have a "stable" normal number of StateChangeEvents per day. However, after a couple of days we can see that even with these tuning actions which we have done, we can still see that the StateChangeEvent table is again the biggest or one of the biggest top tables in the database. But why is that?!

1. Let's start our investigation by running this SQL Query on the OperationsManager database to see how many StateChangeEvent entries we have, how many should get cleaned out by the "standard" Grooming and how many will *not* be cleaned out by the "standard" Grooming even if they are older than the State Grooming Threshold:

NOTE: Don't worry if you also get results for StateChangeEvents that should get groomed out by the standard Grooming - this means that on the next standard Grooming run, these *will* get cleaned out - you can even force this by running this Query: EXEC p_StateChangeEventGrooming.

 DECLARE
    @GroomingThresholdLocal DATETIME,
    @GroomingThresholdUTC DATETIME
 SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
 FROM dbo.PartitionAndGroomingSettings
 WHERE ObjectName = 'StateChangeEvent'
 EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
 SELECT 'Should be cleaned', COUNT(*) AS 'Count'
 FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
 INNER JOIN dbo.State AS S WITH(NOLOCK)
    ON SCE.StateId = S.StateId
 WHERE
    SCE.TimeGenerated < @GroomingThresholdUTC AND
    S.HealthState = 1
 UNION
 SELECT 'Should NOT be cleaned', COUNT(*) AS 'Count'
 FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
 INNER JOIN dbo.State AS S WITH(NOLOCK)
    ON SCE.StateId = S.StateId
 WHERE
    SCE.TimeGenerated < @GroomingThresholdUTC AND
    S.HealthState != 1
 UNION
 SELECT 'Total StateChanges', COUNT(*) AS 'Count'
 FROM StateChangeEvent 

 

2. WOW! So from the above we could see that about 80% (2.255.838 to be exact) of the existing StateChangeEvent entries are *older* than the configured State Grooming Threshold but are still not cleaned-out by the standard Grooming and will not be - the only reason for this is that it seams that the corresponding Monitors are in Warning or Critical state - to get an idea of which these Monitors are, please run this Query:

 DECLARE
    @GroomingThresholdLocal DATETIME,
    @GroomingThresholdUTC DATETIME
 SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
 FROM dbo.PartitionAndGroomingSettings
 WHERE ObjectName = 'StateChangeEvent'
 EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
 SELECT
    DSV.DisplayName
 FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
 INNER JOIN dbo.State AS S WITH(NOLOCK)
    ON SCE.StateId = S.StateId
 INNER JOIN Monitor AS M
    ON S.MonitorId = M.MonitorId
 INNER JOIN BaseManagedEntity AS BME
    ON S.BaseManagedEntityId = BME.BaseManagedEntityId
 INNER JOIN DisplayStringView AS DSV
    ON M.MonitorId = DSV.LTStringId
 WHERE
    SCE.TimeGenerated < @GroomingThresholdUTC AND
    S.HealthState != 1 AND
    DSV.LanguageCode = 'ENU' AND
    M.IsUnitMonitor = 1
 GROUP BY DSV.DisplayName 

 

3. Now we encounter a pretty common issue - we should check for this - usually I bet that with this very high number of "unhealthy" Monitors we should have a looooot of open Alerts which in a production environment would not look pretty good so we can bet here that the corresponding Alerts were closed manually (or automated somehow) *BUT* the corresponding Monitors were *NOT* - to check how many these are, run this Query:

 DECLARE
    @GroomingThresholdLocal DATETIME,
    @GroomingThresholdUTC DATETIME
 SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
 FROM dbo.PartitionAndGroomingSettings
 WHERE ObjectName = 'StateChangeEvent'
 EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
 SELECT COUNT(*)
 FROM Alert AS A
 WHERE A.RuleId IN (
    SELECT
       M.MonitorId
    FROM dbo.StateChangeEvent AS SCE WITH(NOLOCK)
    INNER JOIN dbo.State AS S WITH(NOLOCK)
    ON SCE.StateId = S.StateId
    INNER JOIN Monitor AS M
    ON S.MonitorId = M.MonitorId
    INNER JOIN BaseManagedEntity AS BME
       ON S.BaseManagedEntityId = BME.BaseManagedEntityId
    WHERE
       SCE.TimeGenerated < @GroomingThresholdUTC AND
       S.HealthState != 1 AND
    M.IsUnitMonitor = 1
 ) AND A.ResolutionState = 255 

 

4. WOW! So in this case we had about 2000 results! So to solve this issue, we can execute this PowerShell Script on a Management Server under OM PowerShell:

NOTE: Keep in mind that after running this PS Script you will have get additional new StateChangeEvent entries because a Monitor Reset is also a state change, these will be groomed out as soon as they are older than the configured State Grooming Threshold.

 $alerts = Get-SCOMAlert -ResolutionState 255 | where { $_.IsMonitorAlert -eq $true }
 foreach ($alert in $alerts) {
    $monitor = Get-SCOMMonitor | where { $_.Id -eq $alert.MonitoringRuleId }
    $monitoringobject = Get-SCOMMonitoringobject -Id $alert.MonitoringObjectId | where { $_.HealthState -ne 0 -and $_.HealthState -ne 1 }
    if ($monitoringobject -ne $null) { $monitoringobject.ResetMonitoringState($monitor) }
 } 

 

5. After this, we should also run Kevin's modified Grooming procedure again to clean this up:

NOTE: If you are curious about the change in the original, standard Grooming procedure, I have highlighted the change in the Query below.

 SET ANSI_NULLS ON
 GO
 SET QUOTED_IDENTIFIER ON
 GO
 BEGIN
    SET NOCOUNT ON
    DECLARE
    @Err INT,
    @Ret INT,
    @DaysToKeep TINYINT,
    @GroomingThresholdLocal DATETIME,
    @GroomingThresholdUTC DATETIME,
    @TimeGroomingRan DATETIME,
    @MaxTimeGroomed DATETIME,
    @RowCount INT
 SET @TimeGroomingRan = GETUTCDATE()
 SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, GETDATE())
 FROM dbo.PartitionAndGroomingSettings
 WHERE ObjectName = 'StateChangeEvent'
 EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
 SET @Err = @@ERROR
 IF (@Err <> 0) BEGIN
    GOTO Error_Exit
 END
 SET @RowCount = 1
 SELECT @MaxTimeGroomed = MAX(TimeGenerated)
 FROM dbo.StateChangeEvent
 WHERE TimeGenerated < @GroomingThresholdUTC
 IF @MaxTimeGroomed IS NULL
    GOTO Success_Exit
 DELETE MJS
 FROM dbo.MonitoringJobStatus MJS
 JOIN dbo.StateChangeEvent SCE
    ON SCE.StateChangeEventId = MJS.StateChangeEventId
 JOIN dbo.State S WITH(NOLOCK)
    ON SCE.[StateId] = S.[StateId]
 WHERE SCE.TimeGenerated < @GroomingThresholdUTC
 AND S.[HealthState] in (0, 1, 2, 3)
 SELECT @Err = @@ERROR
 IF (@Err <> 0) BEGIN
    GOTO Error_Exit
 END
 WHILE (@RowCount > 0) BEGIN
    DELETE TOP (10000) SCE
    FROM dbo.StateChangeEvent SCE
    JOIN dbo.State S WITH(NOLOCK)
       ON SCE.[StateId] = S.[StateId]
    WHERE TimeGenerated < @GroomingThresholdUTC
    AND S.[HealthState] in (0,1,2,3)
    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
    IF (@Err <> 0) BEGIN
       GOTO Error_Exit
    END
 END
 UPDATE dbo.PartitionAndGroomingSettings
 SET GroomingRunTime = @TimeGroomingRan,
    DataGroomedMaxTime = @MaxTimeGroomed 
 WHERE ObjectName = 'StateChangeEvent'
 SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT
 IF (@Err <> 0) BEGIN
    GOTO Error_Exit
 END
    GOTO Success_Exit
 Error_Exit:
    PRINT 'ERROR!!!'
 Success_Exit:
 END 

 

To avoid this in the future:

  • Make sure that you always solve a problem before "closing" the issue (close Alert and Reset Monitor) because if you don't, then you will end up with another Alert and another unhealthy state of that Monitor.
  • If you don't want to solve an issue and usually just "ignore" the state of a Monitor or it's Alert, then it would be best to disable that Monitor
  • For Monitors which don't have AutoResolve option or which don't have a "Healthy State Check" always make sure that you reset the Monitor properly: http://technet.microsoft.com/en-us/library/hh212816.aspx
  • Basically make sure somehow, that you *ALWAYS* also reset a Monitor if you close it's Alert - a good idea to constantly monitor this and make sure you don't have to worry about that, is to implement some kind of automation for this, like in this great "AS IS" example using SC Orchestratorhttp://blog.scomfaq.ch/2012/05/05/reset-monitor-using-scom-2012-and-orchestrator-a-must-have-runbook/

 

Try to keep your environment healthy and always be with your eyes open for new tuning! 😉

 

Comments (0)

Skip to main content