Tuning tip: Do you have monitors constantly “flip flopping” ?


 

This is something I see in almost all clients when we perform a PFE Health Check.  The customer will have lots of data being inserted into the OpsDB from agents, about monitors that are constantly changing state.  This can have a very negative effect on overall performance of the database – because it can be a lot of data, and the RMS is busy handling the state calculation, and synching this data about the state and any alert changes to the warehouse.

Many times the OpsMgr admin has no idea this is happening, because the alerts appear, and then auto-resolve so fast, you never see them – or don’t see them long enough to detect there is a problem.  I have seen databases where the statechangeevent table was the largest in the database – caused by these issues.

 

Too many state changes are generally caused by one or both, of two issues:

1.  Badly written monitors that flip flop constantly.  Normally – this happens when you target a multi-instance perf counter incorrectly.  See my POST on this topic for more information.

2.  HealthService restarts.  See my POST on this topic.

 

How can I detect if this is happening in my environment?

 

That is the right question!  For now – you can run a handful of SQL queries, which will show you the most common state changes going on in your environments.  These are listed on my SQL query blog page in the State section:

 

Noisiest monitors in the database: (Note – these will include old state changes – might not be current)

select distinct top 50 count(sce.StateId) as NumStateChanges, m.MonitorName, mt.typename AS TargetClass
from StateChangeEvent sce with (nolock)
join state s with (nolock) on sce.StateId = s.StateId
join monitor m with (nolock) on s.MonitorId = m.MonitorId
join managedtype mt with (nolock) on m.TargetManagedEntityType = mt.ManagedTypeId
where m.IsUnitMonitor = 1
group by m.MonitorName,mt.typename
order by NumStateChanges desc

 

The above query will show us which monitors are flipping the most in the entire database.  This includes recent, and OLD data.  You have to be careful looking at this output – as you might spent a lot of time focusing on a monitor that had a problem long ago.  You see – we will only groom out old state changes for monitors that are CURRENTLY in a HEALTHY state, AT THE TIME that grooming runs.  We will not groom old state change events if the monitor is Disabled (unmonitored), in Maintenance Mode, Warning State, or Critical State.

What?

This means that if you had a major issue with a monitor in the past, and you solved it by disabling the monitor, we will NEVER, EVER groom that junk out.  This doesn’t really pose a problem, it just leaves a little database bloat, and messy statechangeevent views in HealthExplorer.  But the real issue for me is – it makes it a bit tougher to only look at the problem monitors NOW. 

To see if you have really old state change data leftover in your database, you can run the following query:

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

You might find you have a couple YEARS worth of old state data.

So – I have taken the built in grooming stored procedure, and modified the statement to groom out ALL statechange data, and only keep the number of days you have set in the UI.  (The default setting is 7 days).  I like to run this “cleanup” script from time to time, to clear out the old data, and whenever I am troubleshooting current issues with monitor flip-flop.  Here is the SQL query statement:

 

To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state.  By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.

USE [OperationsManager]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN

    SET NOCOUNT ON

    DECLARE @Err int
    DECLARE @Ret int
    DECLARE @DaysToKeep tinyint
    DECLARE @GroomingThresholdLocal datetime
    DECLARE @GroomingThresholdUTC datetime
    DECLARE @TimeGroomingRan datetime
    DECLARE @MaxTimeGroomed datetime
    DECLARE @RowCount int
    SET @TimeGroomingRan = getutcdate()

    SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
    FROM dbo.PartitionAndGroomingSettings
    WHERE ObjectName = ‘StateChangeEvent’

    EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
    SET @Err = @@ERROR

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    SET @RowCount = 1  

    — This is to update the settings table
    — with the max groomed data
    SELECT @MaxTimeGroomed = MAX(TimeGenerated)
    FROM dbo.StateChangeEvent
    WHERE TimeGenerated < @GroomingThresholdUTC

    IF @MaxTimeGroomed IS NULL
        GOTO Success_Exit

    — Instead of the FK DELETE CASCADE handling the deletion of the rows from
    — the MJS table, do it explicitly. Performance is much better this way.
    DELETE MJS
    FROM dbo.MonitoringJobStatus MJS
    JOIN dbo.StateChangeEvent SCE
        ON SCE.StateChangeEventId = MJS.StateChangeEventId
    JOIN dbo.State S WITH(NOLOCK)
        ON SCE.[StateId] = S.[StateId]
    WHERE SCE.TimeGenerated < @GroomingThresholdUTC
    AND S.[HealthState] in (0,1,2,3)

    SELECT @Err = @@ERROR
    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    WHILE (@RowCount > 0)
    BEGIN
        — Delete StateChangeEvents that are older than @GroomingThresholdUTC
        — We are doing this in chunks in separate transactions on
        — purpose: to avoid the transaction log to grow too large.
        DELETE TOP (10000) SCE
        FROM dbo.StateChangeEvent SCE
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE TimeGenerated < @GroomingThresholdUTC
        AND S.[HealthState] in (0,1,2,3)

        SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

        IF (@Err <> 0)
        BEGIN
            GOTO Error_Exit
        END
    END   

    UPDATE dbo.PartitionAndGroomingSettings
    SET GroomingRunTime = @TimeGroomingRan,
        DataGroomedMaxTime = @MaxTimeGroomed
    WHERE ObjectName = ‘StateChangeEvent’

    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END 
Success_Exit:
Error_Exit:   
END

 

Once this is cleaned up – you can re-run the DATEDIFF query – and see you should only have the same number of days as set in your UI retention setting for database grooming.

Now – you can run the “Most common state changes” query – and identify which monitors are causing the problem.

 

Look for monitors at the top with MUCH higher numbers than all others.  This will be “monitor flip flop” and you should use Health Explorer to find that monitor on a few instances – and figure out why it is changing state so much in the past few days.  Common conditions for this one are badly written monitors that target a single instance object, but monitor a multi-instance perf counter.  You can read more on that HERE.  Also – just poor overall tuning can cause this – or poorly written custom script based monitors.

If you see a LOT of similar monitors at the top, with very similar state change counts, this is often indicative of HealthService restarts.  The Health service will submit new state change data every time it starts up.  So if the agent is bouncing every 10 minutes, that is a new state change for ALL monitors on that agent, every 10 minutes.  You can read more about this condition at THIS blog post.

Comments (23)

  1. Kevin Holman says:

    @himcdonough –

    Not at all. The DW grooms based on time retention only. The daily aggregations will be small. The hourly will generally consume more space. The good thing about aggregations is that they really aren’t any bigger with a flood of samples… only the RAW data will be.

  2. Anonymous says:

    I'd have to look at the code to see.  The concept is still true, and the grooming is still true…. but I wouldn't recommend running this cleanup on 2012 until I have a chance to validate against the current stored proc.

  3. Anonymous says:

    Is there a similar procedure for the SCOM Datawarehouse? Cause all the grooming procedure works fine on my datawarehouse ( debuglevel raised to 3) but I still get 40 days of state change (in vStateDaily/Hourly/Raw tables) which are older than the retention period of these tables..

  4. Anonymous says:

    It has value only in that you might groom out context for a monitor currently in a bad state.  My thought process is – that if nobody has reviewed the monitor in 7 days, or whatever your retention is…. then it likely isn't that important.  Groom it out – reset the monitor, and let it re-detect the issue and raise an alert if you really care about the problem.

  5. Anonymous says:

    Ok – I checked these out between this version and 2012.  The cleanup above will work in 2012 just like it did in 2007.  In 2012 – they DID modify the built in stored procedure to use a temp table and delete using that, instead of a direct delete statement with joins, but the result is the same.  I'll assume the change they made uses less I/O's or is faster, or creates a smaller transaction, but I don't know.  I'd have to test to see which is faster, etc.  If someone has a problem getting the cleanup to even complete (I haven't seen that happen) then let me know and I'll post a 2012 updated version that uses the temp table.

  6. Matthew Rogers says:

    I have been trying to track state changes but have always been discouraged because the results seemed so off.  This article enabled me to clean up and identify some real problem monitors!

    Thanks Kevin for another great article that helps keep Opsmgr running!

  7. This was just great work. .

    Thanks Kevin..

  8. Vincentk says:

    Kevin,

    Is this still applicable for scom 2012?

    Thnx 🙂

  9. VincentK says:

    Thanks. I will await for your response before I ruin my scom db 🙂

  10. LAMcG says:

    Re the modified SP:

    Do you recommend we drop the existing SP and install this new one, or just run this one in isolation?

  11. John Bradshaw says:

    So, what records are being set?…..I just found out a 715 for Datediff ! Not bad eh??

    Running the "cleanup" is taking a while… 🙂

  12. du5tE says:

    I can beat that… mine is at 1372!  Is there any reason I might not want to set this up to run a cleanup periodically, maybe once a week?  Does this old state data have any value at all?

  13. Steve says:

    I ran this on 2012 and although State grooming is set to default (7 days) it always returns 8 as the value from the DATEDIFF query.

    Any ideas why this doesn't show 7 days?  I rerun it and it stays on 8.

    Cheers

  14. Keithk2 says:

    Kevin,

    Any update if this is ok to run on OM2012?  I am in a situation where I am in need of running it.

    Thanks,

    Keith

  15. Keithk2 says:

    Confirmed.  Worked great in 2012.  Thanks *again* Kevin

  16. himcdonough says:

    curious, want to turn down my data retention for data warehouse and specifically state data (raw hourly and daily). does the same rule apply that grooming the dw will never remove the data if the generating monitor is disabled or in an unhealthy state.

  17. TonyM says:

    Could you please post an 2012 updated version that uses the temp table? the version posted here does not seem to run for me without an error:

    “Msg 208, Level 16, State 1, Line 3
    Invalid object name ‘StateChangeEvent’.”

  18. Craig Pero says:

    I upated the Query below since we have a lot of custom monitors and the name field is not friendly. The query becomes less efficient using the monitorview though.

    select distinct top 50 count(sce.StateId) as NumStateChanges, m.displayname, mt.typename AS TargetClass

    from StateChangeEvent sce with (nolock)
    join state s with (nolock) on sce.StateId = s.StateId
    join monitorview m with (nolock) on s.MonitorId = m.Id
    join managedtype mt with (nolock) on m.TargetMonitoringClassId = mt.ManagedTypeId

    where m.IsUnitMonitor = 1
    group by m.displayname,mt.typename
    order by NumStateChanges desc

  19. tony says:

    Once you find the monitor name how do you find the actual computer that is having issues, sorry I am very green with SCOM2012

  20. Anonymous says:

    This is designed as a high-level overview of the data the comes into SCOM and how to "tune" it. By tuning

  21. Venkman says:

    I’ve been cleaning up SCOM I inherited and this was very helpful. My top flapper is Microsoft.SystemCenter.ACS.Collector.DBConnectionState. I’m now to SCOm and would like to fix this. Using your query I see 2611232 state changes. Any ideas? Thanks

  22. SM says:

    Since Microsoft upgraded the Technet blogs to a new design, copy and pasting the SQL queries no longer parse correctly in SQL Management Studio. We have to find and replace all the apostrophes and commas in a plain text editor first.

  23. Mirza says:

    What are the modifications to be done for clearing Discovery data?