Performance data "missing" in the OpsMgr warehouse and Health Service Modules event 31553

We had an odd issue happen, which resulted in many relatively new servers not having their performance data in OpsMgr data warehouse. Alongside this symptom, we kept seeing the following errors raised in the Operations Manager event log on our management servers:

Log Name: Operations Manager
Source: Health Service Modules
Date: 8/8/2014 9:19:12 AM
Event ID: 31553
Task Category: Data Warehouse
Level: Error
Keywords: Classic
User: N/A
Computer: <<Management Server Name>>

Description:
Data was written to the Data Warehouse staging area but processing failed on one of the subsequent operations.

Exception 'SqlException': Sql execution failed. Error 2627, Level 14, State 1, Procedure ManagedEntityChange, Line 368, Message: Violation of UNIQUE KEY constraint 'UN_ManagedEntityProperty_ManagedEntityRowIdFromDAteTime'. Cannot insert duplicate key in object 'dbo.ManagedEntityProperty'. The duplicate key value is (17512, Apr 26 2013 9:50AM).

One or more workflows were affected by this.
Workflow name: Microsoft.SystemCenter.DataWarehouse.Synchronization.ManagedEntity
Instance name: c12a765d-2be4-404b-93fd-8ebc915d0dcc
Instance ID: {D31B8D2C-1C84-1D9D-D22D-983C985AA572}
Management group: <<MGName>>

The combination of event ID, event source and the specific exception test are important and anything else I talk about only applies to this particular combination. I say this because event ID 31553 can be raised for numerous exceptions and each one requires different steps.

After reading over this post on SystemCenterCentral.com from Stanislav Zhelyazkov, I was fairly convinced we were up against the same problem and seeing as I didn’t mind losing one or two rows of property changes, I decided to go along with the steps he provided and see if it helped. Well it did and it didn’t. It did help, in that we did not see the same exception again, but it didn’t help in that another exception of the same type (same ID, same SQL message, but a different duplicate key value) came up.

In short, we had the problem that Stanislav described, but with an unforeseeable number of rows in the ManagedEntityProptery table. Ugh!

So what does any of it have to do with getting performance data into the warehouse? In order to answer that question, I had to do some digging into how the warehouse works and here is what I learned:

  • Performance data can’t be inserted into the warehouse for manage entities that don’t exist in the warehouse yet

  • Much like the perf or state data sets, the managed entity data is put into a staging table, before it is officially processed into the warehouse ([dbo].[ManagedEntityStage])

  • We had a huge backlog of data in the staging table and nothing had been added to the live tables for months

    • SELECT COUNT(*) FROM [dbo].[ManagedEntityStage] WITH(NOLOCK) -- backlog in the stage table

    • SELECT MAX(DWCreatedDateTime) FROM vManagedEntity -- most recent record in the live table

 

So how does one fix this issue? Call CSS is the official answer, but you can take Stanislav’s steps and they will work. The side effect of deleting the row is that the given managed entity row property change will be replaced so if 100% data fidelity is a priority for you, then this is not a process you want to pursue. If you get into a situation like we did where there are numerous rows that are violating the key constraint though, then you’re signing up for a lot of tedium, so calling CSS is really the best option as they can help and it will get the attention of the System Center Operations Manager product group if this does end up being a fairly common issue that CSS gets called about.

What should we do to avoid this problem in the future? It’s not clear how to prevent the root problem, but you can look for this problem and react before it gets out of hand. In short, the process that is responsible for moving the data from the staging to the live table is encountering an unexpected issue that isn’t handled by the stored procedure ( [dbo].[ManagedEntityChange] ) and as a result the whole process rolls back. This will happen over and over again and nothing (that I’ve seen at least) outside of manual intervention will fix it.

So…

  • Step 1: Set up some monitoring for event ID 31553 on your management servers and look specifically for the “Violation of UNIQUE KEY constraint 'UN_ManagedEntityProperty_ManagedEntityRowIdFromDAteTime' ” exception text

  • Step 2: If you see this event, check to see how big your ManagedEntityStage backlog is (query above)

  • Step 3: If the backlog is small and you have a reliable backup of your warehouse and you can accept the data fidelity concerns this raises, then consider the steps that Stanislav has outlined.  Note: Any manual deletions will work but are entirely not supported so keep that in mind

  • Step 4: Call CSS, if budget allows

Cheers,

Cory