The case of the Dell (Detailed) MP – beware of large environments


 

This article is not just a warning about the Dell (Detailed) MP, but the danger of importing ANY management pack into your environment without fully understanding the intended scope, scalability, and any known/common issues.

I recently worked with a customer who had an interesting issue.  They had a very large agent based monitoring environment (greater than 10,000 agents).  While performing a supportability review, we noticed that Config generation was failing.  This was evidenced by the Config monitors showing red on the console, alerts generated, events logged in the Management Server SCOM event logs, and most notably by the fact that agents were not getting updated config in a timely fashion.

Events were similar to:

Log Name:      Operations Manager
Source:        OpsMgr Management Configuration
Event ID:      29181
Computer:      managementserver.domain.com
Description:
OpsMgr Management Configuration Service failed to execute 'SnapshotSynchronization' engine work item due to the following exception

Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
   at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.EndSnapshot(String deltaWatermark)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.EndSnapshot(String deltaWatermark)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.ExecuteSharedWorkItem()
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
———————————–
System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, String endMethod)
   at System.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
   at System.Data.SqlClient.SqlCommand.EndExecuteReader(IAsyncResult asyncResult)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
ClientConnectionId:724196c1-d9ec-4f29-8807-b16cab05fcc6

 

Our initial issue was due to the fact that the management servers were running Windows 2012 RTM, with .NET 4.5.  There is an issue here and we needed to install .NET 4.5.1 to resolve these timeouts.  This got us past the initial failing for Snapshot Config failing.

Next – we saw that Delta Config started failing:

Log Name:      Operations Manager
Source:        OpsMgr Management Configuration
Event ID:      29181
Computer:      managementserver.domain.com
Description:
OpsMgr Management Configuration Service failed to execute 'DeltaSynchronization' engine work item due to the following exception

Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
   at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.CmdbDataProvider.GetConfigurationDelta(String watermark)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.TracingConfigurationDataProvider.GetConfigurationDelta(String watermark)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.TransferData(String watermark)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.ExecuteSharedWorkItem()
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
   at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
———————————–
System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
   at System.Data.SqlClient.SqlDataReader.Read()
   at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadManagedEntitiesProperties(SqlDataReader reader)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadData(SqlDataReader reader)
   at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
ClientConnectionId:9d9ec759-e9bf-4c1e-a958-581377c630b3

We run a snapshot config every 24 hours by default.  We run a delta config every 30 seconds by default.  These are controlled via the ConfigService.config file located in the \Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\ directory.  Delta config timing out was odd.  There can be many reasons for this, so the next step was to take a SQL trace and see what expensive queries were running.

If you want to see these in more clarity – the Config service logs these jobs to the CS.WorkItem table:

SELECT * FROM cs.workitem
ORDER BY WorkItemRowId DESC

You can filter these by Delta Sync or the daily Snapshot sync as well:

SELECT * FROM cs.workitem
WHERE WorkItemName like '%delta%'
ORDER BY WorkItemRowId DESC

SELECT * FROM cs.workitem
WHERE WorkItemName like '%snap%'
ORDER BY WorkItemRowId DESC

WorkItemStateId is the value of success or fail for the job.  It is normal to see some failures, for instance when multiple management servers try and execute the same job, some of those will fail, by design.

1    Running
10    Failed
12    Abandoned
15    Timed out
20    Succeeded

What we found – was one of the MP’s – the Dell Hardware MP – was consuming a large amount of SQL server CPU time, just to queries some standard Managed Type views in the database, many of these lasting over 10 minutes.

When we researched further, we found that the “Dell Windows Server (Detailed Edition)” management pack had been imported, and in the documentation there was no mention of scalability limitations.  However, we found in a much older (4.x) version of the documentation, Dell specifically states that they recommend the Detailed MP only for small environments, when the monitored server count is less than 300 agents!!!!  We had already discovered and were monitoring over 5000 Dell servers.

This massive discovery data influx was also causing Config Churn – and binding showing up as 2115 errors for discovery data:

Log Name:      Operations Manager
Source:        HealthService
Event ID:      2115
Computer:      managementserver.domain.com
Description:
A Bind Data Source in Management Group Production has posted items to the workflow, but has not received a response in 1510 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.CollectDiscoveryData
Instance    : managementserver.domain.com
Instance Id : {B3FA7F2F-3D4A-236D-D3FD-119B3E01C3E3}

So, just delete the MP, right?

Well, lets talk about what must happen when we delete an MP.  When you right click an MP in the console to delete it, we must first delete any discovered instances of any classes defined in that MP.  (Such as an instance of “Dell Server BIOS”.)  In order to delete an instance of a class, we must first also delete ALL monitoring data associated with that instance.  And I don’t mean just simply mark it as “deleted” in the database.  It must actually be deleted transactionally from the tables.  This means all alerts, all monitor based state changes, all events, all performance data, etc.  This can be MASSIVE overhead.

What we actually experienced, is the console locking up, we could track the SQL statements trying to delete the management pack and all the instance data, however this would time out eventually and never return anything to the console.  It would just go away, all the while our MP still existed.

So what can we do?

Well, we do have a possible solution…. in the Remove-SCOMDisabledClassInstance PowerShell commandlet.  This cmdlet allows us to delete the discovered instance data methodically, and slowly.  What this cmdlet does, is to delete any discovered instances in the management group, where that instance’s discovery is explicitly disabled via override.

So – we find all the discoveries in the Dell Detailed MP, and we create a new Override MP, to store a disable override for each discovery in.  Then, we run Remove-SCOMDisabledClassInstance.  This will run and run and run…. seemingly forever, until it returns with no errors.  In many cases, even this cmdlet will time out or crash with an exception, which can be normal when deleting a massive amount of data.

One trick to help with this process – is to set your state, performance, and event retention in the OpsDB to ONE day, then run grooming.  This will greatly reduce the amount of data we must delete transactionally.

Then – just keep running Remove-SCOMDisabledClassInstance.  In this specific case, because the amount of data was so large, it actually took over a day and probably over 100 executions, before the instances were all removed.  You can track the instances being removed, by creating a query that counts the records in the Managed Type tables you are deleting from.  Here is part of the one I crafted for this MP:

select sum(TCount) As TotalCount
from
(
select count (*) as Tcount
from MT_Dell$WindowsServer$Server
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$BIOS
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$Detailed$MemoryUnit
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$Detailed$ProcUnit
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$Detailed$PSUnit
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$EnclosurePhysicalDisk
union all
select count (*) as Tcount
from MT_Dell$WindowsServer$ControllerConnector
) as T

As you run the Remove-SCOMDisabledClassInstance command, you will see these instance counts slowly eroding.  You just have to keep running it until it completes without a timeout or an exception.

Once the instance count gets to zero…. you can delete the MP.  We found this time the MP deleted in seconds!

Now that this MP was gone, the expensive query was over… and we saw the binding on Discovery Data go back to a more reasonable occurrence count and time value.

 

The lesson to learn here is – be careful when importing MP’s.  A badly written MP, or an MP designed for small environments, might wreak havoc in larger ones.  Sometimes the recovery from this can be long and quite painful.   An MP that tests out fine in your Dev SCOM environment might have issues that wont be seen until it moves into production.  You should always monitor for changes to a production SCOM deployment after a new MP is brought in, to ensure that you don’t see a negative impact.  Check the management server event logs, MS CPU performance, database size, and disk/CPU performance to see if there is a big change from your established baselines.

If you are designing a large agent deployment that nears our maximum scalability (currently 15,000 agents) great consideration must go into the management packs in scope.  If you require management packs that discover a large instance space per agent, and/or have a large number of workflows, you might find that you cannot achieve the maximum scale.


Comments (9)

  1. Kevin Holman says:

    @Andre –

    There shouldn’t be a reason to run Remove-SCOMDisabledClassInstance on a daily or even regular basis, as this will only be needed when creating an explicit disable via override to a discovery. However, at the same time, it should not hurt anything to do that.

  2. sathya prakash says:

    hi sir,

    artical was excellent. thanks for sharing it.

    regards
    starsathya03@gmail.com

  3. Retep says:

    Nice Article – I have same problem after update to .net 4.5.2 – with the "A Bind Data Source in Management Group Production has posted items to the workflow, but has not received a response in 1510 seconds. This indicates a performance or functional problem
    with the workflow.
    Workflow Id : Microsoft.SystemCenter.CollectDiscoveryData
    Instance : managementserver.domain.com
    Instance Id : {B3FA7F2F-3D4A-236D-D3FD-119B3E01C3E3}
    but you say :What we found – was one of the MP’s – the Dell Hardware MP – was consuming a large amount of SQL server CPU time, just to queries some standard Managed Type views in the database, many of these lasting over 10 minutes." – how did you find out of
    this
    are there some queries or SQL profiler specifics settings

    Thanks for posting 🙂

  4. Retep says:

    I see that you are receiving a few 2115s. Did you try checking the connection between MS & the SQL Server hosting OpsDB?
    You can try a UDL file test -> Create a file by name TestCon.udl in the management server. Open it and try connecting to the SQL server and check if the connection looks okay.

  5. Andre Prins says:

    thanks,

    we have the same issues, and trying to get it resolved.

    I saw somewhere another way of running the Remove-SCOMDisabledClassInstance command, and I built that into a loop, so just kick it off, and it will go on and on until it finally runs without exception:

    [Reflection.Assembly]::Load("Microsoft.EnterpriseManagement.Core, Version=7.0.5000.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35")
    [Reflection.Assembly]::Load("Microsoft.EnterpriseManagement.OperationsManager, Version=7.0.5000.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35")

    $exit = $false
    $i = 0
    do
    {
    write-host ""
    write-host "Attemtp $i ***************************************"
    write-host ""
    $i += 1
    try{
    $mg = [Microsoft.EnterpriseManagement.ManagementGroup]::Connect("localhost")
    $mg.EntityObjects.DeleteDisabledObjects()
    "completed successfully so to time to exit the loop"
    $exit = $true
    }
    catch { write-host $error[0].exception }
    } until ($exit -eq $True)

    I intend to create a scheduled task which I execute every day, just to keep SCOM clean… and with the above script it always runs till it has finished cleaning…

  6. WalterGomez says:

    good day,

    I have the same problem, Error 2115 and 29181, and Manamet dell pack is not installed

  7. JF_83 says:

    Thanks Kevin! We ran into the same issue with the dell MP …

  8. JF_83 says:

    Thanks Kevin! We ran into the same issue with the dell MP …

Skip to main content