Event ID 2115 A Bind Data Source in Management Group


I see this event a lot in customer environments.  I am not an expert on troubleshooting this here… but saw this post in the MS newsgroups and felt it was worth capturing….

My experience has been that it is MUCH more common to see these when there is a management pack that collects way too much discovery data…. than any real performance problem with the data warehouse.  In most cases…. if the issue just started after bringing in a new MP…. deleting that MP solves the problem.  I have seen this repeatedly after importing the Cluster MP, Or Exchange 2007 MP…. but haven’t been able to fully investigate the root cause yet:

 

In a nutshell…. if they are happening just a couple times an hour…. and the time in seconds is fairly low (under a few minutes) then this is normal. 

If they are happening very frequently – like every minute, and the times are increasing – then there is an issue that needs to be resolved.

 

Taken from the newsgroups:

——————————————-

In OpsMgr 2007 one of the performance concerns is DB/DW data insertion performance. Here is a description of how to identify and troubleshoot problems with DB/DW data insertion.

Symptoms:

DB/DW write action workflows run on a Management Server, they first keep data received from Agent / Gateway in an internal buffer, then they create a batch of data from the buffer and insert the data batch to DB / DW, when the insertion of the first batch finished, they will create another batch and insert it to DB / DW. The size of the batch depends on how much data is available in the buffer when the batch is created, but there is a maximum limit on the size of the batch, a batch can contain up to 5000 data items.  If data item incoming (from Agent / Gateway) throughput becomes larger, or the data item insertion (to DB/DW) throughput becomes smaller, then the buffer will tend to accumulate more data and the batch size will tend to become larger.  There are different write action workflows running on a MS, they handle data insertion to DB / DW for different type of data:

  • Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
  • Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
  • Microsoft.SystemCenter.DataWarehouse.CollectEventData
  • Microsoft.SystemCenter.CollectAlerts
  • Microsoft.SystemCenter.CollectEntityState
  • Microsoft.SystemCenter.CollectPublishedEntityState
  • Microsoft.SystemCenter.CollectDiscoveryData
  • Microsoft.SystemCenter.CollectSignatureData
  • Microsoft.SystemCenter.CollectEventData

When a DB/DW write action workflow on Management Server notices that the insertion of a single data batch is slow (ie. slower than 1 minute), it will start to log a 2115 NT event to OpsMgr NT event log once every minute until the batch is inserted to DB / DW or is dropped by DB / DW write action module.  So you will see 2115 events in management server’s “Operations Manager” NT event log when it is slow to insert data to DB /DW.  You might also see 2115 events when there is a burst of data items coming to
Management server and the number of data items in a batch is large.  (This can happen during a large amount of discovery data being inserted – from a freshly imported or noisy management pack.)

2115 events have 2 import pieces of information: the name of the workflow that has insertion problem, and the pending time since the workflow started inserting last data batch.  Here is an example of a 2115 event:

————————————

A Bind Data Source in Management Group OpsMgr07PREMT01 has posted items to the workflow, but has not received a response in 3600 seconds.  This indicates a performance or functional problem with the workflow.

Workflow Id : Microsoft.SystemCenter.CollectSignatureData

Instance    : MOMPREMSMT02.redmond.corp.microsoft.com

Instance Id : {6D52A6BB-9535-9136-0EF2-128511F264C4}

——————————————

This 2115 event is saying DB write action workflow “Microsoft.SystemCenter.CollectSignatureData” (which writes performance
signature data to DB) is trying to insert a batch of signature data to DB and it started inserting 3600 seconds ago but the insertion has not finished yet. Normally inserting of a batch should finish within 1 minutes.

Normally, there should not be much 2115 events happening on Management server, if it happens less than 1 or 2 times every hour (per write action workflow), then it is not a big concern, but if it happens more than that, there is a DB /DW insertion problem.

The following performance counters on Management Server gives information of DB / DW write action insertion batch size and insertion time, if batch size is becoming larger (by default maximum batch size is 5000), it means management server is either slow in inserting data to DB/DW or is getting a burst of data items from Agent/Gateway. From the DB / DW write action’s Avg. Processing Time, you will see how much time it takes to write a batch of data to DB / DW.

  • OpsMgr DB Write Action Modules(*)Avg. Batch Size
  • OpsMgr DB Write Action Modules(*)Avg. Processing Time
  • OpsMgr DW Writer Module(*)Avg. Batch Processing Time, ms
  • OpsMgr DW Writer Module(*)Avg. Batch Size

Possible root causes:

  • In OpsMgr, discovery data insertion is relatively expensive, so a discovery burst (a discovery burst is referring to a short period of time when a lot of discovery data is received by management server) could cause 2115 event (complaining about slow insertion of discovery data), since discovery insertion should not happen frequently.  So if you see consistently 2115 events for discovery data collection. That means you either have DB /DW insertion problem or some discovery rules in a MP is collecting too much
    discovery data.
  • OpsMgr Config update caused by instance space change or MP import will impact the CPU utilization on DB and will have impact on DB data insertion.  After importing a new MP or after a big instance space change in a large environment,  you will probably see more than normal 2115 events.
  • Expensive UI queries can impact the resource utilization on DB and could have impact on DB data insertion. When user is doing expensive UI operation, you will probably see more than normal 2115 events.
  • When DB / DW is out of space / offline you will find Management server keeps logging 2115 events to NT event log and the pending time is becoming higher and higher.
  • Sometimes invalid data item sent from agent /Gateway will cause DB / DW insertion error which will end up with 2115 event complaining about DB /DW slow insertion. In this case please check the OpsMgr event log for relevant error events. It’s more common in DW write action workflows.
  • If DB / DW hardware is not configured properly, there could be performance issue,  and it could cause slow data insertion to DB / DW. The problem could be: 
    • The network link between DB / DW to MS is slow (either bandwidth is small / latency is large, as a best practice we recommend MS to be in the same LAN as DB/DW).
    • The data / log / tempdb disk used by DB / DW is slow, (we recommend separating data, log and tempdb to different disks, we recommend using RAID 10 instead of using RAID 5, we also recommend turning on write cache of the array controllers). 
    • The OpsDB tables are too fragmented (this is a common cause of DB performance issues).  Reindex affected tables will solve this issue.
    • The DB / DW does not have enough memory.

 

Now – that is the GENERAL synopsis and how to attack them.  Next – we will cover a specific issue we are seeing with a specific type of 2115 Event:

———————————————–

It appears we may be hitting cache resolution error we were trying to catch for a while. This is about CollectEventData workflow.  Error is very hard to catch and we’re including a fix in SP2 to avoid it.  There are two ways to resolve the problem in the meantime.  Since the error happens very rarely, you can just restart Health Service on the Management Server that is affected.  Or you can prevent it from blocking the workflow by creating overrides in the following way:

———————————————–


1) Launch Console, switch to Authoring space and click “Rules”
2) In the right top hand side of the screen click “Change Scope”
3) Select “Data Warehouse Connection Server” in the list of types,. click “Ok”
4) Find “Event data collector” rule in the list of rules;
5) Right click “Event data collector” rule, select Overrides/Override the Rule/For all objects of type…
6) Set Max Execution Attempt Count to 10
7) Set Execution Attempt Timeout Interval Seconds to 6

That way if DW event writer fails to process event batch for ~ a minute, it will discard the batch.  2115 events related to
Datawarehouse.CollectEventData should go away after you apply these overrides.  BTW, while you’re at it you may want to override “Max Batches To Process Before Maintenance Count” to 50 if you have a relatively large environment.  We think 50 is better default setting then SP1’s 20 in this case and we’ll switch default to 50 in SP2.

————————————————-

 

Essentially – to know if you are affected by the specific 2115 issue describe above – here is the criteria:

 

1.  You are seeing 2115 bind events in the OpsMgr event log of the RMS or MS, and they are recurring every minute.

2.  The events have a Workflow ID of:  Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData

3.  The “has not received a response” time is increasing, and growing to be a very large number over time.

 

Here is an example of a MS with the problem:  Note consecutive events, from the CollectEventData workflow, occurring every minute, with the time being a large number and increasing:

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:37:06 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706594 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:36:05 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706533 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

 

Event Type:      Warning
Event Source:   HealthService
Event Category:            None
Event ID:          2115
Date:                5/5/2008
Time:                2:35:03 PM
User:                N/A
Computer:         MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706471 seconds.  This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance    : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}

Comments (36)

  1. Anonymous says:

    That is used by the warehouse for all maintenance jobs… including aggregation.  

    To reduce the impact by that job… you can focus on:

    1.  Increasing the disk I/O and server resources for the data warehouse database.

    2.  Reduce the amount of data going into the warehouse, and reduce the retention.

    You really need to find out exactly what is causing the blocking when this runs… to determine the best course of action.  A SQL DBA with SQL profiler in hand should be able to indentify the major causes…

    How big is your warehouse?  Agent count?

  2. Anonymous says:

    Kevin –

    I am experiencing the problems you describe for the CollectEventData workflows. If I follow your workaround, will I be prevent specific types of event collection? Can you provide more detail in what is causing this?

    Thanks!

    Megan

  3. Anonymous says:

    The IIS MP has some frequent and noisy discoveries.  Many run every hour.  I like to modify the frequency of those to once per day.

  4. Anonymous says:

    That is from the Exchange 2007 conversion MP – unfortunately – the event is created from script – if you turn off that workflow – you will also turn off the script.  đŸ™

    This is not a problem in the new native MP coming out with R2.  That event will be a top consumer – but should not flood the database – it just will be at the top of the list.

  5. Anonymous says:

    I had the same symptoms (and pretty much all of them) as discussed above.  My problem seemed to be that I had inserted an account into the ‘Data Warehouse SQL Server Authentication Account’ ‘Run As Account’ where I should have had a space…. as noted here:

    http://www.eggheadcafe.com/conversation.aspx?messageid=30315729&threadid=30282292

    This stopped the 2115 errors immediately.

  6. Anonymous says:

    No – we will simply drop batches of events that get stuck and hold up the queue.

  7. Anonymous says:

    So – that hotfix – 969130 – simply allows dropping of old event tables.  Their existence will not really impact event insertion into the DW – so that is why that didnt work.  Also – that could only possibly affect the DW.CollectEventData 2115, and no others.

    The MEP table query – dealed with discovery data.  This can be a problem when management packs run discoveries that constantly update discovery data with properties that change frequently.  If your only 2115 is from DW.CollectDiscoveryData – then a deeper analysis of discovered properties is in order.

    The best queries I have seen for that are here:

    http://nocentdocent.wordpress.com/2009/05/23/how-to-get-noisy-discovery-rules/

  8. Anonymous says:

    You should not make any overrides for this.

    The overrides in this article handled a very specific issue with events, and it is NOT applicable to perf collection.

    If you continuously have issues with perf insertion into the data warehouse – your warehouse is likely not performing well.  Look for blocking, and for avg disk sec/write values.

  9. Anonymous says:

    @Seth – I have heard about this… scalability issues with Unix monitoring.  Some good rules of thumb:

    1.  Dedicate the management server for cross platform monitoring and do not assign Windows agents to it.

    2.  Place the Management server health service queue on very fast disk with large IOPS capability and very low latency (RAID10 with 4 spindles, 15k drives, Tier 1 SAN, etc..)

    3.  Use physical hardware for the MS when scaling at maximum, with more than the minimum hardware requirements available for memory, CPU, etc…

    4.  Be very careful with what you write as far as custom workflows against the cross platform systems, as these can add additional load and will affect scale.

    My understanding is that we can scale up to 500 cross platform agents per MS when these are met above…. and using the built in MP's for base level Xplat monitoring.  

  10. Anonymous says:

    That override was not a fix to address all 2115’s.  It was only to address a specific situation with 2115 events of a Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData.

    1.  Are ALL (at least 99%) your 2115 events coming from the above workflow?  If they are – then apply this override – and bounce the healthservice on your affected management server (in a cluster – take offline and then back online)  and you might consider clearing out the old healthservice cache.

    2.  If they are NOT all from DataWarehouse.CollectEventData, and are from random sources…. the next step is to see if they are all from a DataWarehouse workflow ID, or some are, some not.  In either case, this is typically SQL database performance related.  Bounce the healthservice and see if these comes back immediately, or if they take some time before you see them.

  11. Anonymous says:

    So – here is what I look at with 2115’s.

    1.  Look for a pattern – do the 2115’s happen at a specific time or random?  If a pattern – look for other jobs that might be running at that time, like a backup – or SQL DBA maintenance plans.

    2.  Look at the 2115’s… do they come from a single datasource/workflow… or multiple?  The workaround I posted only applies if they are ALL from the collectevent and data warehouse workflow.

    3.  Random 2115’s with LOW times… (under 300 seconds) are normal… as long as we recover.  If they have longer times associated with them… that is indicative of a SQL perf issue, or blocking on the DB.  SQL perf is caused by keeping too much data in the DB, bad disk/memory I/O, not enough spindles for the DB, DB and logs not being in distinct volumes/spindle sets, poor SAN performance, too many agents in the management group, other jobs stepping on the DB, too many consoles open, etc….

  12. Anonymous says:

    The gateway approval tool failed with following error:

    “Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.Mom.DataAccessLayer, Version=6.0.4900.0, Culture=neutral, Publi

    cKeyToken=31bf3856ad364e35′ or one of its dependencies. The system cannot find the file specified.

    File name: ‘Microsoft.Mom.DataAccessLayer, Version=6.0.4900.0, Culture=neutral,

    PublicKeyToken=31bf3856ad364e35′

      at GatewayInsertionTool.Program.Main(String[] args)”

    Any help will be appreciated.

    Ashutosh

  13. Anonymous says:

    One other question. Can this problem cause agents to receive the event "Alert generated by Send Queue % Used Threshold"?

  14. Anonymous says:

    You are correct – I did.  This was a copy/paste from a newsgroup posting.

  15. Anonymous says:

    I believe so – if the management server queue is also blocked.  One customer I worked with had a lot of send queue % alerts…. and these cleared up when we implemented this change.

  16. Anonymous says:

    Thank you! It appears this fix has cleared up my issue as well including the Send Queue Alerts. Thanks!

  17. Anonymous says:

    When did you put in these overrides?

    If this just started – and the overrides have been in place for some time…. and SQL I/O performance is good… and this is ONLY coming from the warehouse collect event data source – then I would look for:

    1.  Blocking on the SQL server processes – check Activity monitor…. if performance counters look good on SQL – we can still have an insert problem if something is causing blocking.

    2.  Something is flooding events.  Run the most common event query from my SQL query blog – and see if you can determine the source.

  18. Rich says:

    I think you missed  

    Microsoft.SystemCenter.CollectPerformanceData

    from your workflows.

  19. sb says:

    i have applied all the 3 overrides :

    – Max Execution Attempt Count to 10

    – Execution Attempt Timeout Interval Seconds to 6

    – Max Batches To Process Before Maintenance Count to 50

    but 2115 warning on my RMS server are still arriving.  note that i have 2 physical rms and a virtual rms server (cluster)

    regards,

  20. Thomas says:

    Thx Kevin for this great Blob.

    I have been experiencing this error for a week now. After about 12 hours of working fine, the warnings 2115 start appearing. After some minutes, there are only 2115 errors in the Log and the RMS turns gray. The Workflow IDs include all possibilities, they seem to appear in a cyclic behavior to each other. The workaround didn’t fix it, unfortunatelly. I suspect a SQL performance problem. Before migrating the whole system to a new, better server, I just wanted to make sure if this might be the Problem.

  21. Mike Ory says:

    I just started getting these every 5 – 10 minutes.

    All coming from Microsoft.SystemCenter.DataWarehouse.CollectEventData

    They come back right away after bouncing the service. No MP’s have been added/deleted in about 3 months.

    The performance of the SQL server looks great The db and logs are on different volumes.

    I’ve set the 3 overrides as described above. I’ve rebooted my SQL server and then the RMS.

    Any other ideas?

  22. Mike Ory says:

    Thanks Kevin.

    Yes the overrides have been in place for a long time actually. I’m not seeing any blocking, but using your queries I found that a ton of events from our Exchange servers may be the problem.

    I’m going to disable the event collection rule: "Execute: Test-ServiceHealth diagnostic cmdlet", which seems to be one of the major contributors, and see what happens.

    And by the way, you say that you are not an expert on this subject? I have to repectfully disagree.  đŸ˜‰

  23. Mike Ory says:

    Turns out we DO have blocking.

    This is causing the issue, but I’m not sure what can be done about it:

    exec StandardDatasetMaintenance

  24. Dennis says:

    Hey

    In our environment we have the same problem but nearly all 2115 errors are coming from the workflow Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData with times ranging from 1 minute till 5 minutes before the cache is written to the DW. I tried making overrides to the Performance Data Collector analog to the Event Data Collector although the values aren’t exactly the same.

    I’ve put these settings for now:

    Maximum number of data items batches to process before maintenance: 50

    Subsequent execution attempt timeout interval in seconds: 6

    Maximum write operation execution attempt count: 10

    This still generates a lot of 2115 errors together with constant alert of the type: Performance data collection process unable to write data to the Data Warehouse which get closed the next minutes.

    Anything you can recommend?

  25. Mike Ory says:

    I’ve got a case open with Microsoft on this. I can tell you that one of the things they had me do (which didn’t work for me, but may work for you) is to install this hotfix:

    http://support.microsoft.com/kb/969130

    They also had me run this query, which I think tells me where most of my DW writes are coming from?

    select top 20 met.ManagedEntityTypeSystemName, count(*) from

    ManagedEntityProperty mep

       join ManagedEntity me on (mep.ManagedEntityRowId = me.ManagedEntityRowId)

       join ManagedEntityType met on (me.ManagedEntityTypeRowId =

    met.ManagedEntityTypeRowId) Where mep.FromDateTime > ‘2009-01-01’

    group by met.ManagedEntityTypeSystemName having (count(*)) > 5 order by 2 desc

  26. Mike Ory says:

    Ok, that’s good stuff. I ran the ‘Discovered Objects in the last 4 hours’ query and found that Microsoft.Windows.InternetInformationServices.2003.FTPSite has 36 changes.

    I can tell you for sure that we have not added any FTP Sites in quite awhile…

  27. Augusto says:

    Great blog.

    Just wondering where I can find the ‘Discovered Objects in the last 4 hours’ query.

    Thanks.

  28. John says:

    A Bind Data Source in Management Group ABC has posted items to the workflow, but has not received a response in 122 seconds.  This indicates a performance or functional problem with the workflow.

    Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange

    Instance    : MS4.ABC.local

    Instance Id : {43BE45BE-573D-AD34-B4333-3673F673BE32}

    This come 4 times within a couple of minuts (first 61 sec, 122, 183 and then 245) – but do also come 20 times in a hour – do have a cluster DB with 64 GB RAM and a cluster RMS with 16 GB and 6 MS with 8 GB. There are (for now) 32 agents connected to one MS and even I move agents to other MS then the events come on the "new" MS. It shouldn´t be performance issue – Any idea?

  29. mrbsmallz says:

    Worked like a charm for our OpsMgrR2 on VMWARE virtualized environment that is supporting 250 agents!  Thank you.

  30. Seth says:

    I just battled this issue on the phone with MSFT for hours… turns out our 181 UNIX servers reporting to a single SCOM MS was causing this.  The UNIX boxes have been reporting to this SCOM MS for 2-3 months so the MSFT Performance team is going to contact me tomorrow to run some tests…. apparently UNIX is a disk I/O hog.  1 UNIX servers = 10 Windows servers on disk I/O.

  31. Paul Stonehewer says:

    Hi Kevin, I had ID2115 every 60 seconds covering all the Microsoft.SystemCenter.xxxxxx workflows. I looked at your fix and applied it, this just stopped me getting the errors it did not fix my problem which the RMS server (the only MS in the group) showing greyed out.

    I reviewed all the changes over the last couple of days since the problem occured and noticed that the Management Server Action account that had been in the Default Action Account Profile under the RMS server was now changed to local system.

    I changed this back and all my errors stopped reporting and the servers health was ok. Lessoned learnt….

    Keep up the good work

    Paul

  32. Prabhu V says:

    Hi Kevin,

    Not really strange or if i am missing something here.

    In my env i have been having a lot of 2115 errors and many of them for “CollectPublishedEntityState” and “CollectAlertss” data.

    Upon looking for some blogs and the relevant workflow ids the majority is with "Workflow id: Microsoft.SystemCenter.CollectDiscoveryData", though i do find some for "Workflow id: Microsoft.SystemCenter.DataWarehouse.CollectEventData "

    I havent been able to crack for the first workflow id. Any help on this please?

  33. Shiva says:

    Hi Kevin,

    I need your help in this issue. We are litterally getting alert ” Management server reached the quota limit” and also unable to discover any TFS build servers. I would request you to help me on this if you have created any blogs for this.

  34. Daya Ram says:

    I am getting 2115 events from Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange only and are generated about every hour. Datawarehouse SQL performance is also good. Please suggest how to troubleshoot these.

  35. Kevin Holman says:

    @Daya – if that is the only one you are receiving – that’s odd, and usually there will be other events that help us understand whats wrong. I’d recommend opening a support case IF these are values that are incrementing. If the times stay low (less than
    5 minutes) you might just be overloaded in the warehouse during aggregations, and you just need to do some tuning or get better disk IO