Tuning tip – turning off some over-collection of events


We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don’t care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

 

At a high level – here is how I like to review these events:

  1. Review the Most Common Events query that your OpsDB has.
  2. Create a “My Workspace” view for each event that has a HIGH event count.
  3. Examine the event details for value to YOU.
  4. View the rule that collected the event.
    1. Does the rule also alert or do anything special, or does it simply collect the event?
    2. Do you think the event is required for any special reporting you do?
  5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
  6. Continue to the next event in the query output, and evaluate it.

 

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

 

image

 

image

 

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

 

Here is why I consider it totally cool to disable these uninteresting event collection rules:

  • If they are really important – there will be different alert generating rule to fire an alert
  • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
  • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
  • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
  • If we find we do need one later – simply remove the override.

 

Here is an example of this one:

image

 

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

 

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

 

1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103

 

I don’t recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn’t my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.

Comments (32)

  1. Kevin Holman says:

    To:  Dinesh

    That is because this query is not for the warehouse database.  It is for the Operations database.

  2. Kevin Holman says:

    Re: DHCP

    Yes – thats a very old MP.  That makes sense now.

    I would normally say go upgrade that MP…. but if you are happy with the monitoring it provides – you might just keep it.  The current updated Native DHCP MP 6.0.6452.0 has some significant monitoring limitations, due to some advanced monitoring that it performs, and I am not 100% sure those limitations are present in the conversion MP.  I just dont know.  Like I said – if you are happy, I’d probably stick with it.

  3. Dominique says:

    Hello,

    I found this …

    http://technet.microsoft.com/en-us/library/cc655729.aspx#BKMK_ApplicationProviderPath

    For these rules to work, you need to create the %SMS_INSTALL_DIR_PATH% environment variable on your site server

    with the installation path that was specified for your site installation. The environment variable path should not end in a backslash.

    Each Configuration Manager 2007 server with a sender must be a managed computer.

    http://technet.microsoft.com/en-us/library/cc755616(WS.10).aspx

    Is it what is missing ONLY?

    Thanks,

    Dom

  4. Anonymous says:

    Great post Kevin, thanks. I made a slight change that may be useful to others: I added the Channel field which – in SCOM 2012 – shows the event log the events are from. Saves a lot of time digging 🙂 The query on my side now is:

    SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource, Channel as EventLogName

    FROM EventAllView eav with (nolock)

    GROUP BY Number, Publishername, Channel

    ORDER BY TotalEvents DESC

  5. Kevin Holman says:

    From the SCCM MP guide:

    Defining the SMS Environment Variable to Support Log-Based Rules

    A number of rules in the Configuration Manager Management Pack read Configuration Manager-based log files to check for errors.

    The following rules under ConfigMgr Site Servers – Common are based on the sender.log, distmgr.log, and policypv.log files, respectively:

    • ConfigMgr 2007 Component: The sender cannot connect to remote site over the LAN (Standard Security)

    • ConfigMgr 2007 Component: The sender cannot connect to remote site over the RAS connection

    • ConfigMgr 2007 Component: The sender cannot connect to remote site over the LAN (Advanced Security)

    • ConfigMgr 2007 Component: Distribution Manager failed to process a package

    • ConfigMgr 2007 Component: Distribution Manager failed to insert an SMS Package because SDM Type Content is not present in the CI_Contents table

    • ConfigMgr 2007 Component: Policy Provider failed to get new software update policies from the SMS Site Database

    • ConfigMgr 2007 Component: Policy Provider failed to create new software update policy

    • ConfigMgr 2007 Component: Policy Provider failed to get new compliance policies from the SMS Site Database

    • ConfigMgr 2007 Component: Policy Provider failed to create new compliance policy

    • ConfigMgr 2007 Component: Policy Provider failed to notify Hierarchy Manager of a policy change

    In order to monitor these logs, the location of the Configuration Manager installation folder must be specified. To do so, create the %SMS_INSTALL_DIR_PATH% system environment variable on a site server so that the MOM Agent running under Local System or a local administrator user context has access to the log files in the %SMS_INSTALL_DIR_PATH%Logs directory. For more information about setting system environment variables, see the system environment variable Web page (http://go.microsoft.com/fwlink/?LinkId=92316).

    In order for the Operations Manager Health Agent to use this system environment variable, the Configuration Manager Site Server may need to be restarted.

  6. Kevin Holman says:

    Thanks Marco –

    The 31707 is a known issue – from you not configuring your SMS MP according to the guide.  There is a variable for the SMS logs path in the MP – and you need to set this variable on ALL your SMS servers.  I would STRONGLY recommend you set this up correctly – otherwise you arent monitoring your logs, and you are flooding opsmgr with these events.

    I dont have any 1501 events – what are they when you create the view to look at those?

    The others are known issues – and I would diable them.

  7. Kevin Holman says:

    I searched the XML of all the current DHCP MP’s – and 1501 is not in them.  What DHCP MP are you using, what OS version is your DHCP server, and what is the EXACT rule or monitor name, and target, that is responsible for inserting the 1501?

  8. Kevin Holman says:

    Re:  Serge

    So here is an example where collecting too many events might be a good thing.  🙁   You are hammered with event 1206.  THis is bad.  However – we dont have any good alerting to "detect" this condition… so analyzing your event flooding might be the only way to detect this.  A 1206 is:  Rule/Monitor "%2", running for instance "%3" with id:"%4" failed, got unloaded and reached the failure limit that prevents automatic reload. Management group "%1".   A completely healthy management group will have ZERO 1206 events.

    You should create a view for this event – and try to determine if you have a systemic problem with a MP, a rule, or just sick machines all over the place.  THis isnt good – but might just be a badly written event.  I have never seen that one so high before.  So – it STILL isnt valuable to collect the 1206 event… as it simply fills up your DB – but you DO want to fix the root cause of it…. so I would not turn this off until you are no longer seeing it happen so much.  Or – create an alert-generating rule for this event and enable alert supression.

    Re:  117 – I would determine if Nworks really needs this event.

    Re:  21024, 21025, 29102, 29103, 1210…. I would turn those off.

  9. Kevin Holman says:

    Generally – if that event is high – you have a problem that requires investigation – either a bad MP or some very sick agents.

    That said – you cannot disable the event collection for a single event – when the rule collecting has multiple events in the data source.  The only way to do that – is to simply disable the rule – then recreate it, and leave out any ID’s you dont want.  That said – if you arent using events in troubleshooting on a regular basis – why not just turn off the whole rule?  As long as it doesnt also ALERT, that is.

  10. Kevin Holman says:

    Dom:

    Which MP are you running – the SCCM MP, or the SMS MP?  Or both?

  11. martit01 says:

    Hello Kevin,

    Looks like we have a busy DNS mp. We have the most current version of DNS MP.  I'm guessing it should be ok to disable these since they don't have alerts configured and of course no Product Knowledge.  

    Thanks,

    Tom

    Event ID: 1161

    Total Events: 113602

    Generating Rule: Collect Script Trace Events

    Management Pack: Microsoft Windows DNS Server Library

    Description:

    DNS-NslookupAllTests.js :

    Duration: 0.157

    Start: 17:14:28.239

    End: 17:14:28.396

    Event ID: 1162

    Total Events: 37868

    Generating Rule: Collect Script Trace Events

    MP: Microsoft Windows DNS Server Library

    Description:

    DNS-NslookupAllTests.js : Final Summary:

    SuccessCount: 1

    NonAuthoritativeCount: 1

    FailureCount: 0

    BestHost: 107.107.38.in-addr.arpa.

    BestServer: xxx.xxx.xxx.xxx

    BestTime: 0.094

    WorstHost: 107.107.38.in-addr.arpa.

    WorstServer: xxx.xxx.xxx.xxx

    WorstTime: 0.094

    FailingPairs:

    Event ID: 1199

    Total Events: 37868

    Generating Rule: Collect Script Trace Events

    Management Pack:  Microsoft Windows DNS Server Library

    Description:

    DNS-NslookupAllTests.js : Exiting normally. NslookupAllTests Duration 0.172 seconds.

  12. Kevin Holman says:

    Dom:

    As for the other events – you should follw exactly what the blog posts says – create views for them in "My Workspace" – look at them and see if this is indicative of a big problem – or something you just wanna turn off.  

    Several of the ones in your list are ones I turn off collection rules for.  The others that are not in my list… I would investigate.

  13. Kevin Holman says:

    If the TEST event is your largest event – you just dont have anything goin on yet.

    🙂

    (but yeah – I’d disable it if it was my top event and showed me no value)

  14. Marco says:

    Hi Kevin,

    thanks for sharing this Information. To give you some feedback on the Events i see in our Environment:

    Top1 (1.8 Million! Events) – EventID 31707 (Error monitoring parent directory. Directory = %SMS_INSTALL_DIR_PATH%)

    followed by Event 1501, 10409, 21024, 10403 with about 200k each. So maybe 31707 is an issue for other environments too.

    Regards Marco

  15. Marco says:

    Event 1501 is from the DHCP Scope Monitoring, collecting the address status. From the Product Knowledge of the Rule:

    Summary

    This rule collects the following DHCP related information:

    DHCP superscopes and scopes

    DHCP superscope and scope relationships

    DHCP superscopes and scope utilization

    Caution:

    Disabling this rule prevents the DHCP server superscope and scope monitoring and reports from functioning.

  16. B-Serge says:

    I ran you little query and this is the result:

    TotalEvents EventID EventSource

    1155157 1206 HealthService

    136169 117 nworksSource

    38788 21024 OpsMgr Connector

    15032 29102 OpsMgr Config Service

    14846 29103 OpsMgr Config Service

    14481 21025 OpsMgr Connector

    13551 1210 HealthService

    13144 74 nworksSource

    12354 77 nworksSource

    10575 10378 Health Service Modules

    9824 72 nworksSource

    9737 68 nworksSource

    6154 89 nworksSource

    5689 10376 Health Service Modules

    5614 10403 Health Service Modules

    4505 1102 HealthService

    3783 10102 Health Service Modules

    2355 31901 Health Service Modules

    2248 6022 Health Service Script

    2225 31902 Health Service Modules

    The Top 5 matches your favorites 🙂

    The nworksSource is from the VMware MP by Veeam, will start checking these out.

    Cheers,

    Serge

  17. B-Serge says:

    Hi Kevin,

    I’ve checked and figured out the 1206.

    Apparently 1 (ONE!) server was going ballistic a couple of days ago. Unfortunately it was an nWorks Virtual Infrastructure Collector. These servers collect all info on VM Hosts & Guests. Typically I saw all kinds of Events like this one:

    Rule/Monitor "nworks.VMware.VEM.VC2Alarm.VMGUEST.CPU.toRed", running for instance "_Total" with id:"{C5AC8DDB-DE26-A276-9177-1D9E5D854400}" failed, got unloaded and reached the failure limit that prevents automatic reload.

    The 117 is also an interesting one 🙂

    According to Veeam: This is intended as an update "hint" to the mom/scom MP. This event drives the performance data consumer in the MP.

    The description contains this kind of info: SV110 Performance data for ‘VMDiskProperties’ class published in WMI

    Guess I’m gonna drop the guys at nworks a couple of questions.

    Cheers,

    Serge

  18. Marco says:

    Hi Kevin,

    regarding the 1501 Events. We currently have about ~170 DHCP Servers included in our SCOM Monitoring, running Windows Server 2003.

    The exact Rule Name is "DHCP Scope Monitoring", the Rule Target is "Microsoft Windows 2003 DHCP Servers Installation". The MP is V6.0.5000.33, probably a rather old Version.

  19. dinesh says:

    When I run this query against DW then receving below error message

    Msg 208, Level 16, State 1, Line 1

    Invalid object name ‘EventAllView’.

  20. Ravi says:

    We find Event ID 10409 events with high number, which generated from Rule: Collect WMI Probe Module Events. This rule collects many other evnts also, how can I disable event collection for Event ID 10409 only?

  21. Dominique says:

    Hello Kevin,

    I ran your query and got:

    31707 62549 Health Service Modules

    11771 46004 Health Service Modules

    1199 34200 Health Service Script

    10401 29589 Health Service Modules

    7000 21774 Service Control Manager

    1112 15782 Health Service Script

    21024 13758 OpsMgr Connector

    10409 12846 Health Service Modules

    29103 11987 OpsMgr Config Service

    29102 11985 OpsMgr Config Service

    21025 11942 OpsMgr Connector

    1210 11012 HealthService

    9100 10634 Health Service Modules

    6001 7777 DNS

    10403 6918 Health Service Modules

    1740 4352 ConfigMgr 2007 Monitor State Message Summary Tasks

    1077 3870 W3SVC

    10375 2904 Health Service Modules

    1135 2896 Health Service Script

    21405 2467 Health Service Modules

    So I have the 31707 on top as Marco and what exactly you are referring to when saying "The 31707 is a known issue – from you not configuring your SMS MP according to the guide.  There is a variable for the SMS logs path in the MP – and you need to set this variable on ALL your SMS servers.  I would STRONGLY recommend you set this up correctly – otherwise you arent monitoring your logs, and you are flooding opsmgr with these events."

    I will review the documentation but already three of us did it and could not find the issue … it was the holidays so maybe we are too tired… 🙂 otherwise what should I do with other Events?

    Thanks

    Dom

  22. Dominique says:

    Hello,

    Which SMS MP guide are you referring to?

    I have checked the Microsoft System Center Configuration Manager 2007 Guide unsucessfully as I could not see any SMS MP name … should I install one on top of my existing configuration? Does it have another name in SCOM 2007?

    Thanks,

    Dom

  23. sam says:

    Dom,

    Try this :-

    Variable:

    SMS_INSTALL_DIR_PATH

    Value:

    your installation Drive:SMS

    my case :

    F:SMS

  24. Bryce says:

    Most common event in my OpsDB so far (which is not that long) is:

    Source: Health Service Script

    Generating Rule: Collect Distributed Workflow Test Event

    Event Number: 6022

    Level: Information

    Description:  LogEndToEndEvent.js : This event is logged to the Windows Event Log periodically to test a event collection.

    Seems like a decent candidate for disabling but I didn’t see it anyones list here.  

    Thoughts?

  25. jeremy says:

    Hello,

    So I’m trying to create overrides for some of the top events in our database. For instance, Event 21402 and 21403 which you also list in your list of common events to disable. This rule (Collect Batch Response Module Events) is targeted to Health Service and when I right click the event to create an override it shows "Override the rule… For all objects of class: Health Service".  I select this target, put a check mark in the "override" field, change Override Value to "False", select a custom MP to store the override in and click OK.  I can even view the override using the Summary link.  But, the events keep coming in… several days later, so it’s not that I’m just not waiting long enough for the new configuration to take effect.  I’ve found that overrides I create for rules that target something other than "Health Service" work fine… but they seem to never work for Health Service.

    Is there something I’m missing here? Should I be targeting a different class?

    Thanks.

  26. Kitaab says:

    We have these:

    EventID TotalEvents EventSource

    10457 23390 Health Service Modules

    11771 9127 Health Service Modules

    10409 5113 Health Service Modules

    10403 4468 Health Service Modules

    6522 4212 DNS

    1063 2443 Microsoft-Windows-DHCP-Server

    31717 2173 Health Service Modules

    11903 2055 Health Service Modules

    7001 1909 Service Control Manager

    21402 1356 Health Service Modules

    21403 1299 Health Service Modules

    1206 1009 HealthService

    7036 825 Service Control Manager

    31552 636 Health Service Modules

    4618 579 AdtServer

    6022 570 Health Service Script

    1103 547 MetaFrameEvents

    31709 546 Health Service Modules

    11052 496 Health Service Modules

    7031 418 Service Control Manager

  27. sonia says:

    Hi, We are getting following error on our SCOM management servers, no cause is given, looks like its cut off after "Cause", no additional details in xml view:

    Log Name:      Operations Manager

    Source:        Health Service Script

    Date:          9/15/2013 3:09:22 PM

    Event ID:      3000

    Description: AgentMinRequiredVersionCheck.vbs : An error occurred while reading the registry. Cause:

    Would appreciate any ideas!

  28. dinesh says:

    Do we have any method to delete the collected events from OperationsManager DB.

  29. rahul says:

    Hi Kevin,

    one of my server is in Not Monitored state and check connection of the server it is pinging and i check the event log i find a warning event id 1207 and i done the cache flush but the still the server is in Not Monitored State only

  30. Kash says:

    Hi
    I have 5,45,320 entries of Event ID 7001. Is it OK to disable it for the entire “windows Operating System” Class ??

    Thanks

  31. Arindam C says:

    Hi Kevin

    In our environment I see most of the events are like below.
    EventID TotalEvents EventSource
    4009 634880 Apm PerfCounterMonitor
    1118 634573 Apm Agent
    6398 8085 Microsoft-SharePoint Products-SharePoint Foundation
    2159 4919 Microsoft-SharePoint Products-SharePoint Foundation
    1318 3232 Apm Agent
    4139 3063 Apm Agent
    4140 3055 Apm Agent

    Do you think our APM settings have gone for a toss? Any suggestions please?

    Thanks

Skip to main content