We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.
However – one area of OpsMgr that often goes overlooked, is event overcollection. This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat. I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it. They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.
MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP. These events are items like "config requested”, “config delivered”, “new config active”. They might be interesting, but there is no advanced analysis included to use these to detect a problem. In small environments, they are not usually a big deal. But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them. I have yet to see a customer who did that.
At a high level – here is how I like to review these events:
- Review the Most Common Events query that your OpsDB has.
- Create a “My Workspace” view for each event that has a HIGH event count.
- Examine the event details for value to YOU.
- View the rule that collected the event.
- Does the rule also alert or do anything special, or does it simply collect the event?
- Do you think the event is required for any special reporting you do?
- Create an Override, in an Override MP for the rule source management pack, to disable the rule.
- Continue to the next event in the query output, and evaluate it.
So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:
Most common events by event number and event publishername:
SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC
The trick is – to run this query periodically – and to examine the most common events for YOUR environment. The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it: (I will use a common event 21024 as an example:)
What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.
Next – I will examine the rule. I will look at the Data Source section, and the Response section. The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section. If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.
If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it. You will find you can disable most of the top consumers in the database.
Here is why I consider it totally cool to disable these uninteresting event collection rules:
- If they are really important – there will be different alert generating rule to fire an alert
- They fill the databases, agent queues, agent load, and network traffic with unimportant information.
- While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
- Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
- If we find we do need one later – simply remove the override.
Here is an example of this one:
So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.
Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:
I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value. Just knocking out the top 10 events will often free up 90% of the space they were consuming.
The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them. You might find you have some other events as your top consumers. I recommend you review them in the same manner as above – methodically. Then revisit this every month or two to see if anything changed.
I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them. I’d be happy to give feedback on those, or add any new ones to my list.