Sky Blue–How to keep monitors from going unnoticed after closing an alert


Scenario: A 3-state logical disk monitor targets a disk that meets the warning threshold. A warning alert is raised.

The Operations Team sees the alerts, consults SOP guidance which states ignore warnings, and manually closes the alert. The alert is gone, but the underlying monitor is still unhealthy with a state of warning. A day later the disk space used is now critical, and the monitor state changes to match.

The problem is the alert is closed and it won’t fire again until the monitor is healthy.

The critical alert is closed. Operations doesn’t look at closed alerts. And so the disk reaches capacity, potentially bad things happen, and ultimately someone asks, “Why didn’t we get an alert?”

Well we did, and we closed it.

Let’s consider how alerts are generated from monitors. We often speak in terms of an alert being generated when a threshold is met, and in the case of monitors that over simplifies the process and leaves out an important step. The thresholds trigger a condition detection path that leads to a state change. That state change can then prompt for an alert to be generated. In the case of 3-state monitors, we only generate one alert. When we change from a warning to a critical state, alert severity can be (and often is) configured to match the monitor state. The not so intuitive part of this model reveals itself in that when an operator sees a non-actionable alert, the default behavior is to close the alert. Closing the alert does not affect the state of the monitor. That’s when we find ourselves in situations where the state of a monitor can be warning or critical, but we have no corresponding alert. The state changed so long ago, the alert has been groomed after someone closed it.

So how do we avoid these situations?

Well, first we can always submit product feedback, such as something like:

http://systemcenterom.uservoice.com/forums/293064-general-operations-manager-feedback/suggestions/9356712-a-way-to-reset-monitors-automatically-when-alert-I

But until then we can address the behavior with a PowerShell driven Management Pack I created called Sky Blue. Thanks to Peter Lem for the name, a play on Green Machine and the fact that management asking about “missing alerts” always feels like a rainy day.

Now the ask in the above feedback link is somewhat different than an ask I had from a customer which originally drove this workaround. Instead of providing a task or a way to replicate Green Machine’s resetting of monitor states, the approach with Sky Blue is to enforce good operator behavior in the SCOM console. Sky Blue protects against accidental alert closures by re-opening closed alerts where the monitor state is not healthy. This is done via a rule that runs every five minutes (override-able) against the All MS Resource Pool. The rule runs a PowerShell script that queries for all closed alerts within the past two intervals that were generated by a monitor, checks the state of those monitors, and re-opens (sets state to 0) any alert where the monitor state is not healthy and not in maintenance mode.

Or more specifically, albeit condensed:

$interval=300
$d=(Get-Date).ToUniversalTime().AddSeconds($interval*-2)
$alerts=(Get-SCOMAlert -Criteria "ResolutionState = 255 And LastModified > '$d' And IsMonitorAlert = 'True' And MonitoringObjectHealthState <> 1" | Sort-Object -Descending TimeRaised)
foreach ($alert in $alerts) {
    $alertCheck = Get-SCOMAlert -Criteria "ResolutionState <> 255 And IsMonitorAlert = 'True' And RuleId = '$($alert.RuleId)' And MonitoringObjectId = '$($alert.MonitoringObjectId)'"
    if ($alertCheck.Count -eq 0) {
        $alert.ResolutionState = 0
        $alert.Update('Sky Blue found a closed alert from a monitor that is not healthy. Changing Resolution State to New.')
    }
}

Better yet, head over to GitHub for the whole VSAE project with the PowerShell wrapped up in a rule with a scheduler. Pull Requests and feedback welcome.
https://github.com/cchamp-msft/SkyBlue

Or if you want an unsealed copy of the built MP:
https://github.com/cchamp-msft/SkyBlue/blob/master/SkyBlue/bin/Release/SkyBlue.xml

The configuration of the rule is as follows:

  • IntervalSeconds – Default 300 – How often to run the rule (300 is fairly aggressive, 3600 is still effective).
  • NoReopenSeconds – Default 0 – Useful for sub 60 second interval so that it won’t re-open anything that auto-closed before the monitor has fully reconciled. This setting is not for production use, but if I recall correctly I set this to 30 when testing on a 30 second interval.
  • TimeoutSeconds – Default 60 – Timeout for the PowerShell script, 60 should suffice.
  • GenerateAlert – Default true – Whether or not to generate an alert on error condition
  • Debug – Default false – Whether or not to make a mess of the Operations Manager event log on the MS running the rule in the All MS Resource Pool.

Please note this is provided as-is and contains no warranty. Thank you Michael Bullwinkle for kindly reviewing and editing this post.


Comments (3)

  1. Dennis says:

    Excellent MP!
    In the past I had created a similar script, but instead of re-opening the alert I reset the monitor. Re-opening is definately the better solution.
    Although it is only cosmetic, i just wanted to let you know that on line 88 there’s a version number mentioned, which is shown in the alert history.Maybe something that was left over from development, idk. It’s easily fixed, so it’s no big deal. Thank you for
    sharing!

  2. Charles Champion says:

    Thanks Dennis, I’m glad you are finding it useful. I thought about the monitor reset method too, but I had limited success with it and never got back around to seeing it through. I updated the version on git with your feedback. Good catch. Thanks again,
    Charles.

  3. George says:

    Hi Charles, I deployed the Sky Blue into my environment but I noticed in some occasions it would re-open the alert generated from a monitor and the issues was already resolved. e.g: a monitor which monitors a service if its started was reopened and the
    service was already running. the health state was at healthy.

Skip to main content