The Anatomy of a Good SCOM Alert Management Process – Part 1: Why is alert management necessary?


 

I’ve had the luxury of doing SCOM work for several years now, across many different client types and infrastructure, and one of the constants that I see in most every environment that I’ve worked in is a lack of planning surrounding how SCOM will be used in their environment.  There are a number of reasons for why this is, ranging from a lack of understanding about how the tool works, to internal political issues, administrative problems such as lazy or secretive administrators, to many other things as well.  The solutions to these problems are not always easy and require a bit more than just tossing in a piece of technology so that we can check a box and say “we have monitoring.’'  The reality though, is that this is precisely how many SCOM environments are designed.

We at Microsoft have often been very good at solving technical problems, and the SCOM community as a whole has a number of fantastic blogs ranging from frequent bloggers such as Kevin Holman, Stefan Stranger, and Marnix Wolf to folks such as myself who are far less intelligent and do far less blogging.  In all, if you need to solve a technical problem in SCOM, it’s not hard to do it as someone has already done it.  Unfortunately, I think this is where we can often fall short, as we leave it up to our customers to use our products, and they can have some very interested ways of using them.   SCOM is no different.  The biggest problem with SCOM that I see is that organizations never address the people or processes surrounding the technology that they purchased.

First, let’s start with the obvious. SCOM is very good at telling users that something is wrong.  It’s not hard to spin up, and after tossing in a few management packs, you will quickly start seeing alerts ranging from simple noise to real problems.  SCOM engineers quickly realize that there are lots of really cool Microsoft and Non-Microsoft MPs out there along with some really good ideas (and bad ones).  It does not take long for a customer to deploy a bunch of agents, import some management packs, and next thing they know, their alerts screen is full of red and yellow alerts indicating that something may be wrong. 

What I find from here though, is that the alert management process pretty much stops at this point.  Yes, there are organizations that truly do try to have an end to end alert lifecycle, but in just about every organization that I’ve visited, they are stuck at this point even while thinking they aren’t.  These orgs have a SCOM administrator, who often wears multiple hats, and maybe perhaps a tier 1 staff watching the active alerts to some extent. Usually, tier 2 and 3 are completely disengaged, never touching the SCOM console or perhaps going so far as actively resisting monitoring attempts. In an attempt to bring monitoring issues to light, orgs decide to send emails or generate tickets on alert generation.  Generating tickets usually frustrates the help desk, as SCOM can quite literally generate thousands of alerts a day, which essentially also makes SCOM our own private spam server.  Administrators create rules and put SCOM alerts in folders, and in the end, nothing gets changed while the same alert generates tens, hundreds, or even thousands of emails that go unanswered, and the problem is never actually solved.

The problem is never solved because of a fundamental lack of understanding of what needs to be done and why.  There are a few reasons why alert management is necessary:

  1. There are technology issues with the product which require it.  State changes are not groomed from the database while the state of the object associated is unhealthy.  A failure to manage alerts and fix issues associated with them leaves data in the database beyond its grooming requirement. Likewise, state and performance data can be stored in the DataWarehouse for a long period of time. Failure to manage alerts can lead to a very large DW, often times containing lots of data that the customer could care less about and eventually leading to performance issues if it is not being managed.
  2. All environments are different.  This should go without saying, but it means that it is impossible for SCOM to meet the exact needs of your organization OUT OF THE BOX.  Thresholds for alerts in one organization may be too high, while in others too low.  In some orgs, the monitor or rule is not applicable, and in some cases, what they really want to monitor is turned off by default.  As such, the SCOM administrators primary job is to tune alerts.

Tuning, while seeming like a simple job, requires teamwork. While it would be nice if your SCOM administrator is a technology guru, the reality is that this engineer likely knows bits and pieces about AD, Platforms, Clustering, Skype, DNS, IIS, SharePoint, Azure, Exchange, PKI, Cisco, SAN, and whatever else you happen to have in your environment.  He or she will likely not know these products in detail and as such relies on tier 1 and 2 to investigate issues as well as tier 3’s input as problems are uncovered.  That problem is further complicated by processes which need to change as actions by other IT administrators/engineers can lead to additional alerts for reasons as simple as not putting an object into maintenance mode before rebooting it.

As such, an alert management lifecycle is necessary to handle the end to end life of an alert, whether that be creation to resolution in the event of real problems or the tuning of alerts to reduce noise.  

Part 2:  Process Blockers to Good Alert Management.

Part 3:  Completing the Alert Management Life Cycle.


Comments (5)

  1. anonymouscommenter says:

    Sometimes – this is almost a dirty word in some companies. It is applying an ITSM process around monitoring

  2. Nathan Gau says:

    yep, and that can be easier said than done I'd add too.

  3. BENNY says:

    Hi Nathan - Good article, I have been raising the fact that there is no organisational best practise with my TAM for a long time. If some of your best practise advice could be reflected in the interface and product development then you will be on to a
    winner. Do the product team ever think about making those interfaces more usable, faster, management packs easier to understand or the whole Boolean nature of the system a bit easier to follow in order to attract the end users in? Without increasing the stickiness
    of the console, those email alerts will continue to sit in the folders marked "ignore" and SCOM will continue to be the reason why system failure went unnoticed.

  4. Nathan Gau says:

    Hi Benny,

    I think the biggest best practice is that you have to have all of your teams in the console and to be accountable for their stuff. The SCOM administrators can make their lives a bit easier to the extent that we can create custom views for them and so that they
    aren't seeing the entire mess so to speak. But as for best practices, what I recommend:
    1) your admins aren't by default SCOM admins. Create custom roles for them based off of their responsibilities and scope views to just what they need to see. That part is easy.
    2) Get them in it and using it. They become accountable for what shows up in their views. You can work tier 1/2 in based on your org structure.
    3) Establish a feedback process that can be used to get rid of noise. The SharePoint guy will know which of those app pools are supposed to be stopped. That information is fed back to the SCOM admin and an override is created.

    That's the high level. I try to save emails for just the really important items, otherwise all we've really done is create a spam server. I covered this in a bit more detail in the other parts. That said, I don't think the product team is going to be changing
    the UI. I've seen TP4 of SC2016 and it's pretty much the same.

    The management packs are a bit different. They aren't written by the SCOM product team, but by the other product teams (and some I believe are outsourced as well, but I'm not sure on that). You will see changes to MP design from time to time (for the better
    or for the worse).

    As for massive changes to it, I'm not seeing it.

  5. Bob Hyatt says:

    Great series! It's good to know that we are not alone.

Skip to main content