SCOM Event Remediation

Caution
Test the script(s), processes and/or data file(s) thoroughly in a test environment, and customize them to meet the requirements of your organization before attempting to use it in a production capacity.  (See the legal notice here)

 

Note: The workflow sample mentioned in this article can be downloaded from the Opalis project on CodePlex:  https://opalis.codeplex.com

 

Overview

The SCOM Event Remediation workflow is a very simple sample that shows how one could orchestrate a repair process for an event detected in Microsoft System Center Operations Manager. The use-case the workflow is built around is very simple:

  1. Watch for an alert in Operations Manager that indicates that the “Automatic Update” service has been started. This is relevant because in this (fictions) use-case company policy states that this service must be stopped and disabled. The selection of the service for this demonstration is totally arbitrary and for illustrative purposes only.

  2. If the service does start, we want to verify that in fact it is up and running.

  3. If it’s up and running, try to stop it. Wait for 20 seconds and verify it has stopped.

  4. If the services is still up after waiting 20 seconds, disable the service, stop it, wait 20 seconds and see if the service has now stopped.

  5. If the service is STILL up, then the workflow recognizes enforcing this policy is not possible and the alert is noted such in the Operations Manager alert.

The sample highlights a few key features associated with Orchestration of such a process:

  1. The workflow is a classic example of a “Run Book Automation” in that it takes operations procedures that would normally drive the behaviors of human beings and replaces this work with automation and integration.

  2. Showing how a remediation process can interact with Operations Manager to provide line-of-sight remediation. This means that it updates Operations Manager so that people looking at the Operations Manager console will be able to recognize that Opalis has initiated a remediation process and allow that process to complete before taking additional action.

  3. The workflow shows a multi-phase remediation process, meaning that it knows how to try and resolve the issue a number of different ways before giving up.

  4. Branching is used to terminate the workflow should the remediation process be successful in one of the earlier steps.

Workflow Walk-Through

This workflow itself is very simple and with a moderate amount of tweaking should be able to work in most environments. Some key things to note in the workflow itself:

  1. Links that say “20s Delay” actually insert a delay into the workflow. Inserting a time delay in a workflow is done within a link condition. Recall that link conditions are really just the terms under which one activity will initiate the next activity in a workflow. A time delay is a condition of that initiation, which is to say that there are both logical and timing terms that are associated with the flow of execution from one activity to another.

  2. There is no Foundation Object to disable a service on a Windows server. Notice how a Windows “Run Command” activity is used to accomplish this task. Foundation Activities can frequently be used in a generic context to accomplish results similar to if one had a pre-built activity for a given task.

clip_image002

 

 

Share this post :