SCOM Three-Tier Application Remediation

Article
05/05/2010

Caution

Test the script(s), processes and/or data file(s) thoroughly in a test environment, and customize them to meet the requirements of your organization before attempting to use it in a production capacity. (See the legal notice here)

Note: The workflow sample mentioned in this article can be downloaded from the Opalis project on CodePlex: https://opalis.codeplex.com

Overview

The SCOM Three-Tier Application Remediation workflow is a sample of a complex remediation process. It is meant to demonstrate a classic “Run Book Automation” scenario through which a complex set of procedures (often done manually) can be replaced by automation. In this use-case an event in SCOM is identified as one for which a complex application repair procedure exists. The application in question is a “3-tier application” and accordingly there are separate remediation procedures for the web, application and database tiers. This examples how all three procedures could be run at the same time (in parallel) vs a serial process which is also possible (a simple re-arrangement of the workflow would support such a scenario).

“1. SC Operations Manager Thee Tier Remediation”

This is the primary workflow (the “top-line process). It watches Operations Manager for incoming alerts that match a specific pattern known to be associated with an application outage. Normally, procedures would be manually executed that would attempt to “fix” each tier of the application. Opalis automates this process.

First, the alert is annotated to indicate that remediation is proceeding. This is important to people watching the alert, since it will let them know action is being taken on issue. Then the service is “re-tested” several time to rule out a false alarm or intermittent issue. If after retesting the application remains down, the remediation process begins. The “Get Server Names” activity is a Map Publish Data foundation activity. It looks at the name of the application and returns the involved web, MS SQL Server and application server associated with a given application. In some environments this might be replaced with a query to a CMDB. Once the server names to be targeted for remediation are identified, the remediation process begins in parallel on each tier.

The Junction activity guarantees that the post-remediation re-test wont’ be run until all three tiers have finished the remediation process. Once remediation has finished for each tier (successful or otherwise) the application is re-tested. If the retest fails a new alert isn’t created. Rather, we update the existing alert to indicate that manual intervention is required.

The remediation process for each tier is a child workflow. There are actually four (4) child workflows.

“2. Remediate Database”

The remediation steps for a database are clearly something so specific that there is no way this example could encapsulate every possible scenario. Hence most of this workflow would need to be edited in order for it to be useful. Look at the overall form/structure. This workflow would reflect “captured knowledge” from a DBA team… “what do you do when you need to fix the database”. In this case the DBA team might have said they run a test query ( “select count (*) from…” for example) and if this query is slow then the indices on a handful of tables is dropped and re-created. Once that is done the MS SQL Server service is restarted and once the database returns one final test of the database is performed to see if performance has returned to normal. If not then a Service Desk ticket is created for the database tier.

“3. Remediate IIS”

This very simple workflow looks to make certain IIS (the W3SVC) is up and running. If not, then several attempts are made to restart it. Failure is noted as such in a Service Desk ticket sent to the Web Tier.

“4. Remediate Application”

This workflow tests a web service associated with an application (by calling the service directly via a SOAP call). If the Web Service fails, the service that supports it is re-started… perhaps via a Powershell script or a Windows command shell execution. If the restart fails a ticket with the Application Support team is created.

“5. Create Service Desk Incident”

This is a shell workflow. One would put the Opalis activities into this workflow to create a Service Desk ticket. This allows the workflow to be modular and extendable to support a wide-range of ticketing/notification solutions.

Share this post :