Writing a service recovery script – Cluster service example


 

I had a customer request the ability to monitor the cluster service on clusters, and ONLY alert when a recovery attempt failed.

This is a fairly standard request for service monitoring when we use recoveries – we generally don’t want an alert to be generated from the Service Monitor, because that will be immediate upon service down detection.  We want the service monitor to detect the service down, then run a recovery, and then if the recovery fails to restore service, generate an alert.

Here is an example of that.

The cluster service monitor is unique, in that it already has a built in recovery.  However, it is too simple for our needs, as it only runs NET START.

image

 

So the first thing we will need to do, is create an override disabling this built in recovery:

image

 

Next – override the “Cluster service status” monitor to not generate alerts:

image

 

Now we can add our own script base recovery to the monitor:

image

 

image

 

And paste in a script which I will provide below.  Here is the script:

'========================================================================== ' ' COMMENT: This is a recovery script to recovery the Cluster Service ' '========================================================================== Option Explicit SetLocale("en-us") Dim StartTime,EndTime,sTime 'Capture script start time StartTime = Now 'Time that the script starts so that we can see how long it has been watching to see if the service stops again. Dim strTime strTime = Time Dim oAPI Set oAPI = CreateObject("MOM.ScriptAPI") Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3750,0,"Service Recovery script is starting") Dim strComputer, strService, strStartMode, strState, objCount, strClusterService 'The script will always be run on the machine that generated the monitor error strComputer = "." strClusterService = "ClusSvc" 'Record the current state of each service before recovery in an event Dim strClusterServicestate ServiceState(strClusterService) strClusterServicestate = strState Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3751,0,"Current service state before recovery is: " & strClusterService & " : " & strClusterServicestate) 'Stop script if all services are running If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3752,2,"All services were found to be already running, recovery should not run, ending script") Wscript.Quit End If 'Check to see if a specific event has been logged previously that means this recovery script should NOT run if event is present 'This section optional and not commonly used Dim dtmStartDate, iCount, colEvents, objWMIService, objEvent ' Const CONVERT_TO_LOCAL_TIME = True ' Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") ' dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME ' ' iCount = 0 ' Set objWMIService = GetObject("winmgmts:" _ ' & "{impersonationLevel=impersonate,(Security)}!\\" _ ' & strComputer & "\root\cimv2") ' Set colEvents = objWMIService.ExecQuery _ ' ("Select * from Win32_NTLogEvent Where Logfile = 'Application' and " _ ' & "TimeWritten > '" & dtmStartDate & "' and EventCode = 100") ' For Each objEvent In colEvents ' iCount = iCount+1 ' Next ' If iCount => 1 Then ' EndTime = Now ' sTime = DateDiff("s", StartTime, EndTime) ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3761,2,"script found event which blocks execution of this recovery. Recovery will not run. Script ending after " & sTime & " seconds") ' WScript.Quit ' ElseIf iCount < 1 Then ' Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3762,0,"script did not find any blocking events. Script will continue") ' End If 'At least one service is stopped to cause this recovery, stopping all three services so we can start them in order 'You would only use this section if you had multiple services and they needed to be started in a specific order ' Call oAPI.LogScriptEvent("ServiceRecovery.vbs",3753,0,"At least one service was found not running. Recovery will run. Attempting to stop all services now") ' ServiceStop(strService1) ' ServiceStop(strService2) ' ServiceStop(strService3) 'Check to make sure all services are actually in stopped state ' Optional Wait 15 seconds for slow services to stop ' Wscript.Sleep 15000 ServiceState(strClusterService) strClusterServicestate = strState 'Stop script if all services are not stopped If (strClusterServicestate <> "Stopped") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3754,2,"Recovery script found service is not in stopped state. Manual intervention is required, ending script. Current service state is: " & strClusterService & " : " & strClusterServicestate) Wscript.Quit Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3755,0,"Recovery script verified all services in stopped state. Continuing.") End If 'Start services in order. Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3756,0,"Attempting to start all services") Dim errReturn 'Restart Services and watch to see if the command executed without error ServiceStart(strClusterService) Wscript.sleep 5000 'Check service state to ensure all services started ServiceState(strClusterService) strClusterServicestate = strState 'Log success or fail of recovery If (strClusterServicestate = "Running") Then Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3757,0,"All services were successfully started and then found to be running") Else Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3758,2,"Recovery script failed to start all services. Manual intervention is required. Current service state is: " & strClusterService & " : " & strClusterServicestate) End If 'Check to see if this recovery script has been run three times in the last 60 minutes for loop detection Set dtmStartDate = CreateObject("WbemScripting.SWbemDateTime") dtmStartDate.SetVarDate dateadd("n", -60, now)' CONVERT_TO_LOCAL_TIME iCount = 0 Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate,(Security)}!\\" _ & strComputer & "\root\cimv2") Set colEvents = objWMIService.ExecQuery _ ("Select * from Win32_NTLogEvent Where Logfile = 'Operations Manager' and " _ & "TimeWritten > '" & dtmStartDate & "' and EventCode = 3750") For Each objEvent In colEvents iCount = iCount+1 Next If iCount => 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3759,2,"script restarted " & strClusterService & " service 3 or more times in the last hour, script ending after " & sTime & " seconds") WScript.Quit ElseIf iCount < 3 Then EndTime = Now sTime = DateDiff("s", StartTime, EndTime) Call oAPI.LogScriptEvent("ClusterServiceRecovery.vbs",3760,0,"script restarted " & strClusterService & " service less than 3 times in the last hour, script ending after " & sTime & " seconds") End If Wscript.Quit '================================================================================== ' Subroutine: ServiceState ' Purpose: Gets the service state and startmode from WMI '================================================================================== Sub ServiceState(strService) Dim objWMIService, colRunningServices, objService Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colRunningServices = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name = '"& strService & "'") For Each objService in colRunningServices strState = objService.State strStartMode = objService.StartMode Next End Sub '================================================================================== ' Subroutine: ServiceStart ' Purpose: Starts a service '================================================================================== Sub ServiceStart(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StartService() Next End Sub '================================================================================== ' Subroutine: ServiceStop ' Purpose: Stops a service '================================================================================== Sub ServiceStop(strService) Dim objWMIService, colRunningServices, objService, colServiceList Set objWMIService = GetObject("winmgmts:" _ & "{impersonationLevel=impersonate}!\\" & strComputer & "\root\cimv2") Set colServiceList = objWMIService.ExecQuery _ ("Select * from Win32_Service where Name='"& strService & "'") For Each objService in colServiceList errReturn = objService.StopService() Next End Sub

 

Here it is inserted into the UI.  I provide a 3 minute timeout for this one:

 

image

 

Here is how it will look once added:

image

 

Now – we need to generate an alert when the script detects that it failed to start the service:

image

 

Provide a name and we will target the same class as the service monitor:

image

 

For the expression – the ID comes from the event generated by the recovery script, and the string search makes sure we are only alerting on a Cluster service recovery, if we reuse the script for other services we need to be able to distinguish from them:

image

 

 

Lets test!

If we just simply stop the Cluster Service – the recovery kicks in and see evidence in the state changes, and event log:

 

image

 

I like REALLY verbose logging in the scripts I write…. more is MUCH better than less especially when troubleshooting, and recoveries should not be running often clogging up the logs.

image

image

image

image

 

image

image

 

 

If the recovery fails to start the service – the script detects this – drops a very specific event, and then an alert is generated for the service being down and manual intervention required:

 

image

 

image

 

 

There we have it – we only get alerts if the service is not recoverable.  This makes SCOM more actionable.  If we want a record of this for reporting – we can collect the events for recovery starting, and then report on those events.

You can download this example MP at:

https://gallery.technet.microsoft.com/Cluster-Service-Recovery-270ca2cd


Comments (2)

  1. Olek says:

    Thanks for a nice roundup. Is there a chance for a better out-of-box tools to get this done easily? Now we need to create custom scripts using SCOM API to make SCOM running properly and often this is not worth the hassle, especially in scenarios where
    Infrastructure team doesn’t have a 100% dedicated SCOM engineer. I fill up that job in my company for example, however this comes as tertiary role / responsibility hence no ‘professional’ monitoring cannot be done within this scope.

  2. Peter Zdovc says:

    Hi Kevin, this is super example.

    Another approach to this could be similar to how it’s done in Exchange 2013 Management Pack. Service monitoring in Exchange 2013 MP is using multiple samples approach. So, alert will only be raised if the monitor detects service down after N samples.

    Peter

Skip to main content