Monitoring all services on all computers (service outage report)

Occasionally customers will ask how we can monitor all services that are configured to automatically start on all agent-managed computers. This is a good question and one that does have some merit. In reality, this can be a challenge coming up with the right solution.

 

 To get state of all services running on all monitored computers would increase the management group instance space significantly, because we would need to discover every service on every computer. This is something I would stay away from, especially if you are already reaching capacity limits in your environments, since this does add considerably more load to the environment. So the idea of leveraging Microsoft.Windows.Win32ServiceInformationProbe would not be the best solution in this case.

 

Let’s say we only care about generating an alert and not concerned about unique service state. This is possible by using the Microsoft.Windows.WMIProvider module. I did some testing with the WMI provider and for some reason it periodically returned incorrect results, showing a service as not running when it actually was. I didn’t dig deep into why this was happening, but this could potentially be very noisy if this were to occur frequently in any size environment. If I were to use this solution to generate alerts, I would certainly use a consecutive sample module to reduce the likelihood of generating a high volume of false alerts.

 

Another alternative to generate alerts is using a script-based solution. I know there are some customers that have successfully implemented a script-based solution, and I haven’t heard any major complaints. If you need to generate alerts about service outage without discovering unique instances (no state), I think a script-based data source is the way to go.

 

With these options in mind, I’m still not convinced that using a catch-all approach to alert about any service that may or may not have an impact on application availability covers any particular monitoring requirement. A monitoring scenario like this has high potential to generate a high volume of alerts about services you might not really care about, and this is something we work diligently to avoid as a management pack author and SCOM administrator, because this directly effects the monitoring experience for the customer – usually the tier 1 operator.

 

Is there a way we can proactively monitor all services on all computers and keep the noise out of the console? A collection and reporting solution would solve both problems.

  

The idea would be to create a script-based data source that would return service name and an integer value for each service that is not running. Map these properties to a performance data type and store these samples in the database. Now we can be creative and use this performance data to develop a service availability report. It might sound strange to use performance data for an availability report, but there is nothing stopping us from being creative with the data we collect. You will understand better how we can do this at the end.

  

If we create a collection rule to sample once per hour, we could create a report based on this performance data that would count the number of hours the service was not running each day. This would give us visibility into any services that had stopped, when the service had stopped, and also offers additional information about outage duration.

  

In this particular case of monitoring all services, I believe this is more useful than an alert, because now we can determine whether there are persistent, prolonged issues with a particular service. A historical report empowers you to make a more informed decision to solve a problem, rather than restarting a service that might continue to fail regularly.

  

I’ve attached a sample management pack that has a script-based collection rule and performance data mapper. The management pack also includes a sample report as a starting point for your service outage monitoring needs. I’ll briefly go over the sample MP now.

Stopped Service Collection Rule

 

 

image

 

The collection rule is composed of the script-based data source, which maps the returned properties to performance data type, and a write action module to the data warehouse only. This is a good case to only write to the data warehouse database, since the requirement is to generate reports and is not operational in nature.

Data Source Configuration

 

 

image

 

The data source module on the collection rule has a three important settings.

 

Interval: This should remain at 3600 seconds (1 hour). Since we aggregated performance data, this works out well to cover a 24 hour day, which maintains accuracy in the resultant report.

 

SyncTime: This should be at least a few minutes before and after the top of the hour. If we are too close to the hour, we could potentially collect more than one sample in that hour and this would affect reporting.

 

ServicesToExclude: Some services you may want to exclude all the time for some reason. Some services, like SPTimerV4 and TBS are nice to exclude, since these are expected to stop on occasion even though they are configured to automatically start. You can specify a comma-delimited list of services to exclude from collection and reporting.

Script Performance Data Provider (data source)

 

This is the actual data source type configuration. I provide this for visualization only – there is no need to modify this module.

 

 

image

 

In the above performance data provider, we are collecting the following:

Object: StoppedServicesCollection (static property created in the script)

Counter: NotRunningInd (static property created in the script)

Instance: Service Name (variable set to service name in the script)

Sampled Value: 1 (only store this value if service is not running)

 

If the service is not running, the script will not return any properties. This keeps the database space requirements low for this collection rule.

Stopped Services Collection and Reporting Reports Folder

 

If you import the sample MP, you should see the Stopped Services Collection and Reporting report folder. Inside that folder is the main Service Outage Report and the Stopped Services Detail reports. It is only necessary to launch the Service Outage Report, as the detail report is a drill-through embedded in the main report.

 

clip_image004

Service Outage Report Sample

 

Here is an example of the report. Clicking on the Total Unique Services hyperlink in the line detail will drill into more specific information about which services were stopped and for how many hours on that day.

image

Cumulative Outage Duration Hours is combined outage of all services that were stopped on that day.

Cumulative Daily Outage is a formula to arrive at a percentage outage time for the day, as follows:

 

Round((cumulative outage duration / number of unique services that were stopped) * 24) * 100, 1

Service Outage Detail Drill-Through Report

 

Clicking on the Total Unique Service hyperlink in the line detail generates the drill-through report, providing details of each service that was stopped on that day and for how many hours.

 

 

image

 

There you have it – using performance data for availability reports! Give this a try in your test environments and see if this sparks any new ideas.

 

The attached management pack and included script is provided “as is” with no warranties, and confers no rights. As always, carefully inspect and test any community management pack before installing in your environment.

Stopped.Services.Collection.Reporting.xml