New metric measurement alert rule type in Public Preview!

Hi folks, Anurag here. Today I want to talk about a new powerful alerting capability for Operations Management Suite (OMS): metric measurement alerts.

Traditionally, OMS alerts have only used the number of results returned from Log Search to provide alerting capabilities. With metric measurement alerts, we now allow broad alert rule definitions across a group of objects with the ability to evaluate a threshold and raise alerts on single objects. This new capability also comes with more granular trigger conditions such as single or consecutive breaches.

To provide more context about this new feature, the following table showcases applicability of alert types.

Examples of alerts

 

Type of alert Scenario
Number of results Send Alert if Computer A’s CPU is ever above 90%
Number of results Send Alert if Computer A’s average CPU is ever above 90%
Metric measurement Send Alert if Computer A’s average CPU goes above 90% twice over 2 hours
Metric measurement Send Alert if Computer A’s average CPU goes above 90% twice in a row over 2 hours
Number of results Send Alert if Computer Group A’s average CPU is ever collectively above 90%
Metric measurement Send Alert PER Computer if any Computer in Computer Groups A’s average CPU is above 90%
Metric measurement Send Alert PER Computer if any Computer in Computer Group A’s average CPU is above 90% 3 times over 2 hours
Metric measurement Send Alert PER Computer if any Computer in Computer Group A’s average CPU is above 90% 2 times in a row over 2 hours

Create a metric measurement alert

You create a metric measurement through the same workflow as traditional number of results alerts. Additionally, any of the on-demand aggregation queries used for performance metrics work. See the On-demand metric aggregation and visualization in OMS blog post for information about how to craft these on demand aggregation queries.

Requirements

The main difference when you define metric measurement alerts are the following two requirements:

  • “measure” statement – Metric measurement alerts require a grouping on a field to indicate what object to alert on.

Ex: Type=Event | measure count() by Computer interval 5 minute

Ex: Type=Perf ObjectName=Processor CounterName= | measure avg(CounterValue) by Computer interval 2minute

  • interval” statement - This specifies your sampling interval for your metric for how your data is aggregated.

Ex: Type=Perf ObjectName=Process CounterName=”% Processor Time”| measure avg(CounterValue) by InstanceName interval 3minute

Ex: Type=W3CIISLog | measure avg(TimeTaken) by Computer interval 30minute

Additional steps

  1. Switch Alert Type toggle from Number of Results to Metric Measurement.

Select the Metric Measurement alert type

  1. Switch Alert Type toggle from Number of Results to Metric Measurement.

The threshold is based off the metric aggregation from the query. For example, if you are using Memory as the metric and want to alert if Memory is less than 1 GB, set the threshold to Less Than and the value to 1000.

Pro-Tip: Open two tabs, one with Log Search and the metric chart and the other as the Alert Creation page. In the future, we plan to integrate visualizations straight into the alert creation process.

  1. Choose Trigger conditions.

Metric measurements come with the ability to define trigger options at a granular level. These two options are Total Breaches or Consecutive Breaches.

Total Breaches: When X out of Y samples exceed the threshold fire alert. For Example, if a sampling interval is defined as 15 minutes and a 60-minute time window is defined, there are 60/15 or 4 samples to choose from. If we set the trigger condition to greater than 2 total breaches, an alert fires if 3 out of the 4 samples are greater than the threshold set.

On the following chart, if threshold is set to 15 and trigger condition is set to Greater than 1 total breach, my alerts fire as there are two violating points in the specified time window.

Graph that shows two violating points in the specified time window

Consecutive Breaches: If X consecutive samples exceed the AggregateValue threshold. The time window is less important in this case. For example, if my trigger condition is greater than 2 consecutive breaches, I will raise an alert if the last 3 samples are greater than the threshold of AggregateValue.

In the following graph, if I set my threshold to 10 and my trigger condition to greater than 5 consecutive breaches, the alert would fire for the given time window because there are 6 consecutive violations.

Graph that shows 6 consecutive violations when threshold is 10 and trigger condition is greater than 5 consecutive breaches

As metric measurement alerts are evaluated for each unique object that is part of the grouping, we get unique alerts for each object. This also means actions such as email/runbook/ Webhook are initiated per alert firing.

Additionally, you can group by the specific computer field. This field is then available in the alert record in search.

View an alert for a specific computer

Example queries for alerts

 

Alert rule description Query
Alert if any computer talks to a malicious IP X times MaliciousIP=* | measure count() by Computer interval 1minute
Alert if any Windows or Linux CPU % is greater than X Type:Perf ObjectName=Processor CounterName="% Processor Time" | measure avg(CounterValue) by Computer, InstanceName interval 5minutes
Alert if any Windows or Linux Memory Used % is greater than X Type:Perf ObjectName=Memory (CounterName="% Used Memory") | measure avg(CounterName) by Computer interval 5minutes
Alert if any Windows or Linux agent has missing Security Updates Type:Update AND Classification="Security Updates" UpdateState=Needed Optional=false | measure count() by Computer interval 12hours