State Based Vs. Stateless monitoring

This is more of a philosophical post I wanted to write based on some recent conversations I have had in my team when trying to educate some people on SCOM and management packs. There are different opinions on this and these are only mine. Feel free to disagree and feel free to post some comments if you do.

Within SCOM you have two overarching choices for monitoring – state based and stateless. Lets look at each and the benefits / drawbacks each one has and then I will get to the crux of things and why I think you should not try to force state based monitoring on something that you cannot really model the state.

Both of these approaches rely on some model of an application and discovery or that application and its components. I am not going to focus on this end to end except where it can impact your choice of stateless or state based monitoring. Also since notification can be used in both cases I will leave that out of the argument.

Stateless Monitoring

I will start with stateless. In this model you are going to be using rules to generate alerts. The model here is that a rule responds to an interesting event and raises an alert which to an operator. The rule can optionally:

  • Suppress or limit future alerts while the original alert is still open
  • Run some action such as a script as well as generate the alert (note this cannot be added to a rule that was shipped in a sealed MP)

Rules are simple and have been with us a long time before any type of state based monitoring was provided (and I am sure will be with us a lot longer). It is simple – I see something I care about I raise an alert.

As with all monitoring in SCOM, state based monitoring is targeted at a class. You may or may not go to a granular level of monitoring with stateless monitoring and in many ways doing deep monitoring is less interesting since you are not going to be controlling state. For example suppose I have a simple management pack and I model an application X with one or more components. My instance space may look like this.

image

Now if I want to target some stateless monitoring for the App X class. This is simple, I create some rules and target it at the App X class. This will ensure that it only runs where I have discovered an instance of App X. 

For the App X component class I have a couple of choices with stateless monitoring. I could either model things as above and target my rules at App X component. I would then use some property of the App X component class as criteria for my monitoring so I can tell which instrumentation is for which instance e.g.

  • Rule target: App X Component
  • Rule criteria: Event 101 from event source Application X where Param1 is Component ID
  • Suppression: Suppress on workflow

I will have at most one alert active for any instance of App X component although I could choose to have more by changing the suppression logic in my rule.

However since I am not monitoring the state of the Application X component you could ask what this deeper modeling is actually buying you other than more logic in your discovery. You could equally simplify your model to this and do away with your component class:

image

Now you could create the same rule as follows:

  • Rule target: App X
  • Rule criteria: Event 101 from event source Application X where Param1 is Component ID
  • Suppression: Suppress on Param1 (component ID)

As with the previous case I will still have at most one alert active for any instance of App X component at a time because I handled this in my suppression. However, I have not modeled the App X component.

While the monitoring is effectively the same for this case there are other benefits to modeling to a deeper level. Some of these are as follows:

  • You can pivot to alerts / events / performance about a specific object in the console
  • You can report on monitoring data at a more granular level
  • You can create tasks against a component that use component properties to run

So you have the option of what you want to do and achieve. The point about stateless monitoring is that you are not forced to monitor to a deep level. Hold this thought as I will show you that with state based monitoring you may actually be forced to model deeper than you want.

With stateless monitoring, once an alert is generated, there is no understanding of when the problem is resolved. A user (or automated system like a connector) must close the alert. Depending on suppression settings of individual rules, new alerts may or may not get generated until this is done. You are forcing the diagnosis and resolution state of problems to the operator of the system which may indeed be valid if you application has no understanding of it’s own state.

State Based Monitoring

With state based monitoring you are using monitors as your primary method. As you likely know monitors are very different to rules. Some basic characteristics of normal monitors:

  • Monitors have two or three states
  • States can be automatically determined (instrumentation / timer) or manually set
  • Monitors can optionally alert based on state
  • Monitors can optionally close alerts based on state
  • A maximum of one alert from a single monitor can be active
  • Monitors can roll up to other monitors

When thinking about state based and stateless monitoring choices there are a few areas I want to focus on.

The idea of a monitor is that at any point of time it should know the state of a part of your model. There should be no doubt. When a monitor shows red in the console it should mean there is a problem right now. This is very different to an alert from a rule that shows at some point there was a problem.  While this idea sounds great, in practice there are a number of monitor types that break this concept:

  • Timer based reset monitors – the unhealthy state of a monitor can be detected but there is no way to detect a good state. Instead a timer is set so that the monitor will reset after a specified period of time.
  • Manual reset monitors – the unhealthy state of a monitor can be detected but there is no way to detect a good state. The user must manually reset the monitor when this is known to be true.

To me this breaks the whole concept of state based monitoring and using these types of monitors should have a very good reason (there are a few). You don’t understand the state of the system at any given time and instead you are making a prediction that the problem may have gone away in one case and in the other you are forcing a user to manually intervene. Logically both of these should probably have been done by a stateless rule. Monitors do have some benefits over rules though:

  • You can report on availability over time
  • You can populate state views in the console
  • You can run diagnostics and recovery workflows based on a state change
  • Diagnostics and recoveries are extensible by customers even against sealed MPs
  • You can roll up health through a health model on the same object or across discovered relationships

So when you think about monitors think about whether the benefits of the above outweigh the issue you are not accurately representing the state and you should ensure customer expectations are set on the behavior of these monitoring.

To be honest I have seen lots of examples of state based monitors that use manual / timer reset monitors that could actually be rewritten to properly determine the good state. This may involve some sort of polling of system state for the good state. While this may require the definition of a new monitor type and a bit more development time, the benefits to the customer definitely outweigh the effort it will take you to do this.

In terms of modeling, using monitors may force you to model classes deeper than you want. This may not be a bad thing but at some point you will stop. For example when monitoring a database server you may want to model down to the database level but modeling to the table or even row level with classes is not where you want to go. Using the example above, let’s consider the simple model:

image

Now I want to alert before on the App X component but assume I do not want to model this level in my application. If I know 101 is the good event and 102 is the bad event you might assume you can do the following and create a monitor:

  • Monitor type: 2 state event based
  • Monitor target: App X
  • State 1 (unhealthy): Event 101 from event source Application X
  • State 2 (healthy): Event 102 from event source Application X

Monitors cannot define alert suppression since you get a maximum of one alert form a given monitor.

This monitor will have a major problem. If you have two components (Comp1 and Comp2) consider this:

  • Comp1 throws event 101 – monitor goes unhealthy
  • Comp2 throws event 101 – no state change since monitor is already unhealthy
  • Comp2 throws event 102 – monitor goes healthy

In this flow we are broken. Comp1 is still in a bad state and we are showing healthy for Application X. You may think this is obvious but I have seen this many times in MPs and this is what drove me to write this post.

If you are using a monitor you have to be able to aggregate what you are doing to the class the monitor targets. In the above example you would want to consider how you would know across all components whether App X is health or not healthy. You may not be able to do this using events. The other option is of course to model deeper:

image

Now the monitor would be defined as follows:

  • Monitor type: 2 state event based
  • Monitor target: App X Component
  • State 1 (unhealthy): Event 101 from event source Application X where parameter 1 matches Component ID
  • State 2 (healthy): Event 102 from event source Application X where parameter 1 matches component ID

Optionally you might define a dependency monitor on the App X class to roll up health from the components. Now the monitor targeted at the component class above is just responsible for a single component and uses event parameters to filter out events about other components.

Note that monitors have another characteristic to be very aware of when you are choosing between a monitor and a rule. The alert generated from a monitor is effectively separated from the monitor state. Subsequent changes to the alert will not affect the monitor state. The major concern here is when the alert is resolved in the console. This act will not currently reset the monitor at all. If the monitor was unhealthy and the alert is resolved, the monitor will still be unhealthy until the healthy event is seen. Critically this means no future alerts will be generated from this monitor instance until the thing it is monitor is either manually reset using health explorer or the object returns to a health state itself and then goes back to a bad state. Using manual reset monitors make this worse – once a user resolves the alert it is never coming back till someone goes and resets the monitor which is a very bad experience.

Conclusion

So should you use state based or stateless monitoring? The answer is both. There are reasons to use both of them and the day we can get away from stateless monitoring are a long way away. As a result you are very likely to be using both in your management packs. My advice would be:

  • Make sure you understand the capabilities and differences of monitors / rules
  • Model to an appropriate level in your management pack – don’t ship hundreds of classes that are not of interest to customers just so you can use monitors as you want to
  • Use monitors where you can accurately determine all states via instrumentation
  • Use rules when you cannot determine state but there are important alerts to raise
  • Use rules or aggregated monitors to monitor levels of an application you do not want to model down to
  • Limit your use of manual reset / timer based reset monitors. When you use them be sure you understand the impact to customers

You should definitely strive for state based monitoring where possible but my key message out of all of this is not to be afraid of using rules still. Do not force state on something that is inherently stateless. 

Finally, if you own the application you are building a management pack for then think about how you can improve your instrumentation to move towards being able to use monitors more by understanding and exposing application state better.