Are Web application aggregate rollup monitor alerts not very useful?

In my last post, I described about creating an actionable alert to a specific unit monitor – the status code monitor. You can do the same for all the other unit monitors. To that post, John Curtiss responded ‘Availability aggregate rollups for the web application monitors are pretty useless’. John is right. The rollup simply says ‘something is wrong in this web app’ and it is down. For understanding why, let us look at the monitor tree. Below, is most of the monitor tree that forms health of a web application. The leaf nodes are the unit monitors and the health is rolled up to aggregate monitors. Unit monitors can be numeric, content match, numeric or security certificate related.

image

The aggregate monitor generated alert of the web application (Web app- URL) does not contain the precise description that identifies the exact cause of the problem. A web application alert could happen due to multiple failures. An alert is raised to indicate a problem and ideally it is one alert per problem. For example, if status code error caused the status code monitor to go error and then that caused the web app monitor to go error, it will generate the alert due to status code error. In the meantime, the status code got fixed but there was another failure – say certificate expired, the Web app monitor would still be error but due to a different problem. The alert would still remain in the same resolution state viz New, without a new alert being generated, as the Web app monitor remained in Error state. If the user had looked at an alert description that mentioned the first problem – status code, it may mislead them into thinking that it was the status code and not the certificate expiration. Alert is only indication of the problem and not assisting in diagnosis of the problem. Diagnosis is a complex process that may require additional data collection which is why connecting to the health explorer is the preferable method. At the aggregate level the problem may have triggered due to multiple causes whereas at the unit monitor level, we have  precise indication of the problem. Hence, unit monitors can get more precise descriptions that indicate the problem, whereas at aggregate monitors, it is harder to create a precise description. If you think that majority of the problems are due to status code, I would recommend using the alert description that is stated in the feedback thread, but its hard to generalize a description of the alert at the aggregate level. And Alert is not intended to be the mechanism for live problem diagnostics.

Another factor to take into consideration is reduction of number of outstanding alerts in the system. Alerting at the aggregate level is meant to generate one alert at the application level instead of generating multiple alerts for each problem. Constant generation of alerts may be undesirable in most cases. Hence, by default we have disabled alerting on the unit monitor level. But users have the option to enabling the alert at every unit monitor that they need to. Alerts for monitors in sealed Management packs using overrides. One could develop a tool using the SDK that automates and applies the appropriate overrides for a large number of web applications

On the implementation level, there are optimizations in the monitoring infrastructure  that are intentionally reducing unnecessary updates of monitor state for every state change notification unless the state is truly going to change from one state to another. In the above example, if monitor goes to error due to status code and then remains error due to another problem, there is no need to update the state from error to error and to generate an alert at the aggregate level. If we did that for every event that would generate a lot of state update notifications that could create other performance and scalability problems.

We are looking into ways of fixing the aggregate monitor alerts in one of our next releases to look at some options to make those alerts usable. Following questions may help me refine the proposal:

– Would it be okay if the alert description indicates the first error condition when the monitor went error/warning and created the alert but did not update subsequent state change events?

– What if the alert description is not updated after creation of alert but the history is modified with subsequent changes?

– What does the user want to determine the issue for the error after the error has gone away and resolved?

I would like to hear thoughts from the readers.

Next, let me look into bulk editing of configuration of the monitors.