Monitoring tools and the path of least resistance

At a client recently, an example arose of a well-designed technical solution falling partially short of achieving its desired outcome due to an unanticipated people factor.  Adjusting the technical approach to account for this factor will hopefully resolve the issue – time will tell!

Here is an example inspired by this client.  Wishing to move towards a more pro-active approach to incident handling (for servers), a monitoring tool is designed and implemented that will alert technicians not only when things go wrong, but also when servers become degraded, allowing them to get out in front of issues and (ideally) resolve issues before they become, well, issues.  You guessed it - it was SCOM!

With this sort of tool, however, spurious alerts must be managed, else technicians may start to view the alerts as “noise” and ignore them altogether.  For this reason, tuning of the tool is an important task during the implementation, but there must also be a process change for performing routine work on the servers – placing the devices in maintenance mode.  If a device is rebooted without first disabling the monitoring, spurious alerts may likely result; placing them into maintenance mode is the method.

So…the tool is implemented, tuned, and procedures are modified to require technicians to access the monitoring console and place devices in maintenance mode prior to rebooting them.  And yet, they do not do so – for some reason, and I haven’t yet divined it, some folks just feel it’s easier to close out alerts the next day than it is place the devices in maintenance mode (I think it may be a path of least resistance thing – logging onto the console is too much to ask).  Technicians patch and reboot servers without maintenance mode, generating a host of spurious alerts.  Reports contain inaccurate data, unnecessary tickets get created, technicians not aware of the maintenance waste time researching the alerts.

A few of my teammates and I were chatting about this over lunch one day, and I put the idea out there about eliminating the “extra” step of accessing the monitoring console by providing a means for the server technicians to place the devices in maintenance mode right from the device itself.  The result of this collaborative effort may be found here – a method for placing devices into maintenance mode remotely.  (Thank you, Lynne, Jeramy, and Paul!)

For Operations Managers and Event Management process owners, the reduction of those (say it with me) spurious alerts is a critical success factor – we want our folks focused on actual issues, rather than chasing rabbits down holes.  We may design our tools and processes in such a manner that we believe we have provided everything needed to realize our goals, but do not forget the human factor!