Is anyone watching the health of your Multi-tier Web Application?

Do you know if each tier of your application is healthy? This is especially challenging when you have a multi-tier application (eg. web site, web service, and database). When there are problems on the Web site, you need to be able to quickly identify what tier is causing the problem.

One of the main challenges an operations team faces is monitoring applications. When an application is deployed, the operations team needs know that things are working. To know that things are working, the application needs to have monitoring setup. There are many monitoring solutions in the industry that can help check application health. However, these monitoring solutions do not automatically know what needs to be monitored. Our team uses Microsoft Operations Manager for event collection and a custom built tool for the application monitoring pages.

A good monitoring “BEST PRACTICE” comes from identifying critical features & dependencies of an application. Ideally this is done when an application is being developed. This allows the development team to create “Monitoring hooks” around these features & dependencies. These built-in “monitoring hooks” enable faster issue resolution times, when documented by project teams & configured for monitoring by Operations.

Best practices: Application Monitoring Pages & Reporting

  1. A monitoring page should test for “success” of a critical feature (or features) or dependencies’ & report via a HTTP status code – 200 = Success & > 599 for an application specific failure.
  2. A monitoring page can be used to create a multi-step test where each step can test a specific piece of functionality. Each step can then return a HTTP status code of 200 upon success or > 599 HTTP status code upon failure of that step.
  3. Document each of the “monitoring test” steps in the monitoring page, related HTTP status codes & action to be taken when a specific non-200 status code is returned to the monitoring pages.
  4. Apart from returning a non-200 status code, monitoring pages can be used to write events into the eventlogs (see Event-logging best practices) to provide more specific information about the error which can be used for further troubleshooting purposes.
  5. The monitoring pages can also be used to render more detailed error information that can be viewed by the operations engineer.
  6. A monitoring page representing critical features & dependencies can be used to report on overall Availability of the system, if required.

Best Practices: Event-logging & Reporting

  1. The default should be to write only actionable events to the event log – Anything informational/warning should not be written as an *error* into the logs.
  2. Where additional non-actionable info needs to be collected, allow for a switch to be set on-demand that will turn on information/warning entries
  3. Ensure a combination of Event ID & Event source is always unique.
  4. Ensure Event sources have the App name as a qualifier (Event source=”ApplicationName_ErrorReadData”) in order to avoid conflicts w/ other applications on the same server that has the same feature.
  5. Ensure event text is descriptive enough so appropriate action can be taken.
  6. Document Event ID’s & Event sources used for an application and provide troubleshooting steps to resolve these errors – This is the key for monitoring & resolving issues with quick turnaround times for that application.
  7. In situations where an error is generated a number of times, allow the application to write every 10th occurrence (a value that can be configured through a config file) of that error to the eventlogs (Note: first time an error happens it’s always written to the log, subsequent similar errors are then incremented). This ensures an event log is not full w/ same error resulting in loss of other valuable information.
  8. Use unique (custom) event logs per application where required.

Do not log PII (Personally identifiable information) or password information into the event logs

Some additional things to keep in mind:

· Make sure that you are monitoring each tier of the application.

· It IS possible to over monitor an application. If you have to many monitors on the system, you can start taking system resources away from the application. You have to find that balance of monitors to system resources.

· Treat monitoring as a feature in the application development process. This helps ensure that it is documented and “Monitoring hooks” can be created.

· Any monitoring tool that can check the status code returned in the header of a web page can be used for these application monitoring pages.