Top 10 Topics for MSCOM Ops…Monitoring and Reporting (oh yeah!)

This came in as the second most popular topic that you are interested in. Huh, imagine that. You actually have to report out to somebody on what the heck the environment that you manage is doing. So do we. Daily, weekly, monthly, quarterly, yearly. Oh yeah we feel the same pain that you do.

We took this topic and shot it to our Tools Team. Yep, we do have the luxury of having a team a crack developers most of whom used to actually be working, fully functional Systems Engineers. The real world experience that these folks have give them a unique perspective on how to create Operations specific tools that begin to address the thorny problem space around Monitoring and Reporting. Since these folks are charged with (among other things) automating the collection and reporting out of the monitoring that we do we thought they should get first crack at the answer. Here is their response:

As you might imagine, we collect a ton of data. To put it in real terms, on an average day we collect over 60,000 event log events, 185,000,000 performance counters, perform over 11,500,000 availability tests, parse 1.7 TB of IIS logs, collect asset and configuration information on 2200 servers and gather database statistics on over 2000 databases. While that’s an impressive amount of data, hopefully you’re asking one very important question; “What the heck do you do with all that stuff??” I’ll answer that two ways.

On one hand, we do a lot with the data. In addition to using the constant stream of detail data as it comes in for real time monitoring, we aggregate the data from all of these data sources into a large data warehouse nightly. From this, we provide daily availability reports (both internal and external availability), asset management and performance trend reports and application event level reporting. By taking nightly snapshots of the relationships of servers to clusters to sites, etc, we can take this data and answer questions like “How many servers were associated with x application on y date and what were they, how were they performing, what events and errors were they experiencing and what was the overall availability?” Likewise, we can do this over time all the while maintaining the date specific contexts.

On the other hand, we don’t do nearly as much with the data as I’d like to. We have an incredible opportunity to not only learn from the data, but have the data itself actually “teach” us what is interesting about it. As you might have guessed, there’s much more to that last sentence. In the interest of the length of this post, I’ll reserve that as the subject for another time.

We encourage you to follow up with more in-depth questions.