MSCOM WebCast Week…Q & A (Part 4) Monitor and Manage an Enterprise Platform with Microsoft.com Operations

The webcast series started with High Availability Architecture, Configuration Management of Web Farms and Change and Release Management of Web Farms. We told you about our architecture how we script and manage configurations, and how we partner with various teams to get the bits into production. Thursday’s session was all about how we Monitor and Manage an Enterprise. Scott Gaskins re-introduced the enterprise scale of MSCOM with some astounding numbers:

 

  • 60,000 alerts

  • 185,000,000 performance counters

  • 11,500,000 availability test

  • 1.7 terabyrtes (TBs) of IIS logs parsed

COLLECTED OR PARSED PER DAY!!

Are the Microsoft products and technologies that we use on Microsoft.com ready for the enterprise? We think so. You be the judge.

After the into, Scott jumped into how we Monitor and Report Today with a discussion about Asset Management, Platform and Application Monitoring including the tools we use for both reactive monitoring and proactive testing. The Data Collection and Reporting section explained our processes and key areas that we report out for, such as availability, perf trending and IIS error trends to highlight just a few; he then went on to Monitor and Report Tomorrow. This was a look into what the MSCOM Ops Tools Team is currently working on for the future. One of the key goals in the Configuration Monitoring and Management section was how do we get deeper understanding of the current configurations of both the platform as well as application-specific configurations. The Application Instrumentation section went into common logging classes that these new tool will use that based off the logging and instrumentation block, and thecommon methods that are used. All of this monitoring is ultimately used for Problem Management. Scott’s team has realized the importance of learning from all the data that is collected. Hey, it does you no real good to just collect data, you obviously have to be able to make sense out of it and hopefully be able to predictavely use it to do intelligent management of this environment. This session wrapped up with a discussion of the “Lights Out” Datacenter and the benefits of Integrated Lights Out (iLO) to remotely manage our servers. 

Thursday’s topic was: Monitor and Manage and Enterprise Platform with Microsoft.com Operations

The replay link: https://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032283908&Culture=en-US

Here are the questions that you asked;

I am starting to look into MOM to monitor various servers that we have, what self study (hands-on lab/webcasts/online info/blogs) do you recomend me to do first? Thank you.
Good question. If you have already been reviewing MOM and MOM requirements, then you should start looking into the technical aspects of the product. Here is a link that may help you on your way: https://www.microsoft.com/technet/community/events/mom/tnt1-112.mspx

Our first objective is to use MOM to report the status of the server and send alert (via Exchange for example) to support engineers when the servers have issues, is there any specific link to get the steps?
What you are referring to is "Notification". This is a base feature of the MOM product and is covered very diligently in the MOM product documentation.

Any discussion on how SMS is being incorporated in the monitoring and managment aspects (e.g. asset management) of your enterprise?
Currently, our team is not using SMS, although our data center management groups do use SMS for host data management and collection.

How do you purge the old data from your reporting DB? How is the size of that DB maintained for performance?
Aged data is first moved into a long term warehouse for storage and then the reporting data is simply deleted from the primary database. This allows us to keep a very clean, fast database up front while still maintaining as much data as possibly for long term reports.

Is the AI system to analyze the problems already built, or is it in the design phase? Please give updates on your blog also.

We are in the early design phase, but it is unfortunately behind a couple of other projects in priority. I expect that we will be in an active design phase again in the second half of 2006 We will keep you posted on our blog site

Would it be possible to get a document on the MOM alerts and Events you are using? Are you using the Web Services MP?
We do not use any of the defaul MPs, but rather a subset of all the available MPs. First we bring the MP into a lab environment and evaluate all of the rules and scripts of that MP. Once we have determined which pieces we would like to use in production, we create a custom MP and move it into our standard environment.

This may be off the subject, but is Microsoft working on a way to perform patch management without having to reboot the server?
Yes. Microsoft is actively pursing methods to patch without reboots. Future versions of SMS are specifically targeting this scenario.

Do you use any third party tool for data mining or custom developed tool? What is that?
We have some custom written tools, and we are working on incorporating SQL Analysis services with our repository next year.

Do you run a special build of Cluster Sentinel?
Yes, we have continued development on the tool for internal use. It no longer closely resembles the released version.

“Building manifest of system, dlls and Apps”… How do you do this ? What tools are you using, are these customs? Can you explain a bit more?
We are using custom tools to allow for extensible verification. The manifest is created using custom tools and handled as part of our release process.

Does all application in MSCOM uses your Instrumentation class to log events or is the any of the application which uses custome logging ? If so how you do manage the difference between those two?
Not currently; we are in the process of standardizing to the next release of the 2.0 EIF logging application block, but currently our applications use custom eventing and logging objects.

Ad crawls? What Objects are you using look for & what diff tool do you use?
We scan a set of OU's using ADSI, and then collect information from the individual servers using WMI. The differences are resolved in a SQL database.

EIF? Please let me know what is that?
Sorry, the name has changed slightly: https://msevents.microsoft.com/cui/WebCastEventDetails.aspx?EventID=1032282442&EventCategory=5&culture=en-us&CountryCode=US

How do you sync metabase? Do you use iiscnfg, appcenter or do you populate metabase on each server using custom tools?
Currently we use a completely custom sync application.

How do you check MOM availability?
Do you mean the availability of the MOM servers?
Yes.
We monitor the Management servers in a two-fold process. First, we use Cluster Sentinel to check the availability of the servers themselves. Then we use a custom stored procedure on the MOM database servers to check the last reported heartbeat of the Agent on the Management Servers. If the heartbeat is more then x minutes old, we get an alert and we investigate.