Who really needs to gather crash information and what do they need to do with it?

by anandeep on March 07, 2007 01:06am

I just got back from Cambridge (in the United Kingdom,  not the one by the Charles river) from the Microsoft Research / Technische Universitat Darmstadt  “Reliability Analysis of System Failure Data” conference.

Microsoft Research has a very nice lab near Cambridge University.  This was my first visit to Cambridge and I was able to drink in (literally!) some of the local color.  I went to the pub (called the “Eagle”) where Watson & Crick had their “aha!” moment about the double helix structure of DNA. I also visited the pub next to Queen’s College called the “Anchor” which Pink Floyd’s Syd Barrett used to frequent.

But it was not all pub-crawling, we had some serious stuff to deal with at the conference.  The objective of the conference was to bring academia and industry together to deal with a problem in the field of reliability analysis. The problem is that industry (there were representatives from Sun, Cisco, IBM and of course Microsoft) had the failure data but NOT the models and techniques for solving the problem overall.  Academia had the models and techniques but NOT the kind of failure data that it needs to solve the overall problem.  This conference was an attempt to get the two sides together and find a solution to this conundrum.

After we had presented our position papers (our lab’s paper is here), we split into workshops on Data Collection, Data Repositories and Data Analysis.  The idea was to come up with the “next step” in taking reliability analysis of system failure data forward. 

In the Data Collection section, we were stuck for a bit.  We were trying to look for compelling reasons for data collection by end-users (rather than by the software makers like Microsoft, Sun or IBM).  What reason would an IT department have for implementing mechanisms to collect failure data?

At first the reasons seemed obvious – to monitor for failure and to correct defects of course.  But then the representatives from the software makers spoke up and said that they did collect failure data and were using it exactly for the purpose of correcting defects.  Vince Orgovan from Microsoft stated in his paper that almost 400 million PCs provide data to Microsoft. Not only was the data available but Windows XP supported corporate error reporting in exactly the same way that it supported error reporting to Microsoft.   It just needs changing a few registry keys to do this.  The Windows debugger “!analyze” can be used on the error data, much as it is used internally.   

This took the wind out our collective sails.  If shipping software was providing all these mechanisms what could we suggest as a next step that had compelling value?  Most corporations would like to leave the job of correcting defects to the software makers (proprietary or open source) anyway! The software makers were in a much better position to look across many deployments and correct defects in the software.

The only compelling reason we could come up with to build a mechanism for data collection was to help with deployment on the user side.  The data collection mechanism would collect failure data during the preliminary testing. This data would then be fed into a model that could be used to judge the maturity level of the deployment.  Kind of  a CMM (Capability Maturity Model) for reliability.   We even suggested that we have an ITIL management practice around this.  This would potentially allow the ITIL model to not only give good qualitative measures like it does today but to quantify the reliability of a deployment.

This in itself is a very useful thing to have, but I cannot believe that it is the only reason that we would collect failure data at the end user level.  Let me know if you think of any others. 

Open Source would have much the same issues but for the fact that there is not a central organization that collects all this failure data.  The situation in Open Source may be the reverse of the situation for proprietary software makers in that the failure data is collected at the IT organization level and not centrally.  How does this failure data really result in code defect corrections? I guess that it is either pre-analyzed and submitted as a bug or people patch their own instances of the source code.  But my opinion is that eventually open source software systems will have to build central repositories of failure data  in much the same way that commercial software vendors have built them.