Article written by Ehsan Youssef, Premier Field Engineer
So many times as a field engineer I’ve been engaged to help resolve issues that were ongoing for an extended period of time and the root cause was elusive. Mostly, the issues were performance-related but service outages also came up. On the Microsoft Exchange platform, these issues were incredibly visible and political charged. Quick resolution was a must, but not enough. When onsite with customers, my mandate was not only to put out the fire but also to help insure we knew the root cause to avoid future issues.
Root Cause Analysis Strategy
My 3-step strategy is simple:
- Deploy the required performance baseline and data capture tools
- Educate the operations team on the tools and decision making process (sometimes with a nice Visio diagram posted at their cube! During my PSS days, it was called a cube note)
- Monitor closely and react appropriately upon recurrence
With the appropriate data in hand and the required Subject Matter Expert from Microsoft Premier Support Escalation Services or the Microsoft Product Group, there wasn't a single issue that could not be identified and resolved swiftly.
Preventing Issues From Happening
So you might think, why wait until there is an issue to go through the above process? Why not prepare your environment and staff proactively to minimize the recurrence of service outages? Is this possible? Absolutely.
As part of the planning and engineering phase of any project, we usually think about monitoring and backup as part of the day-to-day requirements of a technology or solution, but is that enough? What about the support activities? What about knowing the common type of issues that arise in the environment and what tools are used to identify and resolve them?
This is where the engineering team, in collaboration with operations, can build the required tools into the deployment plans and part of the “gold image” of the server solution. These tools are in place and with the appropriate readiness and training delivery to the operations staff, the data required for root cause analysis can be captured upon the first occurrence instead of waiting for two or more occurrences to capture the needed data and send it over to Microsoft for analysis.
What tools are we talking about here? Here are some tools I have used for supporting Exchange and Active Directory:
- The basics: Windows Support Tools & Windows Resource Kit tools
- Windows Debugging Tools: ADPlus and User Dumper are critical tools
- Sysinternal Tools: these are some absolutely awesome tools
- Performance Monitor
- Network Monitor
There are obviously more that I haven’t included, so please chime in if you have any personal favorites that you’d like to share.