As a premier field engineer I have a responsibility to do several on call shifts a year. The week just gone was one such on call shift learned the importance of updating drivers during it. Allow me to elaborate.
I got a call around 2AM Wednesday about a customer that needed help recovering the business after accidentally deleting the OU that had all the user accounts in the organisation. I assumed it would be a simple case of performing an authoritative restore and accepted the case. After turning up onsite I learned that authoritative restores had already been performed but when they ran the LDIF file to recover group membership, the destination servers hung. Without running the LDIF file, they had managed to get all the users replicate out successfully. But the DCs geographically spread across the country had inconsistent group membership information.
As the LDIF file could not be replicated out and as the customer was desperate for resolution, we resorted to rebuilding all the DCs using the source DC backup and performing IFM based promotions. This turned out to be the resolution mechanism while we attempted root cause analysis. A colleague of mine joined me onsite and we recreated the problem in an isolated lab network. Analysis of the memory dump revealed issues with the storage related drivers. Specifically HpCISSs2.sys.
We then tested the DC (All HP DL360 G5) after updating the following.
- Controller Firmware (1.82 as per KB969550)
- Disk Firmware (version HPDA for DH036ABAA5 although customer had DH036ABAA6 disks)
- Controller Drivers (as per KB969550 we installed 184.108.40.206 from a smartstart CD)
- Storport.sys (KB957910 )
We could no longer reproduce the issue and a fix was now available. Yay! Customer decided to keep the DCs that were recovered using IFM online and to turn off the remainder and perform metadata and DNS cleanup. They are now planning to rebuild the remainder later after completely rebuilding them using latest HP SmartStart CD and Windows Server 2003 R2 SP2 CDs. Online DCs are to be also gradually updated with updates shown above.
Lessons learnt were as follows.
- Importance of preventing accidental deletions of key OUs.
- Importance of applying all relevant Windows Server service packs/hotfixes (disable SNP, hotfixes for LVR, ntdsutil etc.)
- Importance of updating hardware firmware/drivers
I have tried to keep this post short by skipping information about the long hours and sleep lost, the time it took for trial and error parts of resolution, challenges with 3rd party DNS that had to be circumvented using Windows DNS temporarily during recovery and the pain to customers while they were down for 3 days. But I assure you that failure to learn the lessons highlighted above, will be very painful if this happens to you.