This week has been a little strained, hence quiet on the blogging front. Apart from a hectic week at work (more to follow on that shortly), the reason was a “disaster” which happened late last Sunday evening – everything was working at home one moment, and dead the next.
Since the move over from the UK, I’m still in temporary accomodation. To save space, although my servers were couriered over, I didn’t bring a monitor as all the monitors I owned only ran on 240v. The servers arrived a little shaken, but not too stirred – a few cards were loose, but no failed disks. Just a little prodding into place and they came back perfectly. Humming away for 6 weeks or so without fault.
If you’ve ever tried to figure out why a machine won’t boot without a monitor attached, I know where you’re coming from. Short answer is, it’s next to impossible. It also happened that this machine was not just any machine, but a Domain Controller. And not just any domain controller, the domain controller holding all the FSMO roles for my home domain. Of course, it will probably come as no surprise to you it’s also running a further 5 virtual machines including my website hosting, ISA and Exchange. So yes, it was somewhat of a disaster.
On Monday morning, I took the machine into the office and with a monitor attached, it was obvious it was continually rebooting (off both plexes in the boot mirror) before the GUI portion of the boot came up. Safe mode, last known good gave same symptoms. Similarly, boot logging didn’t help as the boot log doesn’t get written to disk until the GUI part of the boot comes up.
I borrowed another disk from Ben, plugged it in and installed XP SP2 (only 32 bit OS immediately to hand). However, during the first boot, it blue-screened. Sure enough, there was a problem with the hardware – either motherboard or memory.
Running a memory tester showed something wrong with one or more of the (expensive!) ECC memory slots. I saw a big bill coming :(. It was a tedious process of elimination by swapping DIMMs around until the failed chip or chips was identified. That at least got to the point of XP booting. Attempting to boot back with the failed DIMM removed (actually a pair as the system needs matched pairs), same symptoms as before. At this point, going back to XP, I discovered XP didn’t have drivers for the RAID SCSI Controller for the system boot disk and worse, none were available. Onto plan B for recovery.
I re-installed a Windows Server 2003 on the loan disk with the recovery console enabled to attempt to see what was going on. Chkdsk showed the SCSI disks being corrupt and the mirror needing repair. Fixing those still wasn’t getting past the text mode part of the boot.
Not being one to give up, I took the machine home on Monday night. During the day, my wife had bought a second hand 17″ monitor for $20.00 – given it’s in next to perfect condition, I thought that was pretty good value.
From the recovery console of Windows Server installed on the loan disk, I spent two very long and tedious evenings going through disabling drivers one-by-one in the hope I’d find the driver failing to load – every time the same 0x0000007b with 0xc000007b in the parameter list – inaccessible_boot_disk.
Well, two days later I did give up. In some ways I’m glad I did – when I took the decision to blow away the machine for real, I discovered the disks were also corrupt in some way – both of them. Blue screens on reinstall. Possibly the RAID controller? Nope, tried a spare one too 🙁 Anyway, I’ve more disks on order and more memory on order – at least they’re much cheaper in the US than in the UK.
In the meantime, with reduced RAM, on the loan disk I at least got the ISA server and the Exchange server back running. Cleaning up AD to seize the FSMO roles which were held by the previous installation is easy enough (http://support.microsoft.com/?id=255504). They’re now safely on a Virtual domain controller running on another server.
However, there was one interesting side effect relating to DFS in Windows Server 2003 R2. Yes, the machine also was a file server replicating to another server using RDC using domain based DFS. Some of the DFS roots had the now decommissioned server as the preferred target. What this unfortunately means is that when you go into the DFS console from another machine (either another server or from an XP machine with the console installed), when examining the DFS Root, you get the error below: \domain.comshare: The namespace cannot be queried. The RPC server is unavailable.
This only happens on roots which were configured to have the failed server as the preferred target. Clients were still OK accessing the still working server as they failed over automatically
So, from the File Server Management Console, you’re stuck – you can’t remove the failed server. However, you can use the command line utility, dfsutil to forceably remove it.
First, run dfsutil /root:\domain.comshare /export:share.txt
Share.txt will look something like
<Root Name=”\DOMAINShare” State=”1″ Timeout=”300″ >
<Target Server=”FAILEDSERVER” Folder=”Share” State=”2″/>
<Target Server=”GOODSERVER” Folder=”Share” State=”2″/>
To delete the failedserver, and remember this is a last ditch thing, run (on one line)
dfsutil /unmapftroot /root:\domainshare
You’re now close. To make this work, you must have access to the share on a good server. You must also bounce (at least I had to) the DFS Replication service on the good server AND restart the File Server Management Console. However, once done, everything will be good again. Just need to re-introduce the new server once the new disks arrive.
So now you know one reason why it’s been a quiet week of blogging!