A Qual lab in the wild...

Hi All,

It's time for another post from 'A Certified Master'. This one is a bit different than the other entries as Nikolai van Luijtelaar of C2iCT in the Netherlands gives an account of troubleshooting and BBQ's :-). Nik attended the last 'Ranger' rotation in May last year, just passed his lab exam and joins our community at last. Well done Nik!

Without any further ado....

 

A few weeks ago I was asked to help with an urgent e-mail problem for a customer of a relation in the field. This request came a bit to my surprise and after working hours. After talking to my relation for a while, I proposed him to come over and have a combined look, because at this stage he was staring at IPv6 ping responses (the wrong place) and he got stuck a little. As this problem was urgent enough for him, he was kind enough to accept my invitation.

After launching his laptop and investigating the infrastructure for a bit, it became clear that, from my perspective, this was sort of a small qual lab with real negative business impact. His customer has a 24/7 business and its publicly targetted services depend on real-time correspondence mailboxes being available. Employees tend to work in the evening, sometimes overnight, and are strongly depending on their Outlook as well. A fast business so to speak.

To give a short overview of my findings: In Exchange (2007 <> 2000 in coexistence, single site) there was a long line of pending inbound messages in the SMTP to Legacy Routing Group queue. Note that this was the reason for people reporting e-mail problems. Active Directory NTDS and SYSVOL replication was stalled and a zombie DC was still registered as alive. Where there once were multiple, the only remaining DNS server was located on a Windows 2008 DC, although heavily polluted with stale A and NS records. The legacy Exchange 2000 cluster nodes were both Windows 2000 GCs and were still still pointing to themselves for DNS. Unfortunately they were no longer DNS servers. A few other boxes were still pointing to unreachable DNS servers, or machines that turned out to be workstations. Finally, a single Exhange 2007 server (multi-role) was largely a red box, complaining a lot about name resolution and partly failing Exchange services.

A pictures tells a thousand words. The customer situation is shown here:

A vital lesson that is key to success to pass the qual lab, is to only fix things that are preventing Exchange (or the application in scope) from operating correctly. Well actually I should say *correctly enough*, when working towards one explicitly required goal and nothing else... In this case that was to get bi- directional mail flow going a.s.a.p. As I suspected this to be a DNS related issue, first of all I took the following steps:

1. From within the Exchange 2007 Queue Viewer I found the last reported error showed “DNS lookup failed” while looking for the Exchange 2000 cluster name, being the remote bridgehead.

2. In DNS, the AD-integrated zone for the AD Domain showed a correct host record for the cluster public address, but private addresses of the cluster nodes (on a different network) were also registered. I cleaned those up; to only keep 3 valid public host records (nodes and cluster).

3. Then I pointed the Exchange 2000 cluster nodes to the W2K8 DC for DNS queries, instead of to localhost and tested lookups. (Someone must have removed DNS server from those nodes very recently. However, this was not the time to interrogate my relation yet).

At this point I would have suspected mail flow to pick- up again, but that was not the case…

4. Then I noticed that the host record for the Exchange 2000 cluster was manually created in DNS. In cluster manager I verified that DNS registration must complete for the cluster to come online… Ok, so I removed the manually created record and performed a failover of the cluster to the other node that was looking healthy enough. And tada… e-mail was flowing again. We did some confirming tests and yes: Primary objective completed J

Time for a break, but for a master it doesn’t really end here. My relation confirmed 100% that this zombie DC could be removed, so we used NTDSUtil, Site & Services and DNS snap-in to remove it permanently and made sure AD replication was working. I then also took some time to walk through all DNS zones and remove invalid NS records, now that I learned the true DNS topology and did some routine Exchange server checkups. So much for the fixing, stop right there! Going further with it now would be unwise. It was getting late and time to wrap- up.

Although we had mail flow going again, I asked my relation what his migration plans were and why he chose to implement a single Exchange 2007 MLT, moving away from a redundant setup? He couldn’t really answer that with something that sounds. As my fellow masters know, during training we are taught 1001 lessons on high- availability, migration paths and everything else that matters to solid and fitting Exchange implementations. That night I showed him the tip of the iceberg and that – while looking at the business needs - the current Exchange 2007 infrastructure and the migration plan were neither fitting nor planned well. Giving such feedback at the right time is vital to create opportunity for improvement. Now that he was happy, I could take that step.

The next morning I was very pleasantly surprised with a brand new BBQ-set as a gift for my help. On top of that we are now looking into partnership, in order to truly help out this customer for real. A lot of trust was won that night after one effective session and I very much hold Greg and his Ranger/Master team responsible for this personal success story. Big thanks to all of you.