Sloppiness within Design and Implementation.

Mark Arnold

Background.
It might seem an obvious point to make but it is all too often overlooked. Setting your servers up in a structured and repeatable manner is essential in order to maintain the stability of your operating environment. Too many server teams are setting the servers up from a paper based “Server Build Document”. This is not an approach guaranteed to bring you success. Ensuring that your design documents are properly reviewed technically but also with an eye on the capabilities and personalities on your server team is vital to eliminate the likelihood for events such as those described below.

The Story.
Recently I was asked to look at a customers Exchange implementation as part of a general health check, and to see if the implementation conformed to the design, done by a 3rd party. The results were not too pretty.  The design initially called for a pair of two node (A/P) clusters. The design called for this configuration because the 3rd party engaged to do the design did not have the force behind them to direct the (non technical) customer into changing this to a more efficient manner.There was a post design (but pre-procurement and implementation) change to form a three-node cluster (A/A/P) and to save on the fourth server. Actually, the server was to be procured and used for another purpose but that’s a different story. Unfortunately, the server team chose to work from an old version of the design document from their email accounts and implemented the solution as originally designed. Luckily for all the second cluster had not been put into full production at the time of the review and the necessary steps were taken to break down the cluster and bring the 2nd EVS into a single cluster; truly a case for SharePoint and document version control if ever there was one.

The servers themselves were built from a paper script and unfortunately, various items of Windows component software were on one server but not the other. Differing security policies were applied to the two production nodes checked. The simplest way to resolve this situation is to create an OU structure within Active Directory containing suitably configured Group Policies and drop the servers into the correct OU. Again, had the server team been better trained the mismatch would not have arisen. Similarly, the design was itself flawed in that it did not state what security policies the systems would be subjected to, and what the OU structure was to be.

The Anti Virus package was installed remotely on the server but because the package needed to be installed at the console, the server team used VNC to do the job, but only on one server. Better training of the server team would have ensured that using “mstsc /console” was the right way of gaining physically remote access to the server console. Again, a better design would have clearly stated that the package needed to be installed on console and that it was either to be installed physically at the console or with the /console switch if done remotely.

The design did not state whether the MSDTC was to be installed into its own group or into the core cluster group. The server team took the middle path and installed it into its own group on one cluster and into the core cluster group on the other cluster. Luckily one cluster was being decommissioned (see above) and the entirely separate SQL cluster, which the server team were about to build, hadn’t yet got to that stage so a standard could be maintained, albeit by good luck rather than good management.

The Active Directory accounts all resided in an Active Directory Child Domain. This has happened because when Active Directory was designed the solution was for a new Forest root to be created rather than select an NT4 domain to upgrade. Whilst this is a perfectly acceptable approach the Exchange design document made no reference to this and as a consequence the server team upgraded the Exchange NT4 domain (which held accounts and resources) to Windows 2000 and placed it into a child of the single root domain. The business wished to collapse the domains but this was not possible with the accounts currently residing in the child. Whilst not a technical problem, no provision was made for the procurement of a migration solution such as Quest Domain Migration or the Microsoft Active Directory Migration Tool. A better would have collapsed the child domain into the root before bringing Exchange 2003 into the equation.

The environment only had one Global Catalogue in the main server room and one GC in each of the remote locations where the users were. The design did not refer to introducing a second GC into the server room but this was captured as part of another track of work. This GC was implemented but the server had not been rebooted and consequently wasn’t listening as it should. The design failed in that it did not specifically highlight the requirement for GC resiliency.

The cluster nodes themselves had 3.5GB of memory against a design on 4GB but the /3GB switch had not been implemented. The /3GB switch had not been stated in the design and the server team either didn’t know about the switch or had not seen fit to query the situation. Cluster fail back was configured so that a cluster would fail back between the hours of midnight and 1AM. This is against best practice for a number of reasons but the failing here was that the design did not clearly state that no fail back was to be permitted, except by manual process.

The servers were set to “English – United States” which has a higher impact on Exchange servers due to the regional templates in use within Exchange.

In Exchange, the Storage Groups and store names bore no relation to the design and weren’t particularly understandable from an outsiders perspective. This is always an important point to consider because what makes perfect sense to one person isn’t a lot of good when that person finds himself glad he listened to his mothers advice about clean underpants when he fell under a bus one morning.

The Moral?
Make sure your design is comprehensive and takes into account the skills of the implementation team and make sure that where the design is done by a 3rd party that you properly validate it to take account of local knowledge that your 3rd party might not have picked up on.
Make sure that you do the health checks before you migrate users onto the solution. In this case the situation was resolvable by evening reboots and there were no major issues that required extensive reverse engineering.