Have you ever tried to restore a server? What about a Production server? How about in the middle of the night? It never goes smoothly. Your cellphone never stops ringing. Often, the only thing that gets the server recovered is your own ingenuity and rock-star efforts. Let’s spend some cycles and try to get ahead of this.
As PFEs, one of our major roles and responsibilities is to help our customers realize “the gaps” and assist them in addressing them proactively. After an eye-opening conference call discussing recovery plans, or lack thereof, I felt even more compelled to create a post with some DR considerations. Hopefully, this will stir some thoughts and discussions (and ACTIONS!) around the matter of recovery.
Recovery can be defined as (among other things):
- To return to health
- To return to normal state
- To gain back something which was lost
In our World of IT, we could be doing any or all of these actions during what we often refer to as “Disaster Recovery,” or “DR.”
It could be from a natural or man-made disaster or other large-scale event.
- Terrorism or war
- Facility malfunction
It could be a rogue admin or disgruntled employee. Often, it was due to an IT Pro making an innocent mistake – either small or large-scale.
- Even with the confirmation prompts of most actions within Windows, people are still, well, ‘human.’
- Anyone been on the recovery end of a script running with Admin-level credentials but not behaving as expected? Whoa daddy. That’s likely the time when you discover that backups have been failing. Since the spring. Of 2008.
Consider the statement: We do full backups of the ‘whole’ server, so in order to recover after an outage, we would simply do a full recovery of the box and be done.
Many times, a ‘full’ server backup doesn’t get key files – such as those files that are in use. DBs, transaction logs, application exe files, etc, are often not backed up during backup jobs via default settings or without special agents. We usually don’t realize this until we’re in dire straits. Or, perhaps, there is a Scheduled Task that is supposed to pause/quiesce the app/DB so the backup can get a copy of the proper flat file(s)? However, the Task isn’t being monitored and it hasn’t run for 9 months (since the svc acct got locked out and we’re not monitoring it with SCOM). Also, since that last backup 9 months ago, the app owner has upgraded the app two versions.
Consider the statement: We test recovery of our systems at the annual/recurring DR exercise/effort/mtg (you do have one of those, don’t you?)
However, as a “year in the life” passes for a system or server, it gets patched, service packed, drivers updated, settings changed (or drift), etc. Sometimes, the steps that enabled you to recover the system during the last DR exercise no longer work and the recovery suffers an epic failure.
BE PREPARED – as much as you can. Like many things, DR is always a work in progress and always changing as our systems evolve, get patched, updated or otherwise changed. Be vigilant! Be disciplined! Add Recovery to your normal work routine so it doesn’t catch you off-guard. Consider recovery before a system is even deployed. Make sure it is part of the design. Test the recovery design prior to deployment and again at regular intervals.
One tip is to add recovery testing to your own day-to-day work items.
- Consider using Outlook and Recurring Appointments with Reminders
- Monthly – test recovery of a test OU and its test contents
- Quarterly – test recovery of a complete test server and it’s test applications/services
- Isolated or other offline environment
- Bi-annually – test recovery of an entire Domain Controller (a test DC or other non-production impacting)
- Annually – perform a more formal shared DR exercise
- The Outlook Calendar method helps by blocking out Calendar time for this
- You can also Invite others to these Outlook events
- The Outlook Calendar method makes it all just a bit more official and formal
Now for a few DR pointers. Much of this is obvious and self-evident. It is painful, though, how often we neglect or forget the obvious.
Document. Document. DOCUMENT!
- Have two or more locations for Documentation such that a disaster to the system(s) that store your Docs doesn’t render you completely scrambling.
- Don’t underestimate the value of a hard-copy, even if it is a bit dated, it’s better than nothing
- Make sure there are application-specific docs that get tested/reviewed
- Often, the app was installed 6 years ago and no one on the current team even knows where the install bits are stored. The woman who knew the app left the company and took to a life of wandering the forests; she hasn’t been heard from since the spring of 2004.
- Application pre-requisites/details
- DotNet versions?
- Service accounts? (local or Domain-based)
- Specific or non-standard NTFS or registry permissions?
- Non-standard User Rights or other local Policies or Group Policy settings
- Track application service releases/updates/etc – so you’re able to get back to where you are via clean install + updates, if needed
- Have a selection of these accessible:
- CD/DVD blanks
- USB thumb and bigger drives
- 3 ½” floppy disks – if you need one of these, they can be very hard to find these days
- Some folks have mature “Configuration Management Database” systems (CMDB) to track server/application personality Information and settings
- SCCM can help automate a great deal of this personality information via Inventory jobs
- CMDBs are extremely helpful but many times, they are not running on a ‘highly available’ system and during a DR (exercise or real) might not be available. Examine your environment to see if you’ve painted yourself into a corner like this
- Again, don’t be afraid of hard-copy – just be sure to secure it. There’s nothing better than a big ol’ DR binder when you need it.
- Consider storing the following info as a good start
- HDD sizes (especially C:)
- C:WINDOWS or C:WINNT?
- Service Pack levels
- x86 vs x64?
- Windows Firewall – custom ports/settings
- Custom or non-standard Local Policies, reg entries, GPOs
- Local Admin pwd (hopefully as part of a process that is managed/on-going)
- TCP/IP info
- Static routes
- NIC settings and info
- Don’t forget NIC speed/duplex
- Hardware config/info
- Driver versions
- BIOS versions and custom settings (i.e. virtualization, power mgmt, etc)
- Storage/array configs/logical drive layout
- For AD-specific recovery, consider the following as a start:
- GPOs – are you backing up your GPOs?
- Consider Powershell and/or GPMC scripts
- OU information along with GPO link information
- Note, GPMC backups do not backup the GPO links (they’re an aspect of the OU, not the GPO itself) but the link information is recorded in the GPO report within the backu
- OU permissions/delegations
- Consider Powershell and/or a DSACLs script
- Directory Services Restore Mode (DSRM) Password
- This is set on EACH DC independently and is very often poorly managed (if at all)
- However, this can now be sync’d to a Domain account
- Current, accurate location of servers
- In a large datacenter, simply finding the right physical server can be a maddening and high-calorie-burn endeavo
- Virtual servers have their own set of ‘hide and seek’ issue
- Tested recovery/boot CDs for pwd reset, dead server revival/data-harvesting/etc
- Many times, the storage drivers on these need to be updated or they won’t ‘see’ the drives and can’t find the Windows installation
- The Microsoft DaRT tool can help in this regard
Hopefully, the information here reminds you of DR, gets you thinking about DR, brings up an idea or two about DR, or even stirs you to setup some Outlook appointments.
Now, take action and be at least a little better prepared.