After a long hiatus, Pond’s Laws of System Design (or, How to Be a Competent Professional in an Irrational World) returns. Thanks to Jimmy May for a kick-in-the-pants on this topic several months ago; I hope he’ll understand why I just had to wait until right now to publish this post (the clue lies in a very careful examination of the datetime stamp on this post vis a vis our topic)..
I know you’re asking yourself.. Why would I practice looking for an impossible poker hand in front of my (great-(great-(great-)))grandboss?
Perhaps I should explain further. This has nothing to do with poker, but everything to do with how the reliability of your service is perceived at the highest levels of your corporate food chain.
“Five nines” is high-availability shorthand for 99.999% system availability, a holy grail of design, development, configuration, and operations. Five nines implies less than five and a half minutes of downtime a year, and you only get that kind of resiliency if both the physical and logical components of the solution are designed for it, from the ground up, and if your processes and procedures and your execution of them are all sufficiently mature to support such an ambitious environment.
This is no small thing. If you can provide five nines availability on a mission-critical system, in many cases you can build a fine career for yourself. Even if the hardware side of five nines is too expensive in your environment, the consumers of your services will benefit if your designs and processes are as comprehensive as you can make them. And you’ll be one giant leap closer to being “a competent professional in an irrational world,” which is, after all, the goal of this series.
So how do you get to five nines? You design for contingencies and then you test your designs. In this respect I instinctively recall the Boy Scout motto from my youth: “Be prepared.” This simple admonition has served me very well over the years.
It’s not tough to get the design part just about right. There’s a great deal of literature pertinent to designing software and hardware for five nines-level availability. My blogroll (at the left; scroll up/down to the last populated area) offers a host of places to start.
Once you’re prepared, though, the datacenter team needs practice! Take the Kim Tripp maxim, “if you haven’t tested the restore, you don’t know you have a backup,” and apply it to every component in your stack. Run your restores, run your cutovers, test connectivity to your failover sites.. CHECK EVERYTHING. In sequence and in the same session, if you can. There’s some good advice in the comments on the post that inspired Jimmy’s pants-kick.
This is where many shops fail, and when disaster strikes it’s much worse than it needs to be, because the team executing the fix is simultaneously executing a learning curve. However, if you schedule exercises quarterly and rotate half your staff through the process each time, then they won’t be trying to get comfortable with their approach when things hit the fan. These insights lead directly to the establishment of Pond’s Twelfth Law:
Don’t practice in front of the CIO. A professional prepares ahead of time.
When I worked at the oil company, we had a weekly maintenance window; my partner and I switched off every week. We each were exposed to the entire spectrum of the maintenance and upgrade process for our entire product stack. One particular Sunday morning (of course it was my Sunday morning!), a database software upgrade went south. Because I had been through upgrades and restarts many times, when it became necessary I was very confident making the “Houston, we have a problem” calls to both our management and the database vendor’s critical situation team.
I worked a forty-hour day troubleshooting that issue (yes, I was much younger then). It ultimately evolved that we’d uncovered a hardware-specific bug in the database upgrade of which the vendor was unaware.
By Tuesday night we had a hotfix for our platform from the vendor, and our system was back up. On Thursday, we received a letter of apology and thanks from the vendor’s Vice President of Development, saying among other things that “it was very helpful to have someone of Ward’s skill and qualifications working your side of the problem.”
I don’t cite these events to entice your compliments on my dedication and brilliance. It’s important to note that I wasn’t unduly brilliant in this instance; I can very comfortably say, “I was just following orders.”
Our shop had practiced disaster recovery, both on-site and off-site, many times, and as a team we had been through upwards of six database upgrades without incident. Our processes and their regular repetition had prepared me to the point where, when things went wrong that Sunday morning, I was instinctively able to respond quickly and appropriately.
Our team practiced in front of each other. When it was time to go on-stage in front of our CIO, I was ready. I had rehearsed, enough that I hit pretty much every note in my “performance.”
Jimmy says, “the time to learn to put out a fire isn’t when your home is burning down.” Different metaphor, same lesson. The oil company’s corporate headquarters was in a Los Angeles skyscraper; back in the ‘80s and ‘90s, I was part of the building’s Emergency Response Team for earthquakes and such. We practiced four times a year, and the LAFD told us we were pretty good for a bunch of desk jockeys. It was a point of pride for everyone on the team; none of us wanted to be unprepared if, heaven forbid, we were ever called upon to perform. We each had a clearly defined responsibility, and we regularly practiced our roles.
It should be the same with your datacenter’s “emergency response team.” They need to practice for things that you dearly wish won’t happen and take pride in their ability to address them.
If you don’t expend this effort, instead of showing your CIO an impossible poker hand, you’ll be on-stage singing on front of her without rehearsal.
Don’t be that guy or gal. Don’t practice in front of the CIO.
Have you ever practiced in front of the CIO? Have you ever been well-prepared and happy about it? Have you figured out that datetime stamp/post topic puzzler from the first paragraph? Please leave a comment and share your thoughts..
this copyrighted material was originally posted at http://blogs.technet.com/wardpond.
the author and his employer are pleased to provide this content for you at that site, and via rss, free of charge and without advertising.
the author welcomes and appreciates links to and citations of his work. however, if you are viewing the full text of this article at any other website, be aware that its author does not endorse and is not compensated by any advertising or access fees you may be subjected to outside the original web and rss sites