While the MSCom Operations team manages 1800+ production servers, hosting over 120 different web properties for internal Microsoft groups (similar to most Ops teams), we also play an important role in helping ship Microsoft system software. Meaning we deploy several Microsoft system software products (OS, IIS, SQL, etc) before they ship to the marketplace. In fact, we are often running all of www.microsoft.com on the next version of an OS or Service Pack one year before it releases.
The MSCom Operations team is challenged to maintain a known and consistent software platform across our managed enterprise. Our internal customers need to know what platform (technology versions) they should be developing and testing against and understand when and how those technology versions will be deployed into production environments. The challenge is balancing between the sometimes conflicting missions –
1) provide a known and reliable hosting platform with a predicable change cycle and
2) run new software before it ships in order to prove its features and capabilities, and to provide feedback to product teams.
To address this challenge, we partnered with our internal customer groups to revitalize an old practice of running a regular platform update. Several years ago, we had such a process, but over the course of time, we slowly drifted away from it. The divergence was partially caused by the rapid growth in the number of servers and sites supported and the belief that tight control of the platform constrained what our hosted customers needed (or wanted) to do. Not that we totally abandoned this process, but we became less formal about it and over time the standard platform became less standard and the number of software version combinations increased significantly. That variability created problems for our customers (how do I match my dev/test platform to the target production platform) and increased the cost to Operations in terms of time to resolve issues and in maintaining knowledge of what was fixed in what version.
The first step to solving this problem was agreeing there was a problem. The next step was gaining commitment from our customers and our own team that this was a shared problem and that both ‘groups’ needed to work together for a common solution that allowed both to meet individual goals.
In finding a solution we’ve had to consider and address the following aspects:
Platform: The first step was to agree on what we meant by platform – what software was to be included in this definition and to what level of detail were we trying to define and control. We decided to start slowly and agreed to define and track the versions of key pieces of the platform – the OS, IIS, SQL Server, .Net framework and a few tools/services required to manage and monitor the environment. Over time, we will extend this definition to include additional application components as well as new tools we may put in place.
Environments: We have 6 defined server environments including test, performance, pre-production, beta, staging, and production. As an application moves through the SDLC, the code moves from one environment to the next. Each environment includes a higher level of control and manageability. The defined platform should apply to all environments. For now, Operations deploys and audits the platform only in the environments it is directly responsible for – pre-production, beta, staging and production. Lab managers manage the platform deployments and auditing for the test and performance environments.
Frequency: Initially, Operations and its customer groups decided on a quarterly platform update. That frequency seemed to strike the right balance and led to the process name ‘Quarterly Platform Update’.
Requesting and Approving Change to the platform: What is added to the platform and who approves the changes? Good question - anyone can propose a change to the platform. All proposed changes go through a 2 phase review process, first by an internal change review board, consisting of only Operations personnel for an initial review and sanity check. 5 -10 days later the list of proposed changes is reviewed by the external change review board composed of Operations representatives and 1-2 representatives from each customer group we host applications for. The external change review board makes the final decision on what will or will not be included in each quarterly platform update.
Communication and Scheduling: Working with our customer groups, we have defined and now maintain a high level quarterly schedule (process milestones) for the next 5 quarters and maintain a rolling detailed schedule for the next 2 quarters. At any point, we have a detailed view of what specific changes are scheduled for deployment generally 2 quarters in advance. As part of the process we maintain a regular communication schedule. Details of approved changes are sent out in global (all members of Operations and its customer groups) communications 6 weeks prior to the beginning of the quarter. Once deployment starts, weekly status reports help Operations maintain focus by communicating progress and sharing information on any issues that arise. Individual web and database engineers work with the specific customer groups they support to create detailed release plans, synched with the quarterly platform update’s master schedule, for each system/application. Clear and consistent communication throughout the entire process is a must, not a nice-to-have.
Deployment and Auditing the platform: This topic requires more than 1-2 paragraphs and hence will be a future blog posting. To get a brief idea of how to deploy changes to a large environment, see the recent blog Scripting Patch Management of Enterprise Web Clusters on Microsoft.com. Needless to say, deployment and auditing is not a trivial effort.
Future Roadmap / Change Schedule: We’ve defined a forward looking change schedule for major portions of the platform. This benefits hosted customers as they have a more clear understanding of what the platform will be several quarters into the future and can more easily incorporate platform changes into their application roadmap. You may be wondering how MSCom handles the adoption of new technologies. To manage that, we have a separate program for the adoption of new technologies which ties directly into the quarterly platform update program. This program, internally named MOTAP (Microsoft Operations Technology Adoption Program), defines who and how the Operations team works with product groups and customers for adopting new technologies. Further describing the MOTAP program will be saved for a future blog.
Lessons Learned: The quarterly update process is helping us rejuvenate shared goals across all customer groups. We have defined a standard software platform, a schedule and process for regularly updating our managed environments to meet that standard and a process for advancing/changing the platform definition. The platform defines the ‘minimum bar’ for software versions, but we still support deploying more current versions if the customer and Operations are in agreement. The minimum bar keeps the platform current and reduces the number of system software combinations to support. Updating the platform across 1800+ servers each quarter is a lot of work, but it’s the right thing to do.