Intro and Description of Our Environment
The Monitoring & Management (M&M) team inside the Cloud and Datacenter Management Product Group runs a System Center-based monitoring platform that is leveraged by about 300 engineers and support personnel across three different divisions and eight different business groups. They rely on our Service Manager (SM) system for ticketing and escalation and Operations Manager (OM) system for server and application alerting. We run these components of System Center so they can concentrate on application development and change/release.
We have a good sized SM implementation with just under 40,000 incidents in the system and an average of 2,400 new incidents/day. SM is fed by an 8,000-server System Center – Operations Manager (OM) management group, plus eight other small OM instances via the out-of-box OM->SM connectors. We have our incident form customized through the Forms Extension Management Pack (MP), with custom fields we pump through to the Data Warehouse (SMDW). We run (3) management servers and (2) portal server as virtual machines. Our (1) database server and (1) data warehouse server are each physical servers.
Why upgrade from System Center 2012 – Service Manager to System Center 2012 SP1 – Service Manager? There are some fixes to SM as well as OM SP1 compatibility improvements. SM SP1 is a prerequisite for upgrading to OM SP1. Because there aren’t notable customer-facing SM changes in this release that are relevant to our business, keeping consistency with existing open tickets as well as minimizing any downtime are important goals. We see the upgrade of SM to 2012 SP1 as the first of two steps; the second being the OM SP1 upgrade.
The remainder of this blog post describes the preparation and process we went through to upgrade to 2012 SP1.
Prerequisites and Preparation
To get ready for SP1, we upgraded our test environment about ten times to ensure we understood the impact of the upgrade and could validate that both core ticketing and the DW/reporting components all worked as expected after the fix. This meant building out SM and connecting it to our production OM to simulate load and configuration as close as possible. Granted, we started the testing with early RC builds, but even with RTM SP1 bits, we’d recommended at least a couple of full tests.
One area we noticed right away is that some of our custom PowerShell commands stopped working on SP1 upgrade. This was traced back to a missed reference change. This fix will be in a future Cumulative Update (CU), but in the meantime if you are leveraging PowerShell in your SM implementation, you’ll need to do the following. The MonitoringHost.exe.config file in C:\Program Files\Microsoft System Center 2012\Service Manager needs to be updated to reflect the latest PowerShell version. See below for what our MonitoringHost.exe.config looks like after being updated to 7.0.5000.0:
<assemblyIdentity name="Microsoft.EnterpriseManagement.Modules.PowerShell" publicKeyToken="31bf3856ad364e35" />
<bindingRedirect oldVersion="6.0.4900.0" newVersion="7.0.5000.0" />
We leveraged an early morning maintenance window to complete the SM upgrade. Testing demonstrated that we should expect the SM primary Management Server to take ~32 minutes to run through the install during which time the workflows (including the OM connector and Exchange connector) would not be processing OM alerts and e-mails into SM. To minimize impact, we chose a 6am->7:30am window. This was the best combination of low incident and service request volume, but also set us up to have our engineering team & partners “all hands on deck” if something went sideways. An evening update after hours would be of greater disruption if partners had to communicate and mitigate a longer alert->incident outage.
Below is a table of our upgrade steps and the timeline. It’s essentially in three phases:
- First, backup the DW then take it offline and upgrade. Line up disabling of the DW jobs as the first customer impacting event at the start of our maintenance window.
- Second, upgrade Management servers.
- Third, re-enable DW and final validation.
During the entire upgrade window, we had our Tier 1 as well as the bulk of our Engineering team watching the entire system, so when it came to validation at the end, we were able to see incidents flow through the system in real time and quickly call success.
Net impact to our partners: Overall, we completed the upgrade within the communicated maintenance window with lower than expected total impact since incidents were only completely offline for 18 minutes of the 90 minute window. Specifically, from 6:21am to 6:39am there was no incident routing or email processing because workflows were stopped on the primary management server during the upgrade. Incident routing started flowing again at 6:39, but it was about 10 minutes behind until 7:30am when we applied the MonitoringHost update. Another thing to be aware of is, after the upgrade, the DW jobs take a while to run and sync for the first time. All DW jobs were successfully finishing by around 12 noon. Because this data isn’t used for “real time” analysis, this was acceptable.
What would we do differently? We cut it pretty close with 90 minutes to complete all work. The actual SM downtime (from an incident processing perspective) was small, but the final steps bumped right against the 7:30 deadline. The main reason for this is we treated the steps as serial through a single engineer to force accountability and make sure everything was smooth. Given that things did go without hiccup, next time we’ll take a little more coordination risk and divide DW and SM upgrade work to buy some more time within the window. I’d also have us include console and portal updates so they run concurrently with the secondary management servers to leave less to do at the validation phase.
As a summary, the SM upgrade went well and we are going to let the system bake for the next two weeks and ensure everything stays stable. In the meantime, we are coordinating with our partners on the right next maintenance window to update OM to SP1. From a feature/impact perspective, this is a really exciting update that brings in Web Availability, Application Performance Monitoring and custom dashboards to our monitoring offering. We will do a similar blog post on our OM upgrade experience in the coming weeks.