Zero-Downtime Patch & Update Orchestration on the Microsoft Cloud Platform System

We are privileged to have a guest blogger on Building Clouds, Justin Incarnato.  Justin is a Program Manager in Microsoft’s Enterprise Cloud Solutions Group where he has been responsible for developing the Patch and Update process for the Microsoft Cloud Platform System.

Who should read this blog?

If you are interested in learning how Microsoft addressed the monumental challenge of updating converged systems without disrupting tenant workloads, this document is for you.  This paper describes Microsoft’s investment in a zero-downtime orchestrated patch and update engine designed for our Cloud Platform System.  This engine was designed to ensure customers had a simple, reliable and non-invasive process to update all software components on a stamp to the most secure and functional levels.

What is the Microsoft Patch and Update Engine?

The Microsoft Cloud Platform System Patch & Update Engine is an extensible update workflow capable of collecting full inventory, and automatically updating all Microsoft software on Cloud Platform System stamp.  The engine was designed to update one through four racks in a Cloud Platform System stamp without disrupting tenant workloads.  Understanding the inter-dependencies across the various components, as well as intra-dependencies within components is key to the success of the engine and its capabilities. 

What do I need to run the P&U Engine?

The dependencies of the engine itself are self-contained, intentionally minimal and require no outside tooling to run.  Requirements are simply the creation of an elevated account in Active Directory, environment setup, and access to the “P&U Update Packages” – which themselves are self-contained.  Launching the update process is as simple as executing a single PowerShell script.  The P&U packages provide both the manifest that express the installation sequence as well as the bill-of-materials and BOM, which contain the actual updates for all software components on the stamp.  Components updated include Windows Server 2012 R2, System Center 2012 R2 and the Windows Azure Pack.  The inclusion of the updates into a self-contained package allow customers to keep current with the latest security and functional updates and operate in a disconnected environment if required.

Are the update packages validated?

Microsoft ensures all updates contained in the P&U packages are tested and validated internally on both Test and Production CPS racks before providing them to customers. 

Does the P&U Engine collect inventory?

Yes, during the update process the P&U Engine will take a full inventory of the stamp including all software, firmware, drivers, services, roles and hotfixes as well as properties on the BIOS, NICs, disks and enclosures, to name a few.  This well-structured output is available for reporting or compliance and is executed both before and after the update process.

How does it actually update my stamp?

During the actual servicing of a component, the engine will remove a workload from behind the load balancer, initiate System Center Operations Manager’s maintenance mode on the object and then update the component.  Next, the engine will validate the component, evacuate maintenance mode, and return the component back into service behind the load balancer.  Technologies such as Cluster Aware Updating (CAU) are used when updating physical cluster nodes to ensure tenant virtual machines are live migrated from one node to another, seamless to the end user and their respective service(s).  Some physical node updates are run in parallel to provide the updates across the solution as quickly and safely as possible. 

The illustration below provides an example of the update process for the Windows Azure Pack Tenant and Administration virtual machines existing on both the Compute and Management clusters respectively.  Each node in the compute cluster serves tenant functions and sits behind the load balancer.  You will notice there are redundant instances of each service that are updated independently to ensure service availability.  Similar are the virtual machines for the administrative functions on the management cluster that are behind the load balancer as well.  Each of these are also redundant.  Lastly, there are pairs of both SQL Server and MySQL resource providers that are updated similarly to ensure availability.  These intra-dependent services must be updated in the order indicated in the “()” in the diagram below starting with “(1) AFDS WinAuth Instance 1”.

Lastly, you will see both the Service Provider Foundation Service as well as the Virtual Machine Manager Service in the diagram below.  These inter-dependent services are updated first to ensure compatibility between Update Rollups for example that are delivered across the Windows Azure Pack and System Center components. 

Visual representation of the Windows Azure Pack update sequence.


How will I know when it’s done?

Expect the Console VM to restart when the update process is complete.  After the Console VM restarts, you can check the status of the Console service in the Virtual Machine Manager Management console. When it has finished updating successfully, the template release will be an updated version and the status of the service will be “OK”.

Does the Engine provide logging for troubleshooting?

Yes, the P&U Engine provides extensive logging should the update process halt due to an environmental or functional issue with a component(s) in the stamp.  The engine is designed to stop, or prevent servicing when a stamp becomes “unhealthy”.

Are all components updated automatically?

The P&U Engine will automatically update and inventory all Microsoft software on the stamp including the Windows Server physical nodes, virtual machines, System Center workloads, Windows Azure Pack workloads and SQL Server instances.  The update packages may include firmware and driver updates, but will not yet automatically update the hardware components.  This feature is planned to be added in the extensible engine, in future releases. 

In the interim, Microsoft has provided documentation in the Cloud Platform System Administrator’s Guide, for existing Cloud Platform System customers, which detail the hardware update process using familiar technologies such as Cluster-Aware Updating (CAU) – integrated with Dell DUP and SUU packages.  Leveraging these technologies ensures that even when manually updating physical nodes, tenant virtual machines are live migrated from one node to another – seamless to the user.

For more information about the Dell DUP (Dell Update Package) and SUU (Server Update Utility) packages, refer to Dell’s documentation here.

Is the update process disruptive to my running workloads?

Customers can expect the P&U Engine to be non-disruptive against any running workload/tenant virtual machine.  Disruptions to management functions during failover from an active to passive node, as in the case for Virtual Machine Manager Management Console, or the Windows Azure Pack management portal are minimal during the brief failover.  For more information, see the “Expected downtime per component during update process:” section in the Appendix.

What can I expect from Microsoft?

Microsoft will release P&U Update Packages on a predictable quarterly cadence to allow for planning and ensure customers have all applicable security and functional updates.  Only updates that are specifically designed and required for CPS will be included in the Update Packages, allowing the solution to be uniform across stamps, easily supportable and more reliable.

Will the Patch and Update Engine run on my non-CPS systems?

No, the current implementation is designed to run exclusively on the Microsoft Cloud Platform System.

In summary…

Microsoft developed a “Patch and Update” Engine (P&U) – that allows Cloud Platform System customers to easily and reliably update the software on their stamps without disrupting tenant workloads.  The updates delivered to customers are tested and validated internally before distributing on a predictable cadence.  The update packages are built to allow customers to operate in a disconnected environment, provide comprehensive logging and complete inventory output.  The P&U Engine understands the inter and intra dependencies across components, enters and exits maintenance mode when servicing objects and lastly, validates components after servicing to ensure a smooth transition back into the management, storage or compute stack.  Hardware and Firmware update workflows are to be added into the P&U Engine in future releases.

 

Expected downtime per component during update process:

Component (instances)

Dependencies

Tenant Downtime (minutes)

Fabric Downtime (minutes)

P&U

Pre-Update Inventory

N/A

N/A

System Center 2012 R2 – Service Provider Foundation (x2)

VMM, WSUS

N/A

0, redundant instances

System Center 2012 R2 – Service Reporting (x1)

VMM, WSUS

N/A

5, single instance

System Center 2012 R2 – Operations Manager (x3)

VMM, WSUS, SMA

N/A

0, redundant instances

Windows Server 2012 R2 – Console Machines (x4)

Post-Update Inventory

N/A

0, redundant instances

System Center 2012 R2 – Virtual Machine Manager (x2)

N/A

N/A

0.5, active to passive node failover

System Center 2012 R2 – Service Management Automation (x3)

VMM, WSUS

N/A

0, redundant instances

SQL Server (DPM) (x1)

VMM, DPM, WSUS

N/A

0, redundant instances

Management Cluster (x6)

VMM, DPM, WSUS

N/A

0, redundant instances

Storage Cluster (x4)

VMM, DPM, WSUS

0, redundant instances

0, redundant instances

Compute Cluster (x24)

VMM, DPM, WSUS

0, redundant instances

0, redundant instances

Edge Cluster (x2)

VMM, WSUS

N/A

0, redundant instances

Gateway Cluster (x18)

VMM, Edge Cluster, WSUS

N/A

0, redundant instances

Directory Services (x3)

VMM, WSUS

N/A

0, redundant instances

System Center 2012 R2 – Data Protection Manager (x8)

VMM, SCOM, SMA, WSUS

N/A

0, redundant instances

Windows Azure Pack –  Admin/Management (x6)

VMM, WSUS

N/A

0, redundant instances

Windows Azure Pack Public/Tenant (x6)

VMM, WSUS

N/A

0, redundant instances

SQL Server (Management) (x4)

VMM, WSUS

N/A

0, redundant instances

MySQL (x2)

VMM, WSUS

N/A

0, redundant instances

Windows Software Update Services (x1)

VMM

N/A

5, single instance

Windows Deployment Services (x1)

VMM, WSUS

N/A

5, single instance

Offline Image Update – Field Replaceable Unit (FRU)

VMM

N/A

N/A

Pre-Update Inventory

N/A

N/A

N/A

Post-Update Inventory

P&U

N/A

N/A

Field Replaceable Unit (FRU)

SMA

N/A

N/A

 

 

 

 

Hardware Object

Type, Make, Model

Tenant Downtime (minutes)

Fabric Downtime (minutes)

Compute (Dell C6220) for

Management, Compute and Edge Clusters

Motherboard BIOS

0, redundant instances

0, redundant instances

Compute (Dell C6220) for

Management, Compute and Edge Clusters

Motherboard BMC

0, redundant instances

0, redundant instances

Compute (Dell C6220) for

Management, Compute and Edge Clusters

Fan Control Board

0, redundant instances

0, redundant instances

Storage (Dell R620) for

Storage Clusters

Motherboard BIOS

0, redundant instances

0, redundant instances

Storage (Dell R620) for

Storage Clusters

Motherboard BMC

0, redundant instances

0, redundant instances

Storage (Dell R620) for

Storage Clusters

Lights-Out Management  (LOM) 10Gbe

0, redundant instances

0, redundant instances

Storage (Dell R620) for

Storage Clusters

LSI 9207-8E SAS

0, redundant instances

0, redundant instances

Storage (Dell R620) for

Storage Clusters

Life Cycle Controller

0, redundant instances

0, redundant instances

Switches

S4810P (Aggregate, Datacenter, Tenant)

0, redundant instances

0, redundant instances

Switches

S55 (Management)

0

<5, S4810P and/or BMC unavailable during S55 reload

Disks

MD3060e (x4), SSD (x16), HDD (x48)

0, redundant

0, redundant

Load Balancers

F5 Viprion 2400

N/A

0, redundant instances