As a savvy Service Management Automation (SMA) user, you’re probably familiar with the fact that much of what makes SMA great is the fact that it is essentially “PowerShell as a Service.” With SMA, you get access to all that PowerShell goodness, with the added benefits of a service architecture — high availability, reliability, scalability, manageability via a REST API, among other things.
While it’s great that SMA gives you these “as a service” benefits, one thing you’ll notice is that all of these perks end with “ability.” SMA gives you the ability to scale it (scalability), the ability to rely on it (reliability), the ability to keep it available (availability), and the ability to manage it programmatically (manageability). What this means is that while SMA gives you the ability to do all these things, you need to know how to properly architect, manage, and use the service to take advantage of these features! In this blog post, we’ll dive into two of the abilities SMA gives you – scalability and high availability, and how you can architect and manage SMA in a way so as to take advantage of these abilities.
Any good discussion on scalability and high availability starts with an architecture diagram, so let’s begin there as well.
As you can see, SMA is composed of four types of components:
- Web Service
- Takes in requests over HTTP, talks to database to create, read, update, and delete data
- Returns data to the client that made a request to the web service
- Runbook Worker
- Polls the database for runbook job start, stop, suspend, and resume requests
- Returns job execution data to the database, including checkpoints and output stream records
- SQL Database
- Stores all information related to SMA, including:
- Job Checkpoint data
- Job output stream records
- Stores all information related to SMA, including:
- Some program that makes requests into the SMA web service and spits out returned data to be consumed programmatically or by a user
- Examples include:
- Windows Azure Pack Service Admin portal
- SMA PowerShell Module
- A custom C# application written to interface with SMA over OData
When it comes to scalability and high availability, the Runbook Worker, Web Service, and SQL Database are the components you need to worry about. While SMA clients are important, they aren’t part of the Service Management Automation “service” like the Runbook Worker, Web Service, and the SQL Database are. However, SMA clients talk to the SMA service, and are actually the reason the service needs to be scalable and highly available in the first place. If I am using an SMA client for a mission critical task, I need to be able to use the SMA service regardless of failures on the hosts containing the components of the SMA service (high availability). Similarly, if I suddenly need to interact with the SMA service much more frequently, or others want to also interact with this SMA service at the same time as me, I need to be able to raise and lower the capacity of my SMA service to meet this demand (scalability).
Now that we know the different components of the SMA service, and why you’d want scalability and high availability for your SMA service, let’s go into the details of how to attain these things for SMA.
High Availability and Scalability for the SMA Web Service
It’s important that your SMA service can process web requests in a fault tolerant, scalable way. Since the SMA web service is stateless, high availability is easy to attain – just deploy multiple SMA web services, each on its own host, and connect a load balancer. All SMA client requests should be directed at the load balancer, and the load balancer should direct traffic to the various SMA web services. As more or less scale is needed, SMA web services can be provisioned / deprovisioned and added / removed from the load balancer to meet the required capacity. The recommendation is 3 VMs, each holding an SMA web service. See http://technet.microsoft.com/en-us/library/dn458366.aspx for more details on scaling / HA for the SMA web service.
High Availability and Scalability for the SMA Runbook Worker
Similarly, it’s important that your SMA service is able to process runbook jobs regardless of individual node failures, and be capable of scaling up or down to meet demand. Unlike the SMA web service, the SMA runbook worker is not fully stateless – each runbook worker is configured to monitor a specific partition of the job submittal queue (part of the SMA database), so that all jobs are fulfilled but none are fulfilled by multiple workers at the same time. The mechanism for distributing jobs between runbook workers is called a “runbook worker deployment.” A runbook worker deployment contains all of the runbooks workers that are designated to pick up jobs. Each runbook worker is responsible for
1 / (number of runbook workers in the deployment)
of all jobs, where none of these jobs overlap with the jobs any other runbook worker in the deployment is responsible for.
To add or remove runbook workers from the runbook worker deployment, you need to create a new runbook worker deployment using the New-SmaRunbookWorkerDeployment cmdlet. Make sure to never adjust the runbook worker deployment while any runbook worker in the deployment is still running. This means that scaling the SMA runbook worker deployment means temporarily taking the job processing service of SMA offline, so you can modify the runbook worker deployment. However, this is not necessarily true for maintaining high availability. If a host running a runbook worker dies, and that host is in the runbook worker deployment, that host will continue to be allocated jobs which it obviously cannot process. At this point, you have two options:
1. Remove this runbook worker host from the runbook worker deployment, optionally adding a replacement runbook worker host, to maintain the same scale. This is not recommended, as it means you need to temporarily take the job processing service of SMA offline, so you can modify the runbook worker deployment.
2. If the runbook worker host is a VM, deploy a replacement VM with the same hostname. Since runbook worker deployments track runbook workers by hostname, the runbook worker deployment won’t need to be updated, so the service won’t need to be taken offline. Before doing this, just make sure the VM you are replacing is truly gone for good, because if it comes back online while its replacement is running bad things will happen (ex: same job being run twice, once by each runbook worker).
From an availability perspective, option 2 is a much better way of doing things, so this is the recommended option. As a best practice, make sure to use VMs to hold your runbook workers, so you can easily deploy a replacement VM with the same hostname in the event of a VM failure.
Again, the recommendation is 3 VMs, each holding an SMA runbook worker, all of which are in the runbook worker deployment. These can be the same VMs as are holding your SMA web services. See http://technet.microsoft.com/en-us/library/dn530618(v=sc.20).aspx for more details on scaling / HA for SMA runbook workers.
High Availability and Scalability for the SMA Database
The SMA database holds all of SMA’s data – including runbooks, jobs, and assets. Since the SMA runbook workers and web services require this data to function, the SMA database is communicated with by every SMA web service and runbook worker. As a result, it is very important it stays available and can scale to meet the demand placed on it, or else none of the other SMA components will be able to function. SQL Server clustering or SQL Server Always On can be used to keep the database highly available.
You should now have the knowledge to keep your SMA service up and running regardless of spikes in demand or host failures. SMA gives you the ability to scale it and keep it available, and now you know how to take advantage of these capabilities to truly run an enterprise-grade orchestration service.
Until next time – Keep calm and automate on.