Modern Datacenter Architecture Patterns – Offsite Batch Processing Tier

The Offsite Batch Processing Tier design pattern details the Azure features and services required to deliver backend data processing that is both fault tolerant and scalable. These services are realized as worker roles in cloud services on Azure, which currently can be deployed to any Azure data center.


Table of Contents

1 Overview

   1.1 Pattern Requirements and Service Description

2 Architecture Pattern

   2.1 Pattern Dependencies

   2.2 Azure Services

   2.3 Pattern Consideration

3 Interfaces and End Points

4 Availability and Resiliency

5 Scale and Performance

6 Cost

   6.1 Cost Factors

   6.2 Cost Drivers

7 Operations

8 Architecture Anti-Patterns


Prepared by:
Cale Teeter – Microsoft
Tom Shinder – Microsoft
Joel Yoker – Microsoft


Cloud Platform Integration Framework Overview and Patterns:

Cloud Platform Integration Framework – Overview and Architecture

Modern Datacenter Architecture Patterns-Hybrid Networking

Modern Datacenter Architectural Patterns-Azure Search Tier

Modern Datacenter Architecture Patterns-Multi-Site Data Tier

Modern Datacenter Architecture Patterns – Offsite Batch Processing Tier

Modern Datacenter Architecture Patterns-Global Load Balanced Web Tier


1 Overview

The Offsite Batch Processing Tier design pattern details the Azure features and services required to deliver backend data processing that is both fault tolerant and scalable. These services are realized as worker roles in cloud services on Azure, which currently can be deployed to any Azure data center. The full list of services provided at each Azure data center can be found on the Microsoft Azure documentation site.

Batch processing workloads are unique in that they typically provide little or no user interface. An example of this type of workload on premises would be a Windows Service running on Windows Server. When considering this type of workload in a cloud environment, it would be wasteful to deploy an entire server to run a workload, when what is really required is compute, storage and network connectivity. The worker role is the implementation of this on Azure.

By definition, a batch processing job that is run in Azure is a workload that connects to a resource, provides some business logic (computing) and provides some output. The input and output resources are defined by the user and can range from flat files, blobs in Azure blob storage, a NoSQL database, or relational databases.

The business logic is implemented in an Azure worker role, typically by defining the required business logic in a .NET library. While deployment of a worker role to Azure is a simple operation, deploying a worker role that is fault tolerant and scalable requires a design which takes into consideration how the service is executed and maintained within Azure. This pattern will detail the design which considers these requirements and describes how these can be implemented.

1.1 Pattern Requirements and Service Description

The Offsite Batch Processing Tier pattern is a design that will consider the requirements for security, reliability, and scalability. The target service level is 99.95% scheduled uptime. The 99.95% uptime is rationalized across Azure to account for the support of planning and unplanned outages across Azure infrastructure. As this is a foundational architectural pattern, it would be expected that the processing tier outlined in this document could be fit into a larger solution architecture or design pattern.

A foundational set of design principles have been employed to help make sure that this type of application workload is optimized to deliver a robust set of capabilities and functionalities with the highest possible uptime. The following design principles are the basis for this foundational pattern.

  • High Availability: The Offsite Batch Processing Tier pattern is designed to deliver 99.95% scheduled uptime using a standard Microsoft Azure Service. By utilizing cloud services with specific attention to upgrade and fault domains, Azure can help make sure this Service Level Target (SLT) is met. The ability to meet this SLT also depends on the customer’s internal operational processes, tools and the reliability of the worker role code itself.
  • Scalability: Best practices from Microsoft Azure architectures are incorporated in the architecture and is part of the architecture to handle scaling the worker roles instances either manual or utilizing automation.
  • Management and Monitoring: Microsoft System Center and Azure management capabilities will provide the platform capabilities to support both management automation and monitoring of the health of services in Azure.
  • Administration: Using Microsoft Windows PowerShell (with Azure cmdlets), the Microsoft Azure Portal and Visual Studio, administrators and developers can perform the necessary tasks on services hosted in Azure.

2 Architecture Pattern

This document describes a pattern for offsite batch processing utilizing worker role instances contained within a cloud service in Azure. The critical components to this design are shown below. This diagram illustrates the minimum required instances to achieve fault tolerance. Additional instances can be deployed to increase performance of the service. Additionally, auto scaling can be enabled to assisting in scaling the instances by time or additional server metrics.

image

2.1 Pattern Dependencies

As this is a foundational pattern, by its very nature, there are no dependencies on outside components to utilize the design described in this document. This pattern can fit in a larger Solution architecture of other components, but should not draw no inherit dependencies on their components. While this service does appear to have a dependency on a source and output data store, the service should be constructed in such a way to handle the effects of these data sources being unavailable and logging this for exposure to the monitoring components.

2.2 Azure Services

The Offsite Batch Processing Tier architectural pattern is comprised of the following Azure services:

  • Cloud Services
  • Upgrade Domains
  • Fault Domains
  • Worker Roles
  • Auto-scaling

This pattern consists of a cloud service (containing instances of worker roles) which will in turn contain the business logic for the batch processing job. From a computing perspective, a minimum of two worker roles in the same cloud service must be deployed to support local availability. While this strategy guards against failures inside the local datacenter, it does nothing to provide geo-redundancy for the service which is required to protect the service availability in the event and entire Azure datacenter is unavailable.

A key part of the cloud service is the construct of domains which allow control over how updates to the cloud service should be orchestrated and defining how cloud services should be separated inside the local Azure datacenter. This helps make sure that a hardware failure in one area of the datacenter does not affect the availability of the application or batch processing node. Auto scaling is also a key component to the Offsite Batch Processing Tier pattern, as it can be used to provision and de-provision instances of the worker role based on criteria ranging from time-of-day to server metrics.

2.3 Pattern Considerations

There are some specific considerations for running this type of workload on Azure. First, in most cases there is a decoupling ofimage the source data and output data source from the actual service that is providing the processing resources (in this case the worker role). This decoupling can impose some complications within implementing geo-redundant high availability and disaster recovery of the service it is providing.

As discussed above, increasing the instance count of the worker role will provide fault tolerance in the local datacenter in Azure. However, for geo-redundancy replicating across geographies in the same manner to another datacenter is not natively supported. Unlike other roles and services within Azure, this type of service does not have an interface that administrators would be natively interacting with. As part of this pattern, simply deploying the cloud service to an additional location (Azure datacenter) is a typical strategy for supporting disaster recovery of this tier. To support this level of redundancy, the source and destination data sources would also need to be highly available and allow communications to and from the worker role from both locations.

Another area of consideration is the number of cores that are supported per subscription and within a given cloud service. Currently, the total number of cores by default is a maximum of 20. This can be a consideration as the service is scaled to increase instance count or larger compute units are provisioned (A0 – A9), as these will multiply core use.

It is recommended to consider these factors when planning or utilizing this pattern. The maximum number of cores can be increased with a request to the Azure support team. The absolute maximum at this time is 10,000 cores. There are also specific limits to worker roles instances deployed to a single cloud service. The limit is 25 per deployment. Each cloud service can have a maximum of two deployments. One is for production and one is for staging. Currently this is not a limit which can be increased.

For these reasons it is recommended that compute scale considerations of the entire tier be evaluated when implementing this pattern and a request to raise the number of cores available to the configured cloud service and subscription be performed prior to any production implementation.

3 Interfaces and End Points

While the worker role can expose endpoints for HTTP or TCP network communications, the use case of offsite batch processing typically involves a polling method of data source with the output being directed to a variety of output data sources. Due to how offsite batch processing solutions are architected, no endpoint or load balancing features are required for this tier. The incoming load will be balanced by adding or removing instances of the worker role, which both pull form a single data source. This also means that a proper locking construct must be used to support this model. An ideal configuration is to avoid having multiple worker roles processing the same data.

4 Availability and Resiliency

Microsoft provides clearly defined Service Level Agreements (SLAs) for each service provided within Azure. Each architectural pattern is comprised of one or more Azure services and details about each individual Azure service can be found on the Microsoft Azure Service Level Agreement website.

For the Offsite Batch Processing Tier architectural pattern, the Azure services required carry the following SLAs:

Azure Service

Service Level Agreement

Cloud Service

99.95%

Virtual Machines (deployed as worker roles in the same availability set)

99.95%

The composite Service Level Agreement (SLA) of the Offsite Batch Processing Tier architectural pattern is 99.95%. In order to provide high availability and resiliency to hardware failure, the configuration of the cloud service must include clear definitions for upgrade domains. Fault domains will be handled by the Azure fabric which will alternate fault domains for each role instance provisioned.

Upgrade domains are logical units, which ultimately determine how a service is upgraded. Upgrades include updates to the code in the cloud service or updates to the underlying operating system that is running the virtual machine supporting the service instance.

Fault domains are physical units of failure, and are closely related to the physical infrastructure in the Azure data center. Essentially this is the server rack inside Azure. If there is a physical failure of a component in a server rack, having multiple fault domains, insulates the application from a service outage.

To illustrate this point, listed below are examples of how to partition the domains inside Azure.

 

FAULT DOMAIN 1

FAULT DOMAIN 2

UPGRADE DOMAIN 1

Instance 1

 

UPGRADE DOMAIN 2

 

Instance 2

In this configuration, there are two instances of the role provisioned. Each instance is on a separate fault and upgrade domain (complete isolation). In this case, if you upgrade the application (deploy an updated version), and choose upgrade as your deployment type, Upgrade Domain 1 will be updated first, followed by Upgrade Domain 2. If a failure occurs in Fault Domain 1, the fabric controller will notice this and move Instance 1 to another fault domain (rack) and re-provision. Since there are multiple fault domains, the application will still function.

Two additional, more complex examples are provided below.

Scenario 1:

 

FAULT DOMAIN 1

FAULT DOMAIN 2

UPGRADE DOMAIN 1

Instance 1

 

UPGRADE DOMAIN 2

 

Instance 2

UPGRADE DOMAIN 3

Instance 3

 

Scenario 2:

 

FAULT DOMAIN 1

FAULT DOMAIN 2

FAULT DOMAIN 3

UPGRADE DOMAIN 1

Instance 1

   

UPGRADE DOMAIN 2

 

Instance 2

 

UPGRADE DOMAIN 3

   

Instance 3

In either of these configurations, depending on the state of the cluster where the deployment resides, the service could be deployed in a few different ways. In Scenario 1, two instances are in one fault domain, yet two different upgrade domains. This would mean if fault domain failed only 1/3rd of the application would be available while updated instances were being created in other fault domains. In Scenario 2, each instance is isolated to a single upgrade/fault domain, making this the more ideal scenario of the two.

5 Scale and Performance

As stated, in order for Microsoft to maintain at least 99.95% uptime, a minimum of two instances of the batch worker role would need to be deployed to the cloud service. Azure fabric will automatically put the two instances (or more) into separate fault domains. Additionally, by default, the instances will be in separate upgrade domains (up to five by default). If more than five upgrade domains are desired, the maximum limit of 20 can be configured through the service configuration. This configuration will provide high availability to the cloud service and the roles hosted within it.

Performance targets are another case to consider with instances deployed within the cloud service. One of the key tenants when designing a cloud service for batch processing is to determine the performance requirements your processing will require. Capabilities such as Azure autoscale can assist in loads that vary in performance. This specific pattern is more tolerant to these changes as the source data is essentially acting as a queue for the processing on the worker role.

6 Cost

An important consideration when deploying any Solution within Microsoft Azure is the cost of ownership. Costs related to on-premises cloud environments typically consist of up-front investments compute, storage and network resources, while costs related to public cloud environments such as Azure are based on the granular consumption of the services and resources found within them.

Costs can be broken down into two main categories:

  • Cost factors
  • Cost drivers

Cost factors consist of the specific Microsoft Azure services which have a unit consumption cost and are required to compose a given architectural pattern.

Cost drivers are a series of configuration decisions for these services within a given architectural pattern that can increase or decrease costs.

Microsoft Azure costs are divided by the specific service or capability hosted within Azure and continually updated to keep pace with the market demand. Costs for each service are published publicly on the Microsoft Azure pricing calculator. It is recommended that costs be reviewed regularly during the design, implementation and operation of this and other architectural patterns

6.1 Cost Factors

Cost factors within this pattern include Azure compute resources. When using Azure cloud services, the primary factor which impacts cost is the size of cloud service. Microsoft Azure provides a predefined set of available cloud service sizes which provide an array of CPU and memory configurations within the service. Additional cost factors consist of the inclusion of optional services such as the use of Application Insights for monitoring the worker application hosted within the pattern. Finally, while ingress network traffic is included in the Azure service, the egress of network traffic across an Azure virtual network carries a cost.

6.2 Cost Drivers

As stated earlier, cost drivers consist of the configurable options of the Azure services required when implementing an architectural pattern which can impact the overall cost of the Solution. These configuration choices can have both a positive or negative impact on the cost of ownership of a given Solution within Azure, however they may also potentially impact the overall performance and availability of the Solution depending on the selections made by the organization. Cost drivers can be categorized by their level of impact (high, medium and low).

Cost drivers for the Offsite Batch Processing Tier architectural pattern are summarized in the table below.

Level of Impact

Cost Driver

Description

High

Size (and type) of Azure cloud service instances

A consideration for Azure compute costs include the cloud service size (and type). Instances range from low CPU and memory configurations to CPU and memory intensive sizes. Higher memory and CPU core allocations carry higher per hour operating costs. Options include using a fewer number large instances vs. a larger number of small instances to address performance requirements.

Low

Number of traffic manager DNS queries

One price metric for Azure traffic manager is the number of DNS queries traffic manager must load balance. Costs are defined by the number of DNS queries balanced per month, measured in billions. The difference in cost is defined by the first billion DNS queries and any number of queries above this amount in a given month. This cost is the same across all Azure traffic manager load-balancing methods.

7 Operations

Cloud Platform Integration Framework (CPIF) extends the operational and management functions of Microsoft Azure, System imageCenter and Windows Server to support managed cloud workloads. As outlined in CPIF, Microsoft Azure architectural patterns support deployment, business continuity and disaster recovery, monitoring and maintenance as part of the operations of Offsite Batch Processing Tier architectural pattern.

Deployment of this pattern can be achieved through the standard Azure Management Portal, Visual Studio and Azure PowerShell. For deployment using PowerShell, several examples are provided at the Azure Script Center. More information on Azure automation capabilities, documentation and source code can be found in the Azure SDK Tools. There are a few different deployment techniques to deploy the worker role instances and cloud service.

Within the context of this pattern, deployment can be performed by the following methods:

  • Visual Studio – The tooling built into the Visual Studio integrated development environment (IDE) can be used to deploy updated instances of the cloud services and upgrade an existing deployment. There are options for deploying to each deployment slot, either staging or production. However, it should be noted that “swapping” of deployments cannot be performed through Visual Studio.
  • PowerShell – Another recommended deployment method is the use of PowerShell. The entire service Offering in Azure is exposed through the Service Management API and PowerShell cmdlets can be used to access the API. PowerShell cmdlets can be found in the article Azure PowerShell and documentation on their use in the article PowerShell Cmdlet Reference. The use of PowerShell is optimal for two different scenarios:
    – Operational run books
    – Disaster recovery
    First, for run books built for operations, PowerShell can be used to automate the installation and reduce operational mistakes when updates and updated deployments are deployed. Second, for the disaster recovery scenario discussed above, a deployment must be extended to additional Azure datacenters in different locations in a systematic way.
  • Azure Portal – The web-based Azure portal provides another deployment method for cloud services and role instances within Azure. The Azure portal is also built on top of the same Service Management API as mentioned previously. By exposing through a simple web site, users can deploy their cloud service package and choose which deployment slot to target.

Monitoring of this pattern and associated resources can be achieved using Microsoft System Center and Microsoft Azure. To help make sure the pattern detailed in this document is supportable and the SLA is maintained, a proper monitoring Solution should be part of any deployment.

System Center Operations Manager and custom automation can be used to consume the operation logs for each service deployed on Azure. The worker role can also take advantage of the standard tracing facilities in process. These trace logs can be written to Azure blob storage and accessed through the Service Management API. Additionally, Azure Application Insights can be integrated into the worker role and used to surface both application level details (business logic type monitors) and server level details.

The use of Application Insights can provide telemetry into the underlying virtual machine that is hosting the role instances within the cloud service. Application Insights is integrated into Visual Studio, allowing for ease of deployment and setup within this pattern. The results are available through a site deployment when Application Insights in added to the project (worker role).

Maintenance of the pattern falls into two categories:

  • Maintenance of the worker role
  • Maintenance of the platform

For the update and maintenance of the worker role application, an integrated development environment (IDE) product and associated source control repository can be leveraged. Some IDE products such as Visual Studio and Visual Studio Online can natively integrate with Microsoft Azure subscriptions. Details about this integration can be found at the following link.

Platform updates for the worker role will be handled by Azure. This simplifies the platform maintenance cycle, however, it comes at the risk of availability if services are not properly deployed with this in mind. As discussed previously, structuring the worker role configurations can help make sure the application spans update and fault domains, allowing the hosted worker role application to be resilient to outages caused by Azure platform updates.

8 Architecture Anti-Patterns

As the document has focused on recommended architecture for batch processing with Azure worker roles. The inverse of these are referred to as anti-patterns. It’s important to understand some common anti-patterns, to avoid the issues that can result from them. Anti-patterns for Azure worker roles include:

  • Single instance deployment
  • Hosting web sites on worker roles
  • Running long running, high CPU using processes on the main worker role thread
  • Running the worker role with elevated privileges
  • Adding the worker role to an Active Directory domain

As part of the deployment of the underlying worker role that will service the batch processing logic, steps will need to be taken to help make sure the application is highly available. As described in this document, in order for Microsoft to provide an SLA around the uptime of the application, at least two instances of the role will need to be deployed into the environment. There is nothing to prevent the deployment of a single instance, though warnings are given that the application does not meet the requirement to be highly available.

Another area which is possible, but not recommended is hosting web sites on a worker role. While this is technically possible (Azure does not “prevent” or block this type of deployment) it is not recommended to host these type of applications on a worker role. For applications (which are hosted in Internet Information Services (IIS) as a web server) it is advised to utilize web roles, not worker roles for these types of workloads.

Worker roles can essentially be thought of as Windows Services in the cloud. It is a loop that runs some business logic imagecontinuously. Taking this into consideration, there is yet another anti-pattern with regard to specifically how this code runs in the loop. Workloads, or code, that runs in this loop, should be designed in such a way as to avoid running for long periods of time while utilizing maximum CPU.

The issue that can result from doing this is that the status can be shown inaccurately within the Azure portal. This behavior is caused by the internal agent not having enough time slice to perform its health check. This concept is the same as in Windows Services on Windows Server. It is important to help make sure that the task runs in a short enough period of time that it can be run without hanging the process in a long processing state and pause the thread when not running.

Running code on Azure worker roles that requires elevated rights can be accommodated, but is generally not recommended. A typical reason one might consider this could be when a third party component is used on the role which requires elevated rights. From a security perspective, unless there is a very clear reason, processes within worker roles should without elevated privileges.

Finally, when migrating workloads to worker roles on Azure, it is common that the on-premises version of the application was utilizing Active Directory membership to facilitate secure communications to other servers (using Windows Integrated Authentication). This is expected to function correctly in a Platform as a Service (PaaS) application (such as worker roles) by default. “By default” means that the instance of the worker role(s) that are executing on Azure will be maintained automatically. Instances will be rebooted and/or moved inside the Azure datacenter for availability purposes.

While automation could be used to add the worker role instance to a customer’s Active Directory domain infrastructure hosted in Azure, this configuration would be highly complex to maintain and would require the maintenance of privileged credentials (such as administrative user names and passwords) within the automation to have the necessary rights to perform this operation. For these reasons, it’s recommended to use alternative authentication methods rather than relying on Active Directory machine membership.


Go Social with Building Clouds!
Building Clouds blog
Private Cloud Architecture Facebook page
Private Cloud Architecture Twitter account
Building Clouds Twitter account
Private Cloud Architecture LinkedIn Group
Building Clouds Google+ Community
Cloud TechNet forums
TechNet Cloud and Datacenter Solutions Site
Cloud and Datacenter Solutions on the TechNet Wiki
image