Cloud Services Foundation Reference Architecture - Principles, Concepts, and Patterns

Article
08/15/2013

Published: 7/23/2013
Version: 1.0
Abstract: This article lists architectural principles, concepts, and patterns that were identified as commonly-applied best practices across providers of cloud services.

1.0 Introduction

To provide feedback on this article, leave a comment at the bottom of the article or send e-mail to SolutionsFeedback@Microsoft.com. To easily save, edit, or print your own copy of this article, please read How to Save, Edit, and Print TechNet Articles. When the contents of this article are updated, the version is incremented and changes are entered into the change log. The online version is the current version.

1.0 Introduction

This article is one of several articles that are included in the Cloud Services Foundation Reference Architecture (CSFRA) article set. An article set is a collection of interrelated articles that are intended to be read as a whole, like the chapters of a book. This article assumes that you have already read both the Overview and Reference Model articles in the set, so if you have not, please read those articles before reading this one. The terms provider and consumer are used throughout this article.

If you are a member of an enterprise information technology (IT) department, then your department is probably the provider within your organization, and the consumers are typically employees of the same organization. If you are a member of an organization that provides cloud services to consumers outside of your organization, then your customers are the consumers. Though most of the information in this article is relevant to any type of service provider, some of the information is more relevant to enterprise IT service providers than it is to public service providers. Additionally, though most of the information in this article is applicable to any type of cloud service, it's most directly-applicable when planning infrastructure services.

This article provides principles for a cloud infrastructure, concepts that support these principles, and patterns for applying the principles and concepts discussed in the article.

2.0 Principles

The principles outlined in this section provide general rules and guidelines to support a cloud services foundation. They are enduring, seldom amended, and inform and support the way a cloud services foundation fulfills its mission and goals. They also strive to be compelling and aspirational, as achieving many of them takes time and change. These principles are often interdependent and together form the basis on which a cloud services foundation is planned, designed, implemented, and operated.

2.1 Achieve Business Value through Measured Continual Improvement

Statement:
The productive use of technology to deliver business value should be measured via a process of continual improvement.

Rationale:
All investments into IT services need to be clearly and measurably related to delivering business value. Often the returns on major investments into strategic initiatives are managed in the early stages but then tail off, resulting in diminishing returns. By continuously measuring the business value delivered by a service, improvements can be made which continually improve the value of the service. This ensures the use of evolving technology to the productive benefit of the consumer and the efficiency of the provider. Adhered to successfully, this principle results in a constant evolution of IT services that provide the agile capabilities necessary for organizations to attain and maintain a competitive advantage.

Implications:
The main implication of this principle is the requirement to constantly calculate the current and future return from investments. This governance process needs to determine if there is still value being returned to the business from the current service architecture and, if not, determine which element of the strategy needs to be adjusted. All components of each service must be measured and improved, where applicable. This is true whether the component is managed by the internal IT department, or by a public service provider, or by a combination of both.

2.2 Perception of Infinite Capacity

Statement:
From the consumer’s perspective, a cloud service should provide capacity on demand, only limited by the amount of capacity the consumer is willing to pay for.

Rationale:
IT services have historically been designed to meet peak demand, which resulted in the underutilization of resources that the consumer must pay for. Likewise, once capacity was reached, service providers made monumental investments in time, resources and money in order to expand existing their capacity, which oftentimes negatively impacted consumers. The consumer wants “utility” services where they pay for what they use and can scale capacity up or down on demand.

Implications:
A highly-mature capacity management strategy must be employed by the service provider in order to deliver capacity on-demand. Predictable units of network, storage and compute should be pre-defined as scale units. The procurement and deployment times for each scale unit must be well understood and planned for. Therefore, systems management tools must be programmed with the intelligence to understand scale units, procurement and deployment times, and current and historical capacity trends that may trigger the need for additional scale units. Finally, the service provider must work closely with its consumers to understand new and changing business initiatives that may change historical capacity trends. The process of identifying changing business needs and incorporating these changes into the capacity plan are critical to the provider's capacity management processes. A provider may utilize capacity that it owns and manages, or capacity from external service providers, or some combination of both to assist it with achieving this principle.

2.3 Perception of Continuous Service Availability

Statement:
From the consumer’s perspective, a cloud service should be available on demand from anywhere, on any device, and at any time.

Rationale:
Traditionally, IT service providers have been challenged by consumers' availability demands. Technology limitations, architectural decisions, cost, and lack of process maturity all led to increased likelihood and duration of availability outages. High availability services were offered, but only after a tremendous investment in redundant infrastructure. Consumers often expected 100 percent availability, but were unwilling or unable to pay for it.

Implications:
In order to achieve cost-effective, highly-available services, service providers must create a resilient infrastructure and reduce hardware redundancy wherever possible. Resiliency can only be achieved through highly-automated fabric management and a high degree of IT service management maturity. In a highly-resilient environment, it is expected that hardware components will fail. A robust and intelligent fabric management tool is needed to detect early signs of eminent failure so that workloads can be quickly moved off of failing components, ensuring the consumer continues to experience service availability. Legacy applications may not be designed to execute in a resilient infrastructure. These applications may remain hosted on heterogeneous infrastructure outside of the cloud infrastructure or redesigned or replaced to be hosted on the cloud infrastructure. In addition to the availability levels provided by services, services must also be available on the variety of device types that consumers use to access them. When designing services for continuity, a provider can either use multiple data centers that it owns and manages, or use a combination of data centers it owns and manages with data centers owned and managed by external providers.

2.4 Take a Service Provider’s Approach

Statement:
While public cloud service providers inherently take this approach, enterprise IT organizations providing cloud services must also think and behave like a public cloud service provider when providing cloud services to their consumers.

Rationale:
Public cloud service providers provide services to their consumers with the following clearly-defined attributes:

Compute, network, and storage resources shared by multiple consumers
Description of the functionality provided by the service
The service level metrics the service is provided to the consumer with
Consumption-based usage tracking and cost
Their responsibilities in providing the service to their consumers, as well as their consumers' responsibilities when using the service

Most enterprise IT service providers traditionally did not provide clearly-defined attributes of the services they provided to their consumers. While this has improved over time, many enterprise IT service providers still do not provide the same level of clarity for the services that they provide as public service providers do.

Different groups within enterprise IT departments often support different organizational units, each with their own budget and priorities. While network and storage resources are often shared across the organization, oftentimes, individual organizational unit budgets and priorities caused IT to purchase servers for each organizational unit's budget...often resulting in sub-optimal server utilization for the organization as a whole. Further, individuals within the organization were often responsible for the hardware and software that supported the individual organizational units' resources, rather than one department being responsible for shared server hardware and hypervisors, while other departments were responsible for virtual machines running on shared server hardware.

Cloud services are provided on shared, homogeneous infrastructure. Within an enterprise, the shared infrastructure will often support several cloud services shared by many different departments. The shared infrastructure often allows for higher utilization of the infrastructure as a whole, due to the diversification of resource requirements of the individual service components that share the infrastructure. The components used to provide application services such as email or collaboration often run in virtual machines hosted on a shared infrastructure service. This allows for, and requires, clear separation of the responsibilities of both the providers and consumers of each service running on the shared server hardware. Rather than a department within IT being responsible for the physical servers, and the software that executes on them, now a department may be responsible for the shared infrastructure services fabric, while other groups or departments are responsible for virtualized servers used to provide application services.

Implications:
Taking a service provider’s approach requires a high degree of IT service delivery and operations maturity. Enterprise IT must have a clear understanding of the service levels they can achieve and must consistently meet these targets. IT must also have a clear understanding of the true cost of providing a service and must be able to communicate to consumers the cost of consuming the service. There must be a robust capacity management strategy to ensure demand for the service can be met without disruption and with minimal delay. IT must also have health models for each service and have automated systems management tools to monitor and respond to failing components quickly and proactively so that there is no disruption to services.

2.5 Optimization of Resource Usage

Statement:
The cloud should automatically make efficient and effective use of infrastructure resources.

Rationale:
Resource optimization drives efficiency and cost reduction and is primarily achieved through shared resource pools and virtualization. Enabling multiple consumers to share resources results in higher resource utilization and a more efficient and effective use of the infrastructure. Optimization through virtualization enables many of the other principles in this article, and ultimately helps drive down costs and improve agility.

Implications:
The service provider needs to clearly understand service requirements to ensure the requirements can be met by the cloud infrastructure.

The level of efficiency and effectiveness will vary depending on time/cost/quality drivers for a cloud services infrastructure. In one extreme, an infrastructure may be built to minimize the cost, in which case the design and operation will maximize efficiency via a high degree of sharing. At the other extreme, the business driver may be agility in which case the design focuses on the time it takes to respond to changes and will therefore likely trade efficiency for effectiveness. Data center space and hardware resources within most organizations is limited. By evaluating both its on-premises resources, and resources from external providers, organizations can optimize their resource usage across all available resources.

2.6 Take a Holistic Approach to Availability Design

Statement:
The availability design for both a cloud services infrastructure and all cloud services should incorporate holistic mitigation strategies for potential failure conditions of all of the hardware and software utilized to provide the service.

Rationale:
Historically, consumers experienced service unavailability if they were utilizing an application that was not written to withstand the failure of underlying server hardware components and a server hardware component failed. To combat this, service providers historically implemented redundancy for many of the components in physical servers such as power supplies, disks and network interface cards. Additionally, they may have implemented failover server clusters to mitigate the risk of process, memory, or other component failures. Redundancy of components and servers comes at a premium price in capital as well as operational costs.

Designing for service resiliency, rather than designing for hardware redundancy, is often a much more cost-effective strategy. This strategy assumes that hardware components will fail. Instead of implementing redundancy for each component, designing for resiliency addresses all failure conditions for the service, and then implements holistic strategies to mitigate them. For example, consider a stateless software application, such as a web application. Assume it is implemented across four virtualized web servers to support scale, as well as to provide high availability for the application. These four virtualized web servers execute on four different physical servers. If any of the components in any of the physical servers were to fail, the virtualized web server that had been executing on the failed physical server could be automatically moved by the fabric manager to a different, operable physical server. Since the web application is stateless, the consumer would experience little to no downtime. This design doesn't require redundant hardware components, and in a compute fabric comprised of many physical servers, the loss of a single server can potentially have no noticeable impact to services. This holistic approach to availability required both an application that wasn't dependent on redundant components for availability, and an infrastructure that didn't require redundant hardware components.

Implications:
Since designing for resiliency assumes hardware components will fail, the service design needs to holistically address all failure conditions. Existing applications not designed to withstand hardware component failures may not be good candidates for migration to a resilient infrastructure.

2.7 Minimize Human Involvement

Statement:
The day-to-day operations of a cloud should have minimal human involvement.

Rationale:
A significant amount of automation is necessary for cloud services to exhibit the on-demand self-service characteristic, which is one of the cloud characteristics detailed in the Overview article of this article set. Further, automation is a key enabler for many of the principles discussed in this article. Automation not only improves predictability of services, it allows for activities such as provisioning and problem resolution to happen more consistently and quickly than when those activities are performed manually.

Implications:
Automated fabric management requires specific architectural patterns to be in place, which are described later in this article. The fabric management system must have an awareness of these architectural patterns and have a deep understanding of the health model of the infrastructure as a whole. This requires a high degree of customization of any automated workflows in the environment. When evaluating external service providers, one evaluation criteria that supports this principle is the ability to automate tasks of the external provider's services.

2.8 Drive Predictability

Statement:
A cloud infrastructure must be predictable, as the consumer expects consistency in the quality and functionality of the services they consume.

Rationale:
Historically, IT services were provided with inconsistent levels of quality. This inconsistency decreased consumer confidence in the services. As public cloud services have emerged, organizational units within enterprises are choosing to utilize services provided by public service providers, because they often find that the services provided externally deliver a higher lever of quality and consistency. Enterprise IT service providers must provide a predictable service on par with public offerings in order to remain a viable option for their consumers to choose.

Implications:
For enterprise IT service providers to provide predictable services, they must deliver an underlying infrastructure that assures a consistent experience to the hosted workloads in order to achieve this predictability. This consistency is achieved through minimizing human involvement in daily operations, but also through the homogenization of underlying physical servers, network devices, and storage systems. In addition to homogenization of infrastructure, a very high level of IT service delivery and operations maturity is also required to achieve predictability. Well-managed change, configuration and release management processes must be adhered to, and highly-effective. Highly-automated incident and problem management processes must also be in place.

2.9 Incentivize Desired Behavior

Statement:
Enterprise IT service providers must ensure that their consumers understand the cost of the IT resources that they consume so that the organization can optimize its resources and minimize its costs.

Rationale:
In many organizations, the IT department is a cost center, usually with a fixed annual budget for operating its infrastructure. It attempts to prioritize the needs of organizational units, and allocates resources as best it can to meet the organizational units' needs. In some cases, some organizational units use more than their "fair share" of IT's fixed resources...in other cases, other organizational units don't get their "fair share" of the resources. Though enterprise IT service providers do want their consumers to have the perception of infinite capacity, they also need to ensure that that their consumers understand that doesn't come without a cost.

All cloud services provided within the organization should have clearly-defined consumption costs, based on the cost to the organization to provide the service. All resource consumption by the organization's consumers should be tracked on a regular basis, such as monthly. The consumption costs for each reporting period can then be shown back to the organizational units, or cross-charged back to the organizational units. This helps each organizational unit, and each consumer, clearly understand the costs to consume the organization's resources. This knowledge will generally incent more conservative use of resources than if consumption were not measured. This incentive of course, inherently exists in the services provided by public service providers, where consumption is always tracked and customers are billed for their consumption.

Implications:
The enterprise IT service provider needs to identify the behaviors they want to incent. Examples of behaviors an organization may want to incent are better resource utilization across the organization, defining more cost-effective availability requirements, or reducing helpdesk calls. Measuring improvement for any behavior requires well-implemented service management capabilities such as resource consumption tracking and reporting, offering different service classes for services, and enterprise IT taking more of a service provider's approach with its consumers. The behaviors IT wants to incent begin when defining requirements for new services.

2.10 Create a Seamless User Experience

Statement:
Within an organization, consumers should be oblivious as to who the provider of cloud services are, and should have similar experiences with all services provided to them.

Rationale:
Enterprise IT service providers increasingly look to integrate services from multiple public providers with their own services to achieve the most cost-effective solution for the organization. As more of the services provided to consumers include those from different providers, the potential for degraded user experiences increase as business transactions cross different providers' services. The fact that a hybrid service is provided to a consumer by aggregating service functionality from multiple cloud service providers should be completely opaque to the consumer. An example to illustrate this is a user who is using a business service to check the status of a purchase order. The user may look at the order through the on-premises order management service and within a screen of the application, may click on a link to view more detailed information about the purchaser. This information may be held in a customer relationship management system provided as a public cloud service. In crossing the boundary between the on-premises service and the public cloud service, the user should have no idea that they crossed a boundary. There should be no requests for additional authentication, they should not notice any differences in the user interface, and performance should be consistent across the whole experience. This is one example of the application of this principle in the design of a hybrid service.

Implications:
The enterprise IT service provider needs to identify potential causes of disruption to the activities of consumers across a hybrid service. Security and identity systems may need to be federated to allow for seamless traversal of systems, data transformation may be required to ensure consistent representation of business records, styling may need to be applied to give the consumer more confidence that they are working within a consistent environment.

The area where this may have the most implications is in the resolution of incidents raised by consumers. As issues occur, their cause often isn't immediately obvious. Determining the root cause of the issue may require complex incident and problem management processes across multiple service providers. The consumer should be oblivious to the issue resolution effort across providers and should receive resolution to their issue from their organization's service desk.

3.0 Concepts

The following concepts are abstractions or strategies that support the principles and facilitate the composition of a cloud infrastructure. They are guided by, and directly support, one or more of the principles.

3.1 Favor Resiliency Over Redundancy

In order to achieve the Perception of Continuous Service Availability principle, a holistic service design approach must be taken to achieve it. Historically, availability has been the primary measure of the success of IT service delivery and is defined through service level targets that measure the percentage of availability. However, defining the success of the service solely through availability targets creates the false perception of “the more nines the better” and does not account for how much availability the consumers actually need.

There are two fundamental assumptions behind using availability as the measure of success. First, that any service outage will be significant enough in length that the consumer will be aware of it and second, that there will be a significant negative impact to the business every time there is an outage. It is also a reasonable assumption that the longer it takes to restore the service, the greater the impact on the business.

There are two main factors that affect availability:

Reliability: Measured by mean time between failures (MTBF). This measures the time between service outages.
Resiliency: Measured by mean time to restore service (MTRS). MTRS measures the total elapsed time from the start of a service outage to the time the service is restored.

The fact that human intervention was historically required to detect and respond to incidents limited how much MTRS could be reduced. Therefore, organizations have traditionally focused on MTBF to measure availability targets. Achieving higher availability through greater reliability typically requires investment in redundant hardware at significant cost to implement and maintain.

The following technical capabilities are key enablers to designing for resiliency:

Virtualization: Enables the de-coupling of an operating system from a physical server, thereby increasing the portability of that operating system, and any software applications that execute on it.
Service Monitoring: Through definition of a health model for each service, the service monitoring capability can monitor the service and raise alerts when a condition occurs that degrades the health of the service. As many of these alerts as possible should be resolved by automatically by the fabric manager or other automation mechanisms.
Stateless applications: Though a virtual machine can be moved from a failed hypervisor server to an operable server, depending on how the application the user was interacting with was designed, the user may experience downtime as the virtual machine is moved.

While hardware redundancy is rarely necessary for servers in a cloud infrastructure, it may still be required for storage and network resources. Holistically designing for resiliency will significantly reduce the costs of a cloud infrastructure, and may also significantly reduce the amount of service disruption experienced by consumers.

3.2 Homogenization of Physical Hardware

Homogenization of the physical hardware is a key enabler for the predictability of cloud services. Homogenization of hardware typically lowers hardware acquisition costs due to volume discounts, lowers operational costs due to minimizing the amount of different technologies to learn and automate, and makes automation easier than it is for heterogeneous infrastructure. Homogenization is achieved through the use of standardized servers, network, and storage.

Abstraction of services from the hardware through virtualization makes “server stock-keeping units (SKU) differentiation” a logical, rather than a physical construct. This eliminates the need for differentiation at the physical server level. Greater homogenization of compute components results in a greater reduction in variability. This reduction in variability increases the predictability of the infrastructure which, in turn, improves service quality.

The goal is to ultimately homogenize the compute, storage, and network resources to the point where there is no differentiation between the resources. In other words, every server has the same processor and random access memory (RAM); every server connects to the same storage resources, and every server connects to the same networks. This means that any virtual machine executes identically on any physical server in the cloud infrastructure. This enables the virtual machine to be dynamically relocated from a failing or failed physical server to another physical server, seamlessly, without any change in service behavior.

It is understood that full homogenization of the physical infrastructure may not be feasible. While it is recommended that homogenization be the strategy, where this is not possible, the compute components should at least be standardized to the fullest extent possible. Whether or not the provider homogenizes their compute components, homogeneity in storage and network connections are crucial so that a resource pool may be created to host virtualized services.

It should be noted that homogenization has the potential to allow for a focused vendor strategy for economies of scale. Without this scale however, there could be a negative impact on cost, where homogenizing hardware detracts from the buying power that a multi-vendor strategy can facilitate due to competitive pricing options between vendors.

3.3 Pool Compute Resources

Leveraging a shared pool of compute resources is key to maximizing resource utilization. A resource pool is a collection of shared resources composed of compute, storage, and network that when managed by a fabric manager, provide the fabric for services. Subsets of these resources are allocated to consumers as needed and conversely, returned to the pool when they are not needed. Ideally, the resource pool is homogeneous. However, as previously mentioned, the realities of a consumer's current infrastructure may not facilitate a fully homogenized pool of resources and even if it does, may not over time, as older hardware becomes unavailable from suppliers or is no longer the most cost-effective option. The figure below illustrates a collection of server that represent a shared compute resource pool.

3.4 Virtualized Infrastructure

Virtualization is the abstraction of hardware components into logical entities. Although virtualization occurs differently in each infrastructure component (server, network, and storage), the benefits are generally the same including lesser or no downtime during resource management tasks, enhanced portability, simplified management of resources, and the ability to share resources. While virtualization isn't required to provide services with cloud characteristics, it is a key enabler for an elastic infrastructure, for the partitioning of shared resources, and for pooling compute resources. As a result, a virtualized infrastructure should be treated as an integral part of any cloud infrastructure.

3.5 Fabric Management

Fabric is the term applied to the collection of compute, network, and storage resources, when managed by a fabric management capability. Fabric management is a level of abstraction above virtualization; in the same way that virtualization abstracts physical hardware, fabric management abstracts service from specific hypervisors and network switches. Fabric management can be thought of as an orchestration engine, which is responsible for managing the lifecycle of a consumer’s workload. In a cloud infrastructure, fabric management responds to service requests, systems management events, and service operations policies. Traditionally, servers, network and storage have been managed separately, often on a project-by-project basis. To ensureresiliency, a cloud must be able to automatically detect if a hardware component is operating at a diminished capacity or has failed. This requires an understanding of all of the hardware components that work together to deliver a service, and the interrelationships between these components. Fabric management understands the interrelationships between components to determine which services are impacted by a component failure. This enables the fabric management system to determine if an automated response action is needed to prevent an outage, or to quickly restore a virtual machine failure due to physical server failure onto an operable physical server within the fabric. The fabric management system is aware of the upgrade domains, physical fault domains, reserve capacity, resource decay, and health model of the cloud infrastructure.

3.6 Elastic Infrastructure

The concept of an elastic infrastructure enables the perception of infinite capacity. An elastic infrastructure allows resources to be allocated on demand and more importantly, returned to the resource pool on-demand, when the resources are no longer needed. The ability to scale down when capacity is no longer needed is often overlooked or undervalued, resulting in server sprawl and lack of optimization of resource usage. It is important to use consumption-based pricing to incent consumers to be responsible in their resource usage. Automated or customer request-based triggers determine when compute resources are allocated or reclaimed.

Achieving an elastic infrastructure requires close alignment between the provider and their consumers, as peak usage and growth rate patterns need to be well understood and planned for as part of a capacity plan.

3.7 Partitioning of Shared Resources

Sharing resources to optimize resource utilization is a key concept; however, it is also important to understand when shared resources need to be partitioned. While a fully-shared infrastructure may provide the greatest optimization of cost and agility, there may be regulatory requirements, business drivers, or issues of multi-tenancy that require various levels of resource partitioning. Partitioning strategies can occur at many layers, such as physical isolation or network partitioning. Much like redundancy, the lower in the stack this isolation occurs, the more expensive it is. Additional hardware and reserve capacity may be needed for partitioning strategies such as the separation of resource pools. Ultimately, the provider will need to balance the risks and costs associated with partitioning strategies and the cloud infrastructure will need the capability of providing a secure method of isolating the infrastructure and network traffic while still benefiting from the optimization of shared resources.

3.8 Resource Decay

Treating infrastructure resources as a single resource pool allows the infrastructure to experience small hardware failures without significant impact on the overall capacity. Traditionally, hardware is serviced using an incident model, where the hardware is fixed or replaced as soon as there is a failure. By leveraging the concept of a resource pool, hardware can be serviced using a maintenance model. A percentage of the resource pool can fail because of “decay” before services are impacted and an incident occurs. Failed resources are replaced on a regular maintenance schedule or when the resource pool reaches a certain threshold of decay instead of a server-by-server replacement.

The maintenance model requires the provider to determine the amount of “decay” they are willing to accept before infrastructure components are replaced. This allows for a more predictable maintenance cycle and reduces the costs associated with urgent component replacement.

For example, a provider with a resource pool containing 100 servers may determine that up to three percent of the resource pool may decay before an action is taken. This will mean that three servers can be completely inoperable before an action is required, as illustrated in the figure below.

3.9 Service Classification

Service classification is an important concept for driving predictability and incentivizing desired consumer behavior.When providing different classes of a service, the service components are implemented on different service class partition resource pools. Each service class for each service is defined in the provider’s service catalog, describing service levels for availability, resiliency, reliability, and performance. Each service class of each service is offered at a different cost to the consumer. Service classification allows consumers to select the service they consume with the attributes they require at a cost that they're willing and able to pay for. Service classification also allows the provider to adopt a standardized approach to delivering a service that reduces complexity and improves predictability, thereby resulting in higher-quality service delivery.

3.10 Cost Transparency

Cost transparency is a fundamental concept for taking a service provider’s approach to providing cloud infrastructure. In a traditional data center, it may not be possible to determine what percentage of a shared resource such as infrastructure, is consumed by a particular service. This makes benchmarking services against other providers of similar services an impossible task. By defining the cost of infrastructure through service classification and consumption-based pricing, a more accurate picture of the true cost of utilizing shared resources can be gained. This enables enterprise consumers to make fair comparisons of the cost and benefits of private cloud services to those offered by public service providers.

Cost transparency through service classification allows consumers to make informed decisions when buying or building new applications. Applications designed for resiliency will typically be provided at some of the lowest costs to consumers, whereas applications not designed for resiliency will often be provided at higher costs, due to the hardware redundancy often required to support them.

Finally, cost transparency incents service owners to think about service retirement. In a traditional data center, consumers may discontinue using some services, and there is little to no consideration on how to retire an unused service. The cost of ongoing support and maintenance for an under-utilized service may be hidden in the cost model of the data center. In a private cloud, monthly consumption costs for each service can be provided to the business, incenting service owners to retire unused services and reduce their cost.

3.11 Consumption-Based Pricing

This is the concept of paying for what you use as opposed to a fixed cost, irrespective of the amount consumed. In a traditional pricing model, the consumer’s cost is based on flat costs derived from the capital cost of hardware and software and the expenses to operate the service. In this model, services may be over or underpriced based on actual usage. In a consumption-based pricing model, the consumer’s cost reflects their usage more accurately.

The unit of consumption is defined in the service catalog and should reflect, as accurately as possible, the true cost of consuming infrastructure services, the amount of reserve capacity needed to ensure availability SLAs, and the user behaviors that are being incented.

3.12 Security and Identity

Security for the cloud is founded on the following paradigms:

Protected infrastructure takes advantage of security and identity technologies to ensure that hosts, information, and applications are secured across all scenarios in the data center, including the physical (on-premises) and virtual (on-premises and cloud) environments.
Application access helps ensure that enterprise IT service providers can extend vital applications to internal users as well as to important business partners and cloud users.
Network access uses an identity-centric approach to ensure that users—whether they’re based in the central office or in remote locations—have more secure access no matter what device they’re using. This helps ensure that productivity is maintained and that business gets done the way it should.

Most important from a security standpoint, the secure data center makes use of integrated technology to assist users in gaining simple access using a common identity. Management is integrated across physical, virtual, and cloud environments so that organizations can take advantage of all capabilities without the need for significant additional financial investments.

3.13 Multitenancy

Multi-tenancy refers to the ability of the infrastructure to be logically subdivided and provisioned to different organizations or organizational units. The traditional example is a public service provider that provides services to multiple customer organizations. Increasingly, this is also a model being utilized by enterprise IT service providers when providing services to multiple organizational units within a single organization, treating each as a customer or tenant.

4.0 Patterns

Patterns are specific, reusable ideas that have been proven solutions to commonly occurring problems. The following sections describe patterns that are useful for enabling the cloud computing principles and concepts described in this article. Guidance for implementing these patterns with Microsoft products and technologies is available separately at the Cloud and Datacenter Solutions Hub.

4.1 Resource Pooling

Problem: When dedicated infrastructure resources are used to support each service independently, their capacity is typically underutilized. This leads to higher costs for both the provider and the consumer.

Solution: Aggregate infrastructure resources into shared resource pools to more effectively utilize their capacity. When pooling resources, different resource pools may be necessary to meet different requirements in areas such as service classes, security policies, consumers, systems management, and capacity management.Resource pools exist for either storage or compute and network. This de-coupling of resources reflects that storage is consumed at one rate while compute and network are collectively consumed at another rate. The following types of solutions are commonly used when pooling compute and network resources.

4.1.1 Service Class Partitions

Different service class partitions are often defined to differentiate unique security policies, performance, and availability characteristics, as well as to support unique requirements of different consumers such as separate organizations or organizational units. Each of these service classifications is often a separate resource pool.

An example of different service class partition resource pools is one compute resource pool that provides 99.99% availability and another pool that provides 99.9% availability of virtual machines. Virtual machines hosted on the 99.99% availability resource pool would have a higher-cost than those hosted on the 99.9% pool. The application of this pattern should also take into account the Application patterns that will be supported by the infrastructure. Different types of applications may only execute properly on specific service class partitions. A service can be offered to consumers with different service classes by deploying the same service components to multiple service class partition resource pools.

4.1.2 Systems Management Partitions

Systems management tools depend on defined boundaries to function. For example, deployment, provisioning, and automated failure recovery (virtual machine movement) depend on the tools' knowledge of which servers are available to host virtual machines. Resource pools define these boundaries and allow automation of management tool activities.

4.1.3 Capacity Management Partitions

To perform capacity management it is necessary to know the total amount of resource available to a datacenter. A resource pool can represent the total data center compute, storage, and network resources that a provider has. Resource pools allow this capacity to be partitioned; for example, to represent different budgetary requirements or to represent the power capacity of a particular uninterruptable power supply (UPS).

4.2 Physical Fault Domain

Problem: Groups of servers often fail together as a result of a shared infrastructure component such as a network switch or UPS. This can cause service degradation or unavailability if not incorporated into the overall cloud infrastructure design.

Solution: Define physical fault domains to support resiliency in the cloud infrastructure. It is important to understand how a fault impacts the resource pool, and therefore the resiliency of the virtual machines. A datacenter is resilient to small outages such as single server failure or local direct-attached storage (DAS) failure. Larger faults have a direct impact on the datacenter’s capacity, so it becomes important to understand the impact of a non-server hardware component’s failure on the size of the available resource pool.

To understand the failure rate of the key hardware components, select the component that is most likely to fail and determine how many servers will be impacted by that failure. This defines the pattern of the physical fault domain. The number of “most-likely-to-fail” components sets the number of physical fault domains.

For example, the figure below represents ten racks with ten servers in each rack. Assume that the racks have two network switches and an uninterruptible power supply (UPS). Also assume that the component most likely to fail is the UPS. When that UPS fails, it will cause all ten servers in the rack to fail. In this case, those ten servers become the physical fault domain. If we assume that there are nine other racks configured identically, then there are a total of ten physical fault domains.

From a practical perspective, it's not always possible to know which component will have the highest fault rate. Therefore, when determining fault domains, it's recommended that you start with the MTBF information provided by the manufacturer for each resource pool component, or utilize any other historical MTBF data you might have yourself, or that you may have obtained from the IT community. Over time, you'll be able to gather more accurate data and adjust your fault domains as appropriate. The usefulness of this pattern is only realized however, if the virtual machines that support consumer services are spread across multiple physical fault domains. If at least two virtual machines support each tier within a consumer service, fabric management can ensure that virtual machines are spread across separate physical fault domains for this purpose.

4.3 Upgrade Domain

Problem: Over time, all hardware and software must be upgraded. Upgrades can cause service degradation or unavailability if not incorporated into the overall cloud infrastructure design.

Solution: Define upgrade domains to support resiliency in the cloud infrastructure. This pattern applies to all three categories of datacenter resources; compute, network, and storage.

Although the virtual machine creates an abstraction from the physical server, it doesn’t obviate the requirement of an occasional update or upgrade of the physical server. This pattern can be used to accommodate this without disrupting service delivery by dividing the resource pool into small groups called upgrade domains. All servers in an upgrade domain are maintained simultaneously, and each upgrade domain is targeted in turn. This allows workloads to be migrated away from the upgrade domain during maintenance and migrated back after completion.

Ideally, an upgrade would follow the pseudo code algorithm below:

For each ResourceDomain in n;

Free from workloads;
Update hardware;
Reinstall OS;
Return to resource pool;

Next;

In the figure below, the two yellow boxes represent an upgrade domain in a resource pool.

The same concept applies to network. Because the datacenter design is based on a redundant network infrastructure, an upgrade domain could be created for all primary switches (or a subset) and another upgrade domain for the secondary switches (or subset). The same applies for the storage network. The usefulness of this pattern is only realized however, if the virtual machines that support consumer services are spread across multiple upgrade domains. If at least two virtual machines support each tier within a consumer service, fabric management can ensure that virtual machines are spread across separate upgrade domains for this purpose.

4.4 Reserve Capacity

Problem: When resources have failed, decay, or are being upgraded, the cloud infrastructure's aggregate capacity diminishes, potentially causing performance degradation to services.

Solution: Always have enough excess cloud infrastructure capacity to minimize or eliminate performance degradation in the event of failed or decayed resources, or resources that are being upgraded. The advantage of a homogenized resource pool-based approach is that all virtual machines will run the same way on any server in the pool. This means that during a fault, any virtual machine can be relocated to any physical host as long as there is capacity available for that virtual machine. Determining how much capacity needs to be reserved is an important part of designing a cloud infrastructure. This pattern combines the concept of Resource Decay with the Physical Fault Domain and Upgrade Domain patterns to determine the amount of reserve capacity that a resource pool should maintain.

To compute reserve capacity, utilize the following approach:

TOTALSERVERS = the total number of servers in a resource pool

ServersInFD = the number of servers in a fault domain
ServersInUD = the number of servers in an upgrade domain
ServersInDecay = the maximum number of servers that can decay before maintenance

So, the formula is: Reserve capacity = ServersInFD + ServersInUD + ServersInDecay/TOTALSERVERSThis formula makes the following assumptions:

It assumes that only one fault domain will fail at a time. A service provider may elect to base their reserve capacity on the assumption that more than one fault domain may fail simultaneously. This however, leaves more capacity unused ongoing.
If only one fault domain were used, the failure of multiple fault domains might trigger a disaster recovery plan, rather than a fault management plan.
It assumes a situation where a fault domain fails when some servers are indecay and other servers are down for upgrade.
It is based on no over-subscription of capacity.

In the formula, the number of servers in the fault domain is a constant. The size of reserve capacity must be balanced with SLAs and cost structures to ensure that resources are used most efficiently. For example, if an upgrade domain is too large, the reserve capacity will be too high; if it is too small, upgrades will take a longer time to cycle through the resource pool. A resource decay percentage that is too small may require frequent maintenance of the resource pool, while a resource decay percentage that's too large means that the reserve capacity is higher than necessary.

There is no “correct” answer to determining the "right" amount of reserve capacity for a cloud infrastructure. The right amount will allow the infrastructure to most efficiently meet both its SLAs and cost structures. An example calculation of reserve capacity that builds upon the values defined in the physical fault domain and upgrade domain sections of this article is detailed below.

TOTALSERVERS = 100
ServersInFD = 10
ServersInUD = 2
ServersInDecay = 3

In this example, when using the formula Reserve capacity = ServersInFD + ServersInUD + ServersInDecay/TOTALSERVERS, reserve capacityequals fifteen percent. The figure below illustrates the allocation of 15 percent of the resource pool for reserve capacity. This means that fifteen percent of resources can be idle at any given time. This reserve capacity should be factored into the cost model of delivering the service to its specified SLAs, as the cost of the reserve capacity is necessary to providing the service at its specified SLA.

4.5 Scale Unit

Problem: Purchasing individual servers, storage arrays, network switches, and other cloud infrastructure resources requires procurement, installation, and configuration overhead for each individual resource.

Solution: Purchase pre-configured collections of cloud infrastructure resources, or scale units, to minimize individual resource overhead. At some point, the amount of total utilized capacity will approach the total available capacity, where total available capacity is equal to the total servers minus the reserve capacity. When this point is reached, new capacity will need to be added to the datacenter. Historically, capacity was added by purchasing individual servers, which needed to be individually racked, cabled, and configured. To minimize the amount of individual server configuration overhead however, capacity should be increased in standardized increments of more than a single server. With this approach, multiple servers could be purchased pre-racked, for example, and even pre-configured, requiring only a single connection to network, power, and cooling when installed into the data center. Pre-assembled and pre-configured units such as this are commonly available from a variety of hardware manufacturers. This pattern is applied to help strike the "right" balance between adding necessary capacity and minimizing unused capacity.

The scale unit represents a standardized unit of capacity that is added to a datacenter. Common types of scale units are compute scale units, which include servers and network, and storage scale units, which include storage components such as disks and input/ouput (IO) interface cards. Scale units increase capacity in a predictable, consistent way, allow standardized designs, and support capacity plans.

Like reserve capacity, scale units should be defined such that they effectively meet increased capacity needs, while not leaving too much capacity unused.

4.6 Capacity Plan

Problem: Eventually every cloud infrastructure runs out of physical capacity. This can cause performance degradation of services, the inability to introduce new services, or both.

Solution: Define a capacity plan that incorporates capacity forecasting, procurement, installation, and configuration lead time to ensure that new capacity is introduced into the cloud infrastructure before existing capacity is exhausted. This pattern encompasses many of the other patterns in this article to achieve the Perception of Infinite Capacity principle. The provider creates and updates the capacity plan by regularly reviewing consumer capacity requirement forecasts. The capacity plan for private cloud service providers must account for peak capacity requirements of the business, such as holiday shopping season for an online retailer. It must account for typical, as well as accelerated growth patterns of the business, such as business expansion, mergers and acquisitions, and development of new markets. Public cloud service providers often apply this pattern with diversification of capacity requirements from multiple customers, each with unique requirements.

The capacity plan must account for current available capacity and include defined triggers for when additional scale units must be acquired. These triggers should be defined by the amount of capacity each scale unit provides, but also take into account the lead time required for purchasing, obtaining, and installing each scale unit.

A well-designed capacity plan cannot exist without a high-degree of service management maturity, and close monitoring of capacity utilization and forecasts.

4.7 Health Model

Problem: If any component used to provide a service fails, it can cause performance degradation or unavailability of services.

Solution: Define a health model for each service that incorporates all failure conditions for all service components, as well as the actions that should be taken for each failure condition. To ensure resiliency, service management tools must be able to automatically detect when service components are operating at a diminished capacity, or have failed. Achieving this requires a deep understanding of each service component, as well as the interrelationships between these components. This pattern is the understanding of these interrelationships that enables service management tools to take necessary action when service is degraded. From a broader perspective, the service management tools must be able to classify degradation as resource decay, physical fault domain failure, failure due to upgrades occurring within upgrade domains, or a broad failure that requires the system to trigger the provider's disaster recovery response.

When creating the health model, it is important to consider the connections between all service components, to include connections to power and network. For example, if a virtual machine cannot connect to its storage, the service may fail, or operate at a diminished capacity. Another example is where the network may be saturated at greater than 80 percent utilization. This may impact performance SLAs, requiring the system management tools to take the appropriate action to resolve the condition. It is important to understand how to proactively determine both the health and failed states of all components that provide a service.

The diagrams below show typical systems interconnections and demonstrate how the Health Model pattern is used to provide resiliency. In the figure below, power is a single point of failure. Network connections and SAN access connections to the storage area network (SAN) are redundant.

As illustrated below, when “UPS A” fails, it causes a loss of power to "Server 1," "Server 2," "Server 3," and "Server 4." It also causes a loss of power to “Network A” and “SAN Access A”, but because network and SAN access are redundant, only one physical fault domain fails. The fault domain connected to" UPS B" is operating at degraded capacity, as it loses its redundancy, but it's still available.

As illustrated below, the service management tools detect the physical fault domain failure and restart the virtual machines from the servers in the failed physical fault domain on the servers within the operable physical fault domain, which is connected to "UPS B."

While the concept of a health model is not unique, its importance becomes even more critical in a cloud infrastructure. To achieve the necessary resiliency, failure states (an indication that a failure has occurred) and warn states (an indication that a failure may soon occur) need to be thoroughly understood for the cloud infrastructure. The detect and respond scenarios for each state must also be understood, documented, and automated. Only after defining these scenarios and states can the benefits of resiliency be fully realized.

The fabric manager must be able to automatically move virtual machines around the fabric in response to health warning states. Applications designed for greater resiliency should also have robust and high fidelity health models that can be managed by the service management tools.

4.8 Application

Problem: Not all applications are optimized for cloud infrastructures and may not be able to be hosted on cloud infrastructures.

Solution: Classify applications and determine which can run on your cloud infrastructure, and which cannot. For the applications that cannot run on your cloud infrastructure, host them on whatever separate, specialized infrastructure they require.Application patterns are useful in both designing new applications optimized for cloud infrastructure and in determining which existing applications are able to be hosted on cloud infrastructure. Consider the following application patterns:

Stateless applications: In this pattern, application components are designed for resiliency, and are not dependent on redundant infrastructure or any specific hardware. As a result, the application components typically execute in virtual machines, and failed virtual machines executing application components can be restarted by the fabric manager on operable computers, typically without a noticeable break in service to consumers. The cloud infrastructure to support this pattern is generally the lowest cost due to the ability to virtualize the machines that execute the application components and because the application components do not require redundant hardware. Valuable content for authoring cloud-optimized applications can be found at the Cloud Development website on MSDN.
Stateful applications: In this pattern, application components assume some level of redundancy within the underlying infrastructure, such that failure of one application server will cause a break in service to a consumer. This pattern often relies on failover clustering technologies to achieve high levels of availability. Aside from the dependency on hardware redundancy, this pattern has no other dependencies on any specific hardware, and it is able to run on virtualized computers. The cloud infrastructure to support this pattern is generally higher than the cost to support stateless applications because of the need for hardware redundancy. The cloud infrastructure to support this pattern is generally lower than the cost to support legacy applications however, because it still utilizes homogenized hardware in shared resource pools.
Legacy applications: In this pattern, application components often have specific or unique hardware requirements that are not included in the homogeneous cloud infrastructure design. Oftentimes, application components cannot execute in virtual machines due to their hardware requirements. The infrastructure that supports these applications is often not the same infrastructure used to host cloud services. This infrastructure is generally the most expensive to support, as it doesn't benefit from the economies of scale provided by the homogenous infrastructure, virtualization, and significant automation found in cloud infrastructure.

4.9 Cost Model

Problem: Consumers tend to use more resources than they really need if there's no cost to them for doing so.

Solution: Define cost models for each class of each service so that consumers consume only what they're willing to pay for. Cost models represent the cost of providing services in the cloud and the desired consumer behavior that the provider wishes to encourage. Cost models should account for the deployment, operations, and maintenance costs for delivering each service class of each service, as well as the capacity plan requirements for peak usage and future growth. Cost models must also define the units of consumption for a service. The units of consumption will likely incorporate some measurement of the compute, storage, and network provided to each service component by each class of each service and then be combined into a unit of consumption that the consumer understands. The consumption unit definitions can then be used as part of a consumption-based reporting or billing model. Enterprise IT service providers that do not bill their consumers for service can still track and report the cost of each organizational unit's consumption.The cost model will encourage desired behavior in the following two ways:

By charging (or reporting) consumers based on their consumption, they will likely only request the amount of resources they need. If they need to temporarily scale up their consumption, they will likely give back the extra resources when they are no longer needed.
By leveraging different cost models for services based on different classes of service, the consumer is encouraged to build or buy applications that can be hosted on the most cost-effective service class partitions possible.

5.0 Summary

This article detailed the principles, concepts, and patterns that can be applied to the design of any cloud services foundation infrastructure. To better understand how to apply the information in this article to each of the Cloud Services Foundation Reference Model subdomains, you're encouraged to read solutions content found at the Cloud and Datacenter Solutions Hub. You may also want to return to the Cloud Services Foundation Reference Architecture - Overview article.

6.0 Change Log

Version	Date	Change Description
1.0	7/23/2013	Initial posting and editing.

Cloud Services Foundation Reference Architecture - Principles, Concepts, and Patterns

Table of Contents

1.0 Introduction

2.0 Principles

2.1 Achieve Business Value through Measured Continual Improvement

2.2 Perception of Infinite Capacity

2.3 Perception of Continuous Service Availability

2.4 Take a Service Provider’s Approach

2.5 Optimization of Resource Usage

2.6 Take a Holistic Approach to Availability Design

2.7 Minimize Human Involvement

2.8 Drive Predictability

2.9 Incentivize Desired Behavior

2.10 Create a Seamless User Experience

3.0 Concepts

3.1 Favor Resiliency Over Redundancy

3.2 Homogenization of Physical Hardware

3.3 Pool Compute Resources

3.4 Virtualized Infrastructure

3.5 Fabric Management

3.6 Elastic Infrastructure

3.7 Partitioning of Shared Resources

3.8 Resource Decay

3.9 Service Classification

3.10 Cost Transparency

3.11 Consumption-Based Pricing

3.12 Security and Identity

3.13 Multitenancy

4.0 Patterns

4.1 Resource Pooling

4.1.1 Service Class Partitions

4.1.2 Systems Management Partitions

4.1.3 Capacity Management Partitions

4.2 Physical Fault Domain

4.3 Upgrade Domain

4.4 Reserve Capacity

4.5 Scale Unit

4.6 Capacity Plan

4.7 Health Model

4.8 Application

4.9 Cost Model

5.0 Summary

6.0 Change Log

Additional resources