This post is part of a series which outlines the Context, Principles, and Concepts which formulate the basis of a holistic Private Cloud approach. Each post gives a framing for the next so they are best read in order.
The following concepts are abstractions or strategies that support the principles and facilitate the composition of a (Microsoft) Private Cloud. They are guided by and directly support one or more of the principles.
Holistic Approach to Availability
In order to achieve the perception of continuous availability, a holistic approach must be taken in the way availability is achieved. Traditionally, availability has been the primary measure of the success of IT service delivery and is defined through service level targets that measure the percentage of uptime (e. g. 99.99 percent availability). However, defining the service delivery success solely through availability targets creates the false perception of “the more nines the better” and does not account for how much availability the consumers actually need.
There are two fundamental assumptions behind using availability as the measure of success. First, that any service outage will be significant enough in length that the consumer will be aware of it and second, that there will be a significant negative impact to the business every time there is an outage. It is also a reasonable assumption that the longer it takes to restore the service, greater the impact on the business.
There are two main factors that affect availability. First is reliability which is measured by Mean-Time-Between-Failures (MTBF). This measures the time between service outages. Second is resiliency which is measured by Mean-Time-to-Restore-Service (MTRS). MTRS measures the total elapsed time from the start of a service outage to the time the service is restored. The fact that human intervention is normally required to detect and respond to incidents limits how much MTRS can be reduced. Therefore organizations have traditionally focused on MTBF to achieve availability targets. Achieving higher availability through greater reliability requires increased investment in redundant hardware and an exponential increase in the cost of implementing and maintaining this hardware.
In a traditional data center, the MTRS may average well over an hour while a dynamic data center can recover from failures in a matter of seconds. Combined with the automation of detection and response to failure and warn states within the infrastructure, this can reduce the MTRS (from the perspective of IaaS) dramatically. Thus a significant increase in resiliency makes the reliability factor much less important. Availability (minutes of uptime/year) is no longer the primary measure of the success of IT service delivery. The perception of availability and the business impact of unavailability become the measures of success.
Using the holistic approach, higher levels of availability and resiliency are achieved by replacing the traditional model of physical redundancy with software tools.
Homogenization of Physical Hardware
Homogenization of the physical hardware is a key concept for driving predictability. The underlying infrastructure must provide a consistent experience to the hosted workloads in order to achieve predictability. This consistency is attained through the homogenization of the underlying servers, network, and storage.
Abstraction of services from the hardware layer through virtualization makes “server SKU differentiation” a logical rather than a physical construct. This eliminates the need for differentiation at the physical server level. Greater homogenization of compute components results in a greater reduction in variability. This reduction in variability increases the predictability of the infrastructure which, in turn, improves service quality.
The goal is to ultimately homogenize the compute, storage, and network layers to the point where there is no differentiation between servers. In other words, every server has the same processor and memory; every server connects to the same storage resources and to the same networks. This means that any virtualized service runs and functions identically on any physical server and so it can be relocated from a failing or failed physical server to another physical server seamlessly without any change in service behavior.
It is understood that full homogenization of the physical infrastructure may not be feasible. While it is recommended that homogenization be the strategy, where this is not possible, the compute components should at least be standardized to the fullest extent possible.
Leveraging a shared pool of compute resources is key. This Resource Pool is a collection of shared resources composed of compute, storage, and network that create the fabric that hosts virtualized workloads. Subsets of these resources are allocated to the customers as needed and conversely, returned to the pool when they are not needed. Ideally, the Resource Pool should be as homogenized and standardized as possible.
Virtualization is the abstraction of hardware components into logical entities. Although virtualization occurs differently in each infrastructure component (server, network, and storage), the benefits are generally the same including lesser or no downtime during resource management tasks, enhanced portability, simplified management of resources, and the ability to share resources. Virtualization is the catalyst to the other concepts, such as Elastic Infrastructure, Partitioning of Shared Resources, and Resource Pooling. The virtualization of infrastructure components needs to be seamlessly integrated to provide a fluid infrastructure that is capable of growing and shrinking on demand, and provides global or partitioned resource pools of each component.
Fabric is the term applied to the collection of Resource Pools. Fabric Management is a level of abstraction above virtualization; in the same way that virtualization abstracts physical hardware, Fabric Management abstracts services from specific hypervisors and network switches. Fabric Management can be thought of as an orchestration engine, which is responsible for managing the life cycle of a consumer’s workload (one or more VMs which collectively deliver a service). Fabric Management responds to service requests (e.g. to provision a new VM or set of VMs), Systems Management events (e.g. moving/restarting VMs as a result of a warning or failure), and Service Management policies (e.g. adding another VM to a consumer workload in response to load).
Traditionally, servers, network and storage have been managed separately, often on a project-by-project basis. To ensure resiliency we must be able to automatically detect if a hardware component is operating at a diminished capacity or has failed. This requires an understanding of all of the hardware components that work together to deliver a service, and the interrelationships between these components. Fabric Management provides this understanding of interrelationships to determine which services are impacted by a component failure. This enables the Fabric Management system to determine if an automated response action is needed to prevent an outage, or to quickly restore a failed service onto another host within the fabric.
From a provider’s point of view, the Fabric Management system is key in determining the amount of Reserve Capacity available and the health of existing fabric resources. This also ensures that services are meeting the defined service levels required by the consumer.
The concept of an elastic infrastructure enables the perception of infinite capacity. An elastic infrastructure allows resources to be allocated on demand and more importantly, returned to the Resource Pool when no longer needed. The ability to scale down when capacity is no longer needed is often overlooked or undervalued, resulting in server sprawl and lack of optimization of resource usage. It is important to use consumption-based pricing to incent consumers to be responsible in their resource usage. Automated or customer request based triggers determine when compute resources are allocated or reclaimed.
Achieving an elastic infrastructure requires close alignment between IT and the business, as peak usage and growth rate patterns need to be well understood and planned for as part of Capacity Management.
Partitioning of Shared Resources
Sharing resources to optimize usage is a key principle, however, it is also important to understand when these shared resources need to be partitioned. While a fully shared infrastructure may provide the greatest optimization of cost and agility, there may be regulatory requirements, business drivers, or issues of multi-tenancy that require various levels of resource partitioning. Partitioning strategies can occur at many layers, such as physical isolation or network partitioning. Much like redundancy, the lower in the stack this isolation occurs, the more expensive it is. Additional hardware and Reserve Capacity may be needed for partitioning strategies such as separation of resource pools. Ultimately, the business will need to balance the risks and costs associated with partitioning strategies and the infrastructure will need the capability of providing a secure method of isolating the infrastructure and network traffic while still benefiting from the optimization of shared resources.
Treating infrastructure resources as a single Resource Pool allows the infrastructure to experience small hardware failures without significant impact on the overall capacity. Traditionally, hardware is serviced using an incident model, where the hardware is fixed or replaced as soon as there is a failure. By leveraging the concept of a Resource Pool, hardware can be serviced using a maintenance model. A percentage of the Resource Pool can fail because of “decay” before services are impacted and an incident occurs. Failed resources are replaced on a regular maintenance schedule or when the Resource Pool reaches a certain threshold of decay instead of a server-by-server replacement.
The Decay Model requires the provider to determine the amount of “decay” they are willing to accept before infrastructure components are replaced. This allows for a more predictable maintenance cycle and reduces the costs associated with urgent component replacement.
Service classification is an important concept for driving predictability and incenting consumer behavior. Each service class will be defined in the provider’s service catalog, describing service levels for availability, resiliency, reliability, performance, and cost. Each service must meet pre-defined requirements for its class. These eligibility requirements reflect the differences in cost when resiliency is handled by the application versus when resiliency is provided by the infrastructure.
The classification allows consumers to select the service they consume at a price and the quality point that is appropriate for their requirements. The classification also allows for the provider to adopt a standardized approach to delivering a service which reduces complexity and improves predictability, thereby resulting in a higher level of service delivery.
Cost transparency is a fundamental concept for taking a service provider’s approach to delivering infrastructure. In a traditional data center, it may not be possible to determine what percentage of a shared resource, such as infrastructure, is consumed by a particular service. This makes benchmarking services against the market an impossible task. By defining the cost of infrastructure through service classification and consumption modeling, a more accurate picture of the true cost of utilizing shared resources can be gained. This allows the business to make fair comparisons of internal services to market offerings and enables informed investment decisions.
Cost transparency also incents service owners to think about service retirement. In a traditional data center, services may fall out of use but often there is no consideration on how to retire an unused service. The cost of ongoing support and maintenance for an under-utilized service may be hidden in the cost model of the data center. Monthly consumption costs for each service can be provided to the business, incenting service owners to retire unused services and reduce their cost.
Consumption Based Pricing
This is the concept of paying for what you use as opposed to a fixed cost irrespective of the amount consumed. In a traditional pricing model, the consumer’s cost is based on flat costs derived from the capital cost of hardware and software and expenses to operate the service. In this model, services may be over or underpriced based on actual usage. In a consumption-based pricing model, the consumer’s cost reflects their usage more accurately.
The unit of consumption is defined in the service class and should reflect, as accurately as possible, the true cost of consuming infrastructure services, the amount of Reserve Capacity needed to ensure continuous availability, and the user behaviors that are being incented.
(Thanks to authors Kevin Sangwell, Laudon Williams & Monte Whitbeck (Microsoft) for allowing me to revise and share)