Microsoft’s approach is evolved from its own experience of operating large data centers around the world. In its capacity as a data center operator, Microsoft’s Global Foundation Services (GFS) provides the infrastructure used for Microsoft’s internal Enterprise IT Services and the infrastructure used for external services like Bing™, Azure™, Hotmail™, Xbox Live®, and the Business Productivity Online Suite (BPOS).
To achieve the cost and service quality targets, Microsoft GFS has focused on understanding system health, automating the “detect and respond” capability, improving resiliency, and using strong Service Management processes to challenge many deeply held beliefs and best practices. GFS is truly delivering Infrastructure as a Service (IaaS) at the highest scale and is duty bound to use it’s learning and experience to help Microsoft’s partners and customers.
The principles outlined in this section provide general rules and guidelines to support Private Cloud computing. They are enduring, seldom amended, and inform and support the way it fulfills its mission and goals. These principles are often interdependent and together form the basis on which a Private Cloud is created. (Let us not forget the context in which these principles arise, which are described in the previous post.)
Perception of Infinite Capacity
From a consumer’s perspective, Cloud Services appear to have infinite capacity. The consumer can use as much or as little of the service as needed. Using the “electric utility provider” as a metaphor, consumers consume as much of the service as they need. This utility mindset means that Capacity Planning is of paramount importance and must be done proactively so that requests can be serviced on demand. Reactive capacity planning done in isolation leads to inefficient use of resources and avoidable costs. Combined with other principles, such as incenting desired consumer behavior, this principle allows for a balance between the cost of unused capacity and the desire for agility.
In a Private Cloud scenario which is offered to only one organization or business, this can be problematic. With an internal IT Cloud provider, the business unit is the consumer. In this case capacity is, of course, very costly and not infinite nor should it be infinitely expandable (VM sprawl anyone?). IT Infrastructure capacity should be an appropriate ratio of cost (investment) to value (return) based on what that business unit provides to the organization. This puts great emphasis on latter principles such as incenting the desired consumer behavior, but in this case, it also will test the maturity and strength of IT’s relationship with the business and will typically require that relationship to rapidly evolve.
Predictability is a fundamental principle for a Cloud from all perspectives whether you are a consumer or provider. From the vantage point of the consumer, Cloud Services should be consistent; they should have the same quality and functionality any time they are used.
A provider must deliver an underlying infrastructure that assures a consistent experience to the hosted workloads in order to achieve this predictability. This consistency is achieved through the homogenization of underlying physical servers, network devices, and storage systems.
From the provider’s Service Management perspective, this predictability is driven through the standardization of service classes in the service catalog as well as standardization of the processes. The principle of predictability is necessary for driving service quality.
Service Provider’s Approach to Delivering Infrastructure
Historically, whenever IT was asked to deliver a service to the business, they would purchase the necessary components and then build an infrastructure specific to the service requirements. This frequently results in longer times to market, increased costs (due to duplicate infrastructure), and inability to meet the business expectations of agility and cost reduction. Further compounding the problem, this model is often used when an existing service needs to be expanded or upgraded.
The principle of taking a Service Provider’s approach to delivering infrastructure transforms the IT’s approach. If the infrastructure is provided as a service, IT can now leverage a shared resource (multi-tenancy) model that achieves economies of scale and greater agility when combined with the other principles.
Resiliency over Redundancy Mindset
Traditionally, IT has provided highly available services through redundancy. If a necessary component for providing the service were to fail, a redundant component would be standing by to pick up the workload. Redundancy is often applied at all layers of the stack, as each layer does not trust that the layer below will be highly available. This redundancy, particularly at the Infrastructure Layer, comes at a premium price in capital as well as operational costs.
A key principle of a Private Cloud is to provide highly available services through resiliency. Instead of designing for failure prevention, the design accepts and expects that the components will eventually fail and focuses instead on mitigating the impact of failure and rapidly restoring service when the failure occurs. Through virtualization, real-time detection and automated response to health states, workloads can be moved off the failing infrastructure components often with no perceived impact on the service. If redundancy is handled at the application layer, it can be removed from the Infrastructure Layer, thereby saving substantial costs.
Automation and Orchestration
A core principle of a Cloud is its capability to minimize human involvement throughout the entire IT life cycle of the environment. A well-designed Private Cloud has the capability to perform operational tasks dynamically, detect and respond automatically to failure conditions in the environment, and elastically add or reduce capacity as workloads require. It is important to note that there is a continuum between manual and automated intervention that must be defined.
A manual process is where all steps require human intervention. A mechanized process is where some steps are automated, but some human intervention is still required (such as detecting that a process should be initiated or starting a script). To be truly automated, no aspect of a process, from its detection to the response, should require any human intervention. Minimizing human involvement through automation is necessary for realizing many of the other principles.
Optimization of Resource Usage
Resource optimization drives efficiency and cost reduction and is primarily achieved through resource sharing. Having a shared services platform is frequently referred to as multi-tenancy. In some cases these tenants should share infrastructure services but due to departmental silos, today do not, or perhaps they normally wouldn’t share infrastructure services and this new platform’s strong isolation and management now allows them to. Allowing multiple consumers to share resources results in higher resource utilization and a more efficient and effective use of the infrastructure. Optimization through abstraction enables many of the other principles and ultimately helps drive down costs and improve agility.
Incentivize Desired Consumer Behavior
Many of the fundamental principles of a Private Cloud may encourage consumer behaviors that have a negative impact on some or all of its goals. The perception of infinite capacity may result in consumers using capacity as a replacement for effective workload management at the software level. For example, virtualization technologies can often lead to server sprawl, where VMs are created on demand, but there are no mechanisms for removing VMs when they are no longer needed. This may be perceived as an improvement in the quality and agility of a service, but negatively impacts the cost of the infrastructure required to achieve these goals. Therefore, encouraging desired consumer behavior toward service consumption is a key principle in achieving the desired cost savings. In the electrical utility example, consumers are encouraged to use less, and are charged a lower multiplier when utilization is below an agreed threshold. If they reach the upper bounds of the threshold, a higher multiplier kicks in as additional resources are consumed.
Again, cost-transparency is key and many enterprises may actually have much more strict control on resource consumption where it will not be adequate to simply influence through metering and chargeback. In these cases that relationship management element is paramount.
(Thanks to authors Kevin Sangwell, Laudon Williams & Monte Whitbeck (Microsoft) for allowing me to revise and share)
p.s. I was writing this at 10:10, 10/10/10