The Importance of SLA's - Enterprise Grade Cloud

Going to the cloud can be a big decision for some companies. Whether you are a small, home-based business or a Fortune 100 brand, someone in the business likely made time to read the Service Level Agreement (SLA) provided by your cloud provider.

Microsoft has approached the cloud with an understanding that commercial customers have high expectations for SLA’s. In essence, there is an “IT Social Contract” that emerges where the customer:cloud provider relationship becomes a mutually reinforcing partnership for success. SLA’s are one attempt to reflect on paper the spirit of that partnership. (Roadmaps and ability to opt in to updates are other topics but that’s for future posts)

Yesterday’s news of Google Docs widespread outage got me thinking about the differences in our respective approaches to ‘support’ and ‘SLA’ for enterprises. First let me state for the record, there is ZERO element of schadenfreude here. I am not a Google hater, although I did sort of like the Dennis Miller reference, I hear the Catskills are lovely this time of year. (Yes that’s a joke, I try to have at least one.) I just think there needs to be an honest discussion about how we differ.

For the cloud to be successful for everyone, all vendors must continue to push the web delivery model forward. And because it’s all software, at some point it might go down. That means Microsoft, Amazon, SalesForce.com, Google, et al have and will at some point again in the future have an outage that affects users. The cloud does not make points of failure (Hardware, Network, Software) go away, but it does provide economies of scale to mitigate them. This is why you hear so much discussion across the industry about building for scale.

Two things made me write this post

First, the outage appears to have affected users of the ‘new’ version released last week to mixed reviews (not mixed personally). This means the experimental, ‘preview’ version, which was released into mission critical environment of every paying customer, took the system down. This feels eerily similar to the outage that occurred a year ago where Google deployed, “rogue code” into their data centers which caused massive Gmail outage. “That’s just not acceptable,” said Matt Cain, an analyst at Gartner, “It was poor thinking-through of a code change. In a corporate environment, you can’t just tell your CEO it was bad luck.”

The second thing is Google’s lack of support for anything released into Google Labs. All end users have the ability to enable these experimental features into their mission critical environment but Google will not support the code. This is why Google was able to pull Google Gears from the market without blinking an eye. It was a “Labs” feature, therefore apparently not subject to Google’s SLA or terms of service.

Both point to a culture of experimentation on the customers watch. It’s one thing to innovate and deliver new features. It’s another to do it in the mission critical environment and take down the system OR simply say ‘don’t call us’. Google’s 20% innovation time is widely known but how much of what they build are they actually going to support for your organization?

Why does this matter?

For customers, it comes down to how we both build our SLA’s for commerical use. The chart below is my attempt to compare them side by side.

For example, why does Google have a design goal of ‘zero’ RTO but then give themselves a 10 minute grace period for any outage? That’s weird. (See the TechCrunch article for more) Microsoft Online does not have such a limit. Our clock begins immediately upon any outage. That is partnership for the enterprise.

SLA penalties are also very different. Google claims ‘financially backed’ but in essence extends your service. In the event of <95%, they max coverage to 2 weeks tacked on to the end of your contract so you can benefit from more innovation? That really seems to benefit Google more than the customer? Microsost Online on the other hand provides money back to the customer, with up to 100% back. That is partnership for the enterprise. If we don’t do our job, you shouldn't pay. That benefits the customer not the vendor.

There are other areas to explore but I’d like your take on these differences? Do SLA's matter for the IT Pro who owns the solution for the business?

Andrew Kisslo - https://twitter.com/akisslo

Downtime Limitations

Immediate upon outage

Less than 10 minutes does not count

Monthly Uptime Percentage

Service Credit ($)

Monthly Uptime Percentage

Service Credit (days of service)

< 99.9%

25%

< 99.9% - = 99.0%

3

< 99%

50%

< 99.0% - = 95.0%

7

< 95%

100%

< 95.0%

15

Planned Downtime Notification

5 days

5 days

Maximum Yearly Planned Downtime

10 Hours

12 Hours