Where, Oh Where Did All of the Microsoft.com SQL Clusters Go?

As the Group Manager for the database engineers that run Microsoft.com I’ve been repeatedly asked about Microsoft Clustering Services (MSCS) and where we’re using it. After all, Microsoft.com is one of the most visited web sites on the internet, and with SQL Server 2000 this was the top of the line high availability solution for databases. Of the 2516 user databases, none are running in an MSCS cluster. I’ve had the opportunity to talk with Microsoft customers and many of them utilize MSCS to increase availability for their SQL Servers, so why isn’t Microsoft.com also utilizing this HA technology?

The fact of the matter is that up until a couple of years ago we did have database systems running on MSCS clusters in Microsoft.com. So what happened to the MSCS clusters? The short answer is that we determined that MSCS clusters did not solve the architecture problem we were trying to solve. Before I explain why I want to be crystal clear that in no way am I saying that MSCS is not a viable HA solution, without question it is. However, I would contend that MSCS is not the right HA fit for every system.

What are the benefits of MSCS? I don’t intend to offer this as a comprehensive drill down into clustering, however if you want more background information about SQL Server 2000 clustering check out this TechNet article (SQL Server 2000 Failover Clustering). In a nutshell Clustering protects you from hardware failure in the server heads (motherboard, CPU, memory, etc.), and Operating system patching. There are more scenarios, but again it’s not my intention to list all of them. When the cluster fails over to another node it is essentially the same as stopping and starting SQL Server. Transactions that are in flight are rolled forward, or back, depending on whether they were committed or not. So in the scenario where a server fails the cluster will detect this and automatically fail over to another node, and this can occur without an administrator being involved. You’re application takes a small availability hit during the failover, but as soon as that’s complete you’re back online.

Sounds great, doesn’t it? And for many SQL Server 2000 systems this is a perfect fit. I mentioned that this solution didn’t solve the problem we were trying to solve, why not? The answer is geographical redundancy and the business we’re in. We take availability of our web applications very seriously at Microsoft.com and we have multiple datacenters serving content globally, and as a result we do not have the luxury of scheduled maintenance windows to bring a system down. We also do not load all of our eggs in one basket by having our database servers located in a single datacenter. GeoClustering is an option, and yes we have considered going down this path. However, the cost of purchasing the hardware capable of geoclustering, along with connectivity requirements, makes this a very costly solution. You may be thinking, “Come on, you’re working at Microsoft and you’re telling me that it costs too much money?”. While it is expensive, that’s not what I’m telling you. If we were in a different business, banking, medical, the list goes on, the expense completely makes sense. For our business we took a less costly solution that we think scales nicely, and we’ve attempted to mitigate risks with system engineering, processes, and monitoring.

How do we do it? I’ll first talk about SQL Server 2000, then how we’re utilizing SQL Server 2005. We have a few basic architectures, but for simplicity I’m going to focus on two. The first scenario involves read-only database requests from web applications. Since we’re running systems behind a website there are many applications where users from the internet are pulling read-only data from SQL Servers. Updates do occur, but they are initiated from someone inside of Microsoft when they want to update content on their database driven website or application. So when the publisher for a site, or application, on Microsoft.com interacts with a database that faces our corporate network we can replicate those changes to multiple read-only servers in multiple datacenters. These read-only database servers are configured in Network Load Balancing clusters (NLB). This way we have a virtual IP that our web servers point to that never changes. If a server fails, or we need to do maintenance on a server, we remove it from the NLB cluster and from an internet users perspective everything keeps working. If we lose a datacenter our Web Operations Team can direct internet traffic to other datacenters that have read-only copies of the same database. If we lose the database/server/datacenter where the publishing database is located, a site publisher is sitting on their hands until service is restored, but the key is that this would not impact internet users.

The second scenario involves systems that are writing information from internet users. This is the scenario where we once utilized MSCS. When we looked at availability in a single datacenter MSCS was beautiful; however, when we stepped back and looked at the technical pros and cons vs. how much the solution would cost we chose another route. To date, we have had very few servers completely fail without warning. Most of our system downtime is a result of planned maintenance. By planned maintenance I mean work done in the facilities that host our servers, server hardware maintenance, OS patches, SQL Server patches, or application updates. Of those five, MSCS was only going to protect us from two scenarios; OS patching, or server hardware maintenance (unless it was to the shared disk of course). GeoClustering would take up to 3 of 5; OS patching, server hardware maintenance, and facility maintenance. I’m making generalizations, and of course there are ways to mitigate some of these simple scenarios.

At this point we had to ask ourselves whether log shipping behind NLB might be a better solution. By pairing these two technologies together we had a poor mans cluster. We could take two standard servers, no special hardware at all, configure them in an NLB cluster together and converge one server and log ship to the un-converged server in NLB. We would get the benefits of a single virtual IP that we could point our web applications to, nothing would be shared, these are fairly simple to use technologies, and it was inexpensive. What we lost was the automatic failover capabilities, but we’ve mitigated that by adding process and monitoring. For those that have used NLB you may be thinking “Hey, wait a second. NLB clusters don’t span datacenters”, and you’re right our NLB clusters do not span datacenters. You can log ship a copy of a database to more than one server, and those servers can be located anywhere you want. If we lose a datacenter we have our Web Operations team update connection strings to another datacenter where we recover a log shipped database. Is it automatic? No, not in our case. We also have the potential for a small amount of data loss for unplanned events, which does not exist with MSCS. We typically backup and copy the transaction log every minute, restores occur every 5 minutes on the log shipping secondaries. During planned maintenance we do have a small outage while we fail over to the log shipping secondary, but you have to remember that you have a small outage with MSCS as well. This architecture has worked very well for the last several years. It’s easy to support, it’s inexpensive, and it’s flexible.

Moving forward with SQL Server 2005 our availability story is getting even better with peer-to-peer transactional replication, and database mirroring. We now have a system in production that has two servers behind NLB in a datacenter that are using peer-to-peer replication to another pair of servers behind NLB in another datacenter. Any one of these servers can accept data modifications and replication takes care of moving the data the other servers. This model allows datacenter autonomy with web servers only connecting to their local database servers, and letting peer-to-peer replication do it’s magic on the backend. There is latency between servers, and peer-to-peer does not handle conflict resolution, so it doesn’t fit every system.

The other technology from SQL Server 2005 we’re embracing is database mirroring, and this is replacing many of our log shipping pairs. If you’d like to get some more information about either of these technologies, or clustering, here’s a webcast you may want to view (TechNet Webcast: Building Highly Available Database Systems that Scale with Peer-to-Peer Replication (Level 200)). GeoClustering is a reality and if you want more information about it here is a webcast for you (TechNet Webcast: How You Can Achieve Greater Availability with Failover Clustering Across Multiple Sites (Level 300)).

Do we have any plans for clustering in the future? Possibly, availability is the goal and we’re always looking to improve our architecture. If we determine that MSCS fits the bill, we’ll use it.