Where, Oh Where Did All of the Microsoft.com SQL Clusters Go?

As the Group Manager for the database engineers that run Microsoft.com I’ve been repeatedly asked about Microsoft Clustering Services (MSCS) and where we’re using it. After all, Microsoft.com is one of the most visited web sites on the internet, and with SQL Server 2000 this was the top of the line high availability solution for databases. Of the 2516 user databases, none are running in an MSCS cluster. I’ve had the opportunity to talk with Microsoft customers and many of them utilize MSCS to increase availability for their SQL Servers, so why isn’t Microsoft.com also utilizing this HA technology?


The fact of the matter is that up until a couple of years ago we did have database systems running on MSCS clusters in Microsoft.com. So what happened to the MSCS clusters? The short answer is that we determined that MSCS clusters did not solve the architecture problem we were trying to solve. Before I explain why I want to be crystal clear that in no way am I saying that MSCS is not a viable HA solution, without question it is. However, I would contend that MSCS is not the right HA fit for every system.


What are the benefits of MSCS? I don’t intend to offer this as a comprehensive drill down into clustering, however if you want more background information about SQL Server 2000 clustering check out this TechNet article (SQL Server 2000 Failover Clustering). In a nutshell Clustering protects you from hardware failure in the server heads (motherboard, CPU, memory, etc.), and Operating system patching. There are more scenarios, but again it’s not my intention to list all of them. When the cluster fails over to another node it is essentially the same as stopping and starting SQL Server. Transactions that are in flight are rolled forward, or back, depending on whether they were committed or not. So in the scenario where a server fails the cluster will detect this and automatically fail over to another node, and this can occur without an administrator being involved. You’re application takes a small availability hit during the failover, but as soon as that’s complete you’re back online.


Sounds great, doesn’t it? And for many SQL Server 2000 systems this is a perfect fit. I mentioned that this solution didn’t solve the problem we were trying to solve, why not? The answer is geographical redundancy and the business we’re in. We take availability of our web applications very seriously at Microsoft.com and we have multiple datacenters serving content globally, and as a result we do not have the luxury of scheduled maintenance windows to bring a system down. We also do not load all of our eggs in one basket by having our database servers located in a single datacenter. GeoClustering is an option, and yes we have considered going down this path. However, the cost of purchasing the hardware capable of geoclustering, along with connectivity requirements, makes this a very costly solution. You may be thinking, “Come on, you’re working at Microsoft and you’re telling me that it costs too much money?”. While it is expensive, that’s not what I’m telling you. If we were in a different business, banking, medical, the list goes on, the expense completely makes sense. For our business we took a less costly solution that we think scales nicely, and we’ve attempted to mitigate risks with system engineering, processes, and monitoring.


How do we do it? I’ll first talk about SQL Server 2000, then how we’re utilizing SQL Server 2005. We have a few basic architectures, but for simplicity I’m going to focus on two. The first scenario involves read-only database requests from web applications. Since we’re running systems behind a website there are many applications where users from the internet are pulling read-only data from SQL Servers. Updates do occur, but they are initiated from someone inside of Microsoft when they want to update content on their database driven website or application. So when the publisher for a site, or application, on Microsoft.com interacts with a database that faces our corporate network we can replicate those changes to multiple read-only servers in multiple datacenters. These read-only database servers are configured in Network Load Balancing clusters (NLB). This way we have a virtual IP that our web servers point to that never changes. If a server fails, or we need to do maintenance on a server, we remove it from the NLB cluster and from an internet users perspective everything keeps working. If we lose a datacenter our Web Operations Team can direct internet traffic to other datacenters that have read-only copies of the same database. If we lose the database/server/datacenter where the publishing database is located, a site publisher is sitting on their hands until service is restored, but the key is that this would not impact internet users.


The second scenario involves systems that are writing information from internet users. This is the scenario where we once utilized MSCS. When we looked at availability in a single datacenter MSCS was beautiful; however, when we stepped back and looked at the technical pros and cons  vs. how much the solution would cost we chose another route. To date, we have had very few servers completely fail without warning. Most of our system downtime is a result of planned maintenance. By planned maintenance I mean work done in the facilities that host our servers, server hardware maintenance, OS patches, SQL Server patches, or application updates. Of those five, MSCS was only going to protect us from two scenarios; OS patching, or server hardware maintenance (unless it was to the shared disk of course). GeoClustering would take up to 3 of 5; OS patching, server hardware maintenance, and facility maintenance. I’m making generalizations, and of course there are ways to mitigate some of these simple scenarios.


At this point we had to ask ourselves whether log shipping behind NLB might be a better solution. By pairing these two technologies together we had a poor mans cluster. We could take two standard servers, no special hardware at all, configure them in an NLB cluster together and converge one server and log ship to the un-converged server in NLB. We would get the benefits of a single virtual IP that we could point our web applications to, nothing would be shared, these are fairly simple to use technologies, and it was inexpensive. What we lost was the automatic failover capabilities, but we’ve mitigated that by adding process and monitoring. For those that have used NLB you may be thinking “Hey, wait a second. NLB clusters don’t span datacenters”, and you’re right our NLB clusters do not span datacenters. You can log ship a copy of a database to more than one server, and those servers can be located anywhere you want. If we lose a datacenter we have our Web Operations team update connection strings to another datacenter where we recover a log shipped database. Is it automatic? No, not in our case. We also have the potential for a small amount of data loss for unplanned events, which does not exist with MSCS. We typically backup and copy the transaction log every minute, restores occur every 5 minutes on the log shipping secondaries. During planned maintenance we do have a small outage while we fail over to the log shipping secondary, but you have to remember that you have a small outage with MSCS as well. This architecture has worked very well for the last several years. It’s easy to support, it’s inexpensive, and it’s flexible.


Moving forward with SQL Server 2005 our availability story is getting even better with peer-to-peer transactional replication, and database mirroring. We now have a system in production that has two servers behind NLB in a datacenter that are using peer-to-peer replication to another pair of servers behind NLB in another datacenter. Any one of these servers can accept data modifications and replication takes care of moving the data the other servers. This model allows datacenter autonomy with web servers only connecting to their local database servers, and letting peer-to-peer replication do it’s magic on the backend.  There is latency between servers, and peer-to-peer does not handle conflict resolution, so it doesn’t fit every system.


The other technology from SQL Server 2005 we’re embracing is database mirroring, and this is replacing many of our log shipping pairs. If you’d like to get some more information about either of these technologies, or clustering, here’s a webcast you may want to view (TechNet Webcast: Building Highly Available Database Systems that Scale with Peer-to-Peer Replication (Level 200)). GeoClustering is a reality and if you want more information about it here is a webcast for you (TechNet Webcast: How You Can Achieve Greater Availability with Failover Clustering Across Multiple Sites (Level 300)).


Do we have any plans for clustering in the future? Possibly, availability is the goal and we’re always looking to improve our architecture. If we determine that MSCS fits the bill, we’ll use it.


Comments (81)

  1. Anonymous says:

    As the word about TTB gets around we’ve seem more blogs added, and  a lot of posts about TTB itself….

  2. Anonymous says:

    Readers of this blog might have noticed (or been puzzled by) the variety of subject matter that we present.

  3. Anonymous says:

    Readers of this blog might have noticed (or been puzzled by) the variety of subject matter that we present….

  4. Anonymous says:

    A great example of how not every tool fits every scenario. It gives a lot of insight into the ‘why’ and ‘why nots’ of various technologies.


    Their blog is usually a good read anyway. If

  5. Anonymous says:

    I was sent a link recently on the Microsoft.com SQL Aerver Architecture, this is really interesting and…

  6. Anonymous says:

    I was sent a link recently on the Microsoft.com SQL Aerver Architecture , this is really interesting

  7. Anonymous says:

    I was sent a link recently on the Microsoft.com SQL Aerver Architecture, this is really interesting and…

  8. Anonymous says:

    A really candid and excellent article from the "Group Manager for database engineers" at Microsoft on…

  9. Anonymous says:

    I was flicked this blog article from Kevin Jones at work.  I must admit it was nice to read the…

  10. Tim Toennies says:

    Wow… this is a really great article.  I gotta give you props for being so candid about how you use (and in this case don’t use) features of MS’ products within MS IT.  But, I also have to kinda dig you.  This is a blog for microsoft.com, so I don’t know the extent you speak for all of MS’ operations division but I really hope that someone is running a GeoCluster on SQL 2005 to support a production system at MS!!!  If so can you please have them add a comment to your blog & also a link on how they’ve been successful?  If this is not the case you guys are missing a tremendous dog-fooding opportunity.


  11. Colin Leversuch-Roberts says:

    I always did think that clustering ( for SQL Server ) sometimes was a double edged sword. I’ve always strongly supported log shipping, right from 6.0 ( when you had to roll your own ). My experiance with clustering is that it often creates mpore problems than its worth and you have a shed load of expensive hardware sitting around twiddling its thumbs!! Roll on load balanced SQL Server.

  12. Bobby says:

    This article raises several questions –

    (1) Does NLB here refer to the silly built-in Win2k3 Enterprise software NLB solution, or, does it refer to a real hardware LB solution (i.e. Foundry, F5 etc.) – and does it matter (potentially not?) Which one is MS using internally (hw or sw)?

    (2) if you use 1-way transactional log shipping behind NLB virtual IP in the case of a user-updated database (like a blog system) it would seem you must be ok with 10-20 minute downtimes if there is unexpected error. in our MSCS systems we see about a 15-30 second failover time, so the acceptable downtime for updateable systems within MS seems surprisingly high!

    (3) Also – in the case of user-updated databases I presume you are in fact not using NLB (since only one db has the latest info) – or – are you using NLB with one node marked as "off" and then you switch the nodes out after recovery of the secondary?

    Hopefully you guys will have a moment to reply 🙂

  13. MSCOM says:

    I’m to see there have been a few questions to my post, and I’ll do my best to check in and provide clarity or answers.

    First, I’d like to address Tim’s question regarding geo clustering. Micorosoft.com Operations is definitely a much smaller group than the operations teams that support our internal systems (MSIT), and MSN Operations. We’re not running clusters, but I know that both MSIT and MSN are. Are they running geo clusters? Great question, and I don’t know the answer to that. I have a request out to other internal Microsoft operations teams to see if I can find someone who is willing to comment here on their experiences.

    Now to Bobby’s questions:

    1.) Am I talking about NLB that is part of Windows Server? Yep, that’s exactly what I’m talking about and it works great too! To be clear, I’m talking specifically about our backend SQL Servers. This is my question to those who ask why we’re not considering a hardware solution. Why spend the money on hardware when a software solution is already there just waiting to be turned on? Network Load Balancing is included in all Windows Server versions in 2000 and 2003, and it’s very easy for someone who has never used it before to get up to speed quickly. Sure it’s not as sexy as a hardware solution, and no, it does not do some of the cool things you can with hardware, but it does meet our needs and works flawlessly.

    2.) Great question about how we use read-only clusters with a single publishing SQL Server. Yes, we can not fail over as quickly as MSCS. No question about it. Latency, or stale data, as you point out, is a huge concern for us in these systems. Here’s how we have tackled this. First, we have to monitor the publishing server closely. If the server, the database or replication fail, and we don’t know about it….lets just say that I would have some tough questions to answer back to our business owners. We typically have planned work that will cause the data to stop flowing, and in this case we have tried to fix the problem with processes. The publishing server will have a log shipping secondary, and NLB will be configured between the two with only the publishing server converged in the cluster. We’ll stop the application from accepting writes, let log shipping sync, and make sure replication has completed to all subscribers. We speed this up, by running these jobs more frequently before we start the work. Once we’re certain we haven’t missed any data we recover the secondary, create the replication publication with the no sync option (all the data is the same at the publisher and subscribers, so there’s no reason to synchronize), then turn NLB off on the old publisher, and ON on the new publisher. Turn the app back on, and setup log shipping the other direction and we’re done. We can do this in about 5 minutes. Most people ask if we have a script to do it, and the unfortunate answer is no. Moving forward we will be replacing most of our log shipping with database mirroring, and this will reduce failover time substantially.

    3.) And your last question about using NLB with user updated databases. As I mentioned above, we are indeed using NLB and log shipping. One of the biggest benefits is that there is a single IP or name that our web servers have to point to. This way we don’t have to update connection strings while we do maintenance.

    Hopefully this clears things up a little, but if not let me know what you think and I’ll do my best to explain further.

  14. MSCOM says:

    Just to clarify one point in the first sentence of my last post. I’m GLAD to see question and comments. -Brad

Skip to main content