SQL Server 2005 Multi-Site Clustering with Windows Server 2008

I was working recently with a customer who was looking to deploy a SQL Server 2005 cluster across 2 geographically dispersed sites using Windows Server 2008. They were looking to utilise the new clustering improvements in Windows Server 2008 to build a highly available SQL Server solution. The customer required automatic failover between the sites in the event of a disaster, but their current solution required manual intervention by an administrator in order to failover to the disaster recovery site. Automatic failover would increase application availability, and reduce the complexity of the solution. Each site has it's own SAN storage and the customer planned to replicate data between each site using SRDF replication.

This led me to do further research into clustering SQL Server in this type of environment. Windows Server 2008 introduces greater flexibility in the choice of Quorum configuration. The concept of quorum moves away from the requirement of a shared storage resource, but now refers to the number of votes needed to establish a majority. All nodes and a witness resource  can get a vote, which removes the disk as the single point of failure as in previous clustering models. The 4 Quorum Models available are:

  • Node and Disk Majority
  • Disk Only
  • Node Majority
  • Node and File Share Majority

As there is no shared storage between the nodes in a multi-site cluster, 2 of these Quorum models are suitable for multi-site clustering: Node Majority and Node and File Share Majority. Node and Disk Majority and Disk Only should only be used in a multi-site cluster if specifically directed by your storage vendor as your disk replication software needs to support these configurations.

Node and File Share Majority:

This allows the creation of up to 16 nodes with no shared disks. A file share acts as a witness, meaning that a 2 node cluster will have have 3 votes, so connectivity can be lost by either one of the nodes or the witness and the cluster can still continue to function.

A cluster quorum configured to use a node-and-file-share majority is a great solution for multi-site clusters. The file share witness can reside at a third site independent of either site hosting a cluster node for high disaster resilience. A single file server can serve as a witness to multiple clusters (with each cluster using a separate file share witness on the file server).

image

This configuration gives the highest resilience as the cluster can automatically recover from the loss of any one site without manual intervention.

The File Share Witness (FSW) needs to be in the same forest as the nodes and be running Windows Server 2003 or Windows Server 2008. For maximum resilience, it is best to locate the FSW at a 3rd site separate from the cluster nodes. The FSW does not need to be attached to shared storage and should NOT be a node in the same cluster.

Node Majority:

A node-majority cluster consists of 3 or more nodes without shared storage. Each of the nodes has a vote and there is no shared disk vote. A majority of votes are necessary to operate the cluster if 2 nodes fail in a 3 node cluster, then the remaining node drops out of the cluster. An administrator can manually over-ride this and force the remaining node to start. When the other nodes come back, majority quorum is achieved again and the cluster comes back online seamlessly.

This configuration works best with an odd number of cluster nodes as it is not enough to have half the cluster nodes functioning in this model. If four nodes were set up in a node-majority configuration, the cluster would continue to operate with the loss of one node but not with the loss of two nodes. You should use an odd number of nodes with Node Majority as 4 nodes can only survive 1 failure, which is the same as 3 nodes.

image

The node-majority quorum configuration can work when there is more than one cluster node at each site. Consider a multi-site cluster consisting of five nodes, three of which reside at Site A and the remaining two at Site B. With a break in connectivity between the two sites, Site A can still communicate with three nodes (which is greater than 50 percent of the total), so all of the nodes at Site A stay up. The nodes in Site B are able to communicate with each other, but no one else. Since the two nodes at Site B cannot communicate with the majority, they drop out of cluster membership. (Were Site A is to go down in this case, in order to bring up the cluster at Site B, it would require manual intervention to override the non-majority.)

image

As a result, the Node Majority configuration does not give automatic failover between sites as nodes 4 and 5 cannot achieve quorum. In this situation, you would need to manually force a failover.

SQL Server Networking Considerations:

Windows Server 2008 now allows nodes in the same cluster to reside in different network subnets and communicate across network routers. However, be aware that SQL Server 2005 and 2008 still require all cluster nodes to reside in the same network subnet, so you will still need to set up virtual local area networks (VLANs) to connect geographically separated cluster nodes. This can have some benefits with regard to client response times though, as DNS replication may impact client re-connection times in the event of a failover from one site to another. VLAN's allow DNS names to stay the same, so can increase availability.

Storage Considerations:

As there is no shared storage between the nodes in a multi-site cluster, the main consideration is how to keep the data replicated between the sites. The choice of 3rd-party replication solution is important and can have a major effect on how you deploy your cluster. As such, you should work closely with your storage vendor from an early stage in the design process.

Synchronous replication results in no data loss, but requires shorter distances between nodes and higher bandwidth to avoid write latency from impacting performance. Asynchronous allows you to stretch cluster nodes across longer distances, however there is a potential for data loss in the event of a failure. Asynchronous data replication also assumes a large enough network bandwidth to keep up with data changes and does not significantly impact application performance.

Conclusion:

As data replication is key in a multi-site SQL Server cluster, work with your storage vendor from an early stage to ensure they support your cluster configuration. Multi-site clustering allows you to achieve high availability and disaster recovery, however it can be a costly and complex solution. You should evaluate your business requirements first and then decide on the best technology to meet these. It could be the case that Database Mirroring, for example, could be used to give you the required level of resilience across geographical sites.

In this case, the customer chose to implement a 2-node, 3-site solution using the Node and File Share Majority quorum model, with a File Share Witness located in the 3rd site. This gives site-level resilience in the event of a disaster and also allows automatic failover between the cluster nodes without having to re-write client applications, meeting the business requirements.

Additional Resources:

Windows Server 2008 Multi-Site Clustering Whitepaper

TechNet Webcast: Geographically Dispersed Failover Clustering in Windows Server 2008

TechNet Webcast: Failover Clustering and Quorum in Windows Server 2008 Enterprise

Support Webcast: Microsoft SQL Server 2005 Failover Clustering on Windows Server 2008

How to: Create a New SQL Server 2005 Failover Cluster (Setup)