Failover Cluster Testing Methods

Article
06/17/2009

1.1 System Failover Testing

During system testing, we will gather as much information as possible about the potential outcomes of system failures. We will not test, however, many component failures in the system such as a motherboard going out, losing a processor or having a cooling fan going out. These represent outages that have been planned for and many will be protected through Cluster Server, other failures will be covered through the fault-tolerance of the systems themselves (such as redundant power supply modules and cooling fans). We will test certain events that can cause failover within the cluster. These tests are listed below along with an explanation of the procedure along with the expected result of each test.

1.1.1 Disk Failure

Purpose: The purpose of testing a disk failure is to ensure that the RAID configuration will continue without interruption. We will also look at hot-spares and ensure that in the event of a disk failure the hot-spare will pickup for the failed drive.

Test Procedure: The procedure for this test is to pull out one of the hard drives in the SAN array while that drive is operational and is currently used by one of the nodes within cluster.

Expected Result: Uninterrupted service; Windows should not discover any problems at all; RAID management software should report loss of a drive and complete the procedure of substituting the hot spare and rebuilding the drive array. Disk performance might be significantly reduced during this time

1.1.2 Power Failure

Purpose: This test will verify that in the event of a server losing power, the opposite node in the cluster will bring all resources in the cluster online and resume operations.

Test Procedure: The procedure for this test is to simply pull all power plugs from one node while that node is operational and is hosting groups within cluster.

Expected Result: Cluster group hosted by the “failed” node should automatically fail over to a passive node. Service interruption should be in the range of 0-2 minutes.

1.1.3 Network Adapters

Purpose: Testing network adapters will serve dual purposes. We will test the functionality of the heartbeat and the ability for cluster heartbeat communications to be routed over the public network. We will also test the failover scenario in the case of both public network adapters (members of the network team) losing connections to the network.

Test Procedure: First we will test the heartbeat interconnect and ensure that cluster communications are carried through the public network without interrupting cluster communications by disconnecting the private network adapter. Secondly, we will test network adapters to ensure that when one out of the two network adapters are unplugged, the other network adapter will communicate with the network as usual. Then we will unplug the remaining network adapter, which at this time is carrying all network communication, including the heartbeat. After this series of tests is complete, we will bring the system back to a normal configuration and test the public network adapters by removing their connection to the network. However, in this last test we will leave the heartbeat interconnect in place.

Expected Result: For a disconnected private adapter, Windows is expected to switch internal cluster communications to a public adapter automatically. There should be no service interruption. For disconnecting one of the public network cards, network team driver is expected to switch to using another network card automatically. There should be no service interruption. For disconnecting all network adapters, cluster is expected to initiate the failover once it discovers that active node is unavailable. Service interruption should be in the range of 0-2 minutes.

1.1.4 Fiber Channel Components

Purpose: These tests will provide a level of understanding and documentation on expectations with regards to the redundancy of the fiber channel components within the HBA cards, the servers, and the CLARiiON SAN. We will be looking for results of unplugging certain components, simulating power losses, and other failures that will affect the cluster.

Test Procedure: During this test, we will disconnect the redundant fiber connections.

Expected Result: Depending on which fiber connection was disconnected, system should automatically switch to the reserved path. There should be no service interruption. In case of both fiber cables disconnected, so that cluster node completely loses communication to the SAN storage, cluster failover should be initiated. Service interruption should be in the range of 0-2 minutes.

1.2 Windows 2003 and SQL Server Failover Testing

Microsoft Cluster Server will ensure that application services continue running within the cluster in the event that either there are failures in Windows 2003 that prevent the application from operating properly, or if SQL Server itself ceases to function properly. The cluster can detect these failures and fail the application over to a passive node. During these tests, it is important to note that our single point of failure within Windows 2003 and SQL Server is the database(s). If database itself becomes corrupt or experiences some other catastrophic failure, the only solution is to restore this database from a backup copy.

Expected result in all tests is for a cluster to initiate the failover. Service interruption should be in the range of 0-2 minutes.

1.2.1 SQL Server Services

Purpose: The purpose of simulating service failures is to ensure that failover will occur and to monitor the activity that occurs during failover. We will be looking for the time required for failover, proper failover and ensuring dependencies are being brought online properly.

Procedure: The best approach to testing a clustered service is to stop this service from the Services snap-in within the Management MMC. A service that has become a clustered resource can only be managed through the cluster administrator. Performing services operations through the Services snap-in will appear to the cluster as a failure and therefore will simulate a service failing. Following are the services we will attempt to fail.

SQL Server service

SQL Server Agent service

MS DTC service

1.2.2 Windows 2003 Failure

Purpose: The purpose here is to simulate the failure of Windows 2003 to demonstrate the ability of Cluster Server to realize Windows 2003 is not functioning/running on one of the clustered nodes and to initiate failover.

Procedure: Testing Windows 2003 failure will be difficult to simulate by any other means than simply choosing Shut Down. This will stop all services on the node being shut down and this node will cease participation in the cluster. The cluster service will be notified on the opposite node and failover of the application will occur.

1.2.3 Cluster Service Failure

Purpose: Cluster Service is responsible for maintaining cluster membership, monitoring resources and managing the clustered node. If this service were to fail, all clustered groups of resources would be forced to move to another cluster node. We will simulate a Cluster Service failure in this test and monitor the failover activity.

Procedure: Stop the Cluster Service from the Services snap-in and record results.

1.2.4 Quorum Failure

Purpose: The Quorum serves as a log for changes that occur while one node of a cluster is offline and as a tiebreaker, in the event all heartbeat communications are lost. This test will observe the cluster behavior when the quorum is lost. To view the results of the quorum serving as a tiebreaker, view the results under “System Failover Testing” above.

Procedure: Utilize the SAN configuration utility and unpresent the quorum drive from the active node. This will provide us with the results that would be seen if the active node loses access to the quorum drive. Failover should be initiated.

Utilize the RAID configuration utility and mark the logical drive of the quorum as offline. This will provide us with the results that would be seen if the drive that contains the quorum were to fail.

Failover Cluster Testing Methods

Additional resources