Why does my cluster resource keep timing out when I try to bring it online?

 

I saw this on a W2k3SP2 cluster that couldn’t bring a 2TB drive online after it was failed over. The problem I was running into was that the resource was timing out before it was fully brought online. Without enough time for the resource to come online it would fail. After increasing the pending timeout the resource came online.

 

Symptoms;

In the System Event Log you will see events referencing the pending timeout period being too short and the cluster.log will hit a pending time out and fail the resource.

 

Event Type: Error

Event Source: ClusSvc

Event Category: Resource Monitor

Event ID: 1145

Date: 4/28/2009

Time: 4:14:20 PM

User: N/A

Computer: <Server>

Description:

Cluster resource <Resource> timed out. If the pending timeout is too short for this resource, consider increasing the pending timeout value.

 

For more information, see Help and Support Center at https://go.microsoft.com/fwlink/events.asp.

Data:

0000: b4 05 00 00 ´...

Cluster Log entry:

00001160.00001ac8::2009/04/28-23:14:20.186 WARN [RM] RmpTimerThread: Resource <Resource> pending timed out, CP 0 - setting state to failed.

 

Why does this happen?

 

There is a race to bring the resource online before the pending timeout period. If the pending timeout time has been reached and the resource is still OnlinePending the resource is treated as failed.

1. Resource Monitor calls the Online entry point of the first resource DLL and returns the result to Failover Manager.

· If the entry point returns ERROR_IO_PENDING, the resource state changes to OnlinePending. Resource Monitor starts a timer that waits for the resource either to go online or to fail. If the amount of time specified for the pending timeout passes and the resource is still pending (has not entered either the Online or Failed state), the resource is treated as a failed resource and Failover Manager is notified.

· If the Online call fails or the Online entry point does not move the resource into the Online state within the time specified in the resource DLL, the resource enters the Failed state, and Failover Manager uses Resource Monitor to try to restart the resource, according to the policies defined for the resource in its DLL.

 

Here is a Doc

 

How to fix:

You can increase the timeout period and this may give you enough time to bring your resource online.

To configure the pending timeout for a clustered service or application

1. In the Failover Cluster Management snap-in, if the cluster you want to configure is not displayed, in the console tree, right-click Failover Cluster Management, click Manage a Cluster, and select or specify the cluster you want.

2. If the console tree is collapsed, expand the tree under the cluster that you want to configure.

3. Expand Services and Applications.

4. Click the clustered service or application that you want to configure the pending timeout for.

5. In the center pane, right-click the resource for the service or application, click Properties, and then click the Policies tab.

6. Under Pending timeout, specify the length of time, in minutes and seconds, that the resource can take to change states between Online and Offline before the Cluster service puts the resource in the Failed state.

The default timeout value is 3 minutes.

Here is a Doc