Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters

In this blog, I would like to explore some of the inner-workings of the Resource Host Subsystem (RHS) which is responsible for monitoring the health of the various cluster resources being provided as part of highly available services in a Failover cluster. A Windows Server 2008 Failover Cluster is capable of providing high availability services using a variety of resources some of which are included as part of the Failover Cluster feature and others are as part of ’cluster-aware’ applications like SQL and Exchange. Resources are designed to work together and are typically organized in Resource Groups (Figure 1). For example, a group of resources supporting a highly available File Server may consist of one or more of the following types of resources – Client Access Point (IP Address(s) + Network Name resource), Physical Disk (Storage), and a File Server. A highly available SQL Instance could contain the following resources - Client Access Point (IP Address + Network Name resource), Physical Disk (Storage), SQL Server and SQL Server Agent. Cluster resources are supported by special ‘plugins’ or resource Data Link Libraries (DLLs) that include coding to allow them to properly integrate\interoperate with the cluster service.

image

Figure 1

A Windows Server 2008 Failover Cluster is capable of hosting an unlimited number of resources. The management of these resources is the responsibility of the Resource Control Manager (RCM) and the Resource Host Subsystem (RHS) which provide this functionality as part of the Cluster Service itself (Figure 2).

image

Figure 2

The Resource Control Manager (RCM) is part of the overall cluster architecture and is responsible for implementing failover mechanisms and policies for the cluster service as well as establishing and maintaining the dependency tree (Figure 3) for each resource (e.g. a File Server resource requires a dependency on a Client Access Point and a Storage resource).

image

Figure 3

The Resource Control Manager maintains the state for individual resources (Online, Offline, Failed, Online Pending, and Offline Pending) as well as for Resource Groups (Online, Offline, Partial Online, and Failed). The Resource Control Manager can execute the following actions on a group of resources – Move, Failover and Failback. Which action is executed depends on several factors including the current ‘health’ of resources in the group, administrative actions taken on the group (e.g. Move Group), or the current policies in effect for the group. Here is an example (Figure 4) of Failover and Failback Group Policies –

image

Figure 4

Individual resources have policies (Figure 5) that apply to them as well.

image image

Figure 5

The Resource Hosting Subsystem (RHS) is responsible for initially hosting all resources that come Online in the cluster in one default process – rhs.exe (Resource Host Monitoring process) (Figure 6).

image

Figure 6

Note: The rhs.exe *32 process supports 32-bit resource DLLs running in the cluster.

 In previous versions of Microsoft clustering, this was called the resource monitor process (resrcmon.exe) (Figure 7).

image

Figure 7

There is one exception to this rule which has been implemented in the Windows Server 2008 R2 Failover Clustering feature. In Windows Server 2008 R2, the Cluster Group which consists of the Cluster Network Name resource, one or more associated IP address resources and a ‘witness’ resource and the Available Storage group are considered to be ‘critical’ cluster resource groupings and are hosted in an rhs.exe process separate from all the other cluster resources.

The Resource Hosting Subsystem (RHS) conducts periodic health checks of all cluster resources to ensure they are functioning properly. This is accomplished by executing IsAlive and LooksAlive processes which are specific to the type of resource. Examples of these are documented in the following KB article –

KB 914458 - Behavior of the LooksAlive and IsAlive functions for the resources that are included in the Windows server Clustering component of Windows Server 2003.

How often health checks are conducted is determined by the specific resource DLL or by a policy set by the cluster administrator. An example of this policy is shown in Figure 5. Should a resource fail to respond to a low-level LooksAlive check, a more in-depth IsAlive check is conducted. If a resource fails an IsAlive check, additional policies are executed until such time it is determined that a resource cannot run on a particular node in the cluster. When that point has been reached, RHS notifies the Resource Control Manager which will report the resource as Failed to the cluster service and a Failover is executed to move the Resource Group to another node in the cluster provided the default policy (Figure 8) is in effect.

image

Figure 8

There are times when a cluster administrator will choose not to implement the default policy shown in Figure 8 for specific ‘non-critical’ resources. This reduces instability in the cluster which could adversely impact clients connected to highly available service(s).

The IsAlive and LooksAlive health monitoring function is but a small part of what can be done with cluster resources. Figure 9 shows a listing of additional Resource DLL Entry-Point functions.

image

Figure 9

Note: Information on the Failover Cluster APIs can be found on MSDN.

Failure of an IsAlive call into a resource is but one way resources can become unavailable in the cluster. Other ways include:

  • Deadlocks in a resource DLL
  • Crashes in a resource DLL
  • RHS process itself terminates in the cluster
  • Cluster service fails on the node
  • Operating system failures (e.g. resource exhaustion)

Most of us who have been working with clusters for a long period of time understand what happens if a resource fails a critical health check. I want to spend a little time discussing resource deadlocks.

What is a resource ‘deadlock’? Basically, there are two common reasons for instability within a resource DLL. The resource DLL itself crashes (e.g. access violation in the resource DLL) or the resource fails to respond to a command in a timely fashion. Every time a call is made into a resource, a timer is started. If a response is not received within a specific period of time (configurable), the resource is considered to be deadlocked and the RHS process hosting that resource will be terminated and the resource will be placed in a newly created RHS process thereby isolating it from all the other resources running in the default rhs.exe process. When a deadlock happens, the Failover Cluster service registers an event in the cluster log. Here is an example of a deadlock occurring in the ‘Cluster Name’ resource –

000008c8.00002528::2009/06/17-20:07:57.900 WARN [RCM] ResourceControl(GET_NETWORK_NAME) to Network Name (email) returned 5910.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR [RHS] RhsCall::DeadlockMonitor: Call LOOKSALIVE timed out for resource 'Cluster Name'.

00000f1c.00000f28::2009/06/17-20:07:58.009 ERR [RHS] Resource Cluster Name handling deadlock. Cleaning current operation and terminating RHS process.

000008c8.00001cc4::2009/06/17-20:07:58.009 INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Cluster Name', gen(0) result 4.

000008c8.00001cc4::2009/06/17-20:07:58.009 WARN [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'Cluster Name' has crashed or deadlocked; marking it to run in a separate monitor.

 Figure 10

Entries are also made in the Windows System Event Log. Here is an example –

06/17/2009 04:07:58 PM Error Server1.contoso.com. 1230 Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM Cluster resource 'Cluster Name' (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.

06/17/2009 04:07:58 PM Critical Server1.contoso.com. 1146 Microsoft-Windows-FailoverCluste Resource Control NT AUTHORITY\SYSTEM The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor.

Figure 11

Information on these specific Failover Cluster error messages can be found on TechNet. The information for the two events shown in Figure 11 is shown in Figure 12.

image

Figure 12

In Windows Server 2008 R2, RHS events are registered with Windows Error Reporting. These events can be viewed in the Action Center under Control Panel. All RHS issues will be listed under the category ‘Failover Cluster Resource Host Subsystem.’

Examining the properties of a cluster resource highlights some of the information we have been discussing. Figure 13 points out some of the pertinent properties of a resource.

image

Figure 13

MonitorProcessID: Indicates the Process Identifier (PID) in task manger of the rhs.exe process associated with this resource. If multiple resources have been placed in their own RHS process, it can be difficult to discern which process is associated with which resource. Examining the properties of the specific resource can help.

Note: The Process ID is not displayed by default in Task Manager. You need to add the Column to the display by selecting View in the Menu Bar and from the drop down list select Select Columns. Check the box for PID (Process Identifier).

SeparateMonitor: Indicates if the resource has been placed in a separate monitor (0:No, 1:Yes).

IsAlivePoleInterval: Default is as shown indicating it is using the default setting for this specific resource type.

LooksAlivePollInterval: Default is as shown indicating it is using the default setting for this specific resource type.

DeadlockTimeout: Default setting indicating 5 minutes.

Resource deadlock detection was actually introduced in Windows Server 2003 clusters, however it was not turned on by default. Figure 14 illustrates this.

image

Figure 14

Deadlock detection is turned on by default in Windows Server 2008 (RTM + R2) and cannot be disabled.

So, what is the moral of this story? It is important to understand that cluster resource deadlocks are a symptom of a larger problem. The deadlock itself is not the problem….cluster is a victim of a problem that can exist either internal to the cluster node itself or somewhere external to the cluster. Applying a logical troubleshooting methodology can help understand where the problem may exist. But, to do that requires a couple of pieces of knowledge –

  1. Identification of the specific resource that is deadlocked.
  2. What is the entry point that is failing?
  3. What is the entry point trying to do?

Using the example provided in Figures 10 and 11, we can see there was a deadlock in the cluster name resource during a LooksAlive entry point. Understanding what is being evaluated for a LooksAlive process for a Network Name resource may help identify the problem which could end up being local to the node or could perhaps involve connectivity to a DNS server on the network. Referring back to KB 914458, the cluster resource DLL (ClusRes.dll) is responsible for Network Name resource health checking (IsAlive\LooksAlive tests). Some of the tests that are conducted include:

· Determining if the Network Name (NetBIOS Name) is still registered on the network stack on the node. Opening a command prompt on a node and running an nbtstat –n command to view the local NetBIOS name table, will show the registrations for cluster Network Name resources. Here is an example of a Network Name supported a Client Access Point for a File Server –

image

    Inspecting the Parameter data for the resource in the cluster registry hive, confirms the information –

image

  • Determine the result of a DNS registration attempt (dynamic DNS is required for this test).
  • If the Require DNS property is set and registration fails, then the IsAlive\LooksAlive test fails.

If all DNS registrations fail and the NetBIOS name is no longer registered locally on the node, the Network Name is no longer considered reachable and the resource is placed in a Failed state. Recovery processes are initiated by the cluster service on the local node first. If local recovery fails, the Group containing the Failed Network Name resource could be moved to another node in the cluster.

What are some things that can be done to help avoid, or at least mitigate, situations where a deadlock may occur? While not set in stone, here are some of my personal recommendations:

  1. Make sure the operating system (OS) is running with the latest service pack plus any post-service pack updates that pertain to Failover Cluster, networking or storage connectivity.
  2. If running highly available Microsoft applications like SQL or Exchange, ensure they are updated as well.
  3. Consult with the storage vendor and ensure the shared storage is updated and configured correctly to work in a Microsoft Failover Cluster. Most storage vendors maintain a current support matrix.
  4. Ensure there are reliable and redundant communications paths between all nodes in the cluster.
  5. Ensure there is reliable connectivity between all nodes in the cluster and Active Directory.
  6. Document all Third party products that are running in the cluster and ensure they are fully updated. Third party products that interact with storage or network connectivity are always potential suspects.
  7. Use the cluster validation process to help troubleshoot issues seen in a cluster.
  8. If you are a Cluster Administrator, you must be aware of all changes being implemented in the corporate infrastructure to determine potential impacts on highly available services.

Hopefully, you will find this information useful. Thanks again and please come back.

Additional References:

https://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support