Cluster Installation Time Out Issues

Hello, my name is Steven Andress and I am Support Escalation Engineer (SEE) in the Platforms Support group here at Microsoft. One of the technologies I support is Windows Server Failover Clustering (WSFC). I’d like to take a minute to provide some information on an issue we are seeing with some frequency in Windows Server 2008 and Windows Server 2008 R2 Failover Clusters. The issue surfaces when creating a cluster or adding additional nodes to an already existing cluster.

Keep in mind that to be a complete duplicate including any symbols. For example, I have a user called StevenAndress and a machine called StevenAndress. This is not a duplicate. The reason is that in Active Directory, all computer accounts have a dollar ($) sign tagged at the end. So from an Active Directory standpoint, I have a user object called StevenAndress and a computer object called StevenAndress$. These are not duplicates. The duplicate would come in if the user name were also StevenAndress$.

If a new cluster is being created, and multiple nodes are specified in the Create Cluster Wizard, the creation process fails due to a timeout. Creation of a single node cluster, on the other hand, using any of the servers that will participate in the cluster will succeed. If you try to then add an additional node using the Add Node Wizard, the process will time out.

If the Create Cluster Wizard is used to create the cluster, the following output is displayed:

Configuring node ‘name’

---------------------------------------

 12% Validating cluster state on node ‘name’.

 25% Getting current node membership of cluster ‘name’.

 37% Adding node ‘name’ to Cluster configuration data.

 50% Validating installation of the Microsoft Failover Cluster Virtual Adapter on node ‘name’.

 62% Validating installation of the Cluster Disk Driver on node ‘name’.

 75% Configuring Cluster Service on node ‘name’.

 87% Starting Cluster Service on node ‘name’

100% Waiting for notification that node ‘name’ is a fully functional member of the cluster. This phase has failed for Cluster object 'name' with an error status of 1460 (0x000005B4).

Cleaning up ‘name’.

 

clip_image002

 

 

If you use the Cluster.exe /ADD [NODE] command (from an elevated command line interface (CLI) prompt) to add a node, you will see the following error:

"System error 1460 has occurred (0x000005b4) This operation returned because the timeout period expired"

You will also see the error if you use the Add-ClusterNode [[-Name] <cname>] command from an elevated Windows PowerShell Modules command line interface (CLI) prompt to add a node.

 

NOTE: We recommend using Windows PowerShell cmdlets, based on the statement found on the following link as stated below: https://technet.microsoft.com/en-us/library/dd443539(WS.10).aspx

 

If you have scripts based on Cluster.exe, you can continue to use them in Windows Server 2008 R2, but we recommend that you rewrite them with Windows PowerShell cmdlets. In future releases, Windows PowerShell will be the only command-line interface available for failover clusters.

 

To learn more about how Cluster.exe commands map to Windows PowerShell cmdlets, please visit the following link:

https://blogs.technet.com/b/josebda/archive/2010/09/23/mapping-cluster-exe-commands-to-windows-powershell-cmdlets-for-failover-clusters-extended-edition.aspx

 

Based on our current case data, we’ve identified two common causes for this behavior.

The first cause is due to a duplicate account name in Active Directory (AD) for a node name. The node name is the name of a cluster server. What will happen is that it finds the non-computer account and tries to join it. Because it is not the actual node it is making the connection to, it times out the connection.

In most cases the duplicate account is created by an application that is AD integrated. The most effective way to find the duplicate name, if it exists, is to use LDIFDE.exe from a command prompt run as Administrator. Here is the command to run:

ldifde -f output.ldf -r "(samAccountName=W2K8-R2*)"

In the example above:

-f = filename to write to

-r = the variable to search

W2K8-R2* = give me everything that starts with W2K8-R2, which is the node name.

This will create the file output.ldf in the current directory that can be read by notepad. If you review the file, if it is a computername, you will see the below information:

objectclass: computer

servicePrincipalName: StevenAndress$

If it is a user or service account, it will not have the above, but would have:

userPrincipalName=StevenAndress$

Also, it will give you the current OU that the object resides in. To get the node to join, you must rename the user/service account name to something else. For this, just go to the OU listed to find the object and rename it.

The second common cause of this issue is Anti-Virus/Firewall (Security) applications. These applications appear to be closing the required network endpoints. You can determine if this is the likely cause by generating a cluster log file to review. You do this from an administrator command prompt using the following syntax.

CLUSTER [[/CLUSTER:]cluster-name] LOG <options>

<options> = /G[EN[ERATE]] [/COPY[:"directory"]] [/NODE:"node-name"] [/SPAN[MIN[UTE[S]]]:min] ]

 

Cluster log /gen will generate a cluster log and place it in the %systemroot%\Windows\System32\Cluster\Reports

Once you have generated the log, open it and go to the bottom. Search up for the word graceful. You may see entries similar to the following:

00000e40.00000c24::2011/04/27-16:01:04.513 INFO [CHANNEL 1.1.1.1:~52099~] graceful close, status (of previous failure, may not indicate problem) ERROR_SUCCESS(0)

00000e40.00000c24::2011/04/27-16:01:04.513 WARN mscs::ListenerWorker::operator (): GracefulClose(1226)' because of 'channel to remote endpoint 1.1.1.1:~52099~ is closed

Note: IP Address changed to protect the innocent

These entries indicate an endpoint has been closed at the application layer, and as a result, cluster communications fail. The only way to conclusively determine that an Anti-Virus/Firewall (Security) application is the culprit is to fully uninstall it. Disabling the service(s) will not suffice, because there may be Kernel level drivers still loading in memory even with the service(s) disabled. If the Anti-Virus/Firewall (Security) application removal resolves the failure, you should contact the application vendor.

While there can be other causes of this issue, the ones described here are the most frequent and easiest causes to eliminate.

Hopefully this information will help get you past this issue quickly so you can move on to more pressing needs.

 

Steven Andress

Senior Support Escalation Engineer

Microsoft Enterprise Platforms Support