Clusrunning with Windows HPC Server 2008

Article
09/17/2008

One of our most popular features in the Compute Cluster Pack was clusrun (known to you GUI users as “Remote Command Execution”), which allowed you to run a command line command across a set of cluster nodes in parallel, with their output piped back to you on the client. Not content to rest on our laurels, we’ve made some additions to clusrun’s capabilities in Windows HPC Server 2008. I’ll dig into some of them below.

But First, Clusrun Basics

At a basic level, clusrun runs a job with a task in it for each node that you specify. This job completely bypasses the queue to start right away, and the tasks pipe their information back to the client machine. This has a couple of requirements to work, namely:

· All of the target machines must be nodes in the cluster (with the HPC pack installed and able to communicate with the head node), but they don’t have to be in the “Online” state

· Your compute nodes must be able to right to a fileshare on the client computer; you can test this by logging into a node and attempting to connect to \\client\c$

· Your job scheduler needs to be working

Assuming these requirements are met, you can run a clusrun command either from the command line (using the clusrun command) or from the HPC Cluster Manager (by right clicking some nodes and selecting “Run Command . . .”). As a simple example, try running clusrun /all hostname.exe, each of the nodes in your cluster will print out its name onto your client:

PS> clusrun /all hostname.exe

Enter the password for 'REDMOND\jbarnard' to connect to 'JBarnardHN':

Remember this password? (Y/N)Y

-------------------------- JBARNARDCN01 returns 0 --------------------------

JBARNARDCN01

-------------------------- JBARNARDCN03 returns 0 --------------------------

JBARNARDCN03

-------------------------- JBARNARDHN returns 0 --------------------------

JBarnardHN

-------------------------- JBARNARDCN02 returns 0 --------------------------

JBARNARDCN02

-------------------------- Summary --------------------------

4 Nodes succeeded

0 Nodes failed

So What’s New?

There are a lot of new options for clusrun in HPCS 2008. These includes

New Formatting Options: Sorted or Interleaved Output

By default, clusrun returns output as each node completes the command. But you can override this by using either the /sorted or /interleaved flags.

/Sorted prints node output in alphabetical order, making it easier to find a specific node. /Interleaved prints out lines of output as they come back, which is great for processing with a script or for determining just where things are going wrong.

Picking Your Nodes: Exclude, Job, Task

We’ve got some great new options for picking your nodes, including the ability to exclude a set of nodes with the /exclude flag. So the command “clusrun /all /exclude:Node14 ipconfig” will return the IP configuration of every node other than Node14.

Next up are the /job and /task options, which are my personal favorites! They allow you to run a clusrun command against all of the nodes which are (or were) assigned to a particular job or task. For example, “clusrun /task:10.4 del /q SomeFile.txt” will delete SomeFile.txt from every node that ran task #10.4.

History Tracking

Clusrun jobs now live in the database just like regular jobs, making it easier to track what you’ve done and to uncover failures. You can easily find them from the command line by running job list /jobname:”Remote command”, or in the HPC Cluster Manager by selecting the “Clusrun Commands” node in the navigation pane. Each node in the run will have a separate task (including exit code, error message, etc . . .) allowing you to more easily dig into the causes of failures.

Happy Clusrunning!

-Josh