Fun and interesting ways to run MPI jobs on CCS

posted Tuesday, October 03, 2006 6:07 PM by elantz | 0 Comments  

MS-MPI is identical to MPICH2 in most respects but one area of difference is MS-MPI’s secure execution and interaction with the CCS Job Scheduler. Hopefully, this post will shed some light on how you can leverage the CCS Job Scheduler to run your MPI jobs just exactly the way you like to run them. 

 Note: The examples in this posting use the CCS Job Scheduler’s Command Line Interface but all these examples work equally well in the Job Manager (Scheduler’s graphical user interface) using the:

· Processors tab: Set the number of processors required (equivalent of /numprocessors: )

· Tasks tab: Enter your mpiexec command

· Advanced tab: Specify specific nodes upon which to run the job (equivalent of /askednodes: )

 The Basics

The command to submit an MPI job via the scheduler is of the form:

            job submit [specify cluster resources] mpiexec [mpi arguments] Application [app arguments]

 Specifying Cluster Resources

Specify the cluster resources to use on your MPI job via one of the following CCS Scheduler Command Line Interface (CLI) arguments:

 Resource Specification

Use for Running MPI Jobs

/numprocessors:

Specify the number of processors to be used in for the job. Scheduler will choose nodes to satisfy this processor requirement from either the available compute nodes in the cluster OR the list of compute nodes in /askednodes:. This is the fastest way to get your job running because your job won’t be waiting on specific nodes- you’ve given Scheduler the freedom to run it on the first available processors. 

NOTE: The default value is 1. 

 

/askednodes:

The Scheduler treats /askednodes: as a “pool” of possible nodes to use until it reaches the numprocessors required for the job. . 

NOTE: If specified without also specifying /numprocessors:, the job will run on a single processor on a single compute node because the default value of /numprocessors: is 1 !!

 

/numprocessors:

AND

/askednodes:

Scheduler will select /numprocessors: from available compute nodes in the /askednodes; list. This is usually the best way to run an MPI job on specific compute nodes. 

 

mpiexec recognizes several command line arguments and one of the most interesting for the purposes of this discussion is –hosts. The –hosts argument is of the form:

-hosts < number of hosts > < host1 name > <number of processes on host1> ... < hostn >

and is a means of specifying the resources to be used for the MPI job. Under normal circumstances you will not specify –hosts as the CCS Scheduler will automatically create this argument in an environment variable (CCP_NODES) which MS-MPI mpiexec uses by default. If you specify a –hosts argument mpiexec will use the one you specify and ignore CCP_NODES. In general, you do NOT want that to happen because it precludes the Scheduler from load balancing your cluster. And since it’s the exception that proves the rule, there are a very few cases where –hosts is useful…some of which are illustrated in examples below. 

In general, do NOT use mpiexec’s –hosts argument as it precludes the Scheduler from load balancing the cluster and tends to cause authentication errors if the cluster resource requirements for the job do not match the node list in the –hosts argument.  

 

The Simplest Possible MPI Job

             job submit /numprocessors:4 mpiexec MyApp

 This command will run the MPI application- MyApp- on the first available 4 processors in the cluster. The job, by default, runs in exclusive mode so the scheduler will not attempt to run any other jobs on the compute nodes chosen. Note that the default for tasks is non-exclusive so that other tasks from the same job can run on a given node. This is why a listing of this job’s environment variable would have included CCP_EXCLUSIVE=false as CCP_EXCLUSIVE refers to the task exclusivity. 

 Run Your MPI Job Just the Way You Like (Some Examples)

1. Fastest, easiest way to run MyApp.exe on any 5 processors in the cluster and send the output to a shared folder (named “fileshare”) on the headnode
Use /numprocessors: in job submit

1.     job submit /numprocessors:5 /stdout:\\headnode\fileshare\out.txt mpiexec MyApp.exe

2.     I’ve got to run MyApp on 2 specific compute nodes where I’ve installed special DLL’s
Use /askednodes: & /numprocessors: in job submit

1.     job submit /askednodes:Node1,Node2 /numprocessors:3 /stdout:\\headnode\fileshare\out.txt mpiexec MyApp.exe

which, assuming each node has 2 procs, will run MyApp.exe in 2 processes on Node1 and one process on Node2 but other jobs will not be able to use the other Node2 processor because jobs are marked exclusive by default. 

3.     I’ve got to run MyApp on 2 specific compute nodes where I’ve installed special DLL’s and control the number of processes run on each node
Use /askednodes: and /numprocessors: in job submit and –hosts in mpiexec command

1.     job submit /askednodes:Node1,Node2,Node3 /numprocessors:6 /stdout:\\headnode\fileshare\out.txt mpiexec –hosts 3 Node1 2 Node2 2 Node3 1 MyApp.exe

which, assuming each node has 2 procs, will run MyApp.exe on 2 processes on Node1 & Node2 and 1 process on Node3. Note the total number of MPI processes (5) does not equal /numprocessors: (6) and it doesn’t have to. The goal here is to choose a large enough /numprocessors: value so Scheduler will be forced to use all the /askednodes:. 

            Or,

            Use a VB Script (or equivalent) to run MyApp

b. job submit /askednodes:Node1,Node2 /numprocessors:4 /stdout:\\headnode\fileshare\out.txt cscript //Nologo MyScript MyApp.exe

where MyScript will

· Get the name of the application it is to run as a command line argument (MyApp.exe in this case)

· Get the list of nodes that Scheduler assigned to the job by grabbing the CCP_NODES environment variable

· Create and execute a shell command such as “mpiexec –hosts 2 Node1 2 Node2 1 MyApp.exe”

· Pipe STDOUT, STDERR from the child process back to STDOUT, STDERR of the script

· Pass MyApp.exe’s error code back to the error code of the script

 

4. I’ve got to run MyApp on any 2 compute nodes but I need to run a single process per node because MyApp is multi-threaded and will consume all the procs on the nodes where it runs.
Use /numprocessors: in job submit and either:

2.     Submit with a bat file using regular expression substitution to replace the number of procs with “1” for each node in the CCP_NODES environment variable. For example, if all the compute nodes have 4 processors the following simple substitution will work: 

mpiexec –hosts %CCP_NODES: 4= 1% MyApp.exe
(note the space before the ‘4’):

Note: This is a hack but works pretty well for simple node naming schemes. 

3.     The non-hack solution of submitting with a script file like MyScript above which will replace the number of procs with “1” for each node in the CCP_NODES environment variable and run mpiexec

· See the attached file for an example script that does all this and more

 

4.     I want to run MyApp in a process with customer environment variables [MyEnvironmentVariable is set to “CCS_Rocks” and MyEnvVar2 is set to “and Rolls”]
Use the –env argument in mpiexec

1.     job submit /numprocessors:4 /stdout:\\headnode\fileshare\out.txt mpiexec –env MyEnvironmentVariable CCS_Rocks –env MyEnvVar2 “and Rolls” MyApp.exe

5.     I want to start 2 MPI applications on the same MPI global communicator

1.     job submit /askednodes:Node1,Node2 /numprocessors:4 /stdout:\\headnode\fileshare\out.txt mpiexec –hosts 1 Node1 2 MyApp1.exe : -hosts 1 Node2 2 MyApp2.exe

Note1: Assuming 2 procs per node, this command line will start 2 MyApp1 processes on Node1 and 2 MyApp2 processes on Node2 whose ranks will be 0 thru 3 on the same MPI_COMM_WORLD.
Note2: You’ll need to launch mpiexec with a script if want the scheduler to be free to load balance the job to any compute node because you must do some custom tweeking of the two –hosts arguments based on the CCP_NODES environment variable. 

A Good Way to Get Into Trouble

1. Specify both /askednodes: in the job submit and –hosts in the mpiexec commands
You’ll have to manually keep the two node lists in sync and any errors will cause an authentication failure. 

 Documentation of MS-MPI mpiexec Arguments

Full documentation on MS-MPI mpiexec arguments is here:  https://technet2.microsoft.com/WindowsServer/en/library/7876c216-b704-473c-b97f-e8a15c67551b1033.mspx?mfr=true