An issue came to our attention after Windows HPC Server 2008 shipped regarding the way we set affinity on the processes within a job. There are actually two places where we set affinity:
1. The Node Manager (the service on each node responsible for starting jobs and tasks) sets processor affinity on each task to prevent that task from using processors which are not assigned to it.
2. MPIEXEC (which is used to start MS-MPI applications) can, when the –affinity flag is provided, set affinity on all ranks within the MPI application.
The problem that we encountered is this: Due to the way affinity setting works on Windows Job Objects (which we use to run tasks) and the processes within them, you cannot set affinity at both layers. That means that in MPI Tasks which are allocated less than an entire node, the –affinity flag will end up being ignored on the MPIExec command line, since the affinity has already been set by the scheduler and cannot be set in two places. This caused problems for some applications, especially those developed to work against the Compute Cluster Pack (which didn’t set affinity at all).
The problem is particular serious for jobs which specify the –exclusive option; when a job specifies the –Exclusive option it will be allocated an entire node. But the scheduler will set affinity on tasks within the job despite this. So an exclusive job with a 4 core task that is assigned an 8 core node would cause the scheduler to affinitize the task to only 4 cores: This leaves the other 4 cores idle if there are no other tasks in the job and is awfully confusing for some people and applications! Such a job would also not have MPI rank affinity, even if the –Affinity flag was specified.
Our solution is to introduce a new cluster parameter called AffinityType. AffinityType has three possible settings which work as follows:
· AllJobs– When AffinityType is set to AllJobs, the Node Manager will set affinity on any task that isn’t allocated an entire node. This is the behavior described above, and is probably the best choice for applications which may run multiple instances per node (e.g. Parameter Sweeps and SOA Jobs) and want these instances to be isolated from each other.
· NonExclusiveJobs (Default)- With this setting, the Node Manager will not set affinity on jobs which are marked as exclusive. This is the ideal choice for jobs with only 1 task, since that task will be able to take advantage of all cores on the nodes that it is assigned. We’ve made this the new default since it provides what is generally the preferred behavior for MPI tasks, which are most likely to be sensitive to affinitization. With this choice selected, MPI tasks in exclusive jobs can take advantage of the –Affinity flag to MPI even if they are not allocated an entire node.
· NoJobs- With this setting, the Node Manager will never set affinity on any task. This is an excellent choice for those running MPI Tasks who want to make sure they can take advantage of MPI’s –Affinity flag even when jobs may share nodes. This is also useful for applications which want to set their own affinity.
Note that Windows Server 2008 R2 will allow the setting of process affinity at both the Windows Job Object and Process level simultaneously. So hopefully in v3 of the HPC Pack there will no longer be an issues with the conflict between these two setting.
You can learn more about how the new AffinityMode flag works here: http://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.properties.affinitymode(vs.85).aspx