As you may already know, with the release of Windows Compute Cluster Server 2003 (CCS) we included Microsoft Message Passing Interface (MS‑MPI) implementation which is fully compatible with the reference MPICH2. This allows integration with Active Directory and enables role based security for administrators and users, and the use of Microsoft Management Console (MMC) which provides a familiar administrative and scheduling interface.
The Microsoft CCS can use GbE, InfiniBand (IB), Myrinet, Quadrics, or legacy high-speed fabrics as interconnects for high performance computing. The majority of high performance computing clustered systems use GbE, but more and more customers these days prefer the high speed and low latency of interconnects such as InfiniBand or legacy specialty hardware. Our implementation of CCS supports all WSD-compatible fabrics.
This is one of those things that you wake up some days wondering “How does this thing actually work?” Which seems to be a simple question, but then after couple of discussions with the developer you realize that “Hmm, actually it is not very clear or you say some magic is happening somewhere!” If you are trying to find out answers for the following questions, then listen up…
- What magic happens during MPI initialization?
- What are business cards, and how do MPI apps get these for other nodes?
- How MPI network works without name resolution?
What’s more interesting that when we checked the test clusters with IB cards, we found that DNS and Default Gateway settings are not configured on IB network interface cards (NICs). There was no name resolution mechanism, on the MPI network at all. So how we force the MPI traffic using mpich subnet mask without name resolution…..
- User submits an mpi job
- Job Scheduler allocates number of nodes (or processors) requested for mpi job.
- First allocated node runs the mpiexec with all required parameters that are passed by job scheduler (ccp_nodes; ccp_mpi_network …etc)
- mpiexec kicks off and forms a tree by talking first to the msmpi service running on the same node, which spawns the smpd manager talking to msmpi services running on other (allocated) compute nodes that’s where we need name resolution. Because smpd manager on the first node needs to talk to other msmpi service/smpds on allocated nodes
- Each mpi application starts up and queries all the LOCAL addresses for that node. Then they register this information in a “business card” in a shared database inside the smpd tree business card has all available interfaces on the node.
- When MPI app rank x running on node X needs to connect to MPI app rank y running on node Y, it get y’s bizcard from the smpd tree and connects directly to x, using the address list in the business card.
- MPI app x filters y addresses using MPIHC_NETMASK environment variable. This environment variable is set by mpiexec by reading CCP_MPI_NETMASK; which in turn is set as a cluster variable by ccp management services. This var is set when you select the networks in the ToDoList; it set to the MPI network if selected or to the Private network if MPI network is not present.
So bottom line is we do not need to have a name resolution on MPI network as long as node can resolve their names through private or public network.
RELATED RESOURCE REFERENCE(S):
Senior Support Engineer
Microsoft Enterprise Platforms Support