HPC Server 2008 MPI Diagnostic Fails on Eager Message No Business Card Error

An HPC Server 2008 user reported that his cluster was up and running and that all nodes could ping each other over all networks but the built-in MPI diagnostic was failing with an uninformative message "Failed To Run".

He had a topology number three with the head node connected to the Enterprise network and all compute nodes connected to the head node via Ethernet as the Private network and Infiniband as the Applications network.

Please be aware that "Failed To Run" is a separate category from "Failure" and when a test doen't succeed, you may have to check both places in the Diagnostics tree Test Results branch. Once you find this tab you don't get much information beyond the result "Failed To Run". However if you click on the red ! labelled line you will see the bottom pane light up, but it still only says "Test Failed to Run". Look to the right side of that banner and you will see a bright red "Result" followed by a v in a circle. Click on the v and you get more information about the failure.

The test did not run. Please navigate to 'Progress of the test' to view log and error messages.

So where is the Progress of the test to be found? Well, if like me you often don't have the Actions pane open, you better click on the Actions tab near the top of the console. Now near the top of the Action pane you will see the link to "Progress of the Test". This is progress, of a sort. You'll likely see just a single line with the red ! and a State of "Reverted". Now click on that line.

Oh, boy, we're rockin' now. Here are the real error messages. This is so information rich it's almost embarassing.

Time Message
6/29/2009 10:16:53 AM Reverted
6/29/2009 10:16:53 AM The operation failed due to errors during execution.
6/29/2009 10:16:53 AM The operation failed and will not be retried.
6/29/2009 10:16:53 AM ---- error analysis -----
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM mpi has detected a fatal error and aborted mpipingpong.exe
6/29/2009 10:16:53 AM [2] on NODE-03
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM ---- error analysis -----
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM [3-6] terminated
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM Check the local NetworkDirect configuration or set the MPICH_ND_ENABLE_FALLBACK environment variable to true.
6/29/2009 10:16:53 AM There is no matching NetworkDirect adapter and fallback to the socket interconnect is disabled.
6/29/2009 10:16:53 AM CH3_ND::CEnvironment::Connect(296): [ch3:nd] Could not connect via NetworkDirect to rank 1 with business card (port=58550 description="10.1.0.2 192.168.0.28 192.168.0.39 NODE-02 " shm_host=NODE-02 shm_queue=3204:428 nd_host="10.1.0.2:157 " ).
6/29/2009 10:16:53 AM MPIDI_CH3I_VC_post_connect(426)...: MPIDI_CH3I_Nd_connect failed in VC_post_connect
6/29/2009 10:16:53 AM MPIDI_CH3_iSendv(239).............:
6/29/2009 10:16:53 AM MPIDI_EagerContigIsend(519).......: failure occurred while attempting to send an eager message
6/29/2009 10:16:53 AM MPIC_Sendrecv(120)................:
6/29/2009 10:16:53 AM MPIR_Allgather(487)...............:
6/29/2009 10:16:53 AM MPI_Allgather(864)................: MPI_Allgather(sbuf=0x00000000001FF790, scount=128, MPI_CHAR, rbuf=0x0000000000B70780, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed
6/29/2009 10:16:53 AM Fatal error in MPI_Allgather: Other MPI error, error stack:
6/29/2009 10:16:53 AM [2] fatal error
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM [0-1] terminated
6/29/2009 10:16:53 AM 
6/29/2009 10:16:53 AM [ranks] message
6/29/2009 10:16:53 AM job aborted:
6/29/2009 10:16:53 AM  

First it tells us that Node-03 had a problem. Then it tells us to look at the Node-03 local Network Direct connection. Then it tells us that the environment is set to not fall back to Winsock Direct or TCP/IP. This is because falling back when people are expecting Network Direct performance can cause applications to run very slowly and is hard to diagnose. Trust me. I missed sleep over that one.

Then we have serveral lines of MPI error messages, which I generally summarize as the 'eager message no business card' error. You can ignore the rest of the message but keep in mind that whenever you see the eager message no business card error you should suspect your MPI network has a problem.

So, let's follow the advice at the beginning of the error messages, and check the InfiniBand status on Node-03. I use the Run Command feature of the Management Console to run the ndinstall tool. To make life easier, I copy the .exe for this tool to all of the nodes in C:\Windows\System32\ndinstall.exe . This tool is usually installed by the .msi install of the drivers on the head node. Search your system drive after you install the drivers and find this tool. Then put it on a head node share the compute nodes can see and use clusrun or the Run Command GUI to copy it to all the compute nodes. Here's the output from Node-03 (bad node no business card) and Node-02 (good node, pat pat).

Node-03

0000001001 - MSAFD Tcpip [TCP/IP]
0000001002 - MSAFD Tcpip [UDP/IP]
0000001003 - MSAFD Tcpip [RAW/IP]
0000001004 - MSAFD Tcpip [TCP/IPv6]
0000001005 - MSAFD Tcpip [UDP/IPv6]
0000001006 - MSAFD Tcpip [RAW/IPv6]
0000001007 - RSVP TCPv6 Service Provider
0000001008 - RSVP TCP Service Provider
0000001009 - RSVP UDPv6 Service Provider
0000001010 - RSVP UDP Service Provider

Node-02

0000001001 - MSAFD Tcpip [TCP/IP]
0000001002 - MSAFD Tcpip [UDP/IP]
0000001003 - MSAFD Tcpip [RAW/IP]
0000001004 - MSAFD Tcpip [TCP/IPv6]
0000001005 - MSAFD Tcpip [UDP/IPv6]
0000001006 - MSAFD Tcpip [RAW/IPv6]
0000001007 - RSVP TCPv6 Service Provider
0000001008 - RSVP TCP Service Provider
0000001009 - RSVP UDPv6 Service Provider
0000001010 - RSVP UDP Service Provider
0000001011 - OpenIB Network Direct Provider

Notice there is no 0000001011 OpenIB Network Direct Provider on Node-03. So, actually the diagnostic got it right immediately. I just took a long time to prove it. So now let's run ndinstall -i on Node-03. Again with the Run Command, eh? All we get is a Finished. Then run ndinstall -l again and verify that we get the 0000001011 OpenIB Network Direct Provider line. Yes we do, but do not be confused if, for you like for me, it is "0000001012 - OpenIB Network Direct Provider". The sequence number is not important.

And finally let's run the diagnostic again and look in Diagnostics->Test Results->Success. Ah, now that's sweet!

Test Name                                Result     Test Suite        Target      Last Updated
MPI Ping-Pong: Quick Check    Success  Performance   7 nodes    6/29/2009 10:54:58 AM

That's it for now, "Transfer fast and prosper."

 Frankie