Home

Platform MPI User's Guide - Platform Cluster Manager

1. count N status return status of receive operation status IN atatype datatype of each receive buffer entry handle UT count number of received entries integer n Pl _Get_elementsL MPI Status status MPI Datatype datatype MPI Aint count N status return status of receive operation status IN atatype datatype used by receive operation handle OUT oun number of received basic elements integer P _PackL void inbuf MPI_Aint incount MPI Datatype datatype void outbuf MPI Aint outsize MPI Aint position MPI Comm comm inbuf input buffer start choice IN incount number input data items IN datatype datatype of each input ata item handle OUT outbuf output buffer start choice IN outsize output buffer size in bytes OUT position current position in buffer in bytes comm communicator for packed message handle t MPI Pack _externalL char datarep void inbuf MPI Aint incount P Datatype datatype void outbuf MPI Aint outsize MPI Aint position datarep data representation string i nbuf input buffer start choice incount number of input data items datatype datatype of each input data item handle T out buf output buffer start choice outsize output buffer size in bytes OUT position Current position in buffer in bytes P _ Pack_sizeL MPI_Aint incount MPI Datatype datatype P Comm comm MPI _Aint size incount count argument to packing call IN datatype atatype argument to packing call
2. export MPI_REMSH rsh x optional If LSF is being used create an appfile such as this h hostA np 1 path to pp x h hostB np 1 path to pp x h hostC np 1 path to pp x h hostZ np 1 path to pp x Then run one of the following commands bsub pam mpi MPI_ROOT bin mpirun prot f appfile bsub pam mpi MPI_ROOT bin mpirun prot f appfile 1000000 When using LSF the host names in the appfile are ignored If thesrun command is available run a command like this MPI_ROOT bin mpirun prot srun N 8 n 8 path to pp x MPI_ROOT bin mpirun prot srun N 8 n 8 path to pp x 1000000 Replacing 8 with the number of hosts Or if LSF is being used then the command to run might be this bsub I n 16 MPI_ROOT bin mpirun prot srun path to pp x Platform MPI User s Guide 179 Debugging and Troubleshooting bsub I n 16 MPI_ROOT bin mpirun prot srun path to pp x 1000000 If theprun command is available use the same commands as above for s run replacingsr un with prun In each case above the firstmpi r un command uses 0 bytes per message and verifies latency The second mpi run command uses 1000000 bytes per message and verifies bandwidth Example output might look like Host 0 ip 192 168 9 10 ranks 0 Host 1 ip 192 168 9 11 ranks 1 Host 2 ip 192 168 9 12 ranks 2 Host 3 ip 192 168 9 13 ranks 3 O0 hostA ping pong 0 bytes 0 bytes 4 24 usec msg 1 hostB ping pong
3. s em handle L MPI Aint count MPI Datatype oldtype tion count atype handle atype handle Ca a a dexedL MPI Aint count MPI Aint P _ Aint array_of_displacements newt ype MPI Datatype number of blocks number of elements in each block byte displacement of each block hs oldtype old datatype newtype new datatype t MPI Type create hvectorL MPI Aint count MPI Aint blocklength Pl Aint stride MPI Datatype oldtype MPI Datatype newtype count number of blocks blocklength number of elements in each block stride number of bytes between start of each block oldtype old datatype handle newt ype new datatype handle t MPI Type create indexed blockL MPI Aint count MPI Aint locklength MPI Aint array_of_displacements MPI Datatype oldtype P Datatype newt ype count length of array of displacements blocklength size of block in array_of displacements array of displacements oldtype old datatype handle I newt ype new datatype handle int MPI _Type_ create structL MPI Aint count MPI Aint array_of blocklengths MPI Aint array_of displacements MPI Datatype array_of types MPI Datatype newtype N k ount 226 Platform MPI User s Guide number of blocks N array of blocklength number o N array_of displacements byte dis N array _of types type of array o OUT newt ype new data int MPI Type hindexedL MPI Aint count array_of_blocklengths MPI_Aint arra oldtype
4. bsub I n 4 MPI_ROOT bin mpirun TCP netaddr 123 456 0 0 e MPIROOT MPI_ROOT Isf hello_world Job lt 189 gt is submitted to default queue lt nor mal gt lt lt Waiting for dispatch gt gt lt lt Starting on example platform com gt gt Hello world m0O of 4 on n01 Hello world m2 of 4 on moil Hello world mi1 of 4 on n01 Hello world I m 3 of 4 on n0l Options for pr un users prun Enables start up with Elan usage Only supported when linking with shared libraries Some features like mpi run stdio processing are unavailable The np option is not allowed with prun Arguments on the mpi r un command line that follow prun are passed to theprun command Options for SLURM users srun Enables start up on HP XC clusters Some features likempi run stdio processing are unavailable The np option is not allowed with srun Arguments on the mpi r un command linethat follow srun arepassed to thes r un command Start up directly from thesrun command is not supported Remote shell launching f appfile Specifies the appfile that mpi r un parses to get program and process count information for therun hostfile lt filename gt Launches the same executable across multiple hosts File name is a text file with host names separated by spaces or new lines Can be used with the np option hostlist lt list gt Platform MPI User s Guide 105 Understanding Platform MPI Launches the same executable across
5. gt job add 4288 numprocessors 1 env MPI_ROOT MPI_ROOT exclusive true stdout node path to a shared file out stderr nod path to a shared file err path submission_script vbs Wheresubmission_script vbs contains code such as Option Explicit Dim sh ojob JobNewOut appfile Rsrc fs Set sh WScript CreateObj ect WScript Shell Set fs CreateObject Scripting FileSystemObj ect Set oJ ob sh exec MP ROOT bin mpi nodes exe JobNewOut oJ ob StdOut Readal Set appfile fs CreateTextFile lt path gt appfile True Rsrc Split JobNewOut For LBound Rsrc 1 to UBound Rsrc Step 2 appfile WriteLine h Rsrc np Rsrc I l _ POS SEN NSE OO OKEE E Next appfile Close Set oJob sh exec MPI ROOT bin mpirun exe TCP f lt path gt appfile wscript Echo oJob StdQut Readal 3 Submit thejob as in the previous example C gt job submit id 4288 Theaboveexampleusings ub mission _script vbs isonly an example Other scripting languages can be used to convert the output ofmpi _nodes exe into an appropriate appfile Building an MPI application with Visual Studio and using the property pages To build an MPI application in C or C with V S2008 usethe property pages provided by Platform M PI to help link applications Two pages are included with Platform M PI and are located at the installation location
6. if comm_rank eq 0 then do dest 1 comm size 1l cal t ierr enddo else cal mstat ierr endi f Computation Sum up in each column Each MPI process or a rank The column block number stored in cbs cb number Subscripts of the row block rb and cbhe cb 202 Platform MPI User s Guide mpi_type_extent mpi double prec mpi _type_free adtype rb 1 rbe 0 comm si ze 1 dtype 0 comm size 1 twdtype 0 comm size 1 computes blocks that it is assigned in the variable cb starting and ending subscripts of the column block cb respectively is assigned in the variable rb are stored in ly spaced nrow cbs 0 comm si ze 1 k rbe bl k 1 blk cbe bl k 1 nrow rr ined as a set of fixed length is defined as a struct of such 0 comm size 1 sion dsize ierr mpi _type_vector cbe cb cbs cb 1 rbe rb rbs rb 1 i nrow mpi_double_ precision adtype rb EREN AARE 1 nrow dsi ze en adisp adtype terr err data with using derived datatypes defined above it and MPI _recv layout of the data from those datatypes programs to manually pack unpack the data system for optimal will find out the This saves application and more importantly communication mpi _send array 1 twdtype dest dest 0 mpi_comm world mpi_recv array 1 twdtype comm_rank 0 0 mpi_comm world is assigned The are The row block The starting and ending r
7. warm up loop ial ror i Op Tf lt Se ias 4 MPI_Send buf nbytes MPI_CHAR 1 1 MPI Recv buf nbytes MPI CHAR 1 1 MPI COMM WORLD amp status Ee timing loop start MPI _ Wti me or i Oe i lt NLOOPS i 4 ifdef CHECK or j 0 j lt nbytes Jad buf j char fendi Pl _Send buf nbytes MPI _CHAR 1 1000 i ifdef CHECK memset buf 0 nbytes endi Pl Recv buf nbytes MPI CHAR 1 2000 i P COMM WORLD amp status ifdef CHECK or j 0 j lt nbytes j hi OO fe enari i lt i printf error buf d d not d n j Mt iy I 1 break endi stop MPI _Wti me 186 Platform MPI User s Guide Example Applications rintf d bytes 2f usec msg n bytes stop start NLOOPS 2 1000000 f nbytes gt 0 rintf d bytes 2f MB sec n nbytes nbytes 1000000 stop start NLOOPS 2 warm up loop of i s 0 i lt 5 bes i _Recv buf nbytes MPI CHAR 0 1 MPI COMM_WORLD amp status _Send buf nbytes MPI CHAR 0 1 MPI COMM_WORLD Seac r i esw i lt MOORS iat 4 _Recv buf nbytes MPI CHAR 0 1000 i _COMM_WORLD amp status _Send buf nbytes MPI CHAR 0 2000 i MPI COMM_WORLD To 0O ping_pong output The output from running the ping_pong executable is shown below The application was run with np2 ping pong 0 bytes 0 bytes 1 03 usec msg ping_pong_ring c Linux Often a cluster might
8. I Everyone except numRanks 1 sends its rightEnd right if myRank numRanks 1 sendVal aBlock gt getRightEnd getValue MPI Send amp sendVal 1 MPI_INT myRank 1 1 MPI COMM WORLD 214 Platform MPI User s Guide Example Applications if myRank 0 MPI _Wait amp sortRequest amp status aBlock gt setLeftShadow Entry recvVal m II Have each rank fix up its entries ll aBlock gt singleStepOddEntries aBlock gt singleStepEvenEntries I Print and verify the result if myRank intsendVal aBlock gt printEntries myRank aBlock gt verifyEntries myRank I NT_MIN sendVal aBlock gt getRightEnd getValue if numRanks gt 1 MPI Send amp sendVal 1 MPI INT 1 2 MPI COMM WORLD else int recvVal Pl Status Status PI Recv amp recvVal 1 MPI_INT myRank 1 2 MP _COMM WORLD amp Status aBl ock gt printEntries myRank aBlock gt verifyEntries myRank recvVal if myRank numRanks 1 recvVal aBlock gt getRightEnd getValue MPI _Send amp recvVal 1 MPI_INT myRank 1 2 MPI _COMM_ WORLD delete aBlock MPI Finalize exit 0 sort C output The output from running the sort executable is shown below The application was run with np4 Rank 0 998 996 996 993 567 563 544 7543 Rank 1 535 2928 2528 90 Platform MPI User s Guide 215 Example Applications 90 84 84 Rank 2 78 70
9. QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 About This Guide Platform Interconnect Operating System HP XC6000 Clusters TCP IP HP XC Linux QsNet Elan4 InfiniBand HP Cluster Platforms Note OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0 QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 TCP IP and InfiniBand Microsoft Windows HPCS 2008 The last release of HP MPI for HP UX was version 2 2 5 which is supported by Platform Computing This document is for Platform MPI 8 0 which is only being release on Linux and Windows Platform MPI User s Guide 9 About This Guide Documentation resources Documentation resources include 1 2 3 oN SO Ul 9 Platform M PI productinformation availableat http www platform com cluster computing platform mpi M PI The Complete Reference 2 volume se MIT Press MPI 1 2 and 2 0 standards available at http www mpi forum org 1 MPI A Message Passing Interface Standard 2 MPI 2 Extensions to the M essage Passing Interface TotalView documents available at http www totalviewtech com 1 TotalView Command Line Interface Guide 2 TotalView User s Guide 3 TotalView Installation Guide Platform MPI release notes available at http my platform com Argonne National Laboratory s implementation of M PI 1 0 at http www unix mcs anl gov romio University of N otre Dame s LAM implementation of
10. ll i IIiFunction Adjust the even entries I void BI ockOfEntries singleStepEvenEntries Corlim ile i lt mumo entitles 2 lie enmir tesih gt a bestia Entry temp entries i 1 entries i l entries i entries i temp Hf il 212 Platform MPI User s Guide Example Applications BlockOfEntries verifyEntries H Function Verify that the block of entries for rank myRank II is sorted and each entry value is greater than or equal to argument baseLine ll void BlockOfEntries verifyEntries int myRank int baseLi ne roOrtipi Sle lt numo EMTs 2s iy 4 if entries i gt getValue lt baseline cout lt lt Rank lt lt myRank lt lt wrong answer i lt lt j lt lt baseline lt lt baseline lt lt value lt lt entries i gt getVal ue lt lt endl MPI Abort MPI_ COMM_WORLD MPI ERR_OTHER if entries i gt entries i l 4 cout lt lt Rank lt lt myRank lt lt wrong answer i lt lt ji 46 walweli lt lt entries i gt get Val ue lt lt value i l lt lt entries itl gt get Val ue lt lt endl MPI Abort MPI COMM WORLD MPI _ERR_ OTHER Hf if BlockOfEntries printEntries I Function Print myRank s entries to stdout ie void BlockOfEntries printEntries int myRank cout lt lt endl cout lt lt Rank lt lt myRank lt lt endl for int i 1 i
11. run my 32 bit executable on my AM D64 or Intel R 64 system dlopen for MPI_ICLIB_IBV__ BV_MAIN could not open libs in list libibverbs so ibibverbs so cannot open shared objec ile No such file or directory x Rank 0 0 init ibv_resolve_entrypoints failed x Rank 0 0 MPI_Init Can t initialize RDMA device x Rank 0 0 MPI Init MPI BUG Cannot initialize RDMA protocol dlopen for PI ICLIB_ IBV__IBV_MAIN could not open libs in list libibverbs so ibibverbs so cannot open shared objec ile No such file or directory x Rank 0 1 MPI Toit ibv_resolve_entrypoints failed x Rank 0 1 MPI Init Can t initialize RDMA device x Rank 0 1 MPI Init MPI BUG Cannot initialize RDMA protoco PI Application rank 0 exited before MPI _Init with status 1 PI Application rank 1 exited before MPI _Init with status 1 ANSWER Not all messages that say Can t initialize RDM A device are caused by this problem This message can show up when running a 32 bit executable on a 64 bit Linux machine The 64 bit daemon used by Platform M PI cannot determine the bitness of the executable and thereby uses incomplete information to determine the availability of high performance interconnects To work around the problem use flags T CP V API etc to explicitly specify the network to use Or with Platform M PI 2 1 1 and later use the mpi32 flag to mpi run QUESTION Where does Platform M PI look for the shared libraries
12. 21 MPI_BIND_MAP 61 126 MPI_BOTTOM 122 MPI_Bsend 18 MPI_Cancel 123 MPI_Comm_disconnect 111 MPI_Comm_rank 15 MPI_COMMD 131 MPI_COPY_LIBHPC 143 MPI_CPU_AFFINITY 61 126 MPI_CPU_SPIN 61 126 Platform MPI User s Guide 257 MPI_DEBUG_CONT 170 MPI_DLIB_FLAGS 127 MPI_ERROR LEVEL 128 MPI_FAIL_ON_TASK_FAILURE 143 MPI_Finalize 15 MPI_FLAGS 121 162 MPI_FLUSH_FCACHE 61 126 MPI_GLOBMEMSIZE 136 MPI_IB_CARD_ORDER 134 MPI_IB_MULTIRAIL 131 MPI_IB_PKEY 134 MPI_IB_PORT_GID 132 MPI_lbsend 19 MPI_IBV_QPPARAMS 135 MPI_IC_ORDER 130 MPI_IC_SUFFIXES 131 MPI Init 15 MPI_INSTR 128 156 MPI_lIrecv 19 MPI_Irsend 19 MPI_Isend 19 20 MPI_Issend 19 MPI_LOCALIP 138 MPI_Lookup _name 153 MPI_MAX_REMSH 138 MPI_MAX_WINDOW 127 MPI_MT_FLAGS 125 MPI_NETADDR 138 MPI_NO_MALLOCLIB 136 MPI_NOBACKTRACE 128 MPI_NRANKS 144 MPI_PAGE_ALIGN_MEM 136 MPI_PHYSICAL_MEMORY 136 MPI_PIN_ PERCENTAGE 137 MPI_PROT_BRIEF 141 MPI_PROT_MAX 141 MPI_PRUNOPTIONS 141 MPI_Publish name 153 MPI_RANKID 144 MPI_RANKMEMSIZE 136 MPI_RDMA_INTRALEN 139 MPI_RDMA_MSGSIZE 139 MPI_RDMA_NENVELOPE 140 MPI_RDMA_NFRAGMENT 140 MPI_RDMA_NONESIDED 140 MPI_RDMA_NSRQRECV 140 MPI_Recv 15 18 high message bandwidth 167 258 Platform MPI User s Guide low message latency 167 MPI_Reduce 22 MPI_REMSH 138 MPI_ROOT 126 237 MPI_Rsend 18 MPI_Rsend convert to MPI_Ssend 125 MPI_SAVE_TASK_OUTPUT 143 MPI_Scatter 21 MPI_Send 15 18 high message bandwidth 167 low message latency 167 MPI_Send app
13. 32 bit and 64 bit versions of the library are shipped with these systems however you cannot mix 32 bit and 64 bit executables in the same application Platform M PI includes mpi32 and mpi64 options for the compiler wrapper script on Opteron and Intel64 systems U se these options to indicate the bitness of the application to be invoked so that the availability of interconnect libraries can be properly determined by the Platform M PI utilities mpi run and mpi d The default is mpi64 Windows Platform M PI supports 32 bit and 64 bit versions running Windows on AMD Opteron or Intel64 32 bit and 64 bit versions of the library are shipped with these systems however you cannot mix 32 bit and 64 bit executables in the same application Platform M PI includes mpi32 and mpi64 options for the compiler wrapper script on Opteron and Intel64 systems These options are only necessary for the wrapper scripts so the correct li bpcmpi 32 d orl ibpcmpi 64 dl 1 fileis linked with the application It is not necessary when invoking the application 58 Platform MPI User s Guide Understanding Platform MPI Thread compliant library Platform M PI provides athread compliant library By default the non thread compliant library libmpi isused when running Platform M PI jobs Linking to the thread compliant library is required only for applications that have multiple threads making M PI calls simultaneously In previous releases linking to the thread
14. 69 69 383 383 386 386 Rank 3 386 393 393 397 950 965 987 987 compute_pi_spawn f This example computes pi by integrating f x 4 1 x 2 using M PI_ Spawn It starts with one process and spawns a new world that does the computation along with the original process Each newly spawned process receives the of intervals used calculates the areas of its rectangles and synchronizes for a global summation The original process 0 prints the result and the time it took program mainprog include mpif h double precision PI25DT parameter PI25DT 3 141592653589793238462643d0 double precision mypi pi h sum x f a integer Ma myid numprocs terr integer parenticomm spawnicomm mergedcomm high C C Function to integrate C a 2 a fF i wl ee call MPI_INIT ierr call MP COMM RANK MP COMM WORLD myid ierr call MP _COMM_SI ZE MP COMM WORLD numprocs ierr call MP _COMM_GET_PARENT parenticomm ierr if parenticomm eq MPI COMM NULL then print Original Process myid of numprocs ar n ie alive ca PI COMM_SPAWN compute_pi_ spawn MPI ARGV_NULL 3 P INFO NULL 0 MPI COMM WORLD spawni comm P ERRCODES IGNORE err Ca PI _INTERCOMM MERGE spawnicomm 0 mergedcomm ierr ca PI COMM_FREE spawnicomm ierr else print Spawned Process myid of numprocs 7 1S alive ca P _INTERCOMM MERGE parenticomm 1 mergedcomm ierr Ca PI COMM FREE p
15. BAL IBAL on InfiniBand MX M yrinet Express TCP TCP IP MP D daemon comm unication mode SHM shared memory intra host only If a host shows considerably worse performance than another it can often indicate a bad card or cable If the run aborts with an error message Platform M PI might have incorrectly determined which interconnect was availabl compute_pi f This Fortran 77 example e computes pi by integrating f x 4 1 x x Each process 1 Receives the number of intervals used in the approximation 2 Calculates the areas of its rectangles 3 Synchronizes for a global summation Process 0 prints the result of the calculation program main include mpif h double precision Pl 25DT parameter PI25DT 3 141592653589793238462643d0 double precision mypi pi h sum x f a integer n myid numprocs terr C C Function to integrate C ol 64 00 7 i 00 2 a6 call MPI_INIT ierr call MPI COMM_RANK MP _COMM WORLD myid ierr call MP COMM SI ZE MPI COMM WORLD numprocs ierr prim Process mid of MlMprocs bs alive Sizetype 1 Sumtype 2 if myid eq 0 then n 100 endi call MP _BCAST n 1 MPI INTEGER 0 MPI COMM WORLD ierr C C Calculate the interval size C h 1 0d0 n sum 0 0d0 do 20 i myid 1 n numprocs x h dble i 0 5d0 Sum sum f x 20 continue mypi h sum C C Collect all the partial sums C call MPI REDUCE mypi pi 1 MPI DOUBL
16. C gt job add JOBID numprocessors 1 env MPI_ROOT shared alternate location Platform MPI User s Guide 35 Getting Started Compiling and running your first application To quickly becomefamiliar with compiling and running Platform M PI programs start with theC version of the familiar hello_world program This program is called hello_world c and prints out the text string Hello world I m r of son host wherer is a process s rank sis the size of the communicator and host is the host where the program is run The source code for hello_world c is stored in MPI _ROOT hel p Command line basics The utility MP1 ROOT bi n mpi cc isincluded to aid in command line compilation To compile with this utility set M PI_CC to the path of the command line compiler you want to use Specify mpi32 or mpi64 to indicate if you are compiling a 32 or 64 bit application Specify the command line options that you normally pass to the compiler on the mpi cc command line Thempi cc utility adds additional command line options for Platform M PI include directories and libraries The show option can be specified to mpi cc to display the command generated without executing the compilation command See the manpage mpicc 1 for more information To construct the desired compilation command the mpi cc utility needs to know what command line compiler is to be used the bitness of the executable that compiler will produce and the syntax accepte
17. C WI NDOWS Syst em32 Woem HOMEPATH HOME PAT H TEMP C WI NDOWS TEMP CMD Finished successfully You can view directories accessible from the remote machine when authenticated by the user X Demo gt MPI_ROOT bin mpidiag s winbl16 dir mpiccp1 scratch user1 Directory File list Searching for path mpiccpl scratch userl Platform MPI User s Guide 97 Understanding Platform MPI Directory mpiccpl scratch userl BaseRel Beta HPMP BuildTests DDR2 Testing dir pl exportedpath reg Hilesi ini hl xml Hel oWorl d HP64 2960 1l err Hel oWorl d HP64 2960 1 out Hel oWorl d HP64 2961 1 err Hel oWorl d HP64 2961 1 out mpidiag tool for Windows 2008 and Platform MPI Remote Launch Service Platform M PI for Windows 2008 includes the mpi di ag diagnostic tool Itislocated in MPI ROOT bin mpi daig exe This tool is useful to diagnose remote service access without running mpi r un To use the tool run mpi di ag with s lt remote node gt lt options gt where options include help Show the options to mpi di ag s lt remote node gt Connect to and diagnose the remote service of this node Authenticates with the remote service and returns the remote authenticated user s name Authenticates with remote service and returns service status et lt echo string gt Authenticates with the remote service and performs a simple echo test returning the string SyS Authe
18. C gt job add 4288 numprocessors 1 exclusive true stdout node path to a shared file out stderr node path to a shared file err MPI_ROOT bin mpirun hpc node path to hello_world exe Submit the job The machine resources are allocated and the job is run C gt job submit id 4288 Multiple Program Multiple Data MPMD Torun Multiple Program M ultiple Data M PM D applications or other more complex configurations that require further control over the application layout or environment dynamically create an appfile Platform MPI User s Guide 39 Getting Started within the job using the utility MPI ROOT bi n mpi _nodes exe asin thefollowing example To create the executable perform Steps 1 through 3 from the previous section Then continue with 1 Create anew job C gt job new numprocessors 16 exclusive true Job queued ID 4288 2 Submit a script Verify MPI_ ROOT is set in the environment See the mpi r un manpage for more information C gt job add 4288 numprocessors 1 env MPI_ROOT MPI_ROOT exclusive true stdout node path to a shared file out stderr node path to a shared file err path submission_script vbs Wheresubmission_script vbs contains code such as Option Explicit Dim sh ojob JobNewOut appfile Rsrc fs Set sh WScript CreateObj ect WScri pt Shell Set fs CreateObject Scripting FileSystemObj ect Set oJob sh exec MPI ROOT bin mpi
19. Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Platform MPI User s Guide 7 About This Guide Platform Interconnect Operating System Intel 64 Myrinet GM 2 and MX Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 TCP IP Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 InfiniBand Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 OFED 1 0 1 1 1 2 1 3 1 4 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 uDAPL 1 1 1 2 2 0 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 QLogic PSM Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE NIC Version QHT7140 Linux Enterprise Server 9 and 10 CentOS 5 QLE7140 Driver PSM 1 0 2 2 1 2 2 HP XC3000 Clusters Myrinet GM 2 and MX HP XC Linux TCP IP InfiniBand OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0 QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 HP XC4000 Clusters QsNet Elan4 HP XC Linux 8 Platform MPI User s Guide Myrinet GM 2 and MX TCP IP InfiniBand OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0
20. MPI Datatype newtype Large message APIs elements in each block placement of each block elements in each block handles to datatype objects ype handle PI Aint y_of_displacements MPI Datatype N count number of blocks N array_of_blocklengths number of elements in each block N array_of_displacements byte displacement of each block N oldtype old datatype handle OUT newt ype new datatype handle int MPI _Type_hvectorL MPI Aint count MPI Aint blocklength MPI Aint stride MPI Datatype oldtype MPI Datatype newtype N count number of blocks N blockl ength number of elements in each block N stride number of bytes between start of each block N oldtype old datatype handle OUT newt ype new datatype handle One sided communication int MPI _Win_createL void base MPI Aint size MPI Aint disp _unit Pl_Info info MPI Comm comm MPI_WIN win base initial address of window choice size size of window in bytes disp_unit ocal unit size for displacements in bytes info info argument handle comm communicator handle OUT win window object returned by the ca handle int MPI _GetL void origin_addr MPI Aint origin_ count MPI Datatype origin_datatype int target_rank MPI Aint target disp MPI Aint arget_count MPI Datatype target datatype MPI W N win OUT origin_addr initial address of origin buffer choice origin count number of entries in origin buffer or
21. Multilevel parallelism By default processes in an M PI application can only do one task at a time Such processes are single threaded processes This means that each process has an address space with a single program counter a set of registers and a stack A process with multiple threads has one address space but each process thread has its own counter registers and stack M ultilevel parallelism refersto M PI processesthat havemultiplethreads Processes become multithreaded through calls to multithreaded libraries parallel directives and pragmas or auto compiler parallelism M ultilevel parallelism is beneficial for problems you can decompose into logical parts for parallel execution for example alooping construct that spawns multiple threads to do acomputation and joins after the computation is complete Themulti_par f example program is an example of multilevel parallelism Advanced topics This chapter provides a brief introduction to basic M PI concepts Advanced M PI topics include Error handling Process topologies User defined data types 24 Platform MPI User s Guide Introduction Process grouping Communicator attribute caching TheM PI profiling interface To learn more about the basic concepts discussed in this chapter and advanced M PI topics see M PI The Complete Reference and M PI A M essage Passing Interface Standard Platform MPI User s Guide 25 Introduction 26 Platform MPI User s Guide G
22. On remote hosts set DISPLAY to point to your console In addition use xhost to allow remote hosts to redirect their windows to your console 3 Run your application When your application enters M PI_ Init Platform M PI starts one debugger session per process and each debugger session attaches to its process 4 Optional Set a breakpoint anywhere following M PI_ Init in each session 5 Set the global variable M PI_ DEBUG_CONT to 1 using each session s command line interface or graphical user interface The syntax for setting the global variable depends upon which debugger you use adb mpi_debug_cont w 1 dde set mpi_debug_cont 1 xdb print MPI_DEBUG_CONT 1 wdb set MPLDEBUG_CONT 1 gdb set MPlLDEBUG_CONT 1 6 Issue the relevant debugger command in each session to continue program execution Each process runs and stops at the breakpoint you set after M PI_ Init 7 Continue to debug each process using the relevant commands for your debugger 170 Platform MPI User s Guide Debugging and Troubleshooting Using a multiprocess debugger Platform M PI supports theT otalV iew debugger on Linux T hepreferred method when you run T otalV iew with Platform M PI applications is to use the mpi r un run time utility command For example MPI_ROOT bin mpicc myprogram c g MPI_ROOT bin mpirun tv np 2 a out In this example myprogram c is compiled using the Platform M PI compiler utility for C programs The execut
23. SIGILL SIGBUS SIGSEGV SIGSYS If a signal is not caught by a user signal handler Platform M PI shows a brief stack trace that can be used to locate the signal in the code Signal 10 bus error PROCEDURE TRACEBACK 0 0x0000489c bar 0xc lof fs OMe 1 0x000048c4 foo Oxlc lady f Ge OU 2 0x000049d4 main 0xa4 Lolei Ge out eo 0xc013750c start 0xa8 usr lib libc 2 4 0x0003b50 START 0x1a0 Lla Ole This feature can be disabled for an individual signal handler by declaring a user level signal handler for the signal To disable for all signals set the environment variable M P NOBACKTRACE setenv MPI_NOBACKTRACE MPI_INSTR MPI_INSTR enables counter instrumentation for profiling Platform M PI applications The MPI_INSTR syntax is a colon separated list no spaces between options as follows prefixl nc off api 128 Platform MPI User s Guide Understanding Platform MPI where prefix Specifies the instrumentation output file prefix The rank zero process writes the application s measurement data to prefix instr in ASCII If the prefix does not represent an absolute pathname the instrumentation output file is opened in the working directory of the rank zero process when M PI_Init is called l Locks ranks to CPU s and uses the CPU s cycle counter for less invasive timing If used with gang scheduling the l is ignored nc Specifies no clobber If the instrumentation output file exists M PI_
24. Section 3 12 Derived Datatypes in the M PI 1 0 standard describes the construction and use of derived datatypes The following is a summary of the types of constructor functions available in M PI Contiguous MPI_Type_ contiguous Allows replication of a datatype into contiguous locations Vector M PI_ Type _ vector Allows replication of a datatype into locations that consist of equally spaced blocks Indexed M PI_T ype indexed Allows replication of a datatype into a sequence of blocks where each block can contain a different number of copies and havea different displacement Structure MPI_Type struct Allowsreplication of adatatypeinto a sequence of blocks so each block consists of replications of different datatypes copies and displacements After you create a derived datatype you must commit it by calling MPI_Type_commit Platform M PI optimizes collection and communication of derived datatypes Section 3 13 Pack and unpack in the M PI 1 0 standard describes the details of the pack and unpack functions for M PI Used together these routines allow you to transfer heterogeneous data in a single message thus amortizing the fixed overhead of sending and receiving a message over the transmittal of many elements For adiscussion of this topic and examples of construction of derived datatypes from the basic datatypes using the M PI constructor functions see Chapter 3 User Defined Datatypes and Packing in M PI The Complete Reference
25. The default Platform MPI library does not carry this information due to overload but the Platform MPI diagnostic library DLIB does To link with the diagnostic library use ldmpi on the link line Use the Platform MPI collective routines instead of implementing your own with point to point routines The Platform MPI collective routines are optimized to use shared memory where possible for performance To ensure portability the Platform MPI implementation does not take stdargs For example in C the user routine should be a C function of type MPI_handler_function defined as void MPI_Handler_function MPI_Comm int The Platform MPI MPI_FINALIZE behaves as a barrier function so that the return from MPI_FINALIZE is delayed until all potential future cancellations are processed Platform MPI provides a thread compliant library mt mpi which only needs to be used for applications where multiple threads make MPI calls simultaneously MPI_THREAD_MULTIPLE Use Imtmpi on the link line to use the i bmt mpi Platform MPI I O supports a subset of the MPI 2 standard using ROMIO a portable implementation developed at Argonne National Laboratory No additional file information is necessary in your file name string mpirun Using Implied prun or srun Implied prun Platform M PI provides an implied pr un mode Theimplied pr un mode allows the user to omit the prun argument from the mpi r un command line with the use of
26. Understanding Platform MPI License release regain on suspend resume Platform M PI supports the release and regain of license keys when a job is suspended and resumed by a job scheduler T his featureis recommended for useonly with a batch job scheduler To enable this feature add HPM PI_ALLOW_LICENSE_RELEASE 1 to the mpi run command line When mpi run receives a S GTSTP the licenses that are used for that job are released back to the license server Those released licenses can run another Platform M PI job while the first job remains suspended W hen a suspended mpi run job receives aS GCONT the licenses are reacquired and the job continues If the licenses cannot be reacquired from the license server the job exits When ajob is suspended in Linux any memory that is pinned is not swapped to disk and is not handled by the operating system virtual memory subsystem Platform M PI pins memory that is associated with RDMA message transfers By default up to 20 of the system memory can be pinned by Platform M PI at any onetime The amount of memory that is pinned can be changed by two environment variables MPI_PHYSICAL_MEMORY andMPI_PIN_ PERCENTAGE default 20 The dd option to mpi run displays the amount of physical memory that is detected by Platform M PI If the detection is wrong the correct amount of physical memory should be set with MPI_PHYSICAL_MEM ORY in bytes This memory is only returned to the operating system for use by other proc
27. args For example C gt MPI_ROOT bin mpiexec cores 8 myprog x 1 2 3 creates an 8 rank M PI job on the local host consisting of 8 copies of the program myprog x each with the command line arguments 1 2 and 3 It also allows arguments likea M P _Comm_spawn_ multiple call with a colon separated list of arguments where each component is like the form above For example C gt MPI_ROOT bin mpiexec cores 4 myprog x host host2 cores 4 path to myprog x creates a M PI job with 4 ranks on the local host and 4 on host 2 Finally the third form allows the user to specify a file containing lines of data like the arguments in the first form mpiexec configfile file For example C gt MPI_ROOT bin mpiexec configfile cfile gives the same results as in the second example but using the configfile option assuming the cfile filecontains cores 4 myprog x host host2 cores 4 wdir some path myprog x The following mpi exec options are those whose contexts affect the whole command line cores number Ranks per host to use if not specified elsewhere This applies when processing the ghosts gmachinefile hosts and machi nefile options affinity Enables Platform MPl s cpu_bi nd option Platform MPI User s Guide 99 Understanding Platform MPI gpath path path7 Prepends file paths to the PATH environment variable lines Enables Platform M PI S st di o p option genv variable value or genv va
28. file file Ignored in Platform M PI This last option is used separately from the options above configfile file Specify a file of lines containing the above options mpi exec doesnot supportprun orsrun start up mpijob mpi j ob lists the Platform M PI jobs running on the system mpi j ob can only be used for jobs started in appfile mode Invoke mpi j ob on the same host as you initiated mpi run Thempi j ob syntax is Platform MPI User s Guide 79 Understanding Platform MPI mpijob help a u j id id id where help Prints usage information for the utility a Lists jobs for all users u Sorts jobs by user name j id Provides process status for job id Y ou can list a number of job Dsin aspace separated list When you invoke mpi j ob it reports the following information for each job JOB Platform M PI job identifier USER User name of the owner NPROCS N umber of processes PROGNAME Program names used in the Platform M PI application By default your jobs are listed by job ID in increasing order However you can specify the a and u options to change the default behavior An mpi j ob output using the a and u options is shown below listing jobs for all users and sorting them by user name 22623 charlie 12 home watts 22573 keith 14 home richards 22617 mick 100 home j agger 22677 ron 4 home wood When you specify the j option mpi j ob reports the following for each job RANK Rank for each process i
29. following form mpiexec n maxprocs soft ranges host host arch arch wdir dir path dirs file file command args For example MPI_ROOT bin mpiexec n 8 myprog x 1 2 3 78 Platform MPI User s Guide Understanding Platform MPI creates an 8 rank M PI job on the local host consisting of 8 copies of the program myprog x each with the command line arguments 1 2 and 3 It also allows arguments like aM PI_Comm_spawn_multiple call with a colon separated list of arguments where each component is like the form above For example MPI_ROOT bin mpiexec n 4 myprog x host host2 n 4 path to myprog x creates a M PI job with 4 ranks on the local host and 4 on host 2 Finally the third form allows the user to specify a file containing lines of data like the arguments in the first form mpiexec configfile file For example MPI_ROOT bin mpiexec configfile cfile gives the same results as in the second example but using the configfile option assuming the file cfile contains n 4 myprog x host host2 n 4 wdir some path myprog x where mpi exec options are n maxprocs Creates maxprocs M PI ranks on the specified host soft range list Ignored in Platform M PI host host Specifies the host on which to start the ranks arch arch Ignored in Platform M PI wdir dir Specifies the working directory for the created ranks path dirs Specifies the PATH environment variable for the created ranks
30. mpi r un Command uses 1000000 bytes per messageand verifies bandwidth Example output might look like ost OSS mo LA Otay Ss FANS Ww bost oo jf Wy UG LSU AS oo FAME il Host 2 ip 172 16 150 24 ranks 2 host 0 1 2 0 SHM IBAL BAL 1 BAL SHM BAL 2 BAL BAL SHM 0 mpiccp3 ping pong 1000000 bytes 1000000 bytes 1089 29 usec msg 1000000 bytes 918 03 MB sec 1 mpiccp4 ping pong 1000000 bytes 1000000 bytes 1091 99 usec msg 1000000 bytes 915 76 MB sec 2 mpiccp5 ping pong 1000000 bytes 1000000 bytes 1084 63 usec msg 1000000 bytes 921 97 MB sec The table showing SH M IBAL is printed because of the prot option print protocol specified in the mpi run command It could show any of the following settings BAL IBAL on InfiniBand MX M yrinet Express TCP TCP IP MP D daemon communication mode SHM shared memory intra host only If one or more hosts show considerably worse performance than another it can often indicate a bad card or cable If the run aborts with somekind of error message it is possible that Platform M PI incorrectly determined which interconnect was available 182 Platform MPI User s Guide Example Applications This appendix provides example applications that supplement the conceptual information in this book about M PI in general and Platform M PI in particular The example codes are also included in the MPI_ROOT hel p subdirectory of your Plat
31. read mode 18 receive mode 17 18 send mode 17 standard mode 18 synchronous mode 18 blocking receive 18 broadcast 20 21 buf variable 18 21 buffered send mode 17 build applications 66 examples 184 MPI on Linux cluster using appfiles 29 MPI on HP XC cluster 30 MPI on multiple hosts 75 MPI on single host Linux 29 MPI with Visual Studio 42 problems with Windows 174 run HPCS 39 MPMD on HPCS 41 multihost on HPCS 40 single host on Windows 38 Windows 2003 XP using appfiles 44 Windows 2008 using appfiles 43 Windows with Visual Studio 42 Platform MPI User s Guide 253 C C bindings 237 C examples io c 206 ping_pong_ring c 187 ping_pong c 185 thread_safe c 207 C 237 bindings 54 compilers 54 examples cart C 196 sort C 210 profiling 159 cache option 115 cart C 183 ccp option 113 ccpblock option 114 ccpcluster option 114 ccpcyclic option 114 ccperr option 113 ccpin option 113 ccpout option 114 ccpwait option 114 change execution location 126 135 ck option 107 clean up 240 clearcache option 115 code a blocking receive 18 a broadcast 21 anonblocking send 20 ascatter 21 error conditions 178 collective communication 20 reduce 22 collective operations 20 communication 20 computation 22 synchronization 23 comm variable 18 23 commd option 103 communication hot spots 77 improving interhost 76 communicator determine number of processes 17 communicator c 183 254 Platform MPI User s Guide compilation utilities 31
32. significant only at root N count number of elements in send buffer IN datatype data type of elements of send buffer handle IN op reduce operation handle IN root rank of root process IN comm communi cat or handle int MPI Reduce _scatterL void sendbuf void recvbuf MPI Aint recvcounts MPI Datatype datatype MPI Op op MPI_Comm comm N sendbuf starting address of send buffer choice OUT recvbuf starting address of receive buffer choice IN recvcounts array specifying the number of elements in result distributed to each process IN datatype data type of elements of input buffer handle IN op operation handle IN comm communicator handle int MPI _ScanL void sendbuf void recvbuf MPI_Aint count MPI Datatype datatype MPI Op op MPI Comm comm IN sendbuf starting address of send buffer choice OUT recvbuf starting address of receive buffer choice IN count number of elements in input buffer IN datatype data type of elements of input buffer handle IN op operation handle IN comm communicator handle int MPI_ExscanL void sendbuf void recvbuf MPI Aint count MPI Datatype datatype MPI Op op MPI Comm comm IN sendbuf starting address of send buffer choice OUT recvbuf starting address of receive buffer choice IN count number of elements in input buffer IN datatype data type of elements of input buffer handle IN op operation handle IN comm intracommuni cat or handle int MPI ScatterL void send
33. 0 bytes 0 bytes 4 26 usec msg 2 hostC ping pong 0 bytes 0 bytes 4 26 usec msg 3 hostD ping pong 0 bytes 0 bytes 4 24 usec msg The table showing SH M V API is printed because of the prot option print protocol specified in the mpi run command It could show any of the following settings VAPI VAPI on InfiniBand UDAPL uDAPL on InfiniBand BV IBV on InfiniBand PSM PSM on InfiniBand MX M yrinet M X BAL on InfiniBand for Windows only T 1T API on InfiniBand GM M yrinet GM 2 ELAN Quadrics Elan4 TCP TCP IP MP I D daemon communication mode SHM shared memory intra host only If the table shows TCP for hosts the host might not have correct network drivers installed If a host shows considerably worse performance than another it can often indicate a bad card or cable Other possible reasons for failure could be A connection on the switch is running in 1X mode instead of 4X mode A switch has degraded a port to SDR assumes DDR switch cards A degraded SDR port could be dueto using anon DDR cable If the run aborts with an error message Platform M PI might have incorrectly determined what interconnect was available One common way to encounter this problem isto run a 32 bit application on a 64 bit machine like an Opteron or Intel64 It s not uncommon for some network vendors to provide only 64 bit libraries 180 Platform MPI User s Guide Debugging and Troubleshooting Platform M PI de
34. 169 Debugging Platform MPI applications cccccceeeceeeeeeeeeeeeeeeeeaeeeseeeeeseseeesneaeeeeees 170 Troubleshooting Platform MPI applications 0 ccccceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeaees 174 Appendix A Example Applications uccnariarencui e a 183 SONG ECCIVE D Correaa a 184 PING DONG E 185 ping pong Ting c MINUX iritsia aa aie nine 187 PING PONG _TING C Windows wecccceciessccescesencetissencoesaunedeasscccesadsttaanascenedeetaaannesseeuaaienteoes 191 compute DUT saceces shiccadactt ei accecezaces a a tconateen teaeectaedudeassasantdiadadeaasiusee 194 master WOMRKErTQO saci tendicaceeairsverenidiata dutldoeidecuesianeaeen ai a a deans 195 CA E E EA E ese a serena aterm ca O A oe ES E E A E A E T A E cata 196 COMMUNICALONNG sceicwesctecinacncecce nin a ESSE EENE E E E 198 TMT DARE osei a shin cntecgde bats cate ttagtattepestasenttan tetas 199 IOO aar ag snsteceedin oucheeeec ad vaaceewaena doumade E nh Maced becadenieceeesiageues 206 thread SAIC Cts ccerttieth adecsee s a Vencavetauisbecdauetbiathadne ses ES 207 SOMIG aree scenes deast a E a E rundenesaeent E 210 COMPUTE pi SpaWN f atzssssceccetstissnnecestacttnn mia a E E ea 216 Appendix B Large message APIS eeeecccceeeeeeneeeeeeeeeeeeeeeeeaeeeeeeeeaaeeeeeeeeaaeeeeeeenaeeeeeseenaeeeeenenaaes 219 Appendix C Standard Flexibility in Platform MPI ccecceeeeeeceeeeeceeeeeeeeeeeeaeeseceeeeeeeeeesneaeeeeeees 229 Platform MPI implementatio
35. 172 20 0 13 ranks 5 nest 1 2 3 4 f 0 SHM VAPI VAPI VAPI VAPI VAP 1 VAPI SHM VAPI VAPI VAPI VAP 2 VAPI VAPI SHM VAPI VAPI VAP 3 VAPI VAPI VAPI SHM VAPI VAP 4 VAPI VAPI VAPI VAPI SHM VAP Platform MPI User s Guide 69 Understanding Platform MPI 5 VAPI VAPI VAPI VAPI VAPI SHM Hello world m0O of 6 on n Hello world m 3 of 6 on nll Hello world m5 of 6 on n13 Hello world I m 4 of 6 on nl2 Hello world m2 of 6 on n10 Hello world m1 of 6 on n9 Use LSF on non HP XC systems On non HP XC systems to invoke the Parallel Application M anager PAM feature of LSF for applications where all processes execute the same program on the same host bsub lt sf_options gt pam mpi mpirun lt mpirun_options gt program lt args gt In this case LSF assigns a host to the M PI job For example bsub pam mpi MPI_ROOT bin mpirun np 4 compute_pi requests a host assignment from LSF and runsthecomput e_pi application with four processes The load sharing facility LSF allocates hosts to run an M PI job In general LSF improves resource usage for M PI jobs that run in multihost environments LSF handles the job scheduling and the allocation of the necessary hosts and Platform M PI handles the task of starting the application s processes on the hosts selected by LSF By default mpi r un starts the M PI processes on the hosts specified by the user in effect handling the direct mapping of host names to IP address
36. 2 gt Host 3 ip 192 168 9 13 ranks 3 gt gt host 0123 gt 0 SHMVAPI VAPI VAP 1 VAPI SHMVAPI VAP 2 VAPI VAPI SHMVAP 3 VAPI VAPI VAPI SHM gt gt gt gt gt gt 0 hostA ping pong 0 bytes gt 0 bytes 4 57 usec msg gt 1l hostB ping pong 0 bytes gt 0 bytes 4 38 usec msg gt 2 hostC ping pong 0 bytes gt 0 bytes 4 42 usec msg gt 3 hostD ping pong 0 bytes gt 0 bytes 4 42 usec msg The table showing SH M V API is printed because of the prot option print protocol specified in the mpi run command In general it could show any of the following settings VAPI InfiniBand UDAPL InfiniBand BV InfiniBand PSM InfiniBand MX M yrinet M X BAL InfiniBand on Windows only T 1T API on InfiniBand GM M yrinet GM 2 ELAN Quadrics Elan4 190 Platform MPI User s Guide Example Applications TCP TCP IP MP I D commd SHM Shared M emory intra host only If the table shows TCP IP for hosts the host might not have the correct network drivers installed If a host shows considerably worse performance than another it can often indicate a bad card or cable If the run aborts with an error message Platform M PI might have determined incorrectly which interconnect was available O ne common way to encounter this problem isto run a 32 bit application on a 64 bit machine like an Opteron or Intel64 It isnot uncommon for
37. 2 5 Platform M PI can be used in MPICH mode by compiling using mpi cc mpi ch and running using mpi run mpi ch Thecompiler script mpi cc mpi ch uses an include file that defines the interfaces the sameasM PICH 1 2 5 and at link timeitlinks against i b mpi ch so whichistheset of wrappers defining M PICH 1 2 5 compatible entry points for the M PI functions The mpi run mpi ch takes the same arguments as the traditional Platform M PI mpi run but sets LD_LIBRARY_PATH so that i bmpi ch so isfound An example of using a program with Intel Trace Collector export MPlLROOT opt platform_mpi MPI_ROOT bin mpicc mpich o prog x MPI_ROOT help communicator c L path to itc lib IVT lvtunwind ldwarf Insi Im lelf lpthread MPI_ROOT bin mpirun mpich np 2 prog x Here the program communicator ciscompiled with M PICH compatibleinterfaces and islinked to Intel s Trace Collector i bVT a first from the command line option followed by Platform M PI s i bmpich so andthen i bmpi so which areadded by thempi cc mpi ch compiler wrapper script Thus i bVT a sees only the M PICH compatible interface to Platform M PI In general object files built with Platform M PI s M PICH mode can be used in an MPICH application and conversely object files built under M PICH can be linked into a Platform MPI application using MPICH mode H owever using M PICH compatibility modeto produce a single executable to run under MPICH and Platform MPI can
38. 22 recvtype 22 req 20 root 23 sendbuf 21 22 sendcount 21 sendtype 21 tag 20 vector constructor 24 version 34 237 version option 107 viewing ASCII profile 157 Ww WDB 121 Windows 2003 XP command line options 115 CCP command line options 113 115 getting started 27 35 X XDB 121 170 xrc 108 xrc option 108 Y yield spin logic 123 zero buffering 125 Platform MPI User s Guide 263
39. 8 1 MPI does not mandate what an MPI process is MPI MPI processes are UNIX or Win32 console processes and does not specify the execution model for each process can be multithreaded a process can be sequential or multithreaded See MPI 1 2 Section 2 6 Platform MPI User s Guide 229 Standard Flexibility in Platform MPI Reference in MPI Standard The Platform MPI Implementation MPI does not provide mechanisms to specify the initial allocation of processes to an MPI computation and their initial binding to physical processes See MPI 1 2 Section 2 6 MPI does not mandate that an I O service be provided but does suggest behavior to ensure portability if it is provided See MPI 1 2 Section 2 8 The value returned for MP _HOST gets the rank of the host process in the group associated with MPI_COMM_WORLD MPI_PROC_NULL is returned if there is no host MPI does not specify what it means for a process to be a host nor does it specify that a HOST exists MPI provides MPI_GET_PROCESSOR_NAME to return the name of the processor on which it was called at the moment of the call See MPI 1 2 Section 7 1 1 The current MPI definition does not require messages to carry data type information Type information might be added to messages to allow the system to detect mismatches See MPI 1 2 Section 3 3 2 Vendors can write optimized collective routines matched to their architectures or a complete library of collective communication
40. E smal E E 57 EEE E E T N TE E E T E A A A 58 Thread compliant library spssoseienisnnerdea a E Ra ERE 59 CPRODINAING muinean e E a A A e A E 60 MPICH object compatibility for LINUX 2 eee eeeeeeeceeeeceeeeeeeeeeeeeeeaaeeeeeaaaeesecaeeeeeeeeeeeetaeees 63 MPICH2 compatibility wiscscattecasseandevettvavescuestaceaestesnuienarettenas bdavedacseonegardeaacciesuaaeaiternsaners 65 Examples of building On LINUX eeececeeeeeeeeeeeeeeeeeeeeeeeaeeeceaaaeeeeeeeeeseaeeeeseaeeeseeeeeeseneees 66 Running applications ON LINUX 2 eee e eee eeee eee e eee sas aieia aaa NEE E AEE ERES 67 Running applications ON WiINGOWS eceeceeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeneaeeeeeeteeeeaeeees 88 MPIFUM OPTIONS stsccccccsteuetiact Qecennturensedesennngiraaacneteursactiedscusuvagcesadeuavedgans decsteiuiiaaeceeseenniaace 102 Runtime environment variables ccccceeeceeeeeeeeeeeeeeeeeeeeeeeeeeeeeseaaeeeseaeeeeeeteeeetenees 118 List of runtime environment variables ccceceeeceeeeeeeeeeeeeeeeeeeeeeeeaaeeseccaeeeeeeeeeeeeeees 121 SCAUADINGY sseceeteveagneactd cchavesan tech i a aa teeteadener ees hateastecedadd 145 Dynamic PIOCESS OS 42 iccecece ei ntecedicneungnantndsathee ns genacentacadchacne iodedsnanaatacansansimencatvad Secehareines 147 SINGISTOM AUMCHUING cesi n gair faaeccetstianedeceg is peenaned G 148 License release regain ON SUSPENC TESUME 00 0 sees eeeeeeee cece eee eetneeeeeeteeenaaaeeeeeeeeeeaaaees 149 Improved deregistr
41. M PI calls For example in thetwo calls below the send operation sends an integer but the matching receive operation receives a floating point number if rank 1 then MPI Send amp bufl 1 MPI INT 2 17 MPI COMM WORLD else if rank 2 MPI _Recv amp buf2 1 MPI_ FLOAT 1 17 MPI_COMM WORLD amp status MPI object space corruption Detects attempts to write into objects such as MPI_Comm MPI_Datatype M PI_ Request MPI_Group and MPI_Errhandler M ultiple buffer writes Detects whether the data type specified in a receive or gather operation causes MPI to write to a user buffer more than once To disable these checks or enable formatted or unformatted printing of message data to a file set the MPI_DLIB_FLAGS environment variable options appropriately To use the diagnostics library specify the Idmpi option to the build scripts when you compile your application This option is supported on Linux and Windows Note 172 Platform MPI User s Guide Debugging and Troubleshooting Using DLIB reduces application performance Also you cannot use DLIB with instrumentation Enhanced debugging output Platform M PI provides the stdio option to allow improved readability and usefulness of M PI processes stdout and stderr Options have been added for handling standard input Directed Input is directed to a specific M PI process Broadcast Input is copied to the stdin of all processes Ignore Input is ignored T
42. M PICH and Platform M PI is more problematic and is not recommended For more information see M PICH object compatibility for Linux on page 63 Installation and setup QUESTION How areranks launched Or why do I get the message remshd Login incorrect or Permission denied ANSWER Thereareanumber of ways that Platform M PI can launch ranks but some way must be made available 1 Allow passwordlessrs h access by settingup hosts equiv and or rhost files to allow the mpi run machineto user sh to access the execution nodes 2 Allow passwordlesss s h access from the mpi r un machine to the execution nodes and set the environment variable M PI_REM SH to the full path ofs sh 3 UseSLURM srun by using the srun option with mpi run 4 Under Quadrics useRMS pr un by using the prun option with mpi run For Windows see the W indows FAQ section QUESTION How can I verify that Platform M PI is installed and functioning optimally on my system ANSWER A simplehe o_world testis availablein MPI ROOT hel p hello_world c that can validate basic launching and connectivity Other more involved tests are there as well including a simple ping_pong_ring c test to ensure that you are getting the bandwidth and latency you expect The Platform MPI for Linux library includes a lightweight system check API that does not require a separate license to use This functionality allows customers to test the basic installation and setup of Platf
43. PI license key by hostname and hostid The hostid isthe MAC address of the ethO network interface The eth0 MAC address is used even if that network interface is not configured The hostid can be obtained by typing the following command if Platform M PI isinstalled on the system opt platform_mpi bin licensing arch Imutil Imhostid The amp h0 MAC address can be found using the following command sbin ifcontig egrep eth0 awk print 5 sed s g The hostname can be obtained by entering the command host name To request a three server redundant license key for Platform M PI for Linux contact the Platform Computing Version identification To determine the version of a Platform M PI installation usethei dent orr pm command on Linux For example mpirun version or rpm qa grep platform_mpi 34 Platform MPI User s Guide Getting Started Getting started using Windows Configuring your environment The default install directory location for Platform M PI for Windows is one of the following directories On 64 bit Windows C Program Files x86 Platform Computing Platform M PI On 32 bit Windows C Program Files Platform Computing Platform M PI The default install defines the system environment variable M Pl_ ROOT but does not put MP _ ROOT bi nin the system path or your user path If you choose to move the Platform M PI installation directory from its default location 1 Change the system en
44. Sets the environment variable var for the program and gives it the value val if provided Environment variable substitutions for example FO 0 are supported in the val argument To append settings to a variable V AR can be used Sets the target shell PATH environment variable to paths Search paths are separated by acolon Special Platform MPI mode option ha Eliminates an M PI teardown when ranks exit abnormally Further communications involved with ranks that are unreachable return error class M PI_ERR_EXITED but the communications do not forcethe application to teardown if theM PI_Errhandler is set toMPI_ERRORS RETURN This mode never uses shared memory for inter process communication Platform M PI high availability mode is accessed by using the ha option to mpi run Platform MPI User s Guide 109 Understanding Platform MPI To allow users to select the correct level of high availability features for an application the ha option accepts a number of additional colon separated options which may be appended to the ha command line option For example mpirun ha option 1j option2 L Table 16 High availability options Options Descriptions ha Basic high availability protection When specified with no options ha is equivalent to ha noteardown detect ha i Use of lightweight instrumentation with ha ha infra High availability for infrastructure mpi run mpi d ha detect Detection of failed communica
45. amount of prepinned memory Platform M PI uses can be adjusted using several tunables such as MPI_RDMA_MSGSIZE MPI_RDMA_NENVELOPE MPI_RDMA NSRQRECV andMPI_RDMA_NFRAGMENT By default when the number of ranks is less than or equal to 512 each rank prepins 256 Kb per remote rank thus making each rank pin up to 128 M b If the number of ranks is above 512 but less than or equal to 1024 then each rank only prepins 96 Kb per remote rank thus making each rank pin up to 96 M b If the number of ranks is over 1024 then the shared receiving queue option is used which reduces the amount of prepinned memory used for each rank to a fixed 64 M b regardless of how many ranks are used Platform M PI also has safeguard variables M PI_PHYSICAL_M EMORY and MPI_PIN_PERCENTAGE which set an upper bound on thetotal amount of memory aPlatform M PI job will pin An error isreported during start up if this total is not large enough to accommodate the prepinned memory 146 Platform MPI User s Guide Understanding Platform MPI Dynamic processes Platform M PI provides support for dynamic process management specifically the spawning joining and connecting of new processes MP _Comm_spawn starts M PI processes and establishes communication with them returning an intercommunicator MPI_Comm_spawn_multiple starts several binaries or the same binary with different arguments placing them in the same comm_world and returning an intercommunicator The M PIl_C
46. and MPI_Rsend but it is dependent on message size Deadlock situations can occur when your code uses standard send operations and assumes buffering behavior for standard communication mode Nonblocking communication M PI provides nonblocking counterparts for each of the four blocking send routines and for the receive routine The following table lists blocking and nonblocking routine calls Table 3 MPI blocking and nonblocking calls Blocking Mode Nonblocking Mode MPI_Send MPI_Isend MPI_Bsend MPI_lIbsend MPI_Ssend MPI_Issend MPI_Rsend MPI_Irsend MPI_Recv MPI_lIrecv Platform MPI User s Guide 19 Introduction Nonblocking calls have the same arguments with the same meaning as their blocking counterparts plus an additional argument for a request To codea standard nonblocking send use MPI_Isend void buf int count MPI_datatype dtype intdest int tag MPI_Comm comm MPI_ Request req where req Specifies the request used by a completion routine when called by the application to complete the send operation To completenonblocking sends and receives you can use M PI_W ait or MPI_ Test The completion of a send indicates that the sending process is free to access the send buffer The completion of areceive indicates that the receive buffer contains the message the receiving process is free to access it and the status object that returns information about the received message is set Collective operations Applic
47. applications linked with the thread compliant library Imtmpi Instrumentation is not supported for applications linked with the diagnostic library Idmpi Creating an instrumentation profile Counter instrumentation is a lightweight method for generating cumulative run time statistics for M PI applications When you create an instrumentation profile Platform M PI creates an ASCII format file containing statistics about the execution Instrumentation is not supported for applications linked with the diagnostic library Idmpi The syntax for creating an instrumentation profile is mpirun i prefix nc off where prefix Specifies the instrumentation output file prefix The rank zero process writes the application s measurement data to prefix instr in ASCII If the prefix does not represent an absolute pathname the instrumentation output file is opened in the working directory of the rank zero process when M PI_Init is called Locks ranks to CPUs and uses the CPU s cycle counter for less invasive timing If used with gang scheduling the l is ignored nc Specifies no clobber If the instrumentation output file exists M PI_Init aborts off Specifies that counter instrumentation is initially turned off and only begins after all processes collectively call M PIH P_Trace_on For example to create an instrumentation profile for an executable called compute pi MPI_ROOT bin mpirun i compute_pi np 2 compute_pi This invocati
48. are interrupted As a result some applications might benefit from it others might experience a decrease in performance As part of tuning the performance of an application you can control the behavior of the heartbeat signals by changing their time period or by turning them off This is accomplished by setting the time period of the s option in the M PI_FLAGS environment variable for example s600 Timeisin seconds You can use thes a p option with the thread compliant library as well as the standard non thread compliant library Settings a p for the thread compliant library has the same effect as setting MP _MT_FLAGS ct when you use a value greater than 0 for The default value for the thread compliant library iss p0 MPI_MT_FLAGS ct takes priority over the default M P _ FLAGS sp0 Set M P _ FLAGS sa1 to guarantee that M PI_Cancel works for canceling sends To use gprof on HP XC systems set these environment variables MPI_FLAGS s0 GMON_OUT_PREFIX tmp app name These options are ignored on Platform M PI for Windows Enables spin yield logic is the spin value and is an integer between zero and 10 000 The spin value specifies the number of milliseconds a process should block waiting for a message before yielding the CPU to another process H ow you apply spin yield logic depends on how well synchronized your processes are For example if you havea process that wastes CPU time blocked waiting for messages you can use
49. basic information about the machine you used 0 number of CPUs processor type e t c 0 Q sesssesssssssssssssssssssssssssssssssssssssssssssssss 0 MPI Rank User seconds System seconds 0 0 4 95 2 36 0 1 5 16 LA 0 2 4 82 2 43 0 3 5 20 1 18 aE E a ee 0 Total 20 12 7 13 sr un iS supported on HP XC systems with SLURM Using the srunargument from the mpi r un command lineis still supported 236 Platform MPI User s Guide Frequently Asked Questions General QUESTION Where can get the latest version of Platform M PI ANSWER Customers can go to my platform com QUESTION Can I use Platform MPI in my C application ANSWER Yes Platform M PI provides C classes for M PI bindings T he classes provided are an inlined interface class to M PI C bindings Although most classes are inlined asmall portion is a prebuilt library Thislibraryisg ABI compatible BecausesomeC compilers arenot g ABI compatible weprovide the source files and instructions on how to build this library with your C compiler if necessary For moreinformation see C bindings for Linux on page 54 QUESTION How can tell what version of Platform M PI I m using ANSWER Try one of the following 1 mpirun version 2 on Linux rpm qalgrep platform_mpi For Windows see the W indows FAQ section QUESTION What Linux distributions does Platform M PI support ANSWER See the release note for your product for this informat
50. be problematic and is not advised You can compile communicator c under Platform M PI M PICH compatibility mode as export MPlLROOT opt platform_mpi MPI_ROOT bin mpicc mpich o prog x MPIl_ROOT help communicator c and run the resulting prog x under M PICH However some problems will occur First the M PICH installation must be built to include shared libraries and a soft link must be created for i bmpi ch so because their libraries might be named differently Next an appropriateLD_ LIBRARY PATH setting must be added manually because M PICH expects the library path to be hard coded into the executable at link time via rpath Finally although the resulting executable can run over any supported interconnect under Platform MPI it will not under M PICH due to not being linked to libgm libelan etc Similar problems would be encountered if linking under M PICH and running under Platform M PI s MPICH compatibility M PICH s use of rpath to hard code the library path at link time keeps the Platform MPI User s Guide 63 Understanding Platform MPI executable from being able to find the Platform M PI M PICH compatibility library via Platform M PI s LD_LIBRARY_PATH setting C bindings are not supported with M PICH compatibility mode MPICH compatibility mode is not supported on Platform M PI V 1 0 for Windows 64 Platform MPI User s Guide Understanding Platform MPI MPICH2 compatibility MPICH compatibility mode supports a
51. buffer choice IN sendcount number of elements in send buffer IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice significant only at root IN recvcount number of elements for any single receive significant only at root IN recvtype data type of recv buffer elements significant only at root handle IN root rank of receiving process integer IN comm communicator handle nt MPI GathervL void sendbuf MPI Aint sendcount MPI Datatype sendtype void recvbuf MPI_Aint recvcounts MPI _Aint displs P Datatype recvtype int root MPI Comm comm N sendbuf starting address of send buffer choice IN sendcount number of elements IN send buffer non negative integer IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice significant only at root IN recvcounts array equal to the group size specifying the number of elements that can be received from each rank IN displs array of displacements relative to recvbuf IN recvtype data type of recv buffer elements significant only at root handle IN root rank of receiving process integer IN comm communicator handle int MPI _ReduceL void sendbuf void recvbuf MPI_Aint count Platform MPI User s Guide 223 Large message APIs MPI Datatype datatype MP Op op int root MPI_Comm comm N sendbuf address of send buffer choice OUT recvbuf address of receive buffer choice
52. buffers are aligned and offset from each other toavoid cache conflicts caused by direct process to process byte copyoperations To run this example 1 Definethe CHECK macro to check data integrity 2 Increase the number of bytes to at least twice the cache size to obtain representative bandwidth measurements include lt stdio h gt include lt stdlib h gt Platform MPI User s Guide 185 Example Applications include lt math h gt include lt mpi h gt ALIGN ign buffers and displace themin the cache to avoid collisions 1 1 amp ALIGN 1 MPI COMM WORLD i a Ns MPI COMM WORLD define NLOOPS 1000 define ALIGN 4096 main argc argv in argc char argv ne n lige ORS double start stop in nbytes 0 in rank size Pl Status status char huf Pl _Init amp argc amp argv Pl Comm rank MP COMM WORLD amp rank PI Comm size MPI COMM WORLD amp si ze ii eize te 2 i if rank printf ping pong must have two processes n Pl Finalize exit 0 nbytes argc gt 1 atoi argv 1 0 if nbytes lt 0 nbytes 0 Page al buf char malloc nbytes 524288 ALIGN if buf 0 MPI _Abort MP COMM WORLD MPI _ERR_ BUFFER exit 1 buf char C msi amen long buf if rank 1 buf 524288 memset buf 0 nbytes Ping pong a if rank 0 printf ping pong d bytes n nbytes E
53. compliant library was required for multithreaded applications even if only one thread was making aM PI call at atime To link with the thread compliant library on Linux systems specify the libmtmpi option to the build scripts when compiling the application To link with the thread compliant library on Windows systems specify the Imtmpi option to the build scripts when compiling the application Application types that no longer require linking to the thread compliant library include Implicit compiler generated parallelism Thread parallel applications using the HP M LIB math libraries OpenMP applications pthreads Only if no two threads call M PI at the same time Otherwise use the thread compliant library for pthreads Platform MPI User s Guide 59 Understanding Platform MPI CPU binding Thempi run option cpu_bind binds a rank to an Idom to prevent a process from moving to a different Idom after start up The binding occurs before the M PI application is executed To accomplish this a shared library is loaded at start up that does the following for each rank Spins for a short time in a tight loop to let the operating system distribute processes to CPUs evenly This duration can be changed by setting the MPI_CPU_SPIN environment variable which controls the number of spinsin the initial loop Default is 3 seconds Determines the current CPU and Idom Checks with other ranks in the M PI job on the host for oversubscription by
54. create the block of entries I BlockOfEntries BlockOfEntries int numOfEntries_p int myRank ll Initialize the random number generator s seed based on the caller s rank thus each rank should but might not get different random values I srand unsigned int myRank numOfEntries NUM OF ENTRIES PER RANK numOfEntries_p numOfEntries iiaiai l Add in the left and right shadow entries numOf Entries 2 H II Allocate space for the entries and use rand to initialize the values ll entries new Entry numOfEntries for int i 1 i lt numOfEntries 1 i entries i new Entry COM ASI rane SalOOM e rane se Wir a a a Platform MPI User s Guide 211 Example Applications H if Il Initialize the shadow entries Ht if entries 0 new Entry MI NENTRY entries numOfEntries 1 new Entry MAXENTRY H if BlockOfEntries BlockOfEntries Function delete the block of entries Ll BlockOfEntries BlockOfEntries for int 1 i lt numOfEntries 1 lete entries i delete entries 0 delete entries numOfEntries 1 delete entries Function Adjust the odd entries ac ee void BlockOfEntries singleStepOddEntries BlockOfEntries singleStepOddEntries ror iw i 0 lt Pee TES ite it E entries i entries i 1 Entry temp ont eeti 1 entries i tl entries i entries i temp BlockOfEntries singleStepEvenEntries
55. ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ient ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ient ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ien ient ien ien ien ien ien ien ien ien ien ien ien ien reer reo oreo reer reer reo reer 200 2 AD OR OR ODOFrHOOCOr 200 2D OrOOOC SO 2Or OF 2Or OF a O H ooror Example Applications Platform MPI User s Guide 209 Example Applications server 1 processed request 0 for client 0 server 1 processed request 0 for client 0 server 1 processed request 0 for client 0 server 0 processed request 0 for client 0 server 1 processed request 0 for client 0 server 0 processed request 0 for client 0 server 1 processed request 1 for client 1 server 1 processed request 1 for client 1 server 1 processed request 1 for client 1 server 0 processed request 1 for client 1 server 1 processed request 1 for client 1 server 0 processed request 1 for client 1 server 0 total service requests 38 server 1 total service requests 42 sort C Thisprogram doesasimpleinteger sortin parallel Thesortinputis built usingthe rand random number generator The program is self checking and can run with any number of ranks define NUM OF ENTRIES PER_RANK100 stdio h gt st
56. in MP1 _ ROOT hel p HPMPI vsprops and MPI ROOT hel p HPMP 64 vsprops 1 Go to VS Project gt View gt Property Manager and expand the project This displays the different configurations and platforms set up for builds Include the appropriate property page HP MPI vsprops for 32 bit applications HP MPI 64 vsprops for 64 bit applications in Configuration gt Platform 2 Select this page by either double clicking the page or by right clicking on the page and selecting Properties Go to theUser Macros section Set MPI ROOT to the desired location for example the installation location of Platform M PI This should be set to the default installation location Platform MPI User s Guide 89 Understanding Platform MPI ProgramFiles x86 Platform Computing Platform MPI Note This is the default location on 64 bit machines The location for 32 bit machines is ProgramFil es Platform Computing Platform MPI 3 TheMPI application can now be built with Platform M PI The property page sets the following fields automatically but can also be set manually if the property page provided is not used 1 C C Additional Include Directories Set to MPI ROOT i nclude 32 64 2 Linker Additional Dependencies Set tol i bpcmpi 32 1ib orl i bpcmpi 64 11 b depending on the application 3 Additional Library Directories Set to MMP ROOT Ii b Building and running on a Windows 2008 cluster using appfiles T
57. in the job allocation For additional options see the release note for your specific version QUESTION How do install in a non standard location on Windows ANSWER To install Platform M PI on Windows double click setup exe and follow the instructions Oneof theinitial windowsisthe Select Directory window which indicates where to install Platform M PI If you are installing using command line flags use DIR lt path gt to change the default location QUESTION Which compilers does Platform M PI for Windows work with ANSWER Platform M PI works well with all compilers W e explicitly test with Visual Studio Intel and Portland compilers Platform M PI strives not to introduce compiler dependencies QUESTION What libraries do need to link with when build 242 Platform MPI User s Guide Frequently Asked Questions ANSWER Werecommend using thempi cc and mpi f 90 scriptsin MPI ROOT bin to build If you do not want to build with these scripts use them with the show option to see what they are doing and use that as a starting point for doing your build The show option prints out the command to be used for the build and not execute Because these scripts are readable you can examine them to understand what gets linked in and when If you are building a project using Visual Studio IDE we recommend adding the provided PMPI vsprops for 32 bit applications or PMP 64 vs props for 64 bit applications to the property pa
58. job Specifies use of the shared receiving queue protocol when OFED M yrinet GM ITAPI M ellanox VAPI or uUDAPL V1 2 interfaces are used This protocol uses less prepinned memory for short message transfers Extended Reliable Connection XRC is a feature on ConnectX InfiniBand adapters Depending on thenumber of application ranks that are allocated to each host XRC can 108 Platform MPI User s Guide Understanding Platform MPI significantly reducetheamount of pinned memory that is used by theI nfiniBand driver Without XRC the memory amount is proportional to the number of ranks in the job With XRC the memory amount is proportional to the number of hosts on which the job is being run The xrc option is equivalent to srq e M PI_IBV_XRC 1 OFED version 1 3 or later is required to use XRC MPI 2 functionality options 1sided spawn Enables one sided communication Extends the communication mechanism of Platform M PI by allowing oneprocess to specify all communication parameters for the sending side and the receiving side The best performance is achieved if an RDM A enabled interconnect like InfiniBand is used With this interconnect the memory for the one sided windows can comefrom MPI_Alloc_mem or from malloc If TCP IP is used the performance will be lower and the memory for the one sided windows must come from M PI_Alloc_mem Enables dynamic processes Environment control options e var val sp paths
59. mpif90 pgf90 hello _f90 f90 hello f90 f90 mpirun a out Hello world am 0 of 1 C command line basics for Windows The utility MP1 ROOT bi n mpi cc isincluded to aid in command line compilation To compile with this utility set the M PI_CC environment variable to the path of the command line compiler you want to use Specify mpi32 or mpi64 to indicate if you are compiling a 32 bit or 64 bit application Specify the command line options that you would normally pass to the compiler on thempi cc command line Thempi cc utility adds command line options for Platform M PI include directories and libraries Y ou can specify the show option to indicate that mpi cc should display the command generated without executing the compilation command For more information see the mpi cc manpage To construct the compilation command the mpi cc utility must know what command line compiler is to be used the bitness of the executable that compiler will produce and the syntax accepted by the compiler These can be controlled by environment variables or from the command line Table 11 mpi cc utility Environment Variable Value Command Line MPI_CC desired compiler default cl mpicc lt value gt MPI_BITNESS 32 or 64 no default mpi32 or mpi64 MPI_WRAPPER_SYNTAX windows or unix default mpisyntax lt value gt windows For example to compile hello_world c with a 64 bit cl contained in your PATH use the following command beca
60. n02 Running with a hostfile using HPCS 1 Perform Steps 1 and 2 from Building and Running on a Single Host 2 Change to a writable directory on a mapped drive The mapped drive must be to a shared folder for the cluster 3 Create a file hostfile containing the list of nodes on which to run n01 n02 n03 n04 4 Submit the job to HPCS X demo gt MPIROOT bin mpirun hpc hostfile hfname np 8 hello_world exe Nodes are allocated in the order that they appear in the hostfile Nodes are scheduled cyclically so if you have requested more ranks than there are nodes in the hostfile nodes are used multiple times 5 Analyzehello_world output Platform M PI prints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m5 of 8 on n02 Hello world m0O of 8 on n01 Hello world I m 2 of 8 on n03 Hello world m6 of 8 on n03 Hello world m1 of 8 on n02 Hello world I m 3 of 8 on n04 Hello world m4 of 8 on n01 Hello world m7 of 8 on n04 Running with a hostlist using HPCS Perform Steps 1 and 2 from Building and Running on a Single H ost 1 Changeto a writable directory on a mapped drive The mapped drive should beto a shared folder for the cluster 2 Submit the job to HPCS including the list of nodes on the command line X demo gt MPI_ROOT bin mpirun hpc hostlist n01 n02 n03 n04 np 8 hello_world exe 92 Platform MPI User s G
61. nodes exe JobNewOut oJ ob StdOut Readal Set appfile fs CreateTextFile lt path gt appfile True Rsrc Split JobNewOut For LBound Rsrc 1 to UBound Rsrc Step 2 appfile WriteLine h Rsrc np Rsrc l tl _ lt path gt foo exe Next appfile Close Set oJob sh exec MPI ROOT bin mpirun exe TCP f lt path gt appfile wscript Echo ojob StdOut Reada 3 Submit the job as in the previous example C gt job submit id 4288 The above exampleusings ubmi ssion_script vbs isonly an example Other scripting languages can be used to convert the output of mpi _nodes exe into an appropriate appfile Building and running multihost on Windows HPCS clusters The following isan exampleof basic compilation and run steps to execute hello_world c on a cluster with 16 way parallelism To build and run hello_world c on an HPCS cluster 1 Changeto a writable directory on a mapped drive The mapped drive should beto a shared folder for the cluster 2 Open aVisual Studio command window This example uses a 64 bit version so a Visual Studio x64 command window opens 3 Compile the hello_world executable file X Demo gt MPI_ROOT bin mpicc mpi64 MPI_ROOT help hello_world c Microsoft C C Optimizing Compiler Version 14 00 50727 42 for x64 Copyright Microsoft Corporation All rights reserved hello _world c Microsoft Incremental Linker Version 8 00 50727 42 Copyright Microsoft Corporation A
62. of those strings is searched for individually in the output from sbi n smod Thecarrotin the search pattern is used to signify the beginning of aline but the rest of regular expression syntax is not supported In many cases if a system has a high speed interconnect that is not found by Platform M PI dueto changes in library names and locations or module names the problem can be fixed by simple edits to the pcmpi conf file Contacting Platform M PI support for assistance is encouraged Protocol specific options and information TCP IP IBAL This section briefly describes the available interconnects and illustrates some of the more frequently used interconnects options TCP IP is supported on many types of cards M achines often have more than one IP address and a user can specify the interface to be used to get the best performance Platform M PI doesnotinherently know which IP address correspondsto the fastest availableinterconnect card By default IP addresses are selected based on the list returned by gethostbyname The mpi r un option netaddr can be used to gain more explicit control over which interface is used IBAL is only supported on Windows Lazy deregistration is not supported with IBAL Platform MPI User s Guide 83 Understanding Platform MPI IBV Platform M PI supports OpenFabrics Enterprise Distribution OFED through V 1 4 Platform M PI can use both the verbs 1 0 and 1 1 interface To use OFED on Linux the m
63. on the ELAN interconnect However some applications may experience resource problems 162 Platform MPI User s Guide Tuning Message latency and bandwidth Latency is thetime between the initiation of the data transfer in the sending process and the arrival of the first byte in the receiving process Latency often depends on the length of messages being sent An application s messaging behavior can vary greatly based on whether a large number of small messages or a few large messages are sent M essagebandwidth isthereciprocal of thetimeneeded to transfer a byte Bandwidthisnormally expressed in megabytes per second Bandwidth becomes important when message sizes are large To improve latency bandwidth or both Reduce the number of process communications by designing applications that have coarse grained parallelism Use derived contiguous data types for dense data structures to eliminate unnecessary byte copy operations in some cases U se derived data types instead of M PI_ Pack and MPI_Unpack if possible Platform M PI optimizes noncontiguous transfers of derived data types Use collective operations when possible This eliminates the overhead of using M PI_ Send and M PI_Recv when oneprocess communicates with others Also usethe Platform M PI collectives rather than customizing your own Specify the source process rank when possible when calling M PI routines Using MPI_ANY_SOURCE can increase latency Double word align data
64. printed to stdout C gt job new numprocessors 16 exclusive true Job queued ID 4288 5 AddasingleCPU mpi r un task to the newly created job Thempi r un job creates more tasks filling the rest of the resources with the compute ranks resulting in a total of 16 compute ranks for this example C gt job add 4288 numprocessors 1 exclusive true stdout node path to a shared file out stderr node path to a shared file err MPI_ROOT bin mpirun hpc node path to hello_world exe 6 Submit the job The machine resources are allocated and the job is run C gt job submit id 4288 Run multiple program multiple data MPMD applications Torun Multiple Program M ultiple Data M PM D applications or other more complex configurations that require further control over the application layout or environment dynamically create an appfile within thejob using the utility MPI ROOT bi n mpi_nodes exe asin the following example The 88 Platform MPI User s Guide Understanding Platform MPI environment variable CCP_NODES cannot beused for this purpose becauseit only contains thesingle CPU resource used for the task that executes the mpi r un command To create the executable perform Steps 1 through 3 from the previous section Then continue with 1 Create anew job C gt job new numprocessors 16 exclusive true Job queued ID 4288 2 Submit a script Verify M PI_ROOT is set in the environment C
65. properly determined by the Platform M PI utilities mpi run and mpi d The default is mpi64 Debugging and informational options help Prints usage information for mpi run version 106 Platform MPI User s Guide prot d p V i spec Understanding Platform MPI Prints the major and minor version numbers Printsthecommunication protocol between each host e g TCP IP or shared memory The exact format and content presented by this option is subject to change as new interconnects and communication protocols are added to Platform M PI Behaves likethe p option but supports two additional checks of your M PI application it checks if the specified host machines and programs are available and also checks for access or permission problems This option isonly supported when using appfile mode Debug mode Prints additional information about application launch Prints the Platform MPI job ID Turns on pretend mode The system starts a Platform M PI application but does not create processes Thisis useful for debugging and checking whether the appfileis set up correctly This option is for appfiles only Turns on verbose mode Enablesrun timeinstrumentation profiling for all processes spec specifies options used when profiling The options are the same as those for the environment variable M PI_INSTR For example the following is valid MPI_ROOT bin mpirun i mytrace l nc f appfile Lightweight instrumen
66. routines can be written using MPI point to point routines and a few auxiliary functions See MPI 1 2 Section 4 1 Error handlers in MPI take as arguments the communicator in use and the error code to be returned by the MPI routine that raised the error An error handler can also take stdargs arguments whose number and meaning is implementation dependent See MPI 1 2 Section 7 2 and MPI 2 0 Section 4 12 6 MPI implementors can place a barrier inside MPIL_FINALIZE See MPI 2 0 Section 3 2 2 MPI defines minimal requirements for thread compliant MPI implementations and MPI can be implemented in environments where threads are not supported See MPI 2 0 Section 8 7 The format for specifying the file name in MPI_FILE_OPEN is implementation dependent An implementation might require that file name include a string specifying additional information about the file See MPI 2 0 Section 9 2 1 230 Platform MPI User s Guide Platform MPI provides the mpi run np utility and appfiles as well as start up integrated with other job schedulers and launchers See the relevant sections in this guide Each process in Platform MPI applications can read and write input and output data to an external drive Platform MPI sets the value of MP _HOST to MPLPROC_NULL If you do not specify a host name to use the host name returned is that of gethostname If you specify a host name using the h option to mpi run Platform MPI returns that host name
67. run time option is specified the default is mul t MPI_ROOT MPI_ROOT indicates the location of the Platform M PI tree If you move the Platform MPI installation directory from its default location in opt platform mpi for Linux set theMPI_ROOT environment variable to point to the new location MPI_WORKDIR MPI_WORKDIRchanges the execution directory This variableisignored when sr un or prun isused CPU bind environment variables MPL_BIND_MAP MPI_BIND_MAP allowsspecification of theinteger CPU numbers Idom numbers or CPU masks These are a list of integers separated by commas MPI_CPU_AFFINITY MPI_CPU_AFFINITY isan alternative method to using c pu_bi nd on thecommand line for specifying binding strategy The possible settings are LL RANK MAP_CPU MASK_CPU LDOM CYCLIC BLOCK RR FILL PACKED SLURM and MAP_LDOM MPI_CPU_SPIN MPI_CPU_SPIN allows selection of spin value The default is 2 seconds MPI_FLUSH_FCACHE MPI_FLUSH_FCACHE clears the file cache buffer cache If you add eMPI_FLUSH_FCACHE x to thempi r un command line the file cache is flushed before the code starts where x is an optional percent of memory at which to flush If the memory in the file cache is greater than x the memory is flushed The default value is 0 in which case a flush is always performed Only the lowest rank on each host flushes the file cache limited to one flush per host job Setting this environment variable saves time if
68. show option to compiler wrappers When compiling by hand run mpi cc show and a line prints showing what the job would do and skipping the build Fortran 90 To use the mpi Fortran 90 module you must create the module file by compiling themodul e F file infopt platform mpi incl ude 64 module F for 64 bit compilers For 32 bit compilers compile themodule F filein opt platform mpi incl ude 32 module F Note Each vendor e g PGI Qlogic Pathscale Intel Gfortran etc has a different module file format Because compiler implementations vary in their representation of a module file a PGI module file is not usable with Intel and so on Additionally forward compatibility might not be the case from older to newer versions of a specific vendor s compiler Because of compiler version compatibility and format issues we do not build module files In each case you must build just once themodulethat correspondsto mpi with thecompiler you intend to use 50 Platform MPI User s Guide Understanding Platform MPI For example with pl at form mpi bin andpgi bininpath pgf90 c opt platform mpi include 64 module F cat gt hello f90 f90 program main use mpi implicit none integer ss lert rank size call MPI _INIT ierr call MPI COMM RANK MP COMM WORLD rank ierr call MPI COMM SI ZE MP COMM WORLD size ierr print Hello world am a wank OF Size call MPI _FINALI ZE ierr End mpif90
69. suffix The MPI_ICLIB_ variables specify names of libraries to be called by d open TheMPI_ICMOD variables specify regular expressions for names of modules to search for 82 Platform MPI User s Guide Understanding Platform MPI An example is the following two pairs of variables for PSM MPI_ICLIB_PSM_PSM_MAIN libpsm_infinipath so 1 MPI_ICMOD_PSM_PSM_MAIN 4ib_ipath and MPI_ICLIB_PSM__PSM_PATH usr lib64 libpsm_infinipath so 1 MPI_ICMOD_PSM__PSM_PATH 4ib_ipath The suffixes PSM MAIN and PSM _PATH are arbitrary and represent two attempts that are made when determining if the PSM interconnect is available The list of suffixes is in the variable M PI_IC_SU FFIXES which is also setin thepc mpi conf file So when Platform M PI is determining the availability of the PSM interconnect it first looks at MPI_ICLIB_ PSM__PSM_ MAIN MPI_ICMOD_PSM__PSM_MAIN for the library to used open and the module name to look for Then if that fails it continues on to the next pair MPI_ICLIB_ PSM __PSM_PATH MPI_ICMOD_PSM_ PSM PATH which in this case specifies a full path to the PSM library TheMPI_ICMOD_ variables allow relatively complex values to specify the module names to be considered as evidence that the specified interconnect is available Consider the example MPI_ICMOD_VAPI__VAPI_MAIN 4mod_vapi mod_vip ib_core This means any of those three names will be accepted as evidence that VAPI is available Each
70. supports automatic interconnect selection If a valid InfiniBand network is found IBAL is selected automatically It is no longer necessary to explicitly specify i bal 1 BAL TCP Specifies that TCP IP should be used instead of another high speed interconnect If you have multiple TCP IP interconnects use netaddr to specify which interconnect to use Use prot to see which interconnect was selected Example MPI_ROOT bin mpirun TCP hostlist host1 4 host2 4 np8 a out commd Routes all off host communication through daemons rather than between processes Not recommended for high performance solutions Local host communication method intra mix Use shared memory for small messages The default is 256 KB or what is set by MPI_RDMA_INTRALEN For larger messages the interconnect is used for better bandwidth This same functionality is available through the environment variable M PI_INTRA which can beset to shm nic or mix This option does not work with TCP Elan MX or PSM intra nic Use the interconnect for all intrahost data transfers Not recommended for high performance solutions intra shm Use shared memory for all intrahost data transfers This is the default TCP interface selection netaddr Platform MPI User s Guide 103 Understanding Platform MPI This option is similar to subnet but allows finer control of the selection process for TCP IP connections M PI has two main sets of connections th
71. tag MP _Comm comm where buf Specifies the starting address of the buffer count Indicates the number of buffer elements dtype Denotes the data type of the buffer elements dest Specifies the rank of the destination process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes To codea blocking receive use MPI_Recv void buf int count MPI_datatype dtype int source int tag MPI_Comm comm MPI_Status status where buf 18 Platform MPI User s Guide Introduction Specifies the starting address of the buffer count Indicates the number of buffer elements dtype Denotes the data type of the buffer elements source Specifies the rank of the source process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes Status Returns information about the received message Status information is useful when wildcards are used or the received message is smaller than expected Status may also contain error codes Thesend receive f ping_ pong c andmaster worker f90 examples all illustrate the use of standard blocking sends and receives Note You should not assume message buffering between processes because the MPI standard does not mandate a buffering strategy Platform MPI sometimes uses buffering for MPI_Send
72. tate ARRAYSIZE result ARRAYS ZE integer kind 4 numfail ierr call MPI _Init ierr call MPI Comm rank MPI COMM WORLD taskid ierr call MPI Comm size MPI COMM WORLD numtasks ierr numworkers numtasks 1 chunksize ARRAYSIZE numworkers arraymsg 1 indexmsg 2 real 4 numf a KKK hi te data index 1 do dest 1 numworkers call MPI _Send index 1 MPI_INTEGER dest 0 MPI COMM WORLD ierr call MPI Send data index chunksize MPI REAL dest 0 amp PI COMM WORLD ierr KEKKK KK KKK KKKKKKKKKKKEKS Master task KKKKK KKK KKK KKK KKK KKK KKK KKK KKKK KKK kid eq MASTER then 0 0 index index chunksize end do do i 1 numworkers Source call MPI Recv index 1 MPI INTEGER source 1 MPI _COMM_ WORLD amp status ierr call MPI _Recv result index chunksize MPI_REAL source 1 amp MPI COMM WORLD status ierr end do do i 1 numworkers chunksi ze if result i ne i 1 then Platform MPI User s Guide 195 Example Applications codeph gt print element i expecting itl actual is result 1 numfail numfail 1 endi enddo if numfail ne 0 then print out of ARRAYSIZE elements numfail wrong answers else primit a Correct results endi end KKK KKKKKK KKK KKK KKK KKK KKK KKK KKK Worker task KKK KK KKKKK KKK KK KKK KKK KKK KKK KKK if askid gt MASTER then ca Pi _Recv index 1 MPI_INTEGER MASTER 0
73. the environment by using a command such as setenv MPI_REMSH ssh x The tool specified with M PI_REM SH must support a command line interface similar to the standard utilitiesrsh remsh andssh The n option is one of the arguments mpi r un passes to the remote shell command If the remote shell does not support the command line syntax Platform M PI uses write a wrapper script such as pat h t o myremsh to change the arguments and set the M PI_ REMSH variable to that script Platform M PI supports setting M PI_REM SH using the eoption to mpi run MPI_ROOT bin mpirun e MPI_REMSH ssh lt options gt f lt appfile gt Platform M PI also supports setting M PI_REM SH to a command that includes additional arguments for example ssh x But if this is passed to mpi run with e M PI_REM SH then the parser in Platform M PI V 2 2 5 1 requires additional quoting for the value to be correctly received by mpi run MPI_ROOT bin mpirun e MPl_REMSH ssh x lt options gt f lt appfile gt When using ssh besureitis possibleto uses s h from thehost wherempi r un isexecuted to other nodes withouts sh requiring interaction from the user Also ensures sh functions between the worker nodes because thes s h calls used to launch the job are not necessarily all started by mpi run directly a tree of ssh calls is used for improved scalability Compiling and running your first application To quickly becomefamiliar with compiling and running Pl
74. the info argument to the spawn calls host W eaccept standard host domain strings and start the ranks on the specified host Without this key the default is to start on the same host as the root of the spawn call wdir Weaccept some directory strings path Weaccept some directory some other directory A mechanism for setting arbitrary environment variables for the spawned ranks is not provided Platform MPI User s Guide 147 Understanding Platform MPI Singleton launching Platform M PI supportsthecreation of asinglerank without the useof mpi r un called singleton launching Itisonly valid to launch an M PI_COMM_ WORLD ofsizeoneusingthisapproach T hesinglerank created in thisway is executed asif it werecreated with mpi run np 1 lt executable gt Platform M PI environment variables can influence the behavior of the rank Interconnect selection can be controlled using the environment variable M PI_IC_ORDER Many command line options that would normally be passed to mpi r un cannot be used with singletons Examples include but are not limited to cpu_bind d prot ndd srq and T Some options such as i are accessible through environment variables MPI_INSTR and can still be used by setting the appropriate environment variable before creating the process Creating a singleton using fork and exec from another M PI process has the same limitations that OFED places on fork and exec 148 Platform MPI User s Guide
75. the initial InfiniBand connection setup In this second example the default connection cannot be made The following is the first node configuration ibv_devinfo v hca_id mthca0 fw_ver C a node_gui d 0008 f104 0396 62b4 max pkeys 64 local ca dek delay 15 port 1 state PORT_ACTIVE 4 max mtu 2048 4 phys_state LINK_UP 5 GID 0 fe80 0000 0000 0001 0008 f104 0396 62b5 port 2 state PORT_ACTIVE 4 max mtu 2048 4 phys state LINK _UP 5 GID 0 fe80 0000 0000 0000 0008 f104 0396 62b6 The following is the second node configuration ibv_devinfo v hca_id mthca0 fwover 4 7 0 node guid 0008 104 0396 6270 max pkeys 64 local _ca_ack_del ay 15 port 1 state PORT ACTIVE 4 max_ mtu 2048 4 phys state LINK UP 5 GID 0 fe80 0000 0000 0000 0008 f104 0396 6271 port 2 state PORT_ACTIVE 4 max_ mtu 2048 4 phys state LINK _UP 5 GIDL 0 fe80 0000 0000 0001 0008 f104 0396 6272 In this case the subnet with prefix fe80 0000 0000 0001 includes port 1 on the first node and port 2 on thesecond node Thesecond subnet with prefix fe80 0000 0000 0000 includes port 2 on thefirst node and port 1 on the second To make the connection using the fe80 0000 0000 0001 subnet pass this option ot mpi run e MPI_IB_PORT_GID fe80 0000 0000 0001 Platform MPI User s Guide 133 Understanding Platform MPI If the MPI_IB_PORT_GID environmen
76. to cbe j are the indices c of the columns belonging to the j th block of columns These E indices specify a portion the j th portion of a row and the c datatype cdtype j is created as an MPI vector datatype to refer c to the j th portion of a row ote this a vector datatype Platform MPI User s Guide 201 Example Applications AaAANAAA AQAA AAA AADQDAAAA AAA because adjacent elements in a row are actua elements apart in memory allocate rbs 0 comm si ze 1 che 0 comm size 1 r t cdtype 0 comm_size 1 do blk 0 comm size 1 call blockasgn 1 nrow comm size b call mpi_type_conti guous rbe blk b mpi _ double precision rdtype call mpi _ type_commit rdtype bik call blockasgn 1 ncol comm size b call mpi_type vector cbe blk cbs t mpi double_ precision cdtype call mpi_type_commit cdtype blk enddo Compose MPI datatypes for gather sca Each block of the partitioning is de vectors Each process es partition blocks allocate adtype 0 commsize 1 adisp abl en 0 comm_size 1 cal do rank 0 comm size 1 do rb 0 commsize 1 cb mod rb rank comm size cal call mpi_type_commit adtype rb adisp rb rbs rb 1 cbs cb abl en rb 1 enddo call mpi _ type struct comm size ab twdtype rank ierr call mpi_type_commit twdtype rank do rb 0 comm size 1 cal enddo enddo deallocate adtype adisp abl en Scatter initial for the partitioning MPI _send gives opportunities to the MP Strategies
77. tool run mpi di ag with s lt remote node gt lt options gt where options include help Platform MPI User s Guide 95 Understanding Platform MPI Show the options to mpi di ag S lt remote node gt Connect to and diagnose this node s remote service Authenticates with theremoteserviceand returnstheremote authenticated user sname Authenticates with remote service and returns service status et lt echo siring gt Authenticates with the remote service and performs a simple echo test returning the string SyS Authenticates with the remote service and returns remote system information including node name CPU count and username ps username Authenticates with theremoteservice and lists processesrunningon theremotesystem If a username is included only that user s processes are listed dir lt path gt Authenticates with the remote service and lists the files for the given path This isa useful tool to determine if access to network shares are available to the authenticated user sdir lt path gt Same as dir but lists a single file No directory contents are listed Only the directory is listed if accessible kill lt pid gt Authenticates with remote service and terminates the remote process indicated by the pid The process is terminated as the authenticated user If the user does not have permission to terminate the indicated process the process is not terminated mpi di ag authentication o
78. using a shm segment created by mpi r un and alock to communicate with other ranks If no oversubscription occurs on the current CPU then lock the process to the Idom of that CPU If a rank is reserved on the current CPU find anew CPU based on least loaded free CPUs and lock the process to the Idom of that CPU Similar results can be accomplished using mpsched but the procedure outlined above is a more load based distribution and works well in psets and across multiple machines Platform M PI supports CPU binding with a variety of binding strategies see below The option cpu_bind is supported in appfile command line and sr un modes mpirun cpu_bind _mt v option v np 4 a out Where_mtimplies thread aware CPU binding v and v request verbose information on threads binding to CPUs and option is one of rank Schedule ranks on CPUs according to packed rank ID map_cpu Schedule ranks on CPUsin cyclic distribution through MAP variable mask_cpu Schedule ranks on CPU masks in cyclic distribution through MAP variable Il least loaded II Bind each rank to the CPU itis running on For NUMA based systems the following options are also available Idom Schedule ranks on Idoms according to packed rank ID cyclic Cyclic dist on each Idom according to packed rank ID block Block dist on each Idom according to packed rank ID rr round robin rr Same as cyclic but consider Idom load average fill Same as b
79. wait for the HPC job to finish before returning to the command prompt when starting a job through automatic job submittal feature of Platform M PI By default mpi r un automatic jobs will not wait This flag has no effect if used for an existing H PC job hpcblock Uses block scheduling to place ranks on allocated nodes N odes are processed in the order they were allocated by the scheduler with each node being fully populated up to the total number of CPUs before moving on to the next node Only valid when the hpc option is used Cannot be used with the f hostfile or hostlist options hpccluster lt headnode gt Specifies the headnode of the H PC cluster that should be used to run thejob Assumed to bethe local host if omitted hpccyclic Uses cyclic scheduling to place ranks on allocated nodes N odes are processed in the order they were allocated by the scheduler with one rank allocated per node on each cyclethrough thenodelist Thenodelist istraversed as many times as necessary to reach thetotal rank count requested Only valid when the hpc option is used Cannot be used with the f hostfile or hostlist options headnode lt headnode gt Indicates the head node to submit the mpi run job on Windows H PC If omitted local host is used This option can only be used as acommand line option when using the mpi run automatic submittal functionality hosts Allows you to specify a node list to Platform M P lon Windows H
80. was run with lO C npl nitializing 1000 Start computation Computation took 4 088211059570312E 02 R LOON Birra seconds In this C example each process writes to a separate file called iodatax wherex represents each process rank in turn Then the data in iodatax is read lt stdio h gt lt string h gt lt stdlib h gt lt mpi h gt include include include include define SIZE 65536 define FILENAME iodata Each process writes to separate file iodata and the process rank is append main argc argv Mak Anhe chat Faiou nt buf i rank nints len char filename Pl Pile wine Pl Status status Pl_Init amp argc amp argv PI Comm rank MPI COMM WORLD buf int wl oc SIZE nints S ZE sizeof int or i 0 i lt nints i buf i each process opens a separa ilename char malloc str sprintf filename s d FIL PI File_open MPI COMM SELF f PI MODE CREATE P INFO NULL amp PI File set_view fh MPI Off P _ NFO NULL Pl File write fh buf nints PI File close amp fh reopen the file and read th or i 0 i lt nints i buf i PI File _open MP COMM SELF f P MODE CREATE PI INFO NULL amp Pl_File_ set _view fh MPI Off MPI INFO NUL P _File_read fh buf nints PI File close amp fh check if the data read is C 206 Platform MPI User s Guide back s and reads them back The file name is Ge Wo Ti fl
81. wide file descriptors Thenumber of sockets used by Platform M PI can be reduced on some systems at the cost of performance by using daemon communication In this case the processes on a host use shared memory to send messages to and receive messages from the daemon The daemon in turn uses a socket connection to communicate with daemons on other hosts Using this option the maximum number of sockets opened by any Platform M PI process grows with the number of hosts used by theM PI job rather than the number of total ranks nections Ele Edt Yew Favorites Tools Advanced tiop Qw O F Psa P races F yh Loca woa Connocton 2 Enabled Tak Minicom Myrinet Adapter Local Area Connection 2 E LAN or High Speed internet Platform MPI User s Guide 145 Understanding Platform MPI To use daemon communication specify the commd option in thempi r un command After you set the commd option you can use the MPI_COM MD environment variable to specify the number of shared memory fragments used for inbound and outbound messages Daemon communication can result in lower application performance Therefore it should only be used to scale an application to alargenumber of ranks when it is not possible to increase the operating system file descriptor limits to the required values Resource usage of RDMA communication modes When using InfiniBand or GM some memory is pinned which means it is locked to physical memory and cannot be paged out The
82. xxx xxx xxx xxx Specifies the host IP address MPI MAX_REMSH MPI_MAX_REMSH N Platform M PI includes a start up scalability enhancement when using the f option to mpi r un This enhancement allows a large number of Platform M PI daemons mpid to be created without requiring mpi run to maintain a large number of remote shell connections When running with a very largenumber of nodes the number of remote shells normally required to start all daemons can exhaust available file descriptors To create the necessary daemons mpi r un uses the remote shell specified with M PI_ REM SH to create up to 20 daemons only by default This number can bechanged using theenvironment variableM PI_M AX_REM SH When thenumber of daemons required is greater than MPI_MAX_REMSH mpi run creates only MPI_MAX_REMSH number of remote daemons directly The directly created daemons then create the remaining daemons using an n ary tree where n is the value of MPI_ MAX_REMSH Although this process is generally transparent to the user the new start up requires that each nodein the cluster can usethe specified M PI_REM SH command e g rsh ssh to each node in the cluster without a password The value of M PI_M AX_REM SH is used ona per world basis Therefore applications that spawn a large number of worlds might need to use a small value for MPI_MAX_REMSH MPI_MAX_REMSH isonly relevant when using the f option to mpi run The default value is 20 MPI_NETADDR Allows
83. you invoke mpi r un use one of the following mpirun mpirun_options f appfile extra_args for _appfile bsub Ilsf_options pam mpi mpirun mpirun_options f appfile extra_args_for_appfile The extra_args for_appfile option is placed at the end of your command line after appfile to add options to each line of your appfile Caution Arguments placed after are treated as program arguments and are not processed by mpi r un Use this option when you want to specify program arguments for each line of the appfile but want to avoid editing the appfile For example suppose your appfile contains h voyager np 10 send receive argl arg2 h enterprise np 8 compute pi If you invoke mpi r un using the following command line mpirun f appfile arg3 arg4 arg5 Thesend_recei ve command line for machine voyager becomes send_receive argl arg2 arg3 arg4 args Thecomput e_pi command line for machine enterprise becomes compute pi arg3 arg4 arg5 When you usethe extra_args for_appfile option it must be specified at the end of the mpi run command line Setting remote environment variables To set environment variables on remote hosts use the e option in the appfile For example to set the variable M PIl_FLAGS h remote host e MP _FLAGS val np program args Assigning ranks and improving communication The ranks of the processes in MP _COMM_WORLD are assigned and sequentially ordered according to the order
84. 0 B6060BA Date Mon Apr 01 15 59 10 2002 Processes 2 User time 6 57 MPI time 93 43 Overhead 93 43 Blocking 0 00 lication Summary by Rank second k Proc CPU Time User Portion System Portion 0 0 040000 0 010000 25 00 0 030000 75 00 Platform MPI User s Guide 157 Profiling 1 0 030000 0 010000 33 33 0 020000 66 67 Rank Proc Wall Time User PI 0 0 U2GI35 0 008332 6 60 0 118003 93 40 1 0 126355 0 008260 6 54 0 118095 93 46 Rank Proc MPI Ti me Overhead 0 0 118003 0 118003 100 00 0 000000 0 00 1 0 118095 0 118095 100 00 0 000000 0 00 Routine Summary by Rank Rank Routine Statistic Calls Overhead ms Bl ocki ng ms 0 Pl _Bcast 1 5 397081 0 000000 Pl Finalize 1 1 238942 0 000000 PI Init 1 107 195973 0 000000 PI Reduce 1 4 171014 0 000000 1 Pl Beast 1 5 388021 0 000000 PI Finalize 1 1 325965 0 000000 PI Init 1 107 228994 0 000000 PI Reduce 1 4 152060 0 000000 essage Summary by Rank Pair SRank DRank Messages minsize maxsize bin Total bytes 0 1 1 4 4 4 1 0 64 4 1 0 1 8 8 8 1 0 64 8 158 Platform MPI User s Guide Blocking Profiling Using the profiling interface TheM PI profiling interface provides a mechanism by which implementors of profiling tools can collect performance information without access to the underlying M PI implementation source code Because Platform M PI provides several options for profiling your applications you might not
85. 00E CHKBUF stop MPI Wti me sprintf amp str strlen str d bytes 2f usec msg n nbytes stop start NLOOPS 2 1024 1024 if nbytes gt 0 Sprintf amp str strlen str d bytes 2f MB sec n nbytes nbytes 1024 1024 stop start NLOOPS 2 fflush stdout else if rank root 1l size E warm up loop ef partner root ror i e Oe a lt Se e a RECV 1 SEND 1 for i 0 lt NLOOPS i CLRBUF RECV 1000 i CHKBUF SETBUF SEND 2000 i MPI_Bcast str 1024 MPI_CHAR root MPI _COMM_WORLD if rank 0 Or mw Ys Str free obuf MPI _Finalize exit 0 ping_pong_ring c output Example output might look like this Host 0 ip 172 16 159 3 ranks 0 Host 1 ip 172 16 150 23 ranks 1 Host 2 ip 172 16 150 24 ranks 2 host 0 1 2 0 SHM IBAL BAL 1 BAL SHM I BAL 2 BAL IBAL SHM 0 mpiccp3 ping pong 1000000 bytes 1000000 bytes 1089 29 usec msg 1000000 bytes 918 03 MB sec 1 mpiccp4 ping pong 1000000 bytes 1000000 bytes 1091 99 usec msg 1000000 bytes 915 76 MB sec 2 mpiccp5 ping pong 1000000 bytes 1000000 bytes 1084 63 usec msg 1000000 bytes 921 97 MB sec The table showing SH M IBAL is printed because of the prot option print protocol specified in the mpi run command Platform MPI User s Guide 193 Example Applications It could show any of the following settings
86. 110 hostfile option 105 hostlist option 106 hosts assigning using LSF 70 multiple 75 option 114 i option 107 1 0 229 ibal 83 ibal option 103 ibv 84 ibv option 102 implement barrier 23 implement reduction 22 implied prun 231 prun mode 74 srun 232 srun mode 75 improve coding Platform MPI 167 256 Platform MPI User s Guide interhost communication 76 network performance 164 InfiniBand card failover 84 port failover 84 informational options 106 initialize MPI environment 15 installation 239 installation directory Linux 28 Windows 35 instrumentation ASCII profile 157 counter 156 multihost 78 output file 156 interconnects command line options 81 selection 81 241 selection examples 86 selection options 102 supported 6 testing 179 interruptible collectives 112 intra mix option 103 intra nic option 103 intra shm option 103 io c 184 iscached option 115 itapi option 103 J j option 107 job launcher options 104 job scheduler options 104 jobid option 114 L language interoperability 122 large message APIs 219 latency 16 163 167 launch spec options 104 launching ranks 238 LD_LIBRARY_PATH appending 118 Idom 166 libraries to link 239 licenses 237 installing on Linux 33 installing on Windows 47 Linux 32 merging on Linux 33 release 149 testing on Linux 33 testing on Windows 47 Windows 45 46 lightweight instrumentation 107 111 129 linking thread compliant library 59 Linux get
87. 17 Example Applications 218 Platform MPI User s Guide Large message APIs The current M PI standard allows the data transferred using standard API calls to be greater than 2 GB For example if you call MPI_Send that contains a count of 1024 elements that each havea size of 2049 KB the resulting message size in bytes is greater than what could be stored in a signed 32 bit integer Additionally some users working with extremely large data sets on 64 bit architectures need to explicitly pass a count that is greater than the size of a 32 bit integer The current M PI 2 1 standard does not accommodate this option Until the standards committee releases a new API that does Platform M PI providesnew A PIsto handlelargemessage counts Thesenew A Plsareextensionsto theM PI 2 1 standard and will not be portable across other M PI implementations These new APIs contain a trailing L For example to pass a 10 GB count to an MPI send operation MPI_SendL must be called not MPI_ Send Important These interfaces will be deprecated when official APIs are included in the MPI standard The other API through which large integer counts can be passed into Platform M PI calls is the Fortran autodouble i8interface which is also nonstandard This interface has been supported in previous Platform M PI releases but historically had the limitation that the values passed in must still fit in 32 bit integers because the large integer input arguments were ca
88. 3 on opte7 Hello world I m 2 of 3 on opte8 This uses TCP IP over the Elan subnet using the TCP option with the netaddr option for the Elan interface 172 22 x x user optel0 user bsub I n3 ext SLURM nodes 3 MPI_ROOT bin mpirun prot TCP netaddr 172 22 0 10 srun a out Job lt 59307 gt is submitted to default queue lt normal gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost localdomain gt gt Host 0 ip 172 22 0 2 ranks 0 Host 1 ip 172 22 0 3 ranks 1 Host 2 ip 172 22 0 4 ranks 2 host 01 2 86 Platform MPI User s Guide Understanding Platform MPI 0 SHM TCP TCP 1 TCP SHM TCP 2 TCP TCP SHMHell o world m0 of 3 on opte2 Hello world I m1 of 3 on opte3 Hello world I m 2 of 3 on opte4 Elan interface user optel0 user sbin ifconfig eipO eipO Link encap Ethernet HWaddr 00 00 00 00 00 OF inet addr 172 22 0 10 Bcast 172 22 255 255 Mask 255 255 0 0 UP BROADCAST RUNNING MULTICAST MTU 65264 Metric 1 RX packets 38 errors 0 dropped 0 overruns 0 frame 0 TX packets 6 errors 0 dropped 3 overruns 0 carrier 0 collisions 0 txqueuel en 1000 RX bytes 1596 1 5 Kb TX bytes 252 252 0 b GigE interface user opte10 user sbin ifconfig ethO ethO Link encap Ethernet HWaddr 00 00 1A 19 30 80 inet addr 172 20 0 10 Bcast 172 20 255 255 Mask 255 0 0 0 UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 133469120 errors 0 dropped 0 overrun
89. 4 1000 41 82 41 91 41 86 0 18 8 1000 41 46 41 49 41 48 0 37 234 Platform MPI User s Guide gt O gt O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O mpirun Using Implied prun or srun 0 74 1 47 2 89 5 72 10 83 20 41 36 46 52 37 77 00 94 06 106 42 87 37 102 76 104 08 103 59 99 82 101 28 105 86 117 79 118 30 16 1000 41 19 41 27 41 21 32 1000 41 44 41 54 41 51 64 1000 42 08 42 17 42 12 128 1000 42 60 42 70 42 64 256 1000 45 05 45 08 45 07 512 1000 47 74 47 84 47 79 1024 1000 53 47 53 57 53 54 2048 1000 74 50 74 59 74 55 4096 1000 101 24 101 46 101 37 8192 1000 165 85 166 11 166 00 16384 1000 293 30 293 64 293 49 32768 1000 714 84 715 38 715 05 65536 640 1215 00 1216 45 1215 55 131072 320 2397 04 2401 92 2399 05 262144 160 4805 58 4826 59 4815 46 524288 80 9978 35 10017 87 9996 31 1048576 40 19612 90 19748 18 19680 29 2097152 20 36719 25 37786 09 37253 01 4194304 10 67806 51 67920 30 67873 05 8388608 5 135050 20 135244 61 135159 04 5222252222 Thanks for using PMB2 2 The Pallas team kindly requests that you give us as much feedback for PMB as possible It would be very helpful when you sent the output tables of your run s of PMB to pmb pallas com You might also add personal information institution motivation Platform MPI User s Guide 235 mpirun Using Implied prun or srun 0 for using PMB 0
90. 50 Windows 46 compiler options autodouble 56 18 56 r16 56 r8 56 show 50 DD64 58 32 and 64 bit library 58 compilers 239 default 50 compiling applications 50 Windows 36 completing Platform MPI 178 completion functions 113 completion routine 17 computation 22 compute_pi_spawnf 184 compute_pi f 183 configuration files 31 configure environment Linux 28 Windows environment 35 connectx 108 constructor functions contiguous 24 structure 24 vector 24 context communication 19 context switching 165 contiguous and noncontiguous data 23 contiguous constructor 24 count variable 18 21 counter instrumentation 128 156 cpu binding 60 cpu_bind 166 cpu_bind option 106 create appfile 75 ASCII profile 156 instrumentation profile 156 D d option 107 daemons multipurpose 78 number of processes 78 dbgspin option 108 dd option 108 DDE 121 170 debug Platform MPI 121 debuggers 170 debugging options 106 debugging Windows tutorial 173 default compilers 50 deferred deregistration 108 deregistration 150 derived data types 23 dest variable 18 20 determine group size 15 number of processes in communicator 17 rank of calling process 15 directory structure Windows 45 download Platform MPI 237 dtype variable 18 21 23 dump shmem configuration 125 dynamic processes 147 eadb 170 edde 121 170 egdb 121 170 elan 85 elan option 103 environment control options 109 environment variables MPI_2BCOPY 127 M
91. A rank binding on a clustered system uses the number of ranks and the number of nodes combined with the rank count to determine CPU binding Cyclic or blocked launch is taken into account Platform MPI User s Guide 61 Understanding Platform MPI On acell based system with multiple users the LL strategy is recommended rather than RANK LL allows the operating system to schedule computational ranks Then the cpu_bind capability locks the ranks to the CPU as selected by the operating system scheduler 62 Platform MPI User s Guide Understanding Platform MPI MPICH object compatibility for Linux The MPI standard specifies the function prototypes for M PI functions but does not specify types of M PI opaque objects like communicators or the values of M PI constants As a result an object file compiled using one vendor s M PI generally does not function if linked to another vendor s M PI library There are some cases where such compatibility would be desirable For instance a third party tool such as Intel trace collector might only be available using the M PICH interface To allow such compatibility Platform M PI includesalayer of M PICH wrappers Thisprovidesan interface identical to M PICH 1 2 5 and translates these calls into the corresponding Platform M PI interface This MPICH compatibility interface is only provided for functions defined in M PICH 1 2 5 and cannot be used by an application that calls functions outside the scope of MPICH 1
92. Dump to a file c E if comm_rank eq 0 then c print Dumping to adi out open 8 file adi out g write 8 array c close 8 status keep c endi f c c Free the resources iL Fe e nr mpi_recv array 1 twdtype src src 0 mpi_ seconds Example Applications lel if the to compute the row of the block that the next block being computed depends on 1 cdtype cb dest type ncb src 0 rows and columns are type rb dest i SC Oy comm world mpi_send array 1 twdtype comm_rank 0 0 mpi_comm world Platform MPI User s Guide 203 Example Applications do rank 0 comm size 1 call mpi_type_free twdtype rank ierr enddo do bl k 0 comm size 1 call mpi_type_free rdtype blk ierr call mpi_type_free cdtype blk ierr enddo deallocate rbs rbe cbs cbe rdtype cdtype twdt ype Finalize the MPI system call mpi_finalize ierr TEREE EEE EEEE EEEE EERE ER EEE RARER EEEE EE RRA RRR EEEE EEE EEE EEEE E RRA EEK EEEE EE C e E a E a e e G Subroutine blockasgn subs sube bl ockcnt nth blocks bl ocke This subroutine is given a range of subscript and the total number of blocks in which the range is to be divided assigns a subrange to the caller hat is n th member of the blocks mplicit none nteger subs k in subscript start nteger sube l ilar Subscript end integer blockcnt in block count integer nth agian my block begin from 0 nteger blocks out assign
93. E PRECISION MPI SUM 0 MPI COMM WORLD ierr C C Process 0 prints the result C if myid eq 0 then write 6 97 pi abs pi P 25DT 97 format pi is approximately F18 16 Error ise 9 Pie 16 endi f 194 Platform MPI User s Guide Example Applications call MPI _FINALIZE ierr stop end compute_pi output The output from running the compute _pi executable is shown below The application was run with np 10 Process 0 of 10 is alive Process 10 10 is alive Process 2 of 10 is alive Process 3 of 10 is alive Process 40 10 is alive Process 5 of 10 is alive Process 6 of 10 is alive Process 70 10 is alive Process 8 of 10 is a n Process 9 of 10 is a pi is approximately 3 1416009869231249 Error is 0 0000083333333318 master_worker f90 In this Fortran 90 example a master task initiates numtasks 1 number of worker tasks The master distributes an equal portion of an array to each worker task Each worker task receives its portion of the array and sets the value of each element to the element s index 1 Each worker task then sends its portion of the modified array back to the master program array manipulation include mpif h integer kind 4 status MPI STATUS SIZE integer kind 4 parameter ARRAYSIZE 10000 MASTER 0 integer kind 4 numtasks numworkers taskid dest index integer kind 4 arraymsg indexmsg source chunksize int4 real4 real kind 4 zz
94. Enterprise Distribution OFED The open source software stack developed by OFA that provides a unified solution for the two major RDM A fabric technologies InfiniBand and iW ARP also known as RDMA over Ethernet parallel efficiency An increase in speed in the execution of a parallel application point to point communication Communication where data transfer involves sending and receiving messages between two processes This is the simplest form of data transfer in a message passing model polling M echanism to handleasynchronousevents by actively checking to determineif an event has occurred process Address space together with a program counter a set of registers and a stack Processes can be single threaded or multithreaded Single threaded processes can only perform onetask at atime M ultithreaded processes can perform multiple tasks concurrently as when overlapping computation and communication race condition Platform MPI User s Guide 249 Glossary Situation in which multiple processes vie for the same resource and receive it in an unpredictable manner Race conditions can lead to cases where applications do not run correctly from one invocation to the next rank Integer between zero and number of processes 1 that defines the order of a process in a communicator Determining the rank of a process is important when solving problems wherea master process partitions and distributes work to slave processes The slav
95. H 154 no clobber 129 156 nodex option 114 nonblocking communication 17 19 buffered mode 19 MPI_lbsend 19 MPI_lIrecv 19 MPI_Irsend 19 MPI_Isend 19 MPI_Issend 19 ready mode 19 receive mode 19 standard mode 19 synchronous mode 19 nonblocking send 20 noncontiguous and contiguous data 23 nonportable code uncovering 125 nopass option 115 np option 106 number of MPI library routines 15 0 object compatibility 63 ofed 82 84 102 one sided option 109 op variable 23 OPENMP block partitioning 200 operating systems supported 6 optimization report 124 option 109 options password authentication 115 Windows 2003 XP 115 Windows CCP 113 P p option 107 package option 115 packing and unpacking 23 parent process 20 pass option 115 PATH setting 28 pcmpi conf 118 performance communication hot spots 77 latency bandwidth 162 163 permanent license 239 ping_pong_clustertest c 184 ping_pong_ring c 179 183 ping_pong c 183 Platform MPI User s Guide 259 pk option 115 Platform MPI change behavior 121 completing 178 debug 169 jobs running 79 specify shared memory 136 starting 73 starting Linux 174 utility files 45 platforms supported 6 PM PI prefix 159 point to point communications overview 16 portability 15 prefix for output file 156 MPI 159 PMPI 159 problems external input and output 177 message buffering 176 performance 162 167 run time 177 shared memory 176 with Windows build 174 process multith
96. I ERR BUFFER exii buf char unsigned long buf ALIGN 1 amp SAGN 1 if rank gt 0 buf 524288 memset buf 0 nbytes es Ping pong for root 0 root lt size root if rank root partner root 1 size sprintf str d S ping pong d bytes n root myhost nbytes e warm up loop ror hk s We i lt He iaiki 4 SEND 1 RECV 1 P timing oop za start MPI _Wti me for i s 0 T lt MOORS tee SETBUF SEND 1000 i CLRBUF RECV 2000 i CHKBUF stop MPI _Wti me sprintf amp str strlen str d bytes 2f usec msg n nbytes stop start NLOOPS 2 1024 1024 ly ioyees gt h y sprin amp str strlen str d bytes 2f MB sec n nbytes nbytes 1024 1024 stop stari if INEOORS i Zi fflush stdout else if rank root 1 size Platform MPI User s Guide 189 Example Applications l warm up loop s partner root for i 6 e i lt be iar i RECV 1 SEND 1 for 1 2 0 i lt NLO OPSe ise 1 CLRBUF RECV 1000 i CHKBUF SETBUF SEND 2000 i MPI_Bcast str 1024 MPI_CHAR root MPI _COMM_WORLD if rank 0 Orit is Stine free obuf MPI _Finalize exit 0 ping_pong_ring c output Example output might look like this gt Host 0 ip 192 168 9 10 ranks 0 gt Host 1 ip 192 168 9 11 ranks 1 gt Host 2 ip 192 168 9 12 ranks
97. I User s Guide Debugging and Troubleshooting All remote hosts are listed in your rhosts file on each machine and you can r e ms h to the remote machines Thempi r un command has the ck option which you can use to determine whether the hosts and programs specified in your M PI application are available and whether there are access or permission problems M PI r ems h can be used to specify other commands to be used such ass sh instead of r ems h Application binaries areavailableon thenecessary remotehostsand areexecutableon thosemachines The sp option is passed to mpi r un to set the target shell PATH environment variable You can set this option in your appfile The cshrc filedoes not containt t y commandssuch ass t t y if you areusinga bi n cs h based shell Starting on Windows When starting multihost applications using Windows H PCS Don t forget the ccp flag UseUNC paths for your file names Drives are usually not mapped on remote nodes If using the AutoSubmit feature make sure you arerunning from a mapped network drive and don t specify file paths for binaries Platform M PI converts the mapped drive to aUNC path and set MPI_WORKDIR to your current directory If you are running on a local drive Platform M PI cannot map this to aUNC path Don t submit scripts or commands that require a command window These commands usually fail when trying to change directory to a UNC path Don t forget to use quotation marks for file
98. I applications depend on knowing the number of processes and the process rank in a given communicator There are several communication management functions two of the more widely used areM PI_Comm_sizeand MPI_Comm_rank The process rank is a unique number assigned to each member process from the sequence 0 through size 1 where size is the total number of processes in the communicator To determine the number of processes in a communicator use the following syntax MPI_Comm_size MPI_Comm comm int size where comm Represents the communicator handle size Represents the number of processes in the group of comm To determine the rank of each process in comm use MPI_Comm_rank MPI_Comm comm int rank where comm Represents the communicator handle rank Represents an integer between zero and size 1 A communicator is an argument used by all communication routines The C code example displays the use of MPI_Comm_dup one of the communicator constructor functions and M PI_Comm_free the function that marks a communication object for deallocation Sending and receiving messages There are two methods for sending and receiving data blocking and nonblocking In blocking communications the sending process does not return until the send buffer is available for reuse In nonblocking communications the sending process returns immediately and might have started the message transfer operation but not necessarily completed it The ap
99. IBRARY gt option A default g ABI compatible library is provided for each architecture except Alpha Non g ABI compatible C compilers The C library provided by Platform MPI 1 i bmpi CC a was built with g If you are using a C compiler that is not g ABI compatible eg Portland Group Compiler you must build your own i bmpi CC a andincludethisin your build command Thesourcesand M akefilesto build an appropriate library arelocated in opt platform mpi lib ARCH mpiCCsrc To build aversion of i bmpi CC a and includeit in the builds using mpi CC do the following Note This example assumes your Platform MPI installation directory is opt pl atform_mpi It also assumes that the pgCC compiler is in your path and working properly 1 Copy the file needed to build i bmpi CC a into a working location setenv MPI_ROOT opt platform_mpi cp r MPI_ROOT lib linux_amd64 mpiCCsrc cd mpiCCsrc 2 Compileand create the i bmpi CC a library make CXX pgCC MPI_ROOT MPI_ROOT pgCC c intercepts cc l opt platform mpi include DHPMP_BUILD_CXXBI NDING PGCC W 0155 Nova_start seen intercepts cc 33 PGCC x86 Li nux x86 64 6 2 3 compilation completed with warnings pgCC c mpicxx cc I opt platform_mpi include DHPMP_ BUILD CXXBINDING ar rcs li bmpiCC a intercepts o mpicxx o 3 Using a test case verify that the library works as expected mkdir test cd test cp MPI_ROOT help sort C MPI
100. Init aborts off Specifies that counter instrumentation is initially turned off and only begins after all processes collectively call M PIH P_Trace_on api Theapi option to M PI_INSTR collects and prints detailed information about the M PI Application Programming Interface API This option prints a new section in the instrumentation output file for each M PI routine called by each rank It contains the MPI datatype and operation requested along with message size call counts and timing information The following is sample output from i lt file gt api on the examplecompute_pi f TITTLE Titer iti Titiritiiiitititiieititititeiii i api api Detailed MPI Reduce routine information api api AHHH HHEE HEHEHE BHEE HE HEE EEE HH HH api api we eee eee ee eee eee api Rank MPI Op MPI _ Datatype Num Calls Contig Non Contig Message Sizes Total Bytes api ce eee eee eee api R 0 sum fortran double precision 1 1 0 8 8 8 api api Num Calls Message Sizes Total Bytes Ti me ms Bytes Time s api Pe ee eee ee eee eee bee eee eee eee eee api i 0 64 8 1 0 008 api api api ee eee eee eee api Rank MPI Op MPI_ Datatype Num Calls Contig Non Contig Message Sizes Total Bytes api we ee eee eee eee api R 1 sum fortran double precision 1 1 0 8 gt B 8 api api Num Calls Message Sizes Total Bytes Ti me ms Bytes Time s api Sc seerereerrrerres api 1 0 64 8 0 0 308 api api Lightweight instrumentation can be turned on by using e
101. Linux Often clusters might have Ethernet and someform of higher speed interconnect such as nfiniBand This section describes how to usethe ping_pong_ring c example program to confirm that you can run using the desired interconnect Running atest like this especially on a new cluster is useful to ensure that the relevant network drivers are installed and that the network hardware is functioning If any machine has defective network cards or cables this test can also be useful at identifying which machine has the problem To compile the program set the M PI_ROOT environment variable not required but recommended to avalue such as opt platform mpi for Linux and then run export MPI_CC gcc using whatever compiler you want MPI_ROOT bin mpicc o pp x MPI_ROOT help ping_pong_ring c Although mpi cc performs a search for the compiler to use if you don t specify MPI_CC itis preferable to be explicit If you havea shared filesystem itis easiest to put theresultingpp x executable there otherwise you must explicitly copy it to each machinein your cluster Use the start up relevant for your cluster Y our situation should resemble one of the following If no job scheduler such assrun prun or LSF is available run a command like this MPI_ROOT bin mpirun prot hostlist hostA hostB hostZ pp x You might need to specify the remote shell command to use the default is ssh by setting the MPI_REMSH environment variable For example
102. M PI at http www lam mpi org Intel Trace Collector A nalyzer product information formally known as Vampir at http www intel com software products cluster tcollector index htmand http www intel com software products cluster tanalyzer index htm LSF product information at http www platform com 10 HP Windows H PC Server 2008 product information at http www microsoft com hpc en us product information aspx 10 Platform MPI User s Guide About This Guide Credits Platform M PI is based on M PICH from Argonne National Laboratory and LAM from the U niversity of Notre Dameand Ohio Supercomputer Center Platform MPI includes ROM IO a portable implementation of M PI 1 0 developed at the Argonne National Laboratory Platform MPI User s Guide 11 About This Guide 12 Platform MPI User s Guide Introduction Platform MPI User s Guide 13 Introduction The message passing model Programming models are generally categorized by how memory is used In the shared memory model each process accesses a shared address space but in the message passing model an application runsasa collection of autonomous processes each with its own local memory In the message passing model processes communicate with other processes by sending and receiving messages W hen data is passed in a message the sending and receiving processes must work to transfer the data from the local memory of one to the local memory of the other M essag
103. M nodes 4 MPI_ROOT bin mpirun stdio bnone f appfile pingpong Job lt 369848 gt is submitted to default queue lt normal gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost ocal domain gt gt fopt platform mpi bin mpirun unset MPI_USESRUN opt platform_mpi bin mpirun srun pallas x npmin 4 pingpong TCP environment variables MPL_TCP_CORECVLIMIT Theinteger valueindicatesthenumber of simultaneous messageslarger than 16K B that can betransmitted to a single rank at once via TCP IP Setting this variable to a larger value can allow Platform M PI to use more parallelism during its low level message transfers but can greatly reduce performance by causing switch congestion Setting M PI_TCP_CORECVLIMIT to zero does not limit the number of simultaneous messages a rank can receive at once The default value is 0 MPI_SOCKBUFSIZE Specifies in bytes the amount of system buffer space to allocate for sockets when using TCP IP for communication Setting MPI_SOCKBUFSIZE resultsin callstosetsockopt SOL SOCKET 142 Platform MPI User s Guide Understanding Platform MPI SO_SNDBUF andsetsockopt SOL_SOCKET SO RCVBUF If unspecified the system default which on many systems is 87380 bytes is used Elan environment variables MPI_USE_LIBELAN By default when Elan isin use the Platform M PI library uses Elan s native collective operations for performing M PI_Bcast and MPI_ Barrier operationson
104. MPI_COMM WORLD amp Status lerr ca PI _Recv result index chunksize MPI REAL MASTER 0 amp MP _COMM_WORLD status ierr do index index chunksize 1 resulti ef i end do ca Pl Send index 1 MPI INTEGER MASTER 1 MPI COMM_WORLD ierr ca Pl Send result index chunksize MPI REAL MASTER 1 amp MPI COMM WORLD ierr end ca P _Finalize ierr end program array _mani pul ation master_worker output The output from running the master_worker executable is shown below The application was run with np 2 correct results cart C This C program generates a virtual topology The class Node represents anodein a 2 D torus Each process is assigned a node or nothing Each node holds integer data and the shift operation exchanges the data with its neighbors Thus north east south west shifting returns the initial data include lt stdio h gt include lt mpi h gt define NDI MS 2 ypedef enum NORTH SOUTH EAST WEST Direction A node in 2 D torus class Node private P _ Comm comm in di ms NDI MS coords NDI MS in grank Irank in data public ode void Node void void profile void void print void void shift Direction W A constructor Node Node void int i nnodes periods NDI MS Create a balanced distribution MPI Comm size MPI COMM WORLD amp nnodes for i 0 i lt NDIMS i dims i 0 MPI _Dims_create nnode
105. MPI_COMM_WORLD sized communicators To change this behavior set M PI_USE_LIBELAN to false or 0 If changed these operations are implemented using point to point Elan messages To turn off export MPI_USE_LIBELAN 0 MPI_USE_LIBELAN_SUB The use of Elan s native collective operations can be extended to include communicators that are smaller than MPI_COMM_WORLD by setting theM PI_USE_LIBELAN_SUB environment variable to a positive integer By default this functionality is disabled because libelan memory resources are consumed and can eventually cause run time failures when too many subcommunicators are created export MPI_USE_LIBELAN_SUB 10 MPI_ELANLOCK By default Platform M PI only provides exclusive window locks via Elan lock when using the Elan interconnect To usePlatform M PI shared window locks theuser must turn off Elan lock and usewindow locks via shared memory In this way exclusive and shared locks are from shared memory To turn off Elan locks set MPI_ELANLOCK to zero export MPI_ELANLOCK 0 Windows HPC environment variables MPI SAVE TASK OUTPUT Saves the output of the scheduled HP CCP Ser vi ce task to afileuniquefor each node This option is useful for debugging startup issues This option is not set by default MPL_FAIL_ON_TASK_FAILURE Sets the scheduled job to fail if any task fails The job will stop execution and report as failed if a task fails The default is set to true 1 To turn off set to 0 MPL C
106. N Tn a Ns break else define SETBUF define CLRBUF define CHKBUF endifint main argc argv int argc char e oe int ie ifdef CHECK in j fendi f double Start stop intn bytes 0 in rank size int root int partner P Status status char otiia obluit char myhost MP MAX PROCESSOR NAME in len char Sit LO Aes MPI _Init amp argc amp argv PI Comm rank MPI COMM WORLD amp rank PI Comm size MPI COMM WORLD amp size PI Get_processor_name myhost amp len if size lt 2 if rank printf rping must have two processes n Pl _Finalize exit 0 nbytes argc gt 1 atoi argv 1 0 if nbytes lt 0 nbytes 0 Page align buffers and displace themin the cache to avoid collisions e buf char malloc nbytes 524288 ALIGN 1 obuf buf if buf 0 PI Abort MP COMM WORLD MPI _ ERR BUFFER eniti buf char unsigned long buf ALIGN 1 amp ALIGN 1 if rank gt 0 buf 524288 memset buf 0 nbytes Ping pong e for root 0 root lt size root if rank root partner root 1 size Sprintf str d s ping pong d bytes n root myhost nbytes ii warm up loop uy for i S i lt Se ay 4 SEND 1 RECV 1 qh timing loop 192 Platform MPI User s Guide Example Applications start MPI_Wti me for i 0 i lt NLOOPS i SETBUF SEND 1000 i CLRBUF RECV T2 0
107. NSWER Platform M PI uses several mechanisms to clean up job files All processes in your application must call M PI_ Finalize 1 When acorrect Platform M PI program that is one that calls MPI_ Finalize exits successfully the root host deletes the job file 2 If you usempi run it deletes thejob filewhen the application terminates whether successfully or not 3 When an application calls M PI_A bort M PI_A bort deletes the job file 4 Ifyou usempi j 0b j to get moreinformation on a job and the processes of that job have exited mpi j ob issues a warning that the job has completed and deletes the job file QUESTION My MPI application hangs at MPI_ Send Why ANSWER Deadlock situations can occur when your code uses standard send operations and assumes buffering behavior for standard communication mode Do not assume message buffering between processes because the M PI standard does not mandatea buffering strategy Platform M PI sometimes uses buffering for MPI_ Send and MPI_Rsend but it depends on message size and is at the discretion of the implementation QUESTION How can tell if the deadlock is because my code depends on buffering ANSWER To quickly determine whether the problem is dueto your code being dependent on buffering set thez option for MPI_FLAGS MPI_FLAGSmodifies the general behavior of Platform M PI and in this caseconvertsM PI_SendandMPI_Rsendcallsin your codeto M PI_Ssend without you needingto rewrite your cod
108. OPY_LIBHPC Controls when mpi run copies i bhpc dl 1 to the first node of HPC job allocation Values 0 Don t copy 1 default Use cached i bhpc on compute node 2 Copy and overwrite cached version on compute nodes Platform MPI User s Guide 143 Understanding Platform MPI Rank identification environment variables Platform M PI sets several environment variables to let the user access information about the M PI rank layout prior to calling MPI_Init These variables differ from the others in this section in that the user doesn t set these to provide instructions to Platform M PI Platform M PI sets them to give information to the user s application HPMPI 1 This is set so that an application can conveniently tell if it is running under Platform MPI MPI_NRANKS This is set to the number of ranks in the M PI job MPI_RANKID This is set to the rank number of the current process MPI_LOCALNRANKS This is set to the number of ranks on the local host MPI_LOCALRANKID This is set to the rank number of the current process relative to the local host 0 MPI_LOCALNRANKS 1 These settings are not available when running under sr un or pr un However similar information can be gathered from variables set by those systems such as SLURM_NPROCS and SLURM_PROCID 144 Platform MPI User s Guide Understanding Platform MPI Scalability Interconnect support of MPI 2 functionality Platform M PI has been tested on InfiniBand clu
109. OS 5 About This Guide Platform Interconnect Operating System Intel Itanium based TCP IP Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Windows HPCS QsNet Elan4 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 InfiniBand Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE AMD Opteron based Myrinet GM 2 and MX OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0 QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 TCP IP Myrinet GM 2 and MX InfiniBand QsNet Elan4 OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0 QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 Linux Enterprise Server 9 and 10 Cent OS 5 Windows HPCS Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE
110. PC Ranks are scheduled according to the host list The nodes in the list must bein the job allocation or a scheduler error occurs The Platform M PI program MP1 ROOT bin mpi _nodes exe returns a string in the proper hosts format with scheduled job resources jobid lt job id gt Schedules a Platform M PI job as a task to an existing job on Windows H PC It submits thecommandasasingleCPU mpi r un task to theexisting job indicated by the parameter job id This option can only be used as acommand line option when using the mpi run automatic submittal functionality nodex In addition to hpc indicates that only one rank is to be used per node regardless of the number of CPUs allocated with each host This flag is used on Windows HPC 114 Platform MPI User s Guide Understanding Platform MPI Windows 2003 XP The following are specific mpi r un command line options for Windows 2003 XP users package lt package name gt pk lt package name gt When Platform M PI authenticates with the Platform M PI Remote Launch service it authenticates using an installed Windows security package eg Kerberos NTLM N egotiate etc By default Platform M PI negotiates the packageto usewith theservice and no interaction or package specification is required by the user If a specific installed Windows security package is preferred use this flag to indicate the security package on the client This flag is rarely necessary because the clien
111. PCS using appfile mode Create an appfile such as this h hostA np 1 node share path to pp x h hostB np 1 node share path to pp x h hosiC np 1 node share path to pp x Submit the command to the scheduler using Automatic scheduling from a mapped share drive gt MPI_ROOT bin mpirun ccp prot f appfile gt MPI_ROOT bin mpirun ccp prot f appfile 1000000 If running on Windows H PCS using automatic scheduling Submit the command to thescheduler but includethetotal number of processes needed on the nodes as the np command This is not the rank count when used in this fashion Also include the nodexflag to indicate only one rank node Assuming 4 CPUs nodes in this cluster the command would be gt MPI_ROOT bin mpirun ccp np 12 nodex prot ping_ping_ring exe gt MPI_ROOT bin mpirun ccp np 12 nodex prot ping_ping_ring exe 1000000 If running on Windows 2003 XP using appfile mode Create an appfile such as this h hostA np 1 node share path to pp x h hostB np 1 node share path to pp x h hosiC np 1 node share path to pp x Platform MPI User s Guide 181 Debugging and Troubleshooting Submit the command to the scheduler using Automatic scheduling from a mapped share drive gt MPI_ROOT bin mpirun ccp prot f appfile gt MPI_ROOT bin mpirun ccp prot f appfile 1000000 In each case above the first mpi r un command uses 0 bytes per message and verifies latency The second
112. PI 66 Platform MPI User s Guide Understanding Platform MPI Running applications on Linux This section introduces the methods to run your Platform M PI application on Linux Using an mpi run method is required The examples below demonstrate six basic methods For all thempi r un command line options refer to the mpi r un documentation Platform M PI includes mpi32 and mpi64 optionsfor thelaunch utility mpi r un on Opteron and Intel64 Use these options to indicate the bitness of the application to be invoked so that the availability of interconnect libraries can be correctly determined by the Platform M PI utilities mpi run and mpi d The default is mpi64 You can use one of six methods to start your application depending on what the system you are using Usempi run with the np option and thename of your program For example MPI_ROOT bin mpirun np 4 hello_world starts an executable file named he o_wor d with four processes This is the recommended method to run applications on a single host with a single executable file Usempi run with an appfile For example MPI_ROOT bin mpirun f appfile where f appfile specifies a text file appfile that is parsed by mpi r un and contains process counts and alist of programs Although you can use an appfile when you run a single executable file on a single host it is best used when a job isto be run across a cluster of machines that does not have a dedicated launching method su
113. PI license file is 46 Platform MPI User s Guide Getting Started MPIROOT licenses If the license must be placed in a location that would not be found by the above search the you can set the environment variable LM _LICENSE_FILE to specify the location of the license file Installing license files A valid license file contains the system host ID and the associated license key License files can be named li cense dat or any name with extension of i c For example mpi i c The license file must be copied to theinstallation directory defaultC Program Files x86 Platform MPI licenses on all run time systems and to the license server The command to run the license server is MPI_ROOT bin licensing lt i86_n3 gt lmgrd c mpi lic License testing To check for a license build and run the hello_world program in MP _ ROOT help hello_world c If your system is not properly licensed you will receive the following error message MPI BUG Valid MPI license not found in search path Platform MPI User s Guide 47 Getting Started 48 Platform MPI User s Guide Understanding Platform MPI This chapter provides information about the Platform M PI implementation of M PI Platform MPI User s Guide 49 Understanding Platform MPI Compilation wrapper script utilities Platform M PI provides compilation utilities for the languages shown in the following table In general if a specific compiler is neede
114. PI routines MPI routine Description MPI Init Initializes the MPI environment MPI Finalize Terminates the MPI environment MPI_Comm_rank Determines the rank of the calling process within a group MPI_Comm_ size Determines the size of the group MPI_Send Sends messages MPI_Recv Receives messages You must call M PI_Finalizein your application to conform to the M PI Standard Platform M PI issues a warning when a process exits without calling MPI_ Finalize Caution Platform MPI User s Guide 15 Introduction Do not place code before MPI_Init and after MPI_ Finalize Applications that violate this rule are nonportable and can produce incorrect results Asyour application growsin complexity you can introduce other routines from thelibrary For example MPI_Bcastisan often used routine for sending or broadcasting data from one process to other processes in asingle operation Use broadcast transfers to get better performance than with point to point transfers The latter use M PI_Send to send data from each sending process and M PI_Recv to receive it at each receiving process The following sections briefly introducethe concepts underlying M PI library routines For more detailed information see M PI A M essage Passing Interface Standard Point to point communication Point to point communication involves sending and receiving messages between two processes This is the simplest form of data transfer in a message passing model and is d
115. PI_BIND_MAP 126 MPI_CC50 MPI_COMMD 131 MPI_CPU_AFFINITY 126 MPI_CPU_SPIN 126 MPI_CXX 50 MPI_DLIB_FLAGS 127 MPI_ERROR LEVEL 128 MPI_F7750 MPI_F90 50 MPI_FLAGS 121 MPI_FLUSH_FCACHE 126 MPI_GLOBMEMSIZE 136 MPI_IB_CARD_ORDER 134 MPI_IB_MULTIRAIL 131 MPI_IB_PKEY 134 MPI_IB_ PORT GID 132 MPI_IBV_QPPARAMS 135 MPI_IC_ORDER 130 MPI_IC_SUFFIXES 131 MPI_INSTR 128 156 MPI_LOCALIP 138 MPI_MAX_REMSH 138 MPI_MAX_WINDOW 127 MPI_MT_FLAGS 125 MPI_NETADDR 138 MPI_NO_MALLOCLIB 136 MPI_NOBACKTRACE 128 MPI_PAGE_ALIGN_MEM 136 MPI_PHYSICAL_MEMORY 136 MPI_PIN_ PERCENTAGE 137 MPI_RANKMEMSIZE 136 MPI_REMSH 138 MPI_ROOT 126 MPI_VAPI_QPPARAMS 135 MPI_WORKDIR 126 MPIRUN_OPTIONS 121 NLSPATH 154 setting in appfiles 76 setting in pcmpi conf file 118 setting on Linux 118 setting on Windows 119 setting with command line 76 118 TOTALVIEW 130 error checking enable 125 error conditions 178 ewdb 121 example applications cart C 196 io c 206 ping_pong_ring c 187 exdb 121 external input and output 177 F foption 105 failure detection 112 failure recover 111 file descriptor limit 177 Platform MPI User s Guide 255 Fortran 90 50 examples master_worker f90 195 functions MPI 57 G gather 20 GDB 121 170 getting started Linux 27 Windows 27 35 gm 85 gm option 102 gprof on HP XC 123 H h option 106 ha option 109 header files 30 45 headnode option 114 hello_world c 184 help option 106 highly available infrastructure
116. PI_Comm_set_errhandler function When an error is detected in a communication the error class MPI_ERR_EXITED is returned for the affected communication Shared memory is not used for communication between processes Only IBV and TCP aresupported This mode cannot be used with the diagnostic library Clarification of the functionality of completion routines in high availability mode Requests that cannot be completed because of network or process failures result in the creation or completion functionsreturningwith theerror codeM PI_ ERR_EXITED When waitingor testing multiple requests using M PI_Testany MPI_Testsome MPI_Waitany or MPI_Waitsome a request that cannot be completed because of network or process failures is considered a completed request and these routines return with the flag or outcount argument set to non zero If some requests completed successfully and some requests completed because of network or process failure the return value of the routineisM PI_ERR_IN_STATUS The status array elements contain MPI_ERR_EXITED for those requests that completed because of network or process failure Important When waiting on a receive request that uses MP _ANY_SOURCE on an intracommunicator the request is never considered complete due to rank or interconnect failures because the rank that created the receive request can legally match it For intercommunicators after all processes in the remote group are unavailable the request is c
117. PI_SUM or MPI_PROD in op as described in the M PI_Reducesyntax below To implement a reduction use MPI_Reduce void sendbuf void recvbuf int count MPI_Datatype dtype MPI_Op op int root MPI_Comm comm where sendbuf Specifies the address of the send buffer recvbuf Denotes the address of the receive buffer 22 Platform MPI User s Guide Introduction count Indicates the number of elements in the send buffer dtype Specifies the datatype of the send and receive buffers op Specifies the reduction operation root Indicates the rank of the root process comm Designates the communication context that identifies a group of processes For examplecompute_pi f usesMPI_REDUCE to sum the elements provided in the input buffer of each processin MPI _ COMM WORLD using MPI_SUM and returns the summed value in the output buffer of the root process in this case process 0 Synchronization Collective routines return as soon as their participation in a communication is complete However the return of thecalling process doesnot guarantee that thereceiving processes havecompleted or even started the operation To synchronizetheexecution of processes call M PI_Barrier MPI_ Barrier blocks the calling process until all processes in the communicator have called it This is a useful approach for separating two stages of a computation so messages from each stage do not overlap To implement a barrier use MPI_Barrier MPI_Comm co
118. Platform MPI User s Guide Platform MPI Version 8 0 Release date June 2010 Last modified June 17 2010 Platform seit ll Copyright We d like to hear from you Document redistribution and translation Internal redistribution Trademarks Third party license agreements Third party copyright notices 1994 2010 Platform Computing Inc Although the information in this document has been carefully reviewed Platform Computing Corporation Platform does not warrant it to be free of errors or omissions Platform reserves the right to make corrections updates revisions or changes to the information in this document UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM THE PROGRAM DESCRIBED IN THIS DOCUMENT IS PROVIDED ASIS AND WITHOUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THEIM PLIED WARRANTIESOF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE IN NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO ANYONE FOR SPECIAL COLLATERAL INCIDENTAL OR CONSEQUENTIAL DAMAGES INCLUDING WITHOUT LIMITATION ANY LOST PROFITS DATA OR SAVINGS ARISING OUT OF THE USE OF OR INABILITY TO USE THISPROGRAM You can help us make this document better by telling us what you think of the content organization and usefulness of theinformation If you find an error or just want to make a suggestion for improving this document please address your comments to doc platform com Your comments sho
119. The minimum Receiver Not Ready RNR NAK timer After this time an RNR NAK is sent back to the sender V alues 1 0 01ms 31 491 52ms 0 655 36ms The default is 24 40 96ms d RNR retry count before an error is issued Minimum is 0 Maximum is 7 Default is 7 infinite e The max inline data size Default is 128 bytes MPL_VAPI_QPPARAMS MPI_VAPI_QPPARAM S a b c d Specifies time out setting for VAPI where a Timeout value for VAPI retry if there is no response from the target Minimum is 1 Maximum is 31 Default is 18 b The retry count after a time out before an error is issued Minimum is 0 Maximum is 7 Default is 7 c Platform MPI User s Guide 135 Understanding Platform MPI The minimum Receiver Not Ready RNR NAK timer After this time an RNR NAK is set back to the sender V alues 1 0 01ms 31 491 52ms 0 655 36ms The default is 24 40 96ms RNR retry count before an error is issued Minimum is 0 Maximum is 7 Default is 7 infinite Memory usage environment variables MPI_GLOBMEMSIZE MPI_GLOBM EMSIZE e Where eis the total bytes of shared memory of the job If the job sizeis N each rank has e N bytes of shared memory 12 5 is used as generic 87 5 is used as fragments The only way to change this ratio isto useMPI_SHMEMCNTL MPI_NO_MALLOCLIB When set MPI_NO_MALLOCLIB avoids using Platform M PI s ptmalloc implementation and instead usesthe standard libc implementation or perhaps amallocimp
120. WORLD using MPI_Comm_dup include lt stdio h gt include lt stdlib h gt i nclude lt mpi h gt int main argc argv int argc ne argv int rank size data Pl Status status P _ Comm i bcomm Pl _Init amp argc amp argv Pl Comm rank MPI COMM WORLD amp rank P _Comm_size MP COMM WORLD amp si ze iv size ts 2 4 if rank printf communicator must have two processes n 198 Platform MPI User s Guide Example Applications PI Finalize exit 0 PI Comm PE COMM WORLD amp l i bcomm iy rank ss w 4 data 12345 PI Send amp data 1 MPI_INT 1 5 MPI COMM WORLD data 6789 Pl Send amp data 1 MPI_INT 1 5 Ii bcomm else PI Recv amp data 1 MPI_INT 0 5 libcomm amp status printf received libcomm data Ydin data PI Recv amp data 1 MPI INT 0 5 MPI COMM WORLD amp status printf received data hain data Pl Comm free amp l i bcomm Pl _ Finalize eturn 0 j communicator output The output from running the communicator executable is shown below The application was run with np 2 received libcomm data 6789 received data 12345 multi_par f The Alternating Direction Iterative ADI method is often used to solve differential equations In this example multi_par f a compiler that supports OPEN MP directives is required in order to achieve multi level parallelism multi_par f implements the following logic
121. _ROOT bin mpiCC HPMPI_CC pgCC sort C mpiCClib libmpiCC a sort C MPI_ROOT bin mpirun np 2 a out Rank 0 980 980 54 Platform MPI User s Guide 965 Understanding Platform MPI Platform MPI User s Guide 55 Understanding Platform MPI Autodouble functionality Platform M PI supports Fortran programs compiled 64 bit with any of the following options some of which are not supported on all Fortran compilers For Linux i8 Set default KIND of integer variables is 8 r8 Set default size of REAL to 8 bytes r16 Set default size of REAL to 16 bytes autodouble Same as r8 The decision of how Fortran arguments are interpreted by the MPI library is made at link time If the mpi f 90 compiler wrapper is supplied with one of the above options at link time the necessary object files automatically link informing M PI how to interpret the Fortran arguments Note This autodouble feature is supported in the regular and multithreaded MPI libraries but not in the diagnostic library For Windows integer_size 64 1418 i8 real_size 64 4R8 Qautodouble r8 If these flags are given to thempi f 90 bat script at link time the application is linked enabling Platform M PI to interpret the data type M PI_REAL as 8 bytes etc as appropriate at run time However if your application is written to explicitly handle autodoubled datatypes e g if a variable is declared real the code is compiled
122. ability of an interconnect is determined based on whether the relevant libraries can use dl open shl_ oad and on whether a recognized module is loaded in Linux If either condition is not met the interconnect is determined to be unavailable Interconnects specified in the command line or in the MPI_IC_ORDER variable can be lower case or upper case Lower casemeans theinterconnect is used if available U pper case options arehandled slightly differently between Linux and Windows On Linux theupper case option instructs Platform M PI to abort if the specified interconnect is determined to be unavailable by the interconnect detection process On Windows the upper case option instructs H P M PI to ignore the results of interconnect detection and simply try to run using the specified interconnect irrespective of whether it appears to be available or not On Linux the names and locations of the libraries to be opened and the names of the recognized interconnect modulenamesare specified by a collection of environment variablesthat arein MP _ROOT etc pcmpi conf Thepcmpi conf filecan be used for any environment variables but arguably its most important use is to consolidate environment variables related to interconnect selection The default value of M PI_IC_ORDER is specified there along with a collection of variables of the form MPI_ICLIB_ XXX__YYY MPI_ICMOD_XXX__YYY where XXX is one of the interconnects IBV VAPI etc and YYY isan arbitrary
123. able 19 20 sp option 109 spawn 147 spawn option 109 spin yield logic 123 SPMD 251 srq option 108 srun 74 75 arguments 232 233 examples 68 execution 74 implied 232 MPI_SRUNOPTIONS 141 option 105 with mpirun 68 74 ssh 28 138 standard send mode 17 Platform MPI User s Guide 261 starting HP XC cluster applications 30 Linux cluster applications using appfiles 29 multihost applications 70 Platform MPI Linux 28 174 Platform MPI Windows 175 singlehost applications on Linux 29 status 18 status variable 19 stdin 177 stdio 177 229 stdio option 106 stdout 177 structure constructor 24 subdivision of shared memory 137 139 140 synchronization 23 synchronous send mode 17 system test 238 T T option 108 tag variable 18 20 tcp interface options 104 TCP option 103 tcp ip 83 terminate MPI environment 15 test system 238 thread multiple 24 thread_safe c 184 thread compliant library 59 03 59 Oparallel 59 tk option 115 token option 115 TotalView 130 troubleshooting Fortran 90 176 MPI_Finalize 178 Platform MPI 174 tunable parameters 162 tv option 108 twisted data layout 200 U udapl 85 udapl option 102 UNIX open file descriptors 177 262 Platform MPI User s Guide unpacking and packing 23 using counter instrumentation 156 multiple network interfaces 164 profiling interface 159 V v option 107 vapi 85 vapi option 102 variables buf 20 comm 20 count 20 dest 20 dtype 20 op 23 recvbuf 22 23 recvcount
124. able file is compiled with source lineinformation and then mpi run runsthea out MPI program g Specifies that the compiler generate the additional information needed by the symbolic debugger np 2 Specifies the number of processes to run 2 in this case Specifies that the M PI ranks are run under TotalV iew Alternatively usempi r un to invoke an appfile MPI_ROOT bin mpirun tv f my_appfile Specifies that the M PI ranks are run under TotalV iew f appfile Specifies that mpi r un parses appfile to get program and process count information for therun By default mpi r un searches for TotalView in your PATH You can also define the absolute path to TotalView using the TOTALVIEW environment variable setenv TOTALVIEW opi totalview bin totalview totalview options The TOTALVIEW environment variable is used by mpi run Note When attaching to a running MPI application that was started using appfiles attach to the MPI daemon process to enable debugging of all the MPI ranks in the application You can identify the daemon process as the one at the top of a hierarchy of MPI jobs the daemon also usually has the lowest PID among the MPI jobs Limitations The following limitations apply to using T otalV iew with Platform M PI applications All executable files in your multihost M PI application must reside on your local machine that is the machine on which you start T otalV iew Platform MPI User s Guide 171 Debugg
125. ag amp rank rank 100000 i e file called FILENAME myrank en FILENAME ENAME rank 10 l ename MPI MODE_RDWR h set 0 MPI INT MPI_ LNU native MPI INT amp status a back 0 name P MODE RDWR _UNT MPIL_INT native _INT amp status orrect Example Applications flag 0 cor iat Tenim is if buff i rank 100000 i printf Process d error read d should be d n rank buf i rank 100000 i flag 1 lv tile 4 printf Process d data read back is correct n rank MPI File delete filename MP _ NFO NULL free buf free filename MPI Finalize exit 0 io Output The output from running the io executable is shown below The applicat ion was run with np 4 Process 0 data read back is correct Process 1 data read back is correct Process 2 data read back is correct Process 3 data read back is correct thread_safe c In thisC example N clientsloop MAX_WORK times As part of a single work item a client must request service from one of N servers at random Each server keeps a count of the requests handled and prints a log of the requests to stdout After all clients finish the servers are shut down include lt stdio h gt include lt mpi h gt include lt pthread h gt define MAX WORK 40 define SERVER TAG 88 define CLIENT TAG 99 define REQ SHUTDOWN 1 static int service_cnt 0 int
126. ag generic where nenv Specifies the number of envelopes per process pair The default is 8 frag Denotes the sizein bytes of the message passing fragments region The default is 87 5 of shared memory after mailbox and envelope allocation generic Specifies the size in bytes of the generic shared memory region The default is 12 5 of shared memory after mailbox and envelope allocation The generic region is typically used for collective communication MPI_SHMEMCNTL lt a b c where a The number of envelopes for shared memory communication The default is 8 b The bytes of shared memory to be used as fragments for messages c Platform MPI User s Guide 137 Understanding Platform MPI The bytes of shared memory for other generic use such as M PI_Alloc_mem call MPIL_USE_MALLOPT_AVOID_MMAP Instructs theunderlying malloc implementation to avoid mmaps and instead use sbrk to get all memory used The default is MPI_USE_MALLOPT_AVOID_MMAP 0 Connection related environment variables MPIL_LOCALIP MPI_LOCALIP specifies the host IP address assigned throughout a session Ordinarily mpi r un determines the IP address of the host it is running on by calling gethostbyaddr H owever when a host uses SLIP or PPP the host s IP address is dynamically assigned only when the network connection is established In this case gethostbyaddr might not return the correct IP address TheM PI_LOCALIP syntax is as follows XXX XXX XXX XXX where
127. ank 3 uses 3 0 1 2 The order isimportant in SRQ mode because only the first card is used for short messages But in short RDM A mode all the cards are used in a balanced way MPI_IB_PORT_GID If acluster has multiple nfiniBand cardsin each node connected physically to separated fabrics Platform M PI requires that each fabric has its own subnet ID When the subnet ID s are the same Platform M PI cannot identify which ports are on the same fabric and the connection setup is likely to be less than desirable If all the fabrics have a unique subnet ID by default Platform M PI assumes that the ports are connected based on thei bv_ devi nfo output port order on each node All the port 1s are assumed to be connected to fabric 1 and all the port 2s are assumed to be connected to fabric 2 If all the nodes in the cluster have the first InfiniBand port connected to the same fabric with the same subnet ID Platform M PI can run without any additional fabric topology hints If the physical fabric connections do not follow the convention described above then the fabric topology information must be supplied to Platform MPI Thei bv_devi nfo v utility can be used on each node within the cluster to get the port GID If all the nodesin the cluster are connected in the same way and each fabric has a unique subnet ID thei bv_devi nf o command only needs to be done on one node TheM PI_IB_PORT_GID environment variable is used to specify which InfiniBand fa
128. arenticomm ierr endi f call MPI COMM_RANK mergedcomm myid ierr call MPI COMM_SIZE mergedcomm numprocs ierr print Process myid of numprocs in merged commis alive 216 Platform MPI User s Guide Example Applications sizetype 1 Sumtype 2 if myid eq 0 then n 100 endi f call MPI _BCAST n 1 MPI INTEGER 0 mergedcomm ierr C C Calculate the interval size Cc h 1 0d0 n sum 0 0d0 do 20 i myid 1 n numprocs x h dble i 0 5d0 sum sum f x 20 continue mypi h sum C C Collect all the partial sums C call MPI REDUCE mypi pi 1 MPI DOUBLE PRECISION PI SUM 0 mergedcomm ierr C C Process 0 prints the result C if myid eq 0 then write 6 97 pi abs pi PI25DT 97 format pi is approximately F18 16 frror ise Faw 16 endif call MP COMM FREE mergedcomm ierr call MPI _FINALI ZE ierr stop end compute_pi_spawn f output The output from running the compute_pi_spawn executable is shown below The application was run with np1 and with the spawn option Original Process 0 of 1 is alive Spawned Process 0 of 3 is alive Spawned Process 2 of 3 is alive Spawned Process 1 of 3 is alive Process 0 of 4 in merged commis alive Process 2 of 4 in merged commis alive Process 3 of 4 in merged commis alive Process lof 4 in merged commis alive pi is approximately 3 1416009869231254 Error is 0 0000083333333323 Platform MPI User s Guide 2
129. atform M PI at run time are described in the following sections categorized by the following functions General CPU bind Miscellaneous Interconnect InfiniBand Memory usage Connection related RDMA prun srun TCP Elan Rank ID General environment variables MPIRUN_OPTIONS MPIRUN_OPTIONS is a mechanism for specifying additional command line arguments to mpi run If this environment variable is set the mpi r un command behaves as if the arguments in MPIRUN_OPTIONS had been specified on the mpi r un command line For example export MPIRUN_OPTIONS v prot MPI_ROOT bin mpirun np 2 path to program x is equivalent to running MPI_ROOT bin mpirun v prot np 2 path to program x When settings are supplied on the command line in the MPIRUN_ OPTIONS variable and in an pcmpi conf file the resulting command functions as if thepc mpi conf settings had appeared first followed by the M PIRUN_OPTIONS followed by the command line Because the settings are parsed left to right this means the later settings have higher precedence than the earlier ones MPL FLAGS edde exdb MPI FLAGS modifiesthegeneral behavior of Platform MPI ThemP FLAGS syntaxisacomma separated list as follows edde exdb egdb eadb ewdb 1 f i slalpll llyl 4 Ilo E2 C D E T z The following is a description of each flag Starts the application under the dde debugger The debugger must bein the command sea
130. atform M PI programs start with theC version of afamiliar hel 0 world program The source file for this program is calledhel o_world c The program prints out the text string H ello world I m r of son host wherer is a process s rank sis the size of thecommunicator and host isthe host wherethe program is run The processor nameis thehost name for this implementation The processor name is the host name for this implementation Platform M PI returns the host name for M PI_Get_processor_name 28 Platform MPI User s Guide Getting Started The source code for hel l o_ worl d c isstoredin MPI ROOT hel p and is shown below include lt stdio h gt include mpi h void main argc argv int argc char argv int rank size len char name MPI MAX PROCESSOR NAME MPI _Init amp argc amp argv MPI Comm rank MPI COMM WORLD amp rank MPI Comm size MP COMM WORLD amp size MPI Get_processor_name name amp len printf Hello world l m d of d on s n rank size name MPI Finalize exit 0 Building and running on a single host This example teaches you the basic compilation and run steps to executehel o_ world c on your local host with four way parallelism To build and run hel o_world c onalocal host named j awbone 1 2 Change to awritable directory Compilethehel o_ world executable file MPI_ROOT bin mpicc o hello_world MPI_ROOT help hello_world c Runthehel o_ w
131. ation using ptmalloc LINUX only oo eee eee eet eee eeeeteeeeeeeeneeeeee 150 Signal propagation LINUX ONLY ceeeeceeeceeeeeeeeceeeeeeeeeeeeeeeeeeeeeeeeeeeeaeeeesaaeeesenaeesee 151 MPI 2 name publishing Suppor eee eee a 153 Native language SUDPOM sciseisccciiveetede deceresiensd ieseinad soidieae E AA aiaa aai EREE 154 Platform MPI User s Guide 3 Profiling eereenerteereerrereerererce rem eer errer ra a per tr creer ntrrretrrcep ey rrerceree tric freer errc cera 155 Using counter instrumentation c cccceeeeceeceeeeeeeeeeeeeeeeeeeeseaeeeseaeeeeeeeeeeeseneeeesiaeesees 156 Using the profiling interface ecccceeeeeeeeeeeeeeeeeeeeeeeeecaeeececeeeeeeceeeeeeseeeeeesaeeesenaeesee 159 TUNNG kocsira a ree cree errr trreererer a Per rere erererererrerrrrd Crerere rar rere ene rere 161 Tunable Parameters wecsceseeueles of eee EAEN E 162 Message latency and bandwidth ce eceeceeeeenneeeeeeeenaeeeeeeeaaaeeeeeeeaeeeeeeenaaeeeeesenaees 163 Multiple network interfaces ccccceeeeeceeeeseeeeeeeeeaeeeceaaaeeseaaeeseceeeeseeeaeeeeeeaeeeesiaaeeeee 164 Processor SUDSEMPUOM i eis ciasas cei cdepesdeeveey rea aenaran is ct sstwalert nineascaceuadawast eeeaey seen ees 165 Processor locality Ganeman eane aa A leaden canard ie deine 166 MIPITOUtINGe SGIGCHONcsescezeccents n A OA OEE E AE ON 167 Debugging and Troubleshooting cccccccceceeeeeeeeeeeeeeeeeeeeeeaeeeeseceeesecneeeeseeeeeeseeneeeessiaeesee
132. ations may require coordinated operations among multiple processes For example all processes must cooperate to sum sets of numbers distributed among them MPI provides a set of collective operations to coordinate operations among processes T hese operations are implemented so that all processes call the same operation with the same arguments Thus when sending and receiving messages onecollectiveoperation can replacemultiplesends and receives resulting in lower overhead and higher performance Collective operations consist of routines for communication computation and synchronization These routines all specify a communicator argument that defines the group of participating processes and the context of the operation Collective operations are valid only for intracommunicators Intercommunicators are not allowed as arguments Communication Collectivecommunication involvestheexchangeof dataamongprocessesin agroup Thecommunication can be one to many many to one or many to many The single originating process in the one to many routines or the single receiving process in the many to one routines is called the root Collective communications have three basic patterns Broadcast and Scatter Root sends data to all processes including itself Gather Root receives data from all processes including itself Allgather and Alltoall Each process communicates with each process including itself 20 Platform MPI User s Guide buf coun
133. ator might not be completely error free H owever the two ranks in the original communicator that were unable to communicate before the call are not included in a communicator generated by MP _Comm_dup Communication failures can partition ranks into two groups A and B so that no rank in group A can communicate to any rank in group B and vice versa A call to MPI_Comm_dup can behave similarly toacalltoMPI_Comm_split returning different legal communicators to different callers W hen a larger communicator exists than the largest communicator the rank can join it returns M PI_COMM_NULL Platform MPI User s Guide 111 Understanding Platform MPI However extensive communication failures such as a failed switch can make such knowledge unattainable to a rank and result in splitting the communicator If thecommunicator returned by rank A contains rank B then either the communicator return by ranks A and B will be identical or rank B will return MP _COMM_ NULL and any attempt by rank A to communicate with rank B immediately returns M P _ERR_EXITED Therefore any legal use of communicator return by MPI_Comm_dup should not result in a deadlock M embers of the resulting communicator either agree to membership or are unreachable to all members Any attempt to communicate with unreachable members results in a failure Interruptible collectives When afailure host process or interconnect that affects a collective operation occurs at least
134. better performance than interhost communication Platform MPI User s Guide 77 Understanding Platform MPI a Fast communication t process 0 process 2 process 1 ri Multipurpose daemon process Platform M PI incorporates a multipurpose daemon process that provides start up communication and termination services The daemon operation is transparent Platform M PI sets up one daemon per host or appfile entry for communication Note Because Platform MPI sets up one daemon per host or appfile entry for communication when you invoke your application with np x Platform MPI generates x 1 processes Generating multihost instrumentation profiles When you enable instrumentation for multihost runs and invoke mpi r un on a host where at least one M PI processisrunning or on ahostremotefrom M PI processes Platform M PI writestheinstrumentation output file prefix instr to theworkingdirectory on thehost thatisrunningrank 0 when instrumentation for multihost runs is enabled W hen using ha the output fileis located on the host that is running the lowest existing rank number at the time the instrumentation data is gathered during M PI_FINALIZE Mplexec The M PI 2 standard defines mpi exec asa simple method to start M PI applications It supports fewer features than mpi run but itis portable mpi exec syntax has three formats mpi exec offers arguments similar to aM Pl_Comm_spawn call with arguments as shown in the
135. bric subnet should be used by Platform M PI to make the initial InfiniBand connection between the nodes For example if the user runs Platform M PI on two nodes with thefollowingi bv_devinfo v output on the first node ibv_devinfo v hca_id mthca0 fwover ATRO node_gui d 0008 f104 0396 62b4 max pkeys 64 local co ack delav 15 port 1 state PORT_ACTIVE 4 max_ mtu 2048 4 phys state LINK UP 5 GID 0 fe80 0000 0000 0000 0008 f104 0396 62b5 port 2 state PORT_ACTIVE 4 max_ mtu 2048 4 phys state LINK UP 5 GID 0 fe80 0000 0000 0001 0008 f104 0396 62b6 The following is the second node configuration ibv_devinfo v hca_id mthca0 fw_ ver AO node guid 0008 f104 0396 a56c max pkeys 64 local _ca_ack delay 15 port 1 state PORT_ACTIVE 4 132 Platform MPI User s Guide Understanding Platform MPI max mtu 2048 4 phys state LINK_UP 5 GID 0 fe80 0000 0000 0000 0008 f104 0396 a56d port 2 state PORT_ACTIVE 4 max mtu 2048 4 phys_state LINK_UP 5 GID 0 fe80 0000 0000 0001 0008 f104 0396 a56e The subnet ID is contained in the first 16 digits of the GID The second 16 digits of the GID are the interface ID In this example port 1 on both nodes is on the same subnet and has the subnet prefix fe80 0000 0000 0000 By default Platform M PI makes connections between nodes using the port 1 This port selection is only for
136. bs rb and E rbe rb respectively as well src mod comm_rank 1 comm size dest mod comm_rank 1 comm_size comm size ncb comm_rank do rb 0 comm_size 1 cb ncb Compute a block The function wil compiler supports OPENMP directives go thread para aaAAA call compcolumn nrow ncol array i rbs rb rbe rb cbs cb cbe cb if rb lt comm size 1 then Send the last row of the block to the rank that is block next to the computed block Receive the las AAANAA nrb rbtl ncb mod nrb comm rank comm si call mpi _sendrecv array rbe r t 0 array rbs nrb 1 cbs n x mpi comm world mstat ier endi f enddo aao ze b cbs c CONI a by G i e b Sum up in each row The same logic as the loop above except switched ara e A ae E e E src mod comm_ rank 1 comm size comm Si ze dest mod comm rank 1 comm size do cbh 0 comm_size 1 rb mod cb comm_rank comm size comm size call comprow nrow ncol array t rbs rb rbe rb cbs cb cbe cb if cb le Comm size 1 then ncb cbtl nrb mod ncb comm_rank comm si call mpi_sendrecv array rbs r 0 array rbs nrb cbs ncb i mpi _comm_ world mstat ier endi f enddo o i 7 b r E Gather computation results call mpi_barrier MPI_ CO endt mpi _wti me M_WORLD ierr if comm rank eq 0 then do src 1 comm size 1 cal mstat ierr enddo elapsed endt start write 6 else cal i jerr endif Computation took elapsed c G
137. buf MPI Aint sendcount MPI Datatype Sendtype void recvbuf MPI Aint recvcount MPI Datatype recvtype int root MPI Comm comm IN sendbuf address of send buffer choice significant only at root IN sendcount number of elements sent to each process significant only at root IN sendtype data type of send buffer elements significant only at root handle OUT recvbuf address of receive buffer choice IN recvcount number of elements in receive buffer IN recvtype data type of receive buffer elements handle IN root rank of sending process IN comm communicator handle int MPI _ScattervL void sendbuf MPI _ Aint sendcounts MPI Aint displs MPI Datatype sendtype void recvbu PI Aint recvcount MPI Datatype recvtype int root MPI Comm comm IN sendbuf address of send buffer choice significant only at root IN sendcounts array specifying the number of elements to send to each processor IN displs Array of displacements relative to sendbuf IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice IN recvcount number of elements in receive buffer recvtype data type of receive buffer elements handle IN root rank of sending process IN comm communi cator handle Data types communication int MPl_Get_countL MPI_ Status status MPI_Datatype datatype MPI_Aint 224 Platform MPI User s Guide d 0 d E d E in Large message APIs 5 in
138. buffers if possible This improves byte copy performance between sending and receiving processes because of double word loads and stores UseM PI_Recv_init and M PI_Startall instead of a loop of M PI_Irecv calls in cases where requests might not complete immediately For example suppose you write an application with the following code section j ou for i 0 i lt size i if i rank continue MPI _lrecv buf i count dtype i 0 comm amp requests j MPI _Waitall size 1 requests statuses Suppose that one of the iterations through M PI_Irecv does not complete before the next iteration of the loop In this case Platform M PI tries to progress both requests This progression effort could continue to grow if succeeding iterations also do not completeimmediately resulting in a higher latency However you could rewrite the code section as follows ow for ae 1 lt Sives Tah 4 if i rank continue MPI_Recv_init buf i count dtype i 0 comm amp requests j t MPI Startall size 1 requests MPI _Waitall size 1 requests statuses In this case all iterationsthrough M PI_ Recv_init areprogressed just oncewhen M PI_Startall iscalled This approach avoids the additional progression overhead when using M PI_Irecv and can reduce application latency Platform MPI User s Guide 163 Tuning Multiple network interfaces You can use multiple network interfaces for interhost communication while still hav
139. calls MP _Comm_disconnect Receive calls that cannot be satisfied by a buffered message fail on the remote processes after the local processes have called MP _Comm_disconnect Send calls on either side of the intercommunicator fail after either side has called MPI_Comm_disconnect Instrumentation and high availability mode Platform M PI lightweight instrumentation is supported when using haand singletons In the event that some ranks terminate during or before M PI_Finalize then the lowest rank id in MPI_COMM_WORLD produces the instrumentation output file on behalf of the application and instrumentation data for the exited ranks is not included Failure Recover ha recover Fault tolerant MPl_Comm_dup that excludes failed ranks W hen using ha recover the functionality of MP _Comm_dup enables an application to recover from errors Important The MPI_Comm_dup function is not standard compliant because a call to MPI_Comm_dup always terminates all outstanding communications with failures on the communicator regardless of the presence or absence of errors When oneor more pairs of ranks within a communicator are unable to communicate because a rank has exited or the communication layers have returned errors a call to MPI_Comm_dup attempts to return the largest communicator containing ranks that were fully interconnected at some point during the MPI_Comm_dup call Because new errors can occur at any time the returned communic
140. cbs cbe c G This subroutine c does summations of rows in a thread g implicit none integer nrow of rows integer ncol of columns double precision array nrow ncol compute region integer rbs row block start subscript integer rbe row block end subscript integer cbs column block start subscript integer cbe column block end subscript c g Local variables G integer i j c c The OPENMP directives below allow the compiler to split the E values for i between a number of threads while j moves g orward lock step between the threads By making j shared c and i private all the threads work on the same column j at G any given time but they each work on a different portion i E of that column G g This is not as efficient as found in the compcolumn subroutine c but is necessary due to data dependencies G C OMP PARALLEL PRI VATE i do j max 2 cbs cbe C OMP DO do i rbs rbe array i j array i j 1 array i j enddo C OMP END DO enddo C OMP END PARALLEL end CEER EERE EKER KERR ERE ERE ER KERR KK ERE E RRR ER ERR EKER REE ERE REE ER EKER EKER EEE EHH Subroutine getdata nrow ncol array c c Enter dummy data c integer nrow ncol double precision array nrow ncol c do j l neol do i 1 nrow array i j j 1 0 ncol enddo enddo end Platform MPI User s Guide 205 Example Applications multi_par f output The output from running the multi_par f executable is shown below The application
141. ch ass run or pr un described below or when using multiple executables Usempi run with prun using the Quadrics Elan communication processor on Linux For example MPI_ROOT bin mpirun mpirun options prun lt prun options gt lt program gt lt args This method is only supported when linking with shared libraries Some features likempi run stdio processing are unavailable Rank assignments within Platform M PI are determined by the way pr un chooses mapping at run time The np option is not allowed with prun The following mpi run options are allowed with prun MPI_ROOT bin mpirun help version jv i lt spec gt universe_size sp lt paths gt T prot spawn 1sided tv e var val prun lt prun options gt lt program gt lt args gt For moreinformation on prun usage man prun The following examples assume the system has the Quadrics Elan interconnect and is a collection of 2 CPU nodes MPI_ROOT bin mpirun prun N4 a out will runa out with 4ranks one per node Ranks are cyclically allocated n00 ranki n01 rank2 n02 rank3 n03 rank4 MPI_ROOT bin mpirun prun n4 a out Platform MPI User s Guide 67 Understanding Platform MPI assuming nodes have 2 processors cores each will runa out with 4 ranks 2 ranks per node ranks are block allocated Two nodes are used n00 ranki n01 rank2 n02 rank3 n03 rank4 Other forms of usage include allocating the nodes yo
142. ch other and use MPI_Recvto receivemessages the results areunpredictable f the messages are buffered communication works correctly However if the messages are not buffered each process hangs in M PI_ Send waiting for MPI_Recv to take the message For example a sequence of operations labeled D eadlock as illustrated in the following tablewould result in such a deadlock Thistablealso illustrates the sequenceof operations that would avoid code deadlock Table 19 Non buffered messages and deadlock Deadlock No Deadlock Process 1 Process 2 Process 1 Process 2 MPI_Send 2 MPI_Send 1 MPI_Send 2 MPI_Recv 1 MPI_Recv 2 MPI_Recv 1 MPI_Recv 2 MPI_Send 1 Propagation of environment variables When working with applications that run on multiple hosts using an appfile if you want an environment variable to be visible by all application ranks you must usethe e option with an appfile or as an argument to mpi run One way to accomplish this is to set the e option in the appfile h remote_host e var val np program args On HP XC systems the environment variables are automatically propagated by s r un Environment variables are established withs et env orexport and passed to M PI processesbytheSLURM sr un utility Thus on HP XC systems itis not necessary to usethe ename value approach to passing environment variables Although the e name value also work
143. cktrace information The default behavior is to print a stack trace Backtracing can be turned off entirely by setting the environment variable M PI NOBACKTRACE Debugging tutorial for Windows A browser based tutorial is provided that contains information on how to debug applications that use Platform MPI in the Windows environment The tutorial provides step by step procedures for performing common debugging tasks using Visual Studio 2005 The tutorial is located in the MP1 ROOT hel p subdirectory Platform MPI User s Guide 173 Debugging and Troubleshooting Troubleshooting Platform MPI applications This section describes limitations in Platform M PI common difficulties and hints to help you overcome those difficulties and get the best performance from your Platform M PI applications Check this information first when you troubleshoot problems The topics covered are organized by development task and also include answers to frequently asked questions To get information about the version of Platform M PI installed usethempi run versi on command The following is an example of the command and its output mpirun version MPI_ROOT bin mpicc MPI_ROOT bin mpicc Platform MPI 02 01 01 00 dd mm yyyy B6060BA This command returns the Platform M PI version number the release date Platform M PI product numbers and the operating system version For Linux systems use ident MPI_ROOT bin mpirun or rpm qa grep
144. cluster Y our situation should resemble one of the following 1 If running on Windows HPCS using automatic scheduling Submit the command to thescheduler but includethetotal number of processes needed on the nodes as the np command This is not the rank count when used in this fashion Also include the nodexflag to indicate only one rank node Assume 4 CPUs nodes in this cluster The command would be gt MPI_ROOT bin mpirun ccp np 12 IBAL nodex prot ping_ping_ring exe gt MPI_ROOT bin mpirun ccp np 12 IBAL nodex prot ping_ping_ring exe 10000 In each case above the first mpi r un command uses 0 bytes per message and verifies latency The second mpi r un command uses 1000000 bytes per message and verifies bandwidth include lt stdio h gt include lt stdlib h gt ifndef WI N32 include lt unistd h gt endi f Platform MPI User s Guide 191 Example Applications include lt string h gt i nclude lt math h gt include lt mpi h gt define NLOOPS 1000 define ALIGN 4096 define SEND t MPI Send buf nbytes MPI CHAR partner t MPI COMM WORLD define RECV t MPI _Recv buf nbytes MPI CHAR partner t MPI _COMM_WORLD amp status if def CHECK define SETBUF for j 0 lt nbytes j NOM cuer i i define CLRBUF memset buf 0 nbytes define CHKBUF for j 0 j lt nbytes j li Chm 1s ehnar i e i 4 printf error buf d d not d n Jo MO
145. control of the selection process for TCP IP connections The same functionality can be accessed by using the netaddr option to mpi r un For more information refer to the mpi r un documentation MPI_REMSH By default Platform M PI attempts to uses s h on Linux Werecommend thats sh users set Strict Host KeyChecking no intheir ssh config To user sh on Linux instead run the following script as root on each node in the cluster opt pempi etc mpi remsh default 138 Platform MPI User s Guide Understanding Platform MPI Or to user sh on Linux use the alternative method of manually populating thefiles etc profile d pempi csh and etc profile d pcmpi sh with the following settings respectively setenv MPI_REMSH rsh export MPI_REMSH rsh On Linux M PI_REMSH specifies a command other than the default r ems h to start remote processes Thempirun mpij ob and mpi cl ean utilities support M PI_REM SH For example you can set the environment variable to use a secure shell setenv MPI_REMSH bin ssh Platform M PI allows users to specify the remote execution tool to use when Platform M PI must start processes on remote hosts The tool must have a call interface similar to that of the standard utilities rsh remsh andssh An alternate remote execution tool such ass sh can be used on Linux by setting the environment variable M PI_REM SH to the name or full path of the tool to use export MPI_REMSH ssh MPI_ROOT bin mpirun lt opti
146. d by the compiler These can be controlled by environment variables or from the command line Table 6 mpicc Utility Environment Variable Value Command Line MPI_CC desired compiler default cl mpicc lt value gt MPIL_BITNESS 32 or 64 no default mpi32 or mpi64 MPL_WRAPPER_SYNTAX windows or unix default windows mpisyntax lt va lue gt For example to compilehel o_ world c using a 64 bit cl contained in your PATH could be done with the following command since cl and the W indows syntax are defaults C gt MPLROOT bin mpicc mpi64 hello_world c link out hello_world_cl64 exe Or use the following example to compile using the PGI compiler which uses a more U N1X like syntax C gt MPI_ROOT bin mpicc mpicc pgcc mpisyntax unix mpi32 hello_world c o hello_world_pgi32 exe To compile C code and link against Platform M PI without utilizing the mpi cc tool start a command prompt that has the appropriate environment settings loaded for your compiler and use it with the compiler option A MPI_ROOT include lt 32 64 gt and the linker options ibpath MPI_ROOT lib subsystem console lt li bhpmpi64 lib libhpmpi32 lib gt The above assumes the environment variable M PI_ROOT is set For example to compilehe o_ world c from the H elp directory using Visual Studio from a Visual Studio 2008 command prompt window cl hello_world c I MPI_ROOT include 64 link out hello_world exe libpa
147. d set the related environment variable such as M PI_CC Without such a setting the utility script searches the PATH and afew default locations for possible compilers Although in many environments this search produces the desired results explicitly setting the environment variable is safer Command line options take precedence over environment variables Table 10 Compiler selection Language Cc C Fortran 77 Fortran 90 Wrapper Script Environment Variable Command Line mpi cc MPI_CC mpicc lt compiler gt mpi CC MPI_CXxX mpicxx lt compiler gt mpi f 77 MPI_F77 mpif77 lt compiler gt mpi f 90 MPI_F90 mpif90 lt compiler gt Compiling applications The compiler you use to build Platform M PI applications depends on the programming language you use Platform M PI compiler utilities are shell scripts that invoke the correct native compiler Y ou can pass the pathname of the M PI header files using the I option and link an MPI library for example the diagnostic or thread compliant library using the W 1 L or I option By default Platform M PI compiler utilities include a small amount of debug information to allow the TotalV iew debugger to function H owever some compiler options are incompatible with this debug information U sethe notv option to exclude debuginformation The notv option also disablesT otalV iew usage on the resulting executable The notv option applies to archive libraries only Platform M PI offers a
148. d and resumed using SIGTSTP and SIGCONT The Platform M PI library also changes the default signal handling properties of the application in a few specific cases When using the ha option to mpi r un SIGPIPE is ignored W hen using MPI_FLAGS U an MPI signal handler for printing outstanding message status is established for SIGU SR1 When using M PI_FLAGS sa an M PI signal handler used for message propagation is established for SIGALRM W hen using M PI_ FLAGS sp an M PI signal handler used for message propagation is established for SIGPROF Platform MPI User s Guide 151 Understanding Platform MPI In general Platform M PI relies on applications terminating when they are sent SIGTERM Applications that intercept SIGTERM might not terminate properly 152 Platform MPI User s Guide Understanding Platform MPI MPI 2 name publishing support Platform M PI supports the M PI 2 dynamic process functionality MPI_Publish_name MPI_Unpublish_name M PI_Lookup_name with the restriction that a separate nameserver must be started up on a server The service can be started as MPI_ROOT bin nameserver and prints out an IP and port When running mpi r un the extra option nameserver with an IP address and port must be provided MPI_ROOT bin mpirun spawn nameserver lt IP ports gt Thescopeover which namesare published and retrieved consists of all mpi r un commands that arestarted using the same IP port for the nameserver Platf
149. d memory segments mpi clean doesnot support prun orsrun start up mpi cl ean isnot available on Platform M PI V 1 0 for Windows Interconnect support Platform M PI supports a variety of high speed interconnects Platform M PI attempts to identify and use the fastest available high speed interconnect by default Thesearch order for theinterconnectis determined by theenvironment variableM PI_IC_ORDER which isacolon separated list of interconnect names and by command line options which take higher precedence Platform MPI User s Guide 81 Understanding Platform MPI Table 15 Interconnect command line options Command Line Option Protocol Specified Os ibv IBV Linux vapi VAPI VAPI Mellanox Verbs API Linux udapl UDAPL uDAPL InfiniBand and some others Linux psm PSM PSM QLogic InfiniBand Linux mx MX MX Myrinet Linux Windows gm GM GM Myrinet Linux elan ELAN Quadrics Elan3 or Elan4 Linux itapi ITAPI ITAPI InfiniBand Linux deprecated ibal IBAL IBAL Windows IB Access Layer Windows TCP TCP IP All Theinterconnect names used in M PI_IC_ORDER arelikethe command line options above but without the dash On Linux the default value of MPI_IC_ORDER is psm ibv vapi udapl itapi mx gmelan tcp If command line options from the above table are used the effect is that the specified setting isimplicitly prepended to the MPI_IC_ORDER list taking higher precedence in the search The avail
150. de Understanding Platform MPI exclusiveto M PI_SH M EM CNTL IfM PI_SH M EM CNTL isset the user cannot set the other two and vice versa MPI_PIN_PERCENTAGE MPI_PIN_PERCENTAGE communicates the maximum percentage of physical memory see MPI_PHYSICAL_MEMORY that can be pinned at any time The default is 20 export MPI_PIN_PERCENTAGE 30 The above example permits the Platform M PI library to pin lock in memory up to 30 of physical memory The pinned memory is shared between ranks of the host that were started as part of the same mpi run invocation Running multiple M PI applications on the same host can cumulatively cause more than one application s M PI_PIN_ PERCENTAGE to bepinned Increasing M PI_PIN_ PERCENTAGE can improve communication performance for communication intensive applications in which nodes send and receive multiple large messages at a time which is common with collective operations Increasing MPI_PIN_ PERCENTAGE allows morelarge messages to be progressed in parallel using RD M A transfers however pinning too much physical memory can negatively impact computation performance MPI_PIN_ PERCENTAGE and MPI_PHYSICAL_MEMORY areignored unless InfiniBand or M yrinet GM isin use MPI_SHMEMCNTL MPI_SHMEMCNTL controls the subdivision of each process s shared memory for point to point and collective communications It cannot be used with MPI_GLOBMEM SIZE TheMPI_SHMEMCNTL syntax is a comma separated list as follows nenv fr
151. der 1 int numtasks rank source dest outbuf i tag l inbuf 4 MPI _ PROC NULL MPI _PROC_NULL MPI PROC NULL MPI PROC_NULL nbrs 4 dims 2 4 4 periods 2 0 0 reorder 1l coords 2 For example fopt platform_mpi bin mpirun np 16 e MPI_FLAGS o a out Reordering ranks for the cal MPI_Cart_create comm size 16 ndi ms 2 dims 4 4 periods false false reorder true Default mapping of processes would result communication paths i 0 between hosts between subcompl exes bet ween hypernodes between CPUs within a per nodet SIP Reordered mapping results communication paths between hosts between subcompl exes between hypernodes between CPUs within a niger node sine Reordering will not reduce overall communi cati Void the optimization and adopted unreordered rank 2 coords 0 2 neighbors u d r 1 6 rank 0 coords 0 0 neighbors u d r 1 4 rank 1 coords 0 1 neighbors u d r 1 5 rank 10 coords 2 2 nei ghbors u d l r 6 14 rank 2 imour Gl fe ol G a rank 6 coords 1 2 neighbors u d r 2 10 rank 7 coords 1 3 neighbors u d r 3 11 rank 4 coords 10 neighbors u d r 0 8 rank 0 inbuf u d l r 1 4 rank 5 coords 11 neighbors u d r 1 9 4 rank 11 coords 2 3 neighbors u d r 7 15 rank 1 imour u Gl fie 1 5 0 rank 14 coords 3 2 neighbors u d l r 10 rank 9 coords 2 1 neighbors u d r 5 13 rank 13 coords 3 1 neighbors u d r 9 1 rank 15 coords 3 3 n
152. der of a set of events does not vary from run to run domain decomposition Breaking down an M PI application s computational space into regular data structures such that all computation on these structures is identical and performed in parallel executable 246 Platform MPI User s Guide Glossary A binary filecontaining a program in machine language which is ready to be executed run explicit parallelism Programming style that requires you to specify parallel constructs directly Using the MPI library is an example of explicit parallelism functional decomposition Breaking down an MPI application s computational space into separate tasks such that all computation on these tasks is performed in parallel gather M any to one collective operation where each process including the root sends the contents of its send buffer to the root granularity M easure of the work done between synchronization points Fine grained applications focus on execution at the instruction level of a program Such applications are load balanced but suffer from alow computation communication ratio Coarse grained applications focus on execution at the program level where multiple programs may be executed in parallel group Set of tasks that can be used to organize M PI applications M ultiple groups are useful for solving problems in linear algebra and domain decomposition intercommunicators Communicators that allow only processes in tw
153. dlib h gt iostream h gt mpi h gt ncl ude nclude ncl ude ude IN AN nclude lt limits h gt nclude lt i ostream h gt nclude lt fstream h gt as cae i uS aaa gt z declarations ee a on ey class Entry private int value public Entry Li w value Entry int x value x Entry const Entry Ge value e getValue Entry amp operator value const Entry amp e e getValue return this int getValue const return value int operator gt const Entry amp e const return value gt e getValue Va class BlockOfEntries private Entry entries int numOf Entries public 210 Platform MPI User s Guide Example Applications BlockOfEntries int numOfEntries_p int offset BlockOfEntries int getnumOf Entries return numOfEntries void setLeftShadow const Entry amp e entries 0 e void setRightShadow const Entry amp e entries numOfEntries 1 e const Entry amp getLeftEnd return entries 1 const Entry amp getRightEnd return entries numOfEntries 2 void singleStepOddEntries void singleStepEvenEntries void verifyEntries int myRank int baseline void printEntries int myRank Ht ih Class member definitions Hil const Entry MAXENTRY INT_MAX const Entry MINENTRY INT_MIN ll BlockOfEntries BlockOfEntries ll Function
154. dulefiles platform mpi totheMODULEPATH environment variable Some useful module related commands are module avail Lists modules that can be loaded module load platform mpi Loads the Platform M PI module module list Lists loaded modules module unload platform mpi 72 Platform MPI User s Guide Understanding Platform MPI Unloads the Platform M PI module M odules are only supported on Linux Note On HP XC Linux the Platform MPI module is named mpi hp default and can be abbreviated as mpi Run time utility commands Platform M PI provides a set of utility commands to supplement M PI library routines mpirun Thissection includes a discussion of mpi r un syntax formats mpi r un options appfiles the multipurpose daemon process and generating multihost instrumentation profiles ThePlatform M PI start up mpi r un requiresthatM PI beinstalled in thesamedirectory on every execution host The default is the location where mpi run is executed This can be overridden with the MPI_ROOT environment variable Set the MPI_ ROOT environment variable prior to starting mpi run mpi run syntax has six formats Single host execution Appfile execution prun execution sr un execution LSF on HP XC systems LSF on non HP XC systems Single host execution To run on a single host you can use the np option to mpi run For example MPI_ROOT bin mpirun np 4 a out will run 4 ranks on the local host Appfile executio
155. e M PI_ Ssend guarantees synchronous send semantics that is a send can be started whether or not a matching receive is posted H owever the send completes successfully only if a matching receive is posted and the receive operation has begun receiving the message sent by the synchronous send 240 Platform MPI User s Guide Frequently Asked Questions If your application still hangs after you convert MPI_ Send and MPI_Rsendcallsto MPI_Ssend you know that your code is written to depend on buffering Rewrite it so that MPI_ Send and MPI_Rsend do not depend on buffering Alternatively use non blocking communication calls to initiate send operations A non blocking send start call returns before the message is copied out of the send buffer but a separate send complete call is needed to complete the operation For information about blocking and non blocking communication see Sending and receiving messages on page 17 For information about M PI_ FLAGS options see General environment variables on page 121 QUESTION How dol turn on MPI collection of message lengths want an overview of M PI message lengths being sent within the application ANSWER Theinformation is available through Platform M PI s instrumentation feature Basically including i lt filename gt on thempi run command line will create lt filename gt with a report that includes number and sizes of messages sent between ranks Network specific QUESTION I get an error when
156. e MPI_FLAGS environment variable Platform M PI also supports the multithread multiprocess debugger TotalView on Linux In addition to theuse of debuggers Platform M PI provides a diagnostic library D LIB for advanced error checking and debugging Platform M PI also provides options to the environment variable M P _ FLAGS that report memory leaks I force M PI errors to be fatal f print the M PI job ID j and other functionality This section discusses single and multi process debuggers and the diagnostic library Using a single process debugger Because Platform M PI creates multiple processes and ADB DDE XDB WDB GDB and PATHDB only handle single processes Platform M PI starts one debugger session per process Platform M PI creates processes in M PI_Init and each process instantiates a debugger session Each debugger session in turn attaches to the process that created it Platform M PI provides MPI_DEBUG_CONT to control the point at which debugger attachment occurs MP _DEBUG_ CONT is a variable that Platform M PI uses to temporarily halt debugger progress beyond M PI_ Init By default MP _ DEBUG CONT is se to 0 and you must reset it to 1 to allow the debug session to continue past M PI_ Init Complete the following when you use a single process debugger 1 Set the eadb exdb edde ewdb egdb or epathdb option in theM PI_ FLAGS environment variable to use the ADB XDB DDE WDB GDB or PATH DB debugger respectively 2
157. e lt math h gt include lt mpi h gt define NLOOPS 1000 define ALIG 4096 define SEND t MPI Send buf nbytes MPI CHAR partner t MPI COMM_WORLD define RECV t MPI Recv buf nbytes MPI _CHAR partner t ifdef CHECK define SETBUF fo u 0 j lt nbytes j S chami i se m define CLRBUF memset buf 0 nbytes define CHKBUF for j 0 j lt nbytes j iy one fs ear Gy 2 iy 4 printf error buf d d not d n fo DUNN Yo 1 break else 188 Platform MPI User s Guide MPI COMM WORLD amp status Example Applications 0 define SETBUF define CLRBUF define CHKBUF endifint main argc argv int argc char aN lid n i ifdef CHECK in j endi f double Siart SEOD intn bytes 0 n rank size n root n partner P Status status char Wi A a char myhost MPI MAX_PROCESSOR_NAME in len char str 1024 MPI Init amp argc amp argv PI Comm rank MPI COMM WORLD amp rank P Comm size MPI COMM WORLD amp size PI Get_processor_name myhost amp len lv Size lt 2 4 if rank printf rping must have two processes n Pl _Finalize exit 0 nbytes argc gt 1 atoi argv 1 if nbytes lt 0 nbytes 0 Page align buffers and displace themin the cache to avo a d collisions buf char malloc nbytes 524288 ALIGN 1 obuf buf ie ow se PI Abort MP COMM WORLD MP
158. e buffer size in bytes in Pl_Buffer_detachL void buf_address MPI _Aint size OUT buffer_addr initial buffer address choice OUT size buffer size in bytes in Pl_IbsendL void buf MPI Aint count MPI Datatype datatype in dest int tag MP Comm comm MPI Request request N buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destination IN tag message tag IN comm communicator handle OUT request communication request handle int MPIl_IrecvL void buf MPI Aint count MPI Datatype datatype int Source int tag MPI Comm comm MPI Request request OUT buf initial address of receive buffer choice IN count number of elements in receive buffer IN datatype datatype of each receive buffer element handle IN source rank of source IN tag message tag IN comm communicator handle OUT request communication request handle int MPIl_IrsendL void buf MPI Aint count MPI _Datatype datatype int dest int tag MPI Comm comm MPI Request request IN buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destination IN tag message tag IN comm communi cator handle OUT request communication request handle int MPI _IsendL void buf MPI Aint count MPI Datatype datatype int dest int tag MPI_Com
159. e implied s r un mode The contents of the appfile are passed along except for np and hwhich are discarded Some arguments are pulled from the appfile and others after the H ereis the appfile np 1 h foo e MPI_FLAGSS T pallas x npmin 4 setenv MPI_SRUNOPTION label These are required to use the new feature setenv MPI _USESRUN 1 bsub I n4 MPI_ROOT bin mpirun f appfile sendrecv Platform MPI User s Guide 233 mpirun Using Implied prun or srun Job lt 2547 gt is submitted to default queue lt normal gt lt lt Waiting for dispatch gt gt lt lt Starting on localhost gt gt Bee Ne Bees teed Sec Pe Sete Shee OS pay Fee a ac ge ee teyedeeey ence E Date Thu Feb 24 14 24 56 2005 Machine 1a64 System Linux Release 2 4 21 15 11hp XCs mp Version 1 SMP Mon Oct 25 02 21 29 EDT 2004 Minimum message length in bytes 0 Maximum message length in bytes 8388608 MPI Datatype MPI BYTE MPI Datatype for reductions MPI FLOAT MPI_Op gt MPI_SUM List of Benchmarks to run Sendrecv Benchmarking Sendrecv processes 4 gt O gt O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O PE sonene is Sup cane epee ate E Ae ee E NE eats E es DEn iS Sum i ene 2s cei EE E Gae see tesa rene bytes repetitions t_ min t_ max t_avg Mbytes sec 0 1000 35 28 35 40 35 34 0 00 1 1000 42 40 42 43 42 41 0 04 2 1000 41 60 41 69 41 64 0 09
160. e output Hello world m1 of 4 n01 Hello world I m 3 of 4 n02 Hello world m0O of 4 n01 Hello world m2 of 4 n02 Building and running on an HP XC cluster using srun The following is an example of basic compilation and run steps to execute hello_world c on an HP XC cluster with 4 way parallelism To build and run hello_world con an HP XC cluster assuming LSF isnot installed 1 2 Change to awritable directory Compile the hello_world executable file MPI_ROOT bin mpicc o hello_world MPI_ROOT help hello_world c Runthehel o_ worl d executable file MPI_ROOT bin mpirun srun n4 hello_world where n4 specifies 4 as the number of processes to run from SLURM Analyzehello_worl d output Platform M PI prints the output from running the hello_world executablein nondeterministic order The following is an example of the output l m 1 of 4 n01 Hello world m 3 of 4 n02 Hello world l m 0 of 4 n01 Hello world l m2 of 4 n02 Hello world Directory structure for Linux Platform M PI files are stored in the opt platform mpi directory for Linux If you move the Platform M PI installation directory from its default location in opt pl at form mpi settheM PI_ROOT environment variableto point to thenew location Thedirectory structureis organized as follows Table 4 Directory structure for Linux Contents Subdirectory bin etc help include Command files for the Platform MPI utilities gather_inf
161. e passing is used widely on parallel computers with distributed memory and on clusters of servers The advantages of using message passing include Portability M essage passing is implemented on most parallel platforms Universality The model makes minimal assumptions about underlying parallel hardware M essage passing libraries exist on computers linked by networks and on shared and distributed memory multiprocessors Simplicity The model supports explicit control of memory references for easier debugging However creating message passing applications can require more effort than letting a parallelizing compiler produce parallel applications In 1994 representatives from thecomputer industry government labs and academe developed a standard specification for interfaces to a library of message passing routines This standard is known as MPI 1 0 MPI A M essage Passing Interface Standard After this initial standard versions 1 1 June 1995 1 2 July 1997 and 2 0 July 1997 have been produced Versions 1 1 and 1 2 correct errors and minor omissionsof M PI 1 0 M PI 2 M PI 2 Extensionsto theM essage Passing nterface addsnew functionality to MPI 1 2 You can find both standards in HTML format at http www mpi forum org M PI 1 compliance means compliancewith M PI 1 2 M PI 2 compliance means compliance with M PI 2 0 Forward compatibility is preserved in the standard That is a valid M PI 1 0 program isa valid M PI 1 2 program and a
162. e type of elements in send buffer handle IN dest rank of destination IN endtag send tag OUT recvbuf initial address of receive buffer choice IN recvcount number of elements in receive buffer IN recvtype type of elements in receive buffer handle IN e d S source rank of source IN recvtag receive tag IN comm communi cator handle OUT status status object status int MPl_Sendrecv_replaceL void buf MPI_Aint count MPI _Datatype datatype int dest int sendtag int source int recvtag MPI _Comm comm MPI_Status status NOUT buf initial address of send and receive buffer choice IN count number of elements in send and receive buffer IN datatype type of elements in send and receive buffer handle IN dest rank of destination N sendtag send message tag IN Source rank of source Platform MPI User s Guide 221 Large message APIs N recvtag receive message tag IN comm communi cator handle OUT status status object status int MPI _SsendL void buf MPI _ Aint count MPI Datatype datatype int dest int tag MPI Comm comm N buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destination IN tag message tag IN comm communicator handle int MPI Ssend_initL void buf MPI Aint count MPI Datatype datatype int dest int tag MPI Comm comm MPI Request request N buf initial address of send buffer cho
163. e userl 2952 0 046875 5488 explorer exe userl 1468 1 640625 17532 reader_sl exe user1 28151000825 3912 cmd exe userl 516 0 031250 2112 ccApp exe userl 2912 0 187500 7580 mpid exe userl 3048 0 125000 5828 Pallas exe userl 604 0 421875 13308 CMD Finished successfully The processes by the current user user1 runs on winbl16 Two of the processes are M PI jobs mpid exe andPallas exe If these are not supposed to be running use mpi di ag to kill the remote process X Demo gt MPI_ROOT bin mpidiag s winbl16 kill 604 CMD Finished successfully X Demo gt MPI_ROOT bin mpidiag s winbl16 ps Process List ProcessName Username PID CPU Time Me mor y rdpclip exe userl 2952 0 046875 5488 explorer exe userl 1468 1 640625 1532 reader_sl exe userl 2856 0 078125 3912 cmd exe userl 516 0 031250 Bille ccApp exe userl 2912 0 187500 7580 CMD Finished successfully Pallas exe was killed and Platform M PI cleaned up the remaining Platform M PI processes Another useful command is a short system info command indicating the machine name system directories CPU count and memory X Demo gt MPI_ROOT bin mpidiag s winbl16 sys System nfo Computer name WI NBL16 User name user System Directory C WINDOWS system32 Wi ndows Directory C W NDOWS CPUs 2 Total Memory 2146869248 Small selection of Environment Variables 0S Windows NT PAN c C8 Peel il C WENDOWS syst em32 C WI NDOWS
164. ed block start subscript nteger blocke out assigned block end subscript integer dil ml dl sube subs l bl ockcnt ml mod sube substl bl ockcnt blocks nth dl subs mi n nth ml bl ocke bl ocks d1 1 ese ea en CEPREEERREREREER EERE EERE EEE EEE EEE E EEE EEE EEE EEEE EEEE EEE EEEE EE EEE E EEEE EE E e E e ae AAQDAAAAAA C OMP subroutine compcolumn nrow ncol array rbs rbe cbs che This subroutine does summations of columns in a thread implicit none integer nrow of rows integer ncol of columns double precision array nrow ncol compute region integer rbs row block start subscript integer rbe row block end subscript integer cbs column block start subscript integer cbe column block end subscript Local variables integer i j The OPENMP directive below allows the compiler to split the values for j between a number of threads By making i and j private each thread works on its own range of columns j and works down each column at its own pace i Note no data dependency problems arise by having the threads al working on different columns simultaneously PARALLEL DO PRIVATE i j do j cbs cbe do i max 2 rbs rbe 204 Platform MPI User s Guide Example Applications array i j array i 1 j array i j enddo enddo C OMP END PARALLEL DO end CEPREEERREREREER EE RRE REAR RAE R KERR EEE EEE EEE ERA ER RARER EEEE EEE EEE EEE EEEE Subroutine comprow nrow ncol array rbs rbe
165. eighbors u d l r 11 rank 10 inbuf u d r 6 14 rank 12 coords 3 0 nei ghbors u d irj 8 1 rank 8 coords 2 0 neighbors u d l r 4 12 rank 3 coords 0 3 neighbors ud r 1 7 124 Platform MPI User s Guide 5 6 1 0 0 24 0 0 0 24 on cost mapping 9 11 3 7 meo E2 E onl off Understanding Platform MPI rank 6 monitu dl rje 2 10 7 rank 7 now Gl pe 3 dd G od rank 4 mani Ceh aa 0 8 oil rank 5 Ewou eaea tO amp rank 11 invi O Gd tthe 7 a5 A od rank 14 iaou l f s 10 2 Wa ws rank 9 iMour t f 8 5 ds amp Lo rank 13 DO G lra O ad da da rank 15 iaoa 0 0 s ta oil We oil rank 8 inbuf u d l r 4 12 1 9 rank 12 Mou wt Go whe B od of Ae rank 3 ioui Gl ie 2 We x Sets 1 as the value of TRUE and 0 as the value for FALSE when returning logical values from Platform M PI routines called within Fortran 77 applications Disables ccN U MA support Allows you to treat the system as a symmetric multiprocessor SM P Dumps shared memory configuration information Use this option to get shared memory values that are useful when you want to set the M PI_SH M EM CNTL flag Turnsfunction parameter error checkingon or off thedefault Checking can beturned on by the setting M PI_ FLAGS Eon Prints the user and system times for each M PI rank Enables zero buffering mode Set this flag to convert M PI_Send and M PI_Rsend calls in your code t
166. emory size for locking must be specified see below It is controlled by the etc security limits conf filefor Red Hatand the etc syscnt conf file for SUSE soft memlock 4194303 hard memlock 4194304 The example above uses the maximum locked in memory address space in KB units The recommendation is to set the value to half of the physical memory on the machine Platform M PI tries to pin up to 20 of the machine s memory see M PI_PHYSICAL_MEMORY and MPI_PIN_ PERCENTAGE and fails if it is unable to pin the desired amount of memory Machines can have multiple InfiniBand cards By default each Platform M PI rank selects one card for its communication and the ranks cycle through the available cards on the system so the first rank uses the first card the second rank uses the second card etc The environment variable M PI_IB_CARD_ORDER can be used to control which card the ranks select Or for increased potential bandwidth and greater traffic balance between cards each rank can be instructed to use multiple cards by using the variable M PI_IB_ MULTIRAIL Lazy deregistration is a performance enhancement used by Platform M PI on several high speed interconnects on Linux This option is turned on by default and requires the application to be linked in such a way that Platform M PI can intercept calls to mal oc mun map etc M ost applications are linked that way but if oneis not then Platform M PI s lazy deregistration can be turned off wit
167. endbuf starting address of send buffer choice OUT recvbuf starting address of receive buffer choice IN count number of elements in send buffer IN datatype data type of elements of send buffer handle IN op operation handle IN comm communicator handle int MPI _AlltoallL void sendbuf MPI Aint sendcount MPI Datatype sendtype void recvbuf MPI Aint recvcount MPI Datatype recvtype MPI Comm comm IN sendbuf starting address of send buffer choice IN sendcount number of elements sent to each process IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice IN recvcount number of elements received from any process IN recvtype data type of receive buffer elements handle IN comm communi cator handle int MPI_AlltoallvL void sendbuf MPI _ Aint sendcounts MPI Aint 222 Platform MPI User s Guide Large message APIs sdispls MPI Datatype sendtype void recvbuf MPI Aint recvcounts MPI Aint rdispls MPI Datatype recvtype MPI Comm comm IN sendbuf starting address of send buffer choice IN Sendcounts array equal to the group size specifying the number of elements to send to each rank IN sdispls array of displacements relative to sendbuf IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice IN recvcounts array equal to the group size specifying he number of elements that can be received from each rank N rdispls a
168. es When you use LSF to start M PI applications the host names specified to mpi r un orimplicit when the h option isnot used aretreated as symbolic variables that refer to the IP addresses that LSF assigns U se LSF to do this mapping by specifying a variant of mpi run to execute your job To invoke LSF for applications that run on multiple hosts bsub sf_options pam mpi mpirun mpirun_options f appfile extra_args_for_appfile In this case each host specified in the appfile is treated as a symbolic name referring to the host that LSF assigns to the M PI job For example bsub pam mpi MPI_ROOT bin mpirun f my_appfile runs an appfilenamed my_appfile and requests host assignments for all remote and local hosts specified in my_appfile If my_appfile contains the following items h voyager np 10 send receive h enterprise np 8 compute _p Host assignments are returned for the two symbolic links voyager and enterprise When requesting a host from LSF be sure that the path to your executable file is accessible by all machines in the resource pool More information about appfile runs This example teaches you how to run thehel o_ world c application that you built on HP and Linux above using two hosts to achievefour way parallelism For this example thelocal host isnamed jawbone and a remote host is named wizard To runhello_worlI d c on two hosts use the following procedure replacing jawbone and wizard with the
169. es if a password is stored in the user password cache and stops execution The MPI application will not launch if this option is included on the command line clearcache Clears the password cacheand stops The M PI application will not launch if this option is included on the command line Password authentication The following are specific mpi r un command line options for password authentication Platform MPI User s Guide 115 Understanding Platform MPI pwcheck V alidates the cached user password by obtaining a login token locally and verifying the password A pass fail message is returned before exiting To check password and authentication on remote nodes use the at flag with mpidiag Note The mpi r un pwcheck option along with other Platform MPI password options run with Platform MPI Remote Launch Service and do not refer to Windows HPC user passwords When running through Windows HPC scheduler with hpc you might need to cache a password through the Windows HPC scheduler For more information see the Windows HPC job command package lt package name gt and pk lt package name gt When Platform M PI authenticates with the Platform M PI Remote Launch service it authenticates using an installed Windows security package for example K erberos NTLM Negotiate and more By default Platform M PI negotiates the package to use with the service and no interaction or package specification is required If a spec
170. es perform some computation and return the result to the master as the solution ready send mode Form of blocking send wherethe sending process cannot start until a matching receive is posted The sending process returns immediately reduction Binary operations such as addition and multiplication applied globally to all processes in a communicator These operations are only valid on numeric data and are always associative but may or may not be commutative scalable Ability to deliver an increase in application performance proportional to an increasein hardware resources normally adding more processors scatter One to many operation wherethe root s send buffer is partitioned into n segments and distributed to all processes such that theith processreceivestheith segment n represents the total number of processes in the communicator Security Support Provider Interface SSPI A common interface between transport level applications such as Microsoft Remote Procedure Call RPC and security providers such as Windows Distributed Security SSP allows a transport application to call one of several security providers to obtain an authenticated connection Thesecallsdo not requireextensive knowledge of thesecurity protocol s details send modes Point to point communication in which messages are passed using oneof four different types of blocking sends The four send modes include standard mode M PI_ Send buffered mode M PI_Bsend
171. escribed in Chapter 3 Point to Point Communication in the M PI 1 0 standard The performance of point to point communication is measured in terms of total transfer time The total transfer time is defined as total_transfer_time latency message_size bandwidth where latency Specifies the time between the initiation of the data transfer in the sending process and the arrival of the first byte in the receiving process message_size Specifies the size of the message in M B bandwidth Denotes the reciprocal of the time needed to transfer a byte Bandwidth is normally expressed in M B per second Low latencies and high bandwidths lead to better performance Communicators A communicator is an object that represents a group of processes and their communication medium or context These processes exchange messages to transfer data Communicators encapsulate a group of processes so communication is restricted to processes in that group The default communicators provided by MPI areMPI_COMM_WORLD and MPI_COMM_SELF MPI_COMM_WORLD contains all processes that are running when an application begins execution Each process is the single member of its own MPI_COMM_SELF communicator Communicators that allow processes in a group to exchange data are termed intracommunicators Communicators that allow processes in two different groups to exchange data are called intercommunicators 16 Platform MPI User s Guide Introduction M any MP
172. eservers can create a redundant license server network For a license checkout request to be granted at least two servers must berunning and ableto communicate with each other Thisavoidsasingle license server failure which would prevent new Platform M PI jobs from starting With three server redundant licensing the full number of Platform M PI licenses can be used by a single job W hen selecting redundant license servers use stable nodes that arenot rebooted or shutdown frequently The redundant license servers exchange heartbeats Disruptions to that communication can cause the license servers to stop serving licenses The redundant license servers must be on the same subnet as the Platform M PI compute nodes T hey do not have to be running the same version of operating system as the Platform M PI compute nodes but it is recommended Each server in the redundant network must be listed in the Platform M PI license key by hostname and hostid The hostid is the MAC address of the ethO network interface The eth0 MAC address is used even if that network interface is not configured The hostid can be obtained by typing the following command if Platform M PI is installed on the system opt platform_mpi bin licensing arch Imutil Imhostid The ethO MAC address can be found using the following command sbin ifconfig egrep ethO awk print 5 sed s g The hostname can be obtained by entering the command host name To request a three se
173. esources U sethis option 116 Platform MPI User s Guide Understanding Platform MPI if you are running locally This option also suppresses the no password cached warning This is useful when no password is desired for SM P jobs iscached Indicates if a password is stored in the user password cache and stops execution The M PI application does not launch if this option is included on the command line clearcache Clearsthe password cacheand stops TheM PI application doesnotlaunch if thisoption isincluded on the command line Platform MPI User s Guide 117 Understanding Platform MPI Runtime environment variables Environment variables are used to alter the way Platform M PI executes an application The variable settings determine how an application behaves and how an application allocates internal resources at run time M any applications run without setting environment variables H owever applications that use a large number of nonblocking messaging requests require debugging support or must control process placement might need a more customized configuration Launching methods influence how environment variables are propagated To ensure propagating environment variables to remote hosts specify each variable in an appfile using the e option Setting environment variables on the command line for Linux Environment variables can be set globally on the mpi r un command line Command line options take precedence over envi
174. esses after the job resumes and exits Platform MPI User s Guide 149 Understanding Platform MPI Improved deregistration using ptmalloc Linux only To achievethe best performance on RDM A enabled interconnects like InfiniBand and M yrinet the M PI library must be aware when memory is returned to the system in malloc and free calls To enablemore robust handling of that information Platform M PI contains a copy of the ptmalloc implementation and uses it by default For applications with specific needs there are a number of available modifications to this default configuration To avoid using Platform M PI s ptmalloc implementation and instead usethe standard libc implementation or perhaps a malloc implementation contained in the application set the environment variableM PI_NO_MALLOCLIB at run time If the above option is applied so that the ptmalloc contained in Platform M PI is not used there isa risk of M PI not beinginformed when memory isreturned to thesystem This can bealleviated with thesettings MPI_USE MALLOPT_SBRK_PROTECTION and MPI_USE MALLOPT AVOID_MMAP atrun time which essentially results in the libc malloc implementation not returning memory to the system There are cases where these two settings cannot keep libc from returning memory to the system specifically when multiple threads call malloc free at the same time In these cases the only remaining option is to disable Platform M PI s lazy deregistration by giving
175. etting Started This chapter describes how to get started quickly using Platform M PI The semantics of building and running a simple M PI program are described for single and multiple hosts You learn how to configure your environment before running your program You become familiar with the file structure in your Platform MPI directory The Platform M PI licensing policy is explained The goal of this chapter is to demonstrate the basics to getting started using Platform M PI It is separated into two major sections Getting Started Using Linux and Getting Started Using Windows Platform MPI User s Guide 27 Getting Started Getting started using Linux Configuring your environment Setting PATH If you move the Platform MPI installation directory from its default location in opt pl at for m_mpi for Linux Set the M PI_ROOT environment variable to point to the location where M PI is installed Add MP _ROOT bin to PATH Add MPI ROOT share man toMANPATH MPI must be installed in the same directory on every execution host Setting up remote shell By default Platform M PI attempts to uses s h on Linux Werecommend thats s h users set StrictHostKeyChecki ng no intheir ssh config To use a different command such as r s h for remote shells set the M PI_REM SH environment variable to the desired command The variable is used by mpi r un when launching jobs as well as by the mpi job and mpi cl ean utilities Set it directly in
176. ettings By default the HPM PI_SYSTEM_CHECK API cannot be used if M PI_Init has already been called and the API will call MPI_Finalize before returning QUESTION Can have multiple versions of Platform M PI installed and how can switch between them ANSWER You can install multiple Platform M PI s and they can be installed anywhere as long as they arein the same place on each host you plan to run on Y ou can switch between them by setting MPI_ROOT For moreinformation on M PI_ROOT refer to General on page 237 QUESTION How do install in a non standard location ANSWER Two possibilities are rpm prefix wherever you want ivh pempi XXXXX XXX rpm Or you can basically useunt ar for an rpm using rpm2cpio pempi XXXXX XXX rpm cpio id For Windows see the Windows FAQ section QUESTION How do install a permanent license for Platform M PI ANSWER You can install the permanent license on the server it was generated for by running Imgrd c lt full path to license file gt Building applications QUESTION Which compilers does Platform M PI work with ANSWER Platform M PI works well with all compilers W e explicitly test with gcc Intel PathScale and Portland Platform M PI strives not to introduce compiler dependencies For Windows see the Windows FAQ section QUESTION What MPI libraries do need to link with when I build ANSWER Werecommend using thempi cc mpi f 90 and mpi 77 scriptsin MP1 ROOT bin to bui
177. example thehello_world c program is copied to simulate a server and client program in an MPMD application The print statement for each is modified to indicate server orclient program so theMPMD application can be demonstrated 1 Changeto a writable directory on a mapped drive The mapped drive should beto a shared folder for the cluster 2 Open a Visual Studio command window This example uses a 64 bit version so a Visual Studio x64 command window is opened 3 Copythehello_world csourceto server c and client c Then edit each fileto changetheprint statement and include server and client in each X Demo gt copy MPI_ROOT help hello_world c server c X Demo gt copy MPI_ROOT help hello_world c server c Edit each to modify the print statement for both c files to include server or client in the print so the executable being run is visible 4 Compilethe server c and client c programs X Demo gt MPI_ROOT bin mpicc mpi64 server c Microsoft R C C Optimizing Compiler Version 14 00 50727 762 for x64 Copyright C Microsoft Corporation All rights reserved Server c Microsoft R Incremental Linker Version 8 00 50727 762 Copyright C Microsoft Corporation All rights reserved out server exe Llibpath C Program Files x86 Platform MPI 1lib subsystem console libhpcmpi 64 lib libmpio64 lib Server obj Platform MPI User s Guide 41 Getting Started X Demo gt MPI_ROOT bin mpicc mpi64 clien
178. executable in non deterministic order The following is an example of the output Hello world m1 of 4 on n01 Hello world m3 of 4 on n02 Hello world m0O of 4 on n01 Hello world m2 of 4 on n02 Running with an appfile using HPCS Using an appfile with H PCS has been greatly simplified in this release of Platform M PI The previous method of writing a submission script that usesmpi _ nodes exe to dynamically generate an appfile based on the H PCS allocation is still supported H owever the preferred method is to allow mpi run exe to determine which nodes are required for the job by reading the user supplied appfile request those nodes from the H PCS scheduler then submit the job to HPCS when the requested nodes have been allocated The user writes a brief appfile calling out the exact nodes and rank counts needed for the job For example Perform Steps 1 and 2 from Building and Running on a Single H ost 1 Create an appfile for running on nodes n01 and n02 as h n01 np 2 hello_world exe h n02 np 2 hello_world exe 2 Submit the job to HPCS with the following command X demo gt mpirun hpc f appfile 3 Analyze hello_world output Platform M PI prints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m2 of 4 on n02 Hello world m1 of 4 on n01 Hello world m0O of 4 on n01 Hello world I m 3 of 4 on n02 Building and runn
179. f processors and active processes on a host The following table lists possible subscription types Table 18 Subscription types Subscription type Description Under subscribed More processors than active processes Fully subscribed Equal number of processors and active processes Over subscribed More active processes than processors When a host is over subscribed application performance decreases because of increased context switching Context switching can degrade application performance by slowing the computation phase increasing message latency and lowering message bandwidth Simulations that use timing sensitive algorithms can produce unexpected or erroneous results when run on an over subscribed system Platform MPI User s Guide 165 Tuning Processor locality Thempi r un option cpu_bind binds arank to alocality domain Idom to prevent aprocess from moving to a different Idom after start up The binding occurs before the M PI application is executed Similar results can be accomplished using mpsched but this has the advantage of being amore load based distribution and works well in psets and across multiple machines Binding ranks to Idoms cpu_bind On SMP systems processes sometimes moveto adifferent dom shortly after start up or during execution Thisincreasesmemory latency and can causeslower performancebecausetheapplication isnow accessing memory across cells Applications that are very memory latency sensi
180. f the library The options and variables of interest to performance tuning include the following MP _FLAGS y This option can be used to control the behavior of the Platform M PI library when waiting for an event to occur such as the arrival of a message MPI_TCP_CORECVLIMIT Setting this variable to a larger value can allow Platform M PI to use more parallelism during its low level message transfers but it can greatly reduce performance by causing switch congestion MPI_SOCKBUFSIZE Increasing this value has shown performance gains for some applications running on TCP networks cpu_ bind MPI _ BIND _ MAP MPI_CPU_AFFINITY MPI_CPU_SPIN The cpu_bind command lineoption and associated environment variables can improvetheperformance of many applications by binding a process to a specific CPU intra The intra command line option controls how messages are transferred to local processes and can impact performance when multiple ranks execute on a host MPI RDMA_INTRALEN MPI RDMA MSGSIZE MPI RDMA_NENVELOPE These environment variables control aspects of the way message traffic is handled on RDMA networks Thedefault settings have been carefully selected for most applications H owever someapplications might benefit from adjusting these values depending on their communication patterns For more information see the corresponding manpages MPI_USE_LIBELAN SUB Setting this environment variable may provide some performance benefits
181. fault Platform M PI installation location includes spaces C Program Files x86 Platform MPI bin mpi run or AMPI ROOT bi n mpi run Platform MPI User s Guide 175 Debugging and Troubleshooting Running on Linux and Windows Run time problems originate from many sources and may include the following Shared memory When an M PI application starts each M PI daemon attempts to allocate a section of shared memory This allocation can fail if the system imposed limit on the maximum number of allowed shared memory identifiers is exceeded or if the amount of available physical memory is not sufficient to fill the request After shared memory allocation is done every M PI process attempts to attach to the shared memory region of every other process residing on the same host This shared memory allocation can fail if the system is not configured with enough availableshared memory Consult with your system administrator to change system settings Also M PI_GLOBM EM SIZE is available to control how much shared memory Platform M PI tries to allocate Message buffering According to the M PI standard message buffering may or may not occur when processes communicate with each other using MPI_ Send MPI_ Send buffering is at the discretion of the M PI implementation Therefore take care when coding communications that depend upon buffering to work correctly For example when two processes use M PI_ Send to simultaneously send a message to ea
182. for a 2 dimensional compute region Do AE 1 J MAX DO 2 RCL T EATS AL tad END ENDDO DO J 2 MAX DO 1 MAX ACI A 1 J A 1 ENDDO ENDDO There are loop carried dependencies on the first dimension array s row in the first innermost DO loop and the second dimension array s column in the second outermost DO loop A simple method for parallelizing the fist outer loop implies a partitioning of the array in column blocks while another for the second outer loop implies a partitioning of the array in row blocks With message passing programming such a method requires massive data exchange among processes because of the partitioning change Twisted data layout partitioning is better in this case because the partitioning used for the parallelization of the first outer loop can accommodate the other of the second outer loop Platform MPI User s Guide 199 Example Applications Select Network Service m Click the Network Service that you want to install then click OK Network Service Myricom Myrinet MX Mapper Daemon _ This driver is not digitally signed Tell me why driver signing is important Figure 2 Array partitioning In this sample program the rank nprocess is assigned to the partition n at distribution initialization Because these partitions are not contiguous memory regions M PI s derived datatype is used to define the partition layout to the M PI system Each process starts wi
183. for example the file cache is currently using 8 of the memory and x is set to 10 In this case no flush is performed 126 Platform MPI User s Guide Understanding Platform MPI Example output MPI FLUSH FCACHE set to 0 fcache pct 22 attempting to flush fcache on host opteron2 MPI FLUSH_FCACHE set to 10 fcache pct 3 no fcache flush required on host opteron2 Memory is allocated with mma p then it is deallocated with mu n ma p afterwards Miscellaneous environment variables MPI_2BCOPY Point to point bcopy is disabled by setting MPI_2BCOPY to 1 Valid on PA RISC only MPL_MAX_WINDOW MPI_MAX_ WINDOW is used for one sided applications It specifies the maximum number of windows a rank can have at the same time It tells Platform M PI to allocate enough table entries The default is 5 export MPlL_MAX_WINDOW 10 The above example allows 10 windows to be established for one sided communication Diagnostic debug environment variables MPI_DLIB_FLAGS ns strict nmsg nwarn MPI_DLIB_FLAGS controls run time options when you use the diagnostics library The MPI_DLIB_FLAGS syntax is a comma separated list as follows ns h strict nmsg nwarn dump prefix dumpf prefix xNUM where Disables message signature analysis Disables default behavior in the diagnostic library that ignores user specified error handlers The default considers all errors to be fatal Enables M PI object space corrup
184. for the high performance networks it supports ANSWER For information on high performance networks see Interconnect support on page 81 QUESTION How can I control which interconnect is used for running my application ANSWER The environment variable M PI_IC_ORDER instructs Platform M PI to search in a specific order for the presence of an interconnect The contents are a colon separated list For a list of default contents see Interconnect support on page 81 Or mpi run command line options can be used that take higher precedence than M PI_IC_ORDER Lowercase selections imply to use if detected otherwise keep searching U ppercase selections demand the interconnect option be used and if it cannot be selected the application terminates with an error For a list of command line options see Interconnect support on page 81 An additional issue is how to select a subnet when TCP IP is used and multiple TCP IP subnets are available between the nodes This can becontrolled by using the netaddroption to mpi r un For example Platform MPI User s Guide 241 Frequently Asked Questions mpirun TCP netaddr 192 168 1 1 f appfile This causes TCP IP to be used over the subnet associated with the network interface with IP address 192 168 1 1 For more detailed information and examples see Interconnect support on page 81 For Windows see the Windows FAQ section Windows specific QUESTION What versions of Windows does Platform M PI supp
185. form M PI product Table 20 Example applications shipped with Platform MPI Name send_receive f ping_pong c ping_pong_ring c compute_pi f master_worker f90 cart C communicator c multi_par f Language Fortran 77 Fortran 77 Fortran 90 C Fortran 77 Description Illustrates a simple send and receive operation Measures the time it takes to np Argument np gt 2 np 2 send and receive data between two processes Confirms that an application can np gt 2 run using the specified interconnect Computes pi by integrating f x np gt 1 4 14x x Distributes sections of an array np gt 2 and does computation on all sections in parallel Generates a virtual topology Copies the default communicator MPLCOMM_WORLD Uses the alternating direction np 4 np 2 np gt 1 iterative ADI method on a two dimensional compute region Platform MPI User s Guide 183 Example Applications Name Language Description np Argument i0 c C Writes data for each process to np gt 1 a separate file called iodatax wherex represents each process rank in turn Then the data in iodatax is read back thread_safe c Cc Tracks the number of client np gt 2 requests handled and prints a log of the requests to stdout sort C C Generates an array of random np gt 1 integers and sorts it compute_pi_spawn f Fortran 77 A single initial rank spa
186. g and running multihost on Windows HPCS clusters The following is an example of basic compilation and run steps to executehe o_ world c on a cluster with 16 way parallelism To build and run hello_world c onaHPCS cluster L 2 3 Changeto a writable directory on a mapped drive Share the mapped drive to a folder for the cluster Open a Visual Studio command window This example uses a 64 bit version so a Visual Studio 64 bit command window is opened Compilethe hello_world executable file X demo gt set MPI_CC cl X demo gt MPI_ROOT bin mpicc mpi64 MPI_ROOT help hello_world c Microsoft C C Optimizing Compiler Version 14 00 50727 42 for 64 bit Copyright Microsoft Corporation All rights reserved hello_world Microsoft Incremental Linker Version 8 00 50727 42 Copyright Microsoft Corporation All rights reserved out hello_world exe Eli bpath C Program Files x86 Platform MPI 1lib bsystem console pcmpi 64 lib mpi o64 hello_world obj a Soc fer Create anew job requesting the number of CPUs to use Resources are not yet allocated but the job is given aJOBID number which is printed to stdout C gt job new numprocessors 16 exclusive true Job queued ID 4288 AddasingleCPU mpi run task to the newly created job Thempi r un job creates more tasks filling the rest of the resources with the compute ranks resulting in a total of 16 compute ranks for this example
187. g the nodes you want to use which creates a subshell Then jobsteps can be launched within that subshell until the subshell is exited srun A n4 This allocates 2 nodes with 2 ranks each and creates a subshell MPI_ROOT bin mpirun srun a out Thisruns on the previously allocated 2 nodes cyclically n00 ranki n01 rank2 n02 rank3 n03 rank4 68 Platform MPI User s Guide Understanding Platform MPI UseHP XC LSF and Platform MPI Platform M PI jobs can be submitted using LSF LSF uses the SLURM sr un launching mechanism Because of this Platform M PI jobs must specify the srun option whether LSF is used or sr un is used bsub I n2 MPI_ROOT bin mpirun srun a out LSF creates an allocation of 2 processors and sr un attaches to it bsub I n12 MPIl_ROOT bin mpirun srun n6 N6 a out LSF creates an allocation of 12 processors ands r un uses 1 CPU per node 6 nodes H ere we assume 2 CPUS per node LSF jobs can be submitted without the I interactive option An alternative mechanism for achieving the one rank per node which uses the ext option to LSF bsub l n3 ext SLURM nodes 3 MPI_ROOT bin mpirun srun a out The ext option can also be used to specifically request a node The command line would look something like the following bsub I n2 ext SLURM nodelist n10 mpirun srun hello_world Job lt 1883 gt is submitted to default queue lt interactive gt lt lt Waiting for dispatch g
188. ges by using Visual Studio s Property M anager Add this property page for each M PI project in your solution QUESTION How do I specifically build a 32 bit application on a 64 bit architecture ANSWER On Windows open the appropriate compiler command window to get the correct 32 bit or 64 bit compilers When using mpi cc or mpi f 90 scripts include the mpi32 or mpi64 flag to link in the correct M PI libraries QUESTION How can I control which interconnect is used for running my application ANSWER The default protocol on Windows is TCP Windows does not have automatic interconnect selection To use InfiniBand you have two choices W SD or IBAL WSD uses thesame protocol as TCP Y ou must select the relevant IP subnet specifically the PolB subnet for InfiniBand drivers To select a subnet use the netaddr flag For example R gt mpirun TCP netaddr 192 168 1 1 ccp np 12 rank exe This forces TCP IP to be used over the subnet associated with the network interface with the IP address 192 168 1 1 To use the low level InfiniBand protocol use the IBAL flag instead of TCP For example R gt mpirun IBAL netaddr 192 168 1 1 ccp np 12 rank exe The use of netaddr is not required when using IBAL but Platform M PI still uses this subnet for administration traffic By default it uses the TCP subnet available first in the binding order This can be found and changed by going to the Network Connections gt Advanced Sett
189. gram In this example mpi r un runsthehello_world program with two processes on the local machine jawbone and two processes on the remote machine wizard as dictated by the np 2 option on each line of the appfile 6 Analyzehello_world output HP MPI prints the output from running the hello_world executable in nondeterministic order The following is an example of the output Hello world m2 of 4 on wizard Hello world m0 of 4 on jawbone Hello world m3 of 4 on wizard Hello world m1 of 4 on jawbone Processes 0 and 1 run on jawbone the local host while processes 2 and 3 run on wizard HP M PI guarantees that the ranks of the processes in MPI COM M_WORLD are assigned and sequentially ordered according to the order the programs appear in the appfile The appfilein this example my_appfile describes the local host on the first line and the remote host on the second line Running MPMD applications A multiple program multiple data M PM D application uses two or more programs to functionally decompose a problem This style can be used to simplify the application source and reduce the size of spawned processes Each process can execute a different program Platform MPI User s Guide 71 Understanding Platform MPI MPMD with appfiles To run an MPMD application thempi r un command must reference an appfile that contains the list of programs to berun and thenumber of processes to be created for each program A simpleinv
190. h the command line ndd Some applications decline to directly link to i bmpi and instead link to a wrapper library that is linked tol i bmpi In this case it is still possible for Platform M Pl smal oc etc interception to be used by supplying the auxiliary option to the linker when creating the wrapper library by using a compiler flag such as WI auxiliary libmpi so Dynamic linking is required with all InfiniBand use on Linux Platform M PI does not use the Connection M anager CM library with OFED InfiniBand card failover When InfiniBand has multiple paths or connections to thesamenode Platform M PI supports nfiniBand card failover This functionality is always enabled An InfiniBand connection is setup between every card pair During normal operation short messages are alternated among the connectionsin round robin manner Long messages are striped over all the connections W hen one of the connections is broken a warning is issued but Platform M PI continues to use the rest of the healthy connections to transfer messages If all the connections are broken Platform M PI issues an error message InfiniBand port failover A multi port InfiniBand channel adapter can use automatic path migration APM to provide network high availability APM is defined by theI nfiniBand Architecture Specification and enables Platform M PI to recover from network failures by specifying and using thealternatepathsin thenetwork Thel nfiniBa
191. handle comm ommuni cator argument to packing call handle OUT ze upper bound on size of packed message in bytes P Raek external sizeL char datarep Me Aint incount P Datatype datatype MPI Aint size datarep data representation string incount number of input data items datatype datatype of each input data item handle UT size output buffer size in bytes t MPI Type_indexedL MPI Aint count MPI Aint array_of_blocklengths PI Aint array_of_displacements MPI Datatype oldtype MPI Datatype newt ype count number of blocks array_of blocklengths number of elements per block array of displacements displ acemen or each block in multiples of oldtype extent oldtype old datatype handle UT newt ype new datatype handle t MPI _Type_sizeL MPIl Datatype datatype MPI Aint size datatype datatype handle OUT size datatype size t MPI_Type_structL MPI Aint count MPI_Aint array_of_blocklengths PI Aint array_of displacements MPI Datatype array_of types MPI Datatype newt ype count number of blocks integer IN rray_of_blocklength number of elements in each block IN rray_of_displacements byte displacement of each block array_of_types type of elements in each block Platform MPI User s Guide 225 Large message APIs OU int MPI _Type_ create _h array_of_blocklengths OU array of handles N count new datatype int MPI _Type_vectorL MPI Aint number of blocks nonnegative integer blocklength number of old data
192. have regular Ethernet and some form of higher speed interconnect such as InfiniBand This section describes how to use the ping_pong_ring c example program to confirm that you can run using the desired interconnect Running a test like this especially on a new cluster is useful to ensure that the relevant network drivers areinstalled and that the network hardware is functioning If any machine has defective network cards or cables this test can also be useful to identify which machine has the problem To compile the program set the M PI_ROOT environment variable not required but recommended to avaluesuchas opt platform_mpi Linux and then run export MPI_CC gcc whatever compiler you want MPI_ROOT bin mpicc o pp x MPI_ROOT help ping_pong_ring c Although mpi cc will perform asearch for what compiler to useif you don t specify M PI_CC itispreferable to be explicit If you have a shared filesystem it is easiest to put theresultingpp x executablethere otherwise you must explicitly copy it to each machinein your cluster Therearea variety of supported start up methods and you must know which is relevant for your cluster Your situation should resemble one of the following 1 Nosrun prun or HPCS job scheduler command is available For this case you can create an appfile with the following h hostA np 1 path to pp x h hostB np 1 path to pp x h hostC np 1 path to pp x Platform MPI User s Guide 187 Example A
193. he default behavior when using stdio is to ignore standard input Additional options are available to avoid confusing interleaving of output Line buffering block buffering or no buffering Prepending of processes ranks to stdout and stderr Simplification of redundant output This functionality is not provided when using srun or prun Refer to the label option of sr un for similar functionality Backtrace functionality Platform M PI handles several common termination signals on PA RISC differently than earlier versions of Platform M PI If any of the following signals are generated by an MPI application a stack trace is printed prior to termination SI GBUS bus error SI GSEGV segmentation violation SI GI LL illegal instruction SI GSYS illegal argument to system call The backtrace is helpful in determining where the signal was generated and the call stack at the time of theerror If asignal handler is established by theuser code before calling M PI_ Init no backtraceis printed for that signal type and the user s handler is solely responsible for handling the signal Any signal handler installed after M PI_Init also overrides the backtrace functionality for that signal after the point it is established If multiple processes cause a signal each of them prints a backtrace In some cases the prepending and buffering options available in Platform M PI standard 1O processing are useful in providing more readable output of the ba
194. he example teaches you the basic compilation and run steps to executehel o_worl d c on acluster with 4 way pllelism To build and run hel o_ world c on a cluster using an appfile Perform Steps 1 and 2 from Building and Running on a Single H ost Note Specify the bitness using mpi64 or mpi32 for mpi cc to link in the correct libraries Verify you are in the correct bitness compiler window Using mpi64 in a Visual Studio 32 bit command window does not work 1 Create a file appfile for running on nodes n01 and n02 as C gt h n01 np 2 node01 share path to hello_world exe h n02 np 2 node01 share path to hello_world exe 2 For thefirst run of the hello_world executable use cache to cache your password C gt MPI_ROOT bin mpirun cache f appfile Password for M PI runs When typing the password is not echoed to the screen The Platform M PI Remote Launch service must be registered and started on the remote nodes mpi run will authenticated with the service and create processes using your encrypted password to obtain network resources If you do not provide a password the password is incorrect or you use nopass remote processes are created but do not have access to network shares In the following example thehel o0_world exe file cannot be read 3 Analyze hello_world output 90 Platform MPI User s Guide Understanding Platform MPI Platform M PI prints the output from running thehello_world
195. host1 host2 or file name Request a specific list of hosts x exclude host1 host2 or file name Request that a specific list of hosts not be included in the resources allocated to this job label Prepend task number to lines of stdout err For moreinformation on pr un arguments see thepr un manpage Using the prun argument from the mpi r un command line is still supported Implied srun Platform M PI provides an implied s run mode Theimplieds run mode allows the user to omit the srun argument from the mpi r un command line with the use of the environment variable MPI_USESRUN Set the environment variable setenv MPIL_USESRUN 1 Platform M PI inserts the srunargument The following arguments are considered to besr un arguments n N m w X any argument that starts with and is not followed by a space np is translated to n srun is accepted without warning Theimplied sr un mode allows the use of Platform M PI appfiles Currently an appfile must be homogenousin its arguments except for h and np The h and npargumentsin theappfilearediscarded All other arguments are promoted to the mpi r un command line Additionally arguments following are also processed Additional environment variables provided MPI_SRUNOPTIONS Allows additional sr un options to be specified such as label setenv MPlSRUNOPTIONS lt option gt MPI_USESRUN_IGNORE_ARGS Provides an easy way to modify arguments in an a
196. ice IN count number of elements sent IN datatype type of each element handle IN dest rank of destination IN tag message tag IN comm communicator handle OUT request communication request handle Collective communication int MPI_AllgatherL void sendbuf MPI Aint sendcount MPI Datatype Sendtype void recvbuf MPI Aint recvcount MPI Datatype recvtype MPI Comm comm IN sendbuf starting address of send buffer choice IN sendcount number of elements in send buffer IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice IN recvcount number of elements received from any process IN recvtype data type of receive buffer elements handle IN comm communicator handle int MPI_AllgathervL void sendbuf MPI Aint sendcount MPI Datatype sendtype void recvbuf MPI Aint recvcounts int displs MPI Datatype recvtype MPI Comm comm IN sendbuf starting address of send buffer choice IN sendcount number of elements in send buffer IN sendtype data type of send buffer elements handle OUT recvbuf address of receive buffer choice IN recvcounts Array containing the number of elements that are received from each process IN displs Array of displacements relative to recvbuf IN recvtype data type of receive buffer elements handle IN comm communi cator handle int MPI_AllreduceL void sendbuf void recvbuf MPI _ Aint count MPI Datatype datatype MPI Op op MPI Comm comm IN s
197. ific installed W indows security package is preferred use this flag to indicate that security package on the client This flag is rarely necessary as the client mpi r un and the server Platform M PI Remote Launch service negotiates the security package to be used for authentication token lt token name gt and tg lt token name gt Authenticates to this token with the Platform M PI Remote Launch service Some authentication packages require a token name T he default is no token pass Prompts for adomain account password U sed to authenticate and create remote processes A password is required to allow the remote process to access network resources such as file shares The password provided is encrypted using SSP for authentication The password is not cached when using this option cache Prompts for a domain account password U sed to authenticate and create remote processes A password is required to allow the remote process to access network resources such as file shares The password provided is encrypted using SSP for authentication The password is cached so that future mpi r un commands uses the cached password Passwords are cached in encrypted form using Windows Encryption APIs nopass Executes the mpi run command with no password If a password is cached it is not accessed and no password is used to create the remote processes U sing no password results in the remote processes not having access to network r
198. igin datatype datatype of each entry in origin buffer handle target_rank rank of target nonnegative integer target_disp displacement from window start to the beginning of the target buffer target count number of entries in target buffer target datatype datatype of each entry in target buffer handle win window object used for communication handle int MPI_PutL void origin_addr MPI_Aint origin_count Pl Datatype origin datatype int target_rank MPI Aint arget disp MPI Aint arget_count MPI Datatype target datatype MPI WI N win origin_addr initial address of origin buffer choice origin count number of entries in origin buffer origin datatype datatype of each entry in origin buffer handle target _rank rank of target target disp displacement fromstart of window to target buffer target coun number of entries in target buffer target datatype datatype of each entry in target buffer handle win window object used for communication handle int MPI AccumulateL void origin_addr MPI Aint origin count Pl Datatype origin datatype int target_rank MPI Aint target disp PI Aint target_count MPI Datatype target_datatype MPI Op op PI WIN win origin_addr initial address of buffer choice origin count number of entries in buffer origin_datatype datatype of each buffer entry handle Platform MPI User s Guide 227 Large message APIs IN target _rank rank of target IN target_disp displacement from start of window to beginning of target buffe
199. in appfile mode on the hosts specified in the environment variable Isb_hosts Launches the same executable across multiple hosts U ses the list of hosts in the environment variable LSB_H OSTS Can be used with the np option Isb_mcpu_hosts Launches the same executable across multiple hosts U ses the list of hosts in the environment variable LSB_M CPU_HOSTS Can be used with the np option Isf Launches the same executable across multiple hosts U ses the list of hosts in the environment variable LSB_MCPU_HOSTS and sets M PI_REM SH to useLSF s ssh replacement bl aunch Note 104 Platform MPI User s Guide Understanding Platform MPI blaunch requires LSF 7 0 6 and up Platform M PI integrates features for jobs scheduled and launched through Platform LSF These features require Platform LSF 7 0 6 or later Platform LSF 7 0 6 introduced theb aunch command asan ssh likeremote shell for launching jobs on nodes allocated by LSF Usingb aunch to start remote processes allows for better job accounting and job monitoring through LSF W hen submitting an mpirun job to LSF bs ub either add the sf mpi run command lineoption or set thevariable eM PI_USELSF y in thejob submission environment Thesetwo options are equivalent Setting either of the options automatically sets both the sb_mcpu_hosts mpirun command line option and the M PI_REM SH blaunch environment variablein thempirun environment when thejob is executed Example
200. include 32 64 Linker Additional Dependencies Set tol i bpcmpi 32 1ib orl i bpcmpi 64 1i b depending on the application Additional Library Directories Set to MPI_ ROOT 1 ib Building and running on a Windows 2008 cluster using appfiles The example teaches you the basic compilation and run steps to executehe l o_ world c on acluster with 4 way parallelism Note Specify the bitness using mpi64 or mpi32 for mpi cc to link in the correct libraries Verify you are in the correct bitness compiler window Using mpi64 in a Visual Studio 32 bit command window does not work 1 Create a file appfile for running on nodes n01 and n02 as h n01 np 2 node01 share path to hello_world exe h n02 np 2 node01 share path to hello_world exe 2 For thefirst run of the hello_world executable use cache to cache your password MPI_ROOT bin mpirun cache f appfile Password for MPI runs When typing the password is not echoed to the screen The Platform M PI Remote Launch service must be registered and started on the remote nodes mpi run will authenticated with the service and create processes using your encrypted password to obtain network resources If you do not provide a password the password is incorrect or you use nopass remote processes are created but do not have access to network shares In the following example thehel o_ world exe file cannot be read 3 Analyzehello_world output Platform M PI pr
201. ing and Troubleshooting TotalView multihost example The following example demonstrates how to debug a typical Platform M PI multihost application using TotalV iew including requirements for directory structure and file locations TheM PI application isrepresented by an appfile named my_appf i e which contains the following two lines h local_host np 2 path to programl h remote_host np 2 path to program2 my _appfile resides on the local machine local_host in the work mpi apps total directory To debug this application using T otalV iew do the following In this example T otalV iew is invoked from the local machine 1 Place your binary files in accessible locations path to programl exists on local_host path to program2 exists on remote host To run the application under T otalView the directory layout on your local machine with regard to the M PI executable files must mirror the directory layout on each remote machine Therefore in this case your setup must meet the following additional requirement path to program2 exists on local_host 2 Inthe work mpi apps total directory on local_host invoke T otalV iew by passing the tv option to mpi run MPI_ROOT bin mpirun tv f my_appfile Using the diagnostics library Platform M PI provides a diagnostics library DLIB for advanced run time error checking and analysis DLIB provides the following checks M essagesignature analysis D etectstype mismatchesin
202. ing intrahost exchanges In thiscase theintrahost exchangesuseshared memory between processes mapped to different same host IP addresses To use multiple network interfaces you must specify which M PI processes are associated with each IP address in your appfile For example when you havetwo hosts host 0 and host 1 each communicating using two Ethernet cards ethernet 0 and ethernet 1 you have four host names as follows host0 ethernet0 host0 ethernet host1 ethernetO host1 ethernet1 If your executable is called work exe and uses 64 processes your appfile should contain the following entries h hostO0 ethernetO np 16 work exe h host0 ethernet1 np 16 work exe p h hostl ethernet0 np 16 work exe h hostl ethernetl np 16 work exe Now when the appfile is run 32 processes run on host 0 and 32 processes run on host 1 Select Network Component Type S Click the type of network component you want to install El Client gt Service 3 Protocol Description Services provide additional features such as file and printer sharing Figure 1 Multiple network interfaces H ost 0 processes with rank 0 15 communicate with processes with rank 16 31 through shared memory shmem H ost 0 processes also communicate through the host 0 ethernet 0 and the host 0 ethernet 1 network interfaces with host 1 processes 164 Platform MPI User s Guide Tuning Processor subscription Subscription refers to the match o
203. ing on a Windows 2008 cluster using hostlist Perform Steps 1 and 2 from the previous section Building and Running on a Single H ost 1 Runthe cache password if this is your first run of Platform M PI on thenodeand in this user account Use the hostlist flag to indicate which hosts to run X demo gt MPI_ROOT bin mpirun cache hostlist n01 2 n02 2 hello_world exe Password for MPI runs This example uses the hostlist flag to indicate which nodes to run on Also note that the MPI_WORKDIR isset to your current directory If thisis not a network mapped drive Platform M PI is unable to convert this to aU niversal Naming Convention UNC path and you must specify the full UNC path forhello_ world exe 2 Analyze hello_world output Platform MPI User s Guide 91 Understanding Platform MPI Platform M PI prints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m1 of 4 on n01 Hello world I m 3 of 4 on n02 Hello world m0O of 4 on n01 Hello world m2 of 4 on n02 3 Any future Platform M PI runs can now use the cached password Any additional runs of ANY Platform MPI application from the same node and same user account will not require a password X demo gt MPI_ROOT bin mpirun hostlist n01 2 n02 2 hello_world exe Hello world m1 of 4 on n01 Hello world I m 3 of 4 on n02 Hello world m0O of 4 on n01 Hello world m2 of 4 on
204. ings windows IBAL is the desired protocol when using InfiniBand IBAL performance for latency and bandwidth is considerably better than WSD For moreinformation see nterconnect support on page 81 QUESTION When I use mpirun ccp np 2 nodex rank exe l only get one node not two Why ANSWER When usingtheautomaticjob submittal featureof mpi r un np X isused to request thenumber of CPUs for the scheduled job Thisis usually equal to the number of ranks However when using nodex to indicate only one rank node the number of CPUs for the job is greater than the number of ranks Because compute nodes can have different CPUs on each node and mpi r un cannot determine the number of CPUs required until the nodes are allocated to the job the user must provide the total number of CPUs desired for the job Then the nodex flag limits the number of ranks scheduled to just one node In other words np X isthenumber of CPUs for the job and nodex is telling mpi run to only useone CPU node QUESTION WhatisaUNC path Platform MPI User s Guide 243 Frequently Asked Questions ANSWER A Universal Naming Convention UNC path is a path that is visible as a network share on all nodes The basic format is node name exported share folder paths UNC paths are usually required because mapped drives might not be consistent from node to node and many times don t get established for all logon tokens QUESTION 1 am using mpirun automatic job
205. instr Whether mpi r un is invoked on a host where at least one M PI process is running or on a host remote from all M PI processes Platform M PI writes the instrumentation output fileprefix instr tothe working directory on thehostthatisrunningrank 0 when instrumentation for multihost runsis enabled When using ha the output file is located on the host that is running the lowest existing rank number at the time the instrumentation data is gathered during M PI_Finalize The ASCII instrumentation profile provides the version the date your application ran and summarizes information according to application rank and routines The information available in the prefix instr file includes Overhead time The time a process or routine spends inside M PI for example the time a process spends doing message packing or spinning waiting for message arrival Blocking time Thetimeaprocessor routineis blocked waiting for amessageto arrive beforeresuming execution Note Overhead and blocking times are most useful when using e MPI_FLAGS y0 Communication hot spots The processesin your application for which the largest amount of timeis spent in communication M essage bin The range of message sizes in bytes The instrumentation profile reports the number of messages according to message length The following displays the contents of the example report compute _pi instr ASCII Instrumentation Profile Version Platform MPI 01 08 00 0
206. ints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m1 of 4 on n01 Hello world m3 of 4 on n02 Hello world m0O of 4 on n01 Hello world m2 of 4 on n02 Platform MPI User s Guide 43 Getting Started Running with an appfile using HPCS Using an appfile with H PCS has been greatly simplified in this release of Platform M PI The previous method of writing a submission script that uses mpi _ nodes exe to dynamically generate an appfile based on the H PCS allocation is still supported H owever the preferred method is to allow mpi run exe to determine which nodes are required for the job by reading the user supplied appfile request those nodes from the H PCS scheduler then submit the job to H PCS when the requested nodes have been allocated U sers write a brief appfile calling out the exact nodes and rank counts needed for the job For example 1 Change to a writable directory 2 Compilethehe o_world executable file MPI_ROOT bin mpicc o hello_world MPI_ROOT help hello_world c 3 Create an appfile for running on nodes n01 and n02 as h n01 np 2 hello_world exe h n02 np 2 hello_world exe 4 Submit thejob to H PCS with the following command X demo gt mpirun hpc f appfile 5 Analyzehello_worl d output Platform M PI prints the output from running thehe I o_wor I d executable in non deterministic order The followi
207. ion Generally we test with the current distributions of RedH at and SuSE Other versions might work but are not tested and are not officially supported QUESTION WhatisMPI_ROOT that see referenced in the documentation ANSWER MPI_ROOT isan environment variablethat Platform M PI mpi r un uses to determine where Platform M PI isinstalled and therefore which executables and libraries to use It is especially helpful when you have multiple versions of Platform M PI installed on a system A typical invocation of Platform M PI on systems with multiple M P _ROOTs installed is setenv MPI_ROOT scratch test platform mpi 2 2 5 MPI_ROOT bin mpirun Or Platform MPI User s Guide 237 Frequently Asked Questions export MPI_ROOT scratch test platform mpi 2 2 5 MPI_ROOT bin mpirun If you only have one copy of Platform M PI installed on the system and itisin opt pl at form_ mpi or opt mpi you do not need to set MPI_ROOT For Windows see the Windows FAQ section QUESTION Can you confirm that Platform M PI is include file compatible with M PICH ANSWER Platform M PI can beused in what werefer to asM PICH compatibility mode In general object files built with the Platform M PI M PICH mode can be used in an MPICH application and conversely object files built under M PICH can belinked into a Platform M PI application using MPICH mode However using M PICH compatibility mode to produce a single executable to run under both
208. ion 73 Windows 36 with and appfile HPCS 44 run MPI on multiple hosts 70 run time utilities Windows 46 utility commands mpiclean 81 mpijob 79 mpirun 73 run time environment variables 118 MPI_2BCOPY 127 MPI_BIND_MAP 126 MPI_COMMD 131 MPI_CPU_AFFINITY 126 MPI_CPU_ SPIN 126 MPI_DLIB_FLAGS 127 MPI_ERROR LEVEL 128 MPI_FLAGS 121 MPI_FLUSH_FCACHE 126 MPI_IB_CARD_ORDER 134 MPI_IB_MULTIRAIL 131 MPI_IB_PKEY 134 MPI_IB_ PORT GID 132 MPI_IBV_QPPARAMS 135 MPI_IC_ORDER 130 MPI_IC_SUFFIXES 131 MPI_INSTR128 MPI_LOCALIP 138 MPI_MAX_REMSH 138 MPI_MAX_WINDOW 127 MPI_MT_FLAGS 125 MPI_NETADDR 138 MPI_NOBACKTRACE 128 MPI_RDMA_INTRALEN 139 MPI_RDMA_MSGSIZE 139 MPI_RDMA_NENVELOPE 140 MPI_REMSH 138 MPI_ROOT 126 MPI_SHMCNTL 125 MPI_SHMEMCNTL 137 MPI_USE_MALLOPT_AVOID_MMAP 138 MPI_VAPIL_QPPARAMS 135 MPI_WORKDIR 126 MPIRUN_OPTIONS 121 run time utility commands 73 95 run time utilities 32 S scalability 145 scan 22 scatter 20 21 secure shell 28 138 select reduction operation 22 send buffer data type of 23 send_receive f 183 sendbuf variable 21 sendcount variable 21 sending data in one operation 16 sendtype variable 21 setting PATH 28 shared libraries 241 shared memory control subdivision of 137 default settings 125 MPI_SHMEMCNTL 137 MPI_SOCKBUFSIZE 142 shared memory default settings 125 shell setting 28 signal propagation 151 single host execution 73 single threaded processes 24 singleton launching 148 sort C 184 source vari
209. j lt numOfEntries 1 j cout lt lt entries i gt getValue lt lt endl int main int argc char argv int myRank numRanks MPI _Init amp argc amp argv MPI Comm rank MP COMM WORLD amp my Rank MPI Comm _size MP COMM WORLD amp numRanks ll Have each rank build its block of entries for the global sort Platform MPI User s Guide 213 Example Applications int numEntries BlockOfEntries aBlock new BlockOfEntries amp numEntries myRank II Compute the total number of entries and sort them numEntries numRanks for int 0 j lt numEntries 2 j ll II Synchronize and then update the shadow entries i MPI Barrier MPI_ COMM_WORLD int recvVal sendVal MPI Request sortRequest MPI Status status ll II Everyone except numRanks 1 posts a receive for the right s rightShadow ll if myRank numRanks 1 MPI Irecv amp recvVal 1 MPI _ INT myRank 1 MPI ANY_TAG MPI _ COMM_WORLD amp sortRequest Everyone except 0 sends its leftEnd to the left HI if myRank 0 sendVal aBlock gt getLeftEnd getValue MPI Send amp sendVal 1 MPI INT myRank 1 1 MPI COMM WORLD if myRank numRanks 1 MPI _Wait amp sortRequest amp status aBlock gt setRightShadow Entry recvVal i II Everyone except 0 posts for the left s leftShadow if myRank 0 MPI Irecv amp recvVal 1 MPI_INT myRank 1 MPI ANY TAG MPI COMM WORLD amp sortRequest I
210. lan4 protocols for Quadrics By default Platform M PI uses Elan collectives for broadcast and barrier If messages are outstanding at the time the Elan collective is entered and the other side of the message enters a completion routine on the outstanding message before entering the collective call it is possible for the application to hang due to lack of message progression while inside the Elan collective This isan uncommon situation in real Platform MPI User s Guide 85 Understanding Platform MPI applications If such hangs are observed disable Elan collectives using the environment variable MPI_USE_LIBELAN 0 Interconnect selection examples The default M PI_IC_ORDER generally results in the fastest available protocol being used The following example uses the default ordering and supplies a netaddr setting in caseT CP IP is theonly interconnect available echo MPILIC_ORDER ibv vapi udapl psm mx gmelan tcp export MPIRUN_SYSTEM_OPTIONS netaddr 192 168 1 0 24 export MPIRUN_OPTIONS prot MPI_ROOT bin mpirun srun n4 a out The command line for the above appears to mpi run aS MP _ROOT bin mpirun netaddr 192 168 1 0 24 prot srun n4 a out and the interconnect decision looks for IBV then VAPI etc down to TCP IP If TCP IP is chosen it uses the 192 168 1 subnet If TCP IP is needed on a machine where other protocols are available the T CP option can be used This exampleis like the previous except TCP i
211. ld If you do not want to build with these scripts we recommend using them with the show option to see what they are doing and usethat as a starting point for doing your build The showoption prints out the command it uses to build with Because these scripts are readable you can examine them to understand what gets linked in and when For Windows see the Windows FAQ section QUESTION How do build a 32 bit application on a 64 bit architecture Platform MPI User s Guide 239 Frequently Asked Questions ANSWER On Linux Platform M PI contains additional libraries in a 32 bit directory for 32 bit builds MPI_ROOT I1i b linux_ia32 Use the mpi32 flag with mpi cc to ensurethat the 32 bit libraries are used Y our specific compiler might require a flag to indicate a 32 bit compilation For example On an Opteron system using gcc you must instruct gcc to generate 32 bit via the flag m32 The mpi 32 is used to ensure 32 bit libraries are selected setenv MPI_ROOT opt platform_mpi setenv MPI_CC gcc MPI_ROOT bin mpicc hello_world c mpi32 m32 file a out a out ELF 32 bit LSB executable Intel 80386 version 1 SYSV for GNU Linux 2 2 dynamically linked uses shared libraries not stripped For more information on running 32 bit applications see N etwork specific on page 241 For Windows see the W indows FAQ section Performance problems QUESTION How does Platform M PI clean up when something goes wrong A
212. lection to use uUDAPL The lowercase and uppercase options are analogous to the Elan options Dynamic linking is required with UDAPL Do not link static psm PSM Explicit command lineinterconnect selection to use QLogic InfiniBand Thelowercase and uppercase options are analogous to the Elan options mx MX Explicit command line interconnect selection to use M yrinet M X The lowercase and uppercase options are analogous to the Elan options gm GM Explicit command line interconnect selection to use M yrinet GM The lowercase and uppercase options are analogous to the Elan options elan ELAN 102 Platform MPI User s Guide Understanding Platform MPI Explicit command line interconnect selection to use Quadrics Elan The lowercase option is taken as advisory and indicates that the interconnect should be used if it is available The uppercase option is taken as mandatory and instructs M PI to abort if the interconnect is unavailable The interaction between these options and the related MPI_IC_ORDER variable is that any command line interconnect selection here is implicitly prepended to MPI_IC_ORDER itapi ITAPI Explicit command line interconnect selection to use TAPI The lowercase and uppercase options are analogous to the Elan options ibal IBAL Explicit command lineinterconnect selection to usethe W indows B Access Layer The lowercase and uppercase options are analogous to the Elan options Platform M PI for Windows
213. lementation contained in theapplication MPL_PAGE_ALIGN_MEM MPI_PAGE_ALIGN_MEM causes the Platform M PI library to page align and page pad memory Thisis for multithreaded InfiniBand support export MPI_PAGE_ALIGN_MEM 1 MPI_PHYSICAL_MEMORY MPI_PHYSICAL_MEMORY allows the user to specify the amount of physical memory in MB available on the system M PI normally attempts to determine the amount of physical memory for the purpose of determining how much memory to pin for RDM A message transfers on InfiniBand and M yrinet GM The value determined by Platform M PI can be displayed using the dd option If Platform M PI specifies an incorrect value for physical memory this environment variable can be used to specify the value explicitly export MPILPHYSICAL_MEMORY 1024 The above example specifies that the system has 1 GB of physical memory MPI_PIN_PERCENTAGE and MPI_PHYSICAL_MEMORY areignored unless InfiniBand or M yrinet GM isin use MPL_RANKMEMSIZE MPI_RANKMEMSIZE d W hered isthetotal bytes of shared memory of the rank Specifies the shared memory for each rank 12 5 is used as generic 87 5 is used as fragments The only way to change this ratio is to use MPI_SHMEMCNTL MPI_RANKM EM SIZE differsfrom M PI_GLOBM EM SIZE which isthetotal shared memory across all ranks on the host M PI_RANKMEM SIZE takes precedence over MPI_GLOBM EMSIZE if both are set M PI_RANKM EMSIZE and MPI_GLOBMEMSIZE are mutually 136 Platform MPI User s Gui
214. lication hangs 240 MPI_Send convert to MPI_Ssend 125 MPI_SHMCNTL 125 MPISHMEMCNTL 137 MPI_SOCKBUFSIZE 142 MPI_SPAWN_SRUNOPTIONS 141 MPI_SRUNOPTIONS 141 MPI_Ssend 18 MPI_TCP_CORECVLIMIT 142 MPI_THREAD_ AFFINITY 61 MPI_THREAD_IGNSELF 61 MPI_Unpublish _name 153 MPI_USE_LIBELAN 143 MPI_USE_LIBELAN_SUB 143 MPI_USE_MALLOPT AVOID_MMAP 138 MPI_USEPRUN 141 MPI_USEPRUN_IGNORE_ARGS 142 MPI_USESRUN 142 MPI_VAPI_QPPARAMS 135 MPI_WORKDIR 126 MPI 2 options 109 mpi32 option 106 mpi64 option 106 mpicc mpich 63 on Windows 51 utility 50 mpiCC utility 50 MPICH object compatibility 63 MPICH compatibility 238 MPICH2 compatibility 65 mpiclean 81 178 mpidiag tool 95 mpiexec 78 99 command line options 79 99 mpif77 utility 50 mpif90 on Windows 52 mpif90 utility 50 MPIHP_Trace_off 157 MPIHP_Trace_on 157 mpijob 79 mpirun 73 95 appfiles 75 mpirun version command 174 MPIRUN_OPTIONS 121 mpirun mpich 63 mpiview file 156 MPMD applications 71 with appfiles 72 with prun 72 with srun 72 multi_par f 183 multilevel parallelism 24 multiple hosts 70 assigning ranks in appfiles 76 communication 76 multiple network interfaces 164 diagram of 164 improve performance 164 using 164 multiple threads 24 167 multiple versions 239 mx option 102 name publishing 153 Native Language Support NLS 154 ndd option 108 netaddr option 104 network high availability 112 network interfaces 164 network selection options 102 NLS 154 NLSPAT
215. ll rights reserved out hello world exe Tlibpath C Program Files x86 Platform MPI 1lib subsystem console 40 Platform MPI User s Guide Getting Started li bpcmpi64 lib li bmpio64 lib hello_world obj 4 Createajob requesting thenumber of CPU sto use Resources are not yet allocated but thejob is given aJOBID number that is printed to stdout gt job new numprocessors 16 Job queued ID 4288 5 Addasingle CPU mpi run task to thenewly created job mpi r un creates moretasks filling the rest of the resources with the compute ranks resulting in a total of 16 compute ranks for this example gt job add 4288 numprocessors 1 stdout nodelpathitola sharead file out stderr node path to a shared file err MPI_ROOT bin mpirun ccp nodelpath to hello_world exe 6 Submit the job The machine resources are allocated and the job is run gt job submit id 4288 Building and running MPMD applications on Windows HPCS Torun Multiple Program M ultiple Data M PM D applications or other more complex configurations that require further control over the application layout or environment use an appfile to submit the Platform M PI job through the H PCS scheduler Create the appfile indicating the node for the ranks using the h lt node flag and the rank count for the given node using the n X flag Ranks are laid out in the order they appear in the appfile Submit the job usingmpi run ccp f lt appfile gt For this
216. lock but consider Idom load average packed Bind all ranks to sameldom as lowest rank slurm slurm binding Il least loaded II Bind each rank to Idoms it is running on map_ldom Schedule ranks on Idoms in cyclic distribution through M AP variable To generate the current supported options mpirun cpu_bind help a out Environment variables for CPU binding 60 Platform MPI User s Guide Understanding Platform MPI MPI_BIND_MAP allows specification of the integer CPU numbers Idom numbers or CPU masks These are alist of integers separated by commas MPI_CPU_AFFINITY isan alternativemethod to using cpu_bindon thecommand linefor specifying binding strategy The possible settings are LL RANK MAP_CPU MASK_CPU LDOM CYCLIC BLOCK RR FILL PACKED SLURM AND MAP_LDOM MPI_CPU_SPIN allows selection of spin value The default is 2 seconds This valueis used to let busy processes spin so that the operating system schedules processes to processors The processes bind themselves to the relevant processor or core or Idom For example the following selects a 4 second spin period to allow 32 M PI ranks processes to settle into place and then bind to the appropriate processor core Idom mpirun e MPl_CPU_SPIN 4 cpu_bind np 32 linpack MPI_FLUSH_FCACHE can beset to a threshold percent of memory 0 100 which if the file cache currently in use meets or exceeds initiates a flush attempt after binding and essentially befo
217. ly pinned down for each process sending or receiving The maximum number of fragments that can be pinned down for a process is 2 N The default value of N is 128 MPLRDMA_NONESIDED MPI_RDMA_NONESIDED N Specifies the number of one sided operations that can be posted concurrently for each rank regardless of the destination The default is 8 MPL RDMA_NSRQRECV MPI_RDMA_NSRQRECV K Specifies the number of receiving buffers used when the shared receiving queue is used where K is the number of receiving buffers If N is the number of off host connections from a rank the default value is calculated as the smaller of the values N x8 and 2048 In the above example the number of receiving buffers is calculated as 8 times the number of off host connections If this number is greater than 2048 the maximum number used is 2048 140 Platform MPI User s Guide Understanding Platform MPI prun srun environment variables MPI_PROT_BRIEF Disables the printing of the host name or IP address and the rank mappings when prot is specified in the mpi r un command line In normal cases thatis when all of theon nodeand off noderankscommunicateusingthesameprotocol only two lines are displayed otherwise the entire matrix displays This allows you to see when abnormal or unexpected protocols are being used MPL_PROT_MAX Specifies the maximum number of columns and rows displayed in the pr ot output table This number corresponds to the number of
218. m where sendbuf Specifies the starting address of the send buffer sendcount Specifies the number of elements sent to each process sendtype Denotes the datatype of the send buffer Platform MPI User s Guide 21 Introduction recvbuf Specifies the address of the receive buffer recvcount Indicates the number of elements in the receive buffer recvtype Indicates the datatype of the receive buffer elements root Denotes the rank of the sending process comm Designates the communication context that identifies a group of processes Computation Computational operations perform global reduction operations such assum max min product or user defined functions across members of a group Global reduction functions include Reduce Returns the result of a reduction at one node All reduce Returns the result of a reduction at all nodes Reduce Scatter Combines the functionality of reduce and scatter operations Scan Performs a prefix reduction on data distributed across a group Section 4 9 Global Reduction Operationsin the M PI 1 0 standard describes each function in detail Reduction operations are binary and are only valid on numeric data Reductions are always associative but might or might not be commutative You can select a reduction operation from a defined list seesection 4 9 2 in the M PI 1 0 standard or you can define your own operation The operations are invoked by placing the operation name for example M
219. m comm MPI Request request IN buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destination IN tag message tag IN comm communicator OUT request communication request int MPI RecvL void buf MPI Aint count MPI Datatype datatype int Source int tag MPI Comm comm MPI Status status OUT buf initial address of receive buffer choice IN count number of elements in receive buffer IN datatype datatype of each receive buffer element handle IN source rank of source IN tag message tag IN comm 220 Platform MPI User s Guide Large message APIs communicator handle OUT status status object Status int MPI_Recv_initL void buf MPI_Aint count MPI_Datatype datatype int source int tag MPI_Comm comm MPI Request t reques OUT buf initial address of receive buffer choice IN count number of elements received non negative integer IN datatype type of each element handle IN source rank of source or MPI ANY_SOURCE integer IN tag message tag or MPI ANY_TAG integer IN comm communicator handle OUT request communication request handle int MPI _RsendL void buf MPI Aint count MPI Datatype datatype int dest int tag MPI_Comm comm N buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destina
220. mbed or generatempi r un invocations deeply in theapplication Platform MPI User s Guide 141 Understanding Platform MPI MPI_USEPRUN_IGNORE_ARGS Providesan easy way to modify thearguments contained in an appfileby supplying alist of space separated arguments that mpi run should ignore setenv MPI_USEPRUN_IGNORE_ARGS lt option gt MPLUSESRUN Platform M PI provides the capability to automatically assume that s r un is the default launching mechanism T hismodeof operation automatically classifies argumentsintos run andmpi run arguments and correctly places them on the command line The assumed s r un mode also allows appfiles to be interpreted for command line arguments and translated into s r un mode Theimplieds run method of launching is useful for applications that enbed or generate their mpi run invocations deeply within the application This allows existing ports of an application from a Platform M PI supported platform to H P XC MPI_USESRUN_IGNORE_ARGS Providesan easy way to modify thearguments contained in an appfileby supplyingalist of space separated arguments that mpi run should ignore setenv MPIL_USESRUN_IGNORE_ARGS lt option gt Intheexamplebelow thecommand linecontainsareferenceto stdio bnonewhich isfiltered out because it is set in the ignore list setenv MPILUSESRUN_VERBOSE 1 setenv MPILUSESRUN_IGNORE_ARGS stdio bnone setenv MPI_USESRUN 1 setenv MPI_SRUNOPTION label bsub I n4 ext SLUR
221. mm where comm Identifies a group of processes and a communication context For example cart usesMPI_ Barrier to synchronize data before printing MPI data types and packing You can use predefined datatypes for example MPI_INT in C to transfer data between two processes using point to point communication This transfer is based on the assumption that the data transferred is stored in contiguous memory for example sending an array in aC or Fortran application To transfer data that is not homogeneous such as astructure or to transfer data that is not contiguous in memory such as an array section you can use derived datatypes or packing and unpacking functions Derived datatypes Specifies a sequence of basic datatypes and integer displacements describing the data layout in memory Y ou can use user defined datatypes or predefined datatypes in M PI communication functions Packing and unpacking functions Platform MPI User s Guide 23 Introduction Provides MPI_Pack and MPI_Unpack functions so a sending process can pack noncontiguous data into a contiguous buffer and a receiving process can unpack data received in a contiguous buffer and store it in noncontiguous locations Using derived datatypes is more efficient than using MPI_Pack and MPI_Unpack H owever derived datatypes cannot handlethecasewherethedatalayout varies and isunknown by thereceiver for example messages that embed their own layout description
222. mp argv PI Comm rank MPI COMM WORLD amp rank P Comm size MPI COMM WORLD amp size rtn pthread_create amp mtid 0 server void amp rank ii rim te 0 4 printf pthread create failed n MPI Abort MPI COMM WORLD MPI _ERR_ OTHER client rank size shut down_servers rank rtn pthread join mtid 0 iy rim te O48 printf pthread join failed n MPI Abort MPI COMM WORLD MPI ERR_ OTHER MPI Finalize exit 0 208 Platform MPI User s Guide thread_safe output The output from running the thread_safe executable is shown below The application was run with np 2 Senven Sievi en SIEVE SHCHVICS Siemen Ba eir Senven Sieve SaChaViest Sienen SHCHVICS STeVe Senven Senven OL Weir Sieiniven SICHAVICH STEEVE Seaven Senven SiCiaVies SHCIVICH SKEIVE STEEVE Senven Seven SiCiaViea SICH ViCsT SICHAVICS SLEEVE Senven Sieve Sam weir SHCHEVICS SHCHVICS SECiaviesh Sieve Senven SIEVE SHCUVECST SICHAVICH a eir Senven Sienven SiCiavies SHCHVICS SICHVICS STEEVE Senven Senven SiChaVicah SICH VCH Siemen Ba veir Senven Sieve Sam weir Sienen Sienen STeVe Senven Senven SECHaVics SCHAVICS SICHAVICH STeVe Seven 2Or OF 2D Or OF OF 200 DODOR PR OPR POR OOOOOOOCrH 2Or OF a O ooooco processed processed processed processed processed processed processed processed processed processed processed processed processed pr
223. mpids that the job uses which is typically the number of hosts when block scheduling is used but can be up to the number of ranks if cyclic scheduling is used Regardless of size the pr ot output tableis always displayed when not all of theinter nodeor intra node communications use the same communication protocol MPI_PRUNOPTIONS Allows pr un specific options to be added automatically to the mpi r un command line For example export MPlLPRUNOPTIONS m cyclic x host0 mpirun prot prun n2 a out is equivalent to mpirun prot prun m cyclic x host0 n2 a out MPI_SPAWN_PRUNOPTIONS Allows pr un options to be implicitly added to the launch command when SPAWN functionality is used to create new ranks with prun MPI_SPAWN_SRUNOPTIONS Allowssrun options to be implicitly added to the launch command when SPAWN functionality is used to create new rankswithsrun MPI_SRUNOPTIONS Allows additional sr un options to be specified such as label setenv MPI_SRUNOPTIONS lt option gt MPI_USEPRUN Platform M PI provides the capability to automatically assume that pr un is the default launching mechanism T hismodeof operation automatically classifies argumentsintoprun andmpi run arguments and correctly places them on the command line The assumed pr un mode also allows appfiles to be interpreted for command line arguments and translated into pr un mode Theimplied pr un method of launching is useful for applications that e
224. mpute_pi f using Intel Fortran without utilizing the mpi f 90 tool from a command prompt that has the appropriate environment settings loaded for your Fortran compiler C gt ifort compute_pi f I MPI_ROOT include 64 link out compute_pi exe libpath MPI_ROOT lib subsystem console libhpmpi64 lib Note Intel compilers often link against the Intel run time libraries When running an MPI application built with the Intel Fortran or C C compilers you might need to install the Intel run time libraries on every node of your cluster We recommend that you install the version of the Intel run time libraries that correspond to the version of the compiler used on the MPI application Autodouble automatic promotion Platform M PI supports automatic promotion of Fortran datatypes using any of the following arguments some of which are not supported on all Fortran compilers 1 integer_size 64 1418 i8 real_size 64 4R8 Qautodouble 7 r8 DNPWN If these flags are given to thempi f 90 bat script at link time then the application will be linked enabling Platform M PI to interpret the datatype M PI_REAL as 8 bytes etc as appropriate at runtime However if your application is written to explicitly handle the autodoubled datatypes for example if a variable is declared real and the code is compiled r8 and corresponding M PI calls are given MPI_DOUBLE for the datatype then the autodouble related command line argume
225. multiple hosts Can be used with the np option This host list can be delimited with spaces or commas H osts can be followed with an optional rank count which is delimited from the host name with aspace or colon If spaces are used as delimiters in the host list it might be necessary to place the entire host list inside quotes to prevent the command shell from interpreting it as multiple options hhost Specifies a host on which to start the processes default is local_ host Only applicable when running in single host mode mpi run np Seethe hostlist option which provides more flexibility np Specifies the number of processes to run Generally used in single host mode but also valid with hostfile hostlist Isb_hosts and Isb_mcpu_hosts stdio options Specifies standard 1O options This applies to appfiles only Process placement cpu_bind Binds arank to an Idom to prevent aprocessfrom moving to adifferent Idom after start up Application bitness specification mpi32 Option for running on Opteron and Intel64 Should be used to indicate the bitness of the application to beinvoked so that the availability of interconnect libraries can be properly determined by the Platform M PI utilities mpi run and mpi d The default is mpi64 mpi64 Option for running on Opteron and Intel64 Should be used to indicate the bitness of the application to beinvoked so that the availability of interconnect libraries can be
226. myserver 0014c2c1f34a DAEMON HPQ INCREMENT platform mpi Isf_ld 1 0 permanent 8 9A40ECDE2A38 NOTICE License Number AAAABBBB1111 SI GN E5CEDE3E5626 SERVER myserver 0014c2c1f34a DAEMON HPQ INCREMENT platform mpi Isf_ld 1 0 permanent 16 BE468B74B592 NOTI CE Li cense Number AAAABBBB2222 SI GN 9AB4034C6CB2 The result is a valid license for 24 ranks Redundant license servers Platform M PI supports redundant license servers using the FLEX net Publisher licensing software Three servers can create a redundant license server network For a license checkout request to be granted at least two servers must be running and able to communicate with each other This avoids a single license server failure which would prevent new Platform M PI jobs from starting With three server redundant licensing the full number of Platform M PI licenses can be used by a single job W hen selecting redundant license servers use stable nodes that arenot rebooted or shutdown frequently The redundant license servers exchange heartbeats Disruptions to that communication can cause the license servers to stop serving licenses Platform MPI User s Guide 33 Getting Started The redundant license servers must be on the same subnet as the Platform M PI compute nodes They do not have to be running the same version of operating system as the Platform M PI compute nodes but it is recommended Each server in the redundant network must be listed in the Platform M
227. n For applications that consist of multiple programs or that run on multiple hosts here is a list of common options For a complete list see the mpi r un manpage mpirun help version djpv ck t spec i spec commd tv f appfile extra_args_for_appfile Where extra_args for_appfile specifies extra arguments to be applied to the programs listed in the appfile This is a space separated list of arguments Use this option at the end of a command lineto append extra arguments to each line of your appfile These extra arguments also apply to spawned applications if specified on the mpi r un command line In this case each program in the application is listed in a file called an appfile Platform MPI User s Guide 73 Understanding Platform MPI For example MPI_ROOT bin mpirun f my_appfile runs using an appfile named my_appfile that might have contents such as h hostA np 2 path to a out h hostB np 2 path to a out which specify that two ranks are to run on host A and two on host B prun execution Use the prun option for applications that run on the Quadrics Elan interconnect When using the prun option mpi r un sets environment variables and invokes pr un utilities The prun argument to mpi r un specifies that the pr un command isto be used for launching All arguments following prun are passed unmodified to thepr un command MPI_ROOT bin mpirun lt mpirun options gt prun lt prun o
228. n M PI_INIT allowing time for the user to log in to each node running the M PI application and attach a debugger to each process Setting the global variable mpi_debug_cont to a non zero value in the debugger will allow that process to continue This is similar to the debugging methods described in thempidebug 1 manpage except that dbgspin requirestheuser to launch and attach the debuggers manually Specifies that the application runs with the T otalV iew debugger RDMA control options dd ndd rdma srq xrc Uses deferred deregistration when registering and deregistering memory forRDMA message transfers The default is to use deferred deregistration Note that using this option also produces a statistical summary of the deferred deregistration activity when MPI_Finalizeis called The option is ignored if the underlying interconnect does not usean RDM A transfer mechanism or if the deferred deregistration is managed directly by the interconnect library Occasionally deferred deregistration is incompatible with an application or negatively impacts performance U se ndd to disable this feature Deferred deregistration of memory on RDMA networks is not supported on Platform MPI for Windows Disables the use of deferred deregistration For more information see the dd option Specifies the use of envelope pairs for short message transfer The prepinned memory increases proportionally to the number of off host ranks in the
229. n at the end of each outer loop For distributed shared memory architectures the mix of the two methods can be effective The sample program implements the twisted data layout method with M PI and the row and column block 200 Platform MPI User s Guide Example Applications partitioning method with OPEN MP thread directives In the first case the data dependency is easily satisfied because each thread computes down a different set of columns In the second case we still want to computedown thecolumns for cachereasons butto satisfy the data dependency each thread computes a different portion of the same column and the threads work left to right across the rows together implicit none include mpif h integer nrow of rows nteger ncol of columns parameter nrow 1000 ncol 1000 double precision array nrow ncol compute region integer blk block iteration counter nteger rb row block number nteger cb column block number nteger nrb next row block number nteger ncb next column block number nteger rbs row block start subscripts nteger rbe row block end subscripts nteger cbs column block start subscripts nteger cbhe column block end subscripts integer rdtype row block communication datatypes integer cdtype column block communication datatypes nteger twdtype twisted distribution datatypes nteger ablen array of block lengths nteger adisp array of di
230. n of standard flexibility cceeceeeeeeeeeeeseeeeeeeseeeeeesees 229 Appendix D mpirun Using Implied prun or SPUN o oo eee eeeeee eee eenee eee eeeetaeeeeeeeeaaeeeeeeeeaaeeeeeeeeaaeeeeenenaaes 231 IMPS CN PUN ca temas aueatcsaienpaass auuueenad ay laciebee ts vinicsine seats vateseesaws ainincelesensainingdeteaaah abnes 231 Implied Srur ocmi n a terre cect rersrrereeerr errr cre 232 Appendix E Frequently Asked Questions cceeeceeeeeeeeeeeeeeeceeeeeesaaeeeeeaeeeeseseeeseaaeeseeaeeeesaeeeeeaes 237 Generali arso eetri anaie a aE wales dasect aladeteasah alias 237 Instalatom and SOtup arenae ON ANR 238 B illding ApPlIGAtlONS sccesssiensceleesenessidetassatvssectcesssedeaacleteunsancdaaedias evetttidenaniesiaetceaaentriaiss 239 Performance problems serried inen a EEEE E 240 INGIWONMCSPOCINIC sinioreso ianuae ietcnseabiadecanseduinedenbanieibaceatoed dedestenaitacedacansiwes 241 WINDOWS SPOGCITIC siirduski a R a a a aaa E 242 Appendix F GIOSSAny ennei E R 245 4 Platform MPI User s Guide About This Guide This guide describes the Platform M PI implementation of the M essage Passing Interface M PI standard This guide helps you use Platform M PI to develop and run parallel applications You should have experience developing UNIX applications You should also understand the basic concepts behind parallel processing be familiar with M PI and with the M PI 1 2 and M PI 2 standards MPI A M essage Passing In
231. n the job HOST H ost where the job is running PID Process identifier for each process in the job LIVE 80 Platform MPI User s Guide Understanding Platform MPI Whether the process is running an x is used or has been terminated PROGNAME Program names used in the Platform M PI application mpi j ob doesnot support pr un orsrun start up mpi j ob isnot available on Platform MPI V 1 0 for Windows mpiclean help V M j id mpi cl ean kills processes in Platform M PI applications started in appfile mode Invoke mpiclean on the host where you initiated mpirun The MPI library checks for abnormal termination of processes while your application is running In some cases application bugs can cause processes to deadlock and linger in the system W hen this occurs you can use mpijob to identify hung jobs and mpi cl ean to kill all processes in the hung application mpi clean syntax has two forms 1l mpiclean help v j id id id 2 mpiclean help v m where Prints usage information for the utility Turns on verbose mode Cleans up shared memory segments Kills the processes of job number ID Y ou can specify multiple job Dsin a space separated list Obtain the job ID using the j option when you invoke mpirun You can only kill jobs that are your own The second syntax is used when an application aborts during M PI_ Init and the termination of processes does not destroy the allocated share
232. names of your machines 1 Editthe rhosts fileon hosts j awbone 70 Platform MPI User s Guide Understanding Platform MPI and wizard Add an entry for wizard inthe rhosts fileon j awbone and an entry for j awbone inthe rhosts fileon wizard In addition to theentriesin the r hosts file besurethe correct commands and permissions are set up in the ssh shell configuration file on all hosts so you can start your remote processes 2 Configure ssh so you can ssh into the machines without typing a password 3 Besurethe executable is accessible from each host by placing it in a shared directory or by copying it to a local directory on each host 4 Create an appfile An appfileis a text file that contains process counts and alist of programs In this example create an appfilenamed my_appf il e containing the following lines h jawbone np 2 path to hello_world h wizard np 2 path to hello_world The appfile file should contain a separate line for each host Each line specifies the name of the executable file and the number of processes to run on thehost The h option is followed by thename of the host where the specified processes must be run Instead of using the host name you can useits IP address 5 Runthehello_worl d executable file MPI_ROOT bin mpirun f my_appfile The f option specifies the file name that follows it is an appfile mpi run parses the appfile line by line for the information to run the pro
233. names or commands with paths that have spaces The default Platform M PI installation location includes spaces C Program Files x86 Platform MPI bin mpi run or AMPI ROOT bi n mpi run When starting multihost applications using appfiles on Windows 2003 XP verify the following Platform M PI Remote Launch service is registered and started on all remote nodes Check this by accessing thelist of W indowsservicesthrough Administrator T ools gt Services Look for the Platform M PI Remote Launch service Platform M PI isinstalled in the same location on all remote nodes All Platform M PI libraries and binaries must bein the same M PI_ROOT Application binaries are accessible from remote nodes If the binaries are located on a file share use theUNC path i e node share pat h to refer to thebinary becausethese might not be properly mapped to a drive letter by the authenticated logon token If a password is not already cached use the cache option for your first run or use the pass option on all runs so the remote service can authenticate with network resources W ithout these options or using nopass remote processes cannot access network shares If problems occur when trying to launch remote processes use the mpi di ag tool to verify remote authentication and access Also view the event logs to see if the service is issuing errors Don t forget to use quotation marks for file names or commands with paths that have spaces The de
234. nce For tasks that require collective operations use the relevant M PI collective routine Platform M PI takes advantage of shared memory to perform efficient data movement and maximize your application s communication performance Multilevel parallelism Consider the following to improve the performance of applications that use multilevel parallelism UsetheM PI library to providecoarse grained parallelism and a parallelizing compiler to providefine grained that is thread based parallelism A mix of coarse and fine grained parallelism provides better overall performance Assign only onemultithreaded process per host when placing application processes T his ensures that enough processors are available as different process threads become active Coding considerations The following are suggestions and items to consider when coding your M PI applications to improve performance UsePlatform M PI collectiveroutinesinstead of coding your own with point to point routines because Platform M PI s collective routines are optimized to use shared memory where possible for performance Use commutative M PI reduction operations UsetheM PI predefined reduction operations whenever possible because they are optimized W hen defining reduction operations makethem commutative CommutativeoperationsgiveM PI more options when ordering operations allowing it to select an order that leads to best performance UseM PI derived data types when you excha
235. nd subnet manager defines one of the server s links as primary and one as redundant alternate W hen the primary link fails the channel adapter automatically redirects traffic to the redundant path when a link failure is detected This support is provided by the InfiniBand driver availablein OFED 1 2 and later 84 Platform MPI User s Guide Understanding Platform MPI releases Redirection and reissued communications are performed transparently to applications running on the cluster The user has to explicitly enable APM by setting the environment variable MPI_HA_NW_PORT_FAILOVER 1 asin the following example opt platform_mpi bin mpirun np 4 prot e MPL_HA_NW_PORT_FAILOVER 1 hostlist nodea nodeb nodec noded my dir hello_world When the M PI_HA_NW_PORT_FAILOVER environment variable is set Platform M PI identifies and specifies the primary and the alternate paths if available when it sets up the communication channels between the ranks It also requests the InfiniBand driver to load the alternate path for a potential path migration if anetwork failureoccurs W hen a network failureoccurs the InfiniBand driver automatically transitions to the alternate path notifies Platform M PI of the path migration and continues the network communication on the alternate path At this point Platform M PI also reloads the original primary path as the new alternate path If this new alternate path is restored this will allow for the InfiniBand driver
236. ndows man pages The manpages are located in the MPI _ROOT man subdirectory for W indows They can be grouped into three categories general compilation and run time One general manpage M P1 1 is an overview Platform MPI User s Guide 45 Getting Started describing general features of Platform M PI The compilation and run time manpages describe Platform MPI utilities The following table describes the three categories of manpages in the man1 subdirectory that comprise manpages for Platform M PI utilities Table 9 Windows man page categories Category manpages Description General MPI 1 Describes the general features of Platform MPI ilati lao D i h ilabl ilati iliti Compilation mpif90 1 escribes the available compilation utilities mpidebug 1 mpienv 1 Run time mpimtsafe 1 Describes run time utilities environment variables debugging mpirun 1 thread safe and diagnostic libraries mpistdio 1 autodbl 1 Licensing policy for Windows Platform M PI for Windows uses FlexN et Publishing formerly FLEXIm licensing technology A license isrequired to usePlatform M PI for Windows Licenses can becan beacquired from Platform Computing Platform M PI hasan Independent SoftwareV endor ISV program that allows participating ISV sto freely distribute Platform M PI with their applications When the application is part of the Platform M PI ISV program there is no licensing requirement for the user The ISV provides a licensed co
237. need the profiling interfaceto write your routines Platform M PI makes use of M PI profiling interface mechanisms to provide the diagnostic library for debugging In addition Platform M PI provides tracing and lightweight counter instrumentation The profiling interface allows you to intercept calls made by the user program to the M PI library For example you might want to measure the time spent in each call to a specific library routine or to create alog file You can collect information of interest and then call the underlying M PI implementation through an alternate entry point as described below Routines in the Platform M PI library begin with the M PI_ prefix Consistent with the Profiling Interface section of the M PI 1 2 standard routines are also accessible using the PM PI_ prefix for example MPI_Send and PMPI_Send access the same routine To use the profiling interface write wrapper versions of the M PI library routines you want the linker to intercept These wrapper routines collect data for some statistic or perform some other action The wrapper then calls the M PI library routine using the PM PI_ prefix Fortran profiling interface When writing profiling routines do not call Fortran entry points from C profiling routines and visa versa To profile Fortran routines separate wrappers must be written For example include lt stdio h gt include lt mpi h gt int MPI Send void buf int count MPI Datatype type in
238. network vendors for InfiniBand and others to only provide 64 bit libraries for their network Platform M PI makes its decision about what interconnect to use beforeit knows the application s bitness To haveproper network selection in that case specify if the application is 32 bit when running on Opteron and Intel64 machines MPI_ROOT bin mpirun mpi32 ping_pong_ring c Windows Often clusters might have Ethernet and some form of higher speed interconnect such as InfiniBand This section describes how to use the ping_pong_ring c example program to confirm that you can run using the interconnect Running atest like this especially on a new cluster is useful to ensure that the correct network drivers are installed and that network hardware is functioning properly If any machine has defective network cards or cables this test can also be useful for identifying which machine has the problem To compile the program set theM PI_ROOTenvironment variable to the location of Platform MPI The defaultis C Program Files x86 Platform MPI for 64 bit systems and C Program Files Platform MPI for 32 bit systems This might already be set by the Platform M PI installation Open a command window for the compiler you plan on using This includes all libraries and compilers in path and compile the program using the mpi cc wrappers gt MPI_ROOT bin mpicc mpi64 out pp exe MPI_ROOT help ping_ping_ring c Use the start up for your
239. ng is an example of the output Hello world m2 of 4 on n02 Hello world mi1 of 4 on n01 Hello world m0O of 4 on n01 Hello world I m 3 of 4 on n02 M ore information about using appfiles is available in Chapter 3 of the Platform M PI User s Guide Building and running on a Windows 2003 XP cluster using appfiles The following example shows the basic compilation and run steps to execute hello_world c on a cluster with 4 way parallelism To build and run hel o_ worl d c on a cluster using an appfile Note Specify the bitness using mpi64 or mpi32 for mpi cc to link in the correct libraries Verify you are in the correct bitness compiler window Using mpi64 in a Visual Studio 32 bit command window will not work 1 Change to a writable directory 2 Compilethehe o_wor d executable file gt MPI_ROOT bin mpicc o hello_world MPIl_ROOT help hello_world c 3 Create thefileappf ile for running on nodes n01 and n02 as h n01 np 2 node01 share path to hello_world exe h n02 np 2 node01 share path to hello_world exe 4 For thefirst run of the hello_world 44 Platform MPI User s Guide Getting Started executable use cache to cache your password gt MPI_ROOT bin mpirun cache f appfile Password for MPI runs When typing the password is not echoed to the screen The Platform M PI Remote Launch service must be registered and started on the remote nodes mpi r un authenticates with the service and crea
240. ng message fragment size If the message is greater than b the message is fragmented into pieces up to cin length or actual length if less than c and the corresponding piece of the user s buffer is pinned directly The default is 4194304 bytes but on M yrinet GM and BAL the default is 1048576 bytes When deferred deregistration is used pinning memory is fast Therefore the default setting for MPI_RDMA_MSGSIZE is 16384 16384 4194304 which means any message over 16384 bytes is pinned for direct usein RDMA operations However if deferred deregistration is not used ndd then pinning memory is expensive In that case the default setting for MPI_RDMA_MSGSIZE is 16384 262144 4194304 which means messages larger than 16384 and smaller than or equal to 262144 bytes are copied into pre pinned memory using Platform M PI middle message protocol rather than being pinned and used in RDMA operations directly The middle message protocol performs better than the long message protocol if deferred deregistration isnot used For more information seethe M PI_RDMA_MSGSIZE section of thempi env manpage MPLRDMA_NENVELOPE MPI_RDMA_NENVELOPE N Specifies the number of short message envelope pairs for each connection if RDMA protocol is used whereN isthe number of envelope pairs The default is from 8 to 128 depending on the number of ranks MPL RDMA_NFRAGMENT MPI_RDMA_NFRAGMENT N Specifies the number of long message fragments that can be concurrent
241. ngeseveral small size messages that haveno dependencies Minimize your use of M PI_Test polling schemes to reduce polling overhead Code your applications to avoid unnecessary synchronization Avoid M PI_Barrier calls Typically an application can be modified to achieve the same result using targeted synchronization instead of collective calls For example in many cases a token passing ring can be used to achieve the same coordination as a loop of barrier calls Using HP Caliper HP Caliper is a general purpose performance analysis tool for applications processes and systems HP Caliper allows you to understand the performanceand execution of an application and identify ways to improve run time performance Note Platform MPI User s Guide 167 Tuning When running Platform MPI applications under HP Caliper on Linux hosts it might be necessary to set the HPMPI_NOPROPAGATE_SUSP environment variable to prevent application aborts setenv HPMPI_NOPROPAGATE_SUSP 1 export HPMPI_NOPROPAGATE_SUSP 1 168 Platform MPI User s Guide Debugging and Troubleshooting This chapter describes debugging and troubleshooting Platform M PI applications Platform MPI User s Guide 169 Debugging and Troubleshooting Debugging Platform MPI applications Platform M PI allows you to use single process debuggers to debug applications The available debuggers areADB DDE XDB WDB GDB and PATHDB To access these debuggers set options in th
242. ngle M PI process is placed to the standard out of mpi run after bytes of output have been accumulated Platform MPI User s Guide 177 Debugging and Troubleshooting bnone gt 0 The same as b except that the buffer is flushed when itis full and when it is found to contain data Essentially provides no buffering from the user s perspective bline gt 0 Displays the output of a process after a line feed is encountered or if the byte buffer is full The default value of in all cases is 10 KB The following option is available for prepending Enables prepending The global rank of the originating process is prepended to stdout and stderr output Although this mode can be combined with any buffering mode prepending makes the most sense with the modes b and bline The following option is available for combining repeated output r gt 1 Combines repeated identical output from the same process by prepending a multiplier to the beginning of the output At most maximum repeated outputs are accumulated without display This option is used only with bline The default value of is infinity The following options are available for using file settings files Specifies that the standard input output and error of each rank is to be taken from the files specified by the environment variables M P _STDIO_INFILE MPI_STDIO_OUTFILE andMPI_STDIO_ERRFILE If theseenvironment variables are not set dev nul OrNUL is used In addition the
243. nticates with the remote service and returns remote system information including node name CPU count and username ps username Authenticates with theremote service and lists processes running on theremotesystem If ausername is included only that user s processes are listed dir lt path gt Authenticates with the remote service and lists the files for the given path Thisisa useful tool to determineif access to network shares are available to the authenticated user sdir lt path gt 98 Platform MPI User s Guide Understanding Platform MPI Same as dir but lists a single file No directory contents are listed Only the directory is listed if accessible kill lt pid gt Authenticates with remote service and terminates the remote process indicated by the pid The process is terminated as the authenticated user So if the user does not have permission to terminate the indicated process the process will not be terminated Note mpi diag authentication options are the same as mpi r un authentication options These include pass cache clearcache iscached token tg package pk mpiexec The M PI 2 standard defines mpi exec asa simple method to start M PI applications It supports fewer features than mpi run but itis portable mpi exec syntax has three formats mpi exec offers arguments similar to aM P _Comm_spawn call with arguments as shown in the following form mpiexec mpiexec options command command
244. nts should not be passed to mpi f 90 bat atlink time becausethat would cause thedatatypes to be automatically changed Building and running on a single host The following example describes the basic compilation and run stepsto executehe o_world c on your local host with 4 way parallelism To build and run hel o_ world c on a local host named banachl 1 Changeto a writable directory and copy hello_world c from the heap directory C gt copy MPI_ROOT help hello_world c 2 Compilethe hello_world executable file In a proper compiler command window for example Visual Studio command window use mpi cc to compile your program C gt MPIROOT bin mpicc mpi64 hello_world c Note Specify the bitness using mpi64 or mpi32 for mpi cc to link in the correct libraries Verify you are in the correct bitness compiler window Using mpi64 in a Visual Studio 32 bit command window does not work 38 Platform MPI User s Guide 3 Getting Started Run the hello_world executable file C gt MPI_ROOT bin mpirun np 4 hello_world exe where np 4 specifies 4 as the number of processors to run Analyze hello_world output Platform M PI prints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m1 of 4 on banachl Hello world m3 of 4 on banachl Hello world m0O of 4 on banachl Hello world m2 of 4 on banachl Buildin
245. o MPI_Ssend without rewriting your code MPI_MT_FLAGS MPI_MT_FLAGScontrolsrun timeoptions when you usethethread compliant version of Platform M PI TheMPI_MT_FLAGS syntax is a comma separated list as follows ct single fun serial mult The following is a description of each flag ct single fun Creates a hidden communication thread for each rank in the job W hen you enable this option do not oversubscribe your system For example if you enablect for a 16 process application running on a 16 way machine the result is a 32 way job Asserts that only one thread executes Platform MPI User s Guide 125 Understanding Platform MPI Asserts that a process can be multithreaded but only the main thread makes M PI calls that is all calls are funneled to the main thread serial Asserts that a process can be multithreaded and multiple threads can make M PI calls but calls are serialized that is only one call is made at atime mult Asserts that multiple threads can call M PI at any time with no restrictions Setting M PI_MT_FLAGS ct has thesame effect as setting M P _ FLAGS s a p when the value of that is greater than 0 MPI_MT_FLAGS ct takes priority over the default M PI_FLAGS sp0 setting Thesingle fun serial and mult options are mutually exclusive For example if you specify the serial andmul t optionsinMPI_MT_FLAGS only the last option specified is processed in this case the mu t option If no
246. o different groups to exchange data intracommunicators Communicators that allow processes within the same group to exchange data instrumentation Cumulative statistical information collected and stored in ASCII format Instrumentation is the recommended method for collecting profiling data latency Time between the initiation of the data transfer in the sending process and the arrival of the first bytein the receiving process load balancing M easure of how evenly the work load is distributed among an application s processes When an application is perfectly balanced all processes share the total work load and complete at the same time locality Degree to which computations performed by a processor depend only upon local data Locality ismeasured in several waysincludingtheratio of local to nonlocal data accesses Platform MPI User s Guide 247 Glossary locality domain Idom Consists of arelated collection of processors memory and peripheral resources that compose a fundamental building block of the system All processors and peripheral devices in a given locality domain have equal latency to the memory contained within that locality domain mapped drive In anetwork drive mappings reference remote drives and you have the option of assigning the letter of your choice For example on your local machine you might map S to refer to drive C on a server Each time S is referenced on the local machine the drive on the
247. o script Configuration files for example pmpi conf Source files for the example programs Header files 30 Platform MPI User s Guide Getting Started Subdirectory Contents lib pa2 0 lib pa20_64 lib linux_ia32 lib linux_ia64 lib linux_amd64 modules MPICH1 2 MPICH2 0 newconfig sbin share man man1 share man man3 doc licenses Platform MPI PA RISC 32 bit libraries Platform MPI PA RISC 64 bit libraries Platform MPI Linux 32 bit libraries Platform MPI Linux 64 bit libraries for Itanium Platform MPI Linux 64 bit libraries for Opteron and Intel64 OS kernel module files MPICH 1 2 compatibility wrapper libraries MPICH 2 0 compatibility wrapper libraries Configuration files and release notes Internal Platform MPI utilities manpages for Platform MPI utilities manpages for Platform MPI library Release notes License files Linux man pages The manpages arein thes MP _ROOT share man man1 subdirectory for Linux They can be grouped into three categories general compilation and run time One general manpage M P1 1 is an overview describing general features of Platform M PI The compilation and run time manpages describe Platform MPI utilities The following table describes the three categories of manpages in the man1 subdirectory that comprise manpages for Platform M PI utilities Table 5 Linux man page categories Category manpages Description General MPI 1 Describes the general features
248. ocation of an MPMD application looks like this MPI_ROOT bin mpirun f appfile where appfile is the text file parsed by mpi run and contains a list of programs and process counts Suppose you decompose the poisson application into two source files poisson_ master uses a single master process and poisson_child uses four child processes The appfile for the example application contains the two lines shown below np 1 poisson_master np 4 poisson child To build and run the example application use the following command sequence MPI_ROOT bin mpicc o poisson_master poisson_master c MPI_ROOT bin mpicc o poisson_child poisson_child c MPI_ROOT bin mpirun f appfile MPMD with prun pr un also supports running applications with M PM D using procfiles See the pr un documentation at http www quadrics com MPMD with srun MPMD isnot directly supported with s r un However users can write custom wrapper scripts to their application to emulate this functionality This can be accomplished by using the environment variables SLURM_PROCID and SLURM_NPROCSas keys to selecting the correct executable Modules on Linux M odules area convenient tool for managing environment settings for packages Platform M PI for Linux providesan Platform MPI moduleat opt platform mpi modulefiles platform mpi which sets MPI_ROOT and addsto PATH and MANPATH To useit copy the file to a system wide module directory or append opt platform mpi mo
249. ocessed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed processed request request reques reques reques reques reques reques reques reques reques reques reques request reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques reques request omer oreo oreo oreo reer oreo reer oreo O Or 2 AD ORORPOORHOOOF 200 2D OrROOOC Or Oner 2DOO OF 2Or OF ce ooror AADADADADATDATDAAADAADATDAAADADADATD AD AND AD ADA AANA TAN ITD TNA TNA ITN ITDN ITN IAIN IAIN IN INIT IT ITN INI INIT ITN ITN IN ITN IAIN ITAA A ient ient ien
250. of Platform MPI mpicc 1 ilati m D i h ilabl ilati iliti Compilation mpif77 1 escribes the available compilation utilities mpif90 1 Platform MPI User s Guide 31 Getting Started Category manpages Description mpiclean 1 mpidebug 1 mpienv 1 mpiexec 1 Runtime mpijob 1 mpimtsafe 1 mpirun 1 mpistdio 1 autodbl 1 Describes run time utilities environment variables debugging thread safe and diagnostic libraries Licensing policy for Linux Platform M PI for Linux uses FlexN et Publisher formerly FLEXIm licensing technology A license file can benamed i cense dat or any filenamewith an extension of i c Thelicense file must be placed in the installation directory default location opt pcmpi 1 icenses on all run time systems Platform M PI hasan Independent SoftwareV endor ISV program that allows participating ISV sto freely distribute Platform M PI with their applications When the application is part of the Platform M PI ISV program there is no licensing requirement for the user The ISV provides a licensed copy of Platform M PI Contact your application vendor to find out if they participate in the Platform M PI ISV program The copy of Platform M PI distributed with a participating ISV works with that application A Platform M PI license is required for all other applications Licensing for Linux Platform M PI now supports redundant license servers using the FLEX net Publisher licensing software Thre
251. ollectively as a parallel machine collective communication Communication that involves sending or receiving messages among a group of processes at the same time The communication can be one to many many to one or many to many The main collective routines are M PI_Bcast M PI_ Gather and MPI_ Scatter communicator Global object that groups application processes together Processes in acommunicator can communicate with each other or with processes in another group Conceptually communicators define a communication context and a static group of processes within that context context Internal abstraction used to define a safe communication space for processes Within a communicator context separates point to point and collective communications data parallel model Design model wheredataispartitioned and distributed to each processin an application Operations are performed on each set of data in parallel and intermediate results are exchanged between processes until a problem is solved derived data types User defined structures that specify a sequence of basic data types and integer displacements for noncontiguous data Y ou create derived data types through the use of type constructor functions that describe the layout of sets of primitive types in memory Derived types may contain arrays as well as combinations of other primitive data types determinism A behavior describing repeatability in observed parameters The or
252. omm_spawn and MPI_Comm_spawn_multiple routines provide an interface between M PI and the runtime environment of an M PI application MPI_Comm_accept and MP I_Comm_connect along with MPI_Open_port and MPI_Close port allow two independently run M PI applications to connect to each other and combine their ranks into a single communicator MPI_Comm_join allows two ranks in independently run M PI applications to connect to each other and form an intercommunicator given a socket connection between them Processes that are not part of the same M PI world but are introduced through calls to MPI_Comm_connect MPI_Comm_accept MPI_Comm_spawn orMPI_Comm_spawn_multiple attempt to use InfiniBand for communication Both sides need to have InfiniBand support enabled and use the same InfiniBand parameter settings otherwise TCP will be used for the connection Only OFED IBV protocol is supported for these connections W hen theconnection is established through oneof these M PI calls aTCP connection is first established between the root process of both sides TCP connections are set up among all the processes Finally IBV InfiniBand connections are established among all process pairs and the TCP connections are closed Spawn functions supported in Platform M PI MPI_Comm_get_parent MPI_Comm_spawn MPI_Comm_spawn_multiple MPl_Comm_accept MPI_Comm_connect MPI_Open_port MPI_Close_port MPI_Comm_join Keys interpreted in
253. on creates an ASCII file named compute_pi instr containing instrumentation profiling Although i is the preferred method of controlling instrumentation the same functionality is also accessible by setting the M PI_INSTR environment variable Specifications you make using mpi run i override specifications you make using the M PI_INSTR environment variable 156 Platform MPI User s Guide Profiling MPIHP_Trace_on and MPIHP_Trace_ off By default the entire application is profiled from M PI_Init to MPI_Finalize H owever Platform M PI provides the nonstandard MPIHP_Trace_on and MPIHP_Trace off routines to collect profile information for selected code sections only To use this functionality 1 Insert the MPIH P_Trace_on and MPIHP_Trace_off pair around code that you want to profile 2 Build theapplication and invokempi r un with the i lt prefix gt off option i lt index gt off specifies that counter instrumentation is enabled but initially turned off Data collection begins after all processes collectively call M PIH P_Trace_on Platform M PI collects profiling information only for code between MPIHP_Trace_on and MPIHP_Trace off Viewing ASCII instrumentation data The ASCII instrumentation profile is a text file with the instr extension For example to view the instrumentation file for thec ompute_pi f application you can print the prefix instr file If you defined prefix for the file as compute_pi you would print compute _pi
254. onerank calling the collective returns with an error The application must initiate a recovery from those ranks by calling M PI_Comm_dup on the communicator used by the failed collective This ensures that all other ranks within the collective also exit the collective Some ranks might exit successfully from a collective call whileother ranks do not Ranks which exit with M Pl_ SUCCESS will havesuccessfully completed their rolein the operation and any output buffers will be correctly set Thereturn valueof MPI_ SUCCESS does not indicate that all ranks have successfully completed their rolein the operation After a failure one or more ranks must call MPI_Comm_dup All future communication on that communicator results in failure for all ranks until each rank has called MPI_Comm_dup on the communicator After all ranks havecalled MPI_Comm_dup the parent communicator can be used for point to point communication M Pl_Comm_dup can becalled successfully even after afailure Because the results of a collective call can vary by rank ensure that an application is written to avoid deadlocks For example using multiple communicators can be very difficult as the following code demonstrates a Crh S MI Beast ouiier lem KWH Foot Commi ii erm 4 MPl_Error_class err amp class if class MPI _ERR_EXITED err MPI Comm dup commA amp new comms if err MPI SUCCESS cleanup_and_exit MPI _Comm_free comma commA new comms er
255. ons gt f lt appfile gt Platform M PI also supports setting M PI_REM SH using the eoption to mpi run MPI_ROOT bin mpirun e MPl_REMSH ssh lt options gt f lt appfile gt Platform M PI also supports setting M PI_REM SH to a command that includes additional arguments MPI_ROOT bin mpirun e MPI_REMSH ssh x lt options gt f lt appfile gt When usings sh besure that it is possible to uses s h from the host where mpi r un is executed without ssh requiring interaction from the user RDMA tunable environment variables MPL_RDMA_INTRALEN eMPI_RDMA_INTRALEN 262144 Specifies the size in bytes of the transition from shared memory to interconnect when intra mix is used For messages less than or equal to the specified size shared memory is used For messages greater than that size the interconnect is used TCP IP Elan MX and PSM do not have mixed mode MPL RDMA_MSGSIZE MPI_RDMA_MSGSIZE a b c Specifies message protocol length where Short message protocol threshold If themessagelength is bigger than this value middle or long message protocol is used The default is 16384 bytes M iddle message protocol If the message length is less than or equal to b consecutive short messages are used to send the whole message By default b is set to 16384 bytes the same as a to effectively turn off middle message protocol On IBAL the default is 131072 bytes Platform MPI User s Guide 139 Understanding Platform MPI Lo
256. onsidered complete and the MPI_ERROR field of the MPI_Status structure indicates MPI_ERR_EXITED M PI_Waitall waits until all requests are complete even if an error occurs with some requests If some requests fail MPI_IN_STATUSisreturned Otherwise M PI_SUCCESSisreturned In thecaseof an error the error code is returned in the status array Windows HPC The following are specific mpi r un command line options for Windows HPC users hpc Indicates that the job is being submitted through the Windows H PC job scheduler launcher This is the recommended method for launching jobs and is required for all HPC jobs hpcerr lt filename gt Assigns the job s standard error file to the file name when starting a job through the Windows H PC automatic job scheduler launcher feature of Platform M PI This flag has no effect if used for an existing HPC job hpcin lt filename gt Assigns the job s standard input file to the given file name when starting a job through theW indowsH PC automaticjob scheduler launcher featureof Platform M PI This flag has no effect if used for an existing HPC job hpcout lt filename gt Platform MPI User s Guide 113 Understanding Platform MPI Assigns thejob s standard output file to the given filename when starting ajob through theW indowsH PC automaticjob scheduler launcher feature of Platform M PI This flag has no effect if used for an existing HPC job hpcwait Causes the mpi run command to
257. orl d executable file MPI_ROOT bin mpirun np 4 hello_world where np 4 specifies 4 as the number of rocesses to run Analyzehel o_ world output Platform MPI prints the output from running thehe o_ world executablein nondeterministic order The following is an example of the output Hello world m1 of 4 on jawbone Hello world m3 of 4 on jawbone Hello world m0O of 4 on jawbone Hello world m2 of 4 on jawbone Building and running on a Linux cluster using appfiles The following is an example of basic compilation and run steps to executehe o_ world c on acluster with 4 way parallelism To build and run hel o_ world c on a cluster using an appfile 1 2 Changeto a writable directory Compilethehel o_ world executable file MPI_ROOT bin mpicc o hello_world MPI_ROOT help hello_world c Createthefileappf il e for running on nodes n01 and n02 as h n01 np 2 path to hello_world h n02 np 2 path to hello_world Run thehel o_ worl d executable file MPI_ROOT bin mpirun f appfile Platform MPI User s Guide 29 Getting Started 5 By default mpi r un willrsh rems h to theremote machines n01 and n02 If desired the environment variable M PI_REM SH can be used to specify a different command such as usr bin sshor ssh x k Analyzehello_worl d output Platform M PI prints the output from runningthehel o_ world executablein nondeterministic order The following is an example of th
258. orm M PI application with alarge amount of off host processes can quickly reach thefile descriptor limit Ask your system administrator to increase the limit if your applications frequently exceed the maximum External input and output i b gt 0 You can use stdin stdout and stderr in applicationsto read and write data By default Platform M PI does not perform processing on stdin or stdout The controlling tty determines stdio behavior in this case This functionality is not provided when using srun or prun If your application depends on the mpi run option stdio I to broadcast input to all ranks and you are using SLURM ssrun onan HP XC system then a reasonable substitute is stdin all For example mpirun srun stdin all For similar functionality refer to the label option of sr un Platform M PI does provide optional stdio processing features stdin can be targeted to aspecific process or can be broadcast to every process stdout processing includes buffer control prepending M PI rank numbers and combining repeated output Platform M PI standard 10 options can be set by using the following options to mpi run mpirun stdio bline gt 0 bnone gt 0 b gt 0 p r gt 7 i files none where Broadcasts standard input to all M PI processes Directs standard input to the process with the global rank The following modes are available for buffering Specifies that the output of a si
259. orm M PI without the prerequisite of a license The MPI ROOT hel p system check cfilecontains an example of how this API can be used This test can be built and run as follows MPI_ROOT bin mpicc o system_check x MPI_ROOT help system_check c MPI_ROOT bin mpirun system_check x ppr_ message size Any valid options can be listed on the mpi r un command line During the system check the following tests are run 1l hello_world 2 ping_pong_ring These tests are similar to the code found in MPI ROOT help hello_ worid c and MPI_ROOT hel p ping_pong_ring c Theping_pong_ ring testinsystem check c defaults to a messagesize 238 Platform MPI User s Guide Frequently Asked Questions of 4096 bytes An optional argument to the system check application can be used to specify an alternate message size The environment variable HPM PI_ SYSTEM_CHECK can beset to run a single test Valid values of HPMPI_SYSTEM_CHECK are 1 all Runs both tests the default value 2 hw Runsthehe l o_ world test 3 ppr Runstheping_pong_ring test Ifthe HPM PI_SYSTEM_CHECK variableis set during an application run that application runsnormally until M PI_Init is called Before returning from M PI_Init the application runs the system check tests W hen the system checks are completed the application exits This allows the normal application launch procedureto beused during thetest including any job schedulers wrapper scripts and local environment s
260. orm MPI User s Guide Understanding Platform MPI The local context version of gdi r env variable value The local context version of genv envlist var1 var2 The local context version of genvlist path path path7 The local context version of gpath The following are additional options for M PI quiet_hpmpi By default Platform M PI displays a detailed account of the types of M PI commands that aretranslated to assist in determining if theresult is correct T hiscommand disables these messages mpi exec doesnot support prun orsr un start up Platform MPI User s Guide 101 Understanding Platform MPI mpirun options This section describes options included in lt mpirun_options gt for all of the preceding examples They are listed by category Interconnect selection Launching specifications Debugging and informational RDMA control M PI 2 functionality Environment control Special Platform M PI mode Windows CCP Windows 2003 X P Interconnect selection options Network selection ibv IBV Explicit command line interconnect selection to useOFED InfiniBand The lowercase and uppercase options are analogous to the Elan options vapi VAPI Explicit command line interconnect selection to use M ellanox Verbs API The lowercase and uppercase options are analogous to the Elan options Dynamic linking is required with VAPI Do not link static udapl UDAPL Explicit command line interconnect se
261. orm MPI User s Guide 153 Understanding Platform MPI Native language support By default diagnostic messages and other feedback from Platform M PI are provided in English Support for other languages is available through the use of the N ative Language Support NLS catalog and the internationalization environment variable NLSPATH The default NLS search path for Platform M PI is NLSPATH For NLSPATH usage see the environ 5 manpage When an MPI language catalog is available it represents Platform M PI messages in two languages The messages are paired so that thefirst in the pair is always the English version of a message and the second in the pair is the corresponding translation to the language of choice For moreinformation about Native Language Support see the hpnls 5 environ 5 and lang 5 manpages 154 Platform MPI User s Guide Profiling This chapter provides information about utilities you can use to analyze Platform M PI applications Platform MPI User s Guide 155 Profiling Using counter instrumentation Counter instrumentation is a lightweight method for generating cumulative run time statistics for M PI applications When you create an instrumentation profile Platform M PI creates an output filein ASCII format You can create instrumentation profiles for applications linked with the standard Platform M PI library For applications linked with Platform M PI V 2 1 or later you can also create profiles for
262. ort ANSWER Platform M PI for Windows V 1 0 supports Windows HPC Platform M PI for Windows V1 1 supports W indows 2003 and W indows XP multinoderunswith the Platform M PI RemoteLaunch service running on the nodes This service is provided with V 1 1 The service is not required to run in an SMP mode QUESTION WhatisMPI_ROOT that see referenced in the documentation ANSWER MPI_ROOT isan environment variablethat Platform M PI mpi r un uses to determine where Platform M PI isinstalled and therefore which executables and libraries to use It is especially helpful when you have multiple versions of Platform M PI installed on a system A typical invocation of Platform M PI on systems with multiple M PI_ROOT variables installed is gt set MPI_ROOT nodex share test platform mpi 2 2 5 gt MPI_ROOT bin mpirun When Platform M PI is installed in Windows it sets M PI_ROOT for the system to the default location The default installation location differs between 32 bit and 64 bit Windows For 32 bit Windows the default is C Program Files Platform MPI For 64 bit Windows the default is C Program Files x86 Platform MPI QUESTION How areranks launched on Windows ANSWER On Windows HPC ranks are launched by scheduling Platform M PI tasks to the existing job These tasks are used to launch theremote ranks Because CPU s must be available to schedule these tasks the initial mpi r un task submitted must only use a single task
263. ose between ranks and or daemons where real message traffic occurs and connections between mpi run and the daemons where little traffic occurs but is still necessary The netaddr option can be used to specify asingle P mask to use for both purposes or specify them individually The latter might be needed if mpi run happens to be run on aremote machine that doesn t have access to the same Ethernet network as the rest of thecluster To specify both the syntax is netaddr P specification mask To specify them individually thesyntax is netaddr mpirun spec rank spec Thestring aunch can be used in place of mpi run ThelP specification can beanumeric P address like 172 20 0 1 or it can bea host name If ahost nameis used thevalue is thefirst P address returned by gethostbyname The optional mask can be specified as a dotted quad or as a number representing how many bits are to be matched For example a mask of 11 is equivalent to a mask of 255 224 0 0 If an IP and mask are given then it is expected that one and only one IP will match at each lookup An error or warning is printed as appropriate if there are no matches or too many If no mask is specified then thel P matching will simply bedoneby thelongest matching prefix This functionality can also be accessed using the environment variable MPI_NETADDR Launching specifications options Job launcher scheduler Options for LSF users These options launch ranks as found
264. ount of memory space between data elements wheretheelementsarestored noncontiguously Strided data are sent and received using derived data types synchronization Bringing multiple processes to the same point in their execution before any can continue For example M Pl_ Barrier isacollectiveroutinethat blocksthecalling process until all receiving processes have called it This is a useful approach for separating two stages of a computation so messages from each stage are not overlapped synchronous send mode Form of blocking send where the sending process returns only if a matching receiveis posted and the receiving process has started to receive the message tag Integer label assigned to a message when it is sent M essage tags are one of the synchronization variables used to ensure that a message is delivered to the correct receiving process task Uniquely addressable thread of execution thread Smallest notion of execution in a process All M PI processes have one or more threads Multithreaded processes have one address space but each process thread contains its own counter registers and stack This allows rapid context switching because threads require little or no memory management thread compliant Platform MPI User s Guide 251 Glossary An implementation where an M PI process may be multithreaded If it is each thread can issue MPI calls However the threads themselves are not separately addressable trace Informa
265. out_fragsis64 ncreasingthenumber of fragments for applications with a large number of processes improves system throughput in_frags Specifies the number of 16 KB fragments available in shared memory for inbound messages Inbound messages are sent from processes on hosts to processes on a given host using the communication daemon Thedefault valuefor in_fragsis 64 Increasing thenumber of fragments for applications with a large number of processes improves system throughput W hen commd is used M PI_COM M D specifies daemon communication fragments InfiniBand environment variables MPI_IB_MULTIRAIL Supports multi rail on OpenFabric This environment variable is ignored by all other interconnects In multi rail mode a rank can useall the node cards but only if its peer rank uses the samenumber of cards M essages are striped among all the cards to improve bandwidth By default multi card messagestripingis off Specify eM PI_IB_ MULTIRAIL N whereN isthenumber of cards used by arank If N lt 1 then message striping is not used If N is greater than the maximum Platform MPI User s Guide 131 Understanding Platform MPI number of cards M on that node then all M cards areused If 1 lt N lt M message striping is used on N cards or less On ahost all ranks select all the cards in aseries For example if there are 4 cards and 4 ranks on that host rank 0 uses cards 0 1 2 3 rank 1 uses 1 2 3 0 rank 2 uses 2 3 0 1 r
266. ovided to thoseranks U sing ha infra indicates that thempi run and mpi d processes normally used to support the application ranks are terminated after all ranks have called M PI_Init Thisoption implies stdio none To record stdout and stderr consider using the stdio files option when using ha infra 110 Platform MPI User s Guide Understanding Platform MPI Becausethempi r un and mpi d processes do not persist for thelength of theapplication run somefeatures are not supported with ha infra These include spawn commd 1sided U sing ha infra does not allow a convenient way to terminate all ranks associated with the application It isthe responsibility of the user to havea mechanism for application teardown Using MPl_Comm_connect and MPl_Comm_accept MPI_Comm_connect and MPI_Comm_accept can be used without the spawn option to mpi run This allowsapplications launched usingthe ha infra option to call these routines W hen using high availability mode these routines do not deadlock even if the remote process exits before during or after the call Using MPlL_Comm_ disconnect In high availability mode MP _Comm_disconnect is collective only across the local group of the calling process This enables a process group to independently break a connection to the remote group in an intercommunicator without synchronizing with those processes U nreceived messages on theremoteside are buffered and might be received until the remote side
267. ow Using mpi64 in a Visual Studio 32 bit command window does not work 2 Request aH PC allocation of sufficient size to run the required application s Add the rununtilcanceled option to have HPC maintain the allocation until it is explicitly canceled gt job new numcores 8 rununtilcanceled true Job queued ID 4288 3 Submit the job to HPC without adding tasks gt job submit id 4288 Job 4288 has been submitted 4 Run the applications as a task in the allocation optionally waiting for each to finish before starting the following one gt MPI_ROOT bin mpirun hpc hpewait jobid 4288 node share hello_world exe mpirun Submitting job to hpc scheduler on this node mpirun HPMPI Job 4288 submitted to cluster mpiccpl mpirun Waiting for HPMPI Job 4288 to finish mpirun HPMPI Job 4288 finished Platform MPI User s Guide 93 Understanding Platform MPI Note Platform MPI automatic job submittal converts the mapped drive to a UNC path which is necessary for the compute nodes to access files correctly Because this example uses HPCS commands for submitting the job the user must explicitly indicate a UNC path for the MPI application i e hel o0_ world exe orinclude the workdir flag to set the shared directory as the working directory 5 Repeat Step 4 until all required runs are complete 6 Explicitly cancel the job freeing the allocated nodes gt job cancel 4288 Remote launch service for Wind
268. ows Remote Launch service is available for W indows 2003 X P V ista 2008 W indows 7 system The Platform M PI Remote Launch serviceis located in MPI _ROOT sbi n PCMPI Wi n32Service exe MPI_ROOT must be located on a local disk or the service does not run properly Torun theservice manually you must register and start the service To register the service manually run the service executable with the i option To start the service manually run the service after it is installed with the start option The service executable is located at MPI _ ROOT sbi n PCMPI Wi n32Service exe For example C gt MPI_ROOT sbin PCMPIWin32Service exe i Creating Event Log Key PCMPI Installing service Platform MPI SMPID OpenSCManager OK CreateService Succeeded Service installed C gt MPIROOT sbin PCMPIWin32Service exe start Service started The Platform M PI Remote Launch service runs continually as a Windows service listening on a port for Platform M PI requests from remotempi run exe jobs This port must bethesame port on all machines and is established when the service starts The default TCP port is 8636 Ifthisportis not available or to changethe port includea port number asa parameter to i Asan example to install the service with port number 5004 CA gt MPI_ROOT sbin PCMPIWin32Service exe i 5004 Or you can stop the service then set the port key and start the service again For example u
269. pecify the command line options that you normally pass to the compiler on the mpi f 90 command line Thempi f 90 utility adds additional command line options for Platform M PI include directories and libraries The show option can be specified to mpi f 90 to display the command generated without executing the compilation command See the manpage mpif90 1 for more information To construct the desired compilation command the mpi f 90 utility needs to know what command line compiler is to be used the bitness of the executable that compiler will produce and the syntax accepted by the compiler These can be controlled by environment variables or from the command line Table 7 mpif90 utility Environment Variable Value Command Line MPI_F90 desired compiler default ifort mpif90 lt value gt MPI_BITNESS 32 or 64 no default mpi32 or mpi64 MPIL_WRAPPER_SYNTAX windows or unix default windows mpisyntax lt value gt For example to compilecompute_pi f usinga64 bit ifort contained in your PATH could bedonewith the following command since ifort and the Windows syntax are defaults C gt MPLROOT bin mpif90 mpi64 compute_pi f link out compute_pi_ifort exe Or use the following example to compile using the PGI compiler which uses a more U N1X like syntax C gt MPILROOT bin mpif90 mpif90 pgf90 mpisyntax unix mpi32 compute_pi f o compute_pi_pgi32 exe Platform MPI User s Guide 37 Getting Started Tocompileco
270. pempi For Windows systems use gt MPI_ROOT bin mprun version mpirun Platform MPI 01 00 00 00 Windows 32 major version 100 minor version 0 Building on Linux You can solvemost build timeproblems by referring to thedocumentation for thecompiler you areusing If you use your own build script specify all necessary input libraries To determine what libraries are needed check the contents of the compilation utilities stored in the Platform MPI MPI _ROOT bin subdirectory Platform M PI supports a 64 bit version of the M PI library on 64 bit platforms Both 32 bit and 64 bit versions of the library are shipped on 64 bit platforms Y ou cannot mix 32 bit and 64 bit executables in the same application Platform M PI does not support Fortran applications that are compiled with the following option autodblpad Fortran 77 programs Building on Windows M ake sure you are running the build wrappers i e mpi cc mpi f 90 in a compiler command window This window is usually an option on the Start gt All Programs menu Each compiler vendor provides a command window option that includes all necessary paths for compiler and libraries On Windows the Platform M PI libraries include the bitness in the library name Platform M PI provides support for 32 bit and 64 bit libraries The lib files are located in MP1 ROOT 1 ib Starting on Linux When starting multihost applications using an appfile make sure that 174 Platform MP
271. plication might not safely reuse the message buffer after a nonblocking routine returns until M PI_W ait indicates that the message transfer has completed In nonblocking communications the following sequence of events occurs 1 Thesending routine begins the message transfer and returns immediately 2 Theapplication does some computation 3 Theapplication calls a completion routine for example MPI_ Test or MPI_Wait to test or wait for completion of the send operation Blocking communication Blocking communication consists of four send modes and one receive mode The four send modes are Platform MPI User s Guide 17 Introduction Standard MPI_ Send The sending process returns when the system can buffer the message or when the message is received and the buffer is ready for reuse Buffered MPI_Bsend The sending process returns when the message is buffered in an application supplied buffer Avoid using the M PI_Bsend mode It forces an additional copy operation Synchronous MPI_Ssend The sending process returns only if a matching receive is posted and the receiving process has started to receive the message Ready MPI_Rsend The message is sent as soon as possible You can invoke any mode by using the correct routine name and passing the argument list Arguments are the same for all modes For example to code a standard blocking send use MPI_Send void buf int count MPI _Datatype dtype int dest int
272. ppfile by supplying a list of space separated arguments that mpi run should ignore setenv MPlLUSESRUN_IGNORE_ARGS lt option gt 232 Platform MPI User s Guide mpirun Using Implied prun or srun In the example below the appfile contains a reference to stdio bnone which is filtered out because it is set in the ignore list setenv MPI_USESRUN_VERBOSE 1 setenv MPLUSESRUN_IGNORE_ARGS stdio bnone setenv MPI _USESRUN 1 setenv MPlLSRUNOPTION label bsub l n4 ext SLURM nodes 4 MPI_ROOT bin mpirun stdio bnone f appfile pingpong Job lt 369848 gt is submitted to default queue lt normal gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost localdomain gt gt fopt platform mpi bin mpirun unset MPI _USESRUN opt platform mpi bin mpirun srun pallas x npmin 4 pingpong srun arguments n ntasks ntasks Specify the number of processes to run N nodes nnodes Request that nnodes nodes be allocated to this job m distribution block cyclic Specify an alternate distribution method for remote processes w nodelist host1 host2 or filename Request a specific list of hosts x exclude host1 host2 or filename Request that a specific list of hosts not be included in the resources allocated to this job label Prepend task number to lines of stdout err For moreinformation on sr un arguments see thes r un manpage The following is an example using th
273. pplications h hostZ np 1 path to pp x And you can specify what remote shell command to use Linux default is ssh in the MPI_REM SH environment variable For example you might use export MPI_REMSH rsh x optional Then run MPI_ROOT bin mpirun prot f appfile MPI_ROOT bin mpirun prot f appfile 1000000 If LSF is being used the host names in the appfile wouldn t matter and the command to run would be bsub pam mpi MPI_ROOT bin mpirun prot f appfile bsub pam mpi MPI_ROOT bin mpirun prot f appfile 1000000 2 Thesrun command is available For this case then you would run a command like this MPI_ROOT bin mpirun prot srun N 8 n 8 path to pp x MPI_ROOT bin mpirun prot srun N 8 n 8 path to pp x 1000000 Replacing 8 with the number of hosts If LSF is being used the command to run might be this bsub I n 16 MPI_ROOT bin mpirun prot srun path to pp x bsub I n 16 MPI_ROOT bin mpirun prot srun path to pp x 1000000 3 Thepr un command is available This case is basically identical to thes r un case with the change of using pr un in placeofsrun In each case above the first mpi r un uses 0 bytes of data per message and is for checking latency The second mpi r un uses 1000000 bytes per message and is for checking bandwidth include lt stdio h gt include lt stdlib h gt ifndef _W N32 include lt unistd h gt endi f include lt string h gt includ
274. pplications and libraries that use the M PICH 2 implementation M PICH2 isnot a standard but rather a specific implementation of the M PI 2 1 standard Platform M PI provides M PICH 2 compatibility with the following wrappers Table 14 MPICH wrappers MPICH1 MPICH2 mpirun mpich mpi cc mpi ch mpif 77 mpich mpi f90 mpich mpirun mpi ch2 mpicc mpich2 mpif 77 mpi ch2 mpif 90 mpi ch2 Object files built with Platform M PI M PICH compiler wrappers can be used by an application that uses theM PICH implementation Y ou must relink applications built using M PICH compliant libraries to use Platform MPI in MPICH compatibility mode Note Do not use MPICH compatibility mode to produce a single executable to run under both MPICH and Platform MPI Platform MPI User s Guide 65 Understanding Platform MPI Examples of building on Linux This example shows how to build hello_world c prior to running 1 Change to a writable directory that is visible from all hosts the job will run on 2 Compilethe hello_world executable file MPI_ROOT bin mpicc o hello_world MPI_ROOT help hello_world c This example uses shared libraries which is recommended Platform M PI also includesarchivelibrariesthat can beused by specifying thecorrect compiler option Note Platform MPI uses the dynamic loader to interface with interconnect libraries Therefore dynamic linking is required when building applications that use Platform M
275. process request request int request if request REQ SHUTDOWN return request void server args void args int rank request MPI Status status rank int args while 1 MPI _Recv amp request 1 MPI service cnt INT MPI ANY SOURCE SERVER TAG MPI COMM WORLD amp status if process request request break MPI Send amp rank 1 MP Bain REQ SHUTDOWN status MPI SOURCE Platform MPI User s Guide 207 Example Applications CLIENT TAG MPI_COMM WORLD printf server d processed request d for client d n rank request status MPI SOURCE printf server d total service requests d n rank service cnt return void 0 void client rank size int rank int size int w server ack P Status status or w 0 w lt MAX WORK wtt Server rand si ze Pl _ Sendrecv amp rank 1 MPI_INT server SERVER TAG Gack 1 MPI_INT server CLIENT TAG MPI COMM WORLD amp status if ack server printf server failed to process my request n MP _Abort MPI COMM WORLD MPI _ERR_OTHER void shutdown servers rank int rank int request_shutdown REQ SHUTDOWN MPI Barrier MPI COMM WORLD MPI Send amp request_ shutdown 1 MPI_INT rank SERVER TAG MPI COMM WORLD main argc argv int argc char argv int rank size rtn pthread_t mtid Pl Status Status int my_value his value Pl_Init amp argc a
276. processes call MPI_Finalize ierr stop endi f prime Process rank oF Size Le allive dest size 1 src 0 if rank eq src then to dest count 10 tag 2001 do i l 10 data i 1 enddo call MPI Send data count MPI DOUBLE PRECISI ON to tag MPI COMM WORLD ierr endi if rank eq dest then tag MPI ANY_TAG count 10 from MPI_ANY SOURCE Ca MPI Recv data count MPI DOUBLE PRECISION iH from tag MPI COMM WORLD status ierr ca MPI Get Count status MPI DOUBLE PRECISI ON st_count ierr st_source status MPI SOURCE st_tag status MPI TAG print Status info source st _source tAn amp St tag CONME amp St Comme peint rank recelwed atalii i sil 10 endif ca MPI _Finalize ierr stop end send_receive output The output from running the send_ receive executable is shown below The application was run with np 10 Process 0 of 10 is alive Process 1 of 10 is alive Process 2 of 10 is alive Process 3 of 10 is alive Process 4 of 10 is alive Process 5 of 10 is alive Process 6 of 10 is alive Process 7 of 10 is alive Process 8 of 10 is alive Process 9 of 10 is alive Status info source 0 tag 2001 count 10 O OCITNO i i ie i it ty il l il a ping _pong c This C example is used as a performance benchmark tomeasure the amount of time it takes to send and receive data betweentwo processes The
277. ptions arethesameas mpi r un authentication options These include pass cache clearcache iscached token tg package pk For detailed descriptions of these options refer to these options in the mpi r un documentation Thempi di ag tool can be very helpful in debugging issues with remote launch and access to remote systems through the Platform M PI Remote Launch service To use the tool you must always supply a server with the s option Then you can use various commands to test access to the remote service and verify a limited number of remote machine resources For example to test if machine winbl16 Platform M PI remote launch serviceis running use the at flag X Demo gt MPI_ROOT bin mpidiag s winbl16 at connect failed 10061 Cannot establish connection with server 96 Platform MPI User s Guide Understanding Platform MPI SendCmd send sent a different number of bytes than expected 10057 Themachinecannot connect to theserviceon therenote machine After checking and finding theservice was not started the service is restarted and the command is run again X Demo gt MPI_ROOT bin mpidiag s winbl16 at Message received from Service userl Now the service responds and authenticates correctly To verify what processes are running on a remote machine use the ps command X Demo gt MPI_ROOT bin mpidiag s winbl16 ps Process List ProcessName Username PID CPU Ti me Me mor y rdpclip ex
278. ptions gt The np option is not allowed with pr un Some features like mpi run st dio processing are unavailable MPI_ROOT bin mpirun prun n 2 a out launchesa out on two processors MPI_ROOT bin mpirun prot prun n 6 N 6 a out turns on the print protocol option protis an mpi run option and therefore is listed before prun and runs on 6 machines one CPU per node Platform M PI also providesimplied pr un mode Theimplied pr un mode allows the user to omit the prun argument from the mpi r un command line with the use of the environment variable MPI_USEPRUN srun execution Applications that run on HP XC clusters require the srun option start up directly from s r un isnot supported When using this option mpi r un sets environment variables and invokessr un utilities The srun argument to mpi run specifies that thes r un command is to be used for launching All arguments following srun are passed unmodified to thes r un command MPI_ROOT bin mpirun lt mpirun options gt srun lt srun options gt The np option is not allowed with sr un Some features likempi run stdi o processing are unavailable MPI_ROOT bin mpirun srun n 2 a out launchesa out on two processors MPI_ROOT bin mpirun prot srun n 6 N 6 a out turns on the print protocol option prot is an mpi r un option and therefore is listed before srun and runs on 6 machines one CPU per node 74 Platform MPI User s Guide Unde
279. py of Platform M PI Contact your application vendor to find out if they participate in the Platform M PI ISV program The copy of Platform M PI distributed with a participating ISV only works with that application A Platform M PI license is required for all other applications Licensing for Windows Platform M PI for Windows uses FlexN et Publisher formerly FLEXIm licensing technology A license file can benamed icense dat or any filename with an extension of i c The license file must be placed in the installation directory defaultC Program Files x86 Platform MPI licenses on all run time systems and on the license server Platform M PI for Windows optionally supports redundant license servers The Platform M PI License Certificate includes space for up to three license servers Either one license server or three license servers are listed on the certificate To use a single license server follow the directions below To use three redundant license servers repeat the steps below for each license server You must provide the host name and host ID number of the system where the FlexN et daemon for Platform M PI for Windows will run The host ID can be obtained by entering the following command if Platform M PI is already installed on the system MPI_ROOT bin licensing i86_nAlmutil Imhostid To obtain the host name use the control panel by using Control Panel gt System gt Computer Name The default search path used to find an M
280. r MPI _Sendrecv_replace buffer2 len2 type2 src tagl dest tag2 commbB amp status if err In thiscase someranks exit successfully from theM P _Bcast and moveonto theM PI_Sendrecv_replace operation on a different communicator The ranks that call MP _Comm_dup only cause operations on commA to fail Some ranks cannot return from the M Pl_Sendrecv_replace call on comms if their partners are also members of commA and arein the call to MPI_Comm_dup call on commA This demonstrates the importance of using care when dealing with multiple communicators In this example if the intersection of commA and commB isMPI_COMM_SELF itissimpler to write an application that does not deadlock during failure Network high availability ha net Thenet option to haturnson any network high availability N etwork high availability attemptsto insulate an application from errorsin the network In this release ha net is only significant on IBV for OFED 1 2 or later where Automatic Path Migration is used This option has no effect on TCP connections Failure detection ha detect W hen using the ha detect option a communication failure is detected and prevents interference with the application s ability to communicate with other processes that have not been affected by the failure 112 Platform MPI User s Guide Understanding Platform MPI In addition to specifying ha detect M PI_Errhandler must beset to MPI_ ERRORS RETURN using the M
281. r IN target_count number of entries in target buffer IN target datatype datatype of each entry in target buffer handle IN op reduce operation handle IN win window object handle 228 Platform MPI User s Guide Standard Flexibility in Platform MPI Platform MPI implementation of standard flexibility Platform MPI contains a full M PI 2 standard implementation There are items in the M PI standard for which the standard allows flexibility in implementation This appendix identifies the Platform M PI implementation of many of these standard flexible issues The following table displays references to sections in the M PI standard that identify flexibility in the implementation of an issue Accompanying each reference is the Platform M PI implementation of that issue Table 21 Platform MPI implementation of standard flexible issues Reference in MPI Standard The Platform MPI Implementation MPI implementations are required to define the MPI_Abortkills the application c omm is ignored and uses behavior of MPI_Abort at least for a c o mm of MPL_COMM_WORLD MPI_COMM_WORLD MPI implementations can ignore the comm argument and act as if comm was MPI_COMM_WORLD See MPI 1 2 Section 7 5 An implementation must document the implementation Fortran is layered on top of C and profile entry points are of different language bindings of the MPI interface if given for both languages they are layered on top of each other See MPI 1 2 Section
282. r8 and corresponding M PI calls are given MP _ DOUBLE for the datatype then the autodouble related command line arguments should not be passed to mpi f 90 bat at link time because that causes the datatypes to be automatically changed 56 Platform MPI User s Guide Understanding Platform MPI MPI functions The following M PI functions accept user defined functions and require special treatment when autodouble is used MPI_Op_create MPI_Errhandler_create MPI_Keyval_create MPI_Comm_create_errhandler MPI_Comm_create_keyval MPI_Win_create_errhandler MPI_Win_create_keyval The user defined callback passed to these functions should accept normal sized arguments T hese functions are called internally by the library where normally sized data types are passed to them Platform MPI User s Guide 57 Understanding Platform MPI 64 bit support Platform M PI provides support for 64 bit libraries as shown below M oreinformation about Linux and Windows systems is provided in the following sections Table 13 32 bit and 64 bit support OS Architecture Supported Libraries Default Notes Linux IA 32 32 bit 32 bit Linux Itanium2 64 bit 64 bit Linux Opteron amp Intel64 32 bit and 64 bit 64 bit Use mpi32 and appropriate compiler flag For 32 bit flag see the compiler manpage Windows 32 bit and 64 bit N A Linux Platform M PI supports 32 bit and 64 bit versions running Linux on AMD Opteron or Intel64 systems
283. rch path Platform MPI User s Guide 121 Understanding Platform MPI egdb eadb ewdb epathdb s alp A Starts the application under the xdb debugger The debugger must bein the command search path Starts the application under the gdb debugger The debugger must bein the command search path Starts the application under adb the absolute debugger The debugger must be in the command search path Starts the application under the wdb debugger The debugger must bein the command search path Starts the application under the path debugger The debugger must bein the command search path Reports memory leaks caused by not freeing memory allocated when a Platform M PI jobisrun For example when you createa communicator or user defined datatype after you call mP _I ni t you must free the memory allocated to these objects before you call MP _Finalize InC thisis analogous to making calls to malloc and free for each object created during program execution Setting the option can decrease application performance Forces MPI errors to be fatal Using thef option sets the MPI_ERRORS ARE_FATAL error handler ignoring the programmer s choice of error handlers Thisoption can help you detect nondeterministic error problemsin your code If your code has a customized error handler that does not report that an M PI call failed you will not know that a failure occurred Thus your application could be catching an error with a u
284. rd 2 rank 7 to card O all on host 1 mpirun hostlist np 8 host0 host1 This creates ranks 0 through 7 alternating on host 0 host 1 host 0 host 1 etc It assigns rank 0 to card 0 rank 2 to card 1 rank 4 to card 2 rank 6 to card 0 all on host 0 It assigns rank 1 to card 0 rank 3 to card 1 rank 5 to card 2 rank 7 to card 0 all on host 1 MPL_IB_PKEY Platform M PI supports IB partitioning via M ellanox VAPI and OFED Verbs API 134 Platform MPI User s Guide Understanding Platform MPI By default Platform M PI searches the unique full membership partition key from the port partition key table used If no such pkey is found an error is issued If multiple pkeys are found all related pkeys are printed and an error message is issued If the environment variable M P _IB_PKEY has been set to a value in hex or decimal the value is treated as the pkey and the pkey table is searched for the same pkey If the pkey is not found an error message is issued When a rank selects a pkey to use a verification is made to make sure all ranks are using the same pkey If ranks are not using the same pkey and error message is issued MPI_IBV_QPPARAMS MPI_IBV_QPPARAM S a b c d e Specifies QP settings for IBV where a Timeout value for IBV retry if there is no response from target Minimum is 1 Maximum is 31 Default is 18 b The retry count after a time out before an error is issued Minimum is 0 Maximum is 7 Default is 7 c
285. re the user s M PI program starts MPI_THREAD_AFFINITY controls thread affinity Possible values are none Schedule threads to run on all cores Idoms This is the default cyclic Schedule threads on Idoms in cyclic manner starting after parent cyclic_cpu Schedule threads on cores in cyclic manner starting after parent block Schedule threads on Idoms in block manner starting after parent packed Schedule threads on same Idom as parent empty No changes to thread affinity are made MPI_THREAD_IGNSELF when set to yes does not include the parent in scheduling consideration of threads across remaining cores Idoms This method of thread control can be used for explicit pthreads or OpenM P threads Three cpu_bind options require the specification of a map mask description This allows for explicit binding of ranks to processors The three options are map_Idom map_cpu and mask_cpu Syntax cpu_bind map_ldom map_cpu mask_cpu lt settings gt lt settings gt e MPI_BIND_MAP lt settings gt Examples cpu_bind MAP_LDOM e MPI_BIND_MAP 0 2 1 3 map rank 0 to Idom 0 rank 1 to Idom 2 rank 2 to Idom1 and rank 3 to Idom 3 cpu_bind MAP_LDOM 0 2 3 1 manp rank 0 to Idom 0 rank 1 to Idom 2 rank 2 to Idom 3 and rank 3 to Idom 1 cpu_bind MAP_CPU 0 6 5 manp rank 0 to cpu 0 rank 1 to cpu 6 rank 2 to cpu 5 cpu_bind MASK_CPU 1 4 6 manp rank 0 to cpu 0 0001 rank 1 to cpu 2 0100 rank 2 to cpu 1 or 2 0110
286. readed 24 process placement multihost 77 process placement options 106 process rank of root 23 process rank of source 19 process single threaded 24 processor locality 166 subscription 165 profiling interface 159 progression 163 prot option 107 prun 74 implied 231 with mpirun 67 prun execution 74 prun MPI on Elan interconnect 67 prun option 105 psm option 102 pthreads 59 ptmalloc 150 260 Platform MPI User s Guide R rank of calling process 15 rank of source process 19 rank reordering 124 rdma option 108 RDMA options 108 ready send mode 17 receive buffer data type of elements 19 number of elements in 19 starting address 19 receive message information 19 methods 17 recvbuf variable 22 recvcount variable 22 recvtype variable 22 reduce scatter 22 reduction 22 reduction operation 23 release notes 31 45 remote launch service 94 remote shell 70 launching options 105 remsh command 138 174 secure 28 138 remsh 28 70 reordering rank 124 req variable 20 rhosts file 70 174 root process 20 root variable 21 23 routine selection 167 rsh 28 run appfiles 70 LSF on non HP XC systems 70 MPI on Linux cluster using appfiles 29 MPI application 176 MPI Linux application 67 MPI on Elan interconnect 67 MPI on HP XC 75 MPI on HP XC cluster 30 MPI on multiple hosts 75 MPI on non HP XC Linux 75 MPI on single host Linux 29 MPI with appfiles 73 MPI with prun 74 MPI with srun 74 single host execut
287. riable value Uses Platform M Pl s e variable value option genvlist var1 var2 This option issimilar to genv but uses mpi run S current environment for the variable values gdir directory or dir directory Uses Platform M PI S e MPI _WORKDI R directory option gwdir directory or wdir directory Uses Platform M PI S e MPI _WORKDI R directory option ghost host_name Each portion of the command line wherea host or hosts are not explicitly specified is run under the default context ghost host_name sets this default context to host_name with np 1 ghosts num hostA numA hostB numB This option issimilar to ghost but sets the default context to the specified list of hosts and np settings Unspecified np settings are either 1 or whatever was specified in cores number if used gmachinefile file Thisoption issimilarto ghosts butthehostx numx settings areread from thespecified file The following options are those whose contexts only affect the current portion of the command line np number Specifies the number of ranks to launch onto whatever hosts are represented by the current context host host_name Sets the current context to host_name with np 1 hosts num hostA numA hostB numB This option is similar to ghosts and sets the current context machinefile file Thisoption issimilar to hosts but thehostx numx settings are read from the specified file wdir dir 100 Platf
288. rint Dimensions at first l rank 0 intf Dimensions d d n dims 0 dims 1 o Oa ee ig ee eee MPI Barrier comm Each process prints ite pProvile printf global rank d cartesian rank d coordinate d d n grank Irank coords 0 coords 1 I Program body 1 Define a torus topology and demonstrate shift operations void body void Node node node profile node print node shift NORTH node print Platform MPI User s Guide 197 Example Applications node shift EAST node print node shi ft SOUTH node print node shi ft WEST node print Ii Main program it is probably a good programming practice to cal MPI _Init and MPI Finalize here i int main int argc char argv MPI _Init amp argc amp argv body MPI Finalize cart output The output from running the cart executable is shown below The application was run with np 4 Dimensions 2 2 rank 0 cartesian rank 0 coordinate 0 rank 1 cartesian rank 1 coordinate 0 rank 3 cartesian rank 3 coordinate 1 rank 2 cartesian rank 2 coordinate 1 holds holds toy o Te v ov 0 1 5 0 ooo e a ea WQ Q Q PRPOORP POOH OPROrRHOOHrHOOD oQ o a wn POR ORPORPOR POOH OHROOFRHo gt o Q wn WRNYODNDOWHOHPNYWHWONMYHWNOD communicator c This C example shows how to make a copy of the default communicator MP _ COMM
289. ronment variables For example on Linux MPI_ROOT bin mpirun e MPl_FLAGS y40 f appfile In the above example if an M PI_ FLAGS setting was specified in theappfile then the global setting on the command line would override the setting in the appfile To add to an environment variable rather than replacing it use V AR asin the following commana MPI_ROOT bin mpirun e MPI_FLAGS MPI_FLAGS y f appfile In the above example if the appfile specified M PI_FLA GS z then the resulting M P _ FLAGS seen by the application would bez y MPI_ROOT bin mpirun e LD_LIBRARY_PATH LD_LIBRARY_PATH path to third party lib f appfile In the above example the user is appending to LD_LIBRARY_PATH Setting environment variables in an pcmpi conf file Platform M PI supports setting environment variables in an pc mpi conf file These variables are read by mpi run and exported globally as if they had been included on the mpi r un command line as e VAR VAL settings Thepc mpi conf file search is performed in three places and each one is parsed which allows the last one parsed to overwrite values set by the previous files The locations are MPI_ROOT etc pcmpi conf fetc pcmpi conf HOME pcmpi conf This feature can be used for any environment variable and is most useful for interconnect specifications A collection of variables is available that tells Platform M PI which interconnects to search for and which libraries and module
290. rray of displacements relative to recvbuf N recvtype data type of receive buffer elements handle IN comm communicator handle int MPI_Alltoallwlk void sendbuf MPI Aint sendcounts MPI Aint sdispls MPI Datatype sendtypes void recvbuf MPI Aint recvcounts MPI Aint rdispls MPI Datatype recvtypes MP Comm comm N sendbuf starting address of send buffer choice IN Sendcounts array equal to the group size specifying the number of elements to send to each rank IN sdispls array of displacements relative to sendbuf IN sendtypes array of datatypes with entry j specifying the type of data to send to process j OUT recvbuf address of receive buffer choice IN recvcounts array equal to the group size specifying the number of elements that can be received from each rank IN rdispls array of displacements relative to recvbuf IN recvtypes array of datatypes with entry j specifying the type of data recieved from process j IN comm communicator handle int MPI BcastL void buffer MPI Aint count MPI Datatype datatype int root MPI Comm comm INOUT buffer starting address of buffer choice IN count number of entries in buffer IN datatype data type of buffer handle IN root rank of broadcast root IN comm communicator handle int MPI GatherL void sendbuf MPI Aint sendcount MPI Datatype sendtype void recvbuf MPI Aint recvcount MPI Datatype recvtype int root MPI Comm comm IN sendbuf starting address of send
291. rstanding Platform MPI Platform M PI also provides implied s run mode Theimplied sr un mode allows the user to omit the srun argument from the mpi r un command line with the use of the environment variable MPI_USESRUN LSF on HP XC Systems Platform M PI jobscan besubmitted using LSF LSF usestheSLURM s r un launching mechanism Because of this Platform M PI jobs must specify the srun option whether LSF is used or sr un is used bsub I n2 MPI_ROOT bin mpirun srun a out LSF on Non HP XC Systems On non HP XC systems to invoketheParallel A pplication M anager PAM featureof LSF for applications where all processes execute the same program on the same host bsub lt sf_options gt pam mpi mpirun lt mpirun_options gt program lt args gt Apptfiles An appfileis a text file that contains process counts and a list of programs When you invoke mpi run with the name of the appfile mpi r un parses the appfile to get information for the run Creating an appfile The format of entries in an appfile is line oriented Lines that end with the backslash character are continued on thenextline formingasinglelogical line A logical linestarting with the pound character is treated as a comment Each program along with its arguments is listed on a separate logical line The general form of an appfile entry is h remote host e var val sp paths np program args where h remote_host Specifiestheremo
292. run aS MP _ROOT bin mpirun prot srun n4 a out and theinterconnect decision looks for the presence of Elan and uses it if found Otherwise interconnects are tried in the order specified by MPI_IC_ORDER 130 Platform MPI User s Guide Understanding Platform MPI The following is an example of using TCP over GigE assuming GigE isinstalled and 192 168 1 1 corresponds to the Ethernet interface with GigE The implicit use of netaddr 192 168 1 1 is required to effectively get TCP over the proper subnet export MPI_IC_ORDER ibv vapi udapl psm mx gm elan itapi TCP export MPIRUN_SYSTEM_OPTIONS netaddr 192 168 1 1 MPI_ROOT bin mpirun prot TCP srun n4 a out MPI_IC_SUFFIXES When Platform M PI is determining the availability of a given interconnect on Linux it tries to open libraries and find loaded modules based on a collection of variables The use of interconnect environment variables MPI_ICLIB_ELAN MPI_ICLIB_GM MPI_ICLIB_ITAPI MPI_ICLIB MX MPI_ICLIB_ UDAPL MPI_ICLIB_VAPI and MPI_ICLIB_VAPIDIR has been deprecated MPI_COMMD MPI_COMMD routes all off host communication through daemons rather than between processes The MPI _COMMD syntax is as follows out_frags in_frags where out_frags Specifies the number of 16 KB fragments available in shared memory for outbound messages Outbound messages are sent from processes on a given host to processes on other hosts using the communication daemon Thedefault valuefor
293. rver redundant license key for Platform M PI 7 1 for Linux contact Platform Computing For moreinformation see your license certificate 32 Platform MPI User s Guide Getting Started Installing license files A valid license file contains the system host ID and the associated license key License files can be named asl icense dat oranynamewith extension of i c for example mpi i c Copy thelicensefile under thedirectory opt platform mpi licenses The command to run the license server is MPI_ROOT bin licensing lt arch gt Imgrd c mpi lic License testing To check for a license build and run the hello_world programin MP ROOT hel p hello_world c If your system isnot properly licensed you receive the following error message MPI BUG Valid MPI license not found in search path Merging licenses Newer Platform M PI licenses usethe INCREMENT featurewhich allows separate Platform M PI licenses to be used in combination by concatenating files For example License 1 SERVER myserver 0014c2c1f34a DAEMON HPQ INCREMENT platform mpi Isf_ld 1 0 permanent 8 9A40ECDE2A38 NOTICE License Number AAAABBBB1111 SI GN E5CEDE3E5626 License 2 SERVER myserver 0014c2c1f34a DAEMON HPQ INCREMENT platform mpi Isf_ld 1 0 permanent 16 BE468B74B592 NOTI CE Li cense Number AAAABBBB2222 SI GN 9AB4034C6CB2 Here License lisfor 8 ranks and License2 isfor 16 ranks T hetwo licenses can be combined into a single file SERVER
294. s NDIMS dims Establish a cartesian topology communicator 196 Platform MPI User s Guide Example Applications or i 0 i lt NDIMS i periods i 1 Pl Cart _create MPI COMM WORLD NDIMS dims periods 1 amp comm Initialize the data PI Comm rank MP COMM WORLD amp grank if comm MPI COMM NULL rank MPI _PROC_NULL data 1 else Pl _Comm_rank comm amp l rank data rank Pl _ Cart _coords comm Irank NDIMS coords A destructor ode Node void if comm MPI COMM NULL P Comm _free amp comm Shift function void Node shift Direction dir i comm MPI COMM NULL return ini direction GSM SC desire i dir NORTH direction 0 disp 1 else if dir SOUTH direction 0 disp 1 else if dir EAST direction 1 disp 1 YY else i direction 1 disp 1 MPI_Cart_shift comm direction disp amp src amp dest MPI Status stat MPI Sendrecv_replace amp data 1 MPI_INT dest 0 src 0 comm amp stat II Synchronize and print the data being held void Node print void if comm MPI COMM NULL MPI Barrier comm if lrank 0 puts line feed MPI _Barrier comm printf d d holds d n coords 0 coords 1 data Print object s profile id Node profile void om Non member does nothing comm MPI COMM NULL return P
295. s 0 frame 0 TX packets 135950325 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuel en 1000 RX bytes 24498382931 23363 4 Mb TX bytes 29823673137 28442 0Mb Interrupt 31 Platform MPI User s Guide 87 Understanding Platform MPI Running applications on Windows Building and running multihost on Windows HPCS clusters The following is an example of basic compilation and run steps to executehe o_ world c on a cluster with 16 way pllelism To build and run hel o_worl d c on a HPCS cluster 1 Change to a writable directory on a mapped drive Share the mapped drive to a folder for the cluster 2 Open a Visual Studio command window This example uses a 64 bit version so a Visual Studio 64 bit command window is opened 3 Compilethehello_world executable file X demo gt set MPI_CC cl X demo gt MPIROOT bin mpicc mpi64 MPI_ROOT help hello_world c Microsoft C C Optimizing Compiler Version 14 00 50727 42 for 64 bit Copyright Microsoft Corporation All rights reserved hello _world c Microsoft Incremental Linker Version 8 00 50727 42 Copyright Microsoft Corporation All rights reserved out hello world exe Tlibpath C Program Files x86 Platform Computing Platform MPI Iib subsystem console li bpcmpi64 lib libmpio64 lib hello_world obj 4 Create anew job requesting the number of CPUs to use Resources are not yet allocated but the job is given aJOBID number which is
296. s and other countries Other products or services mentioned in this document are identified by the trademarks or service marks of their respective owners http www platform com Company third part license htm http www platform com Company T hird Party Copyright htm About TMS GUIDG garinas ara aa E situcteseheaneca iden nicatnenmn earned sumierien 5 Platforms SUPPOMEO cisrianis onana ne a E aaa AAE REE 6 DOCUMENTATION FESOURCES rennarin r a EE a 10 Credits arns A E Gules Creatine tig 11 IMEFOCUCTION aiai ANAA E AE E NA EA 13 The message passing Model cceceeccceeeeeeeeeceeeeeeeteeceeeeeenesaecaeeeeeeaeaeeeeeeesseeaaeenenteeed 14 WIP TL CONCEDIS esac tiaron aE RAAE EEEE O AE EE EAEE 15 Getting Started mooien aE ia aa a Ar nectsancteteauieds 27 Getting started USING LINUX ssssecesesncsseesscessunsteeesagetonneed aieeaa a aidia aaa 28 Getting started USING WINGOWS 0 cccceecceeeeceeeeeeeeeeeeeeeeeeeceaaeeesecaeeeeeeeeeeeenaeeeesinaeeeeeaes 35 Understanding Platform MPI ccccccccecesceeeseeeeeeeeceeeeseaeeeseceeeeeseceeeeseaaeeeeeaaeeseecaeeeseneeeseseaeees 49 Compilation wrapper Script utilities 02 2 cece eceeeeeeceeeeeeeeeeeeeeeeeeeeeeeseaeeeeesaaeeesenaaeeeee 50 C DINdINGS for LINUX sescesacstceetecesiccesseected sent esaatcorensatyersbbenadgeediideued ante an 54 Autodouble functionality ccccccsceeeeeeeeeeeee cece essences eeeaeeseeeeeeeeaaeeseceeeeseaaeeseeaeeeeeaaeeee 56 MPI TUNCHOMNS zrenia
297. s on HP XC systems using SLURM ssrun Fortran 90 programming features TheM PI 1 1 standard defines bindings for Fortran 77 but not Fortran 90 176 Platform MPI User s Guide Debugging and Troubleshooting Although most Fortran 90 M PI applications work using the Fortran 77 M PI bindings some Fortran 90 features can cause unexpected behavior when used with Platform M PI In Fortran 90 an array is not always stored in contiguous memory When noncontiguous array data is passed to a Platform M PI subroutine Fortran 90 copies the data into temporary storage passes it to the Platform M PI subroutine and copies it back when the subroutine returns As a result Platform M PI is given the address of the copy but not of the original data In some cases this copy in and copy out operation can cause a problem For anonblocking Platform M PI call the subroutine returns immediately and the temporary storage is deallocated W hen Platform M PI tries to access the already invalid memory the behavior is unknown Moreover Platform M PI operates close to the system level and must know the address of the original data H owever even if the address is known Platform M PI does not know if the data is contiguous or not UNIX open file descriptors UNIX imposes a limit to the number of file descriptors that application processes can have open at one time When running a multihost application each local process opens a socket to each remote process A Platf
298. s searched for and found first TCP should always be available So TCP IP is used instead of IBV or Elan etc MPI_ROOT bin mpirun TCP srun n4 a out The following example output shows three runs on an Elan system first using Elan as the protocol then using TCP IP over GigE then using TCP IP over the Quadrics card This runs on Elan user optel0 user bsub I n3 ext SLURM nodes 3 MPI_ROOT bin mpirun prot srun a out Job lt 59304 gt is submitted to default queue lt normal gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost localdomain gt gt Host 0 ELAN node 0 ranks 0 Host 1 ELAN node 1 ranks 1 Host 2 ELAN node 2 ranks 2 host 0 1 2 s 0 SHM ELAN ELAN 1 ELAN SHM ELAN 2 ELAN ELAN SHM Hello world I m O of 3 on opte6 Hello world I m 1 of 3 on opte7 Hello world I m 2 of 3 on opte8 This runs on TCP IP over the GigE network configured as 172 20 x x on ethO user optel0 user bsub I n3 ext SLURM nodes 3 MPI_ROOT bin mpirun prot TCP srun a out Job lt 59305 gt is submitted to default queue lt nor mal gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost ocaldomain gt gt Host 0 ip 172 20 0 6 ranks 0 Host 1 ip 172 20 0 7 ranks 1 Host 2 jp 172 20 0 8 ranks 2 host 0 1 2 0 SHM TCP TCP 1 TCP SHM TCP 2 TCP TCP SHMHello world I m0O of 3 on opte6 Hello world I m 1 of
299. s that multihost applications do profiling in a consistent manner Counter instrumentation and trace file generation are mutually exclusive profiling techniques Note When you enable instrumentation for multihost runs and invoke mpi run ona host where an MPI process is running or on a host remote from all MPI processes Platform MPI writes the instrumentation output file prefix instr to the working directory on the host that is running rank 0 or the lowest rank remaining if ha is used TOTALVIEW When you use the T otallV iew debugger Platform M PI uses your PATH variable to find T otalView You can also set the absolute path and T otalView options in the TOTALVIEW environment variable This environment variable is used by mpi run setenv TOTALVIEW opi totalview bin totalview Interconnect selection environment variables MPI_IC_ORDER MPI_IC_ORDER is an environment variable whose default contents are ibv vapi udapl psm mx gm elan itapi T CP and instructs Platform M PI to search in a specific order for the presence of an interconnect Lowercase selections imply use if detected otherwise keep searching Anuppercase option demands that theinterconnect option beused if it cannot beselected the application terminates with an error For example export MPI_IC_ORDER ibv vapi udapl psm mx gm elan itapi TCP export MPIRUN_OPTIONS prot MPI_ROOT bin mpirun srun n4 a out The command line for the above appears to mpi
300. s to look for with each interconnect These environment variables are the primary use of pcmpi conf Syntactically single and double quotesinpcmpi conf can be used to create values containing spaces If avalue containing a quote is needed two adjacent quotes are interpreted as a quote to be included in the value When not contained in quotes spaces are interpreted as element separatorsin alist and are stored as tabs 118 Platform MPI User s Guide Understanding Platform MPI This explanation of the pc mpi conf file is provided only for awareness that this functionality is available Making changes to the pc mpi conf file without contacting Platform MPI support is strongly discouraged ae environment variables on Windows for HPC jobs For Windows HPC jobs environment variables can be set from the GUI or on the command line From the GUI use the Task Properties window Environment tab to set an environment variable Platform MPI User s Guide 119 Understanding Platform MPI Task Properties PMPI_ROOT Msharedsaltemate location Note These environment variables should be set on the mpirun task Environment variables can also be set using the flag env For example gt job add JOBID numprocessors 1 env MPI_ROOT shared alternate location 120 Platform MPI User s Guide Understanding Platform MPI List of runtime environment variables The environment variables that affect the behavior of Pl
301. se file specifications can include the substrings h p and r which are expanded to host name process id and rank number in MPI _COMM_WORLD Thefiles option causes the stdio options p r and to beignored none This option is equivalent to setting stdio files with MPI_STDIO_INFILE MPI_STDIO_OUTFILE and MPI_STDIO_ERRFILE all set to dev null orNuL Completing In Platform M PI MPI_Finalizeis abarrier like collective routinethat waits until all application processes have called it before returning If your application exits without calling MPI_ Finalize pending requests might not complete When running an application mpi run waits until all processes have exited If an application detects an MPI error that leads to program termination it calls M PI_A bort instead You might want to code your error conditions using M PI_A bort which cleans up the application Each Platform M PI application is identified by a job ID unique on the server where mpi r un is invoked If you use the j option mpi run prints thejob ID of the application that it runs Then you can invoke mpi j ob with thejob ID to display the status of your application 178 Platform MPI User s Guide Debugging and Troubleshooting If your application hangs or terminates abnormally you can use mpi cl ean to kill lingering processes and shared memory segments mpi cl ean usesthejobID from mpirun j to specify the application to terminate Testing the network on
302. ser written error handler or with MPI_ERRORS_RETURN that masks a problem Turns on language interoperability for the MPI_BOTTOM constant MPI_BOTTOM Language Interoperability Previous versions of Platform M PI were not compliant with Section 4 12 6 1 of the M PI 2 Standard which requires that sends receives based at MPI_ BOTTOM on a data type created with absolute addresses must access the same data regardless of the language in which the data type was created For compliance with the standard set M PI_ FLAGS i to turn on language interoperability for the MPI_BOTTOM constant Compliance with the standard can break source compatibility with some M PICH code 122 Platform MPI User s Guide yi Understanding Platform MPI Selects signal and maximum time delay for guaranteed message progression Thes a option selectsSI GALRM Thesp option selects s GP ROF The option isthe number of seconds to wait before issuing a signal to trigger message progression Thedefault value for the M PI library iss p0 which never issues a progression related signal If the application uses both signals for its own purposes you cannot enable the heartbeat signals This mechanism can be used to guarantee message progression in applications that use nonblocking messaging requests followed by prolonged periods of timein which Platform M PI routines are not called Generating aU NIX signal introduces a performance penalty every timethe application processes
303. server is substituted behind the scenes The mapping may also be set up to refer only to a specific folder on the remote machine not the entire drive message bin A message bin stores messages according to message length Y ou can define a message bin by defining the byte range of the message to be stored in the bin use the M PI_INSTR environment variable message passing model M odel in which processes communicate with each other by sending and receiving messages Applications based on message passing are nondeterministic by default H owever when one process sends two or more messages to another the transfer is deterministic as the messages are always received in the order sent MIMD M ultipleinstruction multiple data Category of applicationsin which many instruction streams are applied concurrently to multiple data sets MPI M essage passing interface Set of library routines used to design scalable parallel applications These routines provide a wide range of operations that include computation communication and synchronization M PI 2 is the current standard supported by major vendors MPMD M ultipledata multipleprogram mplementations of Platform M PI that usetwo or more separate executables to construct an application Thisdesign stylecan beused to simplify the application source and reduce the size of spawned processes Each process may run a different executable multilevel parallelism Refers to multithreaded proces
304. ses that call M PI routinesto perform computations This approach isbeneficial for problemsthat can bedecomposed into logical parts for parallel execution for example alooping construct that spawns multiple threads to perform a computation and then joins after the computation is complete multihost 248 Platform MPI User s Guide Glossary A modeof operation for an M PI application whereacluster is used to carry out a parallel application run nonblocking receive Communication in which the receiving process returns before a message is stored in the receive buffer N onblocking receives are useful when communication and computation can be effectively overlapped in an M PI application Use of nonblocking receives may also avoid system buffering and memory to memory copying nonblocking send Communication in which the sending process returns before a message is stored in the send buffer N onblocking sends are useful when communication and computation can be effectively overlapped in an M PI application non determinism A behavior describing non repeatable parameters A property of computations which may have more than one result The order of a set of events depends on run time conditions and so varies from run to run OpenFabrics Alliance OFA A not for profit organization dedicated to expanding and accelerating the adoption of Remote Direct Memory Access RDM A technologies for server and storage connectivity OpenFabrics
305. sing port 5004 C gt MPIROOT sbin PCMPIWin32Service exe stop Service stopped C gt MPIROOT sbin PCMPIWin32Service exe setportkey 5004 Setting Default Port key PCMPI Port Key set to 5004 C gt MPLROOT sbin PCMPIWin32Service exe start Service started 94 Platform MPI User s Guide Understanding Platform MPI For additional Platform M PI Remote Launch service options use help Usage pempiwin32service exe cmd pm where cmd can be one of the following commands h help show command usage S status show service status k removeeventkey remove service event log key r removeportkey remove default port key t setportkey lt port gt remove default port key i install lt port gt remove default port key start start an installed service stop stop an installed service restart restart an installed service Note All remote services must use the same port If you are not using the default port make sure you select a port that is available on all remote nodes Run time utility commands Platform M PI provides a set of utility commands to supplement M PI library routines mpidiag tool for Windows 2003 XP and Platform MPI Remote Launch Service Platform M PI for Windows 2003 XP includes the mpi di ag diagnostic tool It is located in MP ROOT bi n mpi daig exe This tool is useful to diagnose remote service access without running mpi r un To use the
306. spin yield to ensure that the process relinquishes the CPU to other processes Do thisin your appfile by settingy toy 0 for theprocessin question This specifies zero milliseconds of spin that is immediate yield If you arerunningan application stand alone on a dedicated system the default setting M PI _FLAGS y allows M PI to busy spin improving latency To avoid unnecessary CPU consumption when using more ranks than cores consider using a setting such as MPI_FLAGS y40 Specifying y without a spin value is equivalent to M PI_FLAGS y10000 which is the default Platform MPI User s Guide 123 Understanding Platform MPI Note Except when using sr un orprun to launch if the ranks under a single mpid exceed the number of CPUs on the node and a value of MPI_FLAGS y is not specified the default is changed to MPI_FLAGS y0 If the time a process is blocked waiting for messages is short you can possibly improve performance by setting a spin value between 0 and 10 000 that ensures the process doesnot relinquish theC PU until after themessageisreceived thereby reducinglatency Thesystem treats a nonzero spin valueas a recommendation only lt does not guarantee that the value you specify is used Writes an optimization report to stdout MPI_Cart_createand MPI_Graph_create optimize the mapping of processes onto the virtual topology only if rank reordering is enabled set reorder 1 In the declaration statement below see reor
307. splacements nteger adtype array of datatypes allocatable rbs rbe cbs cbe rdtype cdtype twdtype ablen adisp adtype integer rank l rank iteration counter nteger comm_size number of MPI processes nteger comm rank sequential ID of MPI process nteger ierr MPI error code nteger mstat mpi_status_size MPI function status nteger src source rank nteger dest destination rank integer dsize size of double precision in bytes double precision startt endt elapsed time keepers external compcolumn comprow subroutines execute in threads G c MPI initialization G call mpi_init ierr call mpi_ comm size mpi_ comm world comm size ierr call mpi_ comm rank mpi_ comm world comm rank ierr c c Data initialization and start up c if comm rank eq 0 then wiee A aa n FON aC AAM call getdata nrow ncol array write 6 Start computati on endif call mpi_barrier MPI_ COMM_WORLD ierr startt mpi_wti me G c Compose MPI datatypes for row column send receive c c Note that the numbers from rbs i to rbe i are the indices c of the rows belonging to the i th block of rows These indices E specify a portion the i th portion of a column and the c datatype rdtype i is created as an MPI contiguous datatype c to refer to the i th portion of a column Note this is a E contiguous datatype because fortran arrays are stored c column wise c E For a range of columns to specify portions of rows the situation c is similar he numbers from cbs j
308. st down to 32 bit values For Platform MPI that restriction is removed To enable Platform M PI support for these extensions to the M PI 2 1 standard non standard ext must be added to the command line of the Platform M PI compiler wrappers mpi CC mpicc mpif 90 mpi f 77 asin the following example opt platform_mpi bin mpicc non standard ext large_count_test c The non standard ext flag must be passed to the compiler wrapper during the link step of building an executable The following is a complete list of large message interfaces supported Point to point communication MPI BsendL void buf MPI Aint count MPI Datatype datatype dest int tag MPI Comm comm in in IN buf initial address of send buffer Platform MPI User s Guide 219 Large message APIs count number of elements in send buffer datatype datatype of each send buffer element dest rank of destination tag message tag comm communi cat or nt MPI Bsend initL void buf MPI Aint count MPI Datatype datatype int dest int tag MPI _ Comm comm MPI Request request N buf initial address of send buffer choice IN count number of elements sent non negative integer IN datatype type of each element handle IN dest rank of destination integer IN tag message tag integer IN comm communicator handle OUT request communication request handle in P _Buffer_attachL void buf MPI_Aint size IN buffer initial buffer address choice IN siz
309. sters with as many as 2048 ranks usingtheV API protocol M ost Platform M PI features function in a scalable manner H owever the following are still subject to significant resource growth as the job size grows Table 17 Scalability Feature Affected Interconnect Scalability Impact Protocol spawn All Forces use of pairwise socket connections between all mpid s typically one mpid per machine one sided shared lock All except VAPI and IBV Only VAPI and IBV provide low level calls to efficiently unlock implement shared lock unlock All other interconnects require mpid s to satisfy this feature one sided exclusive All except VAPI IBV and VAPI IBV and Elan provide low level calls that allow lock unlock Elan Platform MPI to efficiently implement exclusive lock unlock All other interconnects require mpid s to satisfy this feature one sided other TCP IP All interconnects other than TCP IP allow Platform MPI to efficiently implement the remainder of the one sided functionality Only when using TCP IP are mpid s required to satisfy this feature Resource usage of TCP IP communication Platform M PI has been tested on large Linux TCP IP clusters with as many as 2048 ranks Because each Platform M PI rank creates a socket connection to each other remote rank the number of socket descriptors required increases with thenumber of ranks O n many Linux systems this requiresincreasing the operating system limit on per process and system
310. submittal to schedule my job whilein C tmp but thejob won t run Why ANSWER The automatic job submittal sets the current working directory for the job to the current directory equivalent to using e M PI_WORKDIR lt path gt Because the remote compute nodes cannot access local disks they need aUNC path for the current directory Platform M PI can convert the local drive to a UNC path if the local drive is a mapped network drive So running from the mapped drive instead of the local disk allows Platform M PI to set a working directory to avisible UNC path on remote nodes QUESTION I run a batch script before my MPI job but it fails W hy ANSWER Batch files run in acommand window W hen the batch file starts Windows first starts a command window and tries to set the directory to the working directory indicated by the job This is usually a UNC path so all remote nodes can see this directory But command windows cannot change a directory to a UNC path One option is to use V BScript instead of bat files for scripting tasks 244 Platform MPI User s Guide Glossary application In the context of Platform M PI an application isoneor more executable programs that communicate with each other via M PI calls asynchronous Communication in which sending and receiving processes placeno constraints on each other in terms of completion Thecommunication operation between thetwo processes may also overlap with computation bandwid
311. synchronous mode M PI_Ssend and ready mode MPI_Rsend The modes are all invoked in a similar manner and all pass the same arguments shared memory model M odel in which each process can access a shared address space Concurrent accesses to shared memory are controlled by synchronization primitives SIMD Single instruction multiple data Category of applications in which homogeneous processes execute the same instructions on their own data 250 Platform MPI User s Guide Glossary SMP Symmetric multiprocessor A multiprocess computer in which all the processors have equal access to all machine resources Symmetric multiprocessors have no manager or worker processes spin yield Refers to a Platform M PI facility that allows you to specify the number of milliseconds a process should block spin waiting for a message before yielding the CPU to another process Specify a spin yield valuein the M PI_ FLAGS environment variable SPMD Single program multiple data Implementations of Platform M PI wherean application is completely contained in a single executable SPM D applications begin with the invocation of a single process called the master The master then spawns some number of identical child processes The master and the children all run the same executable standard send mode Form of blocking send where the sending process returns when the system can buffer the message or when the message is received stride Constant am
312. t dtype root comm Introduction The syntax of the M PI collective functions is designed to be consistent with point to point communications but collective functions are more restrictive than point to point functions Important restrictions to keep in mind are The amount of data sent must exactly match the amount of data specified by the receiver Collective functions come in blocking versions only Collective functions do not use a tag argument meaning that collective calls are matched strictly according to the order of execution Collective functions comein standard mode only For detailed discussions of collective communications see Chapter 4 Collective Communication in the MPI 1 0 standard The following examples demonstrate the syntax to code two collective operations a broadcast and a scatter To codea broadcast use MPI_Bcast void buf int count MPI_Datatype dtype int root MPI_Comm comm where Specifies the starting address of the buffer Indicates the number of buffer entries Denotes the datatype of the buffer entries Specifies the rank of the root Designates the communication context that identifies a group of processes For example compute_pi f usesMPI_BCAST to broadcast oneinteger from process 0 to every process inMPI_COMM_WORLD To codea scatter use MPL Scatter void sendbuf int sendcount MP _Datatype sendtype void recvbuf int recvcount MPI_Datatype recviype int root MPI_Comm com
313. t gt lt lt Starting on Isfhost ocaldomai n gt gt Hello world m0O of 2 on n10 Hello world I m 1 of 2 on n10 Including and excluding specific nodes can be accomplished by passing argumentsto SLURM as well For example to make sure a job includes a specific node and excludes others use something like the following In this case n9 is a required node and n10 is specifically excluded bsub l n8 ext SLURM nodelist n9 exclude n10 mpirun srun hello_world Job lt 1892 gt is submitted to default queue lt interactive gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost ocaldomai n gt gt Hello world m0O of 8 on n8 Hello world m1 of 8 on n8 Hello world m6 of 8 on nl2 Hello world m2 of 8 on n9 Hello world m4 of 8 on nll Hello world m7 of 8 on nl2 Hello world m3 of 8 on n9 Hello world m5 of 8 on nll In addition to displaying interconnect selection information the mpi r un prot option can be used to verify that application ranks have been allocated in the required manner bsub I n12 MPl_ROOT bin mpirun prot srun n6 N6 a out Job lt 1472 gt is submitted to default queue lt interactive gt lt lt Waiting for dispatch gt gt lt lt Starting on Isfhost ocaldomai n gt gt hosu WM co iy yA AO W 8 oo aS Host 1 I de 20 0 9 2 felis d Host 2 se ip 172 20 0 10 ranks 2 host ce iy 172 20 0 1 co ranks Host 4 ip 172 20 0 12 ranks 4 Host 5 ip
314. t mpi r un and the server Platform M PI Remote Launch service negotiate the security package to be used for authentication token lt token name gt tg lt token name gt Authenticates to this token with the Platform M PI Remote Launch service Some authentication packages require a token name T he default is no token pass Prompts the user for his domain account password U sed to authenticate and create remote processes A password is required to allow the remote process to access network resources such as file shares The password provided is encrypted using SSP for authentication The password is not cached when using this option cache Prompts the user for the domain account password U sed to authenticate and create remote processes A password is required to allow the remote process to access network resources such as file shares The password provided is encrypted using SSP for authentication Thepassword is cached so that futurempi r un commands usethecached password Passwords are cached in encrypted form using Windows Encryption APIs nopass Executes the mpi run command with no password If a password is cached it is not accessed and no password is used to create the remote processes U sing no password results in the remote processes not having access to network resources This option also suppresses the no password cached warning Thisisuseful when no password iswanted for SM P jobs iscached Indicat
315. t c Microsoft R C C Optimizing Compiler Version 14 00 50727 762 for x64 Copyright C Microsoft Corporation All rights reserved client c Microsoft R Incremental Linker Version 8 00 50727 762 Copyright C Microsoft Corporation All rights reserved out client exe Tlibpath C Program Files x86 Platform MPI 1ib subsystem console li bpcmpi64 lib libmpio64 lib client obj 5 Create an appfile that uses your executables For example create the following appfile appfile txt np 1 h nodel server exe np 1 h nodel client exe np 2 h node2 client exe np 2 h node3 client exe This appfile runs one server rank on nodel and 5 client ranks one on nodel two on node2 and two on node3 6 Submit the job using appfile mode X wor k gt MPIROOT bin mpirun ccp f appfile txt This submits the job to the scheduler allocating the nodes indicated in the appfile Output and error files defaults to appfile lt j OBI D gt lt TASKI D gt out andappfile lt j OBI D gt lt TASKI D gt err respectively These file names can be altered using the ccpout and ccperr flags 7 Check your results Assuming the job submitted was job ID 98 thefileappfile 98 1 out was created The file content is X Demo gt type appfile 98 1 out Hello world Client I m 2 of 6 on node2 Hello world Client I m 1 of 6 on nodel Hello world Server I m 0O of 6 on nodel Hello world Client I m 4 of 6 on node3 Hello
316. t to int tag MPI Comm comm printf Calling C MPI _Send to d n to return PMPI Send buf count type to tag comm pragma weak mpi_ send mpi_send ola mol Sena Vold Wii Imi COUN Tie yoe Wink to rit cag tik Comm Wile Tern printf Calling Fortran MPI Send to d n to pmpi_send buf count type to tag comm ierr C profiling interface The Platform M PI C bindings are wrappers to C calls No profiling library exists for C bindings To profile the C interface write the equivalent C wrapper version of the M PI library routines you want to profile For details on profiling the C M PI libraries see the section above Platform MPI User s Guide 159 Profiling 160 Platform MPI User s Guide Tuning This chapter provides information about tuning Platform M PI applications to improve performance The tuning information in this chapter improves application performance in most but not all cases Use this information together with the output from counter instrumentation to determine which tuning changes are appropriate to improve your application s performance When you develop Platform M PI applications several factors can affect performance T hese factors are outlined in this chapter Platform MPI User s Guide 161 Tuning Tunable parameters Platform M PI provides a mix of command line options and environment variables that can be used to influence the behavior and performance o
317. t variable is not supplied to mpi r un Platform M PI checks the subnet prefix for the first port it chooses determines that the subnet prefixes do not match prints the following message and exits pp x Rank 0 1 MPI_Init The IB ports chosen for IB connection setup do not have the Same subnet prefix Please provide a port GID that all nodes have IB path to it by MPI_ 1B_PORT_GID pp x Rank 0 1 MPI_Init You can get port GID using ibv_devinfo v MPI_IB_CARD_ORDER Defines mapping of ranks to IB cards setenv MPI_IB_CARD_ORDER lt card gt port where card Ranges from 0 to N 1 port Ranges from 0 to 1 Card port can be a comma separated list that drives the assignment of ranks to cards and ports in the cards Platform M PI numbers the ports on a card from 0 to N 1 whereas utilities such asvst at display ports numbered 1 to N Examples To usethe second IB card mpirun e MPL_IB_CARD_ORDER 1 To use the second port of the second card mpirun e MPLIB_CARD_ORDER 1 1 To use the first B card mpirun e MPLIB_CARD_ORDER2 0 To assign ranks to multiple cards mpirun e MPL_IB_CARD_ORDER 0 1 2 This assigns the local ranks per nodein order to each card mpirun hosilist host0 4 host1 4 This creates ranks 0 3 on host 0 and ranks 4 7 on host 1 It assigns rank 0 to card 0 rank 1 to card 1 rank 2 to card 2 rank 3 to card 0 all on host 0 It also assigns rank 4 to card 0 rank 5 to card 1 rank 6 to ca
318. tation can be turned on by using either the i option to mpi run or by setting the environment variable M PI_INSTR Instrumentation data includes someinformation on messages sent to other M PI worlds formed using M PI_Comm_accept MPI_Comm_connect or MPI_Comm_join All off world message data is accounted together using the designation of f w regardless of which off world rank was involved in the communication Platform M PI provides an API that enables users to access the lightweight instrumentation dataon aper process basis beforethe application calling M PI_ Finalize The following declaration in C is necessary to access this functionality extern int hpmp_instrument_runtime int reset A call to hpmp_instrument_runtime 0 populates the output file specified by the i option to mpi run or the M PI_INSTR environment variable with the statistics available at thetimeof thecall Subsequent callsto homp_instrument_runtime or M PI_ Finalize will overwrite the contents of the specified file A call to h gt mp_instrument_runtime Platform MPI User s Guide 107 Understanding Platform MPI T dbgspin 1 populates the filein the same way but also resets the statistics If instrumentation is not being used the call to homp_instrument_runtime has no effect For an explanation of i options refer to the mpi r un documentation Prints user and system times for each M PI rank Causes each rank of the M PI application to spin i
319. tax unix mpi32 compute_pi f o compute_pi_pgi32 exe Tocompilecompute_pi f usingIntel Fortran without usingthempi f 90 tool from a command prompt that has the relevant environment settings loaded for your Fortran compiler ifort compute_pi f I MPI_ROOT include 64 link out compute_pi exe libpath MPI_ROOT lib subsystem console libpcmpi64 lib Note Intel compilers often link against the Intel runtime libraries When running an MPI application built with the Intel Fortran or C C compilers you might need to install the Intel run time libraries on every node of your cluster We recommend that you install the version of the Intel run time 52 Platform MPI User s Guide Understanding Platform MPI libraries that correspond to the version of the compiler used on the MPI application Platform MPI User s Guide 53 Understanding Platform MPI C bindings for Linux Platform M PI supports C bindings as described in the M PI 2 Standard If you compile and link with the mpi CC command no additional work is needed to include and use the bindings You can include mpi h ormpi CC h in your C source files Thebindings provided by Platform M PI arean interfaceclass callingthe equivalent C bindings To profile your application you should profile the equivalent C bindings If you build without the mpi CC command include ImpiCC to resolve C references To use an alternate i bmpi CC a with mpi CC usethe mpiCClib lt L
320. tehost wherearemoteexecutablefileisstored Thedefault isto search the local host remote_host is a host nameor an IP address e var val Sets the environment variable var for the program and gives it the value val The default isnot to set environment variables W hen you use ewith the h option theenvironment variable is set to val on the remote host sp paths Sets the target shell PATH environment variable to paths Search paths are separated by a colon Both sp path and e PATH path do the same thing If both are specified the e PATH path setting is used np Specifies the number of processes to run The default value for is 1 program Specifies the name of the executable to run mpi r un searches for the executable in the paths defined in the PATH environment variable Platform MPI User s Guide 75 Understanding Platform MPI args Specifies command lineargumentsto the program Optionsfollowingaprogram name in your appfile are treated as program arguments and are not processed by mpi run Adding program arguments to your appfile W hen you invokempi r un using an appfile arguments for your program are supplied on each line of your appfile Platform M PI also provides an option on your mpi run command line to provide additional program arguments to thosein your appfile Thisis useful if you want to specify extra arguments for each program listed in your appfile but do not want to edit your appfile To use an appfile when
321. terface Standard and M PI 2 Extensions to the M essage P assing Interface respectively You can access HTM L versions of the M PI 1 2 and 2 standards at http www mpi forum org This guide supplements the material in the M PI standards and M PI The Complete Reference Some sections in this book contain command line examples to demonstrate Platform M PI concepts These examples use the bin csh syntax Platform MPI User s Guide 5 About This Guide Platforms supported Table 1 Supported platforms interconnects and operating systems Platform Interconnect Operating System Intel IA 32 TCP IP Myrinet GM 2 and MX InfiniBand RDMA Ethernet OFED 1 0 1 1 1 2 1 3 1 4 uDAPL 1 1 1 2 2 0 QLogic PSM NIC Version QHT7140 QLE7140 Driver PSM 1 0 2 2 1 2 2 6 Platform MPI User s Guide Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 CentOS 5 Red Hat Enterprise Linux AS 4 0 and 5 0 SUSE Linux Enterprise Server 9 and 10 Cent
322. termines which interconnect to use before it knows the application s bitness To have proper network selection in that case specify if the application is 32 bit when runningon Opteron Intel64 machines MPI_ROOT bin mpirun mpi32 Testing the network on Windows Often clusters might have Ethernet and some form of higher speed interconnect such asInfiniBand This section describes how to use the ping_pong_ring c example program to confirm that you can run using the desired interconnect Running atest like this especially on a new cluster is useful to ensure that relevant network drivers are installed and that the network hardware is functioning If any machine has defective network cards or cables this test can also be useful for identifying which machine has the problem To compilethe program set theM PIl_ ROOT environment variable to the location of Platform M PI The defaultis C Program Files x86 Platform Computing Platform MPI for 64 bit systems and C Program Files Platform Computing Platform MPI for 32 bit systems This may already be set by the Platform M PI installation Open a command window for the compiler you plan on using This includes all libraries and compilers in the path and compiles the program using the mpi cc wrappers MPI_ROOT bin mpicc mpi64 out pp exe MPI_ROOT help ping_ping_ring c Use the start up for your cluster Y our situation should resemble one of the following If running on Windows H
323. tes processes using your encrypted password to obtain network resources If you do not provide a password if the password is incorrect or if you use nopass remote processes are created but do not have access to network shares In this example the remote process cannot read thehello_ world exe file 5 Analyze hello_world output Platform MPI prints the output from running the hello_world executable in nondeterministic order The following is an example of the output Hello world m1 of 4 on n01 Hello world I m 3 of 4 on n02 Hello world m0O of 4 on n01 Hello world m2 of 4 on n02 Directory structure for Windows All Platform M PI for Windows files are stored in the directory specified at installation The default directoryisC Program Files x86 Platform MPI If you move the Platform M PI installation directory from its default location set theM PI_ROOT environment variableto point to the new location The directory structure is organized as follows Table 8 Directory structure for Windows Subdirectory Contents bin Command files for Platform MPI utilities help Source files for example programs and Visual Studio Property pages include 32 32 bit header files include 64 64 bit header files lib Platform MPI libraries man Platform MPI manpages in HTML format devtools Windows Platform MPI services licenses Repository for Platform MPI license file doc Release notes and the Debugging with Platform MPI Tutorial Wi
324. th MPI_ROOT lib subsystem console libhpmpi64 lib 36 Platform MPI User s Guide mpicc bat Getting Started The PGI compiler uses amore U NIX like syntax From a PGI command prompt pgcc hello_world c I MPI_ROOT include 64 o hello_world exe L MPI_ROOT lib Ihpmpi64 Thempicc bat script links by default using the static run time libraries MT This behavior allows the application to be copied without any side effects or additional link steps to embed the manifest library When linking with M D dynamic libraries you must copy the generated lt filename gt exe manifest along with the exe dl fileor the following run time error will display This application has failed to start because MSVCR90 dIl1 was not found Re installing the application may fix this problem To embed the manifest fileinto exe dl usethemt tool For moreinformation see the M icrosoft Visual Studio mt exe tool The following example shows how to embed a mani f est fileinto an application C gt MPI_ROOT bin mpicc bat mpi64 MD hello_world c C gt mt manifest hello_world exe manifest outputresource hello_world exe 1 Fortran command line basics The utility MP1 ROOT bi n mpi f 90 isincluded to aid in command line compilation To compile with this utility set M PI_F90 to the path of the command line compiler you want to use Specify mpi32 or mpi64 to indicate if you are compiling a 32 or 64 bit application S
325. th Data transmission capacity of acommunications channel The greater a channel s bandwidth the more information it can carry per unit of time barrier Collective operation used to synchronize the execution of processes M PI_ Barrier blocks the calling process until all receiving processes have called it This is a useful approach for separating two stages of a computation so messages from each stage are not overlapped blocking receive Communication in which the receiving process does not return until its data buffer contains the data transferred by the sending process blocking send Communication in which the sending process does not return until its associated data buffer is availablefor reuse Thedatatransferred can becopied directly into thematching receive buffer or atemporary system buffer broadcast Oneto many collective operation where the root process sends a message to all other processes in the communicator including itself buffered send mode Platform MPI User s Guide 245 Glossary Form of blocking send wherethe sending process returns when the message is buffered in application supplied space or when the message is received buffering Amount or act of copying that asystem uses to avoid deadlocks A large amount of buffering can adversely affect performance and make M PI applications less portable and predictable cluster Group of computers linked together with an interconnect and software that functions c
326. th computing summations in row wise fashion For example the rank 2 process starts with the block that is on the Oth row block and 2nd column block denoted as 0 2 The block computed in the second step is 1 3 Computing the first row elements in this block requires the last row elements in the 0 3 block computed in the first step in the rank 3 process Thus the rank 2 process receives the data from the rank 3 process at the beginning of the second step Therank 2 process also sends the last row elements of the 0 2 block to the rank 1 process that computes 1 2 in the second step By repeating these steps all processes finish summations in row wise fashion the first outer loop in theillustrated program The second outer loop the summations in column wise fashion is done in the same manner For example at the beginning of thesecond step for thecolumn wisesummations the rank 2 process receives data from therank 1 process that computed the 3 0 block The rank 2 process also sends the last column of the 2 0 block to the rank 3 process Each process keeps the same blocks for both of the outer loop computations This approach is good for distributed memory architectures where repartitioning requires massive data communications that are expensive H owever on shared memory architectures the partitioning of the compute region does not imply data distribution The row and column block partitioning method requires just one synchronizatio
327. the ndd flag to mpi run In the default case where the ptmalloc contained in Platform M PI is used the above cases are avoided and lazy deregistration workscorrectly asis So the abovetunables are only recommended for applications with special requirements concerning their malloc free usage 150 Platform MPI User s Guide Understanding Platform MPI Signal propagation Linux only Platform M PI supports the propagation of signals from mpi r un to application ranks Thempi run executable traps the following signals and propagates them to the ranks SIGINT SIGTERM SIGABRT SIGALRM SIGFPE SIGHUP SIGILL SIGPIPE SIGQUIT SIGSEGV SIGUSR1 SIGUSR2 SIGBUS SIGPROF SIGSYS SIGTRAP SIGURG SIGVTALRM SIGPOLL SIGCONT SIGTSTP Ifprun srun is used for launching the application then mpi r un sends the signal to the responsible launcher and relies on the signal propagation capabilities of the launcher to ensure that the signal is propagated to theranks W hen usingpr un SIGTTIN isalso intercepted by mpi r un butisnot propagated W hen using an appfile Platform M PI propagates these signals to remote Platform M PI daemons mpid and local ranks Each daemon propagates the signal to the ranks it created An exception is the treatment of SIGTSTP When a daemon receives an SIGTSTP signal it propagates S GST OP to the ranks it created and then raises SIGSTOP on itself This allows all processes related to a Platform M PI execution to be suspende
328. the environment variable MPI_USEPRUN Set the environment variable setenv MPI_USEPRUN 17 Platform M PI will insert the prunargument The following arguments are considered to be pr un arguments n N m w xX eMPI_WORKDIR path will be translated to the pr un argument chdir path any argument that starts with and is not followed by a space np will be translated to n prun will be accepted without warning Theimplied pr un mode allows the use of Platform M PI appfiles Currently an appfile must be homogenousin itsarguments except for h and np The h and np argumentsin the appfilearediscarded All other arguments are promoted to the mpi r un command line Additionally arguments following are also processed Additional environment variables provided MPI_PRUNOPTIONS Allows additional pr un options to be specified such as label setenv MPI_PRUNOPTIONS lt option gt MPI_USEPRUN_IGNORE_ARGS Provides an easy way to modify the arguments in an appfile by supplying a list of space separated arguments that mpi run should ignore setenv MPI_USEPRUN_IGNORE_ARGS lt option gt prun arguments Platform MPI User s Guide 231 mpirun Using Implied prun or srun n ntasks ntasks Specify the number of processes to run N nodes nnodes Request that nnodes nodes be allocated to this job m distribution block cyclic Specify an alternate distribution method for remote processes w nodelist
329. the programs appear in the appfile For example if your appfile contains h voyager np 10 send receive h enterprise np 8 compute p 76 Platform MPI User s Guide Understanding Platform MPI Platform M PI assigns ranks 0 through 9 to the 10 processes running send_receive and ranks 10 through 17 to the 8 processes runningcompute pi You can use this sequential ordering of process ranks to your advantage when you optimize for performance on multihost systems Y ou can split process groups according to communication patterns to reduce or remove interhost communication hot spots For example if you have the following A multihost run of four processes Two processes per host on two hosts Higher communication traffic between ranks 0 2 and 1 3 You could use an appfile that contains the following h hosta np 2 programl h hostb np 2 program2 However this places processes 0 and 1 on host a and processes 2 and 3 on host b resulting in interhost communication between the ranks identified as having slow communication lt Slow communication process 0 process 1 ro A more optimal appfile for this example would be h hosta np 1 programl h hostb np 1 program2 h hosta np 1 programl h hostb np 1 program2 This places ranks 0 and 2 on host aand ranks 1 and 3 on host b This placement allows intrahost communication between ranks that are identified as communication hot spots Intrahost communication yields
330. ther the i option to mpi r un or by setting the environment variable M PI_INSTR Instrumentation data includes some information on messages sent to other M PI worlds formed using MPI_Comm_accept MPI_Comm_connect or MPI_Comm_join All off world message data is Platform MPI User s Guide 129 Understanding Platform MPI accounted together using the designation of f w regardless of which off world rank was involved in the communication Platform M PI provides an API that enables users to access the lightweight instrumentation data on a per process basis before the application calling M PI_Finalize The following declaration in C is necessary to access this functionality extern int hpmp_instrument_runtime int reset A call to hpmp_instrument_runtime 0 populates the output file specified by the i option to mpi run or theM PI_INSTR environment variable with the statistics available at the time of the call Subsequent calls to hpmp_instrument_runtime or M Pl_Finalize will overwrite the contents of the specified file A call to hpmp_instrument_runtime 1 populates the file in the same way but also resets the statistics If instrumentation is not being used the call to hb mp_instrument_runtime has no effect Even though you can specify profiling options through the M PI_INSTR environment variable the recommended approach is to use the mpi r un command with the i option instead Using mpi r un to specify profiling options guarantee
331. ting started 27 support 237 local host interconnect options 103 logical values in Fortran77 125 Isb_hosts option 104 Isb_mcpu_hosts option 104 LSF load sharing facility 70 LSF load sharing facility invoking 70 75 LSF non HP XC systems 70 75 LSF on HP XC 69 M manpages 31 compilation utilities Windows 46 general Windows 46 Linux 31 Platform MPI utilities 45 run time 32 Windows 45 46 master_worker f90 183 messages bandwidth achieve highest 167 buffering problems 176 label 19 latency achieve lowest 167 latency bandwidth 162 163 lengths 241 passing advantages 14 status 18 mode option 109 module F 50 modules 72 MPI allgather operation 20 alltoall operation 20 application starting on Linux 28 broadcast operation 20 build application on HP XC cluster 30 on Linux cluster using appfiles 29 on single host Linux 29 build application with Visual Studio 42 change execution source 126 135 clean up 240 functions 57 gather operation 20 initialize environment 15 library routines MPI_Comm_rank 15 MPI_Finalize 15 MPI init 15 MPI_Recv 15 MPI_Send 15 number of 15 prefix 159 routine selection 167 run application on HP XC cluster 30 on Linux cluster using appfiles 29 on single host Linux 29 run application Linux 28 run application on Elan interconnect 67 run application on Linux 67 scatter operation 20 terminate environment 15 MPI run application on multiple hosts 70 MPI_2BCOPY 127 MPI_Barrier 23 167 MPI_Bcast 16
332. tion IN tag message tag IN comm communicator handle int MPI_Rsend_initL void buf MPI _Aint count MPI Datatype datatype int dest int tag MPI Comm comm MPI Request request N buf initial address of send buffer choice IN count number of elements sent IN datatype type of each element handle IN dest rank of destination IN tag message tag IN comm communicator handle OUT request communication request handl e int MPI _SendL void buf MPI Aint count MPI Datatype datatype int dest int tag MPI Comm comm N buf initial address of send buffer choice IN count number of elements in send buffer IN datatype datatype of each send buffer element handle IN dest rank of destination IN tag message tag IN comm communicator handl e int MPI Send_initL void buf MPI Aint count MPI _ Datatype datatype int dest int tag MPI Comm comm MPI Request request N buf initial address of send buffer choice IN count number of elements sent IN datatype type of each element handle IN dest rank of destination IN tag message tag IN comm communicator handle OUT request communication request handl e int MPI SendrecvL void sendbuf MPI Aint sendcount MPI Datatype sendtype int dest int sendtag void recvbuf MPI Aimi recvcount MPI Datatype recvtype int source int recvtag MPI Comm comm MPI Status status IN sendbuf initial address of send buffer choice IN sendcount number of elements in send buffer IN sendtyp
333. tion collected during program execution that you can use to analyze your application Y ou can collect traceinformation and storeit in afilefor later useor analyze it directly when running your application interactively UNC A Universal Naming Convention UNC path isa path that is visible as a network share on all nodes T he basic format is node name exported share folder paths UNC paths are usually required because mapped drives may not be consistent from node to node and many times don t get established for all logon tokens yield See spin yield 252 Platform MPI User s Guide opt mpi doc 31 opt mpi help 30 opt mpi include 30 opt mpi lib pa2 0 31 opt mpi newconfig 31 1sided option 109 32 bit applications 239 32 bit error 241 64 bit support 58 A ADB 170 all reduce 22 allgather 20 app bitness spec options 106 appfile adding program arguments 76 assigning ranks in 76 creating 75 execution 73 improving communication on multihost systems 76 runs 70 setting remote environment variables in 76 with mpirun 67 appfile description of 71 application hangs 240 argument checking enable 125 array partitioning 199 ASCII instrumentation profile 157 asynchronous communication 15 autodouble 56 Linux 56 Windows 56 B backtrace 173 Index bandwidth 16 163 167 barrier 23 167 binding ranks to Idoms 166 blocking communication 17 buffered mode 18 MPI_Bsend 18 MPI_Recv 18 MPI_Rsend 18 MPI Send 18 MPI_Ssend 18
334. tion connections ha recover Recovery of communication connections after failures ha net Enables Automatic Port Migration ha noteardown ha all Note mpi run and mpi d exist they should not tear down an application in which some ranks have exited after MPI_Init but before MPI_Finalize lf ha infra is specified this option is ignored ha all is equivalent to ha infra noteardown recover detect net which is equivalent to ha infra recover net If a process uses ha detect then all processes it communicates with must also use ha detect Likewise if a process uses ha recover then all processes it communicates with must also use ha recover Support for high availability on InfiniBand Verbs You can usethe haoption with the BV option W hen using ha automatic network selection isrestricted to TCP and IBV Be aware that ha no longer forces the use of TCP If TCP is desired on a system that has both TCP and IBV available it is necessary to explicitly specify TCP on thempi run command line All high availability features are available on both TCP and IBV interconnects Highly available infrastructure ha infra The ha option allows M PI ranks to be more tolerant of system failures H owever failures can still affect the mpi run and mpi d processes used to support Platform M PI applications When the mpi run mpi d infrastructure is affected by failures it can affect the application ranks and the services pr
335. tion detection Setting this option for applications that make calls to routines in the M PI 2 standard can produce false error messages Disables detection of multiple buffer writes during receive operations and detection of send buffer corruptions Disables the warning messages that the diagnostic library generates by default when it identifies a receive that expected more bytes than were sent dump prefix Platform MPI User s Guide 127 Understanding Platform MPI Dumps unformatted sent and received messages to prefix msgs rank whererank is the rank of a specific process dumptf prefix Dumps formatted sent and received messages to prefix msgs rank where rank is the rank of a specific process xNUM Defines a type signature packing size NUM is an unsigned integer that specifies the number of signature leaf elements For programs with diverse derived datatypes the default value may be too small If NUM is too small the diagnostic library issues a warning during the M PI_ Finalize operation MPLERROR_LEVEL Controls diagnostic output and abnormal exit processing for application debugging where 0 Standard rank label text and abnormal exit processing Default 1 Adds hostname and process id to rank label 2 Adds hostname and process id to rank label Also attempts to generate core file on abnormal exit MPI_NOBACKTRACE On PA RISC systems a stack trace is printed when the following signals occur in an application
336. tive can show large performance degradation when memory access is mostly off cell To solve this problem ranks must reside in the same Idom where they were created To accomplish this Platform M PI provides the cpu_bind flag which locks down arank to a specific Idom and prevents it from moving during execution To accomplish this the cpu_bind flag preloads a shared library at start up for each process which does the following 1 Spins for a short timein atight loop to let the operating system distribute processes to CPUs evenly 2 Determines the current CPU and Idom of the process If no oversubscription occurs on the current CPU it locks the process to the Idom of that CPU This evenly distributes the ranks to CPUs and prevents the ranks from moving to a different Idom after the M PI application starts preventing cross memory access For more information see refer to cpu_bind in the mpi run documentation 166 Platform MPI User s Guide Tuning MPI routine selection To achievethe lowest message latencies and highest message bandwidths for point to point synchronous communications use the M PI blocking routines M PI_Send and MPI_Recv For asynchronous communications use the M PI nonblocking routines M PI_Isend and MPI_Irecv When using blockingroutines avoid pendingrequests M PI must advancenonblocking messages so calls to blocking receives must advance pending requests occasionally resulting in lower application performa
337. to automatically migrate to it in case of future failures on the new primary path However if the new alternate path is not restored or if alternate paths are unavailable on the same card future failures will force Platform M PI to try to failover to alternate cards if available All of these operations are performed transparent to the application that uses Platform M PI If the environment has multiple cards with multiple ports per card and hasAPM enabled Platform M PI gives InfinBand port failover priority over card failover InfiniBand with MPL_Comm_connect and MPl_Comm_accept VAPI UDAPL GM Elan Platform M PI supports M PI_Comm_connect and MPI_Comm_accept over InfiniBand processes using the IBV protocol Both sides must have InfiniBand support enabled and use the same InfiniBand parameter settings MPI_Comm_connect and M Pl_Comm_accept need a port name which isthe P and port at the root process of the accept side First aT CP connection is established between the root process of both sides Next TCP connections are setup among all the processes Finally IBV InfiniBand connections are established among all process pairs and the TCP connections are closed TheM PI_IB_CARD_ORDER card selection option and the ndd option described above for IBV applies to VAPI The ndd option described above for IBV applies to uDAPL The ndd option described above for IBV applies to GM Platform M PI supports the Elan3 and E
338. to the path of the command line compiler you want to use Specify mpi32 or mpi64 to indicate if you are compiling a 32 bit or 64 bit application Specify thecommand lineoptionsthat you would normally passto thecompiler on themp i f 90 command line Thempi f 90 utility adds additional command line optionsfor Platform M PI include directories and libraries Y ou can specify the show option toindicatethatmpi f 90 should display thecommand generated without executing the compilation command For more information see the mpi f 90 manpage To construct the compilation command thempi f 90 utility must know what command line compiler is to be used the bitness of the executable that compiler will produce and the syntax accepted by the compiler These can be controlled by environment variables or from the command line Table 12 mpi f 90 utility Environment Variable Value Command Line MPI_F90 desired compiler default ifort mpif90 lt value gt MPI_BITNESS 32 or 64 no default mpi32 or mpi64 MPI_WRAPPER_SYNTAX windows or unix default mpisyntax lt value gt windows For example to compilecompute_pi f with a 64 bit ifort contained in your PATH usethe following command because ifort and the W indows syntax are defaults MPI_ROOT bin mpif90 mpi64 compute_pi f link out compute_pi_ifort exe Or use the following example to compile using the PGI compiler which uses a more UN IX like syntax MPI_ROOT bin mpif90 mpif90 pgf90 mpisyn
339. type handl e int MPI UnpackL void inbuf of e to datatype objects handle MPI Aint blocklength MPI Aint MPI Datatype newtype OUT newtype count stride MPI Datatype oldtype IN in each block IN stride number elements between start of each block IN oldtype handle OUT newtype new datatype lements MPI Aint insize MPI Aint position void outbuf MPI Aint outcount MPI Datatype datatype MPI Comm OUT i i p 0 0 d nbuf in nsize si osition cu ut buf ou utcount nu atatype da omm co t MPI _ Unpack_external PI Aimi position atatype OUT T t MPI Type_contiguo P Datatype newtype T d i i p 0 0 d Cc 0 n oldtype UT UT C array_of_blockleng array_of_displacements voi atarep data nbuf i npu nsize i npu osition curr ut buf outp ut count numb atatype data us ount repli ldtype old d ewtype new d in MP Datatype ount comm put buffer start choice ze of input buffer in bytes rrent position in bytes put buffer start choice mber of items to be unpacked atype of each output data item handle mmunicator for packed message handle L char datarep void inbuf MPI_Aint insize d outbuf MPI Aint outcount MPI Datatype representation st buffer start cho buffer size in bytes ent position in buffer ut buffer start choice er of output data item ype of output data it ring ice in bytes
340. u want to use which creates a subshell Then jobsteps can be launched within that subshell until the subshell is exited MPI_ROOT bin mpirun prun A N6 This allocates 6 nodes and creates a subshell MPI_ROOT bin mpirun prun n4 m block a out This uses 4 ranks on 4 nodes from the existing allocation N ote that we asked for block n00 ranki n0l rank2 n02 rank3 n03 rank4 Usempi run with srun on HP XC LSF clusters For example MPI_ROOT bin mpirun lt mpirun options gt srun lt srun options gt lt program gt lt args gt Some features likempi run stdi o processing are unavailable The np option is not allowed with srun The following options are allowed with srun MPI_ROOT bin mpirun help version jv i lt spec gt universe_size sp lt paths gt T prot spawn tv 1sided e var val srun lt srun options gt lt program gt lt args gt For moreinformation ons run usage man srun Thefollowing examples assume the system has the Quadrics Elan interconnect SLU RM is configured to use Elan and the system is a collection of 2 CPU nodes MPI_ROOT bin mpirun srun N4 a out ill runa out with 4 ranks one per node Ranks are cyclically allocated n00 ranki n0l rank2 n02 rank3 n03 rank4 MPI_ROOT bin mpirun srun n4 a out will runa out with 4 ranks 2 ranks per node ranks are block allocated Two are nodes used Other forms of usage include allocatin
341. uide Understanding Platform MPI Nodes are allocated in the order that they appear in the hostlist Nodes are scheduled cyclically so if you have requested more ranks than there are nodes in the hostlist nodes are used multiple times 3 Analyze hello_world output Platform M PI prints the output from running thehello_world executable in non deterministic order The following is an example of the output Hello world m5 of 8 on n02 Hello world m0O of 8 on n01 Hello world m2 of 8 on n03 Hello world m6 of 8 on n03 Hello world m1 of 8 on n02 Hello world I m 3 of 8 on n04 Hello world m4 of 8 on n0l Hello world m7 of 8 on n04 Performing multi HPC runs with the same resources In some instances such as when running performance benchmarks it is necessary to perform multiple application runs using thesameset of HPC nodes Thefollowing exampleisonemethod of accomplishing this 1 Compilethehe o_ world executable file a Changeto a writable directory and copy hello_world c from the help directory C gt copy MPIROOT help hello_world c Compile the hello_world executable file oO z In a proper compiler command window for example Visual Studio command window use mpi cc to compile your program C gt MPI_ROOT bin mpicc mpi64 hello_world c Note Specify the bitness using mpi64 or mpi32 for mpi cc to link in the correct libraries Verify you are in the correct bitness compiler wind
342. uld pertain only to Platform documentation For product support contact support platform com This document is protected by copyright and you may not redistribute or translate it into another language in part or in whole You may only redistribute this document internally within your organization for example on an intranet provided that you continue to check the Platform W eb site for updates and update your version of the documentation Y ou may not make it available to your organization over the Internet LSF is aregistered trademark of Platform Computing Corporation in the United States and in other jurisdictions ACCELERATING INTELLIGENCE PLATFORM COM PUTING PLATFORM SYMPHONY PLATFORM JOB SCHEDULER PLATFORM ISF PLATFORM ENTERPRISE GRID ORCHESTRATOR PLATFORM EGO and the PLATFORM and PLATFORM LSF logos aretrademarks of Platform Computing Corporation in the U nited States and in other jurisdictions UNIX is a registered trademark of The Open Group in the U nited States and in other jurisdictions Linux is the registered trademark of Linus Torvaldsin the U S and other countries Microsoft is either a registered trademark or a trademark of Microsoft Corporation in the United States and or other countries Windows is a registered trademark of Microsoft Corporation in the U nited States and other countries Intel Itanium and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United State
343. use cl and the Windows syntax are defaults MPI_ROOT bin mpicc mpi64 hello_world c link out hello_world_cl64 exe Or use the following example to compile using the PGI compiler which uses a more UN1X like syntax MPI_ROOT bin mpicc mpicc pgcc mpisyntax unix mpi32 hello_world c o hello_world_pgi32 exe To compile C code and link with Platform M PI without using thempi cc tool start a command prompt that has the relevant environment settings loaded for your compiler and useit with the compiler option I MPI_ROOT include 32 64 and the linker options libpath MPI_ROOT lib subsystem console libpcmpi64 lib libpcmpi32 lib Specify bitness where indicated The above assumes the environment variable M P _ ROOT is set Platform MPI User s Guide 51 Understanding Platform MPI For example to compilehello_world cfrom the MP ROOT hel p directory using Visual Studio from a Visual Studio 2005 command prompt window cl hello_world c I MPI_ROOT include 64 link out hello_world exe libpath MPI_LROOT lib subsystem console libpcmpi64 lib The PGI compiler uses amore UNIX like syntax From a PGI command prompt pgcc hello_world c I MPI_ROOT include 64 o hello_world exe L MPI_ROOT lib Ihpmpi64 Fortran command line basics for Windows The utility MP1 ROOT bi n mpi f 90 isincluded to aid in command line compilation To compile with this utility set the M PI_F90 environment variable
344. valid M PI 2 program 14 Platform MPI User s Guide Introduction MPI concepts The primary goals of M PI are efficient communication and portability Although several message passing libraries exist on different systems M PI is popular for the following reasons Support for full asynchronous communication Process communication can overlap process computation Group membership Processes can be grouped based on context Synchronization variables that protect process messaging W hen sending and receiving messages synchronization is enforced by source and destination information message labeling and context information Portability All implementations are based on a published standard that specifies the semantics for usage An MPI program consists of a set of processes and a logical communication medium connecting those processes An M PI process cannot directly access memory in another M PI process Interprocess communication requires calling M PI routines in both processes M PI defines a library of routines that M PI processes communicate through The MPI library routines provide a set of functions that support the following Point to point communications Collective operations Process groups Communication contexts Process topologies Datatype manipulation Although the M PI library contains a large number of routines you can design a large number of applications by using the six routines Table 2 Six commonly used M
345. vironment variable M PI_ROOT to reflect the new location 2 You may need to add MP ROOT bin mpirun exe MPI ROOT bin mpid exe MPI ROOT bin mpidiag exe and MPI ROOT bi n mpisrvutil exe to the firewall exceptions depending on how your system is configured Platform M PI must be installed in the same directory on every execution host To determinetheversion of aPlatform M PI installation usethe version flag with thempi r un command C gt MPIROOT bin mpirun version Setting environment variables Environment variables can be used to control and customizethe behavior of a Platform M PI application The environment variables that affect the behavior of Platform M PI at run time are described in the mpienv 1 manpage In all run modes Platform M PI enables environment variables to be set on the command line with the e option For example C gt MPI_ROOT bin mpirun e MPI_FLAGS y40 f appfile See the Platform M PI User s Guidefor more information on setting environment variables globally using the command line On Windows 2008 environment variables can be set from the GUI or on the command line From the GUI select New Job gt Task List from the left menu list and select an existing task Set the environment variable in the Task Properties window at the bottom Note Set these environment variables on the mpi run task Environment variables can also be set using the flag env For example
346. wns 3 np gt 1 new ranks that all perform the same computation as in compute_pi f ping_pong_clustertest c C Identifies slower than average np gt 2 links in your high speed interconnect hello_world c Cc Prints host name and rank np gt 1 These examples and the makefile are located in the MPI _ROOT hel p subdirectory The examples are presented for illustration purposes only They might not necessarily represent the most efficient way to solve a problem To build and run the examples use the following procedure 1 Change to a writable directory 2 Copy all files from the help directory to the current writable directory cp MPI_ROOT help 3 Compileall examples or a single example To compileand run all examplesin the he p directory at the prompt enter make To compile and run the thread_safe c program only at the prompt enter make thread_safe send_receive f In this Fortran 77 example process 0 sends an array to other processes in the default communicator MPI_COMM_WORLD program maininclude mpif h integer rank size to from tag count i ierr integer src dest integer st_source st_tag st count integer status MPI STATUS SIZE double precision data 100 call MPI _Init ierr call MP _Comm_rank MPI _COMM WORLD rank ierr call MP _Comm_size MPI_COMM WORLD size ierr 184 Platform MPI User s Guide Example Applications if size eq 1 then print must have at least 2
347. world Client m5 of 6 on node3 Hello world Client I m 3 of 6 on node2 Building an MPI application on Windows with Visual Studio and using the property pages To build an MPI application on Windowsin C or C with V S2008 use the property pages provided by Platform M PI to help link applications Two pages are included with Platform M PI and are located at the installation location MPI ROOT in hel p PMPI vsprops andPMP 64 vsprops Go to VS Project gt View gt Property Manager Expand the project This shows configurations and platforms set up for builds Include the correct property page PMP vs props for 32 bit apps PMPI 64 vsprops for 64 bit apps in the Configuration Platform section Select this page by double clicking the page or by right clicking on the page and selecting Properties Go to the User Macros section Set MP _ ROOT to the desired location i e the installation location of Platform M PI This should be set to the default installation location ProgramFiles x86 Platform MPI Note 42 Platform MPI User s Guide Getting Started This is the default location on 64 bit machines The location for 32 bit machines is ProgramFil es Platform MPI The MPI application can now be built with Platform M PI The property page sets the following fields automatically but they can be set manually if the property page provided is not used C C Additional Include Directories Set to MPI ROOT

Platform MPI User's Guide - Platform Cluster Manager

Contents

Download Pdf Manuals

Related Search

Related Contents