Home
LAM/MPI User's Guide Version 7.1.3
Contents
1. The lancheckpoint Command o a a da a A a ee The Lamela Command o e so es saa e eS Re we a we ed The lamexec Command 24 ec wee ede ba ee ba hee a ew we e The damorow Command sos a aei sa a o ee Gt a ae A The Vata lE Commmand o oe ke ee e We a et The laminto Command se se ea a aa aca ew ew wala aw eS The lamnodes Command e o re aor eke hare Ge ed OER EG a Re eS 8 9 79 The Lamr starte Command ocs oho os ba cali bebe ede ea a da 55 740 The lamshrink Command o o e o esos uce eet are ee ee we A E EE 56 7 11 Thempice mpicc mpic andmpif77 Commands o cs sse aoaaa ew aaa 56 711 1 Deprecated Names o o o ce om Re ee eB ae Ses 58 7 12 Thempiexes Command 2 4 04 0 40 ek css casa eH RE eR E 58 FAA ERC SYNIR ee eA we A ee Oe Bea A ete A 58 7 12 2 Launching MPMD Processes o oee e e ee ee ees 59 7 12 3 Launching MPI Processes with No Established LAM Universe 60 7 13 The mpimsg Command Deprecated 0 00202 eee iaa ee ee 60 Fl Them i Command oee ee ee eh ee ee Ge ea eae A 60 TAAl Simple Examples c e soe ea he a tp a a i eS et 60 7 14 2 Controlling Where Processes Are Launched o o 61 74 3 Per Proc ss Controls o co acco cacera aa ea rs Aa a 62 7 14 4 Ability to Pass Environment Variables o ooo o 62 7 14 5 Current Working Directory Behavior o o 63 7419 Thee ash Commande o e e a asoa se e Sk a a We a al GS he 63 716 The
2. Automatically starting file_component dgo Append this function to TotalView s image load callbacks so that TotalView run this program automatically dlappend TV image_load_callbacks auto_run _starter J Note that when using this method mpirun is actually running in the debugger while you are de bugging your parallel application even though it may not be obvious Hence when the MPI job completes you ll be returned to viewing mpirun in the debugger This is normal all MPI pro cesses have exited the only process that remains is mpirun If you click Go again mpirun will launch the MPI job again 2 Do not create the SHOME tvdrc file with the auto run functionality described in the previous item but instead simply click the go button when TotalView launches This runs the mpirun 105 command with the command line arguments which will eventually launch the MPI programs and allow attachment to the MPI processes When Total View initially attaches to an MPI process you will see the code for MPI_INIT or one of its sub functions which will likely be assembly code unless LAM itself was compiled with debugging information You probably want to skip past the rest of MPI_INIT In the Stack Trace window click on function which called MPI_INIT e g main and set a breakpoint to line following call to MPIINIT Then click Go 10 2 3 Limitations The following limitations are curr
3. or not specifying to use a specific port will tell LAM to check the range of ports to find any available port Note that in all cases if LAM cannot acquire a valid port for every MPI process in the job the entire job will be aborted Be wary of forcing a specific port to be used particularly in conjunction with the MPI dynamic process calls e g MPICOMM_SPAWN For example attempting to spawn a child process on a node that already has an MPI process in the same job LAM will try to use the same specific port which will result in failure because the MPI process already on that node will have already claimed that port 82 Choosing an HCA ID The HCA ID is the Mellanox Host Channel Adapter ID For example InfiniHost0 It is usually unnecessary to specify which HCA ID to use LAM MPI will search for all HCAs available and select the first one which is available If you want to use a fixed HCA ID then you can specify that using the rpi_ib_hca_id SSI parameter Adjusting Message Lengths The ib RPI uses two different protocols for passing data between MPI processes tiny and long Selection of which protocol to use is based solely on the length of the message Tiny messages are sent along with tag and communicator information in one transfer to the receiver Long messages use a rendezvous protocol the envelope is sent to the destination the receiver responds with an ACK when it is ready and then the sender sends another envelope
4. 4 It is always a good idea to let mpirun finish before you rerun or exit Total View 5 Total View will not be able to attach to MPI programs when you execute mpirun with s option This is because Total View will not get the source code of your program on nodes other than the source node We advise you to either use a common filesystem or copy the source code and executable on all nodes when using Total View with LAM so that you can avoid the use of mpirun s s flag 106 10 2 4 Message Queue Debugging The TotalView debugger can show the sending receiving and unexepected message queues for many par allel applications Note the following e The MPI 2 function for naming communicators MPI_COMM_SET_NAMEB is strongly recommended when using the message queue debugging functionality For example MPICOMM_WORLD and MPI_COMM_SELF are automatically named by LAM MPI Naming communicators makes it signif icantly easier to identify communicators of interest in the debugger Any communicator that is not named will be displayed as unnamed e Message queue debugging of applications is not currently supported for 64 bit executables If you attempt to use the message queue debugging functionality on a 64 bit executable TotalView will display a warning before disabling the message queue options The lamd RPI does not support the message queue debugging functionality LAM MPI does not currently provide debugging support for dynamic
5. The smp module assumes that there are only two levels of latency between all processes As such it will only allow itself to be available for selection when there are at least two nodes in a communicator and there are at least two processes on the same node Only some of the collectives have been optimized for SMP environments Table 9 8 shows which collec tive functions have been optimized which were already optimal from the lam_basic module and which will eventually be optimized Special Notes Since the goal of the SMP optimized algorithms attempt to take advantage of data locality it is strongly recommended to maximize the proximity of MPICOMM_WORLD rank neighbors on each node The C nomenclature to mpirun can ensure this automatically Also as a result of the data locality exploitation the coll base associative parameter is highly relevant if it is not set to 1 the smp module will fall back to the lam_basic reduction algorithms 9 45 Theshmem Module Module Summary Name shmem Kind coll Default SSI priority 50 Checkpoint restart yes The shmem module is developed to facilitate fast collective communication among processes on a single node Processes on a N way SMP node can take advantage of the shared memory for message passing The module will be selected only if the communicator spans over a single node and the all the processes in the communicator can successfully attach the shared memory region to their addr
6. attribute_val 1 MPI_Comm set_attr comm keyval amp attribute _val x Get the value attribute_val 1 MPI_Comm_get_attr comm keyval amp attribute_val amp flag if flag 1 printf Got the attribute value d n attribute_val 6 4 Dynamic Shared Object DSO Modules LAM has the capability of building SSI modules statically as part of the MPI libraries or as dynamic shared objects DSOs DSOs are discovered and loaded into LAM processes at run time This allows adding or removing functionality from an existing LAM installation without the need to recompile or re link user applications The default location for DSO SSI modules is prefix lib lam If otherwise unspecified this is where LAM will look for DSO SSI modules However the SSI parameter base_module_path can be used to specify a new colon delimited path to look for DSO SSI modules This allows users to specify their own location for modules if desired 46 Note that specifying this parameter overrides the default location If users wish to augment their search path they will need to include the default location in the path specification shell mpirun C ssi base_module path prefix lib lam SHOME my lam_modules 6 5 Selecting Modules As implied by the previous sections modules are selected at run time either by examining in order user specified parameters run time calculations and compiled in defaults The sel
7. lutil A pthread Pd an Two notable sub flags are showme compile Show only the compile flags suitable for substitution into CFLAGS shell mpicc O c main c showme compile lVusr local lam include pthread showme link Show only the linker flags which are actually LDFLAGS and LIBS mixed together suitable for substitution into LIBS shell mpicc O o main main o foo o bar o showme link L usr local lam lib llammpio lpmpi llamf77mpi Impi llam lutil pthread Lan e lpmpi When compiling a user MPI application the lpmpi argument is used to indicate that MPI profiling support should be included The wrapper compiler may alter the exact placement of this argument to ensure that proper linker dependency semantics are preserved Ton Neither the compiler nor linker flags can be overridden at run time The back end compiler however can be Environment variables can be used for this purpose e LAMMPICC deprecated name LAMHCC Overrides the default C compiler in the mpicc wrapper compiler e LAMMPICXX deprecated name LAMHCP Overrides the default C compiler in the mp icc wrapper compiler e LAMMP1F77 deprecated name LAMHF 77 Overrides the default C compiler in the mpi cc wrapper compiler For example for Bourne like shells 57 Lay shell LAMPICC cc shell export LAMMPICC shell mpicc m
8. For example whenever a communicator name is available LAM will use it in relevant error messages when names are not available communicators and windows and types are identified by index number which depending on the application may vary between successive runs The TotalView parallel debugger will also show communicator names if available when displaying the message queues 10 2 TotalView Parallel Debugger Total View is a commercial debugger from Etnus that supports debugging MPI programs in parallel That is with supported MPI implementations the Total View debugger can automatically attach to one or more MPI 103 processes in a parallel application LAM now supports basic debugging functionality with the TotalView debugger Specifically LAM supports Total View attaching to one or more MPI processes as well as viewing the MPI message queues in supported RPI modules This section provides some general tips and suggested use of TotalView with LAM MPI It is not in tended to replace the TotalView documentation in any way Be sure to consult the Total View documenta tion for more information and details than are provided here Note Total View is licensed product provided by Etnus You need to have Total View installed properly before you can use it with LAM 10 2 1 Attaching TotalView to MPI Processes LAM MPI does not need to be configured or compiled in any special way to allow Total View to attach to MPI processes Y
9. Main installation prefix bindir Where the LAM MPI executables are located libdir Where the LAM MPI libraries are located incdir Where the LAM MPI include files are located pkglibdir Where dynamic SSI modules are installed sysconfdir Where the LAM MPI help files are located e version Paired with two addition options display the version of either LAM MPI or one or more SSI modules The first argument identifies what to report the version of and can be any of the following Dynamic SSI modules are not supported in LAM MPI 7 0 but will be supported in future versions 53 Tay Lay lam Version of LAM MPI boot Version of all boot modules boot module Version of a specific boot module coll Version of all coll modules coll module Version of a specific coll module cr Version of all cr modules cr module Version of a specific cr module rpi Version of all rpi modules rpi module Version of a specific rpi module full Display the entire version number string major Display the major version number minor Display the minor version number release Display the release version number alpha Display the alpha version number beta Display the beta version number svn Display the SVN version number Multiple options can be combined to query several attributes at once The second argument specifies the scope of the version number to display whether to show the entire versio
10. binomial algorithms are used No attempt is made to determine the locality of processes however the lam_basic module effectively assumes that there is equal latency between all processes All reduction operations are performed in a strictly defined order associativity is not assumed Collectives for Intercommunicators As of now only lam_basic module supports intercommunicator collectives according to the MPI 2 stan dard These algorithms are built over point to point layer and they also make use of an intra communicator collectives with the help of intra communicator corresponding to the local group Mapping among the inter communicator and corresponding local intracommunicator is separately managed in the lam_basic module The basic algorithms are the same that have been included in LAM MPI since at least version 6 2 91 Tay 9 4 4 The smp Module Module Summary Name smp Kind coll Default SSI priority 50 Checkpoint restart yes The smp module is geared towards SMP nodes in a LAN Heavily inspired by the MagPle algo rithms 6 the smp module determines the locality of processes before setting up a dynamic structure in which to perform the collective function Although all communication is still layered on MPI point to point functions the algorithms attempt to maximize the use of on node communication before communicating with off node processes This results in lower overall latency for the collective operation
11. blcr value 97 self value 99 cr_blcr_base_dir 97 98 crblcr context file 56 cr restart args 56 mpi hostmap 75 rpi 76 rpi_crtcp_priority 78 rpi_crtcp_short 78 rpi_crtcp_sockbuf 78 rpi_ gm_cr 79 rpi_gm_fast 79 rpi_gm_maxport 79 rpi_gm_nopin 79 rpi_gm_port 79 rpi gm priority 79 rpi_gm_tinymsg len 79 80 rpi_ib_hca_id 82 rpi_ib_mtu 82 83 rpi_ib_num_envelopes 82 83 rpi _ib_port 82 rpi_ib_priority 82 rpi_ib_tinymsglen 82 83 rpi_lamd_priority 85 rpi_ssi_sysv_shmmaxalloc 86 rpi_ssi_sysv_shmpoolsize 86 129 rpi_ssi_sysv_short 86 rpi_sysv_pollyield 87 rpi_sysv_priority 87 rpi sysv_shmmaxalloc 87 rpi_sysv_shmpoolsize 87 rpi_sysv_short 87 rpi tcp priority 88 rpi_tcp short 87 89 rpi tcp sockbuf 87 89 rpi_usysv_pollyield 89 rpi usysv priority 89 rpi_usysv_readlockpoll 89 rpi usysv_shmmaxalloc 89 rpi_usysv_shmpoolsize 89 rpi_usysv short 89 rpi_usysv_writelockpoll 89 System Services Interface see SSI threads and MPI 116 tm boot SSI module 73 TMPDIR environment variable 16 50 119 TotalView parallel debugger 104 tping command 64 TVDSVRLAUNCHCMD environment variable 106 Windows see Microsoft Windows wipe command deprecated 64 with cr file dir configure flag 97 with debug configure flag 55 with memory manager configure flag 21 with purify configure flag 55 109 with rpi gm get configure flag 78 with rsh configure flag
12. process than lamhalt and is typically not necessary 34 Chapter 5 Supported MPI Functionality This chapter discusses the exact levels of MPI functionality that is supported by LAM MPI 5 1 MPI 1 Support LAM 7 1 3 has support for all MPI 1 functionality 5 1 1 Language Bindings LAM provides C C and Fortran 77 bindings for all MPI 1 functions types and constants Profiling support is available in all three languages if LAM was configured and compiled with profiling support The laminfo command can be used to see if profiling support was included in LAM MPI Ton Support for optional Fortran types has now been added Table 5 1 lists the new datatypes Note that MPLINTEGER8 and MPI_REAL16 are listed even though they are not defined by the MPI standard Sup port for these types is included per request from LAM MPI users Supported Datatypes MPLINTEGER1 MPIINTEGER2 MPLINTEGER4 MPIINTEGER8 MPI_REAL4 MPI_REAL8 MPI_ REAL16 Table 5 1 Supported optional fortran datatypes Lan 5 1 2 MPI CANCEL MPI CANCEL works properly for receives but will almost never work on sends MPI CANCEL is most frequently used with unmatched MPI_IRECV s that were made in case a matching message arrived This simply entails removing the receive request from the local queue and is fairly straightforward to implement Actually canceling a send operation is much more difficult because some meta information about a mes
13. 116 lamboot command 27 49 55 65 71 74 97 112 119 boot schema file 65 common problems and solutions 27 conditions for success 27 lamcheckpoint command 51 lamclean command 34 52 117 lamexec command 52 lamgrow command 52 lamhalt command 34 53 LAMHCC environment variable deprecated 57 LAMHCP environment variable deprecated 57 LAMHF 77 environment variable deprecated 57 LAMHOME environment variable 67 laminfo command 16 20 26 35 41 53 65 76 89 112 LAMMP ICC environment variable 57 LAMMP ICXX environment variable 57 LAMMP IF77 environment variable 57 lamnodes command 28 55 LAMRANK environment variable 107 lamrestart command 55 LAMRSH environment variable deprecated 17 lamshrink command 56 lamssi 7 manual page 26 lamssi_boot 7 manual page 26 lamssi coll 7 manual page 26 lamssi_cr 7 manual page 26 lamssi rpi 7 manual page 26 lamwipe command 34 64 LD_LIBRARY_PATH environment variable 97 98 LD_PRELOAD environment variable 99 libcr so file 99 listserv mailing lists 111 Load Sharing Facility see batch queue systems LSF see batch queue systems machinefile see boot schema mailing lists 111 manual pages 25 lamssi 7 26 lamssi_boot 7 26 lamssi_coll 7 26 lamssi_cr 7 26 lamssi_rpi 7 26 Matlab 21 Memory management 18 MEX functions 21 M
14. 2 datatypes natively ROMIO cannot presently handle this case This will hopefully be fixed in some future release of ROMIO The ROMIO test programs col1_ test fcoll test large_array andcoll_perf will fail because they use the MPI 2 datatype MPI_DARRAY Please see the sections ROMIO Users Mailing List and Reporting Bugs in romio README for how to submit questions and bug reports about ROMIO that do not specifically pertain to LAM http www mcs anl gov romio 40 5 2 7 Language Bindings LAM provides C C and Fortran 77 bindings for all supported MPI 2 functions types and constants LAM does not provide a Fortran 90 module However it is possible to use the Fortran 77 bindings with a Fortran 90 compiler by specifying the F90 compiler as your Fortran compiler when configuring compiling LAM MPI See the LAM Installation Guide 14 for more details The C bindings include support for the C only MPI BOOL MPI COMPLEX MPI DOUBLE_ COMPLEX and MPI LONG_DOUBLE_COMPLEXxX datatypes Note that there are some issues with using MPI and Fortran 90 together See the F90 C chapter in the MPI 2 standard 2 for more information on using MPI with Fortran 90 As mentioned in Section 5 1 1 profiling support is available in all three languages if LAM was compiled with profiling support The 1aminfo command can be used to see if profiling support was included in LAM MPI 41 42 Chapter 6 System Servic
15. 20 wrapper compilers 56
16. 23 50 2 users load average 0 87 0 81 0 80 Most of the parameters and options that are available to mpirun are also available to Lamexec See the mpi run description in Section 7 14 for more details 7 5 The lamgrow Command The lamgrow command adds a single node to the LAM universe It must use the same boot module that was used to initially boot the LAM universe lamgrow must be run from a node already in the LAM universe Common parameters include e v Verbose mode e d Debug mode enables a lot of diagnostic output e n lt nodeid gt Assign the new host the node ID nodeid nodeid must be an unused node ID If n is not specified LAM will find the lowest node ID that is not being used e no schedule Has the same effect as putting no_schedule yes in the boot schema This means that the C and N expansion used in mpirun and lamexec will not include this node e ssi lt key gt lt value gt Pass in SSI parameter key with the value value e lt hostname gt The name of the host to expand the universe to 52 For example the following adds the node blinky to the existing LAM universe using the rsh boot module shell lamgrow ssi boot rsh blinky cluster example com Note that Lamgrow cannot grow a LAM universe that only contains one node that has an IP ad dress of 127 0 0 1 e g if Lamboot was run with the default boot schema that only contains the name localhost In this case Lamgro
17. Globus boot module will never be automatically selected by LAM it must be selected manually with the boot SSI parameter with the value globus sheng lamboot ssi boot globus hostfile Tunable Parameters Table 8 2 lists the SSI parameters that are available to the globus module SSI parameter name Default value Description boot_globus_priority 3 Default priority level Table 8 2 SSI parameters for the globus boot module 8 1 7 The rsh Module including ssh The rsh ssh boot SSI module is typically the least common denominator boot module When not in an otherwise special environment such as a batch scheduler the rsh ssh boot module is typically used to start the LAM run time environment Minimum Requirements In addition to the minimum requirements listed in Section 8 1 2 the following additional conditions must also be met for a successful Lamboot using the rsh ssh boot module 1 The user must be able to execute arbitrary commands on each target host without being prompted for a password 2 The shell s start up script must not print anything on standard error The user can take advantage of the fact that rsh ssh will start the shell non interactively The start up script can exit early in this case before executing many commands relevant only to interactive sessions and likely to generate output This has now been changed in version 7 1 if the SSI parameter boot_rs
18. IMPI processes and the IMPI server See the IMPI section of the the LAM FAQ for definitions of these terms on the LAM web site For more information about IMPI and the IMPI Standard see the main IMPI web site Note that the IMPI standard only applies to MPI 1 functionality Using non local MPI 2 functions on communicators with ranks that live on another MPI implementation will result in undefined behavior read kaboom For example MPI_COMM_SPAWN will certainly fail but MPI COMM_SET_NAME works fine because it is a local action 12 6 2 Current IMPI functionality LAM currently implements a subset of the IMPI functionality e Startup and shutdown e All MPI 1 point to point functionality http www lam mpi org faq http impi nist gov 117 e Some of the data passing collectives MPILALLREDUCE MPILBARRIER MPI_BCAST MPI_ REDUCE LAM does not implement the following on communicators with ranks that reside on another MPI im plementation e MPILPROBE and MPI_IPROBE e MPI CANCEL e All data passing collectives that are not listed above e All communicator constructor destructor collectives e g MPI COMM_SPLIT etc 12 6 3 Running an IMPI Job Running an IMPI job requires the use of an IMPI server An open source freely available server is avail able As described in the IMPI standard the first step is to launch the IMPI server with the number of expected clients The open source server from above requi
19. Just like with the Checkpoint and Continue functions no MPI functions can be invoked during the Restart function Troubleshooting The most common cause for incorrect checkpoints using the self module is having LAM look for the wrong symbol names at any of the Checkpoint Continue or Restart phases To verify what function names are being looked up at run time the cr_verbose SSI parameter can be set For example shell mpirun C ssi rpi crtcp ssi cr self ssi cr_verbose level 1000 my_mpi program This will output debug level information that clearly shows the function names that LAM is looking for and whether it is able to find them or not If you find that LAM is looking for the right function names but is still somehow not funding the functions at run time ensure that you linked your application with the appropriate flag to export symbols e g with GCC based compilers use the export flag as shown in the example above 101 Known Issues e Since a checkpoint request is initiated by invoking Lamcheckpoint with the PID of mpirun it is not possible to checkpoint MPI jobs that were started using the nw option to mpirun or directly from the command line without using mpi run Lan 102 Chapter 10 Debugging Parallel Programs LAM MPI supports multiple methods of debugging parallel programs The following notes and observations generally apply to debugging in parallel e Note that most debuggers require that
20. MPI programs It is similar to but slightly different than mpirun Although mpiexec is simply a wrapper around other LAM commands including lamboot mpirun and lamhalt it ties their functionality together and provides a unified interface for launching MPI processes Specifically mpiexec offers two features from command line flags that require multiple steps when using other LAM commands launching MPMD MPI processes and launching MPI processes when there is no existing LAM universe 7 12 1 General Syntax The general form of mpiexec commands is The reason that there are two methods to launch MPI executables is because the MPI 2 standard suggests the use of mpiexec and provides standardized command line arguments Hence even though LAM already had an mpi run command to launch MPI executables mpiexec was added to comply with the standard 58 e mpiexec global_args local _args1 local_args2 Global arguments are applied to all MPI processes that are launched They must be specified before any local arguments Common global arguments include e boot Boot the LAM RTE before launching the MPI processes e boot args lt args gt Pass lt args gt to the back end lamboot Implies boot e machinefile lt filename gt Specify lt filename gt as the boot schema to use when invok ing the back end Lamboot Implies boot Ton e prefix lt lam install path gt Use the LAM MPI installation specifi
21. a temporary per user session directory in the following directory lt tmpdir gt lam lt username gt lt hostname gt lt session_suffix gt Each of the components is described below lt tmpdir gt LAM will set the prefix used for the session directory based on the following search order 1 The value of the LAM MP T_SESSION_PREF IX environment variable 2 The value of the TMPDIR environment variable 3 tmp It is important to note that unlike LAMMPI_SESSION_SUFFIX the environment variables for determining lt tmpdir gt must be set on each node although they do not necessarily have to be the same value lt tmpdir gt must exist before Lamboot is run or lamboot will fail lt username gt The user s name on that host lt hostname gt The hostname lt session suffix gt LAM will set the suffix if any used for the session directory based on the fol lowing search order 1 The value of the LAM MP T_SESSION_SUFF IX environment variable 2 If running under a supported batch system a unique session ID based on information from the batch system will be used LAM MPI_SESSION_SUFFIX and the batch information only need to be available on the node from which lamboot is run lamboot will propagate the information to the other nodes 119 12 9 Signal Catching LAM MPI now catches the signals SEGV BUS FPE and ILL The signal handler terminates the application This is useful in batch jobs to help e
22. all such posts are automatically rejected only the LAM Team can post to this list 11 1 2 General Discussion User Questions BEFORE YOU POST TO THIS LIST Please check all the other resources listed in this chapter first Search the mailing list to see if anyone else had a similar problem before you did Re read the error message that LAM displayed to you LAM can sometimes give incredibly detailed error messages that tell you exactly how to fix the problem This unfortunately does not stop some users from cut n pasting the entire error message verbatim including the solution to their problem into a mail message sending it to the list and asking How do I fix this problem So please think and read before you post http www lam mpi org faq Our deep appologies if some of the information in this section appears to be repetitive and condescending Believe us when we say that we have tried all other approaches some users simply either do not read the information provided or only read the 111 This list is used for general questions and discussion of LAM MPI User can post questions comments etc to this list Due to recent increases in spam only subscribers are allowed to post to the list If you are not subscribed to the list your posts will be discarded To subscribe or unsubscribe from the list visit the list information page http www lam mpi org mailman listinfo cgi lam After you have subscribed
23. and promiscuous connections must be enabled By default LAM s base boot SSI startup protocols disable promiscuous connections However this behavior can be overridden when LAM is configured and at run time If the SSI parameter boot _base_ promisc set to an empty value or set to the integer value 1 promiscuous connections will be accepted when than LAM RTE is booted 8 1 5 The bproc Module The Beowulf Distributed Process Space BProc project is set of kernel modifications utilities and libraries which allow a user to start processes on other machines in a Beowulf style cluster Remote processes started with this mechanism appear in the process table of the front end machine in a cluster LAM MPI functionality has been tested with BProc version 3 2 5 Prior versions had a bug that affected at least some LAM MPI functionality It is strongly recommended to upgrade to at least version 3 2 5 before attempting to use the LAM MPI native BProc capabilities http bproc sourceforge net 67 Minimum Requirements Several of the minimum requirements listed in Section 8 1 2 will already be met in a BProc environment because BProc will copy Lamboot s entire environment including the PATH to the remote node Hence if lamboot is in the user s path on the local node it will also automatically be in the user s path on the remote node However one of the minimum requirements conditions The user must be able to execute arbi
24. are discussed in detail in Chapter 9 Arbitrary user arguments can also be passed to the user program mpirun will attempt to parse all options looking for LAM options until it finds a All arguments following are directly passed to the MPI application Pass three command line arguments to every instance of my_mpi_program shell mpirun ssi rpi usysv C my_mpi_program argl arg2 arg3 Pass three command line arguments escaped from parsing shell mpirun ssi rpi usysv C my_mpi program argl arg2 arg3 7 14 2 Controlling Where Processes Are Launched mpirun allows for fine grained control of where to schedule launched processes Note LAM uses the term schedule extensively to indicate which nodes processes are launched on LAM does not influence operating system semantics for prioritizing processes or binding processes to specific CPUs The boot schema file can be used to indicate how many CPUs are on a node but this is only used for scheduling purposes For a fuller description of CPU counts in boot schemas see Sections 4 4 1 and 8 1 1 on pages 26 and 65 respectively LAM offers two main scheduling nomenclatures by node and by CPU For example N means all schedulable nodes in the universe schedulable is defined in Section 7 1 2 Similarly C means all schedulable CPUs in the universe More fine grained control is also possible nodes and CPUs can be individually identified or iden
25. changed at run time Many of these parameters are identical to their sysv counterparts and are not re described here SSI parameter name Default value Description rpi_tcp _short 65535 Maximum length in bytes of a short message for sending via TCP sockets i e off node rpi_tcp_sockbuf 1 Socket buffering in the OS kernel 1 means use the short message size rpi_usysv pollyield 1 Same as SYSV counterpart rpi_usysv_priority 40 Default priority level rpi_usysv_readlockpoll 10 000 Number of iterations to spin before yielding the processing while waiting to read rpi usysv shmmaxalloc From configure Same as SySV counterpart rpi usysv_shmpoolsize From configure Same as SySV counterpart rpi usysv_short 8192 Same as SYSV counterpart rpi_usysv_writelockpoll 10 Number of iterations to spin before yielding the processing while waiting to write Table 9 7 SSI parameters for the Usysv RPI module 9 4 MPI Collective Communication MPI collective communication functions have their basic functionality outlined in the MPI standard How ever the implementation of this functionality can be optimized and or implemented in different ways As such LAM provides modules for implementing the MPI collective routines that are targeted for different environments e Basic algorithms e SMP optimized algorithms e Shared Memory algorithms These modules are discussed in detail below N
26. cluster example com n0 lt 1234 gt ssi boot base linear booting n3 node4 cluster example com n0 lt 1234 gt ssi boot base linear finished The parameters passed to Lamboot in the example above are as follows e v Make lamboot be slightly verbose e ssi boot rsh Ensure that LAM uses the rsh ssh boot module to boot the LAM universe Typically LAM chooses the right boot module automatically and therefore this parameter is not typically necessary but to ensure that this tutorial does exactly what we want it to do we use this parameter to absolutely ensure that LAM uses rsh or ssh to boot the universe e hostfile Name of the boot schema file Common causes of failure with the 1amboot command include but are not limited to 27 Tay Lan e User does not have permission to execute on the remote node This typically involves setting up a SHOME rhosts file if using rsh or properly configured SSH keys using using ssh Setting up rhosts and or SSH keys for password less remote logins are beyond the scope of this tutorial consult local documentation for rsh and ssh and or internet tutorials on setting up SSH keys e The first time a user uses ssh to execute on a remote node ssh typically prints a warning to the standard error LAM will interpret this as a failure If this happens lamboot will complain that something unexpectedly appeared on stderr and abort One solution is to manually ssh to each node in the bo
27. followed by the data of the message The message lengths at which the different protocols are used can be changed with the SSI parameter rpi_ib_tinymsglen which represent the maximum length of tiny messages LAM defaults to 1 024 bytes for the maximum lengths of tiny messages It may be desirable to adjust these values for different kinds of applications and message passing pat terns The LAM Team would appreciate feedback on the performance of different values for real world applications Posting Envelopes to Recieve Scalability Receive buffers must be posted to the IB communication hardware library before any receives can occur LAM MPI uses enevelopes that contain MPI signature information and in the case of tiny messages they also hold the actual message contents The size of each envelope is therefore sum of the size of the headers and the maximum size of a tiny message controlled by rpi_ib_tinymsglen SSI parameter LAM pre posts 64 evnvelope buffers per peer process by default but can be overridden at run time with then rpi_ib_num_envelopes SSI parameter T aad These two SSI parameters can have a large effect on scalability Since LAM pre posts a total of num_processes 1 x num_envelopes x tinymsglen bytes this can be prohibitive if num_processes grows large However num_envelopes and tinymsglen can be adjusted to help keep this number low although they may have an effect on run time performance Changing the number of
28. for GCC based compilers it is export For example with a GCC based compiler when linking the final executable with the appropriate MPI wrapper compiler e g mpicc mpiCC or mpif77 use the export swich as follows shell mpicc main c c shell mpicc restart_functions c c shell mpicc main o restart functions o o my_mpi_application export This will result in an MPI application that properly exports its symbols such that LAM can find the Checkpoint Continue and Restart functions at run time Fortran compilers typically mangle function names in one of four ways make the name all lower case make the name all lower case and add one underscore make the name all lower case and add two underscores or make the name all uppercase 100 Running a Checkpoint Restart Capable MPI Job Even though MPI library state is not used with the self module a checkpoint capable RPI must be used for the MPI application For example the crtcp RPI module can be selected along with the self module shell mpirun C ssi rpi crtcp ssi cr self my_mpi_program i Failing to use a checkpoint capable RPI will result in undefined behavior Checkpointing and Restarting Once a checkpoint capable job is running the LAM command lamcheckpoint can be used to invoke a checkpoint Running lamcheckpoint with the PID of mpirun will cause the user defined Checkpoint function to be invoked Although not typically useful in the sel
29. framework itself as well as tuning run time performance of individual modules by passing param eters to each module Although the specific usage of each SSI module parameter is defined by either the framework or the module that it is passed to the value of most parameters will be resolved by the following 1 Ifa valid value is provided via a run time SSI parameter use that 2 Otherwise attempt to calculate a meaningful value at run time or use a compiled in default value As such it is typically possible to set a parameter s default value when LAM is configured compiled but use a different value at run time 6 3 1 Naming Conventions SSI parameter names are generally strings containing only letters and underscores and can typically be broken down into three parts For example the parameter boot_rsh_agent can be broken into its three components e SSI module type The first string of the name In this case it is boot e SSI module name The second string of the name corresponding to a specific SSI module In this case itis rsh e Parameter name The last string in the name It may be an arbitrary string and include multiple underscores In this case it is agent Note that many SSI modules provide configure flags to set compile time defaults for tweakable parameters See 14 44 Although the parameter name is technically only the last part of the string it is only proper to refer to it within its overall
30. g a desktop workstation and a set of worker nodes In this case it is desirable to mark the desktop workstation as non scheduable so that LAM will not launch executables there by default Consider the following boot schema Mark my workstation as non schedulable my_workstation office example com schedule no All the other nodes are by default schedulable nodel cluster example com node2 cluster example com node3 cluster example com node4 cluster example com Booting with this schema allows the convenienve of shell mpirun C my_mpi_program which will only run my mpi program on the four cluster nodes i e not the workstation Note that this behavior only applies to the C and N designations LAM will always allow execution on any node when using the nX or cX notation shell mpirun c0 C my_mpi program which will run my mpi program on all five nodes in the LAM universe 7 2 The lamcheckpoint Command Ton The lamcheckpoint command is provided to checkpoint a MPI application One of the arguments to lamcheckpoint is the name of the checkpoint restart module which can be either one of blcr and self Additional arguments to lamcheckpoint depend of the selected checkpoint restart module The name of the module can be specified by passing the cr SSI parameter Common arguments that are used with the lamcheckpoint command are e ssi Just like with mpirun the ssi flag can be
31. get put accumulate operations are restricted to being basic datatypes or single level contiguous vectors of basic datatypes The implementation of the one sided operations is layered on top of the point to point functions and will thus perform no better than them Nevertheless it is hoped that providing this support will aid developers in developing and debugging codes using one sided communication While LAM provides the required MPI MODE constants they are ignored by the present implementation Table 5 7 lists the functions related to one sided communication that have been implemented 38 Supported Functions MPIACCUMULATE MPI_GET MPI_PUT MPIWIN_CREATE MPI_WIN_POST MPI_WIN_FENCE MPI_WIN_FREE MPI_WIN_WAIT MPIWIN_COMPLETE MPI_WIN GET_GROUP Table 5 7 Supported MPI 2 one sided functions 5 2 4 Extended Collective Operations Ton LAM implements the new MPI 2 collective functions MPILEXSCAN and MPI_ALLTOALLW for intracom municators Intercommunicator collectives are implemented for all the functions listed in Table 5 8 Notably inter communicator collectives are not defined for MPILSCAN because the MPI standard does not define it MPI_ALLGATHERV and MPILEXSCAN Supported Functions MPI_ALLTOALLV MPIREDUCE MPI_SCATTERV MPIALLGATHER MPIL REDUCE_SCATTER MPILALLGATHERV MPIALLTOALL MPI_ALLTOALLW MPIGATHER MPIGATHERV MPI_BCAST MPLSCATTER MPILBARRIER Table 5 8 Supported M
32. in a set of integers However for operations involving the combination of floating point numbers associativity and commutativity matter An Advice to Implementors note in MPI 1 section 4 9 1 114 20 states It is strongly recommended that MPILREDUCE be implemented so that the same result be obtained whenever the function is applied on the same arguments appearing in the same or der Note that this may prevent optimizations that take advantage of the physical location of processors Some implementations of the reduction operations may specifically take advantage of data locality and therefore assume that the reduction operator is associative As such LAM will always take the conserva tive approach to reduction operations and fall back to non associative algorithms e g lam_basic for the reduction operations unless specifically told to use associative SMP optimized algorithms by setting the SSI parameter coll_base_associative tol 9 4 3 Thelam_basic Module Module Summary Name lambasic Kind coll Default SSI priority 0 Checkpoint restart yes The lam_basic module provides simplistic algorithms for each of the MPI collectives that are layered on top of point to point functionality It can be used in any environment Its priority is sufficiently low that it will be chosen if no other Coll module is available Many of the algorithms are twofold for N or less processes linear algorithms are used For more than N processes
33. in the range 1 100 with 1 indicating that the module should not be considered for selection and 100 being the highest priority Ties will be broken arbitrarily by the SSI framework A module s priorty can be set run time through the normal SSI parameter mechanisms 1 e environment variables or using the ssi parameter Every module has an implicit priority SSI parameter in the form lt type gt _ lt module name gt _priority For example a system administrator may set environment variables in system wide shell setup files e g etc profile etc bashrc or etc csh cshrc to change the default priorities 6 5 3 Selection Algorithm For each component type the following general selection algorithm is used 47 Lan e A list of all available modules is created If the user specified one or more modules for this type only those modules are queried to see if they are available Otherwise all modules are queried e The module with the highest priority and potentially meeting other selection criteria depending on the module s type will be selected Each SSI type may define its own additional selection rules For example the selection of coll cr and rpi modules may be inter dependant and depend on the supported MPI thread level Chapter 9 page 75 details the selection algorithm for MPI SSI modules 48 Chapter 7 LAM MPI Command Quick Reference This section is intended to provide a quick reference of the majo
34. machines which have IB VAPI shared libraries but not the IB hardware and when LAM is compiled with IB support you may see some error messages like can t open device file when trying to use LAM MPI even when you are not using the IB module This error message pertains to IB VAPI shared libraries and is not from within LAM MPI It results because when LAM MPI tries to query the shared libraries VAPI tries to open the IB device during the shared library init phase which is not proper e Heterogeneity between big and little endian machines is not supported e The ib RPI is not supported with IMPI e Mixed shared memory IB message passing is not yet supported all message passing is through Infiniband e XMPI tracing is not yet supported 84 e The ib RPI module is designed to run in environments where the number of available processors is greater than or equal to the number of MPI processes on a given node The ib RPI module will perform poorly particularly in blocking MPI communication calls if there are less processors than processes on a node 9 3 5 The lamd Module Daemon Based Communication Module Summary Name lamd Kind rpi Default SSI priority 10 Checkpoint restart no The lamd RPI module uses the LAM daemons for all interprocess communication This allows for true asynchronous message passing i e messages can progress even while the user s program is executing albeit at the cost of a significan
35. module 73 SLURM boot SSI module 71 Berkeley Lab Checkpoint Restart single node check pointer 97 bler checkpoint restart SSI module 97 boot schema 65 boot SSI modules 65 74 bproc 67 globus 69 rsh rsh ssh 70 slurm 71 tm PBS Torque 73 boot SSI parameter 68 72 74 boot base promisc SSI parameter 67 boot bproc priority SSI parameter 69 boot globus priority SSI parameter 70 boot_rsh_agent SSI parameter 17 72 boot_rsh_fast SSI parameter 72 boot_rsh_ignore_stderr SSI parameter 70 72 boot rsh no n SSI parameter 72 123 boot_rsh_no_profile SSI parameter 72 boot_rsh_priority SSI parameter 72 boot_rsh_username SSI parameter 72 boot_slurm_priority SSI parameter 73 boot_tm_priority SSI parameter 74 booting the LAM run time environment 26 bproc boot SSI module 67 case insensitive filesystem 20 checkpoint restart SSI modules 96 102 bler 97 selection process 96 Clubmask see batch queue systems coll SSI parameter 90 coll base associative SSI parameter 90 92 coll_base_shmem_message_pool_size SSI parameter 95 coll_base_shmem_num_segments SSI param eter 95 coll_crossover SSI parameter 90 coll_reduce_crossover SSI parameter 90 coll_base_shmem_message_pool_size SSI parameter 94 coll_base_shmem_num_segments SSI param eter 92 collective SSI modules 89 92 94 lam_basic 91 selection process 89 shmem 92 smp 92 commands cr_checkpoint 98 cr_restart 98 globus job run 69 hcc d
36. parameters at run time See the Section 9 3 4 page 81 for more details 3 3 Usage Notes 3 3 1 Operating System Bypass Communication Myrinet and Infiniband The gm and ib RPI modules require an additional memory manager in order to run properly On most systems LAM will automatically select the proper memory manager and the system administrator end user doesn t need to know anything about this However on some systems and or in some applications extra work is required The issue is that OS bypass networks such as Myrinet and Infiniband require virtual pages to be pinned down to specific hardware addresses before they can be used by the Myrinet Infiniband NIC hardware This allows the NIC communication processor to operate on memory buffers independent of the main CPU because it knows that the buffers will never be swapped out or otherwise be relocated in memory before the operation is complete LAM performs the pinning operation behind the scenes for example if application MPILSENDs a buffer using the gm or ib RPI modules LAM will automatically pin the buffer before it is sent However since pinning is a relatively expensive operation LAM usually leaves buffers pinned when the function completes e g MPILSEND This typically speeds up future sends and receives because the buffer does not need to be re pinned However if the user frees this memory the buffer must be unpinned before it is given back to the operating s
37. pre posted envelopes effectively controls how many messages can be simultaneously flowing across the network changing the tiny message size affects when LAM switches to use a rendezvous sending protocol instead of an eager send protocol Relevant values for these parameters are likely to be application specific keep this in mind when running large parallel jobs Laoi Modifying the MTU value T 12 The Maximum Transmission Unit MTU values to be used for Infiniband can be configured at runtime using the rpi ib mtu SSI parameter It can take in values of 256 512 1024 2048 and 4096 corresponding to MTU256 MTU512 MTU1024 MTU2048 and MTU4096 values of Infiniband MTUs respectively The default value is 1024 corresponding to MTU1024 Laiz 83 Pinning Memory The Infiniband communication library can only communicate through registered sometimes called pinned memory LAM MPI handles this automatically by pinning user provided buffers when required This allows for good message passing performance especially when re using buffers to send receive multiple messages Note that since LAM MPI manages all pinned memory LAM MPI must be aware of memory that is freed so that it can be properly unpinned before it is returned to the operating system Hence LAM MPI must intercept calls to functions such as sbrk and munmap to effect this behavior To this end support for additional memory allocation packages are included in LAM
38. processes e g MPIL COMM_ SPAWN 10 3 Serial Debuggers LAM also allows the use of one or more serial debuggers when debugging a parallel program 10 3 1 Lauching Debuggers LAM allows the arbitrary execution of any executable in an MPI context as long as an MPI executable is eventually launched For example it is common to mpirun a debugger or a script that launches a debugger on some nodes and directly runs the application on other nodes since the debugger will eventually launch the MPI process However one must be careful when running programs on remote nodes that expect the use of stdin stdin on remote nodes is redirected to dev null For example it is advantageous to export the DISPLAY environment variable and run a shell script that invokes an xterm with gdb for example running in it on each node For example shell mpirun C x DISPLAY xterm gdb csh Additionally it may be desirable to only run the debugger on certain ranks in MPI CCOMM_WORLD For example with parallel jobs that include tens or hundreds of MPI processes it is really only feasible to attach debuggers to a small number of processes In this case a script may be helpful to launch debuggers for some ranks in MPICOMM_WORLD and directly launch the application in others The LAM environment variable LAMRANK can be helpful in this situation This variable is placed in the environment before the target application is executed Hence it is v
39. read the release notes entitled Operating System Bypass Communication Myrinet and Infiniband in the LAM MPI Installation Guide for notes about memory management with Myrinet Specif ically it deals with LAM s automatic overrides of the malloc calloc and free functions Overview In general using the gm RPI module is just like using any other RPI module MPI functions will simply use native GM message passing for their back end message transport Although it is not required users are strongly encouraged to use the MPI ALLOC MEM and MPI_ FREE MEM functions to allocate and free memory instead of for example malloc and free The gm RPI module is marked as yes for checkpoint restart support but this is only true when the module was configured and compiled with the with rpi gm get configure flag This enables LAM to use the GM 2 x function gm_get Note that enabling this feature slightly with the rpi_gm_cr SSI parameter decreases the performance of the gm module which is why it is disabled by default because of additional bookkeeping that is necessary The performance difference is actually barely measurable it is well below one microsecond It is not the default behavior simply on principle At the time of this writing there still appeared to be problems with gm_get so this behavior is dis abled by default It is not clear whether the problems with gm_get are due to a problem with Myricom s
40. required users are strongly encouraged to use the MPI_ALLOC_MEM and MPI_ FREE MEM functions to allocate and free memory used for communication instead of for example malloc and free This would avoid the need to pin the memory during communication time and hence save on message passsing latency Tunable Parameters Table 9 3 shows the SSI parameters that may be changed at run time the text below explains each one in detail SSI parameter name Default value Description rpi_ib_hca_id X The string ID of the Infiniband hardware HCA to be used rpi_ib_num_envelopes 64 Number of envelopes to be preposted per peer process rpi_ib_port 1 Specific IB port to use 1 indicates none rpi ib priority 50 Default priority level rpi_ib_tinymsglen 1024 Maximum length in bytes of a tiny message rpi_ib_mtu 1024 Maximum Transmission Unit MTU value to be used for IB Table 9 3 SSI parameters for the ib RPI module Port Allocation It is usually unnecessary to specify which Infiniband port to use LAM MPI will automatically attempt to acquire ports greater than 1 However if you wish LAM to use a specific Infiniband port number you can tell LAM which port to use with the rpi_ib_ port SSI parameter Specifying which port to use has precedence over the port range check if a specific port is indicated LAM will try to use that and not check a range of ports Specifying to use port 1
41. shared library it may be necessary to use the LD_LIBRARY_ PATH environment variable to specify where it can be found Specifically if the BLCR library is not in the default path searched by the linker errors will occur at run time because it cannot be found In such cases adding the directory where the libcr so file s can be found to the LD_LIBRARY_PATH environment variable on all nodes where the MPI application will execute will solve the problem Note that this may entail editing user s dot files to augment the LD_LIBRARY_PATH variable For example Ensure to see Section 4 1 1 for details about which shell startup files should be edited Also note that shell startup files are only read when starting the LAM universe Hence if you change values in shell startup files you will likely need to re invoke the lamboot command to put your changes into effect 97 T 7 05 los Tono edit user s shell startup file to augment LD_LIBRARY_PATH shell lamboot hostfile shell mpirun C ssi rpi crtcp ssi cr bler my_mpi_program y Alternatively the x option to mpirun can be used to export the LD_LIBRARY_PATH environment variable to all MPI processes For example Bourne shell and derrivates shell LD_LLIBRARY _PATH location of blcr lib SED_LIBRARY_PATH shell export LD_LIBRARY_PATH shell mpirun C ssi rpi crtcp ssi cr bler x LD_LIBRARY_PATH my_mpi_program a For C shell and derivat
42. the gm RPI module ooocoror sr e 79 Sol parameters forthe lb RPI module coccion a 82 SSI parameters for the lamd RPI module 2 co lt lt 85 SSI parameters for the sysv RPI module o o o eee 87 SSI parameters for the tcp RPI module o ee 88 SSI parameters for the Usysv RPI module o oo e 89 Listing of MPI collective functions indicating which have been optimized for SMP environ E ANN 93 Listing of MPI collective functions indicating which have been implemented using Shared a A A E are ae 95 SSI parameters forthe shmem coll module coo ee ee ee ee 95 9 12 1 Valid values for the LAM_MPI_THREAD_LEVEL environment variable 10 Chapter 1 Don t Panic Who Should Read This Document This document probably looks huge to new users But don t panic It is divided up into multiple relatively independent sections that can be read and digested separately Although this manual covers a lot of relevant material for all users the following guidelines are suggested for various types of users If you are e New to MPI First read Chapter 2 for an introduction to MPI and LAM MPI A good reference on MPI programming is also strongly recommended there are several books available as well as excellent on line tutorials e g 3 4 5 9 When you re comfortable with the concepts of MPI move on to New to LAM MPI e New
43. to LAM MPI If you re familiar with MPI but unfamiliar with LAM MPL first read Chapter 4 for a mini tutorial on getting started with LAM MPI Yov ll probably be familiar with many of the concepts described and simply learn the LAM terminology and commands Glance over and use as a reference Chapter 7 for the rest of the LAM MPI commands Chapter 11 contains some quick tips on common problems with LAM MPL Assuming that you ve already got MPI codes that you want to run under LAM MPL read Chapter 5 to see exactly what MPI 2 features LAM MPI supports When you re comfortable with all this move on to Previous LAM user e Previous LAM user As a previous LAM user you re probably already fairly familiar with all the LAM commands their basic functionality hasn t changed much However many of them have grown new options and capabilities particularly in the area of run time tunable parameters So be sure to read Chapters 6 to learn about LAM s System Services Interface SSI Chapters 8 and 9 LAM and MPI SSI modules and finally Chapter 12 miscellaneous LAM MPI information features etc If you re curious to see a brief listing of new features in this release see the release notes in Chapter 3 This isn t really necessary but when you re kicking the tires of this version it s a good way to ensure that you are aware of all the new features Finally even for the seasoned MPI and LAM MPI veteran be sure to check out Chapter 10 fo
44. transparent heterogenous communication For example MPI codes which have been written on the RS 6000 architecture running AIX can be ported to a SPARC architecture running Solaris with little or no modifications 2 2 About LAM MPI LAM MPI is a high performance freely available open source implementation of the MPI standard that is researched developed and maintained at the Open Systems Lab at Indiana University LAM MPI supports all of the MPI 1 Standard and much of the MPI 2 standard More information about LAM MPI including all the source code and documentation is available from the main LAM MPI web site LAM MPI is not only a library that implements the mandated MPI API but also the LAM run time environment a user level daemon based run time environment that provides many of the services required by MPI programs Both major components of the LAM MPI package are designed as component frame works extensible with small modules that are selectable and configurable at run time This component framework is known as the System Services Interface SSI The SSI component architectures are fully documented in 8 10 11 12 13 14 15 http www mpi forum org http www lam mpi org 13 14 Chapter 3 Release Notes This chapter Guide contai contains release notes as they pertain to the run time operation of LAM MPI The Installation ns additional release notes on the configuration compilation and installatio
45. true asynchronous message passing 85 9 3 6 The sysv Module Shared Memory Using System V Semaphores Module Summary Name sysv Kind rpi Default SSI priority 30 Checkpoint restart no The sysv RPI is the one of two combination shared memory TCP message passing modules Shared memory is used for passing messages to processes on the same node TCP sockets are used for passing messages to processes on other nodes System V semaphores are used for synchronization of the shared T aos memory pool L 7 03 Be sure to read Section 9 3 1 page 77 on the difference between this module and the usysv module Overview Processes located on the same node communicate via shared memory One System V shared segment is shared by all processes on the same node This segment is logically divided into three areas The total size of the shared segment in bytes allocated on each node is 2x C N x N 1 x S4 C P where C is the cache line size N is the number of processes on the node S is the maximum size of short messages and P is the size of the pool for large messages The first area of size 2 x C is for the global pool lock The sysv module allocates a semaphore set of size six for each process pair communicating via shared memory On some systems the operating system may need to be reconfigured to allow for more semaphore sets if running tasks with many processes communicating via shared memory The second area is for p
46. used to pass key value pairs to LAM Indeed it is required to pass at least one SSI parameter cr indicating which cr module to use for checkpointing e pid Indicate the PID of mpirun to checkpoint Notes 51 Lay e If the bler cr module is selected the name of the directory for storing the checkpoint files and the PID of mpirun should be passed as SSI parameters to Lamcheckpoint e If the self cr module is selected the PID of mpirun should be passed via the pid parameter See Section 9 5 for more detail about the checkpoint restart capabilities of LAM MPI including details about the blcr and self cr modules 7 3 The lamclean Command The lamclean command is provided to clean up the LAM universe It is typically only necessary when MPI processes terminate badly and potentially leave resources allocated in the LAM universe such as MPI 2 published names processes or shared memory The lamclean command will kill all processes running in the LAM universe and free all resources that were associated with them including unpublishing MPI 2 dynamicly published names 7 4 The lamexec Command The lamexec command is similar to mpirun but is used for non MPI programs For example i shell lamexec N uptime 5 37pm up 21 days 23 49 5 users load average 0 31 0 26 0 25 5 37pm up 21 days 23 49 2 users load average 0 01 0 00 0 00 5 37pm up 21 days 23 50 3 users load average 0 01 0 00 0 00 5 37pm up 21 days
47. with LAM s configure script that will force LAM to zero out all memory before it is used This will eliminate the read from uninitialized types of warnings that memory checking debuggers will identify deep inside LAM This option can only be specified when LAM is configured it is not possible to enable or disable this behavior at run time Since this option invokes a slight overhead penalty in the run time performance of LAM it is not the default 109 110 Chapter 11 Troubleshooting Although LAM is a robust run time environment and its MPI layer is a mature software system errors do occur Particularly when using LAM MPI for the first time some of the initial per user setup can be confusing e g setting up rhosts or SSH keys for password less remote logins This section aims to identify a few common problems and solutions Much more information can be found on the LAM FAQ on the main LAM web site 11 1 The LAM MPI Mailing Lists There are two mailing lists one for LAM MPI announcements and another for questions and user discus sion of LAM MPI 11 1 1 Announcements This is a low volume list that is used to announce new version of LAM MPI important patches etc To subscribe to the LAM announcement list visit its list information page you can also use that page to unsubscribe or change your subscription options http www lam mpi org mailman listinfo cgi lam announce NOTE Users cannot post to this list
48. 4 10 2 2 Suggested Use Since Total View support is started with the mpi run command Total View will by default start by debug ging mpirun itself While this may seem to be an annoying drawback there are actually good reasons for this e While debugging the parallel program if you need to re run the program you can simply re run the application from within Total View itself There is no need to exit the debugger to run your parallel application again e TotalView can be configured to automatically skip displaying the mpirun code Specifically instead of displaying the mpi run code and enabling it for debugging Total View will recognize the command named mpirun and start executing it immediately upon load See below for details There are two ways to start debugging the MPI application 1 The preferred method is to have a SHOME t vdrc file that tells Total View to skip past the mpi run code and automatically start the parallel program Create or edit your SHOME tvdrc file to include the following Set a variable to say what the MPI starter program is set starter_program mpirun Check if the newly loaded image is the starter program and start it immediately if it is proc auto_run starter loaded id global starter_program set executable_name TV symbol get loaded_id full_pathname set file_component file tail executable_name if string compare file_ component starter_program 0 puts
49. 7 1 3 reports its MPI version as 1 2 through the function MPILGET_VERSION Datatype Constructor MPI TYPE CREATE INDEXED BLOCK The MPI function MPI_TYPE_CREATE INDEXED_BLOCK is not supported by LAM MPI Treatment of MPI_Status Although LAM supports the constants MPILSTATUS_IGNORE and MPI_ STATUSES IGNORE the function MPI_REQUEST_GET_ STATUS is not provided Error class for invalid keyval The error class for invalid keyvals MPILERR_KEYVAL is fully sup ported Committing committed datatype Committing a committed datatype is fully supported its end effect is a no op Allowing user functions at process termination Attaching attributes to MPI COMM_SELF that have user specified delete functions will now trigger these functions to be invoked as the first phase of MPI_ FINALIZE When these functions are run MPI is still otherwise fully functional 36 Determining whether MPI has finished The function MPI_FINALIZED is fully supported The Info object Full support for MPI_Info objects is provided See Table 5 2 Supported Functions MPLINFO_CREATE MPI_INFO_FREE MPI_INFO_GET_NTHKEY MPILINFO_DELETE MPIINFO GET MPLINFO_GET_VALUELEN MPI_INFO_DUP MPLINFO_GET_NKEYS MPI_INFO_SET Table 5 2 Supported MPI 2 info functions Memory allocation The MP ALLOC_MEM and MPI_FREE_MEM functions will return special mem ory that enable fast memory passing in RPIs that support it These functions are simply wrappers to mall
50. 8 MPI_TYPE_DELETE_ATTR 40 MPI_TYPE_DUP 40 MPI_TYPE_F2C 37 MPI_TYPE_FREE_KEYVAL 40 MPI_TYPE_GET_ATTR 40 MPITYPE_GET_ CONTENTS 40 MPILTYPE_GET_ENVELOPE 40 MPITYPE_GET_EXTENT 38 40 MPI_TYPE_GET_NAME 40 MPITYPE_GET _TRUE _EXTENT 38 40 MPITYPE_SET_ATTR 40 MPI_TYPE_SET_NAME 40 MPI_UNPACK 38 MPILUNPACK_EXTERNAL 38 MPIUNPUBLISH_NAME 38 117 MPI_WAIT 115 MPI_WIN_C2F 37 MPI_WIN_COMPLETE 39 MPI_WIN_CREATE 39 127 MPIWIN_CREATE_ERRHANDLER 37 40 MPIWIN_CREATE KEYVAL 40 MPIWIN_DELETE_ATTR 40 MPI_WIN_F2C 37 MPI_WIN_FENCE 39 MPI_WIN_FREE 39 MPILWIN_FREE KEYVAL 40 MPI_WIN_GET_ATTR 40 MPIWIN_GET_ERRHANDLER 37 40 MPIWIN_GET_GROUP 39 MPI_WIN_GET_NAME 40 MPI_WIN_POST 39 MPI_WIN_SET_ATTR 40 MPIWIN_SET_ERRHANDLER 37 40 MPI_WIN_SET_NAME 40 MPI_WIN_START 39 MPI_WIN_WAIT 39 MPI_BARRIER 94 MPIL_COMM_SPAWN 38 MPIO_TEST 115 MPIO_TESTANY 115 MPIO_WAIT 115 MPIO_WAITANY 115 MPI types MPI BOOL 41 MPI COMPLExX 41 MPI DOUBLE_COMPLExX 41 MPI LONG_DOUBLE_COMPLExX 41 MPI_File 37 MPI_Info 37 38 MPI_Request 115 MPI_Status 36 39 MPIO_Request 115 MPI 2 I O support see ROMIO 4 no i_hostmap SSI parameter 75 pict command 20 29 40 56 np 1CC command 20 29 30 40 56 npicc command 20 29 30 40 56 piexec command 16 32 36 58 104 pif77 command 29 31 40 56 np imsg command 60 npirun command 31 60 66 72 98 104 107 116 120 npitas
51. ATTR MPI TYPE DELETE ATTR MPILWIN_DELETE_ATTR MPILCOMM_GET_ATTR MPILTYPE_GET_ATTR MPIWIN_GET_ATTR MPICOMM_SET_ATTR MPILTYPE SET _ATTR MPIWIN_SET_ATTR Table 5 10 Supported MPI 2 external interface functions grouped by function 5 2 6 T O MPI IO support is provided by including the ROMIO package from Argonne National Labs version 1 2 5 1 The LAM wrapper compilers mpicc mpiCC mpict and mpif77 will automatically pro vide all the necessary flags to compile and link programs that use ROMIO function calls Although the ROMIO group at Argonne has included support for LAM in their package there are still a small number of things that the LAM Team had to do to make ROMIO compile and install properly with LAM MPI As such if you try to install the ROMIO package manually with LAM MPI you will experience some difficulties There are some important limitations to ROMIO that are discussed in the romio README file One limitation that is not currently listed in the ROMIO README file is that atomic file access will not work with AFS This is because of file locking problems with AFS i e AFS iteself does not support file locking The ROMIO test program atomicity will fail if you specify an output file on AFS Additionally ROMIO does not support the following LAM functionality e LAM MPI 2 datatypes cannot be used with ROMIO ROMIO makes the fundamental assumption that MPI 2 datatypes are built upon MPI 1 datatypes LAM builds MPI
52. CPU that was listed in the boot schema The C notation is therefore convenient shorthand notation for launching a set of processes across a group of SMPs Another method for running in parallel is 3 shell mpirun N hello 4 31 The N option has a different meaning than C it means launch one copy of hel Lo on every node in the LAM universe Hence N disregards the CPU count This can be useful for multi threaded MPI programs Finally to run an absolute number of processes regardless of how many CPUs or nodes are in the LAM universe shell mpirun np 4 hello This runs 4 copies of hello LAM will schedule how many copies of he11o will be run in a round robin fashion on each node by how many CPUs were listed in the boot schema file For example on the LAM universe that we have previously shown in this tutorial the following would be launched e hello would be launched on nO named node1 e hello would be launched on n1 named node2 e 2 hellos would be launched on n2 named node3 Note that any number can be used if a number is used that is greater than how many CPUs are in the LAM universe LAM will wrap around and start scheduling starting with the first node again For example using np 10 would result in the following schedule e 2hellosonno0 1 from the first pass and then a second from the wrap around e 2helloson n1 1 from the first pass and
53. GM library or a problem in LAM itself the with rpi gm get option is provided as a hedging our bets solution if the problem does turn out to be with the GM library LAM users can enable checkpoint support and slightly lower long message latency by using this switch 78 Tunable Parameters Table 9 2 shows the SSI parameters that may be changed at run time the text below explains each one in detail SSI parameter name Default value Description rpi_gm_cr 0 Whether to enable checkpoint restart support or not rpi gm_ fast 0 Whether to enable the fast algorithm for send ing short messages This is an unreliable transport and is not recommended for MPI applications that do not continually invoke the MPI progression en gine rpi_gm_maxport 32 Maximum GM port number to check during MPI_INIT when looking for an available port rpi_gm_nopin 0 Whether to let LAM MPI register pin arbi trary buffers or not rpi_gm_port 1 Specific GM port to use 1 indicates none rpi gm priority 50 Default priority level rpi_gm_tinymsglen 1024 Maximum length in bytes of a tiny message Table 9 2 SSI parameters for the gm RPI module Port Allocation It is usually unnecessary to specify which Myrinet GM port to use LAM MPI will automatically attempt to acquire ports greater than 1 By default LAM will check for any available port between 1 and 8 If your Myrinet
54. I parameter name Default value Description boot_rsh_agent From configure Remote shell agent to use boot_rsh_fast 0 If nonzero assume that the shell on the remote node is the same as on the origin i e do not check boot_rsh_ignore_stderr 0 If nonzero ignore output from stderr when booting don t treat it as an error boot_rsh_priority 10 Default priority level boot_rsh_no_n 0 If nonzero don t use n as an argument to the boot agent boot_rsh_no_profile 0 If nonzero don t attempt to run profile for Bourne type shells boot_rsh_username None Username to use if different than login name Table 8 3 SSI parameters for the rsh boot module Usage SLURM allows running jobs in multiple ways The slurm boot module is only supported in some of them e Batch mode where a script is submitted via the srun command and is executed on the first node from the set that SLURM allocated for the job The script runs Lamboot mpirun etc as is normal for a LAM MPI job This method is supported and is perhaps the most common way to run LAM MPI automated jobs in SLURM environments e Allocate mode where the A option is given to s run meaning that the shell were Lamboot runs is likely to not be one of the nodes that SLURM has allocated for the job In this case LAM daemons will be launched on all nodes that were allocated by SLURM as well as the origin i e the node where lamboot was r
55. I_COMM_SET_NAME 40 107 117 MPI_COMM_SPAWN 16 38 79 82 107 117 MPI COMM_SPAWN_MULTIPLE 38 MPI_COMM_SPLIT 118 MPI_EXSCAN 39 93 95 MPI_FINALIZE 16 36 44 96 113 114 MPI_FINALIZED 37 MPI_ FREE_MEM 37 78 80 82 MPI_GATHER 39 93 95 MPI_GATHERV 39 93 95 MPI_GET 39 MPI_GET_ADDRESS 38 MPI_GET_VERSION 36 MPI GROUP_C2F 37 MPIGROUP_F2C 37 MPI_INFO_C2F 37 MPLINFO CREATE 37 MPLINFO_DELETE 37 MPI_INFO_DUP 37 MPI_INFO_F2C 37 MPI_INFO_FREE 37 MPLINFO_ GET 37 MPLINFO_GET_NKEYS 37 MPLINFO_GET_NTHKEY 37 MPLINFO_GET_VALUELEN 37 MPLINFO_SET 37 MPL_INIT 17 18 36 44 63 79 90 96 99 101 106 113 115 117 120 MPI_INIT_THREAD 40 75 97 115 116 MPLIPROBE 118 MPLIRECV 35 MPI_IS_THREAD_MAIN 40 MPILLOOKUP_NAME 38 MPI_OPEN_PORT 38 MPI_PACK 38 MPI_PACK EXTERNAL 38 MPI_PACK_EXTERNAL_SIZE 38 MPI_PROBE 118 MPI_PUBLISH_NAME 38 117 MPI_PUT 39 MPIQUERY_THREAD 40 MPI_RECV 63 MPI REDUCE 39 91 93 95 118 MPIL REDUCE_SCATTER 39 93 95 MPI_LREQUEST_C2F 37 MPIL REQUEST_F2C 37 MPIL REQUEST_GET_STATUS 36 MPI_SCAN 39 93 95 MPI_SCATTER 39 93 95 MPI_SCATTERV 39 93 95 MPI_SEND 18 MPI_STATUS_C2F 37 MPI_STATUS_F2C 37 MPI_TEST 115 MPI_TYPE_C2F 37 MPI_TYPE_CREATE_DARRAY 38 MPI_TYPE_CREATE_HINDEXED 38 MPITYPE_CREATE_HVECTOR 38 MPITYPE_CREATE_INDEXED_BLOCK 36 MPI_TYPE_CREATE_KEYVAL 40 MPI_TYPE_CREATE_RESIZED 38 MPITYPE_CREATE STRUCT 38 MPITYPE_CREATE_SUBARRAY 3
56. LAM MPI User s Guide Version 7 1 3 The LAM MPI Team Open Systems Lab http www lam mpi org pervasivetecinologylabs AT INDIANA UNIVERSITY February 14 2007 Copyright 2001 2004 The Trustees of Indiana University All rights reserved Copyright 1998 2001 University of Notre Dame All rights reserved Copyright 1994 1998 The Ohio State University All rights reserved This file is part of the LAM MPI software package For license information see the LICENSE file in the top level directory of the LAM MPI source distribution The ptmalloc package used in the gm RPI SSI module is Copyright 1999 Wolfram Gloger Contents 1 Don t Panic Who Should Read This Document 2 Introduction to LAM MPI 21 ABOUEMIPL sossa we i ew eS BR a a e a Be ee 22 About LAMMET er aaa eed ok et A Ew eRe ws ble bai tek Gi Sandi Gee ew BR A Release Notes 3 1 New Feature Overview o oc 56 06422 ba ek ee ee ee ee ed es 22 SDOMIISSUES lt os ora A AAA A 3 2 1 mpirun and MPI Application cr Module Disagreement 3 2 2 Checkpoint Support Disabled for Spawned Processes o o 3 2 3 BLCR Support Only Works When Compiled Statically 320 Infniband rpi Module oao ee eee ee ee Re OG OE Re a 33 Usage NOS oa o coc caa mor a ee bee ee ee ea eae a ok d 3 3 1 Operating System Bypass Communication Myrinet and Infiniband 34 Platiorm Spectiic Notes 2 0 oa ke a a ee es BAL Provided RPM
57. Lusk and Rajeev Thakur Using MPI 2 Advanced Features of the Message Passing Interface MIT Press 1999 Thilo Kielmann Henri E Bal and Sergei Gorlatch Bandwidth efficient Collective Communication for Clustered Wide Area Systems In International Parallel and Distributed Processing Symposium IPDPS 2000 pages 492 499 Cancun Mexico May 2000 IEEE Message Passing Interface Forum MPI A Message Passing Interface In Proc of Supercomputing 93 pages 878 883 IEEE Computer Society Press November 1993 Sriram Sankaran Jeffrey M Squyres Brian Barrett and Andrew Lumsdaine Checkpoint restart sup port system services interface SSI modules for LAM MPI Technical Report TR578 Indiana Uni versity Computer Science Department 2003 Marc Snir Steve W Otto Steve Huss Lederman David W Walker and Jack Dongarra MPI The Complete Reference MIT Press Cambridge MA 1996 Jeffrey M Squyres Brian Barrett and Andrew Lumsdaine Boot system services interface SSD modules for LAM MPI Technical Report TR576 Indiana University Computer Science Department 2003 Jeffrey M Squyres Brian Barrett and Andrew Lumsdaine MPI collective operations system ser vices interface SSI modules for LAM MPI Technical Report TR577 Indiana University Computer Science Department 2003 121 12 13 14 15 16 17 Jeffrey M Squyres Brian Barrett and Andrew Lumsdaine Request progression interface RPI sys tem s
58. M MPT_SESSION_SUFFIX 17 50 69 119 LAM MP I_SOCKET_SUFF IX deprecated 17 LAM MPI_THREAD_LEVEL 75 116 LAMHCC deprecated 57 LAMHCP deprecated 57 LAMHF 77 deprecated 57 LAMHOME 67 LAMMP ICC 57 LAMMP ICXX 57 LAMMP IF 77 57 LAMRANK 107 LAMRSH deprecated 17 LD_LIBRARY_PATH 97 98 LD_PRELOAD 99 PATH 69 TMPDIR 16 50 119 TVDSVRLAUNCHCMD 106 lt SSE SSS bash_1ogin 24 bash profile 24 bashrc 24 Ccshrc 24 Login 24 profile 24 rhosts 111 tcshrc 24 SHOME tvdrc 105 Ssysconf lam hostmap 75 libcr so 99 filesystem notes AFS 20 case insensitive filesystems 20 NES 20 Fortran compilers Absoft 21 fortran process names 115 globus boot SSI module 69 globus 3ob run command 69 GLOBUS_LOCATION environment variable 69 hcc command deprecated 58 hcp command deprecated 58 hf 77 command deprecated 58 hostfile see boot schema T O support see ROMIO IMPI 117 running jobs 118 server 118 supported functionality 117 IMPI HOST NAME environment variable 118 Infiniband release notes 18 Interoperable MPI see IMPI LAM_MP I_PROCESS_NAME environment variable 116 LAM MPI SESSION PREFIX environment vari able 50 119 LAM_MPI_SESSION_SUFFIX environment vari able 17 50 69 119 LAM_MPI_SOCKET_SUFFIX environment variable deprecated 17 LAM MP I_THREAD_LEVEL environment variable 75
59. MPI and will automatically be used on platforms that support arbitrary pinning These memory allocation managers allow LAM MPI to intercept the relevant functions and ensure that memory is unpinned before returning it to the operating system Use of these managers will effectively overload all memory allocation functions e g malloc calloc free etc for all applications that are linked against the LAM MPI libraries potentially regardless of whether they are using the ib RPI module or not See Section 3 3 1 page 18 for more information on LAM s memory allocation managers Memory Checking Debuggers When running LAM s ib RPI through a memory checking debugger see Section 10 4 a number of Read from unallocated RUA and or Read from uninitialized RFU errors may appear pertaining to VAPI These RUA RFU errors are normal they are not actually reads from unallocated sections of memory The Infiniband hardware and kernel device driver handle some aspects of memory allocation and therefore the operating system debugging environment is not always aware of all valid memory As a result a memory checking debugger will often raise warnings even though this is valid behavior Known Issues T ais As of LAM 7 1 3 the following issues remain in the ib RPI module e The ib rpi will not scale well to large numbers of processes See the section entitled Posting En L aiz velopes to Receive Scalability above e On
60. MPI applications were compiled with debugging support en abled This typically entails adding g to the compile and link lines when building your MPI appli cation e Unless you specifically need it it is not recommended to compile LAM with g This will allow you to treat MPI function calls as atomic instructions e Even when debugging in parallel it is possible that not all MPI processes will execute exactly the same code For example if statements that are based upon a communicator s rank of the calling process or other location specific information may cause different execution paths in each MPI process 10 1 Naming MPI Objects LAM MPI supports the MPI 2 functions MPI_ lt type gt _SET_NAME and MPI_ lt type gt _GET_NAME where lt type gt can be COMM WIN or TYPE Hence you can associate relevant text names with communica tors windows and datatypes e g 6x13x12 molecule datatype Local group reduction intracommuni cator Spawned worker intercommunicator The use of these functions is strongly encouraged while debugging MPI applications Since they are constant time one time setup functions using these functions likely does not impact performance and may be safe to use in production environments too The rationale for using these functions is to allow LAM and supported debuggers profilers and other MPI diagnostic tools to display accurate information about MPI communicators windows and datatypes
61. MS 19 List of common shells and the corresponding environment setup files for interactive shells 24 List of common shells and the corresponding environment setup files for non interactive shells 24 Supported optional fortran datatypes ooo ee 35 Supported MPI 2 info functions o o ee 37 Supported MPI 2 handle conversion functions 00000 eee 37 Supported MPI 2 error handler functions 0 0 e eee eee 37 Supported MPI 2 new datatype manipulation functions 38 Supported MPI 2 dynamic functions oa ee 38 Supported MPI 2 one sided functions 2 2 ee 39 Supported MPI 2 intercommunicator collective functions 39 Major topics in the MPI 2 chapter External Interfaces and LAM s level of support 39 Supported MPI 2 external interface functions grouped by function 40 SSI module types and their corresponding scopes o e 0000 00s 44 SSI parameters forthe bproc boot module 6 645564 6 cae eee eee ewes 69 SSI parameters for the globus boot module 0 020000 e eee 70 SSI parameters forthe rsh boot module 2 0 4 66 ek ee ER ee Ee 72 SSI parameters forthe slurm boot module o o lt o s 73 SSI parameters for the tm boot module o o e 74 SSI parameters for the crtcp RPI module o ooo o 78 SSI parameters for
62. PI 2 intercommunicator collective functions 5 2 5 External Interfaces The external interfaces chapter lists several different major topics LAM s support for these topics is sum marized in Table 5 9 and the exact list of functions that are supported is listed in 5 10 Supported Description no no yes no no yes yes yes yes yes Generalized requests Associating information with MPI_Status Naming objects Error classes Error codes Error handlers Decoding a datatype MPI and threads New attribute caching functions Duplicating a datatype Table 5 9 Major topics in the MPI 2 chapter External Interfaces and LAM s level of support These two functions were unfortunately overlooked and forgotten about when LAM MPI v7 1 was frozen for release 39 MPI_WIN_START MPLALLREDUCE Supported Functions MPLCOMM_SET_NAME MPILTYPE_SET_NAME MPIWIN_SET_NAME MPICOMM_GET NAME MPI_TYPE_GET_NAME MPI_WIN_GET_NAME MPILCOMM_CREATE_ERRHANDLER MPILWIN CREATE_ERRHANDLER MPILCOMM_GET_ERRHANDLER MPIWIN_GET_ERRHANDLER MPICOMM_SET_ERRHANDLER MPIWIN_SET _ERRHANDLER MPLTYPE_GET_CONTENTS MPI_INIT_THREAD MPITYPE_GET_ ENVELOPE MPILQUERY_THREAD MPLTYPE_GET_EXTENT MPI_IS_THREAD_MAIN MPITYPE_GET_TRUE_EXTENT MPI_TYPE_DUP MPILCOMM_CREATE_KEYVAL MPILTYPE_CREATE_KEYVAL MPILWIN CREATE_KEYVAL MPIL COMM_FREE_KEYVAL MPLTYPE_FREEKEYVAL MPILWIN_FREE KEYVAL MPICOMM_DELETE
63. PI 2 portable MPI process startup command mpiexec see Section 7 12 page 58 for more details e Full documentation for system administrators users and developers 8 10 11 12 13 14 15 e Various MPI enhancements C bindings are provided for all supported MPI functionality Upgraded the included ROMIO package 16 17 to version 1 2 5 1 for MPI I O support Per MPI 2 4 8 free the MPI COMM_SELF communicator at the beginning of MPI_FINALIZE allowing user specified functions to be automatically invoked Formal support for MPITHREAD_SINGLE MPILTHREAD_FUNNELED and MPI THREAD_ SERIALIZED MPILTHREAD_MULTIPLE is not supported see Section 12 4 page 116 for more details Significantly increased the number of tags and communicators supported in most RPIs Enhanced scheduling capabilities for MPI COMM_SPAWN e Various LAM run time environment enhancements New laminfo command that provides detailed information about a given LAM MPI installa tion Use TMPDIR environment variable for LAM s session directory Restore the original umask when creating MPI processes 16 Allow Fortran MPI processes to change how their name shows up in mpitask Better SIGTERM support in the LAM daemon catch the signal and ensure that all sub processes are killed and resources are released e Deprecated functionality may disappear in future releases of LAM MPI LAMRSH The LAMRSH environme
64. PI supports the following underlying networks for MPI communication including several run time tunable parameters for each see Section 9 3 page 76 for more details TCP IP using direct peer to peer sockets Myrinet using the native gm message passing library Infinband using the Mellanox VAPI mVAPI message passing library Shared memory using either spin locks or semaphores 15 Ton Loans Tas L 7 1 LAM Daemon mode using LAM s native run time environment message passing e LAM s run time environment can now be natively executed in the following environments see Section 8 1 page 65 for more details BProc clusters Globus grid environments beta level support Traditional rsh ssh based clusters OpenPBS PBS Pro Torque batch queue jobs Lan SLURM batch queue systems T a T a e Improvements to collective algorithms Several collective algorithms have now been made SMP aware exhibiting better performance when enabled and executed on clusters of SMPs see Section 9 4 page 89 for more details Several collective now use shared memory collective algorithms not based on MPI point to point communication when all processes in a communicator are on the same node Collectives on intercommunicators are now supported Lan e Full support of the TotalView parallel debugger see Section 10 2 page 103 for more details e Support for the M
65. RIER Only some of the collectives have been optimized for SMP environments Table 9 9 shows which collec tive functions have been optimized which were already optimal from the lam_basic module and which will eventually be optimized List of Algorithms Only some of the collectives have been implemented using shared memory Table 9 9 shows which collective functions have been implemented and which uses lam_basic module Tunable Parameters Table 9 10 shows the SSI parameters that may be changed at run time Each of these parameters were discussed in the previous sections Special Notes LAM provides sysv and usysv RPI for the intranode communication In this case the collective com munication also happens through the shared memory but indirectly in terms of Sends and Recvs Shared Memory Collective algorithms avoid all the overhead associated with the indirection and provide a minimum blocking way for the collective operations The shared memory is created by only one process in the communicator and rest of the processes simply attach the shared memory region to their address space The process which finalizes last hands back the shared memory region to the kernel while processes leaving before simply detach the shared memory region Lanp from their address space 94 MPI function Status MPLALLGATHER Implemented using shared memory MPLALLGATHERV Uses lam_basic algorithm MPILALLREDUCE Implemented us
66. S o oe aa aa a oe eee BA eee be eA ee 24 2 IE 3 4 3 Dynamic Embedded Environments 0002 eee Sa BANDS cs bs 2 AA A A a EE 3 4 5 Mac OS X Absoft Fortran Compilers o e e e e 3 4 6 Microsoft Windows TM Cygwin o o o o o ee SAT SOS laca on be GG be OE died A Wek ae oS ie we Gee ok a Getting Started with LAM MPI 4 1 One Time Seuip o eaaa Pb Re eee hee hee A wae lel SEE the POI fe ha ed dw a oe ee oe Bee Dae be Bates 4 1 2 Finding the LAM Manual Pages o o 000000002 ee 42 Sysiem Services Interface SS nee a a ee a RR a 4 3 What Does Your LAM MPI Installation Support o ooo o 4 4 Booting the LAM Run Time Environment e e o 4 4 1 The Boot Schema File a k a Hostfile Machinefile 44 2 The la mboot Command s se s cepa a a de a 443 The lamnodes Command o se se suce e a oso eras a a 45 Compiling MPL Programs osc bs oe a a A RAR 11 13 13 13 15 15 17 17 17 17 17 18 18 19 19 20 21 21 21 21 22 4 6 4 7 45 1 Sample MPI Program IMC actions da bn ba eda eee wh baa A 4 5 2 Sample MPI PrograminC 2 0 0 2 nee 4 5 3 Sample MPI Program in Fortran o e ee Rumin MPL Pera rara A ee A dol Thenpirun Command cc Soe A ech a ca a A S 462 Themprexec Command ce s ee soss a a we wa ee 40 Thempitask Commmand cios ia ia eee ee ee O A Thelamclean Command gt c c
67. Solaris 7 The default amount of shared memory available on Solaris is fairly small It may need to be increased to allow running more than a small number of processes on a single Solaris node using the sysv or USySV RPI modules For example if running the LAM test suite on a single node some tests run several instances of the executable e g 6 which may cause the system to run out of shared memory and therefore cause the La test to fail Increasing the shared memory limits on the system will allow the test to pass See http sunsite uakom sk sunworldonline swol 09 1997 swol 09 insidesolaris html for a good examplantion of Solaris shared memory 22 Chapter 4 Getting Started with LAM MPI This chapter provides a summary tutorial describing some of the high points of using LAM MPI It is not intended as a comprehensive guide the finer details of some situations will not be explained However it is a good step by step guide for users who are new to MPI and or LAM MPI Using LAM MPI is conceptually simple e Launch the LAM run time environment RTE e Repeat as necessary Compile MPI program s Run MPI program s e Shut down the LAM run time environment The tutorial below will describe each of these steps 4 1 One Time Setup This section describes actions that usually only need to be performed once per user in order to setup LAM to function properly 4 1 1 Setting the Path One of the main requirements fo
68. The ramifications of the cpu key will be discussed later The location of this text file is irrelevant for the purposes of this example we ll assume that it is named host file and is located in the current working directory 4 4 2 The lamboot Command The lamboot command is used to launch the LAM run time environment For each machine listed in the boot schema the following conditions must be met for LAM s run time environment to be booted correctly e The machine must be reachable and operational e The user must be able to non interactively execute arbitrary commands on the machine e g without being prompted for a password e The LAM executables must be locatable on that machine using the user s shell search path e The user must be able to write to the LAM session directory usually somewhere under tmp e The shell s start up scripts must not print anything on standard error e All machines must be able to resolve the fully qualified domain name FQDN of all the machines being booted including itself Once all of these conditions are met the Lamboot command is used to launch the LAM run time environment For example shell lamboot v ssi boot rsh hostfile LAM 7 0 MPI 2 C ROMIO Indiana University n0 lt 1234 gt ssi boot base linear booting nO node1 cluster example com n0 lt 1234 gt ssi boot base linear booting n1 node2 cluster example com n0 lt 1234 gt ssi boot base linear booting n2 node3
69. _TRUE_EXTENT MPI_TYPE CREATE RESIZED Table 5 5 Supported MPI 2 new datatype manipulation functions New predefined datatypes Support has been added for the MPI LONG_LONG_INT MPI UNSIGNED LONG_LONG and MPI_WCHAR basic datatypes Canonical MPI_PACK MPILUNPACK Support is not provided for MPI_PACK_EXTERNAL MPI_ UNPACK_EXTERNAL or MPILPACK_EXTERNAL_SIZE 5 2 2 Process Creation and Management LAM MPI supports all MPI 2 dynamic process management Table 5 6 lists all the supported functions Supported Functions MPI_ CLOSE_PORT MPI_COMM_GET_PARENT MPILOOKUP_NAME MPICOMM_ACCEPT MPIL COMM JOIN MPILOPEN_PORT MPI_COMM_SPAWN MPI_COMM_CONNECT MPI_PUBLISH_NAME MPICOMM_DISCONNECT MPICOMM_SPAWN_MULTIPLE MPILUNPUBLISH NAME Table 5 6 Supported MPI 2 dynamic functions As requested by LAM users MPICOMM_SPAWN and MPI_COMM_SPAWN_MULTIPLE supports some MPI_Info keys for spawning MPMD applications and for more fine grained control about where chil dren processes are spawned See the MPT_Comm_spawn 3 man page for more details These functions supersede the MPIL_COMM_SPAWN function that LAM MPI introduced in version 6 2b Hence MPIL_COMM_SPAWN is no longer available 5 2 3 One Sided Communication Support is provided for get put accumulate data transfer operations and for the post wait start complete and fence synchronization operations No support is provided for window locking The datatypes used in the
70. a a a aa Shutting Down the LAM Universe 2 eee Supported MPI Functionality 5 1 5 2 PIPL SOPDOEE ag bh a a A A Re ee e e S 1 1 Language Bindings o o c cs eos ceca a a S12 MPILCANCGEL hr ok oe ee ee we bee ew Pee RE BS Oo Se Beer MPL 2 SUPP o c ce ee dara A A Oe a ek ee eS 32A Miscellany o cc 2 24004 bk Bee bk EAA Ae ea ee ed bs 5 2 2 Process Creation and Management 2 02000 eee 5 2 3 One Sided Communication o o se a ss sesa ewe parae ee 5 2 4 Extended Collective Operations o cs ecra cecce cas es Son External IMEE ase a a hed ara Bae we da e oo da a a OO CO ei ae Sk eee eh A ee ee eS ea ea a ci S27 Language BINNES o scoe eassa e a eh ee ee we a System Services Interface SSI Overview 6 1 6 2 6 3 6 4 6 5 Types and Modules 2 5585 POA eR RA he ee ee eee A G o IEEE da A ee Ae er ee Bw ee ee Gl Naming CONNOR A a we a dee 6 3 2 Setting Parameter Values se ee ee e es Dynamic Shared Object DSO Modules o o e e Sekon Modules e o sw e ee we e Ae we ee A a a a 65 1 Specifying Modules ec 6 kee eR he ee a ee ba es OSa SUMS Portes se casca i anna te ew eed a te ae bid we ea Ro 63 3 Selection Algontht so e moe aon eie a e a LAM MPI Command Quick Reference Tal ie Ta 7 4 7 5 7 6 Tat 7 8 The lame Command e aaie A dea 7 1 1 Multiple Sessions on the Same Node o e e 7 1 2 Avoiding Running on Specific Nodes o o
71. a single process This process can either run by itself or spawn or connect to other MPI processes and become part of a larger MPI jobs using the MPI 2 dynamic function calls A LAM RTE must be running on the local node as with jobs started with mpirun 12 2 MPI 2 I O Support MPI 2 I O support is provided through the ROMIO package 16 17 Since support is provided through a third party package its integration with LAM MPI is not complete Specifically everywhere the MPI 2 standard specifies an argument of type MPI_Request ROMIO s provided functions expect an argument of type MPIO_Request Note too that the MPIO_Request types cannot be used with LAM s standard MPI_TEST and MPI WAIT functions ROMIO s MPIO_TEST and MPIO_WAIT functions must be used instead There are no array versions of these functions e g MPIO_TESTANY MPIO_WAITANY etc do not exist C MPI applications wanting to use MPI 2 I O functionality can simply include mpi h Fortran MPI applications however must include both mpif h and mpiof h Finally ROMIO includes its own documentation and listings of known issues and limitations See the README file in the ROMIO directory in the LAM distribution 12 3 Fortran Process Names Since Fortran does not portably provide the executable name of the process similar to the way that C programs get an array of argv the mpitask command lists the name LAM MPI Fortran program by default for MPI programs that use
72. agement is unnecessary on Solaris The details are too lengthy for this document 18 external specifically indicates that if the gm or ib RPI modules are used the application promises to invoke the internal LAM function for unpinning memory as required Note that this function is irrelevant but harmless when any other RPI module is used The function that must be invoked is prototyped in lt mpi h gt void lam_handle_free void buf size_t length For applications that must use this functionality it is probably safest to wrap the call to 1lam_ handle free in the following preprocessor conditional include lt mpi h gt int my_sbrk Fx sbrk functionality if defined LAM_MPI lam_handle_free bufer length endif Fx vest of sbrk functionality A Note that when LAM is configured this way all MPI applications that use the gm or ib RPI modules must invoke this function as required Failure to do so will result in undefined behavior 3 4 Platform Specific Notes 3 4 1 Provided RPMs If you install LAM MPI via an official RPM from the LAM MPI web site or one of its mirrors you may not have all the SSI modules that are described in Chapters 8 and 9 The modules that are shipped in 7 1 3 are listed in Table 3 1 If you need modules that are not provided in the RPMs you will likely need to download and install the source LAM MPI tarball Boot Collectiv
73. ails on this file The SSI parameter mpi_hostmap can be used to specify an alternate hostmap file For example shell mpirun C ssi mpi_hostmap my_hostmap txt my_mpi_application This tells LAM to use the hostmap my_hostmap txt instead of sysconf lam hostmap txt The special filename none can also be used to indicate that no address remapping should be performed L a 9 2 MPI Module Selection Process The modules used in an MPI process may be related or dependent upon external factors For example the gm RPI cannot be used for MPI point to point communication unless there is Myrinet hardware present in the node The blcr checkpoint restart module cannot be used unless thread support was included And so on As such it is important for users to understand the module selection algorithm 1 Set the thread level to be what was requested either via MPIINIT_THREAD or the environment variable LAM_MPI_THREAD LEVEL 75 2 Query relevant modules and make lists of the resulting available modules Relevant means either a specific module or set of modules if the user specified them through SSI parameters or all modules if not specified 3 Eliminate all modules who do not support the current MPI thread level 4 If no rpi modules remain try a lower thread support level until all levels have been tried If no thread support level can provide an rpi module abort 5 Select the highest priority rpi module Reset the
74. and received a confirmation e mail you can send mail to the list at the following address You must be subscribed in order to post to the list lam lam mpi org You must be subscribed in order to post to the list Be sure to include the following information in your e mail e The config log file from the top level LAM directory if available please compress e The output of laminfo al1 e A detailed description of what is failing The more details that you provide the better E mails saying My application doesn t work will inevitably be answered with requests for more information about exactly what doesn t work so please include as much detailed information in your initial e mail as possible NOTE People tend to only reply to the list if you subscribe post and then unsubscribe from the list you will likely miss replies Also please be aware that the list goes to several hundred people around the world it is not uncommon to move a high volume exchange off the list and only post the final resolution of the problem bug fix to the list This prevents exchanges like Did you try X Yes I tried X and it did not work Did you try Y etc from cluttering up peoples inboxes 11 2 LAM Run Time Environment Problems Some common problems with the LAM run time environment are listed below 11 2 1 Problems with the Lamboot Command Many first time LAM users do not have their environment pr
75. be noted that the choice of RPI usually does not affect the boot SSI module hence the lamboot command requirements on hostnames specified in the boot schema is not dependent upon the RPI For example if the gm RPI is selected Lamboot may still require TCP IP hostnames in the boot schema not Myrinet hostnames Also note that selecting a particular module does not guarantee that it will be able to be used For example selecting the gm RPI module will still cause a run time failure if there is no Myrinet hardware present The available modules are described in the sections below Note that much of this information particu T aos larly the tunable SSI parameters is also available in the lamssi_rpi 7 manual page 76 9 3 1 Two Different Shared Memory RPI Modules The sysv Section 9 3 6 page 86 and the usysv Section 9 3 8 page 88 modules differ only in the mech anism used to synchronize the transfer of messages via shared memory The sysv module uses System V semaphores while the Usysv module uses spin locks with back off Both modules use a small number of System V semaphores for synchronizing both the deallocation of shared structures and access to the shared pool The blocking nature of the sysv module should generally provide better performance than usysv on oversubscribed nodes i e when the number of processes is greater than the number of available processors System V semaphores will effectively force processes yield to other pr
76. bles to the remote nodes as P shell LAM_MPI_FOO green eggs and ham shell export LAM_MPI_FOO shell mpirun C x DISPLAY SEUSS author samlam This will launch the samIam application on all available CPUs The LAM_MPI_FOO DISPLAY and SEUSS environment variables will be created each the process environment before the sma Iam program is invoked Note that the parser for the x option is currently not very sophisticated it cannot even handle quoted values when defining new environment variables Users are advised to set variables in the environment 62 prior to invoking mpirun and only use x to export the variables to the remote nodes not to define new variables if possible 7 14 5 Current Working Directory Behavior Using the wd option to mpirun allows specifying an arbitrary working directory for the launched pro cesses It can also be used in application schema files to specify working directories on specific nodes and or for specific applications If the wd option appears both in an application schema file and on the command line the schema file directory will override the command line value wd is mutually exclusive with D If neither wd nor D are specified the local node will send the present working directory name from the mpirun process to each of the remote nodes The remote nodes will then try to change to that directory If they fail e g if the directory does not exist on that node t
77. cess Although 1amboot typically prints detailed messages when errors occur users are strongly encouraged to read Section 8 1 for the details of the boot module that they will be using Additionally the d switch should be used to examine exactly what is happening to determine the actual source of the problem many problems with Lamboot come from the operating system or the user s shell setup not from within LAM itself The most common 1 amboot example simply uses a hostfile to launch across an rsh s sh based cluster of nodes the ssi boot rsh is not technically necessary here but it is specified to make this example correct in all environments shell lamboot v ssi boot rsh hostfile LAM 7 0 MPI 2 C ROMIO Indiana University n0 lt 1234 gt ssi boot base linear booting n0 nodel cluster example com n0 lt 1234 gt ssi boot base linear booting n1 node2 cluster example com n0 lt 1234 gt ssi boot base linear booting n2 node3 cluster example com n0 lt 1234 gt ssi boot base linear booting n3 node4 cluster example com n0 lt 1234 gt ssi boot base linear finished 7 1 1 Multiple Sessions on the Same Node In some cases such as in batch regulated environments it is desirable to allow multiple universes owned by the same on the same node The TMPDIR LAM_MPI_SESSION_PREFIX and LAM MPI_SESSION_ SUFFIX environment variables can be used to effect this behavior The main issue is the location of LAM s session d
78. command can be used to determine exactly which modules are supported in your installation see Section 7 7 page 53 8 1 1 Boot Schema Files a k a Hostfiles or Machinefiles Before discussing any of the specific boot SSI modules this section discusses the boot schema file com monly referred to as a hostfile or a machinefile Most but not all boot SSI modules require a boot schema and the text below makes frequent mention of them Hence it is worth discussing them before getting into the details of each boot SSI A boot schema is a text file that in 1ts simplest form simply lists every host that the LAM run time environment will be invoked on For example 65 Tan a1 as This is my boot schema inky cluster example com pinky cluster example com blinkly cluster example com clyde cluster example com Lines beginning with are treated as comments and are ignored Each non blank non comment line must at a minimum list a host Specifically the first token on each line must specify a host although the definition of how that host is specified may vary differ between boot modules However each line can also specify arbitrary key value pairs A common global key is cpu This key takes an integer value and indicates to LAM how many CPUs are available for LAM to use If the key is not present the value of 1 is assumed This number does not need to reflect the physica
79. context Hence it is correct to say the boot_rsh_agent parameter as well as the agent parameter to the rsh boot module Note that the reserved string base may appear as a module name referring to the fact that the parameter applies to all modules of a give type 6 3 2 Setting Parameter Values SSI parameters each have a unique name and can take a single string value The parameter value pairs can be passed by multiple different mechanisms Depending on the target module and the specific parameter mechanisms may include e Using command line flags when LAM was configured e Setting environment variables before invoking LAM commands e Using the ssi command line switch to various LAM commands e Setting attributes on MPI communicators Users are most likely to utilize the latter three methods Each is described in detail below Listings and explanations of available SSI parameters are provided in Chapters 8 and 9 pages 65 and 75 respectively categorized by SSI type and module Environment Variables SSI parameters can be passed via environment variables prefixed with LAM_MP1_SSI For example se lecting which RPI module to use in an MPI job can be accomplished by setting the environment variable LAM MPI_SSI_rpi to a valid RPI module name e g tcp Note that environment variables must be set before invoking the corresponding LAM MPI commands that will use them ssi Command Line Switch LAM MPI commands that inte
80. d wrapper compilers because all they do is add relevant compiler and linker flags to the command line before invoking the real back end compiler to actually perform the compile link Most command line arugments are passed straight through to the back end compiler without modification Therefore to compile an MPI application use the wrapper compilers exactly as you would use the real compiler For example shell mpicc O c main c shell mpicc O c foo c shell mpicc O c bar c shell mpicc O o main main o foo o bar o 56 This compiles three C source code files and links them together into a single executable No additional I L or 1 arguments are required The main exceptions to what flags are not passed through to the back end compiler are e showme Used to show what the wrapper compiler would have executed This is useful to see the full compile link line would have been executed For example your output may differ from what is shown below depending on your installed LAM MPI configuration shell mpicc O c main c showme gcc I usr local lam include pthread O c foo c The output line shown below is word wrapped in order to fit nicely in the document margins shell mpicc O o main main o foo o bar o showme gcc I usr local lam include pthread O o main main o foo o bar o L usr local lam lib llammpio Ipmpi llamf77mpi Impi llam
81. d successfully the saved context files are renamed to their respective target filenames otherwise the checkpoint files are discarded e Checkpoints can only be performed after all processes have invoked MPI_INIT and before any process has invoked MPI_FINALIZE 9 5 1 Selecting a cr Module The cr framework coordinates with all other SSI modules to ensure that the entire MPI application is ready to be checkpointed before the back end system is invoked Specifically for a parallel job to be able to checkpoint and restart all the SSI modules that it uses must support checkpoint restart capabilities All coll modules in the LAM MPI distribution currently support checkpoint restart capability because they are layered on MPI point to point functionality as long as the RPI module being used supports check point restart so do the coll modules However only one RPI module currently supports checkpoint restart crtcp Attempting to checkpoint an MPI job when using any other rpi module will result in undefined behavior 9 5 2 cr SSI Parameters The cr SSI parameter can be used to specify which cr module should be used for an MPI job An error will occur if a Cr module is requested and an rpi or Coll module cannot be found that supports checkpoint restart functionality 96 Additionally the cr_blcr_base_dir SSI parameter can be used to specify the directory where check point file s will be saved If it is not set and no default value was prov
82. d the Fortran binding for MPI_INIT or MPI_LINIT_THREAD 115 The environment variable LAM MP T_PROCESS_NAME can be used to override this behavior Setting this environment variable before invoking mpirun will cause mpitask to list that name instead of the default title This environment variable only works for processes that invoke the Fortran binding for MPI_INIT or MPI_INIT_ THREAD 12 4 MPI Thread Support LAM currently implements support for MPI_THREAD_SINGLE MPILTHREAD_FUNNELED and MPI_ THREAD SERIALIZED The constant MPI_THREAD MULTIPLE is provided although LAM will never return MPILTHREAD_MULTIPLE in the provided argument to MPI_INIT_THREAD LAM makes no distinction between MPI THREAD_SINGLE and MPI_THREAD_ FUNNELED When MPI_THREAD_SERIALIZED is used a global lock is used to ensure that only one thread is inside any MPI function at any time 12 4 1 Thread Level Selecting the thread level for an MPI job is best described in terms of the two parameters passed to MPI_ INIT_THREAD requested and provided requested is the thread level that the user application requests while provided is the thread level that LAM will run the application with e If MPLINIT is used to initialize the job requested will implicitly be MPITHREAD_SINGLE However if the LAM_MPI_THREAD_LEVEL environment variable is set to one of the values in Ta ble 12 1 the corresponding thread level will be used for requested If MPI_INIT_THREAD is used to init
83. debugger e g Total View mpitask can be run from any node in the LAM universe 33 4 6 4 The lamclean Command The lamclean command completely removes all running programs from the LAM universe This can be useful if a parallel job crashes and or leaves state in the LAM run time environment e g MPI 2 published names It is usually run with no parameters shell lamclean lamclean is typically only necessary when developing debugging MPI applications i e programs that hang messages that are left around etc Correct MPI programs should terminate properly clean up all their messages unpublish MPI 2 names etc 4 7 Shutting Down the LAM Universe When finished with the LAM universe it should be shut down with the lamhalt command shell lamhalt In most cases this is sufficient to kill all running MPI processes and shut down the LAM universe However in some rare conditions lamhalt may fail For example if any of the nodes in the LAM universe crashed before running lamhalt lamhalt will likely timeout and potentially not kill the entire LAM universe In this case you will need to use the lamwipe command to guarantee that the LAM universe has shut down properly shell lamwipe v hostfile where host file is the same boot schema that was used to boot LAM i e all the same nodes are listed lamwipe will forcibly kill all LAM MPI processes and terminate the LAM universe This is a slower
84. e Checkpoint Restart RPI globus lam_basic self crtcp rsh smp lamd slurm shmem sysv tcp usysv Table 3 1 SSI modules that are included in the official LAM MPI RPMs This is for multiple reasons e If provided as a binary each SSI module may require a specific configuration e g a specific version of the back end software that it links to interacts with Since each SSI module is orthogonal to other modules and since the back end software systems that each SSI module interacts with may release new versions at any time the number of combinations that would need to be provided is exponential 19 The logistics of attempting to provide pre compiled binaries for all of these configurations is beyond the capability of the LAM Team As a direct result significant effort has going into making building LAM MPI from the source distribution as simple and all inclusive as possible e Although LAM MPI is free software and freely distributable some of the systems that its modules can interact with are not The LAM Team cannot distribute modules that contain references to non freely distributable code The laminfo command can be used to see which SSI modules are available in your LAM MPI instal lation 3 4 2 Filesystem Issues Case insensitive filesystems On systems with case insensitive filesystems such as Mac OS X with HFS Linux with NTFS or Microsoft Windows Cygwin the mpicc and mpiCC commands will both
85. e a rendezvous protocol the envelope is sent to the destination the receiver responds with an ACK when it is ready and then the sender sends another envelope followed by the data of the message The message lengths at which the different protocols are used can be changed with the SSI parameter rpi_gm_tinymsglen which represent the maximum length of tiny messages LAM defaults to 1 024 bytes for the maximum lengths of tiny messages It may be desirable to adjust these values for different kinds of applications and message passing pat terns The LAM Team would appreciate feedback on the performance of different values for real world applications Pinning Memory The Myrinet native communication library gm can only communicate through registered sometimes called pinned memory In most operating systems LAM MPI handles this automatically by pinning user provided buffers when required This allows for good message passing performance especially when re using buffers to send receive multiple messages However the gm library does not have the ability to pin arbitrary memory on Solaris systems auxiliary buffers must be used Although LAM MPI controls all pinned memory this has a detrimental effect on performance of large messages LAM MPI must copy all messages from the application provided buffer to an auxiliary buffer before it can be sent and vice versa for receiving messages As such users are strongly encouraged to us
86. e the MPI ALLOC_MEM and MPI_LFREE_MEM functions instead of malloc and free Using these functions will allocate pinned memory such that LAM MPI will not have to use auxiliary buffers and an extra memory copy The rpi_gm_nopin SSI parameter can be used to force Solaris like behavior On Solaris platforms the default value is 1 specifying to use auxiliary buffers as described above On non Solaris platforms the default value is 0 meaning that LAM MPI will attempt to pin and send receive directly from user buffers Note that since LAM MPI manages all pinned memory LAM MPI must be aware of memory that is freed so that it can be properly unpinned before it is returned to the operating system Hence LAM MPI must intercept calls to functions such as sbrk and munmap to effect this behavior Since gm cannot pin arbitrary memory on Solaris LAM MPI does not need to intercept these calls on Solaris machines To this end support for additional memory allocation packages are included in LAM MPI and will automatically be used on platforms that support arbitrary pinning These memory allocation managers allow LAM MPI to intercept the relevant functions and ensure that memory is unpinned before returning it to the operating system Use of these managers will effectively overload all memory allocation functions e g malloc calloc free etc for all applications that are linked against the LAM MPI libraries potentially regardle
87. ection process involves a flexible negotitation phase which can be both tweaked and arbitrarily overriden by the user and system administrator 6 5 1 Specifying Modules Each SSI type has an implicit SSI parameter corresponding to the type name indicating which module s to be considered for selection For example to specify in that the tcp RPI module should be used the SSI parameter rpi should be set to the value tcp For example ZN shell mpirun C ssi rpi tcp my_mpi_program The same is true for the other SSI types boot cr and coll with the exception that the coll type can be used to specify a comma separated list of modules to be considered as each MPI communicator is created including MPI COMM WORLD For example shell mpirun C ssi coll smp shmem lam_basic my_mpi_program indicates that the smp and lam_basic modules will potentially both be considered for selection for each MPI communicator 6 5 2 Setting Priorities Although typically not useful to individual users system administrators may use priorities to set system wide defaults that influence the module selection process in LAM MPI jobs Each module has an associated priority which plays role in whether a module is selected or not Specif ically if one or more modules of a given type are available for selection the modules priorities will be at least one of the factors used to determine which module will finally be selected Priorities are
88. ed in the lt lam install path gt where lt lam install path gt is the top level directory where LAM MPI is installed This is typically used when a user has multiple LAM MPI installations and want to switch between them without changing the dot files or PATH environment variable This option is not compatible with LAM MPI versions prior to 7 1 Lan e ssi lt key gt lt value gt Pass the SSI lt key gt and lt value gt arguments to the back end mpi run command Local arguments are specific to an individual MPI process that will be launched They are specified along with the executable that will be launched Common local arguments include e n lt numprocs gt Launch lt numprocs gt number of copies of this executable e arch lt architecture gt Launch the executable on nodes in the LAM universe that match this architecture An architecture is determined to be a match if the lt architecture gt matches any subset of the GNU Autoconf architecture string on each of the target nodes the Lamin fo command shows the GNU Autoconf configure string e lt other arguments gt When mpiexec first encounters an argument that it doesn t recognize the remainder of the arguments will be passed back to mpirun to actually start the process The following example launches four copies of the my_mpi_ program executable in the LAM universe using default scheduling patterns shell mpiexec n 4 my_mpi_program i 7 12 2 Launch
89. ently imposed when debugging LAM MPI jobs in TotalView 1 Cannot attach to scripts You cannot attach TotalView to MPI processes if they were launched by scripts instead of mpirun Specifically the following won t work shell mpirun tv C script_to_launch foo But this will shell mpirun tv C foo For that reason since mpiexec is a script although the tv switch works with mpiexec because it will eventually invoke mpi run you cannot launch mpiexec with TotalView 2 TotalView needs to launch the TotalView server on all remote nodes in order to attach to remote processes The command that TotalView uses to launch remote executables might be different than what LAM MPI uses You may have to set this command explicitly and independently of LAM MPI For example if your local environment has rsh disabled and only allows ssh then you likely need to set the To talView remote server launch command to ssh You can set this internally in TotalView or with the TVDSVRLAUNCHCMD environment variable see the TotalView documentation for more information on this 3 The TotalView license must be able to be found on all nodes where you expect to attach the debugger Consult with your system administrator to ensure that this is set up properly You may need to edit your dot files e g profile bashrc cshrc etc to ensure that relevant environment variable settings exist on all nodes when you lamboot
90. eprecated 58 hcp deprecated 58 hf 77 deprecated 58 lam lam lam lam lam lam boot 27 49 55 65 71 74 97 112 119 checkpoint 51 clean 34 52 117 exec 52 grow 52 halt 34 53 lam info 16 20 26 35 41 53 65 76 89 112 lamnodes 28 55 lamrestart 55 lamshrink 56 lamwipe 34 64 npic 20 29 40 56 picc 20 29 30 40 56 picc 20 29 30 40 56 npiexec 16 32 36 58 104 pif77 29 31 40 56 pimsg 60 ey 5 a 5 p a 3 3 3 3 3 120 mpitask 17 33 63 115 pbs_demux 74 recon 63 rsh 65 srun 72 ssh 65 tping 64 wipe deprecated 64 compiling MPI programs 28 configure flags with cr file dir 97 with debug 55 with memory manager 21 with purify 55 109 with rpi gm get 78 with rsh 20 cr SSI parameter 96 cr_blcr_base_dir SSI parameter 97 98 cr_blcr_context_file SSI parameter 56 cr_checkpoint command 98 cr_restart command 98 cr_restart_args SSI parameter 56 debuggers 103 109 attaching 108 pirun 31 60 66 72 98 104 107 116 124 launching 107 memory checking 108 serial 107 TotalView 104 DISPLAY environment variable 107 dynamic environments 21 dynamic name publishing see published names e mail lists 111 environment variables files DISPLAY 107 GLOBUS_LOCATION 69 IMPI_HOST_NAME 118 LAM MPI_PROCESS_NAMB 116 LAM MPI_SESSION_PREFIX 50 119 LA
91. ervices interface SSI modules for LAM MPI Technical Report TR579 Indiana University Computer Science Department 2003 Jeffrey M Squyres Brian Barrett and Andrew Lumsdaine The system services interface SSI to LAM MPI Technical Report TR575 Indiana University Computer Science Department 2003 The LAM MPI Team LAM MPI Installation Guide Open Systems Laborator Pervasive Technology Labs Indiana University Bloomington IN 7 0 edition May 2003 The LAM MPI Team LAM MPI User s Guide Open Systems Laborator Pervasive Technology Labs Indiana University Bloomington IN 7 0 edition May 2003 Rajeev Thakur William Gropp and Ewing Lusk Data sieving and collective I O in ROMIO In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation pages 182 189 IEEE Computer Society Press February 1999 Rajeev Thakur William Gropp and Ewing Lusk On implementing MPI IO portably and with high performance In Proceedings of the 6th Workshop on I O in Parallel and Distributed Systems pages 23 32 ACM Press May 1999 122 Index bash_login file 24 bash profile file 24 bashrc file 24 cshrc file 24 Login file 24 profile file 24 rhosts file 111 tcshrc file 24 SHOME tvdra file 105 Ssysconf lam hostmap file 75 Absoft Fortran compilers 21 AFS filesystem 20 base module path SSI parameter 46 batch queue systems 119 OpenPBS PBS Pro Torque TM boot SSI
92. es 86 9 3 7 The tcp Module TCP Communication e e e 87 9 3 8 The usysv Module Shared Memory Using Spin Locks 88 94 MPI Collective Communication 2 co ea ee ek crega a 89 2941 Selecting acoll Module io a csa es bs ke bee eR RRR EES Ge ew eS 89 10 11 12 942 oll SS Praia a hk we ha RA aE wl oe Eh RG AE a 9 4 3 Thelam basic Module lt lt lt 24 ee eee eR Ree Ee 944 The siip Module 2 eeu Be ek bee obs ee ee Ge a aed 645 The shmem Module 66 c 4 064 04 25 Be ea HSH a ERE ee 9 5 Checkpomt Restatof MPI Jobs 22 sse cc ee ee a udo O51 Selechmea Ch Module ica hw ee Bee a a a 932 ol POrameter 0 0220 a ia eh OR a Re a dk O53 The bler Module oce coe ee ee ee ee wee ew Gh we ee a a Sa These Mod le so cha ho aod we a RR ad ee Debugging Parallel Programs 101 Namme DAPT ODOC lt o o aoia aca a a a Re ee a A 10 2 TotalView Parallel Debugger naaa ee 10 2 1 Attaching TotalView to MPI Processes 0 00000 eee 10 22 Suggested USE o a o e c we ke we A a Di 102 3 Limitation ooo e aoi aa RAR a a e ed bs 10 2 4 Message Queue Debugging aa a 10 3 Sepal De uggers ooe aceae o pa ee ay Se ee ee a 10 3 1 Lau hing Debuggers oo ooe ea caaan ee ee e es 10 3 2 Attaching Debuppers oe ba ga eed vara Dawe date a a 10 4 Memory Checking Debuggers ee ee ee Troubleshooting 11 1 The LAMIMPI Mailing Lists cesc ese e ap a ee a ee ILLI AoC GIRS
93. es shell setenv LD_LIBRARY_PATH location of blcr lib LD_LIBRARY_PATH shell mpirun C ssi rpi crtcp ssi cr bler x LD_LIBRARY_PATH my_mpi _program Checkpointing and Restarting Once a checkpoint capable job is running the BLCR command cr_checkpoint can be used to invoke a checkpoint Running cr_checkpoint with the PID of mpirun will cause a context file to be created for mpirun as well as a context file for each running MPI process Before it is checkpointed mpi run will also create an application schema file to assist in restoring the MPI job These files will all be created in the directory specified by LAM MPI s configured default the cr b1cr base dir or the user s home directory if no default is specified The BLCR cr_restart command can then be invoked with the PID and context file generated from mpirun which will restore the entire MPI job Tunable Parameters There are no tunable parameters to the blcr cr module Known Issues e BLCR has its own limitations e g BLCR does not yet support saving and restoring file descriptors see the documentation included in BLCR for further information Check the project s main web site to find out more about BLCR e Since a checkpoint request is initiated by invoking cr_checkpoint with the PID of mpirun it is not possible to checkpoint MPI jobs that were started using the nw option to mpirun or directly from the command line without using mpi run e While the tw
94. es Interface SSI Overview The System Services Interface SSI makes up the core of LAM MPI It influences how many commands and MPI processes are executed This chapter provides an overview of what SSI is and what users need to know about how to use it to maximize performance of MPI applications 6 1 Types and Modules SSI provides a component framework for the LAM run time environment RTE and the MPI communica tions layer Components are selected from each type at run time and used to effect the LAM RTE and MPI library There are currently four types of components used by LAM MPI e boot Starting the LAM run time environment used mainly with the Lamboot command e coll MPI collective communications only used within MPI processes e cr Checkpoint restart functionality used both within LAM commands and MPI processes e rpi MPI point to point communications only used within MPI processes The LAM MPI distribution includes instances of each component type referred to as modules Each module is an implementation of the component type which can be selected and used at run time to provide services to the LAM RTE and MPI communications layer Chapters 8 and 9 list the modules that are available in the LAM MPI distribution 6 2 Terminology Available The term available is used to describe a module that reports at run time that it is able to run in the current environment For example an RPI module may check to see if supporting n
95. es TCP sockets for MPI point to point communication Tunable Parameters Two different protocols are used to pass messages between processes short and long Short messages are sent eagerly and will not block unless the operating system blocks Long messages use a rendezvous protocol the body of the message is not sent until a matching MPI receive is posted The crossover point between the short and long protocol defaults to 64KB but can be changed with the rpi_tcp_short SSI 87 Tay Lay T a03 L 7 03 parameter an integer specifying the maximum size in bytes of a short message Additionally the amount of socket buffering requested of the kernel defaults to the size of short messages It can be altered with the rpi tcp sockbuf parameter When this value is 1 the value of the rpi tcp short parameter is used Otherwise its value is passed to the setsockopt 2 system call to set the amount of operating system buffering on every socket that is used for MPI communication SSI parameter name Default value Description rpi tcp priority 20 Default priority level rpi_tcp short 65535 Maximum length in bytes of a short message rpi_tcp_sockbuf 1 Socket buffering in the OS kernel 1 means use the short message size Table 9 6 SSI parameters for the tcp RPI module 9 3 8 Theusysv Module Shared Memory Using Spin Locks Module Summary Name usysv Kind rpi Default SSI priority 40 Checkpoin
96. ess space The shared memory region consists two disjoint sections First section of the shared memory is used for synchronization among the processes while the second section is used for message passing Copying data into and from shared memory The second section is known as MESSAGE _POOL and is divided into N equal segments Default value of N is 8 and is configurable with the coll_base_shmem_num_segments SSI parameter The size of As a direct result smp will never be selected for MPI CCOMM_SELF 92 MPI function Status MPI ALLGATHER Optimized for SMP environments MPILALLGATHERV Optimized for SMP environments MPIALLREDUCE Optimized for SMP environments MPI_ALLTOALL Identical to lam_basic algorithm already optimized for SMP environments MPI_ALLTOALLV Identical to lam_basic algorithm already optimized for SMP environments MPI_ALLTOALLW Ibid MPI_BARRIER Optimized for SMP environments MPI_BCAST Optimized for SMP environments MPLEXSCAN Ibid MPI_GATHER Identical to lam_basic algorithm already optimized for SMP environments MPI_GATHERV Identical to lam_basic algorithm already optimized for SMP environments MPI REDUCE Optimized for SMP environments MPI REDUCE SCATTER Optimized for SMP environments MPI SCAN Optimized for SMP environments MPI SCATTER Identical to lam_basic algorithm already optimized for SMP environ
97. essages e One of the processes started by mpirun has exited with a nonzero exit code This means that at least one MPI process has exited after invoking MPI_INIT but before invoking MPI_FINALIZE This is therefore an error and LAM will abort the entire MPI application The last line of the error message indicates the PID node and exit status of the failed process 113 MPL lt function gt process in local group is dead rank lt N gt MPLCOMM_WORLD This means that some MPI function tried to communicate with a peer MPI process and discovered that the peer process is dead Common causes of this problem include attempting to communicate with processes that have failed which in some cases won t generate the One of the processes started by mpirun has exited messages or have already invoked MPI_FINALIZE Communication should not be initiated that could involve processes that have already invoked MPI_FINALIZE This may include using MPIANY_SOURCE or collectives on communicators that include processes that have already finalized 114 Chapter 12 Miscellaneous This chapter covers a variety of topics that don t conveniently fit into other chapters 12 1 Singleton MPI Processes It is possible to run an MPI process without the mpi run or mpiexec commands simply run the program as one would normally launch a serial program N sheng my mpi program Doing so will create an MPILCOMM_WORLD with
98. etwork hardware is present before reporting that it is available or not Chapters 8 and 9 list the modules that are included in the LAM MPI distribution and detail the requirements for each of them to indicate whether they are available or not Selected The term selected means that a module has been chosen to be used at run time Depending on the module type zero or more modules may be selected 43 Scope Each module selection has a scope depending on the type of the module Scope refers to the duration of the module s selection Table 6 1 lists the scopes for each module type Type Scope description boot A module is selected at the beginning of lamboot or recon and is used for the duration of the LAM universe coll A module is selected every time an MPI communicator is created including MPILCOMM_WORLD and MPILCOMM_SELF It re mains in use until that communicator has been freed cr Checkpoint restart modules are selected at the beginning of an MPI job and remain in use until the job completes rpi RPI modules are selected during MPI_INIT and remain in use until MPI_FINALIZE returns Table 6 1 SSI module types and their corresponding scopes 6 3 SSI Parameters One of the founding principles of SSI is to allow the passing of run time parameters through the SSI frame work This allows both the selection of which modules will be used at run time by passing parameters to the SSI
99. etworking domains it may necessary to override the hostname that IMPI uses for connectivity 1 e use something other that what is returned by the hostname command In this case the IMP 1_HOST_NAME can be used If set this variable is expected to contain a resolvable name or IP address that should be used http www osl iu edu research impi 118 12 7 Batch Queuing System Support LAM is now aware of some batch queuing systems Support is currently included for PBS LSF and Clubmask based systems There is also a generic functionality that allows users of other batch queue sys tems to take advantages of this functionality e When running under a supported batch queue system LAM will take precautions to isolate itself from other instances of LAM in concurrent batch jobs That is the multiple LAM instances from the same user can exist on the same machine when executing in batch This allows a user to submit as many LAM jobs as necessary and even if they end up running on the same nodes a lamclean in one job will not kill MPI applications in another job e This behavior is only exhibited under a batch environment Other batch systems can easily be sup ported let the LAM Team know if you d like to see support for others included Manually setting the environment variable LAM MP T_SESSION_SUFFIX on the node where lamboot is run achieves the same ends 12 8 Location of LAM s Session Directory By default LAM will create
100. f module the Continue function is invoked after the Checkpoint function completes to be symmetrical with other modules It is common to either not provide a Continue function or supply a function that does nothing Once these functions return process control is returned to the application Note that no MPI functions are allowed to be invoked in the Checkpoint or Continue functions Although the Lamrestart command can be used to restart self checkpointed applications its invoca tion is quite bulky and inconvenient it is frequently simpler to use mpirun itself Remember with self checkpointed application there is no possibility of actually restarting the application because no MPI library state was saved The application must be completely restarted i e start over from the top of main The self module does provide some assistance however if the cr_self_do_restart SSI parameter is set Specifically self will invoke the Restart function during MPIINIT if cr_self_do_restart is set to 1 For example shell mpirun C ssi rpi crtcp ssi cr self ssi cr_self_do_restart 1 my_mpi_program The typical model for a Restart function is to load previously saved data and to set some global variables indicating that a restart is in progress When MPI_INIT returns the application can see the global variables and continue performing whatever actions are necessary to effect a restart e g jump to a different point in the application
101. h_ignore_stderr is nonzero any output on standard error will not be treated as an error 70 Section 4 page 23 provides a short tutorial on using the rsh ssh boot module including tips on setting up dot files setting up password less remote execution etc Usage Using rsh ssh or other remote execution agent is probably the most common method for starting the LAM run time execution environment The boot schema typically lists the hostnames CPU counts and an optional username if the user s name is different on the remote machine The boot schema can also list an optional prefix which specifies the LAM MPI installatation to be used on the particular host listed in the boot schema This is typically used if the user has mutliple LAM MPI installations on a host and want to switch between them without changing the dot files or PATH environment variables or if the user has LAM MPI installed under different paths on different hosts If the prefix is not specified for a host in the boot schema file then the LAM MPI installation which is available in the PATH will be used on that host or if the prefix lt lam install path gt option is specified for lamboot the lt lam install path gt installation will be used The prefix option in the boot schema file however overrides any prefix option specified on the 1 amboot command line for that host For example rsh boot schema inky cluster example com cpu 2 pinky c
102. hardware has more than 8 possible ports you can change the upper port number that LAM will check with the rpi_gm_ maxport SSI parameter However if you wish LAM to use a specific GM port number and not check all the ports from 1 maxport you can tell LAM which port to use with the rpi_gm_port SSI parameter Specifying which port to use has precedence over the port range check if a specific port is indicated LAM will try to use that and not check a range of ports Specifying to use port 1 or not specifying to use a specific port will tell LAM to check the range of ports to find any available port Note that in all cases if LAM cannot acquire a valid port for every MPI process in the job the entire job will be aborted Be wary of forcing a specific port to be used particularly in conjunction with the MPI dynamic process calls e g MPICOMM_SPAWN For example attempting to spawn a child process on a node that already has an MPI process in the same job LAM will try to use the same specific port which will result in failure because the MPI process already on that node will have already claimed that port 79 Adjusting Message Lengths The gm RPI uses two different protocols for passing data between MPI processes tiny and long Selection of which protocol to use is based solely on the length of the message Tiny messages are sent along with tag and communicator information in one transfer to the receiver Long messages us
103. he MPI 2 functions MPI_PUBLISH_NAME and MPILUNPUBLISH_NAME for publish ing and unpublishing names respectively Published names are stored within the LAM daemons and are therefore persistent even when the MPI process that published them dies As such it is important for correct MPI programs to unpublish their names before they terminate How ever if stale names are left in the LAM universe when an MPI process terminates the 1amclean command can be used to clean all names from the LAM RTE 12 6 Interoperable MPI IMPI Support The IMPI extensions are still considered experimental and are disabled by default in LAM They must be enabled when LAM is configured and built see the Installation Guide file for details 12 6 1 Purpose of IMPI The Interoperable Message Passing Interface IMPD is a standardized protocol that enables different MPI implementations to communicate with each other This allows users to run jobs that utilize different hard ware but still use the vendor tuned MPI implementation on each machine This would be helpful in situa tions where the job is too large to fit in one system or when different portions of code are better suited for different MPI implementations IMPI defines only the protocols necessary between MPI implementations vendors may still use their own high performance protocols within their own implementations Terms that are used throughout the LAM IMPI documentation include IMPI clients IMPI hosts
104. he following line set path usr local lam bin path 4 1 2 Finding the LAM Manual Pages LAM includes manual pages for all supported MPI functions as well as all of the LAM executables While this step is not necessary for correct MPI functionality it can be helpful when looking for MPI or LAM specific information Using Tables 4 1 and 4 2 find the right dot file to edit Assuming again that LAM was installed to usr local lam open the appropriate dot file in a text editor and follow the general directions listed below e For the Bash Bourne and Bourne related shells add the following lines MANPATH usr local lam man MANPATH export MANPATH e For the C shell and related shells such as tcsh add the following lines if MANPATH 0 then setenv MANPATH usr local lam man else setenv MANPATH usr local lam man MANPATH endif 4 2 System Services Interface SSD LAM MPI is built around a core of System Services Interface SSI plugin modules SSI allows run time selection of different underlying services within the LAM MPI run time environment including tunable parameters that can affect the performance of MPI programs 25 While this tutorial won t go into much detail about SSI just be aware that you ll see mention of SSI in the text below In a few places the tutorial passes parameters to various SSI modules through either environment variables and or the ssi command li
105. hey will start from the user s home directory All directory changing occurs before the user s program is invoked it does not wait until MPI_INIT is called 7 15 Thempitask Command The mpitask command shows a list of the processes running in the LAM universe and a snapshot of their current MPI activity It is usually invoked with no command line parameters thereby showing summary details of all processes currently running Since mpitask only provides a snapshot view it is not advisable to use mpitask as a high resolution debugger see Chapter 10 page 103 for more details on debugging MPI programs Instead mpitask can be used to provide answers to high level questions such as Where is my program hung and Is my program making progress The following example shows an MPI program running on four nodes sending a message of 524 288 integers around in a ring pattern Process 0 is running i e not in an MPI function while the other three are blocked in MPILRECV shell mpitask TASK G L FUNCTION PEER ROOT TAG COMM COUNT DATATYPE 0 ring lt running gt 1 1 ring Recv 0 0 201 WORLD 524288 INT 2 2 ring Recv 1 1 201 WORLD 524288 INT 3 3 ring Recv 2 2 201 WORLD 524288 INT 7 16 The recon Command The recon command is a quick test to see if the user s environment is setup properly to boot the LAM RTE It takes most of the same parameters as the Lamboot command Although it does not boot the RTE and does not definit
106. i tcp sockbuf SSI parameter 87 89 rpi_usysv_pollyield SSI parameter 89 rpi_usysv_priority SSI parameter 89 rpi_usysv_readlockpol1 SSI parameter 89 rpi usysv_shmmaxalloc SSI parameter 89 rpi_usysv_shmpoolsize SSI parameter 89 rpi usysv_short SSI parameter 89 rpi usysv writelockpo11 SSI parameter 89 RPMs 19 rsh ssh boot SSI module 70 rsh command 65 running MPI programs 31 sample MPI program C 29 C 30 Fortran 31 serial debuggers 107 session directory 119 shell setup Bash Bourne shells 25 C shell and related 25 signals 120 slurm boot SSI module 71 srun command 72 ssh command 65 SSI module types 43 overview 43 46 parameter overview 44 SSI boot modules see boot SSI modules SSI collective modules see collective SSI mod ules SSI parameters base_module_path 46 boot 68 72 74 bproc value 68 globus value 69 70 rsh value 71 slurm value 72 tm value 74 boot_base_promisc 67 boot_bproc_priority 69 boot_globus_priority 70 boot_rsh_agent 17 72 boot_rsh_fast 72 boot_rsh_ignore_stderr 70 72 boot_rsh_no_n 72 boot_rsh_no_profile 72 boot_rsh_priority 72 boot_rsh_username 72 boot_slurm_ priority 73 boot _tm_priority 74 col1 90 coll_base associat ive 90 92 coll_base_shmem_message_pool_size 05 coll_base_shmem_num_segments 95 coll_crossover 90 coll_reduce_crossover 90 coll_base_shmem_message_pool _size 94 coll_base_shmem_num_segments 92 cr 96
107. ialized the job the requested thread level is the first thread level that the job will attempt to use There is currently no way to specify lower or upper bounds to the thread level that LAM will use The resulting thread level is largely determined by the SSI modules that will be used in an MPI job each module must be able to support the target thread level A complex algorithm is used to attempt to find a thread level that is acceptable to all SSI modules Generally the algorithm starts at requested and works backwards towards MPI_THREAD_SINGLE looking for an acceptable level However any module may increase the thread level under test if it requires it At the end of this process if an acceptable thread level is not found the MPI job will abort Value Meaning undefined MPILTHREAD_SINGLE 0 MPI_ THREAD_SINGLE 1 MPI THREAD_FUNNELED 2 MPI_THREAD_SERIALIZED 3 MPI_THREAD_MULTIPLE Table 12 1 Valid values for the LAM_MPI_THREAD_LEVEL environment variable Also note that certain SSI modules require higher thread support levels than others For example any checkpoint restart SSI module will require a minimum of MPI_THREAD_SERIALIZED and will attempt to adjust the thread level upwards as necessary if that CR module will be used during the job 116 Hence using MPI_INIT to initialize an MPI job does not imply that the provided thread level will be MPI_THREAD_SINGLE 12 5 MPI 2 Name Publishing LAM supports t
108. ible for all process and kernel scheduling 32 Running MPMD Programs For example to run a manager worker parallel program where two different executables need to be launched 1 e manager and worker the following can be used a sheng mpiexec n 1 manager worker This runs one copy of manager and one copy of worker for every CPU in the LAM universe Running Heterogeneous Programs Since LAM is a heterogeneous MPI implementation it supports running heterogeneous MPI programs For example this allows running a parallel job that spans a Sun SPARC machine and an IA 32 Linux machine even though they are opposite endian machines Although this can be somewhat complicated to setup remember that you will first need to Lamboot successfully which essentially means that LAM must be correctly installed on both architectures the mpiexec command can be helpful in actually running the resulting MPI job Note that you will need to have two MPI executables one compiled for Solaris e g he Llo solaris and one compiled for Linux e g he Llo linux Assuming that these executables both reside in the same directory and that directory is available on both nodes or the executables can be found in the PATH on their respective machines the following command can be used shell mpiexec arch solaris hello solaris arch linux hello linux This runs the hello solaris command on all nodes in the LAM universe that ha
109. icrosoft Windows 21 MPI and threads see threads and MPI MPI attribute keyvals LAM_MPI_SSI_COLL 90 MPI collective modules see collective SSI mod ules MPI constants MPI ANY SOURCE 114 MPI_COMM_SELF 16 36 44 90 92 107 MPI COMM WORLD 18 44 47 59 62 90 92 104 107 108 115 MPILERR_KEYVAL 36 MPI_STATUS_IGNORE 36 MPI_STATUSES IGNORE 36 MPI THREAD FUNNELED 16 116 MPILTHREAD MULTIPLE 16 116 MPI_ THREAD SERIALIZED 16 96 97 99 116 MPI THREAD SINGLE 16 116 117 MPI datatypes MPI_DARRAY 40 MPLINTEGER1 35 MPLINTEGERZ2 35 MPLINTEGERA 35 MPLINTEGERE8 35 MPILONG_LONG INT 38 MPI_REAL16 35 MPI_REAL4 35 MPI_REAL8 35 MPILUNSIGNED_LONG_LONG 38 MPI_WCHAR 38 MPI functions MPIACCUMULATE 39 MPILALLGATHER 39 93 95 MPIALLGATHERV 39 93 95 MPI_ALLOC_MEM 37 78 80 82 MPI_ALLREDUCE 39 93 95 118 MPI_ALLTOALL 39 93 95 MPI_ALLTOALLY 39 93 95 MPI_ALLTOALLW 39 93 95 MPI_BARRIER 39 93 95 118 MPI_BCAST 39 93 95 118 MPI CANCEL 35 36 118 MPI_CLOSE_PORT 38 MPI_COMM_ACCEPT 38 MPI_COMM_C2F 37 MPI_CCOMM_CONNECT 38 MPI COMM CREATE ERRHANDLER 37 40 MPI_COMM_CREATE_KEYVAL 40 MPI_COMM_DELETE_ATTR 40 MPI COMM DISCONNECT 38 MPI COMM F2C 37 MPI_COMM_FREE KEYVAL 40 126 MPI_COMM_GET_ATTR 40 MPI COMM_GET ERRHANDLER 37 40 MPI_COMM_GET_NAME 40 MPI_COMM_GET_PARENT 38 MPI_COMM_JOIN 38 MPIL COMM_SET_ATTR 40 MPIL COMM_SET_ERRHANDLER 37 40 MP
110. ided when LAM MPI was configured with the with cr file dir flag the user s home directory is used 9 5 3 The blcr Module Module Summary Name blcr Kind cr Default SSI priority 50 Checkpoint restart yes Berkeley Lab s Checkpoint Restart BLCR 1 single node checkpointer provides the capability for checkpointing and restarting processes under Linux The blcr module when used with checkpoint restart SSI modules will invoke the BLCR system to save and restore checkpoints Overview The blcr module will only automatically be selected when the thread level is MPI_THREAD_SERIALIZED and all selected SSI modules support checkpoint restart functionality see the SSI module selection algo rithm Section 9 2 page 75 The blcr module can be specifically selected by setting the cr SSI parameter to the value blcr Manually selecting the blcr module will force the MPI thread level to be at least MPI_ THREAD_SERIALIZED Running a Checkpoint Restart Capable MPI Job There are multiple ways to run a job with checkpoint restart support e Use the cricp RPI and invoke MPI_INIT_THREAD with a requested thread level of MPI THREAD_ SERIALIZED This will automatically make the blcr module available shell mpirun C ssi rpi crtcp my_mpi program e Use the crtcp RPI and manually select the bler module shell mpirun C ssi rpi crtcp ssi cr bler my_mpi_program Depending on the location of the BLCR
111. ight C 2002 Free Software Foundation Inc This is free software see the source for copying conditions There is NO warranty not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE J Lan 4 5 1 Sample MPI Program in C The following is a simple hello world C program 29 A include lt stdio h gt include lt mpi h gt int main int argc char xargv int rank size MPI _Init amp arge amp argv MPI_Comm_rank MPIL COMM_WORLD amp rank MPI_Comm_size MPIL COMM_WORLD amp size printf Hello world I am d of d n rank size MPI Finalize return 0 This program can be saved in a text file and compiled with the mpicc wrapper compiler shell mpicc hello c o hello 4 5 2 Sample MPI Program in C The following is a simple hello world C program include lt iostream gt include lt mpi h gt using namespace std int main int argc char xargv int rank size MPI Init argc argv rank MPI COMM_WORLD Get rank size MPI COMM_WORLD Get size cout lt lt Hello world I am lt lt rank lt lt of lt lt size lt lt endl MPI Finalize return 0 This program can be saved in a text file and compiled with the mpiCC wrapper compiler or mpic if on case insensitive filesystems such as Mac OS X s HFS shell mpiCC hello cc o hello 30 4 5 3 Samp
112. ing MPMD Processes The separator can be used to launch multiple executables in the same MPI job Specifically each pro cess will share a common MPI_COMM_WORLD For example the following launches a single manager process as well as a worker process for every CPU in the LAM universe shell mpiexec n 1 manager C worker Paired with the arch flag this can be especially helpful in heterogeneous environments 59 N shell mpiexec arch solaris sol program arch linux linux program Even only slightly heterogeneous environments can run into problems with shared libraries different compilers etc The arch flag can be used to differentiate between different versions of the same operating system shell mpiexec arch solaris2 8 sol2 8 program arch solaris2 9 sol2 9_ program 7 12 3 Launching MPI Processes with No Established LAM Universe The boot boot args and machinefile global arguments can be used to launch the LAM RTE run the MPI process es and then take down the LAM RTE This conveniently wraps up several LAM commands and provides one shot execution of MPI processes For example shell mpiexec machinefile hostfile C my_mpi_program Some boot SSI modules do not require a hostfile specifying the boot argument is sufficient in these cases shell mpiexec boot C my_mpi_program When mpiexec is used to boot the LAM RTE it will do i
113. ing shared memory MPI_ALLTOALL Implemented using shared memory MPI_ALLTOALLV Uses lam_basic algorithm MPI_ALLTOALLW Uses lam_basic algorithm MPI BARRIER Implemented using shared memory MPI_BCAST Implemented using shared memory MPLEXSCAN Uses lam_basic algorithm MPI_GATHER Implemented using shared memory MPIGATHERV Uses lam_basic algorithm MPILREDUCE Implemented using shared memory MPI_LREDUCE_SCATTER Uses lam_basic algorithm MPI_SCAN Uses lam_basic algorithm MPI_SCATTER Implemented using shared memory MPISCATTERV Uses lam_basic algorithm Table 9 9 Listing of MPI collective functions indicating which have been implemented using Shared Mem ory SSI parameter name Default value Description coll_base_shmem_ 16384 x 8 Size of the shared memory pool for the messages message pool size coll_base_shmem_num_ 8 Number of segments in the message pool section segments Table 9 10 95 SSI parameters for the shmem coll module T an Lan 9 5 Checkpoint Restart of MPI Jobs LAM supports the ability to involuntarily checkpoint and restart parallel MPI jobs Due to the asynchronous nature of the checkpoint restart design such jobs must run with a thread level of at least MPI THREAD SERIALIZED This allows the checkpoint restart framework to interrupt the user s job for a checkpoint regardless of whether i
114. irectory each node in a LAM universe has a session directory in a well known location in the filesystem that identifies how to contact the LAM daemon on that node Multiple LAM universes can simultaneously co exist on the same node as long as they have different session directories LAM recognizes several batch environments and automatically adapts the session directory to be specific to a batch job Hence if the batch scheduler allocates multiple jobs from the same user to the same node LAM will automatically do the right thing and ensure that the LAM universes from each job will not collide Sections 12 7 and 12 8 starting on page 119 discuss these issues in detail 50 7 1 2 Avoiding Running on Specific Nodes Once the LAM universe is booted processes can be launched on any node The mpirun mpiexec and lamexec commands are most commonly used to launch jobs in the universe and are typically used with the N and C nomenclatures see the description of mpi run in Section 7 14 for details on the N and C nomenclature which launch jobs on all schedulable nodes and CPUs in the LAM universe respectively While finer grained controls are available through mpirun etc it can be convenient to simply mark some nodes as non schedulable and therefore avoid having mpirun etc launch executables on those nodes when using N and C nomenclature For example it may be convenient to boot a LAM universe that includes a controller node e
115. irun C ssi rpi crtcp ssi cr self ssi cr_self_user_prefix foo my_mpi_program will look for functions named foo_checkpoint foo_continue and foo_restart respectively 2 To specify unique names of the Checkpoint Restart and Continue functions use the cr_self_ user_checkpoint cr_self_user_restart and the cr_self_user_continue SSI param eters respectively For example shell mpirun C ssi rpi crtcp ssi cr self ssi cr_self_user_checkpoint save_my_stuff ssi cr_self_user_continue do_nothing ssi cr_self_user_restart load_my_stuff my_mpi_program will look for functions named save my stuff do nothing and load_my_stuff re spectively Note that if both the cr_ssi_user prefix and any of the above three parameters are specified these parameters are given higher preference Note that LAM will make no special interpretation for Fortran functions Hence if you want to have LAM call fortran functions for any of the three phases you must specify the mangled name to the cr_ self_user_ checkpoint continue restart SSI parameters Compiling self Checkpointable Applications It is critically important to compile self checkpointable applications with the appropriate linker flags to export the symbols for the Checkpoint Continue and Restart functions This allows LAM to look up these symbols at run time Each compiler linker s flags for this are different but
116. isible to shell scripts as well as the target MPI application It is erroneous to alter the value of this variable Consider the following script bin csh f 107 Which debugger to run set debugger gdb On MPILCOMM_WORLD rank 0 launch the process in the debugger Elsewhere just launch the process directly if SLAMRANK 0 then echo Launching debugger on MPLCOMM_WORLD rank LAMRANK debugger else echo Launching MPI executable on MPLCOMM_WORLD rank LAMRANK x endif All done exit 0 J This script can be executed via mpirun to launch a debugger on MPIL COMM_WORLD rank 0 and directly launch the MPI process in all other cases 10 3 2 Attaching Debuggers In some cases it is not possible or desirable to start debugging a parallel application immediately For example it may only be desirable to attach to certain MPI processes whose identity may not be known until run time In this case the technique of attaching to a running process can be used this functionality is supported by many serial debuggers Specifically determine which MPI process you want to attach to Then login to the node where it is running and use the debugger s attach functionality to latch on to the running process 10 4 Memory Checking Debuggers Memory checking debuggers are an invaluable tool when debugging software even parallel software They can provide detailed reports about me
117. ively guarantee that Lamboot will succeed it is a good tool for testing while setting up first time LAM MPI users recon will display a message when it has completed indicating whether it succeeded or failed 63 Tay an 7 17 Thetping Command The tping command can be used to verify the functionality of a LAM universe It is used to send a ping message between the LAM daemons that constitute the LAM RTE It commonly takes two arguments the set of nodes to ping expressed in N notation and how many times to ping them Similar to the Unix ping command if the number of times to ping is not specified tping will continue until it is stopped usually by the user hitting Control C The following example pings all nodes in the LAM universe three times shell tping N c 3 1 byte from 3 remote nodes and 1 local node 0 002 secs 1 byte from 3 remote nodes and 1 local node 0 001 secs 1 byte from 3 remote nodes and 1 local node 0 001 secs 3 messages 3 bytes 0 003K 0 005 secs 1 250K sec roundtrip min avg max 0 001 0 002 0 002 7 18 The lamwipe Command The Lamwipe command used to be called wipe The name wipe has now been deprecated and although it still works in this version of LAM MPI will be removed in future versions All users are encouraged to start using lamwipe instead The lamwipe command is used as a last resort command and is typically only necessary if lamhalt fails This usually only occurs in er
118. k command 17 33 63 115 fortran process names 115 Myrinet release notes 18 name publising see published names NFS filesystem 20 no schedule boot schema attribute 51 OpenPBS see batch queue systems PATH environment variable 69 PBS see batch queue systems PBS Pro see batch queue systems pbs_demux command 74 Portable Batch System see batch queue systems published names 117 recon command 63 release notes 15 22 ROMIO 115 rpi SSI parameter 76 rpi_crtcp_priority SSI parameter 78 rpi_crtcp_short SSI parameter 78 rpi_crtcp_sockbuf SSI parameter 78 rpi_gm_cr SSI parameter 79 rpi_gm_fast SSI parameter 79 rpi_gm_maxport SSI parameter 79 rpi_gm_nopin SSI parameter 79 rpi_gm_port SSI parameter 79 rpi_gm_priority SSI parameter 79 rpi_gm_tinymsglen SSI parameter 79 80 rpi_ib_hca_id SSI parameter 82 rpi_ib_mtu SSI parameter 82 83 rpi_ib_num_envelopes SSI parameter 82 83 rpi_ib_port SSI parameter 82 rpi_ib_priority SSI parameter 82 rpi_ib_tinymsglen SSI parameter 82 83 rpi_lamd_priority SSI parameter 85 rpi_ssi_sysv_shmmaxalloc SSI parameter 86 rpi ssi sysv_shmpoolsize SSI parameter 86 rpi _ssi _sysv_short SSI parameter 86 rpi_sysv_pollyield SSI parameter 87 rpi _ sysv_priority SSI parameter 87 rpi_sysv_shmmaxalloc SSI parameter 87 rpi _sysv_shmpoolsize SSI parameter 87 rpi_sysv_short SSI parameter 87 rpi tcp priorit y SSI parameter 88 rpi_tcp_short SSI parameter 87 89 rp
119. l back to a different boot module e g rsh ssh It is suggested that the hostfile file contain hostnames in the style that BProc prefers integer numbers For example host file may contain the following 1 UN Re which boots on the BProc front end node 1 and four slave nodes 0 1 2 3 Note that using IP hostnames will also work but using integer numbers is recommended Tunable Parameters Table 8 1 lists the SSI parameters that are available to the bproc module Special Notes After booting LAM will by default not schedule to run MPI jobs on the BProc front end Specifically LAM implicitly sets the no schedule attribute on the 1 node in a BProc cluster See Section 7 1 page 49 68 SSI parameter name Default value Description boot_bproc_priority 50 Default priority level Table 8 1 SSI parameters for the bproc boot module for more detail about this attribute and boot schemas in general and 7 1 2 page 51 8 1 6 The globus Module LAM MPI 7 1 3 includes beta support for Globus Specifically only limited types of execution are possible The LAM Team would appreciate feedback from the Globus community on expanding Globus support in LAM MPI Minimum Requirements LAM MPI jobs in Globus environment can only be started on nodes using the fork job manager for the Globus gatekeeper Other job managers are not yet supported Usage Starting the LAM run time en
120. l is provided with the coll SSI parameter Its value is a comma separated list of coll module names If this parameter is supplied only these modules will be queried at run time effectively de termining the set of modules available for selection on all communicators If this parameter is not supplied all coll modules will be queried The second level is provided with the MPI attribute LAM_MPI_SSI_COLL This attribute can be set to the string name of a specific Coll module on a parent communicator before a new communicator is created If set the attribute s value indicates the only module that will be queried If this attribute is not set all available modules are queried Note that no coordination is done between the SSI frameworks in each MPI process to ensure that the same modules are available and or are selected for each communicator Although mpirun allows different environment variables to be exported to each MPI process and the value of an MPI attribute is local to each process LAM s behavior is undefined if the same SSI parameters are not available in all MPI processes 9 4 2 coll SSI Parameters There are three parameters that apply to all coll modules Depending on when their values are checked they may be set by environment variables command line switches or attributes on MPI communicators e coll base associative The MPI standard defines whether reduction operations are commu tative or not but makes no provisions for whethe
121. l number of CPUs it can be smaller then equal to or greater than the number of physical CPUs in the machine It is solely used as a shorthand notation for mpirun s C notation meaning launch one process per CPU as specified in the boot schema file For example in the following boot schema inky cluster example com cpu 2 pinky cluster example com cpu 4 blinkly cluster example com cpu 4 clyde doesn t mention a cpu count and is therefore implicitly 1 clyde cluster example com issuing the command mpirun C foo would actually launch 11 copies of foo 2 on inky 4 on pinky 4on blinky and 1 on clyde Note that listing a host more than once has the same effect as incrementing the CPU count The following boot schema has the same effect as the previous example i e CPU counts of 2 4 4 and 1 respectively inky has a CPU count of 2 inky cluster example com inky cluster example com pinky has a CPU count of 4 pinky cluster example com pinky cluster example com pinky cluster example com pinky cluster example com blinky has a CPU count of 4 blinkly cluster example com blinkly cluster example com blinkly cluster example com blinkly cluster example com clyde only has 1 CPU clyde cluster example com Other keys are defined on a per boot SSI module and are described below 66 8 1 2 Minimum Requirements In order to successfully launch a process on a remote node several requireme
122. l return the corresponding names of the specified nodes For example shell lamnodes N will return the node that each CPU is located on the hostname of that node the total number of CPUs on each and any flags that are set on that node Specific nodes can also be queried shell lamnodes n0 3 vo will return the node hostname number of CPUs and flags on nO and n3 Command line arguments can be used to customize the output of Lamnodes These include e c Suppress printing CPU counts e i Print IP addresses instead of IP names e n Suppress printing LAM node IDs 7 9 The lamrestart Command The lamrestart can be used to restart a previously checkpointed MPI application The arguments to lamrestart depend on the selected checkpoint restart module Regardless of the checkpoint restart mod ule used invoking lamrestart results in a new mpirun being launched 55 The SSI parameter cr must be used to specify which checkpoint restart module should be used to restart the application Currently only two values are possible blcr and self e If the blcr module is selected the SSI parameter cr_blcr_context_file should be used to pass in the filename of the context file that was created during a pevious successful checkpoint For example shell lamrestart ssi cr bler ssi cr_bler_context file filename e If the self module is selected the SSI parameter cr_restart_args must be passed with the a
123. l simply be the hostname If the contact string contains whitespace the entire contact string must be enclosed in quotes i e not just the values with whitespaces For example if your contact string is host1 port1 0 xxx 0U yyy CN aaa bbb ccc Then you will need to have it listed as host1l portl 0 xxx OU yyy CN aaa bbb ccc The following will not work host1l port1 0 xxx OU yyy CN aaa bbb ccc Each host in the boot schema must also have a lam_install_path key indicating the absolute directory where LAM MPI is installed This value is mandatory because you cannot rely on the PATH 69 T 7 05 L 7 05 Tay aa environment variable in Globus environment because users dot files are not executed in Globus jobs and therefore the PATH environment variable is not provided Other keys can be used as well Lam_ install_path is the only mandatory key Here is a sample Globus boot schema Globus boot schema inky mycluster 12853 0 MegaCorp OU Mine CN HPC Group prefix opt lam cpu 2 pinky yourcluster 3245 0 MegaCorp OU Yours CN HPC Group prefix opt lam cpu 4 blinky hiscluster 23452 0 MegaCorp OU His CN HPC Group prefix opt lam cpu 4 clyde hercluster 82342 0 MegaCorp OU Hers CN HPC Group prefix software lam Once you have this boot schema the 1amboot command can be used to launch it Note however that unlike the other boot SSI modules the
124. le MPI Program in Fortran The following is a simple hello world Fortran program program hello include mpif h integer rank size ierr call MPLINITGerr call MPI COMM RANK MPI COMM WORLD rank ierr call MPLCOMM _SIZE MPI COMM_WORLD size ierr print x Hello world I_am rank _of size call MPI_FINALIZE ierr stop end This program can be saved in a text file and compiled with the mpi 77 wrapper compiler shell mpif77 hello f o hello 4 6 Running MPI Programs Once you have successfully established a LAM universe and compiled an MPI program you can run MPI programs in parallel In this section we will show how to run a Single Program Multiple Data SPMD program Specifically we will run the hello program from the previous section in parallel The mpirun and mpiexec commands are used for launching parallel MPI programs and the mpitask commands can be used to provide crude debugging support The lamclean command can be used to completely clean up a failed MPI program e g if an error occurs 4 6 1 The mpirun Command The mpirun command has many different options that can be used to control the execution of a program in parallel We ll explain only a few of them here The simplest way to launch the hello program across all CPUs listed in the boot schema is A PY shell mpirun C hello J The C option means launch one copy of hello on every
125. list see Section 11 1 on page 111 Most of the output fields are self explanitory two that are worth explaining are e Debug support This indicates whether your LAM installation was configured with the with debug option It is generally only used by the LAM Team for development and maintenance of LAM itself it does not indicate whether user s MPI applications can be debugged specifically user s MPI appli cations can always be debugged regardless of this setting This option defaults to no users are dis couraged from using this option See the Install Guide for more information about with debug e Purify clean This indicates whether your LAM installation was configured with the with purify option This option is necessary to prevent a number of false positives when using memory checking debuggers such as Purify Valgrind and bcheck It is off by default because it can cause slight performance degredation in MPI applications See the Install Guide for more information about with purify 7 8 The lamnodes Command LAM was specifically designed to abstract away hostnames once lamboot has completed successfully However for various reasons usually related to system administration concerns and or for creating human readable reports it can be desirable to retrieve the hostnames of LAM nodes long after Lamboot The command lamnodes can be used for this purpose It accepts both the N and C syntax from mpirun and wil
126. luster example com cpu 4 prefix home joe lam7 1 install blinky cluster example com cpu 4 clyde cluster example com user jsmith The rsh ssh boot module will usually run when no other boot module has been selected It can however be manually selected even when another module would typically automatically be selected by specifying the boot SSI parameter with the value of rsh For example shell lamboot ssi boot rsh hostfile Tunable Parameters Table 8 3 lists the SSI parameters that are available to the rsh module 8 1 8 The slurm Module As its name implies the Simple Linux Utility for Resource Management SLURM package is commonly used for managing Linux clusters typically in high performance computing environments SLURM con tains a native system for launching applications across the nodes that it manages When using SLURM rsh ssh is not necessary to launch jobs on remote nodes Instead the slurm boot module will automati cally use SLURM s native job launching interface to start LAM daemons The advantages of using SLURM s native interface are e SLURM can generate proper accounting information for all nodes in a parallel job e SLURM can kill entire jobs properly when the job ends e lamboot executes significantly faster when using SLURM as compared to when it uses rsh ssh http www llnl gov linux slurm 71 Ton Los Tan 7 1 Tan SS
127. ments MPI_SCATTERV Identical to lam_basic algorithm already optimized for SMP environments Table 9 8 Listing of MPI collective functions indicating which have been optimized for SMP environments 93 the MESSAGE_POOL can be also configured with the coll_base_shmem_message_pool_size SSI parameter Default size of the MESSAGE_POOL is 16384 x 8 The first section is known as CONTROL_SECTION and it is logicallu divided into 2 x N 2 seg ments NV is the number of segments in the MESSAGE_POOL section Total size of this section is 2x N 2 xCxS Where C is the cache line size S is the size of the communicator Shared variabled for synchronization are placed in different CACHELINE for each processes to prevent trashing due to cache invalidation General Logic behind Shared Memory Management Each segment in the MESSAGE POOL corresponds to TWO segments in the CONTROL_SECTION Whenever a particular segment in MESSAGE_POOL is active its corresponding segments in the CON TROL_SECTION are used for synchronization Processes can operate on one segment Copy the mes sages Set appropriate synchronizattion variables and can continue with the next message segment This ap proach improves performance of the collective algorithms All the process need to complete a MPI_ BARRIER at the last Default 8th segment to prevent race conditions The extra 2 segments in the CONTROL_SECTION are used exclusively for explicit MPILBAR
128. mory leaks bad memory accesses duplicate bad memory management calls etc Some memory checking debuggers include but are not limited to the Solaris Forte debugger including the bcheck command line memory checker the Purify software package and the Valgrind software package LAM can be used with memory checking debuggers However LAM should be compiled with special support for such debuggers This is because in an attempt to optimize performance there are many struc tures used internally to LAM that do not always have all memory positions initialized For example LAM s internal struct nmsg is one of the underlying message constructs used to pass data between LAM pro cesses But since the struct nmsg is used in so many places it is a generalized structure and contains fields that are not used in every situation By default LAM only initializes relevant struct members before using a structure Using a structure may involve sending the entire structure including uninitialized members to a remote host This is not a problem for LAM the remote host will also ignore the irrelevant struct members depending on the specific function being invoked More to the point LAM was designed this way to avoid setting variables that will 108 not be used this is a slight optimization in run time performance Memory checking debuggers however will flag this behavior with read from uninitialized warnings The with puri f y option can be used
129. n 3 3 1 page 18 for more details This can cause problems when running MPI processes as dynamically loaded modules For example when running a LAM MPI program as a MEX function in a Matlab environment normal Unix linker semantics create situations where both the default Unix and the memory management systems are used This typically results in process failure Note that this only occurs when LAM MPI processes are used in a dynamic environment and an addi tional memory manager is included in LAM MPI This appears to occur because of normal Unix semantics the only way to avoid it is to use the with memory manager parameter to LAM s configure script specifying either none or external as its value See the LAM MPI Installation Guide for more details 3 4 4 Linux LAM MPI is frequently used on Linux based machines IA 32 and otherwise Although LAM MPI is generally tested on Red Hat and Mandrake Linux systems using recent kernel versions it should work on other Linux distributions as well Note that kernel versions 2 2 0 through 2 2 9 had some TCP IP performance problems It seems that version 2 2 10 fixed these problems if you are using a Linux version between 2 2 0 and 2 2 9 LAM may exhibit poor TCP performance due to the Linux TCP IP kernel bugs We recommend that you upgrade to 2 2 10 or the latest version See http www lam mpi org linux for a full discussion of the problem 3 4 5 Mac OS X Absoft Fortran Compiler
130. n number string or just one component of it e param Paired with two additional arguments display the SSI parameters for a given type and or module The first argument can be any of the valid SSI types or the special name base indicating the SSI framework itself The second argument can be any valid module name Additionally either argument can be the wildcard any which will match any valid SSI type and or module shell laminfo parsable arch version lam major version rpi tcp full param rpi tcp version lam 7 ssi boot rsh version ssi 1 0 ssi boot rsh version api 1 0 ssi boot rsh version module 7 0 arch i686 pc linux gnu ssi rpi tcp param rpi_tcp_short 65536 ssi rpi tcp param rpi_tcp_sockbuf 1 ssi rpi tcp param rpi_tcp_priority 20 A 54 gt The value will either be 0 not built from SVN 1 built from a Subverstion checkout or a date encoded in the form YYYYM MDD built from a nightly tarball on the given date Note that three version numbers are returned for the tcp module The first ssi indicates the overall SSI version that the module conforms to the second api indicates what version of the rpi API the module conforms to and the last module indicates the version of the module itself Running laminfo with no arguments provides a wealth of information about your LAM MPI instal lation we ask for this output when reporting problems to the LAM MPI general user s mailing
131. n of LAM MPI 3 1 New Feature Overview A full high level overview of all changes in the 7 series and previous versions can be found in the HISTORY file that is included in the LAM MPI distribution This docuemntation was originally written for LAM MPI v7 0 Changebars are used extensively through out the document to indicate changes updates and new features in the versions since 7 0 The change bars indicate a version number in which the change was introduced Major new features specific to the 7 series include the following e LAM MPI 7 0 is the first version to feature the System Services Interface SSI SSI is a pluggable framework that allows for a variety of run time selectable modules to be used in MPI applications For example the selection of which network to use for MPI point to point message passing is now a run time decision not a compile time decision SSI modules can be built as part of the MPI libraries that are linked into user applications or as stan dalone Sprel dynamic shared objects DSOs When compiled as DSOs all SSI modules are installed in fix 1lib lam new modules can be added to or removed from an existing LAM installation simply by putting new DSOs in that directory there is no need to recompile or relink user applica tions e When used with supported back end checkpoint restart systems LAM MPI can checkpoint parallel MPI jobs see Section 9 5 page 96 for more details e LAM M
132. n this case the transport will fall back 86 to using the postbox area to transfer the message Performance will be degraded but the application will progress Tunable Parameters Table 9 5 shows the SSI parameters that may be changed at run time Each of these parameters were dis cussed in the previous sections SSI parameter name Default value Description rpi_sysv priority 30 Default priority level rpi_sysv_pollyield 1 Whether or not to force the use of yield to yield the processor rpi_sysv_shmmaxalloc From configure Maximum size of a large message atomic trans fer The default value is calculated when LAM is configured rpi_sysv_shmpoolsize From configure Size of the shared memory pool for large mes sages The default value is calculated when LAM 1s configured rpi_sysv_short 8192 Maximum length in bytes of a short message for sending via shared memory i e on node Directly affects the size of the allocated postbox shared memory area rpi tcp short 65535 Maximum length in bytes of a short message for sending via TCP sockets i e off node rpi_tcp_sockbuf 1 Socket buffering in the OS kernel 1 means use the short message size Table 9 5 SSI parameters for the sysv RPI module 9 3 7 The tcp Module TCP Communication Module Summary Name tcp Kind rpi Default SSI priority 20 Checkpoint restart no The tcp RPI module us
133. ne parameter to several LAM commands See other sections in this manual for a more complete description of SSI Chapter 6 page 43 how it works and what run time parameters are available Chapters 8 and 9 pages 65 and 75 respectively Also the lamssi 7 lamssi_boot 7 lamssi_coll 7 lamssi_cr 7 and lamssi_rpi 7 man ual pages each provide additional information on LAM s SSI mechanisms 4 3 What Does Your LAM MPI Installation Support LAM MPI can be installed with a large number of configuration options It depends on what choices your system network administrator made when configuring and installing LAM MPI The laminfo command is provided to show the end user with information about what the installed LAM MPI supports Running laminfo with no arguments prints a list of LAM s capabilities including all of its SSI modules Among other things this shows what language bindings the installed LAM MPI supports what under lying network transports it supports and what directory LAM was installed to The parsable option prints out all the same information but in a conveniently machine parsable format suitable for using with scripts 4 4 Booting the LAM Run Time Environment Before any MPI programs can be executed the LAM run time environment must be launched This is typically called booting LAM A successfully boot process creates an instance of the LAM run time environment commonly referred to as the LAM uni
134. nments with multiple TCP networks SLURM may be configured to use a network that is specifically designated for commodity traffic another network may exist that is specifically allocated for high speed MPI traffic By default LAM will use the same hostnames that SLURM provides for all of its traffic This means that LAM will send all of its MPI traffic across the same network that SLURM uses However LAM has the ability to boot using one set of hostnames addresses and then use a second set of hostnames addresses for MPI traffic As such LAM can redirect its TCP MPI traffic across a secondary network It is possible that your system administrator has already configured LAM to operate in this manner If a secondary TCP network is intended to be used for MPI traffic see the section entitled Separating LAM and MPI TCP Traffic in the LAM MPI Installation Guide Note that this functionality has no effect on non TCP rpi modules such as Myrinet Infiniband etc Tunable Parameters Table 8 4 lists the SSI parameters that are available to the slurm module SSI parameter name Default value Description boot_slurm_priority 50 Default priority level Table 8 4 SSI parameters for the slurm boot module Special Notes Since the slurm boot module is designed to work in SLURM jobs it will fail if the slurm boot module is manually specified and LAM is not currently running in a SLURM job The slurm module does not
135. nodes and associated CPU counts to LAM Using lamboot is therefore as simple as shell lamboot The tm boot modules works in both interactive and non interactive batch jobs Note that in environments with multiple TCP networks PBS Torque may be configured to use a net work that is specifically designated for commodity traffic another network may exist that is specifically allocated for high speed MPI traffic By default LAM will use the same hostnames that the TM interface provides for all of its traffic This means that LAM will send all of its MPI traffic across the same network that PBS Torque uses However LAM has the ability to boot using one set of hostnames addresses and then use a second set of hostnames addresses for MPI traffic As such LAM can redirect its TCP MPI traffic across a secondary network It is possible that your system administrator has already configured LAM to operate in this manner If a secondary TCP network is intended to be used for MPI traffic see the section entitled Separating LAM and MPI TCP Traffic in the LAM MPI Installation Guide Note that this has no effect on non TCP rpi modules such as Myrinet Infiniband etc Tunable Parameters Table 8 5 lists the SSI parameters that are available to the tm module SSI parameter name Default value Description boot_tm_priority 50 Default priority level Table 8 5 SSI parameters for the tm boot module S
136. nsure that mpi run returns if an application process dies To disable the catching of signals use the nsigs option to mpirun 12 10 MPI Attributes Discussion item Need to have discussion of built in attributes here such as MPI_UNIVERSE_SIZE etc Should specifically mention that MPI_UNIVERSE SIZE is fixed at MPI_INIT time at least it is as of this writing who knows what it will be when we release 7 1 This whole section is for 7 1 End of discussion item 120 Bibliography 1 2 Ly 3 Ly 4 LL 5 ey 6 a 7 LL S a 9 10 11 Jason Duell Paul Hargrove and Eric Roman The Design and Implementation of Berkeley Lab s Linux Checkpoint Restart 2002 Al Geist William Gropp Steve Huss Lederman Andrew Lumsdaine Ewing Lusk William Saphir Tony Skjellum and Marc Snir MPI 2 Extending the Message Passing Interface In Luc Bouge Pierre Fraigniaud Anne Mignotte and Yves Robert editors Euro Par 96 Parallel Processing number 1123 in Lecture Notes in Computer Science pages 128 135 Springer Verlag 1996 William Gropp Steven Huss Lederman Andrew Lumsdaine Ewing Lusk Bill Nitzberg William Saphir and Marc Snir MPI The Complete Reference Volume 2 the MPI 2 Extensions MIT Press 1998 William Gropp Ewing Lusk and Anthony Skjellum Using MPI Portable Parallel Programming with the Message Passing Interface MIT Press 1994 William Gropp Ewing
137. nt variable has been deprecated in favor of the boot __rsh_ agent parameter to the rsh SSI boot module LAM MPI_SOCKET_SUFFIX The LAM MPI_SOCKET_SUFFIX has been deprecated in favor of the LAM MPI_SESSION_SUFF IX environment variable 3 2 Known Issues 3 2 1 mpirun and MPI Application cr Module Disagreement Due to ordering issues in LAM s MPL INIT startup sequence it is possible for mpirun to believe that it can checkpoint an MPI application when the application knows that it cannot be checkpointed A common case of this is when an un checkpointable RPI module is selected for the MPI application but checkpointing services are available In this case even though there is a mismatch between mpirun and the MPI application there is no actual harm Regardless of what mpi run believes attempting to checkpoint the MPI application will fail 3 2 2 Checkpoint Support Disabled for Spawned Processes Checkpointing support is only enabled for MPI 1 processes spawned processes will have checkpointing support explicitly disabled regardless of the SSI parameters passed and the back end checkpointing support available 3 2 3 BLCR Support Only Works When Compiled Statically Due to linker ordering issues BLCR checkpointing support only works when the blcr modules are compiled statically into LAM Attempting to use the blcr modules are dynamic shared objects will result in errors when compiling MPI applications the error will c
138. nts must be met Although each of the boot modules have different specific requirements all of them share the following conditions for successful operation 1 Each target host must be reachable and operational 2 The user must be able to execute arbitrary processes on the target 3 The LAM executables must be locatable on that machine This typically involves using the shell s search path the LAMHOME environment variable or a boot module specific mechanism 4 The user must be able to write to the LAM session directory typically somewhere under tmp see Section 12 8 page 119 5 All hosts must be able to resolve the fully qualified domain name FQDN of all the machines being booted including itself 6 Unless there is only one host being booted any host resolving to the IP address 127 0 0 1 cannot be included in the list of hosts If all of these conditions are not met Lamboot will fail 8 1 3 Selecting a boot Module Only one boot module will be selected it will be used for the life of the LAM universe As such module priority values are the only factor used to determine which available module should be selected 8 1 4 boot SSI Parameters On many kinds of networks LAM can know exactly which nodes should be making connections while booting the LAM run time environment and promiscuous connections i e allowing any node to connect are discouraged However this is not possible in some complex network configurations
139. o phase commit protocol that is used to save checkpoints provides a reasonable guarantee of consistency of saved global state there is at least one case in which this guarantee fails For example the renaming of checkpoint files by mpi run is not atomic if a failure occurs when mpi run is in the process of renaming the checkpoint files the collection of checkpoint files might result in an inconsistent global state Snttp ftg 1bl gov 98 e If the BLCR module s are compiled dynamically the LD PRELOAD environment variable must in clude the location of the libcr so library This is to ensure that Libcr so is loaded before the PThreads library 9 5 4 The self Module Module Summary Name blcr Kind cr Default SSI priority 25 Checkpoint restart yes The self module when used with checkpoint restart SSI modules will invoke the user defined functions to save and restore checkpoints It is simply a mechanism for user defined functions to be invoked at LAM s Checkpoint Continue and Restart phases Hence the only data that is saved during the checkpoint is what is written in the user s checkpoint function no MPI library state is saved at all As such the model for the self module is slightly different than for example the blcr module Specif ically the Restart function is not invoked in the same process image of the process that was checkpointed The Restart phase is invoked during MPI_INIT of a new instance of the ap
140. oc and free respectively in RPI modules that do not take advantage of special memory These functions can be used portably for potential performance gains Language interoperability Inter language interoperability is supported It is possible to initialize LAM MPI from either C or Fortran and mix MPI calls from both languages Handle conversions for inter language in teroperability are fully supported See Table 5 3 Supported Functions MPI COMM_F2C MPICOMM_C2F MPIGROUP_F2GC MPLGROUP_C2F MPILTYPE_F2C MPLTYPE_C2F MPILREQUEST_F2C MPI REQUEST C2F MPI_INFO_F2C MPLINFO_C2F MPI_WIN_F2C MPLWIN_C2F MPI_STATUS_F2C MPI_STATUS_C2F Table 5 3 Supported MPI 2 handle conversion functions Error handlers Communicator and window error handler functions are fully supported this functionality is not yet supported for MPI_File handles See Table 5 4 Supported Functions MPI COMM CREATE ERRHANDLER MPIWIN CREATE_ERRHANDLER MPI_COMM_GET_ERRHANDLER MPILWIN_GET_ERRHANDLER MPI COMM_SET _ERRHANDLER MPILWIN_SET_ERRHANDLER Table 5 4 Supported MPI 2 error handler functions New datatype manipulation functions Several new datatype manipulation functions are provided Ta ble 5 5 lists the new functions 37 Supported Functions MPILGET_ADDRESS MPI_TYPE_ CREATE_SUBARRAY MPI_TYPE_ CREATE_DARRAY MPI_TYPE_ CREATE_STRUCT MPILTYPE_CREATE_HINDEXED MPI_TYPE_GET_EXTENT MPILTYPE_ CREATE_HVECTOR MPI_TYPE_GET
141. ocesses allowing at least some degree of fair regular scheduling In non oversubscribed environments i e where the number of processes is less than or equal to the number of available processors the usysv RPI should generally provide better performance than the sysv RPI because spin locks keep processors busy waiting This hopefully keeps the operating system from suspending or swapping out the processes allowing them to react immediately when the lock becomes available 9 3 2 The crtcp Module Checkpoint able TCP Communication Module Summary Name crtcp Kind rpi Default SSI priority 25 Checkpoint restart yes The crtcp RPI module is almost identical to the tcp module described in Section 9 3 7 TCP sockets are used for communication between MPI processes Overview The following are the main differences between the tcp and crtcp RPI modules e The crtcp module can be checkpointed and restarted It is currently the only RPI module in LAM MPI that supports checkpoint restart functionality e The cricp module does not have the fast message passing optimization that is in the tcp module As result there is a small performance loss in certain types of MPI applications All other aspects of the crtcp module are the same as the tcp module Checkpoint Restart Functionality The crtcp module is designed to work in conjunction with a cr module to provide checkpoint restart func tionality See Section 9 5 for a descrip
142. omplain that 1ibpthread must be listed after 1ibcr 3 2 4 Infiniband rpi Module The Infiniband ib module implementation in LAM MPI is based on the IB send receive protocol for tiny messages and RDMA protocol for long messages Future optmizations include allowing tiny messages to use RDMA for potentialy latency performance improvements for tiny messages The ib rpi has been tested with Mellanox VAPI thca linux 3 2 build 024 Other versions of VAPI to include OpenIB and versions from other vendors have not been well tested Whichever Infiniband driver is used it must include support for shared completion queues Mellanox VAPI for example did not include support for this feature until mVAPI v3 0 If your Infiniband driver does not support shared completion queues the LAM MPI ib rpi will not function properly Symptoms will include LAM hanging or crashing during MPI_INIT 17 Ton T aag 7 12 Tom L 7 12 T aay 7 1 1 Tol L 12 Lay Tay Note that the 7 1 x versions of the ib rpi will not scale well to large numbers of nodes because they register a fixed number of buffers W bytes for each process peer during MPI_INIT Hence for an N process MPIL COMM_WORLD the total memory registered by each process during MPI_INIT is N 1 x M bytes This can be prohibitive as N grows large This effect can be limited however by decreasing the number and size of buffers that the ib rpi module via SSI
143. on temporary directory typically located under tmp see Section 12 8 page 119 LAM must be able to read write in this session directory check permissions in this tree e LAM is unable to find the current host in the boot schema Solution LAM can only boot a universe that includes the current node If the current node is not listed in the hostfile or is not listed by a name that can be resolved and identified as the current node lamboot and friends will abort e LAM is unable to resolve all names in the boot schema Solution All names in the boot schema must be resolvable by the boot SSI module that is being used This typically means that there end up being IP hostnames that must be resolved to IP addresses Resolution can occur by any valid OS mechanism e g through DNS local file lookup etc Note that the name localhost or any address that resolves to 127 0 0 1 cannot be used in a boot schema that includes more than one host otherwise the other nodes in the resulting LAM universe will not be able to contact that host 11 3 MPI Problems For the most part LAM implements the MPI standard similarly to other MPI implementations Hence most MPI programmers are not too surprised by how LAM handles various errors etc However there are some cases that LAM handles in its own unique fashion In these cases LAM tries to display a helpful message discussing what happened Here s some more background on a few of the m
144. ooo ke OR ee ee eee de d 11 1 2 General Discussion User Questions o 11 2 LAM Run Time Environment Problems 0 00000 pe eee eee 11 2 1 Problems with the Lamboot Command 000000004 Eli MPI eee o a e ee se Se aa me le Get te Medics de eet a te aia hes eee Miscellaneous 12 1 Singleton MPI Processes cg a ae ea a kG BR Re a RES Sad Sa ws 122 MPRI epee ee Oa ee A a Bes 12 3 Fonran Process Names o o sos ee a ee a ww a A 12 4 MPI Thread Support o ea ee he ee eee ed es 1241 Thread bevel oon ea ae hee ea ee bw ead a dee Ge bE ee weld a 12 5 MPI 2 Name Publishing lt c ooo bel Pe a ee a ee 12 6 Interoperable MPI IMPI Support e 126 Pupos of IMPI o cce ky oh a Bade OSA Se we De Oe aoe 12 6 2 Current IMPI functionality lt o ce ee a ee 12 6 3 R nning an IMPI JOD e es ao asea ee ee ai a GOE 12 6 4 Complex Network Setups 2 6 ca ee be a 12 7 Batch Queuing System Support ce ke ead 12 8 Location of LAM s Session Directory 2 6 ee a a 129 Signal Catching ecu a aate ee a ee ee a laa Bw oS IZ TOGIPL AUDI acc kG aa ge Ba de Boe ae Mace E A ea ee PG od Sark s 103 103 103 104 105 106 107 107 107 108 108 111 111 111 111 112 112 113 Discussion Item List of Tables J 4 1 4 2 5 1 5 2 5 3 5 4 5 5 5 6 IT 5 8 5 9 5 10 6 1 8 1 8 2 8 3 8 4 8 5 a4 9 2 9 3 9 4 9 5 9 6 aT 9 8 9 9 SSI modules that are included in the official LAM MPI RP
145. operly configured for LAM to boot properly Refer to Section 4 4 2 for the list of conditions that LAM requires to boot properly User problems with lamboot typically fall into one of the following categories e rsh ssh is not set up properly for password less logins to remote nodes e mail address to send help e mails to It is our hope that big bold print will catch some people s eyes and enable them to help themselves rather than having to wait for their post to distribute around the world and then further wait for someone to reply telling them that the solution to their problem was already printed on their screen Thanks for your time in reading all of this 112 Solution Set up rsh ssh properly for password less remote logins Consult local documentation or internet tutorials for how to set up SHOME rhosts and SSH keys Note that the LAM Team STRONGLY discourages the use of in rhosts or host equiv files e rsh ssh prints something on stderr Solution Clean up system or user dot files so that nothing is printed on stderr during a remote login e ALAM daemon is unable to open a connection back to Llamboot Solution Many Linux distributions ship with firewalls enabled LAM MPI uses random TCP ports to communicate and therefore firewall support must be either disabled or opened between machines that will be using LAM MPI e LAM is unable to open a session directory Solution LAM needs to use a per user per sessi
146. ostboxes or short message passing A postbox is used for communication one way between two processes Each postbox is the size of a short message plus the length of a cache line There is enough space allocated for N x NV 1 postboxes The maximum size of a short message is configurable with the rpi_ssi_sysv_short SSI parameter The final area in the shared memory area of size P is used as a global pool from which space for long message transfers is allocated Allocation from this pool is locked The default lock mechanism is a System V semaphore but can be changed to a process shared pthread mutex lock The size of this pool is configurable with the rpi_ssi_sysv_shmpoolsize SSI parameter LAM will try to determine P at configuration time if none is explicitly specified Larger values should improve performance especially when an application passes large messages but will also increase the system resources used by each task Use of the Global Pool When a message larger than 25 is sent the transport sends S bytes with the first packet When the acknowledgment is received it allocates messagelength S bytes from the global pool to transfer the rest of the message To prevent a single large message transfer from monopolizing the global pool allocations from the pool are actually restricted to a maximum of rpi_ssi_sysv_shmmaxalloc bytes Even with this restriction it is possible for the global pool to temporarily become exhausted I
147. ot schema once in order to eliminate the stderr warning and then try Lamboot again Another is to use the boot_rsh_ignore_stderr SSI parameter We haven t discussed SSI parameters yet so it is probably easiest at this point to manually ssh to a small number of nodes to get the warning out of the way If you have having problems with lamboot try using the d option to Lamboot which will print enormous amounts of debugging output which can be helpful for determining what the problem is Addi tionally check the Lamboot 1 man page as well as the LAM FAQ on the main LAM web site under the section Booting LAM for more information 4 4 3 The lamnodes Command An easy way to see how many nodes and CPUs are in the current LAM universe is with the lamnodes command For example with the LAM universe that was created from the boot schema in Section 4 4 1 running the Lamnodes command would result in the following output shell lamnodes nO node1 cluster example com 1 origin this_node nl node2 cluster example com 1 n2 node3 cluster example com 2 n3 node4 cluster example com 2 The n number on the far left is the LAM node number For example n3 uniquely refers to node4 Also note the third column which indicates how many CPUs are available for running processes on that node In this example there are a total of 6 CPUs available for running processes This information is from the cpu key that was used in the ho
148. ote that the sections below each assume that support for these modules have been compiled into LAM MPI The 1aminfo command can be used to determine exactly which modules are supported in your installation see Section 7 7 page 53 9 4 1 Selecting a coll Module coll modules are selected on a per communicator basis Most users will not need to override the Coll se lection mechanisms the coll modules currently included in LAM MPI usually select the best module for each communicator However mechanisms are provided to override which coll module will be selected on a given communicator 89 When each communicator is created including MPILCOMM_WORLD and MPI_COMM_SELP all available Coll modules are queried to see if they want to be selected A coll module may therefore be in use by zero or more communicators at any given time The final selection of which module will be used for a given communicator is based on priority the module with the highest priority from the set of available modules will be used for all collective calls on that communicator Since the selection of which module to use is inherently dynamic and potentially different for each communicator there are two levels of parameters specifying which modules should be used The first level specifies the overall set of coll modules that will be available to all communicators the second level is a per communicator parameter indicating which specific module should be used The first leve
149. ou can attach TotalView to MPI processes started by mpirun mpiexec in following ways 1 Use the tv convenience argument when running mpirun or mpiexec this is the preferred method S shell mpirun tv other mpirun arguments For example shell mpirun tv C my_mpi_program argl arg2 arg3 2 Directly launch mpi run in Total View you cannot launch mpiexec in Total View shell totalview mpirun a mpirun arguments For example 7 shell totalview mpirun a C my_mpi_program arg arg2 arg3 Note the a argument after mpirun This is necessary to tell Total View that arguments following a belong to mpirun and not Total View Also note that the tv convenience argument to mpirun simply executes totalview mpirun a so both methods are essentially identical Total View can either attach to all MPI processes in MPICOMM_ WORLD or a subset of them The controls for partial attach are in TotalView not LAM In TotalView 6 0 0 analogous methods may work for earlier versions of Total View see the TotalView documentation for more details you need to set the parallel launch preference to ask In the root window menu 1 Select File Preferences 2 Select the Parallel tab 3 In the When a job goes parallel box select Ask what to do 4 Click on OK Refer to http www etnus com for more information about Total View 10
150. pecial Notes Since the tm boot module is designed to work in PBS Torque jobs it will fail if the tm boot module is manually specified and LAM is not currently running in a PBS Torque job The tm module does not start a shell on the remote node Instead the entire environment of lamboot is pushed to the remote nodes before starting the LAM run time environment Also note that the Altair provided client RPMs for PBS Pro do not include the pbs_demux command which is necessary for proper execution of TM jobs The solution is to copy the executable from the server RPMs to the client nodes Finally TM does not provide a mechanism for path searching on the remote nodes so the 1amd exe cutable is required to reside in the same location on each node to be booted 74 Chapter 9 Available MPI Modules There are multiple types of MPI modules 1 rpi MPI point to point communication also known as the LAM Request Progression Interface RPI 2 coll MPI collective communication 3 cr Checkpoint restart support for MPI programs Each of these types and the modules that are available in the default LAM distribution are discussed in detail below 9 1 General MPI SSI Parameters Ton The default hostmap file is Ssysconf lam hostmap typically Sprefix etc lam hostmap txt This file is only useful in environments with multiple TCP networks and is typically populated by the system administrator see the LAM MPI Installation Guide for more det
151. plication i e it starts over from main This is described in detail below Overview The self module can be specifically selected by setting the cr SSI parameter to the value self Manually selecting the self module will force the MPI thread level to be at least MPITHREAD_SERIALIZED At each of the Checkpoint Continue and Restart phases LAM will make a callback to a user specified function to do whatever is required for that phase e g save or load application level data LAM does this by dynamically looking up functions by name at run time The following function names are by default looked up and invoked at each phase e Checkpoint phase int lam_cr_self_checkpoint void e Continue phase int lam_cr_self_continue void e Restart phase int lam_cr_self_restart void To be absolutely clear these functions are to be provided by the application they are not included in the LAM library If one of these functions cannot be found at run time the self module will skip that phase invocation The default function names can be overridden in two ways 1 Use the cr_self_user_prefix to specify a prefix for all three functions This will cause LAM to assume that the Checkpoint Restart and Continue functions are named prefix checkpoint S prefix _restart and prefix continue respectively where prefix is the string value of the cr_self_user_prefix SSI parameter For example 99 L 7 1 Tas shell mp
152. ps This is useful in environments where the same hostname may resolve to different IP addresses on different nodes e g clusters based on Finite Neighborhood Networks Ton e prefix lt lam install path gt Use the LAM MPI installation specified in the lt lam install path gt where lt lam install path gt is the top level directory where LAM MPI is installed This is typically used when a user has multiple LAM MPI installations and want to switch between them without changing the dot files or PATH environment variable This option is not compatible with LAM MPI versions prior to 7 1 Lan See http www aggregate org for more details 49 e s Close the stdout and stderr of the locally launched LAM daemon they are normally left open This is necessary when invoking Lamboot via a remote agent such as rsh or ssh e v Print verbose output This is useful to show progress during lamboot s progress Unlike d v does not forward output to a file or syslog e x Run the LAM RTE in fault tolerant mode e lt filename gt The name of the boot schema file Boot schemas while they can be as simple as a list of hostnames can contain additional information and are discussed in detail in Sections 4 4 1 and 8 1 1 pages 26 and 65 respectively Booting the LAM RTE is where most users particularly first time users encounter problems Each boot module has its own specific requirements and prerequisites for suc
153. r LAM MPI commands Each command also has its own manual page which typically provides more detail than this document 7 1 The lamboot Command The lamboot command is used to start the LAM run time environment RTE lamboot is typically the first command used before any other LAM MPI command notable exceptions are the wrapper compilers which do not require the LAM RTE and mpiexec which can launch its own LAM universe lamboot can use any of the available boot SSI modules Section 8 1 details the requirements and operations of each of the boot SSI modules that are included in the LAM MPI distribution Common arguments that are used with the Lamboot command are e b When used with the rsh boot module the fast boot algorithm is used which can noticeably speed up the execution time of lamboot It can also be used where remote shell agents cannot provide output from remote nodes e g in a Condor environment Specifically the fast algorithm assumes that the user s shell on the remote node is the same as the shell on the node where lamboot was invoked e d Print debugging output This will print a lot of output and is typically only necessary if Lamboot fails for an unknown reason The output is forwarded to standard out as well as either tmp or syslog facilities The amount of data produced can fill these filesystems leading to general system problems e 1 Use local hostname resolution instead of centralized looku
154. r LAM MPI to function properly is for the LAM executables to be in your path This step may vary from site to site for example the LAM executables may already be in your path consult your local administrator to see if this is the case NOTE If the LAM executables are already in your path you can skip this step and proceed to Sec tion 4 2 In many cases if your system does not already provide the LAM executables in your path you can add them by editing your dot files that are executed automatically by the shell upon login both interactive and non interactive logins Each shell has a different file to edit and corresponding syntax so you ll need to know which shell you are using Tables 4 1 and 4 2 list several common shells and the associated files that are typically used Consult the documentation for your shell for more information 23 Shell name Interactive login startup file sh or Bash profile named sh csh cshrc followed by login tcsh tcshrc if it exists cshrc if it does not followed by login bash bash profile if it exists or bash login if it exists or profile if it exists in that order Note that some Linux dis tributions automatically come with bash_profile scripts for users that automatically execute bashrc as well Consult the bash manual page for more information Table 4 1 List of common shells and the corresponding environmental setup files commonly
155. r an operator is associative or not This parameter 1f defined to 1 asserts that all reduction operations on a communicator are assumed to be associative If undefined or defined to 0 all reduction operations are assumed to be non associative This parameter is examined during every reduction operation See Commutative and Associative Reduction Operators below e coll crossover If set define the maximum number of processes that will be used with a linear algorithm More than this number of processes may use some other kind of algorithm This parameter is only examined during MPI_INIT e coll reduce crossover For reduction operations the determination as to whether an algo rithm should be linear or not is not based on the number of process but rather by the number of bytes to be transferred by each process If this parameter is set it defines the maximum number of bytes transferred by a single process with a linear algorithm More than this number of bytes may result in some other kind of algorithm This parameter is only examined during MPI_INIT 90 Commutative and Associative Reduction Operators MPI 1 defines that all built in reduction operators are commutative User defined reduction operators can specify whether they are commutative or not The MPI standard makes no provisions for whether a reduction operation is associative or not For some operators and datatypes this distinction is largely irrelevant e g find the maximum
156. r infor mation about debugging MPI programs in parallel e System administrator Unless you re also a parallel programmer you re reading the wrong docu ment You should be reading the LAM MPI Installation Guide 14 for detailed information on how to configure compile and install LAM MPI 11 12 Chapter 2 Introduction to LAM MPI This chapter provides a summary of the MPI standard and the LAM MPI implementation of that standard 2 1 About MPI The Message Passing Interface MPI 2 7 is a set of API functions enabling programmers to write high performance parallel programs that pass messages between processes to make up an overall parallel job MPI is the culmination of decades of research in parallel computing and was created by the MPI Forum an open group representing a wide cross section of industry and academic interests More information including the both volumes of the official MPI standard can be found at the MPI Forum web site MPI is suitable for big iron parallel machines such as the IBM SP SGI Origin etc but it also works in smaller environments such as a group of workstations Since clusters of workstations are readily available at many institutions it has become common to use them as a single parallel computing resource running MPI programs The MPI standard was designed to support portability and platform independence As a result users can enjoy cross platform development capability as well as
157. ract with SSI modules accept the ssi command line switch This switch expects two parameters to follow the name of the SSI parameter and its corresponding value For example sheng mpirun C ssi rpi tcp my_mpi_program runs the my mpi_program on all available CPUs in the LAM universe using the tcp RPI module Communicator Attributes Some SSI types accept SSI parameters via MPI communicator attributes notably the MPI collective com munication modules These parameters follow the same rules and restrictions as normal MPI attributes Note that for portability between 32 and 64 bit systems care should be taken when setting and getting attribute values The following is an example of portable attribute C code 45 Tay p int flag attribute_val void set_attribute void xxget_attribute MPI_Comm comm MPIL COMM_WORLD int keyval LAM_MPLSSI_COLL_ BASE_ASSOCIATIVE x Set the value set_attribute void 1 MPI_Comm set_attr comm keyval amp set_attribute x Get the value get_attribute NULL MPI_Comm_get_attr comm keyval amp get_attribute amp flag if flag 1 attribute_val int get_attribute printf Got the attribute value d n attribute_val Specifically the following code is neither correct nor portable F int flag attribute_val MPI_Comm comm MPI COMM_WORLD int keyval LAM_MPI_SSI_ COLL_BASE_ASSOCIATIVE x Set the value
158. ram Note that no additional compiler and linker flags are required for correct MPI compilation or linking The resulting my_mpi_program is ready to run in the LAM run time environment Similarly the other two wrapper compilers can be used to compile MPI programs for their respective languages shell mpiCC O c _program cc o my_c _mpi_program shell mpif77 O f77 program f o my f77_mpi program Note too that any other compiler linker flags can be passed through the wrapper compilers such as g and O they will simply be passed to the back end compiler Finally note that giving the showme option to any of the wrapper compilers will show both the name of the back end compiler that will be invoked and also all the command line options that would have been passed for a given compile command For example line breaks added to fit in the documentation al shell mpiCC O c _program cc o my_c _program showme g J usr local lam include pthread O c _program cc o my_c _program L usr local lam lib llammpio llammpi Ipmpi A llamf77mpi Impi llam lutil pthread lan Note that the wrapper compilers only add all the LAM MPI specific flags when a command line argu ment that does not begin with a dash is present For example shell mpicc gcc no input files shell mpicc version gcc GCC 3 2 2 Mandrake Linux 9 1 3 2 2 3mdk Copyr
159. re are less processors than processes on a node L 005 on e Fast support is available and slightly decreases the latency for short gm messages However it is unreliable and is subject to timeouts for MPI applications that do not invoke the MPI progression engine often and is therefore not the default behavior e Support for the gm_get function in the GM 2 x series is available starting with LAM MPI 7 1 but is disabled by support See the Installation Guide for more details e Checkpoint restart support is included for the gm module but is only possible when the gm module was compiled with support for gm_get Lan 9 3 4 The ib Module Infiniband Tan Module Summary Name ib Kind rpi Default SSI priority 50 Checkpoint restart no The ib RPI module is for native message passing over Infiniband networking hardware The ib RPI provides low latency high bandwidth message passing performance Be sure to also read the release notes entitled Operating System Bypass Communication Myrinet and Infiniband in the LAM MPI Installation Guide for notes about memory management with Infiniband Specifically it deals with LAM s automatic overrides of the malloc calloc and free func tions 81 Overview In general using the ib RPI module is just like using any other RPI module MPI functions will simply use native Infiniband message passing for their back end message transport Although it is not
160. recon Command comic eee be AAR ee ee edb 63 FI Thetping Comma d soso sumas a ha ha Ae ha ow ak aw aaa w Ee A 64 7 18 The lamwipe Command 1 0 ee 64 Available LAM Modules 65 8 1 Booting the LAM Run Time Environment 0000 000 004 65 8 1 1 Boot Schema Files a k a Hostfiles or Machinefiles 65 81 2 Minimum Requirements ocs ea oe ara a a 67 8 1 3 Selecting a boot Module 2 244044 5 05 be eo OEE RE RE RGSS EO HS 67 Sla E A eR oe ey OR eee RE REE CARER COO 67 8 15 Thebprog Module gt cir ae Gwe ow A ee Ss 67 BAG Theglobus Mode lt ss oca PREG RAGA ei k aaea E OE EES RAH a 69 8 1 7 Thersh Module including ssh eee ee eee 70 SUE The Sum MODUS o cosas kw ee ha LEAK a e EDS 71 8 1 9 The tm Module OpenPBS PBS Pro Torque 73 Available MPI Modules 75 2 1 General MPI SSI Parameters o nop eB ee 75 92 MPI Module Selection Process o oa coa ce aw ROW ee a a 75 9 3 MPI Point to point Communication Request Progression Interface RPI 76 9 3 1 Two Different Shared Memory RPI Modules o 77 9 3 2 The crtcp Module Checkpoint able TCP Communication 77 93 3 The gm Module Myrinet a e o e s 6646 ess a a ee a a 78 934 The ib Module Infniband lt 2 024 ci ca wi a bh ka 81 9 3 5 The lamd Module Daemon Based Communication 85 9 3 6 The sysv Module Shared Memory Using System V Semaphor
161. refer to the same executable This obviously makes distinguishing between the mpicc and mpiCc wrapper compilers impossible LAM will attempt to determine if you are building on a case insensitive filesystem If you are the C wrapper compiler will be called mpic Otherwise the C compiler will be called mpicc although mpic will also be available NFS shared tmp The LAM per session directory may not work properly when hosted in an NFS di rectory and may cause problems when running MPI programs and or supplementary LAM run time en vironment commands If using a local filesystem is not possible e g on diskless workstations the use of tmpfs or tinyfs is recommended LAM s session directory will not grow large it contains a small amount of meta data as well as known endpoints for Unix sockets to allow LAM MPI programs to contact the local LAM run time environment daemon AFS and tokens permissions AFS has some peculiarities especially with file permissions when using rsh ssh Many sites tend to install the AFS rsh replacement that passes tokens to the remote machine as the default rsh Similarly most modern versions of ssh have the ability to pass AFS tokens Hence if you are using the rsh boot module with recon or lamboot your AFS token will be passed to the remote LAM daemon automatically If your site does not install the AFS replacement rsh as the default consult the documentation on with rsh to see how to set the path to
162. res at least one authentication mechanism to be specified none or key For simplicity these instructions assume that the none mechanism will be used Only one IMPI server needs to be launched per IMPI job regardless of how many clients will connect For this example assume that there will be 2 IMPI clients client 0 will be run in LAM MPI and client 1 will be run elsewhere shell export IMPLAUTH_NONE shell impi_server server 2 auth 0 10 0 0 32 9283 The IMPI server must be left running for the duration of the IMPI job The string that the IMPI server gives as output 10 0 0 32 9283 in this case must be given to mpirun when starting the LAM process that will run in IMPI shell mpirun client 0 10 0 0 32 9283 C my_mpi program This will run the MPI program in the local LAM universe and connect it to the IMPI server From there the IMPI protocols will take over and join this program to all other IMPI clients Note that LAM will launch an auxiliary helper MPI program named impid that will last for the duration of the IMPI job It acts as a proxy to the other IMPI processes and should not be manually killed It will die on its own accord when the IMPI job is complete If something goes wrong it can be killed with the lamclean command just like any other MPI process 12 6 4 Complex Network Setups In some complex network configurations particularly those that span multiple private n
163. rgu ments to be passed to mpi run to restart the application For example shell lamrestart ssi cr self ssi cr_restart_args args to mpirun See Section 9 5 for more detail about the checkpoint restart capabilities of LAM MPI including details about the blcr and self cr modules 7 10 The lamshrink Command The lamshrink command is used to remove a node from a LAM universe shell lamshrink n3 removes node n3 from the LAM universe Note that all nodes with ID s greater than 3 will not have their ID s reduced by one n3 simply becomes an empty slot in the LAM universe mpirun and lamexec will still function correctly even when used with C and N notation they will simply skip the n3 since there is no longer an operational node in that slot Note that the 1amgrow command can optionally be used to fill the empty slot with a new node 7 11 The mpicc mpiCC mpic and mpif77 Commands Compiling MPI applications can be a complicated process because the list of compiler and linker flags required to successfully compile and link a LAM MPI application not only can be quite long it can change depending on the particular configuration that LAM was installed with For example if LAM includes native support for Myrinet hardware the 1gm flag needs to be used when linking MPI executables To hide all this complexity wrapper compilers are provided that handle all of this automatically They are calle
164. ror conditions such as if a node fails The Lamwipe command takes most of the same parameters as the Lamboot command it launches a process on each node in the boot schema to kill the LAM RTE on that node Hence it should be used with the same or an equivalent boot schema file as was used with lamboot 64 Chapter 8 Available LAM Modules There is currently only type of LAM module that is visible to users boot which is used to start the LAM run time environment most often through the lamboot command The lamboot command itself is discussed in Section 7 1 page 49 the discussion below focuses on the boot modules that make up the back end implementation of Lamboot 8 1 Booting the LAM Run Time Environment LAM provides a number of modules for starting the 1amd control daemons In most cases the 1amds are started using the lamboot command In previous versions of LAM MPI lamboot could only use rsh or ssh for starting the LAM run time environment on remote nodes In LAM MPI 7 1 3 it is possible to use a variety of mechanisms for this process startup The following mechanisms are available in LAM MPI 7 1 3 e BProc e Globus beta level support e rsh ssh e OpenPBS PBS Pro Torque using the Task Management interface e SLURM using its native interface These mechanisms are discussed in detail below Note that the sections below each assume that support for these modules have been compiled into LAM MPI The 1aminfo
165. s To use the Absoft Fortran compilers with LAM MPI on OS X you must have at least version 9 0 EP En hancement Pack Contact mailto support absoft com for details 3 4 6 Microsoft Windows Cygwin LAM MPI is supported on Microsoft Windows ay Cygwin 1 5 5 Currently tcp sysv usysv and tcp RPIs are supported ROMIO is not suported In Microsoft Windows Cygwin IPC services are provided by the CygIPC module Hence in stallation and use of the sysv and usysv RPIs require this module Specifically sysv and usysv RPIs are installed if and only if the library Libcygipc a is found and ipc daemon2 exe is running when configuring LAM MPI Furthermore to use these RPIs it is necessary to have ipc daemon2 exe run ning on all the nodes For detailed instructions on configuring these RPIs please refer to the LAM MPI Installation Guide Since there are some issues with the use of the native Cygwin terminal for standard IO redirection it is advised to run MPI applications on xterm For more information on getting X services for Cygwin please see the XFree86 web site http www cygwin com 21 T aag L 7 12 Ton Although we have tried to port the complete functionality of LAM MPI to Cygwin because of some La outstanding portability issues execution of LAM MPI applications on Cygwin may not always be reliable 3 4 7 Solaris T a The gm RPI will fail to function properly on versions of Solaris older than
166. s starting with cO and incrementing the CPU index number Note that unless otherwise specified LAM schedules processes by CPU vs scheduling by node For example using mpirun s np switch to specify an absolute number of processes schedules on a per CPU basis 7 14 3 Per Process Controls mpirun allows for arbitrary per process controls such as launching MPMD jobs passing different com mand line arguments to different MPI COMM_WORTLD ranks etc This is accomplished by creating a text file called an application schema that lists one per line the location relevant flags user executable and command line arguments for each process For example lines beginning with are comments Start the manager on cO with a specific set of command line options c0 manager manager argl manager_arg2 manager_arg3 Start the workers on all available CPUs with different arguments C worker worker_argl worker_arg2 worker_arg3 Note that the ssi switch is not permissible in application schema files ssi flags are considered to be global to the entire MPI job not specified per process Application schemas are described in more detail in the appschema 5 manual page 7 14 4 Ability to Pass Environment Variables All environment variables with names that begin with LAM_MP1 _ are automatically passed to remote notes unless disabled via the nx option to mpirun Additionally the x option enables exporting of specific environment varia
167. sage is usually sent immediately As such the message is usually at least partially sent before an MPI_ CANCEL is issued Trying to chase down all the particular cases is a nightmare to say the least 35 As such the LAM Team decided not to implement MPIL CANCEL on sends and instead concentrate on other features But in true MPI Forum tradition we would be happy to discuss any code that someone would like to submit that fully implements MPIL CANCEL 5 2 MPI 2 Support LAM 7 1 3 has support for many MPI 2 features The main chapters of the MPI 2 standard are listed below along with a summary of the support provided for each chapter 5 2 1 Miscellany Portable MPI Process Startup The mpiexec command is now supported Common examples include N gt Runs 4 copes of the MPI program my_mpi_program shell mpiexec n 4 my_mpi_program Runs my_linux_program on all available Linux machines and runs my_solaris_program on all available Solaris machines shell mpiexec arch linux my linux program arch solaris my_solaris_program Boot the LAM run time environment run my_mpi_program on all available CPUs and then shut down the LAM run time environment shell mpiexec machinefile hostfile my_mpi_program See the mpiexec 1 man page for more details on supported options as well as more examples Passing NULL to MPLINIT Passing NULL as both arguments to MPI_INIT is fully supported Version Number LAM
168. ss of whether they are using the ib RPI module or not See Section 3 3 1 page 18 for more information on LAM s memory allocation managers Memory Checking Debuggers When running LAM s gm RPI through a memory checking debugger see Section 10 4 a number of Read from unallocated RUA and or Read from uninitialized RFU errors may appear originating from func tions beginning with gm_ or lam_ssi_rpi_gm_ These RUA RFU errors are normal they are not 80 actually reads from unallocated sections of memory The Myrinet hardware and gm kernel device driver handle some aspects of memory allocation and therefore the operating system debugging environment is not always aware of all valid memory As a result a memory checking debugger will often raise warnings even though this is valid behavior Known Issues As of LAM 7 1 3 the following issues still remain in the gm RPI module e Heterogeneity between big and little endian machines is not supported e The gm RPI is not supported with IMPI e Mixed shared memory GM message passing is not yet supported all message passing is through Myrinet GM e XMPI tracing is not yet supported T 7 03 e The gm RPI module is designed to run in environments where the number of available processors is greater than or equal to the number of MPI processes on a given node The gm RPI module will perform poorly particularly in blocking MPI communication calls if the
169. start a shell on the remote node Instead the entire environment of lamboot is pushed to the remote nodes before starting the LAM run time environment 8 1 9 The tm Module OpenPBS PBS Pro Torque Both OpenPBS and PBS Pro both products of Altair Grid Technologies LLC contain support for the Task Management TM interface Torque the open source fork of the Open MPI product also contains the TM interface When using TM rsh ssh is not necessary to launch jobs on remote nodes The advantages of using the TM interface are e PBS Torque can generate proper accounting information for all nodes in a parallel job e PBS Torque can kill entire jobs properly when the job ends e lamboot executes significantly faster when using TM as compared to when it uses rsh ssh 73 an Lan Lan Tay Lay Usage When running in a PBS Torque batch job LAM will automatically detect that it should use the tm boot module no extra command line parameters or environment variables should be necessary Specifically when running in a PBS Torque job the tm module will report that it is available and artificially inflate its priority relatively high in order to influence the boot module selection process However the tm boot module can be forced by specifying the boot SSI parameter with the value of tm Unlike the rsh ssh boot module you do not need to specify a hostfile for the tm boot module In stead PBS Torque itself provides a list of
170. stfile and is helpful for running parallel processes see below Finally the origin notation indicates which node lamboot was executed from this_node obviously indicates which node lamnodes is running on 4 5 Compiling MPI Programs Note that it is not necessary to have LAM booted to compile MPI programs Compiling MPI programs can be a complicated process As of this writing a Google search for ssh keys turned up several decent tutorials including any one of them here would significantly increase the length of this already tremendously long manual http www lam mpi org faq 28 e The same compilers should be used to compile link user MPI programs as were used to compile LAM itself e Depending on the specific installation configuration of LAM a variety of 1 L and 1 flags and possibly others may be necessary to compile and or link a user MPI program LAM MPI provides wrapper compilers to hide all of this complexity These wrapper compilers sim ply add the correct compiler linker flags and then invoke the underlying compiler to actually perform the compilation link As such LAM s wrapper compilers can be used just like real compilers The wrapper compilers are named mpicc for C programs mpiCC and mpic for C programs and mpif77 for Fortran programs For example shell mpicc g c foo c shell mpicc g c bar c shell mpicc g foo o bar o o my_mpi_prog
171. t restart no The usysv RPI is the one of two combination shared memory TCP message passing modules Shared memory is used for passing messages to processes on the same node TCP sockets are used for passing messages to processes on other nodes Spin locks with back off are used for synchronization of the shared memory pool a System V semaphore or pthread mutex is also used for access to the per node shared memory pool The nature of spin locks means that the usysv RPI will perform poorly when there are more processes than processors particularly in blocking MPI communication calls If no higher priority RPI modules are available e g Myrinet gm and the user does not select a specific RPI module through the rpi SSI parameter USySV may be selected as the default even if there are more processes than processors Users should keep this in mind in such circumstances it is probably better to manually select the sysv or tcp RPI modules Overview Aside from synchronization the usysv RPI module is almost identical to the sysv module The usysv module uses spin locks with back off When a process backs off it attempts to yield the processor If the configure script found a system provided yield function it is used If no such function is found then select on NULL file descriptor sets with a timeout of 10us is used Such as yield or sched_yield 88 Tunable Parameters Table 9 7 shows the SSI parameters that may be
172. t is performing message passing functions or not in the MPI communications layer LAM does not provide checkpoint restart functionality itself cr SSI modules are used to invoke back end systems that save and restore checkpoints The following notes apply to checkpointing parallel MPI jobs e No special code is required in MPI applications to take advantage of LAM MPI s checkpoint restart functionality although some limitations may be imposed depending on the back end checkpointing system that is used e LAM s checkpoint restart functionality only involves MPI processes the LAM universe is not check pointed A LAM universe must be independently established before an MPI job can be restored e LAM does not yet support checkpointing restarting MPI 2 applications In particular LAM s behav ior is undefined when checkpointing MPI processes that invoke any non local MPI 2 functionality including dynamic functions and IO e Migration of restarted processes is available on a limited basis the crtcp RPI will start up properly regardless of what nodes the MPI processes are re started on but other system level resources may or may not be restarted properly e g open files shared memory etc e Checkpoint files are saved using a two phase commit protocol that is coordinated by mpirun mpirun initiates a checkpoint request for each process in the MPI job by supplying a temporary context file name If all the checkpoint requests complete
173. the rsh that LAM will use Once you use the replacement rsh or an AFS capable ssh you should get a token on the target node when using the rsh boot module This means that your LAM daemons are running with your AFS token and you should be able to run any program that you wish including those that are not system anyuser accessible You will even be able to write into AFS directories where you have write permission as you would expect Keep in mind however that AFS tokens have limited lives and will eventually expire This means that your LAM daemons and user MPI programs will lose their AFS permissions after some specified time unless you renew your token with the klog command for example on the originating machine before the token runs out This can play havoc with long running MPI programs that periodically write out file results if you lose your AFS token in the middle of a run and your program tries to write out to a file it will not have permission to which may cause Bad Things to happen 2If you are using a different boot module you may experience problems with obtaining AFS tokens on remote nodes 20 If you need to run long MPI jobs with LAM on AFS it is usually advisable to ask your AFS administrator to increase your default token life time to a large value such as 2 weeks 3 4 3 Dynamic Embedded Environments In LAM MPI version 7 1 3 some RPI modules may utilize an additional memory manager mechanism see Sectio
174. then a second from the wrap around e 4hellos on n2 2 from the first pass and then 2 more from the wrap around e 2hellosonn3 The mpirun 1 man page contains much more information and mpi run and the options available For example mpi run also supports Multiple Program Multiple Data MPMD programs although it is not discussed here Also see Section 7 14 page 60 in this document 4 6 2 The mpiexec Command The MPI 2 standard recommends the use of mpiexec for portable MPI process startup In LAM MPI mpiexec is functionally similar to mpirun Some options that are available to mpi run are not available to mpiexec and vice versa The end result is typically the same however both will launch parallel MPI programs which you should use is likely simply a personal choice That being said mpiexec offers more convenient access in three cases e Running MPMD programs e Running heterogeneous programs e Running one shot MPI programs i e boot LAM run the program then halt LAM The general syntax for mpiexec is shell mpiexec lt global_options gt lt cmd1 gt lt cmd2 gt Note that the use of the word schedule does not imply that LAM has ties with the operating system for scheduling purposes it doesn t LAM scheduled on a per node basis so selecting a process to run means that it has been assigned and launched on that node The operating system is solely respons
175. thread level if necessary to be at least the lower bound of thread levels that the selected rpi module supports 6 Eliminate all coll and cr modules that cannot operate at the current thread level 7 If no Coll modules remain abort Final selection coll modules is discussed in Section 9 4 1 page 89 8 If no cr modules remain and checkpoint restart support was specifically requested abort Otherwise select the highest priority cr module 9 3 MPI Point to point Communication Request Progression Interface RPI LAM provides multiple SSI modules for MPI point to point communication Also known as the Request Progression Interface RPI these modules are used for all aspects of MPI point to point communication in an MPI application Some of the modules require external hardware and or software e g the native Myrinet RPI module requires both Myrinet hardware and the GM message passing library The laminfo command can be used to determine which RPI modules are available in a LAM installation Although one RPI module will likely be the default the selection of which RPI module is used can be changed through the SSI parameter rpi For example shell mpirun ssi rpi tcp C my_mpi_program runs the my mpi program executable on all available CPUs using the tcp RPI module while shell mpirun ssi rpi gm C my_mpi_program runs the my_mpi_program executable on all available CPUs using the gm RPI module It should
176. tified by ranges The syntax for these concepts is n lt range gt and c lt range gt respectively lt range gt can specify one or more elements by listing integers separated by commas and dashes For example e n3 The node with an ID of 3 e c2 The CPU with an ID of 2 e n2 4 The nodes with IDs of 2 and 4 e c2 4 7 The CPUs with IDs of 2 4 5 6 and 7 Note that some of these CPUs may be on the same node s Integers can range from 0 to the highest numbered node CPU Note that these nomenclatures can be mixed and matched on the mpi run command line sheng mpirun nO C manager worker 61 lt lt will launch the manager worker program on nO as well as on every schedulable CPU in the universe yes this means that nO will likely be over subscribed When running on SMP nodes it is preferable to use the C c lt range gt nomenclature with appropriate CPU counts in the boot schema to the N n lt range gt nomenclature because of how LAM will order ranks in MPILCOMM_WORLD For example consider a LAM universe of two four way SMPs n0 and n1 both have a CPU count of 4 Using the following shell mpirun C my_mpi_program will launch eight copies of my mpi program four on each node LAM will place as many adjoining MPI _ COMM_WORLD ranks on the same node as possible MPI COMM_WORLD ranks 0 3 will be scheduled on n0 and MPILCOMM_WORLD ranks 4 7 will be scheduled on n1 Specifically C schedules processe
177. tion of how LAM s overall checkpoint restart functionality is used The crtcp module s checkpoint restart functionality is invoked when the cr module indicates that it is time to perform a checkpoint The crtcp then quiesces all in flight MPI messages and then allows the checkpoint to be performed Upon restart TCP connections are re formed and message passing processing continues No additional buffers or rollback mechanisms are required nor is any special coding required in the user s MPI application TT L a03 Tay Lan Tan Lan T an Lan Tunable Parameters The crtcp module has the same tunable parameters as the tcp module maximum size of a short message and amount of OS socket buffering although they have different names rpi_crtcp_short rpi_crtcp_ sockbuf SSI parameter name Default value Description rpi_crtcp_priority 25 Default priority level rpi_crtcp _short 65535 Maximum length in bytes of a short message rpi_crtcp_sockbuf 1 Socket buffering in the OS kernel 1 means use the short message size Table 9 1 SSI parameters for the crtcp RPI module 9 3 3 The gm Module Myrinet Module Summary Name gm Kind rpi Default SSI priority 50 Checkpoint restart yes The gm RPI module is for native message passing over Myrinet networking hardware The gm RPI provides low latency high bandwidth message passing performance Be sure to also
178. tly higher latency and lower bandwidth Overview Rather than send messages directly from one MPI process to another all messages are routed through the local LAM daemon the remote LAM daemon if the target process is on a different node and then finally to the target MPI process This potentially adds two hops to each MPI message Although the latency incurred can be significant the lamd RPI can actually make message passing progress in the background Specifically since LAM MPI is an single threaded MPI implementation it can typically only make progress passing messages when the user s program is in an MPI function call With the lamd RPI since the messages are all routed through separate processes message passing can actually occur when the user s program is not in an MPI function call User programs that utilize latency hiding techniques can exploit this asynchronous message passing behavior and therefore actually achieve high performance despite of the high overhead associated with the lamd RPI Tunable Parameters The lamd module has only one tunable parameter its priority SSI parameter name Default value Description rpi_lamd_priority 10 Default priority level Table 9 4 SSI parameters for the lamd RPI module Several users on the LAM MPI mailing list have mentioned this specifically even though the lamd RPI is slow it provides significantly better performance because it can provide
179. trary processes on the target deserves a BProc specific clarification BProc has its own internal permission system for determining if users are allowed to execute on specific nodes The system is similar to the user group other mechanism typically used in many Unix filesystems Hence in order for a user to successfully lamboot on a BProc cluster he she must have BProc execute permissions on each of the target nodes Consult the BProc documentation for more details Usage In most situations the Lamboot command and related commands should automatically know to use the bproc boot SSI module when running on the BProc head node no additional command line parameters or environment variables should be required Specifically when running in a BProc environment the bproc module will report that it is available and artificially inflate its priority relatively high in order to influence the boot module selection process However the BProc boot module can be forced by specifying the boot SSI parameter with the value of bproc Running 1amboot on a BProc cluster is just like running Lamboot in a normal cluster Specifically you provide a boot schema file 1 e a list of nodes to boot on and run lamboot with it For example shell lamboot hostfile Note that when using the bproc module Lamboot will only function properly from the head node If you launch lamboot from a client node it will likely either fail outright or fal
180. ts best to take down the LAM RTE even if errors occur either during the boot itself or if an MPI process aborts or the user hits Control C 7 13 The mpimsg Command Deprecated The mpimsg command is deprecated It is only useful in a small number of cases specifically when the lamd RPI module is used and may disappear in future LAM MPI releases 7 14 Thempirun Command The mpirun command is the main mechanism to launch MPI processes in parallel 7 14 1 Simple Examples Although mpirun supports many different modes of execution most users will likely only need to use a few of its capabilities It is common to launch either one process per node or one process per CPU in the LAM universe CPU counts are established in the boot schema The following two examples show these two cases Launch one copy of my_mpi_program on every schedulable node in the LAM universe shell mpirun N my mpi program Launch one copy of my_mpi_program on every schedulable CPU in the LAM universe shell mpirun C my mpi program The specific number of processes that are launched can be controlled with the np switch 60 Launch four my_mpi_program processes shell mpirun np 4 my_mpi_program NA The ssi switch can be used to specify tunable parameters to MPI processes Specify to use the usysv RPI module shell mpirun ssi rpi usysv C my_mpi program E The available modules and their associated parameters
181. un The origin will be marked as no schedule meaning that applications launched by mpirun and lamexec will not be run there unless specifically requested see See Section 7 1 page 49 for more detail about this attribute and boot schemas in general This method is supported and is perhaps the most common way to run LAM MPI interactive jobs in SLURM environments e srun mode where a script is submitted via the srun command and is executed on all nodes that SLURM allocated for the job In this case the commands in the script e g Lamboot mpirun etc will be run on all nodes simultaneously which is most likely not what you want This mode is not supported When running in any of the supported SLURM modes LAM will automatically detect that it should use the slurm boot module no extra command line parameters or environment variables should be necessary Specifically when running ina SLURM job the slurm module will report that it is available and artificially inflate its priority relatively high in order to influence the boot module selection process However the slurm boot module can be forced by specifying the boot SSI parameter with the value of slurm 72 Unlike the rsh ssh boot module you do not need to specify a hostfile for the slurm boot module Instead SLURM itself provides a list of nodes and associated CPU counts to LAM Using lamboot is therefore as simple as shell lamboot Note that in enviro
182. used with each for interactive startups e g normal login All files listed are assumed to be in the HOME directory Shell name Non interactive login startup file sh or Bash This shell does not execute any file automatically so LAM will named sh execute the profile script before invoking LAM executables on remote nodes csh Ccshrc tcsh tcshrc ifit exists cshrc if it does not bash bashrc if it exists Table 4 2 List of common shells and the corresponding environmental setup files commonly used with each for non interactive startups e g normal login All files listed are assumed to be in the HOME directory 24 You ll also need to know the directory where LAM was installed For the purposes of this tutorial we ll assume that LAM is installed in usr local lam And to re emphasize a critical point these are only guidelines the specifics may vary depending on your local setup Consult your local system or network administrator for more details Once you have determined all three pieces of information what shell you are using what directory LAM was installed to and what the appropriate dot file to edit open the dot file in a text editor and follow the general directions listed below e For the Bash Bourne and Bourne related shells add the following lines PATH usr local lam bin PATH export PATH NH e For the C shell and related shells such as tcsh add t
183. ve the string so laris anywhere in their architecture string and hello linux on all nodes that have linux in their ar chitecture string The architecture string of a given LAM installation can be found by running the laminfo command One Shot MPI Programs In some cases it seems like extra work to boot a LAM universe run a single MPI job and then shut down the universe Batch jobs are good examples of this since only one job is going to be run why does it take three commands mpiexec provides a convenient way to run one shot MPI jobs shell mpiexec machinefile hostfile hello This will invoke 1amboot with the boot schema named host file run the MPI program hello on all available CPUs in the resulting universe and then shut down the universe with the Lamhalt command which we ll discuss in Section 4 7 below 4 6 3 Thempitask Command The mpitask command is analogous to the sequential Unix command ps It shows the current status of the MPI program s being executed in the LAM universe and displays primitive information about what MPI function each process is currently executing if any Note that in normal practice the mpimsg command only gives a snapshot of what messages are flowing between MPI processes and therefore is usually only accurate at that single point in time To really debug message passing traffic use a tool such as message passing analyzer e g XMPD or a parallel
184. verse LAM s run time environment can be executed in many different environments For example it can be run interactively on a cluster of workstations even on a single workstation perhaps to simulate parallel execution for debugging and or development Or LAM can be run in production batch scheduled systems This example will focus on a traditional rsh ssh style workstation cluster i e not under batch systems where rsh or ssh is used to launch executables on remote workstations 4 4 1 The Boot Schema File a k a Hostfile Machinefile When using rsh or ssh to boot LAM you will need a text file listing the hosts on which to launch the LAM run time environment This file is typically referred to as a boot schema hostfile or machinefile For example My boot schema nodel cluster example com node2 cluster example com node3 cluster example com cpu 2 node4 cluster example com cpu 2 Four nodes are specified in the above example by listing their IP hostnames Note also the cpu 2 that follows the last two entries This tells LAM that these machines each have two CPUs available for running MPI programs e g node3 and node4 are two way SMPs It is important to note that the number of CPUs specified here has no correlation to the physicial number of CPUs in the machine It is simply a 26 convenience mechanism telling LAM how many MPI processes we will typically launch on that node
185. vironmetn in Globus environment makes use of the Globus Resource Alloca tion Manager GRAM client globus job run The Globus boot SSI module will never run automat ically it must always be specifically requested setting the boot SSI parameter to globus Specifically although the globus module will report itself available if globus job run can be found in the PATH the default priority will be quite low effectively ensuring that it will not be selected unless it is the only module available which will only occur if the boot parameter is set to globus LAM needs to be able to find the Globus executables This can be accompilshed either by adding the appropriate directory to your path or by setting the GLOBUS_LOCATION environment variable Additionally the LAM MP I_SESSION_SUFF IX environment variable should be set to a unique value This ensures that this instance of the LAM universe does not conflict with any other concurrent LAM universes that are running under the same username on nodes in the Globus environment Although any value can be used for this variable it is probably best to have some kind of organized format such as lt your_username gt lt some_long_random_number gt Next create a boot schema to use with 1amboot Hosts are listed by their Globus contact strings see the Globus manual for more information about contact strings In cases where the Globus gatekeeper is running as a inetd service on the node the contact string wil
186. w will print an error and abort without adding the new node 7 6 The lamhalt Command The lamhalt command is used to shut down the LAM RTE Typically lamhalt can simply be run with no command line parameters and it will shut down the LAM RTE Optionally the v or d arguments can be used to make lamhalt be verbose or extremely verbose respectively There are a small number of cases where lamhalt will fail For example if a LAM daemon becomes unresponsive e g the daemon was killed 1amhalt may fail to shut down the entire LAM universe It will eventually timeout and therefore complete in finite time but you may want to use the last resort lamwipe command see Section 7 18 7 7 The laminfo Command The laminfo command can be used to query the capabilities of the LAM MPI installation Running laminfo with no parameters shows a prettyprint summary of information Using the parsable com mand line switch shows the same summary information but in a format that should be relatively easy to parse with common unix tools such as grep cut awk etc laminfo supports a variety of command line options to query for specific information The h option shows a complete listing of all options Some of the most common options include e arch Show the architecture that LAM was configured for e path Paired with a second argument display various paths relevant to the LAM MPI installation Valid second arguments include prefix
187. y application c o my application For csh like shells shell setenv LAMPICC cc shell mpicc my_application c o my_application All this being said it is strongly recommended to use the wrapper compilers and their default under lying compilers for all compiling and linking of MPI applications Strange behavior can occur in MPI applications if LAM MPI was configured and compiled with one compiler and then user applications were compiled with a different underlying compiler to include failure to compile failure to link seg faults and other random bad behavior at run time Finally note that the wrapper compilers only add all the LAM MPI specific flags when a command line argument that does not begin with a dash is present For example shell mpicc gcc no input files shell mpicc version gcc GCC 3 2 2 Mandrake Linux 9 1 3 2 2 3mdk Copyright C 2002 Free Software Foundation Inc This is free software see the source for copying conditions There is NO warranty not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE 7 11 1 Deprecated Names Previous versions of LAM MPI used the names hcc hcp and h 77 for the wrapper compilers While these command names still work they are simply symbolic links to the real wrapper compilers mpicc mpiCC mpic and mpif77 respectively their use is deprecated 7 12 The mpiexec Command The mpiexec command is used to launch
188. ystem This is where the additional memory manager comes in LAM will by default intercept calls to malloc calloc and free by use of the ptmal loc ptmalloc2 or Mac OS X dynlib functionality note that C new and delete are not intercepted However this is actually only an unfortunate side effect LAM really only needs to intercept the sbrk function in order to catch memory before it is returned to the operating system Specifically an internal LAM routine runs during sbrk to ensure that all memory is properly unpinned before it is given back to the operating system There is sadly no easy portable way to intercept sbrk without also intercepting malloc et al In most cases however this is not a problem the user s application invokes malloc and obtains heap memory just as expected and the other memory functions also function as expected However there are some applications do their own intercepting of malloc et al These applications will not work properly with a default installation of LAM MPI To fix this problem LAM allows you to disable all memory management but only if the top level application promises to invoke an internal LAM handler function when sbrk is invoked before the memory is returned to the operating system This is accomplished by configuring LAM with the following switch sheng configure with memory manager external No Surprisingly this memory man
Download Pdf Manuals
Related Search
Related Contents
Owners guide - skie.net by wk057 Plaquette pacte adjoint mode d`emploi betriebsanleitung Electrolux Dito 601112 User's Manual User Guide ADDCAD+ 2014 FULL for AutoCAD and LT Samsung S24C770T Priručnik za korisnike ZyXEL Communications 100 Network Card User Manual Manual de Operación Zacks Link - Zacks Institutional Services Copyright © All rights reserved.
Failed to retrieve file