Home

Message Passing Toolkit (MPT) User's Guide

1. stdin stdout stderr 12 In this implementation stdin is enabled for only those MPI processes with rank 0 in the first MPI COMM WORLD which does not need to be located on the same host as mpirun stdout and stderr results are enabled for all MPI processes in the job whether launched via mpirun or via one of the MPI 2 spawn functions 007 3773 003 Message Passing Toolkit MPT User s Guide MPI_Get_processor_name The MPI_Get_processor_name function returns the Internet host name of the computer on which the MPI process invoking this subroutine is running Programming Optimizations This section describes ways in which the MPI application developer can best make use of optimized features of SGI s MPI implementation Following recommendations in this section might require modifications to your MPI application Using MPI Point to Point Communication Routines MPI provides for a number of different routines for point to point communication The most efficient ones in terms of latency and bandwidth are the blocking and nonblocking send receive functions MPI Send MPI_Isend MPI_Recv and MPI Irecv Unless reguired for application semantics the synchronous send calls MPI Ssend and MPI Issend should be avoided The buffered send calls MPI Bsend and MP I_Ibsend should also usually be avoided as these double the amount of memory copying on the sender side The ready send routines MPI_Rsend and MPI_Irsend are treated a
2. 4 profile pl s1 c4 5 N 1000 a out 19 5 Profiling MPI Applications histx histx is a small set of tools that can assist with performance analysis and bottlenect identification General formats for hist x Histogram and lipfpm Linux IPF Performance Monitor e 3 mpirun np 4 histx histx options a out o lipfpm lipfpm_options mmpirun np 4 a out Examples o mpirun np 4 histx f o histx out a out lipfpm f e LOADS_RETIRED e STORES_RETIRED mpirun np 4 a out Profiling Interface You can write your own profiling by using the MPI 1 standard PMPI_ calls In addition either within your own profiling library or within the application itself you can use the MPI_Wtime function call to time specific calls or sections of your code The following example is actual output for a single rank of a program that was run on 128 processors using a user created profiling library that performs call counts and timings of common MPI calls Notice that for this rank most of the MPI time is being spent in MPI_Waitall and MPI_Allreduce O Total job time 2 203333e 02 sec O tal MPI processes 128 Wtime resolution is 8 000000e 07 sec activity on process rank 0 comm_rank calls 1 time 8 800002e 06 get_count calls 0 time 0 000000e 00 ibsend calls 0 time 0 000000e 00 probe calls 0 time 0 000000e 00 recv calls 0 time 0 00000e 00 avg datacnt 0 waits 0 wait time 0 0000
3. 7 on the first host on CPUs 8 through 15 Place MPI processes 8 through 15 on CPUs 16 to 23 on the second host Note that the process rank is the MPI_COMM_WORLD rank The interpretation of the CPU values specified in the MPI_DSM_CPULIST depends on whether the MPI job is being run within a cpuset If the job is run outside of a cpuset the CPUs specify cpunum values beginning with 0 and up to the number of CPUs in the system minus one When running within a cpuset the default behavior is to interpret the CPU values as relative processor numbers within the cpuset The number of processors specified should equal the number of MPI processes that will be used to run the application The number of colon delineated parts of the list must equal the number of hosts used for the MPI job If an error occurs in processing the CPU list the default placement policy is used MPI DSM DISTRIBUTE MPI DSM PPM 007 3773 003 Use the MPI_DSM_DISTRIBUTE shell variable to ensure that each MPI process will get a physical CPU and memory on the node to which it was assigned If this environment variable is used without specifying an MPI_DSM_CPULIST variable it will cause MPI to assign MPI ranks starting at logical CPU 0 and incrementing until all ranks have been placed Therefore it is recommended that this variable be used only if running within a cpumemset or on a dedicated system The MPI_DSM_PPM shell variable allows you to specify the number of MPI proce
4. not completed An unmatched request is any blocking send for which a corresponding recv is never posted An incomplete request is any nonblocking send or recv request that was never freed by a call to MPI_Test MPI_Wait or MPI_Request_free Common examples are applications that call MPI_Isend and then use internal means to determine when it is safe to reuse the send buffer These applications never call MPI Wait You can fix such codes easily by inserting a call to MPI_Request_free immediately after all such isend operations or by adding a call to MPI_Wait at a later place in the code prior to the point at which the send buffer must be reused 007 3773 003 Message Passing Toolkit MPT User s Guide I keep getting error messages about MPI_REQUEST_MAX being too small no matter how large set it There are two types of cases in which the MPI library reports an error concerning MPI REOUEST MAX The error reported by the MPI library distinguishes these MPI has run out of unexpected request entries the current allocation level is XXXXXX The program is sending so many unexpected large messages greater than 64 bytes to a process that internal limits in the MPI library have been exceeded The options here are to increase the number of allowable requests via the MPI_REQUEST_MAX shell variable or to modify the application MPI has run out of request entries the current allocation level
5. or stderr not appearing 37 T TotalView 17 42 007 3773 003
6. the MPI library is single copy optimization avoiding the use of shared memory buffers However as discussed in Buffering on page 10 some incorrectly coded applications might hang because of buffering assumptions For this reason this optimization is not enabled by default for MPI_send but can be turned on by the user at run time by using the MPI_BUFFER_MAX environment variable The following steps can be taken by the application developer to increase the opportunities for use of this unbuffered pathway The MPI data type on the send side must be a contiguous type The sender and receiver MPI processes must reside on the same host or in the case of a partitioned system the processes may reside on any of the partitions The sender data must be globally accessible by the receiver The SGI MPI implementation allows data allocated from the static region common blocks the private heap and the stack region to be globally accessible In addition memory allocated via the MPI_Alloc_mem function or the SHMEM symmetric heap accessed via the shpalloc or shmalloc functions is globally accessible Certain run time environment variables must be set to enable the unbuffered single copy method For more details on how to set the run time environment see Avoiding Message Buffering Enabling Single Copy on page 27 Note With the Intel 7 1 compiler ALLOCATABLE arrays are not eligible for single copy since they do not reside in a g
7. the kernel node you can avoid this effect Additionally by restricting system daemons to run on the kernel node you can also deliver an additional percentage of each application CPU to the user Avoid interference with other applications You can use cpusets or cpumemsets to address this problem also You can use cpusets to effectively partition a large distributed memory host in a fashion that minimizes interactions between jobs running concurrently on the system See the Linux Resource Administration Guide for information about cpusets and cpumemsets On a quiet dedicated system you can use dplace or the MPI_DSM_CPULIST shell variable to improve run time performance repeatability These approaches are not as suitable for shared nondedicated systems Use a batch scheduler for example LSF from Platform Computing or PBSpro from Veridan These batch schedulers use cpusets to avoid oversubscribing the system and possible interference between applications 25 6 Run time Tuning Tuning MPI Buffer Resources 26 By default the SGI MPI implementation buffers messages whose lengths exceed 64 bytes Longer messages are buffered in a shared memory region to allow for exchange of data between MPI processes In the SGI MPI implementation these buffers are divided into two basic pools e For messages exchanged between MPI processes within the same host or between partitioned systems when using the XPMEM driver buffers from the per
8. virtual memory usage The ps 1 command s SIZE statistic is telling you the amount of virtual address space being used not the amount of memory being consumed Even if all of the pages that you could reference were faulted in most of the virtual address regions point to multiply mapped shared data regions and even in that case actual per process memory usage would be far lower than that indicated by SIZE could not run executable mean This message means that something happened while mpirun was trying to launch your application which caused it to fail before all of the MPI processes were able to handshake with it With Array Services 3 2 or later and MPT 1 3 or later many scenarios that generated this error message are now improved to be more descriptive Prior to Array Services 3 2 no diagnostic information was directly available This was due to the highly decoupled interface between mpirun and arrayd mpirun directs arrayd to launch a master process on each host and listens on a socket for those masters to connect back to it Since the masters are children of arrayd arrayd traps SIGCHLD and passes that signal back to mpirun whenever one of the masters terminates If mpirun receives a signal before it has established connections with every host in the job it knows that something has gone wrong How do combine MPI with insert favorite tool here 38 In general the rule to follow is to run mpirun on your tool and then t
9. with SGI MPI Portability is one of the main advantages MPI has over vendor specific message passing software Nonetheless the MPI Standard offers sufficient flexibility for general variations in vendor implementations In addition there are often vendor specific programming recommendations for optimal use of the MPI library This chapter addresses topics that are of interest to those developing or porting MPI applications to SGI systems Job Termination and Error Handling MPI Abort Error Handling 007 3773 003 This section describes the behavior of the SGI MPI implementation upon normal job termination Error handling and characteristics of abnormal job termination are also described In the SGI MPI implementation a call to MPI_Abort causes the termination of the entire MPI job regardless of the communicator argument used The error code value is returned as the exit status of the mpirun command A stack traceback is displayed that shows where the program called MPI_Abort Section 7 2 of the MPI Standard describes MPI error handling Although almost all MPI functions return an error status an error handler is invoked before returning from the function If the function has an associated communicator the error handler associated with that communicator is invoked Otherwise the error handler associated with MPI_COMM_WORLD is invoked The SGI MPI implementation provides the following predefined error handlers MPI ERRORS
10. 0e 00 irecv calls 22039 time 9 76185e 01 datacnt 23474032 avg datacnt 1065 send calls 0 time 0 000000e 00 ssend calls 0 time 0 000000e 00 isend calls 22039 time 2 950286e 00 20 007 3773 003 Message Passing Toolkit MPT User s Guide wait calls waitall calls barrier calls alltoall calls alltoallv calls reduce calls allreduce calls beast calls gather calls gatherv calls scatter calls scatterv calls OOS Ove Os 11045 680 4658 680 time 0 00000e 00 avg datacnt 0 time 7 73805e 01 of Reqs 44078 avg data cnt 137944 time 5 133110e 00 time 0 0e 00 avg datacnt 0 time 0 000000e 00 time 0 000000e 00 time 2 072872e 01 time 6 915840e 02 time 0 000000e 00 time 0 000000e 00 time 0 000000e 00 time 0 000000e 00 activity on process rank 1 MPI Internal Statistics 007 3773 003 MPI keeps track of certain resource utilization statistics These can be used to determine potential performance problems caused by lack of MPI message buffers and other MPI internal resources To turn on the displaying of MPI internal statistics use the MPI_STATS environment variable or the stats option on the mpirun command MPI internal statistics are always being gathered so displaying them does not cause significant additional overhead In addition one can sample the MPI statistics counters from within an application allowing for finer grain measurements For information about the
11. ARE FATAL The handler when called causes the program to abort on all executing processes This has the same effect as if MPI_Abort were called by the process that invoked the handler e MPI ERRORS RETURN The handler has no effect 3 Programming with SGI MPI By default the MPI_ERRORS_ARE_FATAL error handler is associated with MPI COMM WORLD and any communicators derived from it Hence to handle the error statuses returned from MPI calls it is necessary to associate either the MPI_ERRORS_RETURN handler or another user defined handler with MPI COMM WORLD near the beginning of the application MPI Finalize and Connect Processes Signals Buffering 10 In the SGI implementation of MPI all pending communications involving an MPI process must be complete before the process calls MPI_Finalize If there are any pending send or recv requests that are unmatched or not completed the application will hang in MPI_Finalize For more details see section 7 5 of the MPI Standard If the application uses the MPI 2 spawn functionality described in Chapter 5 of the MPI 2 Standard there are additional considerations In the SGI implementation all MPI processes are connected Section 5 5 4 of the MPI 2 Standard defines what is meant by connected processes When the MPI 2 spawn functionality is used MPI_Finalize is collective over all connected processes Thus all MPI processes both launched on the command line or subsequen
12. Copy For message transfers between MPI processes within the same host or transfers between partitions it is possible under certain conditions to avoid the need to buffer messages Because many MPI applications are written assuming infinite buffering the use of this unbuffered approach is not enabled by default for MPI_Send This section describes how to activate this mechanism by default for MPI_Send For MPI_Isend MPI_Sendrecv MPI_Alltoall MPI_Bcast MPI_Allreduce and MPI_Reduce this optimization is enabled by default for large message sizes To disable this default single copy feature used for the collectives use the MPI_DEFAULT_SINGLE_COPY_OFF environment variable Using the XPMEM Driver for Single Copy Optimization 007 3773 003 MPI takes advantage of the XPMEM driver to support single copy message transfers between two processes within the same host or across partitions Enabling single copy transfers may result in better performance since this technique improves MPI s bandwidth However single copy transfers may introduce additional synchronization points which can reduce application performance in some cases The threshold for message lengths beyond which MPI attempts to use this single copy method is specified by the MPI_BUFFER_MAX shell variable Its value should be set to the message length in bytes beyond which the single copy method should be tried In general a value of 2000 or higher is beneficial for many
13. Message Passing Toolkit MPT User s Guide 007 3773 003 CONTRIBUTORS Julie Boney Steven Levine Jean Wilson Illustrations by Chrystie Danzer Edited by Susan Wilkening Production by Karen Jacobson COPYRIGHT 1996 1998 2005 Silicon Graphics Inc All rights reserved provided portions may be copyright in third parties as indicated elsewhere herein No permission is granted to copy distribute or create derivative works from the contents of this electronic documentation in any manner in whole or in part without the prior written permission of Silicon Graphics Inc LIMITED RIGHTS LEGEND The software described in this document is commercial computer software provided with restricted rights except as to included open free source as specified in the FAR 52 227 19 and or the DFAR 227 7202 or successive sections Use beyond license provisions is a violation of worldwide intellectual property laws treaties and conventions This document is provided with limited rights as defined in 52 227 14 TRADEMARKS AND ATTRIBUTIONS Silicon Graphics SGI the SGI logo IRIX and Origin are registered trademarks and Altix CASEVision NUMAlink OpenMP Performance Co Pilot ProDev SGI ProPack SHMEM and SpeedShop are trademarks of Silicon Graphics Inc in the United States and or other countries worldwide Intel is a registered trademark of Intel Corporation Kerberos is a trademark of Massachusetts Institute of Technology Linux i
14. PI 2 Spawn Functions to Launch an Application Compiling and Running SHMEM Applications 3 Programming with SGI MPI Job Termination and Error Handling MPI_Abort Error Handling 007 3773 003 XV XV xvi xvi xvi CN DDD 0 0 a UI wo wow N N OO Oo 0 0 vii Contents MPI_Finalize and Connect Processes Signals Buffering Multithreaded Programming Interoperability with the SHMEM programming model Miscellaneous Features of SGI MPI stdin stdout stderr MPI_Get_processor_name Programming Optimizations Using MPI Point to Point Communication Routines Using MPI Collective Communication Routines Using MPI_Pack MPI_Unpack Avoiding Derived Data Types Avoiding Wild Cards Avoiding Message Buffering Single Copy Methods Managing Memory Placement Using Global Shared Memory Additional Programming Model Considerations 4 Debugging MPI Applications MPI Routine Argument Checking Using Total View with MPI programs Using idb and gdb with MPI programs 5 Profiling MPI Applications Using Profiling Tools with MPI Applications profile pl histx viii 10 10 10 11 12 12 12 13 13 13 13 14 14 15 15 15 16 16 17 17 17 17 19 19 19 20 007 3773 003 Message Passing Toolkit MPT User s Guide Profiling Interface MPI Internal Statistics Performance Co Pilot PCP Third Party Products 6 Run time Tuning Reducing Run time Variability Tuning MPI Buffer Resources Avoiding Mess
15. age 17 e Chapter 5 Profiling MPI Applications on page 19 e Chapter 6 Run time Tuning on page 25 e Chapter 7 Troubleshooting and Frequently Asked Questions on page 35 1 Introduction MPI Overview MPI was created by the Message Passing Interface Forum MPIF MPIF is not sanctioned or supported by any official standards organization Its goal was to develop a widely used standard for writing message passing programs SGI supports implementations of MPI that are released as part of the Message Passing Toolkit The MPI Standard is documented online at the following address http www mcs anl gov mpi MPI 2 Standard Compliance The SGI MPI implementation is compliant with the 1 0 1 1 and 1 2 versions of the MPI Standard specification In addition the following MPI 2 features with section numbers from the MPI 2 Standard specification are provided Feature MPI 2 parallel I O A subset of MPI 2 one sided communication routines put get model MPI spawn functionality MPI_Alloc_mem MPI_Free_mem Transfer of handles MPI 2 replacements for deprecated MPI 1 functions Extended language bindings for C and partial Fortran 90 support Generalized requests New attribute caching functions Section 9 6 5 3 4 11 4 12 4 4 14 1 10 1 10 2 4 4 5 2 8 8 007 3773 003 Message Passing Toolkit MPT User s Guide MPI Components MPI Features 007 3773 003 The MPI library is provided as a
16. age Buffering Enabling Single Copy Using the XPMEM Driver for Single Copy Optimization Memory Placement and Policies MPI_DSM_CPULIST MPI_DSM_DISTRIBUTE MPI_DSM_PP MPI DSM VERBOSE Using dplace for Memory Placement Tuning MPI OpenMP Hybrid Codes Tuning for Running Applications Across Multiple Hosts Suspending MPI Jobs 7 Troubleshooting and Freguently Asked Ouestions What are some things I can try to figure out why mpirun is failing My code runs correctly until it reaches MPI Finalize and then it hangs I keep getting error messages about MPI RE set it 2 OU EST_MAX being too small no matter how large I Iam not seeing stdout and or stderr output from my MPI application How can I get the MPT software to install on my machine Where can I find more information about the SHMEM programming model 007 3773 003 20 21 22 24 25 25 26 27 27 28 28 29 29 30 30 30 31 32 35 35 36 37 37 37 38 Contents The ps 1 command says my memory use SIZE is higher than expected Dour ge ao ot S38 What does MPI could not run executable mean 38 How do I combine MPI with insert favorite tool here LL LL 38 Must I use MPIO_Wait and MPIO Test nn nn 39 Must I modify my code to replace calls to MPIO_Wait with MPI_Wait and recompile 40 Why do I see stack traceback information when my MPI job aborts 40 I
17. applications During job startup MPI uses the XPMEM driver via the xpmem kernel module to map memory from one MPI process to another The mapped areas include the static BSS region the private heap the stack region and optionally the symmetric heap region of each process 27 6 Run time Tuning Memory mapping allows each process to directly access memory from the address space of another process This technique allows MPI to support single copy transfers for contiguous data types from any of these mapped regions For these transfers whether between processes residing on the same host or across partitions the data is copied using a bcopy process A bcopy process is also used to transfer data between two different executable files on the same host or two different executable files across partitions For data residing outside of a mapped region a dev zero region for example MPI uses a buffering technique to transfer the data Memory mapping is enabled by default To disable it set the MPI MEMMAP OFF environment variable Memory mapping must be enabled to allow single copy transfers MPI 2 one sided communication support for the SHMEM model and certain collective optimizations Memory Placement and Policies MPI_DSM_CPULIST 28 The MPI library takes advantage of NUMA placement functions that are available Usually the default placement is adequate Under certain circumstances however you might want to modify thi
18. dynamic shared object DSO a file with a name that ends in so The basic components that are necessary for using MPI are the libmpi so library the include files and the mpirun command Profiling support is included in the libmpi so library Profiling support replaces all MPI_Xxx prototypes and function names with PMPI_Xxx entry points The SGI MPI implementation offers a number of significant features that make it the preferred implementation to use on SGI hardware Data transfer optimizations for NUMAlink including single copy data transfer Use of hardware fetch operations fetchops where available for fast synchronization and lower latency for short messages Optimized MPI 2 one sided commands Interoperability with the SHMEM LIBSMA programming model High performance communication support for partitioned systems via XPMEM Chapter 2 Getting Started This chapter provides procedures for building MPI applications It provides examples of the use of the mpirun 1 command to launch MPI jobs It also provides procedures for building and running SHMEM applications Compiling and Linking MPI Programs 007 3773 003 The default locations for the include files the so files the a files and the mpirun command are pulled in automatically Once the MPT RPM is installed as default the commands to build an MPI based application using the so files are as follows To compile using GNU compilers choose one of the following co
19. emon setting MPI_DAEMON_DEBUG_ATTACH sleeps the daemon briefly while you attach to it 17 Chapter 5 Profiling MPI Applications This chapter describes the use of profiling tools to obtain performance information Compared to the performance analysis of sequential applications characterizing the performance of parallel applications can be challenging Often it is most effective to first focus on improving the performance of MPI applications at the single process level It may also be important to understand the message traffic generated by an application A number of tools can be used to analyze this aspect of a message passing application s performance including Performance Co Pilot and various third party products In this chapter you can learn how to use these various tools with MPI applications Using Profiling Tools with MPI Applications profile pl 007 3773 003 Two of the most common SGI profiling tools are profile pl and histx The following sections describe how to invoke these tools Performance Co Pilot PCP tools and tips for writing your own tools are also included You can use profile pl to obtain procedure level profiling as well as information about the hardware performance monitors For further information see the profile pl 1 and pfmon 1 man pages General format 3 mpirun mpirun_entry_object mpirun_entry_object profile pl profile pl_options executable Example o mpirun np
20. es these tools generate can be enormous for longer running or highly parallel jobs This causes a program to run more slowly but even more problematic is that the tools to analyze the data are often overwhelmed by the amount of data 24 007 3773 003 Chapter 6 Run time Tuning This chapter discusses ways in which the user can tune the run time environment to improve the performance of an MPI message passing application on SGI computers None of these ways involve application code changes Reducing Run time Variability 007 3773 003 One of the most common problems with optimizing message passing codes on large shared memory computers is achieving reproducible timings from run to run To reduce run time variability you can take the following precautions Do not oversubscribe the system In other words do not request more CPUs than are available and do not request more memory than is available Oversubscribing causes the system to wait unnecessarily for resources to become available and leads to variations in the results and less than optimal performance Avoid interference from other system activity The Linux kernel uses more memory on node 0 than on other nodes node 0 is called the kernel node in the following discussion If your application uses almost all of the available memory per processor the memory for processes assigned to the kernel node can unintentionally spill over to nonlocal memory By keeping user applications off
21. g models are not currently supported 11 3 Programming with SGI MPI Interoperability with the SHMEM programming model You can mix SHMEM and MPI message passing in the same program The application must be linked with both the SHMEM and MPI libraries Start with an MPI program that calls MPT_Init and MPI_Finalize When you add SHMEM calls the PE numbers are equal to the MPI rank numbers in MPI COMM WORLD Do not call start_pes in a mixed MPI and SHMEM program When running the application across a cluster some MPI processes may not be able to communicate with certain other MPI processes when using SHMEM functions You can use the shmem_pe_accessible and shmem_addr_accessible functions to determine whether a SHMEM call can be used to access data residing in another MPI process Because the SHMEM model functions only with respect to MPI COMM WORLD these functions cannot be used to exchange data between MPI processes that are connected via MPI intercommunicators returned from MPI 2 spawn related functions SHMEM get and put functions are thread safe SHMEM collective and synchronization functions are not thread safe unless different threads use different pSync and pWork arrays For more information about the SHMEM programming model see the intro_shmem man page Miscellaneous Features of SGI MPI This section describes other characteristics of the SGI MPI implementation that might be of interest to application developers
22. he tool on your application Do not try to run the tool on mpirun Also because of the way that mpirun sets up stdio seeing the output from your tool might require a bit of effort The most ideal case is when the tool directly supports an option to redirect its output to a file In general this is the recommended way to mix tools with mpirun Of 007 3773 003 Message Passing Toolkit MPT User s Guide course not all tools for example dplace support such an option However it is usually possible to make it work by wrapping a shell script around the tool and having the script do the redirection as in the following example gt cat myscript bin sh setenv MPI_DSM_OFF dplace verbose a out 2 gt outfile gt mpirun np 4 myscript hello world from process hello world from process hello world from process WN Hm hello world from process gt cat outfile there are now 1 threads Setting up policies and initial thread igration is off Data placement policy is PlacementDefault Creating data P Data pagesize is 16k Setting data PM Creating stack PM Stack pagesize is 16k Stack placement policy is PlacementDefault Setting stack PM there are now 2 threads there are now 3 threads there are now 4 threads there are now 5 threads Must use MPIO_Wait and MPIO_Test 007 3773 003 Beginning with MPT 1 8 MPT has unified the I O requests generated from nonblocking I O routines such as MPI_File_iwri
23. ine syntax see the mpirun 1 man page This section summarizes the procedures for launching an MPI application Launching a Single Program on the Local Host To run an application on the local host enter the mpirun command with the np argument Your entry must include the number of processes to run and the name of the MPI executable file The following example starts three instances of the mtest application which is passed an argument list arguments are optional e 3 mpirun np 3 mtest 1000 arg2 Launching a Multiple Program Multiple Data MPMD Application on the Local Host You are not reguired to use a different host in each entry that you specify on the mpirun command You can launch a job that has multiple executable files on the same host In the following example one copy of prog1 and five copies of prog2 are run on the local host Both executable files use shared memory mpirun np 1 progl 5 prog2 Launching a Distributed Application You can use the mpirun command to launch a program that consists of any number of executable files and processes and you can distribute the program to any number of hosts A host is usually a single machine or it can be any accessible computer running Array Services software For available nodes on systems running Array Services software see the usr lib array arrayd conf file 6 007 3773 003 Message Passing Toolkit MPT User s Guide You can list multiple entries on the m
24. ions MPI Routine Argument Checking By default the SGI MPI implementation does not check the arguments to some performance critical MPI routines such as most of the point to point and collective communication routines You can force MPI to always check the input arguments to MPI functions by setting the MPI CHECK ARGS environment variable However setting this variable might result in some degradation in application performance so it is not recommended that it be set except when debugging Using TotalView with MPI programs The syntax for running SGI MPI with Etnus Total View is as follows e totalview mpirun a np 4 a out Note that TotalView is not expected to operate with MPI processes started via the MPI Comm spawn or MPI Comm spawn multiple functions Using idb and gdb with MPI programs 007 3773 003 Because the idb and gab debuggers are designed for sequential non parallel applications they are generally not well suited for use in MPI program debugging and development However the use of the MPI_SLAVE_DEBUG_ATTACH environment variable makes these debuggers more usable If you set the MPI_SLAVE_DEBUG_ATTACH environment variable to a global rank number the MPI process sleeps briefly in startup while you use idb or gdb to attach to the process A message is printed to the screen telling you how to use idb or gdb to attach to the process Similarly if you want to debug the MPI da
25. ions that you can consider when trying to improve application performance Systems can use the XPMEM interconnect to cluster hosts as partitioned systems or use the Voltaire InfiniBand IB interconnect or TCP IP as the multihost interconnect When launched as a distributed application MPI probes for these interconnects at job startup For details of launching a distributed application see Launching a Distributed Application on page 6 When a high performance interconnect is detected MPI attempts to use this interconnect if it is available on every host being used by the MPI job If the interconnect is not available for use on every host the library attempts to use the next slower interconnect until this connectivity requirement is met Table 6 1 on page 31 specifies the order in which MPI probes for available interconnects Table 6 1 Inquiry Order for Available Interconnects Environment Variable to Interconnect Default Order of Selection Require Use XPMEM 1 MPI_USE_XPMEM InfiniBand 2 MPI_USE_IB TCP IP 3 MPI_USE_TCP 31 6 Run time Tuning The third column of Table 6 1 on page 31 also indicates the environment variable you can set to pick a particular interconnect other than the default In general to insure the best performance of the application you should allow MPI to pick the fastest available interconnect In addition to the choice of interconnect you should know that multihost jobs may use different buffers from th
26. is MPI_REQUEST_MAX XXXXX You might have an application problem You almost certainly are calling MPI_Isend or MPI_Irecv and not completing or freeing your request objects You need to use MPI Reguest free as described in the previous section I am not seeing stdout and or stderr output from my MPI application All stdout and stderr is line buffered which means that mpirun does not print any partial lines of output This sometimes causes problems for codes that prompt the user for input parameters but do not end their prompts with a newline character The only solution for this is to append a newline character to each prompt You can set the MPI_UNBUFFERED_STDIO environment variable to disable line buffering For more information see the MPI 1 and mpirun 1 man pages How can I get the MPT software to install on my machine 007 3773 003 MPT RPMs are included in ProPack releases In addition you can obtain MPT RPMs from the SGI Support website at http support sgi com under Downloads 37 7 Troubleshooting and Frequently Asked Questions Where can I find more information about the SHMEM programming model See the intro shmem 3 man page The psa command says my memory use SIZE is higher than expected What does MPI At MPI job start up MPI calls the SHMEM library to cross map all user static memory on all MPI processes to provide optimization opportunities The result is large
27. lobally accessible memory region This restriction does not apply when using the Intel 8 0 8 1 compilers Managing Memory Placement 007 3773 003 SGI systems have a cNUMA memory architecture For single process and small multiprocess applications this architecture behaves similarly to flat memory 15 3 Programming with SGI MPI architectures For more highly parallel applications memory placement becomes important MPI takes placement into consideration when laying out shared memory data structures and the individual MPI processes address spaces In general it is not recommended that the application programmer try to manage memory placement explicitly There are a number of means to control the placement of the application at run time however For more information see Chapter 6 Run time Tuning on page 25 Using Global Shared Memory The MPT software includes the Global Shared Memory GSM Feature This feature allows users to allocate globally accessible shared memory from within an MPI or SHMEM program The GSM feature can be used to provide shared memory access across partitioned Altix systems and additional memory placement options within a single host configuration User callable functions are provided to allocate a global shared memory segment free that segment and provide information about the segment Once allocated the application can use this new global shared memory segment via standard loads and stores ju
28. ls get the messages Nevertheless program logic like this is not valid by the MPI Standard Programs that require this sequence of MPI calls should employ one of the buffer MPI send calls MPI_Bsend or MPI_Ibsend Table 3 1 Outline of Improper Dependence on Buffering Process 1 Process 2 MPI_Send 2 MPI_Send 1 MPI_Recv 2 MPI Recv 1 By default the SGI implementation of MPI uses buffering under most circumstances Short messages 64 or fewer bytes are always buffered Longer messages are also buffered although under certain circumstances buffering can be avoided For performance reasons it is sometimes desirable to avoid buffering For further information on unbuffered message delivery see Programming Optimizations on page 13 Multithreaded Programming 007 3773 003 SGI MPI supports hybrid programming models in which MPI is used to handle one level of parallelism in an application while POSIX threads or OpenMP processes are used to handle another level When mixing OpenMP with MPI for performance reasons it is better to consider invoking MPI functions only outside parallel regions or only from within master regions When used in this manner it is not necessary to initialize MPI for thread safety You can use MPI_Init to initialize MPI However to safely invoke MPI functions from any OpenMP process or when using Posix threads MPI must be initialized with MPI_Init_thread Note Multithreaded programmin
29. mmands oe g o myprog myprog C lmpi lmpi gcc o myprog myprog c lmpi 3 g77 I usr include o myprog myprog f lmpi oe To compile programs with the Intel compiler use the following commands 3 efc o myprog myprog f lmpi Fortran version 7 1 3 ecc o myprog myprog c lmpi C version 7 1 3 ifort o myprog myprog f lmpi Fortran version 8 3 icc o myprog myprog c lmpi C version 8 The libmpi so library is compatible with code generated by g 3 0 or later compilers as well as Intel C 8 0 or later compilers If compatibility with previous g or C compilers is required the libmpi so released with MPT 1 9 or earlier must be used Note You must use the Intel compiler to compile Fortran 90 programs To compile Fortran programs with the Intel compiler enabling compile time checking of MPI subroutine calls insert a USE MPI statement near the beginning of each subprogram to be checked and use one of the following commands 3 efc I usr include o myprog myprog f lmpi version 7 1 ifort I usr include o myprog myprog f lmpi version 8 2 Getting Started Note The above command line assumes a default installation if you have installed MPT into a non default location replace usr include with the name of the relocated directory Using mpirun to Launch an MPI Application You must use the mpirun command to start MPI applications For complete specification of the command l
30. ndex 4 sun a 5 28 AP ee A a ARS a do a ae A x 007 3773 003 Figures Figure 5 1 mpivis Tool Figure 5 2 mpimon Tool 007 3773 003 22 23 xi Tables Table 3 1 Outline of Improper Dependence on Buffering be ibe hee a tin gh a de 11 Table 3 2 Optimized MPI Collectives bt es die cd fette dle id ikea ie As CLA Table 6 1 Inquiry Order for Available Interconnects La ra Sel ce Bye fo on d 31 007 3773 003 xiii About This Manual This publication documents the SGI implementation of the Message Passing Interface MPI MPI consists of a library which contains both normal and profiling entry points and commands that support the MPI interface MPI is a component of the SGI Message Passing Toolkit MPT MPT is a software package that supports parallel programming on large systems and clusters of computer systems through a technique known as message passing Systems running MPI applications must also be running Array Services software version 3 1 or later Related Publications and Other Sources 007 3773 003 Material about MPI is available from a variety of sources Some of these particularly webpages include pointers to other resources Following is a grouped list of these sources The MPI standard Asa technical report University of Tennessee report reference 24 from Using MPI Portable Parallel Programming with the Message Passing Interface by Gropp Lusk and Skjellum e As online PostScrip
31. ned Altix systems which use the MPI 2 PI_Comm_spawn or MPI_Comm_spawn_multiple functions it may be necessary to explicitly specify the partitions on which additional MPI processes may be launched See the section Launching Spawn Capable Jobs on Altix Partitioned Systems on the mpirun 1 man page 2 Getting Started Compiling and Running SHMEM Applications To compile SHMEM programs with a GNU compiler choose one of the following commands oe g compute C lsma gcc compute c lsma A ol g77 I usr include compute f lsma To compile SHMEM programs with the Intel compiler use the following commands oe ecc compute C lsma version 7 1 ecc compute c lsma version 7 1 efc compute f lsma version 7 1 C c A o lsma version 8 lsma version 8 ifort compute f lsma version 8 icc compute icc compute JW A Je You must use mpirun to launch SHMEM applications The NPES variable has no effect on SHMEM programs To request the desired number of processes to launch you must set the np option on mpirun The SHMEM programming model supports single host SHMEM applications as well as SHMEM applications that span multiple partitions To launch a SHMEM application on more than one partition use the multiple host mpirun syntax such as the following e 4 mpirun hostA hostB np 16 shmem_app For more information see the intro shmem 3 man page 8 007 3773 003 Chapter 3 Programming
32. nk errors try to run your program without mpirun You will get the mpirun must be used to launch all MPI applications message along with any rld link errors that might not be displayed when the program is started with mpirun As a last resort setting the environment variable LD_DEBUG to all will display a set of messages for each symbol that r1d resolves This produces a lot of output but should help you find the cause of the link arror Be sure that you are setting your remote directory properly By default mpirun attempts to place your processes on all machines into the directory that has the same name as PWD This should be the common case but sometimes different functionality is required For more information see the section on MPI_DIR and or the dir option in the mpirun man page e If you are using a relative pathname for your application be sure that it appears in PATH In particular mpirun will not look in for your application unless appears in PATH e Run usr etc ascheck to verify that your array is configured correctly Be sure that you can execute rsh or arshell to all of the hosts that you are trying to use without entering a password This means that either etc hosts equiv or rhosts must be modified to include the names of every host in the MPI job Note that using the np syntax i e no hostnames is 35 7 Troubleshooting and Frequently Asked Questions equivalent to typing localh
33. ompliance 2 Debuggers MPI 2 spawn functions idb and gdb 17 to launch applications 7 Distributed applications 6 MPI_REQUEST_MAX too small 37 mpimon tool 22 mpirun command F to launch application 6 mpirun failing 35 Features 3 mpivis tool 22 Frequently asked questions 35 MPMD applications 6 MPT software installation 37 G P Getting started 5 Performance Co Pilot PCP 22 profile pl tool 19 H Profiling interface 20 Profiling tools 19 histx tool 20 histx 19 Jumpshot 24 mpimon 22 mpivis 22 profile pl 19 007 3773 003 41 Index third party 24 Troubleshooting 35 Vampir 24 Tuning Programs avoiding message buffering 27 compiling and linking 5 buffer resources 26 debugging methods 17 enabling single copy 27 launching distributed 6 for running applications across multiple launching multiple 6 hosts 31 launching single 6 memory placement and policies 28 launching with mpirun 6 MPI OpenMP hybridcodes 30 MPI 2 spawn functions 7 reducing run time variability 25 SHMEM programming model 8 using dplace 30 with TotalView 17 using MPI_DSM_CPULIST 28 using MPI_DSM_DISTRIBUTE 29 using MPI_DSM_PPM 29 S using MPI_DSM_VERBOSE 30 using the XPMEM driver 27 SHMEM applications 8 SHMEM information 38 Single copy optimization U avoiding message buffering 27 using the XPMEM driver 27 Unpinning memory 32 Stack traceback information 40 Using MPIO_Wait and MPIO_Test 39 stdout and
34. ose used by jobs run on a single host In the SGI implementation of MPI the XPMEM interconnect uses the per proc buffers while the InfiniBand and TCP interconnects use the per host buffers The default setting for the number of buffers per proc or per host might be too low for many applications You can determine whether this setting is too low by using the MPI statistics described earlier in this section When using the TCP IP interconnect unless specified otherwise MPI uses the default IP adapter for each host To use a nondefault adapter enter the adapter specific host name on the mpirun command line When using the InfiniBand interconnect MPT applications may not execute a fork or system call The InfiniBand driver produces undefined results when an MPT process using InfiniBand forks Suspending MPI Jobs SGI s MPI software can internally use the XPMEM kernel module to provide direct access to data on remote partitions and to provide single copy operations to local data Any pages used by these operations are prevented from paging by the XPMEM kernel module As of the SGI ProPack 3 Service Pack 5 and SGI Propack 4 for Linux releases if an administrator needs to temporarily suspend a MPI application to allow other applications to run they can unpin these pages so they can be swapped out and made available for other applications Each process of a MPI application which is using the XPMEM kernel module will have a proc
35. ost so a localhost entry will also be needed in one of the above two files e Use the verbose option to verify that you are running the version of MPI that you think you are running e Be very careful when setting MPI environment variables from within your cshrc or login files because these will override any settings that you might later set from within your shell due to the fact that MPI creates the equivalent of a fresh login session for every job The safe way to set things up is to test for the existence of MPI_ENVIRONMENT in your scripts and set the other MPI environment variables only if it is undefined If you are running under a Kerberos environment you may experience unpredictable results because currently mpirun is unable to pass tokens For example in some cases if you use telnet to connect to a host and then try to run mpirun on that host it fails But if you instead use rsh to connect to the host mpirun succeeds This might be because telnet is kerberized but rsh is not At any rate if you are running under such conditions you will definitely want to talk to the local administrators about the proper way to launch MPI jobs e Look in tmp arraysvcs on all machines you are using In some cases you might find an errlog file that may be helpful My code runs correctly until it reaches MPI Finalize and then it hangs 36 This is almost always caused by send or recv requests that are either unmatched or
36. pirun command line Each entry contains an MPI executable file and a combination of hosts and process counts for running it This gives you the ability to start different executable files on the same or different hosts as part of the same MPI application The examples in this section show various ways to launch an application that consists of multiple MPI executable files on multiple hosts The following example runs ten instances of the a out file on host a e mpirun host a np 10 a out When specifying multiple hosts you can omit the np option and list the number of processes directly The following example launches ten instances of red on three hosts fred has two input arguments o mpirun host_a host_b host_c 10 fred argl arg2 The following example launches an MPI application on different hosts with different numbers of processes and executable files mpirun host_a 6 a out host_b 26 b out Using MPI 2 Spawn Functions to Launch an Application 007 3773 003 To use the MPI 2 process creation functions MPI_Comm_spawn or MPI_Comm_spawn_multiple you must specify the universe size by specifying the up option on the mpirun command line For example the following command starts three instances of the mtest MPI application in a universe of size 10 mpirun up 10 np 3 mtest By using one of the above MPI spawn functions mtest can start up to seven more MPI processes When running MPI applications on partitio
37. plication performance for certain communication patterns You can use the MPI_BUFS_PER_PROC shell variable to adjust the number of buffers available for the per proc pool Similarly you can use the MPI_BUFS_PER_HOST shell variable to adjust the per host pool You can use the MPI statistics counters to determine if retries for these shared memory buffers are occurring For information on the use of these counters see MPI Internal Statistics on page 21 In general you can avoid excessive numbers of retries for buffers by increasing the number of buffers for the per proc pool or per host pool However you should keep in mind that increasing the number of buffers does consume more memory Also increasing the number of per proc buffers does potentially increase the probability for cache pollution that is the excessive filling of the cache with message 007 3773 003 Message Passing Toolkit MPT User s Guide buffers Cache pollution can result in degraded performance during the compute phase of a message passing application There are additional buffering considerations to take into account when running an MPI job across multiple hosts For further discussion of multihost runs see Tuning for Running Applications Across Multiple Hosts on page 31 For further discussion on programming implications concerning message buffering see Buffering on page 10 Avoiding Message Buffering Enabling Single
38. process pool called the per proc pool are used Each MPI process is allocated a fixed portion of this pool when the application is launched Each of these portions is logically partitioned into 16 KB buffers For MPI jobs running across multiple hosts a second pool of shared memory is available Messages exchanged between MPI processes on different hosts use this pool of shared memory called the per host pool The structure of this pool is somewhat more complex than the per proc pool For an MPI job running on a single host messages that exceed 64 bytes are handled as follows For messages with a length of 16 KB or less the sender MPI process buffers the entire message It then delivers a message header also called a control message to a mailbox which is polled by the MPI receiver when an MPI call is made Upon finding a matching receive request for the sender s control message the receiver copies the data out of the shared memory buffer into the application buffer indicated in the receive request The receiver then sends a message header back to the sender process indicating that the shared memory buffer is available for reuse Messages whose length exceeds 16 KB are broken down into 16 KB chunks allowing the sender and receiver to overlap the copying of data to and from shared memory in a pipeline fashion Because there is a finite number of these shared memory buffers this can be a constraint on the overall ap
39. ry calls to request data delivery from one process to another or between groups of processes The MPT package contains the following components and the appropriate accompanying documentation e Message Passing Interface MPI MPI is a standard specification for a message passing interface allowing portable message passing programs in Fortran and C languages The SHMEM programming model The SHMEM programming model is a distributed shared memory model that consists of a set of SGI proprietary message passing library routines These routines help distributed applications efficiently transfer data between cooperating processes The model is based on multiple processes having separate address spaces with the ability for one process to access data in another process address space without interrupting the other process The SHMEM programming model is not a standard like MPI so SHMEM applications developed on other vendors hardware might or might not work with the SGI SHMEM implementation This chapter provides an overview of the MPI software that is included in the toolkit This overview includes a description of the MPI 2 Standard features that are provided a description of the basic components of MPI and a description of the basic features of MPI Subsequent chapters address the following topics e Chapter 2 Getting Started on page 5 e Chapter 3 Programming with SGI MPI on page 9 e Chapter 4 Debugging MPI Applications on p
40. s a registered trademark of Linus Torvalds used with permission by Silicon Graphics Inc MIPS is a registered trademark and MIPSpro is a trademark of MIPS Technologies Inc used under license by Silicon Graphics Inc in the United States and or other countries worldwide PostScript is a trademark of Adobe Systems Inc TotalView is a trademark of Etnus LLC UNICOS and UNICOS mk are registered trademarks of Cray Inc UNIX is a registered trademark of the Open Group in the United States and other countries 007 3773 003 New Features in This Manual The MPT 1 12 release supports the suspension of MPI jobs as described in Suspending MPI Jobs on page 32 007 3773 003 Record of Revision Version Description 001 March 2004 Original Printing This manual documents the Message Passing Toolkit implementation of the Message Passing Interface MPI 002 November 2004 Supports the MPT 1 11 release 003 June 2005 Supports the MPT 1 12 release Contents About This Manual Related Publications and Other Sources Obtaining Publications Conventions Reader Comments 1 Introduction MPI Overview da MPI 2 Standard Compliance MPI Components MPI Features 2 Getting Started Compiling and Linking MPI Programs Using mpirun to Launch an MPI Application Launching a Single Program on the Local Host Launching a Multiple Program Multiple Data MPMD Application on the Local Host Launching a Distributed Application Using M
41. s default behavior The easiest way to do this is by setting one or more MPI placement shell variables Several of the most commonly used of these variables are discribed in the following sections For a complete listing of memory placement related shell variables see the MPI 1 man page The MPI_DSM_CPULIST shell variable allows you to manually select processors to use for an MPI application At times specifying a list of processors on which to run a job can be the best means to insure highly reproducible timings particularly when running on a dedicated system This setting is treated as a comma and or hyphen delineated ordered list that specifies a mapping of MPI processes to CPUs If running across multiple hosts the per host components of the CPU list are delineated by colons Note This feature should not be used with MPI applications that use either of the MPI 2 spawn related functions Examples of settings are as follows 007 3773 003 Message Passing Toolkit MPT User s Guide Value CPU Assignment 8 16 32 Place three MPI processes on CPUs 8 16 and 32 32 16 8 Place the MPI process rank zero on CPU 32 one on 16 and two on CPU 8 8 15 32 39 Place the MPI processes 0 through 7 on CPUs 8 to 15 Place the MPI processes 8 through 15 on CPUs 32 to 39 39 32 8 15 Place the MPI processes 0 through 7 on CPUs 39 to 32 Place the MPI processes 8 through 15 on CPUs 8 to 15 8 15 16 23 Place the MPI processes 0 through
42. s standard MPI Send and MPI Isend in this implementation Persistent requests do not offer any performance advantage over standard requests in this implementation Using MPI Collective Communication Routines 007 3773 003 The MPI collective calls are frequently layered on top of the point to point primitive calls For small process counts this can be reasonably effective However for higher process counts 32 processes or more or for clusters this approach can become less efficient For this reason a number of the MPI library collective operations have been optimized to use more complex algorithms Some collectives have been optimized for use with clusters In these cases steps are taken to reduce the number of messages using the relatively slower interconnect between hosts Two of the collective operations have been optimized for use with shared memory The barrier operation has also been optimized to use hardware fetch operations 13 3 Programming with SGI MPI fetchops The MPI Alltoall routines also use special techniques to avoid message buffering when using shared memory For more details see Avoiding Message Buffering Single Copy Methods on page 15 Table 3 2 on page 14 lists the MPI collective routines optimized in this implementation Table 3 2 Optimized MPI Collectives Optimized for Shared Routine Optimized for Clusters Memory MPI Alltoall Yes Yes MPI Barrier Yes Yes MPI Allreduce Yes No MPI Bcas
43. se MPI extensions see the mpi_stats man page These statistics can be very useful in optimizing codes in the following ways To determine if there are enough internal buffers and if processes are waiting retries to aquire them e To determine if single copy optimization is being used for point to point or collective calls For additional information on how to use the MPI statistics counters to help tune the run time environment for an MPI application see Chapter 6 Run time Tuning on page 25 21 5 Profiling MPI Applications Performance Co Pilot PCP In addition to the tools described in the preceding sections you can also use the MPI agent for Performance Co Pilot PCP to profile your application The two additional PCP tools specifically designed for MPI are mpivis and mpimon These tools do not use trace files and can be used live or can be logged for later replay Following are examples of the mpivis and mpimon tools File Options TT Archive Figure 5 1 mpivis Tool 22 007 3773 003 Message Passing Toolkit MPT User s Guide Figure 5 2 mpimon Tool 007 3773 003 23 5 Profiling MPI Applications Third Party Products Two third party tools that you can use with the SGI MPI implementation are Vampir from Pallas www pallas com and Jumpshot which is part of the MPICH distribution Both of these tools are effective for smaller short duration MPI jobs However the trace fil
44. sses to be placed on a node Memory bandwidth intensive applications can benefit from 29 6 Run time Tuning MPI_DSM_VERBOSE placing fewer MPI processes on each node of a distributed memory host On SGI Altix 3000 systems setting MPI DSM PPM to 1 places one MPI process on each node Setting the MPI_DSM_VERBOSE shell variable directs MPI to display a synopsis of the NUMA placement options being used at run time Using dplace for Memory Placement The dplace tool offers another means of specifying the placement of MPI processes within a distributed memory host The dplace tool and MPI interoperate to allow MPI to better manage placement of certain shared memory data structures when dplace is used to place the MPI job For instructions on how to use dplace with MPI see the dplace 1 man page Tuning MPI OpenMP Hybrid Codes 30 Hybrid MPI OpenMP applications might require special memory placement features This section describes a preliminary method for achieving this memory placement The basic idea is to space out the MPI processes to accommodate the OpenMP threads associated with each MPI process In addition assuming a particular ordering of library init code see the DSO man page this method employs procedures to insure that the OpenMP threads remain close to the parent MPI process This type of placement has been found to improve the performance of some hybrid applications significantly To take partial ad
45. st as if it were a System V shared memory segment For more information see the GSM_Int ro or GSM_Alloc man pages Additional Programming Model Considerations 16 A number of additional programming options might be worth consideration when developing MPI applications for SGI systems For example the SHMEM programming model can provide a means to improve the performance of latency sensitive sections of an application Usually this requires replacing MPI send recv calls with shmem_put shmem_get and shmem_barrier calls The SHMEM programming model can deliver significantly lower latencies for short messages than traditional MPI calls As an alternative to shmem get shmem put calls you might consider the MPI 2 MPI_Put MPI Get functions These provide almost the same performance as the SHMEM calls while providing a greater degree of portability Alternately you might consider exploiting the shared memory architecture of SGI systems by handling one or more levels of parallelism with OpenMP with the coarser grained levels of parallelism being handled by MPI Also there are special ccNUMA placement considerations to be aware of when running hybrid MPI OpenMP applications For further information see Chapter 6 Run time Tuning on page 25 007 3773 003 Chapter 4 Debugging MPI Applications Debugging MPI applications can be more challenging than debugging sequential applications This chapter presents methods for debugging MPI applicat
46. t Yes No Note These collectives are optimized across partitions by using the XPMEM driver which is explained in Chapter 6 Run time Tuning These collectives except MPI_Barrier will try to use single copy by default for large transfers unless MPI_DEFAULT_SINGLE_COPY_OFF is specified Using MPI_Pack MPI_Unpack While MPI_Pack and MPI_Unpack are useful for porting PVM codes to MPI they essentially double the amount of data to be copied by both the sender and receiver It is generally best to avoid the use of these functions by either restructuring your data or using derived data types Note however that use of derived data types may lead to decreased performance in certain cases Avoiding Derived Data Types 14 In general you should avoid derived data types when possible In the SGI implementation use of derived data types does not generally lead to performance gains Use of derived data types might disable certain types of optimizations for example unbuffered or single copy data transfer 007 3773 003 Message Passing Toolkit MPT User s Guide Avoiding Wild Cards The use of wild cards MPI ANY SOURCE MPI ANY TAG involves searching multiple gueues for messages While this is not significant for small process counts for large process counts the cost increases guickly Avoiding Message Buffering Single Copy Methods One of the most significant optimizations for bandwidth sensitive applications in
47. t or hypertext on the Web http www mpi forum org e Asa journal article in the International Journal of Supercomputer Applications volume 8 number 3 4 1994 See also International Journal of Supercomputer Applications volume 12 number 1 4 pages 1 to 299 1998 Book Using MPI Portable Parallel Programming with the Message Passing Interface by Gropp Lusk and Skjellum publication TPD 0011 Newsgroup comp parallel mpi SGI manual SpeedShop User s Guide XV About This Manual Obtaining Publications You can obtain SGI documentation in the following ways See the SGI Technical Publications Library at http docs sgi com Various formats are available This library contains the most recent and most comprehensive set of online books release notes man pages and other information You can also view man pages by typing man title on a command line Conventions The following conventions are used throughout this document Convention command manpage x variable user input Reader Comments Meaning This fixed space font denotes literal items such as commands files routines path names signals messages and programming language structures Man page section identifiers appear in parentheses after man page names Italic typeface denotes variable entries and words or concepts being defined This bold fixed space font denotes literal items that the user enters in interactive sessions Outp
48. te and MPI requests from nonblocking message passing routines for example MPI Tsend Formerly these were different types of request objects and needed to be kept separate one was called MPIO Reguest and the other MPI Reguest Under MPT 1 8 and later however this distinction is no longer necessary You can freely mix request objects returned from I O and MPI routines in calls to MPI Wait MPI_Test and their variants 39 7 Troubleshooting and Frequently Asked Questions Must modify my code to replace calls to MPIO_Wait with MPI_Wait and recompile No If you have an application that you compiled prior to MPT 1 8 you can continue to execute that application under MPT 1 8 and beyond without recompiling Internally MPT uses the unified requests and for example translates calls to MPIO Wait into calls to MPI Wait Why do I see stack traceback information when my MPI job aborts This is a new feature beginning with MPT 1 8 More information can be found in the MPI 1 man page in descriptions of the MPI_COREDUMP and MPI_COREDUMP_DEBUGGER environment variables 40 007 3773 003 Index A I Internal statistics 21 Argument checking 17 Introduction 1 C M Code hangs 36 Combining MPI with tools 38 Memory placement and policies 28 Components 3 Memory use size problems 38 Modifying code for MPI Wait 40 MPI jobs suspending 32 D MPI launching problems 38 MPI 2 c
49. tly spawned synchronize in MPI_Finalize In the SGI implementation MPI processes are UNIX processes As such the general rule regarding handling of signals applies as it would to ordinary UNIX processes In addition the SIGURG and SIGUSR1 signals can be propagated from the mpirun process to the other processes in the MPI job whether they belong to the same process group on a single host or are running across multiple hosts in a cluster To make use of this feature the MPI program must have a signal handler that catches SIGURG or SIGUSR1 When the SIGURG or SIGUSR1 signals are sent to the mpirun process ID the mpirun process catches the signal and propagates it to all MPI processes Most MPI implementations use buffering for overall performance reasons and some programs depend on it However you should not assume that there is any message buffering between processes because the MPI Standard does not mandate a buffering 007 3773 003 Message Passing Toolkit MPT User s Guide strategy Table 3 1 on page 11 illustrates a simple sequence of MPI operations that cannot work unless messages are buffered If sent messages were not buffered each process would hang in the initial call waiting for an MPI_Recv call to take the message Because most MPI implementations do buffer messages to some degree a program like this does not usually hang The MPI_Send calls return after putting the messages into buffer space and the MPI_Recv cal
50. ut is shown in nonbold fixed space font Brackets enclose optional portions of a command or directive line Ellipses indicate that a preceding element can be repeated If you have comments about the technical accuracy content or organization of this publication contact SGI Be sure to include the title and document number of the publication with your comments Online the document number is located in the xvi 007 3773 003 Message Passing Toolkit MPT User s Guide 007 3773 003 front matter of the publication In printed publications the document number is located at the bottom of each page You can contact SGI in any of the following ways Send e mail to the following address techpubs sgi com Use the Feedback option on the Technical Publications Library Web page http docs sgi com Contact your customer service representative and ask that an incident be filed in the SGI incident tracking system Send mail to the following address Technical Publications SGI 1500 Crittenden Lane M S 535 Mountain View California 94043 1351 SGI values your comments and will respond to them promptly xvii Chapter 1 007 3773 003 Introduction Message Passing Toolkit MPT is a software package that supports interprocess data exchange for applications that use concurrent cooperating processes on a single host or on multiple hosts Data exchange is done through message passing which is the use of libra
51. vantage of this placement option the following requirements must be met e When running the application you must set the MPI_OPENMP_INTEROP shell variable e To compile the application you must use a compiler that supports the mp compiler option This hybrid model placement option is not available with other compilers 007 3773 003 Message Passing Toolkit MPT User s Guide MPI reserves nodes for this hybrid placement model based on the number of MPI processes and the number of OpenMP threads per process rounded up to the nearest multiple of 2 For example if 6 OpenMP threads per MPI process are going to be used for a 4 MPI process job MPI will request a placement for 24 4 X 6 CPUs on the host machine You should take this into account when requesting resources in a batch environment or when using cpusets In this implementation it is assumed that all MPI processes start with the same number of OpenMP threads as specified by the OMP_NUM_THREADS or equivalent shell variable at job startup The OpenMP threads are not actually pinned to a CPU but are free to migrate to any of the CPUs in the OpenMP thread group for each MPI rank The pinning of the OpenMP thread to a specific CPU will be supported in a future release Tuning for Running Applications Across Multiple Hosts 007 3773 003 When you are running an MPI application across a cluster of hosts there are additional run time environment settings and configurat
52. xpmem pid file associated with it The number of pages owned by this process which are prevented from paging by XPMEM can be displayed by concatenating the proc xpmem pid file for example cat proc xpmem 5562 pages pinned by XPMEM 17 32 007 3773 003 Message Passing Toolkit MPT User s Guide 007 3773 003 To unpin the pages for use by other processes the administrator must first suspend all the processes in the application The pages can then be unpinned by echoing any value into the proc xpmem pid file for example echo 1 gt proc xpmem 5562 The echo command will not return until that process s pages are unpinned When the MPI application is resumed the XPMEM kernel module will prevent these pages from paging as they are referenced by the application 33 Chapter 7 Troubleshooting and Frequently Asked Questions This chapter provides answers to some common problems users encounter when starting to use SGI s MPI as well as answers to other frequently asked questions What are some things I can try to figure out why mpirun is failing 007 3773 003 Here are some things to investigate e Look in var log messages for any suspicious errors or warnings For example if your application tries to pull in a library that it cannot find a message should appear here Only the root user can view this file e Be sure that you did not misspell the name of your application e To find r1d dynamic li

Message Passing Toolkit (MPT) User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents