Home
LSF Parallel User's Guide
Contents
1. mpif77 c myjob f LSF Parallel User s Guide 15 Building Parallel Applications Compiling and Linking This command produces the my job o that contains the object code for this LSF Parallel source file To link the my job o object file with the LSF Parallel libraries to create an executable enter mpif77 o myjob myjob o As with most Fortran 77 compilers the o flag specifies that the name of the executable produced by the linker is to be my job The Fortran 77 source file can be compiled and linked in one step using the following command 9 mpif77 myjob o myjob 16 LSF Parallel User s Guide Building Parallel Applications Building a Heterogeneous Parallel Application Building a Heterogeneous Parallel Application The LSF Parallel system provides a host type substitution facility to allow a heterogeneous multiple architecture distributed application to be submitted to the LSF Batch system The following steps outline how to build and deploy a heterogeneous application 1 Design the parallel application 2 Compile the application on all LSF host type architectures that will be used to support this application Note The binaries must either be named with valid LSF host type extensions or placed in directories named with valid LSF host type path names 3 Place binaries in the appropriate shared file system or distribute them accordingly 4 Use the a annotation to submit the parallel application to th
2. LSF Parallel User s Guide LSF Version 3 2 First Edition August 1998 Platform Computing Corporation ii LSF Parallel User s Guide LSF Parallel User s Guide Copyright O 1998 Platform Computing Corporation All rights reserved This document is copyrighted This document may not in whole or part be copied duplicated reproduced translated electronically stored or reduced to machine readable form without prior written consent from Platform Computing Corporation Although the material contained herein has been carefully reviewed Platform Computing Corporation does not warrant it to be free of errors or omissions Platform Computing Corporation reserves the right to make corrections updates revisions or changes to the information contained herein UNLESS PROVIDED OTHERWISE IN WRITING BY PLATFORM COMPUTING CORPORATION THE PROGRAM DESCRIBED HEREIN IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE IN NO EVENT WILL PLATFORM BE LIABLE TO ANYONE FOR SPECIAL COLLATERAL INCIDENTAL OR CONSEQUENTIAL DAMAGES INCLUDING ANY LOST PROFITS OR LOST SAVINGS ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PROGRAM LSF LSF Base LSF Batch LSF JobScheduler LSF MultiCluster LSF Make LSF Analyzer LSF Parallel Platform Computing and the Platform Computing and LSF logos are trademarks of Pl
3. Batch Execution Batch Execution The LSF Parallel system uses the features of the LSF Batch system to select the most suitable hosts submit and interact with parallel batch jobs The batch job 1s submitted to a queue using the bsub command as described in Submitting Batch Jobs on page 26 and the LSF Batch and LSF Parallel systems attend to the rest Like serial batch jobs parallel batch jobs pass through many states See Batch Job Status on page 23 This part of the chapter discusses the following topics Batch Tob SPIUUS cum ssec AAA duque avail bo scio A 23 Submitting Batch Obst x etu b Eid A opp 26 Suspendida IOS asus qaot t eru etie od Arce a Sd t DD eet 28 IES PAI OBS a audere E ESA ER EROS daas 29 Monitoring Job Status cse E pe e ER Ot ee 30 E Jobs sso A c aeri RAD A ad a Es da ta bata 31 Running Heterogeneous Parallel Applications o 32 22 LSF Parallel User s Guide Submitting Parallel Applications Batch Job Status Batch Job Status Each batch job submitted to the LSF Batch system passes through a series of states until the job completes normally success or abnormally failure The bjobs command allows the status of the batch jobs to be monitored see Monitoring Job Status on page 30 The ability to monitor batch job status extends to the individual processes tasks of the parallel application Figure 3 Batch Job State Diagram suitable host found normal completion or abnormal LSF Parallel User
4. and c out processes are guaranteed to run on the same host The b out processes may run on a different host depending on the resources available and the LSF Batch system scheduling algorithms For a complete list of mpirun options and environment variable controls refer to the mpirun man page and the HP MPI User s Guide version 1 4 42 LSF Parallel User s Guide SGI MPI SGI MPI The mpi argument on the bsub and pam command line is a replacement for mpirun in the HP environment Everything after mpi shall be exactly as it would normally appear if mpi run were being used Example To run a the a out job and have the LSF Batch system select the host the command mpirun np 4 a out is entered as mpirun pam mpi np 4 a out Example To run a multihost job and have the LSF Batch system select the hosts the following command mpirun f appfile is entered as bsub pam mpi f appfile where appfile contains the following entries foo np 4 a out bar np 4 b out foo np 2 c out For a complete list of mpirun options and environment variable controls refer to the mpirun man page LSF Parallel User s Guide 43 SUN HPC MPI SUN HPC MPI When running LSF Batch jobs on Sun platforms you can include the Sun specific argument sunhpc on the bsub command line after any other bsub arguments The following arguments to sunhpc provide additional control over bsub behavior in a Sun HPC environment n proces
5. batch job named my job to run on any two processors having either SUNSOL or RS6K architectures the following command can be used 9 bsub n 2 pam user batch a myjob To specify SUNSOL and RS6K in an environment with other architectures the following command is specified with the R resource option bsub n 2 R type SUNSOL type RS6K pam user batch a myjob For both these examples the Parallel Application Manager PAM substitutes the a notation with the correct LSF host type path name The paths used to select the binaties ate user batch SUNSOL my job user batch RS6K my job LSF Parallel User s Guide 33 Submitting Parallel Applications Interactive Execution Interactive Execution The LSF Parallel system uses the Parallel Application Manager PAM to control the execution of parallel batch jobs interactively Batch jobs ate executed interactively using the pam command see The pam Command on page 35 When submitting batch jobs using the pam command the LSF Batch system is bypassed the jobs are not queued Batch jobs are run immediately upon entering the command if the resource requirements specified are met If the resources are not available the job is not run Since the jobs do not wait interactive job execution 1s beneficial for debugging parallel applications To successfully execute an interactive parallel batch job the pam command must be reissued at a time when the resources are available If
6. graphical tool for comprehensive workload data analysis It processes cluster wide job logs from LSF Batch and LSF JobScheduler to produce statistical reports on the usage of system resources by users on different hosts through various queues is a software product that manages parallel job execution in a production network environment is a distributed and parallel Make based on GNU Make that simultaneously dispatches tasks to multiple hosts is the software upon which all the other LSF products are based It includes the network servers LIM and RES the LSF API and load sharing tools There are two editions of the LSF Suite LSF Enterprise Edition Platform s LSF Enterprise Edition provides a reliable scalable means for organizations to schedule analyze and monitor their distributed workloads across LSF Parallel User s Guide xiii Preface LSF Suite 3 2 heterogeneous UNIX and Windows NT computing environments LSF Enterprise Edition includes all the features in LSF Standard Edition LSF Base and LSF Batch plus the benefits of LSF Analyzer and LSF MultiCluster LSF Standard Edition The foundation for all LSF products Platform s Standard Edition consists of two products LSF Base and LSF Batch LSF Standard Edition offers users robust load sharing and sophisticated batch scheduling across distributed UNIX and Windows NT computing environments xiv LSF Parallel User s Guide Preface Technical Assistance Technical
7. nz a tee da boa o NEE o e Ae 22 Batch Job Status nos is A A EARS 23 Submitting Batch TOS a at ox ei ak Far corta Ad 26 S spending e CN 28 Resumme ODE aue a alea dana sl RE das 29 Monitore Job Status iS tr AA AA 30 Terminating OSs isis de rta dd ai 31 Running Heterogeneous Parallel Applications n ansans sssr surraa 32 Interactive Execution o 34 TRO Command rate ee 35 Process Status Report ai e ahs 38 Getting Host Information Ee S Ad A Ea la 39 LSF Parallel User s Guide 19 Submitting Parallel Applications Job Submission Methods Job Submission Methods The LSF Parallel system supports batch submission of parallel applications batch jobs using the facilities of the LSF Batch System Interactive execution of parallel applications is also supported under control of the Parallel Application Manager PAM Batch Execution When submitting a parallel batch job the LSF Parallel system uses the advanced features of the LSF Batch system to select submit and interact with the individual tasks of the parallel batch job The batch job is submitted to a queue using the bsub command and the LSF Batch system attends to the details A parallel batch job is submitted to a queue where it waits until it reaches the front of the queue and the appropriate resources become available Then the batch job will be dispatched to the most suitable hosts for execution This sophisticated queuing system allows batch jobs to run as soon as th
8. on the number of parallel processes LSF Parallel User s Guide 31 Submitting Parallel Applications Running Heterogeneous Parallel Applications Running Heterogeneous Parallel Applications The LSF Parallel system provides an LSF host type substitution facility to allow a heterogeneous multiple architecture distributed application to be submitted to the LSF Batch system Assumptions 1 The binary will run on each specified platform or a binary exists for each platform 2 The binaries for the Parallel application are specified using the sa notation format see Building a Heterogeneous Parallel Application on page 17 Example Using the LSF host type extension format to specify the batch job named my job to run on any two available processors having either Sun Solaris SUNSOL or RS6000 RS6K architectures the following command can be used bsub n 2 pam myjob a To specify SUNSOL and RS6K in an environment with other architectures the following command is specified with the R resource option bsub n 2 R type SUNSOL type RS6K pam myjob a For both these examples the Parallel Application Manager PAM substitutes the a notation with the correct LSF host type extension The binaries used are named my Job SUNSOL my Job RS6K 32 LSF Parallel User s Guide Submitting Parallel Applications Running Heterogeneous Parallel Applications Example Using the LSF host type path name format to specify the
9. or the LSF Administrator can resume using the bresume command a batch that is in the USUSP state then the batch job state transitions to SSUSP 24 LSF Parallel User s Guide Submitting Parallel Applications Batch Job Status Table 3 Batch Job State Descriptions State Description SSUSP A batch job can be suspended by the LSF Batch system after it has been dispatched This is done if the load on the execution host or hosts becomes too high in order to maximize host performance or to guarantee interactive response time The LSF Batch system suspends batch jobs according to their priority unless the scheduling policy associated with the job dictates otherwise A batch job may also be suspended if the job queue has a time window and the current time exceeds the window The LSF Batch system can later resume a system suspended SSUSP job if the load condition on the execution host decreases or the time window of the queue opens EXIT A batch job can terminate abnormally fail from any state for many reasons Abnormal job termination can occur when e Cancelled using the bki11 command by owner or LSF administrator while in PEND RUN or USUSP state e Aborted by LSF because job cannot be dispatched before a termination deadline e Fails to start successfully e g the wrong executable was specified at time of job submission e Crashes during execution Parallel Batch Job Behavior e When o
10. s Guide 23 Submitting Parallel Applications Batch Job Status Figure 3 shows the possible states a batch job can pass through when submitted to the LSF Batch system The diagram also shows the activities and commands that cause the state transitions The batch job states are described in Table Table 3 Batch Job State Descriptions State Description PEND A batch job is pending when it is submitted using the bsub command and waiting in a queue It remains pending until it moves to the head of the queue and all conditions for its execution are met The conditions may include Start time specified by the user when the job is submitted e Load conditions on qualified hosts Time windows during which e The queue can dispatch jobs e Qualified hosts can accept jobs e Relative priority to other users and jobs e Availability of the specified resources RUN A batch job is running when it has been dispatched to a host DONE A batch job is done when it has normally completed its execution PSUSP The job owner or the LSF Administrator can suspend using the bst op command a batch job while it is pending Also the job owner or the LSF Administrator can resume using the bresume command a batch that is in the PSUSP state then the batch job state transitions to PEND USUSP The job owner or the LSF Administrator can suspend using the bst op command a batch job after it has been dispatched Also the owner
11. to start using the LSF Parallel system They are compiling linking and submitting parallel applications The example used in this chapter is a distributed version of the Hello World program named myob written in C This chapter discusses the following topics Writing a Distributed Xpplicatioti obe pul aad aia 10 Compiling and Linking the Application acne eremo eroe 11 Itane e EE 12 Note If the commands cannot be executed or the man pages cannot be viewed the appropriate directories may need to added to the systems path check with the system administrator LSF Parallel User s Guide 9 Getting Started Writing a Distributed Application Writing a Distributed Application This example program written in C is a distributed version of the Hello World program named myyob Use an editor to enter the code for this application After the code is entered save it in a file named myjob c File myjob c SS tinclude lt stdio h gt tinclude mpi h MPI header file int main int argc char argv int myrank Rank of this process int n_processes Number of Processes int srcrank Rank of the Sender int destrank Rank of the receiver char mbuf 512 Message buffer MPI Status mstat Return Status of an MPI operation MPI Init amp argc amp argv MPI Comm rank MPI COMM WORLD amp myrank MPI Comm size MPI COMM WORLD amp n processes if myrank 0 sprintf mb
12. 3 31 98 10 31 58 2 host2 my job Done 03 31 98 10 31 59 3 host3 my job Done 03 31 98 10 31 59 3 host4 my job Done 03 31 98 10 31 59 Table 4 pam n Job Status Status Description Done Process successfully completed with exit code of 0 Exit code Process unsuccessfully completed with an exit code of code Signaled signal Process was terminated by szeza Exit status unknown Connection broken exit status unknown Killed by PAM sena PAM shutdown process using signal Unreachable PAM is unable to reach host after broken connection No way to determine the state of the process Runaway Process is still running cannot be killed by PAM Suspend Process suspended Undefined PPP Run Process running Local RES died 299 Note Use the t option to suppress the process Status report For example pam t n 4 myjob 38 LSF Parallel User s Guide Submitting Parallel Applications Getting Host Information Getting Host Information The 1shosts command is used to display information about LSF host configurations including name type model CPU normalization factor number of CPUs total memory and available resources Example lshosts HOST NAME type model cpuf ncpus maxmem maxswp server RESOURCES hosti SGI64 SGI4D35 2 0 96M 153M Yes list js irix gla host99 SUNSOL SunSparc 12 0 4 1024M 1930M Yes solaris cs bigmem host2 LINUX 11486 33 14 0 30M 64M Yes linux host7 SUN41 SPA
13. Assistance Contact your reseller for Technical Assistance You can also contact the Technical Support Department at Platform Computing Corporation in the following ways Mai LSF Technical Support Platform Computing Corporation 3760 14th Avenue Markham Ontario Canada L3R 3T7 e Telephone 1 905 948 8448 e Toll free 1 87PLATFORM 1 877 528 3676 Fax 1 905 948 9975 Email support platform com Note When contacting Platform Computing please include the full name of your company Alternatively you can find answers to many of your questions by visiting the Platform Computing Corporation home page on the World Wide Web Point your browser to http www platform com Platform welcomes your comments and suggestions on this document send email to doc platform com ot contact us by mail fax or phone LSF Parallel User s Guide Xv Preface Technical Assistance xvi LSF Parallel User s Guide 1 Introduction This chapter describes the LSF Parallel system and its architecture This chapter discusses the following topics What Is the LSF Parallel EE 0 t est d 9 ai at i atare 2 How Does LSF Parallel Fit Into the LSF Batch System 3 LSF Parallel Architecture 5 LSF Parallel User s Guide 1 1 Introduction What Is the LSF Parallel System What Is the LSF Parallel System The LSF Parallel system 1s a fully supported commercial software system that supports the programming testing and
14. If no n the user must interactively specify the hosts t Suppress the printing of the job task summary report to the standard output at job completion Bi Specifies the job 1s to be run in verbose mode The names of the selected hosts are displayed server_addr location Specifies the location of the PAM server The location is specified in the hostname port_no format server_jobid location Specifies the location of the PAM server The location is specified using the jobid for the server PAM job server_jobname location Specifies the location of the PAM server The location is specified using the jobname for the server PAM job LSF Parallel User s Guide 35 Submitting Parallel Applications The pam Command Option Description m host Specifies the list of hosts on which to run the parallel batch job tasks The number of host names specified indicates the number of processors requested Note This option cannot be used with options R or n and is ignored when pam is used as a bsub option R reg n num Specifies the number of processors required to run the parallel job Note This option cannot be used with option m and is ignored when pam 1s used as a bsub option R req Specify the resource requirements for host selection Execute the parallel job on the hosts that meet these requirements Default r15s pg Note This option is ignored when pam is used a
15. Introduction LSF Parallel Architecture Table 2 Description of LSF Parallel Components SBD The Slave Batch Daemons ate batch job execution agents residing on the execution hosts SBD receives jobs from the MBD in the form of a job specification and starts RES to run the job according the specification SBD reports the batch job status to the MBD whenever job state changes PAM The Parallel Application Manager is the point of control for the LSF Parallel system PAM is fully integrated with the LSF System PAM interfaces the user application with the LSF system If PAM ot its host crashes each RES will terminate all tasks under its management This avoids the problem of orphaned processes RES The Remote Execution Servers reside on each execution host RES manages all remote tasks and forwards signals standard I O resources consumption data and parallel job information between PAM and the tasks application task The individual process of a parallel application execution hosts The most suitable hosts to execute the batch job as determined by the LSF Batch system first execution host The host name at the top of the execution host list as determined by the LSF Batch system LSF Parallel User s Guide 7 1 Introduction LSF Parallel Architecture 8 LSF Parallel User s Guide Getting Started 2 Getting Started The purpose of this chapter is to quickly introduce the concepts needed
16. Machine Dependent Layer MPI Library The Message Passing Interface MPI library is a message passing library that must be linked to the parallel applications that are to be run in the LSF Batch system The MPI library translates MPI message calls to messages for the machine dependent layer and it interfaces the user application to PAM See Vendor MPI Implementations on page 41 for a description of vendor specific MPI implementations LSF Parallel User s Guide 3 1 Introduction How Does LSF Parallel Fit Into the LSF Batch System PAM The Parallel Application Manager PAM is the point of control for the LSF Parallel system PAM is fully integrated with the LSF Batch system PAM interfaces the user application with the LSF Batch system For all parallel application processes tasks PAM Maintains the communication connection map e Monitors and forwards control signals e Receives requests to add delete start and connect tasks Monitors resource usage while the user application is running Enforces job level resource limits e Collects resource usage information and exit status upon termination e Handles standard I O LSF Batch System The LSF Batch system is a sophisticated resource based batch job scheduling system It accepts user jobs and holds them in queues until suitable hosts are available and resource requirements are satisfied Host selection 1s based on up to the minute load information provided by t
17. Notation ii a mie ate fat xii DSS ESS LLCS rot etat aO GNO E xiii Technical Assistance LSF Parallel User s Guide vii Preface Audience Audience This guide provides reference and tutorial material for e MPI programmers who want to compile and link MPI programs for use with the LSF Parallel system Users of the LSF Parallel system who want to submit execute monitor and interact with parallel applications using the LSF Batch system The users of this guide are expected to be familiar with e Programming in the C or Fortran 77 language e Message Passing Interface MPI concepts The LSF Batch system viii LSF Parallel User s Guide Preface Related Publications Related Publications This guide focuses on using parallel applications with the LSF Suite of products primarily the LSF Batch system It assumes familiarity with the LSF Suite of products and the MPI standard The following materials provide useful background about using the LSF Suite of products and MPI These guides are available from Platform Computing Corporation s LSF Suite Installation Guide s LSF Batch Administrator s Guide e LSF Batch Administrator s Quick Reference LSF Batch User s Guide e LSF Batch User s Quick Reference e LSF JobScheduler Administrator s Guide e LSF JobScheduler User s Guide e LSF Analyzer User s Guide e LSF Programmer s Guide This document is available on the World Wide Web MPI A Message Passing Interfa
18. RCSLC 3 0 15M 29M Yes sparc bsd sun41 host3 ALPHA DEC5000 5 0 88M 384M Yes cs bigmem alpha gla host6 ALPHA DEC5000 5 0 84M 350M Yes gla host4 SUNSOL SunSparc 12 0 2 256M 733M Yes solaris cs bigmem host5 SGI SGIINDIG 15 0 96M 300M Yes irix host8 SUNSOL SunSparc 12 0 56M 90M Yes solaris cs bigmem LSF Parallel User s Guide 39 Submitting Parallel Applications Getting Host Information 40 LSF Parallel User s Guide A vendor MPI Implementations LSF Parallel User s Guide 41 HP MPI HP MPI When you use mpirun in stand alone mode you provide it the host names to be used by the MPI job To achieve better resource utilization you can have LSF manage the allocation of hosts coordinating the start up phase with mpirun This is done by preceding the regular HP MPI mpirun command with 9 bsub pam mpi Example To run a single host job and have the LSF Batch system select the host the command mpirun np 14 a out is entered as 9 bsub pam mpi mpirun np 14 a out Example To run a multi host job and have the LSF Batch system select the hosts the command 9 mpirun f appfile is entered as 9 bsub pam mpi mpirun f appfile where appfile contains the following entries h foo np 8 a out h bar np 4 b out h foo np 2 c out In this example the hosts oo and bar are treated as symbolic names and refer to the actual hosts that the LSF Batch system allocates to the job The a out
19. atform Computing Corporation Other products or services mentioned in this document are identified by the trademarks or service marks of their respective companies or organizations Printed in Canada LSF Parallel User s Guide Wii Revision Information for the LSF Parallel User s Guide Edition Description First This document describes the LSF Parallel system released with LSF Suite version 3 2 iv LSF Parallel User s Guide Table of Contents PrelaCe Sid a E be eg aa et vii Audience 4 AR AA A Gab S AA da viii Related Publications ix Typographic Conventions ar A AAA atas xi Command Notation xii SE Suite 12 eg ee eto ss a dose sos Ee xiii Technical Assistance Re E CAR E RUE Xn a XV KT TE CT RE 1 What Is the LSF Parallel System cito rd rn 2 How Does LSF Parallel Fit Into the LSF Batch System o 3 LSF Parallel Architecture 5 2 Getting Started esae e za RU RR ODER TR RO c oe 9 Writing a Distributed Application ocio oret te 10 Compiling and Linking the Application cc 11 Running the Application ics trece a Ee rete Pa audiatis 12 3 Building Parallel Applications 13 Including the Header File coda ER ee 14 Compiling and Linking wage west EE 15 Building a Heterogeneous Parallel Application ooooooooommo 17 4 Submitting Parallel Applications 19 Job Submission Methods es oc 5 03 ve arata ata a vs E A 20 Batch Execution it
20. b When the parallel batch job named my job is submitted to the LSF Batch system and dispatched to host host2 host and host4 the bjobs command will display 26 LSF Parallel User s Guide Submitting Parallel Applications Submitting Batch Jobs bjobs JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 713 userl RUN batch host99 hosti my job Sep 12 16 30 host2 host3 host4 LSF Parallel User s Guide 27 Submitting Parallel Applications Suspending Jobs Suspending Jobs The bstop Command The bstop command is used to suspend parallel batch jobs running in the LSF Batch system The syntax for using the bst op command in the LSF Parallel system is bstop jobId Example The following command suspends the parallel batch job named my job running in the LSF Batch system with job id of 713 bstop 713 When the parallel batch job named my job is suspended the bjobs command will display the batch job state of USUSP bjobs JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 713 userl USUSP batch host99 hosti my job Sep 12 16 32 host2 host3 host4 28 LSF Parallel User s Guide Submitting Parallel Applications Resuming Jobs Resuming Jobs The bresume Command The bresume command is used to resume suspended parallel batch jobs running in the LSF Batch system The syntax for using the bresume command in the LSF Paralle
21. ce Standard Message Passing Interface Forum University of Tennessee 1995 http www mcs anl gov mpi mpi report 1 1 mpi report html These books should be available at your local bookstore MPI The Complete Reference by Marc Snir Steve W Otto Steven Huss Lederman David W Walker and Jack Dongarra MIT Press 1995 Using MPI by William Gropp Ewing Lusk and Anthony Skjellum MIT Press 1994 Parallel Programming with MPI by Peter Pacheco Morgan Kaufmann Publishers Inc 1997 Designing and Building Parallel Programs lan Foster Addison Wesley 1995 LSF Parallel User s Guide ix Preface Related Publications Online documentation supplied by Platform Computing Corporation Man pages accessed with the man command for all commands and MPI functions X LSF Parallel User s Guide Preface Typographic Conventions Typographic Conventions Table 1 The typographic conventions used in this guide Typeface Meaning AaBbCcDdEeFfGg The names of commands files and directories on screen computer output AaBbCcDdEeFfGg What you type contrasted with on screen computer output AaBbCcDdEeFfGg Command line place holder replace with real name or value AaBbCcDdEeFfGg Book titles new words or terms or words to be emphasized LSF Parallel User s Guide Xi Preface Command Notation Command Notation xii Table 2 The command notation used in this guide Notation Meanin
22. dentify processes in the STOPPED state issue the ps command with the el argument orpheus 215 gt ps el RIS UID PID BPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 19 T 0 0 0 0 0 SY 0274e38 0 0 00 sched Here the sched command is in STOPPED state as indicated by the T entry in the S State column Note that when spawning a process in the STOPPED state under LSF the name of your program will not appear in the ps output Instead the stopped process will be identified as a RES daemon Example To start a 1 CPU interactive job on PAM enabled queue hpc in the STOPPED state bsub I n 1 q hpc sunhpc s jobname LSF Parallel User s Guide 45 SUN HPC MPI 46 LSF Parallel User s Guide Index Symbols Jod notation dese xe eee 18 B batch job interactive execution 35 a A H 30 resource usage 30 POSUING e sue z cea EEN 29 SUP ais iio rd ators 26 SUSPENG aro 28 terminales iaa 31 batch job state cem 23 batch job status ode a 23 DIOS tae 30 DRI DERE TE 31 br s me n isi o ERE XXE 29 DSIOD visse tne ie omes tales pats 28 SUD issuer ed acere eae 26 SE Ee vs Vaccine Palas 26 C C program compile culata 15 command syntax xii COMPLE q p ane Bees 15 C PORTA SAS 15 Fortran 77 program 15 D DONE 00 0 de all Mr ds CLE ue 24 E execution host substitution 18 EXE ck e P E dia a 25 F Fortran 77 program compile EE 15 H host type substitut
23. e LSF Batch system LSF Host Type Naming Convention Binaries must be compiled on the target host type architectures The binary must be named using a valid LSF host type string as the extension to its name or the name of a directory in its path lshosts displays a list of valid LSF host types When the a notation is used to submit a parallel application to the LSF Batch system the target host type string 1s substituted All binaries for a specific application must be named using the same host type substitution format 1 e binary extension or path name Example The following binaries are named with appropriate host type extensions to identify the target platform on which they are to run These binaries are named to use Sun Solaris and RS6000 architecture machines my Job SUNSOL my Job RS6K LSF Parallel User s Guide 17 Building Parallel Applications Building a Heterogeneous Parallel Application Example The following binaries are named with appropriate path names to identify the target platform on which they are to run These binaries are named to use Sun Solaris and RS6000 architecture machines user batch SUNSOL my job user batch RS6K myjob a Notation After a parallel application is submitted to the LSF Batch system the Parallel Application Manager PAM replaces the a annotation with the appropriate LSF host type string PAM then launches the individual tasks of the application on the remote hosts using the c
24. e compile This chapter discusses the basic steps in building a parallel application We discuss the basic structure of the application and how it is compiled and linked Note This chapter focuses on building a parallel application to make optimal use of the LSF Batch system It assumes familiarity with the LSF Suite of products and standard MPI Therefore it does not discuss writing MPI programs This chapter contains the following topics Tacludme the Header Pea iei eh t err d doa cat a aa da 14 Compilngand iiiter deco tou kc Pet b p ep esi duri cer de dela estoy dud 15 Building a Heterogeneous Parallel Application see 17 LSF Parallel User s Guide 13 Building Parallel Applications Including the Header File Including the Header File A set of PAM aware header files are included with the LSF Parallel system installation They are typically located in the LSF_INCLUDEDIR 1sf mpi directory The header files contain the MPI definitions macros and function prototypes necessary for using the LSF Parallel system Include Syntax The nclude syntax must be placed at the top of any parallel application that calls MPI routines The include statement looks like this In C applications include lt mpi h gt In Fortran 77 applications INCLUDE mpif h Note If the header files ate not located in the LSF_INCLUDEDIR 1sf mpi directory check with your system administrator 14 LSF Parallel Use
25. e suitable host resources becomes available Note The batch job may not be run immediately it may queued until the appropriate resources become available To use the bsub command to submit a parallel batch job to the LSF Batch system see Submitting Batch Jobs on page 26 Interactive Execution When interactively executing a parallel batch job the pam command 1s used to invoke PAM When submitting batch jobs using the pam command the LSF Batch system is bypassed the jobs are not queued Batch jobs are run immediately upon entering the command if the specified resource requirements are met If the resources are not available the job is not run Since the jobs do not wait interactive job execution is beneficial for debugging parallel applications Direct interaction is supported All the input and output is handled transparently between the local and execution hosts To use the pam command to execute a parallel batch job interactively see Interactive Execution on page 34 20 LSF Parallel User s Guide Submitting Parallel Applications Job Submission Methods LSF Parallel and LSF Batch Commands The LSF Batch and LSF Parallel products provide commands and man pages for these commands If these commands cannot be executed or the man pages cannot be viewed the appropriate directories may need to be added to the systems path check with the system administrator LSF Parallel User s Guide 21 Submitting Parallel Applications
26. execution of parallel applications in production environments The LSF Parallel system is fully integrated with the LSF Batch system the de facto industry standard resource management software product to provide load sharing in a distributed system and batch scheduling for compute intensive jobs The LSF Parallel system provides support for Dynamic resource discovery and allocation resource reservation for parallel batch job execution Transparent invocation of the distributed job processes Full control of the distributed job processes to ensure no processes will become un managed This effectively reduces the possibility of one parallel job causing severe disruption to an organization s computer service The standard MPI interface All major UNIX operating systems Full integration with the LSF Batch system providing heterogeneous resoutce based batch job scheduling 2 LSF Parallel User s Guide Introduction How Does LSF Parallel Fit Into the LSF Batch System How Does LSF Parallel Fit Into the LSF Batch System The LSF Parallel system adopts a layered approach shown in Fzgure 7 that is fully integrated with the LSF Batch system In addition to the LSF Batch system resoutces the following components make up the LSF Parallel system e The MPI Library e The Parallel Application Manager PAM Figure 1 Major Components of the LSF Parallel system User Application MPI Library PAM LSF Batch System
27. g Apostrophes or Must be entered exactly as shown Commas Must be entered exactly as shown Ellipsis The preceding parameter can be repeated Do not enter the ellipsis lower case italics The parameter must be substituted OR bar You must enter one of the items separated by the bar You cannot enter more than one item Parenthesis Must be entered exactly as shown Parameter in square brackets The parameters within the brackets is optional Do not enter the brackets Stacked braces UJ UN You must enter one of the items within the braces You cannot enter mote than one item Do not enter the braces C shell prompt Unless otherwise noted the C shell prompt is used in all command examples LSF Parallel User s Guide LSF Suite 3 2 Preface LSF Suite 3 2 LSF is a suite of workload management products including the following LSF Batch LSF JobScheduler LSF MultiCluster LSF Analyzer LSF Parallel LSF Make LSF Base is a batch job processing system for distributed and heterogeneous environments which ensures optimal resource sharing is a distributed production job scheduling application that integrates heterogeneous servers into a virtual mainframe or virtual super computer supports resource sharing among multiple clusters of computers using LSF products while maintaining resource ownership and cluster autonomy is a
28. he master Load Information Manager LIM LSF Batch runs user jobs on batch server hosts It has sophisticated controls for sharing hosts with interactive users there is no need to set aside dedicated hosts for processing batch jobs See the LSF Batch User Guide and the LSF Batch Administrator s Guide fot a detailed description of the LSF Batch system 4 LSF Parallel User s Guide Introduction LSF Parallel Architecture LSF Parallel Architecture The LSF Parallel system fully utilizes the resources of the LSF Batch System for resource selection and process invocation and control The process of parallel batch job invocation and control is shown in Figure 2 and described in Table 7 on page 6 The LSF components shown in Figure 2 are described in Table 2 on page 6 Figure 2 LSF Parallel Architecture User Requests MBD LIM First Y Execution SBD Host de PAM v Execution RES RE Host The LSF Parallel system also supports interactive parallel job submission The process is similar to that shown in Figure 2 except the user request is submitted directly to PAM which makes a simple placement query to LIM Job queuing and resource reservations are not supported in interactive mode LSF Parallel User s Guide 5 1 Introduction LSF Parallel Architecture Table 1 LSF Parallel Batch Job Invocation Stage Description 1 User submits a paral
29. ion 18 HP MPI seats a EA Ee A 42 interactive execution batchjob aa secete acea cea es 35 J job interactive execution 35 monitor 30 resource usage 30 resume e 29 SUPLINI siii a ad 26 Suspendiss reneste iiet oia 28 terminate o o o o 31 ee ER DONE 00 24 EXIT aed atid Cae oleh busta 25 LSF Parallel User s Guide 47 PSUSP Sera Sie een 24 RUN Male dl 24 A wyee4 25 SUS EE ie 24 Jobstatus A di E Rites ER L EE ee eRe 15 C program ia 15 Fortran 77 program 15 Ishosts command 39 M monitor batch Job uge n 30 MPI TAU sa A da 42 E RTE 43 SUN HPE vita op da daos 44 IUDICO Ls sn ti Guide beh S pro TUNER 15 InpilZ osse detentetx Y 15 N notation GE oral etica uy wa bed pais 18 P pam bsub option 52s SA ca aaa 26 command obs KENNEN e 35 48 LSF Parallel User s Guide pam option a a rl TAS 18 E ECH E EEN 24 PSUS PH se eege 24 R resource usage batch JOD od ees RE rem 30 resume batch job ier eile es 29 RUNCA toe pt Set bed peat ben es 24 S SGIMPI eese 43 SSUSR tU teet V 25 submit batchjob vost E 26 substitution host tyDe s secs ko Mare ets 18 SUN HPCMPfPI 44 suspend batch JOD espiar 28 Synta isa evans are D A xii T terminate batch JOD 4 isse ER Eua 31 typographic conventions xi U RE 24
30. l mode Thu Sep 12 16 39 17 Submitted from host lt host99 gt CWD lt SHOME Work utopia pass pam 2 4 Processors Requested Thu Sep 12 16 39 18 Started on 4 Hosts Processors hostl host2 host3 host4 Execution Home pcc s userl Execution CWD lt pcc s userl W ork utopia pass pam Thu Sep 12 16 40 41 Resource usage collected The CPU time used is 2 seconds MEM 281 Kbytes SWAP 367 Kbytes PGIDs 4 PIDs 4 SCHEDULING PARAMETERS ri5s rim ri5m ut pg io Ls I tmp swp mem loadSched loadStop nresj loadSched loadStop 30 LSF Parallel User s Guide Submitting Parallel Applications Terminating Jobs Terminating Jobs The bkill Command The bkill command is used to terminate parallel batch jobs running in the LSF Batch system The syntax for using the bki11 command in the LSF Parallel system is bkill jobID options Example The following command terminates the parallel batch job named my job running in the LSF Batch system with a job ID of 713 bkill 713 When the parallel batch job named my job is terminated the b jobs command will display the batch job state of EXIT bkill 713 JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 713 userl EXIT batch host 99 host 1 my job Sep 12 16 30 host2 host3 host4 Note The time taken to terminate a parallel batch job varies and depends
31. l system 1s 9 bresume jobID Example The following command resumes the suspended parallel batch job named my job running in the LSF Batch system with job ID of 713 9 bresume 713 When the parallel batch job named my job is resumed the b jobs command will display the batch job state of RUN or PEND bjobs JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 713 userl RUN batch host99 hosti my job Sep 12 16 34 host2 host3 host4 LSF Parallel User s Guide 29 Submitting Parallel Applications Monitoring Job Status Monitoring Job Status The bjobs Command The bjobs command is used to view the running status and resource usage of parallel batch jobs running in the LSF Batch system The syntax forusing the bjobs command in the LSF Parallel system 1s bjobs options Example The following command displays the running status and resource usage of the jobs running in the LSF Batch system bjobs JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 113 userl RUN batch host99 hosti my job Sep 12 16 34 host2 host3 host4 Example The following command uses the 1 option to display run time resource usage CPU memory and swap as well as the running status of the jobs running in the LSF Batch system bjobs 1 Job Id 713 User Command myjob Project Status Interactiv Queue pseudo termina
32. lel batch job to the LSF Batch system 2 MBD retrieves a list of suitable execution hosts from the master LIM 3 MBD allocates schedules reserves the execution hosts for the parallel batch job 3 MBD dispatches the parallel batch job to the SBD on the first execution host that was allocated to the batch job SBD starts PAM on the same execution host PAM starts RES on each execution host allocated to the batch job RES starts the tasks on each execution host Table 2 Description of LSF Parallel Components Part Function User Request Batchjob submission to the LSF Batch system using the bsub command MBD The Master Batch Daemon is the policy center for the LSF Batch system It maintains information about batch jobs hosts users and queues All of this information 1s used in scheduling batch jobs to hosts LIM The Load Information Manager 1s a daemon process running on each execution host LIM monitors the load on its host and exchanges this information with the Master LIM The Master LIM resides on one execution host and collects information from the LIMs on all other hosts in the LSF cluster If the master LIM becomes unavailable another host will automatically take over For batch submission the master LIM provides this information is provided to the MBD For interactive execution the Master LIM provides simple placement advice 6 LSF Parallel User s Guide
33. ne task exits with a none zero return value all the other tasks will run until they complete DONE or fail EXIT e When one task is killed by a signal or core dumps all the other tasks will be shut down LSF Parallel User s Guide 25 Submitting Parallel Applications Submitting Batch Jobs Submitting Batch Jobs The bsub Command The bsub command is used to submit parallel batch jobs to the LSF Batch system The syntax for using bsub when submitting parallel applications is the same as the LSF Batch system with the addition of the pam option bsub options pam options job The pam Option The pam options used with the bsub command are a subset of the pam command options see The pam Command on page 35 Since the LSF Batch system does all of the resource allocation and scheduling the pam options f and n are not necessary and are ignored by the bsub command The syntax for bsub pam is pam h V t v The bsub pam options ate Option Description h Print command usage to standard error and exit V Print LSF version to standard error and exit t Suppress the printing of the process status summary on job completion v Specifies the job is to be run in verbose mode The names of the selected hosts are displayed Example The following command submits a parallel batch job named my job to the LSF Batch system and requests four processors of any type to run the job bsub n 4 pam myjo
34. orrect binaries Note Use the 1shosts command to determine which LSF hosts are available For example 9 lshosts HOST NAME type model cpuf ncpus maxmem maxswp server RESOURCES hosti SUNSOL SunSparc 6 0 1 64M 112M Yes solaris cserver host2 RS 6K IBM350 7 0 L 64M 124M Yes cserver aix Example To submit the my job application from the same directory using LSF host type extensions the following command is used pam n 2 myjob a PAM will make the following substitutions for the a notation myjob SUNSOL myjob RS6K Example To submit the my job application from different directories using host type path names the following command is be used pam n 2 user batch a my job PAM will make the following substitutions for the a notation user batch SUNSOL my job user batch RS6K myjob 18 LSF Parallel User s Guide 4 Submitting Parallel Applications This chapter describes how to submit and interact with parallel applications in the LSF Batch system An extensive and flexible set of tools is provided that allows parallel applications to be submitted through the LSF Batch system Parallel applications can also be executed interactively under control of the Parallel Application Manger PAM These tools allow the specification of how when and where a parallel application is to be run This chapter discusses the following topics Job Submission Methods li CE oes 20 Batch Executi
35. r s Guide Building Parallel Applications Compiling and Linking Compiling and Linking The LSF Parallel system provides a set of scripts that help with the creation of executable objects They are mpicc for C programs and mpif77 for Fortran 77 programs These scripts provide the options and special libraries needed to compile and link MPI programs for use with the LSF Parallel system Applications are linked to system dependent libraries and the appropriate MPI library C Programs The LSF Parallel C compiler mpicc is used to compile MPI C source files It is used in a similar manner to other UNIX based C compilers For example To compile the sample program contained in a file my job c enter mpicc c myjob c This command produces the my job o that contains the object code for this LSF Parallel source file To link the my job o object file with the LSF Parallel libraries to create an executable entet mpicc o myjob myjob o As with most C compilers the o flag specifies that the name of the executable produced by the linker 1s to be my job The C source file can be compiled and linked in one step using the following command mpicc myjob o myjob Fortran 77 Programs The LSF Parallel Fortran 77 compiler mpi 77 is used to compile MPI Fortran 77 source files It is used in a similar manner to other UNIX based Fortran 77 compilers For example To compile the sample program contained in a file my job f enter
36. s which are discussed in more detail in Submitting Batch Jobs on page 26 To view the status of the parallel batch job enter the following command bjobs JOBID USER STAT QUEUE FROM HOST EXEC HOST JOB NAME SUBMIT TIME 1288 userl PEND normal hopper hostl my job Apr 16 14 43 host2 host3 The 5 jobs command has a number of command line options which are discussed in more detail in Monitoring Job Status on page 30 Execute Interactively To interactively execute the parallel application my job on three processors enter the following command pam n 3 myjob Server PAM contact address is klee 2801 From process 1 Hello world from process 1 From process 2 Hello world from process 2 From process 3 Hello world from process 3 TID HOST NAME COMMAND LINE STATUS TERMINAT ION_TIME 1 hosti my job Completed 04 16 98 15 05 56 host2 my job Completed 04 16 98 15 05 56 3 host3 my job Completed 04 16 98 15 05 56 The pam command has a number of command line options which are discussed in more detail in The pam Command on page 35 12 LSF Parallel User s Guide 3 Building Parallel Applications The LSF Parallel systems provides tools to help build a parallel application to take full advantage of the LSF Batch system Most parallel applications can be reused by simply re linking with the PAM aware MPI library in some instances there may not even be a need to r
37. s a bsub option job arg The name of the parallel job to be run Note This must be the last argument on the pam command line Example The following command executes the parallel batch job named my job on the LSF Parallel system requesting four processors of any type 9 pam n 4 myjob TID HOST NAME COMMAND LINE STATUS TERMINATION TIME 1 hosti my job Completed 03 31 98 10 31 58 2 host2 my job Completed 03 31 98 10 31 59 3 host3 my job Completed 03 31 98 10 31 59 4 host4 my job Completed 03 31 98 10 31 58 36 LSF Parallel User s Guide Submitting Parallel Applications The pam Command Example The following command uses the m option to execute the parallel batch job named my job on host host2 and bost Q pam m hostl host2 host3 myjob TID HOST NAME COMMAND LINE STATUS TERMINATION TIME 1 hosti my job Completed 03 31 98 10 31 58 2 host2 my job Completed 03 31 98 10 31 59 3 host3 my job Completed 03 31 98 10 31 59 LSF Parallel User s Guide 37 Submitting Parallel Applications Process Status Report Process Status Report After a parallel batch job terminates in a successful Done or failed EXIT state the LSF Parallel system displays the status of all the processes For example pam n 4 myjob TID HOST NAME COMMAND LINE STATUS TERMINATION TIME 1 hosti my job Done 0
38. s a eir ui Sec 22 Batch Job Status 20 ds dde 23 Submitting Batch Jobs AAA Eech EE NEE 26 Suspenditic Jobs desi soia odata da ea aS ax 28 Resuming JODS 2 Aere o dd 29 Monttorme JoB SEALUS ende rea Eek eae d a 30 Terminating OBS son a pat bagi oo aia sa eda dub ROO tinc eases Yes 31 Running Heterogeneous Parallel Applications 32 LSF Parallel User s Guide WN Interactive Execution 34 The par Command cisco ae ii ES met i geek 35 a A E dich 38 Getting Host Information 22 SE ER Ra A 39 A Vendor MPI Implementations 41 ENEE tert Aert a rh e Gergen 42 EC DEE 43 SUN HPC MP re Seger add dada bra 44 e A A rac ar Roa rec Pe A ER o ec a M do a 47 vi LSF Parallel User s Guide Preface The LSF Parallel User s Guide describes how to compile and link execute interact and monitor parallel applications submitted through the LSF Suite of products For the most part this guide does not repeat information that is available in detail elsewhere but focuses on what is specific to using the Platform Computing Corporation LSF Parallel system References to more general sources are provided in Related Publications on page ix 1n this preface This preface discusses the following topics ANGIE DIE see scite acetat oct acea ae e eegene viii Related Publications eege gege enti d acc cn satul aia c i na 1x Related Publications cccccccsssssssscssessessscsesscssecsscssessecscssessessscsecsecssceestesseeseeesaes ix gt xi Command
39. ses Specify the number of processes to run Note that the bsub n argument specifies the number of CPUs to be used for the job Example To start a 48 process interactive job on PAM enabled queue hpc that will wrap over at least 4 and as many as 16 CPUs 9 bsub I n 4 16 q hpc sunhpc n 48 jobname Note Setting the minimum number of CPUs to a number greater than 1 raises the possibility that if there are fewer CPUs available than the minimum number you specify the job may fail to start In this example if fewer than 4 CPUs are available the job will not start You can avoid this potential problem by setting the minimum number of CPUs to 1 Howevet this introduces the potential cost to performance of having the processes wrapped over a smaller number of CPUs P host port Specify the PAM address of another job with which the new job should colocate The PAM addtess 1s the TCP socket used for communications between the job and PAM Example To start a 4 CPU interactive job on PAM enabled queue hpc bsub I n 4 q hpc sunhpc P Athos 123 jobname The new job is colocated with the job whose PAM is running on host Athos using port 123 j job ID Specify the job ID of another job with which the new job should colocate J job name Specify the job name of another job with which the new job should colocate 44 LSF Parallel User s Guide SUN HPC MPI Specify that the job is to be spawned in the STOPPED state To i
40. specific resources are not requested the LSF Parallel system will run the batch job on the least loaded hosts that meet the batch jobs criteria Direct interaction is supported All the input and output is handled transparently between local and execution hosts All job control signals e g ctrl x ctrl z and ctrl l are propagated to the execution hosts this allows interaction with the job as if it were a being executed locally This part of the chapter discusses the following topics Tis para OPER ECT das ic Dale E dad 35 Process Status REPO it oTt br e bte d d eds 38 Get ng Host Informatica ndice tees t ii 39 34 LSF Parallel User s Guide Submitting Parallel Applications The pam Command The pam Command The pam command is used to interactively execute parallel batch jobs in the LSF Parallel system A subset of the pam command is used as a command option for the bsub command see The bsub Command on page 26 The syntax for using the pam command is pam h V 1 1 t v server_addr location server jobid location server jobname location m host R req n num job arg Option Description h Print command usage to standard error and exit V Print LSF version to standard error and exit i Specifies interactive operation mode the user will be asked if application is to be executed on all hosts If yes y the task is started on all hosts specified in the list
41. uf Hello from process d myrank destrank 0 MPI Send mbuf strlen mbuf 1 MPI CHAR destrank 90 MPI COMM WORLD else for srcrank 1 srcrank n processes srcrank MPI Recv mbuf 512 MPI CHAR srcrank 90 MPI COMM WORLD amp mstat printf From process d s n srcrank mbuf MPI_Finalize 10 LSF Parallel User s Guide Getting Started Compiling and Linking the Application Compiling and Linking the Application After the example program is entered and saved as my job c use the mpicc script to compile and link the application The mpicc script is used in a stmilar manner to other UNIX based C compilers This script provides the options and special libraries needed to compile and link a parallel application for the LSF Parallel environment Compile and Link To compile and link the source code in the my job c file in one step enter the following command mpicc myjob c o myjob The binary created is called my job LSF Parallel User s Guide 11 Getting Started Running the Application Running the Application Submit to the LSF Batch System To submit the parallel application my job to the LSF Batch system requesting three processors enter the following command 9 bsub n 3 pam myjob Job 1288 is submitted to default queue normal This command creates three processes and each runs an instance of my job The bsub command has a number of command line option
Download Pdf Manuals
Related Search
Related Contents
据付・取扱説明書 工事説明書 User Manual (English) Revolabs 06-XLRMIC-BLK-11 STB-100 Set top box with HD DVB-T User`s Manual L187DR and L217DR and L247DR user manual(20130717 平成26年度 六郷第二地区及び七郷地区戸建復興公営住宅 La única aspiradora Sin mantenimiento de filtros, Sin bolsas que Sola User Manual ServerWizard CDをセットします。 Copyright © All rights reserved.
Failed to retrieve file