Home

Scalasca User's Guide - Forschungszentrum Jülich

1. 00000 EPIK Closing experiment epik sor vn128 trace 00000 EPIK Flushed file epik sor vn128 trace ELG 00000 00000 EPIK Closed experiment epik sor vn128 trace S C A N Collect done S C A N Analysis start mpirun mode vn np 128 scout mpi epik sor vn128 trace SCOUT trace analyzer output S C A N Analysis done 16 2 4 A full workflow example S C A N epik sor vn128 trace complete This creates an experiment archive directory epik sor vn128 trace distinguishing it from the previous summary experiment through the suffix trace A separate trace file per MPI rank is written directly into a subdirectory when measurement is closed and the Scalasca parallel trace analyzer SCOUT is automatically launched to analyze these trace files and produce an analysis report SCOUT output includes a report of the maximum amount of memory used by any of the analysis processes which is typically two or more times larger than the largest trace buffer content nax tbc This analysis report can then be examined using the same commands and tools as the summary experiment scalasca examine epik sor vn128 trace INFO Post processing trace analysis report INFO Displaying epik sor vnl128 trace trace cube The screenshot in Figure 2 2 shows that the trace analysis result at first glance provides the same information as the summary result However the trace analysis report is en riched with additional performance
2. Function Group MPI Abort EXT MPI Accumulate RMA MPI Add error class ERR MPI Add error code ERR MPI Add error string ERR MPI Address MISC MPI Allgather COLL MPI Allgatherv COLL MPI Alloc mem MISC MPI Allreduce COLL MPL Alltoall COLL MPL Alltoallv COLL MPL Alltoallw COLL MPI Attr delete CG EXT MPI Attr get CG EXT MPI Attr put CG EXT MPI Barrier COLL MPI Bcast COLL MPI Bsend P2P MPI Bsend init P2P MPI Buffer attach P2P MPI Buffer detach P2P MPI Cancel P2P MPI Cart coords TOPO MPI Cart create TOPO MPI Cart get TOPO MPI Cart map TOPO MPI Cart rank TOPO MPI Cart shift TOPO MPI Cart sub TOPO MPI Cartdim get TOPO MPI Close port SPAWN MPI Comm accept SPAWN MPI Comm c2f CG MISC MPI Comm call errhandler CG ERR MPI Comm compare CG MPI Comm connect SPAWN MPI Comm create CG 53 Appendix A MPI wrapper affiliation MPI Comm create errhandler CG ERR MPI Comm create keyval CG EXT MPI Comm delete attr CG EXT MPI Comm disconnect SPAWN MPI Comm dup CG MPI_Comm f2c CG_MISC MPI Comm free CG MPI_Comm_free_keyval CG_EXT MPI Comm get attr CG EXT MPI Comm get errhandler CG ERR MPI Comm get name CG EXT MPI Comm get parent SPAWN MPI Comm group CG MPI_Comm_join SPAWN MPI Comm rank CG MPI_Comm_remote_group CG MPI Comm remote size CG MPI Comm set attr CG EXT MPI Comm set errhandl
3. To convert a merged trace to the VampirTrace Open Trace Format OTF use 47 Chapter 6 Additional utilities elg2otf epik title and to convert to the older VAMPIR version 3 format VTF3 use elg2vtf3 epik title file elg which stores the resulting trace in lt file gt vpt Note Newer versions of VAMPIR 7 3 and later are also able to handle Scalasca traces directly without merging and conversion via vampir epik title epik esd Experimental support is provided to convert merged EPILOG traces to the format used by the PARAVER trace visualizer from the Barcelona Supercomputing Center 3 via elg2prv epik title and to the Slog2 format used by MPE from Argonne National Laboratory 4 via elgTOslog2 epik title epik elg To visualize the resulting Slog2 file with Jumpshot use jumpshot epik title epik elg slog2 6 3 Recording user specified virtual topologies A virtual topology defines the mapping of processes and threads onto the application do main such as a weather model simulation grid In general a virtual topology is specified as a graph e g a ring or a Cartesian topology such as two or higher dimensional grids Virtual topologies can include processes threads or a combination of both depending on the programming model Virtual topologies can be useful to identify performance problems Mapping perfor mance data onto the topology can help uncover inefficient interactions between neigh
4. 2 D Becker R Rabenseifner F Wolf Implications of non constant clock drifts for the timestamps of concurrent events In Proc of the IEEE Cluster Conference Cluster 2008 pp 59 68 IEEE Computer Society September 2008 38 3 Barcelona Supercomputing Center Paraver Obtain Detailed Information from Raw Performance Traces June 2009 5 48 http www bsc es plantillaA php cat id 485 4 A Chan W Gropp E Lusk Scalable Log Files for Parallel Program Trace Data DRAFT 2003 48 ftp ftp mcs anl gov pub mpi slog2 slog2 draft pdf 5 M Geimer F Wolf B J N Wylie B Mohr Scalable Parallel Trace Based Perfor mance Analysis In Proc of the 13th European PVM MPI Users Group Meeting EuroPVM MPI LNCS 4192 pp 303 312 Springer Berlin Heidelberg Septem ber 2006 1 6 M Geimer F Wolf B J N Wylie E brah m D Becker B Mohr The Scalasca performance toolset architecture In Concurrency and Computation Practice and Experience 22 6 702 719 April 2010 1 Gesellschaft f r Wissens und Technologietransfer der TU Dresden mbH Vampir Performance Optimization June 2009 5 47 7 http vampir eu 8 J lich Supercomputing Centre CUBE User Guide Generic display for application performance data 10 40 43 http apps fz juelich de scalasca releases cube 3 4 docs CubeGuide pdf 9 i J lich Supercomputing Centre Scalasca Open Issues and Limitations 8 19 http
5. 59 Appendix A MPI wrapper affiliation MPI Type set attr TYPE EXT MPI Type set name TYPE EXT MPI Type size TYPE MPI Type struct TYPE MPI Type ub TYPE MPI Type vector TYPE MPI Unpack TYPE MPI Unpack external TYPE MPL Unpublish name SPAWN MPI Wait P2P MPI Waitall P2P MPI Waitany P2P MPI Waitsome P2P MPI Win c2f RMA MISC MPI Win call errhandler RMA ERR MPI Win complete RMA MPI Win create RMA MPI Win create errhandler RMA ERR MPI Win create keyval RMA EXT MPI Win delete attr RMA EXT MPI Win f2c RMA MISC MPI Win fence RMA MPI Win free RMA MPI Win free keyval RMA EXT MPI Win get attr RMA EXT MPI Win get errhandler RMA ERR MPI Win get group RMA MPI Win get name RMA EXT MPI Win lock RMA MPI Win post RMA MPI Win set attr RMA EXT MPI Win set errhandler RMA ERR MPI Win set name RMA EXT MPI Win start RMA MPI Win test RMA MPI Win unlock RMA MPI Win wait RMA MPI Wtick EXT MPI_Wtime EXT 60 A 4 Group to function A 4 Group to function CG Communicators and Groups MPI_Comm_compare MPI Comm create MPI Comm dup MPI Comm free MPI Comm group MPI Comm rank MPI Comm remote group MPI Comm remote size MPI Comm size MPI Comm split MPI Comm test inter MPI Group compare MPI Group difference MPI Group excl MPI Group free MPI Gr
6. however it will then only provide information about the master thread scout omp is built whenever Scalasca is configured with OpenMP support It is used to analyze event traces generated by pure OpenMP applications It can also be used to analyze event traces from serial applications scout mpi is built whenever Scalasca is configured with MPI support It is used to analyze event traces generated by pure MPI applications It can also be used on traces from hybrid MPI OpenMP applications however it will then only provide information about the master thread of each process and its MPI activities scout hyb is built if Scalasca is configured with hybrid MPI OpenMP support It is used to analyze event traces generated by hybrid MPI OpenMP applications pro viding information about all OpenMP threads of each MPI process The appropriate SCOUT variant can be explicitly executed on event traces in EPIK mea surement archives using 37 Chapter 4 Measurement collection amp analysis SMPIEXEC SMPIEXEC FLAGS lt scout type gt s epik title which produces an intermediate analysis report epik_ lt title gt scout cube Event traces collected on clusters without a synchronized clock may contain logical clock condition violations 2 such as a receive completing before the corresponding send is initiated When SCOUT detects this it reports a warning that the analysis may be inconsistent and recommends re running trace analysis
7. MPI Group free CG MPI_Group_incl CG MPI_Group_intersection CG MPI_Group_range_excl CG MPI_Group_range_incl CG MPI_Group_rank CG MPI_Group_size CG MPI_Group_translate_ranks CG MPI_Group_union CG MPI Ibsend P2P MPI Info c2f MISC MPI Info create MISC MPI Info delete MISC MPI Info dup MISC MPI Info f2c MISC MPI Info free MISC MPI Info get MISC MPI Info get nkeys MISC MPI Info get nthkey MISC MPI Info get valuelen MISC MPI Info set MISC MPI Init ENV MPI Init thread ENV MPI Initialized ENV MPI Intercomm create CG MPI Intercomm merge CG MPI_Iprobe P2P MPI Irecv P2P MPI Irsend P2P MPI Is thread main ENV MPI Isend P2P MPI Issend P2P MPI Keyval create CG EXT MPI Keyval free CG EXT MPI Lookup name SPAWN MPI Op c2f MISC MPI Op commutative MISC MPI Op create MISC 57 Appendix A MPI wrapper affiliation MPI Op f2c MISC MPI Op free MISC MPI Open port SPAWN MPI Pack TYPE MPI Pack external TYPE MPI Pack external size TYPE MPI Pack size TYPE MPI Pcontrol PERF MPI Probe P2P MPI Publish name SPAWN MPI Put RMA MPI_Query_thread ENV MPI Recv P2P MPI Recv init P2P MPI Reduce COLL MPI Reduce local COLL MPI Reduce scatter COLL MPI Reduce scatter block COLL MPI Register datarep IO MPI Request c2f MISC MPI Request f2c MISC MPI Request free P2P MPI Request get status MISC MPI Rsend P2P MPI Rsend init
8. MPI Startall MPI Test MPI Test cancelled MPI Testall MPI Testany MPI Testsome MPI Wait MPI Waitall MPI Waitany MPI Waitsome 65 Appendix A MPI wrapper affiliation PERF Profiling Interface MPI Pcontrol RMA One sided communication Remote Memory Access MPI Accumulate MPI Get MPI Put MPI Win complete MPI Win create MPI Win fence MPI Win free MPI Win get group MPI Win lock MPI Win post MPI Win start MPI Win test MPI Win unlock MPI Win wait RMA ERR Error handlers for One sided communication Remote Memory Access MPI Win call errhandler MPI Win create errhandler MPI Win get errhandler MPI Win set errhandler RMA EXT External interfaces for One sided communication Remote Memory Access MPI Win create keyval MPI Win delete attr MPI Win free keyval MPI Win get attr MPI Win get name MPI Win set attr MPI Win set name RMA MISC Miscellaneous functions for One sided communication Remote Memory Access MPI Win c2f MPI Win f2c 66 A 4 Group to function SPAWN Process spawning MPI Close port MPI Comm accept MPI Comm connect MPI Comm disconnect MPI Comm get parent MPI Comm join MPI Comm spawn MPI Comm spawn multiple MPI Lookup name MDL Open port MPI Publish name MPL Unpublish name 67 Appendix A MPI wrapper affilia
9. apps fz juelich de scalasca releases scalasca 1 4 docs OPEN_ TSSUES CXE 10 J lich Supercomputing Centre Scalasca Performance Properties 17 43 http apps fz juelich de scalasca releases scalasca 1 4 help scalasca 75 Bibliography patterns html 11 J lich Supercomputing Centre Scalasca Instrumentation Measurement Regions 15 http apps fz juelich de scalasca releases scalasca 1 4 help scalasca regions html 12 J Labarta S Girona V Pillet T Cortes L Gregoris DiP A Parallel Program Development Environment In Proc of the 2nd International Euro Par Conference LNCS 1123 pp 665 674 Springer Berlin Heidelberg August 1996 5 13 Message Passing Interface Forum MPI A Message Passing Interface Standard Version 2 2 September 2009 3 http www mpi forum org 14 W Nagel M Weber H C Hoppe K Solchenbach VAMPIR Visualization and Analysis of MPI Resources Supercomputer 12 1 pp 60 80 SARA Amsterdam January 1996 5 15 OpenMP Architecture Review Board OpenMP API specification for parallel pro gramming Version 3 1 July 2011 3 http www openmp org 16 Performance Research Lab University of Oregon ParaProf User s Manual 5 http www cs uoregon edu research tau docs newguide bk02 html 17 Performance Research Lab University of Oregon TAU User Guide chapter Selec tively Profiling an Application 28 http www cs uoregon edu research tau docs newguide bk01ch01s
10. 1 In this case experiment trace analysis is automatically initiated after measure ment is complete to quantify wait states that can t be determined with runtime summa rization You may also visualize traces with a third party graphical trace browser The scalasca analyze n preview mode can be used to show but not actually ex ecute the measurement and analysis launch commands along with various checks to determine the possible success Additional informational commentary via v may also be revealing especially if measurement or analysis was unsuccessful In case of problems which are not obvious from reported errors or warnings set the configuration variable EPK_VERBOSE 1 before executing the instrumented application to see control messages of the Scalasca measurement system This might help to track down the problem or allow a detailed problem report to be given to the Scalasca developers Since the amount of messages may be overwhelming use an execution configuration that is as small and short as possible Chapter 2 Getting started When using environment variables in a cluster environment make sure that they have the same value for all application processes on a nodes of the cluster Some cluster environments do not automatically transfer the environment when executing parts of the job on remote nodes of the cluster and may need to be explicitly set and exported in batch job submission scripts 2 3 Analysis re
11. 2 semi automatic instrumentation using POMP directives Section 3 4 and if configured automatic source code instrumentation using the PDToolkit based instrumentor Section 3 5 Additionally Scalasca provides a filter ing capability for excluding instrumented user routines from measurement Section 3 6 if automatic compiler based instrumentation is used As well as user routines and specified source regions Scalasca currently supports the following kinds of events s MPI library calls Instrumentation is accomplished using the standard MPI profiling interface PMPI To enable it the application program has to be linked against the EPIK MPI or hybrid MPI OpenMP measurement library plus MPI specific libraries Note that the EPIK libraries must be linked before the MPI library to ensure interposition will be effective OpenMP directives amp API calls The Scalasca instrumenter automatically uses the OPARI2 tool to instrument OpenMP constructs See the OPARI2 documentation for information about how it instruments OpenMP source code and manually inserted POMP directives and refer to the OPEN ISSUES document 9 for its limitations e g with respect to preprocessors In addition the application must be linked with the EPIK OpenMP or hybrid MPI OpenMP measurement library The Scalasca instrumenter command scalasca instrument automatically takes care of compilation and linking to produce an instrumented executable and should be
12. A unique directory is used for each measurement experiment which must not already exist when measurement starts measurement is aborted if the specified directory exists A default name for each measurement archive directory is created from the name of the target application executable the run configuration e g number of MPI processes and OMP NUM THREADS specified and the measurement configuration This archive name has an epik prefix deriving from the EPIK measurement library used by Scalasca and its location can be explicitly specified to Scalasca with the e path option or changed via configuration variables When the measurement has completed the measurement archive directory contains vari ous log files and one or more analysis reports By default runtime summarization is used to provide a summary report of the number of visits and time spent on each callpath by each process For MPI measurements MPI time and message and file I O statistics are included For OpenMP measurements OpenMP specific metrics are calculated Hybrid OpenMP MPI measurements contain both sets of metrics If hardware counter metrics were requested these are also included in the summary report Event trace data can also be collected as a part of the measurement producing a trace file for each process To collect event trace data as part of the measurement use the scalasca analyze t command or alternatively set the configuration variable EPK_ TRACE
13. analysis of event traces in order to find performance bottlenecks Internally performance problems are specified in terms of execution patterns that represent standard situations of inefficient behavior These patterns are used during the analysis process to recognize and quantify the inefficient behavior in the application The analysis of traces from OpenMP MPI or hybrid MPI OpenMP programs can be performed in parallel with as many processes and threads as the original application execution see Section 4 4 In addition sequential analysis of traces using the KOJAK trace analyzer is still possible see Section 4 5 although only recommended under rare circumstances Scalasca not only supports the analysis of function calls and user defined source code regions cf Chapter 3 but also the analysis of hardware performance counter metrics see section 4 3 4 1 Nexus configuration scalasca analyze lt application launch command gt scan options launchcmd launchargs target targetargs scalasca analyze mpiexec np 4 foo args gt epik foo 4 sum OMP NUM THREADS 3 scan t bar epik bar Ox3 trace OMP NUM THREADS 3 scan s mpiexec np 4 foobar epik foobar 4x3 sum The Scalasca measurement collection and analysis nexus SCAN scalasca analyze should be prefixed to the commandline used to launch and run the application executable 31 Chapter 4 Measurement collection amp analysis Arguments
14. are cases where measurement and associated analysis are degraded e g by small frequently executed and or generally uninteresting functions methods and subroutines A measurement filtering capability is therefore supported for most but not all com pilers A file containing the names of functions one per line to be excluded from measurement can be specified using the EPIK configuration variable EPK FILTER or alternatively via the f filter file option ofthe scalasca analyze command and will be archived in epik_ lt title gt epik filt as part of the experiment Filter function names can include wildcards for multiple characters and for single char acters and if name demangling is not supported then linker names must be used On the other hand if C name demangling is supported characters indicating pointer variables have to be escaped using a backslash Note Generally it is most convenient to replace instances of space characters and other 6 9 6 special characters x and with the character Whenever a function marked for filtering is executed the measurement library skips making a measurement event thereby substantially reducing the overhead and impact of such functions In some cases even this minimal instrumentation processing may be undesirable and the function should be excluded from instrumentation as described in Section 3 6 4 2 2 Selective MPI event ge
15. begin MPI File write all end MPI File write at MPI File write at all MPI File write at all begin MPI File write at all end MPI File write ordered MPI File write ordered begin MPI File write ordered end MPI File write shared MPI Register datarep 64 A 4 Group to function IO ERR Error handlers for Parallel I O MPI File call errhandler MPI File create errhandler MPI File get errhandler MPI File set errhandler IO MISC Miscellaneous functions for Parallel VO MPI File c2f MPI File f2c MISC Miscellaneous functions MPI Address MPI Alloc mem MPI Free mem MPI Get address MPI Get version MPI Info c2f MPI Info create MPI Info delete MPI Info dup MPI Info f2c MPI Info free MPI Info get MPI Info get nkeys MPI Info get nthkey MPI Info get valuelen MPI Info set MPI Op c2f MPI Op commutative MPI Op create MPI Op f2c MPI Op free MPI Request c2f MPI Request f2c MPI Request get status MPI Status c2f MPI Status f2c P2P Point to point communication MPI Bsend MPI Bsend init MPI Buffer attach MPI Buffer detach MPI Cancel MPI Ibsend MPI Iprobe MPI Irecv MPI Irsend MPI Isend MPI Issend MPI Probe MPI Recv MPI Recv init MPI Request free MPI Rsend MPI Rsend init MPI Send MPI Send init MPI Sendrecv MDL Sendrecv replace MPI Ssend MPI Ssend init MPI Start
16. definition record topology index is the pointer to the topology in C or the index to the topology in Fortran 6 EPIK F CART FREE topology index Releases the memory used by the topology its related data structures topology index is the pointer to the topology in C or the index to the topology in Fortran Note There are currectly a few restrictions that need to be obeyed when using the EPIK topology API For each manually defined topology every MPI thread has to call the creation function EPIK F CART CREATE exactly once EPIK F CART COMMIT must be called before EPIK F _COORDS_COMMIT 50 Appendix A MPI wrapper affiliation A MPI wrapper affiliation A 1 Enabling and disabling wrappers at compile time During configuration of the Scalasca build process special groups of wrappers can be enabled and disabled For these no wrappers will be generated resulting in no additional measurement overhead One of the groups listed in the help output is the MINI group which currently has no corresponding affiliation to the groups listed here This class of wrappers comprises all MPI functionality that can be expected to have very little overhead It is highly recommended to disable these wrappers completely at configure time If the standard set of wrappers are used the MINI wrappers are also disabled If the configure option enable all mpi wrappers is used you sh
17. ever hardware counter metrics can be found in the runtime summarization analysis report summary cube which is also produced by default when tracing is enabled For such measurements post processing by scalasca examine merges the trace analysis and summary reports into a combined t race HWC cube report File Display Topology Help Absolute v Absolute 7 Absolute M Metric tree Calltree Flatview System tree Box Plot Topology O EF 10 00 Time 4 EKL 4l E0 a EE 82 11 Execution L1 0 00 initialize_machine E ll 0 36 MPI 10 00 g sync Bi 0 00 Synchronization 0 01 Ip EF O 0 00 Communication 3 27 Point to point O 0 00 get i E ll 0 01 Collective CO 0 00 get f O 0 00 Early Reduce Ll 0 00 ask starting lattice O 0 00 Early Scan L 0 00 ask ending lattice D 0 00 broadcast bytes 1 37 Wait at N x N L Gio 01 MPL Bcasti 0 00 N x N Completion C 0 00 reload lattice 31 57 Init Exit D 0 00 rephase L ll 0 22 Overhead 10 00 make loop table I 6 60e7 Visits 10 00 make path table 28 Synchronizations D 0 00 normal exit EF lll 7 81e4 Communications Ll 0 00 dclock lll 3 1669 Bytes transferred amp D 0 00 update Ill 42 16 Computational imbalance CD 0 00 rephase DU 0 00 g measure ul C 0 00 f meas imp Lj S ES S S w v v a m a m a KID 0 00 0 01 0 01 118 92 0 00 0 0
18. for point to point communication functions 4 3 Measurement and analysis of hardware counter metrics If the Scalasca measurement library EPIK has been built with hardware counter support enabled see INSTALL file it is capable of processing hardware counter information as part of event handling This can be checked by running epik conf and seeing whether EPK METRICS SPEC is set Counters are processed into counter metrics during runtime summarization and recorded as part of event records in collected traces Note that the number of counters recorded determines measurement and analysis overheads as well as the sizes of measurement storage datastructures event traces and analysis reports Counter metrics recorded in event traces are currently ignored by the Scalasca parallel trace analyzer and it is gener ally recommended that they should only be specified for summarization measurements To request the measurement of certain counters set the variable EPK METRICS to a colon separated list of counter names or a predefined platform specific group Alternatively specify the desired metrics with m lt metriclist gt argument to the Scalasca measure ment collection and analysis system scalasca analyze Hardware counter mea surement is disabled by default Metric names can be chosen from the list contained in file doc METRICS SPEC or may be PAPI preset names or platform specific native counter names METR
19. handlers MPI Add error class NPL Add error code MPI Add error string MPI Errhandler create MPI Errhandler free MPI Errhandler get MPI Errhandler set MPI Error class MPI Error string EXT Common external interfaces MPI Abort MPI Get count MPI Get elements MPI Get processor name MPI Grequest complete MPI Grequest start MPI Status set cancelled MPI Status set elements MPI Wtick MPI Wtime 63 Appendix A MPI wrapper affiliation IO Parallel I O MPI File close MPI File delete MPI File get amode MPI File get atomicity MPI File get byte offset MPI File get group MPI File get info MPI File get position MPI File get position shared MPI File get size MPI File get type extent MPI File get view MPI File iread MPI File iread at MPI File iread shared MPI File iwrite MPI File iwrite at MPI File iwrite shared MPI File open MPI File preallocate MPI File read MPI File read all MPI File read all begin MPI File read all end MPI File read at MPI File read at all MPI File read at all begin MPI File read at all end MPI File read ordered MPI File read ordered begin MPI File read ordered end MPI File read shared MPI File seek MPI File seek shared MPI File set atomicity MPI File set info MPI File set size MPI File set view MPI File sync MPI File write MPI File write all MPI File write all
20. i 1 100 end do EPIK USER END r name EPIK FUNC END end subroutine foo C C include epik_user h void foo 23 Chapter 3 Application instrumentation declarations EPIK USER REG r name iteration loop EPIK FUNC START EPIK USER START r name for i 0 i lt 100 i EPIK_USER_END r_name EPIK FUNC END C only include epik user h void foo EPIK TRACER foo Region identifiers r name should be registered in each annotated function subroutine prologue before use within the associated body and should not already be declared in the same program scope For C and C function names are automatically provided by the EPIK FUNC START and EPIK FUNC END macros so don t need registering whereas annotated Fortran functions and subroutines should call EPIK FUNC REG with an appro priate name Note The source files instrumented in this way have to be compiled with DEPIK otherwise EPIK calls expand to nothing and are ignored If the Scalasca instrumenter user flag is used the EPIK symbol will be defined automatically Also note that Fortran source files instrumented this way have to be preprocessed with the C preprocessor CPP Manual routine instrumentation in combination with automatic source code instrumenta tion by the compiler or PDT leads to double instrumentation of user routin
21. measurement configuration file in the working directory or with scalasca analyze f filter file Before initiating a trace measurement experiment ensure that the filesystem where the experiment will be created is appropriate for parallel I O typically scratch or work rather than home and that there will be sufficient capacity and or quota for the expected trace of size total tbc Filtering will not prevent the function from being instrumented Hence measure ment overhead can not be completely eliminated on filtered functions when auto matic compiler based instrumentation is used When all options of the Scalasca measurement system are set in a way that measure ment overhead and space requirements are minimized a new run of the instrumented application can be performed passing the t option to scalasca analyze This will enable the tracing mode of the Scalasca measurement system Additionally the paral lel post mortem trace analyzer searching for patterns of inefficient communication and synchronization is automatically started after application completion scalasca analyze t mpirun mode vn np 128 sor x S C A N Scalasca 1 4 trace collection and analysis S C A N epik sor vn128 trace experiment archive S C A N Collect start mpirun mode vn np 128 sor x 00000 EPIK Created new measurement archive epik sor vn128 trace 00000 EPIK Activated epik sor vn128 trace 10000000 bytes Application output
22. particular installation of Scalasca may not offer all measurement configurations The kconfig command can also be used to determine the right compiler flags for spec ifying the include directory of the epik_user h or epik_user inc header files when compiling without using the Scalasca instrumenter kconfig for cflags or when the user instrumentation macros should be enabled kconfig for user cflags Scalasca supports a variety of instrumentation types for user level source routines and arbitrary regions in addition to fully automatic MPI and OpenMP instrumentation as summarized in Table 3 1 When the instrumenter determines that MPI or OpenMP are being used it automati cally enables MPI library instrumentation and OPARI2 based OpenMP instrumentation respectively The default set of instrumented MPI library functions is specified when Scalasca is installed All OpenMP parallel constructs and API calls are instrumented by default but instrumentation of classes of OpenMP synchronization calls can be selec tively disabled as described in 3 6 By default automatic instrumentation of user level source routines by the compiler is enabled equivalent to specifying comp all This can be disabled with comp none when desired such as when using PDToolkit or POMP or EPIK user API manual source annotations enabled with pdt pomp and user respectively Compiler PDToolkit POMP and EPIK user API instrumentation can all be use
23. prefixed to compile and link commands Often this only requires prefixing definitions for CC or MPICC and equivalents in Makefiles It is not necessary to prefix commands using the compiler for preprocessing as no instrumentation is done in that case When using Makefiles it is often convenient to define a preparation preposition place holder e g PREP which can be prefixed to selected compile and link commands MPICC PREP mpicc MPICXX PREP mpicxx MPIF90 PREP mpif90 These can make it easier to prepare an instrumented version of the program with 19 Chapter 3 Application instrumentation make PREP scalasca instrument while default builds without specifying PREP on the command line remain fully opti mized and without instrumentation When compiling without the Scalasca instrumenter the kconfig command can be used to simplify determining the appropriate linker flags and libraries kconfig mpi omp hybrid for user 32 64 libs The mpi omp or hybrid switch selects whether MPI OpenMP or hybrid MPI OpenMP measurement support is desired kconfig assumes a C or C program is being linked by default and Fortran applications have to be explicitly flagged with the for switch With user the EPIK manual user instrumentation API can be en abled The 32 or 64 switch selects the 32 bit or 64 bit version of the measurement libraries if necessary Note A
24. rank with out the corresponding receive send on the matching rank Generally the analyzer will deadlock The EPIK FLUSH TRACE macro can be used to explicitly request that current trace buffer contents be immediately flushed to disk and the buffer emptied ready to continue event record collection This can be employed to avoid disruptive uncoordinated auto matic flushing of trace buffers during important measurement phases It applies only to the calling thread and is not synchronized with other threads or processes Flush events are marked as TRACING regions In summary experiments EPIK FLUSH TRACE is ig nored 3 4 Semi automatic instrumentation If you manually instrument the desired user functions and regions of your application source files using the POMP INST directives described below the Scalasca instrumenter pomp flag will generate instrumentation for them POMP instrumentation directives are supported for Fortran and C C The main advantages are that being directives the instrumentation is ignored during normal compilation and this semi automatic instrumentation procedure can be used when fully automatic compiler instrumentation is not supported The INST BEGIN END directives can be used to mark any user defined sequence of state ments If this block has several exit points as is often the case for functions all but the last have to be instrumented by INST ALTEND 25 Chapter 3 Application i
25. tation is disabled with comp none Note Depending on the compiler and how it performs instrumentation insertion of in strumentation may disable inlining and other significant optimizations or inlined routines may not be instrumented at all and therefore invisible Automatic compiler based instrumentation has been tested with a number of different compilers GCC UNIX like operating systems not tested with Windows e BM xlc xlC version 7 or later IBM Blue Gene and AIX BM xlf version 9 1 or later IBM Blue Gene and AIX PGI Cray XT and Linux Intel compilers version 10 or later Cray XT and Linux not tested with Windows SUN Studio compilers Linux and Solaris Fortran only PathScale compilers Cray XT and SiCortex CCE Cray compiler Cray XT NEC compiler NEC SX Clang compiler version 3 1 or later Linux earlier versions have not been tested but might also work Open64 compilers Linux In all cases Scalasca supports automatic instrumentation of C C and Fortran codes except for the SUN Studio compilers which only provide appropriate support in their Fortran compiler and Clang which only provides C and C compilers Note The automatic compiler instrumentation might create a significant relative measure ment overhead on short function calls This can impact the overall application per formance during measurement C applications are especially prone to suffer from this depending o
26. the CUBE3 manual 8 provided with the Scalasca distribution CUBES is a generic user interface for presenting and browsing performance and debug ging information from parallel applications The underlying data model is independent 10 2 3 Analysis report examination from particular performance properties to be displayed The CUBE3 main window con sists of three panels containing tree displays or alternate graphical views of analysis reports The left panel shows performance properties of the execution the middle pane shows the call tree or a flat profile of the application and the right tree either shows the system hierarchy consisting of machines compute nodes processes and threads or a topological view of the application s processes and threads All tree nodes are labeled with a metric value and a colored box which can help identify hotspots The metric value color is determined from the proportion of the total root value or some other specified reference value A click on a performance property or a call path selects the corresponding node This has the effect that the metric value held by this node such as execution time will be further broken down into its constituents That is after selecting a performance property the middle panel shows its distribution across the call tree After selecting a call path i e a node in the call tree the system tree shows the distribution of the performance property in that call path a
27. trace analysis Integrated merged trace analysis and results presentation is provided by the command kanal epik title Or kanal lt file gt elg cube The command takes as argument either an EPIK experiment archive contain ing a merged trace a merged trace lt file gt elg or a generated analysis report lt file gt cube If lt file gt cube already exists and is newer than lt file gt elg CUBES3 is used to present it and browse the analysis If the trace lt file gt elg is newer or no analysis file exists then EXPERT is run to generate lt file gt cube before it is presented with CUBE3 Where generation of a new lt file gt cube would overwrite an existing older file with the same name a prompt will confirm whether to continue The EXPERT event trace analysis and CUBE analysis visualization can also be executed separately which is particularly appropriate when the CUBE viewer is installed on a sep arate system e g desktop from the measurement system e g a remote HPC system EXPERT analysis performance for particular trace files can be tuned via EARL environ ment variables which trade efficiency and memory requirements In order to analyze a trace file EXPERT reads the trace file once from the beginning to the end After access ing a particular event EXPERT might request other events usually from the recent past of the event or ask for state information related to one of those events Random access to even
28. with its integrated timestamp synchronization algorithm based on the controlled logical clock 1 activated this aux illiary trace processing is specified with the optional s flag to SCOUT Alternatively event trace analysis can be re initiated using the scalasca analyze command e g scalasca analyze a e epik title SMPIEXEC S MPIEXEC FLAGS where MPIEXEC is the command used to configure and launch MPI applications and is typically identical to that used to launch the user MPI application In the second case the scalasca analyze command will automatically figure out which SCOUT variant should be used and or is available To activate the integrated timestamp synchroniza tion algorithm when using the scalasca analyze command the environment variable SCAN ANALYZE OPTS needs to include s Note The number of MPI processes for SCOUT must be identical to the number of MPI processes for the original application Furthermore if SCOUT is executed on OpenMP or hybrid MPI OpenMP traces it is recommended to set the environment variable OMP NUM THREADS to the value used for the original application although SCOUT will automatically try to create the appropriate number of OpenMP threads Warning The scout omp and scout hyb analyzer require pure OpenMP and hybrid MPI OpenMP applications to use the same number of threads during all parallel regions OpenMP parallel regions that are not executed by all threads due
29. 0 0 00 0 01 0 00 0 00 100 00 0 00 t Figure 4 1 Dashed red frames guide the user in locating the call paths where the most severe instances of the wait states detected by Scalasca here Late Broad cast occured LateBcastVampir pdf LateBcastVampir Location of the worst Late Broadcast instance shown in the timeline display of Vampir It can be seen that some processes enter the MPI operation earlier than the root process leading to a wait state width 1 0 39 Chapter 4 Measurement collection amp analysis The automatic parallel event trace analyzer also supports calculating additional pattern statistics as well as tracking of the five most severe instances of each wait state pattern detected during the analysis For point to point operations the severity corresponds to the waiting time according to the pattern description In case of collective operations the severity correponds to the sum of the waiting times detected for each process involved in the operation To enable this additional trace analysis the environment variable SCAN ANALYZE OPTS needs to include i during the analysis phase In the CUBE3 browser the pattern statistics display can be opened via the Statistics entry in the metric s context menu The call paths of the most severe instances are high lighted in the call tree pane using dashed red frames see Figure 4 1 In case CUBE3 is configu
30. 03 html 18 Performance Research Lab University of Oregon TAU Reference Guide chapter TAU Instrumentation Options 28 http www cs uoregon edu research tau docs newguide bk03ch01 html 19 F Wolf B Mohr Automatic performance analysis of hybrid MPI OpenMP ap plications Journal of Systems Architecture 49 10 11 pp 421 439 Elsevier November 2003 1 76 scalasca US ee oO J JULICH G E www scalasca org FORSCHUNGSZENTRUM
31. BUTORS BE LIABLE FOR ANY DIRECT INDIRECT IN CIDENTAL SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE EVEN IF ADVISED OF THE POSSIBIL ITY OF SUCH DAMAGE ii Contents Contents 1 Introduction LI How to read this document 22s 454845 woe Bo 1 2 Performance optimization cycle oc ccd ook eR oS ad y ES L GL DUEE INE once dene ee a CP IKE N CB RR Rd 2 Getting started 21 ISICON e uuo o3 bx koe EUR ERE CR EX UR RR 2 2 Runtime measurement collection amp analysis 2 3 Analysis report examination 2s 2 o ek ARLE ORE OEE EES 2 4 A full workflow example a 3 Application instrumentation 3 1 Automatic compiler instrumentation 32 Manual region instrumentation oaoa a 3 3 Measurement control instrumentation 3 4 Semi automatic instrumentation 3 5 Automatic source code instrumentation using PDT 3 6 Selective instrumentation cs ce 9 o o OR HE OS 4 Measurement collection amp analysis 11 NexusconhentalDH SR FYR DUR Sea Foxx Pea ees 4 2 Measurement configuration 000s 4 3 Measurement and analysis of ha
32. ICS SPEC also contains specifications of groups of related counters which may conveniently be mea sured simultaneously on various platforms The installed doc METRICS SPEC specifi cation can be overridden when desired by a file named METRICS SPEC in the current working directory or specified by the EPIK configuration variable EPK METRICS SPEC If any of the requested counters are not recognized or the full list of counters cannot be recorded due to hardware resource limits measurement of the program execution will be aborted with an error message Counter metrics appear in the Performance Metrics pane of the CUBE3 browser Rela tionships between counter metrics which define hierarchies are also specified in the file METRICS SPEC those without specified relationships are listed separately Experiments with subsets of the counter metrics required for a full hierarchy could previ ously be combined into composite experiments using the cube merge utility Note that 36 4 4 Automatic parallel event trace analysis a replacement for this utility is still under development and not yet available Generally several measurement experiments are required and the groupings of counters provided in METRICS SPEC can act as a guide for these The default doc METRICS SPEC provides generic metric specifications which can be used for analysis on any platform Additional platform specific example metric speci fications are provided in t
33. MPI library and type OMP either to OpenMP regions or calls to the OpenMP API User program routines on paths that directly or indirectly call MPI or OpenMP provide valuable context for understanding the communication and synchronization behaviour of the parallel execution and are dis tinguished with the COM type from other routines that are involved with purely local computation marked USR Entries marked ANY ALL provide aggregate information for all measured routines and those marked EPK are associated with the EPIK measurement system itself For further information see the online description of Scalasca instrumen tation measurement regions 1 Routines with type USR are typically good candidates for filtering which will effectively make them invisible to measurement and analysis as if they were inlined Routines marked COM can also be filtered however this is generally undesirable since it elimi nates valuable context information Since MPI and OMP regions are required by Scalasca analyses these cannot be filtered By comparing the trace buffer requirements with the time spent in the routines of a partic ular group the initial scoring report will already indicate what benefits can be expected from filtering However to actually set up the filter a more detailed examination is required This can be achieved by examining the score report epik score flt type max tbc time region ANY 215168 11849 04 100 00 summar
34. P2P MPI Scan COLL MPI Scatter COLL MPI Scatterv COLL MPI Send P2P MPI Send init P2P MPI Sendrecv P2P MPI Sendrecv replace P2P MPI Sizeof TYPE MPI Ssend P2P MPI Ssend init P2P MPI Start P2P MPI Startall P2P MPI Status c2f MISC MPI Status f2c MISC 58 A 3 Function to group MPI Status set cancelled EXT MPI Status set elements EXT MPI Test P2P MPI Test cancelled P2P MPI Testall P2P MPI_Testany P2P MPI Testsome P2P MPI_Topo_test TOPO MPI_Type_c2f TYPE_MISC MPI_Type_commit TYPE MPI_Type_contiguous TYPE MPI_Type_create_darray TYPE MPI_Type_create_f90_complex TYPE MPI_Type_create_f90_integer TYPE MPI Type create f90 real TYPE MPI Type create hindexed TYPE MPI Type create hvector TYPE MPI Type create indexed block TYPE MPI Type create keyval TYPE EXT MPI Type create resized TYPE MPI Type create struct TYPE MPI Type create subarray TYPE MPI Type delete attr TYPE EXT MPI Type dup TYPE MPI Type extent TYPE MPI Type f2c TYPE MISC MPI Type free TYPE MPI Type free keyval TYPE EXT MPI Type get attr TYPE EXT MPI Type get contents TYPE MPI Type get envelope TYPE MPI Type get extent TYPE MPI Type get name TYPE EXT MPI Type get true extent TYPE MPI Type hindexed TYPE MPI Type hvector TYPE MPI Type indexed TYPE MPI Type Ib TYPE MPI Type match size TYPE
35. al EPILOG event trace utilities Process local EPILOG traces in EPIK experiment archives can be merged by executing elg merge epik title in order to produce a single merged trace file epik_ lt title gt epik elg Note It may take quite a long time to merge large event traces and the resulting epik elg will typically be more than three times as large as the unmerged process traces Two utility programs are provided to check the correctness and to summarize the contents of EPILOG trace files elg print lt file gt elg Prints the contents of the EPILOG trace file lt file gt elg to the standard output stream elg print creates a readable representation of the EPILOG low level record format This is mainly provided for debugging purposes to check the correct structure and content of the EPILOG trace records elg stat lt file gt elg By default e19 stat calculates and reports some very simple event statistics to standard output In addition options d definition records and e event records enable the printing of a human readable representation of the trace contents on the event level 6 2 Trace converters The following utility programs can be used to convert merged EPILOG trace file into other formats If support for the trace formats OTF and or VTF3 were included during configuration and installation merged EPILOG event traces can be converted for visual analysis with the VAMPIR trace visualizer from TU Dresden ZIH 7
36. al interfaces IO I O MISC Miscellaneous P2P Point to point communication RMA One sided communication SPAWN Process management interface aka Spawn TOPO Topology communicators TYPE MPI datatypes XNONBLOCK Extended non blocking communication events XREQTEST Test events for tests of uncompleted requests communicators groups and other internal data is unaffected and always turned on Example EPK MPI ENABLED ENV P2P This will enable event generation for environmental managment including MPI Init and MPI Finalize as well as point to point communication but will disable it for all other functions groups A shorthand to get event generation for all supported function calls is EPK MPI ENABLED ALL A shorthand to add a single group e g TYPE to the configured default is DEFAULT TYPE EPK MPI ENABLED A detailed overview of the MPI functions associated with each group can be found in Appendix A 35 Chapter 4 Measurement collection amp analysis A somehow special role play the XNONBLOCK and XREQTEST flags If XNONBLOCK is set extra attributes will be recorded for non blocking send completions and receive requests If in addition to XNONBLOCK XREQTEST is set additional events are recorded for unsuc cessful tests for request completion in MPI Waitany some and the family of MPI Test functions In any case P2P must be enabled Otherwise EPIK records no events
37. auncher specification If there is an imposter executable or script e g used to specify placement that precedes the instrumented target it may be necessary to explicitly identify the target with the environment variable SCAN TARGET If environment variables aren t automatically forwarded to MPI processes by the launcher it may be necessary to specify the syntax that the launcher requires for this as SCAN SETENV For example if an environment variable VAR with value VAL must be explicitly exported with export VAR VAL use SCAN SETENV export or use SCAN SETENV setenv for setenv VAR VAL syntax Automatic trace analysis is done with different analyzers according to availability and the type of experiment An alternate trace analyzer with path if necessary can be speci fied with SCAN TRACE ANALYZER Specifying SCAN TRACE ANALYZER none will result in automatic trace analysis being skipped though some validation checks are still done which can be used when trace analysis is intended to be done interactively or on a dif ferent platform Options to be given to the trace analyzer such as s for timestamp correction can be specified with SCAN ANALYZE OPTS Trace data can be automatically removed after successful trace analysis by setting SCAN CLEAN Where the EPIK experiment archive directory is created on a filesystem which is not synchronized between launch node and compute nodes the nexus check fo
38. bors and suggest algorithmic improvements EPIK supports the recording of n dimensional Cartesian grids as the most common case To do this the user has two options 1 using MPI Cartesian topology functions 2 manual recording using the EPIK topology API If an application uses MPI topology functions to set up a Cartesian grid EPIK automat ically includes this information in the measurement experiment In addition EPIK provides users who do not use MPI topologies with an API to define an n dimensional Cartesian topology These functions are available in C and Fortran and have corresponding include files 48 6 3 Recording user specified virtual topologies finclude epik_topol h finclude epik ftopol inc Note In Fortran the inclusion must be in the function where topologies are to be recorded Whereas in C all functions start with the prefix EPIK in Fortran they start with EPIKF Here are the signatures of these functions l EPIK F CART CREATE topology index name num dims defines a Cartesian grid topology of any number of dimensions where topology index is the pointer to the topology in C or the index to the topology in Fortran name is a string to identify this topology and e num dims is an integer describing the number of dimensions in this topology In C this function returns a pointer to a struct of the type EPIK TOPOL In For tran it returns an integer which shoul
39. cally designed for use on large scale systems including IBM Blue Gene and Cray XT but also suitable for smaller HPC platforms using MPI and or OpenMP Scalasca supports an incremental performance analysis process that integrates runtime summaries with in depth studies of concurrent behavior via event tracing adopting a strategy of successively refined measurement configurations 6 A distinctive feature of Scalasca is the ability to identify wait states that occur for example as a result of un evenly distributed workloads Especially when trying to scale communication intensive applications to large processor counts such wait states can present severe challenges to achieving good performance Compared to its predecessor KOJAK 19 Scalasca can de tect such wait states even in very large configurations of processes using a novel parallel trace analysis scheme 5 1 1 How to read this document This user guide is structured into three parts This introductory chapter gives a short introduction into performance analysis in general and the components of the Scalasca toolset in particular If you are already familiar with performance analysis of parallel scientific applications you might skip the following section and continue reading directly with Section 1 3 The next part in Chapter 2 introduces the basic steps and commands reguired for initial performance analyses of parallel applications It also includes a full example describing the Sca
40. can be given to specify whether tracing should be enabled t a filter that should be applied f filter file and hardware counters that should be included in the measurement m metric list Note Instrumented applications can still be run without using the nexus to generate Scalasca measurements however measurement configuration is then exclusively via environment variables which must be explicitly exported to MPI processes and trace analysis is not automatically started after trace collection The target executable is examined by the nexus to determine whether MPI and or OpenMP instrumentation is present and the number of MPI processes and OpenMP threads are determined from the launch environment and command line specification These are used to generate a default name for the experiment archive unless a title has been explicitly specified with e expt title or setting the EPK TITLE environment variable Where the number of processes and or threads are omitted or were otherwise not determined the number is replaced with the letter O is used to indicate this Note Configuration specified on the nexus command line takes precedence over that spec ified as environment variables or in a configuration file Environment variables with the SCAN prefix may be used to configure the nexus it self which is a serial workflow manager process as distinct from the instrumented application process or processes which will be measure
41. cross the system locations A click on the icon left to a node in each tree expands or collapses that node By expanding or collapsing nodes in each of the three trees the analysis results can be viewed on different levels of granularity To obtain the exact definition of a performance property select Online Description in the context menu associated with each performance property which is accessible using the right mouse button A brief description can be obtained from the menu option Info Further information is also available at the Scalasca website http www scalasca org CUBE3 also provides a number of algebra utilities which are command line tools that operate on analysis reports The utililties currently only work on the CUBE files within experiment archive directories not on the archives themselves Multiple analysis re ports can be averaged with cube3_mean or merged with cube3_merge The difference between two analysis reports can be calculated using cube3_diff Finally a new analy sis report can be generated after pruning specified call trees and or specifying a call tree node as a new root with cube3_cut The latter can be particularly useful for eliminating uninteresting phases e g initialization and focussing the analysis on a selected part of the execution Each of these utilities generates a new CUBE formated report as output The cube3_score utility can be used to estimate trace buffer requirements from sum mary or
42. d which are also configured via environment variables discussed in the following Section 4 2 Serial and OpenMP programs are typically executed directly whereas MPI and hybrid MPI OpenMP programs usually require a special launcher such as mpiexec which might also specify the number of processes to be created Many MPI launchers are automatically recognized but if not the MPI launcher name can be specified with the environment variable SCAN MPI LAUNCHER When the MPI launch command is being parsed unrecognized flags might be reported as ignored and unrecognized options with required arguments might need to be quoted Note Launcher specific configuration files which augment the launch command are cur rently not handled by Scalasca If the total number of MPI processes is not correctly determined by the nexus the appropriate number can be specified as SCAN MPI RANKS The specified number will also be used in the automatically generated experiment title While an experiment title with an incorrect number of processes is harmless though generally confusing the correct number is required for automatic parallel trace analysis If the target executable isn t specified as one of the launcher arguments it is expected to be the immediately following part of the command line It may be necessary to use 32 4 2 Measurement configuration a double dash specification to explicitly separate the target from the preceding l
43. d be used to address this topology on other functions EPIK F CART ADD DIM topology index size periodic name adds a new dimension to an existing topology topology index is the pointer to the topology in C or the index to the topology in Fortran sizeis the number of possible coordinates in that dimension e periodic is an integer describing periodicity in this dimension It should be zero if the dimension is not periodic or non zero if dimension is periodic e name is a string containing the name of this dimension e g X Y Z Thread or anything else EPIK F CART SET COORDS topology index coords Sets coordinates per process or thread topology index is the pointer to the topology in C or the index to the topology in Fortran coords in C is a variable number of arguments each containing an integer for the coordinates in each previously defined dimension In Fortran it is an array of integer giving the coordinates in the dimensions previously defined in the same order they were defined EPIK F CART COMMIT topology index Writes the topology definition in the definition record From there on the topology 49 Chapter 6 Additional utilities is read only topology index is the pointer to the topology in C or the index to the topology in Fortran 5 EPIK F CART COORDS COMMIT topology index Writes the topology coordinates in the
44. d simultaneously or in arbi trary combinations however it is generally desirable to avoid instrumentation duplica tion which would result if all are used to instrument the same routines 20 Type Switch Default Standard Other Runtime instrum d instrum d meas ment routines regions control MPI auto configured Sec 4 2 2 by install OpenMP auto Sec 3 6 all parallel constructs Compiler comp all allor not Sec 4 2 1 Sec 3 1 none supported PDToolkit pdt all or not Sec 3 5 selective supported POMP pomp manually manually Sec 3 4 annotated annotated EPIK API user manually manually Sec 3 2 annotated annotated Table 3 1 Scalasca instrumenter option overview Note A minimal measurement containing only information about MPI usage can be ob tained by simply using the Scalasca instrumenter when linking already compiled uninstrumented object files and libraries In this case it is recommended to explic itly disable compiler based instrumentation and specify the MPI measurement mode even when OpenMP is used i e comp none mode MPI To have verbose output from the Scalasca instrumenter showing its various processing compiling and linking steps add the v switch before the compiler linker or set envi ronment variable SKIN VERBOSE 1 This information is particularly helpful to Scalasca developers when re
45. d to ensure suf ficient memory and disk storage is configured for a subsequent trace experiment with an identical execution and measurement configuration Note Since scoring only provides estimates and the penalties for exceeding buffer or disk capacity limits are highly perturbed and or incomplete measurements it is recom mended to include a generous cushion when interpreting the reported values max tbc should be used to specify the size of trace buffers i e ELG_BUFFER_SIZE so that highly disruptive flushing of full trace buffers to disk during measurement is avoided It is also indicative of the amount of memory on each process that the trace analyzer will require to hold the trace in memory during its analysis Note Trace analysis may require more than twice as much memory as the trace size since the analyzer must also allocate additional data structures total tbc is an estimate of the disk space that would be required to store the complete trace from all processes so you can check your disk quota and filesystem capacity are sufficient Total trace size will also be a factor in how long it takes to write the trace to disk after measurement is complete and for the trace analyzer to read it back from disk Note The most efficient parallel filesystem available should be used when generating and analysing traces After analysis is complete traces can be deleted or archived as desired Although total trace size is gen
46. e ad dressed and at times a reference to later chapters with more in depth information on the corresponding topic is given Use of Scalasca involves three phases program instrumentation execution measurement collection and analysis and analysis report examination The scalasca command pro vides action options that invoke the corresponding commands skin scan and square These actions are 1 scalasca instrument is used to insert calls to the Scalasca measurement system into the application s code either automatically semi automatically or by linking with pre instrumented libraries 2 scalasca analyze is used to control the measurement environment during the application execution and to automatically perform trace analysis after measurement completion if trac ing was requested The Scalasca measurement system supports runtime summarization and or event trace collection and analyses optionally including hardware counter information 3 scalasca examine is used to postprocess the analysis report generated by the measurement runtime summarization and or post mortem trace analysis and to start Scalasca s analysis report examination browser CUBE3 To get a brief usage summary call the scalasca command with no arguments or use scalasca h to open the Scalasca Quick Reference with a suitable PDF viewer The following three sections provide a quick overview of each of these actions and how to use them during the corres
47. e application may be left with insufficient memory to run or run adversely with paging to disk Larger traces also require more disk space are correspondingly slower to write to and read back from disk and will require more memory for subsequent analyses Often it is more appropriate to reduce the size of the trace e g by specifying a shorter execution or more selective instrumentation and measurement than to increase the buffer size 13 Chapter 2 Getting started Cube 3 0 QT eplk or vn128 sumsummary cube Ee Display Topology Help Own root percent Own root percent Absolute Metric tree Call tree Flat view Systemtree DBpology 0 GJ 0 00 Time lll 95 64 Execution O 0 00 MPI C 0 00 main B O 0 00 MPI Init O 0 00 setup grid 0 00 Synchronization L B 0 12 Collective 0 00 Communication ll 0 21 Collective O 0 00 File I O W 0 15 Init Exit lll 3 11 Overhead _ 0 00 init field 3 O 0 00 sor iter H L1 0 00 init red black O 0 00 init boundary C O 0 00 get halo W 1 18 MPI lIrecv CO 0 00 MPI Barrier lll 2 37 MPI Irsend 100 00 Visits 100 00 Synchronizations 100 00 Communications 100 00 Bytes transferred 100 00 Computational imbalance HO 0 00 update black AO 0 00 update red 0 00 MPI Allreduce O 0 00 MPI Finalize 87 25 9868 49 Figure 2 1 Examining a runtime summary analysis report in CUBE3 To estimate the buffer requir
48. ed for the target application itself During the analysis Scalasca searches for characteris tic patterns indicating wait states and related performance properties classifies detected instances by category and guantifies their significance The result is a pattern analysis report similar in structure to the summary report but enriched with higher level com 1 3 Scalasca overview munication and synchronization inefficiency metrics Both summary and pattern reports contain performance metrics for every function call path and system resource which can be interactively explored in a graphical report explorer see Figure 2 1 for an example The CUBE GUI is provided with Scalasca or it can be installed separately and third party profile visualization tools such as ParaProf 16 can also present Scalasca analysis reports As an alternative to the automatic analysis the event traces can be visualized and investigated with third party trace browsers taking advantage of their powerful time line visualizations and rich statistical functionality Newer versions of Vampir can handle Scalasca traces directly or traces can be converted for JumpShot Paraver 12 3 or older versions of Vampir 14 7 Chapter 1 Introduction Chapter 2 Getting started 2 Getting started This chapter provides a hands on introduction to the use of the Scalasca toolset on the basis of the analysis of an example application The most prominent features ar
49. ed IO MPI File read ordered begin IO MPI File read ordered end IO MPI File read shared IO MPI File seek IO MPI File seek shared IO MPI File set atomicity IO MPI File set errhandler IO ERR 55 Appendix A MPI wrapper affiliation MPI File set info IO MPI File set size IO MPI File set view IO MPI File sync IO MPI File write IO MPI File write all IO MPI File write all begin IO MPI File write all end IO MPI File write at IO MPI File write at all IO MPI File write at all begin IO MPI File write at all end IO MPI File write ordered IO MPI File write ordered begin IO MPI File write ordered end IO MPI File write shared IO MPI Finalize ENV MPI Finalized ENV MPI Free mem MISC MPI Gather COLL MPI Gatherv COLL MPI Get RMA MPI Get address MISC MPI Get count EXT MPI Get elements EXT MPI Get processor name EXT MPI Get version MISC MPI Graph create TOPO MPI Graph get TOPO MPI Graph map TOPO MPI Graph neighbors TOPO MPI Graph neighbors count TOPO MPI Graphdims get TOPO MPI Grequest complete EXT MPI Grequest start EXT MPI Group c2f CG MISC MPI Group compare CG MPI_Group_difference CG MPI_Group_excl CG 56 A 3 Function to group MPI Group f2c CG MISC
50. ements for a trace measurement scalasca examine s will skip opening the GUI and instead generate a brief overview of the estimated maximal number of bytes required with a detailed score report written into the experiment archive directory scalasca examine s epik sor vn128 sum cube3 score epik sor vn128 sum summary cube Reading epik sor vn128 sum summary cube done Estimated aggregate size of event trace total tbc 25698304 bytes Estimated size of largest process trace max tbc 215168 bytes When tracing set ELG BUFFER SIZE max tbc to avoid intermediate flushes or reduce requirements using a file listing USR regions to be filtered INFO Score report written to epik sor vn128 sum epik score max tbc refers to the maximum of the trace buffer capacity requirements determined for each MPI process in bytes or for each thread in OpenMP measurements If max tbc exceeds the buffer size available for the event stream in memory intermediate flushes during measurement will occur often with undesirable measurement perturbation To prevent flushing either increase the trace buffer size or use a filter to exclude a given list of routines from measurement 14 2 4 A full workflow example To aid in setting up an appropriate filter file this scoring functionality also provides a breakdown by different categories determined for each region according to its type of call path Type MPI refers to function calls to the
51. er CG ERR MPI Comm set name CG EXT MPI Comm size CG MPI_Comm_spawn SPAWN MPI_Comm_spawn_multiple SPAWN MPI_Comm_ split CG MPI Comm test inter CG MPI Dims create TOPO MPI Dist graph create TOPO MPI Dist graph create adjacent TOPO MPI Dist graph neighbors TOPO MPI Dist graph neighbors count TOPO MPI Errhandler create ERR MPI Errhandler free ERR MPI Errhandler get ERR MPI Errhandler set ERR MPI Error class ERR MPI Error string ERR MPI Exscan COLL MPI File c2f IO MISC MPI File call errhandler IO ERR 54 A 3 Function to group MPI File close IO MPI File create errhandler IO ERR MPI File delete IO MPI File f2c IO MISC MPI File get amode IO MPI File get atomicity IO MPI File get byte offset IO MPI File get errhandler IO ERR MPI File get group IO MPI File get info IO MPI File get position IO MPI File get position shared IO MPI File get size IO MPI File get type extent IO MPI File get view IO MPI File iread IO MPI File iread at IO MPI File iread shared IO MPI File iwrite IO MPI File iwrite at IO MPI File iwrite shared IO MPI File open IO MPI File preallocate IO MPI File read IO MPI File read all IO MPI File read all begin IO MPI File read all end IO MPI File read at IO MPI File read at all IO MPI File read at all begin IO MPI File read at all end IO MPI File read order
52. erally proportional to the number of processes often the most appropriate way to reduce the size of a trace is to specify a shorter execution e g covering fewer simulation timesteps or iterations or selectively trace particular timesteps or phases of execution e g using measurement control instrumentation as described in section 3 3 The score report can also be used to identify frequently executed purely computational routines that provide little value in Scalasca summary and trace analyses in relation to their measurement overhead and possible distortion User level source program routines classified as USR which are not involved with MPI and OpenMP parallelism with 44 5 Examination options large max tbc are prime candidates to be excluded from measurement via selective in strumentation section 3 6 or possibly 3 5 or the generally more convenient runtime filtering section 4 2 1 Potential filters can be verified using scalasca examine s f filter file epik title Each Scalasca release is provided with its own performance properties analysis documen tation that gets installed with it By default however potentially revised documentation on the Scalasca download website is prefered To disable fetching documentation from the network set CUBE DISABLE HTTP DOCS 45 Chapter 5 Analysis report examination 46 Chapter 6 Additional utilities 6 Additional utilities 6 1 Addition
53. es i e usually only user region instrumentation is desired in this case For examples of how to use the EPIK user API see the test epik files in the exam ple directory of the Scalasca installation 24 3 3 Measurement control instrumentation 3 3 Measurement control instrumentation The EPIK user API also provides several macros for measurement control that can be incorporated in source files and activated during instrumentation EPIK PAUSE START can be used to temporarily pause measurement until a subsequent EPIK PAUSE END defining a synthetic region named PAUSING Just like the already covered user defined annotated regions START and corresponding END must be correctly nested Events are not recorded when measurement is PAUSING though associated definitions are resulting in smaller measurement overhead In particular traces can be much smaller and can target specific application phases e g excluding initialization and or finalization or specific iterations Since PAUSING 1s process local and effects all threads on the process it can only be initiated outside of OpenMP parallel regions PAUSING is done independently on each MPI process without synchronization Note The behaviour of the parallel trace analyzer is undefined when PAUSING skips recording MPI events on subsets of processes such as some of the ranks in collective communication or synchronization operations or a send receive on one
54. ey to productive application optimization In the evaluation phase conclusions are drawn from the presented information leading to optimization hypotheses The user proposes optimization strategies for the application which are then implemented in the following optimization phase Afterwards the effec tiveness of the optimization has to be verified by another pass through the performance optimization cycle When the user is satisfied with the application performance during evaluation and no further optimization is needed the instrumentation can be disabled and the performance of the uninstrumented application execution can be assessed 1 3 Scalasca overview Scalasca supports measurement and analysis of the MPI OpenMP and hybrid MPI OpenMP programming constructs most widely used in highly scalable HPC ap Chapter 1 Introduction plications written in C C and Fortran on a wide range of current HPC platforms Usage is primarily via the scalasca command with appropriate action flags Figure 1 2 shows the basic analysis workflow supported by Scalasca Before any performance data can be collected the target application needs to be instrumented Instrumentation means that the code must be modified to record performance relevant events whenever they occur On most systems this can be done completely automatically using compiler support On other systems a mix of manual and automatic instrumentation mechanisms is offered When execu
55. he examples directory If desired an example METRICS SPEC appropriate for the platform where the measurements will be or have been recorded can be used instead of the default doc METRICS SPEC via setting the EPK METRICS SPEC configuration variable or replacing the installed file EXPERT analysis see Section 4 5 can further be customized using additional environ ment variables EPT INCOMPLETE COMPUTATION can be set to accept metric computa tions which are missing one or more component measurement while not generally useful on its own it can allow more detailed metric hierarchies to be created when experiments are combined EPT MEASURED METRICS modifies the handling of unparented measured metrics such that they can be ignored value 0 listed separately value 1 the default or listed together with parented metrics value 2 4 4 Automatic parallel event trace analysis SCOUT is Scalasca s automatic analyzer for EPIK event traces It is used internally by the Scalasca measurement collection and analysis nexus scalasca analyze when event tracing is configured or can be explicitly executed on event traces in EPIK mea surement archives Depending on the build configuration and the capabilities of the target platform SCOUT may be available in four forms scout ser is always built It is used to analyze event traces generated by serial applica tions It can also be used to analyze event traces from pure OpenMP applications
56. ht buffer parameters 1s usually a trade off decision between access efficiency and memory requirements In particular for very long traces with many events or very wide traces with many processes or threads adjustment of these pa rameters might be recommended 42 Chapter 5 Analysis report examination 5 Analysis report examination The Scalasca analysis report explorer facilitates interactive examination of analysis re ports both arising from runtime summarization and tracing experiments Analysis report examination can only be done after measurement and analysis are com pleted and the corresponding archive directory is unlocked Parallel resources are not required and it is often more convenient to examine analysis reports on a different sys tem such as a desktop computer where interactivity is superior Scalasca analysis reports are produced in the CUBE format which can be interactively explored with the CUBE GUI and processed with the CUBE algebra utilities as previ ously outlined in section 2 3 1 and detailed in the separate CUBE manual 8 Metrics determined by Scalasca are documented in 10 5 1 Examination options The Scalasca analysis report explorer SQUARE takes as argument the name of an EPIK experiment directory containing one or more analysis reports or the name of a specific analysis report cubefile In the usual case scalasca examine epik title post processes intermediate analysis reports
57. ion prototype including return and parameter types must be given C functions also need to be marked with an extra capital C at the end e g int main int char C Example BEGIN EXCLUDE LIST Exclude C function matmult void matmult Matrix Matrix Matrix C Exclude C functions with prefix sort and a single int pointer argument void sort f int Exclude all void functions in namespace foo void foo END EXCLUDE LIST Unfortunately the hash character is also used for comments so to specify a leading wildcard place the entry in double quotes For more information on how to selectively instrument code using the PDToolkit source code instrumentor please refer to the TAU documentation 17 18 3 5 2 Limitations Since support for the PDT based source code instrumenter is a recently added feature and some parts are still work in progress a number of limitations currently exist When instrumenting Fortran 77 applications the inserted instrumentation code snippets do not yet adhere to the Fortran 77 line length limit Typically it is possible to work around this issue by supplying extra command line flags e g ffixed line length 132 or gfixed 132 to the compiler f a Fortran routine that should be instrumented uses 1en as the name of an argu ment compilation of the instrumented code will fail The instrumentation code uses the intrinsic function 1en which will be shadowed by the argume
58. lasca performance analysis workflow The remainder of this user guide in Chapters 3 to 6 provide a more in depth discus sion of the individual steps in evaluating the performance of a parallel application Chapter 1 Introduction 1 2 Performance optimization cycle Regardless of whether an application should be optimized for single core performance or for scalability the basic approach is very similar First the behavior of the application needs to be monitored and afterwards the recorded behavior can be evaluated to draw conclusions for further improvement This is an iterative process that can be described by a cycle the so called performance optimization cycle When broken down into phases it is comprised of nstrumentation Measurement Analysis Presentation Evaluation Optimization of the code Instrumentation pas N unoptimized Measurement application Optimization d Analysis optimized l application Presentation Evaluation Figure 1 1 Performance optimization cycle As shown in Figure 1 1 the user starts with the original i e unoptimized applica tion which enters the optimization cycle in the instrumentation phase Instrumentation describes the process of modifying the application code to enable measurement of per formance relevant data during the application run In the context of Scalasca this can be achieved by different mechanisms such as source code ins
59. les myprogl f90 and myprog2 f90 replace the combined compile and link command mpif90 myprogl f90 myprog2 f90 o myprog by the following command using the Scalasca instrumenter scalasca instrument options mpif90 myprogl f90 myprog2 f90 o myprog Note The instrumenter must be used with the link command However not all object files need to be instrumented and it is often sufficient to only instrument source modules containing OpenMP and or MPI references Although generally most convenient automatic function instrumentation may result in too many and or too disruptive measurements which can be addressed with selective instrumentation and measurement filtering see Sections 3 5 and 3 6 2 2 Runtime measurement collection amp analysis The Scalasca runtime measurement collection amp analysis nexus accessed through the scalasca analyze action integrates the following steps measurement configuration 2 2 Runtime measurement collection amp analysis application execution collection of measured data automatic post mortem trace analysis if configured To make a performance measurement using an instrumented executable the target appli cation execution command is prefixed with the scalasca analyze command scalasca analyze options SMPIEXEC S MPI FLAGS target target args For non MPI i e serial and pure OpenMP applications the MPI launch command and associated flags should be omitted
60. list of hard ware counter metric group defini tions Depends on installation EPK METRICS Includes hardware counters in mea surement specify a colon separated list of PAPI preset or native counter names or a predefined group EPK MPI ENABLI E Iw Activates event generation for pre defined groups of MPI routines specify a colon separated list of to kens e g COLL 10 P2P RMA Depends on installation EPK MPI HANDLES Maximum number of MPI com 64 municator group window epoch handles tracked simultaneously EPK SUMMARY Enables run time summarization if 1 non zero EPK TITLE Specifies title for experiment a archive directory without the mandatory epik_ prefix EPK TRACE Enables event trace collection if 0 non zero EPK_VERBOSE Produces lot of additional informa 0 tion during measurement 13 Appendix B Environment variables Table B 4 Scalasca measurement environment variables for EPISODE configuration paths Variable name Description Default ESD BUFFER SIZE Size of per process definitions buffers in 100 000 bytes ESD FRAMES Maximum stack frame depth of measured 32 call paths ESD MAX THREADS Maximal number of threads for OpenMP OMP NUM THREADS measurements ESD PATHS Maximum number of measured call 4096 Table B 5 Scalasca measurement environ
61. ment variables for EPILOG configuration Zero Variable name Description Default ELG BUFFER SIZE Size of per thread event trace buffers in bytes 10 000 000 ELG COMPRESSION Compression level of data in event trace files 0 9 6 or u for uncompressed ELG MERGE Automatic merges trace files if non zero 0 ELG SION FILES Number of physical SION files 0 for one file per 0 MPI process without SIONIib 1 for system or configuration dependent default number of files ELG VT MODE Generates VAMPIR compatible traces if non 0 Table B 6 KOJAK sequential trace analyzer environment variables Variable name Description Default EARL BOOKMARK DISTANCE Specifies distance of bookmarks fixed interval the state information is stored 10 000 EPT INCOMPL ETE COMPUTATION surement Accepts metric computations which are missing one or more component mea EP I MEASURED METRICS ented metrics value 2 Modifies the handling of unparented 1 measured metrics such that they can be ignored value 0 listed separately value 1 or listed together with par 74 Bibliography Bibliography 1 D Becker R Rabenseifner F Wolf J Linford Scalable timestamp synchroniza tion for event traces of message passing applications Journal of Parallel Computing 35 12 595 607 December 2009 38
62. metrics which show up as sub metrics of the sum mary properties such as the fraction of Point to point Communication Time potentially wasted due to Late Sender situations where early receives had to wait for sends to be ini tiated That is the trace analysis can isolate and quantify inefficient communication and synchronization behaviour All of the metrics determined by Scalasca are documented in the online description of performance properties 10 The filesystem requirements for an EPILOG event trace and its analysis are much higher than for a runtime summary analysis The runtime of a batch job will also increase due to additional file I O at the end of measurement writing traces and and for their subsequent analysis After a successful tracing experiment the Scalasca measurement collection and analysis nexus has created a directory containing the event trace and its analysis files In tracing mode a runtime summary report stored in summary cube is also pro duced by default in addition to a trace analysis report stored in trace cube When the summary analysis report includes hardware counter metrics that are not available in the trace analysis report the two reports are merged into a combined report stored in tracet HWC cube After successful trace analysis and before moving or storing the experiment archive the trace files can be removed by deleting the ELG subdirectory in the experiment archive 17 Chapter 2 Getting star
63. n call scalasca instrument with the pdt option e g scalasca instrument pdt mpicc c foo c This will by default instrument all routines found in foo c To avoid double instrumen tation automatic compiler instrumentation can be disabled with comp none 3 5 1 PDT selective instrumentation The PDT source code instrumentor can also be configured to selectively instrument files and routines For this you need to supply the additional option optTauSelectFile filename after the pdt option The provided selective instrumentation file needs to be a plain text file of the following syntax Empty lines are ignored comments are introduced using a hash character and reach until the end of the line Files to be excluded from instrumentation can be listed in a file exclusion section You can either list individual filenames or use the star and question mark characters as wildcards for multiple or single characters as in a shell Example BEGIN FILE EXCLUDE LIST bar c Excludes file bar c foo c 4 Excludes all C files with prefix foo 27 Chapter 3 Application instrumentation END FILE EXCLUDE LIST To exclude certain routines from instrumentation their names can be listed in a routine exclusion section You can either list individual names or use the hash ff character as a wildcard Note that for Fortran subroutine names must be given in all uppercase letters for C C the full funct
64. n application design and whether C STL functions are also in strumented by the compiler Currently it is not possible to prevent the instrumenta tion of specific functions on all platforms when using automatic compiler instrumen 22 3 2 Manual region instrumentation tation See Section 3 6 on how to manually instrument applications if you encounter significant overheads Names provided for instrumented routines depend on the compiler which may add un derscores and other decorations to Fortran and C routine names and whether name demangling has been enabled when Scalasca was installed and could be applied suc cessfully 3 2 Manual region instrumentation If the automatic compiler based instrumentation see Section 2 1 or semi automatic in strumentation see Section 3 4 procedure fails instrumentation can be done manually Manual instrumentation can also be used to augment automatic instrumentation with region or phase annotations which can improve the structure of analysis reports Gen erally the main program routine should be instrumented so that the entire execution is measured and included in the analyses Instrumentation can be performed in the following ways depending on the programming language used Fortran include epik user inc subroutine foo declarations EPIK FUNC REG foo EPIK USER REG r name iteration loop EPIK FUNC START EPIK USER START r name do
65. neration The Message Passing Interface MPI adapter of EPIK supports the tracing of most of MPI s 300 function calls MPI defines a so called profiling interface that supports the provision of wrapper libraries that can easily interposed between the user application and the MPI library calls EPIK supports selective event generation Currently this means that at start time of the application the user can decide whether event generation is turned on or off for a group of functions These groups are the listed sub modules of this adapter Each module has a short string token that identifies this group To activate event generation for a specific group the user can specify a colon seperated list of tokens in the configuration variable EPK_MPI_ENABLED Additionally special tokens exist to ease the handling by the user A complete list of available tokens that can be specified in the runtime configuration is listed in the following table Note Event generation in this context only relates to flow and transfer events Tracking of 34 4 2 Measurement configuration Token Module ALL Activate all available modules DEFAULT Activate the configured default modules of CG COLL ENV IO P2P RMA TOPO This can be used to easily activate additional modules CG Communicators and groups COLL Collective communication ENV Environmental management ERR Error handlers EXT Extern
66. nstructions may be small enough to get a fairly accurate view of the application behavior However certain application properties like frequently executed regions with extremely small temporal extent will always lead to a high perturbation Thus the measurement of those regions must be avoided The measurement data can then be analyzed after application execution If a detailed event trace has been collected more sophisticated dependencies between events occur ing on different processes can be investigated resulting in a more detailed analysis re port Especially inter process event correlations can usually only be analyzed by a post mortem analysis The information which is needed to analyze these correlations are usually distributed over the processes Transferring the data during normal application runtime would lead to a significant perturbation during measurement as it would require application resources on the network for this After analyzing the collected data the result needs to be presented in an analysis report This leads to the next phase in the performance optimization cycle namely the presen tation phase At this stage it is important to reduce the complexity of the collected performance data to ease evaluation by the user If the presented data is too abstract per formance critical event patterns might not be recognized by the user If it is too detailed the user might drown in too much data User guidance is therefore the k
67. nstrumentation Fortran subroutine foo declarations POMPS INST BEGIN foo if condition then IPOMPS INST ALTEND foo return end if POMPS INST END foo end subroutine foo C C void foo declarations pragma pomp inst begin foo if lt condition gt pragma pomp inst alt end foo return pragma pomp inst end foo At least the main program function has to be instrumented in this way and additionally one of the following should be inserted as the first executable statement of the main program Fortran program main declarations POMPS INST INIT end program main C C 26 3 5 Automatic source code instrumentation using PDT int main int argc char argv declarations pragma pomp inst init For examples of how to use the POMP directives see the test pomp files in the example directory of the Scalasca installation 3 5 Automatic source code instrumentation using PDT If Scalasca has been configured with PDToolkit support automatic source code instru mentation can be used as an alternative instrumentation method In this case the source code of the target application is pre processed before compilation and appropriate EPIK user API calls will be inserted automatically However please note that this feature is still somewhat experimental and has a number of limitations see 3 5 2 To enable PDT based source code instrumentatio
68. nt definition This issue can only be resolved by renaming the routine argument Instrumentation of Fortran PURE and ELEMENTAL routines is not supported and should be avoided via selective instrumentation Included code will currently not be instrumented This applies to C C header files and other explicit includes by the C preprocessor as well as via the Fortran include keyword Support for C templates and classes is currently only partially implemented 28 3 6 Selective instrumentation Advanced TAU instrumentation features such as static dynamic timers loop I O and memory instrumentation are not yet supported Respective entries in the se lective instrumentation file will be ignored 3 6 Selective instrumentation Scalasca experiments contain by default only summarized metrics for each callpath and process thread More detailed analyses providing additional metrics regarding wait states and other inter process inefficiencies require that event traces are collected in buffers on each process that must be adequately sized to store events from the entire execution to avoid flushes to disk during measurement that are highly disruptive Instrumented routines which are executed frequently while only performing a small amount of work each time they are called have an undesirable impact on measurement The measurement overhead for such routines is large in comparison to the execution time of the uninstrumented ro
69. ory 0 1f non zero SCAN SETENV In order to set environment variables to MPI processes by the launcher one can specify the syntax that the launcher requires for this as SCAN SETENV e g oys SCAN_TARGET If there is an imposter executable or script e g used to specify placement that precedes the in strumented target it may be necessary to explic itly identify the target executable SCAN_TRACE_ANALYZER Specifies alternative trace analyzer e g scout mpi scout hyb If none specified it skips automatic trace analysis SCAN WAIT Wait for synchronization of a distributed filesys 0 712 Table B 3 Scalasca measurement environment variables for EPIK configuration Variable name Description Default EPK CONF Specifies file with a list of EPIK configuration variables EPIK CONF EPK FILTER Specifies file with a list of compiler instrumented USR functions which should not be included in measure ment EPK GDIR Specifies the directory to contain the EPIK measurement archive EPK_LDIR Specifies a temporary location to be used as intermediate storage be fore the data is finally archived in EPK_GDIR EPK_GDIR EPK MACHINE ID Specifies a unique identifier for the current machine 0 EPK MACHINE NAME Specifies a name for the current ma chine Depends on installation EPK METRIC SPEC Specifies a file with a
70. ould to manually disable this class of wrappers again as follows o configure enable all mpi wrappers disable mpi wrappers MINI Note Currently the MINI group comprises the wrappers for MPI Comm rank MPI Comm remote size MPI Comm size MPI Group rank and MPI Group size Addi tionally wrappers for MPI_Wtick MPI_Wtime and MPI_Sizeof will never be generated regardless of any configuration option passed to configure A 2 Subgrouping or cross group enabling Some wrapper functions are affiliated with a function group that has not been described for direct user access in section 4 2 2 These groups are subgroups that contain function calls that are only enabled when both main groups are enabled The reason for this is to control the amount of events generated during measurement a user might want to turn off the measurement of non critical function calls before the measurement of the com plete main group is turned off Subgroups can either be related to MISC miscellaneous functions e g handle conversion EXT external interfaces e g handle attributes or ERR error handlers For example the functions in group CG_MISC will only generate events if both groups CG and MISC are enabled at runtime 5 Appendix A MPI wrapper affiliation A 3 Function to group 52 A 3 Function to group
71. oup incl MPI Group intersection MPI Group range excl MPI Group range incl MPI Group rank MPI Group size MPI Group translate ranks MPI Group union MPI Intercomm create MPI Intercomm merge 61 Appendix A MPI wrapper affiliation CG ERR Error handlers for Communicators and Groups MPI Comm call errhandler MPI Comm create errhandler MPI Comm get errhandler MPI Comm set errhandler CG EXT External interfaces for Communicators and Groups MPI Attr delete MPI Attr get MPI Attr put MPI Comm create keyval MPI Comm delete attr MPI Comm free keyval MPI Comm get attr MPI Comm get name MPI Comm set attr MPI Comm set name MPI Keyval create MPI Keyval free CG MISC Miscellaneous functions for Communicators and Groups MPI Comm c2f MPI Comm f2c MPI Group c2f MPI Group f2c COLL Collective communication MPI Allgather MPI Allgatherv MPI Allreduce MPI Alltoall MPI Alltoallv MPI Alltoallw MPI Barrier MPI Bcast MPI Exscan MPI Gather MPI Gatherv MPI Reduce MPI Reduce local MPI Reduce scatter MPI Reduce scatter block MPI Scan MPI Scatter MPI Scatterv 62 A 4 Group to function ENV Environmental management MPI Finalize MPI Finalized MPI Init MPI Init thread MPI Initialized MPI Is thread main MPI Query thread ERR Common error
72. perations their instrumentation may also result in excessive measurement overhead The OPARD tool can be instructed not to instrument any of the OpenMP synchronization constructs using disable sync or a comma separated list of specific constructs from atomic critical flush locks master and single e g 29 Chapter 3 Application instrumentation scalasca instrument disable atomic locks gcc fopenmp Note OPARI2 options must be concluded with preceding the compiler linker Of course when these constructs are not instrumented and subsequently don t show up in measurements and analysis the application might well still have performance prob lems due to too many OpenMP synchronization calls 30 Chapter 4 Measurement collection amp analysis 4 Measurement collection amp analysis The Scalasca measurement collection and analysis nexus manages the configuration and processing of performance experiments with an instrumented executable Many different experiments can typically be performed with a single instrumented executable without needing to re instrument by using different measurement and analysis configurations The default runtime summarization mode directly produces an analysis report for exam ination whereas event trace collection and analysis are automatically done in two steps to produce a profile augmented with additional metrics The distinctive feature of Scalasca is the automatic
73. ponding step of the performance analysis before a full workflow example is presented in Section 2 4 Chapter 2 Getting started 2 1 Instrumentation To make measurements with Scalasca user application programs need to be instru mented i e at specific important points events during the application run special measurement calls have to be inserted In addition to an almost automatic approach using compiler inserted instrumentation Section 3 1 semi automatic POMP Section 3 4 and manual instrumentation Section 3 2 approaches are also supported In addition automatic source code instrumentation by the PDToolkit instrumenter Section 3 5 can be used if Scalasca is configured accordingly For pure OpenMP or hybrid MPI OpenMP applications or when using the semi automatic POMP directive based approach the OPARD source code instrumenter is used internally Read the OPARD section in the OPEN ISSUES document 9 provided as part of the Scalasca documentation to be aware of some limitations and known prob lems All the necessary instrumentation of user OpenMP and MPI functions is handled by the Scalasca instrumenter which is accessed through the scalasca instrument com mand Therefore the compile and link commands to build the application that is to be analyzed should be prefixed with scalasca instrument e g in a Makefile For example to instrument the application executable myprog generated from the two source fi
74. port examination The results of the automatic analysis are stored in one or more reports in the experiment archive These reports can be processed and examined using the scalasca examine command on the experiment archive scalasca examine epik title Post processing is done the first time that an archive is examined before launching the CUBE3 report viewer If the scalasca examine command is executed on an already processed experiment archive or with a CUBE file specified as argument the viewer is launched immediately A textual score report can also be obtained without launching the viewer scalasca examine s epik title This score report comes from the cube3 score utility and provides a breakdown of the different types of region included in the measurement and their estimated associated trace buffer capacity requirements aggregate trace size total tbc and largest process trace size max tbc which can be used to specify an appropriate FLG BUFFER SIZE for a subsequent trace measurement The CUBE3 viewer can also be used on an experiment archive or CUBE file as shown below cube3 epik title cube3 file cube However keep in mind that no post processing is performed in this case so that only a subset of Scalasca analyses and metrics may be shown 2 3 1 Using CUBES The following paragraphs provide a very brief introduction of the CUBE3 usage To effectively use the GUI you should also consult
75. porting instrumentation issues Sometimes it is desirable to explicitly direct the Scalasca instrumenter to do nothing ex cept execute the associated compile link command and in such cases mode none can be specified Although no instrumentation is performed this can help verify that the Scalasca instrumenter correctly handles the compile link commands Alternatively the environment variable SKIN MODE none can be set for the same purpose and without needing to modify the arguments given to the Scalasca instrumenter This is often neces sary when an application s configure or build procedure doesn t provide a compile link preposition that can be selectively used for the Scalasca instrumenter and actions during configuration build are unable to handle instrumented executables Temporarily setting SKIN MODE none should allow the use of the Scalasca instrumenter to be transparently incorporated in the configure build process until instrumented executables are desired 21 Chapter 3 Application instrumentation 3 1 Automatic compiler instrumentation Most current compilers support automatic insertion of instrumentation calls at routine entry and exit s and Scalasca can use this capability to determine which routines are included in an instrumented measurement Compiler instrumentation of all routines in the specified source file s is enabled by de fault by Scalasca or can be explicitly requested with comp all Compiler instrumen
76. produced by measurement and analysis to derive additional metrics and construct a hierarchy of measured and derived metrics and then presents this final report If there is more than one analysis report in an EPIK experiment archive directory the most comprehensive report is shown by default If intermediate reports were already processed the final report is shown immediately Should it be desirable to re process intermediate reports the F flag can be given to force this Alternatively a specified analysis report can be presented immediately with Scalasca examine epik title epitome cube Since no post processing is done in this case only a subset of Scalasca analyses and metrics may be shown 43 Chapter 5 Analysis report examination It can be desirable to post process intermediate reports in an experiment archive directory immediately after measurement collection without attempting to subsequently load the final report in the GUI and this is achieved with the s flag As well as skipping starting the GUI it also scores the final analysis report with the cube3 score utility and produces a textual epik score report This report provides a breakdown of the different types of region included in the measurement and their associated trace buffer capacity requirements aggregate trace size total tbc and the largest process trace size max tbc This information can be determined from summary experiments and use
77. r a newly created directory will fail and subsequent trace analysis is therefore skipped In this case SCAN WAIT can be set to the maximum number of retries in seconds that the nexus should consider prior to aborting 4 2 Measurement configuration A number of configuration variables can be used to control the EPIK measurement run time configuration for an annotated list of configuration variables and their current settings run the epik conf command Configuration variables can be specified via en vironment variables or in a configuration file called EPIK CONF by default the current directory is searched for this file or an alternative location can be specified with the EPK CONF environment variable The values for configuration variables can contain sub strings of the form XYZ or XYZ where XYZ is the name of another configuration variable Evaluation of the configuration variable is done at runtime when measurement is initiated When tracing large scale MPI applications it is recommended to set the EPK LDIR and EPK GDIR variables to the same location as in such cases intermediate file writing is avoided and can greatly improve performance Therefore this is the default setting 33 Chapter 4 Measurement collection amp analysis 4 2 1 Compiler instrumented routine filtering When automatic compiler instrumentation has been used to instrument user level source program routines classified as USR regions there
78. rdware counter metrics 4 4 Automatic parallel event trace analysis 4 5 Automatic sequential event trace analysis 5 Analysis report examination 3 1 Examination options ulus CES UE SE eS eee OS 6 Additional utilities 6 1 Additional EPILOG event trace utilities G2 Trace OOED Ll lo ee vo Lon ce B onse ee R DG ESS 6 3 Recording user specified virtual topologies A MPI wrapper affiliation A 1 Enabling and disabling wrappers at compile time 31 31 33 36 37 40 43 43 47 47 47 48 51 51 iii Contents A 2 Subgroupingorcross groupenabling 51 A J Fonction to group o e e Ex PE EN EHS OS ae KE OE E o 52 AG Group to function 2212243449942 34 3464 34 61 B Environment variables 71 Bibliography 75 lv Chapter 1 Introduction 1 Introduction Supercomputing is a key technology of modern science and engineering indispensable to solve critical problems of high complexity As a prerequisite for the productive use of today s large scale computing systems the HPC community needs powerful and robust performance analysis tools that make the optimization of parallel applications both more effective and more efficient Jointly developed at the J lich Supercomputing Centre and the German Research School for Simulation Sciences Aachen Scalasca is a performance analysis toolset that has been specifi
79. red with external trace browser support these instances can also be shown in the timeline display of Paraver or Vampir see Figure Additional information on this topic can be found in the CUBE3 documentation 4 5 Automatic sequential event trace analysis EXPERT is an automatic serial analyzer for merged EPILOG event traces It can be manually applied to OpenMP MPI and hybrid MPI OpenMP traces in EPIK experiment archives after they have been merged via elg merge epik title to produce epik_ lt title gt epik elg Note It may take quite a long time to merge large event traces and the resulting epik elg will typically be more than three times as large as the unmerged process traces Explicit execution of EXPERT on a merged EPILOG event trace in an EPIK experiment archive via expert epik title produces an analysis report epik_ lt title gt expert cube Note Bear in mind that both merging of MPI rank traces and EXPERT analysis are se quential operations that might take a long time for large experiments Warning The EXPERT analyzer requires the event trace to represent a call tree with a single root Therefore you should instrument the entry and exit of the application s main function if necessary Also note that EXPERT requires OpenMP applications to use the same number of threads during all parallel regions A dynamically changing number of threads is not supported 40 4 5 Automatic sequential event
80. s of your system By default Scalasca uses the automatic compiler based instrumentation feature This is usually the best first approach when you don t have detailed knowledge about the application and need to identify the hotspots in your code SOR consists of only a single source file which can be compiled and linked using the following two commands scalasca instrument mpixlc c sor c scalasca instrument mpixlc sor o o sor x Now the instrumented binary sor x must be executed On supercomputing systems users usually have to submit their jobs to a batch system and are not allowed to start parallel jobs directly Therefore the call to the scalasca command has to be provided within a batch script which will be scheduled for execution when the required resources are available The syntax of the batch script differs between the different scheduling systems However common to every batch script format is a passage where all shell commands can be placed that will be executed Here the call to the Scalasca analyzer has to be placed in front of the application execution command scalasca analyze mpirun mode vn np 128 sor x Ensure that the scalasca command is accessible when the batch script is executed e g by loading an appropriate module or updating the PATH if necessary The flags mode and np are options of the mpi run command on Blue Gene P systems and other launchers may have different flags and syntax The Scalasca anal
81. scalasca US Scalasca 1 4 User Guide March 2013 The Scalasca Development Team scalasca fz juelich de Ay J LICH FORSCHUNGSZENTRUM Copyright 1998 2013 Forschungszentrum J lich GmbH Germany Copyright 2009 2013 German Research School for Simulation Sciences GmbH Germany Copyright 2003 2008 University of Tennessee Knoxville USA All rights reserved Redistribution and use in source and binary forms with or without modification are per mitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution Neither the names of Forschungszentrum J lich GmbH the German Research School for Simulation Sciences GmbH or the University of Tennessee Knoxville nor the names of their contributors may be used to endorse or promote products derived from this software without specific prior written permission THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBU TORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED IN NO EVENT SHALL THE COPY RIGHT OWNER OR CONTRI
82. set attr MPI Type set name TYPE MISC Miscellaneous functions for datatypes MPI Type c2f MPI Type f2c 69 Appendix A MPI wrapper affiliation 70 Appendix B Environment variables B Environment variables Table B 1 Scalasca instrumenter environment variables e g MPI OpenMP MPI OpenMP or none to skip instrumentation Variable name Description Default SKIN_COMP Specifies routines that the compiler all should instrument al 1 or none or a custom instrumentation specification SKIN MODE Specifies the instrumentation mode determined automatically SKIN VERBOSE Produces additional information during instrumentation if non zero 71 Appendix B Environment variables Table B 2 Scalasca measurement collection amp analysis nexus environment variables tem after measurement completion in seconds Variable name Description Default SCAN ANALYZE OPTS Specifies trace analyzer options severest in stance tracking value i timestamp correction value s verbosity value v SCAN CLEAN Removes trace data after successful trace analy 0 sis if it is non zero SCAN MPI LAUNCHER Specifies a non standard MPI launcher name SCAN MPI RANKS Specifies the number of MPI processes SCAN OVERWRITE Removes existing experiment archive direct
83. ted Cube 3 0 QT epik sor vn128 trace trace cube Ble Display Topology Help Own root percent Own root percent i7 Absok amp e Metric tree Calltree Flat view System tree Ppology0 E C 0 00 Time C 0 00 main lll 96 50 Execution 0 00 MBI Init O 0 00 MPI 0 00 setup_grid E 0 13 Synchronization 0 00 init field 0 00 Communication 0 00 sor iter Bil 0 72 Point to point C 0 00 init red black 0 00 init boundary C 0 00 Messages in Wrong C 0 00 get halo C 0 00 Late Receiver 0 00 MPI Irecv 7 Bi 0 23 Collective 0 00 MPI Barrier 0 00 File I O 0 00 MPI Irsend lll 0 16 Init Exit L Bl 2 21 Overhead C 0 00 update black Bi 100 00 Visits C 0 00 update red 100 00 Synchronizations C 0 00 MPI Allreduce lll 100 00 Communications 0 00 MPI Finalize lll 100 00 Bytes transferred B lll 100 00 Computational imbalance Z c MU nna 0 05 ed 0 00 6 38 963 6856 Figure 2 2 Determine a Late Sender in CUBE3 18 Chapter 3 Application instrumentation 3 Application instrumentation Scalasca provides several possibilities to instrument user application code Besides the automatic compiler based instrumentation Section 3 1 it provides manual instrumen tation using the EPIK API Section 3
84. ting the instrumented code on a parallel machine the user can generate a summary report also known as profile with aggregate performance metrics for individual function call paths Furthermore event traces can be generated by record ing individual runtime events from which a profile or a time line visualization can later be produced The runtime summarization capability is useful to obtain an overview of the performance behavior and also to optimize the instrumentation for later trace gen eration Since traces tend to become very large and inappropriate instrumentation and measurement configuration will compromise the resulting analysis this step is highly recommended Optimized measurement configuration Y U Measurement Summary c library report Instr target Local event Parallel Pattern Report application traces pattern search report browser Merge N z Global Sequential Pattern event trace pattern search report Pattern trace Third party Conversion Exe trace trace browser Figure 1 2 Scalasca s performance analysis workflow When tracing is enabled each process generates a trace file containing records for all its process local events After program termination Scalasca reloads the trace files back into main memory and analyzes them in parallel using as many CPUs as have been us
85. tion TOPO Topology cartesian and graph communicators MPI Cart coords MPI Cart create MPI Cart get MPI Cart map MPI Cart rank MPI Cart shift MPI Cart sub MPI Cartdim get MPI Dims create MPI Dist graph create MPI Dist graph create adjacent MPI Dist graph neighbors MPI Dist graph neighbors count MPI Graph create MPI Graph get MPI Graph map MPI Graph neighbors MPI Graph neighbors count MPI Graphdims get MPI Topo test TYPE Datatypes MPI Pack MPI Pack external MPI Pack external size MPI Pack size MPI Type commit MPI Type contiguous MPI Type create darray MPI Type create f90 complex MPI Type create f90 integer MPI Type create f90 real MPI Type create hindexed MPI Type create hvector MPI Type create indexed block MPI Type create resized MPI Type create struct MPI Type create subarray MPI Type dup MPI Type extent MPI Type free MPI Type get contents MPI Type get envelope MPI Type get extent MPI Type get true extent MPI Type hindexed MPI Type hvector MPI Type indexed MPI Type Ib MPI Type match size MPI Type size MPI Type struct MPI Type ub MPI Type vector MPI Unpack MPI Unpack external 68 A 4 Group to function TYPE EXT External interfaces for datatypes MPI Type create keyval MPI Type delete attr MPI Type free keyval MPI Type get attr MPI Type get name MPI Type
86. to ex plicit conditional clauses or compiler optimizations or with dynamically changing number of threads are not supported and typically will result in deadlock However different numbers of threads on each process are supported Note SCOUT is typically unable to analyze hybrid MPI OpenMP traces from applica tions employing MPI THREAD SERIALIZED In such cases it may be necessary to enforce MPI THREAD FUNNELED when collecting trace experiments that should be automatically analyzed using SCOUT When running the SCOUT analyzer on back end compute nodes with a different ar chitecture to their system front end remember to specify the path to the appropriate back end version e g SCALASCA RTS scout type 38 4 4 Automatic parallel event trace analysis If your MPI library doesn t automatically support passing command line arguments to all MPI processes the name of the experiment to analyze may need to be passed in a special form e g args epik_ lt title gt orcan be specified via the EPK TITLE configura tion variable in a EPIK CONF file or set in the environment for each MPI process e g env EPK TITLE lt title gt Note SCOUT processes may reguire more than twice the memory of the largest MPI rank trace as reported as max_tbc by scalasca examine s or cube3_score to complete analysis without paging to disk Hardware counters recorded in event traces are currently ignored by SCOUT how
87. trace analysis reports If sufficient memory is physically available this can be specified in the ELG BUFFER SIZE configuration variable for a subsequent trace collec tion Detailed region output cube3 score r can also be examined to identify fre quently executed regions that may adversely impact measurement and not be considered valuable as part of the analysis Such regions without OpenMP and MPI operations may be appropriate for exclusion from subsequent experiments via selective instrumentation and measurement see Sections 3 5 and 3 6 Trace buffer capacity can be saved by elim inating certain functions from the measurement This could be done by providing a filter file which lists the function names of the functions to be excluded A potential filter file 11 Chapter 2 Getting started can be evaluated with the option f filter file 2 4 A full workflow example The previous sections introduced the general usage of Scalasca This section will guide through an example analysis of a simple solver kernel called SOR solving the Poisson equation using a red black successive over relaxation method Details of application instrumentation measurement collection and analysis and analysis report examination options will follow in the subsequent chapters The environment used in the following examples is for IBM Blue Gene P and the com mands and outputs presented in this section might differ from the commands and output
88. trumentation automatic compiler based instrumentation or linking with pre instrumented libraries Instrumenta tion on the source code level can be done by introducing additional instructions into the source code prior to compilation On most systems this process can be automated by using special features of the compiler However this approach typically does not allow a fine grained control of the instrumention The third method is to use pre instrumented libraries which contain instrumented versions of the relevant library functions The 1 3 Scalasca overview Message Passing Interface standard MPI 13 provides a special interface for this kind of instrumentation the so called PMPI interface As this interface is defined in the MPI standard its API is portable and creates an opportunity for tool developers to provide a single portable measurement library for multiple different MPI implementations In comparison the OpenMP standard 15 specifies no such standard interface for tools When the instrumented code is executed during the measurement phase performance data is collected This can be stored as a profile or an event trace depending on the desired level of information needed The additional instructions inserted during instru mentation and associated measurement storage require resources memory as well as CPU time Therefore the application execution is affected to a certain degree Pertur bation by the additional measurement i
89. ts as well as the calculation of state information is done inside the EARL event accessor library a component used by EXPERT During the analysis process EARL dynamically builds up a sparse index structure on the trace file At fixed intervals the state information is stored in so called bookmarks to speed up random access to events If a particular event is requested EARL usually needs not start reading from the beginning of the trace file in order to find it Instead the interpreter looks for the nearest bookmark and takes the state information from there which is required to correctly interpret the subsequent events from the file Then it starts reading the trace from there until it reaches the desired event The distance of bookmarks can be set using the following environment variable EARL_BOOKMARK_DISTANCE default 10000 To gain further efficiency EARL automatically caches the most recently processed events in a history buffer The history buffer always contains a contiguous subsequence of the event trace and the state information referring to the beginning of this subsequence So all information related to events in the history buffer can be completely generated from the buffer including state information The size of the history buffer can be set using another environment variable EARL HISTORY SIZE default 1000 number of processes or threads 41 Chapter 4 Measurement collection amp analysis Note Choosing the rig
90. utine resulting in measurement dilation Recording these events requires significant space and analysis takes longer with relatively little improvement in quality Filtering can be employed during measurement described in section 4 2 1 to ignore events from compiler instrumented routines Ideally such routines or regions should not be instrumented to entirely remove their impact on measurement Uninstrumented routines are still executed but become invisi ble in measurement and subsequent analyses as if inlined Excess manual annotations see Section 3 2 or POMP directives see Section 3 4 should be removed or disabled when instrumenting Automatic routine instrumentation working at the level of source modules can be by passed by selectively compiling such sources normally i e without preprocessing with the Scalasca instrumenter Note The instrumenter is however still required when linking If only some routines within a source module should be instrumented and others left uninstrumented the module can be split into separate files or compiled twice with con ditional preprocessor directives selecting the separate parts and producing separate object files Alternatively when Scalasca has been configured with the PD Toolkit a selective instru mentation specification file can be used as described in Section 3 5 For OpenMP or hybrid MPI OpenMP applications where there are very large num bers of synchronization o
91. y ALL PI 195728 147 47 1 24 summary MPI COM 9696 465 93 3 93 summary COM USR 97144 11235 64 94 82 summary USR PI 80000 2 14 0 02 MPI Irsend PI 73600 1 07 0 01 MPI Irecv PI 16040 20 77 0 18 MPI Allreduce PI 16000 14 32 0 12 MPI Barrier PI 9600 87 25 0 74 MPI Waitall COM 9600 304 28 2 57 get halo USR 4800 5432 60 45 85 update red USR 4800 5432 87 45 85 update black PI 240 0 54 0 00 MPI Gather PI 200 3 63 0 03 MPI Bcast EPK 48 368 66 3 11 TRACING USR 48 0 50 0 00 looplimits PI 24 0 52 0 00 MPI Finalize USR 24 0 54 0 00 init boundary USR 24 0 48 0 00 init red black COM 24 2 88 0 02 sor iter COM 24 156 25 1 32 init field COM 24 0 82 0 01 setup grid MPI 24 17 23 0 15 MPI Init COM 24 1 70 0 01 main 15 Chapter 2 Getting started As the maximum trace buffer required on a single process for the SOR example is ap proximately 215 KB there is no need for filtering in this case Note A potential filter file can be tested and evaluated by adding f filter file to the scalasca examine s command resulting in an updated score report detailing the routines that it filters and the effect on max tbc The flt column of the report indicates with a marker routines which matched the filter and would not appear in a filtered measurement Once the configuration of buffer sizes and or filters have been determined make sure they are specified for subsequent tracing measurements via environment variables or an EPIK CONF
92. y cube Figure 2 1 shows a screenshot of the Scalasca report browser CUBE3 with the summary analysis report of SOR opened Examination of the application performance summary may indicate several influences of the measurement on your application execution be haviour For example frequently executed short running functions may lead to signif icant pertubation and would be prohibitive to trace these need to be eliminated before further investigations using trace analysis are taken into account During trace collection information about the application s execution is recorded in so called event streams The number of events in the streams determines the size of the buffer required to hold the stream in memory To minimize the amount of memory required and to reduce the time to flush the event buffers to disk only the most relevant function calls should be instrumented When the complete event stream would be larger than the memory buffer it has to be flushed to disk during application runtime This flush significantly impacts application performance as flushing is not coordinated between processes and runtime imbalances are induced into the measurement The Scalasca measurement system uses a default value of 10 MB per process or thread for the event trace when this would not be ade quate ELG_BUFFER_S1 ZE can be adjusted to minimize or eliminate flushing of the internal buffers However if too large a value is specified for the buffers th
93. yzer will take care of certain control variables which assist in con figuring the measurement of your application The default behaviour of the Scalasca analyzer is to create a summary analysis report rather than create a detailed event trace as indicated by the initial messages from the EPIK measurement system S C A N Scalasca 1 4 runtime summarization S C A N epik sor vn128 sum experiment archive 12 2 4 A full workflow example S C A N Collect start mpirun mode vn np 128 sor x 00000 EPIK Created new measurement archive epik sor vn128 sum 00000 EPIK Activated epik sor vn128 sum NO TRACE Application output 00000 EPIK Closing experiment epik sor vn128 sum 00000 EPIK Closed experiment epik sor vn128 sum S C A N Collect done S C A N epik sor vn128 sum complete After successful execution of the job a summary analysis report file is created within a new measurement directory In this example the automatically generated name of the measurement directory is epik sor vn128 sum indicating that the job was executed in Blue Gene P s virtual node mode mode vn with 128 MPI processes np 128 The suffix sum refers to a runtime summarization experiment The summary analysis report can then be post processed and examined with the Scalasca report browser scalasca examine epik sor vn128 sum INFO Post processing runtime summarization report INFO Displaying epik sor vn128 sum summar

Scalasca User's Guide - Forschungszentrum Jülich

Contents

Download Pdf Manuals

Related Search

Related Contents

Scalasca User's Guide - Forschungszentrum J&uuml;lich

Contents

Download Pdf Manuals

Related Search

Related Contents

Scalasca User's Guide - Forschungszentrum Jülich