Home

Open|SpeedShop User Manual

1. 40 257142 40285715 10 393155 men bti sm comporent progress Imca bt sm so bti sm frag c 0 10 285734 10 285714 7555455 mca pa cbl progress rece pil ob so pri obl state 0 2 714286 84 514284 0 700745 mca pm obl recy imca_omi ob1 sx pmi ob start c 0 0 02857 0 078571 0007376 pol ibe 2 10 1 50 0 028571 2 457143 0 007376 ompi request detest wat af Wimpeso 0 0 1 0 028571 0 028573 0 007376 mca pmi cbt recy kag callback match imta pmi obi so peel obi sta Command Panel O x Below we see the Exclusive CPU time on highlighted lines that indicate relatively high CPU times Efe Tools Help ow Custom Experiment 1 0 User Teme 2 o gx frocess Control 3 p Status Loaded saved data from fie homejegidermosopenmpihydra uwsertime 2 opens tw Stats Pane 2 fw ManagePracessesfanel 2 Source Pane 2 amp OD x Exchooive CPU f fhormejegitemos openmp shydra c em 9 mt sierank 10 i 11 doubt do werklint gt gt 13 dodi belit 524 243 6 43 14 mt 15 10 4225T 771 700095 18 10 retum In t zj Command Panel g gx openss gt gt While performance tools will point out potential bottlenecks and hot areas it is still up to the user to interpret most data in the correct context as well as note areas of the code you may want to probe further If the inclusive and exclusive times are similar this means the child executions are insignificant with respect
2. Column Name Column Definition Exclusive Mem Call Time Aggregated total exclusive time spent in the memory function corresponding to this row of data of Total Time Percentage of exclusive time relative to the total time spent in the memory function corresponding to this row of data Number of Calls Total number of calls to the memory function corresponding to this row of data Min Request Count The number of times minimum bytes allocated or freed occurred during this experiment Min Requested Bytes The minimum number of bytes that were allocated or freed by the corresponding memory function Max Request Count The number of times maximum bytes allocated or freed occurred during this experiment Max Requested Bytes The maximum number of bytes that were allocated or freed by the corresponding memory function Total Requested Bytes The total number of bytes allocated by the corresponding function Note this does not subtract the bytes freed This only totals the allocation function requested bytes 96 Here we show a default view of the output from the ossmem experiment run of smg2000 on a small cluster jeg localhost test openss cli f smg2000 mem 0 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive of Number Min Min Max Max Total Function defining location Mem Call Total of Req Req Req Req Bytes Time ms Time Calls Count Byte
3. The second view below shows the top five time taking POSIX thread function call paths through the application monitored Operitewst Shes recess Lait Clee oa te Gat Dees oe ee Oe eee we ee ry NAIN Meee Dee me 5 62 pretties ES ew eet ope ote megea vee ETT inopi anita wert Pees oo 5 2 ction hs The third view is an event list view which is a chronological POSIX thread call list This view shows each POSIX thread function call in the order they occurred showing the rank and thread the call originated from the time spent in the POSIX thread function call event and the percentage of the total time that represents 91 Wina Past ji aaga e Monet LN tte SP Gh Ba aD ne 8 i i ak UB on 0G mene rr tr reer rele iter oe Seeks amp Theni 4 tot eet ian katses Peedi Cad Tienes of Toast taa Rk Cod Saci Pete eho hesti EE emee r CRET ECO 8 2 1 3 Threading Specific pthreads experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive of Number Function defining location Pthreads Total of Call Calls Time ms 0 3338 87 8148 160 _GI__pthread_mutex_lock libpthread 2 17 so 0 0463 12 1852 160 _pthread_mutex_unlock libpthread 2 17 so Openss gt gt expview v calltrees fullstack pthreads3 Exclusive
4. View performance data for all of application s lifetime or smaller time slices Compare performance results between processes threads or ranks between a previous experiment and current experiment Interactive CLI help facility which lists the CLI commands syntax and typical usage Option to automatically group like performing processes threads or ranks Create MPI traces in OTF Open Trace Format 1 What is Performance Analysis Performance Analysis also called software profiling or performance tuning is not only a way to measure the speed and efficiency of a program but also to identify bottlenecks in parallel applications Software developers are facing new issues when writing code for massively parallel applications There may be issues in code that does not become apparent until it is run on thousands or tens of thousands or hundreds of thousands etc of cores Performance Analysis can be used to identify problems and tune applications for optimal speed and efficiency There are many aspects of a program that can be measured in order to analyze its performance You can measure the time each function takes or the call paths within an application There are a number of hardware counters available like the number of floating point operations per second FLOPS performed or the number of data cache misses You can monitor the I O operations for a program to analyze its interaction with the file system Not only are t
5. gt gt gt gt main lu C 256 gt gt gt gt gt 140 in MAIN__ lu C 256 bu f 46 gt gt gt gt gt gt 180 in ssor_ bu C 256 ssor f 4 gt gt gt gt gt gt gt 64 in rhs_ lu C 256 rhs f 5 gt gt gt gt gt gt gt gt 88 in exchange 3_ bu C 256 exchange _3 f 5 gt gt gt gt gt gt gt gt gt 893 in mpi_wait_ mpi mvapich rt offline so wrappers fortran c 893 gt gt gt gt gt gt gt gt gt gt 889 in mpi_weait mvapich rt offline so wrappers fortran c 885 2798 770000 1 805823 250 gt gt gt gt gt gt gt gt gt gt gt 51 in Wart Wek ihenpich so 1 0 wait c 51 In this experiment we did program counter sampling to get an overview of the application We noticed that smp_net_lookup showed up in function load balance view which caused us to take a look at the linked object view The load balance on the linked object showed some imbalance so we looked at the cluster analysis view and found that rank 255 was an outlier 78 We then took a closer look at rank 255 and saw that the pcsamp output shows most of the time was spent in smp_net_lookup We used the MPI experiment to determine if we can get more clues and saw that a load balance view on the MPI experiment shows rank 255 s MPI_Allreduce time is the highest of the 256 ranks We then looked at rank 255 and a representative rank from the rest of the ranks and noted the differences in MPI_Wait MPI Send and MPI Allreduce We looked a
6. memory function Total Requested Bytes The total number of bytes allocated by the corresponding function Note this does not subtract the bytes freed This only totals the allocation function requested bytes The paths to each memory through the source are available through the call path views Copsa jipes i e Inte thi B Masary Thascius fN Stu reece Caged ae gt Ea uta i gt v i a IE Soa Pest l se Mangre 1 G Oa Viam Tage pe Tu oL ea dio a Thai U on oaii Pomme ent hacen Dems ieee p kl ree lAs LE a Poke SU Rake 512 Teei 51 Embere Mires Call Treetee ne a a ee Aami Sr feet Pyr Argem feces perry ie ere n 00 Gorn Revel Siw Le In this C icon call path view we see the call paths to the memory functions called in this application 99 pa an nid gee sacl pen pel ewe pipa pete oe Ld Dasim ae pe ey mie Eiger miop amp it pe eg w renee empe Pipe mpte 14 pa tq Oc sa bipsu East a Hh eae armit lit In the view below one has chosen the LB icon and generated the load balance view This view shows the min max and average time across all the ranks in the application The ranks of the min and max time values are also shown If there is a significant difference between the min max and average time there may be load imbalance To identify the ranks threads or processes that are acting out of balance use the cluster analysis feature activated
7. Used on cluster systems where a tmp file system is unique on each node It specifies the location ofa shared file system path which is required for O SS to save the raw data files on distributed systems OPENSS_RAWDATA_DIR shared file system path PIExample export OPENSS_RAWDATA_DIR lustre4 fsys userid Activates the MPI_Pcontrol function recognition otherwise MPI_Pcontrol function calls will be ignored by O SS When running the Open SpeedShop convenience scripts only create the database file and do NOT put out the default report Used to reduce the size of the batch file output files if user is not interested in looking at the default report When running the Open SpeedShop convenience scripts only gather the performance information into the OPENSS_RAWDATA_DIR directory but do NOT create the database file and do NOT put out the default report Specifies the path to where O SS will build the database file On a file system without file locking enabled the SQLite component cannot create the database file This variable is used to specify a path to a file system with locking enabled for the database file creation This usually occurs on lustre file systems that don t have locking enabled 2AOPENSS_DB_DIR file system path PIExample export OPENSS_DB_DIR opt filesys userid Specifies the MPI implementation in use by the application only needed for the mpi mpit and mpiotf experiments These are the currently sup
8. 4 4 4 4 _libc_malloc libc 2 17 so 9 3 Memory Analysis Tracing mem experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt The first GUI view show below is the default view for the mem experiment It shows the memory functions that were called in the application how many times they 98 were called the time spent in each of the memory functions and the percentage of the overall memory function time was spent in each of the memory functions This table identifies what each of the columns represent in the default GUI view for the mem experiment Column Name Column Definition Exclusive Mem Call Time Aggregated total exclusive time spent in the memory function corresponding to this row of data of Total Time Percentage of exclusive time relative to the total time spent in the memory function corresponding to this row of data Number of Calls Total number of calls to the memory function corresponding to this row of data Min Request Count The number of times minimum bytes allocated or freed occurred during this experiment Min Requested Bytes The minimum number of bytes that were allocated or freed by the corresponding memory function Max Request Count The number of times maximum bytes allocated or freed occurred during this experiment Max Requested Bytes The maximum number of bytes that were allocated or freed by the corresponding
9. Hilempi so 0 0 0 pfinalize c 35 MPI isend libmpi so 0 0 0 pisend c 49 MPt_Scatterv libmpiso 0 0 0 pecattery lt 40 MPL irecy Limpiso 0 0 0 pirecv c39 MPt_Gathery lhmpiso 0 0 0 pgatherv lt 4 You can use the views dialog box to choose what metric to display Use the Optional Views Dialog box to choose the performance metrics to be displayed in the StatsPanel and click OK Clicking OK will regenerate the StatsPanel with the new metrics displayed Ot osalviews Died MPT kaparimeni Custors Repent Seleennn Dilag MPIT Baclusive Time Values C MPIT Inchaive Tine Valors _ MPIT Murimurs Time Values C MPNT Mazinam Tune Valets C MMT Aversge Time Values J MINT Cownt Calls Ta Punceion MPTT feclusive Time Peecencage Values C MIT Standard Dewanon Valuer L MPTT Message Site Values MPTT Experiment Event List v trace ONLY S MPT Individual Event Sut Tunes G MINT Individual Event Suap Tunes SMPIT Souro Kank Numbers MPIT Desrinarion Rank Numbers i MPIT Meesgr Tag Values S MPT Communicator Used Values S MPT Message Dane Type Values G MPIT Function Dependent Rerum Values oma J amy oe et After choosing the event to view it will then be displayed 80 eo nwoeer rrr Process Contral Se gt oe Frome Biya T Procesi Leaded Cikk on the Rus burion to begin the ceperiment D Stace panelit 0 MamageProcessesPanet 11 X ou cl i a athe ji
10. Open SpeedShop tool including o Memory Analysis mem experiment see section 9 o Lightweight I O iop experiment see section 7 4 o Lightweight MPI mpip experiment see section 8 1 2 o POSIX thread analysis experiment see section 8 2 1 o CUDA NVIDIA GPU analysis experiment see section 8 3 To use this version requires that the Open SpeedShop CBTF version is built instead of the offline version See the Build and Install guide for information on how to build the CBTF based version of Open SpeedShop Rudimentary support for ARM based platforms is now included in the offline version of Open SpeedShop Program counter sampling pscsamp and the hardware counter hwc hwcsamp experiments are currently supported Work in 2015 will be focused on providing an full featured ARM based CBTF and offline version of Open SpeedShop Table of Contents 1 Whatis Performance Analysis s sssusennnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn na 10 2 How to use Performance AnalySiS ssussunnnnnnnnnnnnnnnnnnnnnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 11 2 1 Sequential Code Performance AnalySiS sss ssssususensunununnunununnununnnnununnnnununnnnununnnnunnnnnnnnnnnnnn 12 2 2 Shared Memory Applications s sussussusnnnnnnnnnunnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 12 2 3 Message Passing Applications sssssssssseseeeseeseesseessesseeneeseneneessneneseeeseseeeseeseesseenieesnenaes 13 3
11. Status is NonExistent Saved database is L1 64PE sweep3d mpi hwcsamp openss Performance data spans 1 7 958138 mm ss from 2013 03 27 22 32 45 to 2013 03 27 22 33 53 Executables Involved sweep3d mpi Currently Specified Components h ys6128 p 2765 t 47176895393312 r 3 sweep3d mpi h ys6128 p 2766 t 47824321252896 r 0 sweep3d mpi h ys6128 p 2767 t 47369830317600 r 1 sweep3d mpi h ys6128 p 2768 t 47378742910496 r 2 sweep3d mpi h ys6129 p 22862 t 47327259860512 r 5 sweep3d mpi h ys6129 p 22863 t 47201888194080 r 6 sweep3d mpi h ys6129 p 22864 t 47185544437280 r 7 sweep3d mpi h ys6250 p 11462 t 47028080107040 r 63 sweep3d mpi h ys6250 p 11463 t 47600632852000 r 60 sweep3d mpi h ys6250 p 11464 t 47494028697120 r 61 sweep3d mpi h ys6250 p 11465 t 47944527175200 r 62 sweep3d mpi Previously Used Data Collectors hwcsamp Metrics hwcsamp exclusive_detail hwcsamp percent hwcsamp threadAverage hwcsamp threadMax hwcsamp threadMin hwcsamp time Parameter Values hwcsamp event PAPI_L1_DCM PAPI_L1_ICM PAPI_L1_TCM PAPI_L1_LDM PAPI_L1_STM hwcsamp sampling rate 100 Available Views hwcsamp 5 1 3 3 osshwcsamp experiment Load Balance command and CLI view openss gt gt expview m loadbalance Max CPU Rank Min CPU Rank Average Function defining location Time of Time of CPU Time Across Max Across Min Across Ranks s Ranks s Ranks s 14 890000 28 10 950000 27 12 888594 _ libc_
12. _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2004 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 104 in IOR_Create_POSIX IOR aiori POSIX c 74 gt gt gt gt gt gt gt 670 in open64 iot collector monitor mrnet mpi so wrappers c 608 3 419380 512 gt gt gt gt gt gt gt gt 82 in_libc_open libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2161 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 195 in IOR_Open_POSIX IOR aiori POSIX c 173 gt gt gt gt gt gt gt 670 in open64 iot collector monitor mrnet mpi so wrappers c 608 4757 147988 0 157392 512 gt gt gt gt gt gt gt gt 82 in _libc_open libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR
13. gt gt gt gt gt gt gt 168 in opal_progress libopen pal so 6 2 0 opal_progress c 150 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 1569 in opal_libevent2021_event_base_loop libopen pal so 6 2 0 event c 1559 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 74 in evthread_posix_lock libopen pal so 6 2 0 evthread_pthread c 68 92 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 237 in pthread_mutex_lock pthreads collector monitor mrnet mpi so wrappers c 199 0 0199 5 2304 6 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt __GIl__pthread_mutex_lock libpthread 2 17 so Openss gt gt expview mloadbalance Max Rank Min Rank Average Function defining location Exclusive of Exclusive of Exclusive Pthreads Max Pthreads Min Pthreads call time call time call time in in in seconds seconds seconds Across Across Across Ranks ms Ranks ms Ranks ms 0 1084 3 0 0717 2 0 0835 _GI__ pthread_mutex_lock libpthread 2 17 so 0 0120 3 0 0112 1 0 0116 __pthread_mutex_unlock libpthread 2 17 so 93 8 3 NVIDIA CUDA Analysis Section The Open SpeedShop version with CBTF collection mechanisms supports tracing CUDA events in a NVIDIA CUDA based application An event by event list of CUDA events and the event arguments are gathered and displayed 8 3 1 NVIDIA CUDA Tracing cuda experiment performance data gathering To run the
14. n 50 50 50 22 Additional arguments default trace all supported memory functions P lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of malloc free memalign posix_mem align calloc and realloc 15 13 osspthreads POSIX Thread Analysis Experiment General form osspthreads lt command gt lt args gt default f_t_list 2 Sequential job example osspthreads smg2000 n 50 50 50 2 Parallel job example osspthreads mpirun np 128 smg2000 n 50 50 50 21 Additional arguments default trace all POSIX thread functions lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of pthread_create pthread_mutex_init pthread_mutex_destroy pthread_mutex_lock pthread_mutex_trylock pthread_mutex_unlock pthread_cond_init pthread_cond_destroy pthread_cond_signal pthread_cond_broadcast pthread_cond_wait and pthread_cond_timedwait 15 14 osscuda NVIDIA CUDA Tracing Experiment General form osscuda lt command gt lt args gt Sequential job example osscuda eigenvalues matrix size 4096 R Parallel job example osscuda mpirun np 64 npernode 1 Imp_linux sf gpu lt in lj BI 15 15 Key Environment Variables 129 EXECUTION RELATED VARIABLES DESCRIPTION OPENSS_RAWDATA_DIRE amp OPENSS_ENABLE_MPI_PCONTROL P OPENSS_DATABASE_ONLY OPENSS_RAWDATA_ONLY OPENSS_DB_DIR amp OPENSS_MPI_LIMPLEMENTATIONDE
15. of Number Call Stack Function defining location Pthreads Total of Call Calls Time ms _Start smg2000 gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt __libc_start_main libc 2 17 so gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 512 in main smg2000 smg2000 c 21 gt gt gt gt gt 69 in HYPRE_StructSMGSolve smg2000 HYPRE_struct_smg c 64 gt gt gt gt gt gt 225 in hypre_SMGSolve smg2000 smg_solve c 57 gt gt gt gt gt gt gt 325 in hypre_SMGRelax smg2000 smg_relax c 228 gt gt gt gt gt gt gt gt 225 in hypre_SMGSolve smg2000 smg_solve c 57 gt gt gt gt gt gt gt gt gt 327 in hypre_SMGRelax smg2000 smg_relax c 228 gt gt gt gt gt gt gt gt gt gt 1084 in hypre_CyclicReduction smg2000 cyclic_reduction c 757 gt gt gt gt gt gt gt gt gt gt gt 405 in hypre_FinalizeIndtComputations smg2000 computation c 399 gt gt gt gt gt gt gt gt gt gt gt gt 676 in hypre_FinalizeCommunication smg2000 communication c 662 gt gt gt gt gt gt gt gt gt gt gt gt gt 76 in PMPI_Waitall libmpi so 1 5 2 pwaitall c 43 gt gt gt gt gt gt gt gt gt gt gt gt gt gt 287 in ompi_request_default_wait_all libmpi so 1 5 2 req_wait c 217 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 77 in opal_condition_wait libmpi so 1 5 2 condition h 57 gt gt gt gt gt gt gt gt gt
16. 71 ossmpi which will record call times and ossmpit which will record call times and arguments Equal events will be aggregated to save space in the database as well as to reduce the overhead There is one more MPI experiment that will save the full MPI traces in the Open Trace Format OTF with the convenience script ossmpiott Again we will run experiment on the smg2000 application The syntax for the experiment is gt ossmpi t srun N 4 n 32 smg2000 n 50 50 50 default lt list MPI functions gt mpi_category The default behavior is to trace all MPI functions but a comma separated list of MPI functions can be giving if you only want to trace specific functions e g MPI_Send MPI_Recv etc You can also select an mpi_category to trace all asynchronous_p2Zp collective_com datatypes environment yy u yy W graphs_contexts_comms persistent_com process_topologies and synchronous_p2p The default views are designed to relate the information included in the report back to the individual calls to their corresponding MPI functions This is the same information that would be reported if the user were to do an expview m min max average The view is a representation of the minimum maximum and average time values per individual calls to their corresponding MPI functions The average time reported is the total amount of time for all the calls to a function di
17. IOR c 108 gt gt gt gt gt 2013 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 2608 in WriteOrRead IOR IOR c 2562 gt gt gt gt gt gt gt 244 in IOR_Xfer_POSIX IOR aiori POSIX c 224 gt gt gt gt gt gt gt gt 321 in write iot collector monitor mrnet mpi so wrappers c 239 316 176763 0 010461 2048 gt gt gt gt gt gt gt gt gt 82 in write libc 2 12 so syscall template S 82 7 4 Open SpeedShop Lightweight I O Profiling General Usage The Open SpeedShop iop I O function profiling experiment wraps the most common I O functions records the time spent in each I O function record the call path along which I O function was called record the time spent along each call path to an I O function and record the number of times each function was called 7 4 1 I O Profiling iop experiment performance data gathering 64 The I O Profiling iop experiment convenience script is ossiop Use this convenience script in this manner to gather lightweight I O profiling performance data ossiop how you normally run your application The following is an example of how to gather data for the IOP application on the Cray platform using the ossiop convenience script ossiop aprun n 64 IOR 7 4 2 I O Profiling iop experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt The first image below shows the default view for t
18. SLES SUSE RHEL Fedora Core CentOS Debian Ubuntu and many others It has been installed on the IBM Blue Gene and the Cray systems The OpenSpeedShop website contains information on special builds and usage instructions The source code for Open SpeedShop is available for download at the Open SpeedShop project home on Sourceforge http sourceforge net projects openss Or CVS access is available at http sourceforge net scm type cvs amp seroup id 176777 Packages and additional information can be found on the Open SpeedShop website http www openspeedshop or 13 1 Open SpeedShop Cluster Install Open SpeedShop comes with a set of bash install scripts that will build Open SpeedShop and any components it needs from source tarballs First it will check to see if the correct supporting software is installed on your system if the needed software isn t installed it will ask to build it for you The only thing you need to do is provide a few arguments for the install script For anormal setup you would just specify the directory to install in what build task you want to do and the location of your MPI and QT installs For example install tool build offline openss prefix opt myoss with openmpi opt openmpi 1 8 2 with mvapich opt mvapich 1 7 After the install has successfully completed there are a few important environment variable you need to set Set a variable for the install location so you can reuse it Then set th
19. SpeedShop also supports the MPI_Pcontrol function This feature allows the user to gather performance data only for sections of their code bounded by the MPI_Pcontrol calls The MPI_Pcontrol must be added to the source code of the application MPI_Pcontrol 1 enables the gathering of performance data and MPI_Pcontrol 0 disables the gathering 108 You must also set the Open SpeedShop environment variable OPENSS_ ENABLE _MPI_PCONTROL to 1 in order to activate the MPI _Pcontrol call recognition otherwise it will be ignored Optionally you can set the OPENSS_START_ENABLED environment variable to 1 to have performance data gathered until a MPI_Pcontrol 0 call is encountered If OPENSS_START_ENABLED is no set no performance data will be gathered until a MPI_Pcontrol 1 call is encountered Note that for OPENSS_START_ENABLED to have any effect OPENSS_ENABLE_MPI_PCONTROL must be set 11 5 Graphical User Interface Basics This section gives an overview of the OpenSpeedShop graphical user interface focusing on the basic functionality of the GUI To launch the GUI on any experiment use openss f lt database name gt 11 5 1 Basic Initial View Default View Because this example usertime experiment default view has many of the icons and features of the other Open SpeedShop experiments it is used here for illustration purposes 109 File Tools Eletp om User Tune ft l enue lures Comtrel m the Se hutimi to begin t
20. The GUI will proved some simple graphics to help you understand the results and will relate the data back to the source code when possible gt a oe EEUE ted evan 14 3 1 1 Common Terminology Technical terms can have multiple and or context sensitive meanings therefore this section attempts to explain and clarify the meanings of the terms used in this document especially with respect to the Open SpeedShop tools Experiment A set of collectors and executable or executables bound together to generate performance information that can be viewed in human readable form Focused Experiment The current experiment commands operate on The user may run or view multiple experiments simultaneously and unless a particular experiment is specified directly the focused experiment will be used Experiments are given an enumeration called an experiment id for identification Component s A component is a somewhat self contained code section of the Open SpeedShop performance tool This section of code does a set of specific related tasks for the tool For example the GUI component does all the tasks related to displaying Open SpeedShop wizards experiment creation and results using a graphical user interface The CLI component does similar functions but uses the interactive command line delivery method Collector The portion of the tool containing logic that is responsible for the gathering of the perfo
21. accessible from any CPU The programming models common to L2 Cache Mam Memory 12 shared memory applications include threads e g POSIX threads and OpenMP The typical performance issues with shared memory applications include limited bus bandwidth where a Mean pat CPU CPU Mea bottleneck occurs when many CPUs are trying to access i i the same resources There can be synchronization Mem CPU CPU a Mem overhead associated with thread startup There can be problems with not balancing the workload among threads properly or most efficiently There can be complications with Non Uniform Memory Access NUMA 2 3 Message Passing Applications Message passing applications use a distributed memory model with sequential or shared memory nodes coupled by a network In this case data is exchanged using message passing via a Message Passing Interface MPI The typical performance issues associated with message passing applications include long blocking times while waiting on data or low messaging rates creating bottlenecks due to insufficient network bandwidth 13 3 Introduction to Open SpeedShop Open SpeedShop is an open source performance analysis tool framework It provides the most common performance analysis steps all in one tool It is easily extendable by writing plugins to collect and display performance data It also comes with built in experiments to gather and display several types of performance infor
22. and experiment output follows below gt osspcsamp srun n 256 smg2000 n 60 60 60 openss pcsamp experiment using the pcsamp experiment default sampling rate 100 openss pcsamp experiment calling openss openss Setting up offline raw data directory in p Iscratchrzb jeg offline oss openss Running offline pcsamp experiment using the command srun ppdebug n 256 collab usr global tools openspeedshop oss dev x8664 oss_offline_v2 1u6 bin ossrun c pcsamp smg2000 n 60 60 60 Running with these driver parameters nx ny nz 60 60 60 Px Py Pz 256 1 1 bx by bz 1 1 1 cx cy cz 1 000000 1 000000 1 000000 n_pre n_post 1 1 dim 3 solverID Struct Interface 20 wall clock time 0 020830 seconds cpu clock time 0 030000 seconds SMG Setup wall clock time 0 451188 seconds cpu clock time 0 460000 seconds SMG Solve wall clock time 2 707334 seconds cpu clock time 2 720000 seconds Iterations 7 Final Relative Residual Norm 1 446921e 07 openss Converting raw data from p Iscratchrzb jeg offline oss into temp file X 0 openss Processing raw data for smg2000 Processing processes and threads Processing performance data Processing symbols Resolving symbols for g g24 jeg demos workshop_demos mpi smg2000 test smg2000 Resolving symbols for lib64 ld 2 12 so Resolving symbols for collab usr global tools openspeedshop oss dev x8664
23. any experiment use openss cli f lt database name gt The following example was run on the Yellowstone platform at NCAR UCAR using the job script shown below 5 1 3 1 Job Script and osshwcsamp command bin csh LSF batch script to run an MPI application BSUB P Pnnnnnnnn project code BSUB W 00 30 wall clock time hrs mins BSUB n 64 number of tasks in job BSUB R span ptile 4 run 4 MPI tasks per node BSUB J sweep3d hwcsamp job name BSUB o sweep3d hwcsamp J out output file name in which J is replaced by the job ID BSUB e sweep3d hwcsamp err error file name in which J is replaced by the job ID BSUB q regular queue module load openspeedshop mkdir p glade scratch USER sweep3d rm rf glade scratch USER sweep3d hwcsamp mkdir glade scratch USER sweep3d hwcsamp setenv OPENSS_RAWDATA_DIR glade scratch USER sweep3d hwcsamp setenv REQUEST_SUSPEND_HPC_STAT 1 echo running on compute node osshwcsamp osshwcsamp mpirun l sf glade u home galaro demos sweep3d orig sweep3d mpi PAPI_L1_DCM PAPI_L1_ICM PAPI_L1_TCM PAPI_L1_LDM PAPI_L1_STM 5 1 3 2 osshwcsamp experiment default CLI view The table below describes information that is included in the hwcsamp experiment default view when no alternative PAPI hardware counter arguments are specified for the osshwcsamp experiment Column Name Column Definition Exclusive CPU Time Aggregated total exclusive time sp
24. i Seat ivocess Looked Cici on fre Tho Sefer lo deg liu aparmeri inara Pew o Goes Paw jij Amn projects pace agen punapea h jega enliyi areen stud a eL A a al ag 8 WR Wy pe Perry Calea Faise Riport mpo n Cxecetaties OTC Moet mST iotbomari Pris Theeect Wem her a cha VS _ Sask Nt Eee NOI i Toe Manneral Ce Gal Steck Function smtriny location gt fede OTG aa cM t amp FOO in hrianka Tan i pgi timed Qomnazi Poet MategelYooesseei wes f Procenemm Sjah l l froces Bete re Ava Trua i HOn beccrrwected Darat Froan Set 8 3 3 NVIDIA CUDA Tracing cuda experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt Here we show a trace view of the output from the osscuda experiment run Note the f CUDA is required do to the fact this is a prototype This restriction will be removed in the future This trace shows the actions taken during the execution of the CUDA application matmul on the Titan Cray platform at ORNL openss gt gt expview v trace f CUDA Start Time d h m s Exclusive of Call Stack Function defining location I O Call Total Time ms 2013 08 21 18 31 21 611 11 172864 1 061071 gt gt gt gt gt copy 64 MB from host to device CUDA 2013 08 21 18 31 21 622 0 371616 0 035292 gt gt gt gt gt copy 2 1 MB from host to device CUDA 2013 08 21 18 31 21 623 0 004608 0 000438 gt gt gt gt gt cop
25. in PAPI Because of this there are situations where PAPI_FP_INS may produce fewer Floating Point counts then expected In this example PAPI_FP_OPS was multiplied by 2 to match the theoretical expected FLOP count The formula for calculating Load Instructions was 2 vectors vec_lecth loop bytes_per_word 8 bits_per_byte 128 bits_per_load What can the Hardware Counter Metrics tell us about the code performance The set of useful metrics that can be calculated for functions are FLOPS Memory Ops FMO We would like this to be large which would imply good data locality Also called Computational Intensity or Ops Refs FLOPS Cycle FPC Large values for floating point intensive codes suggests efficient CPU utilization Instructions Cycle IPC Large values suggest good balance with minimal Stalls through arrays through arrays 47 y alpha x y y A xty DGEMV DGEMM The following table shows single CPU simple code Hardware Counters for simple math kernels using the AMD Budapest Processor Other hwc metrics that are useful are also shown 3D Fast Fourier Matrix QR HPCCG linear system Code Transforms Multiplication Factorization solver sparseMV 256x256x256 500x500 N 2350 100x100x100 Computational Intensity Ops Ref MFLOPS papi MFLOPS code Percent peak fpOps TLB miss fpOps D1 cache miss fpOps DC_MISS Ops cycle 6 1 Using the Hardware counter experiments to find bottlenecks 6 1 1 Exampl
26. modified to reduce false sharing Cache line Aligned real 4 dimension 112 100 c d I OMP DO SCHEDULE STATIC 16 do i 1 100 do j 2 100 c i j c i j 1 d i j enddo enddo ISOMP END DO The two code example sources provide the same computation but pay careful attention to alignment and independent OMP parallel cache line chunks that can have big impact PAPI L3_EVICTIONS a good measure of false sharing Aligned 6 5e 03 UnAligned 2 4e 02 1583 1422 Performance S7 175 474 Penalty 6 1 3 Example 3 on use of PAPI Size of TLB Importance PAPI_TLB_DM Sandia s CTH Performance Looking through performance counters the average per processor PAPI counter and identifying ones that increase the most among all the profiled functions with scale is PAPI_TLB_DM registered under MPI By relinking the executable with lhugetlbfs setting HUGETLB_MORECORE yes and re running the application with aprun m500hs we see a Significant improvement in application execution speed 50 CTH Growth in MPI Time _ A VW LLJ O Q t O 40 60 Wall Time The chart above shows the CTH execution time overs the number of cores MPI tasks CTH Intra Node Scaling CTH Aloa lL TH Weak i p E CTH Strong CTH Ideal O 2 4 6 8 10 12 14 16 of Cores The chart above show the CTH application intra node scaling contrasting weak and strong scaling with the ideal per
27. on your applications This manual intends to give users an understanding of the general experiments available in Open SpeedShop that can be used to analyze application code There is extensive information provided about how to use the Open SpeedShop experiments and how to view the performance information in informative ways Hopefully this will allow users to start optimizing and analyzing the performance of application code Open SpeedShop is a community effort by The Krell Institute with current direct funding from the Department of Energy s National Nuclear Security Administration DOE NNSA It builds on a broad list of community provided infrastructures notably the Paradyn Project s Dyninst API and MRNet Multicast Reduction Network from the University of Wisconsin at Madison the Libmonitor profiling tool and the Performance Application Programming Interface PAPI from the University of Tennessee at Knoxville Open SpeedShop is an open source multi platform Linux performance tool which is targeted to support performance analysis of applications running on both single node and large scale IA64 IA32 EM64T AMD64 PPC ARM Blue Gene and Cray platforms Open SpeedShop is explicitly designed with usability in mind and is for application developers and computer scientists The base functionality includes e Sampling Experiments Support for Call Stack Analysis Hardware Performance Counters MPI Profiling and Tracing I O Profiling and T
28. or smg_hwc_cmp txt osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss oname mar2013_pcsamp_cmp This example will generate comparison files named using the specified oname specification 8 rw rw r 1 jeg jeg 4475 Mar 11 15 53 mar2013_pcsamp_cmp compare csv 8 rw rw r 1 jeg jeg 4841 Mar 11 15 53 mar2013_pcsamp_cmp compare txt 10 1 4 osscompare view type or granularity argument osscompare allows an optional view type argument It represents the granularity of the view Open SpeedShop allows for viewing performance data at three levels linked object level function level and at the statement level osscompare will produce output at one of those levels based on the view type argument where viewtype lt functions statements linkedobjects gt is defined as follows functions View type granularity is per function statements View type granularity is per statement linkedobjects View type granularity is per library linked object This example will produce a side by side comparison for the statement level not the default function level So this example will compare statement performance values in each of the two databases and produce a side by side comparison showing how each statement in the application differed from the two experiments osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss viewtype statements 11 Open SpeedShop User Interfaces Throughout this manual we have been using the Open Spee
29. oss_offline_v2 1u6 lib64 openspeedshop pcsamp rt offline so Resolving symbols for collab usr global tools openspeedshop oss dev x8664 krellroot_v2 1u6 lib64 libmonitor so 0 0 0 Resolving symbols for usr local tools mvapich gnu 1 2 lib shared libmpich so 1 0 Resolving symbols for lib64 libc 2 12 so Resolving symbols for lib64 libpthread 2 12 so Resolving symbols for usr lib64 libpsm_infinipath so 1 14 Resolving symbols for usr lib64 libinfinipath so 4 0 Updating database with symbols Finished openss Restoring and displaying default view for g g24 jeg demos workshop_demos mpi smg2000 test smg2000 pcsamp openss openss The restored experiment identifier is x 1 Exclusive of CPU Function defining location CPUtime Time in seconds 272 1200 34 202 hypre_SMGResidual smg2000 smg_residual c 152 195 0000 24 509 hypre_CyclicReduction smg2000 cyclic_reduction c 757 80 0100 10 056 psm_mg_ipeek libpsm_infinipath so 1 14 70 7600 8 893 ips_ptl_poll libpsm_infinipath so 1 14 16 1300 2 027 hypre_Semilnterp smg2000 semi_interp c 126 15 5600 1 955 _psmi_poll_internal libpsm_infinipath so 1 14 14 2300 1 788 hypre_SemiRestrict smg2000 semi_restrict c 125 6 5700 0 825 hypre_SMGAxpy smg2000 smg_axpy c 27 6 0600 0 761 MPIR_Pack_Hvector libmpich so 1 0 dmpipk c 31 5 9500 0 747 ipath_dwordcpy libinfinipath so 4 0 5 7900 0 727 MPID_DeviceCheck libmpich so 1 0 psmcheck c 35 When the application completes a de
30. performance data and typically the data is collected with low overhead So profiles can provide a good overview of the performance of an application The disadvantage of using a profile is that you are required to know beforehand how to aggregate the data collected Also since profiles provide more of an overview they omit the performance details of individual events There could also be an issue where selecting an inappropriate sampling frequency could skew the results of the profile Statistical Performance Analysis is a standard profiling technique it involves interrupting the execution of the application in periodic intervals to record the location of the execution Program Counter value It can also be used to collect additional data like stack traces or hardware counters Again the advantage of this method is its low overhead It is good for getting an overview of the program and finding the hotspots time intensive areas within the program 4 1 Program Counter Sampling Experiment The sampling experiments available in Open SpeedShop include Program Counter Sampling Call Path Profiling and Hardware Counter The Program Counter Sampling experiment osspcsamp provides approximate CPU time for each line and function in the program The Call Path Profiling experiment ossusertime provides inclusive vs exclusive CPU time see section 4 2 and also includes call stacks There is anumber of Hardware Counter experiments osshwc osshwctime
31. that sample hardware counter overflows and osshwcsamp that can periodically sample up to six hardware counter events A flat profile will answer the basic question Where does my code spend its time This will be displayed as a list of code elements with varying granularity i e statements functions and libraries linked objects with the time spent at each function Flat profiling can be done through sampling which allows us to avoid the overhead of direct measurements We must ensure we request a sufficient number of samples sampling rate to get an accurate result An example of flat profiling would be running the program counter sampling in Open SpeedShop We will run the convenience script on our test program smg2000 gt osspcsamp mpirun np 256 smg2000 n 50 50 50 25 It is recommended that you compile your code with the g option in order to see the statements in the sampling The pcsamp experiment also takes a sampling frequency as an optional parameter the available parameters are high 200 samples per second low 50 samples per second and the default value is 100 samples per second If we wanted to run the same experiment with the high sampling rate we would simply issue the command gt osspcsamp mpirun np 256 smg2000 n 50 50 50 high We can view the results of this flat profile in the Open SpeedShop GUI by using the openss f lt database filename gt command le Tools Help pe San
32. to CPU time and it may not be useful to profile below this layer If the inclusive time is 28 significantly greater then the exclusive time then you should focus your attention to the execution times of the children The stack trace views in Open SpeedShop are similar to the well known Unix profiling tool gprof EE _E gt ___ hypre_SMGSolve smg2000 smg solec f hypre SMGRelax smg2000 smg relax 2 225 hypre_FinslizeladiCnmputations amg2000 computation c 997 hypre_ImitializeinitComputations smg2000 computation c 372 29 5 How to Relate Data to Architectural Properties So far we have been focusing mostly on timing Timing information shows where your code spends its time by displaying hot functions statements libraries and hot call paths But it doesn t show you why it is spending so much time in those areas You need to know if the computationally intensive parts of the code are as efficient as they can be to reduce the time spent there or if there are resources that are constraining the execution of the code These answers can be very platform dependent Areas of bottlenecks can differ from system to system and portability issues can cause a drop in performance There may be a need to tune your code based on the architectural parameters of the system In order to do this we will investigate the interaction between the application and the hardware to make sure there is an efficient use of ha
33. user were to do expview m min max average The view is a representation of the minimum maximum and average time values per individual calls to their corresponding MPI functions The average time reported is the total amount of time for all the calls to a function divided by the total number of calls Thus it is the average time that each individual call spends in the function As such it is comparable to the Max maximum and Min minimum of a call to the function that is in the same min max average report Alternatively if a user does an expview m ThreadMin ThreadMax ThreadAve then the report information is related for the Max Min and Average is related back to the individual ranks Another way of saying it is The average is the total amount of time for all the calls to a function divided by the total number of ranks Thus it is the average time that each rank spends in the function As such it is comparable to the Max and Min of a rank that is in the same report If the number of ranks is the same as the number of calls the two different calculations should produce the same result This would be true if all the calls were in a single thread or there were one in each rank as it is for MPI_Init The expview m min max average view can expose load imbalance by showing when the minimum and maximum time for asynchronous MPI functions have large differences This situation indicates that some of the MPI asynchr
34. views are available including the hot path h Hardware events including clock cycles graduated instructions instruction and data cache and TLB misses floating point operations are counted at the machine instruction source line and function levels hwcsamp Similar to hwc except that sampling is based on time not PAPI event overflows Up to six events may be sampled during the same experiment Similar to hwc except that call path sampling is also included Accumulated wall clock durations of input output 1 0 system calls read readv write writev open close dup pipe creat and others Show call paths for each unique I O call path Wc iop Lightweight I O profiling Accumulated wall clock durations of I O system calls read readv write writev open close dup pipe creat and others but individual call information is not recorded In Similar to io except that more information is gathered such as bytes moved file names etc Captures the time spent in and the number of times each MPI function is called Show call paths for each MPI unique call path call information is not recorded mpip Lightweight MPI profiling Captures the time spent in and the number of times each MPI function is called Show call paths for each MPI unique call path but individual mpit Records each MPI function call event with specific data for display using a GUI or a command line interface CLI Trace format option displays t
35. w of Deed bU Thee Trebe OTe Oe eee ree A en n a e oal Wet Bites Beet Wea oai Aa Bret Wrtire Petriho er eme bated Eo ee ek I 409884 jega He OJ 1 tee oair i HM visvig ame seas sali ce J83844 nevis tl mal Tbo 1370 inp i Geeaned Fassi Here the user has chosen the C view icon and the Stats Panel now shows all the call paths in the users application This view shows the every possible call paths through the source to all the I O functions that were called during the execution of this application From this one could validate that this is expected behavior and if not find where the I O in this application is not behaving as expected 60 Open SpeedShop on rzmeri156 S62 in _ libe stan mazs likan so MO k mame 541 J55 in _libe_starn maa ihe Z 12 er libe start c 95 S17 in menitor_main Abmenior so ON maint 492 193 in min OR O c pom 202 in Testlosys HOR OR c 1848 216 in IOR Cime POX OO miori POSIX c 315 76 in close lim eollernrmme mme mpi se wrappers A5 52 in cke ibet 12a spsesll tarsplare 5 A2 mar DOR This view is the load balance view which gives the min max average values for the I O function call time across all the ranks in this application In this view we are seeing some wide ranges between the min and max values for some of the I O functions It may be useful to see if we can identify the ranks by using the Cluster Analysis view Be Open Speed Shop om rzmert1 6 f che
36. 0 times a second This experiment provides a low overhead overview of the time distribution for the application Its lightweight overview provides a good first step for analyzing the performance of an application The Call Path Profiling usertime experiment gathers both the PC sampling information and also records call stacks for each sample This allows the later display of the call path information about the application as well as inclusive and exclusive timing data see section 4 2 This experiment is used to find hot call paths call paths that take the most time and see who is calling whom The Hardware Counter experiments hwc hwctime hwcsamp access data like Cache and TLB misses The experiments hwc and hwctime sample a hardware counter events based on an event threshold The default event is PAPI TOT_CYC overflows Please see chapter 5 for more information on PAPI and hardware counter related experiments Instead using a threshold the hwcsamp experiment Samples up to six events based on a sample time similar to the usertime and pcsamp experiments The hwcsamp experiment default events are PAPI_LFP_OPS and PAPI_TOT_CYC 3 2 4 Tracing Experiments Descriptions The Input Output tracing and profiling experiments io iot iop MPI Tracing Experiments mpi mpip mpit mpiotf Memory tracing mem POSIX thread tracing pthread and the Floating Point Exception Tracing fpe all use a form of tracing or wrapping of the function nam
37. 000 n 50 50 50 2 Additional arguments default event PAPI_TOT_CYC threshold 10000 lt PAPI_event gt PAPI event namek lt PAPI threshold gt PAPI integer threshold 15 8 osshwcsamp HWC Experiment General form osshwcsamp lt command gt lt args gt default lt PAPI_event_list gt lt sampling_rate gt Sequential job example osshwcsamp smg2000 n 50 50 50 21 Parallel job examples P osshwcsamp mpirun np 128 smg2000 n 50 50 50 P osshwcsamp srun N 32 n 128 sweep3d mpi PAPI_L1_DCM PAPI_L1_DCA 200 Additional arguments default events PAPI_TOT_CYC and PAPI_FP_OPS sampling_rate is 100 lt PAPI_event_list gt Comma separated PAPI event list2i lt sampling_rate gt Integer value sampling rate 15 9 ossio ossiot I O Experiments General form ossio t lt command gt lt args gt default f_t_list R Sequential job example P ossio t smg2000 n 50 50 50 2 127 Parallel job example R ossio t mpirun np 128 smg2000 n 50 50 50 2 Additional arguments default trace all I O functions lt f_t_list gt Comma separated list of I O functions to trace one or more of the following close creat creat64 dup dup2 Iseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write and writev 15 10 ossmpi ossmpip ossmpit MPI Experiments General form ossmpi p t lt mpirun gt lt mpiargs gt lt command gt lt args gt defa
38. 048 buts_ la C 256 buts t 4 1 5300 17 0 9200 143 1 21588 bits la C 254 bits f 4 1 3900 145 0 7400 242 10594 jackd_ lu C 256 jacld 5 1 1900 25 06700 243 0 9189 jacu_ hu 0 256 jaca f5 0 7500 64 0 1600 255 0 4763 ssor hi C 256 s or L4 0 5300 94 0 1100 0 0 2466 Gi_memepy libe 2 5 s0 0 4800 189 0 0900 255 0 2945 exchange_3_ u C 256 exchange_3 1 5 i N te Mni i Command Panel Igas openss gt gt e Terrier Ham ef watt el Purse Diigahiti Status Process Landed Click on the Rur button to begin the ex perione 2382 7900 71 3395 rH124900 24 5255 lihenpich oo LO 64 2900 1 9248 libe 2 5 s0 45 7000 1 9682 lihentx4 ritmav2 co 34 8000 1 0419 libpthresd 2 5 s0 Command Panel Opens gt gt Next we see the load balance view base on Linked Objects libraries 75 Stats Process Loaded Click on the Run button to begin the experiment 000000 ggg LN I SI Here we see the cluster analysis view based on Linked Objects Here is the pcsamp view of Rank 255 performance data only 76 p i ja d_ tu C 256 jadd 15 buts_ 00 0 2356 buts 4 jacu hu C 256 jace t5 pthread _spin_lock libpthread 2 5 20 odu_test_new_connection libmpich so 1 om_uaser c 29 ssor_ lu C 256 ssor f4 __Gl_memepy ibe 2 5 00 exchange 3 u C 256 exchange_3 f 5 DeviceCheck lihenpich so 14h viechech c 254 JOTI pech so Below we examine Rank 25
39. 1 650 f3 mexe m c 24 33 478 7 960 f2 mexe m c 15 17 451 4 150 f1 mexe m c 6 0 084 0 020 work mexe m c 33 To access alternative views in the GUI openss f mexe pcsamp openss loads the database file Then use the GUI toolbar to select desired views or using the CLI openss cli f mexe pcsamp openss to load the database file Then use the expview command options for desired views 15 4 osscompare Compare Database Files General form osscompare lt db_file1 gt lt db_file2 gt lt db_file gt time percent lt other metrics gt rows nn viewtype functions statements linkedobjects gt oname lt csv filename gt i Where 125 lt db_file gt represents an Open SpeedShop database file created by running an Open SpeedShop experiment on an application time percent lt other metrics gt represent the metric that the comparison will use to differentiate the performance information for each experiment database rows nn indicates how many rows of output you want to have listed viewtype functions statements linkedobjects select the granularity of the view output The comparison is either done at the function statement or library view level Function level is the default granularity oname lt csv filename gt Name the output filename when comma separated list output is requested Example osscompare smg run1 openss smg run2 openss P osscompare smg run1
40. 1000 counts expview m papi_l2_tca papi_l2_tcm Header percent of 12_tcm 12_tca Percent papi_l2_tcm papi_l2_tca To examine an example we take the default view expview command and add the capability to add the percentage that each function contributes to the total Add the header by using the Header phrase to create a header for the new data column that is being added The Percent phrase to create the arithmetic expression that divides the PAPI_L1_DCM counts count for each function by the total number of PAPI_L1_DCM counts in the application A_Add count Openss gt gt expview m count Header percent of counts Percent count A_Add count Exclusive percent Function defining location PAPI_L1_DCM of counts Counts 342000000 52 333588 hypre_SMGResidual smg2000 smg_residual c 152 207500000 31 752104 hypre_CyclicReduction smg2000 cyclic_reduction c 757 20500000 3 136955 hypre_Semilnterp smg2000 semi_interp c 126 15000000 2 295333 hypre_SemiRestrict smg2000 semi_restrict c 125 8500000 1 300689 pack_predefined_data libmpi so 0 0 3 7000000 1 071155 unpack_predefined_data libmpi so 0 0 3 Another example this one based in the hwcsamp experiment view shows the ratio between total cache accesses and total cache misses We create a header that is defined by the Header clause openss gt gt expview m papi_ 2_tca papi_l2_tcm Header percent of 12_tcm 12_tca Percent papi_l2_tcm papi_l2_tca papi_l2_t
41. 156820311 153053193 757322217 579127274 535841812 502551336 476065093 471343480 269841454 1507685061 215787794 149803306 146967548 20159725 PAMI Interface Context lt PAMI Context gt advance 14647999 Lapilmpl Context Advance lt true true false gt 11563657 _ libc_enable_asynccancel libc 2 12 so 12757207 _lapi_dispatcher lt false gt libpami so 9649598 Lapilmpl Context TryLock lt true true false gt 6436257 _ libc_disable_asynccancel libc 2 12 so 4697170 udp_read_callback libpamiudp so lapi_udp c 538 9619348 _ intel_ssse3_rep_memcpy libirc so 5879517 _lapi_shm_dispatcher libpami so lapi_shm c 2283 3979337 Lapilmpl Context CheckContext libpami so 3167039 Lapilmpl Context Unlock lt true true false gt 5 2 Hardware Counter Experiment hwc As an example we will run the osshwc experiment on our test program smg2000 The convenience script for this is experiment is gt osshwc mpirun np 256 smg2000 n 50 50 50 lt counter gt lt threshold gt This is the same syntax as the osshwctime experiment Note that if your output is empty try lowering the lt threshold gt value it is calculated by Open SpeedShop by default You can try lowering the threshold value if there have not been enough PAPI event occurrences to record Also see the HINT in the osshwcsamp section above You can run osshwcsamp and use a formula to create a reasonable threshold Any counter reported by papi_ava
42. 30000 1 263127 287313329 3994016 291307345 281053971 4763152 libpamiudp so 22 250000 1 043616 1049603690 9037920 1058641610 1033650896 11422120 libpthread 2 12 so 1 440000 0 067542 72649683 620083 73269766 71327993 1007704 libmpich so 3 3 0 020000 0 000938 1286256 23770 1310026 1232178 5222 d 2 12 so 0 010000 0 000469 327 394 721 313 13 librt 2 12 so 2132 010000 100 000000 63623580643 574253745 64197834388 62463347297 881674029 Report Summary openss gt gt 5 1 3 5 osshwcsamp experiment only the hwcsamp PAPI events CLI view The m allEvents option prints only the PAPI event values and not the program counter sampling exclusive time and percentage values openss cli f L1 64PE sweep3d mpi hwcsamp openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview m allEvents papi_l1_dcm papi_l1_icm papi_l1_tcm papi_l1_ldm papi_ll_stm Function defining location 8764235914 8396159476 196649065 _libc_poll libc 2 12 so 46691996441 367096209 47059092650 46247555479 281624221 sweep sweep3d mpi sweep f 2 8646497071 117738843 782716992 10680760 597583047 8038242 550761926 7569975 518605433 6979361 lapi_dispatcher c 57 488545916 6784192 479947719 6732551 275998769 3888499 1522697263 12118336 223197680 3086626 154744623 2075688 CheckParam cpp 21 151052863 2000330 libpami so Context h 204 793397752 605621289 558331901 525584794 495330108 486680270 279887268 1534815599 226284306
43. 4000 59 797 96400082 MPL inte libmonitor so 0 0 0 pmpi c 94 5 887000 PMPINGinaliz vlibmonitor so 0 0 0 pmpi c 223 4 701000 MPI_lrecy Wympieh so 1 0 irecv c 48 1 221000 MPIt_Bcast l PAu beast c 81 0 396000 MPI_Barrier libmpich RRO barrier c gt Allreduce libmpich so 1 0 allreduce c 353 810000 MPI_Wailt libmpich so 1 0 wait c 51 i DOO PMPI_Finalize libmonitor so 0 0 0 pmpi c 8 903000 MPI_Allreduce libmpich so 1 0 allreduce c 59 995000 MPI_irecv libmpich so 1 0 irecv c 48 0 438000 MPI_Barrier libmpich so 1 0 barrier c 56 0 076000 MPI_Bcast libmpich so 1 0 bcast c 81 Next we see the hot call paths for MPI_Wait on Rank 255 openss gt gt expview f 255 vcalltrees fullstack f MPI_Wait Exclusive MPI Call of Total Number of Calls Call Stack Function defining location Time ms gt gt gt gt main lu C 256 gt gt gt gt gt 140 in MAIN lu C 256 lu f 46 gt gt gt gt gt gt 180 in ssor_ hu C 256 ssor f 4 gt gt gt gt gt gt gt 213 in rhs_ hu C 256 rhs f 5 gt gt gt gt gt gt gt gt 224 in exchange_3_ lu C 256 exchange_3 f 5 gt gt gt gt gt gt gt gt gt 893 in mpi_wait_ mpi mvapich rt offline so wrappers fortran c 893 gt gt gt gt gt gt gt gt gt gt 88I in mpi_wait mvapich rt offline so wrappers fortran c 885 6010 978000 3 87 250 gt gt gt gt gt gt gt gt gt gt gt 51 in MPI Wek ierbich so 1 0 wait c 51
44. 5 further but this time using the load balance view in the Command Line Interface for Open SpeedShop openss gt gt expview m loadbalance Max MPI Call Time Rank of Max Min MPI Call Time RankofMin Average MPI Call Function defining location Across Ranks ms Across Ranks ms Time Across Ranks ms 150332 97 0 120351 97 36 131361 13 MPI_Recv libmpich so 1 0 recv c 60 17636 11 36 1103 53 0 5443 08 MPI_Send libmpich so 1 0 send c 65 16470 53 19 353 81 0 5255 33 MPI_Wait libmpich so 1 0 wait c 51 3206 45 255 3 00 17 2000 27 MPI_Allreduce libmpich so 1 0 allreduce c 59 915 17 Minit tibmonitor so 0 0 0 pmpi c 94 16 00 48 _ Finalize libmonitor so 0 0 0 pmpl c 223 9 28 230 MPI1_Irecvy libmpich so 1 0 irecv c 48 1 22 247 0 07 10 MPI_Bcast libmpich so 1 0 bcast c 81 0 51 0 41 MPI_Barrier libmpich so 1 0 barrier c 56 openss gt gt Here we look at the difference between Rank 255 and Rank 0 77 openss gt gt expview r 255 m exclusive_time openss gt gt expview r 0 m exclusive_time Exclusive MPI Call Function defining location Exclusive MPI Call Function defining location Time ms Time ms 138790 370000 MP1_Recv libmpich so 1 0 reev c 60 150332 974000 MPI_Recv libmpich so 1 0 recv c 60 8841 088000 MPI_Wait librnpich so 1 0 walt c 51 1103 539000 MPI_Send lbmpich so 1 0 send c 65 3337 7370 MPI_Send libmpich so 1 0 send c 65 807 433000 PMP1_Init libmonitor s0 0 0 0 pmpi c 94 3206 45
45. 75290392 140333115436532 1400221 18496892 140223075340192 60022118430632 4032 1904047200 140021904047 200 1403219733390 40922 118496832 14092 1904047200 t30231064047 200 140071904947790 140322075240 192 4092 1994047 Don Proce Loaded Cirk on the Wern burton to hogn the experreent _woleed 471 x solve 11401 x enbe 11476 K sete 11456 anire 1401 taie _soter Ji 406 a sibe 11670 y_volew 11997 140122075240102 Duonneced J40392130426032 3 Dianmmmerned After clearing the specific rank and or thread selections we can click the LB load balance icon and Open SpeedShop will display the min max average values across all the ranks in the hybrid code This helps decide if there is imbalance across the ranks of the hybrid application We can focus on individual ranks to see the balance across the openMP threads that are in an individual rank next example image 135 Open SpeedShep Process Landed Click on the Wert bumon to hogn the experment amp woweuwvwe S amp S wi we O O wu wis Here we used the Manage Process panel Focus on selected rank and underlying threads menu options to view the load balance across the 4 openMP threads for the rank 0 process Progen Lowded Cirk on the Wert bunon to hogn the experreent 140321973339904 OIII IMIGFIZ 140322075230302 1409571 46892 1403221 18496892 140223075340192 136 Please also explore the various options offered v
46. 9 225266 Miriio Libpthreet 21 1 320 Testiotys GOA 08 e 1848 ivin SO Opes POSIX POR siri PORK e 17 941201003 S69 Siroti eee ogee lbpthresd 211 336 Teabetys IDH DO i at 204 in POR Cromie POSIX ICM sini POST c 74 Sere serene epented lipire 211 4 a0 Teerbetes JOR IOR r 164 O 2611 m WriteOr Reed HAL 100L c 2561 251 in DOR Xie POSIX MOR siori POSETE 208 coed lhpehroad 3 11 2 se TotSc5ye CION ION r JE 316 in DOR_Cheee_ COL ates POSIX 6 3h ciosa iiepatioral 2 11 deed This image shows the min max average time spent in each of the I O functions showing the rank of the minimum value and the rank of the maximum value for each of the I O functions This view indicates if there is an imbalance relative to the I O in the application being run This may or may not be expected Open Speed Shap ere e eee pa oe w ool ee pnah K e t won a 1 eet ene Ye b b ji U r rae OH iaag PLAS tet ete E SANUS JAAN Treeeas E ETN EAEE wehe Mibpihrad 2 LE Jas E ONTECO opened Hihprhread 2 L a z5 omw read Mepehread 2 11 2 50 14416 amira choss libpebread 2 11 a an eta OAI _ Meweht libgrtterend 211 Don 7 4 3 I O Profiling iop experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 66 The command line interface CLI can provide the same data options as the graphical user interface GUI views Here
47. B NRHS P Q Fact SolveTime Error Residual WALL 31000 31000 1616144 1842 20 1611 59 4 51E 15 1 45E 11 DEPS 1 110223024625157E 016 sum xsol_i 30999 9999999873 0 000000000000000E 000 sum xsol_i x_i 3 332285336962339E 006 0 000000000000000E 000 sum xsol_i x_i M 1 074930753858819E 010 0 000000000000000E 000 sum xsol_i x_i M eps 968211 548505533 0 000000000000000E 000 From output of two separate runs using Lustre and NFS LU Fact time with Lustre 1842 secs LU Fact time with NFS 2655 secs From the final times we see there is an 813 second penalty more than 30 if you do not use parallel file system like Lustre The run time difference 75 of the 813 seconds is mostly I O 1360 99 847 7 605 seconds NFSRu CN Lustre Run Min t sec Max t sec Avad sec _ Function Call Minzbisec Max t sec Avg t sec _ Function Call libpthread D so libpthread Dp so libpthread 2 5 so libpthread 2 5 so 7 2 Lustre Striping Commands To set or get the Lustre file system lfs striping information you can use the following commands gt lfs setstripe s size bytes k M G c cout 1 all i index 1 round robin lt file directory gt Typical defaults for setstripe are s 1M c 4 i 1 usually good to try first File striping is set upon file create gt lfs getstrip lt file directory gt Example for getstrip is gt lfs getstrip verbose oss_lfs_strip_16 grep stripe_co
48. Column Definition I O Call Time Aggregated total exclusive time spent in the I O function corresponding to this row of data of I O Total Time Percentage of exclusive time relative to the total time spent in the I O function corresponding to this row of data Number of Calls Total number of calls to the I O function corresponding to this row of data Min Bytes Count The number of times minimum bytes read or written 59 Column Name Column Definition by the corresponding I O function occurred during this experiment Min Bytes Read or Written The minimum number of bytes that were read or written by the corresponding I O function Max Bytes Count The number of times maximum bytes read or written by the corresponding I O function occurred during this experiment Max Bytes Read or Written The maximum number of bytes that were read or written by the corresponding I O function Total Bytes Read or The total number of bytes read or written by the Written corresponding function This number only represents the totals for the number of bytes read or written based on the I O function called me p gt weer Vyeeree Tisser Beh Poor Lewd Ooch pe the Ber Peto re begs the coperireces 4 tee Pess ii O Mpe eet 11 View ipis Chote vy ca aa due A TA OY UL RB OS GE oring retinet Ope Pascthen benene DE Phere DE corel 0e Pode SER Meir SET Three 812 11 Cab Theet er
49. I O Base Tracing io experiment The base I O tracing experiment gathers data for the following I O functions close creat creat64 dup dup2 lseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write and writev It is a trace type experiment that wraps the real I O calls and records information before and after calling the real I O functions This base I O experiment records the basic I O information as stated in the introductory section but does not record the arguments to each call That is done in the extended iot experiment 7 3 1 1 I O Base Tracing io experiment performance data gathering The base I O tracing io experiment convenience script is ossio Use this convenience script in this manner to gather base I O tracing performance data ossio how you normally run your application lt list of I O function s gt The following is an example of how to gather data for the IOP application on a Linux cluster platform using the ossio convenience script It gathers performance data for all the I O functions because there is no list I O functions specified after the quoted application run command ossio srun n 512 IOR 7 3 1 2 I O Base Tracing io experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 58 7 3 1 3 I O Base Tracing io experiment performance data viewing with GUI To launch the GUI on any e
50. Introduction to Open SpeedShop ssussussunnnnnunnnnnnnnnnnnnnnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 14 3 1 Basic Concepts Interface WOrkflOW sssssussunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 14 31 Common Terminology nyranani aa n aE EARR 15 FLZ CONO GOR Ail E PEI O Eae E eerste 16 3 2 Performance Experiments OvervieW wsssssssssssssssnessssnsesssssessssssesnessecsnsessssnensnssnessersossneesoeens 16 3 2 1 Individual Experiment Descriptions es eeeeeeeseeeseseseseeseeeseeeeeeseecseeseseseseseeeeasatecaeaeeeesess 16 3 2 3 Sampling Experiments Descriptions ss ss sssssresresrrssnssrrnrnnsnnsnnnnrnnnesnnonnennnnnnnnnennnnnnnnnnnnnnnnnnns 18 3 2 4 Tracing Experiments Descriptions i eseseeeeseseseeeeseeseseseneeeseeaeseseseseeeeasaniesesenesetananeeenes 18 3 2 5 Parallel Experiment SUP POLE sssdeciiesd sesencisd cnicssnszcerentheescotasenessssiosianiaszscesetd costeiaaeninatlveicemianaensiins 19 Sid RUNNING an Experiine i asuessa eaaa aaa Aaaa EEEE 20 4 How to Gather and Understand Profiles sssssnssnnnnnnnnnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 25 4 1 Program Counter Sampling ExperiMent ssssssusunsusunsunnununnununnunnnnunnnnunnunnnnunnnnnnnnnnnnnn annan 25 4 2 Call Path Profiling usertime Experiment ssssssssususunnunununnnnununnunununnunununnununnnnnnunnnnnnnnnne 26 5 How to Relate Data to Architectural PropertieS s sssssussunnunnnnnnnnnnnnnnnnnnunnnnnnnnnnnn 30 5 1 Hardw
51. LAST_LEVEL CACHE REFERENCES Local nonlocal memory access MEM_UNCORE_RETIRED LOCAL_ DRAM READ_REQUEST_TO_L3_CACHE ALL_CORES L3 cache L3_CACHE_MISSES ALL_CORES When selecting PAPI events you must determine if they are a valid combination In general combination that are valid will pass the test gt papi_event_chooser PRESET event1 eventz eventN The output for a valid combination will contain event_chooser c PASSED Here is an example using PAPI to check if a three event combination is valid gt papi_event_chooser PRESET PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS eee oe PAPI Version 4 1 2 1 Vendor string and code GenuinelIntel 1 Model string and code Intel Nehalem 21 CPU Revision 5 000000 PAPI_VEC_SP 0x80000069 No Single precision vector SIMD instructions PAPI_VEC_DP 0x8000006a No Double precision vector SIMD instructions Total events reported 44 event_chooser c PASSED 33 Below shows the output of the osshwcsamp experiment with the counters for Total Cycles and Floating Point Operations teqper_ SMU Rrsedaa eng 2000 weg_revebant c 152 tepe Om bic Radection img eyelic_redertion c 757 ra bl mkt cher floors Sheepl on 152 bi wirt hee A 1 paR pind daw Mogen pal o 620 api decarte_eapach 41 heopeet_Seofhewtetcs rong 2000 ene mtret 125 Teg nee elie J FWA i py nE a Cite 2 07 et wob hepe SMG Ange tena DN serg apye 27 Wea f
52. MI Interface Context lt PAMI Context gt advance libpami so ContextInterface h 158 55 750000 2 614903 597583047 8038242 605621289 579127274 14647999 Lapilmpl Context Advance lt true true false gt libpami so Context h 220 52 970000 2 484510 550761926 7569975 558331901 535841812 11563657 _ libc_enable_asynccancel libc 2 12 so 49 850000 2 338169 518605433 6979361 525584794 502551336 12757207 _lapi_dispatcher lt false gt libpami so lapi_dispatcher c 57 48 080000 2 255149 488545916 6784192 495330108 476065093 9649598 Lapilmpl Context TryLock lt true true false gt libpami so Context h 198 47 750000 2 239671 479947719 6732551 486680270 471343480 6436257 _libc_disable_asynccancel libc 2 12 so 26 680000 1 251401 275998769 3888499 279887268 269841454 4697170 udp_read_callback libpamiudp so lapi_udp c 538 25 880000 1 213878 1522697263 12118336 1534815599 1507685061 9619348 _intel_ssse3_rep_memcpy libirc so 21 960000 1 030014 223197680 3086626 226284306 215787794 5879517 _lapi_shm_dispatcher libpami so lapi_shm c 2283 14 910000 0 699340 154744623 2075688 156820311 149803306 3979337 Lapilmpl Context CheckContext libpami so CheckParam cpp 21 40 13 990000 0 656188 151052863 2000330 153053193 146967548 3167039 Lapilmpl Context Unlock lt true true false gt libpami so Context h 204 5 1 3 2 osshwcsamp experiment Status command and CLI view openss gt gt expstatus Experiment definition Expld is 1
53. NVIDIA CUDA experiment use the osscuda convenience script and specify the CUDA application as an argument If there are no arguments to the application then no quotes are necessary but they are placed here for consistency The osscuda script will run the experiment by running the QTC application and will create an Open SpeedShop database file with the results of the experiment Viewing of the performance information can be done with the GUI or CLI osscuda QTC 8 3 2 NVIDIA CUDA Tracing cuda experiment performance data viewing with GUI This section shows the default view for the NVIDIA CUDA experiment for the QTC application Use the following command to open the GUI to see the QTC CUDA experiment performance information To launch the GUI on any experiment use openss f lt database name gt openss f QTC cuda openss fe J gt m0 1 Fatoss Curtin ps e a ua Hk Ot the Pur letter o bya Pee oparo tate Fans eee rage rocensseP ores 1 Views Dapy Clesce be hire Pov oh Bi Bie d 8 A OG emg ieties fasort Cuecusetees OTC hosi mei iocis Pics Teni Eectuswe VO Cat Throrny W of Tota Marrie of Cate Further sdetrawyy bestor Atay st TC _chewice fow ches gt 010 Sees of ire et 94 The view below is the statistics panel and source view panel showing the relationship of the statistics to the actual source in the program i sPeared Shoe vi felne Pis Jot thes ormi ocese Cartrul
54. Open SpeedShop User Manual October 1 2015 Version 2 2 Contributions from Krell Institute LANL LLNL SNL New Features developed and documented for FY2014 FY2015 Development This brief section outlines the new features and updates to Open SpeedShop developed during the last year of development 2014 2015 in this case and newly documented in this version of the users guide New and Updated Feature list Graphical user interface GUI caches does not regenerate GUI Stats Panel and metadata information views within a session This dramatically speeds up redisplay of the Stats Panel views Command Line Interface CLI views are now saved in the Open SpeedShop database as text The views are thus saved across sessions and allow users to subsequently view the same CLI view This feature needs permissions to re write the database It will re write older existing database files saving the views so be aware of that side effect This feature is controlled by a preference See section 11 5 2 1 to enable or disable the save and reuse view feature Redesign of output for a number of the Open SpeedShop GUI and CLI experiment views to provide more application performance information to the user Previously more user action in the form of generating more GUI or CLI views was required to get display the information The Open SpeedShop CBTF version usability and reliability is greatly improved This version adds several new experiment types to the
55. Shep x Proosa Loaded Cirk on the Wer bunon to hogn the eperrment BOO 433e thsikijc ka rbot Lic j z Jic hell abina x wire 471 433 lt ny solve 11470 437 beh i Iaret gt LELI LI asori err td han rha 1167 Et sive fiAn y sowe Tas T y aime 1095 e sedwetI418 AG L In the next GUI view we used the ManageProcess panel to highlight one rank to show the performance data from all the threads that are executed under that particular rank in order to see only that performance data in the Stats Panel view Note Use the focus on selected rank and underlying threads Manage Process panel option to focus on all the threads within a rank Right mouse button down on the Manage Process panel tab to see the options 133 Open SpeedShep x h ohtl ji dindi h abbectet 1 1 bhinckti J rp 7 ay ee i 1018640 rosea bene 1 649969 Aboi 1 625519 1 27744 1535522 1 e524 1 185529 1 nanie 2163S 31635 gt LASS 16 2 Clearing Focus on individual Rank to get bank to default behavior Note Once you focus on individual or groups of ranks e g venturing away from the default aggregated views then you need to use the CL clear auxiliary setting icon to clear away all the optional selections and get back to looking at the aggregated results again 134 OpenjSpeedShep E 140321075290192 1403221 18436832 140322075240102 1403220753340192 140321973330904 HOJI AIRE 1403230
56. Ta Wy uh LB ca igg Somi Bxcostables vmg2000 Mest localhost incakiemam Processes Ranks Threads 2 ff 3010 07 13 19 30 50 2010 07 1 1975 1547167913 gt gt gt gt gt gt gt gt gt gt gt MPl_Waial ewpieo ni g pwai 2010 07 13 1930 51 s 601616488 gt gt gt MP Watalii ibmpisnan i pw 2010 07 15 19 39 5 4902 60 017320 gt gt gt gt gt gt gt gt gt MPI Waitell Ebmpisoc 00 1 pwaitall 2010 07 13 1951 60 p 17000 gt gt gt gt gt gt gt gt gt gt MPi_Waitsaii ihmpisa 0t 1 pwata 2010 07 13 1630 5 60n61 7000 gt gt gt gt gt gt MPL Waited Menpise 00 i pwaite WOOT TS 1S SH wre 1 gt gt gt gt MPI_Watall Ithmpiso na i pwaktaile Pr D1 OT 1S 1 eS 1545167012 3 2014 07 13 19 So 51 a aor i S010 TS 1935 bees LOZEJ 8 1 1 MPI Tracing Experiments mpi mpit 8 1 1 1 MPI Tracing Experiments mpi mpit performance data gathering Much of this information is described above in the main MPI Tracing Experiments section but for completeness this is the convenience script description for running the MPI specific tracing experiments gt ossmpi t srun N 4 n 32 smg2000 n 50 50 50 default lt list MPI functions gt mpi_category 8 1 1 2 MPI Tracing Experiments mpi mpit performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 8 1 1 3 MPI Tracing Experiments mpi mpit performance data viewing with CLI To launch the CLI
57. a single hardware counter and hwcsamp for PC sampling with multiple hardware counters Both osshwc and osshwctime support non derived PAPI presets all non derived events are reported by papi_avail a You can also see the available events by running the experiments osshwc or osshwctime with no arguments The experiments include all native events for that specific architecture Some PAPI event names are listed in sections below but please see the PAPI documentation for the full list The threshold you choose depends on the application you want to balance overhead with accuracy Remember a higher threshold will record less samples Rare events need a smaller threshold or that information may be lost never triggered and recorded Frequent events should use a larger threshold to reduce the overhead of collecting the information Selecting the right threshold can take experience or some trial and error HINT Running the sampling based hardware counter experiment osshwcsamp can help you get an idea for a threshold value to try when running the osshwc and osshwctime experiments which are threshold based Since the ideal number of events threshold depends on the application and the selected counter for events other than the default the hwcsamp experiment can be used to get an overview of counter activity 31 The default threshold is set to a very large value to match the default event PAPI_TOT_CYC For all other events it is recom
58. ace between the user interface and the cluster support and dynamic instrumentation components Plugin A portion library of the performance tool that can be loaded and included in the tool at tool start up time Development of the plugin uses a tool specific interface API so that the plugin and the tool it is to be included in know how to interact with each other Plugins are normally placed in a specific directory so that the tool knows where to find the plugins Target This is the application or part of the application one is running the experiment on In order to fine tune what is being targeted Open SpeedShop gives target options that describes file names host names thread identifiers rank identifiers and process identifiers 3 1 2 Concept of an Experiment Open SpeedShop uses the concept of an experiment to describe the gathering of performance measurement data for a particular performance area of interest Experiments consist of the collector responsible for the gathering of the measurements associated with the performance area of interest The collector which is a small dynamic or static object library also contains functions that can interpret the gathered measurements i e performance data into a human understandable form The experiment definition also includes the application being examined and how often the data will be gathered the sampling rate The application s symbol information is saved into the experiment outpu
59. ad or written based on the I O function called I O Call of Number Call Stack Function defining location Time ms Total of Time Calls 1858418 863034 1055603 730633 103350 518692 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2021 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 gt gt gt gt gt gt gt 766 in close iot collector monitor mrnet mpi so wrappers c 685 61 486298 512 gt gt gt gt gt gt gt gt 82 in close libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2173 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 2611 in WriteOrRead IOR IOR c 2562 gt gt gt gt gt gt gt 251 in IOR_Xfer_POSIX IOR aiori POSIX c 224 gt gt gt gt gt gt gt gt 223 in read iot collector monitor mrnet mpi so wrappers c 137 34 924939 2048 gt gt gt gt gt gt gt gt gt 82 in __GI__read libc 2 12 so syscall template S 82
60. all threads 88 Aggregated Results across IX Threads in Job A b 4 P PS JS s Ps vores 4 a a 1 wie Dinvertis_ bt Wa solve subs f 206 2 solve omp_fn 0 bL W x _solve f 43 compute_rhs_ omp_fn0 bt W x rhs 18 y malve comp n0 DLW _solve t 43 x_soive_omp_fr bt W x x_solve t 45 matmal_sub_ biw slve subst 56 matvec_sub_ bt W x solwe_subs f 27 ihsinit_ bt W x initialize f 225 Next we see the load balance view based on functions Min Max Ave across 4 POSIX Threads in Job a tat ob ati Gesag w Then we look at a cluster analysis view based on functions 89 Open SpeedShoap File Tonks pe Sampling 1 Command Panel Procees Control Pai C nni o Pari SUpdai Statue Process Loaded Click on the Run button to begin the Stats Panel 1 ManegeProcessesPanel 1 CW e A aia amp et Showing Comparative Analysis Report Executables brt Wx View consists of comparison columns click oan the metadata joom T for details Comparing laren 1 Exp riment Showing Host loca host localidomaim p PLAN Fe or performance data type pesamp functions using display opting nreadAveray Column 2 Experiment 1 Showing Host localhost localdomam p 20275 for performance data type pesamp functions using display option TireadAy Metadata for Experiment 1 2 Average Exclusive CPU time in seconds Across Threads lt 3 Average Exclusive CPU time in seconds Punction
61. anged This is useful if you want to track how the performance varies for each new version of an application or understanding how a different compiler or compiler options can affect the performance of your application This also allows you to do scalability tests to see how the performance of your application scales with the number of processors It s also helpful just to see the progress you have made while tuning your code Open SpeedShop has options to allow you to compare performance data You can use the Custom Compare Panel CC icon in the GUI or the osscompare convenience Script gt osscompare db1 openss db2 openss options This will produce a side by side comparison listing you can compare up to 8 databases at once You can see the osscompare man page for more details Below is an example of comparing two different pcsamp experiments on the smg2000 application osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss openss Legend c 2 represents smg2000 pcsamp openss openss Legend c 4 represents smg2000 pcsamp 1 openss c 2 Exclusive CPU c 4 Exclusive CPU Function defining location time in seconds time in seconds 3 870000000 3 630000000 hypre_SMGResidual smg2000 smg_residual c 152 2 610000000 2 860000000 hypre_CyclicReduction smg2000 cyclic_reduc on c 757 101 2 030000000 0 150000000 opal_progress libopen pal so 0 0 0 1 330000000 0 100000000 mca_btlsm_component_progress libmp
62. are Counter Sampling hwcsamp Experiment sssssssusunsusunnunnnnunnununnunnnnnnnnnnn 32 5 1 1 Hardware Counter Sampling hwcsamp experiment performance data gathering ete tees of SER E E E E E E E N E E O 35 5 1 1 1 Hardware Counter Sampling hwcsamp experiment parameters csssssseesecseseseeseseeeesess 35 5 1 2 Hardware Counter Sampling hwcsamp experiment performance data viewing AC ge OO EE E N E N earn err eee nearer A AA eee ere E eens 35 5 1 2 1 Getting the PAPI counter as the GUIs Source Annotation Metric 35 5 1 2 2 Viewing Hardware Counter Sampling Data with the GU ceeeseseeseseetestetesteeeteeeneseens 37 5 1 3 Hardware Counter Sampling hwcsamp experiment CLI performance data CN A T N EA ENE A ncaa ooo cette A ata ee ne eau ope ata eee eet anaes 38 5 135 Job SCHIPE aid OSSMWCSAMIP COMMA ssicsscsiicessenteticcscecesasstelecteitenseeeeeliaisam cesenesdinentieeenmeteieieanaeien 39 5 1 3 2 osshwcsamp experiment default CLI view s ssssssrsssrssssrensressrrssrressnensnensnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 39 5 1 3 2 osshwcsamp experiment Status command and CLI VIEW eeesscseseeseeseeseseseeseseseeeeseseeeeeees 41 5 1 3 3 osshwcsamp experiment Load Balance command and CLI VieW cscssssssssssesesesessseeseseseees 41 5 1 3 4 osshwcsamp experiment Linked Object command and CLI VieW eeseseseseseeteeseeeeeees 41 5 1 3 5 osshwcsamp experiment only the hwcsamp PAPI events CLI view sesser 42 5 2 Hardware Counter Exp
63. are some examples of the performance data that can be viewed and the commands in order to generate the CLI views gt openss cli f IOR iop 1 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive Inclusive of Function defining location I O call I Ocall Total timesin times in Exclusive seconds seconds CPU Time 38297 33 38297 33 96 46 _ write libpthread 2 11 3 so 741 01 741 01 1 86 open6 4 libpthread 2 11 3 so 598 43 598 43 1 50 read libpthread 2 11 3 so 63 38 63 38 0 15 close libpthread 2 11 3 so 2 264 2 26 0 01 _Iseek64 libpthread 2 11 3 so Openss gt gt expview v calltrees fullstack Exclusive Inclusive of Call Stack Function defining location I O call I Ocall Total timesin times in Exclusive seconds seconds CPU Time TestloSys IOR IOR c 1848 gt 2608 in WriteOrRead IOR IOR c 2562 gt gt 244 in IOR_Xfer_POSIX IOR aiori POSIX c 224 38297 33 38297 33 96 46 gt gt gt _write libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 2611 in WriteOrRead IOR IOR c 2562 gt gt 251 in IOR_Xfer_POSIX IOR aiori POSIX c 224 598 43 598 43 1 51 gt gt gt read libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 104 in IOR_Create_POSIX IOR aiori POSIX c 74 472 14 472 14 1 19 gt gt open64 libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 195 in IOR_Open_POSIX IOR aiori POSIX c 173 268 88 268 88 0 68 gt gt open64 libpthr
64. aths in execution time al expview lt experiment name gt lt number gt Shows lt number gt of the functions from the list of the top time consuming functions expview pcsamp2 Shows the two functions taking the most time Bl expview v statements Show lt number gt of the statements from the list of the top lt experiment name gt lt number gt time consuming statements Remember if you want the GUI at any time just issue the command opengui in the CLI 11 1 2 CLI Metric Expressions and Derived Types Open SpeedShop has the capability to create derived metric from the gathered metrics by using the metric expression math functionality in the command line interface CLI One can access the overview from the CLI by typing this help CLI command openss gt gt help metric_expression FRR AK KKK KKK lt metric_expression gt lt string gt lt constant gt lt metric_expression gt lt constant gt lt metric_expression gt A user defined expression that uses metrics to compute a special value for display in a report User defined expression can be added to an lt expMetric_list gt A functional notation is used to build the desired expression and the following simple arithmetic operations are available Function arguments returns Uminus 1 unary minus of the argument Abs 1 Absolute value of the argument Add 2 summation of the arguments Sub 2 difference of the arguments Mult 2 product of th
65. ay Choice set of buttons allows users to choose what granularity to use for a particular display The normal usage scenario is to choose a view choice granularity and then select a view by choosing one of the icons described in the table above The choices as shown in the image below are to see the performance data displayed e Per Function Display the performance information relative to each function in the program that had performance data gathered during the experiment that was run e Per Statement Display the performance information relative to each statement in the program that had performance data gathered during the experiment that was run e Per Linked Object Display the performance information relative to each library or linked object in the program that had performance data gathered during the experiment that was run e Per Loop Display the performance information relative to each loop in the program that had performance data gathered during the experiment that was run Note that the loop performance information is only shown for loops that actually were executed There may be loops in the application that will not show up in the display because they did were not executed or had minimal time attributed to them 111 The image below illustrates that double clicking on a line of statistical information in the Stats Panel will focus the source panel at the line of source representing the performance information and annotates t
66. aying the statements that took the most time in the application run For this execution of BT the statement at line 440 took the most time By double clicking on the statement Open SpeedShop focuses on the source for that line of the application source and highlights that line In the view below we moved the ManageProcess panel tab to the lower panel and split the upper panel using the vertical splitter icon on the far right side of the original upper panel Note Left mouse down and hold on the panel tab then slide the panel you want to move to another location on the Open SpeedShop GUI or off onto other parts of your display 131 Open SpeedShep x Proosa Loaded Cirk on the Wer bunon to hogn the eperrment O op op theater pe foc rbot Lae je e holi abhor 438 e ___ __ __ 436 ject teed absackt 1 p wreelt 437 brki 2j aren aun mkd iI jawor gt is of 140322075240102 JAOLIININEIARIF SO 1G soars 16 1 Focus on individual Rank to get Load Balance for Underlying Threads In the next view below we used the ManageProcess panel to highlight one rank and an individual thread within the rank to show only that threads performance data in the Stats Panel view Note Use the focus on threads and processes Manage Process panel option to focus on individual threads within a rank Right mouse button down on the Manage Process panel tab to see the options 132 Open Speed
67. basic performance analysis tool is the Unix time command which can measure the CPU and wall clock time for an application You could also keep track of application s performance as you vary the input parameters This type of performance analysis is very simple but has the disadvantage of the measurements being coarse grain and not allowing you to pinpoint any performance bottlenecks within the application Another performance analysis method is code integration or instrumentation of performance probes This method allows a much finer grain analysis however it can be hard to maintain and required significant beforehand knowledge of what information to measure and record An alternative to the simple and coarse grain or complex and fine grain approach is the use of performance analysis tools Performance Tools enable fine grain analysis that can be resulted to the source code and work universally across applications There are two ways performance analysis tools gather information from applications One way is through statistical sampling which periodically interrupts the execution of the program to record its location Statistical distributions across all locations are reported and data is typically aggregated over time Time is the most common metric but other metrics are possible Statistical sampling is useful to get an overview of the applications performance as it provides low and uniform overhead Event tracing is another way fo
68. by clicking on the CA icon iis a Lode Clock on the last berma 19 Seget che perenne p OE a a a a a A FM t avcerabve romed ays Haute Obed STEE Pe pks RA Tada l yz asasala saae ra tas A mmm et ibe mmi jiboez 112m e NoE 1h and In this view generated by clicking on the CA icon we see that Open SpeedShop has determined that there are four unique groups where the aggregate time for the groups differs enough to report this to the user The columns in the Stats Panel display show the times that are reflective of each of the ranks in the group The information I icon can be used to view which ranks etc are included in each of the cluster groups 100 the Tere fiii Mercer Trace 11 Presse Orsttr Baila f pis pti j a Mate Meret fI Mamire Fasol 1 l Vew ULETET Choir Tuad die imiia 8 oa eet Coapa Analy Aapon puaren Dia votat rgi e of Comepativgs eehanees Eri ce e Wi Hoe T Oe aaia loan ata re i Share iae ter ptre f yag on n ras r cing a TOONS SS Ce Se en ashe ine Forha Mj e 1 Avorn barbela bij a 4 Avera Far henios bij e 5 Average Por Jorise Mi Fesetion ihefieleg bastion l kca i G Leerrere fare 10 Advanced Analysis Techniques Analyzing the results of a single performance experiment can be useful for debugging and tuning your code But comparing the results of different experiments can show you how the performance of an application has ch
69. ca papi_l2_tcm percent of Function defining location 12_tcm 12_tca 107 289946516 109226440 37 671237 hypre_SMGResidual smg2000 smg_residual c 152 203463495 74795126 36 760956 hypre_CyclicReduction smg2000 cyclic_reduction c 757 34442810 12746112 37 006597 mca_btl_vader_check_fboxes libmpi so 1 4 0 btlvader_fbox h 108 25522126 8311723 32 566734 hypre_Semilnterp smg2000 semi_interp c 126 11 2 CLI Batch Scripting If you have a known set of command you want to issue you can create a plain text file with CLI commands For example we create a batch file that will create run then view the pcsamp experiment run on the application fred Create batch file commands gt echo expcreate f fred pcsamp gt gt input script gt echo expgo gt gt input script gt echo expview pcsamp10 gt gt input script Now to run the batch file input script we use the batch option to openss gt openss batch lt input script Note that currently in this context this interface is only supported via the online version of Open SpeedShop so it must have been build with the OPENSS_INSTRUMENTOR mrnet options 11 3 Python Scripting The Open SpeedShop python API allows users to execute the same interactive batch commands directly through python Users can intersperse the normal python code with commands to Open SpeedShop Currently this interface is only supported via the online version of Open SpeedShop 11 4 MPI_Pcontrol Support Open
70. cation It is extensible through user written plug ins Open SpeedShop is also maintained and supported within the Tri lab clusters Blue Gene and Cray platforms run by Lawrence Livermore Los Alamos and Sandia National Laboratories It is also available at a number of other laboratories and business around the world The following sections give a quick overview of what to look for in your Performance Analysis for different types of applications 2 1 Sequential Code Performance Analysis You should identify the most computationally intensive parts of CPU ee a your application Find out where is your application spending P most of its time in modules or libraries on particular statements ats 1 Cache in your code or within certain functions Check to make sure the a L most time is being spent in the computational kernels Ask L2Cae yourself if the amount of time that each section takes matches your intuition Explore the impact of the memory hierarchy Check to see if your application has excessive data cache misses Find out where your data is located One can also assess the impact of the virtual memory Translation Lookaside Buffer TLB misses Check the interaction of you application with external resources by checking the efficiency of the I O and looking at the time spent in system libraries 2 2 Shared Memory Applications e Shared memory applications have a single shared storage that is a J
71. ce information for each of the various performance metric types that Open SpeedShop supports 15 1 Suggested Workflow We recommend an O SS workflow consisting of two phases First gathering the performance data using the convenience scripts Then using the GUI or CLI to view the data 15 2 Convenience Scripts Jsers are encouraged to use the convenience scripts that hide some of the underlying options for running experiments The full command syntax can be found in the User s Guide The script names correspond to the experiment types and are osspcsamp ossusertime osshwc osshwcsamp osshwctime ossio ossiot ossmpi ossmpit ossmpiotf ossfpe plus an osscompare script Note Make sure to set OPENSS RAWDATA DIR See KEY ENVIRONMENT VARIABLES section for info When running Open SpeedShop use the same syntax that is used to run the application executable outside of O SS but enclosed in quotes e g 2 Using an MPI with mpirun osspcsamp mpirun np 512 smg2000 2lUsing SLURM srun osspcsamp srun N 64 n 512 smg2000 n 5 5 5 2Redirection to from files inside quotes can be problematic see convenience script man pages for more info 15 3 Report and Database Creation Running the pcsamp experiment on the sequential program named mexe osspcsamp mexellresults in a default report and the creation of a SQLite database file mexe pcsamp openss in the current directory the report CPU Time CPU time Function 48 990 1
72. ch Experiment Control Misc Commands e expgo help e expwait list e expdisable log e expenable record Experiment Storage playback e expsave history e exprestore quit The following is a simple example to create run and view data from an experiment using the CLI gt openss cli Open the CLI Openss gt gt expcreate f mutatee 2000 pcsamp Create an experiment using pcsamp with this application Openss gt gt expgo Run the experiment and create the database Openss gt gt expview Display the default view of the performance data You can also get alternative views of the performance data within the CLI The following is a list of some options to change the way the information is displayed help or help commands Display CLI help text amp Show the default view for experiment expview v statements Show time consuming statements Show time consuming loops expview v linkedobjects Show time spent in libraries 105 expview v calltrees fullstack See all unique call pathsPin the application expview m loadbalance See load balance across all the ranks threads processes in the experiment Compare rank 1 to rank 2 for metric equal to time Other metrics are allowed This is a usage example expview v calltrees fullstack See lt number gt of call paths from the list of expensive call lt experiment type gt lt number gt paths Q expview v Calltrees fullstack usertime2 Shows the top two call p
73. d value gt The following is an example of how to gather data for the smg2000 application ona Linux cluster platform using the osshwc convenience script It gathers performance data for the default counter PAPI_TOT_CYC because there is no hardware counter value specified after the quoted application run command osshwc mpirun np 4 smg2000 n 60 60 60 5 2 2 Hardware Counter Threshold hwc experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 43 This image shows the default view for the hwc experiment run with the smg2000 MPI application using PAPI_TOT_CYC as the hardware counter event Double clicking on a line in the Stats Panel or on the bar chart will take the user to the source file and line represented by that line of performance information The next image displays the output from the osshwctime experiment where the counter is the L1 cache misses 5 7 s cc f J z pe A jocalhorst Jocakdmain Processes Kanha Threw 2 f ana aarnaneamnh daome bacen k doma eet r r e rt a a er i E 44 5 2 3 Hardware Counter Threshold hwc experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt In this example we show three view default CLI views with different granularities function statement and library level openss f smg2000 hwc 3 openss openss The restor
74. dShop GUI we would encourage you to play around with the interface to become familiar with it The GUI lets you peel off and rearrange any panel There are also context sensitive menus so you can right click on any location to access a different view or to activate additional panels If you prefer not to use the GUI there are three other options that all have equal functionality First there is the command line interface that we have also seen throughout this manual which you can launch with the cli option gt openss cli There is also the immediate command batch interface This uses the batch flag gt openss batch lt openss_cmd_file gt openss batch f lt exe gt lt experiment gt 104 Lastly there is a python scripting API so you can launch Open SpeedShop commands within a python script gt python openss_python_script_file py 11 1 Command Line Interface Basics The CLI offers an interactive command line interface with processing like gdb or dbx There are several interactive commands that allow you to create experiments provide you with process thread control or enable you to view experiment results You can find the full CLI documentation at http www openspeedshop org doc cli doc but here we will briefly cover some important points Here is a quick overview of some commands those marked with are only available for the online version Experiment Creation Result Presentation e expcreate e expview e expatta
75. defining location 0 1600 0 4800 binwerts_ bt Wx solve _ suhst 206 0 1033 0 1500 solve omp EmO bt Wx 2 solvef 43 0 0900 0 1400 compute rhe ommp fn 0 bt Wx rivet 18 0 0900 0 1300 y_ solve omp EO bt Woe solve 43 l 0 0633 0 1200 matmul sub btLW x solve subs 1 56 8 2 1 Threading Specific Experiment pthreads An experiment specific to tracking POSIX thread function calls and analyzing those calls is also available in Open SpeedShop The experiment is called pthreads and it traces several POSIX thread related functions Like all the other tracing experiments number of calls time spent in each function the call paths to each POSIX thread function and an event by event trace is available Load balance and cluster analysis features are also available 8 2 1 1 Threading Specific pthreads experiment performance data gathering osspthreads mpirun np 4 smg2000 n 15 15 15 8 2 1 2 Threading Specific pthreads experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt Three pthreads experiment views follow The first is the default pthreads experiment view which lists the POSIX thread function routines that were called in the application being monitored the number of times they were called and the time Spent in each function 90 ye e o ye Moia miia an fase fias SS e e aia a m e
76. e We want to identify any components that are bottlenecks We can do this by viewing the profile aggregated by shared linked objects making sure the correct or expected modules are present then analyze the impact of those support and or runtime libraries 4 2 Call Path Profiling usertime Experiment 26 Function The call path profiling usertime experiment can add some information that is missing from the flat profiles It is able to distinguish routines called from multiple callers and understand the call invocation history This provides context for the performance data It also gathers stack traces for each performance sample and only aggregates samples with equal stack traces For the user this simplifies the view by showing the caller callee relationship It can also highlight the hot call paths the paths through the application that take ie the most time i tH The call path profiling experiment also provides inclusive and exclusive time Exclusive time is the time spent inside a function only for example function B Whereas inclusive time is the time spend inside a function and its children for example the full chain of function C D and E The call path profiling experiment is similar to the program counter sampling experiment since it collects program counter information except that it collects call stack information at every sample There are of course tradeoffs with that you obtain additional context i
77. e libe 2 15s gysraltem l _ Gl red ie 2 I2 syscall etpar 5 82 lib open lied 12 ene write lite 2 lise syseali templane 82 ones Ol ide bosk libot its syscall complete 42 This view generated by choosing CA icon the shows that there are two groups of ranks where the I O is performing in similar manner For group 2 labeled c 3 below there are two ranks where the rest of the 512 ranks perform like group 1 labeled c 2 below Investigation by examining ranks 312 or 317 by comparing it 61 to one of the ranks in the other group could shed some light on why group 2 is not similar to the rest This may or may not be significant but is here for illustration file foch E Crmeeets Kapeericnant I Process Crmerred gt Nutue Prowves Lada Chek on the Por hurni hee the urwv ites Jms Pane 11 mM ManapeioceresPand I1 Tue Ba al me d TAIN m LD OA 6G Shoriag Cimparmire Analynie Report Terrtutsahies DOS View rosty ni comparam colurme click oe the metadata xon T for detaili mmert1S6 L Average 1 0 Call Tie 3 Ave 1 0 Call Titt Function defining location Hrn Py Mey INH AMT ven iite 2 174 crm 8 t IPA PA U SIHT i if Lily T rH ill tepi ic Sa OST 14s te 17125 WOL ILS i like took c2 12a pecall t Comman Pure iivip Le S cl ti s verre Dike ey Chce Purtcteesty IL a 7 3 2 3 I O Extended Tracing iot experiment performance data viewing with CLI To launch th
78. e 1 on use of PAPI Memory Bandwidth LLNL Sparse Solver Benchmark AMG Two of the reasons for on node scaling limitations are memory bandwidth and shared L3 cache conflicts For this example we measure the L3 cache misses for 1 2 and 4 processors match the expectation for strong scaling e Reduced data per processor e L3 misses decreasing up to 4 processors linearly L3_ CACHE _MISSES ALL 48 Normalized to 1 processor count Counts are average of processor values L3_ EVICTIONS ALL On the other hand L3 Evictions for 1 2 and 4 processors similarly decrease near perfect but dramatically increases to 100 times at 8 processors and 170 times at 16 processors L3 evictions are a good measure of memory bandwidth limited performance bottleneck at a node AMG Intra Node Scaling AMG Weak E AMG Strong l g7 AMG ideal el of Cores NOTE General Memory BW limitation Remedies e Blocking e Remove false sharing for threaded codes 6 1 2 Example 2 on use of PAPI False Cache line sharing in OpenMP To reduce false sharing we examine two versions of this OpenMP code to illustrate reducing false sharing This code is the unmodified code that exhibits false sharing causing non optimal performance 49 Cache line UnAligned real 4 dimension 100 100 c d I OMP PARALLEL DO do i 1 100 do j 2 100 c i j c i j 1 T d i j enddo enddo I OMP END PARALLEL DO This code has been
79. e CLI on any experiment use openss cli f lt database name gt The command line interface CLI can provide the same data options as the graphical user interface GUI views Here are some examples of the performance data that can be viewed and the commands in order to generate the CLI views The following table describes header and meaning of the default iot view CLI columns Column Name Column Definition I O Call Time data Aggregated total exclusive time spent in the I O function corresponding to this row of of I O Total Time Percentage of exclusive time relative to the total time spent in the I O function corresponding to this row of data Number of Calls Total number of calls to the I O function corresponding to this row of data Min Bytes Count The number of times minimum bytes read or written by the corresponding I O function occurred during this experiment Min Bytes Read or Written The minimum number of bytes that were read or written by the corresponding I O function Max Bytes Count The number of times maximum bytes read or written by the corresponding I O function occurred during this experiment Max Bytes Read or Written The maximum number of bytes that were read or written by the corresponding I O function 62 Total Bytes Read or Written The total number of bytes read or written by the corresponding function This number only represents the totals for the number of bytes re
80. e OPENSS_PLUGIN_PATH for the directory where the plugins are stored If you installed with more then one MPI version you must specify which to use with OPENSS_MPI_IMPLEMENATION Lastly add the Open SpeedShop bin directory to your PATH and lib64 directory to your LD_LIBRARY_PATH Examples of the necessary environment variables that need to be set are as follows export oss_install_dir opt myoss export OPENSS_MPI_IMPLEMENTATION openmpi export OPENSS_PLUGIN_PATH oss_install_dir lib64 openspeedshop export LD_LIBRARY_PATH oss_install_dir lib64 LD_LIBRARY_PATH export DYNINSTAPI_RT_LIB oss_install_dir lib64 libdyninstAPI_RT so export PATH o0ss_install_dir bin PATH 120 13 2 Open SpeedShop Blue Gene Platform Install Please reference the OpenSpeedShop 2 2 Build and Install Guide 13 3 Open SpeedShop Cray Platform Install Please reference the OpenSpeedShop 2 2 Build and Install Guide 13 4 Execution Runtime Environment Setup This section gives an example of a module file softenv file and dotkit that can be used to set up the Open SpeedShop execution environments 13 4 1 Example module file This is an example of a module file used for a cluster installation Use module load lt filename of module file gt to activate the Open SpeedShop runtime environment Modulel O HHHHHHHHHEHHHHHHEHHHHHFEHHHEHHHHHAEHHHEHAHHHHEHHHEHAEHHEEH HHHHHHHHH Openss modulefile HH proc ModulesHelp global version openss puts stderr topenss load
81. e arguments Div 2 first argument divided by second Mod 2 remainder of divide operation 2 Min minimum of the arguments Max 2 maximum of the arguments 106 A_Add 1 sum ofall the data samples specified for the view A_Mult 1 product of all the data samples specified for the view A_Min 1 minimum of all the data samples specified for the view A_Max 1 maximum ofall the data samples specified for the view Sqrt 1 square root ofthe argument Stdev 3 standard deviation calculation Percent 2 percent the first argument is of the second Condexp 3 C expression first argument second argument third argument Header 2 use the first argument as a column header for the display of the second Note Integer and floating constants are supported as arguments as are the metric keywords associated with the experiment view Arguments to these functions can be lt metric_expressions gt with the exception of the first argument of Header The first argument of Header must be a character string that is preceded with and followed by When the v summary option is used it is not generally possible to produce a meaningful column summary A summary is produced for Add Max Min Percent A_ Add A_Max and A_Min Examples expview hwc m count Header percent of counts Percent count A_Add count v summary expview mpi v butterfly f MPI_Alltoallv m time Header average time count Div Mult time
82. e job is initiated This will cause ossutil to be unable to resolve the symbols There have been recent changes to the shared library support in Open SpeedShop Dynamic shared library support is now available in newer Cray and Blue Gene operating systems There is support for both shared and static binaries on the Cray and on the Blue Gene Q platforms 12 1 1 osslink command Information The osslink command links the OpenSpeedShop collectors and runtime libraries into the static executable and manages the setting the appropriate libraries based on the collector value that is one of the inputs to osslink The help output for osslink follows osslink help 117 Usage opt osscbtf_cmake_only_july10 bin osslink c collector options compiler file h help c collector lt collector name gt Where collector is the name of the OpenSpeedShop collector to link into the application See the openss man page for a description of the available experiments provided by OpenSpeedShop This is a mandatory option i mpitype For MPI experiments set the OPENSS_MPI_IMPLEMENTATION value to the MPI implementation specified Valid options are mpich mpich2 mvapich mvapich2 openmpi mpt lam lampi v verbose 12 1 2 Cray Specific Static aprun Information Note in the above execution of the statically linked executable that we need to add the b option to the aprun call The option is needed because Open SpeedShop stores informatio
83. e mpiotf experiment creates Open Trace Format OTF output To obtain a more lightweight overview of application MPI usage use the MPI profiling experiment mpip The lightweight MPI experiment mpip records the invocation of all MPI function call events accumulating the information but does not save individual call information like the mpi and mpit experiments do That allows the mpip experiment database to be smaller and makes the mpip experiment faster than the mpi and mpit experiments The Floating Point Exception Tracing fpe is triggered by any FPE caused by the application It can help pinpoint numerical problem areas The POSIX thread tracing experiment pthreads records invocation of all tracked POSIX thread related function calls also referred to as events The pthreads experiment provides aggregated and individual timings and also provides argument information for each call 3 2 5 Parallel Experiment Support Open SpeedShop supports MPI and threaded codes it has been tested with a variety of MPI implementations The thread support is based on POSIX threads and OpenMP is supported through POSIX threads Open SpeedShop reports the activity of the POSIX threads that represent the OpenMP threads but currently doesn t do any special processing for OpenMP specifically Any Open SpeedShop experiment can be applied to any parallel application This means you can run the program counter sampling experiment on a non parallel applicat
84. e source annotation metric will become L2_LD_PREFETCH Source Panel Annotation Dialog The regenerated view now shows the results for only L2_LD PREFETCH 36 Openi Speed Shop tt Process Loaded Clack on the Run button to begin the expermment hypre_SMGResidual xmgIOO0 smg residuel lt 552 _ bypre_CyclicReduction emg2000 cyrlic_reduction c 757 hypre_Semileerp emg2000 sema_beterp c 126 hypeoe SemiRestrict ann 2000 seem restrict c 125 opel_progress libetgai s 1 0 3 opal_gemenc_sempie_un Ubap sa LAI hypre_StrumAxpy E strunt apye 25 _ SMGSetStructVector Constans values bang AMN0 smee 379 Now double clicking on the Stats Panel result line of choice will focus the source panel and use the PAPI or native counter that was chosen by using the Source Annotation dialog Open Speed Shop bypre_StructMarris TA bepre_StructVector x hypre_StructVectar b Rupre StructVector 1 5 1 2 2 Viewing Hardware Counter Sampling Data with the GUI To launch the GUI on any experiment use openss f lt database name gt 37 The GUI view below represents an example of the default view for the hardware counter sampling hwcsamp experiment In the default view the first set of performance data shown is program counter exclusive time where the program is Statistically spending its time and the percentage of time spent in each function of the program The next information is the hardware counter event c
85. ead 2 11 3 so TestloSys IOR IOR c 1848 gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 61 587482 61 587482 0 155123 gt gt close libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 1 796442 1 796442 0 004525 gt gt close libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 2608 in WriteOrRead IOR IOR c 2562 gt gt 234 in IOR_Xfer_POSIX IOR aiori POSIX c 224 1 280113 1 280113 0 003224 gt gt gt _lseek64 libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 2611 in WriteOrRead IOR IOR c 2562 gt gt 234 in IOR_Xfer_POSIX IOR aiori POSIX c 224 0 981341 0 981341 0 002472 gt gt gt _lseek64 libpthread 2 11 3 so In the above command line interface output the expview command with no options gives the overview or summary view for all the ranks and threads One can view the performance information for individual ranks using r lt rank number gt or 67 individual threads using t lt thread number gt or individual processes using p lt process id gt One can also give a range of ranks threads or processes using their respective option For the calltrees view the display is showing where the I O function was called from in the users application source In this example most of I O time was spent in the write I O function along the path shown in the first individual call path The call path with fullstack option forces the calltrees view to n
86. ed experiment identifier is x 1 jeg localhost test openss cli f smg2000 hwc 3 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive of Total Function defining location PAPI_TOT_CYC PAPI_TOT_CYC Counts Counts 23080000000 43 8283 hypre_SMGResidual smg2000 smg_residual c 152 12880000000 24 4588 hypre_CyclicReduction smg2000 cyclic_reduction c 757 3540000000 6 7224 mca_btlvader_check_fboxes libmpi so 1 5 2 btlvader_fbox h 106 1420000000 2 6965 unpack_predefined_data libopen pal so 6 2 0 opal_datatype_unpack h 41 1220000000 2 3167 hypre_Semilnterp smg2000 semi_interp c 126 1140000000 2 1648 pack_predefined_data libopen pal so 6 2 0 opal_datatype_pack h 38 1020000000 1 9370 _memcpy_ssse3_back libc 2 17 so 740000000 1 4052 hypre_SemiRestrict smg2000 semi_restrict c 125 oOpenss gt gt expview v statements Exclusive of Total Statement Location Line Number PAPI_TOT_CYC PAPI_TOT_CYC Counts Counts 17800000000 36 9141 smg_residual c 289 3440000000 7 1340 cyclic_reduction c 1130 2780000000 5 7652 smg_residual c 238 2760000000 5 7238 cyclic_reduction c 910 1700000000 3 5255 cyclic_reduction c 999 1660000000 3 4426 btl_vader_fbox h 119 1180000000 2 4471 smg_residual c 287 960000000 1 9909 cyclic_reduction c 853 openss gt gt expview v linkedobjects Exclusive of Total LinkedObject PAPI_TOT_CYC PAPI_TOT_CYC Counts Counts 40800000000 77 3606 sm
87. ent in POSIX thread routines pthreads e Look for load imbalance LB view and outliers CA view How do I find parallel inefficiencies in MPI applications e Study time spent in MPI routines mpi mpit and lightweight mpip e Look for load imbalance LB view and outliers CA view How do I find parallel inefficiencies in NVIDIA CUDA applications e Study time spent in CUDA routines and the CUDA event execution trace cuda 123 14 2 Additional Documentation The python scripting API documentation can be found at http www openspeedshop org docs pyscripting doc or in the Share doc packages openspeedshop pyscripting_doc folder in the install directory There are also man pages for openss and every convenience script There s also a quick start guide that you can download from http www openspeedshop or There is also an Open SpeedShop Forum type email alias where you can ask questions and read previous posts at oss questions openspeedshop org Use this URL to sign up https groups google com a krellinst org forum hl en forum oss questions There is also an email list that you can send your questions to without joining the group The email alias is oss contact openspeedshop org 124 15 Convenience Script Basic Usage Reference Information This section provides a quick overview of the convenience scripts that can be used to either compare experiment data to other experiment data or to gather performan
88. ent in the application function corresponding to this row of data of CPU Time Percentage of exclusive time spent in the function correspondin 8 p p to this row of data relative to the total application exclusive time for all the application functions PAPI_TOT_CYC Number of hardware events corresponding to the hardware independent PAPI_TOT_CYC PAPI event This value is based on reading the hardware counter event buffers based on sampling This means this data may not accurately reflect where in the source these events occurred Itis an approximation of what is going in the application but not a mapping back to the source lines Use the hwc and hwctime experiments for that PAPI_TOT_INS Number of hardware events corresponding to the hardware independent PAPI_TOT_INS PAPI event This value is based on reading the hardware counter event buffers based on sampling 39 Column Name Column Definition This means this data may not accurately reflect where in the source these events occurred It is an approximation of what is going in the application but not a mapping back to the source lines Use the hwc and hwctime experiments for that TOT_INS TOT_CYC This is the graduated instructions per cycle which is the ratio between the approximation of the total number of instructions divided by the total number of cycles of TOT_CYC The percentage of PAPI_TOT_CYC events for this function relative to the numb
89. epling Li Process Control i T D bwd e Staus Process Lomded Click on the Run betten to begin the experiment m Stats Panel 1 ManageProcessesPanel 1 WOLS i opr View Display Choice LV e BB LE en bo Showing Functions Repant Wenetions Stetwments Linked Objects Loops Executables amg2000 Hosts 16 ramet Pid 256 Ranks 256 Threads 256 Exclusive CPU time in seeonds pyelusive CPU time in of CPU Time Function defining location 772 120000 34 20687 hvpre_SMGResidwsal smg2000 smg_residual c 152 195 000000 2324 509496 hypre_CyclicReduction smg2000 cyclic_reduction c 757 80 010000 10 056435 pam mq ipeck libpsm infimipathso 1 14 H TO TOOK S850805 ips_ptl_poll libpsen_imfinipath so 1 14 16 130000 2 027375 hypre_Semilnterp smqg2000 sami interp c 126 H 15 560000 1 955732 pee poll internal libpsm_infnipath so 1 14 14 23000 LJRAS6S hypre_SemiRestrict eeng2O00 vemi_restrict c 135 6 570000 0 825781 hypre_SMGAapy amg200k amg axpy 4 27 Command Pamei i Ogenis gt gt We can use this information to identify the critical regions The profile shows computationally intensive code regions by displaying the time spent per function or per statement While viewing this we must ask ourselves e Are those the functions statements that we expected to be taking the most time e Does this match the computational kernels e Are any runtime functions taking a lot of tim
90. er of PAPI_TOT_CYC events that occurred in all the application functions This is a default CLI view for the hwcsamp experiment Exclusive of CPU papi_tot_cyc papi_tot_ins tot_ins tot_cyc papi_tot_cyc Function defining location CPU time Time in seconds 74 0600 99 8786 177712237021 51989184616 0 2925 99 8787 main nbody nbody mpi c 71 0 0400 0 0539 95958566 28058948 0 2924 0 0539 fesetenv libm 2 19 so 0 0300 0 0405 71987793 21053819 0 2925 0 0405 _ sqrt finite libm 2 19 so 0 0100 0 0135 23864331 6996727 0 2932 0 0134 memcpy libc 2 19 so 0 0100 0 0135 23995616 7018006 0 2925 0 0135 fegetround libm 2 19 so 74 1500 100 0000 177928043327 52052312116 0 2925 100 0000 Report Summary This is the output from an osshwcsamp experiment non default experiment where PAPI_L1_DCM PAPI_L1_ICM PAPI_L1_TCM PAPI_L1_LDM PAPI_L1_STM were specified on the osshwcsamp command openss cli f L1 64PE sweep3d mpi hwcsamp openss openss gt gt openss The restored experiment identifier is x 1 oOpenss gt gt expview v summary Exclusive ofCPU papi_l1_dcm papi_l1_icm papi_l1_tcm papi_l1_ldm papi_l1_stm Function defining location CPU time in Time seconds 824 870000 38 689781 8646497071 117738843 8764235914 8396159476 196649065 _ libc_poll libc 2 12 so 799 300000 37 490443 46691996441 367096209 47059092650 46247555479 281624221 sweep sweep3d mpi sweep f 2 75 000000 3 517807 782716992 10680760 793397752 757322217 20159725 PA
91. eriment DWC ssssssssssnsusunsnnununnnnununnnnununnunununnnnununnununnnnnnnnnnnnnnnnnnnnnn 42 5 2 1 Hardware Counter Threshold hwc experiment performance data gathering 43 5 2 2 Hardware Counter Threshold hwc experiment performance data viewing with ME E E E E E E EA A ad ene E E E E E cde esate cane tee eet 43 5 2 3 Hardware Counter Threshold hwc experiment performance data viewing with E W E E E E E E E A ET E E E E E E 45 6 Hardware Performance Counters and Their USe ssssunssnnunnnnnnnnnnnnnnnunnnnnnnnnnnnne 46 6 1 Using the Hardware counter experiments to find bottlenecks cccssssssesseenseeees 48 6 1 1 Example 1 on use of PAPI Memory Bandwidth LLNL Sparse Solver Benchmark DIY GheseschSedet casts stachaeSuctacaea lech sutetaceaubatest jcdetecactnatsetatse eaneca isu cususe canes advolatanricmneboceatgeacesceauaueatstseneamemensuecare 48 6 1 2 Example 2 on use of PAPI False Cache line sharing in OpenMP essen 49 6 1 3 Example 3 on use of PAPI Size of TLB Importance PAPI_LTLB_DM Sandia s CTH PG LOM AUC ose cess haescn cetacean ce eet ce E east ee ae evant eee 50 7 170 Tracing and 1 0 Profiling sssrini aaa aaa a 52 7LOOCORE Exa mpiG s a a oes elves 52 TA LUStre STIPIDN COMM ANS sciana aSa Taa aeaiia aaia 53 7 3 Open SpeedShop I O Tracing and I O Profiling ssssssssssnnusnnnunnnnnnnnnnnnunnnnunnnnnnnnnnnnnnnnnn 54 7 3 Open SpeedShop I O Tracing General Usage sssssssnsunnunnunnunnnnnnnnnnnnnnnn
92. es to record performance information Tracing experiments do not use timers or thresholds to interrupt the application Instead they intercept the function calls of interest by using a wrapper function that 18 records timing and function argument information calls the original function and then records this information for later viewing with Open SpeedShop s user interface tools The Input Output tracing experiments io iot record invocation of all POSIX I O events They both provide aggregated and individual timings and in addition the iot experiment also provides argument information for each call To obtain a more lightweight overview of application I O usage use the I O profiling experiment The lightweight I O experiment iop records the invocation of all POSIX I O events accumulating the information but does not save individual call information like the io and iot experiments do That allows the iop experiment database to be smaller and makes the iop experiment faster than the io and iot experiments The memory tracing experiment mem records invocation of all tracked memory function calls also referred to as events The mem experiment provides aggregated and individual timings and also provides argument information for each call The MPI Tracing Experiments mpi mpit mpiotf record invocation of all MPI routines as well as aggregated and individual timings The mpit experiment provides argument information for each call Th
93. ess patterns Pipeline stalls 5 1 Hardware Counter Sampling hwcsamp Experiment The osshwcsamp experiment supports both derived and non derived PAPI presets and is able to sample up to six counters at one time Again you can check the available counters by running osshwcsamp with no arguments All native events are available including architecture specific events listed in the PAPI documentation Native events are also reported by papi_native_avail The hardware counter sampling experiment uses a sampling rate instead of the threshold used in the previous experiments But like the threshold the sampling rate is depended on the application and must be balanced between overhead and accuracy In this case the lower the sampling rate the less samples recorded The convenience script for this is experiment is 32 gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 lt event_list gt lt sampling_rate gt Note if a counter does not appear in the output there may be a conflict in the hardware counters To find conflicts use gt papi_event_chooser PRESET lt list_of_events gt Here is a list of some possible hardware counter combinations to use list provided by Koushik Ghosh LLNL For Xeon processors PAPI _FP_INS PAPI_LD_INS PAPI_SR_INS Load store info memory bandwidth needs PAPI_L1_ DCM PAPI_L1_ TCA L1 cache hit miss ratios PAPI_L2_ DCM PAPI_L2_ TCA L2 cache hit miss ratios LAST_LEVEL_CACHE_MISSES L3 cache info
94. fault report will be printed on screen The performance information gathered during execution of the experiment will be 21 stored in a database called smg2000 pcsamp openss You can use the Open SpeedShop GUI to analyze the data in detail Run the openss command to load that database file or open the file directly using the f option gt openss f smg2000 Below we show basic examples of how to use the GUI to view the output database file created by the convenience script Du A bette hypre_SMGAxpy amg2000 omg eupye27 You can choose to view data for the Function Statement Linked Object or Loop level of granularity To switch from one view type to another first select the view granularity Function Statement Linked Object or Loop then select the type of view For the default views select the D icon 22 bes samg2000 Hosts 16 remerii Pids 256 Ranks 256 Theme 250000 pee p i You can manipulate the windows within the GUI and double click functions or statements to see the source code directly 23 ns acess Loaded Click o the Run burton to begsn the experiment g lt femas aa eee 2S te Ar coe te PPR FTES PETTE om 24 4 How to Gather and Understand Profiles A profile is the aggregated measurements collected during the experiment Profiles look at code sections over time There are advantages to using profiles since they reduce the size of
95. formance speedup MPI PAPI_TLB_DM Ratio of CORES MPI Tasks 1 2 3 4 Ratio Normalized to 2 PE TLB_DM 51 The graph above shows the ratio of the The result of using the increased TLB buffer size is as follows e 16 PE performance improvement 7 35 e 128 PE performance improvement 8 14 e 2048 PE performance improvement 8 23 7 1 0 Tracing and I O Profiling I O could be a significant percentage of the execution time for an application and can depend on many things including checkpoints analysis output visualization and I O frequencies The I O pattern in the application also matters whether it is N to 1 or N to N and if there are simultaneous read or write requests Certainly the nature of the application is also important to the I O usage if it is data intensive traditional HPC with scalable data or out of core that is an application that works on data that is larger then the available system memory The type of file system and striping available on the cluster NFS Lustre Panasas or other Object Storage Targets OSTs What I O libraries your code is using MPI IO hdf5 PLFS or others Also the I O is dependent on other jobs that are running and stressing the I O sub systems The obvious thing to explore first while tuning your code is to try and use a parallel file system Then optimize your code for I O patterns Match checkpoint I O frequency to Mean Time Before Interrupt MTBI of the system Make sure your code
96. g mem experiment performance data gathering 96 9 2 Memory Analysis Tracing mem experiment performance data viewing with etc saat peace atic cca te cease saatee escalates a ectengeoana a veuca ged deaive causa agetuc uma ehauta gua yegiucrse 96 9 3 Memory Analysis Tracing mem experiment performance data viewing with GT E E E E E 98 10 Advanced Analysis Techniques sssssussunnunnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 101 10 1 Comparison Script Argument Description ss sssssusssnsensnunnunnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnn 102 Os OSS Compare Metric argumen Eoin 102 10 1 2 osscompare rows Of output argument s s ssressrensrssrrssrensnenrnnsnrnsnnnnnronnnnnnennnnnnnnnnnnnnennns 103 10 1 3 osscompare output name argument sasssa aa iaaa ana aa A 103 10 1 4 osscompare view type or granularity argument s sssesssssrsssressnsnrnesrissnennnnnrnesnennnnnnnes 104 11 Open SpeedShop User InterfaCeS s sssssunsnnnnnnnnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 104 11 1 Command Line Interface BaS CS ssssssussnnunnnnnnnnnnnnununnunnnnunnununnunnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 105 11 1 2 CLI Metric Expressions and Derived TYPOS eee 106 TCZ CLP Batch Seri py UN O sipain aiee aaas aa AN ASE ESE 108 11 3 Python Scripting scrisese ainaani iaka Ka EDn nAra KEEA FaN 108 114 MPI Pcontrol SUP OP siaa aaae aa aaa e aaa naaa ees 108 11 5 Graphical User Interface Basics ssssssusunnunnnnunnunnnnun
97. g2000 6060000000 11 4903 libmpi so 1 5 2 4160000000 7 8878 libopen pal so 6 2 0 1720000000 3 2613 libc 2 17 so 45 6 Hardware Performance Counters and Their Use In this section we will explore the importance of simple Hardware Counter Metrics HCM through some easy to understand examples We will also use a simple Matrix multiplication example to illustrate performance optimization A A The Memory Pyramid illustrates the impact of er memory on the performance of an application fom crv The closer the memory is to the CPU the faster and smaller it will be Memory further away nE SO from the CPU is slower but larger The most f 12Cache expensive operation is moving data The application can only do useful work on the data at Shared 13 Cache the top of the pyramid For a given algorithm serial performance is all about maximizing CPU po nmm flop rate and minimizing memory operations in ais Kiii scientific code The table below shows the access J latencies in clock cycles for the Nehalem Intel Disk processor Access latency in clock cycles L1 r ee The following example uses BLAS operations to illustrate the impact of moving data BLAS operations are Basic Linear Algebra Subprograms that proved library function calls for vectors and matrices We use the Flops Ops to understand how sections of the code relate to simple memory access patterns as typified by these BLAS o
98. gress hannes ihelan co 1 CHPH CAP RMD LO COTERRM ER ean dopar ihelanso 1 001000000 001000000 OOOO _ lan_dopurW ar ibelan so 1 0 0 1000000 0 0 190090000 1 UCR clan dopuslie Remove Mbeclanao 1 i Command Panel Lt yieiteit peres gt Below we see the creation of a comparison between to ranks in Open SpeedShop 70 Open SpeedShen Bie Tools Help tmo learo eeu r Process C as ED Status Loaded saved dof from tile Jeg date met weertme sweep kt 256p openss f Stats Panel 1 amir ip vaata BESTEANE D A aeran mecutabies swep ii ma stun View consists af cc columes click on the metadata or Getnils usertime x h mer 104 tink gov 0 1 usertime 3 x h mer 05 dink gov tL Function defining location N 15 34285684 15 14285664 sweep_ sweep mpe sweep f 2 1 51425568 1 82857139 elan3_pollevemt_word libelan3 e0 1 0 28571428 0 25714255 source sweepSd oypl source 2 O 11428571 002857143 elan_pollWoerd ielansc 1 0 08571428 0 02857143 cin progress RxPraglist tibelan so 1 Al TILA AR omi mt thunk he EB on AL mT Command Pane el mj gn x opam gt gt 8 1 MPI Tracing Experiments mpi mpit In this section we will go through an MPI tracing experiment with Open SpeedShop The experiment will be similar to the I O tracing experiment it will record all MPI call invocations There are two MPI experiments and associated convenience scripts
99. gt gt gt gt 80 in ompi_datatype_create_hvector libmpi so 1 5 2 ompi_datatype_create_vector c 70 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 70 in ompi_datatype_create libmpi so 1 5 2 ompi_datatype_create c 68 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 249 in opal_obj_new_debug libmpi so 1 5 2 opal_object h 248 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 476 in opal_obj_new libmpi so 1 5 2 opal_object h 463 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 425 in opal_obj_run_constructors libmpi so 1 5 2 opal_object h 417 97 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 35 in _ompi_datatype_allocate libmpi so 1 5 2 ompi_datatype_create c 33 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 117 in opal_pointer_array_add libopen pal so 6 2 0 opal_pointer_array c 110 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 327 in grow_table libopen pal so 6 2 0 opal_pointer_array c 309 131072 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt Frealloc libc 2 17 so This view shows the
100. he data for each call showing its start and end times mpiotf Write MPI calls trace to Open Trace Format OTF files to allow viewing with Vampir or converting to formats of other tools fpe Find where each floating point exception occurred A trace collects each with its exception type and the call stack contents These measurements are exact not Statistical mem Captures the time spent in and the number of times each memory function is called Show call paths for each memory function s unique call path pthreads Captures the time spent in and the number of times each POSIX thread function is 17 i called Show call paths for each POSIX thread function s unique call path Captures the NVIDIA CUDA events that occur during the application execution and report times spent for each event along with the arguments for each event in an event by event trace Only available in Open SpeedShop using CBTF collection mechanism currently under development 3 2 3 Sampling Experiments Descriptions Program counter sampling pcsamp experiment call path profiling usertime experiment and the three hardware counter experiments hwc hwctime hwcsamp all use a form of sampling based performance information gathering techniques Program Counter Sampling pcsamp is used to record the Program Counter PC in the user application being monitored by interrupting the application at an user defined time interval with the default being 10
101. he experimen re Staa Manet it qo Maragelreessesttunet 1 sine Veer Display Chaine vw a addi aS me CA GG Seeing unctionis Marport Functions 2 Statements J Linked Objrete J Lampe Execwtables sago Host jeghour Pide 2 Ranks 2 Threads 2 Pactueree CPU tee inet Inchiveve CPU time in eee Sef Total Bocha CPU Function heir bolion LaS 1 975429 45 SHA pre AGa wen M0 amg esaluak c 13 LISHI IWT H 25 72573 by pee Cyrlictioniirtain ampak cycli rahirtinn 2 MITH ihe Lomi Bypon Semillestrict srg see restrict 12 EE a Ti 0 1147p LITANI me bi am mmponeni popes llibmpi en OI hil m oampanent i nticom ITO 77W opa _evnecit_ simp _unpori litem eo LAN opal dmiayp ork i 010p 0 1143 2 70270 bye SOC SBaldiLAPSy im ise Dh emg sere mpc 235 0085714 D OBSTH 20077 memcpy seees back llibo 17 0OSTHI 0 05714 1351551 bypoe_sSemilnterp singe seni inicia c 126 0OSTHJI 0 40000 LISI opal popes npa l 0 8 ol pogre c lie r uous pea ian H LISISH opal eermeric_sitniprhe merck liberept ots 0 fh nnsl Aniarype peck c 235 Lemani Panet DO optim 11 5 1 1 Icon ToolBar T VU cL DiS e ef He LB TS oy ISA LB cA Ce Showing Functions Report The most used items that can be found in the Stats Panel menu that is found under the Stats Panel tab are also available in the Stats Panel ToolBar The Stats Panel Toolbar is provided as a convenience The following is a quick overview of the toolbar options The contents of the toolbar vary by ex
102. he iop experiment run ona 50000 rank IOR application job The performance information in the default view is the time spent in I O functions and the percentage of time spent in each I O function ipeni Speed Syore Pit Toots Help UVO Pretiling Il S cits Press Cannel Ka 5 t Seats Manel 2 abbunegeieienesPanit 1 sone Views Display Chace Yu oh isg Pee fe BLP Of Ce thiming Pevetions tapers O Pusctine Peerutahire IOR Hiasi Aha Pads 19088 Hants MEO Threads Bariusive 1 0 call times j trecine VO call times a of Total Excteerve CPU Penction deling locatia t3725175t3 2252 tS7251 78S 22524 4 7787 were Dph 2 11 39 00 oa GT Aa ee eT Pam mis ie ment fopec ture ot 2 1 nee 975574 CONSA ane read litunhread 2 11 1 woo Jur UOD besks htexhressd2 11 eo In the image below the hot call path view for the iop experiment run on a 50000 rank IOR application job is displayed The performance information in the hot call path view is the top five call paths to each of the I O functions that took the most time time spent in I O functions and the percentage of time spent in each I O function 65 Procesas Loaded Click na the Tun Semen oo begin the esperenent s T me Ot ee a p a gt aiw OR ILID noga Piia 19008 Maaka 1000 iral i O aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaa aa Teih OA Oe 1533 A in WeiveOr Reet GOA IOR 256 O 344 in DO Nis POSIX IOR sinri POKL 224 bisama 197280781
103. he source with that information Note the hot to cold color highlighting of the source The higher the performance values are the hotter the color Red is the hottest color so source highlighted in red is taking the most time in the program being profiled Process Loodad Click on the Rus button to begin the experiment OE EE St Se Ee Oe aa ee Dy et ij be 20 hypre Boxloop Siegin loop_ size 282 A date bos start bam stride i 283 _thetn_ bor start bham_strida xi 284 t ditu box start husa stride ri 285 define HYPRE BOX SMP_PRIVATE looph hoopi loop Aisi box sing forlow 10 320284698 7591939571 75919339571 7 1L7437722 3 795966785 3 202846975 2 491 103203 11 5 2 Preferences How to change preferences These preference panel views are included to outline the sequence for changing the GUI and CLI options for generating and viewing performance information with Open SpeedShop The first view is the main General preference panel which allows the font view field sizes data precision path number of lines in the view and many other general options 112 200 eee d oo ox co The Stats Panel preference panel lets you change preferences related to viewing the performance information in the GUI Stats Panel 113 The Source Panel preference panel lets you remap paths to source files that are ina different location on the viewing platform Use this when you ca
104. he user is choosing to only sample 45 times a second instead of the default 100 times a second Why would you want to do this One reason would be to save database size a lower sampling rate may give an accurate portrayal of the application behavior gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 PAPI_L1_DCM PAPI_L2_DCA PAPI_L2_DCM PAPI_L3_DCA PAPI_L3_TCM gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 PAPI_L1_DCM PAPI_L2_DCA PAPI_L2_ DCM 45 5 1 2 Hardware Counter Sampling hwcsamp experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 5 1 2 1 Getting the PAPI counter as the GUIs Source Annotation Metric In order to make one of the PAPI or native hardware counters the counter that will show up in the source view one can click on the SA icon which represents Source Annotation This brings up an option dialogue that allows you to chose the source annotation metric 35 Open SpeedShop Process Loaded Clack on the Run button gg m t hypre CyclicReduction emg2000 cycle _reduction c 757 hypre_Semileterp emg 2000 semu_teterp c 126 hypeo SemiRestrict amm2000 seemt_restrict c 125 opel progress liberais 1 0 3 opal_gemenc_semple_unpeck hbm so LOJ In this example the native counter we want to choose is L2_LD_PREFETCH When we click to choose that counter and click on OK the Stats Panel view will regenerate and th
105. here many possible things to measure about a program there are also different ways to measure them You can instrument your program by adding performance routines to the source code you can have a performance tool periodically take samples from a program as it runs or you can preload certain library functions to monitor those calls There are a number of different performance tools that can help you measure the different performance aspects of your code There are built in Unix commands like time or gprof that can give you some basic timing information This manual describes how to use Open SpeedShop a robust performance tool capable of analyzing unmodified binaries Throughout the manual will show real world examples of performance analysis using Open SpeedShop 10 2 How to use Performance Analysis Performance analysis is an essential part of the development cycle and should be included as early as possible It can have an impact on the patterns used in message passing or the layout of the data structures used and the algorithms themselves Your end goal should be correct and efficient code Typically one would measure the performance of some code and analyze the results You then modify the code or algorithms as appropriate and repeat the measurements from before analyzing the differences in successive runs to ensure an increase in performance Algorithm E i C ode Binary Correct Code Efficient Code The most
106. i so 0 0 2 0 280000000 0 210000000 hypre_Semilnterp smg2000 semi_interp c 126 0 280000000 0 040000000 mca_pml_ob1_progress libmpi so 0 0 2 10 1 Comparison Script Argument Description The Open SpeedShop comparison script accepts a number of arguments This section describes the acceptable options for those individual arguments Fora quick overview see section 14 4 osscompare Compare Database Files As described above the osscompare script accepts at least two and up to eight comma separated database file names enclosed in quotes as the mandatory argument By default the compared metric is the primary metric produced by the experiment For most experiments the metric is exclusive time however the hardware counter experiments use the count of the number of hardware counter overflows as the metric to be compared These are the default or mandatory arguments to osscompare The following sections describe the arguments for osscompare in more detail 10 1 1 osscompare metric argument The osscompare metric argument specifies the performance information type that Open SpeedShop will use to compare against when looking at each database file in the compare database file list To find the metric specifications that are legal and produce comparison outputs one can open one of the database files with the Open SpeedShop command line interface CLI and list the available metrics openss cli f smg2000 pcsamp openss openss gt gt list v metrics
107. ia a panel s pull down menu Clicking on a colored downward facing arrow or using the Stats Panel icons can access further options Red icons represent view options such as updating the data or clearing the view options The green icons correspond to different possible views of the performance data The dark blue icons correspond to analysis options while the light blue icon corresponds to information about the experiment There is context sensitive text that is shown when you hover over the icons 137
108. il a that is not derived is available for use You can also see the available counters by using the osshwc or osshwctime commands with no arguments Native counters are listed in the PAPI documentation 42 PAPI Name Threshold PAPI L1 DCM Li data cache misses high PAPI L2 DCM L2 data cache misses high medium PAPI L1 DCA Li data cache accesses high PAPI FPU_IDL Cycles in which FPUs are idle high medium PAPI STL_ICY Cycles with no instruction issue high medium PAPI BR_MSP Miss predicted branches medium low PAPI_FP_INS Number of floating point instructions high PAPI LD_INS Number of load instructions high PAPL VEC_INS Number of vector SIMD instructions high medium PAPL HW_INT Number of hardware interrupts low PAPI TLB_TL Number of TLB misses low Note the Threshold indications are just for rough guidance and are dependent on the application Also remember that not all counters will exist on all platforms run osshwc with no arguments to see the available hardware counters available In the sections below we show the outputs from the osshwc experiment note that the default counter is the total cycles 5 2 1 Hardware Counter Threshold hwc experiment performance data gathering The hardware counter threshold experiment convenience script is osshwc Use this convenience script in this manner to gather counter values for one unique hardware counter osshwc how you normally run your application lt papi event gt lt threshol
109. ing Experiments mpi mpit performance data gathering s ss s 81 8 1 1 2 MPI Tracing Experiments mpi mpit performance data viewing with GUI 81 8 1 1 3 MPI Tracing Experiments mpi mpit performance data viewing with CLI 81 8 1 2 MPI Tracing Experiments Mpip ssssssussusussusunnununnunnnnunnununnunnnnunnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnn nne 84 8 2 Threading Analysis SCCU ON vies sivecacices crac cacenstactsatarecule sevcinactis es desstdetuanetiauladereccuccsteaiiectvdue 88 8 2 1 Threading Specific Experiment pthreads sss ssssssussssnnununnnnununnnnununnnnununnnnununnnnnnnnnnnn 90 8 2 1 1 Threading Specific pthreads experiment performance data gathering 90 8 2 1 2 Threading Specific pthreads experiment performance data viewing with GUI PT APEE E A ch ecco cece ae ene ceca ade A A aca cat A AEA E cua eecee E Genes 90 8 2 1 3 Threading Specific pthreads experiment performance data viewing with CLI E M E AOT A A eee ta ae eee a alert nc eee AT 92 8 3 NVIDIA CUDA Analysis Section wi scicicisiccieccssarieccataneciventesnetennvecs vansiaewWeraueddvencracvtennsedsastnvscucs 94 8 3 1 NVIDIA CUDA Tracing cuda experiment performance data gathering 94 8 3 2 NVIDIA CUDA Tracing cuda experiment performance data viewing with GUI 94 8 3 3 NVIDIA CUDA Tracing cuda experiment performance data viewing with CLI 95 9 Memory Analysis Techniques siti anne 96 9 1 Memory Analysis Tracin
110. inimum time Note that there may be more ranks that have the same maximum and minimum time per rank FEEFEE This is an example of the ability to compare performance information at the rank level in the CLI Here we show a comparison on the exclusive time metric for rank 0 and rank 23 These ranks were show to be the ranks that had the maximum and minimum values for MPI_Waitall above One could also use the expview r 0 and expview r 23 to see the times for just those ranks 85 Pie Ser he Seer we fee operts eespemepaere 0 25 atise 23 Femetion j efining lecetlon Ceclanive Exclusive WPL oti Wi call times in tiees in seconds WSO 020791 PAPI Wadtell libepich ss 1 0 waitali c 57 Wl Ai lredsce Libenio so 8 8 alleedsoe c59 3510 812910 Pert Irit BS PAPI weit eo teerd w310 191104 2783 a7 e588 3519 005105 s9 11 513329 k J iens 5 89263 PAPIE Ireoy Here we show the top two call paths in the program that took the most time with respect to MPI function calls py mane heme apd age IIT Pe bm We lem Seres Hep cperrts etapeles y cellorees fulletect enip Exclusive of Waber fie Aor Nee Hare Average Call Stack Ponctlen jaefining Lecetlon WT call Total af Gsclusive ef Exclusive of fxclonlee tiees ln Exclusive Caile Time fin tier fae Tise seconds CPU Tise Across icross icross Sertoia Rerksis ferks s start gg g S82 Vibe stert main Libeoniter o 0 0 6 gt eale c 541 gt gt 258 i
111. ion as well as a MPI or threaded application The experiment data collectors 19 are automatically applied to all tasks threads The default views aggregate sum the performance data across all tasks threads but data from individual tasks threads are available The MPI calls are wrapped and MPI function elapsed time and parameter information is displayed 3 3 Running an Experiment First think about what parameters you want to measure then choose the appropriate experiment to run You may want to start by running the pcsamp experiment since it is a lightweight experiment and will give an overview of the timing for the entire application Once you have selected the experiment to run you can launch it with either the wizard in the GUI or by using the command line convenience scripts For example say you have decided to run the pcsamp experiment on the Semi coarsening Multigrid Solver MPI application smg2000 On the command line you would issue the command gt osspcsamp mpirun np 256 smg2000 n 60 60 60 Where mpirun np 256 smg2000 n 60 60 60 is a typical MPI application launching command you would normally use to launch the smg2000 application mpirun a MPI driver script or executable is used to launch smg2000 on 256 processors with n 60 60 60 is passed as an argument to smg2000 An example of a MPI smg2000 pcsamp experiment run from a SLURM based system using srun as the MPI driver along with the application
112. is using the appropriate libraries 7 1 OOCORE Example We will examine an example using the benchmarking application OOCORE an out of core solver from the Department of Defense High Performance Computing Modernization Program DoD HPCMP It is an out of core ScaLAPACK Scalable LAPACK benchmark from the University of Tennessee Knoxville UTK It can be configured to be disk I O intensive It characterizes a very important class of HPC applications involving the use of Method of Moments MOM formulation for investigating electromagnetics e g radar cross section antenna design It solves dense matrix equations by LU lower triangular upper triangular QR or Cholesky decomposition OOCORE is used by HPCMP to evaluate I O system scalability For our needs this application or similar out of core dense solver benchmarks help to point out the important points in performance analysis like I O overhead minimization The use of Matrix Multiply kernel which makes it possible to achieve close to peak performance of the machine if tuned well It can highlight blocking which is very important to tune for deep memory hierarchies 52 The following example was run on 16 cores on a Quad Core Quad Socket Opteron IB cluster We want to compare two different file systems Lustre I O with striping and NFS I O We use the ossio convenience script gt ossio srun N 1 n 16 testzdriver std Sample Output from Lustre run TIME M N MB N
113. ll MPI of MPI MPI Call Time s Time s Total Calls Call Tise ms Tine as 149558 7267 83 7791 1208 0 0169 2394 3711 124 6323 PAPI Allreduce Libmpich so 3 3 14602 3793 7 8438 798 0 0002 79 2830 17 5468 PHPI Wait Libmpich so 3 3 9767 5645 5 4715 6500 9 0005 25 2389 1 5627 PAPI Beast Libmpich so 7 3 2849 7329 1 5964 13584 0 0002 17 6646 0 2098 API Waiteny Libmpich so 3 3 1199 0472 9 6666 776 8 8062 36 5529 1 5336 MPE Waitall lLibmpich se 3 3 982 9403 0 5506 392 0 0003 33 6511 2 5075 PAPI Recv libmpich s 3 3 88 0546 6 6493 190 0 00221 18 1551 6 8805 NPI Reduce libnpich so 3 3 32 2866 6 0181 13986 0 0004 8 6326 6 6023 PMPI Isend Libapich so 3 3 28 9642 6 0162 396 0 0010 0 5800 0 0728 PAPI _Reend Libmpich so 3 3 12 3708 6 6069 14784 8 0001 9621 6 0008 API recy libmpich so 3 3 2 4564 0 90014 792 0 0006 6 0169 6 0031 PAPI Send Libmpich so 3 3 openss gt gt f This is an example of the GUI default view for the MPI mpi mpit experiments 82 De Tee ee een ii o Ma ee weel w up Pome lJ ap Aptana tree fbate mer Ov a EOL E TE cee ere hace ae li4 pede Pee 2 Reoks 2 Ties E ieee jom cee td j Ddosta W OGI Necs ea UR Te Masu y Cal Mees VI LOGU were Mewes 1 Cel Dn erage Tiare The default views are designed to relate the information included in the report back to the individual calls to their corresponding MPI functions This is the same information that would be reported if the
114. load balance view for the execution of the smg2000 application onasmall cluster Following the load balance view there is a expcompare CLI command example where two of the programs ranks are compared against each other This may be useful if there appears to be load imbalance when examining the m loadbalance output Note that min_bytes represents the smallest allocation value that was used in the memory allocation functions This may be of interest if one is allocated a small number of bytes and doing it too often creating unwanted overhead Openss gt gt openss gt gt expview m loadbalance Max Rank Min Rank Average Function defining location Exclusive of Exclusive of Exclusive Mem call Max Memcall Min Mem call time in time in time in seconds seconds seconds Across Across Across Ranks ms Ranks ms Ranks ms 20 1452 1 14 7405 0 17 6019 _cfree libc 2 17 so 11 5376 1 7 3129 0 9 0911 _libc_malloc libc 2 17 so 1 7816 1 1 2193 0 1 4573 realloc libc 2 17 so Openss gt gt expcompare r0 3 m max_bytes r0 Max ri Max r2 Max r3 Max Function defining location Requested Requested Requested Requested Bytes Bytes Bytes Bytes 65536 131072 131072 65536 realloc libc 2 17 so 2064 2064 2064 1040 _ libc_malloc libc 2 17 so Openss gt gt expcompare r0 3 m min_bytes r0 Min ri1 Min r2 Min r3 Min Function defining location Requested Requested Requested Requested Bytes Bytes Bytes Bytes 16 24 24 16 realloc libc 2 17 so
115. lues or bytes that were read or written Beware of serial I O in applications illustrated in the code below code from Mike Davis Cray Inc 55 Below shows the output of the Open SpeedShop iot experiment on the serial I O code x Open Speedshop SHOWS EVENT BY Be leds Help EVENT LIST Gumon Experimert 1 Clicking on this bi gives each calltoa sein con opum Suess I O function being Sisus Process Loaded Cick on he Rur autton to vegen tra pariga _ traced as shown Siats Pano 1 M Manage ccestesPane 1 TESS 3 r z 7 View Dispay Chace Below is a 2 iw tS a WD ie BW HL UB Gh Be showing Per Evont Repo ection graphical trace nopei Host glorny 3 Prooesses Manks Tivresosid 5 view of the same data showing ems serialization of govoneing 13 2054 0 gt _ibe wite 64 lispthrasct 2 530 fwrite THE RED 2o1oneles 12 2254 gt ibe reud Nb alibpitread 2 4 30 BARS for each PE poianete 13 2754 gt _ibs wite lb84lbpthread 25 se with another tool POIs 122 54a gt ibc wrie 664 lipthread 2 6 20 20100808 13 22 54 gt __ tbe read ib 4hbpttwead 2 5 s0 201 OANG 13 22 54 gt __libe_read it 4ibpthread 2 5 sc PoTaeOS TE z4 gt __tibe_read lin 4Stethread 2 5 sc 20100006 13 22 54 zibi read io64ibpttread 2 5 90 20100008 13 22 54 gt __ Ib road ib 4Sbpthread 2 5 sc EAn 137254 Oam FAm sm sit 5 se Where by default the I O functi
116. luster analysis view groups like performing ranks together as a means of locating groups of ranks that are outliers with respect to the other ranks This view shows the hot call paths in the application 87 Opeelpecethop jim remertl 38 tert ee tie ne P o o TC haste CIU Neseibes wd Cath Ti gt a is _ libe man main iibnonktor sn 0 0 0 msinc H1 gt gt 158m _ lbe wan sain libot tlan Hibo srart e 96 gt gt gt 517 in monhor main Hbhmanier sn 0 0 0 sain 6 492 gt gt gt gt S240 in mein ulesh hesh or 531 gt gt gt gt gt 1046 jin PMPI iair Udanin so 9 0 0 pgi DSE flulkoh gt gt gt gt TIB we renin Ihasheah Duiske SUI gt gt gt gt gt amp P w MPL Aireduce ihenpich sa 10 aire sce c 59 an hukesh gt M2 is ihe arman Mhniat sa NAA maine 541 gt gt amp 258 in _ IDe stari sisin libe 2 12m libostartet 96 gt gt gt 517 in monii main libmantor su 11 0 0 maine 492 8 2 Threading Analysis Section We just did an experiment that uses MPI but we can do a similar analysis on applications that use threads To analyze a threaded application first we can run the pcsamp experiment to get an overview then look at the load balance view to detect if there are any widely varying values and finally do cluster analysis to find any outliers The image below shows the default view for an application with 4 threads the information displayed is the aggregated total from
117. m counters on things like network cards and switches or environmental sensors 30 The drawback to hardware counters is that their availability differs between platforms and processor types Even systems that allow the same counters may have slight semantic differences between platforms In some cases access to hardware counters may require privileged access or kernel patches Performance Application Programming Interface PAPI allows access to hardware counters through APIs and simple runtime tools You can find more information on PAPI at http icl cs utk edu papi Open SpeedShop provides three hardware counter experiments that are implemented on top of PAPI It provides access to PAPI and native counters like data cache misses TLB misses and bus accesses There are a few basic models to follow in hardware counter experiments The first is thresholding where the user selects a counter and the application runs until a fixed number of events have been reached on that counter Then a PC sample is taken at that location every time the counter increases by the preset fixed number The ideal threshold the fixed number at which to monitor is dependent on the application Another model is a timer based sampling where the counters are checked at given time intervals Open SpeedShop provides three hardware counter experiments hwc for flat hardware counter profiles using a single hardware counter hwctime for profiles with stack traces using
118. mation Open SpeedShop provides several flexible and easy ways to interact with it There is a GUI to launch and examine experiments a command line interface that provides the same access as the GUI as well as python scripts There are also convenience scripts that allow you to run standalone experiments on applications and examine the results at a later time The existing experiments for Open SpeedShop all work on unmodified application binaries Open SpeedShop has been tested on a variety of Linux clusters and supports Cray and Blue Gene systems 3 1 Basic Concepts Interface Workflow Gut cu pyoiss Open SpeedShop has three ways for the user to Open SpeedShop examine the results of a performance test called experiments a GUI a command line Code Open Source iInstrurnentation interface or through python libraries The user ETEO can also start experiments by using those three options or by an additional method of the command line launched convenience scripts For example to launch one of the convenience scripts for the pcsamp experiment Program Counter Sampling the user executes the command osspcsamp lt application gt where lt application gt is the executable under study along with any arguments The convenience scripts will then create a database for the results of that experiment m n D D 5 _ z mm The user can examine any database in the GUI with the command openss f lt db file gt
119. mended that the user run hwcsamp first to get an idea of how many times a particular event occurs the count of the event during the life of the program A reasonable threshold can be determined from the hwcsamp data by determining the average counts per thread of execution and then setting the hwc hwctime threshold to some small fraction of that For example if you see 1333333333 PAPI_L1_DCM s over the life of the program when running the hwcsamp experiment and there were 524 processes used during the application run the this is the formula you could use to find a reasonable threshold for the hwc and hwctime experiments when using the PAPI_L1_DCM event for the same application So the formula that could be used is as follows Average counts per thread 1000 Threshold for hwc hwctime In this case 1333333333 524 1000 2544529 1000 2545 Using this formula one could use 2545 as the threshold value in hwc and hwctime for PAPI_L1_DCM and expect to get a reasonable data sample of that event NOTE The number of PAPI counters and their use can be overwhelming Ratios derived from a combination of hardware events can sometimes provide more useful information than raw metrics Develop the ability to interpret metric ratios with a focus on understanding e Instructions per cycle or cycles per instruction Floating point Vectorization efficiency Cache behaviors Long latency instruction impact Branch mispredictions Memory and resource acc
120. n Line stert sajin Lise 32 80 Lise etart 96 eer SLT dn moniter sala Liteeniter s0 0 0 0 main c 82 gt gt gt gt SHO in main ilula lutesh oc 239 95034 514128 5 081511 27 gt er 104 in PAPI init iLibeeniter oe 6 0 seed c 104 tert iSuleehl p 5i in ibe stert wain Libeceiter oc 0 0 0 ealn c 541 gt gt 208 in _libc start essin Libc 7 17 pe Lise start lt 96 gt gt gt 517 in manitor sein Libeoniter 20 0 0 0 main c 492 gt gt gt gt J33 in main lilesh lulesh oc 5505 353008 29 40320 12939 srsrs 99 in BPI Alireduce titegich se 1 0 al ireduce c 58 8 1 2 3 MPI Tracing Experiments mpip performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt This is an example of the GUI default view for the MPI mpip experiment Ope rlipweetiap jon temere v i Leaded Click om the Res horma p begin the experiereent __ ___ eee the Atn a aaa oo Boe PA ira libmonsitor so 0 0 0 perp c_ 304 SSIs 101164 PAPI Weetall libergech eo 1 0 waitali c 57 IATA MA _Alintnce lbmpich s 1061 876714 PMPI Wad Ubmpich an 1 0 wait r 51 147 739 PMPI loon Mhergie ee 1 0 isead 58 10 5553234 PMPL_lreey ldeapich co LO eee 248 86 The following view shows the load balance for this execution of lulesh on 27 ranks This view shows the cluster analysis view for this run of lulesh on 27 ranks The c
121. n about the executable location when it is running Without the b option the executable is run in a temporary location that is not available when the raw data information is being converted into the Open SpeedShop database file 12 1 3 Changing parameters to the experiments Note when execution of the statically linked executable with the Open SpeedShop collectors linked in the workflow is different Since the more flexible convenience Scripts can t be used in order to change the arguments to the experiments one must set environment variables Examples of the environment variables that can be changed are as follows Environment Variable Experiment Type OPENSS_PCSAMP_RATE Sampling Rate OPENSS_USERTIME RATE Sampling Rate Name OPENSS_HWC_THRESHOLD How many events occurrences before sample taken 118 OPENSS_HWCSAMP_EVENTS List of PAPI or Native hwcsamp Event Names OPENSS_HWCSAMP_RATE Sampling Rate Name OPENSS_HWCTIME_THRESHOLD How many events hwctime occurrences before sample taken OPENSS_IO_TRACED List of I O functions to collect data for collect data for to collect data for to collect data for to collect data for OPENSS_FPE_EVENT List of FPE exception names to collect data for 119 13 Setup and Build for Open SpeedShop Open SpeedShop is setup to work with the AMD processors and the Intel x86 x86 64 and ARM processor architectures It has been tested on many Linux Distributions include
122. n t see the source files on the viewing machine because the executable was built on a different machine Put the old path to the source into the Old Path Name text box area and the new path for the source on the viewing machine into the New Path Name text box area Freferences Dialog on ramen S56 Cabral feuree Panel Hale Panel Show slaliahics Marag ees el Shew Ene nami Old Path Name old path ang AHH new path ange AH 11 5 2 1 Disabling or enabling the preference for Save Reuse views in CLI Here we show the General preferences window scrolled down to the area of the view that shows more preference options If you do not want the new save reuse view active then you can disable that function by clicking on the Save Views for Reuse in CLI and GUI pointed to by the blue line below By clicking on that preference line you can disable or enable the feature The same procedure works for the other preferences as well 114 Priferences Dialog w remert156 115 12 Special System Support Static Executables 12 1 Cray and Blue Gene When shared library support is limited the normal manner of running experiments in Open SpeedShop doesn t work You must link the collectors into the static executable Currently Open SpeedShop has static support on Cray and the Blue Gene P Q platforms You must relink the application with the osslink command to add support for the collectors The osslink command is a
123. nformation from the call stacks but there is now a higher overhead and necessarily lower sampling rate We can run the call path profiling experiment using the Open SpeedShop convenience script on our test program smg2000 gt ossusertime mpirun np 256 smg2000 n 50 50 50 Again it is recommended that you compile your code with the g option in order to see the statements in the sampling The usertime experiment also takes a sampling frequency as an optional parameter the available parameters are high 70 samples per second low 18 samples per second and the default value is 35 samples per second Note that these sample rates are lower then the pcsamp experiment because of the increased amount of data being collected If we wanted to run the same experiment with the low sampling rate we would simply issue the command gt ossusertime mpirun np 256 smg2000 n 50 50 50 low We can view the results of this experiment in the Open SpeedShop GUI The view is similar to the pcsamp view but this time the inclusive CPU time is also shown 27 Ee Tools Hep Custom Experiment 1 0 User S OO x p Process Control EJ Status Loaded saved data from fe homejjegideg aap W Stats Panel 2 fw ManageProc 2121 Sace Pane 21 SOx a Pera gs View Otspiay Chace Fyi mirek a wng tons Report Functions Statements Linked Objects Zarar 13 365789 5 771428 apal progress hapen pal so 0 0 0
124. nnnnnnnnnnnnnnnn 128 15 11 ossfpe FP Exception Experimenth ssssussusunnununnunnnnunnnnnnnunnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 128 15 12 ossmem Memory Analysis Experiment ssssssussnsunnusunnununnunnnnunnununnununnunnnnnnnnnnnnnnnnnnn 129 15 13 osspthreads POSIX Thread Analysis ExperiMent s ssussussnsunnnnnunnunnunnnnnnnnnnnnnnnnnn 129 15 14 osscuda NVIDIA CUDA Tracing Experiment ssssssssususnununnunnnnunnununnunnnnnnnnnunnnnnnnnnnnnnn 129 15 15 Key Environment V aria Bless sicssiissecssvesscssvvesewvasccansnncevemeeddcaceisccavaveccveinmeveucunsadcaavienevs 129 16 Hybrid openMP and MPI Performance Analysis s ssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnan 131 16 1 Focus on individual Rank to get Load Balance for Underlying Threads 16 2 Clearing Focus on individual Rank to get bank to default behavior Why do I need Performance Analysis Where are the bottlenecks in my program My parallel application works fine on 10 nodes but on 1000 nodes it slows toa WI what s nappening Is my parallel program scalable Is my program optimized for running on this new system Are these new libraries faster than the old L OTIS PUL OIVIlO as All these questions can be answered by using Performance Analysis About this Manual This manual will provide you with a basic understanding of performance analysis You will learn how to plan and run Open SpeedShop performance experiments
125. nnnnnnnnnnnunnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 109 T1 5 T Basic Initial View Default View srein atatea aa T 109 DD SA TCO Tool BaFeuiss ns a a Uanacaiskinanatbaxeadkis 110 TLa 1 2 Views Display Choice selection i canini a A anes 111 11 5 2 Preferences How to change preferences ss ssssssssssorsssorsesornrsnrnesnrnnennnennnneonnnennnnnonnnnnnnnnennnnennnnes 112 11 5 2 1 Disabling or enabling the preference for Save Reuse views in CLI cect 114 12 Chay anad Bie GENC iina a a deanna 116 12 1 1 osslink command Information ssssssussnsunnnnunnnnnnnunnnnunnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnannnnnn nnn 117 12 1 2 Cray Specific Static aprun Information ssssssussneunnnnnnnunnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnn 118 12 1 3 Changing parameters to the experiments s s ssssussusunnunennunnnnunnununnunnnnnnnnnnnnnnnnnnnnn nnn 118 13 1 Open SpeedShop Cluster Install s sssssnsnnnnnnunnnnnunnnnnnnunnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 120 13 2 Open SpeedShop Blue Gene Platform Install s ssssusnnnunnnnnnnnnnnnunnnnunnnnnnnnnnnnnnnnnnnnnnnn 121 13 3 Open SpeedShop Cray Platform Install s ssussnsusnnnnnnunnnnunnnnunnnnnnnunnnnunnnnunnnnnnnnnnnnnnnnnnn 121 13 4 Execution Runtime Environment Setup ssssssssussussunsnnsnnnnunnunnunnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnn 121 134 LEximple mod le Alessin tare trier ree rert pert are tity Perr eerie rm cet rity ets 121 P34 2 Exam ple SOmeny Mie raccistcseasieed
126. oad imbalance and delays to the overall job execution speed These ranks show better performance numbers in terms of the MPI function time but that is only because they were the last to arrive at the internal barrier point and did not have to wait as long as the other MPI functions that arrived sooner but had to wait for the other ranks to finally arrive The image below shows the results of the MPI experiment in the default view Next we see the MPI function call path view shown below 73 ique Call Paths View Click C Icon Unique Call Paths to MPI Waitall and other MPI functions Here is the default pcsamp view based on functions MPI library sho high in the list of time bin eg ee eR Re eee ym ene eee ene perme a P gt y e eee e ee e e e e e e a jadd_ u C 256 jacid 1 5 jacu_ 1 09256 jacu t 5 ssor_ 0 256 seor t 4 exchange 3 lu C 256 exchange 3 1 5 _Gil_ memcpy libe 2 5 s0 Here is the load balance view based on functions 74 P Eile Tools pe Sampling 2 aus Process Control Rus Con Peuse 3 Uplate rr maths Stutuss Process Loaded Click on the Pur button to begin the experiment Stam Panel 1 ManageProcessesPanel 1 u ct b A Ue oe ae Showing Lond Balance min max mwe Presses Banks Threads 256 0 1 f executables hu C256 Mosts 16 heralO nl Max Exchasive Rank of Max 1 6000 17 0 4500 254 1 2
127. on any experiment use openss cli f lt database name gt The following table describes the header and column data definitions for the default MPI experiment views Column Name Column Definition Exclusive MPI Call Time Aggregated total exclusive time spent in the MPI function corresponding to this row 81 Column Name Column Definition of data of MPI Time Percentage of exclusive MPI time spent in the MPI function corresponding to this row of data relative to the total MPI time for all the MPI functions Number of Calls Total number of calls to the MPI function corresponding to this row of data Min MPI Call Time The minimum time that a MPI call took across all calls spent in the corresponding MPI function Max MPI Call Time The maximum time that a MPI call took across all calls spent in the corresponding MPI function Average MPI Call Time Across Ranks The average time for the default view is the total amount of time for all the calls to a function divided by the total number of calls Thus it is the average time that each MPI function call spends in the function This is an example of the CLI default view for the MPI mpi mpit experiments pg tecathost Ascar Fin Edit Vew Search Terme Help openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive i of Number Mininus Moximum Average Function defining Location MPI Ca
128. on list to trace is all the specific functions are creat creat64 dup dup2 lseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write writev Things to remember with I O Avoid writing to one file from all MPI tasks If you need to do it make sure distinct offsets for each PE starts ata stripe boundary Use buffered I O if you must do this 57 If each process writes it s own file then the parallel file system attempts to load balance the OST taking advantage of the stripe characteristics Meta data server overhead can often create severe I O problems Minimize the number of files accessed per PE and minimize each PE doing operations like seek open close stat that involve inode information I O time is usually not measured even in applications that keep some function profile Open SpeedShop can shed light on time spent in I O using the io and iot experiments 7 3 Open SpeedShop I O Tracing General Usage The Open SpeedShop io and iot I O function tracing experiments wrap the most common I O functions record the time spent in each I O function record the call path along which I O function was called record the time spent along each call path to an I O function and record the number of times each function was called In addition the iot experiment also records information about each individual I O function call The values of the arguments and the return value from the I O function are recorded 7 3 1
129. onous functions ran quickly low minimum times but some had to wait a long time to get started large maximum times Many times the function calls that ran quickly were the last to arrive and actually are from ranks that are not running as well as the others causing load imbalance and delays to the overall job execution speed These ranks 83 show better performance numbers in terms of the MPI function time but that is only because they were the last to arrive at the internal barrier point and did not have to wait as long as the other MPI functions that arrived sooner but had to wait for the other ranks to finally arrive 8 1 2 MPI Tracing Experiments mpip 8 1 2 1 MPI Tracing Experiments mpip performance data gathering Much of this information is described above in the main MPI Tracing Experiments section but for completeness this is the convenience script description for running the MPI specific mpi mpit mpip tracing experiments gt ossmpip srun N 4 n 32 smg2000 n 50 50 50 default lt list MPI functions gt mpi_category 8 1 2 2 MPI Tracing Experiments mpip performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt The following table describes the header and column data definitions for the default MPI experiment views Column Name Column Definition Exclusive MPI Call Time Aggregated total exclusive time spent in the MPI function co
130. openss smg run2 openss percent rows 10 Please type man osscompare for more details 15 5 osspcsamp Program Counter Experiment General form osspcsamp lt command gt lt args gt high low default lt sampling rate gt Sequential job example R osspcsamp smg2000 n 50 50 50 P Parallel job example R osspcsamp mpirun np 128 smg2000 n 50 50 50 82 Additional arguments high twice the default sampling rate samples per second low half the default sampling rate default default sampling rate is 1002 lt sampling rate gt integer value sampling rate 15 6 ossusertime Call Path Experiment General form ossusertime lt command gt lt args gt high low default lt sampling rate gt Sequential job example R ossusertime smg2000 n 50 50 50 2 Parallel job example R ossusertime mpirun np 64 smg2000 n 50 50 50 21 126 Additional arguments high twice the default sampling rate samples per second low half the default sampling rate default default sampling rate is 352 lt sampling rate gt integer value sampling rate 15 7 osshwc osshwctime HWC Experiments General form osshwc time lt command gt lt args gt default lt PAPI_event gt lt PAPI threshold gt lt PAPI_ event gt lt PAPI threshold gt 2 Sequential job example P osshwc time smg2000 n 50 50 50 21 Parallel job example R osshwc time mpirun np 128 smg2
131. ot collapse any similar sub trees which makes the view more explicit Without the fullstack option the calltrees would be more consolidated 68 8 Applying Experiments to Parallel Codes The ideal scenario for the execution of parallel code using pthreads or OpenMP is efficient threading where all threads are assigned work that can execute concurrently Or for MPI code the job is properly load balanced so all MPI ranks do the same about of work and no MPI rank is stuck waiting What are some things that can cause these ideal scenarios to fail taken from LLNL parallel processing tutorial MPI jobs can become unbalanced if an equal amount of work was not assigned to each rank possibly through the number of array operations not being equal for each rank or loop iterations not being evenly distributed You can still have problems even if your work seems to be evenly distributed For example if you evenly distribute a sparsely populated array then some ranks may end up with very little or no work while others will have a full workload With adaptive grid models some ranks need to redefine their mesh while other don t With N body simulations some work migrates to other ranks so those ranks will have more to do while the others have less Performance analysis can help you with load balancing and an even distribution of work Tools like Open SpeedShop are designed to work on parallel jobs It supports threading and message passing and automaticall
132. ounts listed in columns by the hardware counter event Column three represents the counts that were recorded for PAPI_TOT_CYC and column four represents the counts for PAPI_TOT_INS What this view can indicate to the viewer is whether or not the specified hardware counter events are occurring and if they are then how prevalent are they With this information the user could isolate down to see exactly where a particular event is occurring by using the hwc or hwctime experiment These two experiments are threshold based Which ultimately means you can map the performance data back to the source because the actual event triggered the recording of the counts of the event This hwcsamp experiment is timer based so Open SpeedShop cannot take you to the line of source exactly where the hardware counter events are happening hwcsamp is an overview experiment that tells the user which events are occurring and if they are occurring in numbers that would warrant using the hwc or hwctime experiment to pinpoint where in the source the specified hardware counter event is actually occurring ie hee 94 gt TW ere Meret T Me Cleete EEJ E gt ae er er OF Ban posi Jii et Maney toecewsiivet I w Uren Lipie Chere Yu CL D amp W LB CA Ge toes ferret haia D Geren J Lett itern 2 lowe foemast ice cag DO lie keatboo Mir 4 Bot 4 Thait 4 5 1 3 Hardware Counter Sampling hwcsamp experiment CLI performance data viewing 38 To launch the CLI on
133. p smg2000 smg_setup c 28 gt gt gt gt gt gt gt gt gt gt 408 in hypre_SMGRelaxSetup smg2000 smg_relax c 357 gt gt gt gt gt gt gt gt gt gt gt 619 in hypre_SMGRelaxSetupASol smg2000 smg_relax c 540 gt gt gt gt gt gt gt gt gt gt gt gt 643 in hypre_CyclicReductionSetup smg2000 cyclic_reduction c 475 gt gt gt gt gt gt gt gt gt gt gt gt gt 384 in hypre_CycRedSetupCoarseOp smg2000 cyclic_reduction c 211 gt gt gt gt gt gt gt gt gt gt gt gt gt gt 609 in hypre_StructMatrixAssemble smg2000 struct_matrix c 578 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 144 in hypre_CommPkgCreate smg2000 communication c 75 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 1367 in hypre_CommPkgCommit smg2000 communication c 1354 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 1480 in hypre_CommTypeBuildMPI smg2000 communication c 1459 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 1561 in hypre CommTypeEntryBuildMPI smg2000 communication c 1520 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 68 in PMPI_Type_hvector libmpi so 1 5 2 ptype_hvector c 43 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt 71 in PMPI_Type_create_hvector libmpi so 1 5 2 ptype_create_hvector c 47 gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt gt
134. pare ranks by using the Customize Stats panel View and creating a compare column for the process groups or individual ranks Cluster analysis is also available it can be used to find outliers ranks that are performing very differently then the others From the Stats Panel toolbar or context menu you can automatically create groups of similar performing ranks or threads Through the Stat Panel Open SpeedShop also provides common analysis functions designed for quick analysis of MPI applications There are load balance views that calculate min max and average values across ranks processes or threads The image below shows the Open SpeedShop buttons for Load Balance and next to that Cluster Analysis Open Speedshop Pic Took cp hero Wizard Y pc Saenpting 1 Process Conrro gt P 5 Shonas Looded saved disa trom ile hone eed datasets sweopdd 000p cponss Stats Pand nis ManagcProces s rh Creliiics Fad Y e F Fi au en 08 wins Load Balance rein max eve tepar Exccutabies sweepSd mpt Hosts 001 meriin pov Processes Ranks Thireads 001 O i m Min fixclusive CPU time in seod Max Farhusive CPU time in sec Average Eectoeewe CPU fime in Fonction defining locates 2il 442000000 705000000 120933 sweep eeverp impi 1 50000000 LI 7000000 Z 1700000 ani potevent_word libelags so 008900000 0 15900000 0 13115000 sowce_ weep mpl CLARA KAMAN 0 15900000 0 ORAS ORK can pro
135. pcsamp percent pcsamp threadAverage pcsamp threadMax pcsamp threadMin pcsamp time You can use the output of the list metrics command as an argument to the osscompare command as shown in the examples below osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss percent osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss threadMin osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss threadMax Some exceptions do apply For example some experiments such as usertime and hwctime have details type metrics output by the list metrics CLI command list v metrics These will not work as a metric argument to osscompare 102 For the hardware counter experiments hwc and hwctime you can use the actual PAPI event name in addition to the metric names output from the list metric command The example database file was generated using the PAPI_TOT_CYC event openss cli f smg2000 hwc openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt list v metrics hwc overflows hwe percent hwce threadAverage hwc threadMax hwe threadMin Here we show a couple osscompare examples where hwc overflows can be used interchangeably with PAPL TOT CYC osscompare smg2000 hwc openss smg2000 hwc 1 openss hwc overflows osscompare smg2000 hwc openss smg2000 hwc 1 openss PAPI_TOT_CYC Note that for compares involving hwcsamp me
136. penspeedshop oss dev OSS21 dk_setenv OPENSS_PLUGIN_PATH OPENSS_PREFIX lib64 openspeedshop dk_setenv OPENSS_DOC OPENSS_PREFIX share doc packages OpenSpeedShop dk_alter PATH OPENSS_PREFIX bin dk_alter LD_LIBRARY_PATH OPENSS_PREFIX lib64 dk_setenv DYNINSTAPI_RT_LIB 0PENSS_PREFIX lib64 libdyninstAPI_RT so dk_setenv XPLAT_RSH rsh dk_setenv OPENSS_MPI_IMPLEMENTATION mvapich dk_test dk_cev OPENSS_ RAWDATA_DIR eq 0 amp amp dk_setenv OPENSS_RAWDATA_DIR p scratchb USER 14 Additional Information and Documentation Sources 14 1 Final Experiment Overview In the table below we match up a few general questions you may ask yourself with the experiments you may want to run in order to find the answer Where does my code spend most of its time e Flat profiles pcsamp e Getting inclusive exclusive timings with call paths usertime e Identifying hot call paths usertime HP analysis e Measure memory performance using hardware counters hwc e Compare to flat profiles custom comparison e Compare multiple hardware counters N x hwc hwcsamp e Study time spent in I O routines io iot and lightweight iop e Compare runs under different scenarios custom comparisons How to identify memory problems e Study time spent in memory allocation de allocation routines mem e Look for load imbalance LB view and outliers CA view How do I find parallel inefficiencies in OpenMP and or threaded applications e Study time sp
137. perations The following table show the number of Flops Ops for each operation where A B and C are NxN Matrices x and y are Nx1 Vectors and k is a Scalar Refs or Ops Flops Ops Benchmarks 2 y Ax y n 2n 2 Achieved in Benchmarks C AB C n 2 Exceeds HW MAX Below is an example of the BLAS Level 1 using the experiment osshwc or osshwcsamp to get the following PAPI counters PAPI_FP_OPS PAPI_TOT_CYC PAPI_LD_INS PAPI_ST_INS PAPI_TOT_INS Where the derived metrics of interest 46 are GLOPS Giga Logical Operations per Second Float_ops cycle Instructions cycle Loads Cycle Stores Cycle and Flops memory Ops BLAS 1 Kernel DAXPY y alpha x y Kernel Code n 10 000 looped 1000 000 times for timing purposes doi 1 n y i alpha x i y i enddo The following table shows the PAPI data for this example n Mem FLOPS Loop PAPI_LD_INS PAPI_SR INS PAPI _FP_OPS PAPI_TOT_CYCLE PAPI TOT_INS Ref 3n Calc BLAS code 10000 30000 20000 100000 1 02E 09 5 09E 08 1 03E 09 2 04E 09 2 43E 09 6 4596E 06 3 096124 0 505386876 1 190989226 0 500489716 0 249412341 Error PAPI FLOPS Error corrected Error Mem Refs PAPI_GLOPS PAPI FLOPS OPS Calc FLOPS OPS FLOPS 93 80 3 10 2 15 3 195244288 0 673937178 0 6666667 The processors used in this example have a Floating Multiply Add FMADD instruction set Although this instruction performs two Floating Point operations it is counted as one Floating Point instruction
138. periment because some options don t make sense for all experiments The following table describes the icons and the functionality they represent T Information This option shows the metadata for the experiment Information such as the experiment type processes ranks threads hosts and other experiment specific information is displayed U Update This option updates the information in the Stats Panel display This can be used to display any new data that may have come from the nodes on which the application is running CL Clear auxiliary Clear auxiliary information Ifthe user has chosen a time segment of information the performance data or a specific function to view the data for This option clears the settings for that and allows the next view selection to show data for the entire program again Default View The default view icon shows the performance results based on the view choice granularity selection Statements per Show the performance results related back to the source statements Function in the application for the selected function Highlight a function in the Stats Panel and click on this icon C plus Call paths w o Show all the calling paths in this application Duplicate paths will not sign coalescing be coalesced All of the calling paths will be shown in their entirety Call paths w o Show all the calling paths in this application for the selected function 110 coalescing per only Highlight a f
139. poll libc 2 12 so 14 270000 47 11 780000 51 12 489062 sweep sweep3d mpi sweep f 2 1 620000 43 0 840000 37 1 171875 PAMI Interface Context lt PAMI Context gt advance libpami so ContextInterface h 158 1 320000 16 0 570000 3 0 871094 Lapilmpl Context Advancestrue true false gt libpami so Context h 220 1 130000 60 0 500000 2 0 778906 _lapi_dispatcher lt false gt libpami so lapi_dispatcher c 57 1 110000 35 0 520000 49 0 751250 Lapilmpl Context TryLock lt true true false gt libpami so Context h 198 1 030000 42 0 600000 12 0 827656 _libc_enable_asynccancel libc 2 12 so 0 950000 62 0 520000 38 0 746094 _libc_disable_asynccancel libc 2 12 so 0 700000 6 0 200000 59 0 343125 _lapi_shm_dispatcher libpami so lapi_shm c 2283 0 630000 33 0 250000 0 0 404375 _ intel_ssse3_rep_memcpy libirc so 0 600000 18 0 270000 16 0 416875 udp_read_callback libpamiudp so 5 1 3 4 osshwcsamp experiment Linked Object command and CLI view openss gt gt expview v linkedobjects 41 Exclusive ofCPU papi_l1_dcm papi_l1_icm papi_l1_tcm papi_l1_ldm papi_l1_stm LinkedObject CPU time in Time seconds 928 310000 43 541541 9818946796 133244862 9952191658 9543597734 215608918 libc 2 12 so 811 920000 38 082373 47212355914 369525459 47581881373 46596204924 441601622 sweep3d mpi 311 490000 14 610157 3356646038 44875637 3401521675 3255300343 80090932 libpami so 29 640000 1 390237 1824778610 12931604 1837710214 1680978945 127174346 libirc so 26 9
140. ported MPI implementations openmpi lampi mpich mpich2 mpt lam mvapich mvapichz2 For Cray IBM Intel MPI implementations use mpich2 OPENSS_MPI_IMPLEMENTATION MPI impl name PIExample export OPENSS_MPI_IMPLEMENTATION openmpi In most cases O SS can auto detect the MPI in use 130 16 Hybrid openMP and MPI Performance Analysis For this example tutorial we have run Open SpeedShop convenience script on the NPB MZ BT program and created a database file that has 4 ranks each of which has 4 underlying openMP threads What this example intends to show is that you can look at hybrid performance first at the MPI level and then can look under the MPI rank to see how the threads are performing Atthe MPI level you can see load balance and outliers then focus ona rank and look at load balance and outliers for the underlying threads Within a terminal window we enter openss f bt mz B 4 pcsamp 1 openssfP to bring up the Open SpeedShop GUI In the GUI view below we display the aggregated results for the application at the statement level granularity When the default view first comes up the view is at the function level granularity To switch to the statement level select the Statements button in the View Display Choice section on the right hand side of the Stats Panel display and then click the D icon for default view This will switch the Stats Panel view to statement level granularity Now the Stats Panel is displ
141. r performance analysis tools to gather information In this case the tool can gather and store individual application events for example function invocations MPI messages or I O calls The events recorded are typically time stamped and proved detailed per event information This method can lead to huge data volumes and higher potentially bursty overheads There are a number of different performance analysis tools so how do you select the right one for your application A tool must have the right features for what you 11 are trying to measure Keep in mind which questions you are looking to answer and how deep do you want to analyze the code A tool must also match your application s workflow and may need access to and knowledge about the source code and the machine environment Other things to keep in mind when choosing a tool are having a local installation of the tool and the availability of local support for the tool Getting started on Performance Analysis can be a challenging and sometimes overwhelming undertaking so it s a good idea to have some support system in place to help you through the hard parts Parts of this manual will focus on general performance analysis information followed by many detailed examples using the Open SpeedShop performance analysis tool Open SpeedShop has an easy to use GUI and command line options it includes both sampling and tracing in a single framework and doesn t require recompilation of the appli
142. racing Floating Point Exception Analysis Memory Function Tracing POSIX Thread Function Tracing NVIDIA CUDA Event Tracing In addition Open SpeedShop is designed to be modular and extensible It supports several levels of plug ins which allow users to add their own performance experiments Open SpeedShop development is hosted by the Krell Institute The infrastructure and base components of Open SpeedShop are released as open source code primarily under LGPL Highlights include e Comprehensive performance analysis for sequential multithreaded and MPI applications No need to recompile the user s application Supports both first analysis steps as well as deeper analysis options for performance experts Easy to use GUI and fully scriptable through a command line interface and Python Supports Linux Systems and Clusters with Intel and AMD processors Extensible through new performance analysis plugins ensuring consistent look and feel In production use on all major cluster platforms at LANL LLNL and SNL Features include Four user interface options batch command line interface graphical user interface and Python scripting API Supports multi platform single system image SSI and traditional clusters Scales to large numbers of processes threads and ranks View performance data using multiple customizable views Save and restore performance experiment data and symbol information for post experiment performance analysis
143. rdware resources Modern memory system are complex they can have deep hierarchies and explicitly managed memory Systems can implement Non Uniform Memory Access NUMA or streaming prefetching methods The key to memory is location Are you accessing the same data repeatedly or are you accessing neighboring data You will want to look at your codes read write intensity the prefetch efficiency the cache miss rate at all levels TLB miss rates and the overhead from NUMA Some system differences can affect the computational intensity like the cycles per instruction CPI or the number of floating point instructions Other architectural features that can differ between systems include branches the number of branches taken the miss speculation or wrong branch prediction results If your code is using anything like single instruction multiple data SIMD any type of multimedia or streaming extensions the performance of all of these things could differ greatly from system to system General system wide information including I O busses network counters also power or temperature sensors all could affect the performance of your code But it can be difficult to relate this information to your source code Hardware performance counters are used to keep track of architectural features Typically most features that are packaged inside the CPU allow counting hardware events transparently without any overhead Newer platforms also provide syste
144. rmance metric A collector is a portion of the code that is included in the experiment plugin Metric The measurement which the collector experiment is gathering A metric could be a time an occurrence counter or other entity which reflects in some way on the application s performance and is gathered by a performance experiment at application runtime directly by the collector Offline offline is the Open SpeedShop default mode of operation It is a link override mechanism that allows for gathering performance data using libmonitor to link Open SpeedShop performance data gathering software components into the user application For this mode of operation the application must be run from start up to completion The performance results may be viewed after the application terminates normally Param Each collector allows the user to set certain values that control the way a collector behaves The parameter or param may cause the collector to perform various operations at certain time intervals or it may cause a collector to measure certain types of data Although Open SpeedShop provides a standard way to set a parameter it is up to the individual collector to decide what to do with that information Detailed documentation about the available parameters is part of the collector s documentation Framework The set of API functions that allows the user interface to manage the creation and viewing of performance experiments It is the 15 interf
145. rresponding to this row of data of MPI Time Percentage of exclusive MPI time spent in the MPI function corresponding to this row of data relative to the total MPI time for all the MPI functions Number of Calls Total number of calls to the MPI function corresponding to this row of data Min MPI Call Time Across Ranks The minimum time that a rank or ranks across all ranks spent in the corresponding MPI function Rank of Min The number of the rank that had the minimum time spent in the MPI function across all the ranks of the application Max MPI Call Time Across Ranks The maximum time that a rank or ranks across all ranks spent in the corresponding MPI function 84 Rank of Max The number of the rank that had the maximum time spent in the MPI function across all the ranks of the application Average MPI Call Time Across Ranks The average for the default view is the total amount of time for all the calls to a function divided by the total number of ranks Thus it is the average time that each rank spends in the function As such it is comparable to the Max and Min of a rank that is in the same report This is an example of the CLI default view for the MPI mpip experiments This is an example of the CLI load balance view for the MPI mpip experiment This view shows the minimum maximum and average time per rank for each function and the rank that represents the maximum time and m
146. s Count Bytes Requested 70 40 62 52 924256 __cfree libc 2 17 so 36 36 32 24 454089 23705 4 4 2064 37138180 _ libc_malloc libc 2 17 so 5 82 5 17 43212 1616 16 2 131072 11611200 realloc libc 2 17 so This view illustrates how to find the call path of where the largest allocation in the application took place The max_bytes metric lists the largest allocation value and the vcalltrees fullstack indicates that we want to see the complete full and unique call paths The mem3 parameter indicates that we only want to see the first three call paths starting with the largest allocation value max_bytes Find the call path for the largest allocation by using metric max_bytes in the calltree view Openss gt gt expview vcalltrees fullstack m max_bytes mem3 Max Call Stack Function defining location Requested Bytes _start smg2000 gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt __libc_start_main libc 2 17 so gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 502 in main smg2000 smg2000 c 21 gt gt gt gt gt 53 in HYPRE_StructSMGSetup smg2000 HYPRE_struct_smg c 48 gt gt gt gt gt gt 337 in hypre_SMGSetup smg2000 smg_setup c 28 gt gt gt gt gt gt gt 408 in hypre_SMGRelaxSetup smg2000 smg_relax c 357 gt gt gt gt gt gt gt gt 613 in hypre_SMGRelaxSetupASol smg2000 smg_relax c 540 gt gt gt gt gt gt gt gt gt 337 in hypre_SMGSetu
147. s the OpenSpeedShop software amp application environment puts stderr n tThis adds oss to several of the puts stderr tenvironment variables puts stderr n tVersion version n module whatis loads the OpenSpeedShop runtime environment for Tcl script use only set version 22 set oss opt OSS21 setenv OPENSS_PREFIX oss setenv OPENSS_DOC_DIR oss share doc packages OpenSpeedShop prepend path PATH oss bin prepend path MANPATH oss share man set unameexe bin uname if file exists unameexe set machinetype exec bin uname m if machinetype x86 machinetype i386 machinetype i486 machinetype i586 machinetype i686 121 13 4 2 Example softenv file This is an example of a softenv file used for a Blue Gene Q installation Use the resoft lt filename of softenv file gt command to activate the Open SpeedShop runtime environment 13 4 3 Example dotkit file 122 This is an example of a dotkit file used for a 64 bit cluster platform installation and is not generalized to support different platforms other than the 64 bit cluster it was written for Use the use lt filename of dotkit file gt command to activate the Open SpeedShop runtime environment Note do not include the dk portion of the filename when using the use command c performance profile d Open Speedshop Version 2 2 dk_setenv OPENSS_PREFIX usr global tools o
148. script that will help with linking Calls to it are usually embedded inside an application s makefile The user generally needs to fine the target that creates the actual static executable and create a collector target that links in the selected collector The following is an example for re linking the smg2000 application smg2000 smg2000 0 echo Linking CC o smg2000 smg2000 0 LFLAGS smg2000 pcsamp smg2000 o echo Linking osslink v c pcsamp CC o smg2000 pcsamp smg2000 0 LFLAGS smg2000 usertime smg2000 0 echo Linking osslink v c usertime CC o smg2000 usertime smg2000 0 LFLAGS smg2000 hwcsamp smg2000 o0 echo Linking osslink v c hwcsamp CC o smg2000 hwcsamp smg2000 0 LFLAGS smg2000 io smg2000 0 echo Linking osslink v c io CC o smg2000 io smg2000 0 LFLAGS smg2000 iot smg2000 0 echo Linking osslink v c iot CC o smg2000 iot smg2000 0 LFLAGS smg2000 mpi smg2000 0 echo Linking osslink v c mpi CC o smg2000 mpi smg2000 0 LFLAGS Running the re linked executable will cause the application to write the raw data files to the location specified by the environment variable OPENSS_RAWDATA_DIR Normally in the cluster environment where shared dynamic executables are being run the conversion from raw data to an Open SpeedShop database is done under the hood However in this case you must use the o
149. soviet daninieaann dd dasuatndanensnidasien 122 1343 Example dot kii Menane E AET AEAEE OTOLE 122 14 Additional Information and Documentation Sources sssssssssnssnssnnnnnnunnnnnnnnnnnnn 123 14 1 Final Experiment OvervieW s sussussussnsunnunnunnunnunnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 123 14 2 Additional Documentatio nina ia ae 124 15 Convenience Script Basic Usage Reference Information 125 15 1 S ggested W OF KM OW siccicestescctsciecsnceie ces aaa aE 125 15 2 Convenience Scrip tS err a a a aaa Ea aN 125 15 3 Report and Database Creation s sussussnnunnnnnunnunnnnnnnnnnunnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 125 15 4 osscompare Compare Database FilesP ssssssussununnunnnnnunnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnn 125 15 5 osspcsamp Program Counter Experime ntl ssssssssnsunnunnunnunnnnnnnnnnnnnunnnnnunnnnnnnnnnnnnnnnnn 126 15 6 ossusertime Call Path Experiment ssssssnunnunnnnnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnn 126 15 7 osshwc osshwctime HWC Experiments ssussussnsunnnnnunnunnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnn 127 15 8 osshwcsamp HWC Experiment sssssussnsunnunnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 127 15 9 ossio ossiot I O ExperimentsP ssussnssnnunnunnunnnnnnnnnnunnnnnunnunnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnn nnn 127 15 10 ossmpi ossmpip ossmpit MPI Experiments s sssnsussunnunnnnnnnnnnnnnunnunnunn
150. ssutil command to create the 116 database file manually Of course you can add the ossutil command to a batch script to eliminate the step of manually issuing that command Once you have the Open SpeedShop database files create you can view them normally with the GUI or CLI Below is an example of a job script that will execute these steps for you PBS q debug PBS N smg2000 pcsamp must have a clean raw data directory each run rm rf home USER smg2000 test raw mkdir home USER smg2000 test raw setenv OPENSS_RAWDATA_DIR home USER smg2000 test raw setenv OPENSS_DB_DIR home USER smg2000 test cd home jgalaro smg2000 test needs b to have the original executable path available and match where the application was run when doing ossutil aprun b n 16 home USER smg2000 test smg2000 pcsamp creates a X 0 openss database file please load the module pointing to openspeedshop before accessing ossutil ossutil home jgalaro smg2000 test raw OpenSpeedShop needs the executable path that is used to process symbols after the run is complete to match where the executable was run The executable path must match the path that is in the raw data that is written to the directory represented by OPENSS_RAWDATA_DIR Ifthe aprun b option is not used then the executable is run in a temporary system directory and the raw data reflects that directory path for the executable instead of the path where the executable is located when th
151. t 256 1 1 64 1 1 gt gt gt CUDA 95 9 Memory Analysis Techniques The Open SpeedShop version with CBTF collection mechanisms supports tracing memory allocation and deallocation function calls in user applications An event by event list of memory function call events and the memory function call event arguments are listed The Open SpeedShop experiment name for the memory analysis experiment is mem The high water memory mark is not currently available but is coming in the future 9 1 Memory Analysis Tracing mem experiment performance data gathering To run the memory analysis experiment use the ossmem convenience script and specify the application as an argument If there are no arguments to the application then no quotes are necessary but they are placed here for consistency Using the sweep3d application as an example here the ossmem script will apply the memory analysis experiment by running the sweep3d application with the Open SpeedShop memory trace collector gather the data and will create an Open SpeedShop database file with the results of the experiment Viewing of the performance information can be done with the GUI or CLI ossmem mpirun np 64 sweep3d mpi 9 2 Memory Analysis Tracing mem experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt The following table describes the fields in the memory experiment default CLI view
152. t file so that performance reports can be generated from the performance data file alone The application itself need not be present to view the performance data at a later time 3 2 Performance Experiments Overview Open SpeedShop refers to the different performance measurements as experiments Each experiment can measure and analyze different aspects of the code s performance The experiment type or type of data gathered is chosen by the user Any experiment can be applied to any application with the exception of MPI specific experiments being applied to non MPI applications Each experiment consists of collectors and views The collectors define specific performance data sources for example program counter samples call stack samples hardware counters or tracing of library routines Views specify how the performance data is aggregated and presented to the user It is possible to implement multiple collectors per experiment 3 2 1 Individual Experiment Descriptions The following table provides a quick overview of the different experiment types that come with Open SpeedShop 16 Experiment Description pcsamp Periodic sampling the program counters gives a low overhead view of where the time is being spent in the user application usertime Periodic sampling the call path allows the user to view inclusive and exclusive time Spent in application routines It also allows the user to see which routines called which routines Several
153. t the call paths to MPI_Wait to determine why the wait was occurring The mpit experiment has a performance information entry for each MPI function call In addition to the time spent in each MPI function information like source and destination rank bytes sent or received are also available You can selectively view the information you desire Below we see the default event view for an MPI application POT 19 30 51 20007713 POLST JIHODI POSE 2OR OT 1S to 35t gt gt gt gt gt gt gt gt gt gt gt MPI ann a hal ee PE NN a gt gt gt gt gt gt gt gt gt MPI_Waltal ibmpi so 0m1 gt gt gt gt gt gt gt gt gt M7 eget pirrur pwsnalle 1 gt gt gt gt gt gt gt gt gt gt MPI Waitall I We can create our own event view with the OV button 79 Eile Tools Help ue ae inf Woa yat oPanse Sipdare a Status Provess Loaded Click on rhe Run burton te beg Exclusive MPI Call Timedms of Total 2357 121579 70 9937243 100 733 420986 22240219 100 215 049948 6 477007 2 6 935016 0 108873 2 1 032697 0 031193 100 0 589067 0 029739 2 0 373066 0 011236 100 0 283167 0 008529 2 a number of optiona fielda columea to include in the creation of an rel view of the exisring clans wjojajx EI ermina e waja MPL Waitall Uibmpi so 0 0 0 pewaitall c 39 MPI Allreduce litmpi so 0 0 0 pallreduce c 40 MPT_Init ibrmpi co 0 0 0 pimit c 41 MPL Finalize
154. tric based databases to compare all of the existing hardware counters from each experiment use the allEvents metric in the osscompare command That will compare all the events in each of the databases and will ignore the program counter sampling data from each of the databases The form of the osscompare command to compare all the hardware counter events is as follows osscompare smg2000 hwcsamp openss smg2000 hwcsamp 1 openss allEvents 10 1 2 osscompare rows of output argument osscompare allows the user to specify how many lines of the comparison output to be output The argument is optional and rows nn is defined as follows nn Number of rows lines of performance data output In this example only ten 10 lines of comparison will be shown when the osscompare command is executed It will be the most interesting or top ten lines osscompare smg2000 hwc openss smg2000 hwc 1 openss hwc overflows rows 10 10 1 3 osscompare output name argument osscompare allows the user to specify the name to be used when writing out the comparison output files The argument is optional and oname lt output file name gt is defined as follows output filename Name given to the output files created for the comparison 103 This argument is valid when the environment variable OPENSS_CREATE_CSV is set to 1 In this example the comparison files created when the osscompare command is executed will be named smg_hwc_cmp csv and
155. ult f_t_list 2 Parallel job example ossmpi p t mpirun np 128 smg2000 n 50 50 50 82 Additional arguments default trace all MPI functions RI lt f_t_list gt Comma separated list of MPI functions to trace consisting of zero or more of MPI_Allgather MPI_Waitsome and or zero or more of the MPI group categories MPI Categor Argument All MPI Functions alli Collective Communicators collective_com Persistent Communicators persistent_com Synchronous Point to Point synchronous_p2p Asynchronous Point to Point asynchronous_p2p Process Topologies P process_topologies Groups Contexts Communicators graphs_contexts_comms Environment environment Datatypes datatypes MPI File I O fileio 15 11 ossfpe FP Exception Experiment General form ossfpe lt command gt lt args gt default f_t_list F Sequential job example ossfpe smg2000 n 50 50 50 21 Parallel job example ossfpe mpirun np 128 smg2000 n 50 50 50 2 Additional arguments default trace all floating point exceptions2 128 lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of inexact_result division_by_zero underflow overflow invalid_operation 15 12 ossmem Memory Analysis Experiment General form ossmem lt command gt lt args gt default f_t_list 2 Sequential job example ossmem smg2000 n 50 50 50 21 Parallel job example ossmem mpirun np 128 smg2000
156. unction in the Stats Panel and click on this Function icon Duplicate paths will not be coalesced All of the calling paths will be shown in their entirety HC Hot Call Path Show the call path in the application that took the most time This is a Short cut to find the hot call path Show the butterfly view which displays the callers and callees of the selected function Highlight a function in the Stats Panel and click on this icon Then repeat to drill down into the callers and or callees segment selected the new performance data report Annotation source Defaults are different for each experiment but mostly time EB Load Balance Show the load balance view which displays the min max and average performance values for the application Only available on threaded or multiple process applications CA Cluster Analysis Show the comparative analysis view which displays the output of a cluster analysis algorithm run against the threaded or multiple process performance analysis results for the user application The goal of this view is to find outlying threads or processes and report the groups of like performing threads processes or ranks Ce Custom Raise the custom comparison panel which provides mechanisms Compare allowing the user to create custom views of the performance analysis results This allows the user to supplement the provided Open SpeedShop views 11 5 1 2 View Display Choice Selection The View Displ
157. unnunnunnnnnnnnnnnnnnnnnnn 58 7 3 1 1 0 Base Tracing io experiment s sussssussnsunnusunnununnunnnnunnnnunnunnnnunnnnunnunnnnnnnnnnnnnnnn annann 58 7 3 1 1 I O Base Tracing io experiment performance data gathering s es 58 7 3 1 2 I O Base Tracing io experiment performance data viewing with CLI 58 7 3 1 3 I O Base Tracing io experiment performance data viewing with GUI 59 7 3 2 1 0 Extended Tracing iot experiment ssssssussnsusenunnnnunnununnunnnnunnnnunnununnnnnnnnnnnnnnnnn nnna 59 7 3 2 1 I O Extended Tracing iot experiment performance data gathering 59 7 3 2 2 I O Extended Tracing iot experiment performance data viewing with GUI 59 7 3 2 3 I O Extended Tracing iot experiment performance data viewing with CLI 62 7 4 Open SpeedShop Lightweight I O Profiling General Usage ssssssssssussnnnunennnnunennnnnnan 64 7 4 1 1 0 Profiling iop experiment performance data gathering 64 7 4 2 1 O Profiling iop experiment performance data viewing with GUL 65 7 4 3 1 0 Profiling iop experiment performance data viewing with CLI 66 8 Applying Experiments to Parallel Codes sssssssssssunununnnnnnununununununnnnnnunununununnnnnn 69 8 1 MPI Tracing Experiments mpi Mpit sssssssussnsussnnunnununnununnunnnnunnununnunnnnnnnnnnnnnnnnnnnnnnnnnnn nne 71 8 1 1 MPI Tracing Experiments mpi Mpit ss sussssussnsusnununnununnunnnnunnununnunnnnnnnnnnnnnnnnnnnnnnnnnnnnne 81 8 1 1 1 MPI Trac
158. unt stripe_count 16 stripe_size 1048576 strip_offset 1 53 1 PE writes BW limited 1 file per process BW enhanced Subset of PEs do 1 0 Could be most optimal Using OOCORE I O performance and the libc_read time from Open SpeedShop the following graph shows the output of an I O experiment used to identify optimal lfs striping from load balance view max min and avg for 16 way parallel run 1200 MIN Wall Time secs S Stripe count 1 Stripe count 4 Stripe count 8 Stripe count 16 7 3 Open SpeedShop I O Tracing and I O Profiling An example of how to use the Open SpeedShop usertime experiment to profile I O is shown below This example compares Open SpeedShop data to instrumentation data 54 OutPut from code PROGLEM SIE 5500 MOCK SE 100 NUVSER OF MGHI HAND SEES 100 5 000 wT 5 PERFORMANCE TIME IN SECONDS totsitene DDLJII4 secnm tect tier IOA econ bowo PRm 2 075548 secondi matmul time 53 80486 seconds lo time 45 86675 seconds Trene tine ANAI wern hs diag bik ime 3 23013560E 0 scond vi np v mie tine LJJA samai Open SpeedShop also has an iot experiment for extended I O Tracing It will record each event in chronological order and collect additional information like function parameters and function return values You should use the extended I O tracing when you want to trace the exact order of events Or when you want to see the return va
159. vided by the total number of calls Thus it is the average time that each individual call spends in the function As such it is comparable to the Max maximum and Min minimum of a call to the function that is in the same min max average report Alternatively if a user does an expview m ThreadMin ThreadMax ThreadAve then the report information is related for the Max Min and Average is related back to the individual ranks Another way of saying itis The average is the total amount of time for all the calls to a function divided by the total number of ranks Thus it is the average time that each rank spends in the function As such it is comparable to the Max and Min of a rank that is in the same report If the number of ranks is the same as the number of calls the two different calculations should produce the same result This would be true if all the calls were in a single thread or there were one in each rank as it is for MPI_Init The expview m min max average view can expose load imbalance by showing when the minimum and maximum time for asynchronous MPI functions have large differences This situation indicates that some of the MPI asynchronous functions ran quickly low minimum times but some had to wait a long time to get started large maximum times Many times the function calls that ran quickly were the last 72 to arrive and actually are from ranks that are not running as well as the others causing l
160. wpee_ Src apy neg DOOR sner apye I5 STi opal pneri etapa iepak teapen pal m 620 opal iitatype_ sepack 1 268 Remember that you do not always need to use the Open SpeedShop GUI to examine the output of experiments you can also use the command line interface to view all of the same information For example the same output from above can be seen on the command line 34 5 1 1 Hardware Counter Sampling hwcsamp experiment performance data gathering The hardware counter sampling experiment convenience script is osshwcsamp Use this convenience script in this manner to gather counter values for unique up to six 6 hardware counters osshwcsamp how you normally run your application lt papi event list gt lt sampling rate gt 5 1 1 1 Hardware Counter Sampling hwcsamp experiment parameters The hwcsamp experiment is timer based not threshold based What that means is a timer is used to periodically interrupt the processor For the hwcsamp experiment each time the timer interrupts the processor the values of the hardware counter events specified will be read up and reset to 0 for the next timer cycle This is repeated until the program finishes Open SpeedShop allows the user to control the sampling rate The following is an example of how to gather data for the smg2000 application ona Linux cluster platform using the osshwcsamp convenience script and specifying a specific set of PAPI hwc events In the next example t
161. xperiment use openss f lt database name gt 7 3 2 I O Extended Tracing iot experiment 7 3 2 1 I O Extended Tracing iot experiment performance data gathering The extended I O tracing iot experiment convenience script is ossiot Use this convenience script in this manner to gather extended I O tracing performance data ossiot how you normally run your application lt list of I O function s gt The following is an example of how to gather data for the IOP application on a Linux cluster platform using the ossiot convenience script It gathers performance data for all the I O functions because there is no list I O functions specified after the quoted application run command ossiot srun n 512 IOR 7 3 2 2 I O Extended Tracing iot experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt This is the default GUI view for the iot experiment This view gives a summary of the I O functions that were called how many times they were called and the amount of time spent in each function The percentage of the total I O time is also attributed to each I O function The time is aggregated totaled across all the threads ranks or processes that were part of the application The table below describes what the columns of data represent The functions that called the I O functions are available by choosing one of the call path views Column Name
162. y 16 KB from host to device CUDA 2013 08 21 18 31 21 623 0 003424 0 000325 gt gt gt gt set 4 KB on device CUDA 2013 08 21 18 31 21 623 0 003392 0 000322 gt gt gt gt set 137 KB on device CUDA 2013 08 21 18 31 21 623 0 120896 0 011481 gt gt gt gt compute_degrees int int int int lt lt lt 256 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 623 13 018784 1 236375 gt gt gt gt QTC_device float char char int int int float int int int int float int int int int bool lt lt lt 256 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 636 0 035232 0 003346 gt gt gt gt reduce_card_device int int lt lt lt 1 1 1 1 1 1 gt gt gt CUDA 2013 08 21 18 31 21 636 0 002112 0 000201 gt gt gt gt gt copy 8 bytes from device to host CUDA 2013 08 21 18 31 21 636 1 375616 0 130640 gt gt gt gt trim_ungrouped_pnts_indr_array int int float int char char int int float int int int int float int bool lt lt lt 1 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 638 0 001344 0 000128 gt gt gt gt gt copy 260 bytes from device to host CUDA 2013 08 21 18 31 21 638 0 025600 0 002431 gt gt gt gt update_clustered_pnts_mask char char int lt lt lt 1 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 638 11 724960 1 113503 gt gt gt gt QTC_device float char char int int int float int int int int float int int int int bool lt lt l
163. y tracks all ranks and thread during execution It can also store the performance info per process rank or thread for individual evaluation All of the experiments for Open SpeedShop can be run on parallel jobs collectors are applied to all ranks on all nodes The results of an experiment can be displayed as an aggregation across all ranks or threads which is the default view or you can select individual or groups of ranks or threads to view There are also experiments specifically designed for tracing MPI function calls Open SpeedShop has been tested with a variety of MPI versions including Open MPI MVAPICH 2 and MPICH2 on Intel Blue Gene and Cray systems Open SpeedShop is able to identify the MPI task rank info through the MPIR interface for the online version or through PMPI preload for the offline version To run MPI code with Open SpeedShop just include the MPI launcher as part of the executable as normal below are several examples gt ossmpi mpirun np 128 sweep3d mpi gt osspcsamp mpirun np 32 sweep3d mpi gt ossio srun N 4 n 16 sweep3d mpi gt openss offline f mpirun np 128 sweep3d mpi hwctime gt openss online f srun N 8 n 128 sweep3d mpi usertime The default view for parallel applications is to aggregate the information collected across all ranks You can manually include or exclude individual ranks processes or 69 threads to view their specific results You can also com

Open|SpeedShop User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents