Home

Open|SpeedShop User Manual

1. 2 1 Sequential Code Performance Analysis You should identify the most computationally intensive parts of CPU ee a your application Find out where is your application spending P most of its time in modules or libraries on particular statements ats 1 Cache in your code or within certain functions Check to make sure the a L most time is being spent in the computational kernels Ask L2Cae yourself if the amount of time that each section takes matches your intuition Explore the impact of the memory hierarchy Check to see if your application has excessive data cache misses Find out where your data is located One can also assess the impact of the virtual memory Translation Lookaside Buffer TLB misses Check the interaction of you application with external resources by checking the efficiency of the I O and looking at the time spent in system libraries 2 2 Shared Memory Applications e Shared memory applications have a single shared storage that is a J accessible from any CPU The programming models common to L2 Cache Mam Memory 10 shared memory applications include threadsn e g POSIX threads and OpenMP The typical performance issues with shared memory applications include limited bus bandwidth where a Mean pat CPU CPU Mea bottleneck occurs when many CPUs are trying to access i o the same resources There can be synchronization Mem CPU CPU a Mem overh
2. 5 1 2 1 Getting the PAPI counter as the GUIs Source Annotation Metric In order to make one of the PAPI or native hardware counters the counter that will show up in the source view one can click on the SA icon which represents Source Annotation This brings up an option dialogue that allows you to chose the source annotation metric 33 Open SpeedShop Process Loaded Clack on the Run button gg m t hypre CyclicReduction emg2000 cycle _reduction c 757 hypre_Semileterp emg 2000 semu_teterp c 126 hypeo SemiRestrict amm2000 seemt_restrict c 125 opel progress liberais 1 0 3 opal_gemenc_semple_unpeck hbm so LOJ In this example the native counter we want to choose is L2_LD_PREFETCH When we click to choose that counter and click on OK the Stats Panel view will regenerate and the source annotation metric will become L2_LD_PREFETCH Source Panel Annotation Dialog The regenerated view now shows the results for only L2_LD PREFETCH 34 Openi Speed Shop tt Process Loaded Clack on the Run button to begin the expermment hypre_SMGResidual xmgIOO0 smg residuel lt 552 _ bypre_CyclicReduction emg2000 cyrlic_reduction c 757 hypre_Semileerp emg2000 sema_beterp c 126 hypeoe SemiRestrict ann 2000 seem restrict c 125 opel_progress libetgai s 1 0 3 opal_gemenc_sempie_un Ubap sa LAI hypre_StrumAxpy E strunt apye 25 _ SMGSetStructVector Constans values bang AMN0 smee
3. 51 If each process writes it s own file then the parallel file system attempts to load balance the OST taking advantage of the stripe characteristics Meta data server overhead can often create severe I O problems Minimize the number of files accessed per PE and minimize each PE doing operations like seek open close stat that involve inode information I O time is usually not measured even in applications that keep some function profile Open SpeedShop can shed light on time spent in I O using the io and iot experiments 7 3 Open SpeedShop I O Tracing General Usage The Open SpeedShop io and iot I O function tracing experiments wrap the most common I O functions record the time spent in each I O function record the call path along which I O function was called record the time spent along each call path to an I O function and record the number of times each function was called In addition the iot experiment also records information about each individual I O function call The values of the arguments and the return value from the I O function are recorded 7 3 1 I O Base Tracing io experiment The base I O tracing experiment gathers data for the following I O functions close creat creat64 dup dup2 lseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write and writev It is a trace type experiment that wraps the real I O calls and records information before and after calling the real I O function
4. gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2161 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 195 in IOR_Open_POSIX IOR aiori POSIX c 173 gt gt gt gt gt gt gt 670 in open64 iot collector monitor mrnet mpi so wrappers c 608 4757 147988 0 157392 512 gt gt gt gt gt gt gt gt 82 in _libc_open libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _ libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2013 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 2608 in WriteOrRead IOR IOR c 2562 gt gt gt gt gt gt gt 244 in IOR_Xfer_POSIX IOR aiori POSIX c 224 gt gt gt gt gt gt gt gt 321 in write iot collector monitor mrnet mpi so wrappers c 239 316 176763 0 010461 2048 gt gt gt gt gt gt gt gt gt 82 in write libc 2 12 so syscall template S 82 7 4 Open SpeedShop Lightweight I O Profiling General Usage The Open SpeedShop iop I O function profiling experiment wraps the most common I O functions records the time spent in each I O function record the call path along which I O function was called record the time spent along each call path to an I O function and record the number of times each function was called 7 4 1 I
5. six 6 hardware counters osshwcsamp how you normally run your application lt papi event list gt lt sampling rate gt 5 1 1 1 Hardware Counter Sampling hwcsamp experiment parameters The hwcsamp experiment is timer based not threshold based What that means is a timer is used to periodically interrupt the processor For the hwcsamp experiment each time the timer interrupts the processor the values of the hardware counter events specified will be read up and reset to 0 for the next timer cycle This is repeated until the program finishes Open SpeedShop allows the user to control the Sampling rate The following is an example of how to gather data for the smg2000 application ona Linux cluster platform using the osshwcsamp convenience script and specifying a specific set of PAPI hwc events In the next example the user is choosing to only sample 45 times a second instead of the default 100 times a second Why would you want to do this One reason would be to save database size a lower sampling rate may give an accurate portrayal of the application behavior gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 PAPI_L1_DCM PAPI_L2_DCA PAPI_L2_DCM PAPI_L3_DCA PAPI_L3_TCM gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 PAPI_L1_DCM PAPI_L2_DCA PAPI_L2_DCM 45 5 1 2 Hardware Counter Sampling hwcsamp experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt
6. CC o smg2000 smg2000 0 LFLAGS smg2000 pcsamp smg2000 o echo Linking osslink v c pcsamp CC o smg2000 pcsamp smg2000 0 LFLAGS smg2000 usertime smg2000 0 echo Linking osslink v c usertime CC o smg2000 usertime smg2000 0 LFLAGS smg2000 hwcsamp smg2000 o0 echo Linking osslink v c hwcsamp CC o smg2000 hwcsamp smg2000 0 LFLAGS smg2000 io smg2000 0 echo Linking osslink v c io CC o smg2000 io smg2000 0 LFLAGS smg2000 iot smg2000 0 echo Linking osslink v c iot CC o smg2000 iot smg2000 0 LFLAGS smg2000 mpi smg2000 0 echo Linking osslink v c mpi CC o smg2000 mpi smg2000 0 LFLAGS Running the re linked executable will cause the application to write the raw data files to the location specified by the environment variable OPENSS_RAWDATA_DIR Normally in the cluster environment where shared executables are being run the conversion from raw data to an Open SpeedShop database is done under the hood However in this case you must use the ossutil command to create the database file 94 manually Of course you can add the ossutil command to a batch script to eliminate the step of manually issuing that command Once you have the Open SpeedShop database files create you can view them normally with the GUI or CLI Below is an example of a job script that will execute these steps for you PBS q debug PBS N
7. A A The Memory Pyramid illustrates the impact of er memory on the performance of an application fom crv The closer the memory is to the CPU the faster and smaller it will be Memory further away nE SO from the CPU is slower but larger The most f 12Cache expensive operation is moving data The application can only do useful work on the data at Shared 13 Cache the top of the pyramid For a given algorithm serial performance is all about maximizing CPU po nmm flop rate and minimizing memory operations in ais Kiii scientific code The table below shows the access J latencies in clock cycles for the Nehalem Intel Disk processor Access latency in clock cycles L1 r ee The following example uses BLAS operations to illustrate the impact of moving data BLAS operations are Basic Linear Algebra Subprograms that proved library function calls for vectors and matrices We use the Flops Ops to understand how sections of the code relate to simple memory access patterns as typified by these BLAS operations The following table show the number of Flops Ops for each operation where A B and C are NxN Matrices x and y are Nx1 Vectors and k is a Scalar Refs or Ops Flops Ops Benchmarks 2 y Ax y n 2n 2 Achieved in Benchmarks C AB C n 2 Exceeds HW MAX Below is an example of the BLAS Level 1 using the experiment osshwc or osshwcsamp to get the following PAPI counters
8. n 50 50 50 Parallel job example osspthread mpirun np 128 smg2000 n 50 50 50 Additional arguments default trace all POSIX thread functions lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of pthread_create pthread_mutex_init pthread_mutex_destroy pthread_mutex_lock pthread_mutex_trylock pthread_mutex_unlock pthread_cond_init pthread_cond_destroy pthread_cond_signal pthread_cond_broadcast pthread_cond_wait and pthread_cond_timedwait 15 14 osscuda NVIDIA CUDA Tracing Experiment General form osscuda lt command gt lt args gt Sequential job example osscuda eigenvalues matrix size 4096 Parallel job example osscuda mpirun np 64 npernode 1 Imp_linux sf gpu lt in j 15 15 Key Environment Variables 105 EXECUTION RELATED VARIABLES DESCRIPTION OPENSS_RAWDATA_DIR OPENSS_ENABLE_MPI_PCONTROL OPENSS_DATABASE_ONLY OPENSS_RAWDATA_ONLY OPENSS_DB_DIR OPENSS_MPI_LIMPLEMENTATION Used on cluster systems where a tmp file system is unique on each node It specifies the location of a shared file system path which is required for O SS to save the raw data files on distributed systems OPENSS_RAWDATA_DIR shared file system path Example export OPENSS_RAWDATA_DIR lustre4 fsys userid Activates the MPI_Pcontrol function recognition otherwise MPI_Pcontrol function calls will be ignored by O SS When running the Open SpeedSho
9. recording of the counts of the event This experiment hwcsamp is timer based so Open SpeedShop cannot take you exactly to the line of source where the hwc events are happening hwcsamp is more of an overview experiment that tells the user which events are occurring to subsequently use hwc or hwctime to pinpoint where in the source the specified hardware counter event is occurring Open SpeedShop F HWC Simp Pand 1 Precees Cotten fu WS Cont fe 5 pilate Shuts Process Loaded Click an the Run button tr begin the experiment F Stata Famed 1 i ManageProcessesPunel 1 i 0 x PRF Ge A ee tae Views Tingley Choioe ry CL D DW LB CA Ce owing Functions Hepor gt Punctions Statements Linked Objects Executabies emng2000 How localikost Jocaldoemin Processes Hanis Threade 2 0 Exelus ve CPU time in sey eof CHU Tie Pari TOT YC PAPI IP OPS Function defining location p 3 920000000 44 697 839528 11772604888 1198486900 bypew SMGRemdbesal smg2000 ang 2 5 10000008 28 620296465 7478131309 12850000 beper CyrolkcRedection amg2000 cy 0 910000000 34534777651 915610917 488622590 opal progress liboper pal s 0 0 0 0 200000000 1420752568 910260209 100529525 hyper SemiRestrict eneg2000 semi 0 200000000 206727480 74155635 ABHOOTSR ma bel un component progress i 0 260000000 2 964652223 78187371030 98664505 hypre_ Semifoterp aeng2000 semi 177 bU f wits rs lt P amp Gx 5 1 3 Hard
10. 29 The default threshold is set to a very large value to match the default event PAPI_TOT_CYC For all other events it is recommended that the user run hwcsamp first to get an idea of how many times a particular event occurs the count of the event during the life of the program A reasonable threshold can be determined from the hwcsamp data by determining the average counts per thread of execution and then setting the hwc hwctime threshold to some small fraction of that For example if you see 1333333333 PAPI_L1_DCM s over the life of the program when running the hwcsamp experiment and there were 524 processes used during the application run the this is the formula you could use to find a reasonable threshold for the hwc and hwctime experiments when using the PAPI_L1_DCM event for the same application So the formula that could be used is as follows Average counts per thread 1000 Threshold for hwc hwctime In this case 1333333333 524 1000 2544529 1000 2545 Using this formula one could use 2545 as the threshold value in hwc and hwctime for PAPI_L1_DCM and expect to get a reasonable data sample of that event 5 1 Hardware Counter Sampling hwcsamp Experiment The osshwcsamp experiment supports both derived and non derived PAPI presets and is able to sample up to six counters at one time Again you can check the available counters by running osshwcsamp with no arguments All native events are available including architectur
11. GUI at any time just issue the command opengui in the CLI 11 1 2 CLI Metric Expressions and Derived Types Open SpeedShop has the capability to create derived metric from the gathered metrics by using the metric expression math functionality in the command line interface CLI One can access the overview from the CLI by typing this help CLI command openss gt gt help metric_expression OK AK AK KK KK KK lt metric_expression gt lt string gt lt constant gt lt metric_expression gt lt constant gt lt metric_expression gt A user defined expression that uses metrics to compute a special value for display in a report User defined expression can be added to an lt expMetric_list gt A functional notation is used to build the desired expression and the following simple arithmetic operations are available Function arguments returns Uminus 1 unary minus of the argument Abs 1 Absolute value of the argument Add 2 summation of the arguments Sub 2 difference of the arguments Mult 2 product of the arguments Div 2 first argument divided by second Mod 2 remainder of divide operation Min 2 minimum of the arguments Max 2 maximum of the arguments A_Add 1 sum ofall the data samples specified for the view A_Mult 1 product of all the data samples specified for the view A_Min 1 minimum of all the data samples specified for the view A_Max 1 maximum ofall the data samples specif
12. Gather and Understand Profiles A profile is the aggregated measurements collected during the experiment Profiles look at code sections over time There are advantages to using profiles since they reduce the size of performance data and typically the data is collected with low overhead So profiles can provide a good overview of the performance of an application The disadvantage of using a profile is that you are required to know beforehand how to aggregate the data collected Also since profiles provide more of an overview they omit the performance details of individual events There could also be an issue where selecting an inappropriate sampling frequency could skew the results of the profile Statistical Performance Analysis is a standard profiling technique it involves interrupting the execution of the application in periodic intervals to record the location of the execution Program Counter value It can also be used to collect additional data like stack traces or hardware counters Again the advantage of this method is its low overhead It is good for getting an overview of the program and finding the hotspots time intensive areas within the program 4 1 Program Counter Sampling Experiment The sampling experiments available in Open SpeedShop include Program Counter Sampling Call Path Profiling and Hardware Counter The Program Counter Sampling experiment osspcsamp provides approximate CPU time for each line and function in the
13. O Profiling iop experiment performance data gathering The I O Profiling iop experiment convenience script is ossiop Use this convenience script in this manner to gather lightweight I O profiling performance data ossiop how you normally run your application The following is an example of how to gather data for the IOP application on the Cray platform using the ossiop convenience script ossiop aprun n 64 IOR 7 4 2 I O Profiling iop experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 57 The first image below shows the default view for the iop experiment run ona 50000 rank IOR application job The performance information in the default view is the time spent in I O functions and the percentage of time spent in each I O function OpeniSpeedsnop tl Process Loaded Click on the Wen humon to begin the experiment l LSPSITBES 225234 arhe Utpthread 2 11 9 30 Aaneene ATTO apent4 hbpehread 2 11 3 20 l BU7ROST4 Rane read hitgthread 2 01 2 90 25600161 002997 ekoe litythread 2 11 s0 l PIPO ASKI __teeokt4 libprhread 2 11 2 90 In the image below the hot call path view for the iop experiment run on a 50000 rank IOR application job is displayed The performance information in the hot call path view is the top five call paths to each of the I O functions that took the most time time spent in I O functions and the percentage o
14. a valid combination In general combination that are valid will pass the test gt papi_event_chooser PRESET event1 eventz eventN The output for a valid combination will contain event_chooser c PASSED Here is an example using PAPI to check if a three event combination is valid gt papi_event_chooser PRESET PAPI_FP_INS PAPI_LD_INS PAPI_SR_INS SSSR RISE A Bel oS ne aE SSeS Se Eee SSS SR A aoe SS Sec PAPI Version 4 1 2 1 Vendor string and code GenuinelIntel 1 Model string and code Intel Nehalem 21 CPU Revision 5 000000 PAPI_VEC_SP 0x80000069 No Single precision vector SIMD instructions PAPI_VEC_DP 0x8000006a No Double precision vector SIMD instructions Total events reported 44 event_chooser c PASSED Below shows the output of the osshwcsamp experiment with the counters for Total Cycles and Floating Point Operations 31 Open SpeedShop Remember that you do not always need to use the Open SpeedShop GUI to examine the output of experiments you can also use the command line interface to view all of the same information For example the same output from above can be seen on the command line 32 5 1 1 Hardware Counter Sampling hwcsamp experiment performance data gathering The hardware counter sampling experiment convenience script is osshwcsamp Use this convenience script in this manner to gather counter values for unique up to
15. a wng tons Report Functions Statements Linked Objects Zarar 13 365789 5 771428 apal progress hapen pal so 0 0 0 40 257142 40285715 10 393155 men bti sm comporent progress Imca bt sm so bti sm frag c 0 10 285734 10 285714 7555455 mca pa cbl progress rece pil ob so pri obl state 0 2 714286 84 514284 0 700745 mca pm obl recy imca_omi ob1 sx pmi ob start c 0 0 02857 0 078571 0007376 pol ibe 2 10 1 50 0 028571 2 457143 0 007376 ompi request detest wat af Wimpeso 0 0 1 0 028571 0 028573 0 007376 mca pmi cbt recy kag callback match imta pmi obi so peel obi sta Command Panel O x Below we see the Exclusive CPU time on highlighted lines that indicate relatively high CPU times Efe Tools Help ow Custom Experiment 1 0 User Teme 2 o gx frocess Control 3 p Status Loaded saved data from fie homejegidermosopenmpihydra uwsertime 2 opens tw Stats Pane 2 fw ManagePracessesfanel 2 Source Pane 2 amp OD x Exchooive CPU f fhormejegitemos openmp shydra c em 9 mt sierank 10 i 11 doubt do werklint gt gt 13 dodi belit 524 243 6 43 14 mt 15 10 4225T 771 700095 18 10 retum In t zj Command Panel g gx openss gt gt While performance tools will point out potential bottlenecks and hot areas itis still up to the user to interpret most data in the correct context as well as note areas of the code you may
16. barrier c 56 openss gt gt Here we look at the difference between Rank 255 and Rank 0 69 openss gt gt expview r 255 m exclusive_time openss gt gt expview r 0 m exclusive_time Exclusive MPI Call Function defining location Exclusive MPI Call Function defining location Time ms Time ms 138790 370000 MP1_Recv libmpich so 1 0 reev c 60 150332 974000 MPI_Recv libmpich so 1 0 recv c 60 8841 088000 MPI_Wait librnpich so 1 0 walt c 51 1103 539000 MPI_Send lbmpich so 1 0 send c 65 3337 7370 MPI_Send libmpich so 1 0 send c 65 807 433000 PMP1_Init libmonitor s0 0 0 0 pmpi c 94 3206 454000 59 797 96400082 MPL inte libmonitor so 0 0 0 pmpi c 94 5 887000 PMPINGinaliz vlibmonitor so 0 0 0 pmpi c 223 4 701000 MPI_lrecy Wympieh so 1 0 irecv c 48 1 221000 MPIt_Bcast l PAu beast c 81 0 396000 MPI_Barrier libmpich RRO barrier c gt Allreduce libmpich so 1 0 allreduce c 353 810000 MPI_Wailt libmpich so 1 0 wait c 51 i DOO PMPI_Finalize libmonitor so 0 0 0 pmpi c 8 903000 MPI_Allreduce libmpich so 1 0 allreduce c 59 995000 MPI_irecv libmpich so 1 0 irecv c 48 0 438000 MPI_Barrier libmpich so 1 0 barrier c 56 0 076000 MPI_Bcast libmpich so 1 0 bcast c 81 Next we see the hot call paths for MPI_Wait on Rank 255 openss gt gt expview f 255 vcalltrees fullstack f MPI_Wait Exclusive MPI Call of Total Number of Calls Call Stack Function defining location Time
17. ee eR Re eee ym ene eee ene perme a P gt y e eee e ee e e e e e e a jadd_ u C 256 jacid 1 5 jacu_ 1 09256 jacu t 5 ssor_ 0 256 seor t 4 exchange 3 lu C 256 exchange 3 1 5 _Gil_ memcpy libe 2 5 s0 Here is the load balance view based on functions 66 P Eile Tools pe Sampling 2 aus Process Control Rus Con Peuse 3 Uplate rr maths Stutuss Process Loaded Click on the Pur button to begin the experiment Stam Panel 1 ManageProcessesPanel 1 u ct b A Ue oe ae Showing Lond Balance min max mwe Presses Banks Threads 256 0 1 f executables hu C256 Mosts 16 heralO nl Max Exchasive Rank of Max 1 6000 17 0 4500 254 1 2048 buts_ la C 256 buts t 4 1 5300 17 0 9200 143 1 21588 bits la C 254 bits f 4 1 3900 145 0 7400 242 10594 jackd_ lu C 256 jacld 5 1 1900 25 06700 243 0 9189 jacu_ hu 0 256 jaca f5 0 7500 64 0 1600 255 0 4763 ssor hi C 256 s or L4 0 5300 94 0 1100 0 0 2466 Gi_memepy libe 2 5 s0 0 4800 189 0 0900 255 0 2945 exchange_3_ u C 256 exchange_3 1 5 i N te Mni i Command Panel Igas openss gt gt e Terrier Ham ef watt el Purse Diigahiti Status Process Landed Click on the Rur button to begin the ex perione 2382 7900 71 3395 rH124900 24 5255 lihenpich oo LO 64 2900 1 9248 libe 2 5 s0 45 7000 1 9682 lihentx4 ritmav2 co 34 8000 1 0419 libp
18. feature activated by clicking on the CA icon OpentSpeedShop e Lads Cock on the isr bema to Seget Ae perenne ROO PT en Ss ls OP Sy tere eee ibe mmk Lite 11 tat fe ie 1h ad In this view generated by clicking on the CA icon we see that Open SpeedShop has determined that there are four unique groups where the aggregate time for the groups differs enough to report this to the user The columns in the Stats Panel display show the times that are reflective of each of the ranks in the group The information I icon can be used to view which ranks etc are included in each of the cluster groups 82 the Tere fiii Mercer Trace 11 Presse Orsttr Baila f pis pti j a Mate Meret fI Mamire Fasol 1 l Vew ULETET Choir Tuad die imiia 8 oa eet Coapa Analy Aapon puaren Dia votat rgi e of Comepativgs eehanees Eri ce e Wi Hoe T Oe aaia loan ata re i Share iae ter ptre f yag on n ras r cing a TOONS SS Ce Se en ashe ine Forha Mj e 1 Avorn barbela bij a 4 Avera Far henios bij e 5 Average Por Jorise Mi Fesetion ihefieleg bastion l kca i G Leerrere fare 10 Advanced Analysis Techniques Analyzing the results of a single performance experiment can be useful for debugging and tuning your code But comparing the results of different experiments can show you how the performance of an application has changed This is useful if you want to track how the perfo
19. file gt command to activate the Open SpeedShop runtime environment Note do not include the dk portion of the filename when using the use command c performance profile d Open Speedshop Version 2 1 dk_setenv OPENSS_PREFIX usr global tools openspeedshop oss dev OSS21 dk_setenv OPENSS_PLUGIN_PATH OPENSS_PREFIX lib64 openspeedshop dk_setenv OPENSS_DOC OPENSS_PREFIX share doc packages OpenSpeedShop dk_alter PATH OPENSS_PREFIX bin dk_alter LD_LIBRARY_PATH OPENSS_PREFIX lib64 dk_setenv DYNINSTAPI_RT_LIB 0PENSS_PREFIX lib64 libdyninstAPI_RT so dk_setenv XPLAT_RSH rsh dk_setenv OPENSS_MPI_IMPLEMENTATION mvapich dk_test dk_cev OPENSS_ RAWDATA_DIR eq 0 amp amp dk_setenv OPENSS_RAWDATA_DIR p scratchb USER 14 Additional Information and Documentation Sources 14 1 Final Experiment Overview In the table below we match up a few general questions you may ask yourself with the experiments you may want to run in order to find the answer Where does my code spend most of its time e Flat profiles pcsamp e Getting inclusive exclusive timings with callstacks usertime e Identifying hot callpaths usertime HP analysis e Measure memory performance using hardware counters hwc e Compare to flat profiles custom comparison e Compare multiple hardware counters N x hwc hwcsamp e Study time spent in I O routines io iot and lightweight iop e Compare runs under different scenarios custom comparison
20. is shown when you hover over the icons 113
21. launch with the cli option gt openss cli There is also the immediate command batch interface This uses the batch flag gt openss batch lt openss_cmd_file gt openss batch f lt exe gt lt experiment gt Lastly there is a python scripting API so you can launch Open SpeedShop commands within a python script gt python openss_python_script_file py 11 1 Command Line Interface Basics 86 The CLI offers an interactive command line interface with processing like gdb or dbx There are several interactive commands that allow you to create experiments provide you with process thread control or enable you to view experiment results You can find the full CLI documentation at http www openspeedshop org doc cli doc but here we will briefly cover some important points Here is a quick overview of some commands those marked with are only available for the online version Experiment Creation Result Presentation e expcreate e expview e expattach Experiment Control Misc Commands expgo e help expwait list expdisable log expenable record Experiment Storage playback e expsave history e exprestore quit The following is a simple example to create run and view data from an experiment using the CLI openss gt gt expcreate f mutatee 2000 pcsamp Create an experiment using pcsamp with this application Openss gt gt expgo Run the experiment and create the database Openss gt gt expview Displ
22. load lt filename of module file gt to activate the Open SpeedShop runtime environment Modulel O HHHHHHHHHEHHHHHHEHHHHHFEHHHEHHHHHAEHHHEHAHHHHEHHHEHAEHHEEH HHHHHHHHH Openss modulefile HH proc ModulesHelp global version openss puts stderr topenss loads the OpenSpeedShop software amp application environment puts stderr n tThis adds oss to several of the puts stderr tenvironment variables puts stderr n tVersion version n module whatis loads the OpenSpeedShop runtime environment for Tcl script use only set version Jed set oss opt OSS21 setenv OPENSS_PREFIX oss setenv OPENSS_DOC_DIR oss share doc packages OpenSpeedShop prepend path PATH oss bin prepend path MANPATH oss share man set unameexe bin uname if file exists unameexe set machinetype exec bin uname m if machinetype x86 machinetype i386 machinetype i486 machinetype i586 machinetype i686 97 13 4 2 Example softenv file This is an example of a softenv file used for a Blue Gene Q installation Use the resoft lt filename of softenv file gt command to activate the Open SpeedShop runtime environment 13 4 3 Example dotkit file This is an example of a dotkit file used for a 64 bit cluster platform installation and is not generalized to support different platforms other than the 64 bit cluster it was written for Use the use lt filename of dotkit
23. ms gt gt gt gt main lu C 256 gt gt gt gt gt 140 in MAIN lu C 256 lu f 46 gt gt gt gt gt gt 180 in ssor_ hu C 256 ssor f 4 gt gt gt gt gt gt gt 213 in rhs_ hu C 256 rhs f 5 gt gt gt gt gt gt gt gt 224 in exchange_3_ lu C 256 exchange_3 f 5 gt gt gt gt gt gt gt gt gt 893 in mpi_wait_ mpi mvapich rt offline so wrappers fortran c 893 gt gt gt gt gt gt gt gt gt gt 88I in mpi_wait mvapich rt offline so wrappers fortran c 885 6010 978000 3 87 250 gt gt gt gt gt gt gt gt gt gt gt 51 in MPI Wek ierbich so 1 0 wait c 51 gt gt gt gt main lu C 256 gt gt gt gt gt 140 in MAIN__ lu C 256 bu f 46 gt gt gt gt gt gt 180 in ssor_ bu C 256 ssor f 4 gt gt gt gt gt gt gt 64 in rhs_ lu C 256 rhs f 5 gt gt gt gt gt gt gt gt 88 in exchange 3_ bu C 256 exchange _3 f 5 gt gt gt gt gt gt gt gt gt 893 in mpi_wait_ mpi mvapich rt offline so wrappers fortran c 893 gt gt gt gt gt gt gt gt gt gt 889 in mpi_weait mvapich rt offline so wrappers fortran c 885 2798 770000 1 805823 250 gt gt gt gt gt gt gt gt gt gt gt 51 in Wart Wek ihenpich so 1 0 wait c 51 In this experiment we did program counter sampling to get an overview of the application We noticed that smp_net_lookup showed up in function load balance view which caused us to take a look at the linked obj
24. of your code There are built in Unix commands like time or gprof that can give you some basic timing information This manual describes how to use Open SpeedShop a robust performance tool capable of analyzing unmodified binaries Throughout the manual will show real world examples of performance analysis using Open SpeedShop 2 How to use Performance Analysis Performance analysis is an essential part of the development cycle and should be included as early as possible It can have an impact on the patterns used in message passing or the layout of the data structures used and the algorithms themselves Your end goal should be correct and efficient code Typically one would measure the performance of some code and analyze the results You then modify the code or algorithms as appropriate and repeat the measurements from before analyzing the differences in successive runs to ensure an increase in performance The most basic performance analysis tool is the Unix time Algorithm E i C ode Binary Correct Code Efficient Code command which can measure the CPU and wall clock time for an application You could also keep track of application s performance as you vary the input parameters This type of performance analysis is very simple but has the disadvantage of the measurements being coarse grain and not allowing you to pinpoint any performance bottlenecks within the application Another performance analysis
25. osscompare allows the user to specify how many lines of the comparison output to be output The argument is optional and rows nn is defined as follows nn Number of rows lines of performance data output In this example only ten 10 lines of comparison will be shown when the osscompare command is executed It will be the most interesting or top ten lines osscompare smg2000 hwc openss smg2000 hwc 1 openss hwc overflows rows 10 10 1 3 osscompare output name argument osscompare allows the user to specify the name to be used when writing out the comparison output files The argument is optional and Oname lt output file name gt is defined as follows output filename Name given to the output files created for the comparison This argument is valid when the environment variable OPENSS_CREATE_CSV is set to 1 In this example the comparison files created when the osscompare command is executed will be named smg_hwc_cmp csv and or smg_hwc_cmp txt osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss oname mar2013_pcsamp_cmp This example will generate comparison files named using the specified oname specification 8 rw rw r 1 jeg jeg 4475 Mar 11 15 53 mar2013_pcsamp_cmp compare csv 8 rw rw r 1 jeg jeg 4841 Mar 11 15 53 mar2013_pcsamp_cmp compare txt 85 10 1 4 osscompare view type or granularity argument osscompare allows an optional view type argument It represents the granularity of the view
26. particular rank in order to see only that performance data in the Stats Panel view Note Use the focus on selected rank and underlying threads Manage Process panel option to focus on all the threads within a rank Right mouse button down on the Manage Process panel tab to see the options 109 Open SpeedShep x h ohtl ji dindi h abbectet 1 1 bhinckti J rp 7 ay ee i 1018640 rosea bene 1 649969 Aboi 1 625519 1 27744 1535522 1 e524 1 185529 1 nanie 2163S 31635 gt LASS 16 2 Clearing Focus on individual Rank to get bank to default behavior Note Once you focus on individual or groups of ranks e g venturing away from the default aggregated views then you need to use the CL clear auxiliary setting icon to clear away all the optional selections and get back to looking at the aggregated results again 110 OpenjSpeedShep E 140321075290192 1403221 18436832 140322075240102 1403220753340192 140321973330904 HOJI AIRE 140323075290392 140333115436532 1400221 18496892 140223075340192 60022118430632 4032 1904047200 140021904047 200 1403219733390 40922 118496832 14092 1904047200 t30231064047 200 140071904947790 140322075240 192 4092 1994047 Don Proce Loaded Cirk on the Wern burton to hogn the experreent _woleed 471 x solve 11401 x enbe 11476 K sete 11456 anire 1401 taie _soter Ji 406 a sibe 11670 y_volew 11997 140122075240102 Duonneced J40
27. the right one for your application A tool must have the right features for what you are trying to measure Keep in mind which questions you are looking to answer and how deep do you want to analyze the code A tool must also match your application s workflow and may need access to and knowledge about the source code and the machine environment Other things to keep in mind when choosing a tool are having a local installation of the tool and the availability of local support for the tool Getting started on Performance Analysis can be a challenging and sometimes overwhelming undertaking so it s a good idea to have some support system in place to help you through the hard parts Parts of this manual will focus on general performance analysis information followed by many detailed examples using the Open SpeedShop performance analysis tool Open SpeedShop has an easy to use GUI and command line options it includes both sampling and tracing in a single framework and doesn t require recompilation of the application It is extensible through user written plug ins Open SpeedShop is also maintained and supported within the Tri lab clusters Blue Gene and Cray platforms run by Lawrence Livermore Los Alamos and Sandia National Laboratories It is also available at a number of other laboratories and business around the world The following sections give a quick overview of what to look for in your Performance Analysis for different types of applications
28. want to probe further Ifthe inclusive and exclusive times are similar this means the child executions are insignificant with respect to CPU time and it may not be useful to profile below this layer If the inclusive time is 26 significantly greater then the exclusive time then you should focus your attention to the execution times of the children The stack trace views in Open SpeedShop are similar to the well known Unix profiling tool gprof EE _E gt ___ hypre_SMGSolve smg2000 smg solec f hypre SMGRelax smg2000 smg relax 2 225 hypre_FinslizeladiCnmputations amg2000 computation c 997 hypre_ImitializeinitComputations smg2000 computation c 372 27 5 How to Relate Data to Architectural Properties So far we have been focusing mostly on timing Timing information shows where your code spends its time by displaying hot functions statements libraries and hot call paths But it doesn t show you why it is spending so much time in those areas You need to know if the computationally intensive parts of the code are as efficient as they can be to reduce the time spent there or if there are resources that are constraining the execution of the code These answers can be very platform dependent Areas of bottlenecks can differ from system to system and portability issues can cause a drop in performance There may be a need to tune your code based on the architectural parameters of the system In orde
29. 0 154744623 2075688 156820311 149803306 3979337 Lapilmpl Context CheckContext libpami so CheckParam cpp 21 13 990000 0 656188 151052863 2000330 153053193 146967548 3167039 Lapilmpl Context Unlock lt true true false gt libpami so Context h 204 5 1 3 2 osshwcsamp experiment CLI Status command and view openss gt gt expstatus Experiment definition Expld is 1 Status is NonExistent Saved database is L1 64PE sweep3d mpi hwcsamp openss Performance data spans 1 7 958138 mm ss from 2013 03 27 22 32 45 to 2013 03 27 22 33 53 Executables Involved sweep3d mpi Currently Specified Components h ys6128 p 2765 t 47176895393312 r 3 sweep3d mpi h ys6128 p 2766 t 47824321252896 r 0 sweep3d mpi h ys6128 p 2767 t 47369830317600 r 1 sweep3d mpi h ys6128 p 2768 t 47378742910496 r 2 sweep3d mpi h ys6129 p 22862 t 47327259860512 r 5 sweep3d mpi h ys6129 p 22863 t 47201888194080 r 6 sweep3d mpi h ys6129 p 22864 t 47185544437280 r 7 sweep3d mpi h ys6250 p 11462 t 47028080107040 r 63 sweep3d mpi h ys6250 p 11463 t 47600632852000 r 60 sweep3d mpi h ys6250 p 11464 t 47494028697120 r 61 sweep3d mpi h ys6250 p 11465 t 47944527175200 r 62 sweep3d mpi Previously Used Data Collectors hwcsamp Metrics hwcsamp exclusive_detail hwcsamp percent hwcsamp threadAverage hwcsamp threadMax hwcsamp threadMin hwcsamp time Parameter Values hwcsamp event PAPI_L1_DCM PAPI_L1_ICM PAPI
30. 0000 hypre_SemilInterp smg2000 semi_interp c 126 0 280000000 0 040000000 mca_pml_ob1_progress libmpi so 0 0 2 topo_unity_component c 0 10 1 Comparison Script Argument Description The Open SpeedShop comparison script accepts a number of arguments This section describes the acceptable options for those individual arguments Fora quick overview see section 14 4 osscompare Compare Database Files As described above the osscompare script accepts at least two and up to eight comma separated database file names enclosed in quotes as the mandatory argument By default the compared metric is the primary metric produced by the experiment For most experiments the metric is exclusive time however the hardware counter experiments use the count of the number of hardware counter overflows as the metric to be compared These are the default or mandatory arguments to osscompare The following sections describe the arguments for osscompare in more detail 10 1 1 osscompare metric argument The osscompare metric argument specifies the performance information type that Open SpeedShop will use to compare against when looking at each database file in the compare database file list To find the metric specifications that are legal and produce comparison outputs one can open one of the database files with the Open SpeedShop command line interface CLI and list the available metrics openss cli f smg2000 pcsamp openss Openss gt gt list v metrics pcs
31. 2 labeled c 3 below there are two ranks where the rest of the 512 ranks perform like group 1 labeled c 2 below Investigation by examining ranks 312 or 317 by comparing it to one of the ranks in the other group could shed some light on why group 2 is not similar to the rest This may or may not be significant but is here for illustration Oi peed libe 212 so sysonll template 542 iber open Dbe 12am eel template 5 102 write Mbe 2 120 spcall tenphire 5 61 Ol lile bok ABe2 2k ycall temeptlate 805 55 7 3 2 3 1 O Extended Tracing iot experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt The command line interface CLI can provide the same data options as the graphical user interface GUI views Here are some examples of the performance data that can be viewed and the commands in order to generate the CLI views gt openss cli f IOR iot 0 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview I O Call of Number Function defining location Time ms Total of Time Calls 1858436 714506 61 486889 2048 close libc 2 12 so syscall template S 82 1055603 730633 34 924939 2048 _GI_ read libc 2 12 so syscall template S 82 108107 666680 3 576772 1024 _libc_open libc 2 12 so syscall template S 82 335 820251 0 011111 3072 write libc 2 12 so syscall template S 82 8 756634 0 000290 4096 _GI__libc_lse
32. 379 Now double clicking on the Stats Panel result line of choice will focus the source panel and use the PAPI or native counter that was chosen by using the Source Annotation dialog Open Speed Shop bypre_StrwctMarris A bepre_StructVector x hypre_StructVectar b Rupre StructVector 1 5 1 2 2 Viewing Hardware Counter Sampling Data with the GUI To launch the GUI on any experiment use openss f lt database name gt 35 The GUI view below represents an example of the default view for the hardware counter sampling hwcsamp experiment In the default view the first set of performance data shown is program counter exclusive time where the program is Statistically spending its time and the percentage of time spent in each function of the program The next information is the hardware counter event counts listed in columns by the hardware counter event Column three represents the counts that were recorded for PAPI_TOT_CYC and column four represents the counts for PAPI_FP_OPS What this view can indicate to the viewer is whether or not the specified hardware counter events are occurring and if they are then how prevalent are they With this information the user could isolate down to see exactly where a particular event is occurring by using the hwc or hwctime experiment These two experiments are threshold based Which ultimately means you can map the performance data back to the source because the actual event triggered the
33. 392130426032 3 Dianmmmerned After clearing the specific rank and or thread selections we can click the LB load balance icon and Open SpeedShop will display the min max average values across all the ranks in the hybrid code This helps decide if there is imbalance across the ranks of the hybrid application We can focus on individual ranks to see the balance across the openMP threads that are in an individual rank next example image 111 Open SpeedShep Process Landed Click on the Wert bumon to hogn the experment amp woweuwvwe S amp S wi we O O wu wis Here we used the Manage Process panel Focus on selected rank and underlying threads menu options to view the load balance across the 4 openMP threads for the rank 0 process Progen Lowded Cirk on the Wert bunon to hogn the experreent 140321973339904 OIII IMIGFIZ 140322075230302 1409571 46892 1403221 18496892 140223075340192 112 Please also explore the various options offered via a panel s pull down menu Clicking on a colored downward facing arrow or using the Stats Panel icons can access further options Red icons represent view options such as updating the data or clearing the view options The green icons correspond to different possible views of the performance data The dark blue icons correspond to analysis options while the light blue icon corresponds to information about the experiment There is context sensitive text that
34. 71616 0 035292 gt gt gt gt gt copy 2 1 MB from host to device CUDA 2013 08 21 18 31 21 623 0 004608 0 000438 gt gt gt gt gt copy 16 KB from host to device CUDA 2013 08 21 18 31 21 623 0 003424 0 000325 gt gt gt gt set 4 KB on device CUDA 2013 08 21 18 31 21 623 0 003392 0 000322 gt gt gt gt set 137 KB on device CUDA 2013 08 21 18 31 21 623 0 120896 0 011481 gt gt gt gt compute_degrees int int int int lt lt lt 256 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 623 13 018784 1 236375 gt gt gt gt QTC_device float char char int int int float int int int int float int int int int bool lt lt lt 256 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 636 0 035232 0 003346 gt gt gt gt reduce_card_device int int lt lt lt 1 1 1 1 1 1 gt gt gt CUDA 2013 08 21 18 31 21 636 0 002112 0 000201 gt gt gt gt gt copy 8 bytes from device to host CUDA 2013 08 21 18 31 21 636 1 375616 0 130640 gt gt gt gt trim_ungrouped_pnts_indr_array int int float int char char int int float int int int int float int bool lt lt lt 1 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 638 0 001344 0 000128 gt gt gt gt gt copy 260 bytes from device to host CUDA 2013 08 21 18 31 21 638 0 025600 0 002431 gt gt gt gt update_clustered_pnts_mask char char int lt lt lt 1 1 1 64 1 1 gt gt gt CUDA 2013 08 21 18 31 21 638 11 724960 1 113503
35. 8 1 3 MPI Tracing Experiments performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 8 2 Threading Analysis Section We just did an experiment that uses MPI but we can do a similar analysis on applications that use threads To analyze a threaded application first we can run the 73 pcsamp experiment to get an overview then look at the load balance view to detect if there are any widely varying values and finally do cluster analysis to find any outliers The image below shows the default view for an application with 4 threads the information displayed is the aggregated total from all threads pees Landed Click on the Run button to begin the experir _ n Dinverhs_ Dt W x solve subs f 206 2 solve omp_fn 0 bLW x _solve f 43 compute rhs omp_fn0 bt W x rhs 18 y alve amp fad DLW _ solvet 43 x_soive_omp_fn bt W x x_solve f 45 matmal_sub_ biw salve subst 56 matvec_sub_ bt W x solve_subs f 27 thaimit_ bt W x initialize 1 225 Next we see the load balance view based on functions 74 he Det pe seeping tl Pones Control h i O rro gators S Stee Panel i A ManageProcessesPanel 11 F Source Panel 17 GL ei OW LD i ip mane meni txccutalses tn Wa tant ealbont locabluerean Procceves Panka Threads 4 Function defining Macaron bimperhs Cht Ws La 1325413770217350 0 L00 Let LOS Rw per 0 217500 x m
36. Open SpeedShop User Manual February 4 2014 Version 2 1 Contributions from Krell Institute LANL LLNL SNL Table of Contents Why do I need Performance Analysis sussunnnnnnnnnnnnnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 5 1 What is Performance Analysis sssssnssnsnunnunnunnunnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 8 2 How to use Performance AnalySiS ssussunnunnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 9 2 1 Sequential Code Performance AnalySiS sss ssssususunsunununnunununnunununnununnnnununnnnununnnnunnnnnnnnnnnnnn 10 2 2 Shared Memory Applications scistesseiscesssessecscessewcstacivetatelasexsct ceeveceatescsiaciesieusheesssetdevsteiscevsts 10 2 3 Message Passing Applications swisscisscisiccsisaiscsdinivassacnsdnstcandssustsnuiecssavivecsavnataisiscidscaseaduissiniuns 11 3 Introduction to Open SpeedShop sssssussunnunnnnnnnnnnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnne 12 3 1 Basic Concepts Interface WOrkflOW sssssussunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 12 al Beal EE Ge e011 9010 bias Terminology aueemeenenrneny Pere rert er enreeetrertrernitarr Senetert er arene EAEE AAAA SAAE E a 13 5 12 Concept OF an EXP CEIMCIIL wsbsccsoseecvecetscatscatiatvesniasscccsnecarscereatiesenacdesdalacousseneversdncasasieetatvaseiecneeeetse 14 3 2 Performance Experiments OvervieW wssssssssssssssnssssssnsesssssesnssssesnessscsneessssnens
37. Open SpeedShop allows for viewing performance data at three levels linked object level function level and at the statement level osscompare will produce output at one of those levels based on the view type argument where viewtype lt functions statements linkedobjects gt is defined as follows functions View type granularity is per function statements _ View type granularity is per statement linkedobjects View type granularity is per library linked object This example will produce a side by side comparison for the statement level not the default function level So this example will compare statement performance values in each of the two databases and produce a side by side comparison showing how each statement in the application differed from the two experiments osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss viewtype statements 11 Open SpeedShop User Interfaces Throughout this manual we have been using the Open SpeedShop GUI we would encourage you to play around with the interface to become familiar with it The GUI lets you peel off and rearrange any panel There are also context sensitive menus so you can right click on any location to access a different view or to activate additional panels If you prefer not to use the GUI there are three other options that all have equal functionality First there is the command line interface that we have also seen throughout this manual which you can
38. PAPI_FP_OPS PAPI_TOT_CYC PAPI_LD_INS PAPI_ST_INS PAPI_TOT_INS Where the derived metrics of interest 43 are GLOPS Giga Logical Operations per Second Float_ops cycle Instructions cycle Loads Cycle Stores Cycle and Flops memory Ops BLAS 1 Kernel DAXPY y alpha x y Kernel Code n 10 000 looped 1000 000 times for timing purposes doi 1 n y i alpha x i y i enddo The following table shows the PAPI data for this example Mem FLOPS Loop PAPI_LD_INS PAPI_SR INS PAPI _FP_OPS PAPI_TOT_CYCLE PAPI TOT_INS Ref 3n Calc BLAS code 10000 30000 20000 100000 1 02E 09 5 09E 08 1 03E 09 2 04E 09 2 43E 09 6 4596E 06 3 096124 0 505386876 1 190989226 0 500489716 0 249412341 Error PAPI FLOPS Error corrected Error Mem Refs PAPI_GLOPS PAPI FLOPS OPS Calc FLOPS OPS FLOPS 93 80 3 10 2 15 3 195244288 0 673937178 0 6666667 The processors used in this example have a Floating Multiply Add FMADD instruction set Although this instruction performs two Floating Point operations it is counted as one Floating Point instruction in PAPI Because of this there are situations where PAPI_FP_INS may produce fewer Floating Point counts then expected In this example PAPI_FP_OPS was multiplied by 2 to match the theoretical expected FLOP count The formula for calculating Load Instructions was 2 vectors vec_lecth loop bytes_per_word 8 bits_per_byte 128 bits_per_load What can the Hardware Counter Metrics tel
39. U The image below shows the results of the MPI experiment in the default view P Click on the Run banon to begin the experiment S amp BOnsx Phat 64 D Eypectanses Reiger Pees dks Thireads 3 3 0 R prere tor Experiment Appian command Executables emg 2000 Paperiment type mpi Hostis hyperenS amp ilini gow hyperinnS 4lini gow hyperionS S lini gow byperianS 6 lini gw byperinnS nl gw byperinnS Atini gow hyperin580_linl gen Processes Rank Ranks or Threads 511 k Atdasian RAPT Col Taat Masta MPI Call Tamed ms avenge Teese f Number of Calle Function beatin 45 206000 127 273000 755 29027 513 PMPI_init leoniter saa hthk pmpic O4 167 S4000 pet 230 512 PMP Pinaliee bmontor so 000 pmpic 223 0 474000 OASIS 512 MPI Aligathery Mbenpich sn 1 0 algathery lt 7 0 213000 013308 5n MPI Aligether lenpict so LO aligarher e70 2034000 1 312100 513 MPt Barrier Ehnpich aa Ii barriere 56 orim fies MPt_Allredece lihmpech m to allredere 59 0977852 670445 MPT Waitall lihmpich om iih vwenitall lt 57 0 001156 Se MPi bend lihmpich om Lik memd c_ 5a 0 000085 MPi irecy Munich sa Lik iecere 48 COLE Next we see the MPI function call path view shown below 65 ique Call Paths View Click C Icon Unique Call Paths to MPI Waitall and other MPI functions Here is the default pcsamp view based on functions MPI library sho high in the list of time bin eg
40. _L1_TCM PAPI_L1_LDM PAPI_L1_STM hwcsamp sampling rate 100 Available Views hwcsamp 5 1 3 3 osshwcsamp experiment CLI Load Balance command and view openss gt gt expview m loadbalance Max CPU Rank MinCPU Rank Average Function defining location Time of Time of CPU Time Across Max Across Min Across Ranks s Ranks s Ranks s 14 890000 28 10 950000 27 12 888594 _ libc_poll libc 2 12 so 14 270000 47 11 780000 51 12 489062 sweep sweep3d mpi sweep f 2 1 620000 43 0 840000 37 1 171875 PAMI Interface Context lt PAMI Context gt advance libpami so ContextInterface h 158 1 320000 1 130000 1 110000 1 030000 0 950000 0 700000 0 630000 0 600000 0 570000 0 500000 0 520000 0 600000 0 520000 6 0 200000 33 18 0 250000 0 270000 3 0 871094 Lapilmpl Context Advance lt true true false gt libpami so Context h 220 2 0 778906 _lapi_dispatcher lt false gt libpami so lapi_dispatcher c 57 49 0 751250 Lapilmpl Context TryLock lt true true false gt libpami so Context h 198 12 0 827656 _libc_enable_asynccancel libc 2 12 so 38 0 746094 _ libc_disable_asynccancel libc 2 12 so 59 0 343125 _lapi_shm_dispatcher libpami so lapi_shm c 2283 0 0 404375 _ intel_ssse3_rep_memcpy libirc so 16 0 416875 udp_read_callback libpamiudp so 38 5 1 3 4 osshwcsamp experiment CLI Linked Object command and view openss gt gt expview v linkedobjects Exclusive ofCPU papi_l1_dcm papi_l1_icm papi_l1_tcm pap
41. _l1_icm papi_l1_tcm papi_l1_ldm papi_l1_stm Function defining location CPU time in Time seconds 824 870000 38 689781 8646497071 117738843 8764235914 8396159476 196649065 _libc_poll libc 2 12 so 799 300000 37 490443 46691996441 367096209 47059092650 46247555479 281624221 sweep sweep3d mpi sweep f 2 75 000000 3 517807 782716992 10680760 793397752 757322217 20159725 PAMI Interface Context lt PAMI Context gt advance libpami so ContextInterface h 158 55 750000 2 614903 597583047 8038242 605621289 579127274 14647999 Lapilmpl Context Advance lt true true false gt libpami so Context h 220 52 970000 2 484510 550761926 7569975 558331901 535841812 11563657 _libc_enable_asynccancel libc 2 12 so 49 850000 2 338169 518605433 6979361 525584794 502551336 12757207 _lapi_dispatcher lt false gt libpami so lapi_dispatcher c 57 48 080000 2 255149 488545916 6784192 495330108 476065093 9649598 Lapilmpl Context TryLock lt true true false gt libpami so Context h 198 47 750000 2 239671 479947719 6732551 486680270 471343480 6436257 _libc_disable_asynccancel libc 2 12 so 26 680000 1 251401 275998769 3888499 279887268 269841454 4697170 udp_read_callback libpamiudp so lapi_udp c 538 25 880000 1 213878 1522697263 12118336 1534815599 1507685061 9619348 _intel_ssse3_rep_memcpy libirc so 21 960000 1 030014 223197680 3086626 226284306 215787794 5879517 _lapi_shm_dispatcher libpami so lapi_shm c 2283 37 14 910000 0 69934
42. a choss libpebread 2 11 a an eta OAI _ Meweht libgrtterend 211 Don 7 4 3 I O Profiling iop experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 59 The command line interface CLI can provide the same data options as the graphical user interface GUI views Here are some examples of the performance data that can be viewed and the commands in order to generate the CLI views gt openss cli f IOR iop 1 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive Inclusive of Function defining location I O call I Ocall Total timesin times in Exclusive seconds seconds CPU Time 38297 339900 38297 339900 96 460929 _ write libpthread 2 11 3 so 741 019727 741 019727 1 866434 open64 libpthread 2 11 3 so 598 432332 598 432332 1 507294 read libpthread 2 11 3 so 63 383924 63 383924 0 159647 close libpthread 2 11 3 so 2 261454 2 261454 0 005696 _lseek64 libpthread 2 11 3 so Openss gt gt expview v calltrees fullstack Exclusive Inclusive of Call Stack Function defining location I O call I Ocall Total timesin times in Exclusive seconds seconds CPU Time TestloSys IOR IOR c 1848 gt 2608 in WriteOrRead IOR IOR c 2562 gt gt 244 in IOR_Xfer_POSIX IOR aiori POSIX c 224 38297 339900 38297 339900 96 460929 gt gt gt _write libpthread 2 11 3 so TestloSys IOR IOR c 1848 g
43. ach experiment consists of collectors and views The collectors define specific performance data sources for example program counter samples call stack samples hardware counters or tracing of library routines Views specify how the performance data is aggregated and presented to the user It is possible to implement multiple collectors per experiment 3 2 1 Individual Experiment Descriptions The following table provides a quick overview of the different experiment types that come with Open SpeedShop 14 Experiment Description pcsamp Periodic sampling the program counters gives a low overhead view of where the time is being spent in the user application usertime Periodic sampling the call path allows the user to view inclusive and exclusive time Spent in application routines It also allows the user to see which routines called which routines Several views are available including the hot path hwc Hardware events including clock cycles graduated instructions instruction and data cache and TLB misses floating point operations are counted at the machine instruction source line and function levels hwcsamp Similar to hwc except that sampling is based on time not PAPI event overflows Up to six events may be sampled during the same experiment Similar to hwc except that call path sampling is also included Accumulated wall clock durations of input output 1 0 system calls read readv write writev open close dup pi
44. amp mpirun np 256 smg2000 n 50 50 50 high We can view the results of this flat profile in the Open SpeedShop GUI by using the openss f lt database filename gt command men Speeds hon SOU gt One D Pem b prileta Tenia W Stats Panel 1 SP MonageProcemePanei t11 ry oL P i a A y Ge showing runian nepon nl wey e ee Eaclunive CPU time in mconds of CPU Time Function defining location e e H 2 680000 53 174603 bypre SMGResidwal emg2000 emg residual c 152 1 200000 32 990508 beypre_ CrclicReduction emg 2000 cyclic reddection 757 0 1 t0000 2 182540 hyper Semiluterp emg2O00 semi interpc 126 j 0 100000 1 084127 hypee_ SemiMestrict sing 2000 semi restriete 125 0 080000 1 587302 bypre_ SMGIDuildNAPSym smg2000 ang setup rape 156 0 050000 0 992063 hypre SMGJALARAPSym amg2000 smg3_sehap rap 233 0 050000 0 992063 hypee_ SMGAxpy smg2000 ang axpy c 27 0 040000 0 79305 1 bypre StructVectorChearGhestValues umg200h struct vectors 0 040000 0 79365 1 bypre_StructAspy emgD000 struct _axpy lt 25 0 030000 0 595238 beyprepStructMatrisintializeData eng 2000 struct matnxctdi4 0 030000 0 595238 We can use this information to identify the critical regions The profile shows computationally intensive code regions by displaying the time spent per function or per statement While viewing this we must ask ourselves e Are those the functions st
45. amp percent pcsamp threadAverage pcsamp threadMax pcsamp threadMin pcsamp time You can use the output of the list metrics command as an argument to the osscompare command as shown in the examples below osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss percent osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss threadMin osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss threadMax Some exceptions do apply For example some experiments such as usertime and hwctime have details type metrics output by the list metrics CLI command list v metrics These will not work as a metric argument to osscompare 84 For the hardware counter experiments hwc and hwctime you can use the actual PAPI event name in addition to the metric names output from the list metric command The example database file was generated using the PAPI_TOT_CYC event openss cli f smg2000 hwc openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt list v metrics hwc overflows hwe percent hwce threadAverage hwc threadMax hwce threadMin Here we show a couple osscompare examples where hwc overflows can be used interchangeably with PAPL TOT CYC osscompare smg2000 hwc openss smg2000 hwc 1 openss hwc overflows osscompare smg2000 hwc openss smg2000 hwc 1 openss PAPI_TOT_CYC 10 1 2 osscompare rows of output argument
46. ance data viewing With CLI essences 73 8 2 THreadino Analysis SECON aiiora a T 73 8 2 1 Threading Specific Experiment pthreads s s ssssssussssnununnnnununnnnununnnnununnunununnnnnnnnnnnn 75 8 2 1 1 Threading Specific pthreads experiment performance data gathering 76 8 2 1 2 Threading Specific pthreads experiment performance data viewing with GUI EE A E AO PE E A AONE A RO a E TOE AT E R E AOTRE AE A EE A T 76 8 2 1 3 Threading Specific pthreads experiment performance data viewing with CLI cbse E A E AEE E E E E E anaes 76 8 2 NVIDIA CUDA Analysis SC CUO IW nnnesonn a einai daria 77 8 3 1 NVIDIA CUDA Tracing cuda experiment performance data gathering 77 8 3 2 NVIDIA CUDA Tracing cuda experiment performance data viewing with GUI 77 8 3 3 NVIDIA CUDA Tracing cuda experiment performance data viewing with CLI 78 9 Memory Analysis Techniqg es sssini aaa aa aa 80 9 1 Memory Analysis Tracing mem experiment performance data gathering 80 9 2 Memory Analysis Tracing mem experiment performance data viewing with E A a EE E A E E E E EE E 80 9 3 Memory Analysis Tracing mem experiment performance data viewing with asap ccc ect n A cance ca ats eens anes eeeesteeeadtem eve eee 81 10 Advanced Analysis Techniques sssssnssunnunnunnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 83 10 1 Comparison Script Argument Description s ssssssussunsunsunnnunnunnnnnunnnunnunnnnnnnnnnnnnnn
47. atements that we expected to be taking the most time e Does this match the computational kernels e Are any runtime functions taking a lot of time We want to identify any components that are bottlenecks We can do this by viewing the profile aggregated by shared linked objects making sure the correct or expected modules are present then analyze the impact of those support and or runtime libraries 24 4 2 Call Path Profiling usertime Experiment The call path profiling usertime experiment can add some information that is missing from the flat profiles It is able to distinguish routines called from multiple callers and understand the call invocation history This provides context for the performance data It also gathers stack traces for each performance sample and only aggregates samples with equal stack traces For the user this simplifies the view by showing the caller callee relationship It can also highlight the hot call paths the paths through the application that take the most time The call path profiling experiment also provides inclusive and exclusive time Exclusive time is the time spent inside a function only for example function B Whereas inclusive time is the time spend inside a function and its children for example the full chain of function C D and E The call path profiling experiment is similar to the program counter sampling experiment since it collects program counter information ex
48. avevie ac 92 T2 Cray and BME GEN cement ieee er inerertr ire A E ret metre ee eer ener 94 12 1 Cray Specific Static aprun Information s ssssssussnennnununnununnunnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnn 95 13 Setup and Build for Open SpeedSho0pP s ssnssnssnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 96 13 1 Open SpeedShop Cluster Install sssssnsnnnnnnunnunnunnnnnnnnnnnnnunnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnn 96 13 2 Open SpeedShop Blue Gene Platform Install s ssssssusnnnunnnnnnnununnunnnnunnnnnnnnnnnnnnnnnnnnnnnn 97 13 3 Open SpeedShop Cray Platform Install ssssussnnnnnunnnnunnnnunnnnnnnunnnnunnnnunnnnnnnnnnnnnnnnnnnnnnnn 97 13 4 Execution Runtime Environment Setup sssssssssssnssunnunnunnunnnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 97 13 AT Example module Merre E heated eeaaneaeinneste 97 AS Ae ZAI PIS SOCOM VY ME A E 98 JUS teow veel gavel lemelela oi agg Ceana recite tre ferret rerermer A 98 14 Additional Information and Documentation Sources ssssssssssnssnnnnnnnnnnnnnnnnnnnnnnnnn 99 14 1 Final Experiment OvervieW ss sussussusunsunnnnnnnnunnunnunnnnnnnannnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 99 14 2 Additional DOCUMENTATION ssssussussnsnnnnnnnnnnnnunnnnnnnnnnnnnunnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 100 15 Convenience Script Basic Usage Reference Information s ssssssnssunnunnnnnnnnnnnnn 101 15 1 Suggested WorkfloW sssssssusussnnununnnnununnn
49. ay the default view of the performance data You can also get alternative views of the performance data within the CLI The following is a list of some options to change the way the information is displayed expview m loadbalance See load balance across all the ranks threads processes in the experiment See data for specific rank s expcompare r 1 r 2 m time Compare rank 1 to rank 2 for metric equal to time Other metrics are allowed This is a usage example list v metrics See the list of optional performance data metrics See the list of source files associated with experiment list v obj See the list of object files associated with experiment 87 See the list of ranks associated with experiment list v hosts See machine host names associated with experiment See performance data for the metric specified expview v calltrees fullstack See lt number gt of call paths from the list of expensive call J lt experiment type gt lt number gt paths expview v calltrees fullstack usertime2 Shows the top two call paths in execution time expview lt experiment name gt lt number gt Shows lt number gt of the functions from the list of the top time consuming functions expview pcsamp2 Shows the two functions taking the most time expview v statements Show lt number gt of the statements from the list of the top lt experiment name gt lt number gt time consuming statements Remember if you want the
50. ce to all the I O functions that were called during the execution of this application From this one could validate that this is expected behavior and if not find where the I O in this application is not behaving as expected Open SpeedShop on rzmert1 4 Process Loaded Click on the Run burton eo begs the experiment et ee oMi B U cage Keecutahiee DOR Meste 30 smeri 00 Pi Nenks 513 Threads 1 S62 in _ libe stan maz likense so M0 maine 41 TS in _libe_stan maia lihe Z LE sr libe start c 95 517 in menitor_main Cidteenitorso 0 0 0 main c 492 I5 in min DOR ORL c pom 20I in Taos OR IOR c 1848 216 in IOR Ci POX OR niori POSIX 325 74 in close lim colernr mmm e mpi oo rapers 185 52 in Chote i212 spscsll complete 5 A2 mart DOR This view is the load balance view which gives the min max average values for the I O function call time across all the ranks in this application In this view we are seeing some wide ranges between the min and max values for some of the I O functions It may be useful to see if we can identify the ranks by using the Cluster 54 Analysis view 2061 72S BLIT 0 655899 oor GI read ied li open libot Zat speeall oereplane 42 were ite 2 1 ser syscali cemplane 5 82 Glide bk lie 2 12 see This view generated by choosing CA icon the shows that there are two groups of ranks where the I O is performing in similar manner For group
51. cept that it collects call stack information at every sample There are of course tradeoffs with that you obtain additional context information from the call stacks but there is now a higher overhead and necessarily lower sampling rate We can run the call path profiling experiment using the Open SpeedShop convenience script on our test program smg2000 gt ossusertime mpirun np 256 smg2000 n 50 50 50 Again it is recommended that you compile your code with the g option in order to see the statements in the sampling The usertime experiment also takes a sampling frequency as an optional parameter the available parameters are high 70 samples per second low 18 samples per second and the default value is 35 samples per second Note that these sample rates are lower then the pcsamp experiment because of the increased amount of data being collected If we wanted to run the same experiment with the low sampling rate we would simply issue the command gt ossusertime mpirun np 256 smg2000 n 50 50 50 low We can view the results of this experiment in the Open SpeedShop GUI The view is similar to the pcsamp view but this time the inclusive CPU time is also shown 25 Ee Tools Hep Custom Experiment 1 0 User S OO x p Process Control EJ Status Loaded saved data from fe homejjegideg aap W Stats Panel 2 fw ManageProc 2121 Sace Pane 21 SOx a Pera gs View Otspiay Chace Fyi mirek
52. ch smg2000 on 256 processors with n 65 65 65 is passed as an argument to smg2000 An example of a typical MPI smg2000 pcsamp experiment run along with the application and experiment output follows below gt osspcsamp mpirun np 2 smg2000 n 65 65 65 openss pcsamp experiment using the pcsamp experiment default sampling rate 100 openss Using OPENSS_PREFIX installed in opt OSS mrnet openss Setting up offline raw data directory in tmp jeg offline oss openss Running offline pcsamp experiment using the command mpirun np 2 opt OSS mrnet bin ossrun smg2000 n 65 65 65 pcsamp Running with these driver parameters nx ny nz 65 65 65 Px Py Pz 2 1 1 bx by bz 1 1 1 cx cy cz 1 000000 1 000000 1 000000 n_pre n_post 1 1 dim 3 solver ID 0 Struct Interface wall clock time 0 049847 seconds cpu clock time 0 050000 seconds SMG Setup wall clock time 0 635208 seconds cpu clock time 0 630000 seconds SMG Solve wall clock time 3 987212 seconds cpu clock time 3 970000 seconds Iterations 7 Final Relative Residual Norm 1 774415e 07 openss Converting raw data from tmp jeg offline oss into temp file X 0 openss 18 Processing raw data for smg2000 Processing processes and threads Processing performance data Processing functions and statements openss Restoring and displaying default view for home jeg DEMOS de
53. code primarily under LGPL Highlights include e Comprehensive performance analysis for sequential multithreaded and MPI applications No need to recompile the user s application Supports both first analysis steps as well as deeper analysis options for performance experts Easy to use GUI and fully scriptable through a command line interface and Python Supports Linux Systems and Clusters with Intel and AMD processors Extensible through new performance analysis plugins ensuring consistent look and feel In production use on all major cluster platforms at LANL LLNL and SNL Features include Four user interface options batch command line interface graphical user interface and Python scripting API Supports multi platform single system image SSI and traditional clusters Scales to large numbers of processes threads and ranks View performance data using multiple customizable views Save and restore performance experiment data and symbol information for post experiment performance analysis View performance data for all of application s lifetime or smaller time slices Compare performance results between processes threads or ranks between a previous experiment and current experiment Interactive CLI help facility which lists the CLI commands syntax and typical usage Option to automatically group like performing processes threads or ranks Create MPI traces in OTF Open Trace Format 1 What is Performance A
54. ctive batch commands directly through python Users can intersperse the normal python code with commands to Open SpeedShop Currently this interface is only supported via the online version of Open SpeedShop 11 4 MPI_Pcontrol Support Open SpeedShop also supports the MPI_Pcontrol function This feature allows the user to gather performance data only for sections of their code bounded by the MPI_Pcontrol calls The MPI_Pcontrol must be added to the source code of the application MPI_Pcontrol 1 enables the gathering of performance data and MPI_Pcontrol 0 disables the gathering You must also set the Open SpeedShop environment variable OPENSS_ENABLE_MPI_PCONTROL to 1 in order to activate the MPI_Pcontrol call recognition otherwise it will be ignored Optionally you can set the OPENSS_START_ENABLED environment variable to 1 to have performance data gathered until a MPI_Pcontrol 0 call is encountered If OPENSS_START_ENABLED is no set no performance data will be gathered until a MPI_Pcontrol 1 call is encountered Note that for OPENSS_START_ENABLED to have any effect OPENSS_ENABLE_MPI_PCONTROL must be set 11 5 Graphical User Interface Basics This section gives an overview of the OpenSpeedShop graphical user interface focusing on the basic functionality of the GUI To launch the GUI on any experiment use openss f lt database name gt 11 5 1 Basic Initial View Default View Because this example usertime experiment default view ha
55. database file that has 4 ranks each of which has 4 underlying openMP threads What this example intends to show is that you can look at hybrid performance first at the MPI level and then can look under the MPI rank to see how the threads are performing At the MPI level you can see load balance and outliers then focus ona rank and look at load balance and outliers for the underlying threads Within a terminal window we enter openss f bt mz B 4 pcsamp 1 openss to bring up the Open SpeedShop GUI In the GUI view below we display the aggregated results for the application at the statement level granularity When the default view first comes up the view is at the function level granularity To switch to the statement level select the Statements button in the View Display Choice section on the right hand side of the Stats Panel display and then click the D icon for default view This will switch the Stats Panel view to statement level granularity Now the Stats Panel is displaying the statements that took the most time in the application run For this execution of BT the statement at line 440 took the most time By double clicking on the statement Open SpeedShop focuses on the source for that line of the application source and highlights that line In the view below we moved the ManageProcess panel tab to the lower panel and split the upper panel using the vertical splitter icon on the far right side of the original upper panel Not
56. do and the location of your MPI and QT installs For example install tool build offline openss prefix opt myoss with openmpi opt openmpi 1 5 5 with mvapich opt mvapich 1 1 After the install has successfully completed there are a few important environment variable you need to set Again set OPENSS_PREFIX for the install location the OPENSS_PLUGIN_PATH for the directory where the plugins are stored if you installed with more then one MPI version you must specify which to use with OPENSS_MPI_IMPLEMENATION lastly add the Open SpeedShop bin directory to your PATH and lib64 directory to your LD_LIBRARY_PATH Examples of the necessary environment variables that need to be set are as follows export OPENSS_PREFIX opt myoss export OPENSS_MPI_IMPLEMENTATION openmpi export OPENSS_PLUGIN_PATH 0PENSS_PREFIX lib64 openspeedshop export LD_LLIBRARY_PATH 0OPENSS_PREFIX lib64 LD_LIBRARY_PATH export PATH 0PENSS_PREFIX bin PATH 96 13 2 Open SpeedShop Blue Gene Platform Install Please reference the OpenSpeedShop 2 1 Build and Install Guide 13 3 Open SpeedShop Cray Platform Install Please reference the OpenSpeedShop 2 1 Build and Install Guide 13 4 Execution Runtime Environment Setup This section gives an example of a module file softenv file and dotkit that can be used to set up the Open SpeedShop execution environments 13 4 1 Example module file This is an example of a module file used for a cluster installation Use module
57. e Left mouse down and hold on the panel tab then slide the panel you want to move to another location on the Open SpeedShop GUI or off onto other parts of your display 107 Open SpeedShep x Proosa Loaded Cirk on the Wer bunon to hogn the eperrment O op op theater pe foc rbot Lae je e holi abhor 438 e ___ __ __ 436 ject teed absackt 1 p wreelt 437 brki 2j aren aun mkd iI jawor gt is of 140322075240102 JAOLIININEIARIF SO 1G soars 16 1 Focus on individual Rank to get Load Balance for Underlying Threads In the next view below we used the ManageProcess panel to highlight one rank and an individual thread within the rank to show only that threads performance data in the Stats Panel view Note Use the focus on threads and processes Manage Process panel option to focus on individual threads within a rank Right mouse button down on the Manage Process panel tab to see the options 108 Open SpeedShep x Proosa Loaded Cirk on the Wer bunon to hogn the eperrment BOO 433e thsikijc ka rbot Lic j z Jic hell abina x wire 471 433 lt ny solve 11470 437 beh i Iaret gt LELI LI asori err td han rha 1167 Et sive fiAn y sowe Tas T y aime 1095 e sedwetI418 AG L In the next GUI view we used the ManageProcess panel to highlight one rank to show the performance data from all the threads that are executed under that
58. e encouraged to use the convenience scripts that hide some of the underlying options for running experiments The full command syntax can be found in the User s Guide The script names correspond to the experiment types and are osspcsamp ossusertime osshwc osshwcsamp osshwctime ossio ossiot ossmpi ossmpit ossmpiotf ossfpe plus an osscompare script Note Make sure to set OPENSS RAWDATA DIR See KEY ENVIRONMENT VARIABLES section for info When running Open SpeedShop use the same syntax that is used to run the application executable outside of O SS but enclosed in quotes e g Using an MPI with mpirun osspcsamp mpirun np 512 smg2000 Using SLURM srun osspcsamp srun N 64 n 512 smg2000 n 555 Redirection to from files inside quotes can be problematic see convenience script man pages for more info 15 3 Report and Database Creation Running the pcsamp experiment on the sequential program named mexe osspcsamp mexe results in a default report and the creation of a SQLite database file mexe pcsamp openss in the current directory the report CPU Time CPU time Function 48 990 11 650 f3 mexe m c 24 33 478 7 960 f2 mexe m c 15 17 451 4 150 f1 mexe m c 6 0 084 0 020 work mexe m c 33 To access alternative views in the GUI openss f mexe pcsamp openss loads the database file Then use the GUI toolbar to select desired views or using the CLI openss cli f mexe pcsamp openss to load the databa
59. e specific events listed in the PAPI documentation Native events are also reported by papi_native_avail The hardware counter sampling experiment uses a sampling rate instead of the threshold used in the previous experiments But like the threshold the sampling rate is depended on the application and must be balanced between overhead and accuracy In this case the lower the sampling rate the less samples recorded The convenience script for this is experiment is gt osshwcsamp mpirun np 256 smg2000 n 50 50 50 lt event_list gt lt sampling_rate gt Note if a counter does not appear in the output there may be a conflict in the hardware counters To find conflicts use gt papi_event_chooser PRESET lt list_of_events gt Here is a list of some possible hardware counter combinations to use list provided by Koushik Ghosh LLNL For Xeon processors PAPI _FP_INS PAPI_LD_INS PAPI_SR_INS Load store info memory bandwidth 30 feeds PAPI_L1_DCM PAPI_L1_TCA L1 cache hit miss ratios PAPI_L2_DCM PAPI_L2_TCA L2 cache hit miss ratios LAST_LEVEL_CACHE REFERENCES MEM_UNCORE_RETIRED LOCAL_DRAM For Opteron processors PAPI_FAD_INS PAPI_FML_INS Floating point add multiply PAPI_FDV_INS PAPI_LFSQ_INS Square root and divisions PAPI_FP_OPS PAPI_VEC_INS Floating point and vector instructions READ_REQUEST_TO_L3_CACHE ALL_CORES L3 cache L3_CACHE_MISSES ALL_CORES When selecting PAPI events you must determine if they are
60. each event in an event by event trace 15 Only available in Open SpeedShop using CBTF collection mechanism currently under development 3 2 3 Sampling Experiments Descriptions Program counter sampling pcsamp experiment call path profiling usertime experiment and the three hardware counter experiments hwc hwctime hwcsamp all use a form of sampling based performance information gathering techniques Program Counter Sampling pcsamp is used to record the Program Counter PC in the user application being monitored by interrupting the application at an user defined time interval with the default being 100 times a second This experiment provides a low overhead overview of the time distribution for the application Its lightweight overview provides a good first step for analyzing the performance of an application The Call Path Profiling usertime experiment gathers both the PC sampling information and also records call stacks for each sample This allows the later display of the call path information about the application as well as inclusive and exclusive timing data see section 4 2 This experiment is used to find hot call paths call paths that take the most time and see who is calling whom The Hardware Counter experiments hwc hwctime hwcsamp access data like Cache and TLB misses The experiments hwc and hwctime sample a hardware counter events based on an event threshold The default event is PAPI TOT_CYC ov
61. ead associated with thread startup There can be problems with not balancing the workload among threads properly or most efficiently There can be complications with Non Uniform Memory Access NUMA 2 3 Message Passing Applications Message passing applications use a distributed memory model with sequential or shared memory nodes coupled by a network In this case data is exchanged using message passing via a Message Passing Interface MPI The typical performance issues associated with message passing applications include long blocking times while waiting on data or low messaging rates creating bottlenecks due to insufficient network bandwidth 11 3 Introduction to Open SpeedShop Open SpeedShop is an open source performance analysis tool framework It provides the most common performance analysis steps all in one tool It is easily extendable by writing plugins to collect and display performance data It also comes with built in experiments to gather and display several types of performance information Open SpeedShop provides several flexible and easy ways to interact with it There is a GUI to launch and examine experiments a command line interface that provides the same access as the GUI as well as python scripts There are also convenience scripts that allow you to run standalone experiments on applications and examine the results at a later time The existing experiments for Open SpeedShop all work on unmodified application bi
62. ect view The load balance on the linked object showed some imbalance so we looked at the cluster analysis view and found that rank 255 was an outlier 70 We then took a closer look at rank 255 and saw that the pcsamp output shows most of the time was spent in smp_net_lookup We used the MPI experiment to determine if we can get more clues and saw that a load balance view on the MPI experiment Shows rank 255 s MPI Allreduce time is the highest of the 256 ranks We then looked at rank 255 and a representative rank from the rest of the ranks and noted the differences in MPI_Wait MPI Send and MPI Allreduce We looked at the call paths to MPI_Wait to determine why the wait was occurring The mpit experiment has a performance information entry for each MPI function call In addition to the time spent in each MPI function information like source and destination rank bytes sent or received are also available You can selectively view the information you desire Below we see the default event view for an MPI application POT 19 30 51 20007713 POLST JIHODI POSE 2OR OT 1S to 35t gt gt gt gt gt gt gt gt gt gt gt MPI ann a hal ee PE NN a gt gt gt gt gt gt gt gt gt MPI_Waltal ibmpi so 0m1 gt gt gt gt gt gt gt gt gt M7 eget pirrur pwsnalle 1 gt gt gt gt gt gt gt gt gt gt MPI Waitall I We can create our own event view with the OV button 71 Eile Tools Help ue ae inf W
63. ective_com Persistent Communicators persistent_com Synchronous Point to Point synchronous_p2p Asynchronous Point to Point asynchronous_p2p Process Topologies process_topologies Groups Contexts Communicators graphs_contexts_comms Environment environment Datatypes datatypes MPI File I O fileio 15 11 ossfpe FP Exception Experiment General form ossfpe lt command gt lt args gt default f_t_list Sequential job example ossfpe smg2000 n 50 50 50 Parallel job example ossfpe mpirun np 128 smg2000 n 50 50 50 Additional arguments default trace all floating point exceptions 104 lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of inexact_result division_by_zero underflow overflow invalid_operation 15 12 ossmem Memory Analysis Experiment General form ossmem lt command gt lt args gt default f_t_list Sequential job example ossmem smg2000 n 50 50 50 Parallel job example ossmem mpirun np 128 smg2000 n 50 50 50 Additional arguments default trace all supported memory functions lt f_t_list gt Comma separated list of exceptions to trace consisting of one or more of malloc free memalign posix_mem align calloc and realloc 15 13 osspthread POSIX Thread Analysis Experiment General form osspthread lt command gt lt args gt default f_t_list Sequential job example osspthread smg2000
64. ek libc 2 12 so syscall template S 82 Openss gt gt expview m loadbalance Max I O Rank MinI O Rank Average I O Function defining location Call Time of Call Time of Call Time Across Max Across Min Across Ranks ms Ranks ms Ranks ms 4114 522156 509 2680 653110 273 3629 759208 close libc 2 12 so syscall template S 82 2824 349452 346 0 315392 317 2061 726036 _GI__read libc 2 12 so syscall template S 82 989 579445 358 5 784552 414 211 147786 _libc_open libc 2 12 so syscall template S 82 4 574762 65 0 424622 494 0 655899 write libc 2 12 so syscall template S 82 0 044708 184 0 011079 317 0 017103 _ GI__libc_lseek libc 2 12 so syscall template S 82 Openss gt gt expview v calltrees fullstack I O Call of Number Call Stack Function defining location Time ms Total of Time Calls _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _ libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2021 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 gt gt gt gt gt gt gt 766 in close iot collector monitor mrnet mpi so wrappers c 685 1858418 863034 61 486298 512 gt gt gt gt gt gt gt gt 82 in close libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonit
65. er experiments hwc for flat hardware counter profiles using a single hardware counter hwctime for profiles with stack traces using a single hardware counter and hwcsamp for PC sampling with multiple hardware counters Both osshwc and osshwctime support non derived PAPI presets all non derived events are reported by papi_avail a You can also see the available events by running the experiments osshwc or osshwctime with no arguments The experiments include all native events for that specific architecture Some PAPI event names are listed in sections below but please see the PAPI documentation for the full list The threshold you choose depends on the application you want to balance overhead with accuracy Remember a higher threshold will record less samples Rare events need a smaller threshold or that information may be lost never triggered and recorded Frequent events should use a larger threshold to reduce the overhead of collecting the information Selecting the right threshold can take experience or some trial and error HINT Running the sampling based hardware counter experiment osshwcsamp can help you get an idea for a threshold value to try when running the osshwc and osshwctime experiments which are threshold based Since the ideal number of events threshold depends on the application and the selected counter for events other than the default the hwcsamp experiment can be used to get an overview of counter activity
66. erflows Please see chapter 5 for more information on PAPI and hardware counter related experiments Instead using a threshold the hwcsamp experiment samples up to six events based on a sample time similar to the usertime and pcsamp experiments The hwcsamp experiment default events are PAPI_FP_OPS and PAPI_TOT_CYC 3 2 4 Tracing Experiments Descriptions The Input Output tracing and profiling experiments io iot iop MPI Tracing Experiments mpi mpit mpiotf Memory tracing mem POSIX thread tracing pthread and the Floating Point Exception Tracing fpe all use a form of tracing or wrapping of the function names to record performance information Tracing experiments do not use timers or thresholds to interrupt the application Instead they intercept the function calls of interest by using a wrapper function that records timing and function argument information calls the original function and then records this information for later viewing with Open SpeedShop s user interface tools The Input Output tracing experiments io iot record invocation of all POSIX I O events They both provide aggregated and individual timings and in addition the 16 iot experiment also provides argument information for each call To obtain a more lightweight overview of application I O usage use the I O profiling experiment The lightweight I O experiment iop records the invocation of all POSIX I O events accumulating the information but does no
67. esced All of the calling paths will be shown in their entirety Call paths w o Show all the calling paths in this application for the selected function 91 coalescing per only Highlight a function in the StatsPanel and click on this Function icon Duplicate paths will not be coalesced All of the calling paths will be shown in their entirety HC Hot Call Path Show the callpath in the application that took the most time This is a short cut to find the hot call path Show the butterfly view which displays the callers and callees of the selected function Highlight a function in the StatsPanel and click on this icon Then repeat to drill down into the callers and or callees segment selected the new performance data report Annotation source Defaults are different for each experiment but mostly time EB Load Balance Show the load balance view which displays the min max and average performance values for the application Only available on threaded or multiple process applications CA Cluster Analysis Show the comparative analysis view which displays the output of a cluster analysis algorithm run against the threaded or multiple process performance analysis results for the user application The goal of this view is to find outlying threads or processes and report the groups of like performing threads processes or ranks Ce Custom Raise the custom comparison panel which provides mechanisms Compare allowing the use
68. f time spent in each I O function 58 Procesas Loaded Click na the Tun Semen oo begin the esperenent s T me Ot ee a p a gt aiw OR ILID noga Piia 19008 Maaka 1000 iral i O aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaa aa Teih OA Oe 1533 A in WeiveOr Reet GOA IOR 256 O 344 in DO Nis POSIX IOR sinri POKL 224 bisama 1972807819 225266 Miriio Libpthreet 21 1 320 Testiotys GOA 08 e 1848 ivin SO Opes POSIX POR siri PORK e 17 941201003 S69 Siroti eee ogee lbpthresd 211 336 Teabetys IDH DO i at 204 in POR Cromie POSIX ICM sini POST c 74 Sere serene epented lipire 211 4 a0 Teerbetes JOR IOR r 164 O 2611 m WriteOr Reed HAL 100L c 2561 251 in DOR Xie POSIX MOR siori POSETE 208 coed lhpehroad 3 11 2 se TotSc5ye CION ION r JE 316 in DOR_Cheee_ COL ates POSIX 6 3h ciosa iiepatioral 2 11 deed This image shows the min max average time spent in each of the I O functions showing the rank of the minimum value and the rank of the maximum value for each of the I O functions This view indicates if there is an imbalance relative to the I O in the application being run This may or may not be expected Open Speed Shap ere e eee pa oe w ool ee pnah K e t won a 1 eet ene Ye b b ji U r rae OH iaag PLAS tet ete E SANUS JAAN Treeeas E ETN EAEE wehe Mibpihrad 2 LE Jas E ONTECO opened Hihprhread 2 L a z5 omw read Mepehread 2 11 2 50 14416 amir
69. g for OpenMP specifically Any Open SpeedShop experiment can be applied to any parallel application This means you can run the program counter sampling experiment on a non parallel application as well as a MPI or threaded application The experiment data collectors are automatically applied to all tasks threads The default views aggregate sum the performance data across all tasks threads but data from individual tasks threads are available The MPI calls are wrapped and MPI function elapsed time and parameter information is displayed 3 3 Running an Experiment First think about what parameters you want to measure then choose the appropriate experiment to run You may want to start by running the pcsamp 17 experiment since it is a lightweight experiment and will give an overview of the timing for the entire application Once you have selected the experiment to run you can launch it with either the wizard in the GUI or by using the command line convenience scripts For example say you have decided to run the pcsamp experiment on the Semi coarsening Multigrid Solver MPI application smg2000 a good benchmark application On the command line you would issue the command gt osspcsamp mpirun np 256 smg2000 n 65 65 65 Where mpirun np 256 smg2000 n 65 65 65 is a typical MPI application launching command you would normally use to launch the smg2000 application mpirun a MPI driver script or executable is here used to laun
70. gt gt gt gt QTC_device float char char int int int float int int int int float int int int int bool lt lt lt 256 1 1 64 1 1 gt gt gt CUDA 78 79 9 Memory Analysis Techniques The Open SpeedShop version with CBTF collection mechanisms supports tracing memory allocation and deallocation function calls in user applications An event by event list of memory function call events and the memory function call event arguments are listed The Open SpeedShop experiment name for the memory analysis experiment is mem The high water memory mark is not currently available but is coming in the future 9 1 Memory Analysis Tracing mem experiment performance data gathering To run the memory analysis experiment use the ossmem convenience script and specify the application as an argument If there are no arguments to the application then no quotes are necessary but they are placed here for consistency Using the sweep3d application as an example here the ossmem script will apply the memory analysis experiment by running the sweep3d application with the Open SpeedShop memory trace collector gather the data and will create an Open SpeedShop database file with the results of the experiment Viewing of the performance information can be done with the GUI or CLI ossmem mpirun np 64 sweep3d mpi 9 2 Memory Analysis Tracing mem experiment performance data viewing with CLI To launch the CLI on any e
71. hardware counter osshwc how you normally run your application lt papi event gt lt threshold value gt tbd 5 2 2 Hardware Counter Threshold hwc experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt This image shows the default view for the hwc experiment run with the smg2000 MPI application using PAPI_TOT_CYC as the hardware counter event Double clicking on a line in the Stats Panel or on the bar chart will take the user to the source file and line represented by that line of performance information 40 3 104575163 hypre Semilaterp ung2000 semi_interp c 126 2941176471 hypre SemiRestrict smg2000 semi_restrict c 125 vp aiaa 1348888489 1 307189542 0 980392157 The next image displays the output from the osshwctime experiment where the counter is the L1 cache misses i 000 Host localhost Jocsldamain Processes Ranba Thesds 2 O a i f LETE SUSUN SESU AN S S SS S S 41 5 2 3 Hardware Counter Threshold hwc experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 42 6 Hardware Performance Counters and Their Use In this section we will explore the importance of simple Hardware Counter Metrics HCM through some easy to understand examples We will also use a simple Matrix multiplication example to illustrate performance optimization
72. hip of the statistics to the actual source in the program i sPeared Shoe vi felne Pis Jot thes ormi ocese Cartrul i Seat ivocess Looked Cici on fre Tho Sefer lo deg liu aparmeri inara Pew o Goes Paw jij Amn projects pace agen punapea h jega enliyi areen stud a eL A a al ag 8 WR Wy pe Perry Calea Faise Riport mpo n Cxecetaties OTC Moet mST iotbomari Pris Theeect Wem her a cha VS _ Sask Nt Eee NOI i Toe Manneral Ce Gal Steck Function smtriny location gt fede OTG aa cM t amp FOO in hrianka Tan i pgi timed Qomnazi Poet MategelYooesseei wes f Procenemm Sjah l l froces Bete re Ava Trua i HOn beccrrwected Darat Froan Set 8 3 3 NVIDIA CUDA Tracing cuda experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt Here we show a trace view of the output from the osscuda experiment run Note the f CUDA is required do to the fact this is a prototype This restriction will be removed in the future This trace shows the actions taken during the execution of the CUDA application matmul on the Titan Cray platform at ORNL openss gt gt expview v trace f CUDA Start Time d h m s Exclusive of Call Stack Function defining location I O Call Total Time ms 2013 08 21 18 31 21 611 11 172864 1 061071 gt gt gt gt gt copy 64 MB from host to device CUDA 2013 08 21 18 31 21 622 0 3
73. i_l1_ldm papi_l1_stm LinkedObject CPU time in Time seconds 928 310000 43 541541 9818946796 133244862 9952191658 9543597734 215608918 libc 2 12 so 811 920000 38 082373 47212355914 369525459 47581881373 46596204924 441601622 sweep3d mpi 311 490000 14 610157 3356646038 44875637 3401521675 3255300343 80090932 libpami so 29 640000 1 390237 1824778610 12931604 1837710214 1680978945 127174346 libirc so 26 930000 1 263127 287313329 3994016 291307345 281053971 4763152 libpamiudp so 22 250000 1 043616 1049603690 9037920 1058641610 1033650896 11422120 libpthread 2 12 so 1 440000 0 067542 72649683 620083 73269766 71327993 1007704 libmpich so 3 3 0 020000 0 000938 1286256 23770 1310026 1232178 5222 d 2 12 so 0 010000 0 000469 32 7 394 721 313 13 librt 2 12 so 2132 010000 100 000000 63623580643 574253745 64197834388 62463347297 881674029 Report Summary openss gt gt 5 2 Hardware Counter Experiment hwc As an example we will run the osshwc experiment on our test program smg2000 The convenience script for this is experiment is gt osshwc mpirun np 256 smg2000 n 50 50 50 lt counter gt lt threshold gt This is the same syntax as the osshwctime experiment Note that if your output is empty try lowering the lt threshold gt value itis calculated by Open SpeedShop by default You can try lowering the threshold value if there have not been enough PAPI event occurrences to record Also see the HINT in the osshwcsamp section above You can ru
74. icant percentage of the execution time for an application and can depend on many things including Checkpoints analysis output visualization and I O frequencies The I O pattern in the application also matters whether it is N to 1 or N to N and if there are simultaneous read or write requests Certainly the nature of the application is also important to the I O usage if it is data intensive traditional HPC with scalable data or out of core that is an application that works 45 on data that is larger then the available system memory The type of file system and striping available on the cluster NFS Lustre Panasas or other Object Storage Targets OSTs What I O libraries your code is using MPI IO hdf5 PLFS or others Also the I O is dependent on other jobs that are running and stressing the I O sub systems The obvious thing to explore first while tuning your code is to try and use a parallel file system Then optimize your code for I O patterns Match checkpoint I O frequency to Mean Time Before Interrupt MTBI of the system Make sure your code is using the appropriate libraries 7 1 OOCORE Example We will examine an example using the benchmarking application OOCORE an out of core solver from the Department of Defense High Performance Computing Modernization Program DoD HPCMP It is an out of core ScaLAPACK Scalable LAPACK benchmark from the University of Tennessee Knoxville UTK It can be configured to be disk I O intensi
75. ied for the view Sqrt 1 square root ofthe argument Stdev 3 standard deviation calculation Percent 2 percent the first argument is of the second Condexp 3 C expression first argument second argument third argument Header 2 use the first argument as a column header for the display of the second 88 Note Integer and floating constants are supported as arguments as are the metric keywords associated with the experiment view Arguments to these functions can be lt metric_expressions gt with the exception of the first argument of Header The first argument of Header must be a character string that is preceded with and followed by When the v summary option is used it is not generally possible to produce a meaningful column summary A summary is produced for Add Max Min Percent A_Add A_Max and A_Min Examples expview hwc m count Header percent of counts Percent count A_Add count v summary expview mpi v butterfly f MPI_Alltoallv m time Header average time count Div Mult time 1000 counts To examine an example we take the default view expview command and add the capability to add the percentage that each function contributes to the total Add the header by using the Header phrase to create a header for the new data column that is being added The Percent phrase to create the arithmetic expression that divides the PAPI_L1_DCM counts count for each func
76. illustrates that double clicking on a line of statistical information in the Stats Panel will focus the source panel at the line of source representing the performance information and annotates the source with that information Note the hot to cold color highlighting of the source The higher the performance values are the hotter the color Red is the hottest color so source highlighted in red is taking the most time in the program being profiled Process Loodad Click on the Run button to begin the experioaent SE ew Os ee _ 26 n ee et Bact ential x 10 3228468 75919339571 7591939571 7 117437722 3 7959667R5 3 202846975 2 491 103203 93 12 Special System Support 12 1 Cray and Blue Gene When shared library support is limited the normal manner of running experiments in Open SpeedShop doesn t work You must link the collectors into the static executable Currently Open SpeedShop has static support on Cray and the Blue Gene P Q platforms You must relink the application with the osslink command to add support for the collectors The osslink command is a script that will help with linking Calls to it are usually embedded inside an application s makefiles The user generally needs to fine the target that creates the actual static executable and create a collector target that links in the selected collector The following is an example for re linking the smg2000 application smg2000 smg2000 0 echo Linking
77. in max eve tepar Exccutabies sweepSd mpt Hosts 001 meriin pov Processes Ranks Thireads 001 O i m Min fixclusive CPU time in seod Max Farhusive CPU time in sec Average Eectoeewe CPU fime in Fonction defining locates 2il 442000000 705000000 120933 sweep eeverp impi 1 50000000 LI 7000000 Z 1700000 ani potevent_word libelags so 008900000 0 15900000 0 13115000 sowce_ weep mpl CLARA KAMAN 0 15900000 0 ORAS ORK can progress hannes ihelan co 1 CHPH CAP RMD LO COTERRM ER ean dopar ihelanso 1 001000000 001000000 OOOO _ lan_dopurW ar ibelan so 1 0 0 1000000 0 0 190090000 1 UCR clan dopuslie Remove Mbeclanao 1 i Command Panel Lt yieiteit peres gt Below we see the creation of a comparison between to ranks in Open SpeedShop 63 Open SpeedShon n fo the home jeg Gemos catasetsy rcr Pod 250p openss pj Cueto Ssh Bie Jools Help taro learo arte Process C eu Status Loaded saved dof from tile Jey date mer serume sweep dd 2506p openss E Seats Panel pje Meowen a THe Nir ity raana BESTEANE D A aeran Executabies eweep amp l img stun View consists af co columes click on the metadata or Getails usertime x h oer 104 link gov r 0 wsertime ac 1 h mer 06 sink gov t 1 Function defining location 15 34285684 15 14285664 sweep_ sweep mpe swrep f 2 1 51428568 1 82857139 eland_pollevemt_word libe
78. ions and the percentage of the overall memory function time was spent in each of the memory functions The paths to each memory through the source are available through the call path views on ew a 7 t 4 inrcerative ep argh Mite O64 SREE Paii PSU Ranka MA Three I O OOOO O a baasi Fe tad PP mreeto nts ww thn Ad bee ree r 1 4400 490 nue __ She maloe Sho 2 IIs iR 5549 127I Me bed aa In this C icon call path view we see the call paths to the memory functions called in this application 81 Openi spees thay mae o omme ee re l oae Lamist Click on the Rus beiton ty bagar thy expecerent Cu 8 E L R a Ot IT T OE Oe ee i ee he p gt 4 Abw maire ibe 11 ae Sant pid rapt are NGE AMin lhe san i hasia h avin se Sat Waer maia edb ae BLY ee e_n renee ce 00 elie ae bean Cpe ld pee A 1 le MAM mepi eee Aime ete ireo gi pefe anto alkatia ibpg tt ri gt anib albre ilihpgfmi s E D r SMUN Poriare irera hbenpech apimi 140 38 0807 In the view below one has chosen the LB icon and generated the load balance view This view shows the min max and average time across all the ranks in the application The ranks of the min and max time values are also shown If there is a significant difference between the min max and average time there may be load imbalance To identify the ranks threads or processes that are acting out of balance use the cluster analysis
79. is run in a temporary location that is not available when the raw data information is being converted into the Open SpeedShop database file 95 13 Setup and Build for Open SpeedShop Open SpeedShop is setup to work with the AMD Opteron or Athlon and the Intel x86 x86 64 and Itanium 2 architectures It has been tested on many Linux Distributions include SLES SUSE RHEL Fedora Core CentOS Debian Ubuntu and many others It has been installed on the IBM Blue Gene P Q and the Cray XT XE XK systems The OpenSpeedShop website contains information on special builds and usage instructions The source code for Open SpeedShop is available for download at the Open SpeedShop project home on Sourceforge http sourceforge net projects openss Or CVS access is available at http sourceforge net scm type cvs amp geroup id 176777 Packages and additional information can be found on the Open SpeedShop website http www openspeedshop or 13 1 Open SpeedShop Cluster Install Open SpeedShop comes with a set of bash install scripts that will build Open SpeedShop and any components it needs from source tarballs First it will check to see if the correct supporting software is installed on your system if the needed software isn t installed it will ask to build it for you The only thing you need to do is provide a few arguments for the install script For anormal setup you would just specify the directory to install in what build task you want to
80. iw mp inp bt Wax x salve 10 DM 0T on2000 S ecni E AaS ez LEES a sohe emp MOH Wi 2 ative Lm 17044 13957077 0000 LM LIMIT 100000 y mie anp int bt Wa y me gt TA LTT Er E E L964 DS sd niman mamul ub Wa gt IPAS 0 050000 Le LIM TO OL 72500 jemenpute_rhe_comp_ n 0 it Wwa chat 1964 LITT omom 14 1296S ION PD 077500 marec seh he Wa 13004 1365330102 URIA EAL 19064 LIMZI TIO LOSE Uheinat_ ht Wx 19964 1355330172 o npeenn IPA LETT MMIII ip gel nun proc Nbpinp sa LOD 1904 a open 0 00000 Lee 1955 55 ber 10 L000 bimer te Wad LM LO Ter 0 059000 17964 LME POE Lope exact wilian bt Ws l 139041306707236 a pooo 13964135600726 0 0i0000 intial omp 0 he Wa isitiahse 110 DOM TIET T 00 p00 190M ITT 1 00000 l Gammand Favel opens gt gt Then we look at a cluster analysis view based on functions pie Thels peSampling 1 Command Panel Proce Control gt Jai t Cai Ponie SUpdate Terminite Status Process Loaded Click on the Kun button to begin the e Stats Panel 1 ManegeProcessesPanel 1 CW a A aia amp tt Showing Comparative Analysis Report Execostahles br Wx epi Showin Host localhost localdon for performance data type pesamp functions using display optior Lolumn 2 Experiment 1 Showing Host localhost localdomam p 237 for performance datn ry Samp functors using ta ton ThreadAs Metadata for Experiment 1 zT 2 Average Exclusive CPU time in seconds Acro
81. k while others will have a full workload With adaptive grid models some ranks need to redefine their mesh while other don t With N body simulations some work migrates to other ranks so those ranks will have more to do while the others have less Performance analysis can help you with load balancing and an even distribution of work Tools like Open SpeedShop are designed to work on parallel jobs It supports threading and message passing and automatically tracks all ranks and thread during execution It can also store the performance info per process rank or thread for individual evaluation All of the experiments for Open SpeedShop can be run on parallel jobs collectors are applied to all ranks on all nodes The results of an experiment can be displayed as an aggregation across all ranks or threads which is the default view or you can select individual or groups of ranks or threads to view There are also experiments specifically designed for tracing MPI function calls Open SpeedShop has been tested with a variety of MPI versions including Open MPI MVAPICH 2 and MPICH2 on Intel Blue Gene and Cray systems Open SpeedShop is able to identify the MPI task rank info through the MPIR interface for the online version or through PMPI preload for the offline version To run MPI code with Open SpeedShop just include the MPI launcher as part of the executable as normal below are several examples gt ossmpi mpirun np 128 sweep3d mpi gt o
82. l us about the code performance The set of useful metrics that can be calculated for functions are FLOPS Memory Ops FMO We would like this to be large which would imply good data locality Also called Computational Intensity or Ops Refs FLOPS Cycle FPC Large values for floating point intensive codes suggests efficient CPU utilization Instructions Cycle IPC Large values suggest good balance with minimal Stalls through arrays through arrays C 44 2 yeA xty doloop 094 j 2 00 0 14 0 26 0 07 0 00 2 ysA xty DGEMV 189 029 0 42 0 15 0 03 doloop kji 6 29 0 87 1 74 0 21 0 00 C A B C DGEMM_ 12 96 184 1 26 0 59 0 01 The following table shows single CPU simple code Hardware Counters for simple math kernels using the AMD Budapest Processor Other hwc metrics that are useful are also shown code 3D Fast Fourier Matrix QR HPCCG linear system Code Transforms Multiplication Factorization solver sparseMV 256x256x256 500x500 N 2350 100x100x100 Intensity Ops Ref MFLOPS papi 4159 3738 6 1 Using the Hardware counter experiments to find bottlenecks 6 1 1 How to find memory bandwidth bottlenecks using O SS hwc experiments TBD 6 1 2 How to find memory cache usage issues using O SS hwc experiments TBD 6 1 3 How to find load store imbalance using O SS hwc experiments TBD 7 1 0 Tracing and I O Profiling I O could be a signif
83. lan3 e0 1 0 28571428 O 25714285 soree sweepGd eypl source t 2 0 11428571 0 028S7143 elan_pollWord libelan sc 1 0 08571428 0 02857143 elan_progressRaPragLisr tibelan so 1 ifA om me thik hz Bu cnr Al 8 1 MPI Tracing Experiment In this section we will go through an MPI tracing experiment with Open SpeedShop The experiment will be similar to the I O tracing experiment it will record all MPI call invocations There are two MPI experiments and associated convenience scripts 64 ossmpi which will record call times and ossmpit which will record call times and arguments Equal events will be aggregated to save space in the database as well as to reduce the overhead There is one more MPI experiment that will save the full MPI traces in the Open Trace Format OTF with the convenience script ossmpiott Again we will run experiment on the smg2000 application The syntax for the experiment is gt ossmpilt srun N 4 n 32 smg2000 n 50 50 50 default lt list MPI functions gt mpi_category The default behavior is to trace all MPI functions but a comma separated list of MPI functions can be giving if you only want to trace specific functions e g MPI_Send MPI_Recv etc You can also select an mpi_category to trace all rooe p2p collective_com datatypes environment graphs_contexts_comms persistent_com process_topologies and synchronous_p2p W
84. le created by the convenience script 19 E Open SpeedShop pc Sampling 1 Process Control Run P Cont Pause Update E Terminate m p Status Process Loaded Click on the Run button to begin the experiment F Stats Panel 1 ManageProcessesPanel n moO x View Display Choice TU CL DiS GY LB CA GG Showing Functions Report one wv Statements y Linked Objects Executables smg2000 Host localhost localdomain Processes Ranks Threads 2 0 Function defining location 3 630000000 43 060498221 hypre_SMGResidual smg2000 smg_residual c 152 Shes 2 860000000 33 926453144 hypre4yclicReduction smg2000 cyclic_reduction c 757 491103203 0 280000000 3 321470937 hypre_SemiRestrict smg2000 semi_restrict c 125 779359431 0 210000000 2 491103203 hypre_Semilnterp smg2000 semi_interp c 126 186239620 0 150000000 1 779359431 opal_ progress libopen pal so 0 0 0 067615658 0 100000000 1 186239620 meca_btl_sm_component_progress libmpi so 0 0 2 a i e H Command Panel openss gt gt 1 wa Open SpeedShop PP pe Sampling 1 Pmoess Contmi gt jija LAT Statie Process Lowded Click o an button to begin the experiment State Panel 1 Siltix nageProucssesPanel 1 l E mig fe ee ty Anana Vw cL Ss n LS CA OC Showing Functions apasi gt Fandom Satements Linked Objects Execetables smg 000 Hosti localhost ioobdo
85. lues MPTT Experiment Event List v trace ONLY S MPT Individual Event Sut Tunes G MINT Individual Event Suap Tunes SMPIT Souro Kank Numbers MPIT Desrinarion Rank Numbers i MPIT Meesgr Tag Values S MPT Communicator Used Values S MPT Message Dane Type Values G MPIT Function Dependent Rerum Values oma J amy oe et After choosing the event to view it will then be displayed 72 gt gt gt gt gt gt gt gt gt gt gt MPI_Wattall gt gt gt gt gt gt Pe gt MPL fall ASSADAS por gt gt gt gt gt gt gt gt gt MPI Waited Ebmpiso 001 pwaitall gt gt gt gt gt gt gt gt gt gt MPi_Woaitall ihmpisa 0t 1 pwata eee ee MP Waited Renpiso 0 0 1 pwann gt gt gt gt MPI_Watall Wempise 00 pwatalle F7 2 PPS gt gt MP _Waitall Renpiso 00 t pwsi gt gt gt gt gt gt MPI Waitall limpico 00 i ae A es r herr wre men fhe forte i a b E lt I Rh mh _ 8 1 1 MPI Tracing Experiments performance data gathering Much of this information is described above in the main MPI Tracing Experiments section but for completeness this is the convenience script description for running the MPI specific tracing experments gt ossmpi t srun N 4 n 32 smg2000 n 50 50 50 default lt list MPI functions gt mpi_category 8 1 2 MPI Tracing Experiments performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt
86. main Processes Ranks Threads 2 EZ o Exelasive CPU time in sad e of CPU Time Function defining foeation 4 3630000000 43 060498221 hype SMCBReskleal am g2000 amg residuale 152 2 860000000 33 9264531 4H hyp ycliPtedaction arm 2000 cyclic reductiog c 757 220000000 3 3204 70997 hypre_SemiRestrict cerg70000 ceme_restect c 125 0 2 491103203 hype Sensilaterp seng2000 semi interp 126 0 1 50Rago0c 1 779359431 opal progwes bopes pal s gt 0 0 0 1 100000 1 86239620 moas btl em coumpoment progress lituepi 0 0 2 fz a ani gt Mm Cit x opeiss gt 20 You can choose to view data for Functions Statements or Linked Objects To switch from one view type to another first select the view granularity Function Statement or Linked Object then select the type of view For the default views select the D icon OpenjSspeedsShop 29 655990510 10 320254698 7 591933571 7591933571 7 117437722 You can manipulate the windows within the GUI and double click functions or statements to see the source code directly 21 ake hypre Headlonp tttegin loop sta A date bon start base stride Ai x data bow shut base stite xi r ditu box sturt buse stracke ri 285 define HYPRE BOX OEP PRIVATE loopk Insp lbopj Aixi ri 29 include Wypre_bo heoo 759193571 7501933571 7 117437722 2795966795 LMI 24901101203 ZL 4 How to
87. method is code integration or instrumentation of performance probes This method allows a much finer grain analysis however it can be hard to maintain and required significant beforehand knowledge of what information to measure and record An alternative to the simple and coarse grain or complex and fine grain approach is the use of performance analysis tools Performance Tools enable fine grain analysis that can be resulted to the source code and work universally across applications There are two ways performance analysis tools gather information from applications One way is through statistical sampling which periodically interrupts the execution of the program to record its location Statistical distributions across all locations are reported and data is typically aggregated over time Time is the most common metric but other metrics are possible Statistical sampling is useful to get an overview of the applications performance as it provides low and uniform overhead Event tracing is another way for performance analysis tools to gather information In this case the tool can gather and store individual application events for example function invocations MPI messages or I O calls The events recorded are typically time stamped and proved detailed per event information This method can lead to huge data volumes and higher potentially bursty overheads There are a number of different performance analysis tools so how do you select
88. mos mpi openmpi 1 4 2 smg2000 test smg2000 pcsamp 1 openss openss The restored experiment identifier is x 1 Exclusive CPU time of CPU Time Function defining location in seconds 3 630000000 43 060498221 hypre_SMGResidual smg2000 smg_residual c 152 2 860000000 33 926453144 hypre_CyclicReduction smg2000 cyclic_reduction c 757 0 280000000 3 321470937 hypre_SemiRestrict smg2000 semi_restrict c 125 0 210000000 2 491103203 hypre_Semilnterp smg2000 semi_interp c 126 0 150000000 1 779359431 opal_progress libopen pal so 0 0 0 0 100000000 1 186239620 mca_btlsm_component_progress libmpi so 0 0 2 0 090000000 1 067615658 hypre_SMGAxpy smg2000 smg_axpy c 27 0 080000000 0 948991696 ompi_generic_simple_pack libmpi so 0 0 2 0 070000000 0 830367734 _GI_memcpy libc 2 10 2 so 0 070000000 0 830367734 hypre_StructVectorSetConstantValues smg2000 struct_vector c 537 0 060000000 0 711743772 hypre_SMG3BuildRAPSym smg2000 smg3_setup_rap c 233 When the application completes a default report will be printed on screen The performance information gathered during execution of the experiment will be stored in a database called smg2000 pcsamp openss You can use the Open SpeedShop GUI to analyze the data in detail Run the openss command to load that database file or open the file directly using the f option gt openss f smg2000 pcsamp openss Below we show basic examples of how to use the GUI to view the output database fi
89. n osshwcsamp and use a formula to create a reasonable threshold Any counter reported by papi_avail a that is not derived is available for use You can also see the available counters by using the osshwc or osshwctime commands with no arguments Native counters are listed in the PAPI documentation PAPI Name Threshold PAPI L1 DCM L1 data cache misses high PAPI L2 DCM L2 data cache misses high medium PAPIL1 DCA Li data cache accesses high PAPI FPU_IDL Cycles in which FPUs are idle high medium PAPI STL_ICY Cycles with no instruction issue high medium PAP BR_MSP Miss predicted branches medium low PAPI FP_INS Number of floating point instructions high PAPI LD_INS Number of load instructions high PAPL VEC_INS Number of vector SIMD instructions high medium PAPL HW_INT Number of hardware interrupts low PAPI TLB_TL Number of TLB misses low 39 Note the Threshold indications are just for rough guidance and are dependent on the application Also remember that not all counters will exist on all platforms run osshwc with no arguments to see the available hardware counters available In the sections below we show the outputs from the osshwc experiment note that the default counter is the total cycles 5 2 1 Hardware Counter Threshold hwc experiment performance data gathering The hardware counter threshold experiment convenience script is osshwc Use this convenience script in this manner to gather counter values for one unique
90. nalysis Performance Analysis also called software profiling or performance tuning is not only a way to measure the speed and efficiency of a program but also to identify bottlenecks in parallel applications Software developers are facing new issues when writing code for massively parallel applications There may be issues in code that does not become apparent until it is run on thousands or tens of thousands or hundreds of thousands etc of cores Performance Analysis can be used to identify problems and tune applications for optimal speed and efficiency There are many aspects of a program that can be measured in order to analyze its performance You can measure the time each function takes or the call paths within an application There are a number of hardware counters available like the number of floating point operations per second FLOPS performed or the number of data cache misses You can monitor the I O operations for a program to analyze its interaction with the file system Not only are there many possible things to measure about a program there are also different ways to measure them You can instrument your program by adding performance routines to the source code you can have a performance tool periodically take samples from a program as it runs or you can preload certain library functions to monitor those calls There are a number of different performance tools that can help you measure the different performance aspects
91. naries Open SpeedShop has been tested on a variety of Linux clusters and supports Cray and Blue Gene systems 3 1 Basic Concepts Interface Workflow Gut cu pyoiss Open SpeedShop has three ways for the user to Open SpeedShop examine the results of a performance test called experiments a GUI a command line Code Open Source iInstrurnentation interface or through python libraries The user ETEO can also start experiments by using those three options or by an additional method of the command line launched convenience scripts For example to launch one of the convenience scripts for the pcsamp experiment Program Counter Sampling the user executes the command osspcsamp lt application gt where lt application gt is the executable under study along with any arguments The convenience scripts will then create a database for the results of that experiment sjuswu adx3 The user can examine any database in the GUI with the command openss f lt db file gt The GUI will proved some simple graphics to help you understand the results and will relate the data back to the source code when possible 12 3 1 1 Common Terminology Technical terms can have multiple and or context sensitive meanings therefore this section attempts to explain and clarify the meanings of the terms used in this document especially with respect to the Open SpeedShop tools Experiment A set of collectors and executables bound together to ge
92. nel tab are also available in the StatsPanel ToolBar The StatsPanel Toolbar is provided as a convenience The following is a quick overview of the toolbar options The contents of the toolbar vary by experiment because some options don t make sense for all experiments The following table describes the icons and the functionality they represent T Information This option shows the metadata for the experiment Information such as the experiment type processes ranks threads hosts and other experiment specific information is displayed U Update This option updates the information in the StatsPanel display This can be used to display any new data that may have come from the nodes on which the application is running CL Clear auxiliary Clear auxiliary information Ifthe user has chosen a time segment of information the performance data or a specific function to view the data for This option clears the settings for that and allows the next view selection to show data for the entire program again Default View The default view icon shows the performance results based on the view choice granularity selection Statements per Show the performance results related back to the source statements Function in the application for the selected function Highlight a function in the StatsPanel and click on this icon C plus Call paths w o Show all the calling paths in this application Duplicate paths will not sign coalescing be coal
93. nerate performance information that can be viewed in human readable form Focused Experiment The current experiment commands operate on The user may run or view multiple experiments simultaneously and unless a particular experiment is specified directly the focused experiment will be used Experiments are given an enumeration called an experiment id for identification Component s A component is a somewhat self contained code section of the Open SpeedShop performance tool This section of code does a set of specific related tasks for the tool For example the GUI component does all the tasks related to displaying Open SpeedShop wizards experiment creation and results using a graphical user interface The CLI component does similar functions but uses the interactive command line delivery method Collector The portion of the tool containing logic that is responsible for the gathering of the performance metric A collector is a portion of the code that is included in the experiment plugin Metric The measurement which the collector experiment is gathering A metric could be a time an occurrence counter or other entity which reflects in some way on the application s performance and is gathered by a performance experiment at application runtime directly by the collector Offline A link override mechanism that allows for gathering performance data using libMonitor to link Open SpeedShop performance data gathering software components int
94. nnnnnnnnn 84 TOIL osscompare Meic arg MeNCenisssniaia aa 84 10 1 2 osscompare rows of output argument s ss sssssressrensrenrrssrrnsninnnnsnrnsnreninnnnnnrennnennonrnsnnnnns 85 10 1 3 osscompare output NAME areument uu aa a 85 10 1 4 osscompare view type or granularity argument s sessssesssssrrssresrensrenrnssrrnsnennnnnnrnsnnnnns 86 11 Open SpeedShop User InterfaCeS ssssussussnnnnnnnnnnnnnnnnnnnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 86 11 1 Command Line Interface BAaS CS s ssussususnunnnnunnnnunnnnnnnunnnnunnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 86 11 1 2 CLI Metric Expressions and Derived Type S s ss ssssresssssrssresrrssnssneorrnsrssnenrnnnnsnnsnrerrnsnnsns 88 114 2 CEL Baten Se riptiie sssissccstursertennsacveinsireatnnnsovicriwnnannevenenennannnninseuaenanaineeuenunn 89 TLS PYON SCHIP UNO wiiciiatencincns wii ianietinevnnda ane a NEUNER UREAN Erai 90 11 4 MPI_Pcontrol Support sssssssssnsunnunnunnunnnnnnnunnnnnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 90 11 5 Graphical User Interface Basics s ssussususnunnnnunnnnunnnnnnnnnunnununnunnnnnnnnnannnnnnnnnnnnnnnnnnnnnnnnnannn nnn 90 11 5 1 Basic Initial View Default VieW sussunnunnnnnnnnnnnnnnnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 90 dU cal Bd Lad a0 tnd Go 0 DI ce temermer ent eeetter pment rnp uly a RD tno SET ea Re eer eer 91 11 5 1 2 View Display Choice Se lectiOn isxcscenec cn ceeec serach dca a
95. nt ssssssssususnusunnunnnnunnununnunnnnnnnnnunnnnnnnnnnnnnn 105 15 15 Key Environment Varta les cissi niuansai aaa 105 16 Hybrid openMP and MPI Performance Analysis ssssssnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnn 107 16 1 Focus on individual Rank to get Load Balance for Underlying Thread 108 16 2 Clearing Focus on individual Rank to get bank to default behavior 110 Why do I need Performance Analysis Where are the bottlenecks in my program My parallel application works fine on 10 nodes but on 1000 nodes it slows to a Crawl Wnat s happenings Is my parallel program scalable Is my program optimized for running on this new system Are these new libraries faster than the old LJ erslon Vwi JIJI at Ju All these questions can be answered by using Performance Analysis About this Manual This manual will provide you with a basic understanding of performance analysis You will learn how to plan and run Open SpeedShop performance experiments on your applications This manual intends to give users an understanding of the general experiments available in Open SpeedShop that can be used to analyze application code There is extensive information provided about how to use the Open SpeedShop experiments and how to view the performance information in informative ways Hopefully this will allow users to start optimizing and analyzing the performance of application code Open SpeedSh
96. nunnunnnnnnnnnnnnnnnnnnn 52 7 3 L 1 0 Base Tracing 10 experientia eunan aaa aaaea naa aE 52 7 3 1 1 I O Base Tracing io experiment performance data gathering eserse 52 7 3 1 2 I O Base Tracing io experiment performance data viewing with CLI 52 7 3 1 3 I O Base Tracing io experiment performance data viewing with GUI 53 7 3 2 1 0 Extended Tracing iot experiment sssssssussssuseneunnnnunnununnunnnnunnnnunnununnunnnnnnnnnnnnnn nnna 53 7 3 2 1 I O Extended Tracing iot experiment performance data gathering 53 7 3 2 2 I O Extended Tracing iot experiment performance data viewing with GUI 53 7 3 2 3 I O Extended Tracing iot experiment performance data viewing with CLI 56 7 4 Open SpeedShop Lightweight I O Profiling General Usage ssssssssssussssnnunennnnunennnnunan 57 7 4 1 1 0 Profiling iop experiment performance data gathering 57 7 4 2 1 0 Profiling iop experiment performance data viewing with GUI eee 57 7 4 3 1 0 Profiling iop experiment performance data viewing With CLI eee 59 8 Applying Experiments to Parallel Codes ssssssssunuunuenunnnnunnunnununnunnununnunnnnunnunnnna 62 o MPI Tracing Experiment ciraire aidata nasade oradada raadas iaa aE ainada 64 8 1 1 MPI Tracing Experiments performance data gathering sesssssessrrssrrersreesrrssrrresrersnees 73 8 1 2 MPI Tracing Experiments performance data viewing With GU essences 73 8 1 3 MPI Tracing Experiments perform
97. nununnnnununnnnununnnnununnnnununnnnunnnnnnunannnnnnunnnnnnnnnnnnnnn nnmnnn 101 15 2 Convenience Scripts csi a aa a a aE a a a 101 15 3 Report and Database Creation ssussussnsunnnnnunnnnnnnnnnunnunnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nannan 101 15 4 osscompare Compare Database Files n ssssussusnnnunnunnnnnunnunnunnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnn 101 15 5 osspcsamp Program Counter Experiment sssssssnsunsunnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 102 15 6 ossusertime Call Path Experiment sssssssssunnunnunnnnunnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 102 15 7 osshwc osshwctime HWC Experiments sssssussusunnunnunnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 103 15 8 osshwcsamp HWC Experiment sssssussnsunnunnunnunnnnnnnannnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 103 15 9 ossio ossiot I O Experiments s ssussnssnnnnnunnunnunnnnnnnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnn 103 15 10 ossmpi ossmpit MPI Experiments sssssusnunnunnnnunnunnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 104 15 11 ossfpe FP Exception Experiment ssssussusunnununnunnnnunnnnnnnununnunnnnnnnnnunnnnnnnnnnnnnnnnnannnnnn nnn 104 15 12 ossmem Memory Analysis Experiment ssssssussnsunsnsunnununnunnnnunnnnunnunnnnnnnnnnnnnnnnnnnnn nnn 105 15 13 osspthread POSIX Thread Analysis Experiment s ssussussnnunnnnnunnnnnunnnnnnnnnnnnnnnnnn 105 15 14 osscuda NVIDIA CUDA Tracing Experime
98. o the user application For the Open SpeedShop offline mode of operation the application must be run from start up to completion The performance results may be viewed after the application terminates normally Param Each collector allows the user to set certain values that control the way a collector behaves The parameter or param may cause the collector to perform various operations at certain time intervals or it may cause a collector to measure certain types of data Although Open SpeedShop provides a standard way to set a parameter it is up to the individual collector to decide what to do with that information Detailed documentation about the available parameters is part of the collector s documentation Framework The set of API functions that allows the user interface to manage the creation and viewing of performance experiments It is the interface between the user interface and the cluster support and dynamic instrumentation components 13 Plugin A portion library of the performance tool that can be loaded and included in the tool at tool start up time Development of the plugin uses a tool specific interface API so that the plugin and the tool it is to be included in know how to interact with each other Plugins are normally placed in a specific directory so that the tool knows where to find the plugins Target This is the application or part of the application one is running the experiment on In order to fine tune what i
99. oa sein con opum Suess I O function being Sisus Process Loaded Cick on he Rur autton to vegen tra pariga _ traced as shown Siats Pano 1 M Manage ccestesPane 1 TESS 3 r z 7 View Dispay Chace Below is a 2 iw tS a WD ie BW HL UB Gh Be showing Per Evont Repo ection graphical trace nopei Host glorny 3 Prooesses Manks Tivresosid 5 view of the same data showing ems serialization of govoneing 13 2054 0 gt _ibe wite 64 lispthrasct 2 530 fwrite THE RED 2o1oneles 12 2254 gt ibe reud Nb alibpitread 2 4 30 BARS for each PE poianete 13 2754 gt _ibs wite lb84lbpthread 25 se with another tool POIs 122 54a gt ibc wrie 664 lipthread 2 6 20 20100808 13 22 54 gt __ tbe read ib 4hbpttwead 2 5 s0 201 OANG 13 22 54 gt __libe_read it 4ibpthread 2 5 sc PoTaeOS TE z4 gt __tibe_read lin 4Stethread 2 5 sc 20100006 13 22 54 zibi read io64ibpttread 2 5 90 20100008 13 22 54 gt __ Ib road ib 4Sbpthread 2 5 sc EAn 137254 Oam FAm sm sit 5 se Where by default the I O function list to trace is all the specific functions are creat creat64 dup dup2 lseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write writev Things to remember with I O Avoid writing to one file from all MPI tasks If you need to do it make sure distinct offsets for each PE starts ata stripe boundary Use buffered I O if you must do this
100. oa yat oPanse Sipdare a Status Provess Loaded Click on rhe Run burton te beg Exclusive MPI Call Timedms of Total 2357 121579 70 9937243 100 733 420986 22240219 100 215 049948 6 477007 2 6 935016 0 108873 2 1 032697 0 031193 100 0 589067 0 029739 2 0 373066 0 011236 100 0 283167 0 008529 2 a number of optiona fielda columea to include in the creation of an rel view of the exisring clans wjojajx EI ermina e waja MPL Waitall Uibmpi so 0 0 0 pewaitall c 39 MPI Allreduce litmpi so 0 0 0 pallreduce c 40 MPT_Init ibrmpi co 0 0 0 pimit c 41 MPL Finalize Hilempi so 0 0 0 pfinalize c 35 MPI isend libmpi so 0 0 0 pisend c 49 MPt_Scatterv libmpiso 0 0 0 pecattery lt 40 MPL irecy Limpiso 0 0 0 pirecv c39 MPt_Gathery lhmpiso 0 0 0 pgatherv lt 4 You can use the views dialog box to choose what metric to display Use the Optional Views Dialog box to choose the performance metrics to be displayed in the StatsPanel and click OK Clicking OK will regenerate the StatsPanel with the new metrics displayed Ot osalviews Died MPT kaparimeni Custors Repent Seleennn Dilag MPIT Baclusive Time Values C MPIT Inchaive Tine Valors _ MPIT Murimurs Time Values C MPNT Mazinam Tune Valets C MMT Aversge Time Values J MINT Cownt Calls Ta Punceion MPTT feclusive Time Peecencage Values C MIT Standard Dewanon Valuer L MPTT Message Site Va
101. of CUDA events and the event arguments are listed 8 3 1 NVIDIA CUDA Tracing cuda experiment performance data gathering To run the NVIDIA CUDA experiment use the osscuda convenience script and specify the CUDA application as an argument If there are no arguments to the application then no quotes are necessary but they are placed here for consistency The osscuda script will run the experiment by running the QTC application and will create an Open SpeedShop database file with the results of the experiment Viewing of the performance information can be done with the GUI or CLI osscuda QTC 8 3 2 NVIDIA CUDA Tracing cuda experiment performance data viewing with GUI This section shows the default view for the NVIDIA CUDA experiment for the QTC application Use the following command to open the GUI to see the QTC CUDA experiment performance information To launch the GUI on any experiment use openss f lt database name gt openss f QTC cuda openss fe J gt m0 1 Fatoss Curtin ps e a ua Hk Ot the Pur letter o bya Pee oparo tate Fans eee rage rocensseP ores 1 Views Dapy Clesce be hire Pov oh Bi Bie d 8 A OG emg ieties fasort Cuecusetees OTC hosi mei iocis Pics Teni Eectuswe VO Cat Throrny W of Tota Marrie of Cate Further sdetrawyy bestor Atay st TC _chewice fow ches gt 010 Sees of ire et 77 The view below is the statistics panel and source view panel showing the relations
102. op is a community effort by The Krell Institute with current direct funding from the Department of Energy s National Nuclear Security Administration DOE NNSA It builds on a broad list of community provided infrastructures notably the Paradyn Project s Dyninst API and MRNet Multicast Reduction Network from the University of Wisconsin at Madison the Libmonitor profiling tool and the Performance Application Programming Interface PAPI from the University of Tennessee at Knoxville Open SpeedShop is an open source multi platform Linux performance tool which is targeted to support performance analysis of applications running on both single node and large scale IA64 IA32 EM64T AMD64 PPC Blue Gene and Cray XT XE XK platforms Open SpeedShop is explicitly designed with usability in mind and is for application developers and computer scientists The base functionality includes Sampling Experiments Support for Call Stack Analysis Hardware Performance Counters MPI Profiling and Tracing I O Profiling and Tracing Floating Point Exception Analysis Memory Function Tracing POSIX Thread Function Tracing NVIDIA CUDA Event Tracing In addition Open SpeedShop is designed to be modular and extensible It supports several levels of plug ins which allow users to add their own performance experiments Open SpeedShop development is hosted by the Krell Institute The infrastructure and base components of Open SpeedShop are released as open source
103. or so 0 0 0 main c 541 gt gt 258 in _ libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2173 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 2611 in WriteOrRead IOR IOR c 2562 gt gt gt gt gt gt gt 251 in IOR_Xfer_POSIX IOR aiori POSIX c 224 gt gt gt gt gt gt gt gt 223 in read iot collector monitor mrnet mpi so wrappers c 137 1055603 730633 34 924939 2048 gt gt gt gt gt gt gt gt gt 82 in __GI__read libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _ libc_start_main libc 2 12 so libc start c 96 56 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492 gt gt gt gt 153 in main IOR IOR c 108 gt gt gt gt gt 2004 in TestloSys IOR IOR c 1848 gt gt gt gt gt gt 104 in IOR_Create_POSIX IOR aiori POSIX c 74 gt gt gt gt gt gt gt 670 in open64 iot collector monitor mrnet mpi so wrappers c 608 103350 518692 3 419380 512 gt gt gt gt gt gt gt gt 82 in _libc_open libc 2 12 so syscall template S 82 _start IOR gt 562 in _libc_start_main libmonitor so 0 0 0 main c 541 gt gt 258 in _libc_start_main libc 2 12 so libc start c 96 gt gt gt 517 in monitor_main libmonitor so 0 0 0 main c 492
104. output of two separate runs using Lustre and NFS LU Fact time with Lustre 1842 secs 46 LU Fact time with NFS 2655 secs From the final times we see there is an 813 second penalty more than 30 if you do not use parallel file system like Lustre The run time difference 75 of the 813 seconds is mostly I O 1360 99 847 7 605 seconds NFSRu NN Lustre Run Min t sec Max t sec Avgat sec _ Function Call MincM sec Max t sec Avg t sec Function Call libpthread wa so libpthread ae so libpthread 2 5 so libpthread 2 5 so 7 2 Lustre Striping Commands To set or get the Lustre file system lfs striping information you can use the following commands gt lfs setstripe s size bytes k M G c cout 1 all i index 1 round robin lt file directory gt Typical defaults for setstripe are s 1M c 4 i 1 usually good to try first File striping is set upon file create gt lfs getstrip lt file directory gt Example for getstrip is gt lfs getstrip verbose oss_lfs_strip_16 grep stripe_count stripe_count 16 stripe_size 1048576 strip_offset 1 1 PE writes BW limited 1 file per process BW enhanced iboet of PEs do I O Could be most optimal Compute Using OOCORE I O performance and the libc_read time from Open SpeedShop the following graph shows the output of an I O experiment used to identify optimal lfs striping from load balance view max min and avg for 16 way pa
105. p smg2000 n 50 50 50 Parallel job examples osshwcsamp mpirun np 128 smg2000 n 50 50 50 osshwcsamp srun N 32 n 128 sweep3d mpi PAPI_L1_DCM PAPI_L1_DCA 200 Additional arguments default events PAPI_TOT_CYC and PAPI_FP_OPS sampling_rate is 100 lt PAPI_event_list gt Comma separated PAPI event list lt sampling_rate gt Integer value sampling rate 15 9 ossio ossiot I O Experiments General form ossio t lt command gt lt args gt default f_t_list Sequential job example ossio t smg2000 n 50 50 50 103 Parallel job example ossio t mpirun np 128 smg2000 n 50 50 50 Additional arguments default trace all I O functions lt f_t_list gt Comma separated list of I O functions to trace one or more of the following close creat creat64 dup dup2 Iseek lseek64 open open64 pipe pread pread64 pwrite pwrite64 read readv write and writev 15 10 ossmpi ossmpit MPI Experiments General form ossmpi t lt mpirun gt lt mpiargs gt lt command gt lt args gt default f_t_list Parallel job example ossmpi t mpirun np 128 smg2000 n 50 50 50 Additional arguments default trace all MPI functions lt f_t_list gt Comma separated list of MPI functions to trace consisting of zero or more of MPI_Allgather MPI_Waitsome and or zero or more of the MPI group categories MPI Categor Argument All MPI Functions Collective Communicators coll
106. p convenience scripts only create the database file and do NOT put out the default report Used to reduce the size of the batch file output files if user is not interested in looking at the default report When running the Open SpeedShop convenience scripts only gather the performance information into the OPENSS_RAWDATA_DIR directory but do NOT create the database file and do NOT put out the default report Specifies the path to where O SS will build the database file On a file system without file locking enabled the SQLite component cannot create the database file This variable is used to specify a path to a file system with locking enabled for the database file creation This usually occurs on lustre file systems that don t have locking enabled OPENSS_DB_DIR file system path Example export OPENSS_DB_DIR opt filesys userid Specifies the MPI implementation in use by the application only needed for the mpi mpit and mpiotf experiments These are the currently supported MPI implementations openmpi lampi mpich mpich2 mpt lam mvapich mvapich2 For Cray IBM Intel MPI implementations use mpich2 OPENSS_MPI_IMPLEMENTATION MPI impl name Example export OPENSS_MPI_IMPLEMENTATION openmpi In most cases O SS can auto detect the MPI in use 106 16 Hybrid openMP and MPI Performance Analysis For this example tutorial we have run Open SpeedShop convenience script on the NPB MZ BT program and created a
107. pcsamp mpirun np 128 smg2000 n 50 50 50 Additional arguments high twice the default sampling rate samples per second low half the default sampling rate default default sampling rate is 100 lt sampling rate gt integer value sampling rate 15 6 ossusertime Call Path Experiment General form ossusertime lt command gt lt args gt high low default lt sampling rate gt Sequential job example ossusertime smg2000 n 50 50 50 Parallel job example ossusertime mpirun np 64 smg2000 n 50 50 50 102 Additional arguments high twice the default sampling rate samples per second low half the default sampling rate default default sampling rate is 35 lt sampling rate gt integer value sampling rate 15 7 osshwc osshwctime HWC Experiments General form osshwc time lt command gt lt args gt default lt PAPI_event gt lt PAPI threshold gt lt PAPI_ event gt lt PAPI threshold gt Sequential job example osshwc time smg2000 n 50 50 50 Parallel job example osshwc time mpirun np 128 smg2000 n 50 50 50 Additional arguments default event PAPI_TOT_CYC threshold 10000 lt PAPI_event gt PAPI event name lt PAPI threshold gt PAPI integer threshold 15 8 osshwcsamp HWC Experiment General form osshwcsamp lt command gt lt args gt default lt PAPI_event_list gt lt sampling_rate gt Sequential job example osshwcsam
108. pe creat and others Show call paths for each unique I O call path Lightweight I O profiling Accumulated wall clock durations of I O system calls read readv write writev open close dup pipe creat and others but individual call information is not recorded Similar to io except that more information is gathered such as bytes moved file names etc Captures the time spent in and the number of times each MPI function is called Show call paths for each MPI unique call path Records each MPI function call event with specific data for display using a GUI or a command line interface CLI Trace format option displays the data for each call showing its start and end times Write MPI calls trace to Open Trace Format OTF files to allow viewing with Vampir or converting to formats of other tools Find where each floating point exception occurred A trace collects each with its exception type and the call stack contents These measurements are exact not Statistical mem Captures the time spent in and the number of times each memory function is called Show call paths for each memory function s unique call path pthreads Captures the time spent in and the number of times each POSIX thread function is called Show call paths for each POSIX thread function s unique call path cuda Captures the NVIDIA CUDA events that occur during the application execution and report times spent for each event along with the arguments for
109. plication lt list of I O function s gt The following is an example of how to gather data for the IOP application on a Linux cluster platform using the ossiot convenience script It gathers performance data for all the I O functions because there is no list I O functions specified after the quoted application run command ossiot srun n 512 IOR 7 3 2 2 I O Extended Tracing iot experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt This is the default GUI view for the iot experiment This view give a summary of the I O functions that were called how many times they were called and the amount of time spent in each function The percentage of the total I O time is also attributed to each I O function The time is aggregated totaled across all the threads ranks or processes that were part of the application The functions that called the I O functions are available by choosing one of the call path views 53 Open Speed Shop on rzmeri156 gt Close Ilibe 3 Leo sycall template 3 82 Gl read ibe 2 12 ecc syscall cemplane S a2 libe oper libe 2 12sm eyeenl termplate S42 PONNI write lite 2 12 se sysealiemgpiar 5 82 Dan Ol libe is k ibe 2 Len eyseull demmplete 852 Here the user has chosen the C view icon and the Stats Panel now shows all the call paths in the users application This view shows the every possible call paths through the sour
110. program The Call Path Profiling experiment ossusertime provides inclusive vs exclusive CPU time see section 4 2 and also includes call stacks There is anumber of Hardware Counter experiments osshwc osshwctime that sample hardware counter overflows and osshwcsamp that can periodically sample up to six hardware counter events A flat profile will answer the basic question Where does my code spend its time This will be displayed as a list of code elements with varying granularity i e statements functions and libraries linked objects with the time spent at each function Flat profiling can be done through sampling which allows us to avoid the overhead of direct measurements We must ensure we request a sufficient number of samples sampling rate to get an accurate result An example of flat profiling would be running the program counter sampling in Open SpeedShop We will run the convenience script on our test program smg2000 gt osspcsamp mpirun np 256 smg2000 n 50 50 50 23 It is recommended that you compile your code with the g option in order to see the statements in the sampling The pcsamp experiment also takes a sampling frequency as an optional parameter the available parameters are high 200 samples per second low 50 samples per second and the default value is 100 samples per second If we wanted to run the same experiment with the high sampling rate we would simply issue the command gt osspcs
111. r gt or individual processes using p lt process id gt One can also give a range of ranks threads or processes using their respective option For the calltree view the display is showing where the I O function were called from in the users application source In this example most of I O time was spent in the write I O function along the path shown in the first individual calltree The calltree with fullstack option forces the calltree view to not collapse any similar sub trees which makes the view more explicit Without the fullstack option the calltrees would be more consolidated 61 8 Applying Experiments to Parallel Codes The ideal scenario for the execution of parallel code using pthreads or OpenMP is efficient threading where all threads are assigned work that can execute concurrently Or for MPI code the job is properly load balanced so all MPI ranks do the same about of work and no MPI rank is stuck waiting What are some things that can cause these ideal scenarios to fail taken from LLNL parallel processing tutorial MPI jobs can become unbalanced if an equal amount of work was not assigned to each rank possibly through the number of array operations not being equal for each rank or loop iterations not being evenly distributed You can still have problems even if your work seems to be evenly distributed For example if you evenly distribute a sparsely populated array then some ranks may end up with very little or no wor
112. r to create custom views of the performance analysis results This allows the user to supplement the provided Open SpeedShop views 11 5 1 2 View Display Choice Selection The View Display Choice set of buttons allows users to choose what granularity to use for a particular display The normal usage scenario is to choose a view choice granularity and then select a view by choosing one of the icons described in the table above The choices as shown in the image below are to see the performance data displayed e Per Function Display the performance information relative to each function in the program that had performance data gathered during the experiment that was run e Per Statement Display the performance information relative to each statement in the program that had performance data gathered during the experiment that was run e Per Linked Object Display the performance information relative to each library or linked object in the program that had performance data gathered during the experiment that was run e Per Loop Display the performance information relative to each loop in the program that had performance data gathered during the experiment that was run Note that the loop performance information is only shown for loops that actually were executed There may be loops in the application that will not show up in the display because they did were not executed or had minimal time attributed to them 92 The image below
113. r to do this we will investigate the interaction between the application and the hardware to make sure there is an efficient use of hardware resources Modern memory system are complex they can have deep hierarchies and explicitly managed memory Systems can implement Non Uniform Memory Access NUMA or streaming prefetching methods The key to memory is location Are you accessing the same data repeatedly or are you accessing neighboring data You will want to look at your codes read write intensity the prefetch efficiency the cache miss rate at all levels TLB miss rates and the overhead from NUMA Some system differences can affect the computational intensity like the cycles per instruction CPI or the number of floating point instructions Other architectural features that can differ between systems include branches the number of branches taken the miss speculation or wrong branch prediction results If your code is using anything like single instruction multiple data SIMD any type of multimedia or streaming extensions the performance of all of these things could differ greatly from system to system General system wide information including I O busses network counters also power or temperature sensors all could affect the performance of your code But it can be difficult to relate this information to your source code Hardware performance counters are used to keep track of architectural features Typically most features
114. rallel run 47 1200 EMIN E AVG Wall Time secs Stripe count 1 Stripe count 4 Stripe count 8 Stripe count 16 7 3 Open SpeedShop I O Tracing and I O Profiling An example of how to use the Open SpeedShop usertime experiment to profile I O is shown below This example compares Open SpeedShop data to instrumentation data 48 OutPut from code PROGLEM SIE 5500 MOCK SE 100 NUVSER OF MGHI HAND SEES 100 5 000 wT 5 PERFORMANCE TIME IN SECONDS totsitene DDLJII4 secnm tect tier IOA econ bowo PRm 2 075548 secondi matmul time 53 80486 seconds lo time 45 86675 seconds Trene tine ANAI wern hs diag bik ime 3 23013560E 0 scond vi np v mie tine LJJA samai Open SpeedShop also has an iot experiment for extended I O Tracing It will record each event in chronological order and collect additional information like function parameters and function return values You should use the extended I O tracing when you want to trace the exact order of events Or when you want to see the return values or bytes that were read or written Beware of serial I O in applications illustrated in the code below code from Mike Davis Cray Inc 49 Below shows the output of the Open SpeedShop iot experiment on the serial I O code x Open Speedshop SHOWS EVENT BY Be leds Help EVENT LIST Gumon Experimert 1 Clicking on this bi gives each callt
115. rmance varies for each new version of an application or understanding how a different compiler or compiler options can affect the performance of your application This also allows you to do scalability tests to see how the performance of your application scales with the number of processors It s also helpful just to see the progress you have made while tuning your code Open SpeedShop has options to allow you to compare performance data You can use the Custom Compare Panel CC icon in the GUI or the osscompare convenience Script gt osscompare db1 openss db2 openss options This will produce a side by side comparison listing you can compare up to 8 databases at once You can see the osscompare man page for more details Below is an example of comparing two different pcsamp experiments on the smg2000 application osscompare smg2000 pcsamp openss smg2000 pcsamp 1 openss openss Legend c 2 represents smg2000 pcsamp openss openss Legend c 4 represents smg2000 pcsamp 1 openss c 2 Exclusive CPU c 4 Exclusive CPU Function defining location time in seconds time in seconds 3 870000000 3 630000000 hypre_SMGResidual smg2000 smg_residual c 152 2 610000000 2 860000000 hypre_CyclicReduction smg2000 cyclic_reduc on c 757 83 2 030000000 0 150000000 opal_progress libopen pal so 0 0 0 1 330000000 0 100000000 mca_btlsm_component_progress libmpi so 0 0 2 topo_unity_component c 0 0 280000000 0 21000
116. s How to identify memory problems e Study time spent in memory allocation de allocation routines mem e Look for load imbalance LB view and outliers CA view e Look for load imbalance LB view and outliers CA view e Look for load imbalance LB view and outliers CA view 99 14 2 Additional Documentation The python scripting API documentation can be found at http www openspeedshop org docs pyscripting doc or in the Share doc packages openspeedshop pyscripting_doc folder in the install directory There are also man pages for openss and every convenience script There s also a quick start guide that you can download from http www openspeedshop or There is also an Open SpeedShop Forum where you can ask questions and read posts at http www openspeedshop org forums There is also an email list that you can send your questions to oss questions openspeedshop org 100 15 Convenience Script Basic Usage Reference Information This section provides a quick overview of the convenience scripts that can be used to either compare experiment data to other experiment data or to gather performance information for each of the various performance metric types that Open SpeedShop supports 15 1 Suggested Workflow We recommend an O SS workflow consisting of two phases First gathering the performance data using the convenience scripts Then using the GUI or CLI to view the data 15 2 Convenience Scripts Jsers ar
117. s This base I O experiment records the basic I O information as stated in the introductory section but does not record the arguments to each call That is done in the extended iot experiment 7 3 1 1 I O Base Tracing io experiment performance data gathering The base I O tracing io experiment convenience script is ossio Use this convenience script in this manner to gather base I O tracing performance data ossio how you normally run your application lt list of I O function s gt The following is an example of how to gather data for the IOP application on a Linux cluster platform using the ossio convenience script It gathers performance data for all the I O functions because there is no list I O functions specified after the quoted application run command ossio srun n 512 IOR 7 3 1 2 I O Base Tracing io experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 52 7 3 1 3 I O Base Tracing io experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 7 3 2 I O Extended Tracing iot experiment 7 3 2 1 I O Extended Tracing iot experiment performance data gathering The extended I O tracing iot experiment convenience script is ossiot Use this convenience script in this manner to gather extended I O tracing performance data ossiot how you normally run your ap
118. s being targeted Open SpeedShop gives target options that describes file names host names thread identifiers rank identifiers and process identifiers 3 1 2 Concept of an Experiment Open SpeedShop uses the concept of an experiment to describe the gathering of performance measurement data for a particular performance area of interest Experiments consist of the collector responsible for the gathering of the measurements associated with the performance area of interest The collector which is a small dynamic or static object library also contains functions that can interpret the gathered measurements i e performance data into a human understandable form The experiment definition also includes the application being examined and how often the data will be gathered the sampling rate The application s symbol information is saved into the experiment output file so that performance reports can be generated from the performance data file alone The application itself need not be present to view the performance data at a later time 3 2 Performance Experiments Overview Open SpeedShop refers to the different performance measurements as experiments Each experiment can measure and analyze different aspects of the code s performance The experiment type or type of data gathered is chosen by the user Any experiment can be applied to any application with the exception of MPI specific experiments being applied to non MPI applications E
119. s many of the icons and features of the other Open SpeedShop experiments it is used here for illustration purposes 90 File Tools Eletp om User Time ft l BT Ituress Control mw Se hutimi to begin the experimen re Stam Paned it dinag Trocessestunei i st View Display Chine vw a addi aS me CA GG Seeing unctionis Marport Functions 2 Statements J Linked Objrete J Lampe Exncwtables sag 2000 Host jeghos Pide 2 Ranks 2 Threads 2 Pactueree CPU tee inet Inchiveve CPU time in eee Sef Total Bocha CPU Function heir bolion LaS 1 975429 45 SHA pre AGa wen M0 amg esaluak c 13 LISHI IWT H 25 72573 by pee Cyrlictioniirtain ampak cycli rahirtinn 2 MITH ihe Lomi Bypon Semillestrict srg see restrict 12 EE a Ti 0 1147p LITANI me bi am mmponeni popes llibmpi en OI hil m oampanent i nticom ITO 77W opa _evnecit_ simp _unpori litem eo LAN opal dmiayp ork i 010p 0 1143 2 70270 bye SOC SBaldiLAPSy im ise Dh emg sere mpc 235 0085714 DUST H 20077 memcpy seees back llibo 17 0OSTHI 0 05714 1351551 bypoe_sSemilnterp singe seni inicia c 126 0OSTHJI 0 40000 LISI opal popes npa l 0 8 ol pogre c lie r uous pea ian H LISISH opal eermeric_sitniprhe merck liberept ots 0 fh nnsl Aniarype peck c 235 Lemani Panet optim 11 5 1 1 Icon ToolBar T VU cL DiS e ef He LB TS oy ISA LB cA Ce Showing Functions Report The most used items that can be found in the StatsPanel menu that is found under the StatsPa
120. se file Then use the expview command options for desired views 15 4 osscompare Compare Database Files General form osscompare lt db_file1 gt lt db_file2 gt lt db_file gt time percent lt other metrics gt rows nn viewtype functions statements linkedobjects gt oname lt csv filename gt Where 101 lt db_file gt represents an Open SpeedShop database file created by running an Open SpeedShop experiment on an application time percent lt other metrics gt represent the metric that the comparison will use to differentiate the performance information for each experiment database rows nn indicates how many rows of output you want to have listed viewtype functions statements linkedobjects select the granularity of the view output The comparison is either done at the function statement or library view level Function level is the default granularity oname lt csv filename gt Name the output filename when comma separated list output is requested Example osscompare smg run1 openss smg run2 openss osscompare smg run1 openss smg run2 openss percent rows 10 Please type man osscompare for more details 15 5 osspcsamp Program Counter Experiment General form osspcsamp lt command gt lt args gt high low default lt sampling rate gt Sequential job example osspcsamp smg2000 n 50 50 50 Parallel job example oss
121. sessesnessossneesoeons 14 3 2 1 Individual Experiment Descriptions ee eeseeeeeeeeeeeeeeseeeeeeseseeeeeeseeeseesesesessseeeneanecasaeeessees 14 3 2 3 Sampling Experiments DescriptioriS sssrinin anaana 16 32 4 racing Experiments Descriptors xicctessisesceocsccsestesesa iniiai eann aana nirankari aa roana 16 329 Paralel Experiment SUD DOL aissein aeiaai Doniane aandaa Deaan 17 3 RUNNING an EXD SR Met isisa aSa 17 4 How to Gather and Understand Profiles ssssssnsnnnnnnnnnnunnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn 23 4 1 Program Counter Sampling ExperimMent ssussssusunsusunsunnununnunnnnunnnnunnununnunnnnnnnnnnnnnnnn nunana 23 4 2 Call Path Profiling usertime Experiment sssssssssususunnunununnunununnunununnunununnunununnnnunnnnnnnnnne 25 5 How to Relate Data to Architectural PropertieS ssussunsunnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnan 28 5 1 Hardware Counter Sampling hwcsamp Experiment s sssssssusunsusununnnnunnunnnnunnnnnnnnnnn 30 5 1 1 Hardware Counter Sampling hwcsamp experiment performance data gathering PE E E tele nse E ETEA E EA E cutee scutes T E E E E E E 33 5 1 1 1 Hardware Counter Sampling hwcsamp experiment parameters s sssssssssssressressrresrreesne 33 5 1 2 Hardware Counter Sampling hwcsamp experiment performance data viewing E E EE E P AEE E A T E E A E A E 33 5 1 2 1 Getting the PAPI counter as the GUIs Source Annotation Metric 33 5 1 2 2 Viewing Hardware Counter Sampling Data with the GUI sss ss
122. smg2000 pcsamp must have a clean raw data directory each run rm rf home USER smg2000 test raw mkdir home USER smg2000 test raw setenv OPENSS_RAWDATA_DIR home USER smg2000 test raw setenv OPENSS_DB_DIR home USER smg2000 test cd home jgalaro smg2000 test needs bb to have the original executable available when doing ossutil aprun bb n 16 home USER smg2000 test smg2000 pcsamp creates a X 0 openss database file please load the module pointing to openspeedshop before accessing ossutil ossutil home jgalaro smg2000 test raw There have been recent changes to the shared library support in Open SpeedShop Dynamic shared library support is now available in newer Cray and Blue Gene operating systems There is support for both shared and static binaries on the Cray and on the Blue Gene Q platforms Also being worked on is a replacement mechanism for having to re link the static binaries to insert the Open SpeedShop collectors into the application It will use the Dyninst binary rewriter to insert the collectors under the hood Then you could use the same convenience scripts and interface for all types of applications 12 1 Cray Specific Static aprun Information Note in the above execution of the statically linked executable that we need to add the b option to the aprun call The option is needed because Open SpeedShop stores information about the executable location when it is running Without the b option the executable
123. ss Threads c 3 Average Exclusive CPU time in seconds Function defining location 0 1600 0 4800 binwerts_ bt Wx solve_swha f_206 0 1033 0 1500 solve omp fn0 bt W x 2 solvet 43 0 0900 0 1400 compute rhe omp fn 0 be Wx rivet 18 0 0900 0 1300 y_ solve omp MO bt Wx y solvet43 0 0633 0 1200 matmul sub bLW x solve subs 1 56 Ga 8 2 1 Threading Specific Experiment pthreads An experiment specific to tracking POSIX thread function calls and analyzing those calls is also available in Open SpeedShop The experiment is called pthreads and it 75 traces several POSIX thread related functions Like all the other tracing experiments number of calls time spent in each function the call paths to each POSIX thread function and an event by event trace is available Load balance and cluster analysis features are also available 8 2 1 1 Threading Specific pthreads experiment performance data gathering 8 2 1 2 Threading Specific pthreads experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt 8 2 1 3 Threading Specific pthreads experiment performance data viewing with CLI To launch the CLI on any experiment use openss cli f lt database name gt 76 8 2 NVIDIA CUDA Analysis Section The Open SpeedShop version with CBTF collection mechanisms supports tracing CUDA events in a NVIDIA CUDA based application An event by event list
124. sspcsamp mpirun np 32 sweep3d mpi gt ossio srun N 4 n 16 sweep3d mpi gt openss offline f mpirun np 128 sweep3d mpi hwctime gt openss online f srun N 8 n 128 sweep3d mpi usertime The default view for parallel applications is to aggregate the information collected across all ranks You can manually include or exclude individual ranks processes or 62 threads to view their specific results You can also compare ranks by using the Customize Stats panel View and creating a compare column for the process groups or individual ranks Cluster analysis is also available it can be used to find outliers ranks that are performing very differently then the others From the Stats Panel toolbar or context menu you can automatically create groups of similar performing ranks or threads Through the Stat Panel Open SpeedShop also provides common analysis functions designed for quick analysis of MPI applications There are load balance views that calculate min max and average values across ranks processes or threads The image below shows the Open SpeedShop buttons for Load Balance and next to that Cluster Analysis Open Speedshop Pic Took cp hero Wizard Y pc Saenpting 1 Process Conrro gt P 5 Shonas Looded saved disa trom ile hone eed datasets sweopdd 000p cponss Stats Pand nis ManagcProces s rh Creliiics Fad Y e F Fi au en 08 wins Load Balance re
125. sssssssssssrsssrrsnsrnensrnsnunnensnnensnnenennn 35 5 1 3 Hardware Counter Sampling hwcsamp experiment performance data viewing seca tect A bese T toes AEA E E A E EA EE T A E A A A N 36 5 2 Hardware Counter Experiment DWC sssssssssssusussnnununnnnununnnnununnunununnnnunnnnunununnnnnnnnnnnnnnnnnnnn 39 5 2 1 Hardware Counter Threshold hwc experiment performance data gathering 40 5 2 2 Hardware Counter Threshold hwc experiment performance data viewing with EON EEE E ch E O EE E E EE E EE E 40 5 2 3 Hardware Counter Threshold hwc experiment performance data viewing with E D E E E A E E E E EIEEE EAE OEN A ET EET EE 42 6 Hardware Performance Counters and Their USe s ssssssussusensennunnunnunnnnnnnunnnnnnnnnns 43 6 1 Using the Hardware counter experiments to find bottlenecks ssssssssssssssnnnsenennns 45 6 1 1 How to find memory bandwidth bottlenecks using O SS hwc experiments 45 6 1 2 How to find memory cache usage issues using O SS hwc experiments s s s 45 6 1 3 How to find load store imbalance using O SS hwc experiment 45 7 AOTC Qin O Pi Oy sesiya 45 SOO CO REE Maa sce ec esieccce ccc e aE aaa easceen cect vatsesscneneedideaseansenmeersnerstes 46 7A Lustre SUIN Command Sassandra aa aiaa iaaa aiaa 47 7 3 Open SpeedShop I O Tracing and I O Profiling sssssssussnnunnnnunnnnnnnunnnnunnnnunnnnnnnnnnnnnnnnnn 48 7 3 Open SpeedShop I O Tracing General Usage sssssssnsunnunnunnunnnnnnnnnnnnnnnnnn
126. t 2611 in WriteOrRead IOR IOR c 2562 gt gt 251 in IOR_Xfer_POSIX IOR aiori POSIX c 224 598 432332 598 432332 1 507294 gt gt gt read libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 104 in IOR_Create_POSIX IOR aiori POSIX c 74 472 137142 472 137142 1 189189 gt gt open64 libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 195 in IOR_Open_POSIX IOR aiori POSIX c 173 268 882585 268 882585 0 677245 gt gt open6 4 libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 61 587482 61 587482 0 155123 gt gt close libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 316 in IOR_Close_POSIX IOR aiori POSIX c 315 1 796442 1 796442 0 004525 gt gt close libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 2608 in WriteOrRead IOR IOR c 2562 gt gt 234 in IOR_Xfer_POSIX IOR aiori POSIX c 224 1 280113 1 280113 0 003224 gt gt gt _lseek64 libpthread 2 11 3 so TestloSys IOR IOR c 1848 gt 2611 in WriteOrRead IOR IOR c 2562 gt gt 234 in IOR_Xfer_POSIX IOR aiori POSIX c 224 0 981341 0 981341 0 002472 gt gt gt _lseek64 libpthread 2 11 3 so In the above command line interface output the expview command with no options gives the overview or summary view for all the ranks and threads One can view the performance information for individual ranks using r lt rank number gt or 60 individual threads using t lt thread numbe
127. t save individual call information like the io and iot experiments do That allows the iop experiment database to be smaller and makes the iop experiment faster than the io and iot experiments The memory tracing experiment mem records invocation of all tracked memory function calls also referred to as events The mem experiment provides aggregated and individual timings and also provides argument information for each call The MPI Tracing Experiments mpi mpit mpiotf record invocation of all MPI routines as well as aggregated and individual timings The mpit experiment provides argument information for each call The mpiotf experiment creates Open Trace Format OTF output The Floating Point Exception Tracing fpe is triggered by any FPE caused by the application It can help pinpoint numerical problem areas The POSIX thread tracing experiment pthreads records invocation of all tracked POSIX thread related function calls also referred to as events The pthreads experiment provides aggregated and individual timings and also provides argument information for each call 3 2 5 Parallel Experiment Support Open SpeedShop supports MPI and threaded codes it has been tested with a variety of MPI implementations The thread support is based on POSIX threads and OpenMP is supported through POSIX threads Open SpeedShop reports the activity of the POSIX threads that represent the OpenMP threads but currently doesn t do any special processin
128. that are packaged inside the CPU allow counting hardware events transparently without any overhead Newer platforms also provide system counters on things like network cards and switches or environmental sensors 28 The drawback to hardware counters is that their availability differs between platforms and processor types Even systems that allow the same counters may have slight semantic differences between platforms In some cases access to hardware counters may require privileged access or kernel patches Performance Application Programming Interface PAPI allows access to hardware counters through APIs and simple runtime tools You can find more information on PAPI at http icl cs utk edu papi Open SpeedShop provides three hardware counter experiments that are implemented on top of PAPI It provides access to PAPI and native counters like data cache misses TLB misses and bus accesses There are a few basic models to follow in hardware counter experiments The first is thresholding where the user selects a counter and the application runs until a fixed number of events have been reached on that counter Then a PC sample is taken at that location every time the counter increases by the preset fixed number The ideal threshold the fixed number at which to monitor is dependent on the application Another model is a timer based sampling where the counters are checked at given time intervals Open SpeedShop provides three hardware count
129. thresd 2 5 s0 Command Panel Opens gt gt Next we see the load balance view base on Linked Objects libraries 67 Open SpeedShap Upen Specd d bop j Here is the pcsamp view of Rank 255 performance data only 68 p i ja d_ tu C 256 jadd 15 buts_ 00 0 2356 buts 4 jacu hu C 256 jace t5 pthread _spin_lock libpthread 2 5 20 odu_test_new_connection libmpich so 1 om_uaser c 29 ssor_ lu C 256 ssor f4 __Gl_memepy ibe 2 5 00 exchange 3 u C 256 exchange_3 f 5 DeviceCheck lihenpich so 14h viechech c 254 JOTI pech so Below we examine Rank 255 further but this time using the load balance view in the Command Line Interface for Open SpeedShop openss gt gt expview m loadbalance Max MPI Call Time Rank of Max Min MPI Call Time RankofMin Average MPI Call Function defining location Across Ranks ms Across Ranks ms Time Across Ranks ms 150332 97 0 120351 97 36 131361 13 MPI_Recv libmpich so 1 0 recv c 60 17636 11 36 1103 53 0 5443 08 MPI_Send libmpich so 1 0 send c 65 16470 53 19 353 81 0 5255 33 MPI_Wait libmpich so 1 0 wait c 51 3206 45 255 3 00 17 2000 27 MPI_Allreduce libmpich so 1 0 allreduce c 59 915 17 Minit tibmonitor so 0 0 0 pmpi c 94 16 00 48 _ Finalize libmonitor so 0 0 0 pmpl c 223 9 28 230 MPI1_Irecvy libmpich so 1 0 irecv c 48 1 22 247 0 07 10 MPI_Bcast libmpich so 1 0 bcast c 81 0 51 0 41 MPI_Barrier libmpich so 1 0
130. tion by the total number of PAPI_L1_DCM counts in the application A_Add count Openss gt gt expview m count Header percent of counts Percent count A_Add count Exclusive percent Function defining location PAPI_L1_DCM of counts Counts 342000000 52 333588 hypre_SMGResidual smg2000 smg_residual c 152 207500000 31 752104 hypre_CyclicReduction smg2000 cyclic_reduction c 757 20500000 3 136955 hypre_Semilnterp smg2000 semi_interp c 126 15000000 2 295333 hypre_SemiRestrict smg2000 semi_restrict c 125 8500000 1 300689 pack_predefined_data libmpi so 0 0 3 7000000 1 071155 unpack_predefined_data libmpi so 0 0 3 11 2 CLI Batch Scripting If you have a known set of command you want to issue you can create a plain text file with CLI commands For example we create a batch file that will create run then view the pcsamp experiment run on the application fred Create batch file commands gt echo expcreate f fred pcsamp gt gt input script gt echo expgo gt gt input script gt echo expview pcsamp10 gt gt input script Now to run the batch file input script we use the batch option to openss gt openss batch lt input script 89 Note that currently in this context this interface is only supported via the online version of Open SpeedShop so it must have been build with the OPENSS_INSTRUMENTOR mrnet options 11 3 Python Scripting The Open SpeedShop python API allows users to execute the same intera
131. ve It characterizes a very important class of HPC applications involving the use of Method of Moments MOM formulation for investigating electromagnetics e g radar cross section antenna design It solves dense matrix equations by LU lower triangular upper triangular QR or Cholesky decomposition OOCORE is used by HPCMP to evaluate I O system scalability For our needs this application or similar out of core dense solver benchmarks help to point out the important points in performance analysis like I O overhead minimization The use of Matrix Multiply kernel which makes it possible to achieve close to peak performance of the machine if tuned well It can highlight blocking which is very important to tune for deep memory hierarchies The following example was run on 16 cores on a Quad Core Quad Socket Opteron IB cluster We want to compare two different file systems Lustre I O with striping and NFS I O We use the ossio convenience script gt ossio srun N 1 n 16 testzdriver std Sample Output from Lustre run TIME M N MB NB NRHS P Q Fact SolveTime Error Residual WALL 31000 31000 1616144 1842 20 1611 59 4 51E 15 1 45E 11 DEPS 1 110223024625157E 016 sum xsol_i 30999 9999999873 0 000000000000000E 000 sum xsol_i x_i 3 332285336962339E 006 0 000000000000000E 000 sum xsol_i x_i M 1 074930753858819E 010 0 000000000000000E 000 sum xsol_i x_i M eps 968211 548505533 0 000000000000000E 000 From
132. ware Counter Sampling hwcsamp experiment performance data viewing 36 To launch the CLI on any experiment use openss cli f lt database name gt The following example was run on the Yellowstone platform at NCAR UCAR using the job script shown below 5 1 3 1 Job Script and osshwcsamp command bin csh LSF batch script to run an MPI application BSUB P Pnnnnnnnn project code BSUB W 00 30 wall clock time hrs mins BSUB n 64 number of tasks in job BSUB R span ptile 4 run 4 MPI tasks per node BSUB J sweep3d hwcsamp job name BSUB o sweep3d hwcsamp J out output file name in which J is replaced by the job ID BSUB e sweep3d hwcsamp J err error file name in which J is replaced by the job ID BSUB q regular queue module load openspeedshop mkdir p glade scratch USER sweep3d rm rf glade scratch USER sweep3d hwcsamp mkdir glade scratch USER sweep3d hwcsamp setenv OPENSS_RAWDATA_DIR glade scratch USER sweep3d hwcsamp setenv REQUEST_SUSPEND_HPC_STAT 1 echo running on compute node osshwcsamp osshwcsamp mpirun sf glade u home galaro demos sweep3d orig sweep3d mpi PAPI_L1_DCM PAPI_L1_ICM PAPI_L1_TCM PAPI_L1_LDM PAPI_L1_STM 5 1 3 2 osshwcsamp experiment CLI Default view openss cli f L1 64PE sweep3d mpi hwcsamp openss openss gt gt openss The restored experiment identifier is x 1 Oopenss gt gt expview v summary Exclusive ofCPU papi_l1_dcm papi
133. xperiment use openss cli f lt database name gt Here we show a trace view of the output from the ossmem experiment run This trace shows the default view and the load balance view for the execution of the sweep3d mpi application on the Titan Cray platform at ORNL The example below also contains an expcompare CLI command example where two of the programs ranks are compared against each other This may be useful if there appears to be load imbalance when examining the m loadbalance output openss cli f sweep3d mpi mem 1 openss openss gt gt openss The restored experiment identifier is x 1 openss gt gt expview Exclusive of Number Function defining location Mem Call Total of Time ms Time Calls 674 690825 66 448540 1132566 _ libc_malloc libc 2 11 3 so 340 667562 33 551460 1127337 _ cfree libc 2 11 3 so Openss gt gt expview m loadbalance Max Rank Min Rank Average Function defining location Exclusive of Exclusive of Exclusive Mem call Max Mem call Min Mem call time in time in time in seconds seconds seconds Across Across Across 80 9 3 Memory Analysis Tracing mem experiment performance data viewing with GUI To launch the GUI on any experiment use openss f lt database name gt The first GUI view show below is the default view for the mem experiment It shows the memory functions that were called in the application how many times they were called the time spent in each of the memory funct

Open|SpeedShop User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents