Home

Debuggers & Performance Tools

image

Contents

1. o 0 00 bt 0 00 mpi setup 13 93 exact rhs a 0 00 timer clear 117 33 exch qbc A 0 00 adi D 388 27 compute rhs gt H 508 59 x solve gt MD5 0 l4 y solve D 0 00 z solve D 0 00 I omp parallel z_solve pr a 0 00 iilud do dz solve gt m 18 75 add a 0 00 MPI Barrier a 0 00 timer start a 0 00 timer stop a 0 00 timer read 2 46 verify a 0 00 MPI Reduce O 0 00 print results a 0 00 MPI Finalize L 1 44e4 0 00 1787 77 52 31 A J LICH FORSCHUNGSZENTRUM 2 KT V Peer percent v Ei System tree M BoxPlot 3417 33 0 00 2 3344 2 33 1 8675 1 4007 138 0 93377 ass 0 46689 0 27 2 1 56e 03 All 2048 elements w 0 00 0 00 1787 77 Selected omp implicit barrier z_solve prep f 428 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 42 Score P Advanced Features Aa J LICH FORSCHUNGSZENTRUM Measurement can be extensively configured via environment variables Check output of scorep 1nfo config vars for details Allows for targeted measurements Selective recording Phase profiling Parameter based profiling Please ask us or see the user manual for details M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 43 Why is the Bottleneck There A J LICH FORSCHUNGSZENTRUM This is highly application dependent Might require additional measurements H
2. File Debug Help 670 fkjml Field kl j 1 671 rkj Rhs k j 672 akj Ans k j for i 1 mod i lt nxl i 2 674 1 675 delta omega fkj i 1 fkjli 1 fkjp1 4 fkjm1 1 676 4 0 fkj i rkjlil 5 7 678 tmpres fabs delta 6 9 680 fkj i fkjlil delta 681 682 tmperr fabs fkjlil akj il ER 1 ma Calling Context View Ze Callers View fe Flat View T 6 fo M WALLCLOCK us 0 0 WALLCLOCK us 0 0 E WALLCLOCK us 1 0 D WALLCLOCK U JTSet b5 Jhet b Experiment Aggregate Metrics 4 79e 06 100 79e 06 100 gt 76e 06 100 main 4 4 v psor iter 4 68e 06 97 7 4 01e 06 83 7 4 66e 06 97 7 3 95 vw loop at sor c 344 2 67e 06 85 75 2 00e 06 41 7 2 71e 06 56 8 2 00 inlined from sor c 658 2 00e 06 41 7 2 00e 06 41 7 2 00e 06 42 0 2 00 loop at sor c 662 2 00e 06 41 7 2 00e 06 42 0 loop at sor c 6 3 2 00e 06 41 7 3 55e 04 0 7 2 00e 06 42 0 loop at sor c 673 inlined from sor c 231 38e 05 38e 05 06e 05 sor c 6 5 59e 05 13 8 59e 05 13 8 23e 05 13 1 23 sor c 682 40e405 5 0 40e405 5 0 3 96e 05 8 3 3 96 sor c 678 UBet hb UB8et hb 39e 04 191Mof400M D M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 67
3. Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM less scorep_myprog_Ppnxt_sum scorep score Estimated aggregate size of event trace 162GB Estimated requirements for largest trace buffer max_buf 2758MB Estimated memory requirements SCOREP TOTAL MEMORY 2822MB hint when tracing set SCOREP TOTAL MEMORY 2822MB to avoid intermediate flushes or reduce requirements using USR regions filters flt type max buf B visits time s time time region visit us ALL 2 891 417 902 6 662 521 083 36581 51 100 0 5 49 ALL USR 2 858 189 854 6 574 882 113 13618 14 37 2 2 07 USR OMP 54 327 600 86 353 920 22719 78 62 1 263 10 OMP MPI 676 342 550 010 208 98 0 6 379 96 MPI COM 371 930 735 040 34 61 0 1 47 09 COM 921 918 660 2 110 313 472 9 0 1 matmul sub 921 918 660 2 110 313 472 6 2 25 binvcrhs 921 918 660 2 110 313 472 0 4 1 matvec_sub 41 071 134 87 475 200 1 0 4 Ihsinit 41 071 134 87 475 200 0 4 E binvrhs 29 194 256 68 892 672 l 0 2 I exact solutio OMP 3 280 320 3 293 184 15 81 0 0 4 80 omp parallel Pl Y M Geimer JUQUEEN Porting Tuning Workshop Feb 2015 36 FORSCHUNGSZENTRUM Call path Profile Filtering Ay J LICH In this example the 6 most fequently called routines are of type USR These routines contribute around 35 of total time However much of that is most likely measurement overhead Frequently executed Time per visit ratio in the order of a few microseconds Avoid me
4. Cure parallel z_ solve 2 33 t Nr Nr Nr Nr Nr t k k k m 1787 77 TESTS implicit barrie gt 140 29 add 22 61 MPI_Barrier 0 02 timer_start 0 03 timer_stop 0 03 timer_read 21 37 verify 14 60 MP Reduce 0 17 print results 0 06 MPI Finalize tat All Y aai M E 1 44e4 100 00 1 44e4 0 00 1843 21 12 80 Selected omp do 27 solve prep f 52 bl M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 41 Call path Profile Example cont m File Display Plugins Help Absolute ES Metric tree L 0 00 Time sec 5 6802 14 Execution gt m 90 11 MPI 20 00 OMP a 0 00 Flush 183 03 Management D 0 00 Synchronization gt E 3417 33 Barrier 0 63 Critical O 0 00 Lock API 2 0 00 Ordered D 0 00 Overhead D 3908 33 Idle threads 8 77e7 Visits occ 4 0 00 Synchronizations occ LH 0 00 Communications occ 9 5769 Bytes transferred bytes 2 0 00 MPI file operations occ 2696 16 Computational imbalance sec LH 0 00 Minimum Inclusive Time sec 6 99 Maximum Inclusive Time sec 64 00 task migration loss O 0 00 task migration win Nr Nr Nr Nr Nr 0 00 3417 33 23 73 cube 4 3 0 scorep bt mz C 64x32 sum summary cubex Absolute ES Call tree Flat view
5. amp Tuning Workshop Feb 2015 46 Trace Generation amp Analysis w Scalasca MJ J LICH FORSCHUNGSZENTRUM Enable trace collection analysis using t option of scalasca analyze THEHHHHHHHHHHHHHHHBHHHHHHBWHHE In the job script FF THEHHHHHHHHHHHHHHHHHHHHHHHHE module load UNITE scalasca export SCOREP TOTAL MEMORY 120MB Consult score report scalasca analyze f filter txt t V runjob ranks per node P np n exe myproc ATTENTION races can quickly become extremely large Remember to use proper filtering selective instrumentation and Score P memory specification Before flooding the file system ask us for assistance M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 47 Scalasca Trace Analysis Example e o File Display Plugins Help Absolute E Metric tree o 0 00 Time sec v B 7083 58 Execution gt L 0 00 OMP DH 0 00 Flush i 908 46 Management 00 00 Synchronization v 20 00 Barrier gt 0 0 00 Explicit vm 125 33 Implicit DH 0 00 Task Wait gt M 0 69 Critical gt 1 0 00 Lock API O 0 00 Ordered DH 0 00 Overhead gt M 3594 87 Idle threads 8 77e7 Visits occ 128 00 Synchronizations occ O 0 00 Pair wise synchronizations for R 3 67e5 Communications occ 9 57e9 Bytes transferred bytes o 0 00 MPI file operations occ 6582 92 Delay costs sec 77 42 Wait states propagating vs ter 77 39 Wait states direct vs indire
6. 7 40 Critical path sec 1 52e4 Performance impact sec 2737 14 Computational imbalance sec v v v v v v v v v v v 3343 76 22 06 AAA 1 52e4 0 00 1795 73 53 70 3343 76 0 00 cube 4 3 0 scorep bt mz C 64x32 trace trace cubex Absolute E Call tree Flat view v Li Le mE x vw M 135 3 exe LIIS O 0 00 timer clear 83 71 exch qbc O omp do 27 solve prep 8 1795 73 ms gt m 13 14 add o 0 00 MPI Barrier O 0 00 timer start 20 00 MPI Reduce O 0 00 print results O 0 00 MPI Finalize lt A J LICH FORSCHUNGSZENTRUM vw Y v Peer percent v E System tree Ml BoxPlot 100 2 34 80 60 1 39 40 0 90 20 0 27 0 1 13e 08 All 2048 elements v Selected OMP thread 10 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 Vampir Event Trace Visualizer a JULICH FORSCHUNGSZENTRUM Offline trace visualization for Score P s SHE E seso COUSCOUS NEOOOOOO EEE sooo COCO OTF2 trace files LX 3 2 000000000000000009098 Visualization of MPI OpenMP di ana application events All diagrams highly customizable through context menus Large variety of displays for ANY part of the trace http www vampir eu Advantage Detailed view of dynamic application behavior Disadvantage Requires event traces huge amount of data Completely manual analysis M Geimer JUQUEEN Portin
7. Recipe A J LICH FORSCHUNGSZENTRUM 1 Prefix your link command with scorep nocompiler 2 Prefix your MPI aunch command with scalasca analyze 3 After execution examine analysis results using scalasca examine scorep t7t e M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 29 Flat MPI Profile Example A J LICH FORSCHUNGSZENTRUM module load UNITE scorep scalasca mpixlf90 03 gsmp omp c foo f90 mpixlf90 03 qsmp omp c bar f90 scorep nocompiler V mpixlf90 03 gsmp omp o myprog foo o bar o THEHHHHHHHHHHHHHHHBHHHHHBHHBHHHE In the job script FF THEHHHHHHHHHHHHHHHHHHHHBHHBHHHE module load UNITE scalasca scalasca analyze V runjob ranks per node P np n exe myprog THEHHHHHHHHHHHHHHHHHHHHHHBHHHE After job finished THEHHHHHHHHHHHHHHHHBHHHHHBHHHE scalasca examine scorep myprog Pp7xt sum p M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 30 Flat MPI Profile Example cont A J LICH FORSCHUNGSZENTRUM du cube 4 3 0 scorep_bt mz_C_64 sum summary cubex V i el ey File Display Plugins Help Absolute v Absolute v Peer percent v E Metric tree E Call tree El Flat view E System tree MI BoxPlot ds i va 298 70 PARALLEL Es m 1 39 MPI Init thread d E 5 50e5 Visits occ o0 0 00 Synchronizations occ 0 0 00 Communications occ 9 5769 Bytes transferred bytes D 0 00 MPI file operations occ o
8. runjob a exe helloworld p 64 Creating LoadLeveler Job submitting LoadLeveler Interactive Job for Totalview M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 18 TotalView Execution Recipe A J LICH FORSCHUNGSZENTRUM otalView tries to debug runjob and shows no source code Ignore it and press GO After some seconds lotalView will detect parallel execution and ask if it should stop Yes it should stop To find the correct point file function to debug use the File Open command Set your breakpoints and press GO again Debugging session will then start To see a variable s contents double click on it in the source M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 19 TotalView Main Window A J LICH FORSCHUNGSZENTRUM A hm 0 lt jj01c64 gt File Edit view Group Process Thread Action Point Debug Tools Window Group Control EI gt HH La i 5 J g m I Ee CER SCH Go Halt kill Restart Next step Out Run To Prey UnSter Caller BackTo Live 1 a n a n n m n m n m n m n m a n a n a n a m a m a n n n n n n n n a n a E a E a a n a E Rank 0 bm U Al Breakpoint 1 Thread I 453373434 795201 hm At Breakpoint 1 Stack Trace Stack Frame main FP TEFF4D TBB8E UD 2 Function main libc start main FP TEfIA UTBalA arge 0x00000001 17 Start FP 7fft4075a050 ar gy Ox 11140754068 Local variables nyrank 0x00000000 0 F numpracs 0x0000000
9. scan 3 Examine analysis results with square For more information See SCALASCA_ROOT doc manuals QuickReference pdf or type scalasca h http www scalasca org mailto scalasca fz juelich de M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 4 Documentation A J LICH FORSCHUNGSZENTRUM Use module avail to check latest status Websites http www fz juelich de ias isc juqueen User Info Debugging Performance Analysis A http www vi hps org training material Performance Tools LiveDVD image Links to tool websites and documentation 4 Tutorial slides M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 5 Ay J LICH FORSCHUNGSZENTRUM Debugging on JUQUEEN ev E O o G O E O I N O Be O L O O O D February 2015 Alexandre Strube STAT Stack Trace Analysis Tool A J LICH FORSCHUNGSZENTRUM Very lightweight helper tool Shows merged call tree of whole program Useful to detect deadlocks Scales to millions of processes http www hpcwire com hpcwire 2012 12 03 bug repellent for supercomputers proves effective htmi Pinpoint individual problems NOT areal full debugger http www paradyn org S TAT STAT html M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 7 STAT Main Window A J LICH STAT on juqueen1 zam kfa juelich de File Edit View Help E 9 d S ao
10. varer TM C O UNIVERSITY OF OREGON dian RESDEN UNIVERSITY German bea Scho Next generation measurement system of Scalasca 2 x Vampir TAU Periscope Common data formats improve tool interoperability htto www score p org core P Scalable performance measurement infrastructure for parallel codes M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 26 scalasca Y A J LICH Collection of trace based performance analysis tools Specifically designed for large scale systems Unique features Scalable automated search for event patterns representing inefficient behavior Scalable identification of the critical execution path Delay root cause analysis Based on Score P for instrumentation and measurement Includes convenience post processing commands providing added value htto www scalasca org M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 27 What is the Key Bottleneck M J LICH FORSCHUNGSZENTRUM Generate flat MPI profile using Score P Scalasca Only requires re linking Low runtime overhead Provides detailed information on MPI usage How much time is spent in which operation How often is each operation called How much data was transferred Limitations Computation on non master threads and outside of MPI Init MPI Finalize scope ignored M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 28 Flat MPI Profile
11. 0 0 00 Computational imbalance sec S 0 00 Minimum Inclusive Time sec 5 93 Maximum Inclusive Time sec 64 00 task migration loss O 0 00 task migration win 0 00 MPI Comm size 0 00 MPI Comm rank 0 11 MPI Comm split 0 43 MPI Bcast 3 79 MPI Isend EA Er ia Irecv a MPI Waite O 58 MPI_ Barrier m 0 72 MPI_Reduce 0 00 MPI Finalize kai 979 92 MPI Rank 1 M Geimer JUGUEEN Porting 8 Tuning Workshop Feb 2015 31 Where is the Key Bottleneck A J LICH FORSCHUNGSZENTRUM Generate call path profile using Score P Scalasca Requires re compilation Huntime overhead depends on application characteristics Typically needs some care setting up a good measurement configuration Filtering Selective instrumentation Option 1 recommended Automatic compiler based instrumentation Option 2 Manual instrumentation of interesting phases routines loops M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 32 A J LICH Cal l path Profi le Reci pe FORSCHUNGSZENTRUM 1 Prefix your compile link commands with scorep Prefix your MPI launch command with scalasca analyze After execution compare overall runtime with uninstrumented run to determine overhead If overhead is too high 1 Score measurement using scalasca examine s scorep lt title gt 2 Prepare filter file 3 He run measurement with filter applied using prefix scalasca ana
12. 0 Communications occ 9 5769 Bytes transferred bytes a 0 00 MPI file operations occ 2696 16 Computational imbalance sec LH 0 00 Minimum Inclusive Time sec 6 99 Maximum Inclusive Time sec 64 00 task migration loss O 0 00 task migration win ml4dedb 4 3 1 44e4 machineJUOUEEN k k t d d 0 00 1 44e4 100 0096 1 44e4 0 00 1 44e4 100 0096 1 44e4 0 00 1 44e4 100 0096 1 44e4 ES 2 SSS M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 40 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM m cube 4 3 0 scorep bt mz C 64x32 sum summary cubex 69 File Display Plugins Help Absolute v Absolute v Peer percent v ES Metric tree ES Call tree El Flat view E System tree HE BoxPlot 1 3987 09324715 ee 0 468 S 8 ZI Visits Temm 4 0 00 Synchronizations occ LH 0 00 Communications occ 9 5769 Bytes transferred bytes a 0 00 MPI file operations occ 2696 16 Computational imbala 2 0 00 Minimum Inclusive Ti m 6 99 Maximum Inclusive 4 m 64 00 m migratie nons 8 47 07 mpi setup 11 42 MPI_Bcast 2 5l env setup 0 05 zone setup 1 25 map zones 0 82 zone starts 0 01 set constants 57 90 initialize 27 62 exact rhs 0 09 timer clear 3814 83 exch qbc 26 98 adi gt H 1363 87 compute rhs gt nm 2401 11 x solve 1 2717 25 y solve vm 73 94 z solve vE 12 92
13. 4 4 Function main in hella mpi c 1 include lt stdio h gt 2 include lt mp1 h gt 3 d int main int argc char arqv i int ierr myrank numprocs ierr MPI Init amp argc amp arqwv left MPI Comm rank MPI COMM WORLD Smyrank lerr MPI Comm size MPI COMM WORLD amp numprocs DEDERE UE EE pet Eo ra i QL LELE rank numpracs nk ES cer Pee Su d mM d kel TATT hello rom 40 Or 00 MY ierr MPI Finalize return ll Action Points Points Processes Threads EE 1 hello mpi c ll main Ux5f B M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 20 TotalView Tools Menu A J LICH FORSCHUNGSZENTRUM Call Graph Data visualization v All w Control Group Update Close Help File View 8 24 Queue Display Options m N NEN EE E NEN NN EE STI 6 PIAR 4 Pending Sends m SSSR ERR N M Pending Recvs 8 24 M Unexpected Messages 8 32 Message queue graph M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 21 A J LICH FORSCHUNGSZENTRUM Performance Analysis Tools on JUQUEEN February 2015 Markus Geimer Typical Performance Analysis Procedure 4 JULICH Dol have a performance problem at all lime speedup scalability measurements Whatis the key bottleneck computation communication MPI OpenMP flat profiling Where is the key b
14. A J LICH FORSCHUNGSZENTRUM Debuggers and Performance Tools je lt O o C O E O I N O E E O I O K D February 2015 Markus Geimer Alexandre Strube Outline A J LICH FORSCHUNGSZENTRUM Local module setup Make it work Debuggers make it right even make it fast Kent Beck Performance Tools e Score P Scalasca Vampir TAU HPC Toolkit M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 2 UNITE A J LICH FORSCHUNGSZENTRUM UNiform Integrated Tool Environment Standardizes tool access and documentation Currently in use at JSC RWTH ZIH Based on module command Standardized tool and version identification lt tool gt lt version gt lt special gt special optional indicator if tool is specific for a MPI library compiler or 32 64 bit mode Tools only visible after module load UNITE once per session Basic usage and pointer to tool documentation via module help tool M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 9 Example A J LICH FORSCHUNGSZENTRUM module load UNITE UNITE loaded module help scalasca 1 4 3 Module Specific Help for scalasca 1 4 3 Scalasca Scalable Performance Analysis of Large Scale Parallel Applications Version 1 4 3 Basic usage 1 Instrument application with skin 2 Collect amp analyze execution measurement with
15. ENTRUM Vampir Execution Statistics Vampir Trace View home dolescha tracefiles feature traces wrf p64 10 mem rusage wrf I h otf File View Help View Chart Filter Eru OTERS NS YA Aggregated D fi g All Processes Accumulated Exclusive Time per Function Group All Processes Accumulated Exclusive Time per Function B f B 305 205 10s Os Information execution time number of calls inclusive exclusive module radiati iation driver MPI Wait module microp ysics driver module em m step prep module em mp rk tendency module small p advance w_ module pbl dr p pbl driver module advect dvect scalar module small advance uv module small dvance mut 101030524 s d 797311s 8 571178 18 207032 7 818995 MPI 39 185731 s DYN 135 259444 s PHYS 73 474569 s Available for all any group activity or all routines symbols Available for any part of the trace gt selectable through time line diagram M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 3 885892 s 3 882541 s 3 663714 s 3 591342 s 3 34529 s 2 965933 s 2 614259 s 2 448983 s 2 400911 s 2 392384 s 2 295455 s 2 147392 s 1 739366 s 1 265127 s 1 141828 s 77414239 ms 700 14165 ms module small p calc p rho module small ll step prep module small mp sumflux MPI Isend module cumul ulus driver module em mp date scalar module em m ddtend dry modu
16. LICH FORSCHUNGSZENTRUM 128 3 7 11 15 V 128 1 5 9 13 17 d 128 2 6 10 14 128 0 4 8 12 16 guivalence Class Collapse Collapse Depth Hide Expand Expand All Focus View Source 28 3 7 11 15 128 1 5 9 13 17 128 3 7 11 15 128 1 5 9 13 17 128 3 7 11 15 128 1 5 9 13 17 128 3 7 11 15 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 10 STAT Equivalence Classes cont A J LICH FORSCHUNGSZENTRUM main 512 0 511 TUM 128 2 6 10 14 128 0 4 8 12 16 128 13 7 11 15 128 1 5 9 13 17 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 11 STAT Recipe A J LICH FORSCHUNGSZENTRUM Compile and link your program with debug option g Load modules ssh X user juqueen El jugueen module load UNITE stat UNITE loaded Stat 2 1 loaded juqueen STATGUI Submit job and attach to it from GUI CAVEAT Job needs to be started by login node where GUI is running Add the following entry to the submission script HQ requirements Machine Jugueen lt gt with lt n gt being the login node id M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 12 SI AT Attaching to a Job Aa J LICH FORSCHUNGSZENTRUM Y FTF Z 71 Sa and His tor Attach Launch Serial Attach Sample Options Topology Ad
17. R O Inclusive Value EE Nu er of Calls SAS Number of Subroutines i Ss Inclusive Per Call Value SS a a e x E L pr ine Hii S NS AS d t EE lt Uw Tep Y fj Ce HL AT D 1 Dp RKR es Er aj 31d M7 E x z E T E r a 7 H 7 p E M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 s te GE EE EE FF EE EE 62 FORSCHUNGSZENTRUM Height and color indicate different metrics X ParaProf Visualizer File Options Windows Triangle Mesh Bar Plot Scatter Plot Height Metric 1 6734C5 3 Z Exclusive w Time IES 1255158 Color Metric 1 6734E8 B 367257 noe Exclusive v Time Bm a ei LA Ka P dd k YLL LAY ERT LATT D jm MPI Barrier gt 1 LT IA i v Function R h 2551ES 4 gt q Ka 16 0 0 Ch 183687 TY Le i e Ld cat Thread n a A AT ET Height value 1 2229E8 microseconds m Per Color value 1 2229E8 microseconds Mesh Plot Axes ColorScale Render Orientation v Show Axes NW NE SE ie SW Multi platform sampling based call path profiler Works on unmodified optimized executables hitto hpctoolkit org Advantages Overhead can be easily controlled via sampling interval Advantageous for complex C codes with many small functions oop level analysis someti
18. Users File Options Windows Help COUNTER NAME P WALL CLOCK TIME seconds 345 5474 NO MP Allreduce 116 4951 algs HyperbolicLevellntegrator3 advance bdry fill create 103 2566 D algs HyperbolicLevellntegrator3 advanceLevel 59 0096 EI algs HyperbolicLevellntegrator3 fill new level create 37 4482 EJ mesh GriddingAlgorithm3 load balance boxes 32 8548 algs HyperbolicLevellntegrator3 advance bdry fill comm 21 4095 S mesh GriddingAlgorithm3 findRefinementBoxes 13 4925 H algs HyperbolicLevellntegrator3 coarsen fluxsum create 12 6572 M algs HyperbolicLevellntegrator3 coarsen sync create 10 4408 mesh GriddingAlgorithm3 find boxes containing tags 8 9215 8 MPI Init 8 6893 mesh GriddingAlgorithm3 bdry fill tags create 7 2717 MPI Bcast 7 1321 MPI_Wait 4 0833 algs HyperbolicLevellntegrator3 error bdry fill comm 3 6778 MPI Finalize 3 1405 MPI Isend 3 0156 MPI Waitall 2 3457 mesh GriddingAlgorithm3 remove intersections regrid all 1 7275 MPI_Test EEE Er 1 6515 algs HyperbolicLevellntegrator3 fill new level comm ES n 1 3919 MPI_Comm_rank Y 4 gt M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 61 M TAU Callgraph Profile View A J LICH X Call Graph n c t 0 0 0 ozone tests MFIX apps sameer users home sanfs mnt File Opti 4 Displa i i ox widt n K Hoxcolorby o Static Exclusive Value OM
19. ardware counter analysis CPU utilization Cache behavior Selective instrumentation Manual automatic event trace analysis M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 44 HW Counter Measurements w Score P A J LICH FORSCHUNGSZENTRUM Score P supports both PAPI and native counters Available counters module load UNITE pap1 5 0 1 less PAPI ROOT doc papi 5 0 1 avail txt less PAPI ROOT doc papi 5 0 1 native avail txt less PAPI ROOT doc papi 5 0 1 avail detail txt Specify using SCOREP METRIC PAPI environment variable THEHHHHHHHHHHHHHHHBHHHHHHHHEHE In the job script FF THEHHHHHHHHHHHHHHHBHHHHHHHBWHHE module load UNITE scalasca export SCOREP METRIC PAPI PAPI FP OPS PAPI TOT CYC scalasca analyze f filter txt V runjob ranks per node P np n exe myprog M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 45 Automatic Trace Analysis w Scalasca JULICH Idea Automatic search for patterns of inefficient behavior Identification of wait states and their root causes Classification of behavior amp quantification of significance ocalable identification of the critical execution path Low level naiss High level event trace result Advantages Guaranteed to cover the entire event trace Quicker than manual visual trace analysis Helps to identify hot spots for in depth manual analysis Property Location M Geimer JUQUEEN Porting
20. asurements to reduce the overhead List routines to be filtered in simple text file M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 37 Filtering Example Ay J LICH FORSCHUNGSZENTRUM cat filter txt SCOREP REGION NAMES BEGIN EXCLUDE binvcrhs matmul_sub matvec_sub b nvrhs Ihsinit exact solution SCOREP REGION NAMES END Score P filtering files support Wildcards shell globs Blacklisting Whitelisting Filtering based on filenames M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 38 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM To verify effect of filter Scalasca examine s f filter txt V scorep myprog Dt sum THEHHHHHHHHHHHHHHHHHHHHHHBHHHE In the job script FF THEHHHHHHHHHHHHHHHHHHHHBHHBHHHE module load UNITE scalasca scalasca analyze f filter txt V runjob ranks per node P np n exe myprog tttttttttttttttttttttttttkt After job finished tttttttttttttttttttttttttkt Scalasca examine scorep_myprog_Ppnxt_sum p M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 39 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM a cube 4 3 0 scorep bt mz C 64x32 sum summary cubex File Display Plugins Help Absolute Absolute Peer percent E Metric tree ES Call tree El Flat view ES System tree W BoxPlot gt m 1 44e4 Time sec bd 8 7767 Visits occ 4 0 00 Synchronizations occ LH 0 0
21. ft 1 5 f A Attach cart 0000 2D dot cart 0000_1 2D dot cart 0000_2 2D dot cart 0000 3D dot cart 0000_3 2D dot cart 0000_4 2D dot cart 0000_5 2D dot cart 0000_1 3D dot ReAttach E Detach 512 0 511 BB Pause gt Resume 512 0 511 A Sample g Sample Multiple 512 0 511 512 0 511 512 0 511 128 0 4 8 12 16 128 2 6 10 14 128 0 4 8 12 16 128 3 7 11 15 128 1 5 9 13 17 128 0 4 8 12 16 128 2 6 10 14 1 28 2 6 10 14 1 28 0 4 8 12 16 128 3 7 11 15 128 1 5 9 13 17 128 0 4 8 12 16 126 0 4 8 12 16 128 2 6 10 14 128 3 7 11 15 128 1 5 9 13 17 128 1 5 9 13 17 PAMI Context trylock advancev 128 3 7 11 15 1128 3 7 11 15 128 1 5 9 13 17 114 0 4 8 12 16 1 28 3 7 11 15 95 0 8 12 16 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 STAT Zoom a J LICH 12 0 511 bodyl Which ranks are following X128 3 7 11 15 V 128 1 5 9 13 17 128 2 6 10 14 EN ie 128 1 5 9 13 17 is Di La 128 2 6 10 14 128 0 4 8 12 16 128 1 5 9 13 17 128 3 7 11 15 128 2 6 10 14 128 1 5 9 13 17 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 STAT Equivalence Classes A J
22. g amp Tuning Workshop Feb 2015 49 Vampir Displays Ay J LICH FORSCHUNGSZENTRUM Vamnpir Trace View home doleschaj tracefiles Teature traces wrf p6 4 ro memri rusaqe wrt Lh otf La wf File View Help Wew Chart Filter Sirk SE mS d Ao Timeline Function Summary All Processes Accumulated Exclusive Time p 50085 Os Process 8 Process 25 242195072 s 112 424503 s 3 467969 s 10 2 1656615 Appl tion 1 391392 s VT_API Process 42 Process 59 Pracess Communication Matrix View Number of Messages O M Function Legend Frocess Summary Context View Application Function summary EJ d B on Property value Display Function Summary Function Group MPI 6 Accumulated Exclusive Time 748 945947 s 29 198329 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 50 Vampir Timeline Diagram A J LICH FORSCHUNGSZENTRUM Vampir Trace View home dolescha tracetiles feature traces wrt p64 10 mem rusage wrt Th ott Functions y view Chart Filter organized SENLOTERE SS 4 into groups Process 1 Process 4 Process 7 Process 10 coloring i Process 16 by group Message lines can be colored by tag or Size Process 18 Process 22 Process 25 Process 28 Process 31 Process 34 Process 37 Process 40 Process 43 Process 46 Process 49 Process 52 Process 55 Process 58 Process 61 Conte
23. le advec mp advect u module adve p advect w module advec mp advect V MPI Irecv module surfac rface driver module small step finish module em m o tendency module small calc coef w_ module physi mp add a2a module em m e phy tend 93 Vampir Process Summary LICH FORSCHUNGSZENTRUM F Vampir Trace View home dolescha tracetiles teature traces wrt p64 10 mem rusage wrf 1h otf X Execution statistics TE ds Sirile EE amp C0 TETE IR IW UNI 1 over all processes Werten Os 35 65 9s 125 155 185 215 24s 275 305 Pro s 0 EAN mod ld mo p open J m i for com Da rison Md uon EE vo oor am om Pro 3 Pro s 4 Pro s5 Pro s6 Pro s7 Pro s8 Pro s9 Pr 10 Pr 11 Pr 12 Pr 13 Pr 14 Pr 15 Pr 16 Pr 17 Pr 18 solve_em_ Clustering mode available for large process counts Pr 19 solve_em_ Pr 20 Mi cdule driver MP mod cr PCT CNE RE Pr 21 Pr 22 Vampir Trace View home dolescha tracetiles teature traces wrt p64 10 Mem rusage wrf 1h otf Fle View Help View Chart Filter EN 5 a a SN KNS SG v II IA NI E UN UN EU DENIM cU o P R CEA T Process Summary 15s sche em solve em M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 54 ampir Communication Statistics J LICH FORSCHUNGSZENTRUM Vampir Trace View lt home dolescha tracetiles feature traces wrt p64 10 me
24. lyze f filter file gt After execution examine analysis results using scalasca examine scorep t7t e M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 33 Call path Profile Example Ay J LICH FORSCHUNGSZENTRUM module load UNITE scorep scalasca scorep mp1x1f90 03 qsmp omp c Too TOU scorep mpixlf90 03 qsmp omp c bar f90 scorep N mpixlf90 03 gsmp omp o myprog foo o bar o THEHHHHHHHHHHHHHHHHHHHHBHHBHHHE In the job script THEHHHHHHHHHHHHHHHHHHHHBHHBHHHE module load UNITE scalasca scalasca analyze V runjob ranks per node P np n exe myprog y M Geimer JUQUEEN Porting Tuning Workshop Feb 2015 34 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM scalasca examine s epik myprog Pp rxt sum scorep score r epik myprog Pom t sum profile cubex INFO Score report written to scorep myprog Pomxt sum scorep score Estimates trace buffer requirements Allows to identify canditate functions for filtering Computational routines with high visit count and low time per visit ratio Region call path classification COM MPI pure MPI library functions VN OMP pure OpenMP functions regions USR COM USR VN USR user level source local computation jGR OMP MPI USR COM combined USR OpeMP MPI ANY ALL aggregate of all region types M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 35
25. m rusage wrf 1h otf Byte and message count de Ven Hel View Chart Filter min max avg message length i es and min max avg bandwidth for each process pair Message length Buen Process 9 B B Statistics Minanga Process 15 Process 18 Communication Matrix View Process 0 Process 21 Vampir Trace View home dolescha tracetiles feature traces wrt po64 Process 24 File View Help Process 27 view Chart Filter Process 30 nom gt Process 33 SC OTERS BO Y Message Summary Process 36 7500 6000 4500 3000 1500 1 Messag Process 39 15 039062 KiB property Process 42 0 664 Ki R R bs B Display Process 45 4911 10 9375 KIB Message Siz 4920 23 203125 KiB Message Tra H H e 10 275 KiB Process 51 Process 48 16 453125 KiB Process 54 48 398438 KiB Process 57 145 1 2 KE Process 60 75 839844 KiB 60 800781 KiB 218 4 5 KiB 36 09375 KiB 35 273438 KiB Prnrace 53 Message Summary 1600 MiB s 1280 MiB s 960 MiB s 640 MiB s EM Y 320 MiB s 0 MiB s EM 2126 8kiB gt 217 59375 KiB M 105 8 2 KiB i 108 28125 KiB 493 213465 MiB s 296 71875 KiB 477 257369 MiB s 159 2 5 KiB 469 472582 MiB s 162 9375 KiB 467 421276 MiB s 35 273438 KiB 461 590886 MiB s 36 09375 KiB A57 004166 MiB s IE 25 875 KiB 451 100737 MIB s 318 46875 KiB 432 708147 MiB s 147 65625 K B t 404 654303 MiB s 48 398438 KiB Available f
26. mes even individual source lines Supports POSIX threads Disadvantages Statistical approach that might miss details MPI OpenMP time displayed as low level system calls M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 64 HPCToolkit Recipe A J LICH FORSCHUNGSZENTRUM 1 Compile your code with g qno1pa For MPI also make sure your application calls MPI Comm rank first on MPL COMM WORLD Prefix your link command with hpc link Ignore potential linker warnings 3 Run your application as usual specifying requested metrics with sampling intervals in environment variable HPCRUN EVENT LIST 4 Perform static binary analysis with hpcstruct loop fwd subst no lt app gt 5 Combine measurements with hpcprof S struct file gt V I path to src measurement dir 6 View results with hpcviewer lt hpct database no M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 65 HPCToolkit Metric Specification A J LICH FORSCHUNGSZENTRUM General format name Ginterval named nterval Possible sample sources WALLCLOCK PAPI counters O use w o Interval spec MEMLEAK use w o interval spec Interval given in microseconds E g 10000 100 samples per second M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 66 Example hpcviewer A J LICH FORSCHUNGSZENTRUM A hpcviewer sor lt djjj28103 gt
27. ng Workshop Feb 2015 15 TotalView Recipe A J LICH FORSCHUNGSZENTRUM Compile and link your program with debug option g Use absolute paths for source code info qful lpath n case of optimized codes XL keep function call parameters gkeepparm Load modules ssh X user juqueen Juqueen module load UNITE totalview UNITE loaded totalview 8 14 0 16 mrnet loaded jugueen mpixlcxx hello cpp qfullpath qkeepparm g o helloworld jugueen M Geimer JUGUEEN Porting 8 Tuning Workshop Feb 2015 16 TotalView Interactive Startup a JULICH nteractively call the I tv script Creates a LoadLeveler batch script with required TotalView parameters f user cancels the script it cancels the debugging job does not eat your computing quota NOTE License limited to 2048 MPI ranks shared between all users Attaching to subset is recommended M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 17 TotalView tv Launch Script A J LICH FORSCHUNGSZENTRUM lltv n nodes default parallel attach subset N rank range runjob a exe program p num Starts program with nodes and num processes per node attaches to lt rank range gt Rank that rank only HankX RankZ all ranks both inclusive HankX RankZ stride every strideth between RankX and RankZ Example lltv n 2 default parallel attach subset VC 2 6
28. or any part of the trace D 4 4 E Een m Feb 2015 55 Vampir Recipe JUQUEEN Ay J LICH FORSCHUNGSZENTRUM 1 module load UNITE vampirserver 2 Start Vampir server component on frontend using vamp1rserver start smp Check output for port and pid 3 Connect to server from remote machine see next slide and analyze the trace 4 vampirserver stop pid See above 2 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 56 Vampir Recipe local system A J LICH FORSCHUNGSZENTRUM 1 Open SSH tunnel to JUQUEEN using ssh L30000 localhost port juqueen lt r gt 2 Start Vampir client component For example usr local zam unite bin vampir 3 Select 1 Open other 2 Remote file 3 Connect keep defaults 4 File traces otf2 from Score P trace measurement directory M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 57 TAU A J LICH FORSCHUNGSZENTRUM Very portable tool set for instrumentation measurementand analysis of parallel multi threaded applications htto tau uoregon edu Tuning and Analysis Utilities Supports Various profiling modes and tracing Various forms of code instrumentation Le i d a K d C C Fortran Java Python dm MPI multi threading OpenMP Pthreads E y
29. ottleneck Call path profiling detailed basic block profiling Why is it there Hardware counter analysis race selected parts to keep trace size manageable Does the code have scalability problems Load imbalance analysis compare profiles at various sizes function by function M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 23 Remark No Single Solution is Sufficient J J LICH FORSCHUNGSZENTRUM A combination of different methods tools and techniques is typically needed a Analysis Statistics visualization automatic analysis data mining Measurement Sampling instrumentation profiling tracing Instrumentation Source code binary manual automatic M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 24 Critical Issues A J LICH FORSCHUNGSZENTRUM Accuracy Intrusion overhead Measurement itself needs time and thus lowers performance Perturbation Measurement alters program behavior E g memory access pattern Accuracy of timers amp counters Granularity How many measurements e How much information processing during each measurement Tradeoff Accuracy vs Expressiveness of data M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 25 Score P A J LICH FORSCHUNGSZENTRUM Community instrumentation and measurement infrastructure Developed by a consortium of performance tool groups 4 duch
30. r M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 TAU Instrumentation MJ J LICH FORSCHUNGSZENTRUM Flexible instrumentation mechanisms at multiple levels Source code manual automatic C C F77 90 95 Program Database Toolkit PDT OpenMP directive rewriting with Opari Object code pre instrumented libraries eg MPI using PMPI Statically linked and dynamically loaded e g Python Executable code dynamic instrumentation pre execution Dyninst virtual machine instrumentation e g Java using JVMPI Support for performance mapping Support for object oriented and generic programming M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 59 TAU Recipe A J LICH FORSCHUNGSZENTRUM 1 module load UNITE tau once per session 2 Specify programming model by setting TAU MAKEFILE to one of TAU MF DIR Makefile tau MPI Makef1le tau bgqtimers papi mpi pdt OpenMP MPI Makefile tau bgqtimers papi1 mpi pdt openmp opar1 3 Compile and link with tau cc sh file c tau cxx sh f1le cxx tau f90 sh file f90 4 Execute with real input data Environment variables control measurement mode TAU PROFILE TAU TRACE TAU CALLPATH 5 Examine results with paraprof M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 60 FORSCHUNGSZENTRUM File Options Windows Help ul n c t 0 0 0 512proc samrai taudata neutronbackup rs sameer
31. vanced localhost Search Remote Host Specify Remote Host Shell rsh v Filter Process List runjob Current Process List Attach on juqueen2 zam kfa juelich de Filter 9 17375 runjob exe homec jzam11 jzam1143 scalasca juqueen topompi cart OOOO HD Refresh Process List Attach M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 13 STAT Attach TotalView to Subset A J LICH FORSCHUNGSZENTRUM CADIZ tt kie art 0000 2D dot 4 Equivalence Classes Rep All None tasks D O UO select all ty nai T Attach TotalView ach DDT to Subset Subset i 128 3 7 11 15 1 5 9 1 1128 2 6 10 14 28 13 7 11 15 1128 1 5 9 1 1128 3 7 11 15 M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 14 TT TOTALVIEW Parallel Debugger A J LICH TECHNOLOGIE FORSCHUNGSZENTRUM UNIX Symbolic Debugger for C C 177 190 PGI HPF assembler programs Standard debugger Special non traditional features Multi process and multi threaded C support templates inheritance inline functions F90 support user types pointers modules 1D 2D Array Data visualization Support for parallel debugging MPI automatic attach message queues OpenMP pthreads Scripting and batch debugging Memory Debugging htip www roguewave com M Geimer JUQUEEN Porting amp Tuni
32. xt view Master Timeline E ER Property Value Display Master Timeline Type Function Function Name open Function Group I O Interval Begin 0 2225 Interval End 1 365 5 Duration 1 143 5 Source File Source Line Information about states messages collective and I O operations available through clicking on the representation M Geimer JUQUEEN Porting amp Tuning Workshop Feb 2015 51 Vampir Process and Counter Timelines Process timeline show call stack nesting Counter timelines for hardware Or software counters M Geimer A J LICH FORSCHUNGSZENTRUM Vampir Trace View hom dolesch tr cefiles feature tr c s wrT p6d ao m m r us g wrT Lh otf j File View Help View Chart Filter STC KM REI KAS Timeline Function Legend 17 95 18 0 s 18 15 18 2 5 18 35 18 45 Process O E l Process 63 2 3 a FELE ME A TO HI s E Wu amas N Context View she ml PC Process Timeline EJ Property Value 7 Display Process Timeline Process O Values of Counter MEM APP ALLOC over Time e Type SHIEM l 7 Function Name MPI Wait Function Group MPI Interval Begin 18 1215 30M i i EE B Interval End 18 121s Duration Os Source File 0M Source Line Process 63 Values of Counter ru utime over Time JUQUEEN Porting amp Tuning Workshop Feb 2015 52 M J LICH FORSCHUNGSZ

Download Pdf Manuals

image

Related Search

Related Contents

LG Electronics 355 Cell Phone User Manual  Derivatives User Manual  SONOSAX SX-R4  Minka Lavery 2901-613-L Instructions / Assembly  DOVETAIL JIG No. 860 - General Tools And Instruments  Benutzerhandbuch User Manual Manuel d`utilisation  INSTALLATION INSTRUCTIONS  Olympus E-3 Instruction Manual  Sub-Zero 427RG Refrigerator User Manual  TLA5000B Series Logic Analyzer Installation Manual  

Copyright © All rights reserved.
Failed to retrieve file