Home

HPCToolkit - Center for Scalable Application Development Software

1. ann hparaceviener TI Trace Ven Time Range 0 05 175 7335 Process Range 002560 Cross Har 70 5235 120 Mini Map C op ven zr a Daa aiii aset get event Analyze Behavior over Time RICE A perrates t p lux lt 8 Itt 162 n spec hpcviewer Profile Name Execution time Mimstfo R anto E 2 then L_ujUpper3 3 1 1 3 3 1 1 ujUpper30 3 1 increases 2 8x in the loop that scales worst 2 0 Cored atit 0e ese loop contributes a 2 nie 2 6 9 scaling loss to 2 n grad mix difffluxCtt 0 1 lux 1t 8 It_t lt LES kne des Rod 5 a Whole execution xavg 1t 1 t 2 n grod ysC 1 1 2 n 2 FE Calling Context View 7t Callers View Flat View g AAt 6 fo pe G n oret ms loop at diffflux gen ujf 197 2 268 2 68 en 12006 5 loop at integrate_erk jstage ge 09008 98 18 1 25006 1 18 1 84008 97 98 loopatvarubles mS90 88 99 1 49006 1 34 1 49006 1 38 6 08006 3 24 ov loop at rhsf 190 516 536 2 2 44 121006 1 28 6 49006 2 54 n loop at rhsf 190 538 544 3 3 08 1 7 06006 3 68 loop at rhsf 190 546 552 2 EET 1 86008 3 18 e loop at thermchem_m 90 127 18 0 78 s 2 28006 1 28 loop at heatflux I gen 5 132 1 1 30 1 46086 1 98 2 00006 1 Em loop at rhsf f90 576 0 68 6 65005 0 64 1 87006 1 04 e loop at getrates f 504 505 W 7 20 8 00006 7 20 8 74006 4 74 ov loop at derivative x 90 213 6901 1 1 2 E
2. HPC Toolkit Sampling based Performance Tools for Leadership Computing John Mellor Crummey Department of Computer Science Rice University johnmc rice edu CScADS Leadership Computing July 24 2012 1 http hpctoolkit org Acknowledgments e Funding sources Center for Scalable Application Development Software Cooperative agreement number DE FC02 07ER25800 Performance Engineering Research Institute Cooperative agreement number DE FC02 06ER25762 e Project Team Research Staff Laksono Adhianto Mike Fagan Mark Krentel Students XuLiu Milind Chabbi Karthik Murthy Collaborator Nathan Tallent PNNL Alumni Gabriel Marin ORNL Robert Fowler RENCI Nathan Froyd Mozilla Summer Interns Reed Landrum Michael Franco Sinchan Banerjee Philip Taffet Challenges for Computational Scientists Execution environments and applications are rapidly evolving architecture rapidly changing multicore microprocessor designs increasing scale of parallel systems growing use of accelerators applications MPI everywhere to threaded implementations adding additional scientific capabilities to existing applications maintaining multiple variants or configurations for particular problems Steep increase in application development effort to attain performance evolvability and portability Application developers need
3. m diffflux 1t Q9 lt 1 1t 2 n m 1 dsm xavg 1t 0 lt 1 1t 2 n Cgrad_ys it__ 1t__1 1lt__2 n m 1142 weft A Ve 1 l 2 n nrod mvma 1 A l 1 DR 2 9 Calling Context View Callers View we Flat View 1 amp d 16 fo MM Scope 1 cogf ms D l core ms E re 1 ms 1 8 core 1 ms E Multicore v loop at diffflux gen ujf 197 22 2 86e06 2 6 2 86e06 2 6 8 12e06 4 3 8 12e06 4 3 5 27e06 E loop at integrate erk jstage lt gel 09e08 98 18 1 25e06 1 18 1 84e08 97 9 5 94e06 3 2 4 70e06 6 1 loop at variables m f90 88 99 1 49e06 1 3 1 49e06 1 3 6 08e06 3 28 6 08606 3 2 4 60e06 6 08 loop at rhsf f90 516 536 2 70e06 2 48 1 31e06 1 28 6 49e06 3 5 3 72e06 2 08 2 41e06 3 1 loop at rhsf f90 538 544 3 35906 3 08 1 45e06 1 38 7 06006 3 8 3 82006 2 0 2 360906 3 1 loop at rhsf f90 546 552 2 56e06 2 3 1 47e06 1 38 5 86e06 3 1 3 42e06 1 8 1 96e06 2 6 loop at thermchem m f90 127 18 00e05 0 7 8 00e05 0 78 2 28e06 1 28 2 28e06 1 28 1 48e06 1 98 loop at heatflux t gen f 5 132 1 46e06 1 3 1 46e06 1 3 2 88e06 1 5 2 88e06 1 58 1 41e06 1 8 loop at rhsf f90 576 6 65e05 0 68 6 65e05 0 68 1 87e06 1 0 1 87e06 1 08 1 20e06 1 6 loop at getrates f 504 505 8 00e06 7 2 8 00006 7 28 8 74006 4 7 8 74006 34 78 7 35905 1 0 loop at derivative x f90 213 6901 78e06 1 68 1 78e06 1 6 2 47e06 1 3 2 47e06 1 3 6 95e05 0 9 Outline e Overview of Rice s HPC Toolkit e
4. Zoom in 2 560 13 5 0 2 560 10 gt BMP 2 400 413 4 61 2 400 10 NOTE M t bloopat Z00M out 1 01e 13 1 94 1 01e 10 MUST analyze DEUM Copy rer a Show usort x c 1323 measurement data with Callsite peort sapi2 c B62 hpcprof mpi to include Show database s raw XML Graph PAPI TOT CYC 1 Plot graph gt thread centric metrics in Graph PAPI TOT CYC E P Sorted plot graph b Graph PAPI L2 TCM 1 Histogram graph the performance database Graph PAPI_L2_TCM E 71M of 97M Radix Sort on 960 Cores Barrier Time 000 hpcviewer mpbs mpi2 960 cores radix I r w Plot graph MPI Barrier Poss Iz Sorted plot graph MPI Ba te Plot graph MPI Barrier Pa te Plot graph MPI Barrier P 23 iwl Plot graph MPI Barrier PAPI TOT CYC I sorted by rank 9 9 9 9 9 9 9 100 00 200 00 300 00 400 00 500 00 600 00 700 00 800 00 900 00 sorted by value Histogram graph MPI Barrier PAPI TOT CYC I value histogram 4E7 6E7 8t7 168 1 268 1 4E8 1 668 1 8EB 2E8 2 2EB 2 4EB 2 6EB 2 8EB 368 3 2E8 3 4E8 Metric Value Ny Calling Context View 3 Callers View ft Flat View Outline e Overview of Rice s HPC Toolkit e Accurate measurement e Effective performance analysis e Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel sys
5. against API library L home projects hpctoolkit ppc64 pkgs hpctoolkit lib hpctoolkit Ihpctoolkit API is a no op unless used with hpclink or hpcrun 49 HPCToolkit Capabilities at a Glance anoo hpcviewer MOAB mbperf_iMesh 200 B Barcelona 2360 SE i mbperf_iMesh cpp 2 TypeSequenceManager hpp Define less than comparison for EntitySequence pointers cs a comparison of the entity handles in the pointed to EntitySequences 0 closs SequenceCompare public bool operator const EntitySequence o return a gt end handle lt b stort hondle costs for h e inlined procedures loops e function calls in full context x Te Callers View 4 Flat View 6 fol Scope Y main PAPI TOT CYC i 63e 08 100 amp 1 13e 11 100 t 1 106411 naf 0 98e 11 86 54 430908 39 84 3 376410 29 94 200 08 37 18 2 180 10 19 34 200 08 37 14 2 160 10 19 14 2 3 9 8 350 08 96 74 8les08 78 94 Y Bb testivoid int double const int const loop at mboerl iVesh cpp 280 313 Bo imesh getvbxarrcoords Y P MBCore get_coords unsigned long const int double cc inlined from stl reeh 472 Y 200 08 37 18 2 160 10 19 14 12380109 8 34 370109 560409 8 8 6 5 3 3 3 2 040 08 23 74 040108 23 64 780 08 20 64 je 08 20 68 8 560 09 v inlined from TypeSequenceManager hpp 27 TypeSequenceManager hpp 27 L
6. better built in kernel support for counters earlier Linux needs a kernel patch perfmon2 or perfctr 42 HPC Toolkit Documentation http hpctoolkit org documentation html e Comprehensive user manual http hpctoolkit org manual HPCToolkit users manual pdf Quick start guide essential overview that almost fits on one page Using HPCToolkit with statically linked programs a guide for using hpctoolkit on BG P and Cray XT The hpcviewer and hpctraceviewer user interfaces Effective strategies for analyzing program performance with HPCToolkit analyzing scalability waste multicore performance HPCToolkit and MPI HPCToolkit Troubleshooting why don t I have any source code in the viewer hpcviewer isn t working well over the network what can do e Installation guide 43 Using HPCToolkit Add hpctoolkit s bin directory to your path see earlier slide for HPCToolkit s HOME directory on your system Adjust your compiler flags if you want full attribution to src add g flag after any optimization flags Add hpclink as a prefix to your Makefile s link line e g hpclink mpixlf o myapp foo o lib a lm Decide what hardware counters to monitor statically linked executables e g Cray XT BG P use hpclink to link your executable launch executable with environment var HPCRUN EVENT LIST LIST BG P hardware counter
7. of an arbitrarily large trace we have a display window of dimensions h x w typically many more processes or threads than h typically many more samples trace records than w Solution sample the samples samples of samples each sample defines a pixel Outline Overview of Rice s HPCToolkit Accurate measurement Effective performance analysis Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors Assessing process variability Understanding temporal behavior e Using HPCToolkit Ongoing R amp D 41 Where to Find HPC Toolkit ALCF Systems intrepid home projects hpctoolkit ppc64 pkgs hpctoolkit vesta home projects hpctoolkit pkgs hpctoolkit eureka home projects hpctoolkit x86_64 pkgs hpctoolkit OLCF Interlagos ccs proj hpctoolkit pkgs hpctoolkit interlagos ccs proj hpctoolkit pkgs hpcviewer NERSC Hopper project projectdirs hpctk hpctoolkit hopper project projectdirs hpctk hpcviewer For your local Linux systems you can download and install it documentation build instructions and software see http hpctoolkit org for instructions we recommend downloading and building from svn important notes using hardware counters requires downloading and installing PAPI kernel support for hardware counters on Linux 2 6 32 or
8. 0 Scalability Analysis Demo Code University of Chicago FLASH Simulation white dwarf detonation Platform Blue Gene P Experiment 8192 vs 256 processors Scaling type weak oooooo oouou Nova outbursts on white dwarfs Laser driven shock instabilities Orzag Tang MHD vortex Rayleigh Taylor instability ala Helium burning on neutron stars Magnetic Cellular detonation Fi rt f FLASH Te University of Chi 31 Rayleigh Taylor Igures courtesy o eam University 0 ICago Scaling on Multicore Processors e Compare performance single vs multiple processes on a multicore system Strategy differential performance analysis subtract the calling context trees as before unit coefficient for each 32 S3D Multicore Losses at the Loop Level hpcviewer Profile Name 9 getrates f E rhsf f90 2 diffflux_gen_uj f 23 Execution time ccr increases 2 8x in the do m 1 1 ujUpper30 m loop that scales worst do n 1 n spec 1 do 1t__2 1 nz do lt 1 1 ny do 1t__ 1 nx diffflux lt 9 lt 1 lt 2 n m ds mixavd 1t 0 1t 1 16 2 n grad ys 1t 0 1t__1 1t__2 n m loop contributes a s 1t 0 1t 1 1t__2 n grad_mixmw 1t__ 1t__1 1t__2 m diffflux 1t 0 lt 1 1t_ 2 n spec m diff 6 995 scaling loss to lux lt 0 1 amp 1 1 amp 2 n spec m diffflux lt 0 1 amp 1 lt whole execution n
9. 693 20e 08 37 1 VI inlined from stl tree h 472 04e 08 23 78 2 2 9 Y loop at stl tree h 1388 04e 08 23 6 9 37e 09 3 amp 8 8 16e 10 18 16e 10 18 38e 09 3 amp v inlined from TypeSequenceManager hpp 27 P 78e 08 20 6 TypeSequenceManager hpp 27 1 78e 08 20 68 56e 09 68 56e409 Eee Principal Views e Calling context tree view top down down the call chain associate metrics with each dynamic calling context high level hierarchical view of distribution of costs e Caller s view bottom up up the call chain apportion a procedure s metrics to its dynamic calling contexts understand costs of a procedure called in many places e Flat view ignores the calling context of each sample point aggregate all metrics for a procedure from any context attribute costs to loop nests and lines within a procedure 24 Outline e Overview of Rice s HPCToolkit e Accurate measurement e Effective performance analysis Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Assessing process variability e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D 25 Efficiency The Problem of Scaling 1 000 0 875 o N m e 0 625 0 500 Ideal efficiency Actual efficiency o amp ab Do
10. 9 ow ga SF CPUs Note higher is better 26 Goal Automatic Scaling Analysis e Pinpoint scalability bottlenecks e Guide user to problems e Quantify the magnitude of each problem e Diagnose the nature of the problem 2 Challenges for Pinpointing Scalability Bottlenecks e Parallel applications modern software uses layers of libraries performance is often context dependent e Monitoring bottleneck nature computation data movement synchronization 2 pragmatic constraints acceptable data volume low perturbation for use in production runs Example climate code skeleton 28 Performance Analysis with Expectations e You have performance expectations for your parallel code strong scaling linear speedup weak scaling constant execution time e Putting your expectations to work measure performance under different conditions e g different levels of parallelism or different inputs express your expectations as an equation compute the deviation from expectations for each calling context for both inclusive and exclusive costs correlate the metrics with the source code explore the annotated call tree interactively 29 Pinpointing and Quantifying Scalability Bottlenecks 600K T Em EH EE MM MM MM MM EN MM NN EH MM SO ND coefficients for analysis of I n strong scaling i 3
11. Accurate measurement e Effective performance analysis e Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D 34 Parallel Radix Sort on 960 Cores ooo hpcviewer mpbs mpi2 960 cores radix hopper Right click on a pp node in the CCT view 131 Pl amp amp y NULL 1 actimer_stop a_overhead 1 n to graph values ik return on across all threads 1 The simple entry point for usort 1322 int64 1323 uSort uint64 x int64 n int64 lb int64 ub 1 t wint64 m C UI64 1 lt lt ub lt lt 1 UI64 1 lt lt lb if lb lt 8 II ub gt 63 II ub lt 1b w Plot graph usort PAPLTOT_CYC I 23 Plot graph usort PAPI TOT CYC I Values for all threads w 40t10 graphed for the selected context a i an te i a i i a i i ae pieni gg ggg ggg UE po 0 0E0 ma 00 00 100 00 200 00 300 00 400 00 500 00 600 00 700 00 800 00 900 00 Process Thread x 3 n Callers View fe Flat View ELI T 6M MM GA a Scope PAPI TOT CYC Sum I PAPI TOT CYC Mean I PAPI TO FLUE Ux pan LIEN wigseraw urease wevewrss Y amp MP1 Barrier 3 11e 14 60 01 3 11e 11 P B MPIR Barrier impl 3 11e 14 60 01 3 11e 11 Y Bb psortui64 mpi2 1 21e 14 23 3 1 21e 11 b loop at psort mpi2 c 801 5 040 13 9 7 5 04e 10 use
12. aded parallelism within and across nodes Performance Analysis Goals Programming model independent tools Accurate measurement of complex parallel codes large multi lingual programs fully optimized code loop optimization templates inlining binary only libraries sometimes partially stripped complex execution environments dynamic loading Linux clusters vs static linking Cray Blue Gene SPMD parallel codes with threaded node programs batch jobs e Insightful analysis that pinpoints and explains problems correlate measurements with code for actionable results support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers e Scalable to petascale and beyond HPC Toolkit Design Principles e Employ binary level measurement and analysis observe fully optimized dynamically linked executions support multi lingual codes with external binary only libraries e Use sampling based measurement avoid instrumentation controllable overhead minimize systematic error and avoid blind spots enable data collection for large scale parallelism e Collect and correlate multiple derived performance metrics diagnosis typically requires more than one species of metric e Associate metrics with both static and dynamic context loop nests procedures inlined
13. at View 6l MA w scope Experiment Aggregae Metres Y ALLOCATE VAMIARLIS ARRAYS in VARIABLES M Tile dr t genter 3 36005 vare Y BOREACTION RATE IO MN M VB REACTION RATE BOUNDS in DAMEN M Tesranw d Associate Costs with Data hpctoolkit org Outline Overview of Rice s HPCToolkit Accurate measurement Effective performance analysis Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors Assessing process variability Understanding temporal behavior Using HPCToolkit e Ongoing R amp D 51 Ongoing R amp D Available in prototype form memory leak detection performance analysis of multithreaded code pinpoint amp quantify insufficient parallelism and parallel overhead pinpoint amp quantify idleness due to serialization at locks e Emerging capabilities data centric profiling GPU support enhanced analysis of OpenMP and multithreading e Future work improving measurement scalability by using parallel file I O 52 Ask Me About e Filtering traces e Derived metrics e Profiling OpenMP Profiling hybrid CPU GPU code e Data centric performance analysis e Profiling programs with recursion e Scalable trace server 53
14. code calling context e Support top down performance analysis natural approach that minimizes burden on developers Outline e Overview of Rice s HPCToolkit e Accurate measurement e Effective performance analysis e Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Assessing process variability e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D HPC Toolkit Workflow call path profile compile amp link source code J presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi HPC Toolkit Workflow compile amp link call path profile source code j z binary analysis hpcstruct For dynamically linked executables on stock Linux compile and link as you usually do nothing special needed For statically linked executables e g for BG P Cray XT add monitoring by using hpclink as prefix to your link line uses linker wrapping to catch control operations process and thread creation finalization signals presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi 10 HPC Toolkit Workflow i l call path profile source i code e Measure execution unobtrusiv
15. e perspectives rank order by metrics to focus on what s important compute derived metrics to help gain insight e g scalability losses waste CPI bandwidth graph thread level metrics for contexts explore evolution of behavior over time presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi 14 Outline e Overview of Rice s HPC Toolkit e Accurate measurement e Effective performance analysis e Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Assessing process variability e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D Measurement source i code J call path profile compile amp link presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi 16 Call Path Profiling Measure and attribute costs in context sample timer or hardware counter overflows gather calling context using stack unwinding Call path sample Calling context tree return address return address return address instruction pointer Overhead proportional to sampling frequency not call frequency Novel Aspects of Our Approach e Unwind fully optimized and even stripped code use on the fly binary analysis to
16. ely launch optimized application binaries dynamically linked applications launch with hpcrun to measure statically linked applications measurement library added at link time control with environment variable settings collect statistical call path profiles of events of interest presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi compile amp link 11 HPC Toolkit Workflow compile amp link call path profile source i code e Analyze binary with hpcstruct recover program structure analyze machine code line map debugging information extract loop nesting amp identify inlined procedures map transformed loops and procedures to source presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi 12 HPC Toolkit Workflow call path profile compile amp link source code e Combine multiple profiles multiple threads multiple processes multiple executions e Correlate metrics to static amp dynamic program structure presentation i interpret profile database correlate w source hpcprof hpcprof mpi 3 hpcviewer hpctraceviewer HPC Toolkit Workflow compile amp link source i code e Presentation call path profile explore performance data from multipl
17. lt gt madness WorldTaskQueue add lt gt mal 6 2 49e 07 2 9 2e 01 Shift Blame from Symptoms to Causes imbalance TOT CYC Sum M 28e415 1 850416 100 viten 1 820016 98 38 pflotran 5 Y BP timestepper module stepperrun s Y loop at timestepper F90 384 5 17e 15 82e 16 98 28 WB timestepper_module_steppersteptransportdt_ 2 22e 15 1 33e 16 72 08 Y loop at timestepper F90 1230 2 22e 15 33e 16 72 08 Y loop at timestepper F90 1254 2 22e 15 1 32e 16 71 3 Y BPsnessolve 2 30e416 70 48 Y BP SNESSolve 2 Y B SNESSolve LS 2 22e 15 220 15 300 16 70 48 27e 1 19e 1 6 21e 14 1 1 1 1 1 1 22e 15 1 300 16 70 48 1 1 e 4 SNESComputeJacobian TOT CYC I 9 Ko smt inst aa te agree ie PS OG r1 Assess Imbalance and Variability AnD Inpcviewer 34 1902x J Amo Annani A tmermenem m0 8 o anden 190 sve ceiver 90 tege erkat T there Ta on extra 1800 in the numerator for the mol ratecony Luref 1 006 rho ref acre do k al kau 80 yl oi gt iat ow vue necati 3 yspecies latency for this loop is 14 5 of total latency in program con tenp i cone revert ki er rds Peels rateceny olet ends endo endo nd subroutine reection rete bounds 41 2 of memory hierarchy latency related to j i Canng Content View y Callen View fy F
18. path to your_app src hpctoolkit your_app measurements jobid runcmd Cray aprun Blue Gene qsub q prod devel t 20 n 32 m co 46 Analysis and Visualization e Use hpcviewer to open resulting database warning first time you graph any data it will pause to combine info from all threads into one file e Use hpctraceviewer to explore traces warning first time you open a trace database the viewer will pause to combine info from all threads into one file Try our our user interfaces before collecting your own data example performance data for Chombo on hpctoolkit org 47 A Special Note About hpcstruct and xlf IBM s xIf compiler emits machine code for Fortran that have an unusual mapping back to source To compensate hpcstruct needs a special option loop fwd subst no without this option many nested loops will be missing in hpcstruct s output and as a result hpcviewer 48 Manual Control of Sampling e Why get meaningful results when measuring a shorter execution than would really be representative only want to measure solver without measuring initialization e How Environment variable HPCTOOLKIT DELAY SAMPLING 1 API hpctoolkit_sampling_start hpctoolkit_sampling_stop Include file home projects hpctoolkit ppc64 pkgs hpctoolkit include include lt hpctoolkit h gt Always
19. s supported dynamically linked executables e g Linux use hpcrun L to learn about counters available for profiling use papi avail you can sample any event listed as profilable 44 Collecting Performance Data Collecting traces dynamically linked hpcrun t statically linked set environment variable HPCRUN TRACE 1 e Launching your job using hpctoolkit Blue Gene qsub q prod devel t 10 n 2048 c 8192 env OMP NUM THREADS 2 HPCRUN EVENT LISTZWALLCLOCK 25000 Y HPCRUN TRACE 1 your app Cray with WALLCLOCK setenv HPCRUN EVENT LIST WALLCLOCK 5000 setenv HPCRUN TRACE 1 aprun your app Cray with hardware performance counters setenv HPCRUN EVENT LIST PAPI_TOT_CYC 3000000 PAPI L2 MISS 2400000 PAPI_TLB_MISS 400000 PAPI FP OPS 2400000 setenv HPCRUN TRACE 1 aprun your app 45 Digesting your Performance Data Use hpcstruct to reconstruct program structure e g hpcstruct your app creates your app hpcstruct Correlate measurements to source code with hpcprof and hpcprof mpi run hpcprof on the front end node to analyze a few processes no per thread profiles run hpcprof mpi on the compute nodes to analyze data in parallel includes per thread profiles to support thread centric graphical view Digesting performance data in parallel with hpcprof mpi run cmd Ipath to hpcprof mpi S your app hpcstruct
20. support unwinding e Cope with dynamically loaded shared libraries on Linux note as new code becomes available in address space e Integrate static amp dynamic context information in presentation dynamic call chains including procedures inlined functions loops and statements 18 Measurement Effectiveness Accurate PFLOTRAN on Cray XT 8192 cores 148 unwind failures out of 289M unwinds 5e 5 errors Flash on Blue Gene P 8192 cores 212K unwind failures out of 1 1B unwinds 2e 2 errors SPEC2006 benchmark test suite sequential codes fully optimized executables Intel PGI and Pathscale compilers 292 unwind failures out of 18M unwinds Intel Harpertown 1e 3 error Low overhead e g PFLOTRAN scaling study on Cray XT 512 cores measured cycles L2 miss FLOPs amp TLB 1 5 overhead suitable for use on production runs 19 Outline e Overview of Rice s HPC Toolkit e Accurate measurement e Effective performance analysis e Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Assessing process variability e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D 20 Effective Analysis call path profile compile amp link source code J hpcviewer hpctraceviewer presentation in
21. t 1 78006 treos 1 Pinpoint amp Quantify Scaling Bottlenecks add MEMFUN_OBJT memfunT amp obj memfunT menfun const ergiT amp argl const arg2T amp arg2 const arg3T amp org3 const TaskAttributes amp Future lt REMFUTURECMEMFUN_RETURNT memfunT gt result add new TaskMemfuncmemfunT gt result obj memfun argl org2 arg3 attr rotura resul quantum chemistry MPI pthreads g Context View As Callers View Tr Flat View 4 Scope Experiment Aggregate Metrics Y pthread spin unlock Y 48 madness Spiniock untock const Y A inlined from worldmutex h 142 val bho MIA 16 cores 1 thread core 4 x Barcelona I us idleness all E 1 57e409 100 amp lock contention accounts for 23 5 of execution time 92e 08 31 24 2 35e 01 100 0 1 780 01 75 64 v 48 7 35e 00 gt 48 madness Future lt gt madness WorldObject taske 35e 00 21 24 4 Adding v 48 inlined from worldtask h 569 4 560 00 19 44 3 0 ing futures 4 3 gt 48 madness Future madness WorldObject lt gt task lt gt 4 56e 00 19 44 3 0g tO Shared global gt 48 inlined from worlddep h 68 1 530 00 6 54 1 04 Work queue 9 5 4 Y 48 inlined from werldtask h 570 1 49e400 6 34 9 970 07 6 3 gt 48 madness Future lt gt madness WorldObject lt gt task lt gt 1 49e 00 6 34 9 97e 07 6 38 gt 4 inlined from worldtask h 558 1 38e 00 5 94 9 260 07 5 94 gt 4 madness Future
22. tems scaling on multicore processors e Assessing process variability e Understanding temporal behavior e Using HPCToolkit e Ongoing R amp D 37 Understanding Temporal Behavior e Profiling compresses out the temporal dimension temporal patterns e g serialization are invisible in profiles e What can we do Trace call path samples sketch N times per second take a call path sample of each thread organize the samples for each thread along a time line view how the execution evolves left to right what do we view assign each procedure a color view a depth slice of an execution Processes Time 38 Process Time Views of PFLOTRAN m m Depth 3 Depth 6 195 Process Range 00881840 Cross Hair 129 422s 486 ag 1 i yf f Call Path 3 mair BB ofctran Bi timestepoer module step Bi snessoive BE SNESSOMe SNESSolve 15 SNES KSPSolve KSPSolve BBKsPsolve_accs WVecDotNorm2 PPL Alireduce B rol CRAY SMPChus Allt uer mast MPIR Bast binomial ve fecv veid CRAY Progress wa BE P10 cRAY progress MPIDI CRAY pildev prog MICQPoll fast nal poll check eqs for event O 8184 core execution on Cray XT5 Trace view rendered using hpctraceviewer on a Mac Book Pro Laptop Insets show zoomed view of marked region at different call stack depths 3g Presenting Large Traces on Small Displays How to render an arbitrary portion
23. terpret profile database correlate w source j hpcprof hpcprof mpi n 1 Recovering Program Structure e Analyze an application binary identify object code procedures and loops decode machine instructions construct control flow graph from branches identify natural loop nests using interval analysis map object code procedures loops to source code leverage line map debugging information discover inlined code account for many loop and procedure transformations Unique benefit of our binary analysis e Bridges the gap between lightweight measurement of fully optimized binaries desire to correlate low level metrics to source level abstractions 22 Analyzing Results with hpcviewer AO hpcviewer MOAB mbperf iMesh 200 B Barcelona 2360 SE we mbperf iMesh cpp X R TypeSequenceManager hpp 23 9 stl tree h m 22 Define less than comparison for EntitySequence f of the entity handles in the pointed to Entity Costs for lt T a a 25 class SequenceCompare e inlined procedures ntitySequence start handle PAPI L1 DCM I Y PAPI TOT CYC I 63e 08 100 1 13e 11 100 amp 35e 08 96 7 1 10e 11 97 6 81e 08 T 6 59 43e 08 J metric panef gt e 228 Vv amp imesh getvtxarrcoords 20e 08 v B gt MBCore get coords unsigned long const int double cc 3 20e 08 37 1 Y loop at MBCore cpp 681
24. to assess weaknesses in algorithms and their implementations improve scalability of executions within and across nodes adapt to changes in emerging architectures overhaul algorithms amp data structures to add new capabilities Performance tools can play an important role as a guide Performance Analysis Challenges e Complex architectures are hard to use efficiently multi level parallelism multi core ILP SIMD instructions multi level memory hierarchy result gap between typical and peak performance is huge e Complex applications present challenges for measurement and analysis for understanding and tuning e Supercomputer platforms compound the complexity unique hardware unique microkernel based operating systems multifaceted performance concerns computation communication I O Performance Analysis Principles e Without accurate measurement analysis is irrelevant avoid systematic measurement error measure actual executions of interest not an approximation fully optimized production code on the target platform e Without effective analysis measurement is irrelevant quantify and attribute problems to source code compute insightful metrics e g scalability loss or waste rather than just cycles e Without scalability a tool is irrelevant for supercomputing large codes large scale thre

HPCToolkit - Center for Scalable Application Development Software

Contents

Download Pdf Manuals

Related Search

Related Contents