Home

PDF / 7 MB

image

Contents

1. exe myprog tttttttttttttttttttttttttkt After job finished tttttttttttttttttttttttttkt scalasca examine scorep myprog Ppnx t_sum F M Knobloch SC Introduction May 2015 38 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM a cube 4 3 0 scorep bt mz C 64x32 sum summary cubex File Display Plugins Help Absolute Aste Y Peer percent E Metric tree E Call tree El Flat view ES System tree JA BoxPlot gt m 1 44e4 Time sec bad E 8 77e7 Visits occ Li 0 00 Synchronizations occ LH 0 00 Communications occ 9 57e9 Bytes transferred bytes 0 0 00 MPI file operations occ 2696 16 Computational imbalance sec LH 0 00 Minimum Inclusive Time sec 6 99 Maximum Inclusive Time sec 64 00 task_migration_loss 0 0 00 task migration win IK gt ld ehbt 1 44e4 machine JUQUEEN kad kad kad kad kad 1 44e4 100 00 1 44e4 0 1 44e4 100 00 1 44e 1 44e4 100 00 1 44e M Knobloch SC Introduction May 2015 39 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM A cube 4 3 0 scorep bt mz C 64x32 sum summary cubex ES File Display Plugins Help Absolute v Absolute v Peer percent v IE Metric tree E Call tree El Flat view E System tree BoxPlot 1 3987 09324715 fe 0 468 T 8 dei Visits geg 0 0 00 Synchronizations occ LH 0 00 C
2. MPI Sendrecv send MPI Type contiguous count 2 MPI Sendrecv recv MPI INT MPI BYTE M Knobloch SC Introduction May 2015 18 MUST Deadlock Detection A J LICH FORSCHUNGSZENTRUM The application issued a set of MPI calls that can cause a deadlock The graphs below show details on this situation This includes a wait for graph that shows active wait for dependencies between the processes that cause the deadlock Note that this process set only includes processes that cause the deadlock and no further processes A legend details the wait for graph components in addition while a parallel call stack view summarizes the locations of the MPI calls that cause the deadlock Below these graphs a message queue graph shows active and unmatched point to point communications This graph only includes operations that could have been intended to match a point to point operation that is relevant to the deadlock situation Finally a parallel call stack shows the locations of any operation in the parallel call stack The leafs of this call stack graph show the components of the message queue graph that they span The application still runs if the deadlock manifested e g caused a hang on this MPI implementation you can attach to the involved ranks with a debugger or abort the application if necessary SR MPI_Send O Active MPI Call comm A tag 456 MPI_Send 1 Sub Operation comm A tag 456 comm A tag 456 MPI_Send 2 A waits for B and C
3. comm A tag 456 MPI_Send 3 main example c 39 MPI Send M Knobloch SC Introduction May 2015 19 A J LICH FORSCHUNGSZENTRUM Performance Analysis Tools Typical Performance Analysis Procedure Zi J LICH Dolhavea performance problem at all lime speedup scalability measurements Whatis the key bottleneck computation communication e MPI OpenMP flat profiling Where ts the key bottleneck Call path profiling detailed basic block profiling Why is it there Hardware counter analysis Trace selected parts to keep trace size manageable Does the code have scalability problems Load imbalance analysis compare profiles at various sizes function by function M Knobloch SC Introduction May 2015 21 Remark No Single Solution is Sufficient J LICH FORSCHUNGSZENTRUM A combination of different methods tools and techniques is typically needed a Analysis Statistics visualization automatic analysis data mining Measurement Sampling instrumentation profiling tracing s Instrumentation m Source code binary manual automatic M Knobloch SC Introduction May 2015 22 Critical Issues A J LICH FORSCHUNGSZENTRUM Accuracy Intrusion overhead Measurement itself needs time and thus lowers performance Perturbation Measurement alters program behavior e g memory access pattern Might prevent compil
4. e Available for any part of the trace nn re ATA M Knol 54 Multi plattorm sampling based call path profiler Works on unmodified optimized executables hitto hpctoolkit org Advantages Overhead can be easily controlled via sampling interval Advantageous for complex C codes with many small functions Loop level analysis sometimes even individual source lines Supports POSIX threads Disadvantages Statistical approach that might miss details MPI OpenMP time displayed as low level system calls M Knobloch SC Introduction May 2015 55 HPCToolkit Metric Specification A J LICH FORSCHUNGSZENTRUM Specified via environment variable HPCRUN EVENT LIST General format name interval name interval Possible sample sources WALLCLOCK PAPI counters O use w o interval spec MEMLEAK use w o interval spec Interval given in microseconds E g 10000 100 samples per second M Knobloch SC Introduction May 2015 56 Example hpcviewer A J LICH FORSCHUNGSZENTRUM Gi hpcviewer sor lt Mjj28103 gt lt File Debug Help 670 fkjml Field kl j 1 571 rkj Rhs k jl 672 akj Ans k j for i 1 mod i lt nxl i 2 574 675 delta omega fkjli 1 fkjli 1 fkjp1 4 fkjml i 676 4 0 fkj i rkjlil 677 678 tmpres fabs delta 579 68
5. cart 0000_5 2D dot cart 0000_1 3D dot ReAttach E Detach 512 0 511 ff Pause gt Resume 512 0 511 A sample a Sample Multiple 512 0 511 512 0 511 512 0 511 128 0 4 8 12 16 128 2 6 10 14 128 0 4 8 12 16 128 3 7 11 15 128 1 5 9 13 17 128 0 4 8 12 16 128 2 6 10 14 128 2 6 10 14 128 0 4 8 12 16 128 3 7 11 15 128 1 5 9 13 17 128 0 4 8 12 16 126 0 4 8 12 16 128 2 6 10 14 128 3 7 11 15 128 1 5 9 13 17 128 1 5 9 13 17 PAMI Context trylock advancev 128 3 7 11 15 128 3 7 11 15 128 1 5 9 13 17 114 0 4 8 12 16 1 28 3 7 11 15 95 0 8 12 16 M Knobloch SC Introduction May 2015 10 STAT Zoom A J LICH 12 0 511 I body Which ranks are following JA 128 3 7 11 15 N128 11 5 9 13 17 128 2 6 10 14 128 0 4 8 12 16 128 1 5 9 13 17 WU 128 2 6 10 14 128 0 4 8 12 16 128 1 5 9 13 17 128 3 7 11 15 128 2 6 10 14 128 1 5 9 13 17 M Knobloch SC Introduction May 2015 STAT Eguivalence Classes A J LICH FORSCHUNGSZENTRUM 128 3 7 11 15 12B 1 5 9 13 17 128 2 6 10 14 1 12810 4 8 12 16 aulvalence Class Collapse Collapse Depth Hid
6. M Knobloch SC Introduction May 2015 73 Vampir Recipe JUQUEEN A J LICH FORSCHUNGSZENTRUM 1 module load UNITE vampirserver 2 Start Vampir server component on frontend using vampirserver start smp Check output for port and pid 3 Connect to server from remote machine see next slide and analyze the trace 4 vampirserver stop lt p1d gt See above 2 M Knobloch SC Introduction May 2015 74 Vampir Recipe local system A J LICH FORSCHUNGSZENTRUM 1 Open SSH tunnel to JUGUEEN using ssh L30000 localhost lt port gt juqueen lt r gt 2 Start Vampir client component For example usr local zam un1te b1n vamp1r 3 Select 1 Open other 2 Remote file 3 Connect keep defaults 4 File traces otf2 from Score P trace measurement directory M Knobloch SC Introduction May 2015 75 HPCToolkit Recipe A J LICH FORSCHUNGSZENTRUM 1 Compile your code with g gno1pa For MPI also make sure your application calls MPI Comm rank first on MPL COMM WORLD Prefix your link command with hpc link Ignore potential linker warnings 3 Run your application as usual specifying requested metrics with sampling intervals in environment variable HPCRUN EVENT LIST 4 Perform static binary analysis with hpcstruct loop fwd subst no lt app gt 5 Combine measurements with hpcprof S lt struct file gt I lt
7. scorep score r epik myprog Ppnx t_sum prof1 le cubex INFO Score report written to scorep_myprog_Ppnxt_sum scorep score Estimates trace buffer requirements Allows to identify canditate functions for filtering Computational routines with high visit count and low time per visit ratio Region call path classification COM MPI pure MPI library functions IN OMP pure OpenMP functions regions USR COM USR J VN USR user level source local computation sROMP MPI USR COM combined USR OpeMP MPI ANY ALL aggregate of all region types M Knobloch SC Introduction May 2015 34 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM less scorep_myprog_Ppnxt_sum scorep score Estimated aggregate size of event trace 162GB Estimated requirements for largest trace buffer max_buf 2758MB Estimated memory requirements SCOREP TOTAL MEMORY 2822MB hint when tracing set SCOREP TOTAL MEMORY 2822MB to avoid intermediate flushes or reduce reguirements using USR regions filters flt type max buf B visits time s time time region visit us ALL 2 891 417 902 6 662 521 083 36581 51 100 0 5 49 ALL USR 2 858 189 854 6 574 882 113 13618 14 37 2 2 07 USR OMP 54 327 600 86 353 920 22719 78 62 1 263 10 OMP MPI 676 342 550 010 208 98 0 6 379 96 MPI COM 371 930 735 040 34 61 0 1 47 09 COM 921 918 660 2 110 313 472 9 0 1 matmul sub 921 918 660 2 110 313 472 6 2 2 binvcrh
8. scorep info config vars for details Allows for targeted measurements Selective recording Phase profiling Parameter based profiling Please ask us or see the user manual for details M Knobloch SC Introduction May 2015 42 Why is the Bottleneck There A J LICH FORSCHUNGSZENTRUM This is highly application dependent Might require additional measurements Hardware counter analysis CPU utilization Cache behavior Selective instrumentation Manual automatic event trace analysis M Knobloch SC Introduction May 2015 43 HW Counter Measurements w Score P A J LICH FORSCHUNGSZENTRUM Score P supports both PAPI preset and native counters Available counters papi avai1l or papi native aval I module load UNITE papi 5 0 1 less PAPI_ROOT doc papi 5 0 1 avail txt less PAPI_ROOT doc papi 5 0 1 native_avail txt less PAPI_ROOT doc papi 5 0 1 avail detail txt Specify using SCOREP METRIC PAPI environment variable tttttttttttttttttttttttktt In the job script FF tttttttttttttttttttttttktt module load UNITE scalasca export SCOREP_METRIC_PAPI PAPI_FP_OPS PAPI_TOT_CYC scalasca analyze f filter txt V runjob ranks per node P np n exe myprog M Knobloch SC Introduction May 2015 44 Automatic Trace Analysis w Scalasca A JULICH Idea Automatic search for patterns of inefficient behavior Identification of wait states and their root causes Classi
9. step Out Run Tn Prey UnSten Caller BackTo Live tan Rank 0 bm U Al Breakpoint 1 Thread I 46997344 54 795201 hm At Breakpoint 1 Stack Trace Stack Frame main FE Ttff40759f90 Function main _ lihc start main FP ifffdUibal4 argo 0x00000001 1 Start FP 7fffdllT7baln5l ar gy 0x tt14075a066 Local variables nyrank 0x00000000 0 F numpiac 0x00000004 4 Function main in hello mpi c 1 include lt stdio h gt a 1nclude lt mp1 h gt 3 4 int main int argc char argw i int 1err myrank numprocs ierr MPI Init amp argc amp armgw left MPI Comm rank MPI COMM WORLD myrank lerr MPI Comm size MPI COMM WORLD amp rumprocs ARA i A be s i I Ty r M ymn on DE Lee rank numpracs h E e Sl d d kel TATT hello rom eu Or 00 MY ierr MPI Finalize return ll Action Points Points Processes Threads ES 1 hello mpi ctll main Ux5f E M Knobloch SC Introduction May 2015 15 TotalView Tools Menu A J LICH FORSCHUNGSZENTRUM Call Graph Data visualization ym r E v All w Control Group Update Close Help File View 8 24 1 00 0 81 0 63 N SSE SEG HE 8 HE 4 55 8 SREB HE N 045 Queue Display Options a dark ar dan Sin 17 027 5 1 Pending Sends Term 0 09 0 09 M Pending Recvs ger 5 24 M Unexpected Messages SC 5 E 4P1_COMM_WORLD 4PI_COMM_WORLD_
10. 0 fkj i fkjlil delta 581 682 tmperr fabs fkj i akj 11 ER 1 ma Calling Context View s Callers View fs Flat View fo M RAs WALLCLOCK us 0 0 WALLCLOCK us 0 0 E WALLCLOCK us 1 0 D WALLCLOCK U 79et06 Fern Experiment Aggregate Metrics 4 79e 06 100 79e 06 100 76e 06 100 main 4 E 4 gt B sor_iter 4 68e 06 97 7 4 01e 06 83 7 4 66e 06 97 7 vw loopatsor c 344 2 57e 06 55 7 2 00e 06 41 7 2 71e 06 56 8 Inlined from sor c 658 2 00e 06 41 7 2 00e 06 41 7 2 00e 06 42 0 loop at sor c 662 2 00e 06 41 7 2 00e 06 42 0 loop at sor c 6 3 2 00e 06 41 7 3 55e 04 0 7 2 00e 06 42 0 loop at sor c 673 inlined from sor c 331 38e 05 38e 05 06e 05 sor c 675 59e 05 13 8 59e 05 13 8 23e 05 13 1 sor c 682 40e 05 5 0 40e 05 5 0 3 96e 05 8 3 sor c 678 0866 05 08e 05 39e 04 191Mof400M D M Knobloch SC Introduction May 2015 57 Allinea Performance Reports d a alinsa A J LICH RE PORTS FORSCHUNGSZENTRUM Single page report provides quick overview of performance issues Works on unmodified optimized executables Shows CPU memory network and I O utilization http www allinea com products allinea pertormance reports Supports MPI multi threading and accelerators Saves data in HTML CVS or text form Note License limited t
11. A J LICH FORSCHUNGSZENTRUM HPC Software Compiler and Tools May 2015 Michael Knobloch Outline A J LICH FORSCHUNGSZENTRUM e Local module setup e Compilers e Libraries Debuggers Make it work UE make it right U re make It fast Kent Beck Performance Tools Score P Scalasca Vampir HPC Toolkit Performance Reports TAU M Knobloch SC Introduction May 2015 A J LICH FORSCHUNGSZENTRUM Module setup amp compiler The Module Setup A J LICH Tools are available through modules Allows to easily manage different versions of programs Works by dynamic modification of a user s environment JUQUEEN Module setup based on UNITE Tools only visible after a module load UNITE User has to take care of dependencies JUROPATEST Module setup based on EasyBuild and Imod Staged hierarchical setup Automatically manages dependencies via toolchains M Knobloch SC Introduction May 2015 4 Most Important Module Commands A J LICH FORSCHUNGSZENTRUM module spider t Imod only show all products spider product 4 Imod only show product details avall 4 show all available products list list loaded products load product s setup access to product unload product s release access swap producti product2 replace v1 of product with v2 whatis product s print short description help
12. PI Isend module cumul ulus driver module em mp date scalar module em m ddtend dry module advec mp_advect_u_ module adve p_advect_w_ module advec mp_advect_v_ MPI Irecv module surfac rface driver module small step finish module em m 0 tendency module small calc coef w_ module physi mp add a2a module em m e phy tend 52 S Vampir Process Summary LICH FORSCHUNGSZENTRUM m Vampir Trace View home dolescha tracetiles teature traces wrt p64 10 mem rusage wrf 1h otf m X Execution statistics ASE Ze srie GU VIT hnn en over all processes ee una Os 35 65 9s 125 155 185 215 245 275 30s Pro s 0 EME mod ld_ mo p_ open om for com pa rison ie en arr RETR EE Pro s 3 Pro s 4 Pro s5 Pro s6 Pro s7 Pro s8 Pro s9 Pr 10 Pr 11 Pr 12 Pr 13 Pr 14 Pr 15 Pr 16 Pr 17 Pr 18 solve_em_ Clustering mode available for large process counts Pr 19 solve_em_ Pr 20 Mi 0 dc driver MBE PS Cnn WW Pr 21 Pr 22 Vampir Trace View home dolescha tracefiles feature traces wrf p64 10 mem rusage wrf Lh ott File View Help View Chart Filter V ei ru OTEREBOVA IA WU UD LL Y DA A eor Re fro en m Process Summary 155 sche em solve em M Knobloch SC Introduction May 2015 53 Vampir Communication Statistics J LICH FORSCHUNGSZENTRUM Vampir Trace View home dolescha tr
13. UNITE scalasca scalasca analyze runjob ranks per node P np n exe myprog tttttttttttttttttttttttttkt After job finished tttttttttttttttttttttttttkt scalasca examine scorep myprog Ppnx t sum W M Knobloch SC Introduction May 2015 29 Flat MPI Profile Example cont A J LICH FORSCHUNGSZENTRUM du cube 4 3 0 scorep bt mz C 64 sum summary cubex a ey File Display Plugins Help Absolute v Absolute v Peer percent vi E Metric tree E Call tree E Flat view ES System tree MI BoxPlot 379 i va 298 70 PARALLEL m 1 39 MPI Init thread d m 5 50e5 Visits occ o0 0 00 Synchronizations occ 0 0 00 Communications occ 9 57e9 Bytes transferred bytes 5 0 00 MPI file operations occ 5 0 00 Computational imbalance sec m 0 00 Minimum Inclusive Time sec 5 93 Maximum Inclusive Time sec 64 00 task_migration_loss O 0 00 task migration win 0 00 MPI Comm size 0 00 MPI Comm rank 0 11 MPI Comm split 0 43 MPI Bcast 3 79 MPI Isend a EL LS Irecv a MPI Waite m 0 58 MPI_ Barrier m 0 72 MPI Reduce 0 00 MPI Finalize i m 78 82 MPI Rank 1 M Knobloch SC Introduction May 2015 30 Where is the Key Bottleneck A J LICH FORSCHUNGSZENTRUM Generate call path profile using Score P Scalasca Requires re compilation Runtime overhead depends on application characteristics Typically needs some care setting up a good measu
14. a 0 00 Flush 183 03 Management D 0 00 Synchronization gt i 3417 33 Barrier 0 63 Critical 0 0 00 Lock API 2 0 00 Ordered D 0 00 Overhead gt i 3908 33 Idle threads 8 77e7 Visits occ 0 0 00 Synchronizations occ LH 0 00 Communications occ 9 57e9 Bytes transferred bytes 0 0 00 MPI file operations occ 2696 16 Computational imbalance sec LH 0 00 Minimum Inclusive Time sec 6 99 Maximum Inclusive Time sec 64 00 task_migration_loss O 0 00 task migration win kl kl kl kl kl gt 0 00 mpi setup 13 93 exact_rhs 0 0 00 timer clear 117 33 exch qbc A 0 00 adi gt H 388 27 compute rhs gt Ei 508 59 x solve gt H 570 14 y solve 00 00 z solve a 0 00 I omp parallel z_solve pr a 0 00 da laa do oz solve A gt E 18 75 add 0 0 00 MPI Barrier 0 0 00 timer start 0 0 00 timer stop 0 0 00 timer read 2 46 verify 0 0 00 MPI Reduce 2 3344 2 33 1 4007 138 0 93377 0 89 0 46689 0 27 H 0 00 print results 0 1 56e 03 0 0 00 MPI Finalize L All 2048 elements v 0 00 3417 33 23 73 1 44e4 0 00 1787 77 52 31 3417 33 0 00 0 00 0 00 1787 77 Selected omp implicit barrier z_solve prep f 428 CJ M Knobloch SC Introduction May 2015 41 score P Advanced Features Aa J LICH FORSCHUNGSZENTRUM Measurement can be extensively configured via environment variables Check output of
15. acefiles feature traces wrf p64 10 mem rusage wrf 1h otf Byte and message count e Ven Hel View Chart Filter min max avg message length i es and min max avg bandwidth rr for each process pair Message length ae Process 9 E statistics AA Process 15 Process 18 Communication Matrix View Process 0 Process 21 Vampir Trace View home dolescha tracefiles feature traces wrf p64 Process 24 File View Help Process 27 View Chart Filter Process 30 n gt Process 33 SITA OS RS OV Sg Message Summary Process 36 7500 6000 4500 3000 1500 S Messag Process 39 15 039062 KiB property Process 42 10 664062 KiB EE as pss i Display Process 45 4930 23 203125 KIB Message Tra i H Ar 16 875 KiB Process 51 Process 48 16 453125 KiB Process 54 48 398438 KiB Process 57 145 1 2 KiB 75 839844 KiB 60 800781 KiB 218 4 5 KiB 36 09375 KiB 35 273438 KiB Process 60 Prnarecc AR Message Summary 1600 MiB s 1280 MiB s 960 MiB s 640 MiB s ET Y 320 MiB s O MiB s AN 212 5 8kiB 217 59375 KiB MM 105 8 2 KiB i i 108 28125 KiB 493 213465 MiB s 296 71875 KiB 477 257369 MiB s 159 2 5 KiB 469 472582 MiB s 162 9375 KiB 467 421276 MiB s 35 273438 KiB 461 590886 MiB s 36 09375 KiB 457 004166 MiB s MW 325 975 KiB 451 100737 MiB s 318 46875 KiB 432 708147 MiB s 147 65625 KiB 404 654303 MiB s 48 398438 KiB
16. alable automated search for event patterns representing inefficient behavior Scalable identification of the critical execution path Delay root cause analysis Based on Score P for instrumentation and measurement Includes convenience post processing commands providing added value htto www scalasca org M Knobloch SC Introduction May 2015 26 What is the Key Bottleneck A J LICH Generate flat MP profile using Score P Scalasca or mpiP Only requires re linking Low runtime overhead Provides detailed information on MPI usage How much time is spent in which operation How often is each operation called How much data was transferred Limitations Computation on non master threads and outside of MPI Init MPI Finalize scope ignored M Knobloch SC Introduction May 2015 27 Flat MPI Profile Recipe A J LICH FORSCHUNGSZENTRUM 1 Prefix your link command with scorep nocomp ler 2 Prefix your MPI launch command with scalasca analyze 3 After execution examine analysis results using scalasca examine scorep_ lt t7t e gt M Knobloch SC Introduction May 2015 28 Flat MPI Profile Example A J LICH FORSCHUNGSZENTRUM module load UNITE scorep scalasca mp1x1f90 03 qsmp omp c foo f90 mp1x1f90 03 qsmp omp c bar f90 scorep nocompiler mp1x1f90 O3 qsmp omp o myprog foo o bar o HAAA In the job script HAAA module load
17. atch script with required TotalView parameters If user cancels the script it cancels the debugging job does not eat your computing quota NOTE License limited to 2048 MPI ranks shared between all users Attaching to subset is recommended M Knobloch SC Introduction May 2015 71 TotalView tv Launch Script A J LICH FORSCHUNGSZENTRUM lltv n lt nodes gt default_parallel_attach_subset lt rank range gt runjob a exe lt program gt p lt num gt Starts lt program gt with lt nodes gt and lt num gt processes per node attaches to lt rank range gt Rank that rank only RankX RankZ all ranks both inclusive RankX RankZ stride every strideth between RankX and RankZ Example lltv n 2 default_parallel_attach_subset 2 6 runjob a exe helloworld p 64 Creating LoadLeveler Job Submitting LoadLeveler Interactive Job for Totalview M Knobloch SC Introduction May 2015 72 TotalView Execution Recipe A J LICH FORSCHUNGSZENTRUM TotalView tries to debug runjob and shows no source code Ignore it and press GO After some seconds TotalView will detect parallel execution and ask if it should stop Yes it should stop To find the correct point file function to debug use the File Open command Set your breakpoints and press GO again Debugging session will then start To see a variables contents double click on it in the source
18. c C C F77 90 95 Program Database Toolkit PDT OpenMP directive rewriting with Opari Object code pre instrumented libraries eg MPI using PMPI Statically linked and dynamically loaded e g Python Executable code dynamic instrumentation pre execution Dyninst virtual machine instrumentation e g Java using JVMPI Support for performance mapping Support for object oriented and generic programming M Knobloch SC Introduction May 2015 61 FORSCHUNGSZENTRUM File Options Windows Help li n c t 0 0 0 512proc samrai taudata neutronbackup rs sameer Users File Options Windows Help COUNTER NAME P_WALL_CLOCK_TIME seconds 345 5474 EHHI vr Allreduce 116 4951 algs HyperbolicLevelintegrator3 advance bdry fill create 103 2566 EEH algs HyperbolicLevelintegrator3 advanceLevel 59 0096 II algs HyperbolicLevelintegrator3 fill new level create 37 4482 EJ mesh GriddingAlgorithm3 load balance boxes 32 8548 algs HyperbolicLevelintegrator3 advance bdry fill comm 21 4095 E mesh GriddingAlgorithm3 findRefinementBoxes 13 4925 algs HyperbolicLevellntegrator3 coarsen_fluxsum_create 12 6572 M algs HyperbolicLevellntegrator3 coarsen_sync_create 10 4408 mesh GriddingAlgorithm3 find_boxes_containing_tags 8 9215 ll MPI_Init 8 6893 mesh GriddingAlgorithm3 bdry_fill_tags_create 7 2717 I MPI_Bcast 7 1321 MPI_Wait 4 0833 algs HyperbolicLevellnte
19. collective 32 300 Message queue graph 350 M Knobloch SC Introduction May 2015 16 MUST A J LICH FORSCHUNGSZENTR Next generation MPI correctness and portability checker http doc itc rwth aachen de display CCP Project MUST aa MUST MUST reports Errors violations of the MPl standard Warnings unusual behavior or possible problems Notes harmless but remarkable behavior Further potential deadlock detection Usage Relink application with mustc mustcxx mustf90 Run application under the control of mustrun requires one additional MPI process See MUST Output html report M Knobloch SC Introduction May 2015 17 A J LICH FORSCHUNGSZENTRUM MUST Datatype Mismatch reference 1 rank 0 A send and a receive operation use datatypes that do not match Mismatch occurs at contiguous 0 MPI INT in the send type and at MPI BYTE in the receive type consult the MUST manual for a grime pie See of a ene A Sege dt inner de situation is available mal MPI Sendrecv e l html The send operation was started at Stern 1 the rj operation was started e eieren 2 Information on communicator MPI COMM WORLD Information on send of count 1 with type Datatype created at reference 3 is for C commited at reference 4 based on the following type s MPI INT Typemap MPI INT 0 MPI INT 4 Information on receive of count 8 with type MPI BYTE
20. e Expand Expand All Focus View Source 1028 3 7 11 15 134 1 5 9 13 17 128 3 7 11 15 128 1 5 9 13 17 12B 3 7 11 15 128 1 5 9 13 17 178 3 7 11 15 M Knobloch SC Introduction May 2015 12 STAT Equivalence Classes cont A J LICH FORSCHUNGSZENTRUM main 512 0 511 body MM 128 2 6 10 14 128 0 4 8 12 16 128 13 7 11 15 128 1 5 9 13 17 M Knobloch SC Introduction May 2015 13 Tr TOTALVIEW Parallel Debugger A J LICH TECHNOLOGIE FORSCHUNGSZENTRUM UNIX Symbolic Debugger for C C F77 F90 PGI HPF assembler programs Standard debugger Special non traditional features Multi process and multi threaded C support templates inheritance inline functions F90 support user types pointers modules 1D 2D Array Data visualization Support for parallel debugging MPI automatic attach message queues OpenMP pthreads Scripting and batch debugging Memory Debugging http www roguewave com NOTE License limited to 2048 processes shared between all users M Knobloch SC Introduction May 2015 14 TotalView Main Window A J LICH FORSCHUNGSZENTRUM 22 hm 0 lt ojj01c64 gt lt File Edit view Group Process Thread Action Point Debug Tools Windows Group Control EI gt HH P i gt 3 J g 8 Ee SG SCH Go Halt Kill Restart Next
21. er optimization e g function inlining Accuracy of timers amp counters Granularity How many measurements How much information processing during each measurement Tradeoff Accuracy vs Expressiveness of data M Knobloch SC Introduction May 2015 23 Pertormance Tools status May 2015 A J LICH FORSCHUNGSZENTRUM JUQUEEN JUROPATEST Performance Tools Score P measurement system Scalasca2 performance analyzer Vampir Server trace visualizer HPCToolkit sampling profiler Allinea Performance Reports TAU performance system mpiP MPI profiling library Extrae Paraver tracing tool PAPI hardware counter library SS SSS SK NN A Y ANNAS M Knobloch SC Introduction May 2015 24 Score P A J LICH FORSCHUNGSZENTRUM Community instrumentation and measurement core P infrastructure infrastructure for parallel codes Developed by a consortium of performance tool groups A J LICH LI UNIVERSITAT RANT ALEN TM CG German Research Scho O UNIVERSITY OF OREGON Next generation measurement system of Scalasca 2 x Vampir TAU Periscope Common data formats improve tool interoperability htto www score p org SC Introduction May 2015 25 scalasca Y A J LICH German Research School for Simulation Sciences Collection of trace based performance analysis tools Specifically designed for large scale systems Unique features Sc
22. erval End 18 1215 Duration Os Source File Source Line gl Vampir Execution Statistics File View Help View Chart Filter Ste OTERS RS YA Function Summary All Processes Accumulated Exclusive Time per Function Group Aggregated profiling Information execution time number of calls inclusive exclusive MPI 39 185731 s DYN 135 259444 s PHYS 73 474569 s Available for all any group activity or all routines symbols Available for any part of the trace gt selectable through time line diagram M Knobloch SC Introduction May 2015 305 A J LICH FORSCHUNGSZENTRUM Vampir Trace View home dolescha tracefiles feature traces wrf p64 10 mem rusage wrf 1h ott Function Summary All Processes Accumulated Exclusive Time per Function 20s 10s 13 10 030524 s 4797311 s i 18 571178 18 207032 7 818995 3 885892 s 3 882541 s 3 664714 s 3 591342 s 3 34529 s 2 965933 s 2 614259 s 2 448983 s 2 440911 s 2 392384 s 2 295455 s 2 147392 s 1 739366 s 1 265127 s 1 141828 s 77414239 ms 700 14165 ms Os solve_em_ module radiati iation driver MPI Wait module microp ysics driver module em m step prep module em mp rk tendency module small p advance w module pbl dr p pbl driver module advect dvect scalar module small advance uv module small dvance mut module small p calc p rho_ module small ll step prep module small mp_sumflux_ M
23. fication of behavior 8 quantification of significance Scalable identification of the critical execution path Low level mn High level event trace result Advantages Guaranteed to cover the entire event trace Quicker than manual visual trace analysis Helps to identify hot spots for in depth manual analysis Property Location 45 M Knobloch SC Introduction May 2015 Trace Generation 8 Analysis w Scalasca A J LICH FORSCHUNGSZENTRUM Enable trace collection analysis using t option of scalasca analyze HAHA In the job script HAHA module load UNITE scalasca export SCOREP TOTAL MEMORY 120MB Consult score report scalasca analyze f filter txt t runjob ranks per node P np n exe myproc ATTENTION Traces can quickly become extremely large Remember to use proper filtering selective instrumentation and Score P memory specification Before flooding the file system ask us for assistance M Knobloch SC Introduction May 2015 46 Scalasca Trace Analysis Example A J LICH FORSCHUNGSZENTRUM Aw cube 4 3 0 scorep bt mz C 64x32 trace trace cubex el Vi X File Display Plugins Help Absolute v Absolute m E Metric tree B Call tree E Flat view E System tree W BoxPlot o 0 00 Time sec gt v d v m 7083 58 EPT 100 2 34 vo TN ER 00 OMP a 0 00 Flush gt 908 46 Management o 0 00 Synchronization 80
24. grator3 error bdry fill comm 3 6778 MPI_Finalize 3 1405 MPI Isend 3 0156 MPI Waitall 2 3457 mesh GriddingAlgorithm3 remove intersections regrid all 1 7275 MPI_Test EEE un 1 6515 algs HyperbolicLevellntegrator3 fill_new_level_comm Ge 1 3919 MPI Comm rank e 4 gt M Knobloch SC Introduction May 2015 62 i FATIUNMWWI TAU Callgraph Profile View A J LICH FORSCHUNGSZENTRUM X Call Graph n c t 0 0 0 ozone tests MFIX apps sameer users home sanfs mnt os F ON ile Options Win Displa idt i ox widt gt Box color by O Static Exclusive Value g BS SS O Inclusive Value S ES Nu er ot Calls 7 s O Number of Subroutines i _ SPS Inclusive Per Call Value ag rer a a lt lt e asi 7 ES a S A gt gt FL HYF SSS y aw ar al n oo HIE Salt A S I M Knobloch SC Introduction May 2015 63 Height and color X Pa ra Prof Visualizer FORSCHUNGSZENTRUM indicate different metrics File Options Windows 8 Triangle Mesh Bar Plot Scatter Plot Height Metric 1 573455 S Exclusive 1 255168 Color Metric 1 5734E8 8 3672E7 a RA Exclusive KT 873468 d Function 4 h 2551683 d Thread 1B356E 7 osegonds mn Per D Color value v Show Axes Mesh Plot Axes w Time bd w Time v MPl_Barrier Heig
25. ht value 1 2229E8 microseconds 1 2229E8 microseconds Color cale Render Orientation NW NE SE 9 SW Documentation A J LICH FORSCHUNGSZENTRUM 0 check latest status JUQUEEN use module avail JUROPATEST use module spider Websites htto www tz juelich de ias jsc juqueen htto www fz juelich de ias jsc juropatest User Info Parallel Debugging AN Parallel Performance Analysis AJ http www vi hps org training material Performance Tools LiveDVD image i 4 Links to tool websites and documentation POINT dee Tutorial slides E M Knobloch SC Introduction May 2015 65 Support 4 J LICH For general support sc fz juelich de Tool specific support via corresponding mailing lists Score P support score p org Scalasca scalasca Ofz juelich de Workshops and Trainings Regular VI HPS Tuning Workshops Several days Multiple tools e g Score P Scalasca Vampir TAU Bring your own code http www vi hps org training tws JUQUEEN Porting and Tuning Workshop Series M Knobloch SC Introduction May 2015 66 A J LICH FORSCHUNGSZENTRUM Helmholtz Gemeinschaft Appendix Tool recipes 55 Mitglied der STAT Recipe A J LICH FORSCHUNGSZENTRUM Compile and link your program with debug option g Load modules ssh X user juqueen ers jugueen module load UNITE stat UNITE loaded Stat 2 1 loaded j
26. i 00 00 Barrier i gt 1 0 00 Explicit 110 34 int ialize va 125 33 125 a m 13 78 exact rl 3343 76 Wait at Barrier 0 00 Umar geg a 0 00 Task Wait gt 83 71 exch gbc 60 A 1 39 m 0 69 Critical a NONO a gt 0 0 00 Lock API O 0 00 Ordered O 0 00 Overhead gt m 3594 87 Idle threads 40 0 90 m 8 77e7 Visits occ i gt mi 128 00 Synchronizations occ U T somp do z_ solve prep S 0 0 00 Pair wise synchronizations for R E omp implicit ba gt m 3 67e5 Communications occ gt 13 14 add gt m 9 57e9 Bytes transferred bytes 0 0 00 MPI_Barrier 20 gt 00 00 MPI file operations occ O 0 00 timer_start gt M 6582 92 Delay costs sec 0 27 gt m 77 42 Wait states propagating vs ter gt 77 39 Wait states direct vs indire gt m 7 40 Critical path sec 1 0 0 00 MPI Reduce 0 1 13e 08 gt 1 52e4 Performance impact sec o 0 00 print results R i m 2737 14 Computational imbalance sec v o 0 00 MPI Finalize All 2048 elements e 0 00 3343 76 22 06 1 52e4 0 00 1795 73 53 7090 3343 76 0 00 Selected OMP thread 10 s M Knobloch SC Introduction May 2015 47 Vampir Event Trace Visualizer A J LICH FORSCHUNGSZENTRUM Offline trace visualization for Score P s En ae seso zz q y2 gt 5 gt 79 989800048 sanaaa cO000000Nn OTF2 trace files ran 2 000000000000000009098 Visualization of MPI OpenMP kuudi and application events Al
27. l diagrams highly customizable through context menus Large variety of displays for ANY part of the trace http www vampir eu Advantage Detailed view of dynamic application behavior Disadvantage Requires event traces huge amount of data Completely manual analysis M Knobloch SC Introduction May 2015 48 Vampir Displays A J LICH FORSCHUNGSZENTRUM wampir Trace View home dolescha tracefiles teature traces wrt p64 10o mem rusage wrt Lh otf La wf File View Help View Chart Filter Sirk OTERS BG 7 WI WO UNI HIH WD AN Timeline Function Summary All Processes Accumulated Exclusive Time p 50085 Os Process 8 Process 25 242195072 5 Process 42 112 424503 s 3 467969 5 JIO Process 59 2 165661 s Appl tion Process 0 1 391392 5 VT API Communication Matrix View Number of Messages O V Function Legend Frocess Summary Context View Application Function summary EJ Mon Property value Display Function Summary Function Group MPI 6 Accumulated Exclusive Time 748 945947 s 29 19832996 M Knobloch SC Introduction May 2015 49 Vampir Timeline Diagram A J LICH FORSCHUNGSZENTRUM Vampir Trace View home dolescha tracetiles feature traces wrt p64 10 mem rusage wrt Ih ott Functions 77 view Chart Filter organized ru OS EKA Bac 4 into groups Process 1 Process 4 Process 7 Process 10 e co
28. lor NO Process 13 Process 16 by group mee Context View Process 22 Master Timeline E ER Process 25 Property value Process 28 Display Master Timeline Process 31 Type Function E M essa a Function Name open Process 34 Function Group I O lines Can Interval End 1 365 s Process 40 Duration 1 1435 Process 43 Source File r Source Line Process 46 Process 49 by tag or roces 52 Process 55 S Ze Process 58 Process 61 Information about states messages collective and I O operations available through clicking on the representation M Knobloch SC Introduction May 2015 50 Vampir Process and Counter Timelines Process timeline show call stack nesting Counter timelines for hardware or software counters M Knobloch Ll Li File View Help View Chart Filter Eru O is BO vm Timeline 17 95 18 0 s 18 15 18 2 5 Process O Process 63 M LA Ge Alm N N HE WE KS ma 7 Process 0 Values of Counter MEM APP ALLOC over Time 20 M OM Process 63 Values of Counter ru utime over Time SC Introduction May 2015 A J LICH FORSCHUNGSZENTRUM Vampir Trace View home dolescha tracefiles feature traces wrf p64i0o mem rusage wrf Lh otf Function Legend Context View Ar Process Timeline 3 Property value Display Process Timeline Type Function Function Name MPL Wait Function Group MPI Interval Begin 18 1218 Int
29. o 512 processes with unlimited number of threads M Knobloch SC Introduction May 2015 58 Example Performance Reports M Knobloch A J LICH FORSCHUNGSZENTRUM Summary cp2k popt is CPU bound in this configuration The total wallclock time was spent as follows CPU soso RE MPI 43 5 a IO 0 0 Time spent running application code High values are usually good This is average check the CPU performance section for optimization advice Time spent in MPI calls High values are usually bad This is average check the MPI breakdown for advice on reducing it Time spent in filesystem I O High values are usually bad This is negligible there s no need to investigate I O performance This application run was CPU bound A breakdown ofthis time and advice for investigating further is in the CPU section below CPU A breakdown of how the 56 5 total CPU time was spent Scalar numeric ops 27 7 W Vector numeric ops 11 3 Jj Memory accesses 60 9 MN Other 0 0 The per core performance is memory bound Use a profiler to identify time consuming loops and check their cache performance Little time is spent in vectorized instructions Check the compiler s vectorization advice to see why key loops could not be vectorized I O A breakdown of how the 0 0 total I O time was spent Time in reads 0 0 Time in writes 0 0 Estimated read rate 0 bytes s Estimated write rate 0 bytes s No time is spent in I O
30. ommunications occ 9 57e9 Bytes transferred bytes 0 0 00 MPI file operations occ 2696 16 Computational imbala 0 0 00 Minimum Inclusive Ti m 6 99 Maximum Inclusive m 64 00 ds _migratic oo na gt m 47 07 mpi setup 2 33 11 42 MPI_Bcast 2 51 env_setup 0 05 zone_setup 7 25 map_zones 0 82 zone_starts 0 01 set_constants 57 90 initialize 27 62 exact_rhs 0 09 timer_clear 3814 83 exch_gbc 26 98 adi D 1363 87 compute_rhs gt H 2401 11 x solve gt m 2717 25 y solve v m 73 94 z solve vE 12 92 Kal parallel z_ solve kad kad kl kl kl kl kl kad kad kad m 1787 77 oria nel implicit barrie gt m 140 29 add 22 61 MPI Barrier 0 02 timer start 0 03 timer stop 0 03 timer read 21 37 verify 14 60 MPI Reduce 0 17 print results 0 06 MPI Finalize lt MUA mw gn A Wb w_ _ amp 5C535 amp ___ 1 44e4 100 00 1 44e4 10 00 1843 21 12 80 1 44e4 10 00 kad All Selected omp do z_solve prep f 52 tJ M Knobloch SC Introduction May 2015 40 A J LICH FORSCHUNGSZENTRUM Call path Profile Example cont a cube 4 3 0 scorep bt mz C 64x32 sum summary cubex di es File Display Plugins Help Absolute Absolute v Peer percent v E Call tree Flat view o 0 00 bt ES Metric tree L 0 00 Time sec Ei System tree M BoxPlot v Ei 6802 14 Execution gt 90 11 MPI 20 00 OMP
31. operations There s nothing to optimize here SC Introduction May 2015 MPI Of the 43 5 total time spent in MPI calls Time in collective calls 8 2 i Time in point to point calls 91 896 E Estimated collective rate 169 Mb s HEI Estimated point to point rate 50 6 Mb s HH The point to point transfer rate is low This can be caused by inefficient message sizes such as many small messages or by imbalanced workloads causing processes to wait Use an MPI profiler to identify the problematic calls and ranks Memory Per process memory usage may also affect scaling Mean process memory usage 82 5 Mb MIN Peak process memory usage 89 3 Mb HI Peak node memory usage 7 4 The peak node memory usage is low You may be able to reduce the total number of CPU hours used by running with fewer MPI processes and more data on each process 59 TAU A J LICH FORSCHUNGSZENTRUM Very portable tool set for instrumentation measurementand analysis of parallel multi threaded applications http tau uoregon edu Tuning and Analysis Utilities Supports Various profiling modes and tracing Various forms of code instrumentation C C Fortran Java Python its MPI multi ihreading OpenMP Pthreads R Accelerators X ai A X M Knobloch SC Introduction May 2015 TAU Instrumentation A J LICH FORSCHUNGSZENTRUM Flexible instrumentation mechanisms at multiple levels Source code manual automati
32. path_to_src gt lt measurement_dir gt 6 View results with hpcviewer lt hpct_database gt M M Knobloch SC Introduction May 2015 76 TAU Recipe A J LICH FORSCHUNGSZENTRUM 1 Load TAU modu les once per session module load UNITE tau 1 Specify programming model by setting TAU MAKEFILE to one of TAU_MF_DIR Makefi le tau 2 Compile and link with tau_cc sh file c tau_cxx sh f1le cxx tau f90 sh f1le f90 3 Execute with real inout data Environment variables control measurement mode TAU PROFILE TAU TRACE TAU CALLPATH 4 Examine results with paraprof M Knobloch SC Introduction May 2015 77
33. product s print longer description show product s show what settings are performed M Knobloch SC Introduction May 2015 5 Compiler and MPl libraries A J LICH JUQUEEN IBM XL C C and Fortran compiler GNU C C and Fortran compiler Clang C C compiler IBM MPI JUROPATEST Intel C C and Fortran compiler GNU C C and Fortran compiler Intel MPI Parastation MPI M Knobloch SC Introduction May 2015 6 A J LICH FORSCHUNGSZENTRUM Debuggers Debugging Tools status May 2015 Debugging STAT TotalView debugger MUST MPI verification tool DDT debugger M Knobloch SC Introduction May 2015 JUQUEEN SN N A J LICH ORSC GS JUROPATEST x STAT Stack Trace Analysis Tool A J LICH FORSCHUNGSZENTRUM Very lightweight helper tool Shows merged call tree of whole program Useful to detect deadlocks Scales to millions of processes http www hpcwire com hpcwire 2012 12 03 bug repellent for supercomputers proves effective html Pinpoint individual problems NOT a real full debugger http www paradyn org STAT STAT html M Knobloch SC Introduction May 2015 9 STAT Main Window A J LICH STAT on jugueen1 zam kfa juelich de File Edit View Help BE Se O oc tr Y len AH A Attach cart 0000 2D dot cart 0000_1 2D dot cart 0000_2 2D dot cart 0000 3D dot cart 0000_3 2D dot cart 0000_4 2D dot
34. rement configuration Filtering Selective instrumentation Option 1 recommended Automatic compiler based instrumentation Option 2 Manual instrumentation of interesting phases routines loops M Knobloch SC Introduction May 2015 31 Call path Profile Recipe A J LICH 1 Prefix your compile amp link commands with scorep 2 Prefix your MPI launch command with scalasca analyze 3 After execution compare overall runtime with uninstrumented run to determine overhead 4 lf overhead is too high 1 Score measurement using scalasca examine s scorep_ lt title gt 2 Prepare filter file 3 Re run measurement with filter applied using prefix scalasca analyze f lt f1lter_f1le gt 5 After execution examine analysis results using scalasca examine scorep_ lt t7t e gt M Knobloch SC Introduction May 2015 32 Call path Profile Example A J LICH FORSCHUNGSZENTRUM module load UNITE scorep scalasca scorep mp1x1f90 03 qsmp omp c foo f90 scorep mp1x1f90 03 qsmp omp c bar f90 scorep mp1x1f90 03 qsmp omp o myprog foo o bar o SR SS SS SS tttttttttttttttttttttttktkt In the job script HAAA module load UNITE scalasca scalasca analyze N runjob ranks per node P np n exe myprog y M Knobloch SC Introduction May 2015 33 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM scalasca examine s epik myprog Ppnx t sum
35. s 921 918 660 2 110 313 472 0 4 1 matvec_sub 41 071 134 87 475 200 1 0 4 Ihsinit 41 071 134 87 475 200 0 4 D binvrhs 29 194 256 68 892 672 0 2 1 exact_solutio OMP 3 280 320 3 293 184 15 81 0 0 4 80 omp parallel EE Y M Knobloch SC Introduction May 2015 35 Call path Profile Filtering A J LICH FORSCHUNGSZENTRUM In this example the 6 most fequently called routines are of type USR These routines contribute around 35 of total time However much of that is most likely measurement overhead Frequently executed Time per visit ratio in the order of a few microseconds Avoid measurements to reduce the overhead List routines to be filtered in simple text file M Knobloch SC Introduction May 2015 36 Filtering Example A J LICH FORSCHUNGSZENTRUM cat filter txt SCOREP REGION NAMES BEGIN EXCLUDE binvcrhs matmul sub matvec sub binvrhs lhsin1t exact solution SCOREP REGION NAMES END Score P filtering files support Wildcards shell globs Blacklisting Whitelisting Filtering based on filenames M Knobloch SC Introduction May 2015 37 Call path Profile Example cont A J LICH FORSCHUNGSZENTRUM To verify effect of filter scalasca examine s f filter txt scorep myprog Ppnx t sum HAAA In the job script tttttttttttttttttttttttkkt module load UNITE scalasca scalasca analyze f filter txt V runjob ranks per node P np n
36. uqueen STATGUI Submit job and attach to it from GUI CAVEAT Job needs to be started by login node where GUI is running Add the following entry to the submission script HQ requirements Machine Jugueen lt gt with lt n gt being the login node id M Knobloch SC Introduction May 2015 68 STAT Attaching to a Job Aa J LICH FORSCHUNGSZENTRUM Y T ZE M Na F nd His tor Attach Launch Serial Attach Sample Options Topology Advanced localhost Search Remote Host Specify Remote Host Shell rsh e Filter Process List runjob Current Process List Attach on juqueen2 zam kfa juelich de Filter 17375 runjob exe homec jzam11 jzam1143 scalasca juqueen topompi cart EI bb mu Refresh Process List Attach M Knobloch SC Introduction May 2015 69 TotalView Recipe for JUQUEEN A J LICH FORSCHUNGSZENTRUM Compile and link your program with debug option g Use absolute paths for source code info gfullpath In case of optimized codes XL keep function call parameters gkeepparm Load modules ssh X user juqueen jugueen module load UNITE totalview UNITE loaded totalview 8 14 0 16 mrnet loaded jugueen mpixlcxx hello cpp gfullpath gkeepparm g o helloworld jugueen M Knobloch SC Introduction May 2015 70 TotalView Interactive Startup A J LICH Interactively call the 11tv script Creates a LoadLeveler b

Download Pdf Manuals

image

Related Search

Related Contents

Pelco PT7800 User's Manual  diagnostic, sommaire du n° 325 - Groupe d`étude et de réforme de la  【各部の名前】  FlexAct® COM  Manual de usuario - Vueling Partners  Intellinet 19" Server Cabinet 42U  Manual - Munters  MGS 40-TO - Murrplastik Systemtechnik GmbH  Podomètre H-215 Mode d`emploi  Preparación de muestras para el Analizador bioquímico IDEXX  

Copyright © All rights reserved.
Failed to retrieve file