Home
Autotuning Tutorials - ROSE compiler infrastructure
Contents
1. The call site of the outlined function in the source code has to be further trans formed to support empirical optimization These transformations include e using dlopen to open a specified so file containing the outlined target and calling the function found in the file e adding performance measuring code to time the call of the outlined target function e transforming code to support checkpointing the execution right before dlopen opening the library source file so multiple variants of the file can be used to test optimization choices empirically when the program is restarted multiple times 3 4 1 Calling a Function Found in a so File As mentioned earlier the outlined function containing the target kernel is stored in a separated source file which will be transformed into a kernel variant and then compiled to a dynamically loadable library The original source code has 11 FIXME We are working on using liveness analysis to further eliminate unnecessary value assignments to be transformed to open the library file find the outlined function and finally call it using the right parameters An example of the resulting transformation on the function containing the outlined loop is given below header for dlopen etc include lt dlfcn h gt handle to the shared lib file void FunctionLib a prototype of the pointer to a loaded routine void OUT 1 6755 void __out_argv p RN Kn i
2. Use the value provided Harmony to find a kernel variant sprintf nextUnroll OUT__1__6755__ d x unroll printf Trying to find 96s Vn nextUnroll OUT 1 6755 void void dlsym FunctionLib nextUnroll timing the execution of a variant timel time_stamp parameter wrapping is omitted here 21 FIXME This is not an ideal way to tune the application we will explore better alternatives Iv Application smg2000Unroll 1 I5 xj unroll HES ASS TS A135 E 232527 ea E Performance function m liao localhost svnrepos benchmarks smg2000 test 75 fiaoclocalhost svnrepos projects empirical tuning active harmony han Ele Edit View Terminal Tabs Help File Edit View Terminal Tabs Help done with one iteration A updateObsGoodness smg2000Unroll 1 2 13075 5 Trying to find function OUT 1 6755 31 Entered update obsGoondess smg2000Unroll 1 2 Trying to send performance information 10 o done with one iteration isglobal smg2000Unroll 1 obsCoodness islgobal Trying to find function OUT 1 6755 29 Step 2 Trying to send performance information 3 Exit simplex method done with one iteration perf update str Trying to find function OUT 1 6755 29 updateObsGoodness sme2000Unroll 1 3 13077 Trying to send performance information 3 isglobal smg2000Unroll 1 obsGoodness islgobal done with one iteration Timestamp 1
3. 6 1 Manual Code Triage The SMG2000 kernel shown in Fig 5 that is automatically identified by our simple code triage may not be the best tuning target It is actually only a portion of a bigger computation kernel that is very hard to be automatically identified Also the bigger kernel has a very complex form which impedes most compilers or tools for further analyses and transformations With the help from Rich Vuduc who is a former postdoc with us we manually transform the code via code specification to obtain a representative kernel which captures the core computation algorithm of the benchmark We then put the outlining pragma pragma rose outline before the new kernel and invoked the ROSE outliner to separate it out into a new source file as shown below Zinclude autotuning lib h static double timel time2 void OUT 1 6119 void __out_argv typedef int hypre_MPI_Comm typedef int hypre_MPI_Datatype typedef int hypre Index 3 typedef struct hypre Box struct hypre Index imin hypre_Index imax hypre_Box typedef struct hypre BoxArray struct hypre Box boxes int size int alloc size hypre BoxArray typedef struct hypre RankLink struct int rank struct hypre RankLink struct next hypre RankLink typedef hypre RankLink hypre_RankLinkArray 3 3 3 typedef struct hypre BoxNeighbors struct hypre BoxArray boxes int procs int ids int first local int num local int num
4. The work to has a few steps that are likely not so easy to fully automate namely the manipulation of the Makefiles and bash scripts that are required to support the empirical tuning but it is likely this can be simplified in the future The work presented is also immediately focused on providing an infrastruc ture for the empirical tuning and less on the empirical tuning of any specific applications SMG2000 was selected somewhat at random since it is moder ately large and we have demonstrated for many years that we can compile it easily with ROSE 28 8 Appendix This section gives quick instructions for installing some needed software pack ages to establish the end to end autotuning system Please consult individual software releases for detailed installation guidance 8 1 Patching Linux Kernels with perfctr This is required before you can install PAPI on Linux x86 and Linux x86 64 platforms Take PAPI 3 6 2 as an example the steps to patch your kernel are Find the latest perfctr patch which matches your Linux distribution from papi 3 6 2 src perfctr 2 6 x patches For Red Hat Enterprise Linux 5 or CentOS 5 the latest kernel patch is patch kernel 2 6 18 92 el5 redhat Get the Linux kernel source rpm package which matches the perfctr kernel patch found in the previous step You can find kernel source rpm packages from one of the many mirror sites For example wget http altruistic lbl gou mirrors centos 5 2 updates
5. amp 1 build so for the kernel variant To redirect stdout to NULL is required since the search engine looks for stdout for the evaluation result make f makefile lib gt dev null 2 gt amp 1 cp OUT 1 6755 so home liao svnrepos benchmarks smg2000 struct ls generate a program to execute and timing the kernel 18 Handled by ROSE best_time 999999999 0 run the program multiple times for i 1 i lt ITER i i 4 1 do again redirecting is essential for the search engine to grab the desired stdout best time in the end cr restart home liao svnrepos benchmarks smg2000 test dump yy gt dev null 2 gt amp E if ne 0 J then echo Error cr restart finishes abnormally exit 1 else test f tmp peri result if ne 0 then echo Error The temp file storing timing informaiton doesnot exist exit 1 fi time tail 1 tmp peri result cut f 1 best_time echo time best_time awk print_ 1_ lt _ 2 _ _ 1_ _ 2 fi done echo best_time As we can see the evaluation of a kernel variant needs the cooperation of three parties 1 the transformed target application providing a performance measurement timing for the call to the variant 2 the eval_smg script choosing the best execution after several times of exe cution using the same kernel variant 3 the search engine retrieving the information returned f
6. lt sys stat h gt include fcntl h Zinclude libcr h ifndef O LARGEFILE define O LARGEFILE 0 Ffendif static int g_checkpoint_flag 0 extern cr client id t cr Ffendif int hypre SMGResidual void residual_vdata hypre StructMatrix A hypre StructVector x hypre StructVector b hypre StructVector r if BLCR CHECKPOINTING int err cr checkpoint args t cr args cr checkpoint handle t cr handle cr initialize checkpoint args t amp cr args cr args cr scope CR SCOPE PROC checkpoint an entire process cr args cr target 0 checkpoint self cr args cr signal SIGKILL kill self after checkpointing cr args cr fd open dump yy OWRONLY O CREAT O LARGEFILE 0400 if cr args cr fd lt 0 printf Error cannot open file for checkpoiting context Vn abort printf Checkpointing stopping here n err cr request checkpoint amp cr args amp cr handle if err 0 printf cannot request checkpoining _err d n err abort block self until the request is served cr enter cs cr cr leave cs cr printf Checkpointing restarting herew n FunctionLib dlopen OUT 1 6755 so RTLD LAZY ignore the code to find the outlined function and to time the execution here OUT 1 6755 out argv1 1527 timing code ignored here stop after the target finishes its execution exit 0 endif
7. syl 4 A data indices i si const double x out argv 1 0 kk hypre sz3 t kk hypre szl xp O ii jj hypre_ sy2 d kk hypre_ sz2 dxp_s si at end timing end timing double x x out arg hypre StructMatrix rp out argv 20 A int x out argv 2 kk int out argv 3 ji int x out argv 4 ii int x out argv 5 si v 18 1 As we can see the new kernel directly and indirectly depends on some user defined types The ROSE outliner was able to recursively find and copy them into the new file in a right order 6 2 Parameterized ROSE Loop Translators ROSE provides several standalone executable programs loopUnrolling loop Interchange and loopTiling under ROSE INSTALL bin for loop transforma tions So autotuning users can use them via command lines with abstract han dles to create desired kernel variants Detailed information about the parame terized loop translators can be found in Chapter 50 of the ROSE Tutorial These translators use ROSE s internal loop translation interfaces declared within the SageInterface namespace They are bool loop Unrolling SgForStatement loop size_t unrolling_factor This function needs two parameters one for the loop to be unrolled and the other for the unrolling factor bool loopInterchange SgForStatem
8. amp t 6 mm Scope v WALLCLK Experiment Aggregate Metrics Load module home liao svnrepos benchmarks smg2000 test smg2000 Y smg residual c Y hypre SMGResidual smg residual c 152 smg residual c 287 smg residual c 236 smg residual c 238 28e04 100 28e04 99 9 22e04 58 0 ee Ju 22e04 58 0 20e03 04e02 12e02 66e02 6 8 7 5 ome 07 79 67 12 96 87 80 78 72 60 59 59 51 48 46 40 40 Figure 2 Profiling results of SMG2000 using HPCToolkit 29 30 30 07 43 76 06 35 62 84 33 33 33 33 33 33 34 31 31 31 32 32 32 32 36 03 66 06 28 48 66 83 98 12 oo0oo0o000 00000000000 78 67 62 42 36 33 30 29 27 23 22 22 19 18 17 15 15 hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 1061 8054450 hypre SemiRestrict home liao smg2000 struct ls semi restrict c 262 Q0 8056a8c hypre_SemiInterp home liao smg2000 struct ls semi interp c 294 0 8055d6c hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 1133 0 8054b2f hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 912 Q0 8053126 hypre_StructAxpy home liao smg2000 struct mv struct axpy c 69 0 806642c hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 1002 8053a60 hypre SMGResidual home liao
9. argvl 1527 0 void amp hypre__nz out argv1l__1527__ 1 void amp hypre __ny out argvl1 1527 2 void amp hypre__nx out argvl1 1527 3 void amp hypre sz3 out argvl1 1527 4 void amp hypre__sy3 out argvl1 1527 5 void amp hypre__sx3 out argvl1 1527 6 void amp hypre sz2 out argvl1 1527 7 void amp hypre__sy2 out argvl1 1527 8 void amp hypre sx2 out argvl1 1527 9 void amp hypre__szl out argvl1 1527 10 void amp hypre__syl out argvl1 1527 11 void amp hypre _ _sx1 out argvl1 1527 12 void x amp loopk out argvl1 1527 13 void x amp loopj out argvl1 1527 14 void x amp loopi out argv1__1527__ 15 void x amp rp out argvl 1527 16 void amp xp out argvl1 1527 17 void x amp Ap out argvl1 1527 18 void x amp ri out argvl1 1527 19 void x amp xi out argvl1 1527 20 void x amp Ai OUT 1 6755 out argv1 1527 The outlined function generated from the target loop Saved into a separated file named OUT 1 6755 c void OUT 1 6755 void __out_argv int Ai x int __out_
10. either drop these restrictions or use ROSE to normalize the loops automatically void OUT 1 6755 void __out_argv int Ai x int __out_argv 20 int xi x int __out_argv 19 int ri x int _ _out_argv 18 double Ap double __out_argv 17 double xp double x out argv 16 double rp double x out argv 15 int loopi kx int out argv 14 int loopj int out argv 13 int loopk int out argv 12 int hypre sx1 int out argv 11 int hypre syl int out argv 10 int hypre szl int out argv 9 int hypre sx2 int out argv 8 int hypre sy2 int _ out_argv 7 int hypre sz2 int out argv 6 int hypre sx3 int _out_argv 5 int hypre sy3 int out argv 4 int hypre sz3 int out argv 3 int hypre_nx int _ out_argv 2 int hypre_ny int _ out_argv 1 int hypre nz int out argv 0 for loopk 0 loopk lt hypre nz loopk 1 for loopj 0 loopj lt hypre ny loopj 1 for loopi 0 loopi lt hypre nx loopi 1 BEGIN nestI rp ri Ap Ai xp xi Ai hypre sxl xi hypre sx2 ri hypre sx3 Ai hypre__syl hypre nx hypre sx1 xi hypre sy2 hypre nx hypre sx2 ri hypre sy3 hypre
11. elements int num ghost 6 int global size hypre CommPkg comm_pkg int ref_count hypre_Struct Matrix void OUT 1 6119 void __out_argv 24 static int counter 0 hypre_StructMatrix A int double xrp int int int int int int int int int int int int int int int int ri hypre StructMatrix out argv 20 int x out argv 19 stencil size i din s 18 UL hypre_ hypre szl1 hypre sy2 hypre sz2 hypre sy3 hypre sz3 hypre mx hypre my hypre mz si ii jj kk int sy1 int int int int const double Ap double out argv 18 int _ out_argv 17 out_argv 16 int 15UL out_argv 15 int out argv 14 int out argv 13 int out argv 12 int out argv 11 int out argv 10 int out argv 9 int __out_argv 8 int __out_argv 7 int out argv 6 out_argv 5 out Ern out argv 3 out argv 2 const double x out argv 0 const double xp_0 at_begin_timing begin timing for si 0 si lt stencil_size si for kk 0 kk lt hypre__mz kk for jj 0 jj lt hypre my jj for ii 0 rp ri ii ii lt hypre mx ii jj hypre sy3 Ap O ii jj hypre
12. 14 Only these minimum transformations are needed to build the target applica tion to support BLCR We choose the static linking method to support BLCR as follows The BLCR library is linked with the final executable smg2000 smg2000 0 echo Linking CC o smg2000 smg2000 0 LFLAGS lcr The checkpointed binary has to be executed once to generate a memory dump file process image which can then be reused multiple times to restart the execution immediately before the line where dlopen is invoked to evaluate multiple variants of the optimization target kernel liao localhost test smg2000 n 120 120 120 d 3 current process ID is 6028 Running with these driver parameters nx ny nz 120 120 120 Px Py Pz 1 1 1 bx by bz 1 1 1 cx cy cz 1 000000 1 000000 1 000000 n_pre n_post 1 1 dim 3 solver ID 0 Struct Interface wall clock time 0 550000 seconds cpu clock time 0 540000 seconds Checkpoiting stopping here Killed A context dump file named dump yy will be generated by BLCR as the checkpointing result This dump file will be used to restart the execution using the command cr restart dump yy 4 Empirical Tuning The actual empirical tuning shown in Fig 4 is conducted via the coopera tion of several components a search engine a parameterized transformation tool and the previously introduced checkpointing restarting library The basic idea is th
13. 3077 Stime 13077 Trying to find function OUT 1 6755 30 Entering simplex method Trying to send performance information 3 Entered simplex with step 2 yvalue 3 done with one iteration 32 Trying to find function OUT 1 6755 27 mg2000Unroll 1 simplex points Trying to send performance information 3 0 27 done with one iteration 1 28 Trying to find function OUT 1 6755 28 Trying to send performance information 3 weaweeneee Step 2 32 2 3 12 done with one iteration Compute P Trying to find function OUT 1 6755 30 Setting bundles to 27 25 Trying to send performance information 3 0 sign 1 done with one iteration oints 0 27 Trying to find function OUT 1 6755 28 sign 1 Trying to send performance information 2 points 1 28 done with one iteration sign 1 Trying to find function OUT_1_6755_28 Found equality 27 points 0 27 Trying to send performance information 2 Retry O 1 0 sign 1 done with one iteration Bundles set Trying to find function OUT 1 6755 29 P star 26 rl Trving to send performance information 3 Exit simplex method Figure 6 Searching using Active Harmony OUT 1 6755 out argv1 1527 time2 time stamp int perf int time2 timel 100000 printf Trying to send performance information 96d XVn perf send performance feedback to Harmony server harmony performa
14. A ROSE Based End to End Empirical Tuning System for Whole Applications Draft User Tutorial Chunhua Liao and Dan Quinlan Lawrence Livermore National Laboratory Livermore CA 94550 925 423 2668 office 925 422 6278 fax liao6 quinlan1 lInl gov Project Web Page UCRL Number for ROSE User Manual UCRL SM 210137 DRAFT UCRL Number for ROSE Tutorial UCRL SM 210032 DRAFT UCRL Number for ROSE Source Code UCRL CODE 155962 ROSE User Manual pdf ROSE Tutorial pdt ROSE HTML Reference html only July 8 2013 Contents 1 Introduction 2 1 Using HPCTToolkit 2 2 Usimggprof ss MAN C d WELT URP 3 3 Kernel Extraction Outlining 3 4 Transformations for Autotuning 4 Empirical Tuning 4 1 Parameterized Transformation Tools 4 2 Search Engines ls 4 3 An Example Search 5 Working with Active Harmony 6 User Directed Autotuning 6 1 Manual Code Triage 6 2 Parameterized ROSE Loop Translators 6 3 Connecting to the Search Engine 6 4 Results 8 1 Patching Linux Kernels with perfctr 8 2 Installing BLCR 20 23 23 25 26 28 28 1 Introduction This document describes an on going project aimed for a ROSE 10 based end to end empirical tuning also called autotuning system It is part of an SciDAC PERI 1 project to enable performance portability of DOE applications through an empirical tuning system Our goal is
15. SRPMS kernel 2 6 18 92 1 22 el5 src rpm Install the kernel source rpm package With a root privilege simply type rpm ivh kernel src rpm The command will generate a set of patch files under usr src redhat SOURCES Generate the kernel source tree from the patch files This step may require the rpm build and redhat rpm config packages to be installed first yum install rpm build redhat rpm config with the root privilege cd usr src redhat SPECS rpmbuild bp target i686 kernel 2 6 spec for x86 platforms rpmbuild bp target x86_64 kernel 2 6 spec for x86_64 platforms Copy the kernel source files and create a soft link Type cp a usr src redhat BUILD kernel 2 6 18 linux 2 6 18 1686 usr src ln s usr src linux 2 6 18 1686 usr src linux Now you can patch the kernel source files to support perfctr Type cd usr src linux path to papi 3 6 2 src perfctr 2 6 x update kernel patch 2 6 18 92 e15 redhat Configure your kernel to support hardware counters 29 cd usr src linux make clean make mrproper yum install ncurses devel make menuconfig Enable all items for performance monitoring counters support under the menu item of processor type and features e Build and install your patched kernel make j4 amp amp make modules j4 amp amp make modules install amp amp make install e Configure perfctr as a device that is automatically loaded each time you boot up your machine cd home 1iao6 downlo
16. a function s parameter of a reference type If the variable is not used by its address a temporary clone variable of the same type using TYPE clone can be introduced to substitute its uses within the outlined function The value of a clone has to be initialized properly using clone parameter before the clone participates in computation After the computation the original variable must be set to the clone s final value using parameter clone By doing this many pointer dereferences introduced by the classic algorithm can be avoided For easier interaction with other tools the outlined function is separated into a new source file usually named after the function name The ROSE outliner will recursively find and copy all dependent type declarations into the new file to make it compilable For the SMG2000 example a file named out 1 6755 orig c is generated and it only contains the function body of OUT_1 6755__ This file will be transformed by a parameterized optimizer to a kernel variant named OUT 1 6755 c and further compiled into a dynamically loadable routine A sample makefile to generate the so file is given below liao localhost smg2000 cat makefile lib 1ib QUT__1__6755__ so OUT__1__6755__ 0 0UT__1__6755__ c gcc c fpic OUT__1__6755__ c OUT__1__6755__ so QUT__1__6755__ o gcc shared lc o OUT__1__6755__ so OUT__1__6755__ o clean rm rf OUT__1__6755__ so OUT__1__6755__ o 3 4 Transformations for Autotuning
17. ad papi 3 6 2 src perfctr 2 6 x cp etc perfctr rules etc udev rules d 99 perfctr rules cp etc perfctr rc etc rc d init d perfctr chmod 755 etc rc d init d perfctr sbin chkconfig add perfctr With the kernel patched it is straightforward to install PAPI cd home liao6 download papi 3 6 2 src configure make make test make install with a root privilege 8 2 Installing BLCR BLCR the Berkeley Lab Checkpoint Restart library s installation guide can be found at We complement the guide with some Linux specific information here It is recommended to use a separate build tree to compile the library mkdir buildblcr cd buildblcr explicitly provide the Linux kernel source path blcr 0 8 2 configure with linux usr src linux 2 6 18 1686 make using a root account make install make insmod check Doublecheck kernel module installation You should find two module files blcr imports ko blcr ko ls usr local lib b1cr 2 6 18 prep To configure your system to load BLCR kernel modules at bootup 30 copy the sample service script to the right place cp blcr 0 8 2 etc blcr rc etc init d change the module path inside of it vi etc init d blcr rc module_dir MODULE_DIR module_dir usr local lib blcr 2 6 18 prep run the blcr service each time you boot up your machine chkconfig level 2345 blcr rc on manually start the service error messages like FATAL Module blcr not found can be igno
18. ample target application path APP PATH home 1iao6 svnrepos benchmarks smg2000 KERNEL VARIANT FILE OUT 1 6119 c remove previous variant of the kernel and result bin rm f APP PATH struct ls KERNEL VARIANT FILE tmp peri result APP PATH so so A tiling iy Jo konia first tiling is always needed loopTiling c APP_PATH struct_ls OUT__1__6119__perfectNest c rose loopTiling abstract_handle ForStatement lt numbering 1 gt rose loopTiling depth 4 rose loopTiling tilesize 1 rose output KERNEL_VARIANT_FILE if 1 ne O then loopTiling c KERNEL_VARIANT_FILE rose loopTiling abstract_handle ForStatement lt numbering 1 gt rose loopTiling depth 4 rose loopTiling tilesize 1 rose output KERNEL_VARIANT_FILE loopTiling c KERNEL_VARIANT_FILE rose loopTiling abstract handle ForStatement lt numbering 1 gt rose loopTiling depth 4 rose loopTiling tilesize 1 rose output KERNEL VARIANT FILE 4 interchange si k j i if 1 ne 0 then loopInterchange c KERNEL VARIANT FILE rose output KERNEL VARIANT FILE rose loopInterchange abstract handle ForStatement lt numbering 4 gt rose loopInterchange depth 4 rose loopInterchange order 2 else No tiling happens start from 1 loopInterchange c KERNEL VARIANT FILE rose output KERNEL VARIANT FILE rose loopInterchange abstract handle ForStatement lt numbering 4 gt rose loopInterchange dep
19. argv 20 int xi int __out_argv 19 int ri int __out_argv 18 double Ap double out argv 17 double xp double __out_argv 16 double rp double x out argv 15 int loopi int out argv 14 int loopj int out argv 13 int loopk int out argv 12 int hypre sx1 int out argv 11 int hypre syl int __out_argv 10 int hypre szl int __out_argv 9 int hypre sx2 int __out_argv 8 int hypre sy2 int out argv 7 int hypre sz2 int __out_argv 6 int hypre sx3 int __out_argv 5 int hypre sy3 int out argv 4 int hypre sz3 int __out_argv 3 int hypre nx int out argv 2 int hypre ny int _ out_argv 1 int hypre nz int __out_argv 0 for loopk 0 loopk lt hypre nz loopk for loopj 0 loopj lt hypre ny loopj for loopi 0 loopi lt hypre__nx loopi i rp ri Ap Ai xp xi Ai hypre sxl xi hypre sx2 ri hypre sx3 Ai hypre syl hypre nx hypre sx1 xi hypre sy2 hypre nx hypre sx2 ri hypre sy3 hypre nx hypre sx3 Ai hypre szl hypre__ny hypre syl xi hypre sz2 hypre ny hypre sy2 ri hypre sz3 hypre ny hypre sy3
20. at 1 a set of pre selected transformations and their corresponding configuration ranges are given by the code triage module and converted into an integral search space 2 the search engine evaluates points from the search space by driving the parameterized transformation tool to generate kernel vari ants and restarting the checkpointed binary to run the variants one by one 4 1 Parameterized Transformation Tools Several choices exist to generate kernel variants they include POET 12 CHiLL 6 and the ROSE loop translators We take POET as an example here 15 Kernel Variants 3 Empirical Tuning Target Kernel Compilation 4 Search Space Point evan gt Evaluation Results Figure 4 Phase 3 of the autotuning system POET Parameterized Optimizations for Empirical Tuning developed by Dr Qing Yi under contract with University of Texas at San Antonio UTSA is an efficient language and tool to express hundreds or thousands of complex code transformations and their configurations using a small set of parameters It is especially relevant to the evaluation of large scale search spaces as part of empirical tuning and is orthogonal to any specific search strategy Using command line options and a configuration file users can direct POET to apply a set of specified transformations with desired configurations on selected code portions Also the targe
21. e self time seconds seconds name 35 01 13 08 13 08 hypre SMGResidual home liao smg2000 struct ls smg residual c 289 0 804caa4 9 05 16 46 3 38 hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 1130 0 8054af4 8 40 19 60 3 14 hypre SMGResidual home liao benchmarks smg2000 struct ls smg residual c 291 804cab9 7 67 22 46 2 87 hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 910 8053191 5 97 24 70 2 23 hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 998 8053a28 5 27 26 66 1 97 hypre SMGResidual home liao smg2000 struct ls smg residual c 238 0 804d129 2 86 27 73 1 07 hypre_SMGResidual home liao smg2000 struct ls smg residual c 287 804cacb 2 28 28 59 0 85 hypre CyclicReduction home liaosmg2000 struct ls cyclic reduction c 853 0 8052bae OOOooooocoooooonennwWM File Debug define HYPRE BOX SMP PRIVATE loopk loopi loopj Ai xi ri include hypre box smp forloop h hypre_BoxLoop3For loopi loopj loopk Ai xi ri t Searh Help cyclic reduction c amp smg residual c 23 xp hypre StructVectorBoxData x i hypre BoxO0ffsetDistance x data box stencil shape si hypre BoxGetStrideSize compute box base stride hypre BoxLoop3Begin loop size hpcviewer loop size A data box start base stride Ai X data box start base stride xi r data box start base stride ri A amp
22. eckpointing and Restarting In order to efficiently evaluate hundreds or even thousands of kernel variants we use a checkpointing and restarting method to measure the time spent on calling the kernel without unnecessarily repeating the execution before and after calling the kernel This allows the full context state of the application to be used in the evaluation of the kernel performance The Berkeley Lab Checkpoint Restart BLCR library 2 was selected for its programmability portability and robustness BLCR is designed to support asynchronous checkpointing which means a running process is notified by an other process to be checkpointed first but the exact checkpointing action hap pens on an indeterministic point later on This default behavior is not desired for our purpose since we want an accurate source position to do the checkpoint ing Fortunately BLCR s library does provide some internal functions to help us initiate synchronous immediate checkpointing though not well documented After some trial and error rounds we use the following code transformation to have a synchronous checkpointing using the blcr 0 8 2 release The idea is to have a running program to notify itself to initiate a checkpointing The program is then blocked until the request is served To support BLCR we have transformed the original source code in two locations The first location is the file where the main function exists We add the necessary header
23. ent loop size_t depth size_t lexico Order The loop interchange function has three parameters the first one to specify a loop which starts a perfectly nested loop and is to be 25 interchanged the 2nd for the depth of the loop nest to be interchanged and finally the lexicographical order for the permutation e bool loopTiling SgForStatement loopNest size_t targetLevel size_t tile Size The loop tiling interface needs to know the loop nest to be tiled which loop level to tile and the tile size for the level For efficiency concerns these functions only perform the specified translations without doing any legitimacy check It is up to the users to make sure the transformations won t generate wrong code Example command lines using the programs are given below unroll a for statement 5 times The loop is a statement at line 6 within an input file loopUnrolling c inputloopUnrolling C rose loopunroll abstract handle Statement lt position 6 gt rose loopunroll factor 5 interchange a loop nest starting from the first loop within the input file interchange depth is 4 and the lexicographical order is 1 swap the innermost two levels loopInterchange c inputloopInterchange C rose loopInterchange abstract handle ForStatement lt numbering 1 gt rose loopInterchange depth 4 rose loopInterchange order 1 tile the loop with a depth of 3 within the first loop of the input file tile size is 5 loopT
24. erformance 2 Code Triage amp Transformation JL Annotated AST Dynamically Checkpointed Loadable Routine Binary Figure 3 Phase 1 and 2 of the autotuning system 3 1 Invoking Code Triage The source code for code triage is located in rose projects auto Tuning auto Tuning C It already has initial implementation to call ROSE s tool interface conduct simple code triage and finally extract kernels using ROSE s AST out liner With the input application and its performance result available users can invoke the ROSE based code triage by using the following command autoTuning c jacobi c rose hpct prof jacobi raw xml rose autotuning triage threshold 0 8 rose outline output path tests The command above provides an input source file and its corresponding XML format performance data generated by HPCToolkit It asks the code triage program to find the most time consuming loops which account for just above 80 of the total execution time The identified loops will be automatically extracted to separated source files and saved into an output path named tests Users can also enable code triage only without calling outlining The per formance data can come from GNU gprof An example is given below example command line to perform code triage only autoTuning c jacobi c rose autotuning triage only rose gprof linebyline jacobi gprof txt the output is a list of abstract handles and their corresponding execution ti
25. files and initialization codes for using BLCR For example Code addition in the file containing main if BLCR CHECKPOINTING checkpointing code Zinclude libcr h include lt sys types h gt include lt unistd h gt include lt sys stat h gt include lt fcntl h gt Zinclude libcr h ifndef OLARGEFILE define O LARGEFILE 0 Ffendif client handle used across source files cr _client_id_t cr Ffendif int main int argc char xargv if BLCR CHECKPOINTING initialize the blcr environment cr cr init cri info init Ztendif 13 FIXME need to discuss the drawbacks e g missing cache warmup and how that might be addressed in the future by pushing back the checkpoint start and transformations to the checkpointed program later The second place is the actual source line to initiate the checkpointing A checkpoint argument structure is filled out first to customize the behavior we want including the scope target memory dump file consequence and so on A blocking phase is put right after crrequest_checkpoint to have an immediate checkpointing Our goal is to stop the executable right before opening the so file so a different kernel variant can be compiled into the so file each time The execution will be restarted multiple times so multiple variants can be evaluated this way if BLCR CHECKPOINTING include lt sys types h gt include lt unistd h gt include
26. iling c inputloopTiling C rose loopTiling abstract handle ForStatement lt numbering 1 gt rose loopTiling depth 3 rose loopTiling tilesize 5 6 3 Connecting to the Search Engine We applied several standard loop optimizations to the new kernel They are in the actual order applied loop tiling on i j and k levels each level has a same tiling size from 0 to 55 and a stride of 5 loop interchange of i j k and si levels with a lexicographical permutation order ranging from 0 to 4 1 and finally loop unrolling on the innermost loop only For all optimizations a parameter value of 0 means no such optimization is applied So the total search space has 14400 12 x 4 x 50 points The bash script used by the GCO search engine to conduct point evaluation is given blow Note that a transformation command needs to consider previous transformations side effects on the kernel We also extended GCO to accept strides using ST in the script for dimensions of search space cat eval smg combined bin bash DIM 3 LB 000 UB 55 23 49 ST 51 2 number of executions to find the best result for this variant ITER 3 command line validation should have x parameters when calling this script x number of dimensions for each point if 3 then echo Fatal error not enough parameters are provided for all search dimensions exit fi 26 convert points to transformation parameters Not necessary in this ex
27. ing the metric match process When neces sary all performance metrics are also propagated from statement levels to loop function and file levels Similarly it also accepts the line by line performance data generated by GNU gprof Detailed information about ROSE HPCToolKit can be found in Chapter 44 of the ROSE Tutorial The code triage program uses the following code to invoke ROSE HPCT int main int argc char argv vector lt string gt argvList argv argvtargc Read into the XML files RoseHPCT ProgramTreeList_t profiles RoseHPCT loadHPCTProfiles argvList Create the AST tree SgProject project frontend argvList Attach metrics to AST last parameter is for verbose RoseHPCT attachMetrics profiles project project gt get_verbose gt 0 3 3 Kernel Extraction Outlining Each of the identified tuning targets often loops will be extracted from the original code to form separated functions or routines The ROSE AST out liner is invoked to generate such functions This kernel extraction step can be automatically invoked by the code triage program or manually initiated by users via the outliner s command line interface The ROSE AST outliner handles multiple input languages including C C and recently Fortran It also provides both command line and program ming interfaces for users to specify the targets to be outlined Detailed infor mation of using the ROSE outliner can be found in Chapte
28. int _ out_argv 0 hypre nz int x out argv 1 hypre__ny int x out argv 2 hypre__nx int _ out_argv 3 hypre sz3 int _ out_argv 4 hypre sy3 int _ out_argv 5 hypre sx3 int x out argv 6 hypre sz2 int x out argv 7 hypre sy2 int x out argv 8 hypre sx2 int x out argv 9 hypre sz1 int x out argv 10 hypre syl int x out argv 11 hypre sxl1 int _ out_argv 12 loopk int out argv 13 loopj int _ out_argv 14 loopi double _ out_argv 15 rp double x out argv 16 xp double x out argv 17 Ap int out argv 18 ri int x out argv 19 xi int _ out_argv 20 Ai The ROSE outliner uses a variable cloning method to avoid using pointer dereferencing within the outlined computation kernels The method is based 10 on usage of a variable that is passed by reference and accessed via pointer dereferencing by classic outlining algorithms Such a variable is used either by value or by address within an outlining target For the C language using the variable by address occurs when the address operator is used with the variable e g amp X C introduces one more way of using the variable by address associating the variable with a reference type TYPE amp Y X or using the variable as an argument for
29. ion does not exist exit 1 fi time tail 1 tmp peri result cut f 1 select the smaller one 27 best time echo time best time awk print 1 lt 2 1 2 fi remove the temp file the best time is kept by the script already bin rm f tmp peri result done report the evaluation result to the search engine echo best_time 6 4 Results The new SMG2000 kernel is invoked thousands of times during a typical execu tion So instead of using checkpointing restarting we used a counter to reduce the point evaluation time The counter was set to be 1600 which means the execution is terminated once the kernel is be called 1600 times By doing this an exhaustive search using GCO became feasible within 40 hours for an input data size of 120 x 120 x 120 The best performance was achieved at point 0 8 0 which means loop in terchange using the lexicographical number 8 corresponding to an order of k j si i improved the performance while tiling and unrolling did not help at all The best searched point achieved a 1 43x speedup for the kernel 1 18 for the whole benchmark when compared to the execution time using Intel C C compiler v 9 1 045 with option O2 on the Dell T5400 workstation 7 Summary The work presented is ongoing work and focused on the whole program empir ical tuning and the automation of all the required steps to make that work for realistic HPC applications in C C and Fortran
30. is needed for the tool to associate performance metrics with source language constructs After installation a typical session of using HPCToolkit is given below Prepare the executable with debugging information gcc g smg2000 c o smg2000 Sample one or more events for the execution use wall clock here hpcrun e WALLCLK smg2000 n 120 120 120 d 3 Convert the profiling result into a XML format hpcproftt p D home liao6 svnrepos benchmarks smg2000 smg2000 smg2000 WALLCLK tux268 11nl gov 10676 0x0 gt result xml FIXME Fig 2 shows the profiling results of SMG2000 using HPCToolkit A state TODO update the text when the ment in a loop takes more than 4696 execution time which makes the loop latest veleabe af dominant most expensive loop of the entire program HPCToolkit works on 32 bit platforms 2 2 Using gprof GNU gprof can generate line by line performance information for an application A typical session to generate such information is given below liao codes gcc g seq pi c pg liao codes a out liao codes gprof 1 L a out gmon out amp gt profile result The option 1 tells gprof to output line by line profiling information L causes gprof to output full file path information which is needed for ROSE to accurately match performance data to source code An excerpt of an output file for smg2000 looks like the following Flat profile Each sample counts as 0 01 seconds cumulativ
31. lease 5 3 Tikanga Linux x86 kernel 2 6 18 92 el5 perfctr SMP PREEMPT PAPI 3 6 2 the Rice HPCToolkit version TRUNK 4 9 0 1280 the latest release does not support the 32 bit machine we use so we use an older version ROSE compiler version 0 9 4a revision 6701 providing a tool interface outliner loop translator etc Berkeley Lab Checkpointing and Restarting library V 0 8 2 POET from Univ Texas San Antonio We use its latest CVS version actually not sure the exact release version number the GCO search engine from Univ of Tennessee UTK We got a package from UTK directly not sure the release number a C version jacobi iteration program used as a simple example input code for autotuning the SMG2000 Semicoarsening Multigrid Solver benchmark from the ASCI Purple used as an example real application 2 Preparation The preparation phase provides basic information about a target application s performance characteristics Such information can be obtained by many perfor mance tools Currently we accept performance data generated by both HPC Toolkit and GNU gprof 2 1 Using HPCToolkit The HPCToolkit 3 developed at the Rice University is an open source profile based performance analysis tool which samples the executions of optimized ap plications No code instrumentation is needed to use HPCToolkit But debug ging information by compiling with the g option if GCC is used in the binary executables
32. led to be a binary executable This binary executable calls the outlined routine collects performance metrics for the call Optionally a checkpointing restarting library can be used to shorten the execution by stopping checkpointing and restarting immediately before calling the outlined function Empirical tuning The final phase does the actual empirical tuning First of all the poten tially beneficial optimizations and their corresponding configurations are represented as points within an integral search space which can be han dled by a search engine The search engine adopts some search policy to evaluate points in the search space and search for a point corresponding a transformation strategy leading to the best performance During this phase multiple versions of the target kernel are generated by a parameterized translatorfrom the searched points and compiled into dynamically loadable library routines The performance of the kernel vari ants are measured one by one as the checkpointed binary is restarted mul tiple times and calls multiple versions of the dynamically loadable library routine We give some details about the system design and the current implementa tion status in the following sections A list of our current hardware software configurations is given below A Dell Precision T5400 workstation with two sockets each a 3 16 GHz quad core Intel Xeon X5460 processor and total 8 GB memory Red Hat Enterprise Linux re
33. me percentage The abstract handles for hot statements exceeding the threshold are Project lt numbering 1 gt SourceFile lt name home liao6 jacobi c gt ExprStatement lt position 193 9 194 76 gt 0 382 Project lt numbering 1 gt SourceFile lt name home liao6 jacobi c gt ExprStatement lt position 196 9 196 45 gt 0 3643 Project lt numbering 1 gt SourceFile lt name home liao6 jacobi c gt ExprStatement lt position 188 9 188 29 gt 0 11 The abstract handles for enclosing loops for hot statements exceeding the threshold are Project lt numbering 1 gt SourceFile lt name home liao6 jacobi c gt ForStatement lt position 190 5 198 7 gt 0 8189 The above example command identifies a list of the most time consuming statements and loops and reports them using abstract handles The report will end once the sum of execution time of the statements or loops reach or exceed a preset threshold default is 75 of the total execution time We explain some details for the implementation of code triage and autotuning related transformations in the following subsections 3 2 Tool Interface ROSE has a performance tool interface called ROSE HPCT in its distribution to accept performance results generated by external performance tools Basi cally it reads in the XML files generated from HPCToolkit and attaches perfor mance metrics to the ROSE AST representing the corresponding source code It can handle macro expansions dur
34. nce update perf get the next point of unroll from the server harmony request all printf done with one iteration Vn f Figure 6 shows a screen shot of using Active Harmony to tuning the SMG2000 benchmark The top panel gives a graphical representation of the search space one dimension named unroll and the tuning timeline with evalu ation values x axis is time y axis is evaluation values The left bottom shell window is the client application being tuned The right bottom shell windows shows the activities of the server We have found that online runtime search engines like the Active Harmony can be extremely expensive if the tuned kernel are invoked thousands of times For SMG2000 it took hours to finish the tuning using a 120x120x120 input data set The major reason is that for each call of the kernel a bidirectional communication between the client application and the server has to be finished Another reason is that the current tuning process is embedded into the thou sands of calls of the kernel so that points are always evaluated even when some of them have already been evaluated before 22 6 User Directed Autotuning While fully automated autotuning can be feasible in some cases many appli cations need users expertise to obtain significant performance improvements We revisit the SMG2000 benchmark in this section to see how one can use our autotuning system to have a user directed empirical tuning process
35. nt hypre SMGhResidual void residual vdata hypre StructMatrix A hypre StructVector x hypre StructVector xb hypre StructVector xr JO Bete an Pointer to error string const char xdlError FunctionLib dlopen OUT 1 6755 so RTLD LAZY dlError dlerror if dlError printf cannot open so file Vn exit 1 Find the first loaded function OUT 1 6755 dlsym FunctionLib OUT 1 6755 dlError dlerror if dlError printf cannot find OUT 1 6755 Vn exit 1 Call the outlined function by the found function pointer parameter wrapping statements are omitted here OUT 1 6755 out argv1 1527 3 4 2 Timing the Call The call to the outlining target kernel should be timed to get the evaluation results during the empirical optimization We instrument the call as follows to get the performance evaluation of a kernel variant More accurate and less intrusive methods based on hardware counters can also be used in the future save timing information into a temporary file remove tmp peri result timei time stampO parameter wrapping statements are omitted here OUT 1 6755 out argvi 1527 time2 time_stamp FILE pfile pfile fopen tmp peri result a if pfile NULL 1 fprintf pfile fNn time2 timei fclose pfile 12 else printf Fatal error Cannot open tmp peri result n 3 4 3 Ch
36. nx hypre sx3 H END nestI Nest Ai hypre szl hypre ny hypre syl xi hypre sz2 hypre ny hypre sy2 ri hypre sz3 hypre ny hypre sy3 some code omitted here Figure 5 the SMG 2000 kernel 4 2 Search Engines Currently we adopt the GCO Generic Code Optimization search engine 13 from University of Tennessee at Knoxville as the external search engine used in our system It has been connected with a specific version of POET not yet fully updated to the latest POET release unfortunately to explore code transformations using several popular search policies such as random search exhaustive search simulated anneal search genetic algorithm and so on The search engine interacts with the parameterized optimization tool POET via a bash script usually named as eval_xxx where xxx indicates the target application This script is manually generated currently and does the 17 FIXME TODO The generation of pt files it not yet automated currently following tasks 1 specifies the search space s dimension number and lower upper bound for each dimension 2 specifies the number of executions for each evaluation This will help exclude some executions disturbed by system noises 3 validation of the number of command line options for this script the number should match the number of dimensions of the search space so each value corresponding one dimension All the opti
37. obsGoodness 1 5000 predGoodness 1 5000 We don t use BLCR since an online tuning method is used for Ac tive Harmony The code transformation to work with Active Harmony is shown below The basic idea is to call a set of Harmony inter face functions to startup communication with the server harmony startup send a configuration file harmony application setup file define a param eter to be tuned harmony add_variable report performance feedback harmony peformance update and finally request the next set of values of the tunable parameters harmony request all include hclient h Zinclude hsockutil h static int har registration 0 static int unroll NULL int hypre SMGhResidual initialize the search engine for the first call if har registration 0 printf Starting Harmony V n harmony startup 0 printf Sending setup file Vn harmony application setup file harmony tcl printf Adding _a_ harmony variablew unroll int harmony _add _variable smg2000Unroll unroll VAR INT har registration 4 Load the so files char nextUnroll 255 if g execution flag 0 Only load the lib for the first time reuse it later on printf Opening_the_ so_file_ n FunctionLib dlopen unrolled versions so RTLD_LAZY dlError dlerror if dlError printf cannot open so file Vn exit 1 end if flag 0
38. ons together mean a valid search point within the search space 4 converts the search point into transformation parameters understandable by POET Some transformation choices are obtained by interpreting in teger values in a custom way such as the order of loop interchanging 5 generates a kernel variant by invoking POET with right parameters to conduct the corresponding transformations on the target kernel 6 compiles the generated kernel variant into a dynamically loadable shared library file a so file 7 restarts the checkpointed binary to evaluate the kernel variant This step is repeated multiple times as configured and the shortest execution time is reported as the evaluation result for this particular transformation setting a point A full example script for SMG2000 is given below bin bash DIM 1 LB 1 UB 32 number of executions to find the best result for this variant ITER 4 command line validation should have x parameters when calling this script x number of dimensions for each point if 1 J then echo 0 0 exit fi convert points to transformation parameters Not necessary in this example remove previous variant of the kernel and result bin rm f out 1 6755 c tmp peri result generate a variant of the kernel using the transformation parameters home liao download qing POET src pcg punrolll 1 L home liao download qing PO ET lib my pt gt dev null 2 gt
39. ony system 11 7 The work is done with generous help from Ananta Tiwari at University of Maryland Active Harmony allows online runtime tuning of application parameters which are critical to the application performance Domain specific knowledge is usually needed to identify those application parameters An active harmony server will be running to manage the values of different application parameters and to conduct the search Applications have to be modified to communicate with the server in order to send performance feedback for one specific set of parameter values current point and get the next set of parameter values next point Currently it supports a search strategy based on the Nelder Mead simplex method 9 For SMG2000 we generate a set of unrolled versions OUT 1 6755 X for the target kernel and treat the function suffix X as the tunable parameter 20 As a result the so file contains all unrolled versions A configuration file is needed for Active Harmony to know the parameters to be tuned and their corresponding ranges The following file is used to specify a tunable parameter named unroll with a range from 1 to 32 obsGoodness observed Goodness or performance and predGoodness predicted goodness are related to a GUI window showing during the execution They do not impact the performance tuning and the workings of the search algorithm harmonyApp smg2000Unroll harmonyBundle unroll int 1 32 1 F
40. periodic hypre RankLinkArray rank_links hypre_BoxNeighbors typedef struct hypre StructGrid struct hypre MPI Comm comm int dim hypre BoxArray boxes 23 int ide hypre BoxNeighbors neighbors int max distance hypre Box xbounding box int local size int global size hypre Index periodic int ref count hypre StructGrid typedef struct hypre StructStencil struct hypre Index xshape int size int max_offset int dim int ref_count hypre StructStencil typedef struct hypre CommTypeEntry struct hypre Index imin hypre Index imax int offset int dim int length array 4 int stride_array 4 hypre CommTypeEntry typedef struct hypre CommType struct hypre CommTypeEntry comm_entries int num entries hypre CommType typedef struct hypre CommPkg struct int num values hypre MPI Comm comm int num sends int num recvs int send procs int recv procs hypre CommType send types hypre CommType recv_types hypre MPI Datatype send mpi types hypre MPI_Datatype recv_mpi _types hypre CommType copy _from_type hypre CommType copy_to_type hypre CommPkg typedef struct hypre StructMatrix struct hypre MPI Comm comm hypre_StructGrid x grid hypre StructStencil user _stencil hypre_StructStencil stencil int num values hypre BoxArray data_space double xdata int data alloced int data size int x x data indices int symmetric int symm
41. point 31 value 49 810719 Checkpointing restarting here Case eval smg 32 Got the evaluation result 49 5221 Checkpointing restarting here Case eval smg 22 Got the evaluation result 49 6475 Checkpointing restarting here Case eval smg 6 Got the evaluation result 51 261 Checkpointing restarting here Case eval smg 30 Got the evaluation result 50 0475 Skipping already visited point 18 value Skipping already visited point 14 value Checkpointing restarting here Case eval smg 4 Got the evaluation result 51 435 Time limit reached 49 701789 49 645038 Random Search Best Result Value 49 522112 Point 32 Total Number of evaluations 13 In the sample search above a one dimension search space loop unrolling factor was examined Within the one minute time limit points were randomly chosen by the search engine and three of them were redundant Apparently the UTK search engine was able to skip redundant evaluations In the end a point 32 had the best value reciprocal of timing which means for the target smg2000 kernel unrolling 32 times generated the best performance Similarly other search policies can be used by replacing random search with exhaustive search anneal search ga search simplex search etc 5 Working with Active Harmony We describe how our end to end empirical tuning framework can be adapted to work with another search engine namely the Active Harm
42. r 27 of the You can also refer to a paper 8 for the algorithm we use to outline kernels For the code triage program the programming interface of the Outliner is used as below recognize options for outliner Dutliner commandLineProcessing argvList SgForStatement target loop findTargetLoop hot node if isOutlineable target loop outline target loop The ROSE AST outliner accepts user options to further guide its internal behavior rose outline parameter wrapper will Wrap all parameters of the outlined function into an array of pointers rose outline new file will separate the outlined function into a new source file to facilitate external tools for further handling rose outline use dlopen Will transform the code to use dlopen and disym to dy namically load the outlined function from a library For the SMG2000 example the most time consuming loop is outlined and a call to it is used to replace the loop in the original code The loop is actually expanded from a macro hypre BoxLoop3For ROSE is able to identify it after a bottom up metrics propagation phase in ROSE HPCT The kernel extraction s result is shown in the following code listing A prototype of the outlined function is inserted at the beginning of the code void OUT 1 6755 void __out_argv The target loop is replaced by a call to the outlined function with parameter wrapping statements void __out_argv1l__1527__ 21 out
43. red etc init d blcr rc restart Unloading BLCR FATAL Module blcr not found FATAL Module blcr_imports not found OK Loading BLCR FATAL Module blcr imports not found FATAL Module blcr not found OK References 1 Performance engineering research institute http www peri scidac 2 Berkeley Lab Checkpoint Restart BLCR https ftg lbl gov CheckpointRestart 2009 3 HPCToolkit http hpctoolkit org 2009 4 POET http www cs utsa edu qingyi POET 2009 5 Peter N Brown Robert D Falgout and Jim E Jones Semicoarsen ing multigrid on distributed memory machines SIAM J Sci Comput 21 5 1823 1834 2000 6 Chun Chen Jacqueline Chame and Mary Hall CHiLL A framework for composing high level loop transformations Technical report USC Com puter Science 2008 7 I H Chung and J K Hollingsworth A case study using automatic perfor mance tuning for large scale scientific programs High Performance Dis tributed Computing 2006 15th IEEE International Symposium on pages 45 56 0 0 2006 8 Chunhua Liao Daniel J Quinlan Richard Vuduc and Thomas Panas Ef fective source to source outlining to support whole program empirical op timization In The 22nd International Workshop on Languages and Com pilers for Parallel Computing 2009 31 9 J A Nelder and R Mead A simplex method for function minimization The Computer Journal 7 4 308 313 January 1965 10 Daniel J Q
44. rom eval_smg as the evaluation result of a variant and proceeding the search accordingly 4 3 An Example Search We use the random search policy of the UTK search engine to demonstrate a sample search process The search engine chooses the maximum evaluation value as the best result by default So a reciprocal of a timing result is indicated by an environment variable GCO SEARCH MODE to be the evaluation result The UTK search engine also accepts a upper time limit for a search session We use 1 minute in this example by adding 1 as the last parameter liao localhost smg2000 export GCO SEARCH MODE reciprocal liao localhost smg2000 search random search eval_smg 1 Checkpointing restarting here Case eval_smg 12 Got the evaluation result 50 7846 Checkpointing restarting here Case eval_smg 13 Got the evaluation result 49 65 Checkpointing restarting here Case eval_smg 11 Got the evaluation result 50 9502 Checkpointing restarting here Case eval_smg 31 Got the evaluation result 49 8107 Checkpointing restarting here Case eval_smg 15 Got the evaluation result 49 8703 19 Checkpointing restarting here Case eval smg 14 Got the evaluation result 49 645 Checkpointing restarting here Case eval smg 25 Got the evaluation result 50 0551 Checkpointing restarting here Case eval smg 18 Got the evaluation result 49 7018 Skipping already visited
45. smg2000 struct ls smg residual c 236 Q0 804d14b hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 1063 8054462 hypre SMGAxpy home liao smg2000 struct ls smg axpy c 69 0 8064970 hypre_CycRedSetupCoarseOp home liao smg2000 struct ls cyclic reduction c 369 8051149 hypre_SMGResidual home liao smg2000 struct ls smg residual c 240 0 804d146 hypre StructVectorSetConstantValues home liao smg2000 struct mv struct vector c 578 806fedc hypre SMGSetupInterpUp home liao smg2000 struct ls smg setup interp c 292 0 804ea04 hypre SemiInterp home liao smg2000 struct ls semi interp c 227 0 8055668 hypre CyclicReduction home liao smg2000 struct ls cyclic reduction c 855 Q0 8052bd4 hypre StructMatrixInitializeData home liao smg2000 struct mv struct matrix c 359 80678b0 3 Code Triage and Transformations The second phase shown in Fig 3 includes code triage and a set of code trans formations Code triage relies on a ROSE based tool interface to read in both source files and performance information of the input application It then con ducts various automated or user directed analyses to identify problematic code segments such as loops Finally the identified code segments are extracted also called outlined into separated routines so they can be individually optimized by empirical methods Base Execution Profiling Instrumentation Compilation Application ll 1 Preparation
46. t kernel has to be instrumented to aid POET in the process Detailed POET user instructions can be found at its official website 4 For example the SMG2000 kernel has the following format to support POET The POET configuration file my pt we use to optimize SMG2000 s kernel is shown below In this file loop unrolling is specified to be performed on the target within a source file named out 1 6755 org c The result will be saved inside a file named out 1 6755 c include Cfront code include opt pi lt parameter unrolll type 1 _ default 2 message unroll factor for innermost loop I gt lt trace target nestI gt lt input from OUT __1_ 6755__orig c lt eval INSERT nestI target UnrollLoops factor unrollI nestI Nest body nestI gt output from target to OUT __1__6755__ c gt to target parse FunctionDefn gt A default transformation parameter unrolling factor is also given in the file But this parameter is usually superseded by a command line parameter the following command line specifies unrolling 5 times home liao download qing POET src pcg punrollI 5 L home liao download qing POET lib my pt 16 Chedkpointing gt amp Restarting FIXME TODO the input is manually changed from the kernel generated by the auto Tuning translator POET expects normalized loops with special tags integer loop control variables and operator is not allowed We will discuss with Qing to
47. th 1 rose loopInterchange order 2 fi d unrolling innermost only generate a variant of the kernel using the transformation parameters unrolling the innermost level must redirect to avoid mess up search engine if 1 ne 0 then loopUnrolling c KERNEL VARIANT FILE rose loopunroll abstract handle ForStatement lt numbering 7 gt rose loopunroll factor 3 rose output KERNEL VARIANT FILE gt dev null 2 gt amp 1 else loopUnrolling c KERNEL VARIANT FILE rose loopunroll abstract handle ForStatement lt numbering 4 gt rose loopunroll factor 3 rose output KERNEL VARIANT FILE gt dev null 2 gt amp 1 fi build a so for the kernel variant To redirect stdout to NULL is required since the search engine looks for stdout for the evaluation result make f makefile lib filename OUT__1__6119__ gt dev null 2 gt amp 1 cp OUT__1__6119__ so APP_PATH struct_ls generate a program to execute and timing the kernel Handled by ROSE best time 999999999 0 run the program multiple times for is 1 i lt ITER i i 1 do The tuned kernel will write timing information into tmp peri result APP PATH test smg2000 n 120 120 120 d 3 gt dev null 2 gt amp 1 if ne 0 then echo Error program finishes abnormally exit 1 else test f tmp peri result if ne 0 then echo Error The temp file storing timing informat
48. to incorporate a set of external tools interacting with ROSE based components to support the entire life cycle of automated empirical optimization We strive to keep this document up to date for better communication among project participants This document is not meant to reflect the final design or implementation choices Currently the ROSE based autotuning system shown in Fig 1 is designed to work in three major phases Search Space ROSE based Components Figure 1 A ROSE based end to end autotuning system 1 Preparation The preparation phase uses external performance tools to collect basic performance metrics of a target application 2 Code triage kernel extraction and transformations This phase is carried out by a set of ROSE based modules A ROSE tool interface module reads in both source files of the application and the per formance data to construct an abstract syntax tree AST representation of the input code annotated with performance information Then a code triage module is followed to locate problematic targets e g loops within the application A set of potentially beneficial optimizations and or their configurations for each target is chosen manually for now based on program analysis After that a ROSE AST Outliner extracts a selected target into a stand alone kernel which will in turn be compiled into a dynamically load able library routine The application will be transformed accordingly and compi
49. uinlan et al ROSE compiler project 11 C Tapus I Hsin Chung and J K Hollingsworth Active harmony To wards automated performance tuning Supercomputing ACM IEEE 2002 Conference pages 44 44 Nov 2002 12 Q Yi K Seymour H You R Vuduc and D Quinlan Poet Parameter ized optimizations for empirical tuning In Workshop on Performance Op timization of High Level Languages and Libraries POHLL March 2007 13 H You K Seymour and J Dongarra An effective empirical search method for automatic software tuning Technical report University of Tennessee 2005 Things to Fix in Documentation that might be addressed in the future by pushing back the check point start and transformations to the checkpointed program the auto Tuning translator POET expects normalized loops with special tags integer loop control variables and operator is not allowed We will discuss with Qing to either drop these re alternatives 32
Download Pdf Manuals
Related Search
Related Contents
GV-Keyboard V3 UP SP TEACH LAPROSCOPES AOC L19W898 User's Manual Copyright © All rights reserved.
Failed to retrieve file