Home

An LLVM Back-end for REPLICA: Code Generation for a Multi

1. 21 22 23 24 25 26 27 R L Graham Bounds for certain multiprocessing anomalies The Bell System Technical Journal XLV November 1966 J Keller C W Kessler and J Tr ff Practical PRAM programming Wiley series on parallel and distributed computing J Wiley 2001 Christoph Kessler and Andrzej Bednarski Optimal integrated code generation for clustered VLIW architectures In Proceedings of the joint conference on Languages compilers and tools for embedded systems software and compilers for embedded systems LCTES SCOPES 02 pages 102 111 New York NY USA 2002 ACM Chris Lattner The Architecture of Open Source Applications chapter 11 LLVM lulu com June 2011 ISBN 978 1 257 63801 7 Chris Lattner and Vikram Adve LLVM A compilation framework for lifelong pro gram analysis amp transformation In Proceedings of the international symposium on Code generation and optimization feedback directed and runtime optimization CGO 04 pages 75 Washington DC USA 2004 IEEE Computer Society R Leupers and P Marwedel Time constrained code compaction for DSPs Very Large Scale Integration VLSI Systems IEEE Transactions on 5 1 112 122 march 1997 Ming Lin Zhenyang Yu Duo Zhang Yunmin Zhu Shengyuan Wang and Yuan Dong Retargeting the Open64 compiler to PowerPC processor In Embedded Soft ware and Systems Symposia 2008 ICESS Symposia 08 International Conference on pages 152 15
2. Other front ends Other back ends Figure 2 13 LLVM compiler framework overview 2 3 1 Clang Clang is the LLVM native C C and Objective C front end The front end is responsible for parsing the source code and generating LLVM internal representa tion for later stages in the compilation process The version of Clang used in this project is Clang 3 0 2 3 2 LLVM passes LLVM provides interfaces for inserting code analysis and transformation passes into the compilation pipeline The different interfaces decide on which level the pass will work this includes modules file level functions loops basic blocks and more 2 3 3 Back ends LLVM has several back ends for generating machine code for different architectures and for generating C or C code LLVM provides interfaces that need to be implemented and registered with the compiler during runtime for lowering LLVM internal representation to target machine code QANE OANow1kwner 18 Background 2 3 4 LLVM internal representation The LLVM internal representation is a complete strongly typed programming lan guage with a 19 Figure 2 15 shows a function taking two integer arguments and sums them to a global variable sum The code was generated from the C code in Figure 2 14 int sum 0 void func int x int y sum x y Figure 2 14 Baseline language example sum common global i32 0 align 4 define void func i32 i32 y nounwind
3. 0rtr 20 Background instructions because for example most ALU instructions have several formats but only differ in the instruction name being used Note that we decided to encode several REPLICA sub instructions in so called super instructions so that it became easier to map REPLICA sub instructions to LLVM IR multiclass Instr_ALU lt string OpcStr SDNode OpNode gt let Uses A0 Defs A0 TSFlags 0x1000 in def rr InstRP lt outs IntRegs dst ins IntRegs src1 IntRegs src2 Istrconcat OpcStr 0 srcl src2 WB dst AO set IntRegs dst OpNode IntRegs srcl IntRegs src2 gt let Uses O0 Defs 00 in def ri InstRP outs IntRegs dst ins IntRegs srcl i32imm val strconcat OPO val Istrconcat OpcStr 0 src1 O0 WB dst AO set IntRegs dst OpNode IntRegs srcl imm val gt I defm ADD Instr_ALU lt ADD add gt Figure 2 17 REPLICA super instructions defined in TableGen Figure 2 17 defines a multiclass Instr_ALU which contains two variants of an ALU super instruction An ALU operation between two registers and an ALU operation between a register and an immediate To use this multiclass for an add instruction in these two variants we use defm on line 19 This will create an ADDrr and an ADDri which follows the definitions on line 3 and line 9 with OpcStr replaced with ADD and OpNode replaced with add As can be seen in Figure 2 17 each Tab
4. Custom This instruction is too complex for TableGen Instead call a func tion in REPLICATargetLowering that will take care it SelectionDAG optimization The focus of the earlier passes is to generate legal code This pass cleans up the code from the earlier passes It also performs some optimizations on the code generated by earlier passes An important optimization that is performed by LLVM at this stage is optimization of the automatically inserted sign zero extension code SelectionDAG instruction selection This pass uses our instruction definitions generated from TableGen to select REPLICA instructions that match the LLVM internal representation instructions already in the DAG LLVM uses pattern matching on subtrees of the selection DAG to pick which REPLICA instructions to use The instruction selection pass uses REPLICAInstrInfo whose base class was generated from TableGen The output of the instruction selection pass is a DAG with only target specific nodes SelectionDAG scheduling and formation In the earlier passes the order between instructions has not been specified only the dependencies between nodes in the DAG The Scheduling and formation pass in LLVM assigns order to all instructions and transforms the DAG to a list of instructions The DAG is destroyed after scheduling and formation is done and the list of machine instructions is sent to the next pass 3 1 2 SSA based optimization This pass allows for target speci
5. J D Ullman Compilers principles techniques amp tools Pearson international edition Pearson Addison Wesley 2nd edition 2007 Mattias V Eriksson Oskar Skoog and Christoph W Kessler Optimal vs heuristic integrated code generation for clustered VLIW architectures In Proceedings of the 11th international workshop on Software amp Compilers for Embedded Systems SCOPES 08 pages 11 20 New York NY USA 2008 ACM J A Fisher P Faraboschi and C Young Embedded computing a VLIW approach to architecture compilers and tools Morgan Kaufmann 2005 M Forsell E a language for thread level parallel programming on synchronous shared memory nocs WSEAS Transactions on Computers 3 July 2004 M Forsell A PRAM NUMA model of computation for addressing low TLP work loads In Parallel Distributed Processing Workshops and Phd Forum IPDPSW 2010 IEEE International Symposium on pages 1 8 april 2010 Martti Forsell Using parallel slackness for extracting ILP from sequential threads Proceedings of the SSGRR 2003s International Conference on Advances in Infras tructure for Electronic Business Education Science Medicine and Mobile Tech nologies on the Internet July 2003 Martti Forsell Parallel and Distributed Computing chapter 3 TOTAL ECLIPSE An Efficient Architectual Realization of the Parallel Random Access Machine pages 39 64 InTech 2010 79 80 Bibliography 16 17 18 19 20
6. MULn Xx Xy Multiply Xx by Xy in ALU n MULUn Xx Xy Multiply Xx by Xy in ALU n unsigned DIVn Xx Xy Divide Xx by Xy in ALU n DIVUn Xx Xy Divide Xx by Xy in ALU n unsigned MODn Xx Xy Determine Xx modulo Xy in ALU n MODUn Xx Xy Determine Xx modulo Xy in ALU n unsigned LOGDn Xx Determine rounddown log2 Xx in ALU n LOGUn Xx Determine roundup log2 Xx in ALU n 68 Assembler Language SELn Xx Xy Select Xx or Xy according to the result of the last compare operation Xx if res 1 Xy if res 0 MAXUn Xx Xy Determine maximum of Xx and Xy in ALU n unsigned MAXn Xx Xy Determine maximum of Xx and Xy in ALU n MINUn Xx Xy Determine minimum of Xx and Xy in ALU n unsigned MINn Xx Xy Determine minimum of Xx and Xy in ALU n SHRn Xx Xy Shift right Xx by Xy in ALU n SHLn Xx Xy Shift left Xx by Xy in ALU n SHRAn Xx Xy Shift right Xx by Xy in ALU n arithmetic RORn Xx Xy Rotate right Xx by Xy in ALU n ROLn Xx Xy Rotate left Xx by Xy in ALU n ANDn Xx Xy And of Xx and Xy in ALU n ORn Xx Xy Or of Xx and Xy in ALU n XORn Xx Xy Exclusive or of Xx and Xy in ALU n ANDNn Xx Xy And not of Xx and Xy in ALU n ORNn Xx Xy Or not of Xx and Xy in ALU n XNORn Xx Xy Exclusive nor of Xx and Xy in ALU n CSYNCn Xx Set up barrier synchronization group Xx in ALU n SEQn Xx Xy Set result 1 if Xx Xy else result 0 in ALU n S
7. The original image used in the image filter benchmarks ho KFOO ONAO OS RUDE 50 Evaluation 5 3 Threshold image filter The threshold image filter benchmark was the first larger program that we tried to implement for REPLICA This benchmark program needs some modifications to the compiler generated assembler code We need to add input and output directives manually for the image An image file in the ppm image format is read with the FILE directive and the result is written to an output file with the SAVE directive see figure 5 4 for how the directives are used in the generated assembler The benchmark program uses the MMADD multi prefix instruction to quickly calculate the sum of pixel values in the image then each thread divides that with the total number of pixels in the image to find the threshold value DATA Qinlmage ALIGN 1 GLOBAL inlmage inlmage FILE im4 ppm DATA Qoutlmage ALIGN 1 GLOBAL outImage outImage SAVE 921654 im4 threshold ppm SPACE 921654 Figure 5 4 Fil memory _inImage with data from a file and write memory content _outImage to file Each thread now performs the threshold filter algorithm on its group of pixels and when the exit trap is triggered the simulator writes the filtered image to the file in the SAVE directive Figure 5 3 was used as input and the produced result can be seen in figure 5 6 Result The speed increase from optimizing the thr
8. allocation o 3 1 4 Prolog Epilog code insertion 3 1 5 Late machine code optimizations 3 1 6 Code emission leen 3 2 Target machine description llle 3 2 1 Su bstargels L2 a ew eem SA REY P x 3 9 Registers ona a he ae Be et Ap p bU ASA XE XC RIS 3 4 Instructions x auklas ue Bk Roe e See as eo ERE Ene Hines 3 4 1 Calling conventions eee 3 5 Assembly printer ees 3 6 Clang basic target information a D oor W 17 17 17 18 19 20 22 x Contents 4 Optimization 35 4 1 Instruction spl tt r casts cc ae le tas Rn 35 4 2 Dependence DAG va 36 4 3 Optimizations for instruction level parallelism 37 4 3 1 Datastructures and algorithm details 39 4 3 2 Similarities to task scheduling 40 4 3 3 Time complexity 22e EEG 40 4 8 4 Practical example sns 42 5 Evaluation 45 5 1 Parallel max min value o o 45 9 27 Paralllarray sui isa g vas ab a ee P Rex d 48 5 3 Threshold image filter o iii iii 50 5 4 Blur image filter 4 i4 we asas sed moe em ee em Bk de 52 8 9 DISCUSSION Lo desk oie e opo ee us a bE 55 6 Conclusion 57 T Future work 59 A Building and using the compiler 61 B Assembler Language 63 B 1 Memory unit sub instructions 0 63 B 2 Write back subinstructions 00004 67 B 3 ALU sub
9. any lo cation in the shared memory see section 2 2 1 2 2 REPLICA REPLICA is the short name for Removing Performance and Programmability Limitations of Chip Multiprocessor Architectures 24 The project aims to develop 2 2 REPLICA 5 a complete language system from high level programming language to assembler support libraries and hardware to run the programs on 24 2 2 1 REPLICA Language System The REPLICA language system is to contain a high level parallel programming language support libraries a low level baseline language unoptimized MBTAC assembler for a minimal REPLICA configuration and optimized MBTAC assembler for a specific REPLICA configuration 24 see Figure 2 1 3 REPLICA Language e Support Libraries A AA Baseline Language k 4 C Libraries I 1 Y la J MBTAC Assembler Optimized MBTAC Assembler Figure 2 1 The REPLICA language system REPLICA Language The REPLICA language is supposed to have a C style syntax with various exten sions It uses the same parallelism style as the e language 24 the e language lets the programmer declare shared and private variables synchronous control struc tures and mix both a synchronous and asynchronous programming style 12 The REPLICA language specification is still in development so it will not be used in this thesis Baseline Language The baseline language is C with e fork style parallelism 24 Code written
10. be located in the shared memory space and a text label will be located in the program memory space No offset is needed for the shared memory and program memory here we can use the label directly If the variable is thread private then we must offset the location with the start of private address space which is conveniently stored in a register Source code for this part can be found in the following files e REPLICAlnstrInfo td e REPLICAInstrInfo h e REPLICAInstrInfo cpp e REPLICAISelLowering h REPLICAISelLowering cpp REPLICAISelectionDAGInfo h REPLICAISelectionDAGInfo cpp 3 4 1 Calling conventions The basic calling convention information is defined with TableGen in REPLICACalling Convention td That file defines how many registers we have to pass arguments in and their types it also defines what should happen with arguments that do not match the type of the registers used and what should be done if there are more arguments than registers We decided to use a calling convention where the first argument to a function is put in register R2 and any subsequent arguments are pushed to the stack Arguments on the stack are put in slots with a size of 4 bytes that are 4 byte aligned The return value of a function is put in register R1 There are three methods that are responsible for handling calls and return values LowerCall LowerFormalArguments and LowerReturn AUNE 3 5 Assembly printer 33 e LowerCall checks the number of ar
11. by adding 16 to R1 the result is stored in AO and is used to load a word The calculated address is then copied to R1 and the loaded word is copied to R2 with a WBn sub instruction where n is the destination register In the next execution step R2 is multiplied by 2 and the result of that multiplication is stored to the memory address contained in R1 These two lines also show how chaining works because each line is executed in one step per thread The result from previous sub instructions must thus be used in the next sub instruction in the same clock cycle Multi prefix instructions are special instructions where several threads can in teract with the same memory cell simultaneously This is achieved by a so called active memory unit where the memory unit contains a simple ALU so that ALU operations can be performed on the active memory cell i e the address we provide to the multi prefix instruction An example of a multi prefix operation is Figure 2 5 where all threads will add 1 to the memory cell pointed to by R1 OPO 1 MADDO O0 R1 Figure 2 5 An add multi prefix operation There are also assembler directives available both to control actions by the simulator and for marking code and data sections The compiler inserts directives where needed in the generated code except for FILE SAVE DEFAULT and RANDOM directives which are left to the user to insert where needed See Table 2 2 24 2 2 2 REPLICA Architecture REPLICA is a TO
12. construction This pass constructs a DAG from the LLVM internal representation version of the program we are compiling Most of this is done by the built in TableGen definitions of LLVM internal representation but the DAG constructor needs custom handling for some constructs mainly function calls and returns This step uses the REPLICATargetLowering class to find out what to do with function calls REPLICATargetLowering is accessible through the main class of the REPLICA back end REPLICATargetMachine see figure 3 1 SelectionDAG legalize types LLVM internal representation is a very generalized machine so data types used in LLVM internal representation may not be supported on the target machine The sub target class provides LLVM with a TargetData description of what types and sizes are supported on the target architecture The code generation process uses the target data layout to convert the instruc tions in the DAG to only use data types supported by our version of REPLICA SelectionDAG legalize This pass converts the DAG to only use instructions supported by the target ar chitecture This pass uses REPLICATargetLowering to determine if an instruction 26 Implementation is legal or need to be expanded promoted or needs custom handling e Expand LLVM tries to generate simpler instructions that would produce the same result as the expanded instruction e Promote The instruction is promoted to a more general instruction e
13. private memory space For jump addresses and shared data no offset needs to be added See figure 2 12 for how the memory is organized according to the compiler 16 Background Shared globals Shared memory space for groups _shared_heap A Group shared locals Private globals of thread O _shared_stack Private memory subspace of thread O A Private locals of thread 0 SP of thread 0 Private globals of thread T 1 Private memory subspace of thread T 1 SP of thread T 1 A Private locals of thread T 1 Figure 2 12 Memory organization according to the compiler 2 3 LLVM LLVM Low Level Virtual Machine is a compiler framework that once started as a research project at the University of Illinois 20 It has since then grown to become a full fledged compiler framework used in industry applications and now includes front ends for several languages and back ends for several architectures 19 The version of LLVM used in this project is LLVM 3 0 LLVM front ends parse and compile source code into LLVM s internal represe ntation 19 LLVM then does target independent optimizations on this internal form 19 It is then the task of the LLVM back end to lower this internal repre sentation into machine code 19 2 3 LLVM 17 Sparc back end Clang Front end C C Objective C x86 back end llvm gcc Front end C C Objective C Ada Fortran LLVM REPLICA back end
14. unit a local sequencer and R registers 24 Most of the hardware is shared between PRAM mode and NUMA mode 24 The processor also includes a step cache and a scratch pad to implement concurrent memory access and multi operations e A step cache is an associative memory buffer In the step cache data is only 2 2 REPLICA 15 valid until the end of an ongoing multi threaded execution 15 e Scratch pads are addressable memory buffers used in conjunction with step caches to implement multi operations 15 MBTAC has a VLIW style instruction format so the compiler will have to issue sub instructions to each functional unit 24 Chaining allows us to use the result of a sub instruction as input to the next sub instruction 24 Memory organization REPLICA uses three types of memory modules local data memory shared data memory and instruction data memory 24 From the compiler s point of view this becomes a program memory space a shared memory space and a thread private memory space The shared memory module consists of an active memory unit and a standard memory module 24 The active memory unit contains a simple ALU and a fetcher this allows us to perform multi prefix operations 24 The simulator automatically copies private variables to each thread s private memory space and shared variables to the shared memory space When executing a program different offsets for each thread has to be used to access data stored in thread
15. uwtable entry x addr alloca i32 align 4 y addr alloca i32 align 4 store i32 9x i32 x addr align 4 store i32 y i32 y addr align 4 O load i32 x addr align 4 1 load i32 y addr align 4 add add nsw i32 0 1 store i32 add i32 Gsum align 4 ret void Figure 2 15 LLVM IR language The program in figure 2 15 was compiled with 00 so the generated code lacks most optimizations The program works as follows e Line 1 Declare a global variable sum which is of type 32 bit integer e Line 3 Declare the function func which takes two 32 bit integers as argu ments and returns nothing void e Lines 5 6 Allocate space on the stack for two 32 bit integer variables e Lines 7 8 Store the function arguments to the stack e Lines 9 10 Load the arguments into registers 41 and 2 e Line 11 Add 1 and 2 into register add OCOANOUBWNEH Pi H NRO 2 3 LLVM 19 e Line 12 Store 4add to global variable sun e Line 13 Return nothing LLVM was chosen for this project because it is open source has a large devel oper community and is supported by the industry 19 This guarantees that the LLVM project will continue even if individual developers leave the project LLVM also has a very modular design and has back ends for many architec tures It has also earlier been re targeted to a VLIW architecture 1 This shows us that it is not impossible to re target LLVM to a new VLIW like architec
16. when accessing data in a thread s private memory space The status register holds the result of the latest compare instruction and is used by branch instructions Source code for this part can be found in the following files REPLICA RegisterInfo td REPLICA RegisterInto h REPLICA RegisterInfo cpp REPLICA FrameLowering h REPLICA FrameLowering cpp 32 Implementation 3 4 Instructions The REPLICA instructions are described using TableGen The description in cludes output registers input registers assembler string and the DAG pattern that the instruction should be matched against If several DAG patterns can be used for the same instruction then each pattern would need a seperate TableGen definition of the instruction Because of the strange instruction format for this architecture we decided to encode several instructions into super instructions that would be matched against DAG patterns We then later split these super instructions into the sub instructions they encode With this method we avoid much of the custom lowering we otherwise would have needed to correctly schedule all sub instructions Some instructions need custom lowering for example function calls returns and global address cal culation Function calls are described in the next section Global addresses need custom handling because of the architecture s memory organization A private variable will be located in the thread private memory space a shared variable will
17. with the real offset For this the eliminateFramelIndez function is used Reserved registers is a vector of IntRegs that are used for special purposes for example the register holding the stack pointer is a special purpose register Table 3 2 Special purpose registers note n is the total number of registers The number inside the parentheses is the number used in our version of the processor Register Back end internal name Description RO RP RO A constant zero R29 RP SP Stack pointer R30 RP TID Thread ID R31 RP RA Return address Rn 1 32 RP TPA Thread private address space start Rn 33 RP SR Status register Table 3 2 gives a short list of the registers currently in the special purpose register list R29 the stack pointer is used for accessing data stored on the stack for example local variables function arguments or spilled registers R30 holds the current thread ID The ID of the current thread can also be found in the built in variable thread id see Section 2 2 1 for more information about the baseline language and built in variables R31 holds the return address of the previously executed jump and link in struction JMPL The return address register is used when returning from a function call The special purpose register holding the starting address of the thread pri vate memory space varies with configuration and is used by load and store instructions
18. 424 1 1 1 The last part where the ready list only have length 1 is for scheduling the remainder of the m sub instructions Calculating this as an arithmetic series we get k 1 ER m which is O k m Introducing the check to see if all sub instructions have been scheduled we get O n k m Assuming that k gt 0 and m gt 0 e m k Again we would have to iterate through k 1 sub instructions to schedule the first execution step At step m all the m sub instructions have been scheduled so there is k m sub instructions left on the ready list o schedule the remaining sub instructions the algorithm would pick and schedule the first sub instruction on the ready list and then iterate through the remaining only to find that no more sub instructions could be scheduled in this execution step At step k the scheduling would be complete The sum would look something like k 1 k k 1 1 Calculating this as an arithmetic series we get Kr UR va which is O k Introducing the check to see if all sub instructions have been scheduled we get O n k Assuming that k gt 0 and m gt 0 The worst case time is obtained with k m and therefor is O n In the best case we would be given a basic block such that the first sub instruction on the ready list can be scheduled in each iteration of the algorithm and there is only one sub instruction on the ready list in each ite
19. 64 processor configuration Table 5 3 contains the executions times for the array sum benchmark We believe that these results have the same explanation as the parallel max min value benchmark Table 5 3 Running time for the parallel sum benchmark Unoptimized Optimized Configuration Steps Cycles Steps Cycles ILP Speedup PRAM T5 4 512 383 196096 313 160256 22 3 PRAM T5 16 512 155 79360 133 68096 16 5 PRAM T5 64 512 98 50176 88 48128 11 4 Table 5 4 contains the speedup for the unoptimized and optimized version of the parallel array sum benchmark depending on configuration The speedup from adding more threads to the program shows a worse speed increase than the parallel max min value benchmark probably because the summation algorithm has only one instruction for each thread in the PRAM T5 64 512 configuration Table 5 4 Parallel speedup for the parallel sum benchmark Configuration Unoptimized Optimized PRAM T5 4 512 1 0 1 0 PRAM T5 16 512 2 47 2 35 PRAM T5 64 512 3 91 3 56 For a graphical comparison of the execution times for the parallel array sum bench mark see figure 5 2 5 2 Parallel array sum 49 288888 Unoptinized EE Optinized IEEE 158888 188888 Cycles 58888 PRRH T5 4 5124 PRRH T5 16 5124 PRRH T5 64 5124 Configuration Figure 5 2 The parallel sum benchmark cycles comparison Figure 5 3
20. 7 july 2008 M Lorenz and P Marwedel Phase coupled code generation for DSPs using a genetic algorithm In Design Automation and Test in Europe Conference and Exhibition 2004 Proceedings volume 2 pages 1270 1275 Vol 2 feb 2004 J M M kel a E Hansson and M Forsell REPLICA language specification manuscript as of December 8 2011 to appear in the report series of VTT 2012 Steven S Muchnick Advanced compiler design and implementation Morgan Kauf mann Publishers 1997 Masataka Sassa Toshiharu Nakaya Masaki Kohama Takeaki Fukuoka Masahito Takahashi and Ikuo Nakata Static single assignment form in the coins compiler infrastructure current status and background Proceedings of JSSST Workshop on Programming and Application Systems SPA2003 2003 Texas Instruments TMS320C6455 fixed point digital signal processor http www ti com lit gpn tms320c6455 Fetched 2011 09 13 GS UN Su Try GF Y do ir P T3139 Sey e Nes uut Link pings universitet Upphovsr tt Detta dokument h lls tillg ngligt p Internet eller dess framtida ers ttare under 25 r fr n publiceringsdatum under f ruts ttning att inga extraordin ra omst ndigheter uppst r Tillg ng till dokumentet inneb r tillst nd for var och en att l sa ladda ner skriva ut enstaka kopior for enskilt bruk och att anv nda det of r ndrat f r icke kommersiell forskning och f r undervisning verf ring av u
21. BD9A1686 aGaaaaaaa HAOAOOOO HOOOHOO HAHAHAHA GBGBG GS HOOOHOO BGGBGBGB BGGBGBGB aponioRHa aGaaaaaaa aaaaaaaa HOHHOHHO BGGBGBGO HAHAHHHO anpaaaaoa anaaaoaaoa BOGBGBGB aponioCca aGaeaaaaaa GGGGGGGBG GBGBGBGG HOOOHOO HAHOOOOO acpaaoaoa BOGBGBGB BOGBGBGB BpDonioEG aaGaaaaaoa BBGBGGBG HAOHAOA aaoaaaaoO BBGBGGBG BOBGBGGBB HAHHAA BBGBGGBA BD9A1760 OOHOOH GBGBGGGG HOHOOHH HAHAHAHA BBGBGBGB HOHOOHH BGGBGBGB BOBBGBGB BD9A1720 GBGGGGGG HAHAAHA GBGBGGGG HOOOHOO BBGBGBGB BOGBGGGB BBGBGBGB BOGBGBGG BDoni 4G aaaaaaoa BGGBGGBG GABGBGG BGGBGGBGB HAHAHHA HOOOHOO BGBGGBGB BBGBGGBA e LI Figure 2 9 IPSMSimX86 memory contents window executing that line All threads do not have to be executing the same line of code as the program runs and when threads become more fragmented the numbers to the left will decrease and spread out to other lines The global variables window see figure 2 11 is very helpful when checking if a program performs as expected The memory content window is quite hard to read and to locate the correct location to check the content of a variable On the other hand the global variables window shows the name address and content of global variables making it perfect for checking the results from a program The only problem is that in the case of an array only the first couple of elements in 14 Background ano PO TO Executing Step 15 Cycle 7680 m 00000144 OPO MEMSTART WB1 08 000090145 OP MEMSIZE WB2 08 90000146 ADDA R1 R2 WB
22. CA Target Figure 3 3 Register the generic sub target with the TargetRegistry We also need to add our assembler printer as an output method and we have also modified the raw printing of assembler instructions to handle the possibility of chaining assembler instructions The two code fragments in Figure 3 4 and 3 5 registers our assembler printer and assembler streamer K GO ND 3 2 Target machine description 29 RegisterAsmPrinter lt REPLICAAsmPrinter gt X TheREPLICA Target Figure 3 4 Register the assembler printer with LLVM s TargetRegistry TargetRegistry RegisterAsmStreamer TheREPLICA Target createREPLICAAsmStreamer Figure 3 5 Register the assembler streamer used by our assembler printer 3 2 1 Sub targets There is currently only one sub target for the REPLICA target machine which is the default 32 bit REPLICA machine This sub targets define a machine with support for 32 bit addresses 32 16 and 8 bit integers and no floating point support The data layout definition is just a string returned by the sub target see figure 3 6 std string getDataLayout const return std string E p 32 32 32 i1 32 32 32 i1 16 16 16 i 8 8 8 Figure 3 6 The sub targets data layout definition The E in the data layout string tells LLVM that this architecture is big endian The p is pointer information the first value being pointer size the next two values are ABI and preferred a
23. DO O0 R32 WB6 AO STO R3 R6 WB17 R3 OPO THREAD WB7 O0 OPO __ absolute thread id ADDO O0 R32 WB8 AO STO R30 R8 OPO __ thread id ADDO O0 R32 WB8 AO STO R30 R8 WB9 R30 SUBO R2 R1 WB3 A0 DIVUO R3 R7 WB3 AO ANDO R3 R4 WB3 A0 T2 Benchmark code MULUO R3 R9 WB10 AO ADDO R10 R1 WB10 A0 ADDO R10 R3 WB11 AO OPO 4 SUBO R11 00 WB11 AO OPO private_space_start ADDO O0 R32 WB12 AO STO R10 R12 OPO private space end ADDO O0 R32 WB13 AO STO R11 R13 WB30 R11 OPO 8 SUBO R30 00 WB29 AO OPO __ absolute number of threads ADDO O0 R32 WB14 AO STO R7 R14 OPO number of threads ADDO O0 R32 WB15 AO STO R7 R15 OPO __ group id ADDO O0 R32 WB16 AO STO R17 R16 STO R7 R17 OPO __ group table WB18 OO STO R7 R18 OPO main JMPL OO DATA GLOBAL MEMSTART MEMSTART MEMSIZE THREAD HEAPEND GLOBAL MEMSIZE GLOBAL THREAD GLOBAL HEAPEND C 3 pmaxmin 73 C 3 pmaxmin The parallel max min value program uses REPLICA multi prefix instructions to speed up the execution include replica h include types h include sync h define SIZE 32768 uint32 _array_ SIZE nintgd2 m max cs Ue uint32 min 32768 int main uint32 i for i _thread_id i lt SIZE i _number_of threads asm MMAXUO 0 1 A emay a ie ee Soie asm MMINUO 0 1 g TQ Wee qs 62 mad t synchronize exit return 0 74 Benchmark code C 4 psum The parallel sum program uses m
24. Institutionen for datavetenskap Department of Computer and Information Science Master s Thesis An LLVM Back end for REPLICA Code Generation for a Multi core VLIW Processor with Chaining by Daniel Akesson LIU IDA LITH EX A 12 007 SE 2012 05 08 Link pings universitet Link pings universitet Link pings universitet SE 581 83 Link ping Sweden 581 83 Link ping Link ping University Department of Computer and Information Science Master s Thesis An LLVM Back end for REPLICA Code Generation for a Multi core VLIW Processor with Chaining by Daniel Akesson LIU IDA LITH EX A 12 007 SE 2012 05 08 Supervisor Erik Hansson Examiner Christoph Kessler S U gone Ny by Dy U I b d Ska gocs 9 L Nees N y 135 4h S s Avdelning Institution Division Department PELAB Department of Computer and Information Science Link pings universitet SE 581 83 Link ping Swed Datum Date 2012 05 08 en Spr k Language Svenska Swedish X Engelska English Rapporttyp Report category Licentiatavhandling M Examensarbete ISBN ISRN LIU IDA LITH EX A 12 007 SE C uppsats Serietitel och serienummer ISSN Title of series numbering D uppsats Ovrig rapport http www ep RL f r elektronisk version liu se Titel Title F rfattare Author Ett
25. LLVM Back end f r REPLICA Kodgenerering f r en Flerk rning VLIW Processor med Kedjade Instruktioner An LLVM Back end for REPLICA Code Generation for a Multi core VLIW Processor with Chaining Daniel kesson Sammanfattning Abstract REPLICA is PRAM NUMA hybrid architecture with support for instruction level parallelism as a VLIW architecture REPLICA can also chain instructions so that the output from an earlier instruction can be used as input to a later instruction in the same execution step There are plans in the REPLICA project to develop a new C based program ming language compilers and libraries to speed up development of parallel pro grams We have developed a LLVM back end as a part of the REPLICA project that can be used to generate code for the REPLICA architecture We have also created a simple optimization algorithm to make better use of REPLICAs support for instruction level parallelism Some changes to Clang LLVMs front end for C C Objective C was also necessary so that we could use assembler in lining in our REPLICA programs Using Clang to compile C code to LLVMs internal representation and LLVM with our REPLICA back end to transform LLVMs internal representation into MBTAC assembler Nyckelord Keywords REPLICA LLVM PRAM compilers MBTAC MBTAC is the processor in the REPLICA architecture Abstract REPLICA is a PRAM NUMA hybrid architecture with support for inst
26. NEn Xx Xy Set result 1 if Xx Xy else result 0 in ALU n SLTn Xx Xy Set result 1 if Xx lt Xy else result 0 in ALU n SLEn Xx Xy Set result 1 if Xx lt Xy else result 0 in ALU n SGTn Xx Xy Set result 1 if Xx gt Xy else result 0 in ALU n SGEn Xx Xy Set result 1 if Xx gt Xy else result 0 in ALU n SLTUn Xx Xy Set result 1 if Xx lt Xy unsigned else result 0 in ALU n SLEUn Xx Xy Set result 1 if Xx lt Xy unsigned else result 0 in ALU n SGTUn Xx Xy Set result 1 if Xx gt Xy unsigned else result 0 in ALU n SGEUn Xx Xy Set result 1 if Xx gt Xy unsigned else result 0 in ALU n B 4 Immediate operand input subinstructions 69 B 4 Immediate operand input subinstructions OPn d Input value d into operand n B 5 Compare unit subinstructions SEQ Xx Xy Set IC if Xx Xy SNE Xx Xy Set IC if Xx 4 Xy SLT Xx Xy Set IC if Xx lt Xy SLE Xx Xy Set IC if Xx lt Xy SGT Xx Xy Set IC if Xx gt Xy SGE Xx Xy Set IC if Xx gt Xy SLTU Xx Xy Set IC if Xx lt Xy unsigned SLEU Xx Xy Set IC if Xx lt Xy unsigned SGTU Xx Xy Set IC if Xx gt Xy unsigned SGEU Xx Xy Set IC if Xx gt Xy unsigned B 6 Sequencer subinstructions BEQZ Ox Branch to Ox if IC equals zero BNEZ Ox Branch to Ox if IC not equals zero JMP Xx Jump to Xx JMPL Xx Jump and link PC 1 to register RA TRAP Xx Trap JOIN Xx Join all threads to a
27. NUMA bunch Xx SPLIT Xx Split all the current NUMA bunches back to PRAM mode threads Appendix C Benchmark code C 1 Makefile This is the makefile we used to compile the benchmark programs CC build bin clang LLC build bin llc LD build bin llvm link LA build bin llvm as CFLAGS I include S O0 emit llvm Y ccc host triple replica unknown unknown V nostdlib nodefaultlib nostdinc LLCFLAGS march replica mcpu generic stats asm verbose V enable replica ilp SOURCES prog c LLVMIR SOURCES c s BITCODE LLVMIR s s bc LINKIR prog s bc all CC CFLAGS SOURCES for i in LLVMIR do V LA i done LD o LINKIR BITCODE LLC LLCFLAGS LINKIR 70 C 2 Initialization function T1 C 2 Initialization function This is the hardcoded assembler function used to initialize the simulator The function was written by Martti Forsell PROC ProgramStart TEXT ALIGN 2 GLOBAL _ ProgramStart _ProgramStart OPO MEMSTART WB1 00 OPO MEMSIZE WB2 O0 ADDO R1 R2 WB3 A0 OPO 1 SHRO R3 00 WB3 AO OPO alignmentMask_LF1 ADDO O0 R32 WBA AO LDO R4 WB4 MO ANDO R3 R4 WB3 A0 OPO shared_space_start ADDO O0 R32 WB5 AO STO R1 R5 OPO HEAPEND WB5 O0 OPO __ shared heap ADDO O0 R32 WB6 AO STO R5 R6 WB1 R3 OPO 4 SUBO R3 00 WB3 AO OPO shared space end ADDO O0 R32 WB5 AO STO R3 R5 OPO 4 SUBO R3 00 WB3 AO OPO __ shared stack AD
28. PLICA 13 PARIS SS A use det Zasas inn ja Registers Ra aaaaaaaa R1 GFORIS C R2 119A157C R3 F9A1578 R4 FFFFFFFC RS F9A1598 RG F9A1594 R aaaaaaoaa RS daaaaaaoa R9 66066 X R18 GaGBOBOB R11 aGaaaaaoaa R12 aaaaaaoa R13 aGaaaaaaa R14 aaanaaaa R15 aGaanaaaan R16 aaaaaaoa R17 Gaaaaoaa R18 BaapBoaBaB R19 GGBGBGBG R28 60606006 R21 6606000 R22 60600006 R23 Gaaaaaoaa R24 9696866 R25 6606000 R26 AGAAGA R27 Gaaaaaaa R28 69608066 R29 Gagaaoaao R38 GaBOBOBOB R31 BGBGBGBG R32 BF9A157C R33 aGaaaoaaoas RRR g i INI soc sadi diam oda sa sis a smaidi Operands TC 08 aaaaaaiC 01 aaaaaaaoa Operands TC 06 06000 01 Gaaoaaaana Figure 2 8 IPSMSimX86 register contents window OMStart D9A1S8 DMEnd 119A1586 DMSi 04000090 i 8666661C apoanisSsa aGaaaaaoaao GGGGGGGG Gaaaaeoaao HAHAHAHA aGdaaoaoao aGpaaoaoa aaaaaasaa HOAHOA GDSAISAG aGaaaaaoa GGGBGGGG aaaaoaaaa BBGBGBGO BBGBGGGB apaaBaaOa BGGBGBGB BOGBGBGB BD9A1SCO aGeaaaana GGGGGGGG HAOHAOA HAHAHAHA BBGBG GS BBGBG G BGGBGBGB BBGBGBGB BDoniSEG aagaaaaaoa BBGBGGBG BGBGBGGB BGGGGBGB HOAHOA BOBGBGG HOAHOA BBGBGGB BD9A1600 aGaaaaaaa GBGGGGGG HOOOHOO HOOOHOO BBGBGBGO HAOHAOA BOGBG GB BOGBGBGB BD9A1620 000005000 GGGGG GG GBGBGGGG GBGBG GG GBGBG G BGGBG G BGGBGBGB BGGGBGBGB BD9A1640 aaGaaaaaoa BBGBGGBG BOBGBGG BGGGGBGB GBGBGGBG BOGGBGG BGBGGBGB HAHAHAHAA ODAIG GBGGGGGG GBGBGGGBG HOOOHOO HOOOHOO BBGBGBGO HOAHOA HOOOHOO BOGBGBGB
29. S AG 0000147 OFB 1 SHRO R3 08 HES AG 0098148 OFB _alignmentilask_LF1 ADD 08 R32 HB4 AB 29900149 Log R4 HB4 ma 20000144 ANDO R3 R4 HES AD 0090148 OFB _shared_space_start ADDA 08 R32 WBS AG 0 B08014C STO RI RS 0090140 OPO HERPEND HES 00 000001 4E OFB shared heap RDDO 00 R32 HES AG 0000914F STO RS R6 99000156 HBI R3 99000151 OPO 4 SUBO R3 00 WB3 AD 29000152 OP shared space end ADDA 00 R32 HBS AG 2048 3 51 00000153 STO R3 RS 0000154 OFB 4 SUBA R3 08 HB3 RO 20000155 OPB shared Stack ADDO 09 R32 HBS AB 4 E H A Figure 2 10 IPSMSimX86 code window the marked line is currently being executed by 2048 threads the array are shown if we want to get the whole array we are forced to dump the array variable to a file LANG Globals GLOBAL VARIABLES PO TO 0 HEAPEND Constant D9A759C nens 1zE Constant 119A157C MEMSTART Constant 09A1580 THREAD Constant 88000800 EPgStrt Private GF9RIS C RTL SVNCHRONIZE GROUP Private BF9AI58C 80009890 RTL UPDRTE SUE Private BF9AI58C 80008890 RTLCUPDRTE SUE SVNC Private BF9AI58C 88008890 absolute nunber of threads Private BFORISEC 80908890 Tabsolute_thread_id Private BF9AI588 88009090 al ignmentttask_LF1 Private BF9AISAG FFFFFFFC aroupid Private BF9A1598 OFOAIS74 Lgroup_table_ Shared BD9A7588 60600880 locals Private OFOR SOC 89900000 number_of_threads Private BF9AI580 88000890 pr
30. TAL ECLIPSE based architecture 24 that implements the PRAM NUMA model of computation 24 A REPLICA can have a configurable number of processors threads and functional units consisting of ALUs and other functional units 24 A REPLICA has P MBTAC processors each processor has 7 threads F functional units and R registers The functional units are A ALUs M memory units MU one compare unit and one seguencer We have been able to test our compiler on simulated versions of the processor with the configurations given in table 2 3 We had access to three ideal PRAM configurations and two ECLIPSE configu rations The difference between them is that the ECLIPSE configurations are not 1Handles program flow altering instructions like branch sub instructions BNEZ 10 Background Table 2 2 REPLICA assembler directives Name Description ALIGN n Move the location to the next multiple of n ASCII str Place the string str in the data segment BYTE n Place a 8 bit byte with value n in the data segment DATA Mark the start of a data segment DEFAULT Define a storage area in the data segment which is filled with ordinal words 24 HALF n Place a 16 bit halfword with value n in the data segment PROC name Declare the start of procedure name ENDPROC name Declare the end of procedure name RANDOM Define a storage area in the data segment which is filled with random words 24 SPACE n Reserve n bytes in the da
31. Un Xx Xy Arbitrary multiprefix max unsigned Xx to active memory Xy in MU n MPMINn Xx Xy Arbitrary multiprefix min Xx to active memory Xy in MU n MPMINUn Xx Xy Arbitrary multiprefix min multiple Xx to active memory Xy in MU n BMADDn Xx Xy Begin add multiple Xx to active memory Xy in MU n BMSUBn Xx Xy Begin subtract multiple Xx to active memory Xy in MU n BMANDn Xx Xy Begin and multiple Xx to active memory Xy in MU n BMORn Xx Xy Begin or multiple Xx to active memory Xy in MU n BMMAXn Xx Xy Begin max multiple Xx to active memory Xy in MU n BMMAXUn Xx Xy Begin max unsigned multiple Xx to active memory Xy in MU n BMMINn Xx Xy Begin min multiple Xx to active memory Xy in MU n BMMINUn Xx Xy Begin min unsigned multiple Xx to active memory Xy in MU n B 1 Memory unit sub instructions 65 EMADDn Xx Xy End add multiple Xx to active memory Xy in MU n EMSUBn Xx Xy End subtract multiple Xx to active memory Xy in MU n EMANDn Xx Xy End and multiple Xx to active memory Xy in MU n EMORn Xx Xy End or multiple Xx to active memory Xy in MU n EMMAXn Xx Xy End max multiple Xx to active memory Xy in MU n EMMAXUn Xx Xy End max unsigned multiple Xx to active memory Xy in MU n EMMINn Xx Xy End min multiple Xx to active memory Xy in MU n EMMINUn Xx Xy End min unsigned multiple Xx to active memory Xy in MU n BMPADDn Xx Xy Begin arbitrary multiprefix add Xx
32. ZE g b inlmage_ data xt yt XSIZE b ASS ar arr if yt gt 0 amp amp yt lt YSIZE r inlmage data xt yt XSIZE r g inlmage_ data xt yt XSIZE g b inlmage_ data xt yt XSIZE b j 4 blur_size 1 outImage_ data i r r jj outImage_ data i g 8 j outImage_ data i b b j int main unsigned int i unsigned int psum 0 if _thread_id 0 for i 0 i lt 54 i outlmage meta i inlmage_ meta i j synchronize for i _thread_id i lt SIZE i _number_of_threads blur i synchronize _ exit return 0 Bibliography 11 12 13 14 15 EPICOpt project page http www complang tuwien ac at epicopt Fetched 2011 09 13 GCC Internals Manual http gcc gnu org onlinedocs gccint Fetched 2012 02 27 GCC Manual http gcc gnu org onlinedocs gcc 4 6 2 gcc Fetched 2012 02 27 GitHub repository for LLVM TMS320C64X https github com alexjordan LLVM TMS320C64X Fetched 2011 09 02 LLVM IA64 SourceForge project http sourceforge net projects llvm ia64 Fetched 2011 09 13 LLVM IA64 wiki http llvm ia64 sourceforge net Fetched 2011 09 13 ROSE user manual http rosecompiler org ROSE UserManual ROSE UserManual pdf Draft User Manual Version 0 9 5a fetched 2011 09 09 Trimaran 4 0 manual http www trimaran org docs trimaran4 manual pdf Fetched 2011 09 09 A V Aho M S Lam R Sethi and
33. age filter benchmark Configuration Unoptimized Optimized PRAM T5 4 512 1 0 1 0 PRAM T5 16 512 3 15 3 23 PRAM T5 64 512 6 84 7 27 Figure 5 5 shows a graphical representation of the executions times for the unopti mized and optimized versions of the program 6888888 5888888 48088888 3888888 28088888 1888088 PRAH T5 4 512 PRRH T5 16 5124 Configuration T Unoptinized EE Optinized IEEE PRRH T5 64 5124 Figure 5 5 The threshold benchmark cycles comparison 52 Evaluation Figure 5 6 The resulting image from the threshold benchmark 5 4 Blur image filter The blur image filter must load image data in the same way as the threshold image filter The blur image filter is the most advanced benchmark program that we built for testing the compiler The filter works by adding each separate color value from the pixels in a cross around the current pixel and dividing them by the total number of pixels added as in the equation below an unweighted average where p x y is a pixel in the filtered image and po x y is a pixel in the original image a yo IF polo y Polzo yo y yo s T Where s is the blurfilter size and T is the total number of pixels in the cross Note that the po xo yo is because each pixel should be added only once to the sum We set the blurfilter size to 3 in our benchmark program so that there is some notable difference between t
34. and new tasks becomes available 16 If we let Grahams processors correspond to our functional units and let tasks cor respond to our sub instructions then we can compare it to our scheduling approach for REPLICA REPLICA has several functional units that are assigned sub instructions by our algorithm Our sub instructions are ordered so that a sub instruction J must be executed before sub instruction 2 if I2 depends on Ii Functional units that are not as signed any sub instruction can be seen as idle until a sub instruction that matches that functional unit is found Grahams processors can execute any task 16 while REPLICA sub instructions are keyed to a specific type of functional unit Either operator ALU memory unit compare unit sequencer or writeback Hence functional units may be idle even though free sub instructions are available MBTACs chaining capability lets dependent sub instructions be executed in parallel in most cases Writebacks is an exception where the sub instruction dependent on the writeback needs to wait until the next execution step Grahams tasks on the other hand have no chaining capability so if two tasks are dependent then the second task will have to wait until the first has been executed before it can be scheduled 16 The tasks in Grahams list scheduling algorithm do not have the constraint that certain tasks must be executed in parallel 16 as REPLICAs OP sub instructions have 4 3 3 Time complexity I
35. ation and the children of that node represents the arguments to the operation e Semantic analyzer Uses the syntax tree and symbol table to check that the program is consistent with the language definition e Intermediate code generator Compilers usually generate a machine like in termediate representation of the program The intermediate representation should be easy to produce and to translate into the target machine e Machine independent code optimizer In this phase the compiler tries to optimize the intermediate code The goal is to get better target code e Code generator This step translates the intermediate code into the target language 4 Background e Machine dependent code optimizer In this phase the compiler tries to do further optimizations For example in our case scheduling instructions that can be executed in parallel We have in this project used the LLVM compiler framework to construct a compiler for REPLICA 2 1 PRAM Model of computation PRAM Parallel Random Access Machine is a generalization of the RAM Random Access Machine model 17 Instead of just one processor connected to memory let us have n processors 17 In each execution step all processors perform one instruction this could be either memory access ALU operations comparisons and branch instructions 17 Because several processors can perform memory accesses simultaneously several processors can also try to access the same memory cell sim
36. attached to each processor MBTAC the processor in the REPLICA architecture can have several func tional units so it is a VLIW architecture This creates some opportunities for instruction level parallelism ILP i e we can run several instructions in parallel So we have also adapted a simple optimization algorithm for finding instructions that can be executed in parallel and relocating them so they are LLVM the Low Level Virtual Machine compiler framework was given as a recommendation to implement a compiler back end for We investigated alterna tives but in the end LLVM was chosen mainly because it is popular has many users and industry support making it unlikely to cease developing and disappear First we will give some background information on the REPLICA project and LLVM in chapter 2 Then we will describe the implementation in chapter 3 How the optimization is done is described in chapter 4 In chapter 5 we run some benchmark programs and analyze the results At last chapter 6 contains some general thoughts on the project and possible future work Project full name Removing Performance and Programmability Limitations of Chip Multi processor Architectures 2Very Large Instruction Word 2 Introduction 1 1 Motivation With a high level programming language it is often easier to write large and or complex programs than with a target specific low level language for example an assembler language A compiler is needed to
37. ble register names for Clang e Useable constraints Currently only register constraints r e Built in defines Assembler inlines in Clang follow the same style as gcc although our back end is fairly limited in the amount of available assembler constraints The built in defines can be used by the programmer to check that we are compiling for REPLICA Thus can be useful to separate REPLICA specific code from platform independent code The defines are given in the list below and can be used with for example ifdef e replica e ARCH REPLICA e replica mbtac e REPLICA MBTAC Source code for this part can be found in the following file e lt llvm source dir gt tools clang lib Basic Targets cpp Chapter 4 Optimization To handle the configurable number of functional units we decided to implement an op timization pass that would work on the almost finished machine code By doing this division we can have the compiler generating code for a minimal configuration by default and then letting the optimization pass handle rescheduling of instructions for specific REPLICA configurations The optimization algorithm used for rescheduling was based on the virtual ILP algorithm in 14 by M Forsell LLVM lets us insert new compiler passes for code transformation and analysis at different stages in the compiler We added an optimization pass at the pre emit stage i e just before instruction printing An advantage with this approach
38. continue 4 Check if the current sub instruction can be scheduled in the current execution step e Yes then schedule the current sub instruction in this execution step and go to step 5 e No then increment the current sub instruction to the next free sub instruction and go to step 1 40 Optimization 5 Was the recently scheduled sub instruction an OP sub instruction e Yes then add the following sub instruction that uses the result from the OP sub instruction to the priority list The resources needed by the sub instruction were checked when the OP sub instruction was scheduled Go to step 6 e No then go to step 6 6 Get the sub instructions freed by scheduling the current sub instruction and add them to the ready list or if the current sub instruction was a WB to the delay list Go to step 1 4 3 2 Similarities to task scheduling The algorithm is similar to Grahams task scheduling algorithm 16 In Grahams task scheduling algorithm we have a system with m tasks that should be processed by n processors 16 There can be dependencies between tasks so if task T is dependent on task Ti then T must be processed later than T 16 Tasks are kept on a list ordered according to dependencies so for example T would come before T on this list 16 A processor searches the list from beginning to end looking for a task that can be processed If no task was found then the processor becomes idle until another processor finishes
39. convert the high level language into a low level language Optimization of the generated code is needed to utilize the potential speedup from the combination of instruction level parallelism and thread level parallelism possible on the REPLICA architecture This is an interesting problem with many solutions we have chosen a simpler greedy algorithm for optimizing programs in this project Chapter 2 Background The REPLICA project is developing a configurable emulated shared memory ma chine architecture 24 As a proof of concept an FPGA based prototype machine a new programming language compilation and optimization tools and sample programs are being developed 24 The REPLICA language has a C style syntax with some modifications 24 Code written in REPLICAs language is first to be compiled into C which is in turn compiled into MBTAC assembler 24 A compiler is a computer program that reads a program in one language and translates it to another language 9 A compiler usually works in several phases the number of phases varies between compilers 9 As an example 9 uses the following compiler phases e Lexical analyzer Reads the source program and outputs a stream of tokens When the lexical analyzer encounter for example a variable declaration it stores information such as name and type in the symbol table e Syntax analyzer Reads the tokens a creates a syntax tree where each inte rior node in the tree represents an oper
40. ct created an LLVM back end for generating code for the REPLICA architecture a PRAM NUMA hybrid architecture with the interesting chaining VLIW feature for instructions Thanks to LLVMs TableGen repetitive tasks such as defining available instructions and how these map to LLVMs internal representation was made simpler Early in the project we decided to create so called super instructions to find easier mappings between LLVMs internal representation and REPLICAs sub instructions The downside of this was that we would have to split these super instructions to make any reordering possible because we could have unused functional units that some later sub instruction could use with some reordering Therefor the instruction splitter was added With the instruction splitter implemented a dependence DAG needed to be con structed to find the dependencies between sub instructions that could be used by the later optimization pass to check which instructions were free to schedule The optimiza tion algorithm is based on Forsells virtual ILP algorithm 14 and tries to fill functional units with any free sub instruction a sub instruction is counted as free if it does not depend on any unscheduled sub instruction We also needed to modify Clang so that the compiler front end knew about the REPLICA architecture and what data layout and register names it had The register names were used in the inline assembler that was needed for using multi prefix instruc tio
41. e integer registers and implicitly define and or use registers from other register classes This is later adressed by our instruction splitter The REPLICA RegisterInfo class see Figure 3 1 implements functions for han dling reserved registers callee saved registers handling of pseudo instructions gen erated during function call lowering and frame index handling e Callee saved registers is a list of registers used by LLVM when emitting prolog epilog code If a register in this list is used inside a function we need to generate code for saving this register in the prolog of the function We also need to generate code for restoring this register in the epilog of the function e Removal of pseudo instructions are also done by REPLICA RegisterInfo The call frame pseudo instructions callseg_start and callseg_end were used to group instructions belonging to a function call together so that they are handled as a single unit by later optimization passes These pseudo instruc tions also contain the size of arguments put on the stack this information would be useful if we were to implement handling of variable sized argu ments For now we only remove these instructions This is implemented in eliminateCallFramePseudolnstructions e During compilation instructions using data in a stack slot access that stack slot with an abstract offset 0 1 2 etc so before emitting assembler 3 3 Registers 31 we need to replace these abstract offsets
42. e source code in multiple languages 7 ROSE is primarily for software analysis and trans formation tools and code generation tools 7 COINS COINS a COmpiler INfraStructure is a research project aiming to create a base for constructing other compilers 26 COINS is written in Java and has two forms of intermediate representation HIR High level Intermediate Representation and LIR Low level Intermediate Representation 26 fcc The Fork95 compiler for SB PRAM SB PRAM is a realization of PRAM built at Saarbr cken University 17 SB PRAM has up to 4096 RISC style processors and up to 2 GB of shared memory 17 The shared memory is accessible in one CPU cycle by any processor Fork95 is based on ANSI C with additional constructs for creating parallel processes dividing groups of processors into subgroups managing shared and private address subspaces 17 REPLICA is also a PRAM realization so fcc might be an alternative to LLVM VEX VEX short for VLIW example is an example architecture that accompanies the book Embedded computing a VLIW approach to architecture compilers and tools by Joseph A Fisher et al 11 The architecture and compiler is based on HPs STs Lx ST200 family of VLIW processors and their compiler 11 22 Background The VEX C compiler has some support for dynamic reconfiguration 11 Avail able resources ALUs issue slots memory ports and multipliers issue use de lay the time betw
43. e the REPLICA architecture has some built in traps for using timers and halting the program By naming convention these macros should start with an Name Description start timer Issues an OP 176 TRAP 00 that starts the simulator s timer _stop_ timer Issues and OP 180 TRAP 00 that stops the simulator s timer exit Issues an OP O TRAP 00 that halts the simulator Library functions Some help functions that we created during development of the back end ended up in this library The longterm goal is that this should be compiled into a runtime library that can be used to replace the current ECLIPSE runtime library types h This header only defines different types with more obvious size information e 8 16 and 32 bit unsigned integers uint8 uinti6 uint32 e 8 16 and 32 bit signed integers int8 inti6 int32 e A size type size t 8 Background string h Currently only contains an implementation of memcpy Name Description memcpy void const void size_t Copies data stdlib h Allocating memory with malloc does not currently work because the statically and dynamically allocated memory collides in the shared memory space so they end up overwriting each other Event though its called malloc and free memory allocation currently only works in shared memory Name Description void init_mem Initializes the allocated memory list void mall
44. ee that variable as shared see Figure 2 3 for how shared and private variables are declared int shared var 0 A shared variable int private var 0 A private variable x Figure 2 3 Baseline language example Each thread has a copy of a private variable so in Figure 2 3 where private_var is set to zero each thread running the program with this declaration will get its own version of private_var in its own private memory space and this variable will be set to zero The shared variable shared_var on the other hand will be stored in the shared memory space so all threads running the program with this declaration will access the same instance of the shared variable shared_var 2 2 REPLICA 7 Built in variables There are some built in variable for accessing information such as thread id and total number of threads As simple naming convention that we used is to let all built in REPLICA specific variables and macros start with an This helps us differentiate between generic library functions and machine specific functions Table 2 1 Built in variables containing information about the currently running con figuration Name Description private_space_ start Start of the current threads private memory space Also stored in register R32 thread id The current threads id number number of threads The total number of threads Built in macros IPSMSim X86 the simulator used to simulat
45. een instructions issue and output ready and the number of registers 11 GCC GCC short for GNU Compiler Collection is a collection of compilers and tools for several languages such as C C Objective C Objective C Java Fortran Ada and Go 3 GCC is a part of the GNU project and licensed under the GNU General Public License 3 GCC uses an internal representation called RTL Register Transfer Language 2 The process is similar to LLVM a source program is parsed and transformed into a parse tree 2 The parse tree is then matched against instruction patterns 2 The instruction patterns is then matched against RTL templates to generate assembler 2 We believe that writing a back end for LLVM will probably be simpler than writing a back end for GCC 2 5 Related work EPICOpt is a project at the Vienna University of Technology that aims to develop new algorithms that try to maintain the advantages of integer linear programming code generation techniques while still being computationally feasible for real world programs 1 An LLVM back end for Texas Instrument s TMS320C64X family of VLIW processors is being developed for the EPICOpt project 4 TMS320C6455 is currently the fastest processor in the TMS320C64X family It has eight func tional units and can execute up to eight 32 bit instructions per clock cycle 27 There is also an LLVM back end for Intel Itanium IA 64 5 but the author has not been able to find more i
46. eshold filter is much better than the earlier benchmark programs The optimized version is 31 5 faster than the unoptimized version on the 4 processor configuration 34 596 faster than the unoptimized version on the 16 processor configuration and 4096 faster on the 64 processor configuration The speed up from optimization is better maybe because the program is larger than any of the earlier benchmarks so the initialization part of the program does not become the largest part of the compiled program The different speed increases for the same program on different configurations is very strange maybe it is something else in the configurations that differentiates them see section 5 5 Table 5 5 contains the contains times for the threshold benchmark Table 5 6 show us that the speedup for the threshold benchmark is much larger than for the earlier benchmarks This is probably because the data set that the program works on is also larger so the benefit from adding more processors and threads to the simulation is better 5 3 Threshold image filter 51 Table 5 5 Running time for the threshold image filter example Unoptimized Optimized Configuration Steps Cycles Steps Cycles ILP Speedup PRAM T5 4 512 12382 6340044 9414 4819968 31 5 PRAM T5 16 512 3924 2009597 2917 1494015 34 5 PRAM T5 64 512 1811 927320 1293 662527 40 Table 5 6 Parallel speedup for the threshold im
47. faster than the unoptimized version on the 4 processor configuration on the 16 processor configuration the optimized version is 16 9 faster than the unoptimized program and on the 64 processor configuration the optimized program is only 11 7 faster than the unoptimized Table 5 1 contains the number of execution steps for running the whole program and the total number of clock cycles 45 46 Evaluation Table 5 1 Running time for the parallel max min example Unoptimized Optimized Configuration Steps Cycles Steps Cycles ILP Speedup PRAM T5 4 512 480 245760 394 201728 21 8 PRAM T5 16 512 180 92160 154 78848 16 9 PRAM T5 64 512 105 53760 94 48128 11 7 The low gain from optimization for larger configurations of the REPLICA machine may be because the initialization part of the program that initializes global and builtin variables is hard coded in assembler and might have longer running time than the actual calculation part of the program Note that the steps is the number of assembler lines executed and clock cycles is the total number of clock cycles the program was running For a processor with 512 threads one step would equal 512 clock cycles Table 5 2 gives the speedup for the unoptimized and optimized version The speedup from running the program with configurations with more threads are quite low But a possible explanation for that could be because as earlie
48. fic optimization before register allocation We do not use this pass because we will insert more code in the prolog and epilog of functions and we would like to optimize that inserted code as well 3 1 3 Register allocation Until this pass instructions have used virtual registers Now it is time to assign physical registers to these virtual registers For this LLVM uses a target indepen dent register allocation algorithm that uses information about REPLICA registers 3 2 Target machine description 27 that we defined in TableGen see the REPLICARegisterInfo class in figure 3 1 There are several register allocation algorithms available to the developer to choose from By default the register allocator is chosen according to optimization level but this can be changed with command line options to the LLVM compiler 11c 3 1 4 Prolog Epilog code insertion The prolog epilog code insertion pass handles calculating a new stack pointer and restoring the old stack pointer when leaving the function We also must save and restore registers that are used by the callee and the called function This is done in C by the REPLICAFrameLowering class see figure 3 1 3 1 5 Late machine code optimizations The late machine code optimization pass is used to perform target specific opti mizations on the finished machine code before outputting it This is where we perform our optimizations for instruction level parallelism We also implemented a debug pass
49. function calls This is handled by the REPLI CAFrameLowering class that is accessed through the REPLICATargetMa chine class e Data layout Supported data types are implemented in the sub targets be cause this could vary between sub targets For the moment we only have one generic sub target Data layout is a string describing supported data types on the REPLICA architecture explained in detail in the sub target section e REPLICATargetLowering Describes how LLVM internal representation in structions should be lowered and if instructions need custom handling This is handled by the REPLICATargetLowering class which is accessed through the REPLICATargetMachine class e Sub targets Additional information about sub targets In our case we only have one implemented sub target The generic 32 bit REPLICA machine with no additional features The Target also needs to be registered with the TargetRegistry so that LLVM is able to find and use our target during runtime To make our REPLICA target available to LLVM we need to first register our target name so that it becomes selectable at compile time The two code fragments in figure 3 2 and 3 3 are needed to register our REPLICA target machine and the generic REPLICA sub target RegisterTarget lt Triple replica gt X TheREPLICATarget replica REPLICA Figure 3 2 Register the REPLICA target with LLVM s TargetRegistry RegisterTargetMachine lt REPLICA32TargetMachine gt X TheREPLI
50. ge data i r D else 1 outlmage data i r synchronize exit return 0 outImage_ data i outlmage data i outImage data i outlmage data i Oga O 09 255 C 6 blur 77 C 6 blur We also precalculated the image size and metadata size as in the previous program for the blur program include replica h include sync h define XSIZE 640 define YSIZE 480 fdefine SIZE 307200 struct pixel ys struct image unsigned char r g b char meta 54 struct pixel data SIZE struct image inlmage struct image outlmage_ unsigned int blur_size 3 void blur const unsigned int i int j int x i XSIZE int y i XSIZE int xt int yt unsigned int r 0 unsigned int g 0 unsigned int b 0 inlmage_ data i r inImage_ data i g inlmage_ data i b oma for j 1 j lt blur size j ai E CDI yu V if xt gt 0 amp amp xt lt XSIZE r inlmage data xt yt XSIZE r g inlmage_ data xt ytx XSIZE 8 b inlmage_ data xt yt XSIZE b a Se SES 78 Benchmark code if xt gt 0 amp amp xt lt XSIZE r inlmage data xt yt XSIZE r g inlmage_ data xt yt XSIZE g b inlmage_ data xt yt XSIZE b li EE v s if yt gt 0 amp amp yt lt YSIZE r inlmage data xt yt XSIZE r g inlmage_ data xt yt XSI
51. guments and allocates room on the stack if needed LowerCall then creates instructions for moving arguments to their cor rect place and when all the arguments are in place generates a call instruction LowerCall is also responsible for taking care of the return value if any e LowerFormalArguments checks the number of arguments and generates instruc tions for moving arguments from the stack to registers e LowerReturn moves the return value into register R1 and then generates a return instruction Source code for this part can be found in the following files e REPLICACallingConv td e REPLICAISelLowering h e REPLICAISelLowering cpp 3 5 Assembly printer The printing of assembler code is split into three classes one low level printing one for handling machine instructions and one for holding settings e REPLICA AsmPrinter Splits a machine instruction into its operands and calls the function needed to print the operand REPLICA AsmPrinter is also responsible for printing assembler directives around functions e REPLICAMCAsmStreamer Does the raw printing to file and is also responsible for printing assembler directives for variables e REPLICAMCAsmInfo Defines directive texts and features available for the as sembler printer Because MBTAC assembler is different from how assembler languages are structured in general we decided to implement a custom assembler streamer i e the class responsible for the low level writing of asse
52. he original image and the blurred image Figure 5 3 was used as input and the produced result can be seen in figure 5 8 pr xo Yo Result The blur image filter benchmark is the largest program that we have written and thanks to the shared memory on the REPLICA architecture the blur filter really works well with many threads The speed up between the unoptimized and optimized version is 23 1 for the 4 processor configuration 23 6 for the 16 processor configuration and 25 4 for the 64 processor configuration Table 5 7 contains the execution times for the blur benchmark program Table 5 8 contains the speedup for the unoptimized and optimized versions of the benchmark and because of the shared memory in the REPLICA architecture we do not lose any time sending data so all threads always have data to work on 5 4 Blur image filter 53 Table 5 7 Running time for the blur image filter benchmark Unoptimized Optimized Configuration Steps Cycles Steps Cycles ILP Speedup PRAM T5 4 512 120005 61443069 97488 49913856 23 1 PRAM T5 16 512 31194 15971836 25237 12921852 23 6 PRAM T5 64 512 8962 4589052 7149 3660796 25 4 Table 5 8 Parallel speedup for the blur image filter benchmark Configuration Unoptimized Optimized PRAM T5 4 512 1 0 1 0 PRAM T5 16 512 3 85 3 86 PRAM T5 64 512 13 39 13 63 Figure 5 7 gives a graph
53. he simulator We solved this by copying the content of our generated assembler files into a file with the correct flags set Simulator For testing programs compiled for REPLICA we use a simulator IPSMSimX86 that was originally written for the ECLIPSE project 15 but also can be used for running REPLICA programs The simulator view is divided into several windows showing what is going on in the simulated computer while executing our program The command window shows us a history of commands see figure 2 6 that we have issued to the simulator like loading a program or executing a program Commands Open psort_unoptimized s Step T Step Step Step Step Step Step Step Step Step Step Step Step Step Step E E Ee Eea Ees Eon Bon Eon Eon Eos E E Figure 2 6 IPSMSimX86 command window showing a history of step thread Step T commands The simulator uses the output window to print out error messages about the assembler code we are trying to run Messages are only printed when the simulator tries to execute the part of the generated assembler code that contains the error so the simulator must run the program to be able to find any errors in it Possible error messages are listed in table 2 4 24 Figure 2 7 shows the output window of the simulator Output messages are usually error messages but could also be information about issued traps and other general information The window labeled with the currently loaded configuratio
54. her instruction Improve the current optimization algorithm to find and eliminate unnecessary sub instructions and build longer chains of sub instructions Improve the optimization algorithm and add another optimization algorithm that uses e g integer linear programming to find the optimal schedule for a given basic block Create build scripts that automatically generate part of the back end for a specific REPLICA configuration so that we can for example compile a new compiler that generates code for a REPLICA configuration with 5 ALUs Implement vararg support in the back end so that the printf family of C functions can be compiled for the back end Implement the C standard library for REPLICA and a library with useful functions for parallelization A platform independent simulator for the REPLICA architecture 59 Appendix A Building and using the compiler This appendix contains a small guide for compiling LLVM and Clang and using it to compile programs with the REPLICA back end These instructions should help you build LLVM with Clang and the REPLICA back end on Linux based systems First make sure that you have LLVM with the REPLICA back end and Clang with the basic REPLICA target in lt llvm replica src gt tools clang Create a separate build directory outside lt Ilvm replica src gt with mkdir build Switch to the new build directory cd build and run cmake with cmake lt llvm replica src gt Run make and wa
55. here that would print all machine instructions in a tree like structure to ease debugging 3 1 6 Code emission The last pass in the code generation process is code emission This is were the generated machine code is outputted to file in our case that is assembler code printing Code emission is registered separately from the back end for LLVM see REPLICA AsmPrinter and REPLICAMCAsmStreamer in figure 3 1 The generated machine instructions are printed to a s file 3 2 Target machine description The REPLICATargetMachine class see figure 3 1 is the main component of our REPLICA back end implementation It implements the LLVMTargetMachine in terface that LLVM uses to access classes holding target specific information such as instructions registers sub targets target lowering target data layout and se lection DAG information e REPLICAInstrInfo Instructions are described in TableGen with some ad ditional C code to handle register to register copy DAG nodes This is handled by the REPLICAlnstrInfo class which is accessed through the REPLICATargetMachine class e REPLICARegisterInfo LLVM needs to know what registers we have avail able and what data types they support This information is described in TableGen and in the REPLICA RegisterInfo class that is accessed through the REPLICATargetMachine class 28 Implementation e REPLICAFrameLowering Takes care of prolog epilog code insertion sav ing and restoring registers at
56. ical comparison of the execution times of the blur image filter benchmark 54 Evaluation Unoptinized NE Optinized EE 68888088 58808088 48888088 wa vo e 3 38888888 288808088 18888888 PRAH T5 4 512 PRRH T5 16 5124 PRRH T5 64 5124 Configuration Figure 5 7 The blur benchmark cycles comparison Figure 5 8 The resulting image from the blur benchmark 5 5 Discussion 55 5 5 Discussion It is possible to create fast parallel algorithms for the REPLICA architecture using the REPLICA LLVM back end The optimization algorithm gives varying results but for the large benchmark programs the speed up after optimization is guite good roughly 10 to 20 for the smaller benchmark programs and very good roughly 20 to 40 for the larger benchmark programs Note that the programs have some optimization already because LLVM chooses super instructions when trying to find the best match for internal representation against target machine instructions The blur image filter also shows an algorithm that thanks to the shared memory becomes very easy to parallelize from the sequential algorithm and the speedup result from parallelization seems very good The total number of elements in the datasets for the max min value and sum bench marks was chosen to equal the total number of threads for the largest REPLICA con figuration 32768 This number should maybe be larger so that the running time of those benchmarks becomes longer and
57. in the baseline language can be compiled with LLVM together with the REPLICA back end 24 The current implementation of the baseline language is basically C with assembler in lining and macros for handling REPLICAs multi prefix instructions Figure 2 2 shows a program that calculates the sum of an array with 8096 inte gers Note that shared variables end with a and built in macros and variables start with an DONOG ROUND 6 Background include replica h define SIZE 8096 int array SIZE private array with SIZE entries int sum 0 shared variable int main unsigned int i _start_timer Timer used for benchmarking for i _thread_id i lt SIZE i _number_of_threads asm MADDO 960 961 no output r array i r amp sum no clobber x _synchronize Wait for all threads x _end timer Stop the benchmarking timer x _exit x Issue an exit trap to halt the program return 0 Figure 2 2 Baseline language example Shared and private variables The simulator IPSMSimX86 sees data labels ending with a _ as data that belongs in the shared memory so to be able to differentiate between a shared and a private variable in a C program a simple naming convention was used The names of shared variables need to end with a because then the variables label name in the generated assembler code will also end with a and IPSMSimX86 will s
58. instructions 2 a 67 B 4 Immediate operand input subinstructions 69 B 5 Compare unit subinstructions 0 4 69 B 6 Sequencer subinstructions 2l 69 C Benchmark code 70 CI Makefile 2x a p aS OE Pe BE 70 C 2 Initialization function aa 71 C9 pmaxmimn uua Boe Se ee ed eee ae m EUR ye ee ee 73 CA psu lu he a et ditt LAS ss lets 8 as BS 74 5 threshold x ida a eo he a da ils PASE 75 500 Dl A a e dekas j nd ae dba 77 Bibliography 79 Chapter 1 Introduction This thesis project is part of the REPLICA project The focus has been to create a compiler back end that generates assembler code for the REPLICA architecture This compiler back end is only a part in what is to become a whole tool chain for developing programs for the REPLICA architecture including a high level lan guage specialized for creating parallelized programs a C based baseline language described in this report an assembler language that this back end generates code in support libraries and a simulator for the architecture to run programs on The REPLICA architecture is a realization of the PRAM Parallel Random Access Machine which is the ideal parallel computer PRAM is based on the normal Random Access Machine which consists of a memory and one processor but with more than one processors sharing the same memory 17 REPLICA does not follow the definition of a PRAM exactly as it also has a private memory module
59. is that all generated code like prolog epilog stack adjustments has already been added A disadvantage is that all instruction are now in list form so we need to construct a dependence graph out of this list with instructions and also because registers have already been assigned we could have introduced artificial dependencies between instructions Because of the decision to map LLVM internal representation to REPLICA super instructions we first need to divide these super instructions so that the optimizer can try to reorder them 4 1 Instruction splitter Before the generated code is sent to the assembler printer it passes through the instruction splitter The instruction splitter is a LLVM pre emit pass where each super instruction is divided into the sub instructions it encodes OPO 2 ADDO O0 R1 WB1 A0 Figure 4 1 Instruction splitting As an exampled before instruction splitting line 1 in figure 4 1 is seen as one instruction by the compiler After instruction splitting on the other hand OPO 2 ADDO 00 R1 and WB1 AO are seen as separate instructions Source code for this part can be found in the following file 35 Ookwnr 36 Optimization e REPLICAInstrSplitter cpp 4 2 Dependence DAG A general scheduling approach is to construct a dependency graph to represent the con straints on instruction schedules i e in which order instructions need to be scheduled 25 As control flow through a basic block is always from the fi
60. it for the compiler to compile The compiler and all compiler tools should end up in build bin when the compi lation is finished Some command line options that are useful when compiling for REPLICA are given below First the option to tell Clang we are compiling for REPLICA ccc host triple replica unknown unknown this will tell Clang to compile the program as if we were running on REPLICA nostdinc may be useful so that Clang does not try to include standard library from the usual path To tell LLVM that we would like to generate assembler for REPLICA we use the following command line options march replica mcpu generic our target is machine architecture REPLICA and CPU model generic enable replica ilp enables the ILP optimization pass disable replica instr splitter disables the instruction splitter The ILP optimization pass only works with sub instructions so with the instruction splitter off the ILP optimization does not work enable replica instr printer prints the machine instructions in a more read able structure which is useful when debugging the optimization pass 61 62 Building and using the compiler include replica h int main 1 int a I int lo 28 int c a b return c Figure A 1 Sample program to compile for REPLICA sample c Because we do not have a linker for REPLICA an LLVM IR linker 1lvm link is quite useful when compiling programs spread over several files The p
61. ithms for code generation that when necessary use parts of the target machine s back end To generate code for REPLICA we need to describe it to LLVM information such as instructions registers calling conventions and how LLVM IR is to be low ered to MBTAC assembler is needed This is done with a combination of C and TableGen see section 2 3 5 LLVMs target description language Almost all source code for our back end is located in lt llum source dir gt lib Target REPLICA Some modifications outside of this directory were needed i e we needed to add our new back end to lt llum source dir gt include ADT Triple h and lt llum source dir gt lib Support Triple cpp so that it becomes a selectable target When adding a new back end to LLVM we also need to register it at runtime so that LLVM knows it exists Most classes in our LLVM back end are accessed through the REPLICATargetMachine class so that has to be registered with the LLVM target registry We also need some output method from the back end in our case we only have the assembler printer so that also needs to be registered with LLVM and this is done separately from the target machine registry See Figure 3 1 for an architectural overview of the back end We also modified the Clang front end so that it knows about REPLICA s register names and available in line assembler constraints 23 Implementation 24 Gu iomoj ouie14796 E 1 buniamo73961e1 ssequoljunyo
62. ivate_space_end Private BF9A1584 F9A5578 private_space_start Private F9A1584 OFOAS578 root Shared BD9A7588 08000000 G0G02022 GEBOBEGO BEBEBEGA BEBEGEBO shared_heap Private BF9AI594 D9A759C shared_space_end Private BF9RI598 BF9A1574 shared space start Private BF9R1598 BF9A1574 shared_stack Private BF9A1598 OF9A1574 sune Shared 8DOR7584 BB0007FF thread iid Private GFORIS C 88000000 arrau Shared GDOR3580 98000240 OOO003E GOO0047E GODGOSCR OGAAASIE OOOOOS4E 09090288 OODOOGCD out_array_ Shared ODOAS588 80000000 0900002 OO0D00003 Basa 9000907 BEG9G928 G9G88099 GBGBOBGR serrou Shared BD9AI582 00000000 00000000 0000000 pasea OcO00000 Oo000000 GO000000 O0000000 ond ACTIVE FRAMES PROCESSOR 8 THREAD 0 0 00000000 ProgromStart Shared frame 88000000 n 98000800 Private frame 80000000 lt gt y Figure 2 11 IPSMSimX86 global variables and their contents Processor MBTAC short for Multi bunched threaded Architecture with Chaining is a dual mode VLIW architecture 24 The two modes are PRAM and NUMA In PRAM mode each MBTAC has A ALUs M memory units M hash address calculation units a compare unit a sequencer and R registers per thread 24 On the other hand in NUMA mode processors can be configured to execute a common instruc tion stream and share their state among each other each other 24 Each thread bunch then has a local ALU a local memory
63. leGen instruction definition consists of several REPLICA sub instructions so called super instructions For example ADDri contains three REPLICA sub instructions OP ADD and WB We decided to use this implementation style so that we could let LLVMs code generation algorithm select and order instructions without any modifications 2 4 Alternative Compilers If it for some reason would not be possible to use LLVM for this project we have looked into some other compilers that could be used instead The compilers we have studied are Open64 Trimaran ROSE COINS fcc VEX and GCC 2 4 Alternative Compilers 21 Open64 SGI released their MIPSpro compiler under an open source license in 2000 under the name Pro64 This later became the Open64 compiler The compiler supports C C Fortran95 and can generate code for many architectures such as IA 64 etc 22 According to Ming Lin et al 22 re targeting the Open64 compiler to a new architecture is a hard and error prone task as no automated tools for re targeting exist Trimaran Trimaran is a compiler made for research in computer architecture and compiler optimizations 8 Trimaran can target a wide range of architectures such as VLIW processors 8 Trimarans modular nature should make it relatively easy to add support for a new architecture But as a research compiler framework it has limited user group ROSE ROSE is a source to source compiler infrastructure that can read and writ
64. lignment We also have support for 32 16 and 8 bit integers with 32 16 and 8 bit ABI and alignments Source code for this part can be found in the following files REPLICA td REPLICATarget Machine h REPLICA Target Machine cpp REPLICASubtarget h e REPLICASubtarget cpp 1 Application Binary Interface 30 Implementation 3 3 Registers Register set and register classes are described using LLVM s TableGen language which is transformed to C during compilation of the compiler In our version of the processor all registers are 32 bit integer registers but there are still some differences between them Some are associated with specific functional units Cur rently there are four register classes IntRegs ALURegs MEMRegs and OPRegs see Table 3 1 Table 3 1 Register classes for REPLICA Register class Description IntRegs General purpose 32 bit integer registers ALURegs Registers holding results of ALU operations For ex ample the result of an add instruction in ALU 0 will be placed in ALU register 0 MEMRegs Registers holding results from memory operations For example the result of a load in memory unit 0 will be placed in memory register 0 OPRegs OPRegs can be used to insert constants in assembler code These are volatile registers as they don t keep their value between clock cycles Super instructions that are generated before our ILP optimization do only use the general purpos
65. ll REPLICA Vi har ven utvecklat en enklare optimerings algoritm f r att b ttre utnyttja REPLI CAs f rm ga f r instruktions parallelisering Vi har ven gjort ndringar i Clang LLVMs front end f r C C Objective C s att vi kan anv nda inline assembler i REPLICA program Med Clang kan man kompilera C kod till LLVMs interna representation som i sin tur genom LLVM och REPLICA back end kan omvandlas till MBTAC as sembler IMBTAC is the processor in the REPLICA architecture SMBTAC r processorn i REPLICA Acknowledgments To Erik Hansson for supervision and comments on this thesis To Martti Forsell for comments on this thesis This project was funded by VTT vii Contents Introduction Ta Motivation ss nk Rexcesesnu A eh ea a vs Background 2 1 PRAM Model of computation a 2 2 REPLICAS deva stat a inet V kB RA 2 2 1 REPLICA Language System 2 2 2 REPLICA Architecture va DES DEV M se te astrs Keri a rrt Gs a tat ues nls urn as un ed Be ud ela Didi Glan 438 E AD ri Ga UR 23 2 bLVM passes i vss oe ia r r Se dS 2 3 9 Back ends cs as tt hoe e ee he Pa ee Ee 2 3 4 LLVM internal representation 2 3 5 TableGen nia ros Aa 9 aja 2 4 Alternative Compilers la 2 5 Related work 4 53 144 e Ale a a d Implementation 3 1 Code generation 2 aa 3 1 1 Instruction selection soo a 3 1 2 SSA based optimization ss moss ss ss ss ss svs 3 1 3 Register
66. lt kill gt ira 2 26 R_CSGTUrr Reg 9 lt use gt lt kill gt Reg 10 lt use gt lt kill gt Reg 2 lt def gt B 28 R_BNEZ Reg 4 lt use gt lt kill gt Reg 2 lt use gt Figure 4 3 Dependence DAG from the conditional check in a for loop Figure 4 3 is a graphical representation of the dependencies between instructions in a basic block The dependencies between instructions are either register dependencies or 4 3 Optimizations for instruction level parallelism 37 additional dependencies our dependence DAG construction algorithm uses the following rules for deciding if there is a dependency or not 1 If instruction J defines register r and instruction 2 uses register r then I2 depends on h 2 If instruction 1 uses register r and instruction 2 defines register r then I2 depends on h 3 If instruction J defines register r and instruction 2 also defines register r then I depends on H 4 Aside from the register conflicts there is also other dependencies to look out for when scheduling instructions We look for the following additional dependencies when creating our dependence DAG e Load store instructions we can not know if the address that the load instruc tion loads from is the same as a later store instruction saves to So if I is a load instruction and 2 is a store instruction then Jz depends on e An inline assembler DAG node is given dependencies so that it is n
67. mbler to file The most important change is the handling for printing chained sub instructions To encode that the current sub instruction being printed has more chained subsequent sub instructions we added an ending to the assembler string See Figure 3 7 An empty delimiter instruction is used to end the chain see line 4 in Figure 3 7 Branch to Ox if IC not equals zero def R_BNEZ InstRP lt outs ins OPRegs brdst BNEZ brdst gt def delimiter Pseudo lt outs ins Nn gt Figure 3 7 The assembler string on line 3 ends with a so that the assembler streamer does not automatically insert a newline after the sub instruction The assembler printer is also currently printing the initialization function Pro gramStart that initializes built in variables This function ought to be moved into the runtime library for future releases of the REPLICA back end Source code for this part can be found in the following files e REPLICAAsmPrinter cpp 34 Implementation e REPLICAMCAsmStreamer h REPLICA MCAsmStreamer cpp REPLICAMCAsmlnfo h REPLICAMCAsmInfo cpp 3 6 Clang basic target information Because our REPLICA back end relies on inline assembler for using the special multi prefix instructions we need to give clang the compiler front end information about the REPLICA architecture such as e Available registers To use constraints in assembler inlining we would need to define the availa
68. n see Figure 2 8 shows us the state and contents of registers in the simulated machine When single stepping a thread processor or the whole machine this window is updated in every step otherwise it only shows the content from when the machine was last halted 12 Background Table 2 4 IPSMSimX86 assembler error messages Number Message 101 Illegal ALU number 102 Illegal memory bus number 103 Illegal memory bus number 104 Illegal register number 105 Illegal operand number 106 Illegal mnemonic 107 Missing directive 108 End of line ignored 109 Illegal label 110 Illegal label 111 Missing directive 112 End of line ignored 114 Symbol Table overflow 115 Unkown operand 116 Unkown instruction class 117 Program too large to fit in instruction memory Output MTAC Assembler Compiler Compiling psort_unoptimized s Pass 1 Compiling psort_unoptimized s Pass 2 x Opening Done Figure 2 7 IPSMSimX86 output window The memory content window see figure 2 9 is a bit hard to read but it shows the content of the complete memory in the current configuration The memory address to the first memory cell on a line is written on the left side each row then has 8 memory slots The code window see figure 2 10 shows which instructions are currently ex ecuted the numbers to the left of the marked line show how many threads are 2 2 RE
69. n any given basic block there would always be at least one sub instruction that can be scheduled in the current execution step In a worst case scenario we could have a basic block with n sub instructions which consists of first k independent sub instructions that uses the same functional unit and then m sub instructions where each sub instruction is dependent on the previous Figure 4 5 shows how a worst case dependence DAG might look like e At the start of the algorithm the k sub instructions and the first of the m sub instructions are added to the ready list 4 3 Optimizations for instruction level parallelism 41 e Now the algorithm would first schedule the first of the k sub instructions and then search through k 1 sub instructions until the last sub instruction on the ready list which are one of the m e Scheduling the last sub instruction on the ready list would complete that execution step and add another of the m sub instructions to ready list inserted last The check to see if all sub instructions have been scheduled loops through all sub instructions one time in each iteration of the main optimization loop There are two cases m gt kandm lt k k m m em gt k We would have to iterate through k 1 sub instructions to schedule the first execution step For scheduling the second execution step we would have to iterate through k sub instructions The sum would look something like k 1 k k 1
70. nformation about it than an almost empty wiki 6 and a project page at Sourceforge with source code 5 Because LLVMs IR format does not match that well with REPLICAs instruc tion format a compromise was to let the back end generate code for a default REPLICA configuration and later reschedule and optimize the generated code for the REPLICA implementation at hand Basic block scheduling i e scheduling of instructions within basic blocks seemed like a good starting point There is a lot of research available on basic block scheduling Kessler et al show in 18 a dynamic programming based algorithm which produces optimal or highly optimized code Leupers et al give an integer linear programming based algorithm in 21 which produces high quality optimized code Lorenz et al show an genetic algorithm for code generation in 23 Eriksson et al show in 10 both an algorithm based on a genetic algorithm which produces code one or two clock cycles from optimum in the cases where an optimal schedule is known and a integer linear programming based algorithm 2 Optimal Code Generation for Explicitly Parallel Processors 1 Chapter 3 Implementation Our LLVM back end converts LLVM internal representation to MBTAC assembler for a minimal REPLICA configuration We also have an optimization pass that tries to restructure our generated assembler so that we use available functional units more effectively The LLVM framework provides generic algor
71. ns An early problem we had was with stack handling in conjunction with function calls We needed to adjust the stack pointer in a function so that it had room for function argu ments and local variables The shared stack that REPLICA can have is not implemented due to the difficulty with multiple threads in multiple functions all using the same stack It was more pressing for the author to generate working code than to implement more specialized features The optimization results seem very promising we gained a speedup ranging between 10 to 40 from simple re ordering of sub instructions Using more invasive optimization algorithms may yield even better results The compiler back end is not complete but it is useable We will in the next section give some thoughts on possible improvements aside from the usual testing and extending the already implemented features 57 Chapter 7 Future work A compiler back end is only a part in the tool chain for creating and compiling programs for a computer Other parts include front ends libraries linkers and assemblers They are all needed to create the complete tool chain while this back end can be used to compile programs for a specific REPLICA configuration All parts of the tool chain are needed for this compiler to actually be useful Add internal representation for multi prefix instructions in both the Clang front end and to LLVM so that the multi prefix instructions can be optimized as any ot
72. oad byte from memory n address Xx in MU n LDBUn Xx Load byte from memory n address Xx unsigned in MU n LDHn Xx Load halfword from memory n address Xx in MU n LDHUn Xx Load halfword from memory n address Xx unsigned in MU n LDn Xx Load word from memory n address Xx unsigned in MU n STBn Xx Xy Store byte Xx to memory n address Xy in MU n STHn Xx Xy Store halfword Xx to memory n address Xy in MU n STn Xx Xy Store word Xx to memory n address Xy in MU n 63 64 Assembler Language MADDn Xx Xy Add multiple Xx to active memory Xy in MU n MSUBn Xx Xy Subtract multiple Xx to active memory Xy in MU n MANDn Xx Xy And multiple Xx to active memory Xy in MU n MORn Xx Xy Or multiple Xx to active memory Xy in MU n MMAXn Xx Xy Max multiple Xx to active memory Xy in MU n MMAXUn Xx Xy Max unsigned multiple Xx to active memory Xy in MU n MMINn Xx Xy Min multiple Xx to active memory Xy in MU n MMINUn Xx Xy Min unsigned multiple Xx to active memory Xy in MU n MPADDn Xx Xy Arbitrary multiprefix add Xx to active memory Xy in MU n MPSUBn Xx Xy Arbitrary multiprefix subtract Xx to active memory Xy in MU n MPANDn Xx Xy Arbitrary multiprefix and Xx to active memory Xy in MU n MPORn Xx Xy Arbitrary multiprefix or Xx to active memory Xy in MU n MPMAXn Xx Xy Arbitrary multiprefix max Xx to active memory Xy in MU n MPMAX
73. oc size_t Allocates memory in the shared memory space void free void Frees allocated memory MBTAC Assembler The MBTAC processor lets us chain sub instructions so that we can use the output of one sub instruction as an input operand for the next sub instruction 24 There are different types of sub instructions dependending on which functional unit it uses In our REPLICA configurations we have the following sub instruction types See appendix B for descriptions of each sub instruction e Memory unit sub instructions Load store and multi prefix instructions e ALU sub instructions Instructions like add and subtract etc and also com pare instructions where the result is stored in a register e Compare unit Compare sub instructions where the result sets status register flags There is always only one compare unit e Sequencer Program flow altering sub instructions i e jump and branch There is always only one sequencer e Operand One sub instruction only Used to load a constant or label etc into an operand slot e Writeback Also only one sub instruction Used to copy register contents When writing code for MBTAC sub instructions that are to be issued in the same execution step are written on the same line see the code example in Figure 2 4 2 2 REPLICA 9 OPO 16 ADDO O0 R1 LDO AO WB1 AO WB2 MO OPO 2 MULO OO R2 STO AO R1 Figure 2 4 MBTAC assembler example Line 1 in example 2 4 calculates an address
74. ock at the first sub instruction in the block and leave the block at the last sub instruction in the block As an effect of this we also always know that the dependencies between sub instructions will not have any cycles so therefor the algorithm will always terminate l Instruction Level Parallelism 38 Optimization Do we have any sub instructions that must be scheduled in this step Schedule the sub instructions on the priority list Yes Finalize sub instructions in this step adding them to the new basic block Insert delimiter pseudo instruction to terminate the step in the new basic block Insert all free unscheduled inline assembler dependence DAG nodes into the new basic block Move instructions from the delay list to the ready list Reset the current sub instruction to the first sub instruction on the ready list Can we schedule something more in this step Can the current sub instruction be scheduled now Build dependence DAG Remove delimiter pseudo instructions Find free sub instructions to start the algorithm with All sub instructions in basic block scheduled Yes Basic block schedule complete Increment to the next sub instruction Insert the current sub instruction in this step Have following sub instruction that must be Add following sub instruction Rese
75. on level parallelism 43 Source code for this part can be found in the following file e REPLICAVirtualILP cpp Chapter 5 Evaluation We created a number of benchmark programs to test our backed and also to see how effective our optimization approach is All of these programs were compiled with the makefile in appendix C with ILP optimizations both on and off We have to use inline assembler in some parts of our benchmark programs because that is the only way to use multi prefix instructions To get away from the use of inline assembler we would have to add internal representation nodes that matches to these instructions The compiler was running on Ubuntu Linux 11 10 and the simulator was running on Mac OS X 10 6 for all benchmarks Source code for the benchmarks is located in appendix C Note that there is no sequential case for these benchmarks the smallest configuration used in the benchmarks has 4 processors with 512 threads each 5 1 Parallel max min value The parallel max min benchmark locates the maximum and minimum value in an array consisting of 32768 unsigned integers This program uses the MMAXU and MMINU multi prefix instructions so that several threads can work on the same data simultaneously Results The results from the Parallel max min value benchmark shows a small speedup both using the ILP optimization algorithm from chapter 4 and from adding more threads The optimized version of the benchmark is roughly 21 8
76. ot moved during optimization i e it is dependent on all previous instructions and all instructions following the inline assembler node is dependent on it e An instruction using the result of an OP instruction must be executed in the same step as the OP instruction therefor the edge between such DAG nodes is given the special edge weight 1 e An instruction using the result from a writeback WB instruction can not be scheduled in the same step as the writeback so the edge between such DAG nodes is given the weight 2 Source code for this part can be found in the following file e REPLICAMCDependenceDAG cpp 4 3 Optimizations for instruction level parallelism The REPLICA architecture has several functional units so to effectively use the hard ware we need to make sure that the utilization of functional units is high The ILP optimization pass tries to reschedule instructions to increase functional unit utilization At the same time it must also take into account the dependencies between instructions so that we do not for example try to use a calculated result before it is calculated There for the ILP optimization pass uses the dependence DAG described earlier to find which instructions are available to schedule at the given time The algorithm was based on M Forsells virtual ILP algorithm 14 this implementation is visualized in figure 4 4 The algorithm works on basic blocks so we know that the program flow always will enter the bl
77. perty law the author has the right to be mentioned when his her work is accessed as described above and to be protected against infringement For additional information about the Link ping University Electronic Press and its procedures for publication and for assurance of document integrity please refer to its www home page http www ep liu se Daniel kesson
78. pphovsr tten vid en senare tidpunkt kan inte upph va detta tillst nd All annan anv ndning av doku mentet kr ver upphovsmannens medgivande F r att garantera ktheten s kerhe ten och tillg ngligheten finns det l sningar av teknisk och administrativ art Upphovsmannens ideella r tt innefattar r tt att bli n mnd som upphovsman i den omfattning som god sed kr ver vid anv ndning av dokumentet p ovan be skrivna s tt samt skydd mot att dokumentet ndras eller presenteras i s dan form eller i s dant sammanhang som r kr nkande f r upphovsmannens litter ra eller konstn rliga anseende eller egenart F r ytterligare information om Link ping University Electronic Press se f rla gets hemsida http www ep liu se Copyright The publishers will keep this document online on the Internet or its possi ble replacement for a period of 25 years from the date of publication barring exceptional circumstances The online availability of the document implies a permanent permission for anyone to read to download to print out single copies for his her own use and to use it unchanged for any non commercial research and educational purpose Subsequent transfers of copyright cannot revoke this permission All other uses of the document are conditional on the consent of the copyright owner The publisher has taken technical and administrative measures to assure authenticity security and accessibility According to intellectual pro
79. r stated that the calculation part of the program is so small compared to the initialization part Table 5 2 Parallel speedup for the parallel max min example Configuration Unoptimized Optimized PRAM T5 4 512 1 0 1 0 PRAM T5 16 512 2 67 2 56 PRAM T5 64 512 4 57 4 19 For a graphical comparison between the execution times of the parallel max min value benchmark see figure 5 1 5 1 Parallel max min value 47 258888 T T T TE Unoptinized EE Optinized IEEE 288888 158888 Cycles 188888 58888 PRRH T5 16 5124 PRRH T5 64 5124 PRRH T5 4 5124 Configuration Figure 5 1 The parallel max min value benchmark cycles comparison 48 Evaluation 5 2 Parallel array sum The parallel array sum benchmark is similar to the parallel max min value benchmark but instead of using the MMINU and MMAXU multi prefix instructions it uses the MMADD instruction to add every element in an array to a summation variable Results Maybe because of the similarities with the parallel min max benchmark the gain from optimization is not that visible because of the small size of the calculation part of the program compared to the static part doing initialization The optimized version is 22 3 faster than the unoptimized version on the 4 processor configuration 16 5 faster than the unoptimized version on the 16 processor configuration and only 11 4 faster than the unoptimized version on the largest
80. ration In such a case we would only have to go through the loop n times Multiply that with the finish check and we get that the best case time is O n The check to see if we have scheduled all sub instructions is very inefficient The check can be done in constant time thereby bringing the worst case time complexity down to O n and the best case time down to O n OANDOBWNEH NOOK VVD 42 Optimization 1 instr type A k instr type A k 1 instr type B m instr type B Figure 4 5 Example of a worst case dependence DAG Instructions of type A use functional unit A and instructions of type B use functional unit B 4 3 4 Practical example As an example the code in figure 4 6 can be compressed using the ILP optimization algo rithm to the assembler code in figure 4 7 assuming the target REPLICA configuration has one ALU one memory unit and two operator slots We gained two execution steps lines of assembler code with this optimization OPO 28 ADDO R29 00 WB1 A0 LDO R1 WB1 MO OPO 20 ADDO R29 00 WB2 A0 LDO R2 WB2 MO ADDO R2 R1 WB1 AO OPO 12 ADDO R29 O0 WB2 AO STO R1 R2 OPO 0 WB2 00 SLT R1 R2 OPO _BB1_14 BNEZ O0 Figure 4 6 Unoptimized MBTAC assembler OPO 28 ADDO R29 00 WB1 AO OPO 20 ADDO R29 00 LDO R1 WBlL MO WB2 AO LDO R2 WB2 MO ADDO R2 R1 WB1 AO OPO 12 ADDO R29 O0 WB2 AO OPO 0 STO R1 R2 WB2 00 SLT R1 R2 OPO _BB1_14 BNEZ O0 Figure 4 7 Optimized MBTAC assembler 4 3 Optimizations for instructi
81. rogram in figure A 1 is compiled with the command in figure A 2 clang S O0 emit llvm ccc host triple replica unknown unknown V sample c Figure A 2 Use clang to convert our C program to LLVM IR The command in figure A 2 will produce a file sample s containing the program from figure A 1 translated to LLVM IR We use the commands in figure A 3 to compile this into LLVM bit code line 1 and link with other files line 2 if needed llvm as sample s llmv link o program s bc sample s bc other file s bc Figure A 3 Use 11vm as and 1lvm link to convert our LLVM IR to LLVM bit code and to link several bit code files into one large bit code file And finally in figure A 4 we compile the bit code file with our program to assembler that can be run in IPSMSimX86 llc march replica mcpu generic asm verbose V enable replica ilp program s bc Figure A 4 Use 11c to compile our program bit code file to MBTAC assembler Now our compiled program is in the file program s s Appendix B Assembler Language This appendix contain a list of assembler sub instructions for the first version of the REPLICA architecture Forsells Baseline language assembler and architecture 24 The sub instructions are grouped according to which functional unit they belong to These sub instruction descriptions were taken from Martti B 1 Memory unit sub instructions LDBn Xx L
82. rst instruction in the block to the last instruction in the block with no branches in between the dependency graph for a basic block always becomes a DAG Directed Acyclic Graph 25 The nodes in the dependence DAG are REPLICA machine instructions and there is an edge between two nodes Ni and Na if N2 depends on Ni The REPLICA back end s implementation of a dependence DAG for pre emit rescheduling of instructions transforms REPLICA machine instructions in list form to a graph with dependencies _ BBO 1 for cond gt This Inner Loop Header Depth 1 OPO 0 ADDO R29 O0 WB1 AO OPO 9 WB2 00 LDO R1 WB1 MO SGTU R1 R2 OPO _BB0_3 BNEZ O0 Figure 4 2 Basic block for the conditional of a for loop Figure 4 2 shows a basic block for the conditional check in a for loop The instructions in this basic block are read from the OPO O on line 1 to BNEZ 00 on line 3 and from this the dependence DAG in figure 4 3 is constructed 19 R_OPi Reg 4 lt def gt Imm 0 20 R_ADDro Reg l lt def gt Reg 38 lt use gt Reg 4 lt use gt lt kill gt IS 24 R_OPi Reg 4 lt def gt Imm 9 iI 25 R_WBro Reg 10 lt def gt Reg 4 lt use gt lt kill gt a g 27 R_OPi Reg 4 lt def gt Unkown u 21 R_WBra Reg 9 lt def gt Reg l lt use gt lt kill gt SL 22 R_LDr Reg 3 lt def gt Reg 9 lt use gt lt kill gt 23 R_WBrm Reg 9 lt def gt Reg 3 lt use gt
83. ruction level parallelism as a VLIW architecture REPLICA can also chain instructions so that the output from an earlier instruction can be used as input to a later instruction in the same execution step There are plans in the REPLICA project to develop a new C based program ming language compilers and libraries to speed up development of parallel pro grams We have developed a LLVM back end as a part of the REPLICA project that can be used to generate code for the REPLICA architecture We have also created a simple optimization algorithm to make better use of REPLICAs support for instruction level parallelism Some changes to Clang LLVMs front end for C C Objective C was also necessary so that we could use assembler in lining in our REPLICA programs Using Clang to compile C code to LLVMs internal representation and LLVM with our REPLICA back end to transform LLVMs internal representation into MBTAC assembler Sammanfattning REPLICA r en VLIW liknande PRAM NUMA arkitektur med m jlighet f r att kedja ihop instruktioner sa att resultat fran tidigare instruktioner kan anv ndas som indata till n sta instruktion i samma exekveringssteg Inom REPLICA projetet finns planer pa att utecklar ett nytt C baserat pro grammeringsspr k kompilatorer och bibliotek f r att snabbba upp utvecklingen av parallella program Som en del av REPLICA projektet har vi utvecklat ett kompi lator back end for LLVM som kan anv ndas f r att generera kod ti
84. t current sub instruction to the first sub instruction on the ready list A Add free sub instructions scheduled in this step Yes to the priority list gt to the delay list Other free sub instructions goes to the ready list Free sub intructions with WB delay goes 6 No Figure 4 4 Flowchart over the ILP optimization algorithm 4 3 Optimizations for instruction level parallelism 39 4 3 1 Datastructures and algorithm details There are three important data structures in the algorithm aside from the dependence DAG described earlier Note that free sub instructions are those sub instructions that have all their dependencies fulfilled e Ready list Tracks the free sub instructions that our algorithm can try to schedule Called can_be_scheduled_now in the source code e Priority list Tracks sub instructions that use the result from an OP sub instruction and therefor must be inserted in the same execution step as the OP Called must_be_scheduled_now in the source code e Delay list Tracks sub instructions that are delayed by a WB sub instruction These sub instructions can be moved to the ready list when scheduling the next execution step Called can_be_scheduled_next in the source code e Current sub instruction Not a data structure but the index in the ready list where the sub instruction we are currently trying to schedule is The optimization pass is given a machine f
85. ta segment TEXT Mark the start of a code segment WORD n Place 32 bit word with value n in the data segment GLOBAL name Declare a global variable or procedure with name name FILE filename Fill this memory location with data from filename SAVE size filename write size number of bytes from this location to filename ideal PRAM so they are a bit slower using more cycles than the PRAM configu rations For the PRAM configurations we had 4 16 and 64 processors with each processor having 512 threads while the two ECLIPSE configurations both had 4 processors with 4 and 512 threads Table 2 3 REPLICA configuration names and number of processors threads Name Processors Threads ALUs Memory units Registers ECLIPSE_CRCW 4 4 1 1 34 T5 4 de SRAM_SA4 ECLIPSE_CRCW 4 512 1 1 34 T5 4 de SRAM_SA4 PRAM T5 4 4 512 1 1 34 512 PRAM T5 16 16 512 1 1 34 512 PRAM T5 64 64 512 1 1 34 512 The memory space on the REPLICA architecture is divided between a dedi cated program memory space shared memory space and a thread private memory 2 2 REPLICA 11 space for each thread 24 The simulator for the whole architecture IPSMSimX86 currently only works on Mac OS X We had some problems with the current sim ulator and file encodings even though both files are in UTF 8 the file generated by our compiler lacks the internal file type signatures used by t
86. the initialization part of the benchmarks does not become such a large portion of the total execution time It is strange that speedup from ILP optimization increases when adding more pro cessors and threads to the configuration because the program is the same independent of configuration it is the same file with assembler code that is loaded into the simulator The cause of this is hard to find it might have something to do with the configurations or the simulator Programs using the back end are currently very primitive because there is no standard C library support for REPLICA yet and there is a very limited amount of REPLICA specific library definitions and code skeletons Other sample programs written for the compiler back end that are not shown here include programs for testing function calls loop constructs and other functional testing In conclusion when writing a parallel program for REPLICA one should have in mind that e The dataset for the algorithm need to be sufficiently large to get a good speed up from optimization e The same applies to when adding more threads to a problem e Synchronization was done by hand after each loop The synchronization function was created using inline assembler and multi prefix instructions e Synchronization is needed to guarantee that the result calculated in the loop is finished when threads start to execute sub instructions outside the loop Chapter 6 Conclusion We have in this proje
87. to active memory Xy in MU n BMPSUBn Xx Xy Begin arbitrary multiprefix subtract Xx to active memory Xy in MU n BMPANDn Xx Xy Begin arbitrary multiprefix and Xx to active memory Xy in MU n BMPORn Xx Xy Begin arbitrary multiprefix or Xx to active memory Xy in MU n BMPMAXn Xx Xy Begin arbitrary multiprefix max Xx to active memory Xy in MU n BMPMAXUn Xx Xy Begin arbitrary multiprefix max unsigned Xx to active memory Xy in MU n BMPMINn Xx Xy Begin arbitrary multiprefix min Xx to active memory Xy in MU n BMPMINUn Xx Xy Begin arbitrary multiprefix min multiple Xx to active mem ory Xy in MU n 66 Assembler Language EMPADDn Xx Xy End arbitrary multiprefix add Xx to active memory Xy in MU n EMPSUBn Xx Xy End arbitrary multiprefix subtract Xx to active memory Xy in MU n EMPANDn Xx Xy End arbitrary multiprefix and Xx to active memory Xy in MU n EMPORn Xx Xy End arbitrary multiprefix or Xx to active memory Xy in MU n EMPMAXn Xx Xy End arbitrary multiprefix max Xx to active memory Xy in MU n EMPMAXUn Xx Xy End arbitrary multiprefix max unsigned Xx to active mem ory Xy in MU n EMPMINn Xx Xy End arbitrary multiprefix min Xx to active memory Xy in MU n EMPMINUn Xx Xy End arbitrary multiprefix min multiple Xx to active mem ory Xy in MU n OMPADDn Xx Xy End ordered multiprefix add Xx to active memory X
88. ture 2 3 5 TableGen TableGen is a language used to describe records holding domain specific infor mation In this case information about CPU features registers instructions and calling conventions TableGen descriptions are used to generate C code that can be used in our back end later Expansion of TableGen into C is done during compilation of the back end TableGen supports inheritance so common information shared between all in stances of the record we are trying to implement can be stored in the base class of that record type to reduce work Figure 2 16 shows how registers in the REPLICA back end have been implemented using inheritance The TableGen definitions look very much like C templates e Line 1 Defines registers in the REPLICA back end e Line 8 Defines the specialization integer registers from REPLICAReg e Line 12 Defines the integer register RO from Ri class REPLICAReg lt string n gt Register lt n gt field bits lt 6 gt Num let Namespace RP Registers are identified with 6 bit ID numbers Ri 32 bit integer registers class Ri lt bits lt 6 gt num string n gt REPLICAReg lt n gt let Num num def RO Ri lt 0 RO gt DwarfRegNum 0 gt Figure 2 16 REPLICA registers defined in TableGen Multiclass is a special TableGen feature it lets us define a common name to inherit from that contains several classes This is heavily used when defining OO NO m
89. ultaneously There exists many PRAM variants for how such memory accesses are handled 17 e EREW Exclusive Read Exclusive Write several processors may not read or write from to the same memory cell simultaneously 17 e CREW Concurrent Read Exclusive Write several processors may read from the same cell simultaneously But only one processor is allowed to write to the same memory cell in each step 17 e CRCW Concurrent Read Concurrent Write several processors may read or write to same memory cell in each step What happens when several processors write to the same cell varies between CRCW PRAM sub variants 17 NUMA Non Uniform Memory Access consists of multiple processors with a local memory bank each 13 All processors are interconnected with a communi cation network so that non local memory access will take longer time 13 REPLICA implements the PRAM NUMA model of computation that has P processors grouped as PT processor groups 13 24 All processors are connected to a shared memory as a PRAM would and each group of processors are connected their own interconnected local memory block as in NUMA 13 24 REPLICA also uses a arbitrary CRCW memory model all threads participating in a concurrent read will obtain the same result and for all threads participating in a concurrent write some arbitrary thread will write to the memory location 15 REPLICAs multi prefix instructions can be used to control concurrent operations on
90. ulti prefix instructions to speed up adding all elements of the array to a summation variable include replica h include types h include sync h fdefine SIZE 32768 uint32 array SIZE uint32 sum int main 1 uint32 i for i thread id i lt SIZE i number of threads asm MADDO 960 961 ar Et dy le mime _ synchronize _ exit return 0 C 5 threshold 75 C 5 threshold The threshold benchmark is hard coded for the image used in the benchmark the size of the image and image metadata has been calculated before outside of the program include replica h include sync h define SIZE 307200 struct pixel unsigned char r g b h struct image char meta 54 struct pixel data SIZE h struct image inlmage struct image outlmage unsigned int sum 0 unsigned int sum 0 int main 5 unsigned int i unsigned int psum 0 if thread id 0 for i 0 i lt 54 i outImage_ meta i inImage_ meta i synchronize for i _thread_id i lt SIZE i _number_of_threads sum inlmage data i r inImage_ data i g inlmage_ data i b asm MADDO 0 1 r sum r amp sum synchronize sum sum_ SIZE for i _thread_id i lt SIZE i _number_of_threads psum inlmage data i r inImage_ data i l 8 inlmage_ data i b T6 Benchmark code if sum gt psum outlma
91. ulyseW Iesibvauon sjes ojujuonp ungeulu eu Figure 3 1 Architectural overview of the REPLICA back end 3 1 Code generation 25 3 1 Code generation The main objective of this project was to generate assembler code for the REPLICA architecture LLVM provides a target independent framework for code generation where the target architectures only have to add handling and descriptions of target specific features The code generation process starts with the program in LLVM internal repre sentation format and ends with the finished assembler code being printed to a file During this process the code passes through many stages where the output of each stage is one step closer to the final assembler code The following sections detail the code generation process and the steps needed to transform LLVM internal representation to MBTAC assembler 3 1 1 Instruction selection LLVM uses a selection DAG Directed Acyclic Graph based instructions selector to translate LLVM internal representation to target specific code Parts used by the instruction selector are generated from TableGen files Parts that need custom handling are written in C A DAG based instruction selector basically keeps the programs internal rep resentation in a tree structure and then tries to match subtrees of this to the TableGen defined instructions If a match is found the subtree is converted to the target specific node of the matched instruction SelectionDAG
92. unction containing several basic blocks and the algorithm is called once for each basic block in the machine function The basic block is used to build a dependence DAG and in this dependence DAG the initial free sub instructions are found and added to the ready list Figure 4 4 contains a flowchart of the algorithm the numbers on the decision boxes corresponds to the numbers in the following list 1 Check if all sub instructions in the basic block have been scheduled e Yes then the algorithm is finished e No then we need to continue with another loop through the algorithm 2 Are there any sub instructions on the priority list e Yes then schedule those sub instructions Resources for those sub instructions are checked by the OP sub instruction that they depend on We know that a sub instruction on the priority list is schedulable because its preconditions and resources were checked in the previous iteration when adding the OP sub instruction e No then continue 3 Have we filled all functional units or are there no more sub instructions that can be scheduled in this execution step e Yes then finalize the sub instructions in this execution step and output them to their new places in the basic block output any inline assembler that now have all their preconditions met copy all sub instructions on the delay list to the ready list and reset the current sub instruction to the first sub instruction on the ready list e No then
93. y in MU n OMPSUBn Xx Xy End ordered multiprefix subtract Xx to active memory Xy in MU n OMPANDn Xx Xy End ordered multiprefix and Xx to active memory Xy in MU n OMPORn Xx Xy End ordered multiprefix or Xx to active memory Xy in MU n OMPMAXn Xx Xy End ordered multiprefix max Xx to active memory Xy in MU n OMPMAXUn Xx Xy End ordered multiprefix max unsigned Xx to active mem ory Xy in MU n OMPMINn Xx Xy End ordered multiprefix min Xx to active memory Xy in MU n OMPMINUn Xx Xy End ordered multiprefix min multiple Xx to active memory Xy in MU n B 2 Write back subinstructions 67 SMPADDn Xx Xy Send multiprefix add Xx to active memory Xy in MU n SMPSUBn Xx Xy Send multiprefix subtract Xx to active memory Xy in MU n SMPANDn Xx Xy Send multiprefix and Xx to active memory Xy in MU n SMPORn Xx Xy Send multiprefix or Xx to active memory Xy in MUn SMPMAXn Xx Xy Send multiprefix max Xx to active memory Xy in MU n SMPMAXUn Xx Xy Send multiprefix max unsigned Xx to active memory Xy in MU n SMPMINn Xx Xy Send multiprefix min Xx to active memory Xy in MU n SMPMINUn Xx Xy Send multiprefix min multiple Xx to active memory Xy in MUn B 2 Write back subinstructions WBn Xx Write Xx to register Rn B 3 ALU subinstructions ADDn Xx Xy Add Xx to Xy in ALU n SUBn Xx Xy Subtract Xx from Xy in ALU n

An LLVM Back-end for REPLICA: Code Generation for a Multi

Contents

Download Pdf Manuals

Related Search

Related Contents