Home
User Manual
Contents
1. Jacguard Computing 2012 35 4 C CODE CONSTRUCTION typedef struct int AO_in int Al in int A2_in int A3_in int A4_in int result_out FIR_t FIR_t FIR FIR_t t const int T 5 t result_out 133 5s Fe 93 1h Ps t AO in T O t A1 in T 1 t A2 in T 2 t A3 in T 3 t A4 in T 4 return t Figure 38 Legacy Module Code typedef int ROCCC_int12 void Test ROCCC int12 a in ROCCC_in12 amp b in VT as Figure 39 Declaring And Using A Twelve Bit Integer Type 4 9 Hardware Specific Optimizations There are several features specific to ROCCC that allow you to create specific hardware and are not reflected in the software These include bit width specification systolic array generation and temporal common subexpression elimination 4 9 1 Specifying Bit Width Every integer variable you declare in the C code can have a nonstandard bit width tailored to your application The supported floating point bit widths are 16 32 and 64 with the default being 32 bits The choice of cores instantiated in the datapath will be based upon the bit width of the variables passed to them Smaller bit width variables will be extended to take advantage of the larger cores unless no such core exists in which case the variables will be truncated to use the largest core available The quality and precision of the generated VHDL can vary based upon how the C is specified so use cautio
2. 3 GUI ROCCE lt 2 Build ES New Pp Project Add b z Import lt Module 15 System b View Manage Generate Settings Help Figure 9 Creating a New Project may also be set on a page by page basis by selecting the Save Values As New Defaults button 800 o Select High Level Compiler Flags for BitWidthNew c Pr Please select the high level compiler flags for BitWidthNew c All loops will automatically be fully unrolled Z S High Level Compile Optimizations Selected Flags Arg MultiplyByConstElimination a MultiplyByConstElimination a DivisionByConstElimination _Add gt J DivisionByConstElimination TO InlineAllModules Edit InlineModule Redundancy lt Remove z EEN MultiplyByConstElimination Replace all integer multiplications by constants with equivalent shifts and additions Example Original C Code Code After Optimization a b 16 a b lt lt 4 EE geb a b 3 a 7 a b lt lt 1 b a lt lt 2 a a a Multiplication Used Multiplication eliminated Save Values as New Defaults Set All To Default Value lt Back Next gt Cancel Finish Figure 10 High Level Optimizations Page On the first page you can select which high level compiler optimizations to perform Depending on whether you are compiling a module or a system you will see a different list of available optimizations to choose The second page availabl
3. 0 i lt 100 i output Ali output A i final output Datapath a b Figure 31 System Code That Results in a Summation Reduction FIR system code shown in Figure 32a Note that instead of a black box the top level design has all of the individual operations exposed and may perform additional optimizations on this code 4 5 Control Flow ROCCC code supports arbitrary if statements through predication The quality of the generated circuit is directly affected by the use of predication so care should be taken in constructing the C code to minimize logic In the simplest case an if statement that determines one of two values to store into a variable will be translated into a boolean select statement Figure 33 shows the transformation undergone from original C code to intermediate representation and finally to the generated hardware If statements written in exactly this way will always result in a mux in the generated hardware All other combinations of if statements will be reduced to this form through predication If there are any paths through which a variable might not be initialized the generated hardware will either choose a default value of O create a feedback variable that requires an initial value An example of this is shown in Figure 34 The variable x is only assigned if the expression value gt 5 is true In the generated hardware we must assign a value to x regardless of the expression
4. b IN std_logic_VECTOR 31 downto 0 operation_rfd QUT std_logic clk IN std_logic ce IN std_logic result OUT std_logic_VECTOR 31 downto 0 END component Signal inv_stall STD_LOGIC begin inv_stall lt not stall when we need to stall we just stop enabling the clock UO fp_div_gen32 port map a gt a b gt b clk gt clk ce gt inv_stall result gt result end Behavioral Figure 61 Wrapper for the Theoretical 32 bit Floating Point Divide Jacquard Computing 2012 51 6 GENERATED SPECIFIC HARDWARE CONNECTIONS void SystemCode int A int B d int i s int j int X x 5 Ignored for i 0 i lt 10 i for j 0 j lt 10 j B i j A i j x Only statement translated into hardware x B 9 9 Ignored Figure 62 System Code Sections Translated Into Hardware 6 Generated Specific Hardware Connections Some optimizations may also create additional input scalars that do not appear in the C code This section describes in detail how input and output scalar ports are derived from the written C code 6 1 Basic Assumptions When compiling systems we only translate the body of the innermost loop after all loop unrolling has occurred into hardware This means that any initialization or arbitrary code before or after the loop is ignored For example the code in Figure 62 will not translate the statements before or after the loop nest unless all
5. s result and so we assign a default value In modules this default value is 0 while in systems the default value is itself which will introduce a feedback variable in a way that the user might not have expected 4 6 Look Up Tables ROCCC detects arrays declared locally in functions and translates them into hardware lookup tables On the hardware these lookup tables will be placed in BRAMs if any are available or created as logic if no BRAMs are available or if the underlying platform does not support BRAMs Depending on the available resources the actual hardware resources used might be larger than what the C code implies i e we might round up to the nearest power of two in order to allocate actual BRAM resources Lookup tables should be declared as a non const array of any type and support random run time access as shown in Figure 35 The small example in Figure 35 shows the declaration of a local lookup table that consists of 10 initialized elements Every clock cycle the generated code will read a value from the input stream A and use that value to determine which element of the lookup table to access If the value in stream 9Jacguard Computing 2012 31 4 C CODE CONSTRUCTION void FIRSYStem int A int B int i 5 int tmp for i 0 e i lt 10 i Module instantiation FIR A i A i 1 A i 2 A i 3 Alita tmpy tmp B 1 a C Code c After Inlining Fig
6. 29 4 C CODE CONSTRUCTION Example code with feedback void MaxSystem int N int A int final int i int currentMax for i 20 1 lt N i if A i gt currentMax l l lt a a ae currentMax Al 1 currentMax init else Datapath i currentMax currentMax ee H final currentMax fina currentMax feedback a b Figure 30 a System Code That Contains Feedback and b Generated Hardware This feedback may potentially decrease circuit throughput if the U code requires the feedback variable to be determined at the bottom of the pipeline and used at the top of the pipeline Feedback variables are not output at the end of computation and if you wish to have the final value output you must assign a separate output variable as shown in Figure 30b 4 3 4 Summation reduction A special condition of feedback variables is a summation reduction When the feedback detected is purely performing a summation reduction the feedback can be performed in one clock cycle and does not necessarily affect the throughput of the circuit An example of the code recognized as a summation reduction is shown in Figure 3la The hardware generated as shown in Figure 31b will contain a datapath that handles the feedback internally and can support full throughput on the data streams 4 4 Instantiating Modules Both module code and system code can instantiate other modules to be integrated directly i
7. Figure 40 The current version of systolic array generation only transforms a precise software architecture into a specific instance of a systolic array The code must have a single two dimensional array where the value of every cell is based upon some function of the cells located to the north west and northwest Optionally the C code may have a constant array of values based upon the outer loop bounds and a single dimensional input array based upon the loop bounds of the innermost loop as seen in the Smith Waterman example Any other software architecture is not currently supported for the systolic array generation optimization After transformation the resulting hardware will expect a one dimensional input array A input and produces a one dimensional output array A_output in place of the original two dimensional array The input stream A_input should be the values of the topmost row of the original two dimensional array The output stream A output will generate the bottom row of the original two dimensional array All of the intermediate values are discarded and not output in the generated hardware structure Additionally the first column of the original two dimensional array must be passed in as scalars to the resulting hardware 4 9 3 Temporal Common Subexpression Elimination Temporal common subexpression elimination TCSE analyzes loops and detects common code across loop iterations For example if the same value is calculated in l
8. ROCCC generated code expects the data we receive to be in the exact order as requested When data is ready if the full port is not currently high the data must be placed on the input data port s and write enable must be asserted and held high for a clock cycle As long as full remains low write enable can be kept high and data can be put onto the data port s every clock cycle The number of outstanding memory requests generated by the ROCCC generated code is unbounded if the interfacing code can only handle so many outstanding memory requests it is up to that code to bring address stall high when that limit is reached If the user has specified that a given stream is a multi channel stream then it is necessary to set all channels of the input with valid data before asserting the write enable signal The channels in the ROCCC generated code are numbered from 0 to N and it is up to the user generated interfacing code to place the oldest data in channel 0 the second oldest data in channel 1 and so on Once all channel data has been fetched if data_full is not high the interfacing code should set write enable high and hold it high for one clock cycle An example of this timing protocol is shown in Figure 51 Jacquard Computing 2012 43 5 INTERFACING WITH GENERATED HARDWARE full out address channel out ag addr3 address channel out Caddr2y addr4 Figure 51 Reading From A Stream With Multiple Channels 5 2 3 Output S
9. complex numbers are represented as two 64 bit values for the real and imaginary part FFT4 The FFT4 module uses several instantiations of the FFT2 module in order to create 4 pairs of FFT base calculations happening in parallel FFT8 The FFT8 module uses several instantiations of the FFT4 module and connects them in the butterfly configuration in order to perform the fast Fourier transform on eight input complex numbers and generate eight complex output numbers MaxFilter The Max Filter module takes three input integers and returns the maximum value amongst them SaturatingAdd This example module shows the usage of bit width by performing a saturating addition The inputs and outputs to the system are eight bits long but the internal calculations use a nine bit number in order to prevent overflow If the result of the 8 bit addition is outside the range of an 8 bit number than the value returned from the module will be the maximum value of an eight bit number SingleS WCell This example performs the calculations necessary for a single cell of a wavefront algorithm like Smith Waterman This code can then be used as a module in a larger systolic array generation System Examples The system examples are listed here Integrallmage This system calculates the integral image on a two dimensional input stream The integral image is the sum of all the previously seen elements that lie to the north and west of the current el
10. data coming out nor are there any guarantees about the number of times a value may be output in the previous example it is easy to see that element B 1 was written in both the first and second iteration of the loop Elements written multiple times in different loop iterations may be actually written to more than once or values may be cached to eliminate redundant writes to memory In any case it is important not to rely on a particular behavior 5 3 3 Systolic Arrays After using the systolic array optimization two input streams and a set of input registers are created as inputs The input registers should be loaded with the first column of the matrix and the top row of the matrix is fed in as a stream The input array T is also fed in as a stream Refer to Figure 59 for the relationships between the original two dimensional array and the created registers and input streams 5 4 Pipelining Pipelining in ROCCC is guided by user provided weights of basic operations By varying these numbers along with a desired clock cycle weight the aggressiveness of pipelining can be controlled by the user Under ROCCC the data flow graph representing each loop body contains no initial registers Registers are then inserted into the data flow graph until no register to register path has a total weight greater than the desired clock cycle weight In Figure 56 the leftmost mux has a critical path of one addition operation assuming Weight add gt Weigh
11. int B Output Stream inti 5 int ea efore a write in the innermost loop j y Read bef j he i loop is a feedback variable int internal Written and then read in the innermost loop identified as an internal register for i 0 i lt endValue i EC internal y 2 Bli ALi x y internal z Ali 1 Figure 63 C Code That Infers Ports Jacguard Computing 2012 99 6 GENERATED SPECIFIC HARDWARE CONNECTIONS entity SystemCode is VECTOR 31 downto 0 VECTOR 31 downto 0 VECTOR 31 downto 0 VECTOR 31 downto 0 VECTOR 31 downto 0 VECTOR 31 downto 0 port Default signals clk in STD LOGIC rst in STD LOGIC inputReady in STD LOGIC outputReady out STD LOGIC done out STD LOGIC stall in STD LOGIC Input Stream signals A_WC1k_in in STD_LOGIC A_full_out out STD_LOGIC A_writeEn_in in STD_LOGIC A_data_channel0_in in STD_LOGIC_ A address channelO base out out STD_LOGIC_ A address channelO count out out STD LOGIC A address clk in in SID LOGIC A address rdy out out STD_LOGIC A_address_stall_in in STD_LOGIC Output Stream signals B_WC1k_in in STD_LOGIC B_full_out out STD_LOGIC B_writeEn_in in STD_LOGIC B_data_channel0_in in STD_LOGIC_ B address channelO base out out STD_LOGIC_ B address channelO count out out STD LOGIC B address clk in in SID LOGIC B address rdy out out STD_LOGIC B_addre
12. level data flow graph to minimize the number of registers and decrease the size of the generated hardware e FanoutTreeGeneration Create a tree when the fanout of a register is greater than the threshold This will slightly increase the latency and size of the generated circuit in order to tradeoff the large fanout e MaximizePrecision Temporary arithmetic results are truncated at every step by default but this optimization will increase the size of the temporary arithmetic results and only perform rounding as necessary in assignments 3 9 Add IPCores When working on a ROCCC project you may want to integrate some hardware modules that you have access to outside of ROCCC Using this component would require you to insert the already created component into the ROCCC database so the compiler can incorporate it as well as using it in future compilations To do this select Add IPCore in the ROCCC menu A window will pop up asking for the details of the component as shown in Figure 17 Boe Add IPCore Please enter the IPCore information This IP ore cannot already exist in the database Component Details Component Name Latency Ports Direction TA Direction Size Figure 17 Add Component Wizard First specify the name and latency of the component Next you need to add all of the ports for the added component You need to specify at least one input port and one output port before you can click Fini
13. 1 x height 3 1 To traverse a memory of size 20x20 the width and height passed in to the hardware need to both be 18 20 1 3 When processing the code in Figure 54 both height and width will become input registers and need to be set along with inputReady Only then is it safe to begin returning valid data to the component s request for window elements not setting height and width to the correct values will result in the wrong addresses being generated 5 3 2 Output Streams The memory layout for output streams follows the same rules as the memory layout of input streams The window size and the for loop end values will both be used to calculate the address of each value s location in memory For the code in Figure 55 the first iteration through the loop will calculate Bli Bhi 1 and Bli 2 with i 0 and so the outputted address for B i Bli 1 and Blt 2 will be 0 1 and 2 respectively On the second iteration through the loop i 1 so the outputted address for Bli Bla 1 and Bli 2 will be 1 2 and 3 respectively Multi dimensional code works similarly Jacguard Computing 2012 46 5 INTERFACING WITH GENERATED HARDWARE for i 0 i lt 5 i B i A i A iti B i 1 A i 1 A i 2 B i 2 A i 2 A it3 Figure 55 C Code That Writes To Three Locations In The Same Stream Each Loop Iteration One note to make is that there are no guarantees make about the order of
14. 12 40 5 INTERFACING WITH GENERATED HARDWARE clk rst stall inputReady A_in et BitWidthTest e C out outputReady done Figure 47 Block Diagram Of A Generated Module valid As long as the inputReady signal is high input scalars will be read on every rising edge of the clock Setting the input scalars to valid data and setting inputReady high should be the first thing done by any interfacing code e outputReady The outputReady port goes high when valid data is placed on the output scalar ports of the hardware The output data is valid simultaneously with the outputReady signal being high e done The done port goes high when the hardware generated by ROCCC has finished processing all of the input it was designed to process and remains high until the reset signal is asserted e stall The stall port is used by the interfacing code to stall the pipeline of the generated hardware 5 1 2 Input And Output Ports In addition to the default ports input and output data ports will be generated by ROCCC These may correspond either to single registers or to streams e Registers For each input register a single data port will be generated When generating modules all inputs are treated as registers When generating systems any single variable that acts as input to the main loop will be treated as an input register For each output register a single data port will be generated When generating modules al
15. 47 CopyReduction CopyReduction FanoutTreeGeneration MaximizePrecision ArithmeticBalancing Parallelizing optimization that converts chains of arithmetic operations into parallel arithmetic operations Example Chained Operation Stream Value After Arithmetic Balancing Before ArithmenticBalancing 5 Save Values as New Defaults Set All To Default Value C lt Back C next gt Cancel Ce E Figure 11 Low Level Optimizations Page Frequency vs Area Tuning So Set the values of the operations and fanout to tune the hardware to the target platform Z 2 Mode Basic Advanced Pipelining Amount Pipelined EE 000 Y gt fob This will attempt to put on average 3 3333333 operations per pipeline stage cack a nen gt VA Figure 12 Basic Control of the Pipelining Phase Jacquard Computing 2012 16 3 GUI AOS l Frequency vs Area Tuning gt e Set the values of the operations and fanout to tune the hardware to the target platform Z Mode Basic Advanced Pipelining Amount nail Compacted Pipelined p lt eee gt A KE a This will attempt to put on average 3 3333333 operations per pipeline stage Basic Operation Weights Add 23 Sub 23 Mult 42 op 17 Compare 21 Copy 12 Shift 12 AND 17 Mux 14 XOR 17 Max Unregistered Fanout 100 lt Back Next gt Cancel 4 Finish Figure 13 Advanced Control of the Pipelining Phase ex
16. A lt gt Jaco ard COMPUTING INC ROCCC 2 0 User s Manual Revision 0 7 4 June 1 2012 CONTENTS Contents 1 Changes St Revision 07 4 BUC mes Ze 545 8 bd A AN o Bled AT AE we aa ds r Ee eH Hace 1 2 Revision 0 7 Added Features 2 Installation 3 GUI oa Wis E ee The PHEN or nica Gh he a Sah MH a ks BE ee ed en o id eG Be 3 2 preparing the GUI dormsime ROCCO s a 2 bch de we doses K A eee Ze Ze 9 GUI Ment OVERVIEW 4 4 se eee RE eee Pee Gee S SEES a PR VA E EK i sl OWE E Ue ec hp te Bt dy he ae Get de He Bem EL BOAT ar Gf AE Suse et as 0 312 ROCCO Toolbar 4 05 Ss e Hd Ew Ae See EEE OS A BA 99 3 ROCES Context Men us dr aa e he Re eee S 34 Loading the Example Files e nee oe AA A eee ee ed a SR E e eae goed bo kee ete eas Re OU e GS a EEN 3 6 Creating a New ROCCC Project 2 1 a a Ont Jal Tor Hard varo Llao BAG BEE EAM a eee Poe we ko AE BO 5 0 Compiler CITE ld ess oe A o A ee ed ta 3 8 1 System Specific Optimizations o EE AAA a Ee e 8 3 8 2 Optimizations for both Systems and Modules 3 8 3 Low Level Optimizations b Are a EE a 3001 os CT OT pa i a I E U NE U Br DE b b gt 9 10 Create New Module 2 4 44 Y 2 w ER AAA Eo BP BODA LAIA eei li CMe ave INC WO SLOT E de eee a ok ee Eer ey De bd d E Borgo A ad ka SE Mpo Module hs curas we eB ave a E Be a A o Dao MPO GECKEN k tn S ao P AKA B osi AE Manage pb aaa AL r
17. A a V ae ee Are de AC stl We E EENEG 310 Reset GOMADIC e ad iaa e ee Aa a et A bee ed 3 17 Testbeneh Generation 22 iaa eid cote eee bE ee EE ee id a e dd ls HEEN si ee e A A e A A ebe Ee is BS TO WER E ad dd AE BAN IA ni 4 C Code Construction 4 1 General Code Guidelines A Mg EEN oa ge Se ee Se cane oe A cote ae o HE He atc BOA AY Ta OE ae as teenth ek Gee SO Ee Oe ends a RA BOS SSE See SE ERDF AS Soy Suet OMS Ap Sie at ke de ge oe Sh Be Be He ae A he dd oe A oe E a g 4 3 1 Windows and Generated Addresses a 211012 EE aro Siete os A Sh Se as a ee ae see ewe Bak e Ze een Feedback detecto scs vue Bie we boa Geese S SESS AE e eS ASA Summation TEdUCLION s 04 2 8 ee e oe we eee SEE eee SS Ee ES A a A AQ Ee Modules e ee DI AAA BBS Ge we ee ee O 4 2h 1 ani Modules AE EA A AAA Anr e TONA EE kee EE Ae EA b b dd e E e EE dae dd 4 7 Composed System Code se aa bod b bod E ia a So A EE TEEN AAA PA EB Agd Lesacy Module Codes e en A oie o rer Ss ral a E oes e amp Ae Sech Legacy Systeme Code soss wa a E A MS eee er eh eee eS Jacquard Computing 2012 6 6 6 4 9 Hardware Specific Optimizations 0 2004 4 9 1 Specifying Bit Width a e i ala AA AS 4 9 2 Systolic Array Generation 4 9 3 Temporal Common Subexpression Elimination 4 9 4 Arithmetic Balancing 4 9 5 Copy Reduction 4 648 sperra Ba ea 226552 e5 4 9 6 Fanout Tree Generation 0 200048 AOS A BUT e oe ee ee ee a E oe P A s BB 5 Interfacing
18. Single Write LUT A lies out of bounds of the lookup table in C i e is greater than or equal to 10 then the hardware will return an undefined value Multidimensional Lookup tables are also supported as shown in in Figure 36 Lookup tables generated by ROCCC support an arbitrary number of reads but currently only supports one total write to each lookup table in the C code that corresponds to one write in the generated pipeline Care must be taken when applying optimizations as unrolling would increase the number of writes to lookup tables resulting in incorrect code 4 7 Composed System Code ROCCC now has added preliminary support for the composition of system code into larger systems The code as shown in Figure 37 allows for the composition of systems in a manner very similar to constructing modules from other modules The code itself is a system and as such may be used in larger composed systems As shown in Figure 37 input streams are declared and passed as normal Output streams to composed systems must be identified by being passed by reference Any intermediate variables or streams that need to Jacguard Computing 2012 34 4 C CODE CONSTRUCTION include roccc library h void SystemToSystem int A int amp B int internal PassThrough A 96 internal PassThrough internal 96 B Figure 37 Composite Systems be declared to connect two systems are not accessible outside the composed system Figure 37
19. T int Figure 8 IP Cores View 3 6 Creating a New ROCCC Project To start using ROCCC with your own code from scratch you first need to create a new project To create a new project select ROCCC New Project as shown in Figure 9 A window will pop up asking for the name of the new project to make Type in the desired name of the project and press Ok Once that is done a new project will show up in the project explorer with the name you chose From there to add new modules or systems you either import them from already made files or create new ones from scratch To import premade modules or systems into the project use the Import Module and Import System under the ROCCC menu To create new modules or systems to be added to the project use the New Module and New System under the ROCCC menu 3 7 Build to Hardware Once a ROCCC module or system is ready to be compiled into VHDL code you want to use the Build command To do this open the desired module or system inside the Eclipse editor and select the Build command in the ROCCC menu or ROCCC toolbar After that is selected a window will open up asking for which high level compiler optimizations to use as in Figure 10 The build window consists of several pages which control different levels of compiler optimizations The user may select finish at any time and any pages not modified will use the default values The default values Jacquard Computing 2012 14
20. V ee e gy A ee 12 ROCCC Contest Menu rra ad OE PERSE ERE SORE Re dd 12 Importime the Examples gt omite a o ds a aaa dh dd b S 13 The ROCCCExamples Project vk dia dk ee E ier LR RRA BER RS 14 PORES NOME aia He kona pe eee We oe A ES eee ee ee ss deis o Se 14 Creating New Project s si kA oe A e a ee EO ew ee ee eH E 15 High Level Optimizations Page a 15 Low Level Optimizations Page 1 a a a a a a 16 Basic Control of the Pipelining Phase e 16 Advanced Control of the Pipelining Phase 0 0002 eee eee ees 17 Stream Accessing Management Page 18 SUCCESS ee E DEE AAA o o 18 VHDL subdirectory Created 242 M 4448 dk ob Bie ee U a he EEE HA 19 Ada Component Wiz aa ds AS et a tree e de CC 20 New Module Wiz Td eg wee E oe A Oe tl ee e E 21 Module Skeleton Code for MAGU aaa 21 New System Wizard 1 ba a ira Bodo A AB A BS een Mee e oA 22 System Skeleton Code for WithinBounds 0 a 0000 eee ee 22 Boes HEEN 4 si aa SY a Bak BE eRe SES GSES SESE Se ee we OSE eS 23 Testbench Generation wee a abst o Bk e de Bite wre eh eee de owe ee em a o 24 Dependent Files Window 24 a Module Code in C and b generated hardware 000000 eee 27 a Using a loop in module code and b resulting hardware 27 a System Code in C and b generated hardware o oo aa a o 28 ACCESSING a EENHEETEN 29 A system with a three dimensional input and output stream 29 a System Code That Contains Fe
21. With Generated Hardware dl Port Descriptions crad oe er Bo ook ew ea ow so A SA1 Default Ports s s s sn cn ee bs A Se e BA A 5 1 2 Input And Output Ports 5 2 Interfacing Protocols lt 5 lt amp A ae eee e A ee ES S241 Anput Registersi lt lt aa 24 wee odd eho SSE SEES SAZ eer ugeet es he Bade E eR ed Diles Output oS amp A Aas BO a AH Dad Output EE EEN dn Pee b ee das a Z eos OMS ot e st Sov Ge ah Sg EO ee Sp ee a tt O20 flo oa eae e e e eh Ee Beek E 5 3 Memory Organization a 5S3 Input Streams Lasip menei dd a 24 od OBUL Cams io ee B B Ge Se eee es dd OSLO ANTAS a a a E A EE EE 9 0 Panout Reduction s i 44 44 0540 e e e E E E O SITE A aes ce eatin A O ee ee 6 Generated Specific Hardware Connections 6 1 Basic Assumptions o bb ee uf ted OSG ad Boe 6 2 Values created by optimizations 0008 7 Examples Provided Ball Mod le EAT DIES e te eta eee ie Se et a EL E EE 12 System Examples s 4 5 sao we Ba ee oe eH eh E 9Jacguard Computing 2012 CONTENTS LIST OF FIGURES List of Figures CON o OK O DD H w V V M U NM I V M VM N k RF HM K k k k k HM HOO N H O OO 0 NV O DOOR Ga DV H OD Oo NVO Ou OU DD H CA CA 34 39 36 37 38 39 40 41 42 43 44 45 46 47 Copying the Plugins into Eclipse a 9 Location of the ROCCC 2 0 Preferences 10 The ROCCC Preferences Page a 10 ROC CC a0 DA te ee ee BA EB
22. ally designed for that purpose These operations Jacguard Computing 2012 47 5 INTERFACING WITH GENERATED HARDWARE vy vy vv V y Register Boundary Operation lt Dataflow Y Figure 56 Basic Dataflow vy vy V CH Y Figure 57 Medium Dataflow Jacquard Computing 2012 48 5 INTERFACING WITH GENERATED HARDWARE INS ff VN 00 00 Figure 58 High fanout a before registering and b after registering are significantly more complex than simple operations such as addition and because there are several ways to implement division the synthesis tool does not blindly infer a solution In order to simulate or synthesize code generated with ROCCC that uses integer division integer modulus or floating point operations it is necessary to create and include an intrinsic component into your simulation or synthesis project It is generally necessary if you want to use floating point to find an ipcore for each of the operations you need Xilinx has the CoreGen utility to provide ipcores while it is probably also possible to find free ipcores on a site like http opencore org Once you have found an ipcore that implements the operations you require you will want to utilize it in your project Traditionally this would be done by instantiating it in the code that requires it with each ipcore having slightly different requirements For example one divide core may have a reset or enab
23. an take as long as O E where E is the number of edges in the use def graph although in practice a minimum is found quickly 4 9 6 Fanout Tree Generation When compiling high level code the amount of parallelism that is generated in hardware may not be readily apparent High fanout can seriously affect the clock rate or area of the generated circuit and so we have added user control to specify the maximum allowable fanout for any register in the generated circuit If the fanout exceeds this number ROCCC generates a tree of registers in separate pipeline stages increasing the latency but shortening the clock and simplifying the routing 9Jacguard Computing 2012 38 4 C CODE CONSTRUCTION for i 0 i lt 5 i d for 0 j lt 5 row1 A i j ALi j 1 A i j 2 row2 A i 1 1 3 Ali 1 j 1 A i 1 3 2 row2 Ali 21 3 Ali 2 j 1 Ali1 21 j 2 B i j rowi row2 row3 5 Figure 43 System Code That Accesses a 3x3 Window ET Figure 44 3x3 Smart Buffer Sliding Along a 5x5 Memory 4 9 7 Smart Buffers When generating code for systems array accesses are analyzed looking for possible reuse between loop iterations These reuse patterns can be exploited and reduce the number of off chip memory accesses The generated hardware will contain Smart Buffers to exploit the reuse between loop iterations which internally consist of registers that cache the portion of memor
24. apath Again prefer to handle full memory in the stream interface and not with the stall signal 9Jacguard Computing 2012 45 5 INTERFACING WITH GENERATED HARDWARE outputReady IAN Y Figure 53 Timing Diagram Of The End Of A System s Processing for i 5 0 3 i lt height i for j 0 j lt width 3 MAX window i j windowLli j 1 windowli j 2 maxCol1 MAX window i 1 j windowli 1 j 1 windowli 1 j 2 maxCol2 MAX window i 2 j window it 2 j 1 windowli 2 j 2 maxCol3 Find the maximium of the three columns MAX maxCol1 maxCol2 maxCol3 finalOutput Figure 54 C Code For MaxFilterSystem Which Uses A 3x3 Window 5 3 Memory Organization 5 3 1 Input Streams Input streams will generate a base address and a count for each chunk of contigious memory they access Both one dimensional and two dimensional streams generate addresses for each requested value and it is up to the interfacing code to decide how to treat these values The addresses that are generated by the system when accessing memory are assuming an input memory of a certain size This assumed size is based off of several factors including both the window size of the input and the size of the for loops driving the window For example given the U code for MaxFilterSystem as shown in Figure 54 the window size is 3x3 andthe for loop size is width x height Given these values the input memory size is width 3
25. c library h Open the roccc library h file in the default editor Manage 23 Intrinsics Generate gt Manage Intrinsics Open the intrinsic window to add edit or delete intrinsics Generate lt gt PCore Interface lt Q Testbench Generate PCore Interface Generate a PCore for a ROCCC module Testbench Generate a hardware testbench file for a ROCCC component Settings p lt Reset Database S ettings Help ag Preferences Reset Database Reset the database back to its installation configuration Preferences Open the preference page to manage preferences lt gt User Manual 27A Load Examples lt Check for Updates lt 7 ROCCE Webpage Help lt gt About ROCCC User Manual Opens the ROCCC user manual Load Examples Loads the ROCCC examples in an Eclipse project Check for Updates Check if a new version of ROCCC is available ROCCC Webpage Opens the ROCCC webpage About ROCCC View which version of ROCCC you are using Jacquard Computing 2012 11 3 GUI Figure 4 ROCCC Toolbar v gt FFT int Ai in gt gt vhdi int Ad in b 3 FFTONeSti New gt b GAR Open F3 b gt Histogram Open With gt gt 3 MAC gt 13 MaxFilter Ca Copy C gt MO gt gt MDFloat ated b E modulewn X Delete 3 5 7 9 11 gt Boo gt LG Pow10 ft in T 0 gt gt Quadratic Move E Lt gt E SingleCell R
26. calars Output scalars are driven when outputReady is driven The number of clock cycles before outputReady goes high after driving inputReady is based off the delay of the pipeline Code that interfaces with systems should ignore outputReady if values are to be sampled every iteration of the loop then a stream should be used System code that properly uses output scalars should only be interested in a final value which will be valid when done goes high not when outputReady goes high 5 2 4 Output Streams Output streams have four ports dealing with address generation and three that deal with input data assum ing 1 channel for both address stall address rdy address channel base address channel count data empty data readEn and data channel As long as the address stall port is not held high addresses will be generated until the total amount of addresses have been written When an address is being generated the address rdv port will be brought high and the address channel base port will hold the base address of the values needed The address channel count port holds the number of consecutive elements from that base address that are being written it is up to the end user to use these two values to derive the memory locations needed The address rdy will only be held high for one clock cycle for each individual address being generated If addresses are being generated in consecutive clock cycles the address rdy port will be continu
27. composites two PassThrough systems which take an input stream and a number and return an output stream The input to the composed system A is mapped to the input of the first internal PassThrough system The output stream of the first PassThrough system is mapped to the internal stream internal and flows into the second PassThrough system The output of the second PassThrough system is mapped to the output stream of the composite system B Individual systems in composite systems may be specified as redundant using a label When a system is made redundant several copies are instantiated and the each input stream must be split into separate streams for each instance Similarly all of the output streams from the multiple copies must me merged into a stream voter We have added support for these two constructs as intrinsics so it is the user s responsibility to provide ROCCC with the appropriate stream splitters and voters and ROCCC will instantiate them as appropriate 4 8 Legacy Code In previous versions of ROCCC modules and systems were coded slightly differently We still support compilation of legacy code although newer features such as inlining are not supported for legacy code and mixing legacy code with new style code may cause problems in the future as legacy code is deprecated 4 8 1 Legacy Module Code Legacy module code must define both an interface and implementation The interface is described as a struct that identifie
28. d Features Look up tables can now be specified and provide run time random read and write access Certain aspects of the calculation of generated addresses is now pipelined deeper resulting in faster achievable clock speeds Preliminary support for composed systems is now available In composed systems redundancy for component systems is supported New intrinsics are supported for handling streams in composed systems including stream voters and stream splitters Port for Convey HC 1 now available Port for Pico M501 now available Jacquard Computing 2012 6 2 INSTALLATION 2 Installation Installation and execution of ROCCC has been tested on the following systems 32 bit Ubuntu Linux 64 bit Ubuntu Linux 32 bit CentOS Linux 64 bit CentOS Linux 64 bit OpenSuse Linux Macintosh Snow Leopard Macintosh Lion Other systems are not supported ROCCC is distributed as a binary package and requires Eclipse 3 5 1 or higher When uncompressed the ROCCC distribution folder should have the following directories Documentation The location of this user manual and any additional manuals Examples A directory to be imported into the Eclipse framework that contains all of the example code GUI The location of the Eclipse plugin jar files License The default location where the ROCCC license file should be placed LocalFiles The location of one user s database and local information ReferenceFiles A directo
29. de for WithinBounds 3 12 Import Module If you are looking to add an already done ROCCC module C file to the current project you are working on you can use the Import Module command To do this first have a valid project opened to import the module into Next click Import Module under the ROCCC menu This will open up a window asking for the file to import First browse for the desired ROCCC module file to import Secondly type the name of the module you are importing Lastly select which project to import the module into Once finished click the Finish button at the bottom and the selected module will be imported into the project and will show up in the Project Navigator view This does not add the module to the database this solely adds the module C code to the project 3 13 Import System If you are looking to add an already done ROCCC system C file to the current project you are working on you can use the Import System command To do this first have a valid project opened to import the system into Next click Import System under the ROCCC menu This will open up a window asking for the file to import 9Jacguard Computing 2012 22 3 GUI First browse for the desired ROCCC system file to import Secondly type the name of the system you are importing Lastly select which project to import the system into Once finished click the Finish button at the bottom and the selected system will be imported into the project and
30. e when compiling asks for which low level compiler optimizations to use as in Figure 11 These flags are the same regardless of compiling a module or system The third optimization page available controls the extent of pipelining in the generated hardware As shown in Figure 12 the pipelining may be controlled with a slider that adjusts the generated pipeline from fully pipelined on the left to fully compacted on the right When fully pipelined every operation will be placed into a separate pipeline stage resulting in the largest area but fastest clock When fully compacted the compiler will attempt to put every operation into one pipeline stage resulting in the slowest clock speed but smaller area When fully compacting code instantiated modules will retain their delay However not all operations take the same amount of time to execute To naively have the compiler arbi trarily pack operations together without considering how expensive an operation is would give inconsistent results across different components Because of this ROCCC allows you to specify weight values for each basic operation in the advanced mode as shown in Figure 13 A larger weight means that operation is more Jacquard Computing 2012 15 3 GUI Select Low Level Compiler Flags for BitWidthNew c lt P Please select the low level compiler flags for BitWidthNew c EX Low Level Compile Optimizations oe ArithmeticBalancing geg ArithmeticBalancing
31. edback and b Generated Hardware 30 System Code That Results in a Summation Reduction oaoa a a 31 a Code That Instantiates a Module b the Generated Hardware and c Generated Hard el AA 32 Boolean Select Control Flow a In the original C b in the intermediate representation and c in the generated hardware datapath o e e o 33 Predicated Control Flow A in the original C B in the intermediate representation and C in the generated hardware 3 4 4 dad 60 04 e a oe ak Boho ho de 33 Reading From A Preinitialized LUUT 34 Multiple Reads sinele Write LUT 24 5 pln a AA 34 Composite Systems sas o e re E A E e ee Ae 35 Legacy Module Codes 44582 pas E e e es a A A 36 Declaring And Using A Twelve Bit Integer Type 36 C Code To Generate A Systolic Array ooo 37 Block Diagram Of Max Filter System a 38 Block Diagram Of Max Filter System After TCSE 38 System Code That Accesses a 3x3 Window a a a 39 3x3 Smart Buffer Sliding Along a 5x5 Memory a 39 a System Code that reads from a FIFO and b Memory fetches when using a FIFO 39 Timing Diagram Of A System With Both Input Scalars And Input Streams 40 Block Diagram Of A Generated Module 000 2 41 9Jacguard Computing 2012 4 LIST OF FIGURES 48 Block Diagram Of A Generated System 42 49 Timing Diagram Of Module Use Seck ols AA DEP Se eRe da ad b 43 50 Readi
32. ement This is accomplished by instantiating a LUT and storing the accumulated values inside The output of the system is a two dimensional stream that represents the integral image Jacquard Computing 2012 55 7 EXAMPLES PROVIDED e IPIntegration This system demonstrates how to integrate external IP with ROCCC generated code The multiply accumulate core is not a ROCCC example and must be added into the ROCCC library through the GUL e MatrixMultiplication The matrix multiplication system takes as input two NxN two dimensional matrices and outputs an NxN matrix that is the product of the inputs The code is three nested loops accessing the two dimensional arrays with different access patterns and shows some ways that ROCCC supports accessing array elements based upon loop indices e MaxFilterSystem This example filters out the maximum value on each 3x3 window of a two dimensional input stream The module MaxFilter must be compiled before this example If temporal common subexpression elimination is performed on this example the amount of hardware generated will be reduced from 4 instances of MaxFilter to 2 instances of MaxFilter e MedianFilter The MedianFilter example chooses the median value from each window of size 8 from a one dimensional stream In order to choose the median each window must be sorted which is accomplished through the BitonicSort8 module e Prewitt This example performs Prewitt edge detection on a t
33. ename F2 Fin T 3 gt 3 systems in T 4 Import t Export Refresh FS Validate Run As gt Debug As b Team b Compare With b Replace With b gt WikiText gt Properties Figure 5 ROCCC Context Menu 3 3 2 ROCCC Toolbar e Build Compile the open ROCCC module or system file e Cancel Stops the current compilation if any are running e New Module Create the starter code for a new ROCCC module and add it to a project e New System Create the starter code for a new ROCCC system and add it to a project e Manage Intrinsics Open the intrinsic management window to add edit or delete intrinsics 3 33 ROCCC Context Menu e Build Compile the open module or system file and run it through the ROCCC compiler Jacquard Computing 2012 12 3 GUI a add Window Help 25 Build CCC Development ROCCCEx View Manage Generate Settings 2 User Manual Y Load Examples lt 7 Check for Updates 3 ROCCC Webpage gt About ROCCC Figure 6 Importing the Examples 3 4 Loading the Example Files To test ROCCC out on the example files you need to load the examples that came bundled with the distribution The first way to do this is after setting the distribution folder for the first time ROCCC will ask if you would like the examples loaded Selecting Yes will have the ROCCC examples loaded into a new project called ROCCCExamples If there is already a project with that
34. eritive that data not be serialized it is preferred to create several output streams or to create a multi channel stream 5 2 5 Done The done signal works differently depending on if it is coming from module or system code Module code will drive the done signal high as soon as the first value is processed this can safely be ignored by any code interfacing with a ROCCC module as modules are stateless and can never be considered done System code will drive the done signal high on the rising edge of the clock after the last output values are set Figure 53 provides an example of the done signal s behavior in a typical system 5 2 6 Stall The stall signal allows the interfacing code to stall the datapath in both modules and systems Stalls are not instantaneous it takes 1 2 clock cycles for the stall signal to propogate all the way up the datapath to both the input and output controller In hardware a common use for a stall signal is when interfacing with memory that may become full However both input and output streams are two way handshakes and any stream can be stalled by simply not completing the handshake For this reason and because stalls are not instantaneous stalls should be reserved for the case when there is no alternative When the stall signal is brought high both input and output streams will continue to interact with any interfacing code However the datapath will be frozen and data will not be pushed onto the dat
35. es and types for that selected component You can delete a compiled component from the database by clicking the component name in the IPCores View and pressing the Delete key The component will also be removed from the roccc library h file You can also use any of the components in the ROCCC database by having a valid module or system open and selected move the cursor to where you want to insert a call to a module and double click the desired component in the IPCores view This will add a function call to the double clicked component in the open ROCCC file and will add include roccc library h to the top of the file All that you will have to do after that is put which variables you wish to pass into the desired component function call 9Jacguard Computing 2012 13 3 GUI Y gt ROCCCExamples T src Y modules BitWidth 2 CompleteMD 3 Complete MDFloat 2 ComplexifModule FFTOneStage FIR gt Histogram gt E MAC gt MaxFilter gt CMD MDFloat gt ModuleWithALoop Pow 10 gt QuadraticFormula gt SingleCell Y 2 systems 2 Complexlf5ystem 2 FFTComplete FIRSystem MatirxMultiplication gt MaxFilter ystem b MaxFilterTCSE ModularSystolicArray SystolicArray b VarianceFilter Figure 7 The ROCCCExamples Project Component Name Port Name Direction i Data Type Hisbograra E numi e i i ink num l IN int Sam out DU
36. fies for the stream and needs to be a factor of both the window size and the step window size which the number of address channels will be equal to the number of address channels specified by the user and needs to be a factor of both the window size and the step window size along the dimensions other than the innermost Figure 48 shows the block diagram of a generated system that communicates with a multi dimensional buffer The default ports are still generated located on the top and bottom of the figure as well as the interface to the streams In addition to the ports generated for streams input and output registers can be created as well For output streams several ports will be generated an address clock an address valid port an address stall port a positive number of address channel pairs a data clock a data empty port a data read enable port and a positive number of data channel ports 5 2 Interfacing Protocols 5 2 1 Input Registers Input registers are used by both module and system code They need to be set when inputReady is driven and are sampled on the rising edge of the clock Driving the input registers is the responsibility of the calling code In modules the input registers can be changed every clock cycle In systems the input registers may be set only once and must be set before passing any data to the input streams See Figure 46 for the timing of interfacing with system code s input registers and Figure 49 for t
37. generated by ROCCC communicates with the external platform in a variety of ways described in this section All inputs and outputs that connect to ROCCC code are assumed to be active high 5 1 1 Default Ports Each hardware module and system generated by ROCCC will contain six ports by default These default ports are clk rst inputReady outputReady done and stall Their use is described here clk The clk port is the clock of the hardware and should be connected to a clock signal All processes internal to ROCCC code trigger off of the rising edge of the clock All ROCCC components and systems assume a single clock to drive all the hardware rst The rst port is the reset signal to the generated hardware Driving the reset port high resets the hardware to an initialized state As long as the reset port is held high the hardware will remain in the reset state regardless of the inputs After bringing the reset port low the hardware will begin responding to the input signals The hardware generated by ROCCC requires the reset port to be driven high for at least ten clock cycles for initialization purposes Not doing so may leave the component in an uninitialized state The use of the reset signal and the initialization of hardware that contains both input registers and input streams is shown in Figure 46 inputReady The inputReady signal should be driven high when the signals that correspond to input scalars are Jacquard Computing 20
38. he timing of driving a module s input registers 5 2 2 Input Streams The input stream address generation and the input data protocol are decoupled allowing address gener ation to happen independent of incoming data In particular there are four ports dealing with address Jacquard Computing 2012 42 5 INTERFACING WITH GENERATED HARDWARE inputReady _ NY Figure 49 Timing Diagram Of Module Use SE LJ LI LI LJ LI address rdy out address out Kaddr1 Kaddr2 DX addr3 3 Figure 50 Reading From A Stream generation and three that deal with input data assuming 1 channel for both address stall address_rdy address_channel_base address_channel_count data_full data_writeEn and data_channel As long as the address_stall port is not held high addresses will be generated When an address is being generated the address_rdy port will be brought high and the address_channel_base port will hold the base address of the values needed The address_channel_count port holds the number of consecutive elements from that base address that are desired it is up to the end user to use these two values to derive the memory locations needed The address_rdy will only be held high for one clock cycle for each individual address being generated If addresses are being generated in consecutive clock cycles the address_rdy port will be continuously high The user defined interfacing code needs to service memory requests in a FIFO fashion
39. ibrary h file 3 17 Testbench Generation Once a module or system has been compiled with ROCCC and translated into hardware you can create a hardware testbench for simulation by selecting Generate Testbench from the ROCCC menu For modules you can enter as many test sets as you wish with their corresponding expected outputs as shown in Figure 23 For systems you will need to enter values for both the input scalars as well as specify files that contain the values of each input stream and output stream The stream files must consist of a list of values separated by white space in the order in which they will be read Jacquard Computing 2012 23 3 GUI Testbench Generation for FIR Fill cut the values to use for this testbench Fa Number of test sets 3 Input Scalars Test Values Ag bm bn Expected Output Values Output Set 1 Output Set 2 Output Set 3 125 160 200 Figure 23 Testbench Generation 3 18 Platform Generation Once a module or system has been compiled with ROCCC you can generate a Xilinx PCore from it You can do this by selecting Generate PCore Interface in the ROCCC menu ROCCC will then generate all the necessary files and connections to make a PCore If your component requires any dependent files such as sub components or netlists a window will pop up asking for those files prior to generating the PCore files The window will show you all the required components it is looking for and a
40. in Section 5 Data reuse is handled through the creation of smart buffers which is detailed in Section 4 9 7 The code located in the innermost loop will be translated into a datapath that is separate from the control 4 3 1 Windows and Generated Addresses When generating code we infer the size of the memory we are accessing from both the loop bounds and the size of the accessed window For example the loop bounds in Figure 28 suggest a 10x10 memory However the code inside the loop accesses a 3x3 window so we generate code that assumes a 13x13 memory The addresses we generate will be the same as in C and note that if run in C on a 10x10 array the results will be undefined When fetching the first window we will therefore generate the offsets 0 13 and 26 for the first column Jacquard Computing 2012 28 4 C CODE CONSTRUCTION void WindowSystem int A int B TAC ly dE for 1 for 4 D SO A O i lt 10 i Biles S Alenia E AEAT 2 Figure 28 Accessing a 3x3 Window Example N Dimensional code void NDimensional int A int B int iy Jz K7 for i Dos A 10 y ART toe G U ae O for k 0 k lt 10 k Figure 29 A system with a three dimensional input and output stream and NOT 0 10 20 Similarly we will generate the offsets 1 14 and 27 for the second column and 2 15 and 28 for the third column of the window Additionally we perform a normalization
41. ing the module as shown in Figure 19 O p macie El typedef int ROCCC_int3z Y MyProject void MACCCROCCC_int32 numi ROCCC_int32 num ROCCC_int32 numd ROCCC_int32 amp sum t T arc modules T Macc MM MACC lt Figure 19 Module Skeleton Code for MACC 3 11 Create New System To start a new system from scratch first make sure you have a valid ROCCC project loaded or have created a new project as described in the Creating a new Project section Once you have a valid project open select New System under the ROCCC menu or toolbar to begin creating the new system A new window will open asking for the details of the new system Input the name of the system and which project to add the new system to Lastly select how many 9Jacguard Computing 2012 21 3 GUI New ROCCC System New ROCCC System Peaze enter the new systern s information System Details System Name Project to add to ROCCCExamples Stream Dimensions 1 Figure 20 New System Wizard stream dimensions the system will have Once everything is added correctly click Finish and the system will be added to the project The new file will open in the editor with the necessary starter code to begin coding the system as shown in Figure 21 Ze WithinBounds c 33 void MthinBoundsCint A int B int i forci 0 i lt 100 i Example code to pass stream A into stream E B i ALi Figure 21 System Skeleton Co
42. l outputs are treated as registers When generating systems any single variable that acts as output to the main loop will be treated as an output register Figure 47 provides a block diagram of a ROCCC generated module This module includes both the default ports located on the top and bottom but also the user defined ports which may be variable bit size located on the left and right e Streams All streams in ROCCC are split into two seperate clock domains a data clock domain with associated ports and an address clock domain with associated ports On the data clock domain there is a data valid port a data read write enable port and a positive number of data channels On the address clock domain there is an address stall port an address valid port and a positive number of pairs of address channel ports these pairs each consist of a base address port and a count address port 9Jacguard Computing 2012 41 5 INTERFACING WITH GENERATED HARDWARE clk rst stall inputReady A_wclk B_rclk A Channel 32 Vi B Channel A writeen lt B_readen A full lt gt B empty A_address clk A address rdy lt A address base lt B address cik b B address rdy N B address base 32 32 B address count a B address stall A address count a A address stall y v outputReady done Figure 48 Block Diagram Of A Generated System The number of data channels will be equal to the number of data channels the user speci
43. le input while another may not Because ROCCC has no knowledge of what ipcore you will end up using we cannot directly instantiate the ipcore you will use instead we instantiate a wrapper component This component must be written by you and provides a standardized interface that ROCCC can instantiate However this component does not have to implement any logic it can simply instantiate the ipcore to implement the logic In this way a standardized interface is presented to ROCCC but any IPCore can be used to implement the actual logic As an example the declaration for a theoretical 32 bit floating point divide core is shown in Figure 60 with the corresponding wrapper shown in Figure 61 When choosing an IPCore it is important to keep several considerations in mind First the core should be fully pipelined as ROCCC assumes all subcomponents are fully pipelined Second the core needs to have a way to stall the component if no core is available that has a way to stall the component a simple solution is to gate the clock but this results in poor performance Thirdly the component must complete the calculation in a constant number of clock cycles This number of clock cycles must be told to the GUI when importing the intrinsic into ROCCC Jacquard Computing 2012 49 5 INTERFACING WITH GENERATED HARDWARE A_input temp0_init_in templ init in temp2 init in clk rst stall inputReady 411 T channel T_writeen T_ f
44. loops are fully unrolled Input streams are identified as array read accesses Output streams are identified by array write accesses Arrays may not be both read and written to in the body of a loop except in the special case of generating a systolic array Input and output scalars must be passed in the parameter list Local variables used in the innermost loop may also be identified as a feedback variable if the local variable has a read followed by a write Feedback variables will add an input port to the generated hardware for the initial value Figure 63 provides an example of the assumptions we make based upon the C code The interface we generate for this code is shown in Figure 64 6 2 Values created by optimizations The optimizations Temporal Common Subexpression Elimination TCSE and Systolic Array Generation also create input ports TCSE will create a feedback variable and corresponding initialization port for each piece of code eliminated Systolic Array generation will turn the original two dimensional array into a one dimensional array input which corresponds to the first row of the two dimensional array and will create initialization input ports for every element in the first column of the original two dimensional array Jacguard Computing 2012 02 6 GENERATED SPECIFIC HARDWARE CONNECTIONS void SystemCode int A Input Stream int lt x Input Scalar int endValue Input Scalar int amp z Output Scalar
45. n By default all operations are expanded to the highest precision before being performed and then truncated if necessary as the last step In the generated VHDL an N bit addition is stored into an N 1 bit value and a multiplication between two N bit numbers is stored into a number with 2N bits The user may select the optimization MaintainPrecision to truncate at every step Specifying the specific bit width is done by declaring a typedef at the beginning of your program This typedef must be in the form of ROCCC_intX where X is any positive number as shown in Figure 39 This type can then be used to declare any variable with the appropriate size 9Jacguard Computing 2012 36 4 C CODE CONSTRUCTION Li for i 0 i lt 100 i i for j 0 j lt 100 j i A i j A i 1 j 1 A i j 1 A i 1 j A d Figure 40 C Code To Generate A Systolic Array 4 9 2 Systolic Array Generation Systolic array generation is an optimization that takes a wavefront algorithm operating on a two dimensional array and converts it into hardware consisting of a single dimensional array of elements that feed back to each other The original U code must be in the form of a doubly nested for loop that calculates the value of a two dimensional array based upon some function of the previous elements of that array In order for systolic array generation to recognize the optimization the outer loop must be labelled as shown in
46. n Output Stream Name Number of Parallel Data Channels Number of Parallel Address Channels Add Out 4 1 Remove Data Address Cycle 1 Cycle 2 Cyde1 Ode 3 Channels 112 3 4 516 as 11171727272 17171721272 Single Dimension 1 Data Channel Single Dimension 3 Data Channels Memory fetched for 0 current cycle ODER 1 HDD Cycle 1 1 a a memory BETERE 1 213 4 5 6 is fetche PEERY ous BRT 7 EEGEN 112134 5 6 1 Data and Address Channel 3 Data and Address Channels lt Back Next gt Cancel Finish Figure 14 Stream Accessing Management Page The automatic opening of this report can be controlled in the ROCCC Preferences panel El Console 32 ROCCC Toolchain Compiling FIR c Writing module to Users ratadero Documents runtime Compilation of FIR c finished Compiling FIRSystem c Writing output controller to Users ratadero Document Writing input controller to Users ratadero Documents Writing output controller to Users ratadero Document Writing module to Users ratadero Documents runtime Compilation of FIRSystem c finished Figure 15 Successful compilation If the compilation finished successfully you will see a VHDL folder in the project directory next to the file you compiled that will have the generated VHDL code for that system or module as shown in Figure 16 The selected flags for each file are saved so that if you go to recompile a file multiple times it will load which flags were used duri
47. n the ROCCC menu In both of these cases ROCCC will check to see if there is a new version of the compiler and if there is a new version of the GUI plugins All messaging about checking for updates will show up in the Eclipse console When there is a new version available you will be informed Updates are available for download on the jacquardcomputing com website Jacguard Computing 2012 25 4 C CODE CONSTRUCTION 4 C Code Construction 4 1 General Code Guidelines ROCCC supports two styles of C programs which we refer to as modules and systems Modules represent concrete hardware implementations of purely computational functions Modules can be constructed using instantiations of other modules in order to create larger components that describe a specific architecture System code performs repeated computation on streams of data System code consists of loops that iterate Over arrays System code may or may not instantiate modules System code represents the topmost perspective and generates hardware that interfaces to memory systems 4 1 1 Limitations ROCCC is not designed to compile entire applications into hardware and has certain general restrictions on both module and system code ROCCC is continually in development so these restrictions may fluctuate or be eliminated entirely in future releases ROCCC 2 0 currently does not support e Logical operators that perform short circuit evaluation The 4 and IT operato
48. name ROCCC will ask you for a different name for a project to create and import the examples into If there is an internet connection available ROCCC will also open the examples webpage to give explanations of how the examples work The second way to load the examples which can be done at any time after the distribution folder has been set is to do it through the ROCCC menu Select ROCCC Help Load Examples and the ROCCC examples will be loaded as mentioned above This is shown in Figure 6 Once that is complete the examples should be loaded into the project that was created If you look into the projects sub directories you should see a src folder Within that folder there should be modules and systems folders as shown in Figure 7 The GUI requires ROCCC projects to be arranged according to this directory structure Any code located in the modules subdirectory is assumed to be module code and similarly any code in the systems directory is assumed to be systems code 3 5 IP Cores View ROCCC maintains its own database of compiled modules that can be viewed at anytime To view the contents of the database click ROCCC View IPCores on the Eclipse menu The ROCCC IPCores view will open and display all the inserted modules inside the database You can view what ports are on a specific module in the database by selecting a component in the IPCores view The neighboring table will then display all the port names directions port siz
49. ng From A Stream 4124 44 A bw ed e eae ee ES che ee ee ee ee ee od 43 51 Reading From A Stream With Multiple Channel 44 52 Timing Diagram of Output Streams 2 2 A5 53 Timing Diagram Of The End Of A System s Processing 46 54 C Code For MaxFilterSystem Which Uses A 3x3 Window a 46 55 C Code That Writes To Three Locations In The Same Stream Each Loop Iteration 47 00 BaSe E e pe paa e e RAS ok 1 EG 48 a Medan Dados ee IEA EAS EE l 48 58 High fanout a before registering and b after registering ooo a a 49 59 Generated Systolic Array Hardware 50 60 Theoretical Interface to a 32 bit Floating Point Divide IPCore 51 61 Wrapper for the Theoretical 32 bit Floating Point Divide 51 62 System Code Sections Translated Into Hardware e 92 0 C Code That Miers e EE 53 04 Generated Ports yaa 4658 d 446 See Dee eee Se BA ee ES 54 Jacquard Computing 2012 5 1 1 1 1 2 1 CHANGES Changes Revision 0 7 4 Bug fixes Fixed an issue with inlining where certain obscure code configurations would fail such as when you had multiple inlined modules inside of another module that contained two or more loops and the innermost module had an if then else chain greater than 4 levels deep Fixed an issue where the GUI might freeze when compiling if the compilation generated too many warnings or errors We added support for the Pico M501 platform Revision 0 7 Adde
50. ng the previous compile The other way to compile a file is to right click the desired file in the Project Navigator and select Build to Hardware in the ROCCC submenu as shown in Figure 5 3 8 Compiler Optimizations In addition to standard compiler optimizations such as dead code elimination and constant propagation when compiling ROCCC code the first page of the build window will allow the user to select additional high level optimizations to perform on the code The choice of optimizations is different depending on if the compiled code is a module or system Note When compiling a module all loops are fully unrolled automatically The available optimizations are 9Jacguard Computing 2012 18 3 GUI 3 MyProject 2 src 2 modules gt MACC gt vhdl V MACC vhdl ROCCChelper vhdl Se MACC c Figure 16 VHDL Subdirectory Created 3 8 1 System Specific Optimizations Systolic Array Generation Transform a wavefront algorithm that works over a 2 dimensional array into a one dimensional hardware structure with feedback at every stage in order to increase the throughput while reducing hardware Note This optimization cannot be combined with other optimizations Temporal Common Sub Expression Elimination Detection and removal of common code across loop iterations to reduce the size of the generated hardware LoopFusion Merge successive loops with the same bounds and no dependencies LoopInterchange Switch
51. ng the ways streams are accessed as shown in Figure 14 From here you can select Add to add managing info for either input or output streams From here a page will open asking for the stream name the number of parallel data channels and the number of parallel address channels Once pressing Finish the values will be added to the stream management page in the corresponding table you pressed Add for Once these values are in the table you can edit these values by double clicking individual cells and changing the values The number of stream channels must be a factor of the window the data is being accessed from for that stream and the step size of the loop Once you have selected which optimizations to use and have set the arguments for the optimizations that require them select Finish This will run the ROCCC toolchain on the selected open file inside the Eclipse editor All output from the compilation will be outputted on the console inside of Eclipse as shown in Figure 15 Additionally a summary report will open in the main window to report statistics of the compilation 9Jacguard Computing 2012 17 3 GUI 8600 Stream Accessing Management LS Tune any of the input or output stream accessing for VectorAdd c Z 5 Input Stream Information Input Stream Name Number of Parallel Data Channels Number of Parallel Address Channels Add A 4 1 B 4 1 Remove Output Stream Informatio
52. nto the generated hardware When a module is compiled it is exported for use in other code All modules have header information placed into the file roccc library h These functions can be called from other ROCCC code and each function call will be translated into a module instantiation The system code shown in Figure 32a processes a data stream and instantiates the module that was shown in Figure 25 When compiled the generated hardware will resemble the circuit shown in Figure 32b IMPORTANT NOTE Currently array references can be used as inputs to modules but the outputs of modules can not be mapped to array references If you wish to accomplish this you must declare an intermediate temporary variable and assign the output of the module to this variable and then assign the variable to the output array This is shown in Figure 32a as the output of FIR must be mapped to the variable tmp and then assigned to the output stream B 4 4 1 Inlining Modules The user has control of if module instantiations are treated as black boxes or inlined When inlined the individual operations of the module are exposed to the top level design and can be optimized around at the expense of increased compile time As an example Figure 32c shows the resulting circuit structure of the Jacquard Computing 2012 30 4 C CODE CONSTRUCTION A Example summation code void Summation ine A 1nt B int final E AAA int i i int output for i
53. occc 0 7 distribution Licenses license lic Browse Verify License File Path General Preferences J Automatically check for updates on startup Automatically open compilation report after compiling A 2 Cancel OK i Figure 3 The ROCCC Preferences Page 3 3 GUI Menu Overview This is a quick overview of all the ROCCC buttons and options located on the GUI for future reference Each of the actions the buttons do will be covered in more detail in the other sections this is merely so you can see and recognize all the buttons available Note The icons on the menus may not show up if your system preferences are set to not show Menu Icons 3 3 1 ROCCC Menu e Bud Build Compile the open modules or system file Es Project 27 Module e New lt 5 System Project Create a new ROCCC project in Eclipse Jacquard Computing 2012 10 3 GUI Module Create starter code for a new ROCCC module System Create starter code for a new ROCCC system New b Add b 27 IP Core A d d Import b IP Core Add an IP Core directly to the database for future use Import b lt Module Import Ven d Module Import an outside ROCCC module C file into a project System Import an outside ROCCC system C file into a project lt 2o IP Cores Manage el rocec library h Vi ew Generate IP Cores Opens the IP Cores view to see available cores in the ROCCC database rocc
54. oop iteration 1 and loop iteration 2 this will be detected When generating hardware we take advantage of this fact and create feedback variables that eliminate redundant computations TCSE can only be performed on system code The code does not have to be written in any special way to take advantage of TCSE An example of the difference in hardware generated can be see in Figures 41 and 42 These block diagrams show the original structure of the Max Filter System hardware that contains four Max Filter modules and operates on a sliding 3x3 window and the Max Filter System after TCSE has been performed After TCSE the generated hardware only has two Max Filter modules and two have been replaced with feedback variables The generated hardware does require initial values for each piece of hardware eliminated so you might have to change the way you pass data into the hardware depending on if you perform TCSE or not 4 9 4 Arithmetic Balancing The user has the choice of performing arithmetic balancing on the generated hardware The optimiza tion finds expressions composed of a single operator performed in serial and changes the order that the subexpressions are calculated in to minimize the time to calculate the expression Only associative and commutative operators are balanced currently addition multiplication and bitwise AND OR and XOR Jacquard Computing 2012 37 4 C CODE CONSTRUCTION DU Figu
55. ously high The fifo interface protocol of the output streams is similar to the fifo interface protocol of the input stream When the output controller has valid data from the datapath the first element of the stream is written to the data port s and the address of that data element is written to the corresponding address port The empty port is brought low signaling there is data in the fifo When the read enable signal is brought high the first data element is put onto the data port s and the address port is loaded with that elements address See Figure 52 for an example of the timing protocol for output streams As shown in Figure 52 the data and addresses are decoupled 9Jacguard Computing 2012 44 5 INTERFACING WITH GENERATED HARDWARE empty_out A data out en EE EE tas readen_in _ Ny __ A O N Figure 52 Timing Diagram of Output Streams Because the output controller may be serializing data calculated in parallel the datapath must be stalled until all of the data is serialized This happens entirely internally but functions equivalently to bringing the stall port high the datapath is stalled the input controller continues to read but will not push data onto the datapath and other output streams may run out of valid data For this reason it is important not to rely on a specific timing for any stream interfacing Rather the fifo interface should be relied on to guarantee that data is transfered correctly If it is imp
56. pensive in terms of execution time on the desired platform To edit these values click the advanced tab at the top of the Area vs Frequency page These weight values have no real absolute meaning they only have meaning relative to each other For example if our Mult operation takes twice as long as our Add we need to make sure we make the weight value for Mult is twice that of Add This can be done as 100 and 50 or 50 and 25 it doesnt really matter as long as the weights are proportional to each other In this case when compaction occurs the compiler would attempt to allow two chained additions to happen together for every multiplication that is done If all the weights have the same value that means that they all take the same amount of execution time Again this can be achieved by having the weights as all 1s or even all 500s as long as they are all the same value The default weights that were distributed with ROCCC are the values we came up with for targeting 150 MHZ on an LX 330 These weights combined with the pipeline slider gives you precise control over how to tune your component in terms of area and frequency Also available in the advanced view is control over the maximum allowable fanout When generating a circuit if any register has a fanout larger than the specified number registers are inserted along the paths in order to ease routing constraints If compiling a system there is a fourth page in the compilation wizard for managi
57. plugin is continually evolving and may function slightly differently in future releases 3 1 Installing The Plugin Once you have downloaded and uncompressed Eclipse open the resulting uncompressed eclipse folder Inside of there you should see a folder named plugins This is where we need to copy the ROCCC GUI plugins into as shown in Figure 1 Any previous versions of the ROCCC plugins must also be removed from this directory The ROCCC plugins are located inside the GUI folder of the uncompressed ROCCC distribution folder s SEARCH FOR Eclipse ROCCCPlugin 0 6 0 ROCCCPCoreGenera jar tor 0 6 0 jar features HTML HTML epl v10 htmi notice htmi artifacts xm 2 items 799 77 GB available Figure 1 Copying the Plugins into Eclipse Once you have moved the ROCCC plugins into the plugins folders inside eclipse ROCCC should be ready to run on Eclipse The first time you run Eclipse with the ROCCC plugins installed ROCCC will set up the perspective best used for working with ROCCC It will also open up a page welcoming you to ROCCC 2 0 and asking if you would like to register for updates and news 3 2 Preparing the GUI for using ROCCC Before we can use the core functionally that is bundled with the GUI the user must first set the directory path to the ROCCC distribution folder This can be done by selecting Preferences in the ROCCC menu tab at the top of the program as in Figure 2 Once this is done a prefe
58. pport processing new data every clock cycle If a module contains a loop it will automatically be fully unrolled Hence any loop inside of a module must have an end bound that can be statically determined Figure 26a provides an example of the supported loop structure inside modules After unrolling constant and copy propagation we end up with the hardware as shown in Figure 26b which is a single multiply as we would expect There is no loop control or other control created as the loop has been removed 9Jacguard Computing 2012 26 4 C CODE CONSTRUCTION Example module code Input parameters must come before output parameters void FIR int AO int Al int A2 int A3 int A4 intg result const int T 5 3 5 7 9 11 result AO Al A2 A3 A4 result a b Figure 25 a Module Code in C and b generated hardware 3 This module contains a loop it will automatically be fully unrolled void Squared int x int amp y int total int A 5 total 1 for i 0 gt i lt 2 i total x y total y a b Figure 26 a Using a loop in module code and b resulting hardware Jacquard Computing 2012 27 ot G CODE CONSTRUCTION Example system code Smart Buffer Smart Buffer Streams are passed as pointers but treated E EE EE as arrays void VectorAdd int N int V1 int V2 int Sum Datapath int i fo
59. r 41 0 e i lt N 41 Sun fa vii V2 1 3 Sum a b Figure 27 a System Code in C and b generated hardware 4 3 System Code System code performs computation on streams of data and produces streams of data Scalars may also be read as input and generated as output but as opposed to modules input scalars are read once at the beginning of computation and output scalars are only generated once at the end of computation Similar to module code system code is written as a void function that takes input and output parameters Input scalars are passed by value output scalars are passed by reference and both input and output streams are passed as pointers The function definition must declare inputs before outputs Although passed as pointers the internal use of streams must be through array accesses An example of system code is shown in Figure 27a This code takes a single input scalar that is used to determine the length of the incoming streams two input streams V1 and V2 and an output stream Sum The computation adds all elements of the two input vectors and outputs them to the Sum stream Like module code all inputs must be declared in the parameter list before any outputs The generated hardware is shown in Figure 27b Each stream specified in the C code generates a memory interface that includes an address generator AG and a BRAM FIFO structure The specifics of the hardware communication protocols are discussed
60. re 41 Block Diagram Of Max Filter System e Figure 42 Block Diagram Of Max Filter System After TCSE are balanced For example the statement a b c d in software will be calculated serially By performing arithmetic balancing the statement is changed into a b c d e with b c and d e calculated in parallel Because floating point operators are not strictly associative and commutative and order of execution matters when dealing with overflow this optimization may change the final result when using floating point values 4 9 5 Copy Reduction ROCCC automatically inserts copy registers in between pipeline stages if a value is not used immediately after it is calculated If an operation could correctly be calculated in several different pipeline stages one of those stages will minimize the total bits that are copied both coming into that operation from previous stages and leaving that operation to later stages that use the calculated value This pass attempts to find a placement for operations that minimizes the total number of copied bits Starting with the edge in the use def graph that has the most number of bits copied edges are tightened by moving the nodes at the edges ends toward each other By repeating this process and saving a snapshot of the graph whenever a minimal number of copied bits is found eventually a local minimum is found that minimizes the number of copied bits This c
61. rence page will pop up asking for the ROCCC distribution path Set the preference value to wherever you had uncompressed the ROCCC distribution folder The validity of the chosen folder can be checked by clicking the Verify ROCCC Distribution Folder button on the preference page as shown in Figure 3 On this page you must also set the path to the license file in order to use ROCCC If the license is not set you will not be able to compile or use any of the ROCCC functionality Once that is done the ROCCC GUI should be ready to use If you ever try to use any of the ROCCC functionality and this preference is not set or that directory is incorrect the GUI will tell you and ask if you want to set the ROCCC distribution folder in the Preference menu Jacquard Computing 2012 9 3 GUI ROCCC Window Help y 2 Build lt ROCCC Developm New Add Import View Manage Generate Help Tv TT FFY lt A Reset Database A Preferences EE SS View License Information Figure 2 Location of the ROCCC 2 0 Preferences ano Preferences type filter text l ROCCC Groovy General P Ant gt Help e gt Install Update 4 Java Z gt Plug in Development ROCCC Run Debug Distribution Directory Team gt Verilog VHDL Editor ROCCC Distribution Directory Users roccc roccc 0 7 distribution Browse Verify ROCCC Distribution Folder License File Path ROCCC License File Path Users roccc r
62. rs do work and should be used in place of amp amp and e Generic pointers e Non component functions including C library calls e Shifting by a variable amount e Non for loops e The ternary operator e Stream accesses other than those based on a constant offset from loop induction variables 4 2 Module Code Module code represents a hardware building block to be used in larger applications Modules are computa tional datapaths and are written as computational functions All inputs to modules are passed in by value and all outputs are passed by reference Inputs must only be read from and output ports can only be written to inside the function We do not support writing to an output port multiple times inside the function Mod ules can only process scalar values and cannot have arrays as input or output variables Internal variables may be created but are not visible outside of the module Figure 25a shows a simple FIR filter written as a module This code takes five inputs and computes a single output When compiled the hardware generated will resemble the circuit shown in Figure 25b The interface to the module is exactly as described by the parameter list the integer array T is not visible outside of the module Modules do not generate addresses or fetch values from memory but instead have data pushed onto them and then output scalar values after all computation has been performed They are completely pipelined and can su
63. ry containing the files necessary for PCore generation Supplementary VHDL A directory that contains VHDL that is necessary for some ROCCC generated code to run bin A directory that contains the precompiled binaries for a given platform lib A directory that contains the precompiled libraries for a given platform scripts A directory that contains a script for canceling compilation and a script for creating a subordinate installation tmp This directory is used for temporary storage when compiling with ROCCC 9Jacguard Computing 2012 7 2 INSTALLATION There may be additional directories depending on if the installation contains a port of ROCCC to a specific machine Included in the GUI directory of the distribution are Eclipse plugins that control access to all of the ROCCC functionality The user is responsible for moving the files into the appropriate eclipse plugin directory on the target system and removing any previously installed ROCCC plugins that may exist If you experience any failures in the installation procedure consult the troubleshooting section at the end of this document Jacquard Computing 2012 3 GUI 3 GUI The ROCCC GUI is a plugin designed for the Eclipse IDE that works on both Linux and Mac systems The user must have at least Eclipse version 3 5 1 installed ROCCC currently supports the C and Java versions of Eclipse Eclipse can be downloaded for free at www eclipse org The ROCCC GUI
64. s all of the inputs and outputs to the module Input ports must be identified by adding the suffix _in and output ports must be identified by adding the suffix out The implementation function must be a function that returns and receives an instance of this struct by value Any return statements that are not at the end of the function are ignored and cannot be used as a form of control flow All computation inside this function will be translated to hardware The FIR filter shown in Figure 38 is written in this style Note that the hardware generated for this code is nearly identical to the hardware generated for the same code written in Figure 25 The only difference will be in the ordering of the ports once compiled IMPORTANT NOTE When compiling Legacy ROCCC modules the order in which you pass the pa rameters is not necessarily the order in which you declared them in the struct The order in which you pass parameters must match the order in which they appear in the struct as exported in the roccc library h file If using the GUI this ordering is available by double clicking the module in the IPCores view Modules written in the new style will have the parameters in the same order as written 4 8 2 Legacy System Code Legacy system code is nearly identical to the new style system code with the exception that parameters were not accepted Input and output arrays and scalars are declared locally and inferred during compilation
65. sh If you need to edit one of the already added ports simply double click on the field you wish to edit and you will be able to change the value of that field Once everything is added correctly click Finish and the component will be added to the ROCCC database The component will now also be found in the IPCores view Jacquard Computing 2012 20 3 GUI Create New Module Please enter the new modules information This module cannot already exist in the database Module Details Module Mame Project to add to ROCCCExamples Ports Port Name Direction Size Figure 18 New Module Wizard 3 10 Create New Module To start a new module from scratch first make sure you have a valid ROCCC project loaded or have created a new project as described in the Creating a new Project section Once you have a valid project open select New Module under the ROCCC menu or toolbar to begin creating the new module A new window will open asking for the details of the new module as shown in Figure 18 Input the name of the module and which project to add the new module to Next add all the ports that this module will have If you ever need to edit an already added port simply double click the field you wish to edit and you will be able to change the value of that field Once everything is added correctly click Finish and the module will be added to the project The new file will open in the editor with the necessary starter code to begin cod
66. sk for the necessary files for each as in Figure 24 Dependent components for PCore for FFTOneStage Select HDL or netlist files for dependent components in FFTOneStage Select files Dependent Components Contains Netlist File HDL Camponent File Users ratadera Documents runtime EclipseApplicatior Browse Wrapper HDL File Optional Browse a Figure 24 Dependent Files Window You can either fill these sections out and let ROCCC handle all the moving and packaging of the files or you can continue with the generation without specifying these and place them in the packaged folder later Once the generation of the PCore interface is complete a folder named either PCore will show up next to Jacquard Computing 2012 24 3 GUI the ROCCC file in the project explorer These folders should have all the files necessary to run the PCore on the desired hardware as long as they support PCores on what you chose PCores support being generated on all modules but currently not on systems 3 19 Updating There are a few ways to keep the ROCCC toolset up to date with the most current version available The first is by having the ROCCC GUI automatically check for updates each time on startup You can change whether or not you want ROCCC checking for updates at startup in the preference page as in Figure 3 The other way to check for updates is to manually check for updates by selecting Help Check for Updates i
67. ss_stall_in in STD_LOGIC Feedback Initialization Scalars y_init_in in STD_LOGIC_VECTOR 31 downto 0 Input Scalars x_in in STD_LOGIC_VECTOR 31 downto 0 endValue_in in STD_LOGIC_VECTOR 31 downto 0 Output Scalars z_out out STD_LOGIC_VECTOR 31 downto 0 Jes end SystemCode Figure 64 Generated Ports 9Jacguard Computing 2012 54 7 EXAMPLES PROVIDED 7 Examples Provided Twenty different example codes are provided to demonstrate the current capabilities of ROCCC 2 0 These are located in the Examples subdirectory The Examples subdirectory contains a directory with all of the Module examples and a directory with all of the System examples 7 1 Module Examples The Module examples are listed here 1 2 BitonicSort2 This module takes two scalar numbers and returns two sorted scalar numbers BitonicSort8 This module takes eight scalar numbers and returns eight sorted scalar numbers BitonicSort8 instan tiates many BitonicSort2 modules in order to sort the inputs ColoumbicForceCalculations This example performs all of the Coloumbic force calculations between two atoms for one timestep of a molecular dynamics simulation The calculations performed in this module require floating point values so floating point cores must be supplied by the user in order to run the generated code FFT2 The FFT2 module performs the base calculations for the fast Fourier transform between two complex numbers The
68. step on the window accesses to adjust for negative offsets If the C code accesses an array with a negative offset for example Ali 2 and Ali 1 we normalize these values to start at location 0 meaning the previous offsets will be adjusted to Ali and A i 1 After the normalization we determine the size of the memory rows we are accessing identically as above 4 3 2 N dimensional arrays ROCCC can accept arbitrary dimension arrays Figure 29 shows example C code that both inputs a three dimensional array and outputs a three dimensional array When declaring an N dimensional array the parameter must be a N dimensional pointer 4 3 3 Feedback detection Variables whose values are used in multiple iterations of the for loop in system code are detected and turned into feedback variables Figure 30 shows example code that contains a feedback variable In this code the value of current Max is used in the initial loop iteration and is then carried through all of the additional loop iterations When converting the code in Figure 30a into hardware we get a circuit that resembles that in Figure 30b All variables that are determined to be feedback variables will have an additional hardware input port generated for the initial value of the variable Subsequent iterations of the datapath can only be executed once the value of the feedback variable is known which in the worst case will be at the bottom of the pipeline Jacquard Computing 2012
69. t compare while the rightmost mux has a critical path of one addition operation and one com parison By choosing a desired delay d such that Weight mux Weight add lt d lt Weight mux Weight add Weight compare registers were inserted after the leftmost mux but before the rightmost mux This can be seen in Figure 57 When dealing with complicated multi operation datapaths and a large pipeline depth this sort of timing analysis is difficult and error prone when performed by hand and time consuming when done at the gate level on large graphs by the synthesis tool 5 5 Fanout Reduction A high fanout in a design can severely impact the frequency of the final hardware especially when that high fanout is exacerbated by not having registers in between the fanout operation and the operations that use it By specifying the max unregistered fanout the user can specify at what point registers should be inserted to minimize the impact of a high fanout As an example the addition operation in Figure 58 has a high fanout by inserting registers between it and the operations that use it the impact on the frequency of the final design is minimized 5 6 Intrinsics Unlike in C integer division modulus and floating point operations are expensive to do in hardware In fact there is no way to specify add two 32 bit floats or multiply two 16 bit floats other than implementing the algorithm yourself or using a hardware IPCore specific
70. the loop induction variables of two nested loops Loop Unrolling Unroll the loop at the given C label by a specified amount If the loop has constant bounds the loop can be fully unrolled Arguments Loop Label The loop specified by the C label in the code Number of times to unroll The number of times to unroll the loop If the loop has constant bounds you can set the value to FULLY to fully unroll the loop If a system has all of its loops completely unrolled it will be transformed and compiled as a module FullyUnroll Fully unroll all loops in the original C code If any of the loops have variable bounds this pass will stop compilation 3 8 2 Optimizations for both Systems and Modules MultiplyByConstElimination Replace all integer multiplications by constants with equivalent shifts and additions DivisionByConstElimination Replace all integer divisions by constants with equivalent shifts and adds Redundancy Enable dual or triple redundancy for a module at a given C label InlineModule Inline C code of specified modules as opposed to instantiating black boxes InlineAllModules Inline C code of all module instantiations and if those contain any other calls continue inlining up to the specified depth Jacquard Computing 2012 19 3 GUI 3 8 3 Low Level Optimizations e ArithmeticBalancing Change long chains of sequential arithmetic into parallel trees when applica ble e CopyReduction Rebalance the low
71. ull T_address_clk T_address_rdy T address base B rclk 32 T_address_ count B channel T address stall B readen A wclk B emot A Channel 32 ak A writeen B address dk A full kB address rdy A address cik A address rdy B address base 32 B address _ count B address stall 32 A address base A address count A address stall e tempo init temp1_init_1 temp2 init temp3 Ini eo temp4 Init temp gt Ini temp6 Inn bits outputReady done Figure 59 Generated Systolic Array Hardware Jacquard Computing 2012 50 5 INTERFACING WITH GENERATED HARDWARE entity fp_div_gen32 is port a in STD_LOGIC_VECTOR 31 downto 0 dividend b in STD_LOGIC_VECTOR 31 downto 0 divisor clk in STD_LOGIC clok signal ce in STD_LOGIC clock enable brought low to stall the core result out STD_LOGIC_VECTOR 31 downto 0 quotient E end fp_div_gen32 Figure 60 Theoretical Interface to a 32 bit Floating Point Divide IPCore entity fp_div32 is port clk in STD_LOGIC clock signal rst in STD_LOGIC inputReady in STD_LOGIC outputReady out STD_LOGIC done out STD_LOGIC stall in STD_LOGIC a in STD_LOGIC_VECTOR 31 downto 0 b in STD_LOGIC_VECTOR 31 downto 0 result out STD_LOGIC_VECTOR 31 downto 0 end fp_div32 architecture Behavioral of fp_div32 is component fp_div_gen32 IS port a IN std_logic_VECTOR 31 downto 0
72. ure 32 a Code That Instantiates a Module b the Generated Hardware and c Generated Hardware After Inlining Jacquard Computing 2012 32 4 C CODE CONSTRUCTION if value gt 5 x else X 2 x ROCCCBoolSelect 1 2 value gt 5 a b 2 value gt 5 c Figure 33 Boolean Select Control Flow a In the original C b in the intermediate representation and c in the generated hardware datapath if value gt 5 x 21 pred value gt 5 x ROCCCBoolSelect 1 x pred a b value gt 5 c Figure 34 Predicated Control Flow A in the original C B in the intermediate representation and C in the generated hardware Jacguard Computing 2012 33 4 C CODE CONSTRUCTION Input Stream is a set of numbers between 0 10 void LUTTestSystem int A int N int B int i int LocalLUT 10 1 3 9 O 5 6 4 7 8 2 for i 0 i lt N i B i LocalLUT A i Figure 35 Reading From A Preinitialized LUT void LUTTestSystem int A int B int C int D int E Multidimensional LUTs are supported int LocalLUT 5 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2 3 4 5 ti int i for i 0 i lt 10 i LocalLUT A i BLi Cli One write per pipeline D i LocalLUT A i 0 Eli LocalLUT O B i Figure 36 Multiple Reads
73. will show up in the Project Navigator view This does not create hardware code for the selected system this solely adds the system C code to the projects 3 14 Intrinsics Manager Certain operations in C require hardware blocks on FPGA These include floating point operations and integer division By selecting Manage Intrinsics the user is able to select which IP cores to use The intrinsics manager is shown in Figure 22 By adding intrinsics the user is able to select which components are inserted into generated datapaths by activating and deactivating individual intrinsics Intrinsics List S Add delete or edit intrinsics in the database Z intrinsics info Activate Name Bitsize int Latency Description int diva 12 int div 16 LG 20 int die 3i 32 36 s Deactivate Add C Delete Ip less than ona Figure 22 Intrinsics Manager 3 15 Open roccc library h Every time a module is compiled the interface struct and hardware function prototype are added to the roccc library h file If you ever need to view the roccc library h file simply select View roccc library h under the ROCCC menu This will open up the roccc library h file in the default editor 3 16 Reset Compiler To reset the ROCCC database to its distribution state simply click Settings Reset Database under the ROCCC menu This will delete any added entries in the ROCCC database and will clear all added modules under the roccc l
74. wo dimensional image by comparing each window with a set 3x3 filter The output is a two dimensional matrix that corresponds to a black and white image of just the edges in the original image e SmithWaterman An implementation of the Smith Waterman algorithm that can be compiled with the Systolic Array Generation optimization to create an efficient hardware solution e Sobel This example performs the Sobel edge detection algorithm on a two dimensional image by comparing each 3x3 window with a set 3x3 filter The output is a two dimensional matrix that corresponds to a black and white image of just the edges in the original image e VectorAdd The VectorAdd example takes two streams of one dimensional input and outputs a third stream that consists of the sum of the individual elements This example can be used to demonstrate various stages of unrolling and stream widths e VectorMatrixMultiplication This example performs the multiplication of a one dimensional input stream to a two dimensional input stream The size of the streams should be 1xN and NxN with the resulting output stream being a 1xN e VectorReduction The VectorReduction code takes a one dimensional input stream and outputs a single scalar output that consists of the sum of all the input elements 9Jacguard Computing 2012 56
75. y reused The code in Figure 43 requires a 3x3 window from the memory A in order to execute each loop iteration Note that as in the C code the ROCCC generated hardware will access rows 0 6 and columns 0 6 of the image even thought the loop bounds are lt 5 As shown in Figure 44 code that accesses a sliding 3x3 window over a larger memory can reuse six values between loop iterations shown with X s in the diagram The smart buffer initially reads nine values from memory and exports all nine to the datapath for the first loop iteration and for subsequent iterations only three are read for each loop iteration The code as shown in Figure 45 will be analyzed by ROCCC and determined that no reuse occurs between loop iterations In this case a FIFO interface is generated For each loop iteration two elements are read in as in Figure 45 No reuse can be exploited between consecutive loop iterations for i 0 i lt 65 i 2 a b Figure 45 a System Code that reads from a FIFO and b Memory fetches when using a FIFO Jacquard Computing 2012 39 5 INTERFACING WITH GENERATED HARDWARE rst J O OoOoooy o inputReady AA yo const1 const2 Welk SJSUUUU UL full_out E A address_rdy_out _ address out Gaara gt adarz Gaara DX addra A Figure 46 Timing Diagram Of A System With Both Input Scalars And Input Streams o Interfacing With Generated Hardware 9 1 Port Descriptions The VHDL
Download Pdf Manuals
Related Search
Related Contents
ETBW11AA PTS PANELS Creatinine Test Strips - ptsdiagnostics Instruction Manual blaupunkt.de American Standard Colony RIght Height Elongated 10: Rough-In Toilet 3370.016 User's Manual Comfort Pro P 100/300/500, Comfort Pro P 300...Comfort M1357Adv Program user manual - Mosdesign Semiconductor CORP. Copyright © All rights reserved.
Failed to retrieve file