Home

as a PDF - CECS - University of California, Irvine

1. Processor Subsystem generation ynthesis P R b Figure 3 a A high level application to a hardware software system generation b Processor subsystem generation 4 4 S W Generation Phase This phase shown in Figure 3 a as a box titled S W generation generates code for the target processor taking into account the presence of AFUs The two subtasks in the S W generation phase are subgraph matching and subgraph replacement with ISEs Since all possible instances of an ISE have already been enumerated by the ISE generation phase the subgraph matching simply consists of a DFG traversal and marking constituent instructions of the ISE in the DFG LastDef LastDef FirstUse FirstUse a b1 b2 b3 Figure 4 The ISE here is composed of the shaded instruction nodes a An example showing the LastDef point and the FirstUse point b1 an example where it is not possible to insert the ISE under considera tion b2 After code restructuring b3 positioning of the ISE between LastDef and FirstUse After subgraph matching the ISE is used to replace the set of marked instructions in the DFG We depict the ISE replacement strategy in Figure 4 An ISE can be placed anywhere between the point where its source operands have their last definition LastDef and the point where its destination operand has its first use FirstUse as shown in Figure 4 a the shaded nodes identify the ISE under cons
2. out std logic FSLO S DATA in std logic vector 0 to 31 FSLO S CONTROL in std logic FSLO S EXISTS in std logic FSL1 S CLK out std logic FSL1 S READ out std logic FSL1 S DATA in std logic vector 0 to 31 FSL1 S CONTROL in std logic FSL1 S EXISTS in std logic FSL2 S CLK out std logic FSL2 S READ out std logic FSL2 S DATA in std logic vector 0 to 31 FSL2 S CONTROL in std logic FSL2 S EXISTS in std logic FSL3 S CLK out std logic FSL3 S READ out std logic FSL3 S DATA in std logic vector 0 to 31 FSL3 S CONTROL in std logic FSL3 S EXISTS in std logic FSLO M CLK out std logic FSLO M WRITE FSLO M DATA FSLO M CONTROL FSLO M FULL out std logic out std logic vector 0 to 31 out std logic in std logic FSL1 M CLK out std logic FSL1 M WRITE out std logic FSL1 M DATA out std logic vector 0 to 31 FSL1 M CONTROL out std logic FSL1 M FULL in std logic AFU en out std logic end component fsl_interface begin counter_inst counter port map CLK gt CLK enable gt count_en counter_ticks gt counter_ticks cutl inst cutl port map AFU en chip en data inl gt data inl data in2 gt data in2 data in3 data in3 data in4 gt data in4 data outl gt data outl data out2 data out2 fsl interface inst fsl interface port map CLK gt CLK RE
3. SRVAL A X 0 SRVAL B X 0 WRITE MODE A WRITE FIRST WRITE MODE B gt WRITE FIRST INIT_00 X C1027257FA808 102079097432013AF849F4A0FOF FF BO4A80766AFABDDF739FFD INIT 01 gt X 102040B1C337A97A6EFB77A8272040CB52497DC20EDBB9F6BCCF7FFCO9E146FF INIT 3E X 0000000000000000000000000000000000000000000000000000000000000000 INIT SF gt X 0000000000000000000000000000000000000000000000000000000000000000 This can be easily done using a simple script Make projnav projnav par do with the following content This is the script for post place and route simulation Note that VCD is destined to be generated in system vcd vmap simprim C Xilinx vhdl mti se simprim vlib work vcom 93 work work system timesim vhd vcom 93 work work testbench_par vhd vsim t ps notimingchecks sdftyp testbench uut system_timesim sdf work testbench ved file system vcd vcd add testbench uut add wave f Right click on system structure lt path gt system vhd and select Add Source Find test bench vhd in the projnav directory and click Open to add a test bench Select vhdl test bench while adding the test bench g Click on testbench behavior testbench vhd Right click on Simulate Behavioral Model in the Processes for Source partition and select properties Change the following fields e Use Custom Do File Check the selection e Use Automatic Do File Uncheck the selection e Custom Do File Click an
4. aoaaa aaa di ERR n e ee eee ALE A a A high level application to a hardware software system generation b Processor sub Systemi generation i o ar E Sle 518 m t E bed ged E ROS xS 11 12 12 13 13 13 15 16 16 18 21 26 26 29 36 O OO VAU The ISE here is composed of the shaded instruction nodes a An example showing the LastDef point and the FirstUse point bl an example where it is not possible to insert the ISE under consideration b2 After code restructuring b3 positioning of the ISE between LastDef and FirstUse 4 ra nar a RU RA Measuring System Power 2 34 2 524 NW EMS EU ec he he o ES A DP external AFU Interface s o es Bo o MOS de OS RS ds edt ed Mh Hand 22 ae Microblaze Processor Core with an AFU and its Interface Communication Template for AFU Interface in Microblaze Xilinx Multimedia Board uu vob oe 8 wu une LR RN UR e Rm ER EL de An ISE for ADPCM ENCODER adpcm e having 4 inputs and 2 outputs each operation node maps to a hardware component 1 Introduction Typically applications running on a programmable platform can be executed either as a software al gorithm or on a specialized hardware unit The software approach is the slowest but most flexible while the hardware approach is the fastest but least flexible Instruction Set IS extensible processors comprise an emerging class of processors especially i
5. ut std logic out std logic out std logic vector 0 to 31 out std logic in std logic ut std logic out std logic out std logic vector 0 to 31 out std logic in std logic AFU_en out std logic end fsl interface architecture behavioral of fsl interface is S IGNAL count begin FSLO M CONTROL lt FSLO_S_CLK FSL1_S_CLK FSL2_S_CLK FSL3_S_CLK FSLO_M_CLK FSLI M CLK AFU control begin if RESET count lt lt lt lt lt lt lt pr 0 FSLO S REA FSL1 S REA FSL2 S REA FSL3 S REA FSLO M WRIT QUU VU FSLO M DATA FSL1 FSL1 M WRIT M DATA lt enabling AFU operation natural range 0 to 9 ro CLK CLK CLK CLK CLK CLK ocess CLK 1 then ro ro ro ro ro others ro others E lt E lt elsif CLK event and CLK CASE WHEN count 0 FSLO S ESLI So FSL2 S ESL3 SA FSLO_M FSL1_M J count_e count lt WHEN 1 IF and FS FSLO S READ lt 1 FSL1 S READ lt 1 FSLO IS gt REA REA READ lt READ lt WRITE lt lt lt D D D D WRITE lt n lt Pss gt S EXISTS L2 S EXISTS ry is rots rts Lil Te ry s rye gt gt 117 1717 1717 Je Papst then and THEN and FSLl1 S EXISTS Initialize the counte
6. Table 1 Table 1 Speedup and Code Size Reduction with the Introduction of an AFU having 4 inputs and 2 outputs in the Microblaze subsystem Core Only Core AFU Code BMs Bytes Cycles Bytes Cycles Redn Spdup autcor 58444 264305 58452 404673 8 0 65x adpcm d 12049 252688 11953 190979 96 1 32x adpcm e 14121 157177 13989 106821 132 1 47x AES 16013 240613 14957 167397 1056 1 44x Each of the operand send and result receive operations in Microblaze has a latency of 2 cycles Con sequently the latency for transferring 6 operands is 12 cycles in the worst case and 6 cycles in the best case 1 e if all the latencies are successfully hidden by the scheduler The ISE generated for autcor was a chain of just three operations a multiply a barrel right shift and an add having software latencies as 3 2 and 1 cycles respectively With AFU operation taking just 1 cycle the best case latency of the ISE is 64 1 7 cycles Thus even the best case performance of the ISE lags behind the worst case performance of the corresponding software execution 8 2 1 6 cycles Consequently there was slowdown instead of speedup for autcor owing to the communication overhead However there are some prior related work 6 8 which have shown speedup even with small sized ISEs containing on the order of 3 4 instructions because of incurring no communication ov
7. a single instruction just replaces the set of constituent instructions Replacing the multiply and the add with a single user defined instruction ISE1 the resulting instruction sequence as in Figure 4 b3 would become 3 e 5 1 4 d ISE1 b c e 2 f a 0z2 5 g e d However if an ISE is represented as a set of predefined data transfer instructions send receive the resulting instruction sequence after ISE replacement would appear as 3 e 5 1 4 send b send c send e receive d 2 f a 0x2 5 g e d After subgraph replacement with ISE the compiler performs scheduling register allocation and target code generation as a back end pass Note that the latency of the ISE required by the scheduler is derived Y Routing Info Y Hardware Power Simulation Simulation Figure 5 Measuring System Power from the H W generation phase as shown in Figure 3 a 4 5 Processor Subsystem Generation Phase We show this phase in Figure 3 b As a final step the processor model of the target Soft core along with the AFU and its interface are synthesized and implemented using standard synthesis and Place and Route tools The executable generated in Figure 3 a and the system synthesized in Figure 3 b are deployed in two schemes one for measuring speedup and the other for evaluating energy power consump tion With the goal of measuring actual time spent in running t
8. designers need to determine if this speedup comes at a price of increased power This report shows that increased performance can also reduce both power and energy of a customizable processor in the presence of an AFU and reports the effects on code size and area It is predicted 17 that by 2010 over one third of all PLD FPGA devices are expected to have micro processor cores up from 15 today Xilinx Microblaze 10 is a popular commercially available soft core We demonstrate the use of our framework by transforming a given input application into a running Xil inx Microblaze hardware software system For four real life applications from Mediabench and EEMBC suites we measure the real performance gain over pure software execution and also accurately evaluate energy and power consumption Our experimental results show that significant speedup is obtained only when an ISE contains a large set of atomic operations With only one large ISE per application we ob tained speedup of up to 1 47 x over simple software execution and simultaneously up to 40 energy saving and 12 peak power reduction To the best of our knowledge this is also the first attempt to present the details of interfacing an AFU with a customizable soft core The main contributions highlighted in this report are the following e We present a generalized interface aware soft processor customization framework for mapping an application in C into a running processor AFU subsystem tha
9. do f Double click Simulate Behavioral Model to run the behavioral simulation 4 Structural Timing Simulation using ModelSim a Right click on Project Project name gt and click to Mark to Initialize BRAM b Right click on Default microblaze 0 xmdstub and click to un select Mark to initialize BRAM c Invoke Tools Sim Model Generation This populates the simulation structural directory d The file system init vhd contains the memory map of the executable A part of it looks like the following configuration braml conf of braml wrapper is for STRUCTURE e for braml braml_elaborate for STRUCTURE for ramb16_s1_s1_0 ramb16 s1 s1 use entity unisim rambl6 sl sl ramb16 sl sl v generic map INIT 00 gt X C102125AF2808102049087432010AA84154A021FFCF04ACOGDE65996B4FDE57F INIT 01 X 102040B1C26DSBF87EFB72A82420409D17492DC2074FB95734CFFFE508A183FF INIT 3E gt X 0000000000000000000000000000000000000000000000000000000000000000 INIT 3F X 0000000000000000000000000000000000000000000000000000000000000000 end for end for end for end for end braml conf The corresponding section in the work directory system timesim vhd is empty Superim pose this memory section from system init vhd into work directory gt system timesim vhd so that the corresponding BRAM section of the latter looks like the following ramb16 s1 s1 2 X RAMB16 SIS generic map INIT A X 0 INIT B gt X 0
10. eq 32 ip en in std logic std logic vector 0 to 31 std logic vector 0 to 31 in std logic vector 0 to 31 in std logic vector 0 to 31 out std logic vector 0 to 31 t mux eq 32 x leq 32 ip en in std logic std logic vector 0 to 31 std logic vector 0 to 31 in std logic vector 0 to 31 in std logic vector 0 to 31 out std logic vector 0 to 31 t mux leq 32 x geq 32 ip en in std logic std logic vector 0 to 31 std logic vector 0 to 31 in std logic vector 0 to 31 in std logic vector 0 to 31 out std logic vector 0 to 31 end component mux geq 32 begin cn cn cn cn cn cn cn cn cn logic st 0 lt b 0000 0000 0000 0000 0000 0000 0000 0000 st 1 lt b 0000 0000 0000 0000 0000 0000 0000 0001 st 2 b 0000 0000 0000 0000 0000 0000 0000 0010 st 3 lt b 0000 0000 0000 0000 0000 0000 0000 0011 st 4 lt b 0000 0000 0000 0000 0000 0000 0000 0100 st 7 lt b 0000 0000 0000 0000 0000 0000 0000 0111 st 8 lt b 0000 0000 0000 0000 0000 0000 0000 1000 st 32767 lt b 0000 0000 0000 0000 0111 1111 1111 1111 st minus 32768 lt b 1111 1111 1111 1111 1000 0000 0000 0000 and_32_1 and_32 port map chip_en gt AFU_en data_inl gt data_inl data
11. in2 gt cnst 7 data out gt sigl and 32 2 and 32 port map chip en gt AFU en data inl gt sigl data in2 gt cnst 2 data out gt sig2 and_32_3 and_32 port map chip_en gt AFU_en data_inl gt sigl data_in2 gt cnst_4 data_out gt sig3 and_32_4 and_32 port map chip_en gt AFU_en data_inl gt sigl data_in2 gt cnst_l data_out gt sig4 brs 1 barrel right shifter port map chip en AFU en data in data in2 shift amnt gt cnst 3 data out gt sig5 add_32_1 add_32 port map chip_en gt AFU_en data inl gt sig5 data in2 gt data in2 data out gt sig6 mux eq 32 1 mux eq 32 port map chip en AFU en condl gt sig3 cond2 gt cnst 0 data_inl gt sig5 data_in2 gt sig6 data_out gt sig8 brs_2 barrel_right_shifter port map chip_en gt AFU_en data in gt data in2 shift amnt gt const 1 data out gt sig7 add 32 2 add 32 port map chip en gt AFU en data inl gt sig data in2 gt sig8 data out gt sig9 mux eq 32 2 mux eq 32 port map chip en gt AFU en condl gt sig2 cond2 gt cnst 0 data_inl gt sig8 data_in2 gt sig9 data_out gt sigll brs_3 barrel_right_shifter port map chip_en gt AFU_en data in gt data in2 shift amnt gt cnst 2 data out gt sigl10 add 32 3 add 32 port map
12. steps We then derive the total energy dissipated in the system from the reported power and the measured execution time Now we apply our processor customization framework to generate a real system 5 Communication Template for Xilinx Microblaze Xilinx Microblaze 10 is a soft core with a DP external AFU interface as shown in Figure 6 We demonstrate the utility of our framework by transforming a given input application into a running Microb laze hardware software system Microblaze has a DP external AFU to be connected with the processor via Fast Simplex Links or FSLs FSLs are dedicated point to point unidirectional 32 bit wide FIFO interfaces The Microblaze is capable of including a maximum of 8 input and 8 output FSLs CLK Figure 7 Microblaze Processor Core with an AFU and its Interface Microblaze is a 32 bit RISC processor with a simple 3 stage pipeline Figure 7 shows an AFU and its interfacing with the Microblaze processor core via 8 x 8 FSL channels The AFU interface implements the processor AFU communication protocol and is synchronous with the Microblaze processor through a global clock CLK The AFU interface is also connected to a counter module to enable counting whenever required If the count enable signal Cnt_en is 1 counting is enabled Otherwise the counter is reset to 0 The signals n 32 and Out 32 are used to send data to and receive data from the AFU respectively When the AFU enable signal
13. the structural AFU model pre sented in Appendix E and the communication template presented in Appendix D The AFU with its interface for the adpcm d example is presented as follows library IEEE use IEEE STD LOGIC 1164 ALL use IEEE STD LOGIC ARITH ALL use IEEE STD LOGIC UNSIGNED ALL library unisim use unisim vcomponents all entity my fsl is Port CLK in std logic System clock RESET in std logic FSLO S CLK out std logic FSLO S READ out std logic FSLO S DATA in std logic vector 0 to 31 FSLO S CONTROL in std logic FSLO S EXISTS in std logic FSL1 S CLK out std logic FSL1 S READ out std logic FSL1 S DATA in std logic vector 0 to 31 FSL1 S CONTROL in std logic FSL1 S EXISTS in std logic FSL2 S CLK out std logic FSL2 S READ out std logic FSL2 S DATA in std logic vector 0 to 31 FSL2 S CONTROL in std logic FSL2 S EXISTS in std logic FSL3 S CLK out std logic FSL3 S READ out std logic FSL3 S DATA in std logic vector 0 to 31 FSL3 S CONTROL in std logic FSL3 S EXISTS in std logic FSLO M CLK out std logic FSLO M WRITE out std logic FSLO M DATA out std logic vector 0 to 31 FSLO M CONTROL out std logic FSLO M FULL in std logic FSL1 M CLK out std logic FSL1 M WRITE out std logic FSL1 M DATA out std logic vector
14. tribute syn noprune of component name component is true The attribute state ments were pertaining to XST and Synplify Pro would simply ignore them So the black box constraints are specified in Synplicity syntax If the system does not have any output the Synthesis phase would prune all the components This is prevented by using syn noprune attribute g Right click on system structure lt path gt system vhd and select Add Source Find sys tem ucf in the data directory and click Open to add constraints h Select system structure lt path gt system vhd Double click Synthesize Synplify Pro in Processes for Source section to run synthesis Alternatively double click Generate Programming File directly which includes running synthesis and Place and Route i Double click Implement Design to perform Place and Route of the design j Double click Generate Programming File to generate the bitmap file k Go back to XPS Select Tools Import from ProjNav and import the following files i BIT file path to work directory gt projnav system bit and ii BMM file path to work directory implementation system bd bmm 1 From the XPS menu run Tools Update Bitstream 9 Compiling the software Run Tools Build All User Applications Check whether the size of executable elf is less than 64 KB Recall that the memory allocated for both data and instruction was 64 KB Also note that the maximum usable space in 56 BRAMS
15. 0 to 31 FSLI M CONTROL out std logic FSL1 M FULL in std logic end my fs1 architecture IMP of my fsl is signal count en std logic enabling the counter signal chip en std logic signal counter ticks nal from the counter signal data inl data in2 data in3 data in4 signal data outl data out2 component counter std logic vector 0 to 1 port CLK IN std logic enable IN std logic counter ticks OUT std logic vector 0 to 1 end component counter component cutl port AFU en IN std logic data inl IN std logic vector 0 to 31 data in2 IN std logic vector 0 to 31 data in3 IN std logic vector 0 to 31 data in4 IN std logic vector 0 to 31 data outl OUT std logic vector 0 to 31 data out2 OUT std logic vector 0 to 31 end component cutl component fsl_interface port CLK in std_logic RESET in std_logic count_en out std_logic counter ticks in std logic vector 0 to 1 data inl out std 1 logic vector 0 to 3 data in2 out std 1 L Logic_vector 0 to 31 L Sig std_logic_vector 0 to 31 std_logic_vector 0 to 31 data_in3 data_in4 data outl data out2 FSLO S CLK out std logic vector 0 to 31 out std logic vector 0 to 31 in std logic vector 0 to 31 in std logic vector 0 to 31 out std logic FSLO S READ
16. 1 had 8 instances in the critical basic block covering more than 50 of the DFG and overall 12 instances in the critical function Both the large size and large scale reuse as defined in 1 of the ISE accounts for a significant speedup 1 44x obtained on AES despite the overhead in sending and receiving operands Along with the merit of speedup AES also exhibit a 7 code size reduction owing to replacement of a large chunk of code by an ISE in the form of a set of data transfer instructions 7 2 Power and Energy Results From Table 2 it is evident that both the peak power P Pwr as well as the average power A Pwr reduced with the introduction of AFU Because the presence of both core and AFU apparently indicates more circuit activity an initial expectation is increased power with the addition of AFU However because the ISE here is a multi cycle operation interlocked with the Microblaze pipeline the AFU operation com pletely overlaps with a processor pipeline stall Consequently we obtain an overall power reduction in the presence of AFU operation owing to reduced overall circuit activity As shown in Table 3 we also obtained up to 40 saving in energy on account of reduced application runtime It is interesting to note that the trend of energy decrease or increase exactly follows that of speedup shown again in Table 3 for the sake of comparison This trend can be expected as a corollary to Table 3 Energy Benefits of ISEs in the Mi
17. AFU en in std logic data inl in std logic vector 0 to 31 data in2 in std logic vector 0 to 31 data in3 in std logic vector 0 to 31 data in4 in std logic vector 0 to 31 data outl out std logic vector 0 to 31 data out2 out std logic vector 0 to 31 end cutl architecture logic of cutl is signal sigl std logic vector 0 to 31 signal sig2 std logic vector 0 to 31 signal sig3 std logic vector 0 to 31 signal sig4 std logic vector 0 to 31 signal sig5 std logic vector 0 to 31 signal sig6 std logic vector 0 to 31 signal sig7 std logic vector 0 to 31 signal sig8 std logic vector 0 to 31 signal sig9 std logic vector 0 to 31 signal sig10 std logic vector 0 to 31 signal sigll std logic vector 0 to 31 signal sigl2 std logic vector 0 to 31 signal sigl3 std logic vector 0 to 31 signal sigl4 std logic vector 0 to 31 signal sigl5 std logic vector 0 to 31 signal sigl6 std logic vector 0 to 31 signal sigl7 std logic vector 0 to 31 signal sig18 std logic vector 0 to 31 signal cnst_0 std logic vector 0 to 31 signal cnst 1 std logic vector 0 to 31 signal cnst 2 std logic vector 0 to 31 signal cnst 3 std logic vector 0 to 31 signal cnst_4 std logic vector 0 to 31 signal cnst 7 std logic vector 0 to 31 signal cnst 8 std logic vector 0 to 31 signal cnst 32767 std logic vector 0
18. AFU en is 1 the AFU latches the output in Out 32 In Figure 8 we present the generic communication template for Microblaze AFU interaction as a Finite State Machine FSM synchronous with respect to CLK For the sake of explanation we call an FSL channel FSL R when it is used for AFU read operation or FSL_W when it is used for AFU write operation Associated with every FSL R channel is a set of three signals namely FSL READ SIG FSL DATA EXISTS FSL IN DATA 32 Another triplet FSL WRITE SIG FSLFIFO FULL FSL OUT DATA 32 is associated with every FSL W channel The FSM is initially in Input Sync state waiting for data to arrive on an FSL R channel When data exists on the FSL channel the corresponding FSL DATA EXISTS signal goes high causing a transition from Input Sync state to Input Read state In Input Read state FSL READ SIG is set to high to cause the data in the FSL_R FIFO to be read into FSL_DATA_EXISTS low FSL_DATA_EXISTS high Input Sync p u Input Read DS FSL_READ SIG lt 0 FSL_WRITE_SIG lt 0 Cnt_en lt 0 FSL_READ_SIG lt 1 In lt FSL_IN_DATA AFU_en lt 0 true true FSL_OUT_DATA lt Out FSL_WRITE_SIG lt 1 AFU_en lt 0 Output Write A Output Sync Count Cycles and FSL Q FULL low FSL READ SIG lt 0 AFU_en lt 1 Cnt_en lt 1 Count
19. Cycles or FSL Q FULL high Figure 8 Communication Template for AFU Interface in Microblaze In 32 using a 32 bit signal array FSL IN DATA After the data has been read into n 32 the FSM tran sitions to Output Sync state and waits on the AFU operation by enabling the counter After Cycles as evaluated in the H W generation phase in Figure 3 a has elapsed the result of the AFU operation is latched in Out 32 If FSL_W FIFO is not full 1 e FSL FIFO FULL is low a state transition takes place to Output Write state In the Output Write state data from Out 32 is written into the FSL W FIFO using FSL OUT DATA 32 by setting FSL WRITE SIG to high Thus for introducing every new AFU only the AFU module in Figure 7 and the Cycles change in the process of H W generation while the communication template is reused 6 Experiments We first describe our experiemental setup in detail and then present the experimental results 6 1 Experimental Setup The ISE generation algorithm ISEGEN 1 was integrated with a MACHSUIF 9 front end The S W generation was done with Microblaze GCC 2 95 mb gcc compiler Microblaze Instruction Set has multiple data transfer instructions for sending data to and receiving data from its FSL channels put for sending and get for receiving data in blocking mode and nput nget are the corresponding instructions in non blocking mode We used the non blocking send instruction nput and the blo
20. Processor Customization on a Xilinx Multimedia Board Partha Biswas Sudarshan Banerjee and Nikil Dutt CECS Technical Report 06 04 Center for Embedded Computer Systems School of Information and Computer Science University of California Irvine CA 92697 USA Mar 12 2006 Abstract Performance of applications can be boosted by executing application specific Instruction Set Exten sions ISEs on a specialized hardware coupled with a processor core Many commercially available customizable processors have communication overheads in their interface with the specialized hardware However existing ISE generation approaches have not considered customizable processors that have com munication overheads at their interface Furthermore they have not characterized the energy benefits of such ISEs This report presents a soft processor customization framework that takes an input C applica tion and realizes a customized processor capturing the microarchitectural details of its interface with the specialized unit The speedup energy power and code size benefits of the ISE approach were accurately evaluated on a real system implementation by applying the design flow to a popular Xilinx Microblaze soft processor core synthesized for four real life applications It was found that only one large ISE per application is sufficient to get an average 1 41 x speedup over pure software execution in spite of incurring communication overheads Finally a simultane
21. SET gt RESET count_en gt count_en counter_ticks gt counter_ticks data_inl gt data_inl data in2 gt data in2 3 4 data in3 data in3 data in4 data in4 data outil gt data outl data out2 data out2 FSLO S CLK gt FSLO S CIK FSLO S READ gt FSLO S READ FSLO S DATA gt FSLO S DATA FSLO S CONTROL gt FSLO S CONTROL FSLO S EXISTS gt FSLO S EXISTS FSL1 S CLK gt FSL1 S CLK FSL1 S READ gt FSL1 S READ FSL1 S DATA gt FSL1 S DATA FSL1 S CONTROL gt FSL1 S CONTROL FSL1 S EXISTS gt FSL1 S EXISTS FSL2 S CLK gt FSL2 S CLK FSL2 S READ gt FSL2 S READ FSL2 S DATA gt FSL2 S DATA FSL2 S CONTROL gt FSL2 S CONTROL FSL2 S EXISTS gt FSL2 S EXISTS FSL3 S CLK gt FSL3 S CIK FSL3 S READ gt FSL3 S READ FSL3 S DATA gt FSL3 S DATA FSL3 S CONTROL gt FSL3 S CONTROL FSL3 S EXISTS gt FSL3 S EXISTS FSLO M CLK gt FSLO M CLK FSLO M WRITE FSLO M WRITE FSLO M DATA gt FSLO M DATA FSLO M CONTROL gt FSLO M CONTROL FSLO M FULL gt FSLO M FULL FSL1 M CLK gt FSL1 M CLK FSL1 M WRITE gt FSL1 M WRITE FSL1 M DATA gt FSL1 M DATA FSL1 M CONTROL FSL1 M CONTROL FSL1 M FULL gt FSL1 M FULL AFU en gt chip en end IMP
22. all entity fsl interface is Port CLK in std logic System clock RESET in std logic data inl out std logic vector 0 to 31 data_in2 out data_in3 out data_in4 out data_outl in data_out2 in count en out Signal from counter ticks FSLO FSLO FSLO FSLO FSLO FSLI FSLI FSLI FSLI FSL std logic vector 0 to 31 std logic vector 0 to 31 std logic vector 0 to 31 std logic vector 0 to 31 std logic vector 0 to 31 std logic enabling the counter the counter in std logic vector 0 to 1 _S CLK out std logic 9 READ _S_DATA _S_CONTROL S EXISTS out std_logic in std logic vector 0 to 31 in std logic in std logic _S CLK out std logic 9 READ _S_DATA _S_CONTROL _S_ EXISTS FSL2 FSL2 FSL2 FSL2 FSL2 FSL3 FSL3 FSL3 FSL3 FSL3 FSLO FSLO FSLO FSLO FSLO FSLI FSLI FSLI FSLI FSLI out std_logic in std logic vector 0 to 31 in std_logic in std_logic _S CLK out std logic 9 READ _S_DATA _S_CONTROL _S_EXISTS out std_logic in std logic vector 0 to 31 in std logic in std logic _S CLK out std logic 9 READ _S_DATA _S_CONTROL _S_EXISTS _M CLK o _M WRITE _M DATA M CONTROL M FULL _M CLK o _M WRITE _M DATA M CONTROL _M FULL out std logic in std logic vector 0 to 31 in std logic in std logic
23. ation Model as Behavioral Set appropriate paths for the simulation libraries as follows Check the installation directories of the ModelSim libraries EDK Library C Xilinx vhdl mti_se edklib Xilinx Library C Xilinx vhdl mti se c Right click on Project lt Project name gt and make sure Mark to Initialize BRAM is se lected d Right click on Default microblaze 0 xmdstub and make sure Mark to initialize BRAM is un selected 3 Behavioral Simulation using ModelSim a From XPS invoke Tools Sim Model Generation which populates simulation behavioral directory Modify simulation behavioral system init vhd by commenting the last few lines as follows configuration systemN conf of system is for STRUCTURE for all braml wrapper use configura tion work bramlN conf end for end for end system conf b Now from Project Navigator add projnav testcase vhd with the following content TestBench T LIBRARY ieee emplate USE ieee std logic 1164 ALL USE ieee num ENTITY testb END testbenc eric std ALL ench IS h ARCHITECTURE behavior OF testbench IS Component Declaration COMPONENT system PORT sys clk IN std logic sys rst IN std logic END COMPONENT SIGNAL clk std logic SIGNAL rst std logic BEGIN Component Instantiation uut system PORT MAP SyS Clk gt clk sy
24. ch infrastructure Using a simulator the authors show speedup for applications that reuse AFUs generated for other applications in the same domain Such reuse of AFUs across application is possible only when ISEs found were reasonably small in size However we will confirm in our experimental results that such small sized ISEs would not generate a considerable speedup for AFUs with communication overheads Sun et al 6 employs a Tensilica Instruction Extension TIE compiler in their methodology and operates at a higher C source code level of abstraction Therefore this methodology relies more on designer s experience for ISE identification and mapping to AFUs The AFU in this case therefore does not have any communication overhead Fei et al 7 integrated a fairly accurate energy estimation engine in the same framework but they do not report a comparison of energy before and after extending the processor A recent work having a goal of real system implementation 8 generated application specific instructions for Altera Nios II processor in the presence of AFUs that do not have communication overheads The results show a good speedup and limited area overhead but they do not discuss energy or power consumption Unlike 8 in this report we deal with the non trivial details of synchronization between the processor and the AFU with the help of a generic communication template Note that in the prior related work the AFU in general did not have
25. chip en gt AFU en data inl gt sigll data in2 gt sigl0 data out gt sigl2 mux eq 32 3 mux eq 32 port map chip en gt AFU en condl gt sig4 cond2 gt cnst O0 data_inl gt sigll data in2 gt sigl2 data out gt sig13 sub 32 1 sub 32 port map chip en AFU en data inl gt data in3 data in2 gt sigl13 data out gt sigl5 add_32_4 add_32 port map chip_en gt AFU_en data_inl gt data_in3 data in2 gt sig13 data_out gt sigl4 and 32 5 and_32 port map chip en gt AFU en data inl gt data inl data in2 gt cnst 8 data out gt sigl6 mux eq 32 4 mux eq 32 port map chip en AFU en condi gt sigl6 cond2 gt cnst O0 data inl gt sigl4 data in2 gt sigl5 data out gt sigl7 mux leq 32 1 mux leq 32 port map chip en AFU en condl gt sigl7 cond2 gt cnst_32767 data inl gt sigl7 data in2 gt cnst 32767 data out gt sig18 Ji mux geq 32 1 mux geq 32 port map chip en AFU en condi gt sigl8 cond2 gt cnst minus 32768 data inl gt sigl8 data in2 cnst minus 32768 data out gt data outil mult 32 1 mult 32 port map chip en gt AFU en data inl data inl data in2 gt cnst 4 data out data out2 end logic F AFU with its Interface for adpcm d The AFU with its interface that is captured in my fsl glues together
26. cking receive instruc tion get for our AFU interface Because of using two different compilers for ISE generation and S W generation the subgraph replacement with ISEs was done as a post assembly pass on the assembly out put of mb gcc After replacing the identified subgraphs with ISEs mb gcc was run again to generate the executable We selected four real life applications for demonstrating the effectiveness of our framework autcor Auto correlation from EEMBC suite adpcm e ADPCM Encoder and adpcm d ADPCM Decoder from Mediabench suite and AES AES encryption Our platform is Xilinx Multimedia Board which is equipped with a Virtex II XC2V2000 FPGA Figure 9 shows a snapshot of the board We used Xilinx Plat form Studio for configuring the FPGA to include a Microblaze processor with a 64KB i e the maximum a NUNC McroBlare tt Pert hund hued lcd bd beca oco cad led iml a Figure 9 Xilinx Multimedia Board size possible Block RAM BRAM two Local Memory Buses LMBs to interface with BRAM one for instruction and the other for data one Microblaze Debugging Manager MDM and one Timer both MDM and Timer on a single On chip Peripheral Bus OPB The standard inputs and outputs of an appli cation were redirected to the MDM and the elapsed number of cycles was evaluated using the Timer We set the clock frequency of the Microblaze processor to 50 MHz The tools used in the second scheme Fig ure 5 for evaluating ene
27. communication overheads at its in terface Indeed there are many commercially available processors providing such an interface Common examples are Altera Nios II processor 13 LEON processor 12 etc However there are similarly many commercial customizable processors where AFUs incur overhead in sending and retrieving data Some ex amples include STMicroelectronics ST120 11 Xilinx Microblaze processor 10 etc To the best of our knowledge ISE generation in the context of AFUs incurring communication overheads at their interface with the core processor has not been studied yet This is our motivation for proposing a framework that is capable of incorporating different AFU models and in particular targeting Xilinx Microblaze soft core We apply the design flow of our framework to study performance gain energy power consumption code size reduction and area overhead with the introduction of an AFU into the Microblaze subsystem 4 Framework for Complete System Realization Our framework takes as input a high level application in C and generates an executable and an AFU with appropriate interfacing protocol as shown in Figure 2 The executable runs in the processor core as software containing ISEs for invoking the AFU operation in hardware Our target for running the complete processor AFU subsystem is an FPGA platform Y ISE Generation lt gt Latency S W Generation H W Generation Int
28. croblaze subsystem Tot Energy uJ Tot Energy uJ age BMs for Core Only for Core AFU Saving Spdup autcor 2 21 3 10 40 27 0 65 x adpcm d 8 48 5 84 31 13 1 32x adpcm e 10 54 6 34 39 85 1 47x AES 69 09 43 69 36 76 1 44x a consistent power reduction shown in Table 2 Thus contrary to conventional expectation enhanced performance simultaneously results in reduced power and energy for the customized Microblaze soft core 7 3 Slices Utilization The XC2V2000 FPGA that we use as our target platform has 10752 slices Table 4 shows the per centage utilization of the FPGA slices before and after introducing the AFU that brought the speedup in Table 1 Table 4 Slices Utilization out of 10752 in the absence of an AFU and in the presence of an AFU for the four applications in XC2V2000 FPGA BMs No AFU autcor adpcm d adpcm e AES Slices 1274 1609 1804 2226 2043 Util 11 14 16 20 19 Note here that XC2V2000 used here is very small The largest possible Virtex II chip XC2V8000 contains 46592 slices If the largest FPGA is used instead of XC2V2000 the average slices utilization reduces to only 596 which is very reasonable Thus the area overhead of including an AFU in the Microblaze subsystem is also minimal 8 Summary and Future Directions Applications can be accelerated in a programmable processor by executing their performance cri
29. cts pdf datasheets se pdf 16 Xilinx XPower Documentation http toolbox xilinx com docsan xilinx6 books data docs dev dev0089 14 html 17 Panelists peer into future of FPGAs Article 60407325 EETimes March 7 2005 A System Realization on Xilinx Multimedia Board Here we present the detailed steps to realize a basic hardware software subsystem with the hardware consisting of the Microblaze processor local memory bus BRAM timer and mdm and the software being the Microblaze executable 1 2 Invoke Xilinx Platform Studio XPS 6 2i or higher Click File New Project Platform Studio The settings for Create New Project are as follows Project File lt Path to work directory gt system xmp Target Device Architecture virtex2 Device Size xc2v2000 Package ff896 Speed Grade 6 default Click OK and then answer Yes for Do you want to start with an empty MHS File Then click OK for the comment Project Add Edit Cores Setting up the hardware Under System tab right click on System BSP and select Add Edit Cores a Add the following peripherals e microblaze 1 e bram block 1 e Imb bram if cntlr 2 1 for data 1 for instruction Base Address 0x00000000 High Address 0x0000ffff Memory allocated both for data and instruction 64 KB e opb mdm 1 Base Address Oxffff0400 High Address OxffffOAff e opb timer 1 Base Address Oxffff0800 High Address OxffffO8ff Note that address ra
30. d browse for projnav projnav_par do h Double click Simulate Post Place amp Route VHDL Model to invoke the structural simula tion Run structural simulation by selecting Simulate Run All Choose an appropriate termination criterion to terminate the simulation C Creating a Custom FSL Interface A user core in the form of an AFU resides in the lt project directory gt pcores direc tory The base name for an FSL interface description follows the following naming convention core name gt lt version number gt For example my fsl 1 00 a is a valid base name for a user core called my fsl Under the project directory pcores data directory two files are created for describing the inter face and specifying the order in which the underlying modules are synthesized The respective files are my fsl 1 00 a mpd and my fsl 1 00 a pao corresponding to the chosen base name Under the lt project directory gt pcores hdl vhdl directory reside the VHDL source code for the user core and the FSL inter face D VHDL Source for the Communication Template We present in this section the simple FSL Interface used to synchronize the data transfer between the processor core and the user core or AFU The I O constraints used here is 4 inputs and 2 outputs library IEEE use IEEE STD LOGIC 1164 ALL use IEEE STD LOGIC ARITH ALL use IEEE STD LOGIC UNSIGNED ALL library unisim use unisim vcomponents
31. eration This latency information is passed on to the scheduler in the S W generation phase shown with a dotted arrow in Figure 3 The evaluated number of cycles is also used to synchronize the AFU with respect to the core Apart from the component library the designer also creates a communication template for AFUs which captures the communication protocol between the processor core and the AFU The writing back of result from the AFU to the processor is delayed by the exact number of cycles required by the AFU operation The implementation of communication protocol together with synchronization with the core completes the AFU interface synthesis Note that the H W generation phase can be applied to synthesize the AFU and its interface in the customized processor model presented in Figure 1 Preprocessing GED T Y Compiler Front end Profile code Y Annotate w hw sw CFG DFG latencies exec count 2 i J i Component M Communication S W generation ISE generation pera Constraints Communica Y Replace subgraph by ISE E o wi w CFG DFG Y w ISEs ISEGEN Y Scheduling Register Alloc T n ISEs or Back end subgraphs H W generation Component Library Binding Interface Synthesis components and a Computation w edges by cnxns Communication Clock Eval Crit Path Period Calc Cycles y y Y Replace ops by Couple
32. erface System FPGA platform Figure 2 The Flow of our Framework The expanded view of our framework is shown in Figure 3 a It has five main phases Preprocessing phase ISE generation phase S W generation phase H W generation phase and Processor subsystem generation phase The Preprocessing phase takes the input application and generates an annotated in termediate representation The ISE generation phase generates ISEs under microarchitectural constraints The H W generation phase synthesizes the corresponding AFUs with their interfaces and the S W gener ation phase generates the executable A dotted arrow between the two phases indicates that the latency of an ISE obtained in the H W generation phase is passed on to the S W generation phase Finally the Processor subsystem generation phase builds the complete running system for evaluation 4 1 Preprocessing Input Application This phase can be identified as a box labeled Preprocessing in Figure 3 a A compiler front end yields Control Flow Graph CFG and Data Flow Graph DFG of an input application and runs predica tion to combine a set of small basic blocks into a large basic block The input application is then profiled and the basic blocks are annotated with their execution counts A component library is created contain ing a synthesizable combinational element corresponding to each instruction in the target instruction set Each element in the library is synthesized for a
33. erhead in processor AFU interface Thus we confirm that if the AFU interface has a communication overhead a small sized ISE will only result in performance degradation Table 2 Power Benefits of ISEs in the Microblaze subsystem Core Only Core AFU Pk Avg P Pwr A Pwr P Pwr A Pwr Pwr Pwr BMs mW mW mW mW Redn Redn autcor 1957 1287 1869 1229 4 5 4 5 adpcm d 1975 1317 1919 1197 2 8 9 1 adpcm e 2070 1332 2012 1178 2 8 11 6 AES 2256 1276 1982 1187 12 1 7 0 The applications adpcm d and adpcm e are the two examples where predication of several small critical basic blocks led to a large basic block Consequently the ISEs found for these two benchmarks are very large containing on the order of 40 operations This led to a significant speedup in spite of the communication overhead Figure 10 shows the ISE of adpcm e that generated a speedup of 1 47x over pure software execution The shaded nodes show the inputs and the outputs of the ISE Appendix D Figure 10 An ISE for ADPCM ENCODER adpcm e having 4 inputs and 2 outputs each operation node maps to a hardware component Appendix E and Appendix F present the complete VHDL source code for the AFU and its interface for adpcm d The last benchmark under consideration is AES which has the largest number of instructions in its critical basic block The generated ISE
34. f power and energy consumption The steps required for taking the design from the EDK into the Project Navigator and running the behavioral and structural simulation are as follows Creating Simulation libraries a Compiling Xilinx Simulation Libraries COMPXLIB Following are the two ways e From the Project Navigator i Open an existing project that might have been exported from Xilinx Platform Studio using the Export to ProjNav option and highlight the target device ii In the Processes for Source window under the Design Entry Utilities right click Compile HDL Simulation Libraries and select Properties Select appropriate Target Simulator ModelSim SE in our case and click OK iii Double click Compile HDL Simulation Libraries to compile the Xilinx Simulation Libraries in C Xilinx vhdl mti se directory e From Command Line shown for virtex2 board compxlib s mti se f virtex2 l vhdl Run compxlib help to choose appropriate option for the board under consideration b Compiling EDK Behavioral Simulation Libraries COMPEDKLIB Compedklib bat s mti_se o edklib X 2 Initial Set up for Simulation a Invoke Xilinx Platform Studio XPS and load the design created with XPS using ProjNav implementation flow as explained in the document titled Building a Hardware Software system using Xilinx EDK and Xilinx Multimedia Board b From XPS select Options Project Options and in the HDL and Simulation tab select Simul
35. given technology and the corresponding instruction in the DFG is annotated with a normalized hardware latency Each instruction in the DFG is also annotated with its software latency obtained from the target architecture specification 4 0 ISE Generation Phase This phase shown as the ISE generation box in Figure 3 a is integrated with the compiler front end An ISE generation algorithm takes the annotated CFG DFG and returns subgraphs or ISEs that would maximize performance under microarchitectural constraints Although any ISE generation algorithm can be used we use ISEGEN in our framework because it identifies all the instances of an ISE exploiting large scale ISE reuse 4 3 H W Generation Phase We show this phase in a box marked H W generation in Figure 3 a The two subtasks of this phase are component library binding and interface synthesis The identified subgraph or ISE is isolated and each instruction in the subgraph is replaced by the corresponding element in the component library Figure 10 shows an example subgraph where each node maps to an element in the component library The data dependencies between the instructions are replaced by port to port connections between the elements and the resulting structure is an AFU This structural AFU model is then synthesized to evaluate the critical path length The critical path length divided by the clock period of the processor core gives the number of cycles needed for the AFU op
36. he application the scheme for Performance Measurement uses the bitmap of the synthesized system to program an FPGA fabric which then becomes the platform for actually running the executable The executable is downloaded into the system memory through a JTAG port and the number of cycles for running the executable is measured using a hardware timer Processor i Subsystem Core TT 8 gji S S 2 Peripherals o a 2 t i E 7 H IE Tightly coupled External External Memory AFU External Figure 6 A DP external AFU Interface Since there is no direct way to measure power of a running system on the FPGA fabric we employ a different scheme for Power Energy Evaluation depicted in Figure 5 for accurately evaluating the power and energy consumption of the system Note that there are three kinds of information in the post Place and Route system Figure 3 b the structural model of the system the timing information and the routing information We superimpose the memory image of the executable in Figure 3 a into the memory section of the structural model This complete structural model along with the timing information is run through a cycle accurate hardware simulator to generate a Value Change Dump VCD of all the signals in the structural netlist The routing information and the VCD information together are then used by a power simulator to generate the dynamic power consumed at different time
37. ider ation Since ISE generation phase has ensured convexity of the identified subgraphs it is never pos sible to have a dependency edge from the FirstUse node to the LastDef node because this would make the subgraph non convex Consequently it is possible to encounter a situation where a FirstUse point precedes a LastDef point in the instruction sequence This renders the subgraph replacement impos sible without code restructuring Consider the following sequence of operations in instruction order Da b c 2 f al0x2 3 e 5 4 d a e 5 g e d Suppose the ISE under consideration is a multiply followed by an add as identified by the nodes labeled 1 and 4 in Figure 4 b1 respectively Fig ure 4 b1 b3 show an example of how the placement of ISE between LastDef and FirstUse is accomplished through code restructuring Since in this case the FirstUse point appears earlier in the instruction chain than the LastDef point the ISE cannot be placed anywhere Figure 4 b1 So instruction reordering has to be done in order that the LastDef point precedes the FirstUse point This reordering is possible because there is no dependency from FirstUse to LastDef Figure 4 b2 shows the code snippet after restructuring Figure 4 b1 i e swapping the positions of node 2 and node 3 and Figure 4 b3 shows the placement of ISE between the LastDef point node 3 and the FirstUse point node 2 If an ISE is used as a single user defined instruction
38. is 64 KB If not it is not possible to run with only BRAMs The alternatives are out of the scope of this document 10 Running the system a Switch on the board and invoke iMPACT from Xilinx ISE Accessories b Configure devices via Boundary Scan Mode with Automatically connect to cable and iden tify Boundary Scan chain selected Select appropriate device to program e g xc2v2000 in our case c Right click on the device and select Assign New Configuration File Find download bit in lt path to work directory gt implementation directory and select Open Observe the PROG LED change color from red to green indicating success Close the iMPACT window d Create a file xmd ini in lt path to work directory gt with the following lines help mbconnect mdm dow mblaze code executable elf rst con e From the XPS menu run Tools XMD and check the output of running executable elf soft ware on the synthesized hardware B Steps for System Simulation using ModelSim A complete system simulation is intended for verifying the correctness and generating the Value Change Dump VCD for the different signals employed The correctness is ensured using both the be havioral simulation as well as the structural Post Place and Route simulation The VCD is relevant only after the flattened netlist has been generated After the VCD dump is generated by the structural simulation run XPower is employed to evaluate the system in terms o
39. n the embedded domain that permit execution of only the critical application kernels in customized units as hardware with the rest of the application executing on the processor core as software This speeds up the application without compromising the processor clock or modifying the architectural model of the processor and yet preserves the flexibility of the soft ware approach We call such a coprocessing hardware element an Ad hoc Functional Unit AFU The AFU operation is triggered by an instruction or a set of instructions that we call an Instruction Set Exten sion or ISE In the past researchers have modeled AFUs having no communication overhead However many commercially popular customizable processors have communication overheads in their interface with AFUs Therefore our goal is to consider the microarchitectural details of an AFU interface in a processor customization framework and accurately evaluate the performance and energy benefits of ISEs in a realistic processor The efficacy of the framework lies in seamlessly considering the synchronization between the processor and the AFU in a unified manner for different applications Minimizing power and energy consumption is as important as maximizing performance in embedded systems A high power consumption may destroy a chip completely through overheating while a high energy consumption may reduce the battery life of an embedded device Therefore even though ISEs can achieve high speedups
40. nges chosen are disjoint b Add the following bus connections e Imb v10 v1 00 a 2 microblaze O dlmb M Imb bram if cntlr O slmb S microblaze O ilmb M Imb bram if cntlr 1 slmb S e opb_v20_v1_10_b 1 microblaze_0 dopb M microblaze_0 iopb M opb_mdm_0 sopb S opb timer O sopb S c All the Clk and Rst ports All the net names must be sys clk or sys_rst corresponding to Clk and Rst ports respectively d The following parameters need to be changed from their default values e microblaze_0 C_DEBUG_ENABLED 1 C_USE_BARREL 1 to use a barrel shifter C NUMBER OF RD ADDR _ BRK C NUMBER OF WR ADDR BRK 1 e opb_mdm_0 C UART WIDTH 8 e Imb v10 0 C EXT RESET HIGH 0 e Imb_v10_1 C EXT RESET HIGH 0 e opb v20 0 C EXT RESET HIGH 0 Click OK to register all the above changes for the hardware 5 Setting up the software In the Applications tab right click on Software Projects and click Add New Project Give a name to the project and click OK a Right click on Sources and click Add File Select all the source c files and click OK b Right click on Headers and click Add File Select all the header h files and click OK c Right click on Default microblaze 0 xmdstub and click to Mark to Initialize BRAM d Right click on Project Project name and click to un select Mark to Initialize BRAM e Right click on Project lt Project name gt and select Set Compiler Options Unde
41. ous savings in energy up to 40 and power up to 12 peak power reduction with this increased performance were observed Contents 1 Introduction 2 Customized Processor Model 3 Related Work 4 Framework for Complete System Realization 4 1 4 2 4 3 4 4 4 5 Preprocessing Input Application ISE Generation Phase es a Ds ne dr eem Ge Aen ee see eran ltem e Be he Ge AR Wn CR Hew Generation Phase aca os x ss aes Se cuo nter eet as S W Generation PHASE let an Yves te oh ela s ADR ve SUR Boe og a a Processor Subsystem Generation Phase 5 Communication Template for Xilinx Microblaze 6 Experiments 6 1 6 2 Experimental Setups 36 oul as adieu BAe ac aL oe ee ee ee ee cu System Implementation on the Board 7 Experimental Results 7 1 7 2 7 3 o nm UU n Ww gt Performance and Code Size 44 44 4444 Power and Energy Results 4 5 236i ES YAS A Sa RN get OX ORE ERR Slices D hzation Sa Ds Ara 4 xen RE ERR Eee Se it Set be Summary and Future Directions System Realization on Xilinx Multimedia Board Steps for System Simulation using ModelSim Creating a Custom FSL Interface VHDL Source for the Communication Template Structural AFU model for adpcm d AFU with its Interface for adpcm d List of Figures 1 2 3 Target Customized Processor Subsystem The Flow of our Framework
42. r Eq FSL3 S EX STS rp THEN FSL2 S REA FSL3 S REA data inl lt data in2 data in3 data in4 AFU en lt 0 count lt 2 END IF WHEN 2 gt FSLO_S_REA FSL1 S REA FSL2 S REA FSL3 S REA AFU en count en lt 0 IF counter ticks count lt 3 END IF WHEN 3 gt IF FSLO M FULL lt CO yf Fit s LOF rots lt D D lt D lt D lt 177 1 ry FSLO S1 FSL1 SI FSL2 S DATA DATA DATA ESL3 DATA enable counting Wy m THEN only this will vary depending on app 1 cycle before writing THEN FSLO M1 DATA lt data outl FSLO M WRITE lt AFU en lt 0 count lt 0 END IF IF FSL1 M FULL 0 THEN FSL1 M DATA lt data out2 FSL1 M WRI TE lt 1 AFU en lt 0 count lt 0 END IF WHEN OTHERS END CASE end if Popes gt NULL end process end behavioral E Structural AFU model for adpcm d The structural model of the AFU generated for adpcm d with I O Constraints of 4 inputs 2 outputs is presented in the cut1 module library IEEE IEEE STD LOG use C 1164 ALL use IEEE STD LOGIC ARITH ALL use IEEE STD LOGIC UNSIGNED ALL library unisim use unisim vcomponents all entity cutl is Port
43. r Di rectories tab give a suitable path for Output ELF File for example Path to work directory gt output executable elf If barrel shifter is present in the Microblaze i e if C USE BARREL 1 then under the Advanced tab insert mxl barrel shift in the Program Sources Compiler Options 6 Select Project Software Platform Settings a In the Processor and Driver Parameters tab change the Current Value of xmdstub peripheral to opb mdm 0 b In the Library OS Parameters tab change the Current Values of both stdin and stdout to opb_mdm 0 7 Create a User Constraints File in lt path to work directory gt data system ucf with the following lines for Xilinx Multimedia Board NET sys clk LOC AD16 NET sys rst LOC AH7 NET sys clk NODELAY NET sys clk TNM_NET clk50 TIMESPEC TSclk50 PERIOD clk50 20 ns HIGH 50 Note that the pin mapping will alter if the board is different The clock frequency is selected to be 50 MHz with 50 duty cycle 8 Synthesizing the hardware to be carried out by one the following ways e Using EDK with Xilinx XST easier option a Run Tools Generate Netlist b Run Tools Generate Libraries and BSPs c Run Tools Update Bitstream e Using EDK with Synplicity Synplify Pro if XST license is unavailable a Open Options Project Options from the XPS menu Select tab Hierarchy and Flow and make the following changes i Change Synthesi
44. ral Constraints In Proc of DAC 2003 4 P Yu and T Mitra Scalable Custom Instructions Identification for Instruction Set Extensible Pro cessors In Proc of CASES 2004 5 N Clark H Zhong and S Mahlke Processor Acceleration through Automated Instruction Set Customization In Proc of MICRO 2003 6 F Sun S Ravi A Raghunathan and N K Jha Synthesis of Custom Processors based on Extensible Platforms In Proc of ICCAD 2002 7 F Sun S Ravi A Raghunathan and N K Jha A Hybrid Energy Estimation Technique for Exten sible Processors IEEE TCAD 2004 8 J Cong Y Fan G Han and Z Zhang Application Specific Instruction Generation for Configurable Processor Architectures In Proc of FPGA 2004 9 Machine SUIF http www eecs harvard edu hube software software html 10 Microblaze Processor Reference Guide http www xilinx com ise embedded mb ref guide pdf 11 ST100 DSP Core Architecture Overview http www st com stonline prodpres dedicate st100 overview overview htm 12 The Leon Processor User Manual http www ra informatik uni stuttgart de virazela LP_Project leon 2 3 7 pdf 13 The Nios II Processor Reference Handbook http www altera com literature hb nios2 n2cpu_niidvl pdf 14 SC140 DSP Core Reference Manual http www soc napier ac uk module php3 Op getresource amp cloaking no amp resourceid 1473119 15 ModelSim SE datasheet http www model com produ
45. rface inside the processor subsystem or loosely coupled through an external bus The AFU interface or the external interface implements the communication pro tocol between the AFU and the processor and thus controls synchronization of data and access to the processor register file The function of an ISE is to transfer control to an AFU for execution An ISE can be either a single user defined instruction or a set of multiple pre defined instructions A single user defined instruction is decoded as a special instruction which encapsulates inputs and outputs of an AFU as source and destina tion operands respectively The decoder takes the responsibility of issuing such a special instruction to an appropriate AFU for execution Alternatively sending inputs and receiving outputs of the AFU from the processor can be done at the expense of multiple data transfer instructions Such instructions must already exist in the instruction set of the processor in the form of send data to AFU and receive data from AFU instructions In this case the AFU incurs communication overhead at its interface while sending and receiving data 3 Related Work Several algorithms 1 4 2 3 5 6 have recently been proposed to identify ISEs in a given application The speedups over simple software execution claimed in most of the approaches 1 4 2 3 are estimated by assuming a typical RISC processor execution model The methodology in 5 targets Trimaran resear
46. rgy and power are ModelSim for hardware simulation 15 and Xilinx XPower for power simulation 16 We now detail the steps to realize a complete hardware software subsystem using the Xilinx Multimedia Board 6 2 System Implementation on the Board The steps that we used to build a Hardware Software system using Xilinx Embedded Development Kit EDK are enumerated in Appendix A The generated system can be simulated both behaviorally as well as structurally following the steps detailed in Appendix B Appendix C briefly explains how an AFU is introduced in the form of a user core in the system 7 Experimental Results We demonstrate the effectiveness of our approach using a number of front end tools in our framework shown in Figure 3 a 7 1 Performance and Code Size The code generation for the baseline configuration was done by mb gcc with all optimizations turned on O2 mnoxl soft mul so that the performance is maximized in pure software execution The Microb laze configuration was then customized for different applications by introducing AFU with its interface as explained in Section 6 1 The ISEs were generated with I O constraints of maximum 4 inputs and 2 outputs and number of AFUs set to 1 Note here that for each application a different Microblaze configuration is generated and the resulting system is analyzed by applying our framework The results in terms of code size reduction and speedup over software execution are summarized in
47. s Tool to None ii Change Implementation Tool Flow to ISE ProjNav b Run Tools Export to ProjNav A directory projnav is created that contains the exported files Note that if Xilinx Platform Studio has been installed after XST has expired an error will be reported saying ERROR Unable to set property Synthesis Tool To resolve this error run a script containing the following in the Path to work directory gt directory sed XST d npl cmdfile gt tmpfile mv tmpfile npl cmdfile pjcli v f npl_cmdfile c Invoke from Windows menu Xilinx ISE Project Navigator d Click File Open Project and open system npl to be found in the projnav directory e Double click xc2v2000 6ff896 to be found under Sources in Project to open Project Properties Change the value of Synthesis Tool to Synplify Pro VHDL Verilog and click OK f Make the following changes in system structure lt path gt system vhd found under xc2v2000 6ff896 1 il ii Comment the lines library UNISIM and use UNISIM VCOMPONENTS ALL Add the following lines in the beginning LIBRARY synplify use synplify attributes all Comment all the attribute statements For example attribute box type of bram block O wrapper component is black box for the component bram block O wrapper Instead introduce for each component the following lines attribute syn black box of component name component is true and at
48. s rst gt rst Test Bench Statements tb clk PROCESS 50 MHz clock BEGIN clk lt 1 wait for 10 ns clk lt 0 wait for 10 ns END PROCESS tb clk tb reset PROCESS BEGIN rst lt 0 wait for 1 us rst lt 1 wait END PROCESS tb reset End Test Bench END Check the system init vhd file for ensuring the correct ness of the module names configuration testbench conf of testbench is for behavior for uut system for STRUCTURE for all bram block 0 wrapper use configura tion work bram block 0 conf end for end for end for end for end testbench conf c Create a script file projnav projnav do with the following content a script for behavioral simulation cd simulation behavioral do system do vcom 93 work work system vhd vcom 93 work work projnav testbench vhd vsim Lf unisim t ps notimingchecks work testbench conf add wave d Right click on system structure path system vhd and select Add Source Find test bench vhd in the projnav directory and click Open to add test bench Select vhdl testbench while adding the test bench e Click on testbench behavior testbench vhd Right click on Simulate Behavioral Model in the Processes for Source partition and select properties Change the following fields e Use Custom Do File Check the selection e Use Automatic Do File Uncheck the selection e Custom Do File Click and browse for projnav projnav
49. t enables accurate evaluation of all the metrics deemed important in embedded system design namely performance energy power cost and code size e By applying our framework to Microblaze soft processor core we conclude that ISEs can be simul taneously beneficial in terms of performance energy power and code size The rest of the report is organized as follows We present our target customizable processor model in Section 2 In Section 3 we present some related research work We describe our framework for transforming a given application to a customized processor subsystem in Section Section 6 1 presents Processor Subsystem Core XU mama Peripherals Tightly coupled i Tightly boupled Loosely coupled Kener External i Memory i Interface Interface Bus External AFU External Figure 1 Target Customized Processor Subsystem how we use the framework to target Xilinx Microblaze soft processor core In Section we describe our experimental results Finally Section 8 concludes the report 2 Customized Processor Model Our goal is to map a given application to the target customizable processor model shown in Figure 1 In this model the software part of the application stored in the program memory is composed of base in structions to be run on Execution Unit and ISEs to be run on the hardware part i e AFUs An AFU can be tightly coupled with the core through an AFU inte
50. tical sections in customized Ad hoc Functional Units AFUs as Instruction Set Extensions ISEs We pre sented an interface aware processor customization framework that enabled us to implement a customizable soft core microarchitecture capturing the details of interfacing with an AFU We applied our framework to four real life applications and realized four different processor configurations Our results confirmed that in the presence of communication overhead at the processor AFU interface significant speedup over pure software execution is possible only if the AFU function is sufficiently larger than a set of 2 3 op erations Further analysis of the synthesized systems led to the conclusion that integration of AFUs in a customizable processor can result in increased performance and reduced code size while simultaneously decreasing power and energy consumption Our future work will investigate the advantages of ISEs in other reconfigurable platforms and commercially available processors References 1 P Biswas S Banerjee N Dutt L Pozzi and P Ienne ISEGEN Generation of High Quality In struction Set Extensions by Iterative Improvement In Proc of DATE 2005 2 P Biswas V Choudhary K Atasu L Pozzi P Ienne and N Dutt Introduction of Local Memory Elements in Instruction Set Extensions In Proc of DAC 2004 3 K Atasu L Pozzi and P Ienne Automatic Application Specific Instruction Set Extensions under Microarchitectu
51. to 31 signal cnst_minus_32768 component barrel_right_shifter port data_in shift_amnt data_out chip_en in std_logic std logic vector 0 to 31 in std logic vector 0 to 31 out std logic vector 0 to 31 in std logic vector 0 to 31 end component barrel right shifter component add 32 port chip_en in std_logic data_inl in std logic vector 0 to 31 data in2 in std logic vector 0 to 31 data out out std logic vector 0 to 31 E end component add 32 component sub 32 port chip en in std logic data inl in std logic vector 0 to 31 data in2 in std logic vector 0 to 31 data out out std logic vector 0 to 31 end component sub 32 component and 32 port chip_en in std_logic data_inl in std_logic_vector 0 to 31 data_in2 in std logic vector 0 to 31 data out out std logic vector 0 to 31 end component and 32 component mult 32 port chip en in std logic data inl in std logic vector 0 to 31 data in2 in std logic vector 0 to 31 data out out std logic vector 0 to 31 end componen component mu port condl cond2 data_inl data_in2 data_out end componen component mu port condi cond2 data inl data in2 data out end componen component mu port condi cond2 data_inl data_in2 data_out ch in in ch in in ch in in t mult 32 x

as a PDF - CECS - University of California, Irvine

Contents

Download Pdf Manuals

Related Search

Related Contents