Home
        as a PDF - CECS - University of California, Irvine
         Contents
1.                        Processor Subsystem  generation       ynthesis  P R     b   Figure 3   a  A high level application to a hardware software system generation  b  Processor subsystem  generation  4 4 S W Generation Phase    This phase  shown in Figure 3 a  as a box titled    S W generation     generates code for the target  processor taking into account the presence of AFUs  The two subtasks in the S W generation phase are    subgraph matching and subgraph replacement with ISEs  Since all possible instances of an ISE have  already been enumerated by the ISE generation phase  the subgraph matching simply consists of a DFG  traversal and marking constituent instructions of the ISE in the DFG     LastDef         LastDef    FirstUse      FirstUse           a   b1   b2   b3     Figure 4  The ISE here is composed of the shaded instruction nodes   a  An example showing the LastDef  point and the FirstUse point   b1  an example where it is not possible to insert the ISE under considera   tion   b2  After code restructuring   b3  positioning of the ISE between LastDef and FirstUse     After subgraph matching  the ISE is used to replace the set of marked instructions in the DFG  We  depict the ISE replacement strategy in Figure 4  An ISE can be placed anywhere between the point  where its source operands have their last definition  LastDef  and the point where its destination operand  has its first use  FirstUse  as shown in Figure 4 a   the shaded nodes identify the ISE under cons
2.      out std logic                                               FSLO S DATA in std logic vector 0 to 31    FSLO S CONTROL in std logic    FSLO S EXISTS in std logic    FSL1  S CLK out std logic    FSL1  S READ out std logic    FSL1 S DATA in std logic vector 0 to 31    FSL1 S CONTROL in std logic    FSL1 S EXISTS in std logic    FSL2  S CLK out std logic    FSL2 S READ out std logic    FSL2 S DATA in std logic vector 0 to 31    FSL2  S CONTROL in std logic    FSL2  S EXISTS in std logic    FSL3 S CLK out std logic    FSL3 S READ out std logic    FSL3 S DATA in std logic vector 0 to 31            FSL3 S CONTROL in std logic     FSL3 S EXISTS    in std logic     FSLO M CLK    out std logic           FSLO M WRITE  FSLO M DATA    FSLO M CONTROL  FSLO M FULL    out std logic   out std logic vector 0 to 31         out std logic   in std logic                    FSL1  M CLK out std logic    FSL1 M WRITE out std logic    FSL1 M DATA out std logic vector 0 to 31    FSL1 M CONTROL out std logic    FSL1 M FULL in std logic    AFU en out std logic        end component fsl_interface   begin  counter_inst counter  port map     CLK    gt  CLK     enable   gt  count_en   counter_ticks   gt  counter_ticks          cutl inst  cutl   port map    AFU en     chip en   data inl   gt  data inl   data in2   gt  data in2   data in3     data in3   data in4   gt  data in4   data outl   gt  data outl   data out2     data out2            fsl interface inst   fsl interface  port map    CLK   gt  CLK   RE
3.     SRVAL A   X   0       SRVAL B   X   0       WRITE MODE A    WRITE FIRST       WRITE MODE B   gt   WRITE FIRST       INIT_00   X    C1027257FA808 102079097432013AF849F4A0FOF FF BO4A80766AFABDDF739FFD      INIT 01  gt  X 102040B1C337A97A6EFB77A8272040CB52497DC20EDBB9F6BCCF7FFCO9E146FF      INIT 3E   X 0000000000000000000000000000000000000000000000000000000000000000       INIT SF  gt  X 0000000000000000000000000000000000000000000000000000000000000000         This can be easily done using a simple script     Make projnav projnav par do with the following content  This is the script for post place   and route simulation  Note that VCD is destined to be generated in system vcd    vmap simprim C  Xilinx vhdl mti se simprim    vlib work   vcom  93  work work system timesim vhd   vcom  93  work work testbench_par vhd   vsim  t ps  notimingchecks  sdftyp  testbench uut system_timesim sdf work testbench   ved file system vcd   vcd add testbench uut     add wave       f  Right click on system structure  lt path gt  system vhd  and select Add Source  Find test   bench vhd in the projnav directory and click Open to add a test bench  Select vhdl test  bench while adding the test bench      g  Click on testbench behavior  testbench vhd   Right click on Simulate Behavioral Model  in the Processes for Source partition and select properties  Change the following fields     e Use Custom Do File  Check the selection   e Use Automatic Do File  Uncheck the selection   e Custom Do File  Click an
4.    aoaaa aaa di ERR n  e ee eee ALE A   a  A high level application to a hardware software system generation  b  Processor sub    Systemi generation  i  o ar E Sle   518 m  t   E bed ged E ROS xS    11    12  12  13    13  13  15  16    16    18    21    26    26    29    36      O OO VAU    The ISE here is composed of the shaded instruction nodes   a  An example showing the  LastDef point and the FirstUse point   bl  an example where it is not possible to insert  the ISE under consideration   b2  After code restructuring   b3  positioning of the ISE  between LastDef and FirstUse     4  ra nar a RU RA  Measuring System Power 2 34 2 524 NW EMS EU ec he he o  ES  A DP external AFU Interface s o es Bo o MOS de OS RS ds edt ed Mh Hand 22 ae  Microblaze Processor Core with an AFU and its Interface                     Communication Template for AFU Interface in Microblaze                    Xilinx Multimedia Board         uu vob oe 8 wu une LR RN UR e Rm ER EL de  An ISE for ADPCM ENCODER  adpcm e  having 4 inputs and 2 outputs  each operation  node maps to a hardware component                                  1 Introduction    Typically  applications running on a programmable platform can be executed either as a software al   gorithm or on a specialized hardware unit  The software approach is the slowest but most flexible while  the hardware approach is the fastest but least flexible  Instruction Set IS  extensible processors comprise  an emerging class of processors  especially i
5.    ut std logic   out std logic   out std logic vector 0 to 31    out std logic   in std logic        ut std logic   out std logic   out std logic vector 0 to 31    out std logic   in std logic     AFU_en        out std logic    end fsl interface     architecture behavioral of fsl interface is    S    IGNAL count       begin    FSLO M CONTROL  lt      FSLO_S_CLK  FSL1_S_CLK  FSL2_S_CLK  FSL3_S_CLK  FSLO_M_CLK  FSLI M CLK    AFU control   begin    if  RESET  count  lt       lt     lt     lt     lt     lt     lt      pr         0     FSLO S REA  FSL1 S REA  FSL2 S REA  FSL3 S REA  FSLO M WRIT       QUU VU    FSLO M DATA           FSL1  FSL1     M WRIT    M           DATA  lt         enabling AFU operation    natural range 0 to 9     ro      CLK   CLK   CLK   CLK   CLK   CLK     ocess  CLK     1     then  ro    ro    ro    ro    ro     others  ro     others    E  lt      E  lt      elsif CLK   event and CLK    CASE  WHEN    count  0    FSLO S   ESLI So  FSL2 S    ESL3 SA  FSLO_M    FSL1_M J    count_e   count  lt   WHEN 1  IF  and        FS    FSLO S READ  lt    1  FSL1 S READ  lt    1     FSLO    IS    gt   REA  REA  READ  lt    READ  lt    WRITE  lt          lt     lt         D  D  D  D    WRITE  lt      n  lt   Pss         gt   S EXISTS  L2 S EXISTS             ry is  rots  rts  Lil  Te            ry  s  rye               gt      gt     117                         1717       1717       Je   Papst                then    and  THEN and     FSLl1 S EXISTS    Initialize the counte
6.  Table 1     Table 1  Speedup and Code Size Reduction with the Introduction of an AFU having 4 inputs and 2  outputs in the Microblaze subsystem       Core Only Core   AFU Code   BMs Bytes   Cycles   Bytes   Cycles   Redn   Spdup  autcor   58444   264305   58452   404673    8   0 65x  adpcm d   12049   252688   11953   190979   96   1 32x  adpcm e   14121   157177   13989   106821   132   1 47x  AES 16013   240613   14957   167397   1056   1 44x                                                       Each of the operand send and result receive operations in Microblaze has a latency of 2 cycles  Con   sequently  the latency for transferring 6 operands is 12 cycles in the worst case and 6 cycles in the best  case  1 e   if all the latencies are successfully hidden by the scheduler   The ISE generated for autcor was  a chain of just three operations  a multiply  a barrel right shift and an add having software latencies as 3   2 and 1 cycles respectively  With AFU operation taking just 1 cycle  the best case latency of the ISE is  64 1     7 cycles  Thus  even the best case performance of the ISE lags behind the worst case performance  of the corresponding software execution  8  2  1   6 cycles   Consequently  there was slowdown instead  of speedup for autcor owing to the communication overhead  However  there are some prior related work   6  8   which have shown speedup even with small sized ISEs containing on the order of 3 4 instructions  because of incurring no communication ov
7.  a single instruction just replaces the set of  constituent instructions  Replacing the multiply and the add with a single user defined instruction   ISE1          the resulting instruction sequence  as in Figure 4 b3   would become   3  e   5   1   4   d   ISE1 b c  e    2  f   a 0z2   5  g   e     d  However  if an ISE is represented as a set of predefined  data transfer instructions  send     receive      the resulting instruction sequence after ISE replacement  would appear as   3  e   5   1   4  send b   send c   send e   receive d    2  f   a 0x2   5  g   e     d   After subgraph replacement with ISE  the compiler performs scheduling  register allocation and target  code generation as a back end pass  Note that the latency of the ISE required by the scheduler is derived    Y  Routing  Info    Y    Hardware Power  Simulation Simulation                                     Figure 5  Measuring System Power    from the H W generation phase as shown in Figure 3 a      4 5 Processor Subsystem Generation Phase    We show this phase in Figure 3 b   As a final step  the processor model of the target Soft core along  with the AFU and its interface are synthesized and implemented using standard synthesis and Place   and Route tools  The executable generated in Figure 3 a  and the system synthesized in Figure 3 b  are  deployed in two schemes  one for measuring speedup and the other for evaluating energy power consump   tion  With the goal of measuring actual time spent in running t
8.  designers need to determine if this speedup comes at a price of increased power   This report shows that increased performance can also reduce both power and energy of a customizable  processor in the presence of an AFU and reports the effects on code size and area    It is predicted  17  that by 2010  over one third of all PLD FPGA devices are expected to have micro   processor cores  up from 15  today  Xilinx Microblaze  10  is a popular commercially available soft core   We demonstrate the use of our framework by transforming a given input application into a running Xil   inx Microblaze hardware software system  For four real life applications  from Mediabench and EEMBC  suites   we measure the real performance gain over pure software execution and also accurately evaluate  energy and power consumption  Our experimental results show that significant speedup is obtained only  when an ISE contains a large set of atomic operations  With only one large ISE per application  we ob   tained speedup of up to 1 47 x over simple software execution and simultaneously up to 40  energy saving  and 12  peak power reduction  To the best of our knowledge  this is also the first attempt to present the  details of interfacing an AFU with a customizable soft core  The main contributions highlighted in this  report are the following     e We present a generalized interface aware soft processor customization framework for mapping an  application in C into a running processor AFU subsystem tha
9.  do      f  Double click  Simulate Behavioral Model    to run the behavioral simulation   4  Structural Timing Simulation using ModelSim     a  Right click on Project    Project name gt  and click to Mark to Initialize BRAM     b  Right click on Default  microblaze 0 xmdstub and click to un select Mark to initialize  BRAM     c  Invoke Tools    Sim Model Generation  This populates the simulation structural directory     d  The file system init vhd contains the memory map of the executable  A part of it looks like  the following     configuration braml conf of braml wrapper is  for STRUCTURE     e     for braml   braml_elaborate   for STRUCTURE   for ramb16_s1_s1_0  ramb16 s1 s1   use entity unisim rambl6 sl sl ramb16  sl sl v    generic map    INIT 00   gt  X C102125AF2808102049087432010AA84154A021FFCF04ACOGDE65996B4FDE57F    INIT 01   X 102040B1C26DSBF87EFB72A82420409D17492DC2074FB95734CFFFE508A183FF      INIT 3E   gt  X 0000000000000000000000000000000000000000000000000000000000000000       INIT 3F   X 0000000000000000000000000000000000000000000000000000000000000000       end for     end for    end for    end for    end braml conf    The corresponding section in the   work directory     system timesim vhd is empty  Superim   pose this memory section from system init vhd into   work directory  gt  system timesim vhd  so that the corresponding BRAM section of the latter looks like the following   ramb16 s1 s1 2   X RAMB16 SIS    generic map    INIT A   X 0     INIT B  gt  X 0 
10.  eq 32    ip en   in std logic    std logic vector 0 to 31    std logic vector 0 to 31     in  std logic vector 0  to 31    in std logic vector 0 to 31    out std logic vector 0 to 31                 t mux eq 32   x leq 32    ip en   in std logic    std logic vector 0 to 31  std logic vector 0 to 31     in std logic vector 0 to 31    in std logic vector 0 to 31    out std logic vector 0 to 31                                t mux leq 32     x geq 32    ip en   in std logic    std logic vector 0 to 31    std logic vector 0 to 31     in std logic vector 0 to 31    in std logic vector 0 to 31    out std logic vector 0 to 31                 end component mux geq 32     begin  cn  cn  cn  cn  cn  cn  cn  cn       cn                                                                                                                                     logic   st 0  lt   b 0000 0000 0000 0000 0000 0000 0000 0000     st 1  lt   b 0000 0000 0000 0000 0000 0000 0000 0001     st 2     b 0000 0000 0000 0000 0000 0000 0000 0010     st 3  lt   b 0000 0000 0000 0000 0000 0000 0000 0011     st 4  lt   b 0000 0000 0000 0000 0000 0000 0000 0100     st 7  lt   b 0000 0000 0000 0000 0000 0000 0000 0111     st 8  lt   b 0000 0000 0000 0000 0000 0000 0000 1000     st 32767  lt   b 0000 0000 0000 0000 0111 1111 1111 1111     st minus 32768  lt   b 1111 1111 1111 1111 1000 0000 0000 0000                                    and_32_1   and_32  port map    chip_en   gt  AFU_en   data_inl   gt  data_inl   data
11.  in2   gt  cnst 7   data out   gt  sigl             and 32 2   and 32  port map   chip en   gt  AFU en   data inl   gt  sigl   data in2   gt  cnst 2   data out   gt  sig2        and_32_3   and_32  port map    chip_en   gt  AFU_en   data_inl   gt  sigl   data_in2   gt  cnst_4   data_out   gt  sig3        and_32_4   and_32  port map    chip_en   gt  AFU_en   data_inl   gt  sigl   data_in2   gt  cnst_l   data_out   gt  sig4        brs 1   barrel right shifter  port map    chip en     AFU en    data in     data in2    shift amnt   gt  cnst 3    data out   gt  sig5        add_32_1   add_32  port map    chip_en   gt  AFU_en     data inl   gt  sig5   data in2   gt  data in2   data out   gt  sig6         mux eq 32 1   mux eq 32   port map   chip en     AFU en   condl   gt  sig3   cond2   gt  cnst 0   data_inl   gt  sig5   data_in2   gt  sig6   data_out   gt  sig8         brs_2   barrel_right_shifter  port map     chip_en   gt  AFU_en    data in   gt  data in2    shift amnt   gt  const 1    data out   gt  sig7        add 32 2   add 32  port map   chip en   gt  AFU en   data inl   gt  sig    data in2   gt  sig8   data out   gt  sig9        mux eq 32 2   mux eq 32   port map   chip en   gt  AFU en   condl   gt  sig2   cond2   gt  cnst 0   data_inl   gt  sig8   data_in2   gt  sig9   data_out   gt  sigll         brs_3   barrel_right_shifter  port map     chip_en   gt  AFU_en    data in   gt  data in2     shift amnt   gt  cnst 2   data out   gt  sigl10        add 32 3   add 32  port map   
12.  steps  We then derive the  total energy dissipated in the system from the reported power and the measured execution time   Now  we apply our processor customization framework to generate a real system     5 Communication Template for Xilinx Microblaze    Xilinx Microblaze  10  is a soft core with a DP external AFU interface  as shown in Figure 6   We  demonstrate the utility of our framework by transforming a given input application into a running Microb   laze hardware software system    Microblaze has a DP external AFU to be connected with the processor via Fast Simplex Links  or  FSLs   FSLs are dedicated point to point unidirectional 32 bit wide FIFO interfaces  The Microblaze is  capable of including a maximum of 8 input and 8 output FSLs        CLK    Figure 7  Microblaze Processor Core with an AFU and its Interface     Microblaze is a 32 bit RISC processor with a simple 3 stage pipeline  Figure 7 shows an AFU and its  interfacing with the Microblaze processor core via 8 x 8 FSL channels  The AFU interface implements  the processor AFU communication protocol and is synchronous with the Microblaze processor through a  global clock  CLK   The AFU interface is also connected to a counter module to enable counting whenever  required  If the count enable signal  Cnt_en  is    1     counting is enabled  Otherwise  the counter is reset to     0     The signals  n 32  and Out 32  are used to send data to and receive data from the AFU respectively   When the AFU enable signal  
13.  the structural AFU model  pre   sented in Appendix E  and the communication template  presented in Appendix D   The AFU with its    interface for the adpcm d example is presented as follows     library IEEE    use IEEE STD LOGIC 1164 ALL    use IEEE STD LOGIC ARITH ALL   use IEEE STD LOGIC UNSIGNED ALL                                         library unisim   use unisim vcomponents all     entity my fsl is  Port    CLK   in std logic         System clock  RESET   in std logic     FSLO S CLK   out std logic    FSLO S READ   out std logic    FSLO S DATA   in std logic vector 0 to 31    FSLO S CONTROL   in std logic   FSLO S EXISTS   in std logic                 FSL1 S CLK   out std logic    FSL1 S READ   out std logic    FSL1 S DATA   in std logic vector 0 to 31    FSL1 S CONTROL   in std logic   FSL1 S EXISTS   in std logic                       FSL2 S CLK   out std logic    FSL2 S READ   out std logic    FSL2 S DATA   in std logic vector 0 to 31    FSL2 S CONTROL   in std logic   FSL2 S EXISTS   in std logic                    FSL3 S CLK   out std logic    FSL3 S READ   out std logic    FSL3 S DATA   in std logic vector 0 to 31    FSL3 S CONTROL   in std logic   FSL3 S EXISTS   in std logic                 FSLO M CLK   out std logic    FSLO M WRITE   out std logic    FSLO M DATA   out std logic vector 0 to 31    FSLO M CONTROL   out std logic    FSLO M FULL   in std logic           FSL1 M CLK   out std logic   FSL1 M WRITE   out std logic   FSL1 M DATA   out std logic vector 
14.  tribute syn noprune of   component name    component is true    The attribute state   ments were pertaining to XST and Synplify Pro would simply ignore them  So  the   black box  constraints are specified in Synplicity syntax  If the system does not have  any output  the Synthesis phase would prune all the components  This is prevented by  using syn noprune attribute      g  Right click on system structure  lt path gt  system vhd  and select Add Source  Find sys   tem ucf in the data directory and click Open to add constraints      h  Select system structure  lt path gt  system vhd   Double click Synthesize   Synplify Pro  in Processes for Source section to run synthesis   Alternatively  double click Generate  Programming File directly which includes running synthesis and Place and Route      i  Double click Implement Design to perform Place and Route of the design     j  Double click Generate Programming File to generate the bitmap file     k  Go back to XPS  Select Tools   Import from ProjNav and import the following files    i  BIT file    path to work directory  gt  projnav system bit and  ii  BMM file    path to work directory    implementation system bd bmm    1  From the XPS menu  run Tools    Update Bitstream     9  Compiling the software  Run Tools   Build All User Applications  Check whether the size of  executable elf is less than 64 KB   Recall that the memory allocated for both data and instruction  was 64 KB  Also note that the maximum usable space in 56 BRAMS 
15. 0 to 31            FSLI M CONTROL    out std logic     FSL1 M FULL   in std logic        end my fs1           architecture IMP of my fsl is  signal count en   std logic     enabling the counter    signal chip en   std logic     signal counter ticks  nal from the counter    signal data inl  data in2  data in3  data in4  signal data outl  data out2          component counter       std logic vector 0 to 1      port   CLK   IN std logic   enable   IN std logic        counter ticks   OUT std logic vector 0 to 1           end component counter     component cutl                         port   AFU en   IN std logic   data inl   IN std logic vector 0 to 31    data  in2 IN std logic vector 0 to 31    data in3 IN std logic vector 0 to 31    data  in4 IN std logic vector 0 to 31    data outl  OUT std logic vector 0 to 31    data out2  OUT std logic vector 0 to 31       end component cutl   component fsl_interface  port    CLK   in std_logic   RESET   in std_logic   count_en   out std_logic   counter ticks   in std logic vector 0 to 1      data inl   out std 1          logic vector 0 to 3        data in2   out std 1    L         Logic_vector  0 to 31    L        Sig     std_logic_vector 0 to 31      std_logic_vector 0 to 31      data_in3  data_in4  data outl   data out2   FSLO S CLK    out std logic vector 0 to 31                         out std logic vector 0 to 31      in std logic vector 0 to 31         in std logic vector 0 to 31      out std         logic               FSLO S READ  
16. 1  had 8 instances in the critical basic block covering more than  50  of the DFG and overall 12 instances in the critical function  Both the large size and large scale reuse   as defined in  1   of the ISE accounts for a significant speedup  1 44x  obtained on AES despite the  overhead in sending and receiving operands  Along with the merit of speedup  AES also exhibit a 7  code  size reduction owing to replacement of a large chunk of code by an ISE in the form of a set of data transfer  instructions     7 2 Power and Energy Results    From Table 2  it is evident that both the peak power  P  Pwr  as well as the average power  A  Pwr   reduced with the introduction of AFU  Because the presence of both core and AFU apparently indicates  more circuit activity  an initial expectation is increased power with the addition of AFU  However  because  the ISE here is a multi cycle operation interlocked with the Microblaze pipeline  the AFU operation com   pletely overlaps with a processor pipeline stall  Consequently  we obtain an overall power reduction in the  presence of AFU operation owing to reduced overall circuit activity    As shown in Table 3  we also obtained up to 40  saving in energy on account of reduced application  runtime  It is interesting to note that the trend of energy decrease  or increase  exactly follows that of  speedup  shown again in Table 3 for the sake of comparison   This trend can be expected as a corollary to    Table 3  Energy Benefits of ISEs in the Mi
17. AFU en   in std logic                    data inl in std logic vector 0 to 31    data in2 in std logic vector 0 to 31    data in3 in std logic vector 0 to 31    data in4   in std logic vector 0 to 31    data outl out std logic vector 0 to 31    data  out2 out std logic vector 0 to 31           end cutl     architecture logic of cutl is                                                                         signal sigl   std logic vector 0 to 31    signal sig2   std logic vector 0 to 31    signal sig3   std logic vector 0 to 31    signal sig4   std logic vector 0 to 31    signal sig5   std logic vector 0 to 31    signal sig6   std logic vector 0 to 31    signal sig7   std logic vector 0 to 31    signal sig8   std logic vector 0 to 31    signal sig9   std logic vector 0 to 31    signal sig10 std logic vector 0 to 31    signal sigll std logic vector 0 to 31    signal sigl2 std logic vector 0 to 31    signal sigl3   std logic vector 0 to 31    signal sigl4   std logic vector 0 to 31    signal sigl5   std logic vector 0 to 31    signal sigl6 std logic vector 0 to 31    signal sigl7 std logic vector 0 to 31    signal sig18 std logic vector 0 to 31    signal cnst_0 std logic vector 0 to 31    signal cnst 1 std logic vector 0 to 31    signal cnst 2 std logic vector 0 to 31    signal cnst 3   std logic vector 0 to 31    signal cnst_4 std logic vector 0 to 31    signal cnst 7 std logic vector 0 to 31    signal cnst 8   std logic vector 0 to 31    signal cnst 32767   std logic vector 0 
18. AFU en is    1     the AFU latches the output in Out 32     In Figure 8  we present the generic communication template for Microblaze AFU interaction as a  Finite State Machine  FSM  synchronous with respect to CLK  For the sake of explanation  we call an  FSL channel FSL  R when it is used for AFU read operation or FSL_W when it is used for AFU write  operation  Associated with every FSL R channel is a set of three signals  namely   FSL READ SIG   FSL DATA EXISTS  FSL IN DATA 32    Another triplet   FSL WRITE  SIG  FSLFIFO FULL   FSL OUT DATA 32   is associated with every FSL W channel  The FSM is initially in  Input Sync    state  waiting for data to arrive on an FSL  R channel  When data exists on the FSL channel  the corresponding  FSL DATA EXISTS signal goes high causing a transition from    Input Sync    state to  Input Read  state   In    Input Read  state  FSL READ SIG is set to high to cause the data in the FSL_R FIFO to be read into     FSL_DATA_EXISTS       low       FSL_DATA_EXISTS       high        Input Sync p  u  Input Read    DS FSL_READ SIG  lt      0     FSL_WRITE_SIG  lt      0     Cnt_en  lt      0                     FSL_READ_SIG  lt      1     In  lt   FSL_IN_DATA  AFU_en  lt      0       true true       FSL_OUT_DATA  lt   Out  FSL_WRITE_SIG  lt      1       AFU_en  lt      0       Output Write A Output Sync     Count      Cycles  and   FSL Q FULL       low        FSL READ SIG  lt      0     AFU_en  lt      1       Cnt_en  lt      1           Count      
19. Cycles  or   FSL Q FULL       high        Figure 8  Communication Template for AFU Interface in Microblaze    In 32  using a 32 bit signal array  FSL IN DATA  After the data has been read into  n 32   the FSM tran   sitions to  Output Sync  state and waits on the AFU operation by enabling the counter  After   Cycles   as evaluated in the H W generation phase in Figure 3 a   has elapsed  the result of the AFU operation is  latched in Out 32   If FSL_W FIFO is not full  1 e   FSL FIFO FULL is low   a state transition takes place  to    Output Write  state  In the    Output Write  state  data from Out 32  is written into the FSL W FIFO  using FSL OUT DATA 32  by setting FSL WRITE SIG to high  Thus  for introducing every new AFU   only the AFU module in Figure 7 and the   Cycles change in the process of H W generation  while the  communication template is reused     6 Experiments  We first describe our experiemental setup in detail and then present the experimental results   6 1 Experimental Setup    The ISE generation algorithm  ISEGEN   1  was integrated with a MACHSUIF  9  front end  The  S W generation was done with Microblaze GCC 2 95  mb gcc  compiler  Microblaze Instruction Set has  multiple data transfer instructions for sending data to and receiving data from its FSL channels     put for  sending and get for receiving data in blocking mode  and nput nget are the corresponding instructions in  non blocking mode  We used the non blocking send instruction  nput  and the blo
20. Processor Customization on a Xilinx Multimedia Board    Partha Biswas  Sudarshan Banerjee  and Nikil Dutt     CECS Technical Report  06 04  Center for Embedded Computer Systems School of Information and Computer Science  University of California  Irvine  CA 92697  USA    Mar 12  2006    Abstract    Performance of applications can be boosted by executing application specific Instruction Set Exten   sions  ISEs  on a specialized hardware coupled with a processor core  Many commercially available  customizable processors have communication overheads in their interface with the specialized hardware   However  existing ISE generation approaches have not considered customizable processors that have com   munication overheads at their interface  Furthermore  they have not characterized the energy benefits of  such ISEs  This report presents a soft processor customization framework that takes an input  C  applica   tion and realizes a customized processor capturing the microarchitectural details of its interface with the  specialized unit  The speedup  energy  power and code size benefits of the ISE approach were accurately  evaluated on a real system implementation by applying the design flow to a popular Xilinx Microblaze  soft processor core synthesized for four real life applications  It was found that only one large ISE per  application is sufficient to get an average 1 41 x speedup over pure software execution in spite of incurring  communication overheads  Finally  a simultane
21. SET   gt  RESET   count_en   gt  count_en   counter_ticks   gt  counter_ticks   data_inl   gt  data_inl   data in2   gt  data in2   3  4       data in3     data in3    data in4     data in4    data outil   gt  data outl    data out2     data out2   FSLO S CLK   gt  FSLO S CIK   FSLO S READ   gt  FSLO S READ   FSLO S DATA   gt  FSLO S DATA   FSLO S CONTROL   gt  FSLO S CONTROL   FSLO S EXISTS   gt  FSLO S EXISTS   FSL1 S CLK   gt  FSL1 S CLK   FSL1 S READ   gt  FSL1 S READ   FSL1 S DATA   gt  FSL1 S DATA   FSL1 S CONTROL   gt  FSL1 S CONTROL   FSL1 S EXISTS   gt  FSL1  S EXISTS   FSL2 S CLK   gt  FSL2 S CLK   FSL2 S READ   gt  FSL2 S READ   FSL2 S DATA   gt  FSL2 S DATA   FSL2 S CONTROL   gt  FSL2 S CONTROL   FSL2 S EXISTS   gt  FSL2 S EXISTS   FSL3 S CLK   gt  FSL3 S CIK   FSL3 S READ   gt  FSL3 S READ   FSL3 S DATA   gt  FSL3 S DATA   FSL3 S CONTROL   gt  FSL3 S CONTROL   FSL3 S EXISTS   gt  FSL3 S EXISTS                                                                                         FSLO M CLK   gt  FSLO M CLK   FSLO M WRITE     FSLO M WRITE   FSLO M DATA   gt  FSLO M DATA   FSLO M CONTROL   gt  FSLO M CONTROL   FSLO M FULL   gt  FSLO M FULL   FSL1 M CLK   gt  FSL1 M CLK   FSL1 M WRITE   gt  FSL1 M WRITE   FSL1 M DATA   gt  FSL1 M DATA   FSL1 M CONTROL     FSL1 M CONTROL   FSL1 M FULL   gt  FSL1 M FULL    AFU en   gt  chip en                                     end IMP        
22. all     entity fsl interface is  Port    CLK   in std logic         System clock  RESET   in std logic     data inl   out std logic vector 0 to 31      data_in2   out  data_in3   out  data_in4   out  data_outl   in  data_out2   in  count en   out     Signal from  counter ticks    FSLO  FSLO  FSLO  FSLO  FSLO    FSLI  FSLI  FSLI  FSLI  FSL                    std logic vector 0 to 31    std logic vector 0 to 31    std logic vector 0 to 31    std logic vector 0 to 31    std logic vector 0 to 31    std logic     enabling the counter    the counter  in std logic vector 0 to 1      _S CLK   out std logic       9  READ  _S_DATA  _S_CONTROL  S EXISTS             out std_logic    in std logic vector 0 to 31    in std logic   in std logic        _S CLK   out std logic        9  READ  _S_DATA  _S_CONTROL  _S_ EXISTS                FSL2  FSL2  FSL2  FSL2  FSL2    FSL3  FSL3  FSL3  FSL3  FSL3    FSLO  FSLO  FSLO  FSLO  FSLO    FSLI  FSLI  FSLI  FSLI  FSLI       out std_logic    in std logic vector 0 to 31    in std_logic   in std_logic        _S CLK   out std logic          9  READ  _S_DATA  _S_CONTROL  _S_EXISTS          out std_logic    in std logic vector 0 to 31    in std logic   in std logic        _S CLK   out std logic        9  READ  _S_DATA   _S_CONTROL  _S_EXISTS             _M CLK   o  _M WRITE  _M DATA     M CONTROL   M FULL          _M CLK   o  _M WRITE  _M DATA    M CONTROL  _M FULL             out std logic   in std logic vector 0 to 31    in std logic        in std logic  
23. ation Model as Behavioral  Set appropriate paths for the simulation libraries as follows    Check the installation directories of the ModelSim libraries    EDK Library  C  Xilinx vhdl mti_se edklib  Xilinx Library  C  Xilinx vhdl mti se     c  Right click on Project   lt Project name gt  and make sure Mark to Initialize BRAM is se   lected      d  Right click on Default  microblaze 0 xmdstub and make sure Mark to initialize BRAM is  un selected     3  Behavioral Simulation using ModelSim     a  From XPS  invoke Tools Sim Model Generation  which populates simulation behavioral  directory   Modify simulation behavioral system init vhd by commenting the last few lines  as follows       configuration systemN conf of system is       for STRUCTURE                for all   braml  wrapper use configura   tion work bramlN conf       end for      end for       end system  conf      b  Now  from Project Navigator  add projnav testcase vhd with the following content            TestBench T    LIBRARY ieee       emplate            USE ieee std logic 1164 ALL     USE ieee num    ENTITY testb  END testbenc                eric std ALL        ench IS  h        ARCHITECTURE behavior OF testbench IS                                                       Component Declaration  COMPONENT system  PORT   sys clk IN std logic   sys rst IN std logic      END COMPONENT    SIGNAL clk   std logic   SIGNAL rst   std logic   BEGIN        Component Instantiation  uut  system PORT MAP   SyS Clk   gt  clk   sy
24. ch  infrastructure  Using a simulator  the authors show speedup for applications that reuse AFUs generated  for other applications in the same domain  Such reuse of AFUs across application is possible only when  ISEs found were reasonably small in size  However  we will confirm in our experimental results that such    small sized ISEs would not generate a considerable speedup for AFUs with communication overheads    Sun et al   6  employs a Tensilica Instruction Extension  TIE  compiler in their methodology and  operates at a higher  C source code  level of abstraction  Therefore  this methodology relies more on  designer s experience for ISE identification and mapping to AFUs  The AFU in this case therefore does not  have any communication overhead  Fei et al   7  integrated a fairly accurate energy estimation engine in the  same framework  but they do not report a comparison of energy before and after extending the processor  A  recent work having a goal of real system implementation  8  generated application specific instructions for  Altera Nios II processor in the presence of AFUs that do not have communication overheads  The results  show a good speedup and limited area overhead  but they do not discuss energy or power consumption   Unlike  8   in this report  we deal with the non trivial details of synchronization between the processor  and the AFU with the help of a generic communication template    Note that in the prior related work  the AFU in general did not have 
25. chip en   gt  AFU en   data inl   gt  sigll   data in2   gt  sigl0   data out   gt  sigl2           mux eq 32 3   mux eq 32   port map   chip en   gt  AFU en   condl   gt  sig4   cond2   gt  cnst O0   data_inl   gt  sigll   data in2   gt  sigl2   data out   gt  sig13            sub 32 1   sub 32   port map   chip en     AFU en   data inl   gt  data in3   data in2   gt  sigl13   data out   gt  sigl5            add_32_4   add_32   port map    chip_en   gt  AFU_en   data_inl   gt  data_in3   data in2   gt  sig13   data_out   gt  sigl4            and 32 5   and_32  port map   chip en   gt  AFU en   data inl   gt  data inl   data in2   gt  cnst 8   data out   gt  sigl6             mux eq 32 4   mux eq 32  port map   chip en     AFU en   condi   gt  sigl6   cond2   gt  cnst O0   data inl   gt  sigl4   data in2   gt  sigl5   data out   gt  sigl7           mux leq 32 1   mux leq 32  port map   chip en     AFU en   condl   gt  sigl7   cond2   gt  cnst_32767   data inl   gt  sigl7   data in2   gt  cnst 32767   data out   gt  sig18  Ji       mux geq 32 1   mux geq 32   port map   chip en     AFU en   condi   gt  sigl8   cond2   gt  cnst minus 32768   data inl   gt  sigl8   data in2     cnst minus 32768   data out   gt  data outil         mult 32 1   mult 32  port map   chip en   gt  AFU en   data inl     data inl   data in2   gt  cnst 4   data out     data out2        end logic   F AFU with its Interface for adpcm d    The AFU with its interface that is captured in my fsl glues together
26. cking receive instruc   tion  get  for our AFU interface  Because of using two different compilers for ISE generation and S W  generation  the subgraph replacement with ISEs was done as a post assembly pass on the assembly out   put of mb gcc  After replacing the identified subgraphs with ISEs  mb gcc was run again to generate the  executable    We selected four real life applications for demonstrating the effectiveness of our framework  autcor   Auto correlation  from EEMBC suite  adpcm e  ADPCM Encoder  and adpcm d  ADPCM Decoder   from Mediabench suite  and AES  AES encryption   Our platform is Xilinx Multimedia Board  which is  equipped with a Virtex II XC2V2000 FPGA  Figure 9 shows a snapshot of the board  We used Xilinx Plat   form Studio for configuring the FPGA to include a Microblaze processor with a 64KB  i e   the maximum    a      NUNC    McroBlare tt Pert       hund hued lcd bd beca oco cad led iml a    Figure 9  Xilinx Multimedia Board    size possible  Block RAM  BRAM   two Local Memory Buses  LMBs   to interface with BRAM     one  for instruction and the other for data   one Microblaze Debugging Manager  MDM  and one Timer  both  MDM and Timer on a single On chip Peripheral Bus  OPB    The standard inputs and outputs of an appli   cation were redirected to the MDM and the elapsed number of cycles was evaluated using the Timer  We  set the clock frequency of the Microblaze processor to 50 MHz  The tools used in the second scheme  Fig   ure 5  for evaluating ene
27. communication overheads at its in   terface  Indeed  there are many commercially available processors providing such an interface  Common  examples are Altera Nios II processor  13   LEON processor  12   etc  However  there are similarly many  commercial customizable processors where AFUs incur overhead in sending and retrieving data  Some ex   amples include STMicroelectronics ST120  11   Xilinx Microblaze processor  10   etc  To the best of our  knowledge  ISE generation in the context of AFUs incurring communication overheads at their interface  with the core processor has not been studied yet  This is our motivation for proposing a framework that  is capable of incorporating different AFU models and in particular  targeting Xilinx Microblaze soft core   We apply the design flow of our framework to study performance gain  energy power consumption  code  size reduction and area overhead with the introduction of an AFU into the Microblaze subsystem     4 Framework for Complete System Realization    Our framework takes as input a high level application  in C   and generates an executable and an AFU  with appropriate interfacing protocol  as shown in Figure 2   The executable runs in the processor core as  software containing ISEs for invoking the AFU operation in hardware  Our target for running the complete  processor AFU subsystem is an FPGA platform        Y       ISE Generation       lt  gt       Latency                                   S W Generation H W Generation  Int
28. croblaze subsystem                                  Tot Energy  uJ    Tot Energy  uJ    age  BMs for Core Only   for Core AFU   Saving   Spdup  autcor 2 21 3 10  40 27   0 65 x  adpcm d 8 48 5 84 31 13   1 32x  adpcm e 10 54 6 34 39 85   1 47x  AES 69 09 43 69 36 76   1 44x                a consistent power reduction shown in Table 2  Thus  contrary to conventional expectation  enhanced  performance simultaneously results in reduced power and energy for the customized Microblaze  soft core     7 3 Slices Utilization    The XC2V2000 FPGA that we use as our target platform has 10752 slices  Table 4 shows the per   centage utilization of the FPGA slices before and after introducing the AFU that brought the speedup in  Table 1     Table 4  Slices Utilization  out of 10752  in the absence of an AFU and in the presence of an AFU for the  four applications in XC2V2000 FPGA             BMs   No AFU   autcor   adpcm d   adpcm e   AES  Slices 1274 1609 1804 2226   2043  Util  11  14  16  20  19                                Note here that XC2V2000 used here is very small  The largest possible Virtex II chip  XC2V8000  contains 46592 slices  If the largest FPGA is used instead of XC2V2000  the average slices utilization  reduces to only 596  which is very reasonable  Thus  the area overhead of including an AFU in the  Microblaze subsystem is also minimal     8 Summary and Future Directions    Applications can be accelerated in a programmable processor by executing their performance cri
29. cts pdf datasheets se   pdf     16  Xilinx XPower Documentation http   toolbox xilinx com docsan xilinx6     books data docs dev dev0089 14 html     17  Panelists peer into future of FPGAs  Article 60407325  EETimes  March 7  2005     A System Realization on Xilinx Multimedia Board    Here we present the detailed steps to realize a basic hardware software subsystem with the hardware  consisting of the Microblaze processor  local memory bus  BRAM  timer and mdm and the software being  the Microblaze executable     1   2    Invoke Xilinx Platform Studio  XPS  6 2i  or higher      Click File   New Project    Platform Studio  The settings for Create New Project are as follows     Project File   lt Path to work directory gt  system xmp    Target Device Architecture  virtex2  Device Size  xc2v2000  Package  ff896  Speed Grade   6  default     Click OK and then answer Yes for Do you want to start with an empty MHS File  Then click OK  for the comment Project    Add Edit Cores     Setting up the hardware  Under System tab  right click on System BSP and select Add Edit  Cores          a  Add the following peripherals     e microblaze  1   e bram block  1     e Imb bram if cntlr  2   1 for data  1 for instruction   Base Address   0x00000000  High  Address   0x0000ffff  Memory allocated both for data and instruction   64 KB     e opb mdm  1   Base Address   Oxffff0400  High Address   OxffffOAff  e opb timer  1   Base Address   Oxffff0800  High Address   OxffffO8ff  Note that address  ra
30. d browse for projnav projnav_par do      h  Double click Simulate Post Place  amp  Route VHDL Model to invoke the structural simula   tion  Run structural simulation by selecting Simulate   Run    All  Choose an appropriate  termination criterion to terminate the simulation     C Creating a Custom FSL Interface    A user core in the form of an AFU resides in the  lt project directory gt  pcores direc   tory  The base name for an FSL interface description follows the following naming convention     core name gt   lt version number gt   For example  my fsl 1 00 a is a valid base name for a user core  called my fsl    Under the   project directory     pcores data directory  two files are created for describing the inter   face and specifying the order in which the underlying modules are synthesized  The respective files are  my fsl  1 00 a mpd and my fsl 1 00 a pao corresponding to the chosen base name  Under the  lt project  directory  gt  pcores hdl vhdl directory  reside the VHDL source code for the user core and the FSL inter   face     D VHDL Source for the Communication Template    We present in this section the simple FSL Interface used to synchronize the data transfer between the  processor core and the user core  or AFU   The I O constraints used here is 4 inputs and 2 outputs     library IEEE    use IEEE STD LOGIC 1164 ALL    use IEEE STD LOGIC ARITH ALL   use IEEE STD LOGIC UNSIGNED ALL                                      library unisim        use unisim vcomponents 
31. eration  This latency information is passed on to the scheduler in the S W  generation phase  shown with a dotted arrow in Figure 3   The evaluated number of cycles is also used to  synchronize the AFU with respect to the core    Apart from the component library  the designer also creates a communication template for AFUs   which captures the communication protocol between the processor core and the AFU  The writing back  of result from the AFU to the processor is delayed by the exact number of cycles required by the AFU  operation  The implementation of communication protocol together with synchronization with the core  completes the AFU interface synthesis  Note that the H W generation phase can be applied to synthesize  the AFU and its interface in the customized processor model presented in Figure 1        Preprocessing GED          T Y  Compiler    Front end Profile code             Y  Annotate w  hw sw  CFG DFG latencies  exec count                                                          2 i J i Component M Communication  S W generation ISE generation pera Constraints Communica  Y  Replace    subgraph by ISE E   o wi   w  CFG DFG Y  w  ISEs   ISEGEN    Y  Scheduling  Register Alloc T  n   ISEs or    Back end   subgraphs  H W generation    Component Library Binding Interface Synthesis                            components and a Computation w   edges by cnxns Communication    Clock Eval Crit  Path    Period Calc Cycles         y y Y    Replace ops by   Couple            
32. erface    System    FPGA  platform    Figure 2  The Flow of our Framework    The expanded view of our framework is shown in Figure 3 a   It has five main phases  Preprocessing  phase  ISE generation phase  S W generation phase  H W generation phase  and Processor subsystem    generation phase  The Preprocessing phase takes the input application and generates an annotated in   termediate representation  The ISE generation phase generates ISEs under microarchitectural constraints   The H W generation phase synthesizes the corresponding AFUs with their interfaces and the S W gener   ation phase generates the executable  A dotted arrow between the two phases indicates that the latency  of an ISE obtained in the H W generation phase is passed on to the S W generation phase  Finally  the  Processor subsystem generation phase builds the complete running system for evaluation     4 1 Preprocessing Input Application    This phase can be identified as a box labeled    Preprocessing    in Figure 3 a   A compiler front end  yields Control Flow Graph  CFG  and Data Flow Graph DFG  of an input application and runs predica   tion to combine a set of small basic blocks into a large basic block  The input application is then profiled  and the basic blocks are annotated with their execution counts  A component library is created contain   ing a synthesizable combinational element corresponding to each instruction in the target instruction set   Each element in the library is synthesized for a 
33. erhead in processor AFU interface  Thus  we confirm that if  the AFU interface has a communication overhead  a small sized ISE will only result in performance  degradation     Table 2  Power Benefits of ISEs in the Microblaze subsystem  Core Only Core   AFU     Pk     Avg  P  Pwr   A  Pwr   P  Pwr   A  Pwr   Pwr Pwr  BMs  mW     mW     mW     mW    Redn   Redn  autcor 1957 1287 1869 1229 4 5 4 5  adpcm d   1975 1317 1919 1197 2 8 9 1  adpcm e   2070 1332 2012 1178 2 8 11 6  AES 2256 1276 1982 1187 12 1 7 0                                                       The applications adpcm d and adpcm e are the two examples where predication of several small critical  basic blocks led to a large basic block  Consequently  the ISEs found for these two benchmarks are  very large containing on the order of 40 operations  This led to a significant speedup in spite of the  communication overhead  Figure 10 shows the ISE of adpcm e that generated a speedup of 1 47x over  pure software execution  The shaded nodes show the inputs and the outputs of the ISE  Appendix D                                                  Figure 10  An ISE for ADPCM ENCODER  adpcm e  having 4 inputs and 2 outputs  each operation  node maps to a hardware component     Appendix E and Appendix F present the complete VHDL source code for the AFU and its interface for  adpcm d    The last benchmark under consideration is AES  which has the largest number of instructions in its  critical basic block  The generated ISE  
34. f power and energy consumption    The steps required for taking the design from the EDK into the Project Navigator and running the  behavioral and structural simulation are as follows        Creating Simulation libraries     a  Compiling Xilinx Simulation Libraries  COMPXLIB   Following are the two ways     e From the Project Navigator   i  Open an existing project  that might have been exported from Xilinx Platform Studio  using the Export to ProjNav option  and highlight the target device    ii  In the Processes for Source window  under the Design Entry Utilities  right click  Compile HDL Simulation Libraries and select Properties  Select appropriate Target  Simulator  ModelSim SE in our case  and click OK    iii  Double click Compile HDL Simulation Libraries to compile the Xilinx Simulation  Libraries  in C  Xilinx vhdl mti se directory    e From Command Line  shown for virtex2 board    compxlib  s mti se  f virtex2  l vhdl  Run compxlib  help to choose appropriate option for the board under consideration      b  Compiling EDK Behavioral Simulation Libraries  COMPEDKLIB    Compedklib bat  s mti_se  o edklib  X    2  Initial Set up for Simulation     a  Invoke Xilinx Platform Studio  XPS  and load the design created with XPS  using ProjNav  implementation flow  as explained in the document titled Building a Hardware Software  system using Xilinx EDK and Xilinx Multimedia Board      b  From XPS  select Options Project Options  and in the HDL and Simulation tab  select  Simul
35. given technology and the corresponding instruction in the  DFG is annotated with a normalized hardware latency  Each instruction in the DFG is also annotated with  its software latency obtained from the target architecture specification     4 0 ISE Generation Phase    This phase  shown as the    ISE generation    box in Figure 3 a   is integrated with the compiler front   end  An ISE generation algorithm takes the annotated CFG DFG and returns subgraphs or ISEs that would  maximize performance under microarchitectural constraints  Although any ISE generation algorithm can  be used  we use ISEGEN in our framework because it identifies all the instances of an ISE exploiting  large scale ISE reuse     4 3 H W Generation Phase    We show this phase in a box marked    H W generation    in Figure 3 a   The two subtasks of this  phase are component library binding and interface synthesis  The identified subgraph or ISE is isolated  and each instruction in the subgraph is replaced by the corresponding element in the component library   Figure 10 shows an example subgraph where each node maps to an element in the component library  The  data dependencies between the instructions are replaced by port to port connections between the elements  and the resulting structure is an AFU  This structural AFU model is then synthesized to evaluate the critical  path length  The critical path length divided by the clock period of the processor core gives the number of  cycles needed for the AFU op
36. he application  the scheme for Performance  Measurement uses the bitmap of the synthesized system to program an FPGA fabric  which then becomes  the platform for actually running the executable  The executable is downloaded into the system memory  through a JTAG port and the number of cycles for running the executable is measured using a hardware  timer        Processor i  Subsystem    Core   TT  8 gji   S S   2 Peripherals      o    a  2 t i  E 7 H    IE        Tightly coupled    External    External  Memory    AFU  External                            Figure 6  A DP external AFU Interface    Since there is no direct way to measure power of a running system on the FPGA fabric  we employ  a different scheme for Power Energy Evaluation  depicted in Figure 5  for accurately evaluating the  power and energy consumption of the system  Note that there are three kinds of information in the post     Place and Route system  Figure 3 b    the structural model of the system  the timing information and  the routing information  We superimpose the memory image of the executable  in Figure 3 a   into the  memory section of the structural model  This complete structural model along with the timing information  is run through a cycle accurate hardware simulator to generate a Value Change Dump  VCD  of all the  signals in the structural netlist  The routing information and the VCD information together are then used  by a power simulator to generate the dynamic power consumed at different time
37. ider   ation   Since ISE generation phase has ensured convexity of the identified subgraphs  it is never pos   sible to have a dependency edge from the FirstUse node to the LastDef node because this would make  the subgraph non convex  Consequently  it is possible to encounter a situation where a FirstUse point  precedes a LastDef point in the instruction sequence  This renders the subgraph replacement impos   sible without code restructuring  Consider the following sequence of operations in instruction order    Da   b c   2  f   al0x2   3 e   5   4 d   a   e   5 g   e     d  Suppose the ISE under consideration is  a multiply followed by an add  as identified by the nodes labeled 1 and 4 in Figure 4 b1  respectively  Fig   ure 4 b1 b3  show an example of how the placement of ISE between LastDef and FirstUse is accomplished  through code restructuring  Since in this case the FirstUse point appears earlier in the instruction chain  than the LastDef point  the ISE cannot be placed anywhere  Figure 4 b1    So  instruction reordering has  to be done in order that the LastDef point precedes the FirstUse point  This reordering is possible because  there is no dependency from FirstUse to LastDef  Figure 4 b2  shows the code snippet after restructuring  Figure 4 b1   i e   swapping the positions of node 2 and node 3  and Figure 4 b3  shows the placement of  ISE between the LastDef point  node 3  and the FirstUse point  node 2     If an ISE is used as a single user defined instruction 
38. is 64 KB  If not  it is not  possible to run with only BRAMs  The alternatives are out of the scope of this document     10  Running the system      a  Switch on the board and invoke iMPACT from Xilinx ISE    Accessories      b  Configure devices via Boundary Scan Mode with Automatically connect to cable and iden   tify Boundary Scan chain selected  Select appropriate device to program  e g   xc2v2000 in  our case       c  Right click on the device and select Assign New Configuration File  Find download bit in   lt path to work directory gt  implementation  directory and select Open   Observe the PROG  LED change color from red to green indicating success   Close the iMPACT window      d  Create a file xmd ini in  lt path to work directory gt   with the following lines   help  mbconnect mdm  dow mblaze code executable elf  rst  con     e  From the XPS menu  run Tools    XMD and check the output of running executable elf  soft   ware  on the synthesized hardware     B Steps for System Simulation using ModelSim    A complete system simulation is intended for verifying the correctness and generating the Value  Change Dump  VCD  for the different signals employed  The correctness is ensured using both the be   havioral simulation as well as the structural  Post Place and Route  simulation  The VCD is relevant only  after the flattened netlist has been generated  After the VCD dump is generated by the structural simulation  run  XPower is employed to evaluate the system in terms o
39. n the embedded domain  that permit execution of only the  critical application kernels in customized units  as hardware  with the rest of the application executing  on the processor core  as software   This speeds up the application without compromising the processor  clock or modifying the architectural model of the processor and yet preserves the flexibility of the soft   ware approach  We call such a coprocessing hardware element an Ad hoc Functional Unit  AFU   The  AFU operation is triggered by an instruction or a set of instructions that we call an Instruction Set Exten   sion or ISE  In the past  researchers have modeled AFUs having no communication overhead  However   many commercially popular customizable processors have communication overheads in their interface  with AFUs  Therefore  our goal is to consider the microarchitectural details of an AFU interface in a  processor customization framework and accurately evaluate the performance and energy benefits of ISEs  in a realistic processor  The efficacy of the framework lies in seamlessly considering the synchronization  between the processor and the AFU in a unified manner for different applications    Minimizing power and energy consumption is as important as maximizing performance in embedded  systems  A high power consumption may destroy a chip completely through overheating while a high  energy consumption may reduce the battery life of an embedded device  Therefore  even though ISEs can  achieve high speedups 
40. nges chosen are disjoint      b  Add the following bus connections     e Imb v10  v1 00 a  2    microblaze O dlmb  M   Imb bram if cntlr O slmb  S    microblaze O ilmb  M   Imb bram if cntlr  1 slmb  S   e opb_v20_v1_10_b  1    microblaze_0 dopb  M   microblaze_0 iopb  M    opb_mdm_0 sopb  S   opb timer O sopb  S      c  All the Clk and Rst ports  All the net names must be sys clk or sys_rst corresponding to Clk  and Rst ports respectively      d  The following parameters need to be changed from their default values     e microblaze_0   C_DEBUG_ENABLED   1   C_USE_BARREL   1  to use a barrel shifter    C NUMBER OF RD ADDR _ BRK       C NUMBER OF WR ADDR BRK   1    e opb_mdm_0  C UART WIDTH   8   e Imb v10 0  C EXT RESET HIGH   0  e Imb_v10_1  C EXT RESET HIGH   0  e opb v20 0  C EXT RESET HIGH   0    Click OK to register all the above changes for the hardware     5  Setting up the software  In the Applications tab  right click on Software Projects and click Add  New Project  Give a name to the project and click OK      a  Right click on Sources and click Add File    Select all the source    c  files and click OK    b  Right click on Headers and click Add File    Select all the header    h  files and click OK    c  Right click on Default  microblaze 0  xmdstub and click to Mark to Initialize BRAM    d  Right click on Project    Project name  and click to un select Mark to Initialize BRAM      e  Right click on Project   lt Project name gt  and select Set Compiler Options  Unde
41. ous savings in energy  up to 40   and power  up to 12   peak power reduction  with this increased performance were observed     Contents    1 Introduction    2 Customized Processor Model    3 Related Work    4 Framework for Complete System Realization    4 1  4 2  4 3  4 4  4 5    Preprocessing Input Application                                  ISE  Generation Phase es a Ds ne dr eem Ge Aen ee see eran ltem e Be he Ge AR Wn CR  Hew Generation Phase aca   os x ss aes Se cuo nter eet as  S W Generation PHASE  let an Yves te oh ela s ADR   ve SUR Boe og a a  Processor Subsystem Generation Phase                                5 Communication Template for Xilinx Microblaze    6 Experiments    6 1  6 2    Experimental Setups 36 oul as adieu BAe ac aL oe ee ee ee ee cu  System Implementation on the Board                                  7 Experimental Results    7 1  7 2  7 3    o    nm UU n Ww  gt     Performance and Code Size                        44 44 4444     Power and Energy Results   4 5 236i ES YAS A Sa RN get OX ORE ERR  Slices  D  hzation  Sa Ds Ara 4 xen RE ERR Eee Se it Set be    Summary and Future Directions   System Realization on Xilinx Multimedia Board  Steps for System Simulation using ModelSim  Creating a Custom FSL Interface   VHDL Source for the Communication Template  Structural AFU model for adpcm d    AFU with its Interface for adpcm d    List of Figures    1  2  3    Target Customized Processor Subsystem                             The Flow of our Framework 
42. r       Eq         FSL3 S EX      STS    rp    THEN    FSL2 S REA  FSL3 S REA  data inl  lt    data in2      data in3      data in4      AFU en  lt    0    count  lt   2   END IF  WHEN 2   gt   FSLO_S_REA  FSL1 S REA  FSL2 S REA  FSL3 S REA  AFU en      count en  lt    0    IF  counter ticks  count  lt   3   END IF   WHEN 3   gt   IF  FSLO M FULL                    lt            CO                          yf  Fit s  LOF   rots     lt         D  D  lt    D  lt    D  lt    177    1                     ry    FSLO S1  FSL1 SI  FSL2 S      DATA   DATA   DATA        ESL3        DATA        enable counting  Wy m      THEN       only this will vary       depending on app               1 cycle before writing      THEN    FSLO M1       DATA  lt      data outl     FSLO M WRITE  lt   AFU en  lt    0    count  lt   0   END IF   IF   FSL1 M FULL  0    THEN  FSL1 M      DATA  lt   data out2   FSL1 M WRI     TE  lt    1    AFU en  lt    0    count  lt   0   END IF   WHEN OTHERS  END CASE  end if        Popes                           gt  NULL                end process   end behavioral     E Structural AFU model for adpcm d    The structural model of the AFU generated for adpcm d with I O Constraints of 4 inputs 2 outputs is  presented in the cut1 module     library IEEE   IEEE STD LOG                    use  C 1164 ALL           use IEEE STD LOGIC ARITH ALL   use IEEE STD LOGIC UNSIGNED ALL                          library unisim   use unisim vcomponents all        entity cutl is  Port    
43. r Di   rectories tab  give a suitable path for Output ELF File  for example    Path to work  directory  gt  output executable elf   If barrel shifter is present in the Microblaze  i e   if  C USE BARREL   1  then  under the Advanced tab  insert  mxl barrel shift in the Program  Sources Compiler Options      6  Select Project     Software Platform Settings      a  In the Processor and Driver Parameters tab  change the Current Value of xmdstub  peripheral  to opb mdm 0      b  In the Library OS Parameters tab  change the Current Values of both stdin and stdout to  opb_mdm 0     7  Create a User Constraints File in  lt path to work directory gt  data system ucf with the following  lines  for Xilinx Multimedia Board    NET  sys clk  LOC    AD16    NET  sys rst  LOC      AH7      NET  sys clk  NODELAY   NET  sys clk  TNM_NET    clk50      TIMESPEC    TSclk50    PERIOD  clk50  20 ns HIGH 50      Note that the pin mapping will alter if the board is different  The clock frequency is selected to be  50 MHz with 50  duty cycle     8  Synthesizing the hardware  to be carried out by one the following ways      e Using EDK with Xilinx XST  easier option       a  Run Tools   Generate Netlist    b  Run Tools   Generate Libraries and BSPs    c  Run Tools    Update Bitstream     e Using EDK with Synplicity Synplify Pro  if XST license is unavailable       a  Open Options     Project Options from the XPS menu  Select tab Hierarchy and Flow  and make the following changes     i  Change Synthesi
44. ral Constraints  In Proc  of DAC  2003      4  P  Yu and T  Mitra  Scalable Custom Instructions Identification for Instruction Set Extensible Pro   cessors  In Proc  of CASES  2004      5  N  Clark  H  Zhong and S  Mahlke  Processor Acceleration through Automated Instruction Set  Customization  In Proc  of MICRO  2003      6  F Sun  S  Ravi  A  Raghunathan and N  K  Jha  Synthesis of Custom Processors based on Extensible  Platforms  In Proc  of ICCAD  2002      7  F  Sun  S  Ravi  A  Raghunathan and N  K  Jha  A Hybrid Energy Estimation Technique for Exten   sible Processors  IEEE TCAD  2004      8  J  Cong  Y  Fan  G  Han and Z  Zhang  Application Specific Instruction Generation for Configurable  Processor Architectures  In Proc  of FPGA  2004      9  Machine SUIF  http    www eecs harvard edu hube software software html      10  Microblaze Processor Reference Guide  http    www xilinx com ise embedded mb  ref guide pdf      11  ST100 DSP Core Architecture Overview  http   www st com stonline prodpres   dedicate st100 overview overview htm         12  The Leon Processor User Manual  http   www ra informatik uni stuttgart de    virazela LP_Project leon 2 3 7 pdf            13  The Nios II Processor Reference Handbook  http   www altera com literature hb   nios2 n2cpu_niidvl pdf     14  SC140 DSP Core Reference Manual  http    www soc napier ac uk module php3   Op getresource amp cloaking no amp resourceid 1473119            15  ModelSim SE datasheet http   www model com produ
45. rface inside the processor subsystem or loosely coupled  through an external bus  The AFU interface or the external interface implements the communication pro   tocol between the AFU and the processor and thus controls synchronization of data and access to the  processor register file    The function of an ISE is to transfer control to an AFU for execution  An ISE can be either a single  user defined instruction or a set of multiple pre defined instructions  A single user defined instruction is  decoded as a special instruction  which encapsulates inputs and outputs of an AFU as source and destina   tion operands respectively  The decoder takes the responsibility of issuing such a special instruction to an  appropriate AFU for execution  Alternatively  sending inputs and receiving outputs of the AFU from the  processor can be done at the expense of multiple data transfer instructions  Such instructions must already  exist in the instruction set of the processor in the form of    send data to AFU    and    receive data from  AFU    instructions  In this case  the AFU incurs communication overhead at its interface while sending  and receiving data     3 Related Work    Several algorithms  1  4  2  3  5  6  have recently been proposed to identify ISEs in a given application   The speedups over simple software execution claimed in most of the approaches  1  4  2  3  are estimated  by assuming a typical RISC processor execution model  The methodology in  5  targets Trimaran resear
46. rgy and power are ModelSim for hardware simulation  15  and Xilinx XPower for  power simulation  16   We now detail the steps to realize a complete hardware software subsystem using  the Xilinx Multimedia Board     6 2 System Implementation on the Board    The steps that we used to build a Hardware Software system using Xilinx Embedded Development Kit   EDK  are enumerated in Appendix A  The generated system can be simulated both behaviorally as well  as structurally following the steps detailed in Appendix B  Appendix C briefly explains how an AFU is  introduced in the form of a user core in the system     7 Experimental Results    We demonstrate the effectiveness of our approach using a number of front end tools in our framework  shown in Figure 3 a      7 1 Performance and Code Size    The code generation for the baseline configuration was done by mb gcc with all optimizations turned  on   O2   mnoxl soft mul  so that the performance is maximized in pure software execution  The Microb   laze configuration was then customized for different applications by introducing AFU with its interface as  explained in Section 6 1  The ISEs were generated with I O constraints of maximum 4 inputs and 2 outputs    and number of AFUs set to 1  Note here that for each application  a different Microblaze configuration is  generated and the resulting system is analyzed by applying our framework  The results in terms of code  size reduction and speedup over software execution are summarized in
47. s Tool to None     ii  Change Implementation Tool Flow to ISE  ProjNav       b  Run Tools    Export to ProjNav  A directory projnav is created that contains the exported  files   Note that if Xilinx Platform Studio has been installed after XST has expired  an error will  be reported saying     ERROR  Unable to set property  Synthesis Tool   To resolve this  error  run a script containing the following in the   Path to work directory  gt   directory   sed  XST d  npl cmdfile  gt  tmpfile  mv tmpfile npl cmdfile  pjcli  v  f npl_cmdfile     c  Invoke  from Windows menu  Xilinx ISE    Project Navigator      d  Click File    Open Project and open system npl to be found in the projnav directory     e  Double click xc2v2000 6ff896  to be found under Sources in Project  to open Project  Properties  Change the value of Synthesis Tool to Synplify Pro  VHDL Verilog  and  click OK      f  Make the following changes in system structure  lt path gt  system vhd   found under  xc2v2000 6ff896      1     il     ii     Comment the lines library UNISIM  and use UNISIM VCOMPONENTS ALL   Add the following lines in the beginning    LIBRARY synplify    use synplify attributes all     Comment all the attribute statements  For example     attribute box type  of bram block O wrapper  component is  black box        for the component     bram block O wrapper     Instead  introduce for each component  the following  lines     attribute syn black box of   component name    component is true   and    at  
48. s rst   gt  rst      Test Bench Statements  tb clk PROCESS    50 MHz clock  BEGIN  clk  lt    1   wait for 10 ns   clk  lt    0   wait for 10 ns   END PROCESS tb clk   tb reset PROCESS  BEGIN  rst  lt    0   wait for 1 us   rst  lt    1   wait   END PROCESS tb reset            End Test    Bench    END           Check the system init vhd file for ensuring the correct   ness of the module names  configuration testbench conf of testbench is   for behavior       for uut  system  for STRUCTURE       for all   bram block 0 wrapper use configura   tion work bram block 0 conf  end for   end for   end for   end for     end testbench conf      c  Create a script file  projnav projnav do with the following content   a script for behavioral  simulation   cd    simulation behavioral  do system do  vcom  93  work work system vhd  vcom  93  work work       projnav testbench vhd  vsim  Lf unisim  t ps  notimingchecks work testbench conf  add wave       d  Right click on system structure    path     system vhd  and select Add Source  Find test   bench vhd in the projnav directory and click Open to add test bench  Select vhdl testbench  while adding the test bench      e  Click on testbench behavior  testbench vhd   Right click on Simulate Behavioral Model  in the Processes for Source partition and select properties  Change the following fields     e Use Custom Do File  Check the selection   e Use Automatic Do File  Uncheck the selection   e Custom Do File  Click and browse for projnav projnav
49. t enables accurate evaluation of all the  metrics deemed important in embedded system design  namely  performance  energy  power  cost  and code size     e By applying our framework to Microblaze soft processor core  we conclude that ISEs can be simul   taneously beneficial in terms of performance  energy  power and code size     The rest of the report is organized as follows  We present our target customizable processor model  in Section 2  In Section 3  we present some related research work  We describe our framework for  transforming a given application to a customized processor subsystem in Section     Section 6 1 presents    Processor Subsystem      Core XU        mama Peripherals      Tightly coupled    i Tightly boupled Loosely coupled    Kener External  i Memory          i  Interface  Interface  Bus          External    AFU  External                            Figure 1  Target Customized Processor Subsystem    how we use the framework to target Xilinx Microblaze soft processor core  In Section     we describe our  experimental results  Finally  Section 8 concludes the report     2 Customized Processor Model    Our goal is to map a given application to the target customizable processor model shown in Figure 1   In this model  the software part of the application stored in the program memory is composed of base in   structions to be run on Execution Unit and ISEs to be run on the hardware part  i e   AFUs  An AFU can be  tightly coupled with the core through an AFU inte
50. tical  sections in customized Ad hoc Functional Units  AFUs  as Instruction Set Extensions  ISEs   We pre   sented an interface aware processor customization framework that enabled us to implement a customizable  soft core microarchitecture capturing the details of interfacing with an AFU  We applied our framework  to four real life applications and realized four different processor configurations  Our results confirmed  that in the presence of communication overhead at the processor AFU interface  significant speedup over  pure software execution is possible only if the AFU function is sufficiently larger than a set of 2 3 op   erations  Further analysis of the synthesized systems led to the conclusion that integration of AFUs in a  customizable processor can result in increased performance and reduced code size  while simultaneously    decreasing power and energy consumption  Our future work will investigate the advantages of ISEs in  other reconfigurable platforms and commercially available processors     References     1  P  Biswas  S  Banerjee  N  Dutt  L  Pozzi and P  Ienne  ISEGEN  Generation of High Quality In   struction Set Extensions by Iterative Improvement  In Proc  of DATE  2005      2  P  Biswas  V  Choudhary  K  Atasu  L  Pozzi  P  Ienne and N  Dutt  Introduction of Local Memory  Elements in Instruction Set Extensions  In Proc  of DAC  2004      3  K  Atasu  L  Pozzi and P  Ienne  Automatic Application Specific Instruction Set Extensions under  Microarchitectu
51. to 31         signal cnst_minus_32768    component barrel_right_shifter    port      data_in    shift_amnt    data_out    chip_en    in std_logic     std logic vector 0 to 31      in std logic vector 0 to 31      out std logic vector 0 to 31     in std logic vector 0 to 31            end component barrel right shifter     component add  32          port     chip_en in std_logic   data_inl in std logic vector 0 to 31    data in2 in std logic vector 0 to 31    data out out std logic vector 0 to 31     E    end component add 32     component sub 32          port     chip en in std logic   data inl in std logic vector 0 to 31    data in2 in std logic vector 0 to 31    data out out std logic vector 0 to 31           end component sub 32     component and  32          port     chip_en in std_logic   data_inl in std_logic_vector 0 to 31    data_in2 in std logic vector 0 to 31    data out out std logic vector 0 to 31           end component and  32     component mult 32          port    chip en in std logic   data inl in std logic vector 0 to 31    data in2 in std logic vector 0 to 31    data out out std logic vector 0 to 31           end componen    component mu    port      condl  cond2  data_inl  data_in2  data_out          end componen    component mu    port     condi  cond2  data inl  data in2  data out           end componen    component mu    port      condi  cond2  data_inl  data_in2  data_out          ch  in       in    ch  in       in    ch  in       in    t mult  32     x
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Magic Chef MCBR160W2 Use and Care Manual  Rheem (RP14) Specification Sheet  Sicrómetro Digital Modelo RH300 y RH305 (kit)  SVZ131 Series - Manuals, Specs & Warranty  Manual do Utilizador do Nokia 2710  特定電気用品適合性検査申請書 (1/2)  Operating Instructions TH-42PWD5  MANUAL TÉCNICO  POCKET SONOVIT MANUALE  Installation Manual    Copyright © All rights reserved. 
   Failed to retrieve file