Home
        "user manual"
         Contents
1.       1 PIPELINE IS STALLED 2  Sul  1 NIOS II  LATE RESULT    WN RI WwWHN PY  I    Regarding the NIOS II multiplier performance  it mostly depends on the hardware used  if the FPGA has dedicated  multipliers on chip or not  If a instruction has a late result penalty  it means that the result is available two cycles  afterwards  if the result is needed in the next instruction  The penalty may depend on the lack of data forwarding in  the part of the pipeline which is associated with the instructions that have the specified penalty  If the pipeline has to  be flushed  it takes four cycles to complete     Since the NIOS II does not have any dedicated double load or store instructions  dealing with data types  larger than a word will take at least twice as long time as the single load or store takes        3The pipe line is stalled until the load is completed     Comparison of Synthesizable Processor Cores 5 INSTRUCTION PERFORMANCE       5 1 Branch Delay Slot vs  Dynamic Branch Prediction    This section contains a short comparison of the different branch handling methods both processor cores  uses  respectively  The aim of a pipelined processor architecture is to keep the pipeline full of instructions  all the time  If not  the performance will decrease  by increasing the Cycle Per Instruction  CPI   When the  pipeline depth increases  the cost of a conditional branch will also increase  if the branch is taken     Branch Delay Slot    The LEON2 uses the branch delay slot feature
2.    Memory Controller    SRAM 1 1 MBYTE  ALU   MULTIPLIER SOFTWARE   SORTWARE 2     DIVIDER SOFTWARE    SOFTWARE           12Emulated in software    26    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       Configuration Comments    All configurable options were chosen to consume as few gates and cache bits as possible  The multiplication  and division are emulated in software  it decreases the number of gates used  but the performance is affected  in a negative manner  The caches on both processors are direct mapped  since it has a simple implementation  and therefor less gates are being used  The memory system consists of I MB SRAM  which is enough to  the benchmarks that is going to be executed on them later on  These configurations have been at both  frequencies  25 MHz and 50 MHz respectively     9 2 Synthesis Results    In the table 9 below  the synthesis results can be seen  The    Total Mem bits    section contains both cache  bits and the register file bits that each processor core utilizes  The number of LE   s used is without the  debug unit each processor core uses  respectively  The number inside the parenthesis is the percentage of  the maximum available option     Table 9  Minimum Area Synthesis Results          PROCESSOR CORE LEON2 NIOS II LEON2 NIOS II  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  LE   s 5189  25    2167  10    5259  25    2181  10     M4K BLOCKS   10  15    10  15    10  15    10  15       TOTAL MEM BITS 34688  11   13 872  4    34 688
3.    The arithmetic diagnosed appears to be Excellent        T                      ISFLAW  lack s  of guard digits or failure s  to correctly round or chop    40    Comparison of Synthesizable Processor Cores 12 SUMMARY       12 Summary    Analyzing synthesizable processor cores performance is not an obvious task  since there are several things  depending on each other  The hole synthesis chain which is complex  begins with the processor VHDL  source code and ends up with the netlist after the place and route part has finished  All steps included in the  synthesis affects the overall system performance  but the major impact of the final performance is probably  depending on the software compiler and the application that is going to be executed on the target     Syntesizable processor cores are in general configurable in some way  Mainly  both the instruction and  the data caches sizes and the multiplier size and latency could be customized  since they affect the overall  performance most  Another feature they have is that hardware migration is possible  which make them  reusable and flexible  A disadvantage is that some of the processor cores are software tool dependent and  hardware dependent  as well  Which will force one to use certain software tools and FPGA   s     In the  Minimum Area    section  each processor core has been configured to utilize as few gates and cache  bits as possible  To save gates  multiplication and divide was emulated in software  Regarding the res
4.   11    13 888  4       Synthesis Result Comments    Concerning the LEON2    Total Mem Bits     Synplify Pro    and Quartus II    reports different number of memory  bits used  In the table 9 above the number used is that the Quartus II    reports  The  Mem bit    section contains the  memory bits used by the caches and the register file  As shown in table 9  the LEON2 is almost two and a half times  bigger than the NIOS II core  The NIOS II core is vendor optimized with respect to the FPGA which have been used   The difference in number of LE   s used  for each processor at the two different frequencies mostly depends on the the    timing criteria  which could be harder to fulfil with the same number of LE   s used        13On Chip RAM  Total 64 Blocks  Block size  128 x 36 Bits    27    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       9 3 Benchmarking    This section contains the benchmarking results and conclusions  As mentioned in section 8 1 all benchmark  sets have been executed in two frequencies  25 MHz and 50 MHZ  respectively  It is important to notice  that the NIOS II caches are only 0 5Kbytes each in this  Minimum Area    part of the complete benchmark  set  therefore the numbers in this section should not be taken at face value     9 3 1 Dhrystone    In table 10 below  the Dhrystone results are shown     Table 10  Dhrystone Results     Minimum Area             PROCESSOR CORE LEON2 NIOSII LEON2 NIOSII  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  1 ITER
5.   56    158 976  53    167 168  56    158 976  53       Result comments    Concerning the LEON2    Total Mem Bits     Synplify    and Quartus II    reports different numbers of memory bits  used  In the table above the number used is that the Quartus II    reports  The    Total Mem Bits    includes both  instruction and data cache bits and the register file  As shown in the table 14 above  the LEON2 uses more than two  times more LE   s than the NIOS II  where the LEON2   s multiplier  divider  two cache controllers and the windowed  register file consumes a lot of LE   s compared to the vendor optimized NIOS II  The difference between the number of  LE   s used by each processor at the two different frequencies is because of the new timing criteria to fulfil  which may    affect the place and route part of the synthesis chain         4On Chip RAM  Total 64 Blocks  Block size  128 x 36 Bits    34    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       10 3 Benchmarking   This section contains the benchmark results and conclusions for the maximum performance part of the  work  Both processor configurations have been executed in two frequencies  25 MHz and 50 MHz  respec   tively    10 3 1 Dhrystone    In table 15 below  the execution result of Dhrystone on both processor cores at both frequencies can be  seen     Table 15  Dhrystone Results   Maximum Performance             PROCESSOR CORE LEON2 NIOSII LEON2 NIOSII  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  1 I
6.   A branch delay slot is a single cycle delay that comes  after a conditional branch instruction has begun execution The compiler could insert a instruction in the  delay slot  that does not depend on the branch instruction  if it is impossible  a  no operation    instruction  is inserted there  This feature improves performance by having the processor to execute other instructions  while waiting for the branch target and condition to be calculated     Dynamic Branch Prediction    The NIOS Il uses a dynamic branch prediction scheme  which is based on a 2 bit branch history table  By  using a dynamic predictor  it is possible to look at the outcome of earlier branches to determine whether or  not to take coming ones  The efficiency of a dynamic branch predictor depends not only on its precision   but also on the cost of a branch  especially if the prediction was wrong and the pipeline has to be flushed   the longer pipeline the bigger penalty  In section 3 3 3 the NIOS II branch prediction cycles can be seen     Comparison of Synthesizable Processor Cores    6 QUICK REVIEW       6 Quick Review    This section contains a quick summary of the configuration possibilities for both processors                 FEATURE LEON2 NIOS Uf  Integer Unit  ARCHITECTURE 32 BIT RISC 32 BIT RISC  ISA SPARC V8 NIOS I ISA  CUSTOM INSTRUCTIONS Yes  Yes  PIPELINE STAGES 5 6  ENDIANESS BIG LITTLE  REGISTER FILE WINDOWED FLAT  NR OF GLOBAL REGISTERS 8 32  REGISTERS   WINDOW 16    NR OF WINDOWS 2 32    
7.   INTMM 50 67 17 33 ms  Mm 1183 1338 583 667 ms  PUZZLE 400 384 184 192 ms  QUICK 50 61 34 30 ms  BUBBLE 67 78 33 39 ms  TREE 367 105 183 54 ms  FFT 1483 1543 733 772 ms  COMPOSITES   NONFLOATING 184 125 93 62  FLOATING 1188 1201 589 600  COMPOSITE SUM 1372 1326 676 662    In order to make the comparison of the Stanford benchmark set easier  a graphical overview has been  made  it can be seen in figure 4 on the next page     36    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       Stanford Execution Times    450 T T T T T T 1600          Mmmm LEON2 25 Mmmm LEON2 25    E NIOS I 25 E NIOS 11 25    400 F  GBI LEON2 50 Jo pen IS de 1400     LEON2 50  E NIOS 11 50 E   E NIOS 11 50       1200    1000    Time  ms     N   o   o   T   L   Time  ms    00  o  o    i i 600  150      4        400  100 Fo      i J    sol   i     200                   Perm Towers Queens Intmm Puzzle Quick Bubble Tree    Figure 4  Maximum Performance Stanford Execution Times    In figure 4 the integer program times can be seen to the left and the emulated floating   point times to the  right  Notice the different time scales     Stanford Result Comments    When comparing the results of the first three programs  Perm  Towers and Queens  with the result of same  three programs in the minimum area section 9 3 2  the difference is not as big as one may assume  It is  because of the caches have not been filled up yet and they do not take advantage of the multiplier nor  divider which is incl
8.   Paranoia k  rdes enbart p      Minimum Area    konfiguration vid en  frekvens  F  r att g  ra prestandautv  deringen m  jlig m  ste ett korskompilatorsystem anv  ndas f  r de b  da  processorerna    Till sist har prestandautv  rderingsresultaten har diskuterats och utv  rderats  De b  da processorerna  har likv  rdig prestanda p      Minimum Area    f  r Dhrystone och Stanford  medan LEON2   r snabbare p       Styrapplikationen     Vid    Maximal Prestanda      r NIOS II snabbare p   Dhrystone och Stanford   n LEON2   medan LEON2   r snabbare p      Styrapplikationen        Acknowledgments    I would like to thank my supervisor Jiri Gaisler  Edvin Catovic and the other Gaisler Research staff for  supporting me during this thesis work     At last  I would also thank my examiner Lars Bengtsson at the Department of Computer Engineering  at Chalmers for undertaking my thesis work     Klas Westerlund  Gothenburg August  2005    Contents    1 Initial Architecture Analysis    2 LEON2  2 1 System Overview   sio a e ae SSR ELS  2 2 Instruction Set Architecture     so oo oo ee se sr rr rr rr rr rr rss sees  2 3  Inte se Unit os dne ye e rr ee ts   A A ic  231  Pipeline Architectures  a2 sosse bee a D E S a a a E  2 3 2 Multiply and Divide Options   soo eos se e    e    e     2A     Cache  System  o ma alee A a a SR AAA we De a Ra RifR  24 1  Instruction Cache    time Se Pe Ge Ba Mate BBG Se Ee Gs  DAD    Mata Cache  secs rias oe Shay ee AP becca atte ike Ne ao ES  2 5     Internal Bu
9.   The LEON2 supports optional signed and unsigned MAC instructions     16 x 16 bit multiplier with 40 bit  accumulator  it executes in one cycle but have two latency cycles  A program that is going to use the MAC  instructions should be written in assembly language  A radix 2 hardware divider  non restoring  is also  available  with the following characteristics  Input data  64 32  Bits    producing a 32 bit result and takes  35 cycles to compute     2 4 Cache System    Separate multi set  instruction  and data caches are provided  each of them are configurable with 1   4 Sets   1 64 Kbyte set  16 32 Bytes Line  Sub blocking is implemented with one valid bit per 32   bit word  There  are a several replacement policies provided  LRU  LRR and Random  It is possible to mix the policies  e g   LRU on the instruction cache and random on the data cache  The instruction set provides instructions to  flush the caches if it is necessary     2 4 1 Instruction Cache    The instruction cache uses streaming during line   refill to minimize refill latency   Instruction cache tag layout    1 Kb set  32 bytes line      ATAG  31 10    LRR 9    LOCK 8    VALID 7 0                          Only the necessary bits will be implemented in the cache tag  depending on the configuration  The LRR  field is used to store the replacement history  if the LRR replacement algorithm was chosen  LOCK indi   cates if a line is locked or not     2 4 2 Data Cache    The data cache uses write   through policy an
10.  2 LEON2    This section contains a description of the architecture of the processor core  cache hierarchy  the instruction  set  available peripherals and configuration options     2 1 System Overview    The LEON   8  implements a 32 bit single issue SPARC V8  9  compatible processor core   It is designed for embedded applications  with the following features on chip     Separate Instruction    and Data Caches  Harvard Architecture   Hardware Multiply and Divide   Flexible Memory Controller   Parallel 16 32 Bits I O Port   Ethernet MAC   PCI Interface   Two UARTs   Interrupt Controller   Two 24 bit Timers   Debug Support Unit with Trace Buffer   Watchdog   Power down Function   Is Fully Synthesizable VHDL   Code   Can be implemented on both FPGA and ASIC   Support for Different Floating   Point Units   Not included in this work     In figure 1 on next page  a typical LEON2 system can be seen     Comparison of Synthesizable Processor Cores 2 LEON2             Debug Debug  Support Unit    Serial Link              Co CPU       Data Cache   Instr  Cache    T        Ethernet  gt           AMBA AHB  Memory   Controller AMBA APB  UART s    Timers    IrqCtrl    8 16 32 64   Bits Memory Bus    SDRAM    SRAM 1 0 PROM        AHB eee   APB  eee         AHB  Controller                             Figure 1  LEON2 System Overview    2 2 Instruction Set Architecture    LEON2 is a SPARC V8  9   IEEE 1754  single issue compliant RISC  with a simple five stage pipeline  implementation  32 Bit 
11.  Additional UNUS S ie sec 6 Gok Swe i a ee SAS ee Pe 14  3 81 JTAG Debus  Module  si pes sty a he Bs Pa el Eee a E E 14   38 2 Exception Controller  a4 ok rt ee PR REE EOE ee A x 14   3 8 3  Interrupt  Controllers  2 4 fh Awe Obed ae Be Ge Mee ead 14   4 Bus Comparison 15  5 Instruction Performance 16  5 1 Branch Delay Slot vs  Dynamic Branch Prediction             o    o         17   6 Quick Review 18  7 Development Tools 20  Tele  HardWare  gece  seaside 0 bene rk d  r Bho Sk O ALGER MA Aa Mor isin wk 20   1527 s SOMWALC  5  4 0 eee he do Ae oh ewe RS Bye ae dot ted  Gye ele ble Oa 21  TILL   MEBOND    os ar la ey OB at aed ey a tb 21   A   NIOS IE ic ese es tht ete aed  ohh ae Socks  Royo de Ree  FY RT ha ously Go sata Bape 21   Lo     Implementation eee He idee bee ee ee oe art See Bed 21   8 Benchmarking 22  8 1 Benchmarking considerations      ooo se ere er er reser rr eee 22  8 1 1 Floating point Emulation                 rr rr rr res es 23   8 2 The Different Benchmarks Used              o         aTe p OE ARNE a G 24  82 1 DITYStOME   st pe ie sarge Be eee ee a ee Se E 24   8 22 27 Stanford sc  q al a ed es eee GP ee td Wale a YG 24     2 3   O E vie   amp  ee Rs O Res Aad EE RE  BOS Ow Bek x 25    8 2 4 Control Application              0 000 000 0000000000  25    9 Minimum Area 26    9 1  Processor Configurations         44 se is ewe da ee Bw ee Ag 26  OD  Synthesis Results seit  gos are e ek Seek ia de a Le AL sp sa de RI VR oe BO Re 27  9 3   Benchmarking  2 ov 22
12.  TIMING    WAIT STATES SUPPORT    OPERATING FREQUENCY    MULTIPLE MASTER  MULTIPLE SLAVE    PIPELINED  BURSTING    NON TRI STATE  SPLIT TRANSACTIONS    AHB  gt APB    8 128 BITS    1 OR MORE    SYNCHRONOUS    YES    USER DEFINED         Recommended max   256 Bits    2 Asynchronous IP blocks could be connected to the bus    SINGLE MASTER  MULTIPLE SLAVE    UNPIPELINED    NO BURSTING  NON TRI STATE    8 32 BITS  2    SYNCHRONOUS    No    USER DEFINED    MULTIPLE MASTER  MULTIPLE SLAVE    PIPELINED  STREAMING  BURST   TRI STATE  LATENCY AWARE   TRANSFERS    AVALON  gt AHB  AVALON     TRISTATE    8 32 BITS  1 OR MORE    SYNCHRONOUS  ASYNCHRONOUS     YES  FIXED OR  PERIPHERAL     CONTROLLED    USER DEFINED    Comparison of Synthesizable Processor Cores 5 INSTRUCTION PERFORMANCE       5 Instruction Performance    In this section  the instruction cycle performance for each processor is evaluated  Since both processors  are RISC  almost every instruction take one cycle to execute  Some instructions have penalties associated  with their execution and takes several cycles to complete  In table 6 below a summary of the instruction  performance can be seen     Table 6  Instruction Cycle Performance          Instruction Type Cycles on LEON2 Cycles on NIOS II f Penalties  MULTIPLY 1 2 4 5 35 1 5 11 NIOS II  LATE RESULT  DIVIDE 35 4 66 NIOS II  LATE RESULT  JUMP 2 3       DOUBLE LOAD   SINGLE STORE   DOUBLE STORE   ATOMIC LOAD STORE  RET  CALLR   CALL   LOAD   STORE   READ CONTROL REGISTER 
13.  accuracy will also affect the execution times  especially when it predicts wrong    There is a noticeable execution difference concerning the Tree program  which includes recursion   iteration and selection  The recursive part causes register window overflow on LEON   by spending much  time in the trap routine  which has a negative impact on the performance  Every time the same function is  called after the first overflow has occurred  the trap function will be executed  The bigger the tree is  the  more time is spent in the trap routine  Deeply recursive algorithms is a disadvantage for a processor with a  windowed register file compared with a processor that uses a flat register file     30    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       The equal execution times of the sorting algorithms  Quicksort and Bubblesort on the NIOS II  probably  depends on the small caches  Since the array contains 5000 random numbers  combined with a data cache  of only 512 bytes  This combination will cause a high load on the memory system and the system bus as  well  which will increase the execution times    Concerning the floating point programs Mm and FFT  where the floating point arithmetics have to be  emulated in software  the LEON2 execution times are roughly 30   shorter  One obvious reason is of course  the cache size difference  But to try to find out other possible reasons  the assembly code from the two  compilers were compared and evaluated  The assembly code 
14.  ae bade sata wae tae ee ewe dee ate ee ee bee Me 28  O34 sDEPYSIONE  2 08 o o EE Geek eS BP a eee Ge ae 28  932  Stanford ais as e Meets ee aa oe ys Ge ee a E A 29  9 3 3   Control Application  sar SAGER AEA RE a ta a e eS 32  9 4 Minimum Area Conclusions          0 00 e 32  10 Maximum Performance 33  10 1 Processor Configurations o s s e s toere ee 33  10 2 Synthesis Results 2 if   dere eR AR Bo eng a ee ie e de 34  10 3 Benchmarking  ai geek dae A we ted ir de A is 35  LOB  DIEYStONE 2  score a hee aoe BAN ole a a esas 35  103320 Stanforda ts paolo de ti asa da Sahat Me pts ed ls lee aa da 36  10 3 3 Control Application sosse rr rr rr e 39  10 4 Maximum Performance Conclusions                  res ss vs sa 39  11 Paranoia 40  11 1 Results   NIOS Ml o a eg de BE deg AN Wal id eR  NS E 40  11 2 Results  LEON2 2 200 is Rak de doe Ghee ee Atak  od Tea Rt 40  12 Summary 41  13 Appendix 43  14 References 44  List of Tables  1 LEON2 Multiply Options       go s   cs ce Sao se RR RE ee i E 3  2 LEON2 Supported Memories and Sizes    2    2      o    a 5  3 NIOS II Multiply and Divide Options        o    o    e    e    10  4 NIOS II Branch Prediction Cycles          o    o    e    e    e      o  10  5 Bus Comparsa AAA  amp  Gel det 15  6 Instruction Cycle Performance                e    16  7 Floating point operations and their corresponding number of integer instructions when  emulated in software           ee fre 23  8 Minimum Area Processor Configurations   oso sees ers sr rss ss 
15.  algorithm  Bubble     Sort a random array using the Bubblesort algorithm  Tree     Sort a random array using the Treesort algorithm   FFT     Calculate a Fast Fourier Transform    After the execution has finished  a kind of mean value is computed  one where all  eight  integer program  execution times are included  Non   floating composite  and a second where all ten execution times are  included  Floating composite     24    Comparison of Synthesizable Processor Cores 8 BENCHMARKING       8 2 3 Paranoia    Paranoia is the name of a program written by William Kahan in the early 80   s  The program used in this  benchmark is version 1 4 and converted to C by David M  Gay and Thos Sumner     Paranoia is designed to characterize floating   point behavior of computer systems     Here is a part of the tests that Paranoia does         Small integer operations      Search for radix and precision      Check normalization and guard bits in         x and        Check if rounding is done correctly      Check for sticky bit      Tests if VX    X for a number of integers  If it will pass monotonicity  If it is correctly rounded or chopped      Testing powers Z   for small Integers Z and i      Search for underflow threshold and smallest positive number      Testing powers Z   at four nearly extreme values      Searching for overflow threshold and saturation      It also tries to compute 1 0 and 0 0    When all tests have been done  Paranoia prints out a detailed result summary  which tel
16.  applications  a graphical overview was made  to make it easy to compare their execution times  it can be seen in figure 3 on the next page     29    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       Stanford Execution Times    800 6000    Mmmm LEON2 25  E NIOS I 25  Ea LEON2 50  E NIOS 11 50    Mmmm LEON2 25        E NIOS II 25  700 f   niak a Ea LEON2 50     C    NIOS 11 50    5000                  600    4000  500    Time  ms   A  3  Time  ms   3  3    300       2000    200    1000           100                Perm Towers Queens Intmm Puzzle Quick Bubble Tree    Figure 3  Minimum Area Stanford Execution Times    In figure 3 the integer program times are to the left and the emulated floating point programs  Mm and  FFT  times to the right  Notice the different time scales     Stanford Result Comments    As mentioned before  the NIOS II caches are only 0 5 Kbytes each  Regarding the first three programs   Perm  Tower and Queens the LEON  is the fastest due to its windowed register file  which speeds up  execution of programs containing a few function calls  compared with the flat register file that the NIOS II  uses    Concerning the Intmm and Puzzle results  when two matrices are to be multiplied or dealing with  matrices and loop intensive algorithms in general  will cause a lot of both instruction and data transactions   This will stress both caches  the memory system and the system bus quite a lot  If the processor is equipped  with a branch predictor  its
17.  baud   rate  parity  start  stop and data   bits and  optional RTS CTS flow control signals     3 7 2 JTAG UART    The JTAG UART core provides communication between a host PC and a Altera FPGA  Master peripherals  communicate with the core by reading and writing control and data registers  The core provides bidirec   tional FIFOs to improve bandwidth over JTAG connection  The FIFO depth is configurable     could be  either in memory or build with registers     3 7 3 SPI    SPI is a industry   standard serial protocol commonly used in embedded systems to connect the processor  to a variety of off chip devices  The SPI core can implement either the master or the slave protocol  If  1t is configured as a master  the SPI core can control up to sixteen independent SPI slaves  The core also  provides an interrupt output which can flag an interrupt whenever a transfer completes     3 7 4 Parallel I O Port    The parallel I O provides a memory mapped interface between an Avalon slave port and general purpose  T O port  The I O ports connect either to on chip user logic  or to external devices  Each core can provide up  to thirty two I O ports  A bidirectional mode is available with tristate control  The core can be configured  to generate a interrupt request on certain inputs     Comparison of Synthesizable Processor Cores 3 NIOS II       3 8 Additional Units  3 8 1 JTAG Debug Module    The NIOS II core supports a JTAG debug module to provide JTAG interface to software debugging too
18.  cache sizes has  increased  multiplication and divide is performed in hardware  The size of the LEON2 multiplier was set  to 16 x 16  with a latency of 5 cycles  which gained the best timing  Regarding the data cache    bytes   line     option  on the LEON2  it was chosen to 16  since it will improve the associativity  but it consumes more  gates which is not a problem on this FPGA  Concerning the replacement policy  the LRU and the random  algorithms were tested  they performed quite equal  but the LRU had the best performance in the Control    Application part     The NIOS II configuration options are limited to the FPGA used  since it does not have any dedicated  multiplier on chip and there was only one LE based multiplier available  The cache sizes are the only part    which is configurable     33    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       10 2 Synthesis Results    The synthesis results can be seen in table 14 below  The LE part of the table contains the processor core   timer  UART and the memory controller  The debug unit which each processor uses is not included in the    numbers  The number inside the parenthesis is the percentage of the maximum available option     Table 14  Maximum Performance Synthesis Results          PROCESSOR CORE LEON2 NIOS II LEON2 NIOS II  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  LE s 7389  36    3057  15    7554  37    3058  15     M4K BLockKs   42  65    43  67    42  65    43  67       TOTAL MEM BITS 167168
19.  controller  which acts like a slave on the AHB bus  The function of the controller is programmed through  three memory configuration registers through the APB bus     The controller decodes a 2 Gbyte address space according to the table 2 below     Table 2  LEON2 Supported Memories and Sizes          Type Size  PROM 512 MB  1 0 512 MB    S D RAM 1024 MB    Burst Cycles    To improve memory bus bandwidth  access to sequential addresses can be performed in burst mode  Burst  transfers will be generated when the memory controller is accessed using an AHB burst request  These  requests includes instruction cache   line fills  double loads and double stores     2 6 1 SRAM    The memory controller can handle up to 1 GByte SRAM  divided on up to five RAM banks  The bank sizes  could be programmed in binary steps from 8 KByte to 256 MByte  while the fifth bank handles the upper  512 MBytes  A read access to the SRAM consists of two data cycles and zero to three wait states  A write  access is similar to the read but takes at least three cycles     2 6 2 PROM    The PROM banks can be configured to operate in 8      16    or 32   bit mode  Because of a read access to the  PROM is always done in 32 bit mode  a read access to the 8    or 16   bit mode is done by bursting  in four  and two cycles  respectively  A write access will only write the necessary bits     2 6 3 I O Devices  The I O device section can be configured to operate in 8    or 16   bit mode  A I O device can only be acc
20.  register     2 8 Additional Units and Features    The following units and features are provided     2 8 1 Debug Support Unit    The Debug Support Unit  DSU  allows non intrusive debugging on target hardware  The DSU allows  to insert breakpoints and watchpoints and access to all on chip registers from a remote debugger  The  DSU has no performance impact on the system  Communication to outside debuggers is done by using a  Dedicated Communication Link  DCL   e g UART  RS232  or through any AHB master e g  Ethernet  The  registers of a FPU or Co processor can also be accessed through the DSU     2 8 2 Trace Buffer    A trace buffer is provided to trace the executed instruction flow and or AHB traffic  A 30 bit counter is also  provided and stored in the trace as time tag  Its operation is controlled through the DSU control register  and the trace buffer control register     The default size is 128 lines      2kbyte   could be configured to 8 4096 lines     2 8 3 Timers    The timer unit implements two 24   bit timers  one 24 bit watchdog and one 10 bit shared prescaler  The  prescaler is clocked by the system clock and decremented on each clock cycle  When it underflows  the  prescaler is reloaded from the prescaler register and restarted     2 8 4 Watchdog    A 24 bit watchdog is provided on   chip  it is clocked by the timer prescaler  When the watchdog reaches  zero  an output signal is asserted  The signal could be used to generate system reset     Comparison of Synthesiz
21. 2 PCI RS232 SPI J2C PCI  Software Tool Chain  COMPILER GCC 3 2 3 GCC 3 4 1  LIBRARY NEWLIB 1 12 0 NEWLIB 1 12 0    Supported OS   es       ECOS  4CLINUX  SNAPGEAR LINUX  RTEMS RTOS    puC OS II  pCLINUX  KROS  NORTi NUCLEUS PLUS  prKERNEL    Comparison of Synthesizable Processor Cores    7 DEVELOPMENT TOOLS       7 Development Tools    This section contains a presentation of the hardware and software tools which have been used to implement    each processor system on the same target FPGA     7 1 Hardware    In this section the target hardware is presented     Altera Cyclone Development Board    The development board which both processor systems have been executed on  is based on a Altera Cyclone  FPGA  14   since the NIOS II cannot be used on other FPGA   s than Alteras own  The board consists of the    following features   FPGA     Cyclone EP1C20F400C7  20060 LEs    On chip RAM  294912    Bits  Two PLL    Memories     1 Mbyte SRAM   16 Mbytes SDRAM   8 Mbytes Flash  Compact Flash Interface    Interfaces     10 100 Mbps Ethernet PHY MAC   2 x Serial Ports  RS232    Several Expansion Prototype Connectors  JTAG    Miscellaneous     50 MHz Oscillator  Push   buttons  LEDs   7 Segment LEDs         A LE is equal to a Xilinx LUT  1064 Blocks  Block Size  128 x 36 Bits    20    Comparison of Synthesizable Processor Cores 7 DEVELOPMENT TOOLS       7 2 Software    In this section the different software tools are evaluated  The different program versions can be seen in  Appendix A     
22. 7 2 1 LEON2    Very extensive configuration tool  all necessary details are available through it  E g  multiplier sizes and  latencies  number of cache sets  set sizes  replacement policies and different memory controllers among  others  In order to run programs on the target hardware  the BCC  15   a GNU based cross   compiler system  has been used  It is based on the GNU GCC 3 2 3 compiler  and uses newLib  16  1 12 0 as C library     7 2 2 NIOS II    The configuration of a NIOS II based system is done through SOPC  which is a integrated part of Quartus  II     SOPC     Good  but it would have been much better if the sizes and latencies of the arithmetic options  were available explicitly  Now it is like a    black box      you know that you get a hardware based multiplier  or divider  but you do not know its input and output sizes  features and latencies  Also the NIOS II uses  a GNU based tool chain with a Eclipse  17  based GUI  The compiler version is GNU GCC 3 4 1  The  newLib  16  1 12 0 is used as the C library     Compiler Comments    Due to the different compiler versions  18  each processor system uses  the NIOS II may take advantage of  the higher optimization level introduced  in the newer one     7 3 Implementation    A few things have been done  based on the changes done by De Nayer Instituut  19  on the LEON2  to  make it run on the development board  Technology specific ram and the PLL were instantiated  and a new  port map was created  The compiling and map
23. ATION  MS  68 2 69 8 33 4 34 9  DHRYSTONES SEC 14 652 14 301 29 925 28 653  DHRYSTONES SEC MHZ 586 572 599 573    Dhrystone Result Comments    The bigger caches on the LEON2 shows that the performance impact on the execution time is roughly 4   for such a big cache system compared to the NIOS II  Since the caches are small and despite the fixed  sequence of instructions  there will be a lot of accesses to the main memory which will affect the execution  time in a negative manner  The frequency doubling increased the performance on LEON2  but the NIOS II  has almost the same performance at both frequencies     28    Comparison of Synthesizable Processor Cores    9 MINIMUM AREA       9 3 2 Stanford    The Stanford benchmark set was executed on both processor cores at two frequencies  The results can be  seen in table 11 below  In this benchmark set  the execution times should be as short as possible     Table 11  Stanford Results     Minimum Area    PROCESSOR CORE    LEON2 NIOS II    LEON2 NIOS II                      FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  PROGRAM UNIT  PERM 66 80 33 40 ms  TOWERS 116 150 66 82 ms  QUEENS 50 52 33 26 ms  INTMM 316 707 150 345 ms  Mm 3633 5281 1816 2727 ms  PUZZLE 483 498 266 249 ms  QUICK 66 120 33 60 ms  BUBBLE 84 120 50 60 ms  TREE 500 198 250 98 ms  FFT 3417 5003 1734 2687 ms  COMPOSITES   NONFLOATING 270 279 137 140  FLOATING 2848 4043 1397 2129  COMPOSITE SUM 3118 4322 1534 2269    Since the Stanford benchmark set contains various types of
24. CHALMERS       Comparison of Synthesizable Processor Cores    KLAS WESTERLUND    Master s Thesis  Electrical Engineering Program    CHALMERS UNIVERSITY OF TECHNOLOGY  Department of Computer Science and Engineering  Division of Computer Engineering   G  teborg 2005    All rights reserved  This publication is protected by law in accordance with     Lagen om Upphovsratt  1960 729     No part of this publication may be  reproduced  stored in a retrieval system  or transmitted  in any form or by any  means  electronic  mechanical  photocopying  recording  or otherwise   without the prior permission of the authors        Klas Westerlund  G  teborg 2005     Abstract    The purpose of this thesis work has been to compare two synthesizable processor cores  the LEON2 from  Gaisler Research and the NIOS II provided by Altera     The work consists of three parts     1  Initial Core Analysis  2  Implementation on a FPGA  3  Performance evaluation by Benchmarking    In the analysis part  the processor architecture of each core and characteristics like pipeline depth  cache  sub system and configurability have been evaluated  Both processor cores have been implemented on the  same target FPGA board  In the benchmark part  Dhrystone  Stanford  a typical control application and  Paranoia  The first three programs have been executed on two different processor configurations     Mini   mum Area    and  Maximum Performance    respectively  and in two different frequencies  Paranoia was only  exe
25. ICITY INC  SYNPLIFY PRO VERSION 8 0  BUILD 189R  BUILD JAN 17  2005  ALTERA QUARTUS II 4 2 BUILD 157 12 07 2004 SJ FULL VERSION  GAISLER RESEARCH GRMON 1 0 6 PROFESSIONAL EDITION   Appendix B   Stanford Weight Values    The Non   Floating point composite is calculated as the sum of the execution time for each program multi   plied by each program   s weight value  and divided by the number of integer programs  eight of ten   The  floating point composite is calculated in the same way but the values of all ten programs are included     Program Weight    PERM 1 75  TOWERS 2 39  QUEENS 1 83  INTMM 1 46  MM 2 92  PUZZLE 0 50  QUICK 1 92  BUBBLE 1 61  TREE 2 50  FFT 4 44          43    Comparison of Synthesizable Processor Cores       14 References     1  LEON2 Processor Overview  Url  http   www  gaisler com products leon2 leon  html     2  Gaisler Research  F  rsta Langgatan 19 Gothenburg  Sweden  Url  http   www  gaisler com     3  NIOS II Processor Overview  Url  http   altera com products ip processors nios2 cores ni2 processor _cores html     4  Altera Corporation  101 Innovation Drive   San Jose  California 95134  USA  Url  http   www altera com     5  The GNU LGPL License form  Url  http   www gnu org copyleft lesser html     6  The LEON2 Full Source Code  Url  http   www  gaisler com products leon2 leon_down html     7  NIOS II Licensing Info  Url  http   www altera com products ip processors nios2 featuresfni2 q_and_a html        8  The LEON2 Processor User   s Manual XS
26. T Edition Version 1 0 27 January 2005     9  The SPARC Architecture Manual Version 8  Revision SAVO80SI9308  SPARC International Inc  535 Middlefield Road  Suite 210  Menlo Park  CA 94025  415 321 8692  Url  http   www sparc org    44    Comparison of Synthesizable Processor Cores        10  AMBA AHB and APB Specification Rev 2 0  ARM IHI0011A  1999  Url  http   www arm com     11  The OpenCores Ethernet MAC  Url  http   www opencores com projects cgi web ethmac overview     12  The NIOS II Processor Reference Handbook NI5V1 1 2 September  2004   13  The Avalon Bus Specification  Reference Manual version 2 3  July 2003     14  Additional Development Board Info  Url  http  altera com products devkits altera kit nios_1c20 html     15  BCC  A GNU based Cross   Compiler System  Used by LEON2  Url  http   www  gaisler com doc libio bcc html     16  Newlib  a C Library Supported by Redhat  Url  http   sources redhat com newlib      17  The Eclipse IDE GUI  Url  http   www  eclipse org     18  The GNU GCC Release History and Change Logs  Url  http   gcc gnu org releases html     19  LEON2 Changes Done by De Nayer Instituut  Belgium  Url  http   emsys denayer wenk  be  project empro  amp page cases  amp id 14 ls     20  GRMON  A Combined Debug Monitor and Simulator for LEON Processors  Url  http   www  gaisler com products grmon grmon  html    45    
27. TERATION  MS  26 8 23 6 13 1 11 8  DHRYSTONES SEC 37 383 42 299 76 433 85 030  DHRYSTONES SEC MHZ 1495 1692 1529 1701    Dhrystone Result Comments    The results are shown in table 15 above  A processor system with a big cache and a program where the main  part is a loop with a fixed sequence of instructions the cache hit rate will go towards 100    Increasing the  cache size will not give a better result in this benchmark  Requiring no main memory access thus becoming  more representative of the processor  rather than system performance  If the results are compared with  the execution times in the minimum area section  one can see that the cache impact on integer programs  are enormous  almost three times faster  see table 10  One interesting question is  How much does the  compiler affect the execution times     No assembly code study has been done in this section  since it is a  very complex and time consuming task to evaluate the compiler efficiency     35    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       10 3 2 Stanford    The Stanford benchmark set has been executed on both processors  at both frequencies  The results can be  seen in table 16 below  The shorter execution times the better performance is achieved     Table 16  Stanford Results     Maximum Performance    PROCESSOR CORE LEON2 NIOSI LEON2 NIOSI                      FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  PROGRAM UNIT  PERM 66 79 33 39 ms  TOWERS 100 95 50 47 ms  QUEENS 50 49 33 25 ms
28. TOTAL NR OF REGISTERS 40 520 32  BRACH HANDLING BRANCH DELAY SLOT BHT gt   FPU SUPPORT YES N A  MMU YES N A  Multiply Options  SIZE AND  LATENCY  32 x 32  1   32 x 16  2  32 x 32  1 2   32 x 8  4   16 x 16  4  32 x 16  5 2   16 x 16  5   ITERATIVE  35  32x 4  1142   MAC YES   N A   Divide Options  TYPE RADIX 2 RADIX 2  SIZE AND  LATENCY  64 32  35  32 32  4 66     Continues on next page        4Could be added as a Co Processor instruction  5Branch History Table  Dynamic prediction  616 x 16 multiplier and a 40 bit accumulator  7 Could be implemented as a custom instruction    8The latency depends on the hardware used    Comparison of Synthesizable Processor Cores    6 QUICK REVIEW                FEATURE LEON2 NIOS U f  Cache Options  INSTRUCTION CACHE  NUMBER OF SETS 1 4 1  SET SIZE 1 64 KBYTE 0 5 64 KBYTE  POSSIBLE CACHE SIZES 1 256 KBYTE 0 5 64 KBYTE  LINE SIZE 16 32 BYTES 32 BYTES  WRITE POLICY STREAMING CRITICAL WORD FIRST    REPLACEMENT POLICIES    DATA CACHE  NUMBER OF SETS  SET SIZE  POSSIBLE CACHE SIZES  LINE SIZE  WRITE POLICIES    DURING LINE REFILL  LRU LRR RANDOM    1 4  1     64 KBYTE  1     256 KBYTE  16     32 BYTES    WRITE THROUGH   WRITE BUFFER    N A    1  0 5     64 KBYTE  0 5     64 KBYTE  4 BYTES  WRITE BACK   WRITE ALLOCATE       REPLACEMENT POLICIES LRU LRR RANDOM N A  Supported Memory Interfaces  SRAM  SDRAM SRAM SDRAM  PROM FLASH    Supported System Interfaces    MEMORY MAPPED I O    MEMORY MAPPED I O       ETHERNET  JTAG    ETHERNET  JTAG       RS23
29. able Processor Cores 2 LEON2       2 8 5 Interrupt Controller    The interrupt controller manages a total number of fifteen  15  interrupts  originating from internal and  external sources  Each interrupt can be programmed to one of two priority levels  A chained secondary  controller for up to thirty two  32  additional interrupts is also available  There are a several unused inter   rupts that can be utilized by other IP   cores and peripherals     2 8 6 Parallel I O Port    A partially bit wise programmable 32 bit I O port is provided on chip  It is splited into two parts     the  upper 16 bits can only be used when all areas  ROM RAM and I O  of the memory controller is in 8  or  16 bit mode  If the SDRAM controller is enabled  the upper 16 bits cannot be used     2 8 7 Power down    The processor can be powered   down by writing an arbitrary value to the power down register  Then the  processor will enter the power down mode on the next load or store instruction  During power down mode  the Integer Unit  IU  will effectively be halted  All instructions that are inside the pipeline will be there until  the mode will be terminated  If the mode will be terminated     the Integer Unit  IU  will be re enabled when  an unmasked interrupt with higher level than the current processor interrupt level  PIL  become pending   All other functions and peripherals operate as normal during the power down mode     2 9 Co   Processors  2 9 1 FPU    The LEON2 processor model provides an in
30. application has been executed on both  processors  This application reveals more about their floating   point performance  The results can be seen  in table 12 below     Table 12  Control Application Results     Minimum Area          PROCESSOR CORE LEON2 NIOSH LEON2  NIOS II  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  PROGRAM UNIT  CONTROL APPLICATION 487 1250 251 620 SEC    Control Application Result Comments    Floating   point emulation in software   as mentioned in section 8 1 1 causes a lot of instructions to be  executed by the integer part of the processor  In table 12 above  the LEON2 is almost 2 5 times faster than  the NIOS II  This program includes more instructions than the floating point programs included in the  Stanford benchmark set do  The combination of many instructions and a relatively small cache system will  cause a high load on each processor and on the cache and memory system as well  In this situation  the data  handling capabilities of the processor cores are revealed  In this case the LEON  is the better one     9 4 Minimum Area Conclusions    Concerning the results in this  Minimum Area    section  their performance are quite equal while comparing  their integer performance  In the floating point part of the benchmarks the performance on the LEON  is  the better one  The difference may depend on the bigger cache system and the write buffers that LEON2  uses    A relatively small cache combined with multiplication and divide emulated in software while e
31. aside Buffer             0 000000 00 2 eee rt         NAYNAYAAYAAYAAYNAYNAYIAYNNDDDDDADADADAAMNMANAAKHAPSPPPHPWWWWNNN     pa    3 NIOS II 8  Sil    System OVenview e wi heh em ee O aa hoe o ies a te STS 8   3 2 Instruction Set Architecture         re er rr rr rr rr rr rr rr ee 9   33 Intese Unit ve als a E see ee a td See ate ee ee ben ae 9  33 1    Pipelme Architectite as a ae REP Se EER ODA ee a S  R as 9   3 3 2 Multiply and Divide Options             o    e    rr rr rr es es 10   3 3 3  Branch  Prediction    ssd ste fe o Sa GUS Ble Ae ee ee eS 10   3A    Cache Systm o oo e sedel bd e Ay Eo ele BR Eee Be ee 8 11  3AL    Instruction  Cache sae ses Bese how Sete RRP A a hele es 11   342 Data Cache  si  65 406 toh 4 OG Poe a a E he he See oe SE A  6 11   3 9  InternalBusses i keda ne le a ee BA SRA a AY Se et ee ae s 12  3 5 1 Avalon On chip Bus      sve So a ve RN ee Oe A 12   3 0 Memory Interfaces meto bea e wae ae Bob be e a ee eee ees 12  3 6 1  SDRAM  rin a a ae the ES Soe ee ae eb e RED a 12   3162 DMA  a a 8 28 Su ees oa he Re PE ete BE ATS N 12   316 35     CBD 6 0 A Behe eae Mes fot  eo PR A eS id a 12   A EPES ec s body dae Be Be ee ee B bd as RP Pw ee as 12   3 7  System Interfaces  siii bie ah SS A Be eR BPs Sk SA ee Se ees 13  IL SUARE o tic ea ta ee hed ele EI bik E ps 13   3 7 2  JTAGUART  654 Sauget A hls Se AS eee eS 13   Sa SPTO soe wit eee te AGRE MEARS  Son Rn ede d   a RV Sima 4 13   3 74  Parallel WOPort  ooo cita of he o poe os re a Re ee es 13   3 8   
32. ating   point test program Paranoia can be seen     11 1 Results     NIOS II    When Paranoia was executed on the NIOS II  the program reported one failure in the multiplication part of  the test     A part of the output from Paranoia on NIOS II     Multiplication is neither chopped nor correctly rounded   Sticky bit used incorrectly or not at all        The number of FLAW s discovered   1   The arithmetic diagnosed seems Satisfactory though flawed     Possible Failure Sources    The failure is probably caused by the code generation in the soft   float part of the compiler  There where no  error when the program was executed when it was compiled without optimizations  But the performance  will drop by a certain amount  without optimizations which is not satisfactory  Especially when it will be  used to do a lot of emulated floating   point calculations     Result without optimizations    No failures  defects nor flaws have been discovered    Rounding appears to conform to the proposed IEEE standard P754   except for possibly Double Rounding during Gradual Underflow   The arithmetic diagnosed appears to be Excellent        T                   11 2 Results     LEON2    When Paranoia was executed on the LEON2 neither failures nor flaw   s were detected   A part of the output from Paranoia on LEON2     No failures  defects nor flaws have been discovered    Rounding appears to conform to the proposed IEEE standard P754   except for possibly Double Rounding during Gradual Underflow
33. aware peripherals  streaming peripherals and  multiple bus masters  The advanced features allow multiple units of data to be transferred between pe   ripherals during a single bus transaction  Avalon masters and slaves interact with each other based on a  technique called slave side arbitration  Slave side arbitration determines which master gains access to a  slave  if at least two masters attempt to access the same slave at the same time  Both the instruction and  data buses are implemented as Avalon master ports  The data master port connects to both memory and  peripheral components  while the instruction master port only connects to memory components     Every peripheral mentioned in the following sections uses the Avalon bus  In figure 2 on page 10  a bus  overview can be seen     3 6 Memory Interfaces    The processor core is capable to access up to 2 GBytes of external address space  Both data memory   peripherals and memory   mapped I O are mapped into the address space of the data master port on the  Avalon interface  Multibyte numbers are stored as little endian    When sharing memory  the highest performance is achieved when the data master port has been as   signed higher arbitration priority on any memory that is shared by both instruction and data master ports     3 6 1 SDRAM    The SDRAM controller  provides an interface to off chip SDRAM  The controller supports the standard  SDRAM PC100 specification  The controller handles all SDRAM protocol requirements  T
34. contained a lot of load  branch and multiply  instructions and a emulated floating   point multiplication or divide will need some extra instructions since  they both have to be emulated due to no hard multiplier nor divider is available in these configurations   All load instructions will stall the NIOS II pipeline  due to its load delay of two cycles  The small cache  system causes a lot of replacement conflicts  then there will be a higher load on the system memory and  on the system bus as well  In this program  the branch handling capabilities has a impact on the execution  performance  especially on the NIOS II  if its predictor predicts wrong  the pipeline has to be flushed   Pipeline flushing could be time consuming  if it happens too often  since the execution has to restart from  the instruction that comes after the branch instruction    Finally  their non   floating composite values are quite equal  despite the cache size differences  but the  difference concerning the floating   point composite is approximately 30    in this case  the cache sizes and  the write buffers that LEON2 uses speeds up the execution and the load delay as mentioned above affects  the execution times on the NIOS II     31    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       9 3 3 Control Application    To find out how good each processor is when dealing with soft   float operations and as a complement to the  floating point programs in the Stanford benchmark set  the control 
35. cuted on the  Minimum Area    configuration in one frequency  To make the benchmark part possible  a  cross   compiler tool chain for each processor system have been used    The benchmark results are discussed and evaluated  Both processor cores perform equal on Dhrystone  and Stanford on the  Minimum Area    configuration  but LEON2 is the fastest one on the  Control Ap   plication     On the  Maximum Performance    configuration  NIOS II is fastest on Dhytstone and Stanford   LEON2 performs best on the  Control Application    again     Sammanfattning    Syftet med detta examensarbete har varit att j  mf  ra tv   syntetiserbara processorer  LEON2 som Gaisler  Research har utevecklat och NIOS II som Altera tillhandah  ller     Arbetet best  r av tre delar     1  J  mf  relse av processorerna  2  Implementering p   ett FPGA utevecklingskort  3  Prestantautv  rdering med hj  lp av benchmarkprogram    I den f  rsta delen av arbetet  har processorernas arkitektur  karakt  ristiska delar s   som antalet pipeline     steg  cachesystem och konfigurerbarhet j  mf  rts och utv  rderats  De b  da processorerna har implementer   ats p   samma utvecklingskort baserat p   en Altera Cyclone FPGA  Som prestandautv  dering har fyra  program k  rts p   de b  da processorerna  Dhrystone  Stanford  en typisk styrapplikation och Paranoia  De  tre f  rsta programmen har k  rts p   tv   olika processorkonfigurationer   Minimum Area    respektive    Max   imal Prestanda    vid tv   olika frekvenser
36. d implements a double   word write   buffer    It can also perform bus   snooping on the AHB bus  A local scratch pad ram can also be added to the data  cache controller to allow O   wait states access without requiring data write back to external memory    Data cache tag layout    4 Kb set  32 bytes line      ATAG  31 12    Not Used  11 10    LRR 9    LOCK 8    VALID 7 0                             Only the necessary bits will be implemented in the cache tag  depending on the configuration  The LRR  field is used to store the replacement history  if the LRR replacement scheme has been chosen  LOCK  indicates if a line is locked or not     Cacheable Memories  PROM and RAM  Non cacheable   I O and Internal  AHB     Comparison of Synthesizable Processor Cores 2 LEON2       Write buffer    Consists of three 32   bit registers to temporarily store data until it is sent to the destination  acts like a FIFO     Cache line locking   If the lock bit in the cache is set to     1     it prevents the cache line to be replaced by the replacement algorithm   LRR  LRU or Random     CCR   Cache Control Register    The operation of the instruction and data caches is controlled through a common CCR  Each cache can be  in three modes  disabled  enabled or frozen  The register is 32 bit wide     Disabled  No caching  all Load Store requests are passed to the memory controller directly   Enabled  Both instruction and data is cached   Frozen  As enabled  but no new lines are allocated on read mis
37. elow  a list of approximately  corresponding number of integer instructions can be seen  The numbers have been taken from the NIOS  II instruction set simulator  when a hardware based multiplier and divider were available  The numbers of  integer instructions on LEON2 may differ due to the difference in their instruction set     Table 7  Floating   point operations and their corresponding number of integer instructions when emulated  in software     FLOATING POINT OPERATION NR OF INTEGER INSTRUCTIONS NR OF CYCLES          ADDITION 350 600  SUBTRACTION 350 600  MULTIPLICATION 550 1300  DIVIDE 1550 2000    The numbers in table 7 above shows that floating point emulation takes roughly 50 200 times longer  compared to regular integer arithmetics  If no hardware multiplier or divider is available the number of  integer instructions will increase  since the multiplication and division instructions themselves have to be  emulated        11 4 GNU GCC specific compilation flag    23    Comparison of Synthesizable Processor Cores 8 BENCHMARKING       8 2 The Different Benchmarks Used    In this section  the four different benchmarks used is presented     8 2 1 Dhrystone    Dhrystone is a benchmark invented in 1984 by Reinhold P  Weicker  The benchmark was first published  in ADA  today the C version of the benchmark is mainly used  The current version of Dhrystone  version  2 1 was created in 1988 has been used to measure the integer performance on both processors  The origi   nal purp
38. eme depends not only on the accuracy  but also on  the cost of a branch  if the prediction was wrong  In section 5 1 a comparison of their two different branch  handling methods can be seen     Static prediction     In the NIOS II s core   Static branch prediction is implemented using the branch offset direction   A negative offset   predict taken    A positive offset   predict not taken     Dynamic prediction     In the NIOS If core  Dynamic branch prediction is implemented using a 2   bit branch history table     Branch Cycles  In the table 4 below  the NIOS II branch cycles are shown     Table 4  NIOS II Branch Prediction Cycles          Prediction Cycles Penalty   CORRECTLY PREDICTED  TAKEN 2 NO PENALTY  CORRECTLY PREDICTED  NOT TAKEN 1 NO PENALTY  MISPREDICTED 4 PIPELINE IS FLUSHED    Comparison of Synthesizable Processor Cores 3  NIOS II       3 4 Cache System    The NIOS II f processor core supports both instruction and data caches  Both caches are always enabled  at run   time  Data cache bypass methods are available via software  Cache management and coherency are  handled by software  the instruction set provides instructions for cache management  The core supports the  31 bit cache bypass method for accessing I O on the data master port     3 4 1 Instruction Cache    The instruction cache has the following features   Direct mapped implementation  Critical word first    32 Bytes  Eight words  per cache line  Configurable size  512 bytes to 64 Kbytes    The instructi
39. essed  in a single access in 32   bit mode     2 6 4 SDRAM    SDRAM access is supported to two banks of PC100 133 compatible devices  The controller supports 64     512 MByte devices  The SDRAM controller contains a refresh function that periodically issues an AUTO   REFRESH command to both SDRAM banks  the refresh period could be programmed in the memory  controller register  The SDRAM can also be write protected     Comparison of Synthesizable Processor Cores 2 LEON2       2 7 System Interfaces  2 7 1 UART    Two identical UARTs are provided for serial communications  The UART support data frames with 8 data  bits  one start bit  one optional parity bit and one stop bit   Hardware flow   control is supported through  the RTSN CTSN hand shake signals  The two UARTs are possible to run in loop   back mode to ensure a  working connection     2 7 2 Ethernet MAC    A 10 100 Mbps Ethernet MAC is available  it is based on the core from OpenCores  11    with tw AHB  interfaces  one master and one slave  The AHB master interface is used by the MAC DMA engine to  transfer Ethernet packets to and from memory  The slave handles all configuration   Interrupt generated by  the Ethernet MAC is routed to the interrupt controller     2 7 3 PCI    Primary used for debugging purposes  it supports DSU communications over the PCI bus  if the develop   ment board used has a PCI connector  The interface consists of one PCI memory BAR occupying 2 Mbyte  of the PCI address space  and an AHB address
40. going to compare  It is important to understand how  different features affect each other and how the performance is affected  both in a positive and a negative  manner    In this case  regarding the minimum area configurations  each processor core have been configured  to be as small as possible  with respect to the number of LE   s and the total number of cache bits used   Multiplication and division is emulated in software    Concerning the maximum performance configurations  the idea was to use as much as possible of all  available resources  The multiplier and divider was chosen to give as good timing as possible and the num   ber of cache bits which can be used is set to the maximum available on the FPGA     It is important to keep in mind that benchmark performance will vary depending on the processor con   figuration  implementation tools  targeted FPGA architecture   device speed grade  the software compiler  and library used     8 1 1 Floating point Emulation    Since both processor cores are intended to be used in embedded applications no floating   point unit  FPU   is included by default  To be able to execute programs that contain floating point arithmetic in the high     level source code  the floating   point part has to be emulated  The compiler has to be informed about it  during compilation  by using the       msoft float      flag  The compiler then inserts a specified sequence of  integer instructions  which behaves like it was done by a FPU  In table 7 b
41. guage     Which Benchmark  Program  Version is Used     Which Tool Chain  Compiler  Library      Which Optimization Level is Used      Which Hardware is Used and How the Processor Core is Configured    Nn FB WN      Which Processor Frequency    Regarding the benchmarks in this report  one must keep in mind that the NIOS II processor is optimized  with respect to both FPGA and development board used  There might be some features the LEON2 could  not utilize good enough on the FPGA or on the development board used     In the following tests  all programs have been compiled with the GCC  02 flag and the    msoft   float  flag  All maximum performance executables were compiled with their hardware multiplication and divide  specific flags  respectively  If some of these benchmarks are going to be executed on the same target hard   ware  it is plausible that the results may differ by  1    since the processor behavior is not deterministic   All benchmark sets have been executed on both processor cores at two frequencies  25 MHZ and 50 MHz   Two different frequencies was chosen to see how the execution times are affected when the frequency is  doubled  If the frequency is doubled the execution times are not always halved  depending on the new  timing criteria        22    Comparison of Synthesizable Processor Cores 8 BENCHMARKING       When a comparison of two or more devices are to be done  one must be sure that the comparison is  relevant  you must be very careful of what you are 
42. he core can  access SDRAM subsystem with the following data widths  8  16  32  64 bits  various memory sizes and  multiple chip selects  Up to 4 banks of memory is supported  Because the Avalon interface is latency     aware  pipelined read transfers are allowed     3 6 2 DMA    The DMA controller performs bulk data transfers  reading data from a source address range and writing the  data to a different address range  An Avalon master peripheral  such as the NIOS II   can provide memory  transfer tasks to the DMA controller  independently of the processor  The controller is also capable of  performing streaming Avalon transactions     3 6 3 CFI    The common flash interface core  CFI controller  provides connection to external flash memory  The  Avalon tristate bridge creates an off chip memory bus that allows the flash chip to share address and  data pins with other memory chips  Avalon master ports can perform read transfers directly from the CFI  controller   s Avalon port     3 6 4 EPCS    The EPCS device controller core allows NIOS II systems to access an Altera EPCS serial configuration  devices  The EPCS device is able to store non   volatile program data and FPGA configuration data  Boot  loading is also provided     Comparison of Synthesizable Processor Cores 3 NIOS II       3 7 System Interfaces  3 7 1 UART    The UART core provides a register mapped Avalon slave interface  which allows communication with  master peripherals such as NIOS II  It provides configurable
43. hen a more computing intensive program is  executed  it will reveal a more realistic work load  on the processor as well as on the memory system  As  the numbers in table 17 shows  the LEON2 performs about 30  better than the NIOS II  The difference  could depend on the multiplier latency which is six more cycles on the NIOS II and the load delay  which  is one cycle on LEON2 and two cycles on the NIOS II     10 4 Maximum Performance Conclusions    As one could see in the result sections above  the execution times has decreased  compared with the    Mini   mum Area    results  When a hard multiplier is available on   chip it improves the execution speed  compared  to software emulation    When    small programs    like the Stanford benchmark set is executed  a bigger cache system in not  always a advantage  If it is too big  it will introduce some overhead by checking empty places  while  accessing the caches  If the cache is to small there will be replacement conflicts  which will decrease the  execution performance  since the data and the instructions have to be fetched from the main memory    In loop intensive applications  the performance will be improved  as seen in the results above  since  the temporal and spatial locality in the bigger caches will be improved  then data and instructions does not  have to be fetched from the main memory that often     39    Comparison of Synthesizable Processor Cores 11 PARANOIA       11 Paranoia    In this section the results from flo
44. ign   Write   Back    The pipeline is stalled when one of these conditions occurs     Multi cycle instructions   Avalon instruction master   port read access  Avalon data master   port read write access  Data dependencies on long latency instructions    When a stall has occurred  no new instructions enter any stage  Only The Decode  and Align stages creates  stalls  Up to thirteen  depends on the multiplier latency  instructions can be executed while waiting for the  result from a multicycle instruction  if there is no data dependency between the result of the multicycle  instruction and the other instructions     Comparison of Synthesizable Processor Cores 3 NIOS II       3 3 2 Multiply and Divide Options    The processor supports a variety of multiplication and divide options  mostly depending on the FPGA   according to the table 3 below  No embedded multiplier or divider is provided on the development board  used in this thesis work     Table 3  NIOS II Multiply and Divide Options          ALU option Details CPI Result Latency cycles  No HW MUL DIV EMULATED  40 N A  EMBEDDED  STRATIX I  amp  II  32 x 32 1 2  EMBEDDED  CYCLONE II  32 x 16 5 2   LE BASED 32x4 11 2  HARDWARE DIVIDE 32 32 4   66 2    The hardware divide has no exception when a division by zero occurs not on overflow either     3 3 3 Branch Prediction    The core is provided with a branch predictor to achieve better performance while avoiding stalls during  execution  The effectiveness of a branch predictor sch
45. ls   The core also supports an optional enhanced interface that allows real time trace data to be routed out of  the processor and stored in an external debug probe     3 8 2 Exception Controller    The architecture provides a simple  non   vectored exception controller to handle all exception types  All  exceptions cause the processor to transfer execution to a single exception address  The handler at this  address determines the cause of the exception and finishes the appropriate exception routine     3 8 3 Interrupt Controller    The architecture supports thirty two  32  external hardware interrupts  The core has thirty two  32  level   sensitive interrupt request  IRQ  inputs  providing a unique input for each interrupt source  The priority is  determined by software  The software can enable and disable any interrupt source individually by masking  the IENABLE control register     Comparison of Synthesizable Processor Cores    4 BUS COMPARISON       4 Bus Comparison    This section contains a more detailed comparison of the internal buses used by each processor core  see  table 5 below  The AMBA AHB and AMBA APB which LEON2 uses and the Avalon switch fabric which    the NIOS Il uses     Table 5  Bus Comparison          Option AMBA AHB AMBA APB AVALON  PROVIDER ARM ARM ALTERA  Bus VERSION REV 2 0 REV 2 0 1 2  DATA BUS WIDTH 8 1026 BITS  8 32 BITS 8 32 BITS  ADDRESS BUS WIDTH 32 BITS 32 BITS 32 BITS    ARCHITECTURE    PROTOCOL    BRIDGING  TRANSFER SIZES  TRANSFER CYCLES   
46. ls if the processor  fulfil the IEEE754 standard or if there were any failures in the implementation     8 2 4 Control Application    Since Paranoia does not contain any time measuring  a floating   point program that measures the execution  time has been executed on both processors  the program is a kind of control application that does a lot of  floating   point calculations  This program reveals the performance of the soft   float part on each processor  core  both hardware and the software     25    Comparison of Synthesizable Processor Cores 9 MINIMUM AREA       9 Minimum Area    This section and section 10 contains the third part of the thesis work  This section contains the    Mini   mum Area    configurations and the results of the benchmarks mentioned in section 8 2  Each processor  configuration can be seen in section 9 1 below     9 1 Processor Configurations    Each processor configuration can be seen in the table 8 below  Additional info concerning the processors   take a look in section 6     Table 8  Minimum Area Processor Configurations          PROCESSOR CORE LEON2 NIOS II  OPTION UNIT  CACHE  INSTRUCTION CACHE  SIZE 1024 512 BYTES  ASSOCIATIVITY 1 1 NR OF SETS  CACHE LINES 32 16 LINES  BYTES   LINE 32 32 BYTES  SUB BLOCK SIZE 1   BIT  4 BYTE WORD  TOTAL LINE SIZE 291 287 BITS    DATA CACHE    SIZE 1024 512 BYTES   ASSOCIATIVITY 1 1 NR OF SETS  CACHE LINES 32 128 LINES  BYTES   LINE 32 4 BYTES  SUB BLOCK SIZE 1   BIT  4 BYTE WORD  TOTAL LINE SIZE 291 55 BITS 
47. ne  Its data handling capabilities is better  than on the NIOS II    41    Comparison of Synthesizable Processor Cores 12 SUMMARY       By respect to their sizes  it is noticeable that the NIOS II core is vendor optimized and the so  urce is  encrypted  which is a limit to portability when it only could be used in Altera FPGA   s  The LEON2  which  has no vendor restriction  fits well in a low end FPGA like the one used in this work  even if it is not  optimized with respect to a certain technology     To achieve best performance when dealing with embedded systems  the hardware and software have to  be designed together  When having a FPGA based platform  the whole system can be re   configured  by  respect to the FPGA capability  to change characteristics if its performance is not good enough     With respect to their configurablity the LEON2 is the best  by providing multi set cache system with con   figurable sizes and    bytes line    and a variety of cache replacement policies and multipliers  On the NIOS  II  only the cache sizes are configurable  the multiplier option depends on the target FPGA used  One NIOS  II advantage is that it supports custom instructions  which could speed up applications  where a certain task  is dominating     42    Comparison of Synthesizable Processor Cores 13 APPENDIX       13 Appendix  Appendix A    Program Versions    This section contains the different development tool versions used in this work           Provider Program Version  SYNPL
48. on byte address has the following fields and sizes for a 8Kbyte cache        TAG  30 15  LINE 14 5  OFFSET  4 2  00  1 0                       The offset field is 3 bits wide  an 8 word line   the tag and line sizes depends on the cache size  The  maximum instruction address size is 31 bits  The instruction cache is permanently enabled and can not be  bypassed     3 4 2 Data Cache    The data cache has the following features     Direct mapped implementation  Write back   Write   allocate   4 Bytes  One word  per cache line  Configurable size  512 bytes to 64 Kbytes    The data byte address has the following fields and sizes for a 1 Kbyte cache        TAG 22 10  LINE 9 2  OFFSET 1 0                    The offset field is 2 bits wide  the tag and line sizes depends on the cache size     In all current NIOS II cores  there is no hardware cache coherency mechanism  Therefore  if there are  multiple masters accessing shared memory  software must explicitly maintain coherency across all  masters     Comparison of Synthesizable Processor Cores 3 NIOS II       3 5 Internal Busses  3 5 1 Avalon On chip Bus    The Avalon  13  bus is a simple bus architecture designed to connect on   chip processor and peripherals  together into a working NIOS II based system  The Avalon is an interface that specifies the port con   nections between master and slave components  it also specifies the timing by which these components  communicate    The Avalon bus supports advanced features  e g  latency 
49. ose was to create a short benchmark program  representative of integer programming  Its code is  dominated by simple integer arithmetic  string operations  logic decisions and memory accesses  intended  to behave like a typical computing application  Most of the execution time is spent in library functions     The Dhrystone result is determined by measuring the average time a processor takes to perform many  iterations of a single loop  containing a fixed sequence of instructions    The output from the benchmark is the number of Dhrystones per second and the number of iterations of  the main loop per second     8 2 2 Stanford    The Stanford suite is gathered by John Hennessy and modified by Peter Nye  The version of the suite used  is 4 2    The suite consists of three major program categories     Recursion  Loop    intensive  Sorting algorithms    All four loop intensive programs include multiplication  two of these includes floating point arithmetics     All programs perform a check to make sure each program will get the right output  the time spent do   ing the check is included in the execution time     The following ten programs are included     Perm     Calculates permutations recursively   Towers     Solve the Towers of Hanoi problem   Queens     Solve the Eight Queens Problem fifty times   IntMm     Multiply two random integer matrices   Mm     Multiply two random real matrices   Puzzle     A Compute   bound program   Quick     Sort a random array using the Quicksort
50. ping part of the LEON2 synthesis was done in Synplify Pro  8 0 and the place and route part was done in Quartus II       Concerning the NIOS II it was straight forward  all you had to do was to configure the system and then  do the synthesis and    place and route    in Quartus II     The resulting netlist was downloaded to the target  hardware through Quartus     This was also done for the LEON2    On LEON2  the benchmark programs were downloaded to the target hardware through GRMON  20    which connects to the DSU and allows debugging of the system  On the NIOS II  the software downloading  was done through the provided IDE     21    Comparison of Synthesizable Processor Cores 8 BENCHMARKING       8 Benchmarking    Benchmarking should be an objective  reproducible measure of performance  for example execution   speed  comparisons  It must be meaningful and test something relevant to the user  Benchmarks could also be used  to monitor performance changes during development  The benchmarks in this thesis work  only consists of  integer and emulated floating   point performance     Two important questions should be asked of any benchmarking activity   How accurately does the benchmark predict real world performance   How reliably can a comparison between competing processors be made     8 1 Benchmarking considerations    When a set of benchmarks are to be executed on several microprocessor architectures  one must keep a few  things in mind  regarding      Which Programming Lan
51. res 3  NIOS II       3 2 Instruction Set Architecture    The Instruction Set Architecture  ISA  is compatible across all NIOS II processor systems  The supported  addressing modes are  register   register    or  register   immediate     There is also a possibility to add  custom instructions  Multibyte numbers are stored as little endian     When the processor issues a valid instruction that is not implemented in hardware  an unimplemented  instruction exception is generated  The exception handler determines which instruction generated the ex   ception  If the instruction is not implemented in hardware  control is passed to an exception routine that  emulates the operation in software  concerning multiply and divide instructions     3 3 Integer Unit    The integer unit  IU  architecture supports a flat register file  consisting of thirty two 32 bit general pur   pose registers  Three control registers are also provided  The architecture is prepared for the future addition  of floating   point registers     All instructions take one or more cycles to execute  Some instructions have other penalties associated with  their execution  Late result instructions have a two cycle bubble placed between them and the instruction  that uses the result  Instructions that uses Avalon transfers are stalled until it is completed     3 3 1 Pipeline Architecture  The NIOS II f core employs a 6 stage pipeline  with following stages     Instruction Fetch  Instruction Decode  Execute   Memory   Al
52. ses     2 5 Internal Busses    2 5 1 AMBA    The processor has a full implementation of AMBA 2 0  10   AHB and APB on chip buses   The APB bus is used to access on   chip registers on the peripheral functions  while the AHB bus is used for  high speed data transmission     A flexible configuration scheme makes it simple to add new IP cores  A more detailed description of  the internal buses can be seen in section 4   2 5 2 AHB Bus    LEON2 uses the AMBA 2 0 AHB bus to connect the processor cache controllers to the memory controller  and other high speed units  Default configuration  the processor is the only master on the bus  while there  are two slaves  the memory controller and the APB   bridge     2 5 3 APB Bus    The APB bridge is connected to the AHB bus as a slave and acts as the  only  master on the APB bus  The  most on chip peripherals are accessed through the APB bus  eg UART  I O  Timer  IrqCtrl     A detailed bus overview of how the peripherals are connected can be seen in figure   on page 2     Comparison of Synthesizable Processor Cores 2 LEON2       2 6 Memory Interfaces    The memory interface provides a direct interface to PROM  memory mapped I O devices  static RAM   SRAM  and synchronous dynamic RAM  SDRAM   The different controllers can be programmed to ei   ther 8  16  32  64 bits data width  Chip select decoding is done for two PROM banks  one I O bank  five  SRAM banks and two SDRAM banks  The external memory bus is controlled by a programmable memory 
53. sses       6 5 die 4 we RG Be a Se Sed ee  252   AMBA sch ti e Sse Gia debe is edt fag eas  2 2  AHB BUS     26 ele Ee See EN Bw oe Re el BRAS al BE ee  23 3 APB Bus  fost toh AA ee Leite Bia eb ae Be ab a ted   amp   2 6     Memory Interfaces  eeose be posed Rede a rea E a  201    SRAM  ques easels Gide e ee a A pe BR SE arte aed  26 2  PROM  vee poms ard  ee ok ab fee Ree Re Road  Sab Oe Bo Bee  2 6 3     VO DEVICES 4 5  E ae bee has bea ee  ate eg we bee ee 166  26 4 SDRAM  arty 008 bat dS ole a the Ee BP Gee ae tle BS ae eet te S  R hk  2 7   System Interfaces         soo ac eee ee a he he Bae be a a  2 71  VART D  d taht e Sass be hd ts fog bed  2 02  Ethernet  MAC    soc olsen Be pde Ae ee Da lee Bk Bee ere ee 8  ES CUA ce E BE ROR ARE UN N  RA   R Sk EN Ar  2 8 Additional Units and Features               ere ee ee  2 8 1 Debug Support Unit                 e    e     IAS E BUEL A ON  DEI  gt  TIMES  2 A ate wae bee it be ate eg e a  USA   Watchdog  a2 6 0  oe Ae ER Ge eb eS A A Bie as  2 8 3    Interrupt Controller  gt  gt    gt    ses eda es Se a eee eS  29 07  Paralel VWOSBOTE  00d ye thes liga Ach feo aes od LAL By Be ede oe veoh Ay bok ig  28h POWELL Wi Slee BAIR Sia A See A BO Bee Boh a  2 9     CO POCESSOLS cr ak san er GIG BRE A Be RS BRP AG BBA Aw lee eh  ZIL  EPU  amp  pret oti O det hed 3M elie Ew  2 92 QGREPW 250008 A Ba AS ee ta  2 9 3  Generic Co processor    2    be SIR Be RR RR ER  2 10 Memory Management Unit               e       2 10 1 Translation Look 
54. terface to the GRFPU available from Gaisler Research and  Meiko FPU core from Sun Microsystems     2 9 2 GRFPU    The GRFPU operates on single    and double   precision operands  and implements all SPARC V8 FPU  instructions  It is interfaced to the LEON2 pipeline using a LEON2 specific FPU controller  GRFPC   The  control unit allows FPU instructions to be executed simultaneously with integer instructions  Only in case  of a data or resource dependency the integer pipeline is stalled     2 9 3 Generic Co   processor    LEON2 can be configured to provide a generic co   processor  The interface allows execution in parallel  with the integer unit  IU   One co   processor instruction can be started each clock cycle if there is no data  or resource dependency     2 10 Memory Management Unit    With the optional Memory Management Unit  MMU  it implements a SPARC V8 reference MMU and  allows usage of robust operating systems such as Linux  The MMU can have a separate   Instruction  and Data  or a common Translation Look aside Buffer  TLB   The TLB is configurable for 2 32 fully  associative entries  When the MMU is disabled the caches operate as normal  When enabled  the cache  tags store the virtual address and also include an 8   bit context field     2 10 1 Translation Look aside Buffer    The MMU can be configured to use a shared TLB  the number of TLB entries can be set to 2 32  The orga   nization of the TLB and number of entries is not visible to the software and operating s
55. ts  processor architecture and system analysis  implementation on  a FPGA and benchmarking  Two different processor configurations have been compared and evaluated   minimum area and maximum performance  Both configurations have been executed in two different fre   quencies  25 MHz and 50 MHz  respectively  The benchmarks used in this work are Dhrystone  Stanford   Paranoia and a typical control application  the execution results have been discussed for each configuration     Comparison of Synthesizable Processor Cores 2 LEON2       Licensing and Availability    This section contains a evaluation of their license forms  respectively  The LEON2 full VHDL source code  1s available under the GNU LGPL  5  license  which allows free and unlimited use of the processor core  and peripherals  Since it is open source  it is not restricted to a certain technology  LEON2 based systems  could be implemented in both FPGA and ASIC  The full LEON2 source code is available through the  Gaisler Research homepage  6     All NIOS II development kits includes a perpetual non   cost license  7  to develop and ship systems  using the processor core and peripherals in a Altera    FPGA  A implementation as a ASIC is also possible   The NIOS II is distributed as a encrypted VHDL file     1 Initial Architecture Analysis    This section contains the first part of the work  In section 2 the LEON2 is described and analyzed  in section  3 the same kind of description and analysis is done for the NIOS2    
56. uded in hardware in this configuration    Regarding the next seven programs  Intmm  Mm  Puzzle  Quick  Bubble Tree and FFT  the difference is  obvious  since they all contain a lot of data to be processed  where the bigger cache system is a advantage   Concerning the matrix multiplication programs  Intmm and Mm  and Puzzle  which are loop intensive   the speed up is caused by the hardware multiplier and the bigger caches  where the temporal and spatial  locality are improved  then it is a bigger possibility that the desired data or instruction already is in the  caches  respectively  and no fetching from the main memory is needed    Their multiplier performance decreases the execution times almost three times  Especially the NIOS  II multiplier performs really good  even if it has such a big latency compared to the LEON2 multiplier    Thirteen cycles compared with five on the LEON2      37    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       Regarding the sorting algorithms  Quick  Bubble and Tree  which also takes advantage of the big cache  system the performance has increased  The equal execution times on the Quick     and Bubble sort algo   rithms on LEON2 at 50 MHz  probably depends on the data which the random function generates  other   wise the Quicksort algorithm would be the fastest one  as it is on the NIOS II and on the LEON2 in the   Minimum Area    section    The Tree sort algorithm is deeply recursive  which causes register window overflo
57. ults  of the benchmarking  their integer performance are quite equal  despite the cache size differences  When  dealing with emulated floating point applications  LEON2 is faster  by taking advantage of its write buffers  and the bigger cache system  In the NIOS II case where the load delay  which is two cycles affects the per   formance negatively  by stalling the pipeline  If a stall occurs many times  especially when dealing with  floating   point emulation combined with a relatively small cache system  its performance will drop by a  certain amount     In the  Maximum Performance    section  where the aim was to configure both processor cores to achieve  as high performance as possible  Multiplication and divide is performed in a hardware based multiplier and  divider  respectively  Their sizes and latencies have been configured to gain as high overall performance as  possible  The NIOS II has the best integer performance  especially on Dhrystone  which contains a fixed  sequence of instructions  where the hole sequence more or less fits in the instruction and data caches  re   spectively     The bigger cache system improves the performance  by improving the cache hit rate  which on such small  applications  like Dhrystone and Stanford is almost 100    on both processors  Then the benchmark results  will be representative of integer performance rather than the overall system performance  In the emulated  floating   point application part  LEON2 once again is the fastest o
58. ve 26  9 Minimum Area Synthesis Results           o    o       ss rss ss ss 27  10  Dhrystone Results     Minimum Area            rer res ss ss 000  28  11 Stanford Results     Minimum   rea    29  12 Control Application Results     Minimum Area                           32  13 Maximum Performance Processor Configurations          o                33  14 Maximum Performance Synthesis Results              o      o           34  15 Dhrystone Results     Maximum Performance                           35  16 Stanford Results     Maximum Performance                      000 000 4 36  17 Control Application Results     Maximum Performance          o    o        39    Objectives    Today  synthesizable processor cores are becoming common for embedded microprocessor based applica   tions  where high performance is required  Because of the Field Programmable Gate Arrays  FPGA  are  becoming bigger and faster  they can contain a complete microprocessor based system  The first syntesiz   able processor cores were 8 bit and showed up in the late 1990   s  now there are 32 bit processor cores  available  In this context  it is of interest to make a comparative analysis with synthesizable processor cores  from different providers     In this thesis work  two syntesizable processor cores have been compared  the LEON2  1  which is a  SPARC V8 compatible processor core developed by Gaisler Research  2  and the NIOS II which is 3   developed by Altera  4     The work consists of three major par
59. w on the LEON2   most of the execution time is spent in the trap routines  but the NIOS II handles the recursive part very  good  indeed  The difference in the FFT program is only about 4    one reason could be the difference in  the multiplier latency  since it is not as multiplication intensive as the Mm program  where the difference  is about 10      Overall  when comparing the composite sum  the NIOS II is roughly 30   better  The cache system  is big enough to contain all necessary instructions and data  since a majority of the programs are loop     intensive integer programs  The LEON2 on the other hand has the best floating   point performance  but not  as big as in the minimum area section     38    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       10 3 3 Control Application    As a complement to the floating   point programs in the Stanford benchmark set  the Control Application  was executed to reveal emulated floating   point performance differences  In table 17 below the result of the  execution of the Control Application can be seen     Table 17  Control Application Results     Maximum Performance          PROCESSOR CORE LEON2 NIOSH LEON2 NIOSI  FREQUENCY 25 MHz 25 MHz 50 MHz 50 MHz  PROGRAM UNIT  CONTROL APPLICATION 200 293 107 141 SEC    Control Application Result Comments    This time  the floating point emulation performance has increased  by beneficiation of the bigger cache  system and the hardware based multiplier and divider used  W
60. wide instruction set and few addressing modes    Register   Register    and  Register   Immediate     Multibyte numbers are stored as big endian     2 3 Integer Unit    The LEON2 integer unit implements the full SPARC V8 standard  including all multiply and divide in   structions  The implementation is focused on portability and low complexity    The number of register windows is configurable within the limit of the SPARC standard  2 32   8  is default  Total number of registers by default is 136  Separate instruction and data cache interfaces are  provided  Harvard Architecture   The LEON2 is provided with a branch delay slot  more info concerning  the delay slot feature can be seen in section 5 1    2 3 1 Pipeline Architecture    The LEON2 integer unit uses a single instruction issue pipeline with 5 stages  The stages can be seen below     Instruction Fetch  Instruction Decode  Execute   Memory   Write Back    The LEON  pipeline is stalled until the operation is completed if one of these conditions occurs     Multi Cycle Instruction  Load or Store from the memory  SRAM or SDRAM     Comparison of Synthesizable Processor Cores 2 LEON2       2 3 2 Multiply and Divide Options    The LEON2 has a variety of multipliers available  In table 1 below  the LEON2 multiplier options can be  seen     Table 1  LEON2 Multiply Options          Configuration Result latency cycles  32 x 32 1   32x 16 2   32 x 8 4   16x 16 4   16 x 16   PIPELINE REG 5  ITERATIVE 35  EMULATED IN SOFTWARE   40  
61. xecuting  a program like the Control Application  will reveal the total system performance  then the processor has to  work with a high load during a longer time     32    Comparison of Synthesizable Processor Cores 10 MAXIMUM PERFORMANCE       10 Maximum Performance    This section contains the last part of the thesis work  This part contains the  Maximum Performance     configurations and the results of the benchmarks mentioned in section 8 2     10 1 Processor Configurations    Each processor configuration can be seen in the table 13 below  Additional info concerning the processors   take a look in section 6    Table 13  Maximum Performance Processor Configurations          PROCESSOR CORE LEON2 NIOS II  OPTION UNIT  Cache  INSTRUCTION CACHE  ASSOCIATIVITY   SET SIZE 2   4096 1 8192 NR OF SETS   KBYTES  CACHE SIZE 8192 8192 BYTES  REPLACEMENT POLICY LRU N A  CACHE LINES 256 256 Lines  BYTES   LINE 32 32 Bytes  SUB BLOCK SIZE 1   Bit  4 Byte Word  TOTAL LINE SIZE 294 278 BITS  DATA CACHE  ASSOCIATIVITY   SET SIZE 2   4096 1 8192 NR OF SETS  KBYTES  CACHE SIZE 8192 8192 BYTES  REPLACEMENT POLICY LRU N A  CACHE LINES 256 2048 LINES  BYTES   LINE 16 4 BYTES  SUB BLOCK SIZE 1   BIT  4 BYTE WORD  TOTAL LINE SIZE 155 55 BITS    Memory Controller    SRAM 1 1 MBYTE  ALU   MULTIPLIER SIZE  LATENCY  16x16 5  32x4 11 2      DIVIDER SIZE  LATENCY  64 32  35  32 32  N A       Configuration Comments    Both processor cores have be configured to achieve as high performance as possible  The
62. ystem modification  are therefore not required     Comparison of Synthesizable Processor Cores 3 NIOS II       3 NIOS II    There are three versions of the NIOS II  12  processor core available  one with a single pipeline stage and  no cache  NIOS Il e   one with five pipeline stages and instruction cache  NIOS II s  and the last one with  six pipeline stages and both instruction and data caches  NIOS II f   In this thesis the focus is on the NIOS  II f core  since it is the most extensive one of the available NIOS II cores    The processor architecture  cache structure  Instruction Set Architecture  peripherals and configuration  options are described below     3 1 System Overview    The NIOS II processor is a general purpose single issue RISC processor core providing     Full 32 bit instruction set  data path and address space   32 General Purpose Registers  Flat register file    32 External Interrupt Sources   Barrel Shifter   Avalon System Bus   Instruction and Data Cache Memories  Harvard Architecture    Access to On chip Peripherals  and Interfaces to Off   chip Peripherals and Memories  The core is provided as a encrypted VHDL file    A typical NIOS II system can be seen in figure 2 below             ES  JTAG    Debug Module    NIOS II    Processor Core            General Purpose I O    Ethernet  MAC PHY  Interface        SDRAM Controller    Avalon Switch Fabric    Tristate Bridge             Figure 2  NIOS II System Overview    Comparison of Synthesizable Processor Co
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
folpan UBG PCP15654.qxd  MSI Mainboard K9N Neo-F  REGULADOR - Leroy Somer  Alpha 21164 Microprocessor Hardware Reference Manual  Stokke Tripp Trap 07 High Chair    Set-up and installation    Copyright © All rights reserved. 
   Failed to retrieve file