Home
        An FPGA-based Multi-Core Platform for Testing and Analysis of
         Contents
1.   At  the time of the experiments we did not have access to more  modern Altera boards such as the DE2 115  Since then  the  ZA4800 design has been ported  with very minimal effort  to  newer systems with large DDR2 memories  thus we could  now run most of the SPEC CPU2000 and 2006 benchmarks   Nevertheless  the results here show the capabilities of a single   CPU Z4800 system  The 147 vortex benchmark did not  function correctly  so no results for it have been included    The original SPEC95 scripts did not work cleanly on  the modern Debian Linux distribution that was ported to  the FPGA  New scripts were written to control execution  of the benchmarks  These new scripts also incorporated the  necessary logic to seperate benchmark results for each distinct  hardware configuration under test  Some of the C code in the  benchmarks also needed minimal fixes to properly build with  modern GCC  version 4 4     To produce the benchmark results  z48perf was config   ured to sample the hardware counters every 5 seconds  Data  was not collected during the first polling interval of each  benchmark run  this cuts off the initial latency of program  startup over the network based root filesystem  The bench   marks were only allowed to run for up to 55 seconds  each   after which z48perf kills the benchmark process and moves  on to the next  This cuts the time for a run through all the  benchmarks down to 10 minutes     4The Cyclone II is a relatively low cost FPGA family  Experiments w
2.  Krste Asanovic   Ramp  Research accelerator for multiple processors  In In Proceedings  of Hot Chips 18  2006    Matt T  Yourst  Ptlsim  A cycle accurate full system x86 64 microar   chitectural simulator  In in ISPASS 07  2007           
3.  Register Write  RW   In the RW stage  the register file  write ports are driven  Due to internal cross port latency of the  FPGA   s embedded RAMs  data written here will not be visible  at the read ports until 2 cycles later  Bypassing is required to  hide this latency    8  Bypass  BP   The BP stage contains very little logic  Its  only purpose is to hide the cross port latency of the register  file     C  Z4800 Register File    The Z4800   s register file is somewhat difficult to implement  efficiently in an FPGA  2 write and 4 read ports are needed to  satisfy 2 MIPS instructions per cycle  Additionally  an optional  5th read port should be available for debug purposes    One obvious approach to take would be to time multiplex  the register file  With a single dual port memory block  the  clock would have to multiplied by a factor of 4 to handle the  7 total read write operations per pipeline cycle    The problem with time multiplexing is that it is difficult to  control timing on an FPGA  The address input of the RAM  must set up 4 times per pipeline cycle  and must not violate  the hold time requirement of the RAM input register  This also  means a wide mux must select from 1 of 4 addresses  at the  high speed clock rate  Further  the high speed clock must be  phase synchronized with the pipeline clock  This means that  the pipeline clock must be divided from the register clock   and the relative phase of these two clocks must be preserved  when they are routed around 
4.  compare the simulated  output to  The design uses off chip devices such as memories   which need accurate models if simulated behavior is to match  the real hardware  Using such models and verifying that the  models and stimulus are themselves accurate is not a trivial  task    Second  there was little to be gained from software simula   tion of the design  Altera   s Quartus II software has a built in  logic analyzer tool  Signaltap II  This tool makes it possible  to inspect any internal signals in the design  Hardware is  automatically generated to use FPGA resources to capture   buffer  and transfer the data to the host PC over a USB  JTAG cable  The software will draw timing diagrams that can  be directly inspected  Effectively  this tool yields the same  visibility of internal signals that the functional simulator offers   free of the concerns of simulation accuracy and speed    The debug hardware in the Z4800 design was designed to  be used in conjunction with the Signaltap II logic analyzer   The debug hardware is designed to be extremely simple and  reliable  since ultimately one must trust the debugging tools   All debug hardware is optional  it can be disabled at compile   time to save FPGA resources and slightly increase clock speed    The Z4800   s debug hardware is specifically designed in such  a way that it cannot influence the processor   s behavior except  in explicit circumstances  In other words  the debug hardware  does not influence any RTL interactions
5.  of the rest of the  processor  The only exception to this rule is the hardware  debug module   s control of the main pipeline stepping and  master reset    1  Integrated CPU Debug Module  The main debug tool is  the debug module integrated into the Z4800 core  This module  provides control of run halt  reset  and single stepping  It also  provides a single hardware breakpoint and register view    The debug module itself appears as a memory mapped I O  peripheral  The Z4800 prototype system for the DE2 70 board  has a physical I O connection for a 2nd DE2 70 board  This  allows the address spaces of both boards to be unified  The  2nd  debugger  DE2 70 board can access the debug hardware   main memory  and all peripherals on the target DE2 70 board   A block diagram of the hardware setup is shown in Figure 3    The debugger board itself contains an Altera Nios2 CPU  running a full Linux Nios2 OS  On this Nios2 CPU  a C  program called z48d is used to control all I O with the target  board and provide a user interface  The z4 8d program accepts  a few simple commands  such as run  halt  Reset  step  to                                                                                                                                                                      Nios2 CPU 2   Z4800 CPU 2       2    Z  2 z   2 2   2     i y Linux mips Trace F  Linux Nios2 kernel kernel buffer  S  Debug M  2      amp  n 2  NUMA bridge NUMA bridge 3 3 m E  z48d 4 f Zw   software  L S   ei Perfc
6.  with larger caches at the expense of FPGA resource  consumption  The design is modularized  so that individual  components can be replaced and redesigned to test new archi   tectural ideas  The design also has extensive instrumentation  a  large number of performance counters  currently 80  are used     The Z4800 implements a 32 bit subset of the MIPS   R4000 ISA  Most  MIPS implementations use R as their first letter  so we have used Z  Other  MIPS implementations include the R4000    R4400     and so on  We  arbitrarily pick the name Z4800 because it is capable of running code compiled  for R4xxx processors  including kernel code  MIPS  is a registered trademark  of MIPS Technologies  Inc   and the Z4800 is not endorsed by nor associated  with MIPS Technologies  Inc     to simultaneously measure a wide range of statistics in real   time  It also provides a fully automated hardware synthesis  and benchmarking system and an extensive debug interface  It  promotes simplicity  ease of use and adaptability    The Z4800 is synthesizable on many Altera FPGAs  on the  Altera DE2 115  a  300 FPGA board based on an inexpensive  Cyclone IV E chip  a 4 core system is quite practical  Larger  systems with 16 or more cores are possible in high end Stratix  FPGAs     A  Design Philosophy    The Z4800 takes on a somewhat unique design philosophy   The central idea is that functionality and ease of design are  more important than other design metrics  area  speed  power   etc   Instead of 
7. 0 processor features a 2 issue in order integer  pipeline  It can execute most instructions supported by the  MIPS R4000  except 64 bit and floating point operations   The Z4800 includes a full featured TLB which maintains  compatibility with the R4000 TLB  The implemented ISA is  sufficient to run unmodified GNU Linux distributions such as  Debian    The Z4800 processor aims for high IPC at moderate clock  rates  and emphasizes simplicity and modularity when rea   sonable  This is done at the expense of area  and to a lesser  extent  clock speed  The Z74800 is superscalar with symmetric  in order pipelines  The number of stages is configurable at  compile time  The overall pipeline organization is shown in  Figure 1    1  Instruction Fetch and Instruction Queue  The Z4800  includes hardware which prefetches instructions along the  predicted path  An instruction queue seperates the frontend  from the rest of the pipeline and allows the frontend to run  ahead along the predicted code path while the backend is  stalled  The frontend is capable of prefetching instructions  at a rate of two per cycle  even across cacheline and page  boundaries  This serves to hide some of the Icache and ITLB  miss penalties  The frontend pre decodes instructions and is  able to predict both taken and not taken branches with 0  penalty cycles  Fetched instructions are written into the Iqueue   instruction queue   which is drained by the DG  Decode and  Group  stage    2  Speculative Exception Hand
8. 28 entries  Varying the number of ITLB  sets between 2 and 128 costs the same amount of hardware  and runs at the same speed  so it is still reasonable to have  a large ITLB  A direct mapped 128 entry ITLB would be a  good choice    The result is slightly different for the DTLB  A 2 way  DTLB does show a performance boost of 4 0  over a direct     iqueue_16          Eee    eee  GOE  as SS  ed D  ed D    itlb_64ent_dm  p     iqueue e ny  reedice    no     e_fast miss    no_cascade    static_bp    itlb_32ent dam D   no fetch bto SSE    no_fetch_btb  no_fetch    no_clever_flush EE   iqueu    no_fast_mis     static_frontend_b    noi    variants for SPEC CPU95 integer benchmarks    mapped DTLB of the same size  LRU replacement is indis   tinguishable from Random  the difference is within the RMS  error   and it requires memory blocks  so there is no reason to  use LRU  However  since the DTLB is typically on the critical  path  the effect on clock speed should be taken into account   If the direct mapped DTLB design has at least 4 0  higher  clock speed  it is faster than a 2 way DTLB     In practice  it seems that using a direct mapped DTLB re   sults in around 5 15  higher clock speed  This value is highly  dependent on device congestion and random fitter placement  results so it is hard to quantify  This benchmark experiment  was not designed to take the influence of differing clock speeds  into account  It does  however  seem quite reasonable to use a  direct mapped DTLB  as d
9. An FPGA based Multi Core Platform for Testing  and Analysis of Architectural Techniques    Willard G  Simoneau and Resit Sendag  Department of Electrical  Computer and Biomedical Engineering  University of Rhode Island  Kingston  RI 02881  simoneau sendag  ele uri edu    Abstract   This paper covers the design and FPGA based  prototyping of a full featured multi core platform for use in  computer architecture research studies  Existing platforms for  performing studies include software simulators and hardware   assisted simulators  but there are no modular full hardware  platforms designed to measure a wide range of performance  metrics  Such a platform  using HDL synthesis onto an FPGA   can run orders of magnitude faster than software based solutions  at the cost of having less flexible processor configuration and  implementation  This paper presents an end to end solution  from  bottom level hardware design all the way to automated results  collection and analysis  which can be used with inexpensive  commodity hardware     I  INTRODUCTION    In computer architecture  there is a growing need to evaluate  the performance and cost of new  complex architectural ideas   Currently  most evaluation is done with simulations  1    2     3    9    11    20   but these simulators are typically quite  slow  It is difficult to get accurate results quickly  Conse   quently  there has been research into hardware acceleration  of these simulators using FPGAs  field programmable gate  arra
10. Emer  Hasim  Fpga based high detail multicore simulation  using time division multiplexing  In HPCA  pages 406 417  IEEE  Computer Society  2011    Michael Pellauer  Muralidaran Vijayaraghavan  Michael Adler  Arvind   and Joel Emer  Quick performance models quickly  Closely coupled  partitioned simulation on fpgas  In Proceedings of the ISPASS 2008    IEEE International Symposium on Performance Analysis of Systems and  software  pages 1 10  Washington  DC  USA  2008  IEEE Computer  Society    RM7000 microprocessor with on chip secondary cache data sheet  jan  2001  accessed 02 2011    Taeweon Suh and Hsien hsin S  Lee  Initial observations of hard   ware software co simulation using fpga  In in Architecture Research   2nd Workshop on Architecture Research using FPGA Platforms  2006   Zhangxi Tan  Andrew Waterman  Rimas Avizienis  Yunsup Lee  Henry  Cook  David Patterson  and Krste Asanovi    RAMP gold  an FPGA   based architecture simulator for multiprocessors  In Proceedings of the  47th Design Automation Conference  DAC   10  pages 463 468  New  York  NY  USA  2010  ACM    Zhangxi Tan  Andrew Waterman  Henry Cook  Sarah Bird  Krste  Asanovi    and David Patterson  A case for FAME  FPGA architecture  model execution  In Proceedings of the 37th annual international  symposium on Computer architecture  ISCA    10  pages 290 301  New  York  NY  USA  2010  ACM    John Wawrzynek  Mark Oskin  Christoforos Kozyrakis  Derek Chiou   David A  Patterson  Shih lien Lu  James C  Hoe  and
11. VAO  y piru    y    ly _   icache_dcache_16k_4wa       y  icache_4k_dam E    dceache_2c     bh ey    dcache_16k_4wa     dcache_8k_4wai    queue 28 EEX ZZ    btb_no_ta     dcache_16k_4way    dcache_8k_4wa     VON NWN _    EEE  deache_4k_dm      dtlb_64ent on EL      aigned_frontend  GS TETEOEFEOFOFO SSS    icache_16k_4wa       dcache_no_earl  icache_8k_4wa   icache_8k_4waj    icache_16k_4waj    dcache_2c  icache_dcache_16k_4wa     Fig  5  CPU Performance comparison of all    change alone reduces IPC by 3 5   If the frontend is con   figured to use only static prediction  thus removing the BTB   RAS  and 2 bit predictor from the fetch stage   the difference  in IPC is 8 5     We also can see the impact of varying Iqueue sizes  En   larging the Iqueue to 128 instructions yields a small gain of  0 70  IPC  while reducing its size to 8 instructions yields an  8 6  IPC reduction  It is probably wisest to use Iqueue depths  in the range of 16 64 in practical configurations  The baseline  configuration of 32 instructions is a good choice    It is interesting to note that the L1 ITLB size and associa   tivity has essentially no impact on performance over the tested  range  Going from the baseline 64 entry 2 way ITLB to a 32   entry direct mapped ITLB reduces IPC by only 1 0   Note  that the R4000 used only a 2 entry fully associative ITLB   the instruction stream has very good locality  Despite this  result  we should also recognize that the FPGA   s M4K memory  blocks contain up to 1
12. ding all comments and whitespace  Much of that  code is simply signal routing between modules  The core   s  logic is instantiated using only a few thousand lines of code   More details are shown in Table I  The compilation of the  design is also very fast  Using a reasonable host system  it  takes about 11 minutes to compile a fully functional system  with 1 core  15 minutes for 2 cores  and 40 minutes for 4    cores           These are worst case cold compile times including all phases  analysis   synthesis  fitting  timing analysis  targetted for the DE2 115 board  Incre   mental compilation and reduced cache TLB configurations will improve on  these compile times  The 4 core design begins pushing the limits of the FPGA   so its compile time is much longer  If a larger FPGA is used  compilation  time can be greatly decreased since less place route effort is required  The  host machine used here was a Dell Poweredge 1950 with a pair of quad core  Xeon X5460 CPUs and 32GB RAM        Frontend pipeline 394             Decode group logic 1200  Backend pipeline  pipeline  ALU mul div  Coprocessor 0  1602  LI L2 caches  cache controllers  and glue logic 2304       Replacement Algorithms  LRU  Pseudo LRU  Random  Hybrid  528                L1 TLBs  JTLB 855  Debug features  not necessary for running CPU  474  Performance counters 60  TOTAL 10090                TABLE I  Number of lines of VHDL code for Z74800  includ   ing optional debug modules     B  Microarchitecture    The Z480
13. e results for all benchmarks on each  hardware variant    Because the experiments are fully automated  no human  intervention is required  One can determine what variants are  to be tested  enter them into the parametric generation script   s  configuration file  enter a 1 line command to start everything   and walk away  In our experiments  given in Section VI   hundreds of runs completed their benchmarks without hanging    or crashing over a period of 57 hours  this is a testament to  the reliability of the Z4800 prototype design     VI  AN EXAMPLE STUDY ON THE Z4800    Several example systems have been designed for the Al   tera Terasic DE2 70 evaluation board  This board contains a  Cyclone II 2C70 FPGA and many useful peripherals   64MB  SDRAM  2MB SSRAM  10 100 Ethernet  etc  Priced at  269   it has a very rich featureset for its cost    The example systems include 1 or 2 CPUs and an optional  L2 cache  For a 2 CPU system on the Cyclone II 2C70 FPGA   the maximum CPU clock frequency achieved is typically 41   44 MHz  which makes it practical to set the main PLL for  40MHz   Turning on fitter optimizations such as physical  synthesis does not provide slack improvements and vastly  increases the compilation time  1 5 3x longer      A  Benchmark Setup    In our experiments  we used an Altera DE2 70 FPGA board  running a single CPU design  Since the DE2 70 only has  64MB RAM  we were not able to run SPEC CPU2000 or 2006  benchmarks  we used SPEC95 integer benchmarks instead
14. es  If the  exception does not match the predicted exception  the test  stops  This aggressively tests the datapath  memory hierarchy   TLBs  and pipeline flushing on exception entry exit    The test coverage of this fuzz tester is not perfect  Errors  will go undetected if they do not influence the exceptions    that are taken  However  even with imperfect coverage  we  can expect that many errors will eventually be uncovered by  random chance  In practice  a single 8MB program run can  reliably detect subtle processor bugs  Most observed failures  occured in the first 256K of the program  although rigorous  statistics were not recorded during development    The z48d program can run fuzz testing in a fully automatic  mode  looping for hours or even longer  generating a unique  instruction stream for each iteration  Late in development  the  prototype ran over 1700 8MB program fuzz tester iterations   no errors were detected     V  AUTOMATED HARDWARE SYNTHESIS AND  BENCHMARKING SYSTEM    The Z4800 provides a fully automatic experimentation sys   tem to facilitate the use of the FPGA platform  The automation  handles both hardware synthesis and benchmarking  It enables  running a large number of experiments with no human in   tervention  It also allows scaling to arbitrarily large parallel  FPGA farms    The automated hardware synthesis and benchmarking sys   tem is shown in Figure 4  The automation is started by running  a parallel make command on the host to initiate synthes
15. focusing on maximizing performance per logic  block and micro optimization  the Z4800 instead uses high   level RTL  register transfer level  constructs  Synthesized area  is traded for easier RTL  modularity  and more aggressive  design on an architectural level  The resulting design runs at  moderate clock rates  but it can easily be modified to study  various architectural enhancements     Blocks such as the cache controllers and TLBs  translation  lookaside buffers  are split into seperate VHDL  VHSIC  very  high speed integrated circuit  hardware description language   entities so that they can be rewritten and replaced indepen   dently  However  modularization does not extend much be   yond this level  This is beneficial because over modularization  would obfuscate the design  imagine trying to make sense of  a design in which every gate and flip flop was instantiated  seperately  One must consider the human writing the HDL  as part of the design process  The time and effort of hand   optimizing the HDL could be the difference between a design  that functions and one that does not  tolerating higher logic  consumption results in a bigger  slower  but still functional  design  Besides  with an appropriately modularized design   one can always return to optimize the problematic parts of  the design once absolute correctness is achieved    The ultimate proof of this design philosophy is in the results   The fully functional processor is about 10K lines of VHDL  code  inclu
16. h Keefe  and Hari Angepat   FPGA Accelerated Simulation Technologies  FAST   Fast  Full System   Cycle Accurate Simulators  In Proceedings of the 40th Annual  IEEE ACM International Symposium on Microarchitecture  MICRO 40   pages 249 261  Washington  DC  USA  2007  IEEE Computer Society   Eric S  Chung  Michael K  Papamichael  Eriko Nurvitadhi  James C   Hoe  Ken Mai  and Babak Falsafi  ProtoFlex  Towards Scalable   Full System Multiprocessor Simulations Using FPGAs  ACM Trans   Reconfigurable Technol  Syst   2 15 1 15 32  June 2009    J  Emer  P  Ahuja  E  Borch  A  Klauser  Chi Keung Luk  S  Manne   S S  Mukherjee  H  Patil  S  Wallace  N  Binkert  R  Espasa  and T  Juan   Asim  a performance model framework  Computer  35 2  68    76  feb  2002   Joe Heinrich  MIPS R4000 microprocessor user   s manual  second  edition  1994  accessed 02 2011    Milo M  K  Martin  Daniel J  Sorin  Bradford M  Beckmann  Michael R   Marty  Min Xu  Alaa R  Alameldeen  Kevin E  Moore  Mark D  Hill  and  David A  Wood  Multifacets general execution driven multiprocessor  simulator  gems  toolset  SIGARCH Comput  Archit  News  33 2005   2005    M  Pellauer  M  Vijayaraghavan  M  Adler  Arvind  and J  Emer  Quick  performance models quickly  Closely coupled partitioned simulation on  fpgas  In Performance Analysis of Systems and software  2008  ISPASS  2008  IEEE International Symposium on  pages 1    10  april 2008   Michael Pellauer  Michael Adler  Michel Kinsy  Angshuman Parashar   and Joel S  
17. host out   D hw_config_name     Processes raw data  generates  datafiles and gnuplot commands       result 0   Runs result 1     result N 1        util host z48report  benchmark0           gnuplot                   benchmarkN 1     tosis ae    z48perf                Fig  4  Diagram of automated hardware synthesis benchmarking system     The FPGA boards themselves were configured to boot  prebuilt Linux kernels from on board Flash memory  The  boards automatically mount their root filesystems over the  network  using the nfs network filesystem  and run the list  of benchmarks  Overall automation is achieved by the use of  further scripts running on a host PC     B  Hardware Variants and Automation    The Z4800 is heavily parameterized with 77 top level  configuration options  These options include pipeline options   trade clock speed vs  IPC   branch prediction options  and  cache TLB options  size  line size  associativity  replacement     47 hardware variants were parametrically auto generated to  test the impact of many interesting CPU core configuration  options  Auto generation of the HDL for every variant was  done by a script  configured by a single simple configuration  file  in less than a minute    These 47 variants were synthesized in parallel  8 at a time   on a Dell Poweredge 1950 server with 8x3 16GHz Xeon  X5460 CPUs and 32GB RAM  The command to initiate the  parallel synthesis was as simple as make  j8 due to the  level of automation and integration offered by 
18. is   The farmer script watches the output directory for completed    sof FPGA images  and feeds the filename of each completed  image to the runner program  The runner program feeds  each image to a physical FPGA via the run script whenever  a new image is available and an FPGA is idle  Each time  the FPGA   s programming completes  it automatically boots  its kernel from on board Flash memory  and then executes the  benchinit script directly  This script brings up a minimal  Unix environment  and then executes each benchmark listed in  benchlist pl  A helper program  z48perf  that wraps  the execution of the benchmark being run must read the  performance counters and output the data into result files   The result files are written using NFS  Network FileSystem   over Ethernet to the host PC  Later  the user can run the  z48report script to process the results and procedurally  generate all desired dataplots    The z48report script is written to generate arbitrary  plots of arbitrary performance measures  each derived from  any value one can compute from the performance counter val   ues  The script generates plots in PDF format using gnuplot   an open source command line plotting program  Internally it  uses a hierarchy of Perl hashes  associative arrays   Transfor   mations are applied to this tree of hashes  turning its indices  inside out  which makes it straightforward to do operations  such as generation of a synthetic    All SPEC    benchmark  representing the aggregat
19. iscussed for the ITLB  it is    free     to enlarge the TLBs to 128 sets  so a 128 entry direct mapped  DTLB would be a good choice     Overall  CPU performance seems consistently good as long  as the caches are not undersized  On the aggregate of all  benchmarks  the design achieves 0 7 IPC in the baseline  configuration  If both caches are increased to 16K 4 way LRU  it can achieve 0 85 IPC     VII  RELATED WORK    FPGAs have become a promising vehicle to bridge the  multi core modelling gap in computer architecture research   However  due to implementation complexity and difficulty in  debugging  it takes longer to develop FPGA based models  than software based models  To reduce FPGA implementa   tion complexity  current FPGA based simulators separate their  functional and timing models   a technique that has long been  used by the software simulators  1    2    3    9    11    20    FAST  7  uses a functional partition in software and a timing  model implemented in the FPGA  Others  such as HAsim   13  and RAMP Gold  17  implement both the functional  and timing models within the FPGA  ProtoFlex  8  supports  a hybrid functional model to accelerate large scale multi   core functional modelling  where rare events are executed in  software and frequently occurring ones are implemented in  hardware  While ProtoFlex does not support any timing model  on the FPGA  it provides a fast functional model that can be  fed to a timing model    FAST and HAsim include the timing model 
20. it data  to update the shadow register file    Since each entry in the buffer contains the updates made to  the register file in each commit  it is possible to reconstruct the  complete state of all 32 registers at any commit in the trace  buffer   s history  The z48d program implements this register  state reconstruction in software and provides a simple user   interface    This capability allows the user to freely step both forwards  and backwards in time  at least to the extent of the stored trace  buffer data  This is an invaluable debugging tool  especially  when combined with the hardware breakpoint  The user can  set a breakpoint  run the CPU at full speed until it is hit  and  then examine the 1024 commits before the breakpoint  The  hardware breakpoint signal can itself be used to trigger an on   chip logic analyzer  such as Altera   s SignalTap II  Effectively   this halts the processor at the trigger point  with trace history  and logic analyzer history that can be manually inspected and  correlated     3  Machine check Exceptions and Assertions  The Z4800  core also includes a small hardware module which continu   ously monitors various assertions in the processor  If any of  these assertions is violated  it can trigger the logic analyzer and  optionally stop the CPU  This hardware works even while the  CPU is running at full speed  effectively providing real time  verification    In particular  the machine check hardware checks that ex   actly the right instructi
21. ith  a Stratix II GX 2SGX90 suggest that 60MHz operation is possible  newer   more advanced devices should show further clock speed improvement    5The behavior of 147  vortex is due to software problems  not hardware  problems  The benchmark fails when run on an emulated machine with  QEMU 2  just as it does on the Z4800 hardware      It is possible to run the entire reference input set for these benchmarks  but  doing so takes several hours  We used truncated execution to greatly improve  benchmark turnaround times     N x Quartus instances     synthesize multiple HW Picks up completed  sof    wy Program FPGAs  Issues each  sof file                                                                                                               variants in parallel  FPGA images to an FPGA board      util host run         quartus_pgm   72  FPGA      make  C util host out  jN            util host farmer         util host runner        JTAG  t quartus_flow    compile     L  util host run         quartus_pgm         FPGA  i Makefile  E E E a T ee A SRE S E NSE lutil target z48perf CS   sbin init replacement  Wrapper program   brings up bare   minimum Runs list of benchmarks from samples perfcounters    UNIX environment       util target benchlist pl      Benchmarks i       kernel boots from  CFI flash       util target benchinit                   util target bench    Write results over  NFS to host                                                       Results analysis         util 
22. ling  ITLB faults are raised  speculatively on the Z4800  The decode stage rewrites in   structions that have faulted the ITLB into trap instructions   The trap instructions internally have the same opcode as the  explicit MIPS software traps  but the exception that is raised  will use the appropriate TLB exception code  If the faulting  ITLB access is a wrong path reference  it will be annulled by  a pipeline flush and therefore it will have no architecturally  visible effects    This logic cleanly handles many rather difficult corner  cases  such as a branch at the last word in a page  In this  case  the delay slot instruction in the next page may cause  a TLB exception  The decode stage will rewrite the faulting  instruction to a trap in the branch   s delay slot  resulting in  an exception raised with EPC  the exception restart address   pointing to the branch  This is correct behavior  10  Page 121            i                      Icache             Iqueue DG             1 y       predecode                         RR EX M1 RW BP                      pt  Deache    mul div  cop0    i i             Rfile                                  RR EX M1 RW BP                                          t       Fig  1  Pipeline organization    3  Decode and Group  DG   The DG stage routes machine  code from the instruction queue through the instruction de   coders and the grouper  Two candidate instructions are read  from the queue and decoded into pipeline control signals  If  the gr
23. locking  such  as the Linux spinlock primitives  These instructions require  special support in the L1 caches  Their semantics are    e Load linked  Same as normal 1w  load word   except for   additional bookkeeping  On the Z4800  this involves two  additional operations            Ifthe referenced cacheline is not in MODIFIED state   promote it  This operation may trigger a snoop to  obtain the cacheline for exclusive ownership        Track the effective physical address this 11 instruc   tion references  A later sc will fail if the physical  address does not match the address saved at this step    Store conditional  Store a word  The store only occurs if  the hardware can guarantee that the data at the effective  address has not been modified by any other CPU since  the previous 11  If the store occurs  the value   is written  to a register  0 is written otherwise  On the Z4800  the  following conditions apply        The sc must hit the cache in MODIFIED state  Since  the previous 11 will have forced the cacheline into  MODIFIED state  the sc should hit the cache  How   ever  if this is not the case  the sc will immediately  fail and return 0  This will happen if two processors  enter the 11 sc critical section at the same time        The effective physical address of the sc must match  that of the previous 11  Further  all cache operations  invalidate this stored address  any operation between  11 and sc is guaranteed to cause the sc to fail           These conditions are 
24. nd portable dynamic translator  In  Proceedings of the annual conference on USENIX Annual Techni   cal Conference  ATEC    05  pages 41 41  Berkeley  CA  USA  2005   USENIX Association     3  Patrick Bohrer  Mootaz Elnozahy  Ahmed Gheith  Charles Lefurgy   Tarun Nakra  James Peterson  Ram Rajamony  Ron Rockhold  Hazim  Shafi  Rick Simpson  Evan Speight  Kartik Sudeep  Eric Van Hensber   gen  and Lixin Zhang  Mambo a full system simulator for the powerpc  architecture  ACM SIGMETRICS Performance Evaluation Review  2004      4      5      6      7      8      9     10        11    12    13          14     15      16      17      18      19      20     D  Chiou  H  Angepat  N  Patil  and Dam Sunwoo  Accurate functional   first multicore simulators  Computer Architecture Letters  8 2  64    67   feb  2009    D  Chiou  Dam Sunwoo  H  Angepat  Joonsoo Kim  N A  Patil  W  Rein   hart  and D E  Johnson  Parallelizing computer system simulators  In  Parallel and Distributed Processing  2008  IPDPS 2008  IEEE Interna   tional Symposium on  pages 1    5  april 2008    Derek Chiou  Dam Sunwoo  Joonsoo Kim  Nikhil Patil  William H   Reinhart  D  Eric Johnson  and Zheng Xu  The fast methodology  for high speed soc computer simulation  In Proceedings of the 2007  IEEE ACM international conference on Computer aided design  ICCAD    07  pages 295 302  Piscataway  NJ  USA  2007  IEEE Press    Derek Chiou  Dam Sunwoo  Joonsoo Kim  Nikhil A  Patil  William  Reinhart  Darrel Eric Johnson  Jebedia
25. ntries map two virtually contiguous pages in a  single entry  which requires a selector mux  The Z4800   s L1  TLBs are designed to avoid decoding these entries  instead   they deal with single 4K pages  On a L1 TLB miss  the    JTLB hardware will respond with a 4K slice of the appopriate  mapping  regardless of its size  This design is similar to that  used by the RM7000  15   The JTLB back invalidates any  potentially stale L1 TLB entries as necessary     III  MULTI CORE SUPPORT    The Z4800 includes hardware support for SMP cache   coherence based on a snoopy protocol  This design gives full  Icache  Dcache  and DMA coherence with sequential consis   tency  Snoops are broadcast to all agents  CPUs and DMA  bridges   1 per cycle globally  The snoop signals are organized  into an emulated tristate bus  The arbitration for this bus is  pipelined  The overall latency is 3 cycles  with throughput of  1 request per cycle  The bus supports an arbitrary number  of agents     it is limited only by practical fan in  fan out  and  contention issues  not by any fixed tag or ID field sizes  The  snoop pipeline is in order and locks up only when an actual  conflict is detected by an agent     A  Cache coherence implementation    The Z4800 uses a version of the MESI cache coherence  protocol  Due to the internal implementation  particularly the  fact that local cache requests are seralized and atomic with  regard to external snoop requests  the cache coherence state  machine is quite sim
26. of the core in  their FPGA based simulators  RAMP Gold and ProtoFlex  on  the other hand  take a course grained approach to model large  systems and do not include detailed processor cores in their  timing models  Additionally  RAMP Gold and ProtoFlex are  open source platforms  but HAsim and FAST are not    Our Z4800 system differs from these systems because the  target is directly implemented on the FPGA      there is no  seperation of the functional and timing models  It is still very  configurable with more than 70 parameters  However  it is  also limited in terms of modeling structures that do not map  well onto FPGAs  Nevertheless  the Z4800 could be used as  an FPGA based full system functional model feeding a timing  model through the trace buffer interface     VIII  CONCLUSION    This paper has covered the design  implementation  and use  of a newly designed CPU  system  and supporting platform  for use in research studies  There is no other known platform  that is functional which provides similar speed and end to end  integration  We plan to release the entire platform under the  GNU General Public License  either version 2 or version 3  It  is hoped that this platform as well as the experiences and data  presented here will be useful to others in other performance  studies     REFERENCES     1  T  Austin  E  Larson  and D  Ernst  Simplescalar  an infrastructure for  computer system modeling  Computer  35 2  59    67  feb 2002     2  Fabrice Bellard  Qemu  a fast a
27. ons are committed by checking their  addresses  This verifies that the pipeline flushing hardware is  annulling the correct instructions  and that no instructions were  skipped  The cache controllers also contribute some logic to  verify that important assumptions regarding timing and cache  coherence are not violated    The machine check hardware does not have very wide cov   erage  since only a few conditions are checked  Many problems  can occur without triggering any of the checks  However  the  checks cover assumptions made in the implementation that  will cause incorrect behavior if they are violated  Thus  it is  still useful to verify that they never occur     B  Fuzzy Tester    The z48d program includes a fuzz testing mode  In this  mode  a linear  branch less  stream of randomly generated  instructions is dynamically generated  simulated  loaded to  the target board  and executed  The generated instructions are  limited to loads  stores  and a few ALU operations  Only 8  of the 32 registers are used to increase the probability of  value re use  No effort is made to control the effects of the  instructions  many of them will trigger exceptions due to bad  memory addresses    A minimal kernel preloaded onto the target verifies that each  exception taken while executing the code matches the excep   tion predicted by simulation  When an instruction triggers an  exception and the exception matches what was predicted  the  instruction is simply skipped and execution resum
28. ounters 3  Debugger FPGA Board 40 pin Ribbon Cable Target FPGA Board Host PC    Fig  3  Block diagram of debug setup using two Altera DE2 70 boards     control the target board  It can also display the contents of the  target   s registers and memory    2  Integrated Instruction Trace Buffer  The integrated de   bugger allows one to control the CPU only in the forward  direction  One can halt the CPU and then observe the registers  changing by single stepping  However  this ability is insuffi   cient to debug anything beyond the most trivial of programs   The Z4800 greatly enhances the basic debugger capabilities  with an integrated instruction trace buffer  which itself is  another memory mapped I O peripheral    The trace buffer contains a limited history of all committed  instructions  Included in each commit are many signals to  flag internal pipeline events  such as branch mispredicts and  exceptions   as well as the committed register data  The buffer  for this data can practically range from 64 to 1024 entries  The  Z4800 commits 2 instructions per cycle so each entry contains  2 instructions  Since the buffer is circular  each new commit  overwrites the oldest entry    The Z4800   s trace buffer has a further enhancement  a  shadow register file  The shadow register file stores the state  of all 32 architected registers just before the time of the oldest  commit in the buffer  This is implemented by reading the  oldest entry as it is being replaced  using the old comm
29. ouper decides it is safe to issue them in parallel  then  both are issued and the Iqueue advances by 2 words  If parallel  issue is not safe  then only the first instruction will issue to  pipe 0  pipe 1 will be issued a NOP  and the Iqueue will only  advance by   word  If later pipeline stages are stalled  nothing  is issued and the Iqueue does not advance    If the first of the two instructions coming from the Iqueue  is a branch  then the second instruction is its delay slot  instruction  A rather simple approach to handling branch delay  slots is used  delay slot instructions are always issued in  parallel with the branch  If the delay slot instruction is not  present  i e  Iqueue only contains one valid word  then the  branch will not issue until it becomes available    4  Register Read  RR   In the RR stage  the register file is  read to provide four operands for two instructions  Operands  from later pipeline stages are muxed with the output from  the register file  If an operand is marked as invalid  then the  data has been overwritten but is not known yet  use of such  an operand will cause this stage to stall  Invalid operands are  generated by late result instructions such as loads  this causes  RAW dependent  read after write dependent  instructions to  stall  An entire pipeline stage is dedicated to this operation  to avoid it becoming the critical path    5  Execute  EX   In the EX stage  the ALU will perform  the necessary operation  Data cache virtual addresse
30. ple  no additional transient states are  required  Figure 2 shows all possible states as well as the  transition conditions for all cases             minstate SHARED  amp  amp        Isnoop returned INVALID        refill                   rsnoop SHARED      writeback  demote                      minstate MODIFIED    promote      minstate SHARED  amp  amp        Isnoop returned INVALID      refill i    Fig  2  Z4800 MESI state machine     Isnoop    refers to a locally   initiated snoop request     rsnoop    refers to a remotely initiated  snoop request     minstate    refers to the minimum state required  to satisfy the outstanding local miss        A broadcast request response approach is used to coordinate  snoop requests  On a miss  the CPU sends a request to all    3The prototype on the Altera DE2 115 board with Ethernet has 6 agents  participating in cache coherence  4 CPUs  Ethernet DMA  and debugger  DMA     other CPUs asking to acquire a certain cacheline for a given  minimum state  either SHARED or EXCLUSIVE mode   The  other CPUs will not respond until they have updated their own  caches    state  which may require writeback of dirty data and  tag updates                    B  Atomic primitives for SMP  load linked store conditional    The MIPS load linked  11  and store conditional  sc  in   structions provide an efficient way to implement atomic read   modify write critical sections in multithreaded code  They are  typically used to implement an OS kernel   s core 
31. ploration platform  the Z4800     The Z4800 is not aimed at replacing any existing FPGA   accelerated simulation platforms  for different type of research  studies as well as educational objectives  different tools will  be appropriate  The Z4800 focuses on testing and analysis  of single core and small scale  2 16 core  multi core systems  rather than large scale systems such as those targetted by  ProtoFlex and RAMP Gold  It can run unmodified Linux based  operating systems over network filesystems such as NFS or  from local storage such as a USB drive    The main design philosophy of the Z4800 is centered  around ease of adaptation and use  It is supported by a fully  automatic synthesis and benchmarking system that provides  a mechanism to run a large number of experiments with no  human intervention  It also provides extensive run time debug  and verification interfaces  Consequently  it is an end to end  FPGA based platform that can be used not only for computer  architecture research  but also for operating system research  and for software debugging     II  Z4800  A CONFIGURABLE AND MODULAR FPGA  COMPUTER    The Z4800 is an open source multi core research platform  which can be directly synthesized for a commodity FPGA  and can run a full  unmodified Linux based operating system   It can be configured to run at up to 55MHz while achieving  0 7 average per core IPC  instructions per clock cycle  using  only a  300 FPGA board  Per core IPC of up to 0 85 can be  achieved
32. require 11 RAM blocks  a 2x5 grid of 32x32 RAMs plus  a single 32x1 RAM  This register file design  despite being  large  is capable of high speed operation  It has been used  successfully at 5SOMHz without it becoming the critical path   Its simple design  high clock frequency  and the fact that it  does not require a special clock makes it the implementation  of choice for the Z4800     D  TLBs    The Z4800 contains two level 1 TLBs  ITLB and DTLB   which are modules embedded into the Icache and Dcache  and  a main level 2 TLB  the joint TLB  JTLB   The Z4800 uses  a single cycle L1 DTLB and is VIPT  This allows the cache  tags and DTLB to be accessed in parallel  With a late way   select  we can hit the cache in 1 cycle  Cache aliasing does  not occur for way sizes less than or equal to 4K  the minimum  page size  This method does not require speculation and allows  set associative operation without any special considerations    The JTLB matches the behavior of the R4000   s TLB for  software compatibility  It is architecturally visible and is under  control by the MIPS Coprocessor 0 instruction set  Hardware  in the joint TLB  JTLB  enforces coherence between the JTLB  and the two L1 TLBs  a policy of strict inclusion is adopted  so that the L1 TLBs are architecturally invisible    The R4000   s TLB is capable of mixing entries with varying  page sizes  Translating such addresses is a slow and complex  process that is best kept away from the L1 TLBs  Further   R4000 TLB e
33. s are  calculated here as well  This stage is the first stage at which  operands are guaranteed ready  Branches are resolved as well   EX will trigger a pipeline flush if a branch mispredicts    Of particular note in this stage is the support for operand  cascading  If this option is turned on  the Z4800 can issue  pairs of instructions that are RAW dependent  A direct com   binational path is provided from the ALU in pipe 0 to the  operand inputs of pipe 1  This means that the ALUs only  have half a clock cycle each to perform their operations    It is important to note that the instruction in pipe 1 need not  be an ALU instruction  it could be any instruction  All basic  register to register instructions  other than memory accesses     effectively have zero latency and can pair with any other  dependent instruction  For example  the result of an addition or  barrel shift can be cascaded into the virtual address calculation  of a concurrently executing load or store    6  Memory  M   In the M stage  the first changes to  architecturally visible state are allowed  The Dcache  multi   ply divide unit  and coprocessor 0 interface are in this stage   With a Dcache latency of 1  this stage also contains a barrel  shifter and sign extension unit for handling non word loads   All possible exceptions are detected and resolved in this stage   By preventing any changes to visible state from occuring until  exceptions are resolved  we completely avoid the need for  rollback logic    7 
34. sufficient to guarantee that the critical  section is atomic across all CPUs  Due to the way that  cache coherence is implemented on the Z4800  this can be  easily understood  The 11 will always gain the cacheline for  exclusive ownership up front  the sc will not succeed unless  the cacheline is still owned exclusively when it executes  Any  other processor entering the critical section would invalidate  the local copy of the cacheline  which will cause the sc to  correctly return 0  Since the Z4800 has sequential consistency   it is already guaranteed that other operations surrounding  the critical section cannot be reordered  no explicit barrier  instructions are required     IV  HARDWARE DEBUGGING AND VERIFICATION    A  Debugging Features    The Z4800 processor integrates a variety of testing  debug   and verification features  These features were implemented    out of necessity  since successfully booting and running a  full featured operating system requires absolute attention to  correctness    During development  no software functional or timing simu   lations of the RTL were performed  All debugging was based  on examination of synthesis results and tests performed on  real hardware  There are a few reasons behind this potentially  surprising fact    First  one must take into account the difficulty in properly  setting up an accurate testbench for each component of the  design being tested  Correct input stimulus must be provided   and there must be some reference to
35. the FPGA  Any uncertainty or    skew in the clock signal reduces the margins of setup and or  hold timing  relative to the high speed clock  These delays  can quickly eat away the entire clock period  resulting in a  low maximum clock rate  Early versions of the Z4800 used a  2x multiplied register file clock and two dual port RAMs  this  achieved only 60MHz  which restricted the pipeline clock to  under 30MHz    Instead of time multiplexing  the Z4800 takes a simpler but  larger approach  This method can be used to create arbitrar   ily large NxM port memories without any time multiplexing   although it scales by O n   m  in area  The key idea is to  recognize that a dual port RAM can be arranged with 1 read  and 1 write port  the RAM thus contains the most up to date  register data written by that write port  We can duplicate this  structure N times  where N is the number of read ports  If the  write inputs are all tied together then every copy contains the  same data  and we now have an independent port to service  each read  By arranging these rows of RAMs  one row per  architected write port  we can now store data from every write  port  To read the most up to date data at a given address  all  RAMs in a row are read in parallel  the data from the most  recently written RAM should be selected by a mux  This mux  is controlled by a small table storing the index of the most  recent writer of each register    For the Z4800   s case of 32 32 bit registers and 2x5 ports   we 
36. the scratch   built benchmarking infrastructure  Each variant takes 11 12  minutes to synthesize and peaks out at just under 3GB of  physical RAM usage  Since synthesis of each variant is single   threaded  all 8 server CPUs were saturated  We can thus say  that effectively 1 to 2 minutes of real time were spent to  generate each   sof output file for the FPGAs  Benchmarking  began as soon as the first round of  sof files was produced   approximately 12 minutes after issuing make  j8   pick up  of each completed  sof file was completely automatic     Each 10 minute SPEC95 benchmark run was performed 5  times and the results were averaged  The FPGA was com   pletely reset and reprogrammed for each of the 5 runs  this  makes the influence of non deterministic memory allocation  decisions show up as run to run variations  The standard de   viation of these 5 run populations gives some indication as to  how consistently the results can be obtained  A total of 685 10   minute runs were performed  taking a total of approximately  114 FPGA hours  Two DE2 70 boards were used to execute  the runs in parallel  so collection of all benchmark data took  approximately 57 hours of real time  The scripts allow scaling  to arbitrarily large parallel FPGA farms  even though only two  were used here     C  Experiment Results    The benchmark experiment explores the design space of  the CPU core  There are many compile time configuration  options for the Z4800  many need to be set to specific  
37. values for correctness in a given system  but others only have  performance impacts  To explore the design space  a set of 47  hardware variants was defined  Table II   Each variant tests  the performance when changing one or two features against  the performance of a reasonable baseline configuration    Figure 5 shows the results as average IPC of all benchmarks  tested  We can see that the L1 Dcache size is a very important  parameter  Reducing the Dcache to 4K direct mapped leads  to a 25 6  degradation in overall IPC  while increasing it to  16K 4 way LRU leads to a 10 0  improvement    Other parameters of particular interest are those that  influence how instruction prefetching is performed  The  aligned_frontend variant is incapable of pre decoding  branches that are not aligned on an 8 byte boundary  this                   Icache 8K 2 way  64B  LRU   Dcache 8K 2 way  64B  LRU  1 cycle   ITLB 64 entry 2 way  Random   DTLB 64 entry 2 way  Random   JTLB 32 entry  64 page   fully assoc   Random       32 instructions   2K ent  2 bit  128 ent  BTB  8 ent  RAS  2K ent  2 bit  128 ent  BTB  8 ent  RAS  Yes   Yes    Iqueue length   Main br  pred   Frontend br  pred   Frontend unaligned br   Operand cascading                            TABLE II  Baseline CPU configuration    0 9       0 8       0 7 4       yru    VAO O    0 6 4  0 5    0 4    0 3    0 2      Arithmetic Mean IPC    0 1         base e  o  yu n    pir     a  y Piru ey  vvo    dtib_32ent_dm P    yru a  piru  piu O  
38. ys   7    6    18    17    8    5    12    19    16    4     14   Many of these techniques use a seperate functional  model  usually  7    17    14  but not always  8  software   based  paired with an FPGA based timing model  FPGA   accelerated models are several orders of magnitude faster than  their software counterparts  especially when simulating multi   core systems    Current FPGA accelerated simulation platforms have  demonstrated FPGAs as viable architecture research vehicle   However  they fall short in terms of providing tools which are  fully open source and easy to experiment with  For example   HAsim 13  and FAST 7  provide valuable insight into how  FPGA accelerated simulators can be designed  but they are not  open source  Their utility is limited to a few research groups  in academia and industry  The difficulty in designing and  verifying an FPGA accelerated simulator leads to a scarcity of  tools  The few existing open source tools  e g   RAMP Gold   17  and ProtoFlex  8  are not straightforward to use  modify  and or generate results with    The end result is that FPGA accelerated simulators have not  become popular in computer architecture research community   More FPGA platforms are needed to popularize their use  in the research community  These systems should preferably    provide a full solution  easy to use  modify  debug and verify   and an automated experiment setup  To address this need   we introduce in this paper a new FPGA based architecture  ex
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Sigma FireWire 800 3-Port PCIe  OM, Flymo, HT39, HT42, HT42 S, HT45 S, HT53, HT53 S  Mini-chaîne stéréo MP 203 Manuel d`utilisation - Migros  USER MANUAL - Vir2 Instruments  FRG Series User Manual - Automated Laser Corporation  Highpoint RocketRAID 4320  Desenvolvimento e avaliação do desempenho de um equipamento  Manual del usuario    Copyright © All rights reserved. 
   Failed to retrieve file