Home
        ARCS - An Architectural Level Communication
         Contents
1.   communication events can be considered asynchronous be   cause at this level of abstraction they are not bound to any  particular cycle level implementation  Due to this  a high  level simulation framework that embraces communication  as its fundamental simulation unit would be ideal for per   forming architectural level studies  Once this communica   tion behavior is understood  a specification can be refined  into an increasingly detailed implementation    There are a number of features which we believe are im   portant in an architectural level simulator of this sort  The  most important is that the designer must be able to easily  define complex high level behaviors for system pieces and  the communication that takes place between them  The sim   ulation must model communication event timing in a way  that helps the designer understand the performance impact  of system level decisions  and should allow for variable grain  size of component descriptions both in terms of behavior and  in terms of timing  The simulation should abstract or auto   mate the communication and timing models so that the user  is not inundated with details when concentrating on high  level design  This trade off between convenience of com   ponent abstraction and timing accuracy is crucial to allow  rapid prototyping and the refinement of architectural level  designs into increasingly detailed implementations  In terms  of the design of the simulator itself  it should maintain high  performance b
2.  11   How   ever  because communication is essentially asynchronous at  the architectural level this framework is equally well suited  for exploring systems that will eventually be implemented  as traditional synchronous systems     2  ARCS FRAMEWORK    We have designed ARCS to fulfill the previously described  features resulting in a convenient architectural level simula   tion framework through which significant performance re   sults can be obtained  A major design choice while writing  ARCS was choosing Java as its implementation language   Java was chosen because it is a high level language that  allows easy specification of complex behaviors within simu   lation components while also having native support for con   currency and synchronization  features not often supported  in other languages without specialized extensions or requir   ing significant user expertise  A modern language like Java  also encourages the use of ARCS with next generation sys   tem designers for whom Java may be a language of choice    However  Java alone does not provide the needed infras   tructure for architectural simulation  Creating a working  system that has massive concurrency in any language is  not a trivial effort  and Java is no exception  Often im   plementation details can become overwhelming for even ex   perienced system designers  Because of this  ARCS uses  a CSP  12  process algebra as the logical level for model   ing processes and their communications  By moving reason   ing
3.  ACM Press   1989     3  A  Bardsley and D  Edwards  Compiling the language Balsa to  delay insensitive hardware  In C  D  Kloos and E  Cerny   editors  Hardware Description Languages and their  Applications  CHDL   pages 89 91  April 1997     KSI                   15          16     17      18     Cadence Design Systems  Verilog xl reference manual   December 1994    Mentor Graphics  Modelsim se user   s manual  2001    Doug Burger and Todd M  Austin  The simplescalar tool set   version 2 0  SIGARCH Comput  Archit  News  25 3  13 25   1997    Alain J  Martin  Compiling communicating processes into  delay insensitive VLSI circuits  Distributed Computing   1 4  226 234  1986    Erik Brunvand and Robert F  Sproull  Translating concurrent  programs into delay insensitive circuits  In Proc  ICCAD   pages 262 265  IEEE Computer Society Press  November 1989   Kees van Berkel  Joep Kessels  Marly Roncken  Ronald Saeijs   and Frits Schalij  The VLSI programming language Tangram  and its translation into handshake circuits  In Proc  European  Conference on Design Automation  EDAC   pages 384 389   1991    W  F  Richardson and E  Brunvand  An architecture for a  self timed decoupled computer  In Int  Symposium on  Asynchronous Circuits and Systems  IEEE Computer Society  Press  March 1996    W  F  Richardson and E  Brunvand  Architectural  considerations for a self timed decoupled processor  IEE  Proceedings  Computers and Digital Techniques   143 5  251 257  September 1996    C  A  R 
4.  Hoare  Communicating sequential processes   Commun  ACM  21 8  666 677  1978    E  K  Brunvand and M  Starkey  An integrated environment  for the design and simulation of self timed systems  In   A  Halaas and P  B  Denyer  editors  VLSI 91  page 4a 2  IFIP   August 1991    P H Welch and J M R Martin  Formal Analysis of Concurrent  Java Systems  In P H Welch and A W P Bakkers  editors   Communicating Process Architectures 2000  volume 58 of  Concurrent Systems Engineering  pages 275 301  WoTUG   IOS Press  Amsterdam   September 2000    Ltd  Formal Systems  Europe   Failures divergence renement   Fdr2 user manual  October 1997    P H Welch  J R Aldous  and J Foster  CSP networking for java   JCSP net   In P M A Sloot  C J K Tan  J J Dongarra  and  A G Hoekstra  editors  Computational Science   ICCS 2002   volume 2330 of Lecture Notes in Computer Science  pages  695 708  Springer Verlag  April 2002    Ivan E  Sutherland  Micropipelines  Communications of the  ACM  32 6  720 738  June 1989    E  Brunvand  M  Michell  and K  Smith  A comparison of  self timed design using FPGA  CMOS  and GaAs technologies   In Proc  ICCD  pages 76 80  IEEE Computer Society Press   October 1992        
5.  about these concurrent systems into a formal framework  such as CSP a vast body of concurrency and formal methods  literature becomes accessible    CSP style descriptions have been shown to be an effective  means of describing concurrent systems  7  8  9  3   This  prior work focused mainly on using CSP as the language  from which circuits can be automatically synthesized  al   though in some cases the CSP style program can be run  to validate functional correctness  13  3   These synthesis  systems focus on converting the CSP style descriptions into  circuits making implementation details close to the heart of  these models  exactly what ARCS is designed to avoid    ARCS moves in the opposite direction by using the Java  object hierarchy to provide increased abstraction details   convenience classes and methods  thereby facilitating rapid  development of systems by avoiding implementation details   Also  the use of CSP semantics could allow convenient con   nections to these hardware synthesis systems if an ARCS  description is refined to a small enough grain size    An additional benefit of abstracting component intercon   nects is the ability to reason about a simulation    asynchronously without concern for the implementation tech   nology  Allowing reasoning about both synchronous systems  at a high level  and asynchronous systems at any level  is a  major advantage over traditional low level simulators that  focus on the implementation  3  6      Java  Native Primitive
6.  architectural simulation asynchronous communication  Java    1  INTRODUCTION    Architectural level design exploration should allow a de   signer to simulate and evaluate system ideas at a high level of  abstraction prior to making specific implementation choices   There are many simulators that focus on the detailed    Permission to make digital or hard copies of all or part of this work for  personal or classroom use is granted without fee provided that copies are  not made or distributed for profit or commercial advantage and that copies  bear this notice and the full citation on the first page  To copy otherwise  to  republish  to post on servers or to redistribute to lists  requires prior specific  permission and or a fee    GLSVLST 04  April 26 28  2004  Boston  Massachusetts  USA    Copyright 2004 ACM 1 58113 853 9 04 0004     5 00     transistor  and gate level implementations of systems  1  2   3   and HDL simulators that can simulate both behavioral  and structural code in VHDL and Verilog  4  5   but they  model systems at much too fine a level of detail to be eas   ily used for architectural exploration  Others  such as Sim   plescalar  6   are intended for system level simulation but  are often tedious to use because of their fixed simulation  granularity based upon the notion of a clock cycle    At the architectural level the most important considera   tions are data movement  communication between system  pieces  and relative timing between those events  These
7.  functionality  the simulation designer needs only to define the execution  functionality and can rely on ARCS library calls to com   plete the rest of the steps outlined above  If the component  needs increased functionality the user must override a num   ber of functions to implement that custom functionality  In  this way ARCS optimizes the common case to provide rapid  prototyping while not limiting functionality  In all cases  however the distributed clockworking model will encapsu   late timing information critical to performance evaluation   As data is retired from the simulation it is passed into  Recorder components which aggregate the timing informa   tion included with the data for post processing  This tim   ing information can be either processed within the ARCS  framework to retrieve statistical information about execu   tion paths and timing delays among other things or it can  be exported to a file for archiving  Java   s extensive built in  library of functions  including graphics  make it an attractive  language in which to perform post processing of simulation  data  Currently ARCS post processing can only provide ba   sic statistical information or a full output of all timing data  from the simulation  More sophisticated graphical analysis  tools are being developed to allow more detailed analysis     3 1 Channelwatchers    Post processing of simulation data is often not convenient  for analysis of concurrent systems  To fulfill the need for a  real time v
8. ARCS   An Architectural Level Communication Driven  Simulator    Dave Nellans  Vamshi Krishna Kadaru  Erik Brunvand  School of Computing  University of Utah  Salt Lake City  Utah 84112   dnellans  kad  elb   cs utah edu    ABSTRACT    Simulators for digital systems operate at a variety of levels  of abstraction varying from detailed analog and switch level  modeling of the transistor to cycle based descriptions of en   tire systems  We propose an even higher level simulator   called ARCS  based on the abstraction of an asynchronous  communication event rather than of a clock cycle  Modeling  systems at this level allows architectural level exploration  of the design space before cycle level details are available   and also allows the same framework to be used to refine ar   chitectural level simulations into more detailed simulations  with increasingly fine grained notions of timing  The ARCS  simulation framework uses concurrently operating threads  in Java with communicating sequential processes  CSP  se   mantics as a natural expression of communication between  concurrent hardware  To avoid synchronization bottlenecks  ARCS models time using a communication driven clockwork  model which allows for both user configurable runtime view   ing of the simulation and post processing of complete simu   lation timing data     Categories and Subject Descriptors    C 0  General   Modeling of computer architecture    General Terms    Algorithms  Design  Experimentation    Keywords   
9. e key needs of an archi   tectural level simulator by allowing the convenient descrip   tion of complex functionality using Java  the abstraction  of communication as the fundamental unit of simulation   and the separation of component timing and functionality  through the use of a distributed clockworking timing model   Because communication is essentially asynchronous at the  architectural level this framework is equally well suited for  exploring systems that will eventually be implemented as  asynchronous systems or as traditional synchronous systems    ARCS allows rapid prototyping of systems through vari   able component granularity and by offering support for its  timing model through a set of predefined Java classes  By  realizing true concurrency during simulation through the  JCSP library it is highly scalable without significant syn   chronization bottlenecks  This allows for a single simulation  to be distributed over multiple machines to increase simula   tion performance if necessary  ARCS   s layering on top of the  JCSP library encourages leverage of a large body of work on  CSP semantics and on synthesis of CSP based descriptions   It also encourages the use of formal verification during the  development cycle of systems     6  REFERENCES     1  Meta Software Inc  Hspice user   s manual  June 1987     2  A  Salz and M  Horowitz  Irsim  an incremental mos  switch level simulator  In Proceedings of the 26th ACM IEEE  design automation conference  pages 173 178 
10. ecution Loop of ARCS Component    mance simulations  As a simulation   s granularity increases  to gain timing accuracy the number of Linux processes in   creases proportionally  This increased number of processes  allows the OS scheduler to more efficiently distribute work  across multiple processors decreasing the performance loss  typically associated with increased simulation accuracy  The  JCSP library also contains the functionality to allow CSP  channels to communicate remotely over a TCP IP session   16   This distributed communication provides simulation  scaling which can divide large simulations into discrete sec   tions of CSP processes running on difference machines in  separate JVM   s but connected through standard CSP chan   nels  Though not user transparent this small inconvenience  offers a massive increase in computing resources if desired  even across different hardware platforms     3  TIMING    What separates ARCS from a simple collection of JCSP  processes  and turns it into a powerful architectural simu   lation framework is its timing model  Because components  can vary in their functional scope it is necessary to decou   ple timing information from functionality  Each component  must be allowed to take a variable amount of simulation  time to complete while also being allowed to complete in  any order in real time    To satisfy these timing requirements ARCS utilizes a com   munication driven distributed clockworking model  Every  component has its o
11. ed by ARCS for post processing  This  clockwork timing model allows for most important timing  information  including communication delays due to con   tention  to be recorded without the need for a centralized  time queue which is often a major performance bottleneck    To further understand ARCS   s timing model we can ex   amine the execution of a general purpose component within  ARCS  Upon initialization each component automatically  reads its user defined component time from a file listing  based on its fully qualified classname  This defines the simu   lation processing time the component requires independent  of its execution complexity  After initialization has com   pleted it begins the standard communication driven execu   tion loop of read  process  write  shown in Table 1  This  generic communication driven component is applicable to a  wide variety of hardware subsystems and can be mapped  to an implementation in any number of ways  ARCS also  allows for variable single component time by permitting the  user to modify a processing time function that can return  non static time values based upon the function performed    One style of hardware component that we are particu   larly interested in is based on asynchronously communicat   ing components known as micropipelines  17   Under this  model the component begins its execution loop by perform   ing a blocking read  Read Data  on its input channels until  a channel provides data  The component then updates its  
12. iewing of simulation execution ARCS provides  ChannelWatchers  which are a specialized ARCS compo   nents  ChannelWatchers may be inserted in an active sim   ulation anywhere a channel exists  This single channel is  replaced by two channels with a ChannelWatcher compo   nent in between  ChannelWatchers differ from functional  simulation components in that they may not affect the sim   ulation or timing data that passes through them although  they may examine the data  This restriction cause Chan   nelWatchers to be completely transparent to the simulation  and have zero effect on simulation correctness    ChannelWatchers are intended to perform actions which  notify the simulation designer of an event  These events  provide a look into the simulation during execution so in   formation about simulation communications can be gleaned  in real time  Because Channelwatchers are a subset of com   ponents  defining custom functionality for Channelwatchers  is trivial  We currently have console based ChannelWatcher  implementations which output a variety of information upon  activation  Work is ongoing to provide large scale real time  visualization of data moving through a simulated system  using native Java graphics     4  TORUS ROUTING EXAMPLE    As a small but illustrative example of ARCS   s capabili   ties we implemented a Torus Routing Chip  18   The Torus  Router is an asynchronous  dimension order  wormhole  router that has been designed in several technologies  most  recen
13. nularity  and number of components as well as their interconnections  are specified by the user while defining the simulation  The  user then defines specific component behavior to affect data  which flows through the component during simulation  This  behavior is not limited allowing complex and simple behav   iors as well as not modifying the data at all  By allowing this  varying functionality of components the high level nature of  Java is leveraged to allow complex component behaviors to  be implemented easily    Running ARCS in the Sun Microsystems JDK version  1 4 2_03 b02 under the Linux 2 6 kernel  every Java thread  is a native Linux processes  This allows the JCSP processes  to be load balanced across multiple processors within a sin   gle workstation  This fundamental feature of ARCS   s de   sign helps achieve scalable  user transparent  high perfor     Function    Read Data   Update Local Time  Return Local Time  Process Data   Add Timing Info    Send Data  Update Local Time 2       Description  Reads data off an input channel     Updates localtime based on later of localtime or latest time attached to data   Returns the new localtime to the component that provided data     Perform any necessary data manipulations    Add a new timing record to the data of component name   time received  processing time  and time sending attempted   Block writing to an output channel until data is accepted   Update localtime based on time returned by data receiver     Table 1  Ex
14. s  threads  synchronization    JCSP  CSP Primitives  channels  processes  etc     ARCS   Simulation Primitives  Timed Data   Channel Watchers  Object Hierarchy  Data Processing    Circuit  Simulation       Figure 1  Layering of JCSP and ARCS on Java     2 1 JCSP    ARCS realizes its CSP model through use of the Java CSP   JCSP  library from University of Kent  14   The JCSP li   brary is an implementation of the CSP primitives on top of  Java   s native concurrency objects  threads and semaphores   The use of the JCSP library in ARCS is twofold  First  CSP   level models of concurrency allow for formal proofs of pop   erties using tools such as those from Formal Systems  15    Secondly  JCSP provides a convenient abstraction of Java   s  threads and semaphores which are tedious and error prone  to work with in quantity  Because ARCS components are  instantiated as JCSP CSP processes it is required that all  interprocess communication be channel based  thus there is  no explicitly shared memory between threads  This elimi   nates implementation level race conditions which are a ma   jor cause of deadlocks and guarantees correct startup of  threads to avoid race conditions due to thread spawning   JCSP has been shown to properly implement CSP seman   tics  14  using these native constructs  Any simulation using  only JCSP processes inherits these correctness properties    ARCS represents all components as JCSP processes which  are interconnected by one or more channels  The gra
15. stored localtime  Update Local Time  to the later of its lo   caltime or the latest time present in the current data   s ap   pended timing information  It then writes its new localtime  back to the component that provided the data  Return Local  Time  This critical step allows ARCS to accurately model  communication delays due to contention for a resource or  implement arbitrary communication delays  Without this  write back all communications would appear to occur in   stantaneously to the sender  The component then processes  the received data based on its defined functionality  Process  Data    After processing the data  the component is ready to send  the data out one of its communication channels  It first adds  an additional timing tag  Add Timing Info  to the data with  its unique component name  time the data was received   time required to process the data  and the time it attempted  to send the data  It then blocks  writing to a channel un   til it the data is accepted  Send Data  Finally it receives a  second localtime update from the receiver representing the  time the communication completed  Update Local Time 2  If  this time is prior to the sender   s localtime then communica     tion was instantaneous and its localtime remains the same   otherwise communication was delayed for some reason and  the sender   s localtime must be updated  Once the sender   s  notion of time is updated  the loop cycles back to Read Data   If components need to implement only basic
16. ted at the switch level in a 0 6 micron  CMOS process  Timings were obtained to model the re   quirements for a Router element to process both the address  words and the data words of the packet  These times repre   sent the processing paths within the transister level model  of the Router  These times were then input into ARCS   s  Router level descriptions as the base processing time  A  full 4x4 mesh was simulated in both Cadence and ARCS by  routing a 22 word packet containing two address words and  20 data words  This packet was routed from position  0 0   to position  1 1   the longest possible path in a 4x4 mesh   Cadence reported 2148ns required to complete this opera   tion while ARCS reported that only 2067ns were required   ARCS   s extrapolated full system timing was accurate within  3 77  of the switch level model  This small discrepency is  likely due to both the granularity of the model and inter   connect delays which were not modeled by ARCS    ARCS requires significantly less time to complete its sim   ulation than a transister level simulator  To simulate 100  of the aforementioned packets moving through the TRC   ARCS requires less than a second  Cadence requires approx   imately 18 seconds to perform this same task  This speedup  in simulation execution time when compared to switch level  simulations is significant when using ARCS to simulate large  systems  It allows much more thorough testing of the sys   tem because so much more data can be passed thro
17. tly in 0 6 micron CMOS at the University of Utah  At  the architectural level the overall system can be described  as a set of processors attached to a mesh connected set of  routers as in Figure 2     Using ARCS we can describe this system at any number  of levels of detail  At the highest level we can describe each  processor as a component  and the entire mesh connected  routing network as a single ARCS component  Refining that  description results in each Mesh Element in Figure 2 being  described as a separate ARCS component  Refining further   ARCS can model each Router element separately where two  such elements compose a single Mesh Element    At this level of detail the Mesh Element is implemented in  ARCS as two separate components  each representing a sin   gle Router  connected by six independent One2OneChannels   The functionality of each Router is described in just a few  lines of Java with the rest of the concurrency  channel  and  timing infrastructure inherited from the ARCS framework   Connecting these Mesh Elements into a 4x4 mesh with their  associated processors results in four Mesh Element processes   each connected to a data source and sink  the processor  model   and to four other Mesh Elements  The processors     writes and reads into the Mesh network are controlled by the  main simulation thread which writes and reads data into and  out of the simulation as specified by the user    Using the Cadence toolkit  4  the Torus Routing Chip   TRC  was simula
18. ugh the  simulation in a reasonable amount of time  Using Java as  the implementation language also allows convenient  flexible  generation and formatting of input data to the system    ARCS performs full system simulations with more than  acceptable accuracy while requiring significantly less design  time  It requires less than 100 lines of user written Java  to simulate the full 4x4 mesh that Cadence modeled using  16 856 transistors  ARCS also allows much more flexibil   ity in terms of playing    what if    games by modifying the  behavior of the routers and their communications  Modi   fying the Torus Routing Chip to use stateful packet based  routing instead of wormhole routing would be trivial using  ARCS where as it would require a complete redesign of the  Cadence model  Similarly  it is not easy to determine how  overall system performance would be impacted by speeding     gt  Mesh  Element  V A A  Mesh  Element        Mesh        Element         Processor                    Mesh  Element    Mesh  Element    Figure 2  Torus Router    up a single component using switch level modeling  Chang   ing components processing time in ARCS is trivial however   This allows easy experimentation to identify components in  which optimization would provide the greatest improvement  to total system performance     5  CONCLUSIONS    This paper has presented a framework called ARCS for  performing architectural level exploration in a convenient  and efficient manner  ARCS fulfills th
19. wn notion of the current time and each  data item that passes through a communication channel car   ries time information within it  A component   s local notion  of time is updated when it participates in a communication  action  either input or output   Because CSP communica   tions are synchronized  input and output components must  agree to communicate  When this happens  time informa   tion is passed with the data and each component updates  its own notion of local time based on this information  A  component can also increment its own local time and the  most recent time tagged against the current data  to model  time spent processing data within that component  JCSP  allows n buffered channels enabling multiple data items to  be in flight in FIFO order on a single channel  Additionally   it is trivial to model an n sized FIFO buffer in ARCS for  more accurate simulation    All simulation data in ARCS must extend the TimedData  class  Using TimedData allows components to attach tim   ing information to simulation data as it moves through the  various modules via channels  As data moves through a sim   ulation component it is automatically tagged by that com   ponent with the component   s unique simulation ID  time  received  time spent processing  and time it attempted to    send that data  When a piece of data has been retired from  the simulation it contains a full record of all components  it passed through and at what times  This data is then  recorded and aggregat
20. y avoiding centralized data structures  such  as an event queue  and allow its performance to scale with  large numbers of simulation components    Defining these features for an architectural level simulator  results in some compromises  The requirement of variable  grain size abstraction inherently decreases system timing ac   curacy  This means that timing estimates based on initial  architectural level simulation may need to be refined as more  implementation details become available  Abstraction of im   plementation details such as circuit interconnects does not  necessarily provide an easy path for automated synthesis of    circuits  although we note that there are several synthesis  systems that are targeted directly at concurrent hardware  with CSP style communication  7  8  9  3     To support all these features we present ARCS  an asyn   chronous communication driven architectural level simula   tion framework written in Java  ASIM uses CSP style com   munications between separate Java threads to implement  separate simulated system pieces  The use of Java allows  complex behaviors and timing models to be expressed eas   ily  The use of separate threads to model concurrent sys   tem pieces  CSP style communication  and a communication  based event timing model means that architectural behav   ior is effectively supported at a variety of grain sizes  One  specific interest we have is in simulating and exploring truly  asynchronous systems for which ASIM is ideal  10 
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
SIMATIC Field PG M3 - Service, Support  34594022 - heidenhain  Epson PowerLite 530  AiM User guide ECU connection of AiM devices via OBDII Release  767  Invacare 1090768 User's Manual  Formes galéniques administrées par voies entérales  Installation and Owner`s Manual Manual de Instrucciones y del  Catálogo de libros - Hospital Hermanos Ameijeiras  Loewe A 26 DVB-T CI User's Manual    Copyright © All rights reserved. 
   Failed to retrieve file