Home
        Catenation and Specialization for Tcl Virtual Machine Performance
         Contents
1.    stantially less than a full blown JIT compiler  but can still  improve performance  It eschews expensive optimizations   and reuses code from the interpreter  which also reduces  semantic errors  The compiler uses fixed size native code  templates  and precomputes as much information as possi   ble  and thus runs quickly     Catenation builds on techniques used by Piumarta and Ric   cardi to implement their selective inlining  15   This involves  making relocated copies of the native code used by the in   terpreter to implement each virtual opcode  The original  technique placed several constraints on which virtual op   codes could be compiled  because not all code  e g   call  instructions to pc relative targets  is trivially relocatable   These constraints do not apply to most of the short  low   level opcodes of Forth  or the VM used to evaluate selective  inlining  Ocaml  However  the constraints present a severe  problem for a virtual machine such as Tcl   s  which has much  higher level opcodes  We remove the constraints  so that all  Tcl opcodes code can be moved  albeit at the expense of  portability     Catenation allows compilation of every instruction in a byte   code program into a separate unit of native code  This prop   erty  along with the infrastructure we built to relocate all  code  allows us to perform several optimizations  We     1  specialize the operands of bytecode instructions into  the operands of the native instructions in the template   and emp
2.  Dynamic native optimization of  interpreters  In Proc  of the 2003 workshop on  Interpreters  Virtual Machines and Emulators     Sun Microelectronics  UltraSPARC Ili User   s Manual   1997     Tcl Core Team  TclLib benchmarks  online   2003   Available from   http   www tcl tk software tcllib      B  Vitale  Catenation and Operand Specialization for  Tcl Virtual Machine Performance  Master   s thesis   University of Toronto  2004     
3.  map of the native code    1By unit of code  we mean an object in the Tcl object system  whose type is code  This typically corresponds to a proc   procedure   but might also be an anonymous block of code   such as a command typed interactively at the interpreter  prompt  or passed to the eval command  etc     typedef void  ntv_addr     struct   ntv_addr start  ntv_addr end   inst_table         Labels as Values        amp  amp inst_incr_immed_begin   amp  amp inst_incr_immed_end         0           define NEXT_INSTR goto  inst_table   vpc  start     increment integer object at top of stack by arg       inst_incr_immed_begin          int incr_amount   Tcl_Obj  object    tosPtr     tos is Top Of Stack          Label          skip opcode  extract arg     incr_amount     uchar      vpc   object    gt int_val    incr_amount     vpc     advance upc over argument       inst_incr_immed_end   NEXT_INSTR          Figure 4  Using gcc   s labels as values extension to delineate  the interpreter case for an opcode       Label       offset for each bytecode  The offset map is used to deter   mine the native destination address for jump instructions   and for exception handling  described below     The second pass then copies the native code for each tem   plate  For example  to compile the incr_immed opcode shown  in Figure 4  the essential operation consists of looking up the  starting and ending addresses of the native code for this op   code in the interpreter  by referring to the inst_
4.  so  dynamic execution paths do not look  as they do in this  case  exactly like static code in memory  Catenation handles  control flow by changing virtual machine jumps into native  code jumps  In Figure 3d  the native code for the second  virtual instruction  push 3  starts at native address 4  Thus  a virtual jump to virtual pc 2 would be compiled into a  native jump to address 4     Catenation is based on the use of fixed size templates of  native code for each virtual opcode  Each template is self   contained  and does not depend on the context of the opcode  being compiled  The separate cases of an interpreter largely  meet this description  and indeed we generate our templates  directly from the interpreter  Following Piumarta and Ric   cardi  15   we leverage interpreter oriented extensions in the  GNU C compiler to treat C language labels as values  delin   eating the boundaries of native code for each virtual opcode   as shown in Figure 4     Catenating a unit    of Tcl bytecodes is a two pass process   The first pass computes the total size of the resulting native  code  by adding up the sizes of the templates for each byte   code instruction in the unit  It then allocates a native code  buffer of the appropriate size  The size of a template may be  slightly larger than the interpreter case  to allow room for  instructions we synthesize and emit at the end of a template   These are used for jumps  debugging  and profiling instru   mentation  The pass also builds a
5.  the separate initialization  pass  While it was still profitable to compile code that was  executed many times  the faster technique vastly broadens  the applicability of catenation     The final patch type is UPDATE_PC  While we don   t main   tain the vpc during execution  the nature of catenated code  makes it is easy to rematerialize  UPDATE_PC is used on the  right hand side of assignments to the interpreter variable  vpc  and is patched with the constant value vpc would have  had during execution of the original interpreter  In cate   nated code there is a separate body of native code for every  static bytecode instruction  and so the value of vpc is im   plicitly known  For example  an UPDATE_PC in the native  code emitted for the virtual instruction at vpc 5 is patched  to the constant 5  We set the value only in exception han   dling paths in the interpreter code  To conserve space  we  will not describe Tcl exception handling here  except to say  that stack unwinding and handler resolution is done at run   time  using a map from vpc of exception to vpc of handler   Another map allows us to report source line number diag   nostics on uncaught errors  We could have translated all  these maps to native addresses  entirely eliminating the vpc    in our case  but the current implementation works well  and  it   s useful to demonstrate rematerialization of vpc  because  one can imagine other interpreters or dynamic recompilation  systems requiring it     5 4 Implementati
6. Catenation and Specialization  for Tcl Virtual Machine Performance    Benjamin Vitale    bv   cs toronto edu  Department of Computer Science    Tarek S  Abdelrahman    tsa   eecg toronto edu  Edward S  Rogers Sr  Department of Electrical  and Computer Engineering    University of Toronto  Toronto  M5S 3G4 Canada    ABSTRACT    We present techniques for eliminating dispatch overhead in a  virtual machine interpreter using a lightweight just in time  native code compilation  In the context of the Tcl VM  we  convert bytecodes to native Sparc code  by concatenating  the native instructions used by the VM to implement each  bytecode instruction  We thus eliminate the dispatch loop   Furthermore  immediate arguments of bytecode instructions  are substituted into the native code using runtime special   ization  Native code output from the C compiler is not  amenable to relocation by copying  fix up of the code is re   quired for correct execution  The dynamic instruction count  improvement from eliding dispatch depends on the length in  native instructions of each bytecode opcode implementation   These are relatively long in Tcl  but dispatch is still a signif   icant overhead  However  their length also causes our tech   nique to overflow the instruction cache  Furthermore  our  native compilation consumes runtime  Some benchmarks  run up to three times faster  but roughly half slow down  or  exhibit little change     Categories and Subject Descriptors    D 3 4  Programming Langua
7. Microsystems SunBlade 100    Execution Time       E bc_compile  O dispatch  E ntv_compile    200    l cache stall  E work          a  a  oO       o  Oo  z                 J    Cycles relative to bytecode    a  Oo    oy  he  he  he  he  he  he  he  he  ox  vee    Se    gt see    S     i  ee  ee  A  i  Y    SS    EXPR MD5 FACT MATRIX IF       Figure 8  Performance counter benchmarks     with a 502 MHz Ultrasparc Ile  640 MB RAM  a 16 KB  2 way set associative instruction cache  and a 16 KB direct  mapped data cache  The 4 way unified Level 2 cache is 256  KB  It runs the Solaris 8 operating system  The Tcl bench   marks are  runtime evaluation of a simple arithmetic expres   sion of constants and variables  a hash function involving  many bitwise operations  the factorial function  multiplica   tion of two 15x15 matrices  and a long if then else tree     The results are depicted in Figure 8  For each benchmark   the bar on the left shows the performance of the benchmark  with the original bytecode interpreter  The bar on the right  shows the same  but with our catenating compiler  The  performance of each benchmark is measured by the total  number of execution cycles  normalized with respect to the  original bytecode interpreter  Cycles are broken down into  those spent on useful work  those devoted to dispatch  those  devoted to bytecode compilation  and to native compilation   and those wasted in instruction cache stalls     As intended  catenation removes all dispatch overhe
8. ad in  all cases  Furthermore  the three optimizations it enables      operand specialization  virtual to native branch conversion   and elimination of the virtual program counter     substan   tially reduce the amount of work cycles in all cases  For three  cases  FACT  MATRIX  and IF  the result is a significant  improvement in total execution time     Our techniques do not always reduce execution time  Some   times it is mostly unchanged  or actually increased  There  are two main problems  one of which shows up in the EXPR  benchmark  and the other in MD5  While not shown  we  measured dynamic instruction counts in addition to cycles   In every case besides EXPR  catenation reduces instruction  counts  because it saves dispatch  and benefits from the sub   sequent optimizations  EXPR unbraced  however  requires  a large amount of compilation time  In fact  this benchmark    is a contrived idiom to force continuous late recompilation  due to Tcl semantics  This idiom is well known by experi   enced programmers  and generally avoided  We include it  here to underscore that any workload which spends lots of  time recompiling will do poorly in any JIT compiler  includ   ing ours  The JIT must regenerate native code each time  bytecode is regenerated  Our JIT is relatively fast  but typ   ically requires between 100  and 150  as much time as the  bytecode compilation     The MD5 benchmark exhibits a more serious problem  The  catenated code does slightly less work than the int
9. ad in Tcl   s large  opcode bodies  and that its removal via catenation signifi   cantly improves performance for some benchmarks  Thus   we believe our techniques are applicable to VMs with large  opcode bodies     Furthermore  Ertl and Gregg found  8  found that I cache  stalls induced by code growth due to replication are not a  major problem  In contrast  in our system  we find that I     cache stalls are a problem  This is perhaps not surprising  given the larger opcodes in the Tcl VM     Exploiting the C compiler to build templates for code copying  is a clever and largely portable technique  We have extended  it  allowing all opcode bodies to be moved  but sacrificed  portability  Furthermore  our experience leaves us with the  opinion that the approach is too brittle for general experi   mentation  depending excessively on the compiler and ma   chine architecture  A more explicit code generation infras   tructure would be more flexible for exploring the interesting  issues surrounding native compilation and optimization of  dynamic languages such as Tcl     Catenation implicates the classic inlining tradeoff  and a  more selective technique is required  perhaps preferring code  expansion to dispatch overhead only where profitable  Mixed   mode execution might facilitate this  and we would like to  explore this in future work  A key related question is decid   ing when to compile and when to interpret  A useful heuris   tic might be based on the potential correlation b
10. ading  Alternatively  in direct  threaded code  the bytecode program represents opcodes by  the native addresses of their implementations  rather than  one byte opcode numbers  saving the table lookup  This  dispatch can be expressed in roughly three machine instruc   tions  Its execution time depends largely on the time to  handle the indirect branch to the next opcode  but on typi   cal workloads it averages about 12 cycles on a modern CPU  such as the Ultrasparc III we use     On our Sparc  switch dispatch requires roughly 12 dynamic  instructions  or 19 cycles  We found token threading takes  about 14 cycles  If the bodies  the    real work     of the typi   cal instruction are small  as with Forth   this overhead can  dominate interpreter performance  Even with larger bodies   such as Tcl   s  the overhead is significant  For example  the  Tcl code to compute the factorial function shown in Fig   ure 2 uses  on average  about 100 cycles per virtual opcode   Thus  indirect threaded dispatch accounts for 16  overhead   Even direct threading  at 12 cycles  takes significant time   We thus pose the question  Can all dispatch be eliminated   And  if so  is this profitable     3  CATENATION    To remove all dispatch  we must instead execute native code   The typical approach is a full blown compiler  including an  intermediate representation and code generator  which can  do much more than simply avoid dispatch  In this section   we describe a simpler approach to just avoid di
11. am  Among other things  this will reduce the  need to maintain the virtual program counter  Catenated  code provides the foundation for another a technique that  removes the fetches  Next  we describe this technique  which  we call Operand Specialization     The implementation of a virtual opcode in the original in   terpreter is generic  in the sense that it must handle any  instance of that opcode  in any location in any bytecode  program  and with any valid values for operands  After  catenation  on the other hand  there is a separate instance  of native code implementing each static bytecode instruc   tion in the program  During code generation  we know the  operands of each instruction  and can treat them as run   time constants  We specialize a template with the values  of these constants  Essentially  we are compiling bytecode  instruction operands into native instruction operands     Now  we will describe how to enhance the templates so  they are appropriate for specialization  Given a virtual op   code with operands  one of the first tasks of the interpreter  body implementing that opcode is to fetch and decode the  operands  They are usually placed in variables  We remove  the interpreter code that implements this fetch  and replace  the variable with a magic number of an appropriate size  At  specialization time  we substitute the magic number with  the real operands from the bytecode program  We include     ifdef INTERPRET     define MAGIC_OP1_U1_LITERAL    codePt
12. argets  and  thus the processor   s branch predictor does poorly  because  it uses the address of the branch instruction as the main key  when predicting the target  8      Interpreters can employ many other dispatch techniques  3   6  7   In the    Function table    approach  each opcode body is  a separate function  Dispatch consists of a for loop  looking  up the address of each function in a table  and an indirect  function call  By making the table lookup explicit in the  code  instead of letting the compiler generate it  as with  switch   it can sometimes be more efficient  avoiding the  spurious bounds check  for example  However  the function  call  with associated stack frame setup  register saving  etc   is more expensive than the simple jump required in switch       compute n   i e  factorial function    proc factorial  n     set fact 1    while   n  gt  1     set fact  expr   fact    n    incr n    1    return  fact    Figure 2  Tcl code to compute factorial function    Another approach is to remove the dispatch from the ex   ternal for switch loop  and instead place it at the end of  each opcode body  The interpreter community refers to this  as shared dispatch  This saves the jump from the end of  each body to the single dispatch code  More importantly   it gives the branch predictor many static indirect jumps to  work with  dramatically improving its chances of success  If  switch like table lookup dispatch is placed in each opcode   this is known as token thre
13. ased to the  research community  and is no longer actively developed     8  CONCLUSION AND FUTURE WORK    In this paper  we presented techniques that allow us to    com   pile away    the bytecode program  acting as a very naive JIT  compiler  However  almost all the runtime infrastructure of  the interpreter remains  so it may be better to think of this  system as an advanced interpretation technique  rather than  a true compiler  In practice  there is a range of techniques  from simple interpretation to advanced dynamic compila   tion  which trade off implementation size and complexity   start up performance  and portability  We offer a new point  on this spectrum  and implement it in a non portable but  transparent and complete Tcl interpreter system     Our experimental evaluation indicates that catenation and  the optimizations it enables improve the performance of sev   eral benchmarks  often significantly  On the other hand  for  some benchmarks  the I cache stalls often induced by code  growth from catenation degrade performance  The overall  effect  on a typical microprocessor  is that about half our  benchmarks speed up  by as much as a factor of 3  but the  other half slow down     Ertl and Gregg  9  suggest that dispatch is more of a problem  for small opcode bodies  and that the architecture of    inef   ficient    popular VMs should be improved before advanced  interpretation techniques are applied  On the other hand   we find that dispatch is a source of overhe
14. bed in more  detail below  Refer to Figure 5 to see the interpreter C  code for the push opcode  with our conditionally compiled  variation to produce a template for specialization  Figures 6  and 7 show the assembly language output by the C compiler  for each variation     In addition to removing the load of the operand from the  bytecode stream  Figure 7 shows that several related in   structions were also removed  In the Tcl VM  the operand  of the push opcode is not the address of the object to be  pushed  Rather  the operand is an unsigned integer index  into the literal table  a table of objects stored along with  the bytecode  We treat the index  and the table  as run   time constants at specialization time  then perform a simple  constant propagation  The result is that push is reduced  from twelve Sparc instructions  twenty including dispatch   to seven  including one load instruction  instead of four   This is an extreme example  because push is much shorter  than most Tcl opcodes  whose bodies can average hundreds  of instructions  However  it is significant  because push is  an extremely common opcode  both statically and dynam   ically  We also employ this table lookup constant propaga   tion on the builtin_math_func opcode  which takes a small  unsigned integer argument to indicate which of several math  functions should be invoked  e g  sin  sqrt  etc      Catenation runs very fast  because it requires only copying  native code  followed by a few fixups and lin
15. char opcode   pc    amp program  0    int sum     for         opcode    pc   switch  opcode     case INST_ADD   sum   POP     POP     PUSH  sum    pe    1   break   case INST_SUB   break         other instruction cases                    Figure 1  Simple interpreter dispatch using C for and    switch statements    details of our implementation  including coaxing the C com   piler into generating templates  and subsequent object code  processing  We evaluate the performance of these techniques  in Section 6  Finally  in Sections 7 and 8 we discuss related  and future work  and conclude     2  TRADITIONAL INTERPRETERS    The most fundamental job of the virtual machine is to dis   patch virtual instructions in sequence  In this section  we  compare some of the techniques used in practice  A typ   ical dispatch loop is shown in Figure 1  Here dispatch is  implemented as a C switch statement  and kept separate  from the implementation of opcode semantics  A C compiler  will likely translate this to a table lookup  While efficient  in principle  the compiler often includes extra instructions  for a bounds check  Furthermore  there is only one copy  of this table lookup code  and each opcode body must end  with a jump back to it  The two loads  get opcode from  program  then lookup opcode in switch table  and indirect  branch required are pipeline hazards on typical micropro   cessors  Worse  there is only one static copy of the indirect  branch to dispatch to many possible opcode t
16. code case  we append a macro for token threaded  dispatch  so that the optimizer is constrained to produce  templates suitable for catenation  as discussed in Section 3   We redefine operand fetch macros to various magic con   stants  instead of code to perform actual fetches from the  bytecode stream  We use a constant of the appropriate  size and type so that the template is efficient  For exam   ple  if a virtual opcode takes a one byte integer operand   we use a magic constant less than 256  For a four byte  operand  we use a constant that forces the compiler to gen   erate code general enough to load any four byte operand  On  the Sparc  this is a two instruction sequence  Occasionally   we had to manually introduce uses of the C volatile key   word  or other tricks  to force the optimizer to not propagate  the magic constants too deeply into the opcode implemen   tation  Other times  our code rewriting system  discussed  below  handles the optimizer   s output     At build time  the bodies are assembled into a single C func   tion  with each delineated by labels as shown in Figure 4   We generate C code to store all these labels in a table  in   dexed by opcode number  Then  the function is compiled to  assembly  with gcc set to optimize  The resulting assembly  is largely suitable for templates  but we must post process it  to undo the work of the C optimizer in two situations  The  first occurs when a native branch instruction in the middle  of a body branches to the 
17. d target architectures  some including full system  emulation  Where we use magic numbers to identify points  for specialization in our templates  QEMU stores operands  in global variables  and exploits the ELF relocation meta   data generated by the linker  Some unwarranted chummi   ness is still required to elide function prologues  etc   CPU  opcode instruction bodies tend to be small  with complex  implementations    outlined    to called functions  and thus  catenation performs well     Brouhaha  13  is a portable Smalltalk VM that executes  bytecode using a direct threaded interpreter built from na   tive code templates  The templates are written in a style  similar to function table dispatch  but then  like our sys   tem  after compilation the assembly is post processed to re   move prologues  epilogues  and make other transformations   Where we post process using Tcl  Brouhaha uses sed  and  does significantly more rewriting  Little runtime rewriting  seems required  although neither superinstructions  replica   tion  nor operand specialization are implemented     DyC  10  is a dynamic compilation system based on pro   grammer annotations of runtime constants  which are used  in dynamic code generation to exploit specialization by par   tial evaluation  The authors motivate one of their papers  using the example of a bytecode interpreter     in this case   m88ksim  a simulation of a Motorola 88000 CPU  The input  data for this benchmark is a bytecode program  DyC trea
18. e a drop in replacement for an inter   preter  preserving the interactive use of a virtual machine  system  especially important for scripting  Unfortunately  a  JIT compiler is typically large  and complex to develop and  portably deploy  Furthermore  the delay for compilation at  runtime may interfere with interactive use     In this paper  we present an implementation of a relatively  new point in the design space between interpretation and  JIT compilation  Our technique  which we call catenation   eliminates the overhead of instruction dispatch in the virtual  machine  Catenation creates a native code version of a byte   code program by making copies of the native code used by  the interpreter to execute each of the virtual instructions in  the original program  The resulting native code has certain  properties that enable several optimizations  in particular   the removal of operand fetch code from the bytecode imple   mentations      bodies          We have implemented this technique for the Tcl scripting  language  Originally  Tcl interpreted scripts directly from  source  but in 1995 evolved to compile scripts to bytecodes  on the fly  and then interpret the bytecodes  11   While  the bytecode compiler improves performance significantly   many Tcl applications still demand more speed  Our system    extends the bytecode compilation with an additional native  compilation step  targeting the Sparc microprocessor     The programming effort required in our approach is sub
19. ed  and the  entire Tcl virtual machine is linked together in the tradi   tional fashion  In the rest of this section  we describe the  runtime manipulations of the templates to accomplish com   pilation of Tcl bytecode     5 2 Compiler Initialization   During and after catenation  we apply many small changes  to the fixed length templates to complete the process of com   pilation  These changes fix the code so that it executes cor   rectly after it is relocated  and also handle operand special   ization  We call these fixups patches  and refer to the overall  process as patching  For example  when a native call in   struction targeting a C library subroutine  e g  malloc  is  relocated during catenation  the target will be wrong  be   cause Sparc calls are pc relative     that is  the target de   pends on the address of the call instruction itself  A patch  is used to correct this problem  by adjusting the target ad   dress in the relocated instruction  This patch type actually  applies to any pc relative instruction whose target lies out   side the template being moved  A few other patch types are  described below     To accelerate catenation in general  and patching in partic   ular  we analyze the native code once at initialization time   We locate the starting and ending points of the template  for each opcode  and find and store the native code offsets   types  and sizes of each patch required by each opcode  Us   ing this information  a patch can be applied very quickl
20. egg  Optimizing Indirect Branch  Prediction Accuracy in Virtual Machine Interpreters   In Proc  of PLDI  2003      9      10     11    12    13    14          15    16    17    18    19          20    21    22          M  A  Ertl and D  Gregg  The Structure and  Performance of Efficient Interpreters  Journal of  Instruction Level Parallelism  5 1 25  2003     B  Grant  M  Mock  M  Philipose  C  Chambers  and  S  J  Eggers  DyC  an expressive annotation directed  dynamic compiler for C  Theoretical Computer  Science  248 1 2  147 199  2000     B  Lewis  An on the fly bytecode compiler for Tcl  In  Proc  of the 4th Annual Tcl Tk Workshop  1996     P  S  Magnusson and F  L  et al  SimICS sun4m  A  Virtual Workstation  In Proc  of the Useniz Annual  Technical Conference  1998     E  Miranda  BrouHaHa   A Portable Smalltalk  Interpreter  In Proc  of OOPSLA   87  pages 354 365     S  Muchnick  Advanced Compiler Design and  Implementation  Morgan Kaufmann  1997     I  Piumarta and F  Riccardi  Optimizing  direct threaded code by selective inlining  In Proc  of  PLDI  pages 291 300  1998     F  Rouse and W  Christopher  A Typing System for  an Optimizing Multiple Backend Tcl Compiler  In  Proc  of the 5th Annual Tcl Tk Workshop  1997     M  Sofer  Tcl Engines  online   Available from   http    sourceforge net projects tclengine      SPARC International Inc A  The SPARC Architecture  Manual  Version 8  1992     G  T  Sullivan  D  L  Bruening  I  Baron  T  Garnett   and S  Amarasinghe 
21. erpreted  version  but the overall execution time stays the same  As  the chart shows  increased I cache misses defeat the advan   tages of removing dispatch and improving work efficiency   Now  in many cases  catenated code exhibits better I cache  performance  because useful code is tightly packed in mem   ory  without intervening unused opcode implementations   However  catenation causes major code growth     on average   we measure a factor of 30  The expanded code   s working set  can easily exceed a typical 32 KB I cache  and consequently  I cache stall cycles can overwhelm the savings from remov   ing dispatch and operand fetch     6 2 Varying I cache Size   To further explore the effect of I cache misses induced by  code growth  we ran the entire set of 520 Tcl benchmarks   with the original and catenating interpreters  under the Sim   ics  12  full machine simulator configured in separate exper   iments with I caches of 32  128  and 1024 KB  A fourth  experiment simulated an infinite I cache  by running with  no latency to the external L2 cache  The benchmarks are  all quite short  but some solve interesting problems such as  gathering statistics on DNA sequences     At the realistic 32 KB I cache size  54  of benchmarks ac   tually run slower using catenated code  The larger 128 and  1024 KB caches slowed 45  and 34  of benchmarks  respec   tively  Even with the infinite I cache  18  of benchmarks  slow down  This is because there is not enough execution  time to am
22. etween op   code body size and I cache performance  To study this  we  would like to apply our technique to other interpreters  in   cluding some with shorter  lower level bodies than Tcl  Fi   nally  we would like to explore outlining of large instruction  bodies  perhaps by moving them to separate functions  and  calling these from the body     9  ACKNOWLEDGMENTS   We thank Angela Demke Brown  and Michael Stumm for  their ideas and encouragement  Colleagues Mathew Zaleski  and Marc Bernd  offered stimulating brainstorms  Virtutech  Inc  provided an academic license for its excellent Simics  full system simulator  Finally  the anonymous reviewers     feedback was invaluable     10  REFERENCES    1  J  Aycock  Converting Python Virtual Machine Code  to C  In Proc  of 7th Intl  Python Conf   1998     2  V  Bala  E  Duesterwald  and S  Banerjia  Dynamo  a  transparent dynamic optimization system  In Proc  of  PLDI  2000     3  J  R  Bell  Threaded code  Communications of the  ACM  16 370 372  1973     4  F  Bellard  Qemu x86 cpu emulator  online   2004   Available from   http   fabrice bellard free fr qemu      5  D  Cuthbert  The Kanga Tcl to C converter  online    2000  Available from   http    sourceforge net projects kt2c      6  R  B  Dewar  Indirect threaded code  Communications  of the ACM  18 330 331  1973           7  M  A  Ertl  Threaded code  online   1998  Available  from  http   www complang tuwien ac at forth   threaded code htm1            8  M  A  Ertl and D  Gr
23. exit of the body  The normal tar   get of the branch is one native instruction beyond the end of  the case  which is precisely what is required for catenation   However  the optimizer exploits the Sparc delayed branch  slot  18   schedules an instruction of the  extraneous  dis   patch code after the branch  and changes the branch tar   get to two instructions beyond the end  Suppressing the  optimizer delayed branch scheduler using a command line  switch to the C compiler was not a reasonable option  be   cause this optimization is important to the performance of    Sparc code  and we want to leave most such transformations  intact     The second problem occurs when the compiler applies a tail   merging  14  optimization to two or more opcode cases  This  results in the cases sharing part of their implementation  but  catenation requires that each case be self contained  because  later passes may make changes to the code  and these must  be separate for each opcode     Using text processing techniques on the assembly output  we     deoptimize    the delayed branch and tail merging transfor   mations  if they have been applied in the places described   Together  they affect 14 of 97 opcodes  All build time ma   nipulation and assembly post processing for deoptimization  is performed with Tcl scripts  These scripts  after boot   strapping  are executed by a Tcl    interpreter    which incor   porates our native code compiler     After post processing  the templates are assembl
24. ges   Processors   Interpreters    General Terms  bytecode interpreters    Keywords    virtual machines  just in time compilation  Tcl    1  INTRODUCTION    Many portable high level languages are implemented using  virtual machines  Examples include traditional program   ming languages such as Java and scripting languages such    as Tcl  Perl  and Python  The virtual machine may pro   vide portability and intimate linkage with a high level run   time system that is more sophisticated than a typical general  purpose hardware machine     for example  it may provide  garbage collection  object systems  and other services     One penalty of the virtual machine approach is lost perfor   mance  Because the virtual machine code is interpreted  it  runs slower than native code  Scripting systems are usually  designed for portability  flexibility  extensibility  and expres   siveness  not high performance  However  they are powerful  enough that they are used to implement entire software sys   tems  and thus their performance is important  The efforts  to address the VM performance problem can be divided into  two main camps  Clever interpreter designs such as threaded  code can make the process of interpretation faster  Alter   natively  compilers can translate the virtual machine code  into native code  and execute that directly  Just in time   JIT  compilers make this translation at runtime  Avoiding  a separate  slow  static compilation step means that a JIT  compiler can try to b
25. king  Special   ization consumes more time  running for every operand of  every bytecode instruction  It still can run quickly  because  our templates are fixed size  and we can pre compute the  offsets where bytecode operands can be specialized into na   tive code operands  In the next section  we give details of  this and other aspects of our implementation     add  16  4   16   sethi     hi operand    01   or  01   lo operand    01  st  01    16    ld   01    00   add  00  1   00   st  00    01     incr VM stack pointer      object to push       object to push   store to top of VM stack  next 3 instructions incr  ref count of pushed object    Figure 7  Template for push opcode  compiled from Fig     ure 5 using the SPECIALIZE variation  Note that it is much  shorter than Figure 6  The native operands of the instruc   tions marked     are points for operand specialization  The  sethi or instruction pair is the Sparc idiom for setting a 32  bit constant in a register     5  PREPARING AND USING TEMPLATES   We build our templates using the output of the GNU C  compiling a modified version of the Tcl interpreter  This as   sembly language output itself requires two small build time  transformations  and is then conventionally linked into the  Tcl virtual machine  Finally  at runtime  our catenating  compiler is divided into several phases  which include trans   formations on the templates  In this section  we describe  each of these steps     5 1 Building Templates   To each op
26. loy some constant propagation     2  convert virtual branches into native branches  and    3  elide maintenance of the virtual program counter  and   if necessary  rematerialize it     In evaluating its performance  we found that our technique  improves some benchmarks significantly  but slows others  down  We attribute the slowdown to more instruction cache  misses  resulting from the code growth caused by catenation   As we discuss in Section 7  we feel this result is interesting  in light of recent work by Ertl and Gregg  8   They find  that a replication technique similar to our catenation also  causes code growth  but that the code growth rarely causes  enough I cache misses to hurt performance  The key distinc   tion  perhaps predicted by Ertl and Gregg  is that Tcl and  many other popular VMs are not    efficient interpreters      The large opcode bodies in such VMs suffer less dispatch  overhead  and thus gain less from its removal  More im   portantly  the large bodies  when replicated  cause excessive  code growth     The rest of this paper is organized as follows  Section 2 re   views the structure and performance of some common tech   niques for interpreter dispatch  Section 3 presents catena   tion  our approach to eliminating dispatch overhead  Sec   tion 4 introduces operand specialization  which removes the  overhead of bytecode operand fetch  Section 5 provides some    enum   INST_ADD  INST_SUB            void interpret  unsigned char  program       unsigned 
27. micro architectural statis   tics  The second experiment tries to answer questions raised  by the first  by running a larger set of benchmarks on both  our catenating Tcl interpreter  and the original  while vary   ing only the size of the instruction cache  This hypotheti   cal scenario requires a simulation infrastructure  because of  the lack of variety in I cache sizes within generations of the  Sparc CPUs  Using CPUs from different generations  which  have substantial architectural differences  would confound  the results     6 1 Cycle counts and I Cache Misses   The highest precision clock available for our first experiment  is the CPU   s cycle counter  available using the Sparc per   formance counters  20   which are present in the sparcv8  instruction set on the Ultrasparc II and later CPUs  Two  64 bit hardware registers can each be programmed to collect  any one of a wide variety of events  such as cycles  retired  instructions  cache misses  branch mis predicts  etc  To fa   cilitate the experiment  we implemented a Tcl command to  collect performance statistics while running arbitrary Tcl  code  and separately track bytecode compilation time  na   tive compilation time  and execution time  We can also  choose whether to include events during execution of system  code on behalf of the application  We ran our benchmarks  on an otherwise unloaded machine  and exclude events in   curred while the operating system was executing other pro   grams  The machine is a Sun 
28. nfirm it  appears the correct number of times  If these assertions  fail  the interpreter code must be retooled  This happened  during development with a handful of opcodes  because our  pattern matcher did not yet recognize all variations of com   piler output     The input and output types for each patch can largely be  determined from the type of virtual opcode  e g  virtual  branch  or operands  or during the analysis phase  e g  na   tive pc relative calls   There is a small table of exceptions   coded by hand to handle special cases  Our code contains a  large number of assertions  During development  we identi   fied the exceptions when a small number of assertions failed     5 3 Compilation   A code unit is compiled to native code only the first time it  is executed  The process runs quickly  because most of the  work has been done in the initialization pass  The compila   tion is simply    interpretation    of the patches for the cate   nated program     Each patch can be interpreted very quickly  because it re   quires only a load from the bytecode stream  possibly an   other to index through constant tables  sometimes one or  two adds to handle pc relative instructions  possibly a lookup  in the virtual to native pc map  a few operations for mask   ing  shifting the data into destination native instruction   then store  On average  this requires about 120 us per patch   The initialization pass requires about 4 ms  An initial ver   sion of our system did not perform
29. on Notes   The implementation is divided into several modules  follow   ing the structure described above  The pre computation of  templates and patches is implemented in 771 lines of C  con   taining 462 semicolons  The code generator  which uses the  templates to implement catenation and operand specializa   tion  is 581 lines long  or 258 semicolons  While these mod   ules include some profiling instrumentation  an additional  535 lines  240 semicolons  of Tcl extension commands to  collect detailed statistics when running on real machines   Finally  for the simulation experiments described below  we  created 1181 lines  513 semicolons  of statistics extensions  to both Tcl and our simulator     As described above  we also made small  typically one  or  two line  changes to several Tcl VM opcode implementa   tions  Excluding these changes and the instrumentation  code  roughly two weeks of effort were required to code the  template and code generators  However  we spent consid   erably more time finding and resolving subtle bugs  such as  the    deoptimizations     requiring 250 lines of Tcl  described  in Section 5 1     6  PERFORMANCE EVALUATION    To measure the impact of catenation and specialization in  our implementation  we constructed two sets of experiments   The first measures execution time on a small number of  benchmarks from the Tcl performance benchmark suite  21    to determine if our modified interpreter actually improves  performance  We capture detailed 
30. ortize the cost of native code compilation  except  in four  less than 1  of  benchmarks  due to continuous  recompilation  as described earlier     7  RELATED WORK    Ertl and Gregg  8  evaluate the performance of a large num   ber of advanced interpretation techniques  Their focus is the  high proportion of indirect branch instructions in an inter   preter  Forth  with short opcode bodies  and the predictabil   ity of those branches by a pipelined micro architecture  Their  techniques include forming superinstructions and making  replicas of code  Their dynamic superinstructions across ba   sic blocks with replication technique is very similar to our  catenation  except that they leave some dispatch in place  to handle virtual branches  whereas we remove all dispatch   Instead  their key goal is to have more indirect branch in   structions     one for each static bytecode instruction     whose  behavior and context precisely match the behavior of the  bytecode program  This yields much better branch predic   tion  They also find that I cache stalls induced by code  growth due to replication are not a major problem for their     efficient    Forth VM  which has short opcode bodies  On    the other hand  our VM  with large opcode bodies  encoun   ters major I cache stalls on many benchmarks     QEMU  4  is a CPU emulator which employs techniques  very similar to our catenation and operand specialization   It is more portable than our system  supporting a plurality  of host an
31. r    gt objArrayPtr  TclGetUInt1AtPtr  pc   1       define PC_OP x  pe    x   define NEXT_INSTR break     elseif SPECIALIZE     define MAGIC_OP1_U1_LITERAL     Tcl_Obj    0x7bc5c5c1    define NEXT_INSTR    define PC_OP x        goto  jump_table   pc  start     unnecessary        endif    case INST_PUSH1   Tcl_Obj  objectPtr     objectPtr   MAGIC_OP1_U1_LITERAL     4 tosPtr   objectPtr   Tcl_IncrRefCount  objectPtr     PC_OP     2     NEXT_INSTR     Figure 5  Source for interpreter implementation of push op   code  with variation to generate template suitable for spe   cialization     add  16  4   16  add  15  1   15  ldub   15    00    incr VM stack ptr  inc vpc past opcode   load operand    ld   fp   0x48    02 load addr of execution context  ld   02   0x4c    01 load addr of literal tbl from ctx  sll  00  2   00 compute offset into table   ld   01    00    01 load from literal table    st  01    16   ld   01    00    store to top of VM stack  next 3 instructions increment    inc  00 reference count of pushed object  st  00    01   inc 415   increment vpc    sethi  hi 0x800    00  or  00  Ox2f0   00  ld   17    00    01  ldub   15    00  sll  00  2   00    rest is dispatch to next instr    1d   01    00    00  jmp 00  nop   branch delay slot unusable    Figure 6  C compiler   s assembly language translation of code  in Figure 5  using the INTERPRET variation     several assertions in the compiler initialization code to en   sure this results in a suitable template  as descri
32. spatch     Consider the problem of interpreting the bytecode program  in Figure 3a  The interpreter uses the  imaginary  native  instructions in Figure 3b to implement the push and add  opcodes  and instruction dispatch  In Figure 3c  we show in  bold the dynamic sequence of useful  i e   non dispatch  na   tive instructions executed when interpreting this bytecode  program  Now we are set to understand the idea of    cate   nation     which is our technique for improving interpreter  performance  If we simply copy into executable memory the  sequence of useful work instructions     those shown in bold      we have    compiled    a native code program with exactly    push 2  push 3  add     a  Sample bytecode program  Note  opcode push is  used twice  with different operands     push  P1P2P3  add  a a2a3a4a5  dispatch  010203     b  Definitions of virtual instruction bodies in native  code  Each p  a   etc  represents one native instruction     010203 P1P2P3 010203 P1P2P3 010203 Q1 Q2 Q3 Q  a5     c  Dynamic sequence of native instructions executed  when interpreting program in  a      P1P2P3 P1p2p3 A1A2Aa3a4a5     d  Static sequence of native instructions emitted by  catenating compiler for program in  a      Figure 3  Compiling bytecode objects into catenated copies  of native code from the interpreter avoids dispatch overhead    the same semantics as the interpreted version  This new  program has no dispatch overhead     Of course  most programs contain branches and loops 
33. table  and  then memcpy      ing the code into the native code buffer     There are a few things to observe about the code in Figure 4   First  the compiled code still has to fetch operands from the  bytecode stream  It does this using the virtual program  counter  vpc  The code uses the entire interpreter runtime   interfacing with it at the register level     Also note that while the instruction template ends with the  label inst_incr_immed_end  we still include the  NEXT_INSTR code after the template proper  This code con   sists of a traditional token threaded dispatch  Its purpose is  to force the C optimizer to build a control flow graph which  reflects the fact that any opcode case may be preceded or  followed by any other opcode case  which is precisely the sit   uation after catenation  and during normal interpretation      However  even position independent code output by the  C compiler is not intended for this sort of relocation  in   deed  considerable transformation of the output is required  to make catenation work  We undertake some of these trans   formations at interpreter build time  and others during cate   nation  We describe this in some detail in Section 5     By itself  catenation is not a true compilation  Among other  things  the resulting code still refers to operands in the byte   code instruction stream  We address this problem in the  next section     4  SPECIALIZATION    We would like to eliminate the fetching of operands from the  bytecode stre
34. ts  It performs some analysis of the types and locations  of objects pushed onto the stack  but no use is made of this    information  For Python  the similar    211    system  1  com   piles extended basic blocks of Python bytecode into superin   structions by concatenating the C code for the interpreter  cases  performing some peephole optimization  and then in   voking the C compiler in a separate  offline step  We are  sympathetic to Aycock   s suggestions that a VM based on  registers and lower level bytecodes would be more amenable  to compilation     The s4  17  project has experimented with improved dis   patch techniques for the Tcl VM  such as token threading   and many of its results have been folded back into the pro   duction Tcl release  improving performance     The ICE 2 0 Tcl Compiler project  16  created a commercial  stand alone static  ahead of time  Tcl version 7 to C com   piler in 1995  and then a later version in 1997 that also tar   geted bytecode and included a conservative type inference  system  setting the stage for classical compiler optimiza   tions  The compiler offered an approximately 30  improve   ment in execution time over the Tcl 8 0 bytecode compiler   Both the ICE compilers were static  that is  required a sep   arate compile step  This precludes using the compiler as  a drop in replacement for the original interactive Tcl inter   preter  an important modality for scripting languages  The  source code of the ICE Compilers was never rele
35. ts  the entire bytecode program as a constant  and  using an  optimization they call complete loop unrolling  accomplishes  essentially the same effect as our catenation  The system ap   plies traditional optimizations after partial evaluation  This  process is static and quite expensive  and thus might not  be appropriate for a dynamic scripting language which fre   quently compiles and re compiles  At static compile time   they specialize their runtime specializer  so it is pre loaded  with most of the analysis for optimization and code genera   tion  This is more general than  but similar to  our system  of patches  which pre computes the necessary fix ups  They  report speedups of 1 8 on m88ksim  but do not discuss the  complexities of I cache and code explosion     Trace based dynamic re compilation techniques  2  also have  the promise to automatically accomplish effects similar to  catenation  and many other optimizations   Sullivan et al   19   show how to extend Dynamo   s infrastructure  by telling it  about the virtual program counter  so that it is able to  perform well while executing virtual machine interpreters   whereas it had previously done poorly     There have been several efforts to improve Tcl performance   The kt2c system  5   while unfinished  uses the bytecode for  a given function to build a C file containing a huge    su   perinstruction    implementing all the bytecode instructions  in the function  It converts virtual jumps into C goto state   men
36. y   essentially in two steps  First  some information is extracted  from the bytecode stream     for example  a one byte unsigned  integer operand  Then  it may be filtered or manipulated in  some way  Finally  it is applied to the output native code  template  Thus  we store for each patch an input and out   put type  along with its relative offset from the beginning  of the template  Input types essentially select the kind of  patch necessary  and include  for example  plain operands  of various size and signedness  destination addresses of vir   tual jumps  and operands which need translation through  the literal table or builtin math function table     After a patch has fetched and transformed the input data   the data must be placed into the template at the appro   priate location  This means changing the operands of one    or more native machine instructions  On a RISC proces   sor like the Sparc  there are only a few formats  and indeed  only a few instructions  to recognize for patching  loading  a small immediate integer constant  one which fits in 13  bits   loading larger constants  calls  and a few branches  In  the case of virtual branch opcodes  the patch may also syn   thesize  rather than rewrite  the native instruction  branch  false  branch true  or branch always   The analysis expects  certain instruction patterns to appear a certain number of  times  usually once for each operand of a given opcode   The  magic constant is used to locate the pattern  and co
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
ST70 Bedieneinheit Bedienung  DreamLine SHEN-2136360-04 Installation Guide  5 SENSUAL « SAS »  Toro Junior MAX Data Sheet  USER MANUAL  Manual do Utilizador  SikaFill 3 Fibras  Mode d`emploi  取扱説明書 - yodobashi.com  Grundlagen der Informationstechnologie    Copyright © All rights reserved. 
   Failed to retrieve file