Home
        OpenSPARC T1 Microarchitecture Specification
         Contents
1.        Shifted in at power on reset  POR  to make the software visible  read only           CORE_AVAIL  PROC_SER_NUM         Debug Ports          Internal visibility port on each UCB interface          L2 cache visibility port input from the L2 cache  2 x 40 bits   CMP clock           Debug port A output to the debug pads  40 bits   J Bus clock           Debug port B output to the JBI  2 x 48 bits   J Bus clock      TABLE  5 1  UCB interfaces to Clusters    Cluster Block  Width from IOB to block  Width from block to IOB    CTU  4 bits  4 bits    DRAM02 and DRAM13  4 bits  4 bits    JBI PIO  64 bits  16 bits    JBI SSI  4 bits  4 bits    TAP  8 bits  8 bits    5 4 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 2  UCB Interface    FIGURE  5 2 shows the UCB interface from and to cluster  There are two    uni directional ports     one from the IOB to the cluster and one from the cluster to  the IOB  Each port consists of a valid signal  a data packet  and a stall signal     FIGURE  5 2 IOB UCB Interface to and From the Cluster    5 1 2 1  UCB Request and Acknowledge Packets    The UCB request or acknowledge  ACK  packet can have various widths     64 bits   128 bits  or 192 bits  TABLE  5 2 defines the UCB request or acknowledge packet    format     TABLE  5 2  UCB Request Acknowledge Packet format    Bits  191 128  127 64  63 55  54 15  14 12  11 10  9 4  3 0    Description  Extended  Data 63 0     Data   63 0     Reserved  Address   39 0     Size  Buffer I
2.        UCB ACK packet sent to the i2c  translated  and sent on to the CPX        IOB_INT CSR Accesses          J_INT_BUSY DATA0 DATA1 CSRs are implemented in the register file in the  c2i         Read Write ACKS sent to the i2c  by way of the int_buf register file  and sent  on to the CPX        TAP Reads Writes          TAP mastered requests are similar to the UCB Read Writes         Requests to the TAP CSRs are bounced through the iobdg_ctrl to the i2c for  IOB  gt TAP UCB         TAP requests to memory  L2 cache CSRs  and CPU ASIs  are bounced through  iobdg_ctrl on to i2c and issued as a forward request to CPX  ACK returns on  the PCX     5 10 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 6  IOB Interrupts    This section describes the various interrupts that are handled by the IOB         UCB Signalled Interrupts  Request Type   UCB_INT          Only two LSBs of the DEV ID are used         Error Interrupt  Dev_ID   1          SSI EXT_INT_L  Dev_ID   2          Signalled on the UCB interface to the i2c         Looks up Mask status in the INT_CTL Dev_ID   the CPU ID  and the vector in  the INT_MAN Dev_ID  CSR         Generate CPX interrupt packets and sends them to the CPU        Software Generated Interrupts  INT_VEC_DIS CSR          Writable by the CPU or the TAP         Sends reset  interrupt  idle  and resume signals to the selected thread         Generates UCB interrupt packets in the iobdg_ctrl         Translates to the CPX interrupt packet for
3.        x     Not used or don   t care        V     Valid        rs     Source register        rd     Destination register        T     Thread ID        FD     Forwarded data        src     Source        tar     Target    TABLE  3 1  CPX Packet Format     Part 1    Pkt  bits  No   Load    I fill  1   L2 IOB  I fill  2  L2  Strm Load  Evict Inv    Valid  144  1  V  V  V  V  V    Rtntyp  143 140  4  0000  0001  0001  0010  0011    L2 miss  139  1  V  V  0  V  x    ERR  138 137  2  V  V  V  V  x    NC  136  1  V  V  V  V  V    Shared bit  135  1  T  T  T  T  x    Shared bit  134  1  T  T  T  T  x    Shared bit  133  1  WV  WV 0  WV  WV  x    Shared bit  132  1  W  W x  W  W  x    Shared bit  131  1  W  W x  W  W  x    Shared bit  130  1  0  0  F4B  0  A  x    Shared bit  129  1  atomic  0  1  B  x    Reserved  128  1  PFL  0  0  0  0    Data  127 0  128  V  V  V  V   INV1   6 pa    112 inv      3 6 OpenSPARC T1 Microarchitecture Specification      August 2006    TABLE  3 2  CPX Packet Format     Part 2    Pkt  bits  No   Store Ack    Strm Store  Ack  Int  FP    Fwd  req    Fwd  Reply  Error    Valid  144  1  V  V  V  V  V  V  V    Rtntyp  143 140  4  0100  0101  0110  0111  1000  1001  1010  1011  1100    L2 miss  139  1  x  x  x  x  x  x  x    ERR  138 137  2  x  x  x  x  x  V  V    NC  136  1  V  V  flush  V  R  W  R  W  x    Shared bit  135  1  T  T  x  T  x  x  0    Shared bit  134  1  T  T  x  T  x  x  0    Shared bit  133  1  x  x  x  x  src  tar  x    Shared bit  132  1  x
4.       FIGURE  5 3 IOB Internal Block Diagram    CMP Clk  J BUS Clk  CMP Clk    PCX    CPX    EFC    L2    i2c    c2i    dbg    ctrl    CSRs    16x65   x4      misc    16x65   x4     CTU  DRAM 0 2  DRAM 1 3  JBI PIO  JBI SPI  TAP    CTU  DRAM 0 2  DRAM 1 3  JBI PIO  JBI SPI  TAP    JBI Mondo    16x160    16  x  160    16x65   x2     wr_ack    UCB IF    UCB Pkt IF    124    32    145    134    4  4  4 64  4  8       Dbg port B   J bus    Dbg port A    48    48    40    64 128    64 128  153    Mondo  data0 1  src busy    4  4  4 16  4  8  16    40  40    Chapter 5  Input Output Bridge 5 9    5 1 5  IOB Transactions    This section describes the various transactions processed by the IOB         UCB Reads          LOAD_RQ packet received by the c2i         Map decode address to the destination port         Translates to the UCB packet format and sends UCB_READ_REQ over the UCB         UCB_READ_ACK _NACK received by i2c         Translates to the CPX packet format and sends to CPU        UCB Writes          STORE_RQ packet received by the c2i         Map addresses to the destination port         Translates to the UCB packet format and sends UCB_WRITE_REQ over the  UCB         Send write ACK  STORE_ACK  directly to the i2c         Sends STORE_ACK packet to the CPX        IOB_MAN  IOB Management  CSR Accesses          Similar to UCB Reads Writes  except the UCB request packet is routed to the  iobdg_ctrl         CSRs  except J_INT_BUSY DATA0 DATA1  are implemented in the iobdg_ctrl  
5.      TABLE  5 8  UCB No Payload Over an 8 Bit Interface With Stalls    iob_ucb_vld  0  1  1  1  1  1  1  1  1  1  1  1  0    iob_ucb_data 7 0   X  D0  D1  D2  D2  D2  D3  D3  D4  D5  D6  D7  X    ucb_iob_stall  0  0  1  1  1  0  0  0  0  0  0  0  0    TABLE  5 9  IOB Address Map    PA 39 32   Hex   Destination  Description    0x00   0x7F  DRAM Memory  TAP only   CCX forward request    0x80  JBI PIO  JBI CSRs  J Bus 8MB Non Cached  amp  Fake DMA spaces    0x81   0x95  Reserved    0x96  CTU    0x97  DRAM  DRAM CSRs  PA 12    0 for DRAM02  PA 12    1 for DRAM13    0x98  IOB_MAN  IOB Management CSRs    0x99  TAP  TAP CSRs    0x9A   0x9D  Reserved    0x9E  CPU ASI  TAP Only   CCX forward request    0x9F  IOB_INT  IOB Mondo Interrupt CSRs    0xA0   0xBF  L2 CSRs  TAP Only   CCX forward request    0XC0   0xFE  JBI PIO  J Bus 64 GB Non Cached Spaces    0xFF  JBI SSI  SSI CSRs and Boot PROM    5 8 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 4  IOB Block Diagram    FIGURE  5 3 shows the IOB internal block diagram  The PCX requests from the CPU    are processed by a block called the CPU to I O  c2i  and it generates UCB requests  to the various blocks  UCB requests from various blocks are processed by a block  called I O to CPU  i2c  which then generates a CPX packet  Internal control status  registers  CSRs  are controlled by the CSR block  The debug block takes data from  L2 cache and sends it to debug port A  an external port  and to debug port B  to the  JBI
6.     2 8  Stream Processing Unit    Each SPARC core is equipped with a stream processing unit  SPU  supporting the  asymmetric cryptography operations  public key RSA  for up to a 2048 bit key size     The SPU shares the integer multiplier with the execution unit  EXU  for the modular  arithmetic  MA  operations  The SPU itself supports full modular exponentiation   While the SPU facility is shared among all threads of a SPARC core  only one thread  can use the SPU at a time  The SPU operation is set up by a storing a thread to a  control register and then returning to normal processing  The SPU will initiate  streaming load or streaming store operations to the level 2 cache  L2  and compute  operations to the integer multiplier  Once the operation is launched  it can operate in  parallel with SPARC core instruction execution  The completion of the operation is  detected by polling  synchronous fashion  or by interrupt  asynchronous fashion      2 8 1  ASI Registers for the SPU    All alternate space identifier  ASI  registers for the SPU are 8 bytes in length  Access  to all of the ASI registers for the SPU have hypervisor privilege  so they can only be  accessed in hypervisor mode  The following list highlights those ASI registers         Modular arithmetic physical address  MPA  register    This register carries the physical address used to access the main memory   MA_LD requests must be on the 16 byte boundary while MA_ST requests must  be on the 8 byte boundary         Mo
7.     3    5    7    12    13    8    9    10  11    4  2    1a  6    1   Reset     Trap TL gt  5     1a   Reset     Trap TL gt  5     HPSTATE red  gt 1     2   SV Trap TL lt 2     3   HV Trap 2 lt  TL lt 5     SV Trap  2 lt  TL lt 5      4   Done Retry when TSTATE TL  priv 1 amp   HTSTATE TL  priv  gt 0     PSTATE priv  gt 0    Done Retry   HTSTATE TL  priv 0 amp TSTATE TL  priv 1     HPSTATE red  gt 0   5    6   Done Retry   HTSTATE TL   red priv  01     HPSTATE red  gt 0     7   Done Retry   HTSTATE TL   red priv  00 amp TSTATE TL  priv 1      HPSTATE  red priv   gt 00  PSTATE priv 1   8   Done Retry   HTSTATE TL   red priv  00 amp TSTATE TL  priv 0      HPSTATE  red priv   gt 00  PSTATE priv 0     9   Any Trap  TL lt 5     10   SV Trap TL lt 2      Any Trap or Reset   11    12   Done Retry   HTSTATE TL  priv 0      HPSTATE priv  gt 0    PSTATE priv 0   HV Trap     SV trap 2 lt  TL lt 3   13    2 64 OpenSPARC T1 Microarchitecture Specification      August 2006    2 10 11  Content Construction for Processor State Registers    Processor state registers  PSRs  carry different content in different situations  such  as  traps  interrupts  done instructions  or retry instructions  The following list  highlights the register contents     1  On traps or interrupts     save states in the trap stack and update them    a  Update trap level  TL  and global level  GL     i  On normal traps or interrupts    TL   min TL 1  MAXTL     GL   min GL 1  MAXGL  for hypervisor    GL   min GL 1  2 
8.     CCR        ASI_REG        TTYPE    Synchronization based on the HTSTATE priv bit and the TSTATE priv bit for the  non split mode is not enforced on software writes  but synchronized while restoring  done and retry instructions     Software writes in supervisor mode to the TSTATE gl bit do not cap at two  The cap  is applied while restoring done and retry instructions     2 66 OpenSPARC T1 Microarchitecture Specification      August 2006    2 10 13  Trap  Tcc  Instructions    Traps number 0x0 to 0x7f are all SPARC V9 compliant  They can be used by user  software or by privileged software  The trap will be delivered to the supervisor if  TL   lt  MAXPTL 2   Otherwise  it will be delivered to the hypervisor     Traps number 0x80 to 0xff can only be used by privileged software  These traps are  always delivered to hypervisor  User software using trap number 0x80 to 0xff will  result in an illegal instruction trap if the condition code evaluates to true  Otherwise  it  is just a NOP     The instruction decoding and condition code evaluation of Tcc instructions are done  by the instruction fetch unit  IFU  and the seventh bit of the Trap  is checked by the  TLU     2 10 14  Trap Level 0 Trap for Hypervisor    Whenever the trap level  TL  changes from non zero to zero  and if the HPSTATE tlz  bit is set to 1  and the thread is not at Hypervisor privilege level  then a precise trap  level 0  TLZ  trap will be delivered to the hypervisor on the next following  instruction     The 
9.     on page  3 17 for more information     3 5 2  CPX Arbiters    Data and control flow inside the CPX are identical to those inside the PCX  so see  Section  3 4 2     PCX Arbiter Data Flow    on page  3 18 and Section  3 4 3     PCX Arbiter    Control Flow    on page  3 19 for more information     4 1    CHAPTER 4    Level 2 Cache    This chapter contains the following sections         Section  4 1     L2 Cache Functional Description    on page  4 1        Section  4 2     L2 Cache I O LIST    on page  4 18    4 1  L2 Cache Functional Description    The following sections describe the OpenSPARC T1 processor level 2 cache   L2 cache          Section  4 1 1     L2 Cache Overview    on page  4 1        Section  4 1 2     L2 Cache Single Bank Functional Description    on page  4 2        Section  4 1 3     L2 Cache Pipeline    on page  4 9        Section  4 1 4     L2 Cache Instruction Descriptions    on page  4 12        Section  4 1 5     L2 Cache Memory Coherency and Instruction Ordering    on    page  4 17    4 1 1  L2 Cache Overview    The OpenSPARC T1 processor L2 cache is 3 Mbytes in size and is composed of four  symmetrical banks that are interleaved on a 64 byte boundary  Each bank operates  independently of each other  The banks are 12 way set associative and 768 Kbytes in  size  The block  line  size is 64 bytes  and each L2 cache bank has 1024 sets     The L2 cache accepts requests from the SPARC CPU cores on the processor to cache  crossbar  PCX  and responds on th
10.    Chapter 2  SPARC Core 2 33    2 5  Execution Unit    The execution unit  EXU  contains these four subunits     arithmetic and logic unit   ALU   shifter  SHFT   integer multiplier  IMUL   and integer divider  IDIV      FIGURE  2 15 presents a top level diagram of the execution unit     FIGURE  2 15 Execution Unit Diagram    The execution control logic  ECL  block generates the necessary select signals that  control the multiplexors  keeps track of the thread and reads of each instruction  and  implements the bypass logic  The ECL also generates the write enables for the  integer register file  IRF   The bypass logic block does the operand bypass from the  E  M  and W stages to the D stage  Results of long latency operations such as load   mul  and div  are forwarded from the W stage to the D stage  The condition codes  are bypassed similar to the operands  and bypassing of the FP results and writes to  the status registers are not allowed     The shifter block  SHFT  implements the 0   63 bit shift  and FIGURE  2 16 illustrates    the top level block diagram of the shifter     Bypass  Logic    MUL    DIV    ALU    SHFT    Register  file    ECL    WR    rd to RF    rd to RF    data  to LSU    addr  to LSU    W  M  E    Load  data    reg addr    thread   window controls    rs1    rs3    rs1  rs2  rs2    rs3    imn    oncode    2 34 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 16 Shifter Block Diagram    The arithmetic and logic unit  ALU  consis
11.    The data valid array is dual ported with one read port and one write port  1R1W      Each line in the L1 D cache is parity protected  A parity error will cause a miss in the  L1 D cache which  in turn  will cause the correct data to be brought back from the  L2 cache     In addition to the pipeline reads  the L1 D cache can also be accessed by way of  diagnostic ASI operations  BIST operations  and RAMtest operations through the test  access port  TAP      2 4 4  Data Translation Lookaside Buffer    The data translation lookaside buffer  DTLB  is the TLB for the D cache  The DTLB  caches up to the 64 most recently accessed translation table entries  TTE  in a fully  associative array  The DTLB has one CAM port and one read write port  1 RW   All  four threads share the DTLB  The translation table entries of each thread are kept  mutually exclusive from the entries of the other threads     The DTLB supports the following 32 bit address translation operations         VA   gt  PA  virtual address  VA  to physical address  PA  translation         VA   PA  address bypass for hypervisor mode operations         RA   gt  PA  Real Address  RA  to Physical Address  PA  bypass translation for  supervisor mode operations     The TTE tag and the TTE data are both parity protected and errors are uncorrectable   TTE access parity errors for load instructions will cause a precise trap  TTE access  parity errors for store instructions will cause a deferred trap  that is  the generation  of
12.    sctag_scbuf_wbrd_wl_r0 2 0   In  SCTAG    sctag_scbuf_ev_dword_r0 2 0   In  SCTAG    sctag_scbuf_rdma_wren_s2 15 0   In  SCTAG    sctag_scbuf_rdma_wrwl_s2 1 0   In  SCTAG    jbi_sctag_req 31 0   In  JBI    jbi_scbuf_ecc 6 0   In  JBI    sctag_scbuf_rdma_rden_r0  In  SCTAG    sctag_scbuf_rdma_rdwl_r0 1 0   In  SCTAG    sctag_scbuf_ctag_en_c7  In  SCTAG    sctag_scbuf_ctag_c7 14 0   In  SCTAG    sctag_scbuf_req_en_c7  In  SCTAG    sctag_scbuf_word_c7 3 0   In  SCTAG    TABLE  4 1  SCDATA I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    4 20 OpenSPARC T1 Microarchitecture Specification      August 2006    sctag_scbuf_word_vld_c7  In  SCTAG    scdata_scbuf_decc_out_c7 623 0   In  SCDATA    dram_scbuf_data_r2 127 0   In  DRAM    dram_scbuf_ecc_r2 27 0   In  DRAM    cmp_gclk  In  CTU  Clock    arst_l  In  CTU  Asynchronous reset    grst_l  In  CTU  Synchronous reset    global_shift_enable   In  CTU    cluster_cken  In  CTU    ctu_tst_pre_grst_l  In  CTU    ctu_tst_scanmode  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_macrotest  In  CTU    ctu_tst_short_chain  In  CTU    scbuf_sctag_ev_uerr_r5  Out  SCTAG    scbuf_sctag_ev_cerr_r5  Out  SCTAG    scbuf_jbi_ctag_vld  Out  JBI    scbuf_jbi_data 31 0   Out  JBI    scbuf_jbi_ue_err  Out  JBI    scbuf_sctag_rdma_uerr_c10  Out  SCTAG    scbuf_sctag_rdma_cerr_c10  Out  SCTAG    scbuf_scdata_fbdecc_c4 623 0   Out  SCDATA    scbuf_dram_data_mecc_r5  Out  DRAM    scbuf_dram_wr_data_r5 63 0   Out 
13.    xvi OpenSPARC T1 Microarchitecture Specification      August 2006    xvii    Tables    TABLE  2 1  SPARC Core Terminology  2   4    TABLE  2 2  SPARC Core I O Signal List  2   5    TABLE  2 3  Modular Arithmetic Operations  2   39    TABLE  2 4  Error Handling Behavior  2   42    TABLE  2 5  Supported OpenSPARC T1 Trap Types  2   54    TABLE  2 6  Privilege Levels and Thread States  2   61    TABLE  3 1  CPX Packet Format     Part 1  3   5    TABLE  3 2  CPX Packet Format     Part 2  3   6    TABLE  3 3  PCX Packet Format     Part 1  3   7    TABLE  3 4  PCX Packet Format     Part 2  3   8    TABLE  3 5  CCX I O Signal List  3   9    TABLE  4 1  SCDATA I O Signal List  4   18    TABLE  4 2  SCBUF I O Signal List  4   19    TABLE  4 3  SCTAG I O Signal List  4   21    TABLE  5 1  UCB interfaces to Clusters  5   3    TABLE  5 2  UCB Request Acknowledge Packet format  5   4    TABLE  5 3  UCB Request ACK Packet Types  5   5    TABLE  5 4  UCB Data Size  5   5    TABLE  5 5  UCB Interrupt Packet Format  5   6    TABLE  5 6  UCB Interrupt Packet Types  5   6    xviii OpenSPARC T1 Microarchitecture Specification      August 2006    TABLE  5 7  UCB No Payload Over an 8 Bit Interface Without Stalls  5   6    TABLE  5 8  UCB No Payload Over an 8 Bit Interface With Stalls  5   7    TABLE  5 9  IOB Address Map  5   7    TABLE  5 10  I O Bridge I O Signal List  5   12    TABLE  6 1  JBI I O Signal List  6   8    TABLE  7 1  OpenSPARC T1 FPU Feature Summary  7   3    TABLE  7 2  SPARC 
14.   4 22 OpenSPARC T1 Microarchitecture Specification      August 2006    scdata_sctag_scanout  In  DFT  Scan in    ctu_tst_macrotest  In  CTU  To test_stub of test_stub_bist v    ctu_tst_pre_grst_l  In  CTU  To test_stub of test_stub_bist v    ctu_tst_scan_disable  In  CTU  To test_stub of test_stub_bist v    ctu_tst_scanmode  In  CTU  To test_stub of test_stub_bist v    ctu_tst_short_chain  In  CTU  To test_stub of test_stub_bist v    efc_sctag_fuse_clk1  In  EFC    efc_sctag_fuse_clk2  In  EFC    efc_sctag_fuse_ashift  In  EFC    efc_sctag_fuse_dshift  In  EFC    efc_sctag_fuse_data  In  EFC    sctag_cpx_req_cq 7 0   Out  CCX CPX  sctag to processor request    sctag_cpx_atom_cq  Out  CCX CPX  Atomic request    sctag_cpx_data_ca 144 0   Out  CCX CPX  sctag to cpx data pkt    sctag_pcx_stall_pq  Out  CCX PCX  sctag to pcx IQ_full stall    sctag_jbi_por_req  Out  JBI    sctag_scdata_way_sel_c2 11 0   Out  SCDATA    sctag_scdata_rd_wr_c2  Out  SCDATA    sctag_scdata_set_c2 9 0   Out  SCDATA    sctag_scdata_col_offset_c2 3 0   Out  SCDATA    sctag_scdata_word_en_c2 15 0   Out  SCDATA    sctag_scdata_fbrd_c3  Out  SCDATA  From arbctl of sctag_arbctl v    sctag_scdata_fb_hit_c3  Out  SCDATA  Bypass data from Fb    sctag_scdata_stdecc_c2 77 0   Out  SCDATA    sctag_scbuf_stdecc_c3 77 0   Out  SCBUF    sctag_scbuf_fbrd_en_c3  Out  SCBUF  rd en for a fill operation or fb bypass    sctag_scbuf_fbrd_wl_c3 2 0   Out  SCBUF    sctag_scbuf_fbwr_wen_r2 15 0   Out  SCBUF    sctag_scbuf_fbwr_
15.   7 1  Functional Description    The OpenSPARC T1 floating point unit  FPU  has the following features and  supports the following functions         The FPU implements the SPARC V9 floating point instruction set with the  following exceptions          Does not implement these instructions     FSQRT s d   and all quad precision  instructions         Move type instructions executed by the SPARC core floating point frontend  unit  FFU   FMOV s d   FMOV s d cc  FMOV s d r  FABS s d   FNEG s d          Loads and stores  the SPARC core FFU executes these operations         The FPU does not support the visual instruction set  VIS    The SPARC core FFU  provides limited VIS support          The FPU is a single shared resource on the OpenSPARC T1 processor  Each of the  eight SPARC cores may have a maximum of one outstanding FPU instruction  A  thread with an outstanding FPU instruction stalls  switches out  while waiting for  the FPU result         The floating point register file  FRF  and floating point state register  FSR  are not  physically located within the FPU  The SPARC core FFU owns the register file and  FSR  The SPARC core FFU also performs odd even single precision address  handling         The FPU complies with the IEEE 754 standard     7 2 OpenSPARC T1 Microarchitecture Specification      August 2006        The FPU includes three independent execution pipelines          Floating point adder  FPA      adds  subtracts  compares  conversions         Floating point multipl
16.   CPX  and the processor cache crossbar  PCX  packet formats are described  in Section  3 1 5     CPX and PCX Packet Formats    on page  3 5     3 2 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  3 1 CPU Cache Crossbar  CCX  Interface    3 1 2  CCX Packet Delivery    The CCX consists of two main blocks     processor cache crossbar  PCX  and the  cache processor crossbar  CPX   The PCX block manages the communication from  any of the eight CPUs  source  to any of the four L2 cache banks  I O bridge  or FPU   destination   The CPX manages communication from any of the four L2 cache  banks  I O bridge  or FPU  source   to any of the eight CPUs  destination   FIGURE  3 2    illustrates the PCX interface and FIGURE  3 3 illustrates the CPX interface     FIGURE  3 2 Processor Cache Crossbar  PCX  Interface    CPU0    L2Bank 0 L2Bank 1 L2Bank 2 L2Bank 3  IOBridge  FPU    CPU1  CPU2  CPU3  CPU4    CCX    CPU5  CPU6  CPU7    CPU  8     L2Cache  4   IOBridge  FPU    CCX  PCX  CPX    Chapter 3  CPU Cache Crossbar 3 3    When multiple sources send a packet to the same destination  the CCX buffers each  packet and arbitrates its delivery to the destination  The CCX does not modify or  process any packet     In one cycle  only one packet can be delivered to a particular destination  The CCX  handles two types of communication requests  The first type of requests contain one  packet and it is delivered in one cycle  The second type of request contains two  packets 
17.   Generate a disrupting data_error trap on the requesting thread          Mark the line dirty  and the memory keeps the bad ECC on writeback     9 8 OpenSPARC T1 Microarchitecture Specification      August 2006        Uncorrectable errors on writeback data  DMA reads  and scrub all cause a  disrupting data_error trap on the steering thread         MA loads with uncorrectable errors  and aborts the operation in SPU         A fatal error indication is issued across the J Bus in order to request a warm_reset  of the entire chip when there is a          Parity error on any of the 12 VAD bits in the set during any access          Parity error during a directory scrub     9 4  DRAM Errors    This section lists the error registers and the error protection of the DRAM  This  section also describes the DRAM correctable and uncorrectable and addressing  errors     9 4 1  DRAM Error Registers    Each DRAM channel has its own set of error registers     1  DRAM Error Status Register         Contains the status of the DRAM errors          Not cleared on a reset     2  DRAM Error Address Register         Contains the physical address of the DRAM scrub error          DRAM access error addresses are logged by the L2 cache     3  DRAM Error Location Register         Contains the location of the bad nibble     4  DRAM Error Counter Register         16 bit counter  decrements on every 16 byte correctable error          An interrupt is sent to the IOB when the count hits 0     5  DRAM Error Injec
18.   Head and tail pointers of resumable and non resumable error queue        CPU interrupt registers         Interrupt receive register         Incoming vector register         Interrupt dispatch registers  for cross calls     2 10 2  Trap Types    Traps can be generated from the user code  the supervisor code  or from the  hypervisor code  A trap will be delivered to different trap handler levels for further  processing  namely the supervisor level  SV level  otherwise known as the privileged  level  or the hypervisor level  HV level   The way the traps are generated can help  categorize a trap into either an asynchronous trap  asynchronous to the SPARC core  pipeline operation  or a synchronous trap  synchronous to the SPARC core pipeline  operation      There are three defined categories of traps     precise trap  deferred trap  and  disrupting trap  The following paragraphs briefly describe the nature of each  category of trap     1  Precise trap    A precise trap is induced by a particular instruction and occurs before any  program visible state has been changed by the trap inducing instruction  When a  precise trap occurs  several conditions must be true          The PC saved in TPC TL  points to the instruction that induced the trap  and  NPC saved in NTPC TL  points to the instruction that was to be executed next          All instructions issued before the one that induced the trap must have  completed their execution          Any instructions issued after the one that 
19.   PADS  J Bus ACK 4    io_jbi_j_pack5 2 0   In  PADS  J Bus ACK 5    io_jbi_j_adp 3 0   In  PADS  J Bus parity for AD bus    io_jbi_j_par  In  PADS  J Bus parity for request PACK    iob_jbi_dbg_hi_data 47 0   In  IOB  Debug data high    iob_jbi_dbg_hi_vld  In  IOB  Debug data high valid    iob_jbi_dbg_lo_data 47 0   In  IOB  Debug data low    iob_jbi_dbg_lo_vld  In  IOB  Debug data low valid    jbi_ddr3_scanout18  Out  DFT  Scan out    jbi_clk_tr  Out  CTU  Debug_trigger     jbi_jbusr_so  Out  DFT  Scan out    jbi_jbusr_se  Out  DFT  Scan enable    jbi_sctag0_req 31 0   Out  SCTAG0  L2 cache request    jbi_scbuf0_ecc 6 0   Out  SCBUF0    jbi_sctag0_req_vld  Out  SCTAG0  Next cycle will be header of a new request packet     jbi_sctag1_req 31 0   Out  SCTAG1  L2 cache request    jbi_scbuf1_ecc 6 0   Out  SCBUF1    jbi_sctag1_req_vld  Out  SCTAG1  Next cycle will be header of a new request packet     jbi_sctag2_req 31 0   Out  SCTAG2  L2 cache request    jbi_scbuf2_ecc 6 0   Out  SCBUF2    jbi_sctag2_req_vld  Out  SCTAG2  Next cycle will be header of a new request packet     jbi_sctag3_req 31 0   Out  SCTAG3  L2 cache request    jbi_scbuf3_ecc 6 0   Out  SCBUF3    jbi_sctag3_req_vld  Out  SCTAG3  Next cycle will be Header of a new request packet     jbi_iob_pio_vld  Out  IOB  PIO valid    jbi_iob_pio_data 15 0   Out  IOB  PIO data    jbi_iob_pio_stall  Out  IOB  PIO stall    TABLE  6 1  JBI I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Cha
20.   could come from one of these sources     1  Branch    2  TrapPC    3  Trap NPC    4  Rollback  a thread rolled back due to a load miss     5  PC   4    The IFU tracks the PC and NPC through W stage  The last retired PC will be saved  in the trap logic unit  TLU   and  if a trap occurs  it will also be saved in the trap  stack     2 3 4  Level 1 Instruction Cache    The instruction cache is commonly referred to as the level 1 instruction cache  L1I    The L1I is physically indexed and tagged and is 4 way set associative with 16 Kbytes  of data  The cache line size is 32 bytes  The L1I data array has a single port  and the  I cache fill size is 16 bytes per access  The characteristics of cached data include      32 bit instructions  1 bit parity  and 1 bit predecode  The tag array also has a single  port     There is a separate array for valid bit  V bit   This V bit array holds the cache line  state of either valid or invalid  and the array has one read port and one write port   1R1W   The cache line invalidation only accesses the V bit array  and the cache line  replacement policy is pseudo random     2 10 OpenSPARC T1 Microarchitecture Specification      August 2006    The read access to the I cache has a higher priority over the write access  The ASI  read and write accesses to the I cache are set to lower priorities  The completion of  the ASI accesses are opportunistic  and there is fairness mechanism built in to  prevent the starvation of service to ASI accesses     Th
21.   ctu_tst_scanmode  In  CTU  Scan mode     ctu_tst_pre_grst_l  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_macrotest  In  CTU    ctu_tst_short_chain  In  CTU    ddr3_jbi_scanin18  In  DFT    jbusr_jbi_si  In  DFT    sctag0_jbi_iq_dequeue  In  SCTAG0  SCTag is unloading a request from its 2 request queue     sctag0_jbi_wib_dequeue  In  SCTAG0  Write invalidate buffer  size 4  is being unloaded     scbuf0_jbi_data 31 0   In  SCBUF0  Return data    scbuf0_jbi_ctag_vld  In  SCBUF0  Header cycle of a new response packet     scbuf0_jbi_ue_err  In  SCBUF0  Current data cycle has a uncorrectable error     Chapter 6  J Bus Interface 6 9    sctag0_jbi_por_req_buf  In  SCTAG0  Request for DOK_FATAL     sctag1_jbi_iq_dequeue  In  SCTAG1  SCTag is unloading a request from its 2 request queue     sctag1_jbi_wib_dequeue  In  SCTAG1  Write invalidate buffer  size 4  is being unloaded     scbuf1_jbi_data 31 0   In  SCBUF1  Return data    scbuf1_jbi_ctag_vld  In  SCBUF1  Header cycle of a new response packet     scbuf1_jbi_ue_err  In  SCBUF1  Current data cycle has a uncorrectable error     sctag1_jbi_por_req_buf  In  SCTAG1  Request for DOK_FATAL     sctag2_jbi_iq_dequeue  In  SCTAG2  SCTag is unloading a request from its 2 request queue     sctag2_jbi_wib_dequeue  In  SCTAG2  Write invalidate buffer  size 4  is being unloaded     scbuf2_jbi_data 31 0   In  SCBUF2  Return data    scbuf2_jbi_ctag_vld  In  SCBUF2  Header cycle of a new response packet     scbuf2_jbi_ue_err  In  SCBUF2  
22.   x  x  x  src  tar  x    Shared bit  131  1  x  x  x  x  src  tar  x    Shared bit  130  1  x R  A  x  x  SASI  x  x    Shared bit  129  1  atomic  x  x  x  x  x  x    Reserved  128  1  x R  0  0  0  0  0  0  0    Data  127 0  128   INV2   3 cpu    6 pa    112 inv       INV3   3 cpu    6pa    112 inv      V   V   FD   64 x     Data     x    Chapter 3  CPU Cache Crossbar 3 7    TABLE  3 3  PCX Packet Format     Part 1    Pkt  Bits  No   Load  Ifill Req  ST  CAS 1   CAS 2     Valid  123  1  V  V  V  V  V    Rqtyp  122 118  5  00000  10000  00001  00010  00011    NC  117  1  V  V  V     1        1       Cpu_idfs  116 114  3  V  V  V  V  V    Thread_id  113 112  2  V  V  V  V  V    Invalidate  111  1  V  V  0  0  0    Prefetch  110  1  V  0  BST  0  0    Block init store   Displacement  flush    109  1  DF  0  BIS BST  0  0    Rep_L1_way  108 107  2  V  V  P  V  x    Size  106 104  3  V  x  V  V  V    Address  103 64  40  V  V   V  V  V    Data  63 0  64  x  x  V  Vrs2  Vrd    3 8 OpenSPARC T1 Microarchitecture Specification      August 2006    TABLE  3 4  PCX Packet Format     Part 2    Pkt  Bits  No     SWP  Ldstb    Stream  loads    Stream  Store  Int  FP  1   FP  2     Fwd  req    Fwd  reply    Valid  123  1  V  V  V  V  V  V  V  V    Rqtyp  122 118  5  00110  00100  00101  01001  01010  01011  01100  01101  01110    NC  117  1     1        1     V  Br  x  x  R  W  R  W    Cpu_id  116 114  3  V  V  V  V  V  V  src  tar    Thread_id  113 112  2  V  V  V  V  V  V  000  x    In
23.  0   In  To ctu_dft of ctu_dft v    afi_rt_high_low  In  To ctu_dft of ctu_dft v    afi_rt_read_write  In  To ctu_dft of ctu_dft v    afi_rt_valid  In  To ctu_dft of ctu_dft v    10 16 OpenSPARC T1 Microarchitecture Specification      August 2006    afi_tsr_div 9 1   In  To ctu_dft of ctu_dft v    afi_tsr_tsel 7 0   In  To ctu_dft of ctu_dft v    cmp_gclk  In  To u_cmp_header of bw_clk_cl_ctu_cmp v    cmp_gclk_cts  In  To u_cmp_gclk_dr of bw_u1_ckbuf_40x v    ddr0_ctu_dll_lock  In  PADS  To ctu_clsp of ctu_clsp v    ddr0_ctu_dll_overflow  In  PADS  To ctu_clsp of ctu_clsp v    ddr1_ctu_dll_lock  In  PADS  To ctu_clsp of ctu_clsp v    ddr1_ctu_dll_overflow  In  PADS  To ctu_clsp of ctu_clsp v    ddr2_ctu_dll_lock  In  PADS  To ctu_clsp of ctu_clsp v    ddr2_ctu_dll_overflow  In  PADS  To ctu_clsp of ctu_clsp v    ddr3_ctu_dll_lock  In  PADS  To ctu_clsp of ctu_clsp v    ddr3_ctu_dll_overflow  In  PADS  To ctu_clsp of ctu_clsp v    dll0_ctu_ctrl 4 0   In  PADS  To ctu_clsp of ctu_clsp v    dll1_ctu_ctrl 4 0   In  PADS  To ctu_clsp of ctu_clsp v    dll2_ctu_ctrl 4 0   In  PADS  To ctu_clsp of ctu_clsp v    dll3_ctu_ctrl 4 0   In  PADS  To ctu_clsp of ctu_clsp v    dram02_ctu_tr  In  DRAM  DRAM debug trigger    dram13_ctu_tr  In  DRAM  DRAM debug trigger    dram_gclk_cts  In  DRAM  To u_dram_gclk_dr of bw_u1_ckbuf_30x v    efc_ctu_data_out  In  EFC  To ctu_dft of ctu_dft v    io_clk_stretch  In  PADS  To ctu_clsp of ctu_clsp v    io_do_bist  In  PADS  To ctu_clsp of ctu_clsp v   
24.  0x80_FFFF_FFFF  is used for the debug data        Error injection is supported in outbound and inbound J Bus traffic        BI debug info  when enabled  is placed in the transaction headers  the JBI queues  info in the upper 64 bits of the AD     6 1 6  J Bus Internal Arbitration        There are seven agents for internal arbitration          Four read return queues         PIO request queues         Mondo interrupt ACK NACK queues         Debug FIFO        In the default arbitration  the debug FIFO has the lowest priority  and there is  round robin arbitration between the other six agents        Until the FIFO is flushed  the debug FIFO has the highest priority when the  HI_WATER or MAX_WAIT limits are reached    Chapter 6  J Bus Interface 6 7    6 1 7  Error Handling in JBI        There are 19 different fatal and not correctable errors  each with a log enable   signal enable  error detected bit  and error overflow detected bit   Refer to the  UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification for details  on programming control bits and reading status registers          J Bus snapshot registers contain address  data  control  parity bits         J Bus requests to non existent memory causes a read to address 0 before the JBI  issues an error cycle on the J Bus         Fatal error asserts DOK on for 4 cycles  which instructs the external J Bus to PCI   Express ASIC to perform a warm reset     6 1 8  Performance Counters        There are two performance co
25.  10 4 OpenSPARC T1 Microarchitecture Specification      August 2006    10 1 1 2  Clock Dividers    A clock divider divides the output of the PLL  and supports a divide range of 2 to  24  The clock dividers are Johnson counter variants and have deterministic starts for  repeatability     Each clock domain  C  D  and J  are generated by the dividing PLL clock  and each  domain uses its own divide ratio and positive negative pairs  For the PLL bypass  mode  the divide ratios are fixed     the C clock is divided by 1  and D and J clocks are  divided by 4  Refer to the UltraSPARC T1 Supplement to the UltraSPARC 2005  Architecture Specification for the complete definitions of these clock divider ratios     Clock divider block diagram and waveforms are shown in FIGURE  10 3     FIGURE  10 3 Clock Divider Block Diagram    The clock divider and other parameters are stored in shadowed control registers   CREGs   A cold reset  or a power on reset  sets the default values in each CREG and  its shadow  Warm resets with frequency changes copies the CREG to its shadow     1div    pos    neg    pos    PLL    neg    out    pll_clk_out  dom_div    align  init_1    div_vec 14 0     Chapter 10  Clocks and Resets 10 5    TABLE  10 1 defines the various dividers for the clock domains     10 1 1 3  Clock Domain Crossings    Clock domain crossing has the following characteristics         Clock domains are ratioed synchronous  which means that after every few clock  cycles  depending on the ratio   t
26.  2 37        Section  2 8     Stream Processing Unit    on page  2 38        Section  2 9     Memory Management Unit    on page  2 43        Section  2 10     Trap Logic Unit    on page  2 50    2 2 OpenSPARC T1 Microarchitecture Specification      August 2006    2 1  SPARC Core Overview and Terminology    FIGURE  2 1 presents a high level block diagram of a SPARC core  and FIGURE  2 2    shows the general physical location of these units on an example core     FIGURE  2 1 SPARC Core Block Diagram    I Cache    External  Interface    D Cache    Decode  Strand  Scheduler    ALU    Strand  Instruction  Registers    Register  Files    Store Buffers    Chapter 2  SPARC Core 2 3    FIGURE  2 2 Physical Location of Functional Units on an OpenSPARC T1 SPARC Core    Trap    MUL    EXU    IFU    LSU  MMU    0    1    2    3    4    5    6    7    2 4 OpenSPARC T1 Microarchitecture Specification      August 2006    TABLE  2 1 defines acronyms and terms that are used throughout this chapter     FIGURE  2 3 shows the view from virtualization  which illustrates the relative    privileges of the various software layers     FIGURE  2 3 Virtualization of Software Layers    TABLE  2 1  SPARC Core Terminology    Term  Description    Thread  A thread is a hardware strand  thread and strand will be used  interchangeably in this chapter   Each thread  or strand  enjoys a  unique set of resources in support of its execution while multiple  threads  or strands  within the same SPARC core will share
27.  7 and 10 of the generic  reset sequence  skipping steps 8 and 9      The SPARC core initiates a warm reset by writing to the I O bridge  IOB  chip in  order to toggle the J_RST_L reset signal  A warm reset can be used for         Recovering from hangs        Creating a deterministic diagnostics start        Changing frequency    Chapter 10  Clocks and Resets 10 15    10 1 2 4  Debug Initialization    A debug unitization is a lightweight reset intended to create determinism with  respect to a coincident edge         Software is required to achieve a quiescent state  and          Stop all threads         Clear out arrays        A read to the CREG_DBG_INIT causes the GDBGINIT_L signals to be asserted   and then deasserted        Read data return occurs with a fixed relationship to a coincident edge    10 2  I O Signal list    TABLE  10 2 describes the I O signals for the OpenSPARC T1 processor clock and test    unit  CTU      TABLE  10 2 CTU I O Signal List    Signal Name  I O    Source   Destination  Description    afi_pll_trst_l  In  PLL Test Reset    afi_tsr_mode  In    io_j_clk 1 0   In  PADS  J clock input from PADS    afi_bist_mode  In  To ctu_dft of ctu_dft v    afi_bypass_mode  In  To ctu_dft of ctu_dft v    afi_pll_char_mode  In  To ctu_dft of ctu_dft v    afi_pll_clamp_fltr  In  To ctu_dft of ctu_dft v    afi_pll_div2 5 0   In  To ctu_dft of ctu_dft v    afi_rng_ctl 2 0   In  To ctu_dft of ctu_dft v    afi_rt_addr_data  In  To ctu_dft of ctu_dft v    afi_rt_data_in 31
28.  8 1 1  Arbitration Priority    The read requests have higher priority over write requests  but there is a starvation  counter which will enable writes to go through  Write requests that match the  pending read requests are completed ahead for ordering  The DRAM controller  should never see a read request followed by write request  The arbitration priority  order is listed as follows  with the first list item having the highest priority     1  Refresh request     2  Pending column address strobe  CAS  requests  round robin      3  Scrub row address strobe  RAS  requests     4  Write pending RAS requests  which have matching addresses  as read requests  that are picked for RAS     5  Read RAS requests from read queues  or write RAS requests from write queues  when the write starvation counter reaches its limit  round robin      6  Write RAS requests from write queues  or read RAS requests from read queues if  the write starvation counter reaches its limit     7  Incoming read RAS requests     8 4 OpenSPARC T1 Microarchitecture Specification      August 2006    8 1 2  DRAM Controller State Diagrams    FIGURE  8 2 presents a top level state diagram of the DRAM controller  Software must    initialize the DRAM controller at power on in order for it to achieve an initialized  state     FIGURE  8 2 DDR II DRAM Controller Top Level State Diagram    Idle    Software DRAM Init    Init done    Refresh GO    Issue  Refresh  command    Closed  all banks   Yes  No    Chapter 8  DRAM Contro
29.  August 2006    performs diagnostic reads from the JTAG or the L2 cache  and it sends a request to a  CPU by way of the CPX  The CPU bounces the request to the L2 cache by way of the  PCX     4 1 3 2  L2 Cache Pipeline Stages    The L2 cache access pipeline has eight stages  C1 to C8   and the following sections  describe the logic executed during each stage of the pipeline     C1        All buffers  WBB  WB and MB  are cammed  The instruction is a dependent  instruction if the instruction address is found in any of the buffers         Generate ECC for store data         Access VUAD and TAG array to establish a miss or a hit     C2        Pipeline stall conditions are evaluated  The following conditions require that the  pipeline be stalled          32 byte access requires two cycles in the pipeline          An I miss instruction stalls the pipeline for one cycle  When an I miss  instruction is encountered in the C2 stage  it stalls the instruction in the C1  stage so that it stays there for two cycles  The instruction in the C1 stage is  replayed         For instructions that hit the cache  the way select generation is completed         Pseudo least recently used  LRU  is used for selecting a way for replacement in  case of a miss         VUAD is updated in the C5 stage  However  VUAD is accessed in the C1 stage   The bypass logic for VUAD generation is completed in the C2 stage  This process  ensures that the correct data is available to the current instruction from the  pr
30.  DRAM    scbuf_dram_data_vld_r5  Out  DRAM    so  Out  DFT  Scan out    TABLE  4 2  SCBUF I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 4  Level 2 Cache 4 21    TABLE  4 3  SCTAG I O Signal List    Signal Name  I O    Source   Destination  Description    pcx_sctag_data_rdy_px1  In  CCX PCX  PCX data ready    pcx_sctag_data_px2 123 0   In  CCX PCX  PCX to sctag packet    pcx_sctag_atm_px1  In  CCX PCX  Indicates that the current packet is atomic    cpx_sctag_grant_cx 7 0   In  CCX CPX  CPX grant    scdata_sctag_decc_c6 155 0   In  SCDATA  From data of scdata_data v    scbuf_sctag_ev_uerr_r5  In  SCBUF    scbuf_sctag_ev_cerr_r5  In  SCBUF    scbuf_sctag_rdma_uerr_c10  In  SCBUF    scbuf_sctag_rdma_cerr_c10  In  SCBUF    dram_sctag_rd_ack  In  DRAM    dram_sctag_wr_ack  In  DRAM    dram_sctag_chunk_id_r0 1 0   In  DRAM    dram_sctag_data_vld_r0  In  DRAM    dram_sctag_rd_req_id_r0 2 0   In  DRAM    dram_sctag_secc_err_r2  In  DRAM    dram_sctag_mecc_err_r2  In  DRAM    dram_sctag_scb_mecc_err  In  DRAM    dram_sctag_scb_secc_err  In  DRAM    jbi_sctag_req_vld  In  JBI    jbi_sctag_req 31 0   In  JBI    arst_l  In  CTU  Asynchronous reset    grst_l  In  CTU  Synchronous reset    adbginit_l  In  CTU  Asynchronous reset    gdbginit_l  In  CTU  Synchronous reset    cluster_cken  In  CTU    cmp_gclk  In  CTU  Global clock input to cluster header    global_shift_enable  In  CTU    ctu_sctag_mbisten  In  CTU    ctu_sctag_scanin  In  CTU  
31.  J Bus  software  programmable   Read returns to the IOB may observe strict ordering with respect to  the writes to the L2 cache  software programmable      Chapter 6  J Bus Interface 6 5    6 1 3  J Bus Interrupt Requests to the IOB        A J Bus interrupt in the mondo vector format is received by the J Bus parser and  then it is stored in the interrupt queue before being sent to the IOB         A modified mondo interrupt transaction is where only the first data cycle is  forwarded to the CPU         The mondo interrupt queue is maximally sized to 16 entries  and there is no flow  control on queue         Interrupts to the IOB may observe strict ordering with respect to the writes to the  L2 cache  software programmable          An interrupt ACK NACK received from the IOB is first stored in the interrupt  ACK NACK queue  and then it is sent out on the J Bus     6 1 4  J Bus Interface Details    The J Bus interface has the following characteristics         JBI Requests the J Bus as agent 0        Masters transaction using agent ID 0 to 3        16 transaction IDs  TIDs  assigned in the least recently used order        A read TID becomes available when the read data is returned        A write TID is never marked unavailable        Responds to the addresses corresponding to agent ID 0 to 3        External J Bus arbitration          Adheres to the J Bus arbitration protocol         May arbitrate to maximize its time as the default owner  in order to  opportunistically drive the
32.  LSU is the source  of the latest traps in the pipeline     The trap logic unit  TLU  gathers traps from all functional units except the LSU  and  it then sends them to the LSU  the LSU performs the or function for all of them  plus  its own  and then it broadcasts across the entire chip     The LSU can also send a truncated flush for the internal ASI ld st to the TLU  the  MMU  and the SPU     2 4 21  LSU Error Handling    Errors can be generated from any  or all  of the following memory arrays     DCACHE   D cache   D cache tag array  DTAG   D cache valid bit array  DVA   DTLB  data fill  queue  DFQ   store buffer CAM array  SCM   and store buffer data array   STBDATA   Only the DCACHE  DTAG  and DTLB arrays are parity protected         A parity error on a load reference to the DCACHE will be corrected by way of the  reloading the correct data from the L2 cache as if there were a D cache miss         A DTAG parity error will result in a correction packet  followed by the actual load  request  to the L2 cache  The correction packet synchronizes the L2 directory and  the L1 D cache set  On the load request acknowledgement  the level 1 D cache  will be filled         A parity error on the DTLB tte data will cause an uncorrectable error trap to the  originating loads or stores         A parity error on the DTLB tte data can also cause an uncorrectable error trap for  ASI reads         A parity error on the DTLB tte tag can only cause an uncorrectable error trap for  ASI reads  
33.  Out  sparc4  CPX data ready    cpx_spc5_data_cx2 144 0   Out  sparc5  CPX SPARC data    cpx_spc5_data_rdy_cx2  Out  sparc5  CPX data ready    cpx_spc6_data_cx2 144 0   Out  sparc6  CPX SPARC data    cpx_spc6_data_rdy_cx2  Out  sparc6  CPX data ready    cpx_spc7_data_cx2 144 0   Out  sparc7  CPX SPARC data    cpx_spc7_data_rdy_cx2  Out  sparc7  CPX data ready    pcx_fp_data_px2 123 0   Out  FPU  PCX data    pcx_fp_data_rdy_px2  Out  FPU  PCX data ready    pcx_iob_data_px2 123 0   Out  IOB  PCX data    TABLE  3 5  CCX I O Signal List  Continued     Signal Name  I O  Source Destination  Description    3 12 OpenSPARC T1 Microarchitecture Specification      August 2006    pcx_iob_data_rdy_px2  Out  IOB  PCX data ready    pcx_sctag0_atm_px1  Out  L2 Bank0  PCX atomic packet    pcx_sctag0_data_px2 123 0   Out  L2 Bank0  PCX data    pcx_sctag0_data_rdy_px1  Out  L2 Bank0  PCX data ready    pcx_sctag1_atm_px1  Out  L2 Bank1  PCX atomic packet    pcx_sctag1_data_px2 123 0   Out  L2 Bank1  PCX data    pcx_sctag1_data_rdy_px1  Out  L2 Bank1  PCX data ready    pcx_sctag2_atm_px1  Out  L2 Bank2  PCX atomic packet    pcx_sctag2_data_px2 123 0   Out  L2 Bank2  PCX data    pcx_sctag2_data_rdy_px1  Out  L2 Bank2  PCX data ready    pcx_sctag3_atm_px1  Out  L2 Bank3  PCX atomic packet    pcx_sctag3_data_px2 123 0   Out  L2 Bank3  PCX data    pcx_sctag3_data_rdy_px1  Out  L2 Bank3  PCX data ready    pcx_spc0_grant_px 4 0   Out  sparc0  PCX grant to SPARC    pcx_spc1_grant_px 4 0   Out  sparc1  P
34.  PA  of an instruction that missed the  I cache  A second PA that matches the PA of an already pending I cache miss will  cause the second request to be put on hold and marked as a child of the pending  I cache miss request  The child request will be serviced when the pending I cache  miss receives its response  The MIL uses a linked list to track and service the  duplicated I cache miss request  The depth for such a linked list is four     MIL    pcxpkt  to LSU    PA    Cmp    RR arb    Chapter 2  SPARC Core 2 13    The MIL cycles through the following states     1  Make request     2  Wait for an I cache fill     3  Fill the first 16 bytes of data  The MIL sends a speculative completion notification  to the thread scheduler at the completion of filling the first 16 bytes     4  Fill the second 16 bytes of data  The MIL sends a completion notification to the  thread scheduler at the completion of filling the second 16 bytes     5  Done     An I cache miss request could be canceled because of  for example  a trap  The MIL  still goes through the motions of filling a cache line but it does not bypass it to the  thread instruction register  TIR   A pending child request must be serviced even if  the original parent I cache miss request was cancelled     When a child I cache miss request crosses with a parent I cache miss request  the  child request might not be serviced before the I cache fill for the parent request  occurs  The child instruction fetch shall be retired  rolled 
35.  a set of  common resources in support of their execution   The per thread resources include registers  a portion of I fetch data   path  store buffer  and miss buffer  The shared resources include the  pipeline registers and data path  caches  translation lookaside  buffers  TLB   and execution unit of the SPARC Core pipeline     ST  Single threaded     MT  Multi threaded     Hypervisor  HV   The hypervisor is the layer of system software that interfaces with  the hardware     Supervisor  SV   The supervisor is the layer of system software such as operation  system  OS  that executes with privilege     Long latency  instruction  LLI     LLI represents an instruction that would take more than one SPARC  core clock cycle to make its results visible to the next instruction     OS instance 1    Applications    Hypervisor    OpenSPARC T1    OS instance 2    Chapter 2  SPARC Core 2 5    2 2  SPARC Core I O Signal List    TABLE  2 2 lists and describes the SPARC Core I O signals     TABLE  2 2  SPARC Core I O Signal List    Signal Name  I O    Source   Destination  Description    pcx_spc_grant_px 4 0   In  CCX PCX  PCX to processor grant info    cpx_spc_data_rdy_cx2  In  CCX CPX  CPX data in flight to SPARC    cpx_spc_data_cx2 144 0   In  CCX CPX  CPX to SPARC data packet    const_cpuid 3 0   In  Hard wired  CPU ID    const_maskid 7 0   In  CTU  Mask ID    ctu_tck  In  CTU  To IFU of sparc_ifu v    ctu_sscan_se  In  CTU  To IFU of sparc_ifu v    ctu_sscan_snap  In  CTU  To IFU of s
36.  and Vector Interrupts    INTR   CPX  PKT  from LSU    Redirect PC NPC to IFU    Pending INTR to IFU  level     INTR cmd  m stage   with  nop  from IFU    Gen  TrapPC  Update Stack  and State Regs    Decode    State and  Ctrl Regs   Trap Stack    63               0    5           0    FFs    Interrupt Packet to LSU  HV SW write  PCX Intr  PKT  to CCX  LSU    Interrupt Receive Register    Incoming Vector  Register    5           0    CPU Interrupt  Vector Dispatch  Register    Chapter 2  SPARC Core 2 59    FIGURE  2 32 illustrates the flow of reset  idle  or resume interrupts     FIGURE  2 32 Flow of Reset or Idle or Resume Interrupts    INTR   CPX  PKT  from LSU    Redirect PC NPC to IFU    RESET INTR cmd  m   with  nop  from IFU    INTR to IFU  pulse     Gen  Trap Vector  Update TSA  and State Regs    Decode    State and  Ctrl Regs   Stack    Reset Type   Vector     FFs    2 60 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 33 illustrates the flow of software and timer interrupts     FIGURE  2 33 Flow of Software and Timer Interrupts    SV Intr  to  IFU  level     16 15 14  0    Redirect PC NPC to IFU    HV Intr  to IFU  level     INTR cmd  m  with  nop  from IFU    SW write to  SOFTINT_REG  SET_SOFTINT  CLEAR_SOFTINT     SW write    SV Intr  to  IFU  level     Resolve priority  Gen  Trap Vector  Update TSA  and State Regs              CPU_mondoQ  Head   Tail  SV   Dev_mondoQ  Head   Tail  SV   ResumableError  _mondoQ  Head   Tail  SV     State an
37.  and a data cache  complex     4  Trap logic unit  TLU  includes trap logic and trap program counters     5  Stream processing unit  SPU  is used for modular arithmetic functions for crypto     6  Memory management unit  MMU      7  Floating point frontend unit  FFU  interfaces to the FPU     Fetch  Thrd Sel  Decode  Execute  Memory  WB    ICache  Itlb  Inst  buf x 4  Crossbar  Interface  Thrd  Sel  Mux    PC logic  x 4  Thrd  Sel  Mux    Decode    Thread  select  logic    Thread selects  Instruction type  Misses  Traps and interrupts  Resource conflicts    Regfile  x 4  Crypto  Coprocessor    Alu  Mul  Shft  Div    DCache  Dtlb  Stbuf x 4    1 6 OpenSPARC T1 Microarchitecture Specification      August 2006    1 3 1 1  Instruction Fetch Unit    The thread selection policy is as follows     a switch between the available threads  every cycle giving priority to the least recently executed thread  The threads become  unavailable due to the long latency operations like loads  branch  MUL  and DIV  as  well as to the pipeline stalls like cache misses  traps  and resource conflicts  The  loads are speculated as cache hits  and the thread is switched in with lower priority     Instruction cache complex has a 16 Kbyte data  4 way  32 byte line size with a single  ported instruction tag  It also has dual ported  1R 1W  valid bit array to hold cache  line state of valid invalid  Invalidates access the V bit array  not the instruction tag   A pseudo random replacement algorithm is used 
38.  and these two packets are delivered in two cycles     The total number of cycles required for a packet to travel from the source to the  destination may be more than the number of cycles required to deliver a packet  This  issue occurs when the PCX  or the CCX  uses more than one cycle to deliver the  packet  The PCX  or the CCX  uses more than one cycle to deliver a particular packet  if multiple sources can send packets for the same destination     3 1 3  Processor Cache Crossbar Packet Delivery    The processor cache crossbar  PCX  accepts packets from a source  any of eight  SPARC CPU cores  and delivers the packet to its destination  any one of the four L2   cache banks  the I O bridge  or the FPU      A source sends a packet and a destination ID to the PCX  These packets are sent on  a 124 bit wide bus  Out of the 124 bits  40 bits are used for address  64 bits for data   and rest of the bits are used for control  The destination ID is sent on a separate 5 bit  bus  Each source connects with its own separate bus to the PCX  Therefore  there are  eight buses that connect from the CPUs to the PCX     The PCX connects to each destination by way of a separate bus  However  the FPU  and I O bridge share the same bus  Therefore  there are five buses that connect the  PCX to the six destinations  The PCX does not perform any packet processing and  therefore the bus width from the PCX to each destination is 124 bits wide  which is  identical to the PCX packet width  FIGURE  3 
39.  cache notifies the SPU of the error          If the INT bit is set in the SPU  there is an ECC_error trap on the thread  specified in SPU control register  in addition to the completion interrupt   Or  the error trap is signalled to IFU when the sync load occurs         Correctable errors detected on the writeback data  DMA read  or DMA partial  writes   lt 4B  result in a ECC_error trap on the steering thread          Errors on writeback data is fixed before writing to memory          DMA partial stores correct the L2 cache data         Correctable errors detected during a scrub are logged in the L2 cache registers          Corrected data is written to the L2 cache          ECC_error trap is taken on the steering thread         Correctable errors detected on any of the 12 tags in a set during an access causes          The hardware to correct all tags in the set          An ECC_error trap on steering thread     9 3 4  L2 Cache Uncorrectable Errors        Error information is captured in the L2 cache error status and the L2 cache error  address registers         If the L2 error enable non correctable error enable  NCEEN  bit is set          Error is also logged in the SPARC error status and the SPARC error address  registers          Erroneous data is loaded in the L1 cache with bad parity         If the SPARC error enable NCEEN bit is set  a precise trap is generated on the  requesting thread         Partial Stores  less than 4 bytes           Do not update the cache        
40.  can send an interrupt when reaching a programmed count        All correctable and uncorrectable errors are logged and sent to the L2 cache along  with the data        DRAM scrub errors are also forwarded to L2 cache independently        Error location register logs the error nibble position on correctable errors        The scrub error address is also logged in the error address register    8 1 5  Repeatability and Visibility        For repeatability          The arbiter states for the RAS and the CAS picker are reset         The scrub address is reset         The refresh counter is software programmable  it does not reset         Visibility is plenty for address data         The address can be reconstructed from the RAS and CAS address  chip select   and bank bits by knowing the configuration registers         External visible check bits have to be XORed with the address parity in order  to get the true ECC    Chapter 8  DRAM Controller 8 7    8 1 6  DDR II Addressing    The characteristics of DDR II addressing include         Burst lengths of 4 and 8 are supported        Various DRAM chips are supported and their addressing is shown in TABLE  8 1        The address bit A10 is used as auto precharge bit        The DRAM bank bits are hashed as follows     new_dimm_bank 2 0    dimm_bank 2 0    addr 20 18    addr 30 28          TABLE  8 2 shows the physical address  PA  decoding to the DIMM address  bank    address  row address  and column address      TABLE  8 1  DDR II Addres
41.  debug port data even when it has nothing to issue   software controlled          JBI starts up in multi segment arb mode  which can be change by way of  software        Flow control   address OK  AOK  and data OK  DOK          Uses only AOK off to flow control the J Bus when WDQ reaches its high  watermark         DOK off is not used for flow control         Follows J Bus protocol when other agents assert their AOKs DOKs    6 6 OpenSPARC T1 Microarchitecture Specification      August 2006    6 1 5  Debug Port to the J Bus        There are two debug first in first outs  FIFOs   each with 32 entries        Programmable to fill and dump from one or both FIFOs  If both FIFOs are  programmed  then each FIFO is alternately filled  but they are dumped both in  parallel  thus using half as many cycles         Arbitration models for using the J Bus to report debug data include          Default   when the J Bus is idle  and the JBI has no other transactions available  to issue on to the J Bus  the JBI opportunistically dumps debug data if it is the  default owner         DATA_ARB   JBI will arbitrate  arb  whenever the FIFOs are higher than the  low watermark  and the JBI is not the bus owner         AGGR_ARB   JBI arbs whenever it does not own the bus  so the bus behavior  does not change based on the quantity of the debug output        Debug data appears on the J Bus as a Read16 return cycle to the AID4 with debug  data payload on J_AD 127 0         Fake DMA range  0x80_1000_0000 to
42.  execution     FIGURE  2 30 illustrates the trap flow with respect to the hardware blocks     FIGURE  2 30 Trap Flow With Respect to the Hardware Blocks         IFU Traps  m   and Interrupts    PC  NOC from IFU    CWP  CCR from EXU    ASI_REGs from LSU    Flush to LSU    Early Trap  to EXU    FP Traps  SpillTraps    Redirect  PC NPC  to IFU    EXU Traps  m     SPU Traps  m     LSU Traps  m     MUX  MUX    Processor State   HPSTATE  TL   PSTATE  etc     Pend  Async  Traps    Trap Stack    TBA Select    TLU  detected  Traps    Resolve Priority    FFs    HTBA    TBA    Reset Vect     Update  State  Regs     Final  TType    Chapter 2  SPARC Core 2 57    2 10 4  Trap Program Counter Construction    The following list highlights the algorithm for constructing the trap program counter   TPC          Supervisor trap  SV trap     Redirect PC  lt    TBA 47 15    TL gt 0   TTYPE 8 0   5   b00000         Hypervisor trap  HV trap     Redirect PC  lt    TBA 47 14   TTYPE 8 0   5   b00000         Traps in non split mode    Redirect PC  lt    TBA 47 15    TL gt 0   TTYPE 8 0   5   b00000         Reset trap    Redirect PC  lt    RSTVAddr 47 8    TL gt 0   RST_TYPE 2 0   5   b00000          RSTVAddr   0xFFFFFFFFF0000000        Done instruction    Redirect PC  lt   TNPC TL         Retry instruction    Redirect PC  lt   TPC TL     Redirect NPC  lt   TNPC TL     2 10 5  Interrupts    The software interrupts are delivered to each virtual core using the interrupt_level_n  traps  0x41 0x4f  through 
43.  for supervisor    ii  On power on reset  POR  or warm reset    TL   MAXTL   6     GL   MAXGL   3     iii  On software write    For hypervisor     TL  lt   min wr data 2 0   MAXTL  for hypervisor    GL  lt   min wr data 3 0   MAXGL  for hypervisor    For supervisor     TL  lt   min wr data 2 0   2  for supervisor    GL  lt   min wr data 3 0   2  for supervisor    b  PC   gt  TPC TL     c  NPC   gt  TNPC TL     d   ASI_REG  CCR_REG  GL  PSTATE    gt  TSTATE TL     e  Final_Trap_Type   gt  TTYPE TL     f  HPSTATE   gt  HTSTATE TL     g  Update HPSTATE enb  red  priv  and so on  register    h  Update PSTATE priv  ie  and so on  register    Chapter 2  SPARC Core 2 65    2  On done or retry instructions     restore states from trap stack    a  Update the trap level  TL  and the global level  GL     TL  lt   TL  1    GL  lt   Restore from trap stack  TL and apply CAP    b  Restore all the registers including PC  NPC  HPSTATE  PSTATE  from the trap  stack   TL     c  Send CWP and CCR register updates to the execution unit  EXU     d  Send ASI register update to load store unit  LSU     e  Send restored PC and NPC to the instruction fetch unit  IFU     f  Decrement TL    2 10 12  Trap Stack    The OpenSPARC T1 processor supports a six deep trap stack for six trap levels  The  trap stack has one read port and one write port  1R1W   and it stores the following  registers         PC        NPC        HPSTATE  Note  The HPSTATE enb bit is not saved         PSTATE        GL        CWP    
44.  io_j_rst_l  In  PADS  To ctu_clsp of ctu_clsp v    io_pll_char_in  In  PADS  To ctu_clsp of ctu_clsp v  and so on    io_pwron_rst_l  In  PADS  To ctu_clsp of ctu_clsp v  and so on    io_tck  In  PADS  To u_tck_dr of bw_u1_ckbuf_30x v  and so on    io_tck2  In  PADS  To ctu_clsp of ctu_clsp v    io_tdi  In  PADS  To ctu_dft of ctu_dft v    io_test_mode  In  PADS  To ctu_dft of ctu_dft v    io_tms  In  PADS  To ctu_dft of ctu_dft v    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 10  Clocks and Resets 10 17    io_trst_l  In  PADS  To ctu_dft of ctu_dft v    io_vdda_pll  In  PADS  To u_pll of bw_pll v    io_vdda_rng  In  PADS  To u_rng of bw_rng v    io_vdda_tsr  In  PADS  To u_tsr of bw_tsr v    io_vreg_selbg_l  In  PADS  To u_rng of bw_rng v    iob_clsp_data 3 0   In  IOB  To ctu_clsp of ctu_clsp v    iob_clsp_stall  In  IOB  To ctu_clsp of ctu_clsp v    iob_clsp_vld  In  IOB  To ctu_clsp of ctu_clsp v    iob_ctu_coreavail 7 0   In  IOB  To ctu_dft of ctu_dft v    iob_ctu_l2_tr  In  IOB  To ctu_clsp of ctu_clsp v    iob_ctu_tr  In  IOB  To ctu_clsp of ctu_clsp v    iob_tap_data 7 0   In  IOB  To ctu_dft of ctu_dft v    iob_tap_stall  In  IOB  To ctu_dft of ctu_dft v    iob_tap_vld  In  IOB  To ctu_dft of ctu_dft v    jbi_ctu_tr  In  JBI  To ctu_clsp of ctu_clsp v    jbus_gclk  In  JBI  To u_jbus_header of bw_clk_cl_ctu_jbus v    jbus_gclk_cts  In  JBI  To u_jbus_gclk_dr of bw_u1_ckbuf_30x v    jbus_gclk_dup  In  
45.  is a registered trademark in the U S  and other countries  exclusively licensed through X Open Company  Ltd     The Adobe logo is a registered trademark of Adobe Systems  Incorporated     Products covered by and information contained in this service manual are controlled by U S  Export Control laws and may be subject to the  export or import laws in other countries  Nuclear  missile  chemical biological weapons or nuclear maritime end uses or end users  whether  direct or indirect  are strictly prohibited  Export or reexport to countries subject to U S  embargo or to entities identi   ed on U S  export exclusion  lists  including  but not limited to  the denied persons and specially designated nationals lists is strictly prohibited     DOCUMENTATION IS PROVIDED  AS IS  AND ALL EXPRESS OR IMPLIED CONDITIONS  REPRESENTATIONS AND WARRANTIES   INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY  FITNESS FOR A PARTICULAR PURPOSE OR NON INFRINGEMENT   ARE DISCLAIMED  EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID     Copyright    2006 Sun Microsystems  Inc   4150 Network Circle  Santa Clara  California 95054  Etats Unis  Tous droits r  serv  s     Sun Microsystems  Inc  d  tient les droits de propri  t   intellectuels relatifs    la technologie incorpor  e dans le produit qui est d  crit dans ce  document  En particulier  et ce sans limitation  ces droits de propri  t   intellectuelle peuvent inclure un ou plus des brevets am  ricains list  s     l   adres
46.  prefetch that misses in  the TLB  or accesses I O space  will be treated as a NOP  The issuing thread will be  switched back in without accessing the processor to L2 cache interface  PCX      The LSU supports a total of eight outstanding prefetch instructions across all four  threads  The LSU keeps track of the number of outstanding prefetches per thread   which limits the number of outstanding prefetches     2 4 16  Floating Point BLK LD and BLK ST Instructions  Support    Floating point blk ld and blk st instructions are non TSO compliant  Only one  outstanding blk ld or blk st instruction is allowed per SPARC core  These  instructions will bypass the level 1 caches and will not allocate in the level 1 caches  either  On a level 1 D cache  L1D  hit  a blk st instruction will cause an invalidation  to the L1D  Both blk st and blk ld instructions can access the memory space and the  I O space     The LSU breaks up a the 64 byte packet of a blk ld instruction into four of 16 byte  load packets so that they can access the processor and L2 cache interface  PCX   The  Level 2 cache returns four of the 16 byte packets  which in turn  will cause eight of 8   byte data transfers to the floating point register file  FRF   Errors are reported on the  last packet  A blk ld instruction could cause a partial update to the FRF  Software  must be written to retry the instruction later     A blk st instruction will be unrolled into eight helper instructions by the floating   point function
47.  provided by the  FFU         The FFU executes all floating point move  FMOV  instructions  The FPU does not  require any conditional move information  A 2 bit FSR condition code  FCC  field  identifier  fcc0  fcc1  fcc2  fcc3  is provided to the FPU so that the floating point  compare  FCMP  target fcc field is known when the FPU result is returned to the  FFU         The FPU provides IEEE exception status flags to the FFU for each instruction  completed  The FFU determines if a software trap  fp_exception_ieee_754  is  required based on the IEEE exception status flags supplied by the FPU and the  IEEE trap enable bits located in the architected FSR         A denormalized operand or result will never generate an unfinished FPop trap to  the software  The hardware provides full support for denormalized operands and  results         Each of the five IEEE exception status flags and associated trap enables are  supported     invalid operation  zero divide  overflow  underflow  and inexact         IEEE traps enabled mode     if an instruction generates an IEEE exception when the  corresponding trap enable is set  then a fp_exception_ieee_754 trap is generated  and results are inhibited by the FFU          The destination register remains unchanged         FSR condition codes  fcc  remain unchanged         FSR aexc field remains unchanged         FSR cexc field has one bit set corresponding to the IEEE exception        All four IEEE round modes are supported in hardware         Th
48.  sequence    10  Send an interrupt to a thread in a core    a  The CTU activates a wake thread signal to the IOB    10 14 OpenSPARC T1 Microarchitecture Specification      August 2006    b  The IOB generates an interrupt packet to thread 0 of the lowest numbered  SPARC core marked enabled    c  The SPARC core starts fetching instructions from SSI interface    Cold Reset Sequence    A cold reset sequence has four steps     1  Assertion of the PWRON_RST_L reset  which performs steps 1  2  and 3 of the  preceding generic reset sequence described in Section  10 1 2 3     Reset Sequence       on page  10 11     2  Deassertion of the PWRON_RST_L reset  which performs step 4 of the generic  reset sequence     3  Assertion of the J_RST_L reset  which performs steps 5 and 6 of the generic reset  sequence     4  Deassertion of the J_RST_L reset  which performs the steps 7  8  9  and 10 of the  generic reset sequence     There are two types of the cold resets   normal and deterministic  The timing of  the J_RST_L reset assertion determines the reset type  On the tester  the  deterministic type is used     Warm Reset Sequence    Warm reset sequence has only 2 steps  and during warm reset PWRON_RST_L  remains unasserted throughout  The 2 steps are     1  Assertion of the J_RST_L reset  which performs steps 1 through 6 of the preceding  generic reset sequence described in Section  10 1 2 3     Reset Sequence    on    page  10 11     2  Deassertion of the J_RST_L reset  which performs steps
49.  such as blk ld  blk st  quad ASI   and so on        Defining an explicit context for address translation at all levels of privilege  such  as primary  secondary  as_if_user  as_if_supv  and so on        Defining special attributes  such as non_faulting and endianness  and so on        Defining address translation bypassed  such as  RA PA    VA PA   and so on   where VA stands for virtual address  PA stands for physical address  and RA  stands for real address    2 4 11  Support for Atomic Instructions  CAS  SWAP   LDSTUB     CAS is issued as a two packet sequence to the Processor to L2 cache Interface  PCX    Packet 1 contains the compare address  rs1  and the data  rs2   Packet 2 contains the  swap data  rd   Packet 1 resides in the store buffer in order to be compliant to the  TSO ordering  while Packet 2 occupies the thread   s entry into the load miss queue   LMQ      Packet 1 and Packet 2 are issued in back to back order to the PCX  An  acknowledgement to the load is returned to the CPX in response to Packet 1   This acknowledgment contains the data in memory from the address in  rs1   An  acknowledgement to the store is returned on the CPX in response to Packet 2  This  acknowledgement will cause an invalidation at address in  rs1  if the cache line is  present in the level 1 D cache     SWAP and LDSTUB are single packet requests to the PCX  and they reside in the  store buffer     Chapter 2  SPARC Core 2 29    2 4 12  Support for MEMBAR Instructions    MEMBAR instruct
50.  the errors and manages the error registers     cluster_cken  In  CTU  To spc_hdr of cluster_header v    gclk  In  CTU  To spc_hdr of cluster_header v    cmp_grst_l  In  CTU  Synchronous reset    cmp_arst_l  In  CTU  Asynchronous reset    ctu_tst_pre_grst_l  In  CTU  To test_stub of test_stub_bist v    adbginit_l  In  CTU  Asynchronous reset    gdbginit_l  In  CTU  Synchronous reset    spc_pcx_req_pq 4 0   Out  CCX PCX  processor to pcx request    spc_pcx_atom_pq  Out  CCX PCX  processor to pcx atomic request    spc_pcx_data_pa 123 0   Out  CCX PCX  processor to pcx packet    spc_sscan_so  Out  DFT  Shadow scan out    spc_scanout0  Out  DFT  Scan out    spc_scanout1  Out  DFT  Scan out    tst_ctu_mbist_done  Out  CTU  MBIST done    tst_ctu_mbist_fail  Out  CTU  MBIST fail    spc_efc_ifuse_data  Out  EFC  From IFU of sparc_ifu v    spc_efc_dfuse_data  Out  EFC  From IFU of sparc_ifu v    TABLE  2 2  SPARC Core I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 2  SPARC Core 2 7    2 3 1  SPARC Core Pipeline    There are six stages in a SPARC core pipeline         Fetch     F stage        Thread selection     S stage        Decode     D stage        Execute     E stage        Memory     M stage        Writeback     W stage    The I cache access and the ITLB access take place in fetch stage  A selected thread   hardware strand  will be picked in the thread selection stage  The instruction  decoding and register file access occur in the 
51.  the trap will be deferred to the instruction following the store instruction    However  the trap PC delivered to the system software still points to the store  instruction that encountered the parity error in the TTE access  Therefore  the  deferred action of the trap generation will still cause a precise trap from the system  software perspective     Chapter 2  SPARC Core 2 25    2 4 5  Store Buffer    The physical structure of the store buffer  STB  consists of a store buffer CAM  SCM   and a store buffer data array  STBDATA   Each thread is allocated with eight fixed  entries in the shared data structures  The SCM has one CAM port and one RW port   and the STBDATA has one read  1R  port and one write  1W  port     All stores reside in the store buffer until they are ordered following a total store  ordering  TSO  model and have updated the L1D  level 1 D cache   The lifecycle of a  TSO compliant store follows these four stages     1  Valid    2  Commit  issued to L2 cache     3  Acknowledged  L2 cache sent response     4  Invalidated or L1D updated    Non TSO complaint stores  such as blk init and other flavors of bst  block store   will  not follow the preceding life cycle  A response from the L2 cache is not required  before releasing the non TSO complaint stores from the store buffer     Atomic instructions such as CAS  LDSTUB  and SWAP  as well as flush instructions   can share the store buffer     The store buffer implements partial and full read after write  RAW  c
52.  this parameter is set  the SPU will generate a disrupting trap to the  current thread on completion of the current modular arithmetic operation  If  cleared  software can synchronize with the current modular arithmetic  operation using the MA_Sync instruction     The disrupting trap will use the implementation_dependent_exception_20 as the  modular arithmetic interrupt          Opcode     Operation code of modular arithmetic operation  see TABLE  2 3          Length     Length of modular arithmetic operations    This field contains the bits for the value of the  length   1  for the modular  arithmetic operations     TABLE  2 3  Modular Arithmetic Operations    Opcode Value  Modular Arithmetic Operation    0  Load from modular arithmetic memory    1  Store to modular arithmetic memory    2  Modular multiply    3  Modular reduction    4  Modular exponentiation loop    5 7  Reserved    2 40 OpenSPARC T1 Microarchitecture Specification      August 2006    2 8 2  Data Flow of Modular Arithmetic Operations    FIGURE  2 22 illustrates the data flow of modular arithmetic operations     FIGURE  2 22 Data Flow of Modular Arithmetic Operations    2 8 3  Modular Arithmetic Memory  MA Memory     A total of 1280 bytes of local memory with a single read write port  1RW  is used to  supply operands to modular arithmetic operations  The modular arithmetic memory   MA memory  will house 5 operands with 32 words each  which supports a  maximum key size of 2048 bits  This MA memory is parity pr
53.  to complete     10 1 2 3  Reset Sequence    There are two types of reset sequences   a cold reset and a warm reset  While there  are 10 generic steps in the reset sequence  and all 10 are done during a cold reset   steps 8 and 9 are not done in the warm reset     These 10 generic reset sequence steps are described as     1  Assert resets    a  Asynchronous resets     ARST_L and ADBGINIT_L    b  Synchronous resets     GRST_L and GDBGINIT_L    c  For cold resets  assertion of the PWRON_RST_L reset asserts all resets   Deassertion of the PWRON_RST_L reset deasserts only asynchronous ones   while the synchronous ones remain asserted     d  For warm resets  only synchronous resets are asserted     i  C and J domain resets are asserted about the same time    ii  For fchg and warm resets  CREG_CLK_CTL SRARM defines whether the rfsh  attribute is on or off    iii  If the rfsh is not on  the D domain reset is asserted at the same time    iv  If rfsh is on  the self_refresh signal to DRAM is asserted  and the D reset is  asserted about 1600 ref cycles after C and J resets    2  Turn off clock enables    a  For cold resets  this sequence is instantaneous  Assertion of the PWRON_RST_L  reset turns on clock enables for the PADs misc  jbusl  jbusr  dbg  and turns off  all others    10 12 OpenSPARC T1 Microarchitecture Specification      August 2006    b  For warm resets  the clock turn off is staggered    i  Starting cluster is 0  for sparc0     ii  Progression is in the CREG_CLK_CTL bit 
54.  value    FNEG s d   Floating point negate    Chapter 7  Floating Point Unit 7 5    7 1 2  FPU Input FIFO Queue    The OpenSPARC FPU input first in first out  FIFO  queue has the following  characteristics         Contains  16 entry x 155 bits  1R 1W ports         The input FIFO queue accepts input data from the crossbar  One source operand  per cycle is transferred  The crossbar will always provide a two cycle transfer   Single source instructions produce an invalid transfer on the second cycle         A bypass path around the FIFO is provided when the FIFO is empty         While a two source instruction requires two valid transfers  the two transfers are  merged into a single 155 bit entry prior updating or bypassing the FIFO         For single source instructions  the FPU forces rs1 to zero  within the 155 bit entry   prior to updating or bypassing the FIFO         For single precision operands  the unused 32 bit region of the 64 bit source is  forced to zero by the FFU  The 32 bits of single precision data is always contained  in the upper 32 bits of the 64 bit source         One instruction per cycle may be issued from the FIFO queue to one of the three  execution pipelines  FPA  FPM  or FPD          Prior to updating or bypassing the FIFO  five tag bits are generated per source  operand  This creates a 69 bit source operand width  64 5 69   The five tag bits  convey information about the zero fraction  the zero exponent  and the all ones  exponent         Eight FIFO ent
55. 1        x    0  for    x     0 or      or NaN    result        FSR dzc 1    result   max  or        FSR ofc 14    result   0 or    min or    denorm  FSR ufc 15  4    result IEEE6    FSR nxc 17    FiTOs  result IEEE6    FSR nxc 1    FiTOd  Cannot generate IEEE exceptions    FMOV s d   Executed in SPARC core FFU  cannot generate IEEE exceptions     FMOV s d cc  Executed in SPARC core FFU  cannot generate IEEE exceptions     FMOV s d r  Executed in SPARC core FFU  cannot generate IEEE exceptions     FMUL s d       SNaN             0  result NaN1  2    FSR nvc 1    result   max  or        FSR ofc 14    result   0 or    min or    denorm  FSR ufc 15  4    result IEEE6    FSR nxc 17    FNEG s d   Executed in SPARC core FFU  cannot generate IEEE exceptions     7 14 OpenSPARC T1 Microarchitecture Specification      August 2006    FsMULd      SNaN             0  result NaN1  2    FSR nvc 1    FSQRT s d   Unimplemented    F s d TOi      NaN               large  result max    integer3    FSR nvc 1    result IEEE6    FSR nxc 1    FsTOd      SNaN  result NaN2    FSR nvc 1    FdTOs      SNaN  result NaN2    FSR nvc 1    result   max or         FSR ofc 14    result   0 or    min or    denorm  FSR ufc 15  4    result IEEE6    FSR nxc 17    F s d TOx      NaN               large  result max    integer3    FSR nvc 1    result IEEE6    FSR nxc 1    FSUB s d       SNaN                   result NaN1  2    FSR nvc 1    result   max  or        FSR ofc 14    result   0 or    min or    denorm  FSR uf
56. 1    dram_io_cke0  Out  PADS  DRAM CKE 0    dram_io_cke1  Out  PADS  DRAM CKE 1    dram_io_clk_enable0  Out  PADS  DRAM clock enable 0    dram_io_clk_enable1  Out  PADS  DRAM clock enable 1    dram_io_cs0_l 3 0   Out  PADS  DRAM CS 0    dram_io_cs1_l 3 0   Out  PADS  DRAM CS 1    dram_io_data0_out 287 0   Out  PADS  DRAM data 0    dram_io_data1_out 287 0   Out  PADS  DRAM data 1    dram_io_drive_data0  Out  PADS  From dramctl0 of dramctl v    dram_io_drive_data1  Out  PADS  From dramctl1 of dramctl v    dram_io_drive_enable0  Out  PADS  From dramctl0 of dramctl v    dram_io_drive_enable1  Out  PADS  From dramctl1 of dramctl v    dram_io_pad_clk_inv0  Out  PADS    dram_io_pad_clk_inv1  Out  PADS    dram_io_pad_enable0  Out  PADS    dram_io_pad_enable1  Out  PADS    dram_io_ptr_clk_inv0 4 0   Out  PADS    dram_io_ptr_clk_inv1 4 0   Out  PADS    dram_io_ras0_l  Out  PADS  DRAM RAS 0    dram_io_ras1_l  Out  PADS  DRAM RAS 1    dram_io_write_en0_l  Out  PADS  DRAM write enable 0    TABLE  8 4  DRAM Controller I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    8 12 OpenSPARC T1 Microarchitecture Specification      August 2006    dram_io_write_en1_l  Out  PADS  DRAM write enable 1    dram_sctag0_data_vld_r0  Out  SCTAG0    dram_sctag0_rd_ack  Out  SCTAG0    dram_sctag0_scb_mecc_err  Out  SCTAG0    dram_sctag0_scb_secc_err  Out  SCTAG0    dram_sctag0_wr_ack  Out  SCTAG0    dram_sctag1_data_vld_r0  Out  SCTAG1    dram_sctag1_rd_ack  Out  SCTAG1    dr
57. 1  for SP  and between 9 and 60 for DP         Zero results always produce a fixed execution latency of 7 SP  7 DP         Infinity or QNaN results always produce a fixed execution latency of 32 SP  61 DP         The FPD uses a shift subtract restoring algorithm generating 1 bit per cycle         The FDIV instructions execute in a dedicated datapath and are non blocking         The FPD execution datapath is implemented in seven pipeline stages  D1 through  D7    See TABLE  7 5 for details of these stages      TABLE  7 4  FPM Datapath Stages    Stage  Action    M1  Format input operands  booth recoder    M2     M4        Generate partial products using a radix 4 booth algorithm      Accumulate partial products using a Wallace tree configuration      Add the two Wallace tree outputs using a carry propagate adder    M5  Normalize    M6  Round    Chapter 7  Floating Point Unit 7 9    7 1 7  FPU Power Management    FPU power management is accomplished by way of block controllable clock gating   Clocks are dynamically disabled or enabled as needed  thus reducing clock power  and signal activity when possible     The FPU has independent clock control for each of the three execution pipelines   FPA  FPM  and FPD   Clocks are gated for a given pipeline when it is not in use  so  a pipeline will have its clocks enabled only under one of the following conditions         The pipeline is executing a valid instruction        A valid instruction is issuing to the pipeline        The reset i
58. 2 illustrates this PCX interface     Since both the FPU and the I O bridge share a destination ID  the packets intended  for each get routed to both  The FPU and I O bridge each decode the packet to  decide whether to consume or discard the packet     A source can send at most two single packet requests or one two packet request to a  particular destination  There is a 2 deep queue inside the PCX for each source   destination pair that holds the packet  The PCX sends a grant to the source after  dispatching a packet to its destination  Each source uses this handshake signal to  monitor the queue full condition     The L2 caches and the I O bridge can process a limited number of packets  When a  destination reaches its limit  it sends a stall signal to the PCX  This stall signal  prevents the PCX from sending the grant to a source  CPU core   The FPU  however   cannot stall the PCX     3 4 OpenSPARC T1 Microarchitecture Specification      August 2006    3 1 4  Cache Processor Crossbar Packet Delivery    The cache processor crossbar  CPX  accepts packets from a source  which can be one  of the four L2 cache banks  the I O bridge  or the FPU  and delivers the packet to its  destination  one any of eight SPARC CPU cores      A source sends a packet and a destination ID to the CPX  The packets are sent on a  145 bit wide bus  Out of the 145 bits  the 128 bits is used for data and the rest of the  bits are used for control     The destination ID is sent on a separate 8 bit bus  Eac
59. 4    3 1 5  CPX and PCX Packet Formats  3   5    3 2  CCX I O List  3   9    3 3  CCX Timing Diagrams  3   13    3 4  PCX Internal Blocks Functional Description  3   17    3 4 1  PCX Overview  3   17    3 4 2  PCX Arbiter Data Flow  3   18    3 4 3  PCX Arbiter Control Flow  3   19    3 5  CPX Internal Blocks Functional Description  3   20    3 5 1  CPX Overview  3   20    3 5 2  CPX Arbiters  3   20    4   Level 2 Cache  4   1    4 1  L2 Cache Functional Description  4   1    4 1 1  L2 Cache Overview  4   1    4 1 2  L2 Cache Single Bank Functional Description  4   2    4 1 2 1  Arbiter  4   4    4 1 2 2  L2 Tag  4   4    4 1 2 3  L2 VUAD States  4   4    4 1 2 4  L2 Data  scdata   4   5    4 1 2 5  Input Queue  4   5    4 1 2 6  Output Queue  4   6    viii OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 2 7  Snoop Input Queue  4   6    4 1 2 8  Miss Buffer  4   6    4 1 2 9  Fill Buffer  4   7    4 1 2 10  Writeback Buffer  4   8    4 1 2 11  Remote DMA Write Buffer  4   8    4 1 2 12  L2 Cache Directory  4   8    4 1 3  L2 Cache Pipeline  4   9    4 1 3 1  L2 Cache Transaction Types  4   9    4 1 3 2  L2 Cache Pipeline Stages  4   10    4 1 4  L2 Cache Instruction Descriptions  4   12    4 1 4 1  Loads  4   12    4 1 4 2  Ifetch  4   12    4 1 4 3  Stores  4   13    4 1 4 4  Atomics  4   13    4 1 4 5  J Bus Interface Instructions  4   14    4 1 4 6  Eviction  4   16    4 1 4 7  Fill  4   16    4 1 4 8  Other Instructions  4   16    4 1 5  L2 Cache Mem
60. 9  ASI Queue and Bypass Queue  2   27    2 4 10  Alternate Space Identifier Handling in the Load Store Unit  2   28    2 4 11  Support for Atomic Instructions  CAS  SWAP  LDSTUB   2   28    2 4 12  Support for MEMBAR Instructions  2   29    2 4 13  Core to Core Interrupt Support  2   29    2 4 14  Flush Instruction Support  2   29    2 4 15  Prefetch Instruction Support  2   30    2 4 16  Floating Point BLK LD and BLK ST Instructions Support  2   30    2 4 17  Integer BLK INIT Loads and Stores Support  2   31    2 4 18  STRM Load and STRM Store Instruction Support  2   31    2 4 19  Test Access Port Controller Accesses and Forward Packets  Support  2   31    2 4 20  SPARC Core Pipeline Flush Support  2   32    2 4 21  LSU Error Handling  2   32    2 5  Execution Unit  2   33    2 6  Floating Point Frontend Unit  2   35    2 6 1  Functional Description of the FFU  2   35    2 6 2  Floating Point Register File  2   36    2 6 3  FFU Control  FFU_CTL   2   36    2 6 4  FFU Data Path  FFU_DP   2   37    2 6 5  FFU VIS  FFU_DP   2   37    2 7  Multiplier Unit  2   37    2 7 1  Functional Description of the MUL  2   37    2 8  Stream Processing Unit  2   38    2 8 1  ASI Registers for the SPU  2   38    2 8 2  Data Flow of Modular Arithmetic Operations  2   40    vi OpenSPARC T1 Microarchitecture Specification      August 2006    2 8 3  Modular Arithmetic Memory  MA Memory   2   40    2 8 4  Modular Arithmetic Operations  2   41    2 9  Memory Management Unit  2   43    2 9 1  The R
61. AT CSR shows the RO and RW status for  POR  FREQ  and WRM        Software Visibility for Efuse Data         Serial data shifted in after a power on reset  POR          CORE_AVAIL         PROC_SER_NUM         IOB_EFUSE     contains parity check results from the EFC  If non zero  the chip  is suspect with a potentially bad CORE_AVAIL or a memory array  redundancy         Power Management   Thermal Sensor         Sends an idle resume interrupt to threads specified in the TM_STAT_CTL  mask    5 1 8  IOB Errors        Accesses to non existent I O addresses  address map reserved          Drops I O writes         Sends NACK for the I O reads        IOB forwards NACKs received from the other blocks        IOB forwards error interrupts signalled by the other blocks    5 12 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 9  Debug Ports    Debug ports provide on chip support for logic analyzer data capture  The visibility  port inputs are         L2 visibility ports  2 x 40 bits   CMP clock      these are pre filtered in the CPU  clock domain for bandwidth         IOB visibility ports  J Bus clock   you can select the IOB UCB port to monitor with  raw valid stall or decoded qualified valid qualifiers     The output debug ports have separate mux select and filtering on each port  There  are two debug ports         Debug port A   dedicated debug pins  40 bits   J Bus clock         Debug port B   J Bus port  2 x 48 bits   J Bus clock          16 bytes data return to a 
62. CX grant to SPARC    pcx_spc2_grant_px 4 0   Out  sparc2  PCX grant to SPARC    pcx_spc3_grant_px 4 0   Out  sparc3  PCX grant to SPARC    pcx_spc4_grant_px 4 0   Out  sparc4  PCX grant to SPARC    pcx_spc5_grant_px 4 0   Out  sparc5  PCX grant to SPARC    pcx_spc6_grant_px 4 0   Out  sparc6  PCX grant to SPARC    pcx_spc7_grant_px 4 0   Out  sparc7  PCX grant to SPARC    rclk  Out  CCX  Clock    TABLE  3 5  CCX I O Signal List  Continued     Signal Name  I O  Source Destination  Description    Chapter 3  CPU Cache Crossbar 3 13    3 3  CCX Timing Diagrams    FIGURE  3 4 shows the timing diagram for processing a single packet request     FIGURE  3 4 PCX Packet Transfer Timing     One Packet Request    CPU0 signals the PCX that it is sending a packet in cycle PQ  CPU0 then sends a  packet in cycle PA  ARB0 looks at all pending requests and issues a grant to CPU0 in  cycle PX  ARB0 sends a data ready signal to the L2 cache Bank0 in cycle PX  ARB0  sends the packet to the L2 cache Bank0 in cycle PX2     Arbiter  control    Arbiter  data  select    spc0_pcx_req_vld_pq 0     spc0_pcx_data_pa 123 0     pcx_spc0_grant_px    pcx_sctag0_data_rdy_px1    pcx_sctag0_data_px2 123 0     PQ  PA  PX  PX2    pkt1    3 14 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  3 5 shows timing diagram for processing a two packet request     FIGURE  3 5 PCX Packet Transfer Timing     Two Packet Request    CPU0 signals the PCX that it is sending a packet in cycle PQ  CPU0 also a
63. Current data cycle has a uncorrectable error     sctag2_jbi_por_req_buf  In  SCTAG2  Request for DOK_FATAL     sctag3_jbi_iq_dequeue  In  SCTAG3  SCTag is unloading a request from its 2 request queue     sctag3_jbi_wib_dequeue  In  SCTAG3  Write invalidate buffer  size 4  is being unloaded     scbuf3_jbi_data 31 0   In  SCBUF3  Return data    scbuf3_jbi_ctag_vld  In  SCBUF3  Header cycle of a new response packet     scbuf3_jbi_ue_err  In  SCBUF3  Current data cycle has a uncorrectable error     sctag3_jbi_por_req_buf  In  SCTAG3  Request for DOK_FATAL     iob_jbi_pio_stall  In  IOB  PIO stall    iob_jbi_pio_vld  In  IOB  PIO valid    iob_jbi_pio_data 63 0   In  IOB  PIO data    iob_jbi_mondo_ack  In  IOB  Mondo acknowledgement    iob_jbi_mondo_nack  In  IOB  Mondo negative acknowledgement    io_jbi_ssi_miso  In  PADS  SSI Master in slave out from pad     io_jbi_ext_int_l  In  PADS  External interrupt    iob_jbi_spi_vld  In  IOB  Valid packet from IOB     iob_jbi_spi_data 3 0   In  IOB  Packet data from IOB     iob_jbi_spi_stall  In  IOB  Flow control to stop data     io_jbi_j_req4_in_l  In  PADS  J Bus request  4 input    TABLE  6 1  JBI I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    6 10 OpenSPARC T1 Microarchitecture Specification      August 2006    io_jbi_j_req5_in_l  In  PADS  J Bus request  5 input    io_jbi_j_adtype 7 0   In  PADS  J Bus packet type    io_jbi_j_ad 127 0   In  PADS  J Bus address data bus    io_jbi_j_pack4 2 0   In
64. D  Thread  ID    Packet  Type    IOB  ucb_flow_     Cluster    iob_ucb_vld    iob_ucb_data N 1 0     ucb_iob_stall    ucb_iob_vld    ucb_iob_data M 1 0     iob_ucb_stall    ucb_pkt    ucb_pkt    ucb_pkt    addr 39 0     data 63 0     data 63 0     cntl    cntl    ucb_pkt    ucb_bus_out    ucb_bus_out    ucb_bus_in    ucb_bus_in    Chapter 5  Input Output Bridge 5 5    TABLE  5 3 defines the UCB request or acknowledge packet types     There is no write NACK as writes to invalid addresses are dropped  Some packet  types have data  payload  while others are without data  no payload      TABLE  5 4 defines the UCB data size parameters     The buffer ID is 00 when the master is CPU and the ID is 01 when the master is  TAP  The thread ID has two parts     CPU ID  3 bits  and Thread ID within CPU  2   bits      TABLE  5 3  UCB Request ACK Packet Types    Description  Packet Type Value  Binary     UCB_READ_NACK  0000    UCB_READ_ACK  0001    UCB_WRITE_ACK  0010    UCB_IFILL_ACK  0011    UCB_READ_REQ  0100    UCB_WRITE_REQ  0101    UCB_IFILL_REQ  0110    UCB_IFILL_NACK  0111    TABLE  5 4  UCB Data Size    Description  Size Value  Binary     UCB_SIZE_1B  000    UCB_SIZE_2B  001    UCB_SIZE_4B  010    UCB_SIZE_8B  011    UCB_SIZE_16B  111    5 6 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 2 2  UCB Interrupt Packet    The UCB interrupt packet has a fixed width of 64 bits  TABLE  5 5 describes the UCB    interrupt packet format     TABLE  5 6 defines the UCB inte
65. DRAM Uncorrectable and Addressing Errors        Error information is captured in the DRAM error status  L2 cache error status   and the L2 cache error address registers         If the L2 cache NCEEN bit is set  the error information is also captured in the  SPARC error status and SPARC error address registers  as an L2 cache error          An out of bounds error is signalled as a cache line and marked with an  uncorrectable error         For each 32 bit chunk with an error  the data is loaded into the L2 cache with  poisoned ECC         An error on the critical chunk results in a precise trap on the requesting thread     9 10 OpenSPARC T1 Microarchitecture Specification      August 2006        An error on non critical chunks results in a disrupting data_error trap to the  steering thread         If an error is on the 16 byte chunk to be written  the stores will not update the L2   cache  The line is marked as dirty  so on eviction the line is written to the memory  with a bad ECC         An uncorrectable error during a scrub is captured in the DRAM error status and  DRAM error address registers  and if the DSU bit is set in the L2 cache error  status register  a disrupting data_error trap is generated on the steering thread     10 1    CHAPTER 10    Clocks and Resets    This chapter describes the following topics         Section  10 1     Functional Description    on page  10 1        Section  10 2     I O Signal list    on page  10 15    10 1  Functional Description    The O
66. From u_rng of bw_rng v    afo_rng_data  Out  From u_rng of bw_rng v    afo_rt_ack  Out  From ctu_dft of ctu_dft v    afo_rt_data_out 31 0   Out  From ctu_dft of ctu_dft v    afo_tsr_dout 7 0   Out  From u_tsr of bw_tsr v    clsp_iob_data 3 0   Out  From ctu_clsp of ctu_clsp v    clsp_iob_stall  Out  IOB  From ctu_clsp of ctu_clsp v    clsp_iob_vld  Out  IOB  From ctu_clsp of ctu_clsp v    cmp_adbginit_l  Out  From ctu_clsp of ctu_clsp v    cmp_arst_l  Out  From ctu_clsp of ctu_clsp v    cmp_gclk_out  Out  From ctu_clsp of ctu_clsp v    cmp_gdbginit_out_l  Out  From ctu_clsp of ctu_clsp v    ctu_ccx_cmp_cken  Out  From ctu_clsp of ctu_clsp v    ctu_dbg_jbus_cken  Out  From ctu_clsp of ctu_clsp v    ctu_ddr0_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr0_dll_delayctr 2 0   Out  PADS  From ctu_clsp of ctu_clsp v    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    10 20 OpenSPARC T1 Microarchitecture Specification      August 2006    ctu_ddr0_dram_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr0_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr0_iodll_rst_l  Out  PADS  From u_ctu_ddr0_iodll_rst_l_or2_ecobug of ctu_or2 v    ctu_ddr0_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr0_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr0_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr1_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr1_dll_delayctr 2 0   Out  PADS  From ctu_
67. I          IOB operates in both the CMP and J Bus clock domains     5 2 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1 1  IOB Interfaces    FIGURE  5 1 shows the interfaces to and from the IOB to the rest of the blocks and    clusters     FIGURE  5 1 IOB Interfaces    The main interfaces to and from the IOB are         Crossbar  CCX   interface to the PCX to the CPX  both are parallel interfaces          Universal connection bus  UCB  interface is a common packetized interface to all  clusters for CSR accesses          Common width parameterized blocks in the IOB and clusters          The separate request and acknowledge interrupt paths with parameterized  widths  various blocks  and widths are defined in TABLE  5 1     IOB    JBI    SSI  Debug  J BUS    DRAM 1 3  CTU  TAP    EFC    CCX    L2    DRAM 0 2    PIO  amp   CSRs  SSI  Mondo  interrupts  Debug  port B    Chapter 5  Input Output Bridge 5 3         In most of the UCB interfaces  the IOB is master and the cluster block is a  slave  with the exception of the TAP  The TAP interface is unique     it is both  master and slave          All UCB interfaces are visible through the debug ports         J Bus Mondo Interrupt Interface          16 bit request interface and a valid bit          Header with 5 bit source and target  thread  IDs          8 cycles of data   128 bits  J Bus Mondo Data 0  amp  1           2 bit acknowledge interface   ACK   NACK         Efuse Controller  EFC      Serial Interface   
68. IGURE  2 36  PCR and PIC Layout  2   67    FIGURE  3 1  CPU Cache Crossbar  CCX  Interface  3   2    FIGURE  3 2  Processor Cache Crossbar  PCX  Interface  3   2    FIGURE  3 3  Cache Processor Crossbar  CPX  Interface  3   4    FIGURE  3 4  PCX Packet Transfer Timing     One Packet Request  3   13    FIGURE  3 5  PCX Packet Transfer Timing     Two Packet Request  3   14    FIGURE  3 6  CPX Packet Transfer Timing Diagram     One Packet Request  3   15    FIGURE  3 7  CPX Packet Transfer Timing Diagram     Two Packet Request  3   16    FIGURE  3 8  PCX and CPX Internal Blocks  3   17    FIGURE  3 9  Data Flow in PCX Arbiter  3   18    FIGURE  3 10  Control Flow in PCX Arbiter  3   19    FIGURE  4 1  Flow Diagram and Interfaces for an L2 Cache Bank  4   3    Figures  xv    FIGURE  5 1  IOB Interfaces  5   2    FIGURE  5 2  IOB UCB Interface to and From the Cluster  5   4    FIGURE  5 3  IOB Internal Block Diagram  5   8    FIGURE  6 1  JBI Functional Block Diagram  6   2    FIGURE  7 1  FPU Functional Block Diagram  7   2    FIGURE  8 1  DDR II DRAM Controller Functional Block Diagram  8   2    FIGURE  8 2  DDR II DRAM Controller Top Level State Diagram  8   4    FIGURE  8 3  DIMM Scheduler State Diagram  8   5    FIGURE  10 1  Clock and Reset Functional Block Diagram  10   2    FIGURE  10 2  PLL Functional Block Diagram  10   3    FIGURE  10 3  Clock Divider Block Diagram  10   4    FIGURE  10 4  Sync Pulses Waveforms  10   6    FIGURE  10 5  Clock Signal Distribution  10   9 
69. JBI  To u_pll of bw_pll v    jbus_grst_l  In  JBI  To u_jbus_header of bw_clk_cl_ctu_jbus v    pads_ctu_bsi  In  PADS  To ctu_dft of ctu_dft v    pads_ctu_si  In  PADS  To ctu_dft of ctu_dft v    sctag0_ctu_mbistdone  In  SCTAG0  MBIST done    sctag0_ctu_mbisterr  In  SCTAG0  MBIST error    sctag0_ctu_tr  In  SCTAG0  SCTAG debug trigger    sctag1_ctu_mbistdone  In  SCTAG1  MBIST done    sctag1_ctu_mbisterr  In  SCTAG1  MBIST error    sctag1_ctu_tr  In  SCTAG1  SCTAG debug trigger    sctag2_ctu_mbistdone  In  SCTAG2  MBIST done    sctag2_ctu_mbisterr  In  SCTAG2  MBIST error    sctag2_ctu_serial_scan_in  In  SCTAG2  Scan In    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    10 18 OpenSPARC T1 Microarchitecture Specification      August 2006    sctag2_ctu_tr  In  SCTAG2  SCTAG debug trigger    sctag3_ctu_mbistdone  In  SCTAG3  MBIST done    sctag3_ctu_mbisterr  In  SCTAG3  MBIST error    sctag3_ctu_tr  In  SCTAG3  SCTAG debug trigger    spc0_ctu_mbistdone  In  SPARC0  MBIST done    spc0_ctu_mbisterr  In  SPARC0  MBIST error    spc0_ctu_sscan_out  In  SPARC0  Scan out from SPARC    spc1_ctu_mbistdone  In  SPARC1  MBIST done    spc1_ctu_mbisterr  In  SPARC1  MBIST error    spc1_ctu_sscan_out  In  SPARC1  Scan out from SPARC    spc2_ctu_mbistdone  In  SPARC2  MBIST done    spc2_ctu_mbisterr  In  SPARC2  MBIST error    spc2_ctu_sscan_out  In  SPARC2  Scan Out from SPARC    spc3_ctu_mbistdone  In  SPARC3  MBIST done    spc3_ctu_m
70. MA requests  routing them to the appropriate L2 banks  and also  issuing PIO transactions on behalf of the processor threads and forwarding  responses back     1 12 OpenSPARC T1 Microarchitecture Specification      August 2006    1 3 8  Serial System Interface    The OpenSPARC T1 processor has a 50 Mbyte sec serial system interface  SSI  that  connects to an external application specific integrated circuit  ASIC   which in turn  interfaces to the boot read only memory  ROM   In addition  the SSI supports PIO  accesses across the SSI  thus supporting optional control status registers  CSR  or  other interfaces within the ASIC     1 3 9  Electronic Fuse    The electronic fuse  e Fuse  block contains configuration information that is  electronically burned in as part of manufacturing  including part serial number and  core available information     2 1    CHAPTER 2    SPARC Core    An OpenSPARC T1 processor contains eight SPARC cores  and each SPARC core has  several function units  These SPARC core units are described in the following  sections         Section  2 1     SPARC Core Overview and Terminology    on page  2 2        Section  2 2     SPARC Core I O Signal List    on page  2 5        Section  2 3     Instruction Fetch Unit    on page  2 6        Section  2 4     Load Store Unit    on page  2 21        Section  2 5     Execution Unit    on page  2 33        Section  2 6     Floating Point Frontend Unit    on page  2 35        Section  2 7     Multiplier Unit    on page 
71. MP  DRAM  and J Bus        The DRAM controller operates in two modes     four channel mode or two channel  mode  the mode is software programmable         The DRAM controller services L2 cache read requests from the DIMMs         Out of bound read addresses are returned with a multiple bit ECC  MECC   error         Reply zero data for L2 cache dummy read requests    8 2 OpenSPARC T1 Microarchitecture Specification      August 2006        The DRAM controller performs L2 cache writebacks to the DIMMs         Out of bound write addresses are silently dropped         Uncorrectable L2 cache data is stored by poisoning the data        The DRAM controller performs DRAM data scrubbing        DRAM controller issues periodic refreshes to the DIMMs        Supports DRAM power throttling by reducing the number of DIMM activations        To program the DRAM controller control and status registers  CSRs   the  controller uses the UCB bus as an interface to the I O buffer  IOB     FIGURE  8 1 displays a functional block diagram of the DRAM controller     FIGURE  8 1 DDR II DRAM Controller Functional Block Diagram    L2 req    cmp_clk    dram_clk    cmp_clk    dram_clk    L2 i f ctl    dram ctl    ecc gen    wr data Q 8     err det cor    dram ack    rd req Q 8     CAS req Q 8   Pad logic    To DIMMs    scrub req    refresh req    wr req Q 8     st data    dp    rd data    addr  amp  ctl    288  288    256    64    288    addr  amp  ctl    128 16    128 28    Chapter 8  DRAM Controller 8 3   
72. RC Core Errors    on page  9 3        Section  9 3     L2 Cache Errors    on page  9 5        Section  9 4     DRAM Errors    on page  9 8    9 1  Error Handling Overview    The OpenSPARC T1 processor detects  logs  and reports a number of errors to the  software  This chapter describes the error types  and how various blocks detect  log  and report these errors     There are three types of errors in the OpenSPARC T1 processor     1  Correctable errors  CE     The correctable errors are fixed by the hardware  and the hardware can generate  the disrupting traps so that the software can keep track of the error frequency or  the failed failing parts     2  Uncorrectable errors  UE     These types of errors are cannot be corrected by hardware  and hardware will  generate precise  disrupting  or deferred traps  These errors can be corrected by  software     3  Fatal errors  FE     These types of errors can create potentially unbounded damage and these types  of errors will cause a warm reset     9 2 OpenSPARC T1 Microarchitecture Specification      August 2006    9 1 1  Error Reporting and Logging        The SPARC core errors are logged in program order and they are logged only  after the instruction has exited the pipe  W stage   The rolled back and flushed  instructions do not log errors immediately  Errors are logged in the L2 cache and  DRAM error registers in the order the errors occur         Errors are reported hierarchically in the following order     DRAM  L2 cache  and  
73. RF  FFU_DP  FFU_VIS    FFU_CTL    7    78    dout  78    ST FPop  Data to  LSU    Load  data  FPU  result    din    addr  2  wen  ren    Chapter 2  SPARC Core 2 37    2 6 4  FFU Data Path  FFU_DP     This FFU data path block contains the multiplexors and the flops for the data that  has been read from  or is about to be written to  the FRF  The FFU data path also  dispatches the data for the STF and the FPops to the LSU  receives LDF from the  LSU  and receives the results from the FPops from the CPX  The FFU data path also  implements FMOV  FABS  and FNEG  checks the ECC for the data read from the  FRF  and generates the ECC for the data written to the FRF     2 6 5  FFU VIS  FFU_DP     The FFU VIS  FFU_DP  block implements a subset of the VIS graphics instructions   including partitioned addition subtraction  logical operations  and faligndata  All  the operations are implemented in a single cycle  and the data inputs and outputs  are connected to the FFU_DP     2 7  Multiplier Unit    2 7 1  Functional Description of the MUL    The SPARC multiplier unit  MUL  performs the multiplication of two 64 bit inputs   The MUL is shared between the EXU and the SPU  and it has a control block and  data path block  FIGURE  2 20 shows how the multiplier is connected to other    functional blocks     FIGURE  2 20 Multiplexor  MUL  Block Diagram    EXU  SPU    sparc_mul_top    Data In Control  Data In Control  Data Out    2 38 OpenSPARC T1 Microarchitecture Specification      August 2006
74. SPARC core  For diagnostic reasons  the L2 cache can be configured to not report  errors to the SPARC core         SPARC  L2 cache  and DRAM error registers log error details for a single error  only         Fatal and uncorrectable errors will overwrite earlier correctable error information         The error registers have bits to indicate if multiple errors occurred         Refer to the UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 for  detailed information about error control and status register  CSR  definitions   including addresses  bit fields  and so on     9 1 2  Error Traps        Error trap logic is located in the SPARC core  IFU   Errors anywhere on the chip  have to be reported here         Error traps can be disabled  typically for diagnostic reasons          Correctable errors cause a disrupting corrected ECC error trap         Uncorrectable errors can cause precise  disrupting  or deferred traps         L2 cache and DRAM errors are reported through CPX packets         There is a special CPX packet type that reports errors that cannot be attributed to  a specific transaction  for example  an L2 evicted line with an UE          When IFU receives this packet  a data_error trap is taken     The following sub sections describe the errors in SPARC core  L2 cache  and DRAM   Errors in other blocks like IOB and JBI are described in their chapters     Chapter 9  Error Handling 9 3    9 2  SPARC Core Errors    This section describes the error registers  error p
75. Sun Microsystems  Inc   www sun com    Submit comments about this document at  http   www sun com hwdocs feedback    OpenSPARC    T1 Microarchitecture    Speci   cation    Part No  819 6650 10  August 2006  Revision A    Please  Recycle    Copyright    2006 Sun Microsystems  Inc   4150 Network Circle  Santa Clara  California 95054  U S A  All rights reserved     Sun Microsystems  Inc  has intellectual property rights relating to technology embodied in the product that is described in this document  In  particular  and without limitation  these intellectual property rights may include one or more of the U S  patents listed at  http   www sun com patents and one or more additional patents or pending patent applications in the U S  and in other countries     U S  Government Rights   Commercial software  Government users are subject to the Sun Microsystems  Inc  standard license agreement and  applicable provisions of the FAR and its supplements     Use is subject to license terms     This distribution may include materials developed by third parties     Sun  Sun Microsystems  the Sun logo  Solaris  OpenSPARC T1 and UltraSPARC are trademarks or registered trademarks of Sun Microsystems   Inc  in the U S  and other countries     All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International  Inc  in the U S  and other  countries  Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems  Inc     UNIX
76. V9 Single and Double Precision FPop Instruction Set  7   4    TABLE  7 3  FPA Datapath Stages  7   7    TABLE  7 4  FPM Datapath Stages  7   8    TABLE  7 5  FPD Datapath Stages  7   9    TABLE  7 6  IEEE Exception Cases  7   13    TABLE  7 7  FPU I O Signal List  7   15    TABLE  8 1  DDR II Addressing  8   7    TABLE  8 2  Physical Address to DIMM Address Decoding  8   7    TABLE  8 3  DDR II Commands Used by OpenSPARC T1 Processor  8   8    TABLE  8 4  DRAM Controller I O Signal List  8   9    TABLE  9 1  Error Protection for SPARC Memories  9   4    TABLE  9 2  Error Protection for L2 Cache Memories  9   6    TABLE  10 1  Clock Domain Dividers  10   5    TABLE  10 2  CTU I O Signal List  10   15    xix    Preface    This OpenSPARC T1 Microarchitecture Specification includes detailed functional  descriptions of the core OpenSPARC    T1 processor components  This manual also  provides the I O signal list for each component  This processor is the first chip  multiprocessor that fully implements the Sun    Throughput Computing initiative     How This Document Is Organized    Chapter  1 introduces the processor and provides a brief overview of each processor    component     Chapter  2 provides a detailed description of the functional units of a SPARC   Core     Chapter  3 describes the CPU cache crossbar  CCX  unit and includes detailed CCX    block and timing diagrams     Chapter  4 provides a functional description of the L2 cache and describes the    L2 cache pipeline and 
77. _grst_out_l  Out  JBI  Synchronous Reset    pscan_select  Out    tap_iob_data 7 0   Out  IOB    tap_iob_stall  Out  IOB    tap_iob_vld  Out  IOB    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    10 26 OpenSPARC T1 Microarchitecture Specification      August 2006    
78. _pre_grst_l  In  CTU    global_shift_enable  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_scanmode  In  CTU    ctu_tst_macrotest  In  CTU    ctu_tst_short_chain  In  CTU    si  In  DFT  Scan in    fp_cpx_req_cq 7 0   Out  CCX CPX  FPU result request to the CPX    fp_cpx_data_ca 144 0   Out  CCX CPX  FPU result packet to the CPX    so  Out  DFT  Scan out    7 16 OpenSPARC T1 Microarchitecture Specification      August 2006    8 1    CHAPTER 8    DRAM Controller    This chapter describes the following topics for the double data rate two  DDR II   dynamic random access memory  DRAM  controller         Section  8 1     Functional Description    on page  8 1        Section  8 2     I O Signal List    on page  8 9    8 1  Functional Description    The OpenSPARC T1 DDR II DRAM controller has the following characteristics         There are four independent DRAM controllers     each controller is connected to  one L2 cache bank and one DDR II memory channel        Supports a maximum physical address space of 37 bits  for a maximum memory  size of 128 Gbytes        64 byte cache lines are interleaved across four channels        Operational range of 125 MHz to 200 MHz with a data rate of 250 to 400 MT sec        Peak bandwidth of 23 Gbyte sec at 200 MHz         Error correction code  ECC  is based on single nibble correction and double nibble  error detection  128 bit data   16 bit ECC         Supports the chip kill feature        The DRAM controller has three clock domains     C
79. _spc4_cmp_cken  Out  SPARC4  Clock enable    ctu_spc4_mbisten  Out  SPARC4  MBIST enable    ctu_spc4_sscan_se  Out  SPARC4  Shadow scan enable    ctu_spc4_tck  Out  SPARC4  Test clock    ctu_spc5_cmp_cken  Out  SPARC5  Clock enable    ctu_spc5_mbisten  Out  SPARC5  MBIST enable    ctu_spc5_sscan_se  Out  SPARC5  Shadow scan enable    ctu_spc5_tck  Out  SPARC5  Test clock    ctu_spc6_cmp_cken  Out  SPARC6  Clock enable    ctu_spc6_mbisten  Out  SPARC6  MBIST enable    ctu_spc6_sscan_se  Out  SPARC6  Shadow scan enable    ctu_spc6_tck  Out  SPARC6  Test clock    ctu_spc7_cmp_cken  Out  SPARC7  Clock enable    ctu_spc7_mbisten  Out  SPARC7  MBIST enable    ctu_spc7_sscan_se  Out  SPARC7  Shadow scan enable    ctu_spc7_tck  Out  SPARC7  Test clock    ctu_spc_const_maskid 7 0   Out  SPARC  Mask ID    ctu_spc_sscan_tid 3 0   Out  SPARC    ctu_tst_scan_disable  Out    dram_adbginit_l  Out  DRAM  Asynchronous Reset    dram_arst_l  Out  DRAM  Asynchronous Reset    dram_gclk_out  Out  DRAM  Clock    dram_gdbginit_out_l  Out  DRAM  Synchronous Reset    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 10  Clocks and Resets 10 25    dram_grst_out_l  Out  DRAM  Synchronous Reset    global_scan_bypass_en  Out    jbus_adbginit_l  Out  JBI  Asynchronous Reset    jbus_arst_l  Out  JBI  Asynchronous Reset    jbus_gclk_dup_out  Out  JBI  Clock    jbus_gclk_out  Out  JBI  Clock    jbus_gdbginit_out_l  Out  JBI  Synchronous Reset    jbus
80. a  hardware auto demap to prevent the threads from writing to overlapping pages   Each auto demap operation is partition specific  The sequence of an auto demap  operation is as follows     1  Schedule a write from the four entry FIFO in the LSU     2  Construct an equivalent auto demap key     3  Assert demap and complete with a handshake     4  Assert write and complete with a handshake     2 9 9  TLB Entry Replacement Algorithm    Each entry has a Used bit  An entry is picked to be a candidate for a replacement if it  is the least significant unused bit among all 64 entries     A used bit can be set on a write  or on a CAM hit  or when locked  A locked page  will have its used bit always set  An invalid entry has its used bit always cleared  All  used bits will be cleared when the TLB reaches a saturation point  that is  when all  entries have their used bit set while a new entry needs to be put in a TLB   If a TLB  remains saturated because all of the entries have been locked  the default  replacement candidate  entry 0x63  will be chosen and an error condition will be  reported     2 9 10  TSB Pointer Construction    An MMU miss will cause the write of the faulting address and the context in the tag  access  The tag access has a context 0 copy or a context non 0 copy  which is updated  depending on the context of the fault  The miss handler will read the pointer of  page size 0 or page size 1  The hardware will continue with the following sequence  in order to complete t
81. a 124 bit wide bus from each SPARC CPU core that extends out to  the five arbiters  one bus for each arbiter corresponding to a destination      ARB0 can receive packets from any of the eight CPUs for the L2 cache Bank0  and it  stores packets from each CPU in a separate queue  Therefore  ARB0 contains eight  queues  Each queue is a two entry deep FIFO  and each entry can hold one packet  A  packet is 124 bits wide and it contains the address  the data  and the control bits   ARB0 delivers packets to the L2 cache Bank0 on a 124 bit wide bus  FIGURE  3 9 shows    this data flow     FIGURE  3 9 Data Flow in PCX Arbiter    ARB1  ARB2  and ARB3 receive packets for the L2 cache Bank1  Bank2  and Bank3  respectively  ARB4 receives packets for both the FPU and the I O bridge     CPU0  CPU1  CPU2  CPU3  CPU4  CPU5  CPU6  CPU7    C0    Q1  Q0    C1  C2  C3  C4  C5  C6  C7    ARB0  ARB1    L2Cache  Bank 0  L2Cache  Bank 2  L2Cache  Bank 1  L2Cache  Bank 3  FPU  IOB    ARB2  ARB3    C0    Q1  Q0    C1  C2  C3  C4  C5  C6  C7    ARB4    124    124  124  124  124  124    124  124  124  124  124    Chapter 3  CPU Cache Crossbar 3 19    3 4 3  PCX Arbiter Control Flow    This section describes the control flow inside ARB0  the control flow is similar inside  other arbiters      ARB0 dispatches packets to the destination in the order it receives each packet   Therefore  a packet received in cycle 4 will be dispatched before a packet received in  cycle 5  When multiple sources dispatch a pack
82. a from the DRAM and sends it back to the JBI directly  read once data  only  without installing it in the L2 cache  The CTAG  the instruction identifier  and  the 64 byte data is returned to the JBI on a 32 bit interface     Write Invalidate    For a 64 byte write  the write invalidate  WRI  from the JBI   the JBI issues a 64 byte  write request to the L2 cache     When the write progresses through the pipe  it looks up the tags  If a tag hit occurs   it invalidates the entry and all primary cache entries that match  If a tag miss occurs   it does nothing  it just continues down the pipe  to maintain the order     Data is not written into the scdata cache on a miss  However  the scdata entry  and all  primary cache lines  are invalidated on a hit     The CTAG  the instruction identifier  is returned to the JBI when the processor sends  an acknowledgement to the cache line invalidation request sent over the CPX     After the instruction is retired from the pipe  64 bytes of data is written to the  DRAM     Partial Line Write    A partial line write  WR8  supports the writing of any subset of 8 bytes to the scdata  array by the JBI  However  the bytes written have to be contiguous  The JBI breaks  down any store that is not composed of contiguous bytes     When the JBI issues 8 byte writes to the L2 cache with random byte enables  the L2   cache treats them just like 8 bytes stores from the CPU   That is  it does a two pass  partial store if an odd number of byte enables are ac
83. ad requests     A prefetch instruction is issued by a CPU and is identical to a load  except for this  one difference     the results of a prefetch are not written into the L1 cache and  therefore the tags are not copied into the L2 cache directory     From an L2 cache perspective  a streaming load behaves same as a normal load  except for one difference  The L2 cache understands that it will not install the data in  the L1 cache  Therefore  the dcache entry is not created and the icache entries are not  invalidated  The L2 cache returns 128 bits of data     A forward request read returns 39 bits  32   7 ECC  of data  The data is returned  without an ECC check  Since the forward request load is not installed in the L1   cache  there is no L2 cache directory access     4 1 4 2  Ifetch    An ifetch is issued to the L2 cache in response to an instruction missing the L1  icache  The size of icache is 256 bits  The L2 cache returns the 256 bits of data in two  packets over two cycles to the requesting CPU over the CPX  The two packets are  returned as an atomic  The L2 cache then creates an entry in the icache directory and  invalidates any existing entry in the dcache directory     Chapter 4  Level 2 Cache 4 13    4 1 4 3  Stores    A store instruction to L2 cache is caused by any of the following conditions         A miss in the L1 cache by a store  block store  or a block init store instruction         A streaming store issued by the stream processing unit  SPU          A forwa
84. al unit  FFU  for a total of a 64 byte data transfer  Each 8 byte data  gets an entry of the corresponding thread in the store buffer  The blk st instructions  are non TSO compliant  so the software must do the ordering     Chapter 2  SPARC Core 2 31    2 4 17  Integer BLK INIT Loads and Stores Support    The blk init load and blk init store instructions were introduced as the substitute for  blk ld and blk st in block copy routines  They can access both the memory space and  the I O space  The blk init loads do not allocate in the level 1 D cache  On a level 1  D cache hit  the blk init stores will invalidate the level 1 D cache  L1D      The blk init load instructions must be quad word accesses  and violating this rule  will cause a trap  Like quad load instructions  blk init loads also send double pump  writes  8 byte access  to the integer register file  IRF  when a blk init load packet  reaches the head of the data fill queue  DFQ      The blk init stores are also non TSO compliant  which allows for greater write  throughput and higher performance yields for the block copy routine     Up to only eight of all non TSO compliant instructions can be allowed outstanding  for each SPARC core  The LSU keeps a counter per thread to enforce this limit     2 4 18  STRM Load and STRM Store Instruction Support    Instructions such as strm ld and strm st make requests from the stream processing  unit  SPU  to memory by way of the LSU     The Store buffer will not be looked up by the 
85. al_shift_enable  In  CTU    io_temp_trig  In  PADS    io_trigin  In  PADS    jbi_iob_mondo_data 7 0   In  JBI  UCB data    jbi_iob_mondo_vld  In  JBI  UCB valid    jbi_iob_pio_data 15 0   In  JBI  UCB data    jbi_iob_pio_stall  In  JBI  UCB stall    jbi_iob_pio_vld  In  JBI  UCB valid    TABLE  5 10 I O Bridge I O Signal List  Continued     Signal Name  I O  Source Destination  Description    5 14 OpenSPARC T1 Microarchitecture Specification      August 2006    jbi_iob_spi_data 3 0   In  JBI  UCB data    jbi_iob_spi_stall  In  JBI  UCB stall    jbi_iob_spi_vld  In  JBI  UCB valid    jbus_adbginit_l  In  CTU  Asynchronous reset    jbus_arst_l  In  CTU  Asynchronous reset    jbus_gclk  In  CTU  Clock    jbus_gdbginit_l  In  CTU  Synchronous reset    jbus_grst_l  In  CTU  Synchronous reset    l2_dbgbus_01 39 0   In  L2  Debug bus    l2_dbgbus_23 39 0   In  L2  Debug bus    pcx_iob_data_px2 123 0   In  CCX PCX  PCX packet    pcx_iob_data_rdy_px2  In  CCX PCX  PCX data ready    tap_iob_data 7 0   In  CTU TAP  UCB data    tap_iob_stall  In  CTU TAP  UCB stall    tap_iob_vld  In  CTU TAP  UCB valid    efc_iob_fuse_clk1  In  EFC    iob_scanin  In  DFT  Scan in    iob_clk_l2_tr  Out  CTU  Debug trigger    iob_clk_tr  Out  CTU  Debug trigger    iob_cpx_data_ca 144 0   Out  CCX CPX  CPX packet    iob_cpx_req_cq 7 0   Out  CCX CPX  CPX request    iob_ctu_coreavail 7 0   Out  CTU    iob_io_dbg_ck_n 2 0   Out  PADS  Debug clock N    iob_io_dbg_ck_p 2 0   Out  PADS  Debug clock P    iob_io_
86. am_sctag1_scb_mecc_err  Out  SCTAG1    dram_sctag1_scb_secc_err  Out  SCTAG1    dram_sctag1_wr_ack  Out  SCTAG1    ucb_iob_data 3 0   Out  IOB  UCB data    ucb_iob_stall  Out  IOB  UCB stall    ucb_iob_vld  Out  IOB  UCB valid    dram_sctag0_chunk_id_r0 1 0   Out  SCTAG0    dram_sctag0_mecc_err_r2  Out  SCTAG0    dram_sctag0_rd_req_id_r0 2 0   Out  SCTAG0    dram_sctag0_secc_err_r2  Out  SCTAG0    dram_sctag1_chunk_id_r0 1 0   Out  SCTAG1    dram_sctag1_mecc_err_r2  Out  SCTAG1    dram_sctag1_rd_req_id_r0 2 0   Out  SCTAG1    dram_sctag1_secc_err_r2  Out  SCTAG1    dram_scbuf0_data_r2 127 0   Out  SCBUF0    dram_scbuf0_ecc_r2 27 0   Out  SCBUF0    dram_scbuf1_data_r2 127 0   Out  SCBUF1    dram_scbuf1_ecc_r2 27 0   Out  SCBUF1    dram_local_pt0_opened_bank  Out    dram_local_pt1_opened_bank  Out    dram_pt_max_banks_open_valid  Out    dram_pt_max_time_valid  Out    TABLE  8 4  DRAM Controller I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 8  DRAM Controller 8 13    dram_pt_ucb_data 16 0   Out    dram_clk_tr  Out  CTU  Debug trigger   J Bus freq    dram_so  Out  DFT  Scan out    TABLE  8 4  DRAM Controller I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    8 14 OpenSPARC T1 Microarchitecture Specification      August 2006    9 1    CHAPTER 9    Error Handling    This chapter describes the following topics         Section  9 1     Error Handling Overview    on page  9 1        Section  9 2     SPA
87. and interrupts and  forward them  FIGURE  2 28 illustrates the TLU role with respect to all other backlogs    in a SPARC core     Chapter 2  SPARC Core 2 51    FIGURE  2 28 TLU Role With Respect to All Other Backlogs in a SPARC Core    The following list highlights the functionality of the TLU         Collects traps from all units in the SPARC core        Detects some types of traps internal to the TLU        Resolves the trap priority and generates the trap vector        Sends flush pipe to other SPARC units using a set of non LSU traps         Maintains processors state registers        Manages the trap stack        Restores the processor state from the trap stack on done or retry instructions        Implements an inter thread interrupt delivery        Receives and processes all types of interrupts        Maintains tick  all tick compares  and the SOFTINT related registers        Generates timer interrupts and software interrupts  interrupt_level_n type         Maintains performance instrumentation counters  PIC     Sync Trap    Async Trap    Sync Trap    Interrupts    TrapPC    Sync Trap    Instruction    PC NPC    Sync Trap    Async Trap    Interrupt PKT    ASI_REG    Rd Ld data    Deferred Trap    Async Trap    CWP_CCR_REG    Ld St Addr  ASI regs     tcl  tpd    IFU    EXU    FFU    LSU    SPU    hyperv  tsa    TLU    intctl  intdp    pib    2 52 OpenSPARC T1 Microarchitecture Specification      August 2006    2 10 1  Architecture Registers in the Trap Logic Unit    The 
88. ase represents an exact denormalized  result         There are three types of FPU related traps tracked in the architected trap type   TT  register located in the SPARC core TLU          fp_disabled    External to the FPU  the SPARC core IFU detects the fp_disabled trap type          fp_exception_ieee_754    If an FPop generates an IEEE exception  nv  of  uf  dz  nx  when the  corresponding trap enable  TEM  bit is set  then an fp_exception_ieee_754 trap  is caused  The FFU detects this trap type          fp_exception_other    In the OpenSPARC T1 implementation  fp_exception_other trap results from an  unimplemented FPop  The FFU detects unimplemented FPops     7 12 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1 8 1  Overflow and Underflow    An overflow occurs when the magnitude of what would have been the rounded  result  had the exponent range been unbounded  is greater than the magnitude of  the largest finite number of the specified precision  FPA  FPM  and FPD support all  overflow conditions     The underflow exception condition is defined separately for the trap enabled and  trap disabled states         FSR UFM   1     underflow occurs when the intermediate result is tiny        FSR UFM   0     underflow occurs when the intermediate result is tiny and there is  a loss of accuracy    A tiny result is detected before rounding  when a non zero result value is computed  as though the exponent range were unbounded and would be less in magnitude  than 
89. assignments   Once prioritized  the interrupts will be  scheduled just like the instructions     When executing in the hypervisor  HV  state  an interrupt with a supervisor  SV   privilege will not be serviced at all  An hypervisor state execution shall not be  blocked by anything with supervisor privilege     Nothing could block the scheduling of a reset  idle  or resume interrupt     Some interrupts are asserted by a level while others are asserted by a pulse  The IFU  remembers the form the interrupts were originated in order to preserve the integrity  of the scheduling     Chapter 2  SPARC Core 2 21    2 3 16  Error Checking and Logging    Parity protects the I cache data and the tag arrays  The error correction action is to  re fetch the instruction from the level 2 cache     The instruction translation lookaside buffer  ITLB  array is parity decoded without  an error correction mechanism  so all errors are fatal     All on core errors  and some of the off core errors  are logged in the per thread error  registers  Refer to the Programmer   s Reference Manual for details     The instruction fetch unit  IFU  maintains the error injection and the error enabling  registers  which are accessible by way of ASI operations     Critical states  such as program counter  PC   thread state  missed instruction list   MIL   and so on  can be snapped and scanned out on line  This process is referred  to as a shadow scan     2 4  Load Store Unit    The load store unit  LSU  processes m
90. ata_fuse_dshift  In  EFC  To efuse_hdr of scdata_efuse_hdr v    scbuf_scdata_fbdecc_c4 623 0   In  SCBUF  To periph_io of scdata_periph_io v    sctag_scdata_col_offset_c2 3 0   In  SCTAG    sctag_scdata_fb_hit_c3  In  SCTAG  To rep of scdata_rep v    sctag_scdata_fbrd_c3  In  SCTAG  To rep of scdata_rep v    sctag_scdata_rd_wr_c2  In  SCTAG  To rep of scdata_rep v    sctag_scdata_set_c2 9 0   In  SCTAG  To rep of scdata_rep v    sctag_scdata_stdecc_c2 77 0   In  SCTAG  To rep of scdata_rep v    sctag_scdata_way_sel_c2 11 0   In  SCTAG    sctag_scdata_word_en_c2 15 0   In  SCTAG    so  Out  DFT  Scan out    Chapter 4  Level 2 Cache 4 19    scdata_efc_fuse_data  Out  EFC  From efuse_hdr of scdata_efuse_hdr v    scdata_scbuf_decc_out_c7 623 0   Out  SCBUF    scdata_sctag_decc_c6 155 0   Out  SCTAG  From rep of scdata_rep v    TABLE  4 2  SCBUF I O Signal List    Signal Name  I O    Source   Destination  Description    sctag_scbuf_fbrd_en_c3  In  SCTAG  rd en for a fill operation or fb bypass    sctag_scbuf_fbrd_wl_c3 2 0   In  SCTAG    sctag_scbuf_fbwr_wen_r2 15 0   In  SCTAG    sctag_scbuf_fbwr_wl_r2 2 0   In  SCTAG    sctag_scbuf_fbd_stdatasel_c3  In  SCTAG  Select store data in OFF mode    sctag_scbuf_stdecc_c3 77 0   In  SCTAG  Store data goes to scbuf and scdata    sctag_scbuf_evict_en_r0  In  SCTAG    sctag_scbuf_wbwr_wen_c6 3 0   In  SCTAG  Write en    sctag_scbuf_wbwr_wl_c6 2 0   In  SCTAG  From wbctl    sctag_scbuf_wbrd_en_r0  In  SCTAG  Triggered by a wr_ack from DRAM 
91. ating Point Instructions    TABLE  7 2 describes the floating point instructions  including the execution latency    and the throughput for each instruction     TABLE  7 2  SPARC V9 Single and Double Precision FPop Instruction Set    Mnemonic  Description  Pipe  Execution Latency  Throughput    FADD s d   Floating point add  FPA  4  1 1    FSUB s d   Floating point subtract  FPA  4  1 1    FCMP s d   Floating point compare  FPA  4  1 1    FCMPE s d   Floating point compare  exception if  unordered     FPA  4  1 1    F s d TO d s   Convert between floating point formats  FPA  4  1 1    F s d TOi  Convert floating point to integer  FPA  4  1 1    F s d TOx  Convert floating point to 64 bit integer  FPA  4  1 1    FiTOd  Convert integer to floating point  FPA  4  1 1    FiTOs  Convert integer to floating point  FPA  5  1 1    FxTO s d   Convert 64 bit integer to floating point  FPA  5  1 1    FMUL s d   Floating point multiply  FPM  7  1 2    FsMULd  Floating point multiply single to double  FPM  7  1 2    FDIV s d   Floating point divide  FPD  32 SP  61 DP   less for zero or  denormalized  results     29 SP  58 DP   less for zero  or  denormalized  results     FSQRT s d   Floating point square root  Unimplemented    Executed in the SPARC core FFU    FMOV s d   Floating point move    FMOV s d cc  Move floating point register if condition is  satisfied    FMOV s d r  Move floating point register if integer register  contents satisfy condition    FABS s d   Floating point absolute
92. back  to the F stage to  allow it to access the I cache  This kind of case is referred to as miss fill crossover     2 3 8  Windowed Integer Register File    The integer register file  IRF  contains 5 Kbytes of storage  and has three read ports   2 write ports  and one transfer port  3R 2W 1T   The IRF houses 640 64 bit registers  that are protected by error correcting code  ECC   All read or write accesses can be  completed in one SPARC core clock cycle     FIGURE  2 8 illustrates the structure of an integer architectural register file  IARF  and    an integer working register file  IWRF      2 14 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 8 IARF and IWRF File Structure    Each thread requires 128 registers for the eight windows  with 16 registers per  window   and four sets of global registers with eight global registers per set  There  are 160 registers per thread  and there are four threads per SPARC core  There are a  total of 640 registers per SPARC core     Only 32 registers from the current window are visible to the thread  A window  change occurs in the background under thread switching while the other threads  continue to access integer register file     Please refer to OpenSPARC T1 Processor Megacell Specification for additional details on  the IRF     call    return    outs 0 7     ins 0 7     locals 0 7     outs 0 7     ins 0 7     locals 0 7     outs 0 7     ins 0 7     locals 0 7     outs 0 7     ins 0 7     locals 0 7     Archit
93. bisterr  In  SPARC3  MBIST error    spc3_ctu_sscan_out  In  SPARC3  Scan Out from SPARC    spc4_ctu_mbistdone  In  SPARC4  MBIST done    spc4_ctu_mbisterr  In  SPARC4  MBIST error    spc4_ctu_sscan_out  In  SPARC4  Scan Out from SPARC    spc5_ctu_mbistdone  In  SPARC5  MBIST done    spc5_ctu_mbisterr  In  SPARC5  MBIST error    spc5_ctu_sscan_out  In  SPARC5  Scan Out from SPARC    spc6_ctu_mbistdone  In  SPARC6  MBIST done    spc6_ctu_mbisterr  In  SPARC6  MBIST error    spc6_ctu_sscan_out  In  SPARC6  Scan Out from SPARC    spc7_ctu_mbistdone  In  SPARC7  MBIST done    spc7_ctu_mbisterr  In  SPARC7  MBIST error    spc7_ctu_sscan_out  In  SPARC7  Scan Out from SPARC    data  In    lclk  In    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 10  Clocks and Resets 10 19    rclk  In    enable_chk  In    ctu_tst_pre_grst_l  Out    global_shift_enable  Out  From ctu_dft of ctu_dft v    ctu_tst_scanmode  Out  From ctu_dft of ctu_dft v    ctu_tst_macrotest  Out  From ctu_dft of ctu_dft v    ctu_tst_short_chain  Out  From ctu_dft of ctu_dft v    ctu_efc_read_start  Out  EFC    ctu_jbi_ssiclk  Out  JBI    ctu_dram_rx_sync_out  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram_tx_sync_out  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_jbus_rx_sync_out  Out  JBI  From ctu_clsp of ctu_clsp v    ctu_jbus_tx_sync_out  Out  JBI  From ctu_clsp of ctu_clsp v    cmp_grst_out_l  Out  From ctu_clsp of ctu_clsp v    afo_rng_clk  Out  
94. c 15  4    result IEEE6    FSR nxc 17    FxTO s d   result IEEE6    FSR nxc 1    1 Default response QNaN   x   7ff   fff       2 SNaN input propagated and transformed to QNaN result    3 Maximum signed integer  x   7ff   fff    or x   800   000        4 FFU will clear FSR ofc  FSR ufc  if overflow  underflow  exception traps and FSR OFM  FSR UFM  is not set and FSR NXM is set  FFU  will set FSR nxc     5 FFU will clear FSR ufc if the result is exact  FSR nxc is not set  and FSR UFM is not set  This case represents an exact denormalized result     6 Rounded or overflow  underflow  result     7 FFU will clear FSR nxc if an overflow  underflow  exception does trap because FSR OFM  FSR UFM  is set  regardless of whether  FSR NXM is set  FFU will set FSR ofc  FSR ufc      TABLE  7 6  IEEE Exception Cases  Continued     Instruction  Invalid  Divide by zero  Overflow  Underflow or  Denormalized  Inexact    Chapter 7  Floating Point Unit 7 15    7 2  I O Signal list    TABLE  7 7 describes the I O Signals for the OpenSPARC T1 floating point unit  FPU      TABLE  7 7  FPU I O Signal List    Signal Name  I O  Source Destination  Description    pcx_fpio_data_rdy_px2  In  CCX PCX  FPU request ready from the PCX    pcx_fpio_data_px2 123 0   In  CCX PCX  FPU request packet from the PCX    arst_l  In  CTU  Chip asynchronous reset     asserted low    grst_l  In  CTU  Chip synchronous reset     asserted low    gclk  In  CTU  Chip clock    cluster_cken  In  CTU  Cluster clock enable    ctu_tst
95. can out    sctag_scbuf_scanout  Out  DFT  Scan out    sctag_efc_fuse_data  Out  EFC  From red_hdr of cmp_sram_redhdr v    TABLE  4 3  SCTAG I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    4 24 OpenSPARC T1 Microarchitecture Specification      August 2006    5 1    CHAPTER 5    Input Output Bridge    This chapter describes the following topics         Section  5 1     Functional Description    on page  5 1        Section  5 2     I O Bridge Signal List    on page  5 12    5 1  Functional Description    The input output bridge  IOB  is the interface between the CPU cache crossbar   CCX  and the rest of the blocks in the OpenSPARC T1 processor  The main IOB  functions include         I O address decoding          IOB maps or decodes I O addresses to the proper internal or external  destination          IOB generates control status register  CSR  accesses to the IOB  JBI  DRAM   and CTU clusters          IOB generates programmed I O  PIO  accesses to the external J Bus         Interrupts          IOB collects the interrupts from clusters  errors and EXT_INT_L  and mondo  interrupts from the J Bus          IOB forwards interrupts to the proper core and thread          IOB wakes up a single thread at reset         Interface between the read write ifill to the SSI         IOB provides test port access  TAP  access to CSRs  Memory  L2 cache  and CPU  ASIs         IOB provides debug Port functionality  both to an external debug port and to the  JB
96. chitecture Specification      August 2006    7 1 5  Floating Point Multiplier  7   7    7 1 6  Floating Point Divider  7   8    7 1 7  FPU Power Management  7   9    7 1 8  Floating Point State Register Exceptions and Traps  7   10    7 1 8 1  Overflow and Underflow  7   12    7 1 8 2  IEEE Exception List  7   13    7 2  I O Signal list  7   15    8   DRAM Controller  8   1    8 1  Functional Description  8   1    8 1 1  Arbitration Priority  8   3    8 1 2  DRAM Controller State Diagrams  8   4    8 1 3  Programmable Features  8   5    8 1 4  Errors  8   6    8 1 5  Repeatability and Visibility  8   6    8 1 6  DDR II Addressing  8   7    8 1 7  DDR II Supported Features  8   8    8 2  I O Signal List  8   9    9   Error Handling  9   1    9 1  Error Handling Overview  9   1    9 1 1  Error Reporting and Logging  9   2    9 1 2  Error Traps  9   2    9 2  SPARC Core Errors  9   3    9 2 1  SPARC Core Error Registers  9   3    9 2 2  SPARC Core Error Protection  9   4    9 2 3  SPARC Core Error Correction  9   4    9 3  L2 Cache Errors  9   5    9 3 1  L2 Cache Error Registers  9   5    Contents  xi    9 3 2  L2 Cache Error Protection  9   6    9 3 3  L2 Cache Correctable Errors  9   6    9 3 4  L2 Cache Uncorrectable Errors  9   7    9 4  DRAM Errors  9   8    9 4 1  DRAM Error Registers  9   8    9 4 2  DRAM Error Protection  9   9    9 4 3  DRAM Correctable Errors  9   9    9 4 4  DRAM Uncorrectable and Addressing Errors  9   9    10   Clocks and Resets  10   1    10 1  Fu
97. ck Diagram    OpenSPARC T1    1 4 OpenSPARC T1 Microarchitecture Specification      August 2006    1 3  OpenSPARC T1 Components    This section provides further details about the OpenSPARC T1 components     1 3 1  SPARC Core    Each SPARC core has hardware support for four threads  This support consists of a  full register file  with eight register windows  per thread  with most of the address  space identifiers  ASI   ancillary state registers  ASR   and privileged registers  replicated per thread  The four threads share the instruction  the data caches  and the  TLBs  Each instruction cache is 16 Kbytes with a 32 byte line size  The data caches  are write through  8 Kbytes  and have a 16 byte line size  The TLBs include an  autodemap feature which enables the multiple threads to update the TLB without  locking     Each SPARC core has single issue  six stage pipeline  These six stages are     1  Fetch    2  Thread Selection    3  Decode    4  Execute    5  Memory    6  Write Back    FIGURE  1 2 shows the SPARC core pipeline used in the OpenSPARC T1 Processor     Chapter 1  OpenSPARC T1 Overview 1 5    FIGURE  1 2 SPARC Core Pipeline    Each SPARC core has the following units     1  Instruction fetch unit  IFU  includes the following pipeline stages     fetch  thread  selection  and decode  The IFU also includes an instruction cache complex     2  Execution unit  EXU  includes the execute stage of the pipeline     3  Load store unit  LSU  includes memory and writeback stages 
98. clsp of ctu_clsp v    ctu_ddr1_dram_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr1_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr1_iodll_rst_l  Out  PADS  From u_ctu_ddr1_iodll_rst_l_or2_ecobug of ctu_or2 v    ctu_ddr1_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr1_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr1_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr2_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr2_dll_delayctr 2 0   Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr2_dram_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr2_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr2_iodll_rst_l  Out  PADS  From u_ctu_ddr2_iodll_rst_l_or2_ecobug of ctu_or2 v    ctu_ddr2_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr2_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr2_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr3_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr3_dll_delayctr 2 0   Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr3_dram_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_ddr3_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr3_iodll_rst_l  Out  PADS  From u_ctu_ddr3_iodll_rst_l_or2_ecobug of ctu_or2 v    ctu_ddr3_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr3_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_ddr3_update_dr  Out  PADS  From ctu_dft of ctu_dft v    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Sou
99. croarchitecture Specification      August 2006    1 3 4  L2 Cache    The L2 cache is banked four ways  with the bank selection based on the physical  address bits 7 6  The cache is 3 Mbyte  12 way set associative with pseudo least  recently used  LRU  replacement  the replacement is based on a used bit scheme    The line size is 64 bytes  Unloaded access time is 23 cycles for an L1 data cache miss  and 22 cycles for an L1 instruction cache miss     L2 cache has a 64 byte line size  with 64 bytes interleaved between banks  Pipeline  latency in the L2 cache is 8 clocks for a load  9 clocks for an I miss  with the critical  chunk returned first  16 outstanding misses per bank are supported for a 64 total  misses  Coherence is maintained by shadowing the L1 tags in an L2 cache directory  structure  the L2 cache is a point of global visibility   DMA from the I O is serialized  with respect to the traffic from the cores in the L2 cache     The L2 cache directory shadows the L1 tags  The L1 set index and the L2 cache bank  interleaving is such that one forth of the L1 entries come from an L2 cache bank  On  an L1 miss  the L1 replacement way and set index identifies the physical location of  the tag which will be updated by the miss address  On a store  the directory will be  cammed  The directory entries are collated by set  so only 64 entries need to be  cammed  This scheme is quite power efficient  Invalidates are a pointer to the  physical location in the L1 cache  eliminating t
100. culative states     Ready    long lat rsrc conflict  Run  Wait    c    o    m    pl    et    io    n    s    c    h    e    d    ul    e    tr    ap     l    d    mi    ss    s    wi    tc    he    d    o    u    t    2 18 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 11 State Transition for a Thread in Speculative States    2 3 12  Thread Scheduling    A thread can be scheduled when it is in one of the following five states     idle  which  happens infrequently  and generally results from a reset or resume interrupt   Rdy   SpecRdy  Run  and SpecRun  The thread priority in each state is different at the time  for scheduling  The priority scheme can be characterized as follows     Idle  gt  Rdy  gt  SpecRdy  gt   Run   SpecRun     The fairness scheme for threads in the Run state or the SpecRun state is a round   robin algorithm with the least recently executed thread winning the selection     Within Idle threads  the priority scheme is as follows     T0  thread 0   gt  T1  thread 1   gt  T2  thread 2   gt  T3  thread 3     Rdy  Run    SpecRdy  SpecRun    Wait    R    e    al    ly    d    o    n    e    W    ro    n    g    s    p    e    c    W    ro    n    g    s    p    e    c    R    e    al    ly    d    o    n    e    s    c    h    e    d    ul    e    sc    he    dul    e    S    p    ec    ul    at    e    d    o    n    e    lo    ng    la    t     rs    rc    c    o    nf    li    ct    s    wi    tc    he    d    o    u    t    Chapte
101. d  Ctrl Regs   Stack    SOFTINT_REG SV     FFs    HINTP     TICK  HSTICK_CMPR   HV      TICK  STICK_CMPR   PIC_Overflow   TICK  TICK_CMPR     Level_ lt 1   15 gt    Bits 0 and  16 also map  to Level_14     Chapter 2  SPARC Core 2 61    2 10 7  Interrupt Behavior and Interrupt Masking    The following list highlights the behavior and the masking of interrupts     1  Hypervisor interrupts cannot be masked by the supervisor nor the user and can  only be masked by the hypervisor by way of the PSTATE IE bit  Such interrupts  include hardware interrupts  HINTP  and so on     2  Normal inter core or inter thread interrupts such as cross calls can be sent by  software writing to the CPU INT_VEC_DIS_REG register     3  Special inter core or inter thread interrupts  such as reset  idle  or resume  can  only be sent by software through the I O bridge  IOB  by writing to the IOB  INT_VEC_DIS_REG register     4  Hypervisor will always suspend supervisor interrupts     5  Some supervisor interrupts such as Mondo Qs can only be masked by the  PSTATE IE bit     6  Interrupts of Interrupt_level_n type can only be masked by the PIL and the  PSTATE IE bit at the supervisor or user level     2 10 8  Privilege Levels and States of a Thread    Split mode is referred to as the operating mode where hypervisor and supervisor  modes are uniquely distinguished  Otherwise  the mode is referred to as non split  mode     TABLE  2 6 illustrates the privilege levels and states of a thread     TABLE  2 6  Pr
102. d GRST_L have race through synchronizer for gclk   gt  rclk     FIGURE  10 5 displays the clock signal distribution     Chapter 10  Clocks and Resets 10 9    FIGURE  10 5 Clock Signal Distribution    PLL    CTU    sync_header    cluster_header    art_l  art_l    gclk    gclk    art_l    PWRON_RST_L    gclk    gclk    cken    art_l    gclk  gclk  rclk    gclk  rclk    grst_l    dbginit_l    arst_l    adbginit_l    rst_l    dbginit_l    art_l    gclk    rx_sync    tx_sync    rx_sync    tx_sync    art_l    rclk    10 10 OpenSPARC T1 Microarchitecture Specification      August 2006    10 1 2  OpenSPARC T1 Processor Resets    The resets of the OpenSPARC T1 processor have the following characteristics         There are three input reset signals          Power on reset  PWRON_RST_L          JTAG test access port  TAP  reset  TRST_L          J Bus reset  J_RST_L         At power on  the TRST_L and PWRON_RST_L resets must be asserted before  applying power  then deasserted after power is stable         Deasserting the TRST_L reset completes TAP reset sequence         Generally  a TAP reset and a function reset are independent  but some things may  need to be set up before the function reset is done         Deasserting the PWRON_RST_L reset proceeds with a cold sequence         The initial state of the J_RST_L reset is don   t care  though the reset needs to assert  and deassert to complete the sequence         In system  the initial state of the J_RST_L reset is asserted         In te
103. data_ca 144 0     cpx_sctag0_grant_px    cpx_spc0_data_rdy_cx21    cpx_spc0_data_cx2 144 0     CQ  CA  CX  CX2    pkt1    pkt1    3 16 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  3 7 CPX Packet Transfer Timing Diagram     Two Packet Request    Arbiter  control  Arbiter  control    Arbiter  data  select    Arbiter  data  select    sctag0_cpx_req_cq 0     sctag0_cpx_atom_cq    sctag0_cpx_data_ca 144 0     cpx_sctag0_grant_px    cpx_spc0_data_rdy_cx21    cpx_spc0_data_cx2 144 0     CQ  CA  CX  CX2  CX3    pkt1  pkt2    pkt1  pkt2    Chapter 3  CPU Cache Crossbar 3 17    3 4  PCX Internal Blocks Functional  Description    3 4 1  PCX Overview    The PCX contains five identical arbiter modules   one for each destination  An  arbiter stores the packets from the sources for one particular destination  The PCX  then arbitrates and dispatches packets to that destination  FIGURE  3 8 shows a block    diagram of the PCX arbitration     FIGURE  3 8 PCX and CPX Internal Blocks    CPU  8     L2Cache  4   IOBridge  FPU    CCX    arbiter0  arbiter4    arbiter1    arbiter2    arbiter3    arbiter0    arbiter1    arbiter2    arbiter3    arbiter4    arbiter5    arbiter6    arbiter7    3 18 OpenSPARC T1 Microarchitecture Specification      August 2006    3 4 2  PCX Arbiter Data Flow    The PCX contains five identical arbiter modules  While data flows similarly inside  other arbiters  this section will describe the data flow inside one of the arbiters   ARB0   There is 
104. data_ca 144 0   In  L2 Bank1  L2 CPX data    sctag1_cpx_req_cq 7 0   In  L2 Bank1  L2 CPX request    3 10 OpenSPARC T1 Microarchitecture Specification      August 2006    sctag1_pcx_stall_pq  In  L2 Bank1  PCX stall    sctag2_cpx_atom_cq  In  L2 Bank2  Atomic packet    sctag2_cpx_data_ca 144 0   In  L2 Bank2  L2 CPX data    sctag2_cpx_req_cq 7 0   In  L2 Bank2  L2 CPX request    sctag2_pcx_stall_pq  In  L2 Bank2  PCX stall    sctag3_cpx_atom_cq  In  L2 Bank3  Atomic packet    sctag3_cpx_data_ca 144 0   In  L2 Bank3  L2 CPX data    sctag3_cpx_req_cq 7 0   In  L2 Bank3  L2 CPX request    sctag3_pcx_stall_pq  In  L2 Bank3  PCX stall    spc0_pcx_atom_pq  In  sparc0  Atomic packet    spc0_pcx_data_pa 123 0   In  sparc0  SPARC PCX data address    spc0_pcx_req_pq 4 0   In  sparc0  SPARC PCX request    spc1_pcx_atom_pq  In  sparc1  Atomic packet    spc1_pcx_data_pa 123 0   In  sparc1  SPARC PCX data address    spc1_pcx_req_pq 4 0   In  sparc1  SPARC PCX request    spc2_pcx_atom_pq  In  sparc2  Atomic packet    spc2_pcx_data_pa 123 0   In  sparc2  SPARC PCX data address    spc2_pcx_req_pq 4 0   In  sparc2  SPARC PCX request    spc3_pcx_atom_pq  In  sparc3  Atomic packet    spc3_pcx_data_pa 123 0   In  sparc3  SPARC PCX data address    spc3_pcx_req_pq 4 0   In  sparc3  SPARC PCX request    spc4_pcx_atom_pq  In  sparc4  Atomic packet    spc4_pcx_data_pa 123 0   In  sparc4  SPARC PCX data address    spc4_pcx_req_pq 4 0   In  sparc4  SPARC PCX request    spc5_pcx_atom_pq  In  sparc5  Atom
105. data_in 255 0   In  PADS  I O data in    io_dram0_data_valid  In  PADS  I O data valid    io_dram0_ecc_in 31 0   In  PADS  I O ECC in    io_dram1_data_in 255 0   In  PADS  I O data in    io_dram1_data_valid  In  PADS  I O data valid    io_dram1_ecc_in 31 0   In  PADS  I O ECC in    iob_ucb_data 3 0   In  IOB  UCB data    iob_ucb_stall  In  IOB  UCB stall    iob_ucb_vld  In  IOB  UCB valid    scbuf0_dram_data_mecc_r5  In  SCBUF0    scbuf0_dram_data_vld_r5  In  SCBUF0    scbuf0_dram_wr_data_r5 63 0   In  SCBUF0  To dramctl0 of dramctl v    scbuf1_dram_data_mecc_r5  In  SCBUF1    scbuf1_dram_data_vld_r5  In  SCBUF1    scbuf1_dram_wr_data_r5 63 0   In  SCBUF1  To dramctl1 of dramctl v    sctag0_dram_addr 39 5   In  SCTAG0  To dramctl0 of dramctl v    sctag0_dram_rd_dummy_req  In  SCTAG0    sctag0_dram_rd_req  In  SCTAG0  To dramctl0 of dramctl v    sctag0_dram_rd_req_id 2 0   In  SCTAG0  To dramctl0 of dramctl v    8 10 OpenSPARC T1 Microarchitecture Specification      August 2006    sctag0_dram_wr_req  In  SCTAG0  To dramctl0 of dramctl v    sctag1_dram_addr 39 5   In  SCTAG1  To dramctl1 of dramctl v    sctag1_dram_rd_dummy_req  In  SCTAG1    sctag1_dram_rd_req  In  SCTAG1  To dramctl1 of dramctl v    sctag1_dram_rd_req_id 2 0   In  SCTAG1  To dramctl1 of dramctl v    sctag1_dram_wr_req  In  SCTAG1  To dramctl1 of dramctl v    clspine_dram_rx_sync  In  CTU  RX synchronous    clspine_dram_tx_sync  In  CTU  TX synchronous    clspine_jbus_rx_sync  In  CTU  RX synchronous    clspin
106. dbg_data 39 0   Out  PADS  Debug bus    iob_io_dbg_en  Out  PADS  Debug enable    iob_jbi_dbg_hi_data 47 0   Out  JBI  Debug data high    iob_jbi_dbg_hi_vld  Out  JBI  Debug data high valid    iob_jbi_dbg_lo_data 47 0   Out  JBI  Debug data low    iob_jbi_dbg_lo_vld  Out  JBI  Debug data high valid    TABLE  5 10 I O Bridge I O Signal List  Continued     Signal Name  I O  Source Destination  Description    Chapter 5  Input Output Bridge 5 15    iob_jbi_mondo_ack  Out  JBI  MONDO ACK    iob_jbi_mondo_nack  Out  JBI  MONDO negative ACK    iob_pcx_stall_pq  Out  CCX PCX  PCX stall    iob_clk_data 3 0   Out  CTU CLK  UCB data    iob_clk_stall  Out  CTU CLK  UCB stall    iob_clk_vld  Out  CTU CLK  UCB valid    iob_dram02_data 3 0   Out  DRAM  DRAM data    iob_dram02_stall  Out  DRAM  DRAM stall    iob_dram02_vld  Out  DRAM  DRAM valid    iob_dram13_data 3 0   Out  DRAM  DRAM data    iob_dram13_stall  Out  DRAM  DRAM stall    iob_dram13_vld  Out  DRAM  DRAM valid    iob_jbi_pio_data 63 0   Out  JBI  PIO data    iob_jbi_pio_stall  Out  JBI  PIO stall    iob_jbi_pio_vld  Out  JBI  PIO valid    iob_jbi_spi_data 3 0   Out  JBI  JBI UCB data    iob_jbi_spi_stall  Out  JBI  JBI UCB stall    iob_jbi_spi_vld  Out  JBI  JBI UCB valid    iob_tap_data 7 0   Out  CTU TAP  UCB data    iob_tap_stall  Out  CTU TAP  UCB stall    iob_tap_vld  Out  CTU TAP  UCB valid    iob_scanout  Out  DFT  Scan out    TABLE  5 10 I O Bridge I O Signal List  Continued     Signal Name  I O  Source Destination  Desc
107. dead cycle for switching commands from one rank to another  rank        A single ended DQS is used        An off chip driver  OCD  is not supported        SDRAM on die termination  ODT  is not supported        The additive latency  AL  is always zero    TABLE  8 3 lists the subset of DDR II SDRAM commands used by the OpenSPARC T1    processor     TABLE  8 3  DDR II Commands Used by OpenSPARC T1 Processor    Function    CKE  Previous  Cycle    CKE  Current  Cycle  CS_L  RAS_L  CAS_L  WE_L  Bank  Address    Mode extended  mode register set    H  H  L  L  L  L  BA  Op code    Auto refresh  H  H  L  L  L  H  X  X    Self refresh entry  H  L  L  L  L  H  X  X    Self refresh exit  L  H  H  X  X  X  X  X    L  H  H  H    Precharge all banks  H  H  L  L  H  L  X  A10 H    Bank activate  H  H  L  L  H  H  BA  Row Address    Write with auto  precharge    H  H  L  H  L  L  BA  Column address   A10 H    Read with auto  precharge    H  H  L  H  L  H  BA  Column address   A10 H    No operation  H  X  L  H  H  H  X  X    Device deselect  H  X  H  X  X  X  X  X    Chapter 8  DRAM Controller 8 9    8 2  I O Signal List    TABLE  8 4 lists the I O signals for OpenSPARC T1 DDR II DRAM controller     TABLE  8 4  DRAM Controller I O Signal List    Signal Name  I O    Source   Destination  Description    dram_other_pt_max_banks_open_v  alid    In    dram_other_pt_max_time_valid  In    dram_other_pt_ucb_data 16 0   In    dram_other_pt0_opened_bank  In    dram_other_pt1_opened_bank  In    io_dram0_
108. decode stage  The branch evaluation  takes place in the execution stage  The access to memory and the actual writeback  will be done in the memory and writeback stages  FIGURE  2 4 illustrates the SPARC    core pipeline and support structures     FIGURE  2 4 SPARC Core Pipeline and Support Structures    Fetch  Thrd Sel  Decode  Execute  Memory  WB    ICache  Itlb  Inst  buf x 4  Crossbar  Interface  Thrd  Sel  Mux    PC logic  x 4  Thrd  Sel  Mux    Decode    Thread  select  logic    Thread selects  Instruction type  Misses  Traps and interrupts  Resource conflicts    Regfile  x 4  Crypto  Coprocessor    Alu  Mul  Shft  Div    DCache  Dtlb  Stbuf x 4    2 8 OpenSPARC T1 Microarchitecture Specification      August 2006    The instruction fill queue  IFQ  feeds into the I cache  The missed instruction list   MIL  stores the addresses that missed the I cache and the ITLB  and the MIL feeds  into the load store unit  LSU  for further processing  The instruction buffer is two  levels deep  and it includes the thread instruction  TIR  and next instruction  NIR   unit  Thread selection and scheduler  S stage  resolves the arbitration among the  TIR  NIR  branch PC  and trap PC to pick one thread send it to the decode stage   D stage   FIGURE  2 5 shows the support structure for this portion of the thread    pipeline     FIGURE  2 5 Frontend of the SPARC Core Pipeline    2 3 2  Instruction Fetch    The instruction fetch unit  IFU  maintains the program counters  PC  and the next   pr
109. dest_sample  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_fuse_bypass  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_jbus_cken  Out  EFC  From ctu_clsp of ctu_clsp v    ctu_efc_read_en  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_read_mode 2 0   Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_rowaddr 6 0   Out  EFC  From ctu_dft of ctu_dft v    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    10 22 OpenSPARC T1 Microarchitecture Specification      August 2006    ctu_efc_shiftdr  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_tck  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_updatedr  Out  EFC  From ctu_dft of ctu_dft v    ctu_fpu_cmp_cken  Out  FPU  From ctu_clsp of ctu_clsp v    ctu_fpu_so  Out  FPU  From ctu_dft of ctu_dft v    ctu_global_snap  Out  From ctu_dft of ctu_dft v    ctu_io_clkobs 1 0   Out  PADS  From u_pll of bw_pll v    ctu_io_j_err  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_io_tdo  Out  PADS  From u_test_stub of ctu_test_stub_scan v    ctu_io_tdo_en  Out  PADS  From ctu_dft of ctu_dft v    ctu_io_tsr_testio 1 0   Out  PADS  From u_tsr of bw_tsr v    ctu_iob_cmp_cken  Out  IOB  From ctu_clsp of ctu_clsp v    ctu_iob_jbus_cken  Out  IOB  From ctu_clsp of ctu_clsp v    ctu_iob_resetstat 2 0   Out  IOB  From ctu_clsp of ctu_clsp v    ctu_iob_resetstat_wr  Out  IOB  From ctu_clsp of ctu_clsp v    ctu_iob_wake_thr  Out  IOB  From ctu_clsp of ctu_clsp v    ctu_jbi_cmp_cken  Out  JBI  From ctu_clsp of ct
110. dge  1   11    1 3 7  J Bus Interface  1   11    1 3 8  Serial System Interface  1   12    iv OpenSPARC T1 Microarchitecture Specification      August 2006    1 3 9  Electronic Fuse  1   12    2   SPARC Core  2   1    2 1  SPARC Core Overview and Terminology  2   2    2 2  SPARC Core I O Signal List  2   5    2 3  Instruction Fetch Unit  2   6    2 3 1  SPARC Core Pipeline  2   7    2 3 2  Instruction Fetch  2   8    2 3 3  Instruction Registers and Program Counter Registers  2   9    2 3 4  Level 1 Instruction Cache  2   9    2 3 5  I Cache Fill Path  2   10    2 3 6  Alternate Space Identifier Accesses  I Cache Line Invalidations   and Built In Self Test Accesses to the I Cache  2   11    2 3 7  I Cache Miss Path  2   12    2 3 8  Windowed Integer Register File  2   13    2 3 9  Instruction Table Lookaside Buffer  2   15    2 3 10  Thread Selection Policy  2   15    2 3 11  Thread States  2   16    2 3 12  Thread Scheduling  2   18    2 3 13  Rollback Mechanism  2   19    2 3 14  Instruction Decode  2   20    2 3 15  Instruction Fetch Unit Interrupt Handling  2   20    2 3 16  Error Checking and Logging  2   21    2 4  Load Store Unit  2   21    2 4 1  LSU Pipeline  2   22    2 4 2  Data Flow  2   22    2 4 3  Level 1 Data Cache  D Cache   2   23    2 4 4  Data Translation Lookaside Buffer  2   24    2 4 5  Store Buffer  2   25    Contents  v    2 4 6  Load Miss Queue  2   26    2 4 7  Processor to Crossbar Interface Arbiter  2   26    2 4 8  Data Fill Queue  2   27    2 4 
111. dular arithmetic memory addresses  MA_ADDR  register    This register carries the memory address offsets for various operands  and the  size of the exponent  FIGURE  2 21 highlights the layout of the bit fields     FIGURE  2 21 Layout of MA_ADDR Register Bit Fields        Modular arithmetic N prime value  MA_NP  register    This register is used to specify the modular arithmetic N prime value      63 48    47 40    39 32    31 24    23 16    15 8    7 0     RSVD  ES  E    X    X    M    N    N    R    M    B    N    A    A    A    MA_EXP    MA_MUL    MA_RED    Chapter 2  SPARC Core 2 39        Modular arithmetic synchronization  MA_SYNC  register    A load operation from this register is used to synchronize a thread with the  completion of asynchronous modular arithmetic operations performed by the  SPU         Modular arithmetic control parameters  MA_CTL  register    This register contains several bit filed fields that provide these control  parameters          PerrInj     Parity error injection    When this parameter is set  each operation that writes to modular arithmetic  memory will have the parity bit inverted          Thread     Thread ID for receiving interrupt    If the Int bit is set  this set of bits specifies the thread that will receive the  disrupting trap on the completion of the modular arithmetic operation          Busy     SPU is BUSY    When this parameter is set  the SPU is busy working on the specified  operation          Int     Interrupt enable    When
112. e  In the  following cases  an invalidation could be addressing anyone         A single I cache line invalidation due to store acknowledgements  or due to a  load exclusivity requiring that the invalidation of the other level 1 I caches  resulted from the self modifying code         Invalidating two I cache lines because of a cache line eviction in the level 2 cache   L2 cache          Invalidating all ways in a given set due to error conditions  such as encountering  a tag ECC error in a level 2 cache line     2 12 OpenSPARC T1 Microarchitecture Specification      August 2006    2 3 7  I Cache Miss Path    A missed instruction list  MIL  is responsible for sending the I cache miss request to  the level 2 cache  L2 cache  in order to get an I cache fill  The MIL has one entry per  thread  which supports a total of four outstanding I cache misses for all four threads  in the same SPARC core at the same time  Each entry in the MIL contains the  physical address  PA  of an instruction that missed the I cache  the replacement way  information  the MIL state information  the cacheability  the error information  and  so on  The PA tracks the I fetch progress from the indication of an I cache miss until  the I cache has been filled  The dispatch of I cache miss requests from different  threads follow a fairness mechanism based on a round robin algorithm     FIGURE  2 7 illustrates the I cache miss path     FIGURE  2 7 I Cache Miss Path    The MIL keeps track of the physical address 
113. e Error Correction    The SPARC core provides error correction for various errors as follows         Instruction Data TLB Data Parity Error         Precise trap during translation          Precise trap for ASI accesses         Instruction Data TLB Tag Parity Error         Not checked during translation          Precise trap for ASI accesses with periodic software scrubbing          DTLB parity error on a store causes a deferred trap         Instruction Data Cache Data Tag Parity Error         Two requests to the L2 cache     the first invalidates the entire set and the  second does a refill          Data cache is not accessed for stores or atomics     TABLE  9 1  Error Protection for SPARC Memories    Memory  Error Protection Type    ITLB data  Parity    ITLB tag  Parity    DTLB data  Parity    DTLB tag  Parity    Instruction cache data  Parity    Instruction cache tag  Parity    Data cache data  Parity    Data cache tag  Parity    Integer register file  IRF   ECC    Floating point register file  FRF   ECC    Modular arithmetic  MA  memory  Parity    Chapter 9  Error Handling 9 5         Data cache errors on loads cause a rollback of the instruction following the  load  from D or W stages           An instruction cache parity error on an instruction causes a rollback from the  D stage         IRF FRF Correctable Error         The instruction is rolled back from W stage and the error is corrected  The  instruction is then replayed         IRF FRF Uncorrectable Error         Ca
114. e JBI         Instructions for recycle from the fill buffer and the miss buffer        Stall signals from the pipeline  a stall condition will evaluate to true for a signal  currently in the pipeline     4 1 2 2  L2 Tag    The L2 tag block contains the sctag array and the associated control logic  Each 22 bit  tag is protected by 6 bits of SEC ECC  the L2 tag does not support double bit error  detection   sctag is a single ported array  and it supports inline false hit detection  In  the C1 stage of pipeline  the access address bits  as well the check bits  are compared   Therefore  there is never a false hit     The state of each line is maintained using valid  V   used  U   allocated  A   and  dirty  D  bits  These bits are stored in the L2 VUAD array     4 1 2 3  L2 VUAD States    The four state bits for sctags are organized in a dual ported array structure in the L2  VUAD array  The four states are     valid  V   used  U   allocated  A   and dirty  D    The used bit is not protected because a used error will not cause incorrect  functionality  VAD bits are parity protected because an error will be fatal  The L2  VUAD array has two read and two write ports     A valid bit indicates that the line is valid  The valid bit  per way  gets set when a  new line is installed in that way  It gets reset when that line gets invalidated     The used bit is a reference bit used in the replacement algorithm  The L2 cache uses  a pseudo LRU algorithm for selecting a way to be replaced  T
115. e bank          PA for loads and stores  and other indices for scrub and directory errors         L2 Error Injection Register         Injects errors into the directory only          L2 tags  valid used allocated and dirty  VUAD  array  and data array errors can  be injected through diagnostic accesses     9 3 2  L2 Cache Error Protection    All SRAMs  caches  and so on in the L2 cache have error protection using either  parity or ECC  TABLE  9 2 shows the L2 cache memories and their error protection    types     9 3 3  L2 Cache Correctable Errors        Error information is captured in the L2 cache Error Status and L2 cache Error  Address registers         If the L2 cache correctable error enable  CEEN  bit is set and the error is on the  requested data  the error is also logged in the SPARC error status and error  address registers     TABLE  9 2  Error Protection for L2 Cache Memories    Memory  Error Protection Type    L2 cache data  ECC    L2 cache tag  ECC    Directory  Parity    VAD bits  Parity    Writeback buffer  ECC    Chapter 9  Error Handling 9 7        Loads  ifetch and prefetch     if the SPARC CEEN bit is set  a disrupting ECC_error  trap is taken on the requesting thread          Hardware corrects the error on the data being returned from the L2 cache  but  it does not correct the L2 cache data itself          Partial stores  less than 4 bytes   Atomics     error is corrected and written to the  L2 cache         MA loads          If the CEEN bit is set  the L2
116. e cache to processor crossbar  CPX   The L2 cache  is also responsible for maintaining the on chip coherency across all L1 caches on the  chip by keeping a copy of all L1 tags in a directory structure  Since the OpenSPARC    4 2 OpenSPARC T1 Microarchitecture Specification      August 2006    T1 processor implements system on a chip  with single memory interface and no L3  cache  there is no off chip coherency requirement for the OpenSPARC T1 L2 cache  other than it needs to be coherent with the main memory     Each L2 cache bank has a 128 bit fill interface and a 64 bit write interface with the  DRAM controller  Each bank had a dedicated DRAM channel  and each 32 bit word  is protected by 7 bits of single error correction double error detection  SEC DED   ECC code     4 1 2  L2 Cache Single Bank Functional Description    The L2 cache is organized into four identical banks  Each bank has its own interface  with the J Bus  the DRAM controller  and the CPU cache crossbar  CCX      Each L2 cache bank interfaces with the eight SPARC CPU cores through a processor   cache crossbar  PCX   The PCX routes the L2 cache requests  loads  ifetches  stores   atomics  ASI accesses  from all of the eight CPUs to the appropriate L2 cache bank   The PCX also accepts read return data  invalidation packets  and store ACK packets  from each L2 cache banks and forwards them to the appropriate CPU s      Each L2 cache bank interfaces with one DRAM controller in order to issue reads and  evictions t
117. e five IEEE exception status flags include          Invalid  nv          Overflow  of          Underflow  uf          Division by zero  dz          Inexact  nx     Chapter 7  Floating Point Unit 7 11        The FSR contains a 5 bit field for current exceptions  FSR cexc  and a 5 bit field  for accrued exceptions  FSR aexc   Each IEEE exception status flag has a  corresponding trap enable mask  TEM  in the FSR          Invalid mask     NVM         Overflow mask     OFM         Underflow mask     UFM         Division by zero mask     DZM         Inexact mask     NXM        The FPU does not receive the FSR TEM bits  The FSR TEM bits are used within  the FFU for the following cases          fp_exception_ieee_754 trap detection  If a FPop generates an IEEE exception   nv  of  uf  dz  nx  when the corresponding trap enable  TEM  bit is set  then a  fp_exception_ieee_754 trap is caused  The FSR cexc field has one bit set  corresponding to the IEEE exception  and the FSR aexc field remains  unchanged          Clear the FSR nxc flag if an overflow  underflow  exception does a trap  because the FSR OFM  FSR UFM  mask is set  regardless of whether the  FSR NXM mask is set  Set FSR ofc  FSR ufc           Clear the FSR ofc  FSR ufc  flag if overflow  underflow  exception traps when  the FSR OFM  FSR UFM  mask is not set and the FSR NXM mask is set  Set  FSR nxc          Clear the FSR ufc flag if the result is exact  and the FSR nxc flag is not set  and  the FSR UFM mask is not set  This c
118. e maximum wait period for a write access to the I cache is 25 SPARC core clock  cycles  A wait longer than 25 clock cycles will stall the SPARC core pipeline in order  to allow the I cache write access completion     2 3 5  I Cache Fill Path    I cache fill packets come from the level 2 cache to processor interface  CPX  by way  of the load store unit  LSU   Parity and predecode bits will be calculated before the  I cache fills up  CPX packets include invalidations  invalidation packets are non   blocking   test access point  TAP  reads and writes  and error notifications  The valid  bit array in the I cache has a dedicated port for servicing the invalidation packets     FIGURE  2 6 illustrates the I cache fill path     FIGURE  2 6 I Cache Fill Path    The I cache line size is 32 bytes  and a normal I cache fill takes two CPX packets of  16 bytes each  The instruction fill queue  IFQ  has a depth of two  An I cache line  will be invalidated when the first CPX packet is delivered and filled in the I cache     IFQ    cpxpkt  from LSU    asi  bist    To    Vbit Ary  To  I Cache  Bypass  to TIR    INV    bist gt asi gt cpx    Chapter 2  SPARC Core 2 11    That cache line will be marked as valid when the second CPX packet is delivered  and filled  I cache control guarantees the atomicity of the I cache line fill action  between the two halves of the cache line being filled     An instruction fetch from the boot PROM  by way of the system serial interface  SSI    is a very slow tra
119. e of MMU in Virtualization    The OpenSPARC T1 processor provides hardware support for the virtualization  where multiple images and or instances of the operating system  OS  coexist on top  of the underlying chip multiple threading  CMT  microprocessor     FIGURE  2 26 illustrates the view of virtualization     FIGURE  2 26 Virtualization Diagram    ITLB    IFU    DTLB    LSU    MMU    OS instance 1    Applications    Hypervisor    OpenSPARC T1    OS instance 2    Chapter 2  SPARC Core 2 45    The hypervisor  HV  layer virtualizes the underlying central processing units  CPU    The multiple instances of the OS images form multiple partitions of the underlying  virtual machine  The hypervisor improves the OS portability to the new hardware  and insures that failure in one domain would not affect the operation in the other  domains  The OpenSPARC T1 processor supports up to eight partitions  and the  hardware provides 3 bits of partition ID in order to distinguish one partition from  another     The hypervisor  HV  layer uses physical addresses  PA  while the supervisor  SV   layer views real addresses  RA  where the RAs represent a different abstraction of  the underlying PAs  All applications use virtual addresses  VA  to access memory    The VA will be translated to RA and then to PA by TLBs and the MMU      2 9 2  Data Flow in MMU    The MMU interacts with TLBs to maintain the content of TLBs  The system software  manages the content of MMU by way of three kinds of operati
120. e_jbus_tx_sync  In  CTU  TX sync    dram_gdbginit_l  In  CTU  Debug init for repeatability   J Bus freq    clk_dram_jbus_cken  In  CTU  J Bus clock enable    clk_dram_dram_cken  In  CTU  DDR clock enable    clk_dram_cmp_cken  In  CTU  CMP clock enable    clspine_dram_selfrsh  In  CTU  Signal from clock to put in self refresh  J   Bus freq    global_shift_enable  In  CTU  Scan shift enable signal    dram_si  In  DFT  Scan in    jbus_gclk  In  CTU  J Bus clock    dram_gclk  In  CTU  DDR clock    cmp_gclk  In  CTU  CMP clock    dram_adbginit_l  In  CTU  Active low async reset of dbginit_l    dram_arst_l  In  CTU  Active low async reset of rst_l    jbus_grst_l  In  CTU  Active low reset signal    dram_grst_l  In  CTU  Active low reset signal    cmp_grst_l  In  CTU  Active low reset signal    ctu_tst_scanmode  In  CTU    ctu_tst_pre_grst_l  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_macrotest  In  CTU    TABLE  8 4  DRAM Controller I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 8  DRAM Controller 8 11    ctu_tst_short_chain  In  CTU    dram_io_addr0 14 0   Out  PADS  DRAM address 0    dram_io_addr1 14 0   Out  PADS  DRAM address 1    dram_io_bank0 2 0   Out  PADS  DRAM bank 0    dram_io_bank1 2 0   Out  PADS  DRAM bank 1    dram_io_cas0_l  Out  PADS  DRAM CAS 0    dram_io_cas1_l  Out  PADS  DRAM CAS 1    dram_io_channel_disabled0  Out  PADS  DRAM channel disable 0    dram_io_channel_disabled1  Out  PADS  DRAM channel disable 
121. ead stalls   switches out  while waiting for an FPU result         The FPU includes three parallel pipelines and these pipelines can simultaneously  have instructions at various stages of completion  FIGURE  7 1 displays an FPU    block diagram that shows these parallel pipelines     FIGURE  7 1 FPU Functional Block Diagram    From PCX  To CPX    FPU  input  FIFO  queue    FPU  output  arbitration    Divide pipeline  FPD    Multiply pipeline  FPM    Add pipeline  FPA    Chapter 7  Floating Point Unit 7 3    The following sections provide additional information about the OpenSPARC T1  FPU         Section  7 1 1     Floating Point Instructions    on page  7 4        Section  7 1 2     FPU Input FIFO Queue    on page  7 5        Section  7 1 3     FPU Output Arbitration    on page  7 6        Section  7 1 4     Floating Point Adder    on page  7 6        Section  7 1 5     Floating Point Multiplier    on page  7 7        Section  7 1 6     Floating Point Divider    on page  7 8        Section  7 1 7     FPU Power Management    on page  7 9        Section  7 1 8     Floating Point State Register Exceptions and Traps    on page  7 10    TABLE  7 1  OpenSPARC T1 FPU Feature Summary    Feature  OpenSPARC T1 Processor FPU Implementation    ISA  SPARC V9    VIS  Not available    Issue  1    Register file  In FFU    FDIV blocking  No    Full hardware  denorm support    Yes    Hardware quad  support    No    7 4 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1 1  Flo
122. ecomposition Queue  which will then send Request  Header and Data to sctag of L2 cache         The following types of writes are supported  refer to the OpenSPARC T1 External  Interface Specification for details of the transaction types      1  WriteInvalidate  WRI   WriteInvalidateSelf  WRIS   NonCachedWriteComressible   NCBWR  are treated as 64 byte writes    2  NCWR is treated as 8 byte write    3  WriteMerge  WRM           WRM is similar to WRI but with 64 bit Byte enables  supporting 0 to 64 byte  writes          Multiple 8 byte write requests  WR8  to the L2 cache        Write decomposition         WRM is broken into 8 byte write requests  WR8  and sent to the L2 cache at  the head of the write decomposition queue  WDQ          Number of requests is dependent on the WRM byte enable pattern         Each WR8 request writes 1 to 8 contiguous bytes         If a run of contiguous bytes crosses an 8 byte address boundary  two WR8s are  generated         A WRM transaction can generate up to 32 WR8s to the L2 cache        Writes to the L2 cache may observe strict ordering with respect to the other writes  to the L2 cache  software programmable     6 4 OpenSPARC T1 Microarchitecture Specification      August 2006    6 1 1 2  Read Requests to the L2 Cache    A DMA read request from the J Bus is parsed by the J Bus parser and then the  information is passed to the write decomposition queue  WDQ   which will then  send the request header to the sctag of the L2 cache  Data returned 
123. ectural Set   compact sram cells   Working Set   fast RF cells     Transfer  port    Read Write  Access from pipe    w n 1     w n 1     w n     Chapter 2  SPARC Core 2 15    2 3 9  Instruction Table Lookaside Buffer    The instruction table lookaside butter  ITLB  is responsible for address translation  and tag comparison  The ITLB is always turned on for non hypervisor mode  operations  and the ITLB is always turned off for hypervisor mode operations     The ITLB contains 64 entries  The replacement policy is a pseudo least recently used   pseudo LRU  policy  which is the same policy as that for the I cache     The ITLB supports page sizes of 8 Kbytes  64 Kbytes  4 Mbytes  and 256 Mbytes   Multiple hits in the ITLB are prevented by the autodemap feature in an ITLB fill     2 3 10  Thread Selection Policy    Thread switching takes place during every SPARC core clock cycle  At the time of a  thread selection  the priority is given to the least recently executed yet available  thread  Load instructions will be speculated as cache hits and the thread executing a  load instruction will be deemed as available and allowed to be switched in with a  low priority     A thread could become unavailable due to one of these reasons     1  The thread is executing one of the long latency instructions  such as load  branch   multiplication  division  and so on     2  The SPARC core pipeline has been stalled due to one of the long latency  operations  such as encountering a cache miss  takin
124. eir respective caches  An acknowledgement to a local I flush is  treated the same way as an interrupt     Streaming stores will be completed to the D cache before the acknowledgement is  sent to the SPU     2 4 9  ASI Queue and Bypass Queue    Certain SPARC core internal alternate space identifier  ASI  accesses  such as the  long latency MMU ASI transactions and all IFU ASI transactions  are queued in the  ASI queue  The ASI queue is a FIFO that supports one outstanding ASI transaction  per thread  For all read type ASI transactions  regardless whether they originated  from the LSU or not  must have their the return data routed through the LSU and be  delivered to the register file by way of the bypass queue     The bypass queue handles all of the load reference data  other than that received  from the L2 cache  that must be asynchronously written to the integer register file   IRF   This kind of read data includes full RAW data from the store buffer  ldxa to the  internal ASI data  store data for casa  a forward packet for the ASI transactions  as  well as the pending precise traps     2 28 OpenSPARC T1 Microarchitecture Specification      August 2006    2 4 10  Alternate Space Identifier Handling in the Load  Store Unit    In addition to sourcing alternate space identifier  ASI  data to other functional units  of a SPARC core  the load store unit  LSU  decodes and supports a variety of ASI  transactions  which include         Defining the behavior of ld st ASI transactions
125. emory referencing operation codes  opcodes   such as various types of loads  various types of stores  cas  swap  ldstub  flush   prefetch  and membar  The LSU interfaces with all of the SPARC core functional  units  and acts as the gateway between the SPARC core units and the CCX  Through  the CCX  data transfer paths can be established with the memory subsystem and the  I O subsystem  the data transfers are done with packets      The threaded architecture of the LSU can process four loads  four stores  one fetch   one FP operation  one stream operation  one interrupt  and one forward packet   Therefore  thirteen sources supply data to the LSU     The LSU implements the ordering for memory references  whether locally or not   The LSU also enforces the ordering for all the outbound and inbound packets     2 22 OpenSPARC T1 Microarchitecture Specification      August 2006    2 4 1  LSU Pipeline    There are four stages in the LSU pipeline  FIGURE  2 13 shows the different stages of    the LSU pipeline     FIGURE  2 13 LSU Pipeline Graph    The cache access set up and the translation lookaside buffer  TLB  access set up are  done during the pipeline   s E stage  execution   The cache tag TLB read operations  are done in the M stage  memory access   The W stage  writeback  supports the  look up of the store buffer  the detection of traps  and the execution of the data  bypass  The W2 stage  writeback 2  is for generating PCX requests and writebacks to  the cache     2 4 2  Data F
126. en clusters    d  The D and J domain clock enables are subject to Tx_sync    Chapter 10  Clocks and Resets 10 13    7  Deassert resets    a  For cold resets  the ARST_L signals are already deasserted at the deassertion of  the PWRON_RST_L reset    b  The GRST_L signals are deasserted at the same time in all domains    c  The DLL reset is deasserted a few cycles before the GRST_L deassertion    8  Transfer the e Fuse cluster  EFC  data    Note     This step is only performed during a cold reset     a  The CTU kicks the EFC to start the data transfer    b  The EFC transfers device specific information such as SRAM repair  information to the target clusters    c  Core available information is programmed into the IOB  but it is still visible to  the CTU    d  There is no handshake to indicate the end of the operation  and the CTU just  waits a fixed number of cycles    9  Do BIST    Note     This step is only performed during a cold reset     a  At the J_RST_L reset deassert time  DO_BIST pin is sampled for eight cycles to  determine the msg  which determines     i  The DO_BIST pin tied low on system    ii  Do or do not perform a BIST action    iii  BIST vs  bi directional schematic interface  BISI     iv  Serial vs  parallel    b  If a BIST action is required  it occurs after the EFC is done    c  The CTU starts the BIST engines  enabled by EFC   and then the CTU waits for  a response from the engines    d  The status from each BIST engine is recorded  but does not affect reset 
127. ent  potential duplication in the L2 cache tags     The latencies for completing different load instructions may differ  for example  a  quad load fill will have to access integer register file  IRF  twice      The LMQ is also leveraged by other instructions  For example  the first packet of a  CAS instruction will be issued out of the store buffer while the second packet will be  issued out the LMQ     2 4 7  Processor to Crossbar Interface Arbiter    The processor to crossbar interface  PCX  is the interface between the processor and  the CCX  The arbiter takes on 13 sources to produce one arbitrated output in one  cycle  The 13 sources include     four load type instructions  four store type  instructions  one instruction cache  I cache  fill  one floating point unit  FPU  access   one stream processing unit  SPU  access  one interrupt  and one forward packet     The 13 sources are further divided into four categories of different priorities  The  I cache miss handling is one category  The load instructions  one outstanding per  thread  are in one category  The store instructions  one outstanding per thread  are in  another category  The rest of accesses are lumped into one category  and include      the FPU access  SPU access  interrupt  and the forward packet     The arbitration is done within the category first and then among the other  categories  An I cache fill is at the highest priority  while all other categories have an  equal priority  The priorities can be illus
128. entry in the TLB  For diagnostics  purposes  a single bit parity error can be injected on writes     A page may be specified as a real on write  and a page will have a partition assigned  to it on a write     2 9 6  Specifics on TLB Read Access    TLB read operations follow the same handshake protocol as TLB write operations   The ASI data access operations will read the RAM portion  that is  the TTE data    The ASI tag read access operations will read the TTE tag from the RAM  The TLB  read data will be returned to the bypass queue in the LSU  If no parity error is  detected  the LSU will forward the data  Otherwise  the LSU will take a trap     2 9 7  Translation Lookaside Buffer Demap    The system software can invalidate entries in the translation lookaside buffer  TLB   selectively using demap operations in any one of the following forms for the ITLB  and the DTLB respectively and distinctly  Each demap operation is partition specific         Demap by page real     Match VA tag and translate RA to PA        Demap by page virtual     Match VA tag and translate VA to PA        Demap by context     Match context only  has no effect on real pages         Demap all     Demap all but the locked pages    Chapter 2  SPARC Core 2 49    The system software can clear the entire TLB distinctly through an invalidate all  operation  which includes all of the locked pages     2 9 8  TLB Auto Demap Specifics    Each TLB is shared by all four threads  The OpenSPARC T1 processor provides 
129. er file  In case of aborting an MA operation  the pending ldxa to  MA_Sync is unblocked  and the SPU signals the LSU will not update the register file     FIGURE  2 23 illustrates the MA operations using a state transition diagram     FIGURE  2 23 State Transition Diagram Illustrating MA Operations    ma_op    Idle  Abort    Wait    2 42 OpenSPARC T1 Microarchitecture Specification      August 2006    The state transitions are clarified by the following set of equations     A MA_ST operation is started with a stxa to the MA_CTL register opcode equals the  MA_ST  and the length field specifies the number of words to send to the level 2  cache  L2 cache   The SPU sends a processor to cache interface  PCX  request to the  LSU and waits for an acknowledgement from the LSU prior to sending another  request  If needed  store acknowledgements  which are returned from the L2 cache  on level 2 cache to processor interface  CPX   will go to the LSU in order to  invalidate the level 1 D cache  L1D   The LSU will then send the SPU an  acknowledgement  The SPU then decrements a local counter and waits for all the  stores sent out to be acknowledged and transitioned to the done state     On a read from the MA Memory  the operation will be halted if a parity error is  encountered  The SPU waits for all posted stores to be acknowledged  If the Int bit is  cleared  Int   0   the SPU will signal the LSU and the IFU on all ldxa to the MA  registers     An MA_LD operation is started with a stxa 
130. errupt queue and an interrupt  ACK NACK queues in the JBI in order to interface to the IOB     6 2 OpenSPARC T1 Microarchitecture Specification      August 2006        There are only two sub blocks in the JBI  J Bus parser and J Bus transaction issue   specific to J Bus  All of the other blocks are J Bus independent  J Bus independent  blocks can be used for any other external bus interface implementation     FIGURE  6 1 displays the JBI block diagram     FIGURE  6 1 JBI Functional Block Diagram    The following sub sections describe the various JBI transactions and interfaces from  the JBI to the other functional blocks     IOB    IOB    Debug  FIFOs  32 x 64b  Debug  FIFOs  32 x 64b    Request Q0  Hdr and Data    Request Q1  Hdr and Data    Request Q2  Hdr and Data    Request Q3  Hdr and Data    Return Q0  16 x 138b    Return Q1  16 x 138b    Return Q2  16 x 138b    Return Q3  16 x 138b    Write  Decomp   Queue  16 x 156b    J Bus  Parser  J Bus  Txn Issue  SSI    SSI    SCTAG  Bank 0 Bank 1 Bank 2 Bank 3    L2    JBI    SCBUF  Bank 0 Bank 1 Bank 2 Bank 3    PIO Request Q  16 x 64b     PIO Return Q  16 x 128b     Interrupt Q  16 x 138b     Interrupt Ack Nack Q   16 x 10b     J BUS    Chapter 6  J Bus Interface 6 3    6 1 1  J Bus Requests to the L2 Cache    There are two types of requests from J Bus to L2     read and write     6 1 1 1  Write Requests to the L2 Cache    DMA write request from J Bus is parsed by J Bus parser and then it passes the  information to the Write D
131. es do a sequenced turn off on         After a reset  the software can turn each cluster off         CREG_CKEN has one bit per cluster  except CTU   and bits are reset to all ones  by a reset sequence         CREG_CKEN is NOT shadowed  and the effect is immediate         Turning off some clusters could be fatal  but it can be recovered with a test access  port  TAP  reset         Turning off the IOB is catastrophic  and will require a reset sequence to recover     10 1 1 5  Clock Stop    Clock stop event have the following characteristics         Clock stop events can be chosen from a wide palette         When a clock stop event occurs  a trigger is sent to the CTU         The CTU does a sequenced clock enable  CKEN  turn off          Can be staggered or instant  which is controlled by the  CREG_CLK_CTL OSSDIS          The first cluster to turn off is defined by the TAP  and the progression is in the  CREG_CLK_CTL bit order with wraparound          The time the first cluster is turned off is controlled by  CREG_CLK_DLL DBG_DLY          The gap between clusters is controlled by CREG_CLK_CTL STP_DLY         After a clock stop  you can use JTAG to do a scan dump and a macro dump         After a clock stop and JTAG dump  you need to perform a cold reset to continue     10 8 OpenSPARC T1 Microarchitecture Specification      August 2006    10 1 1 6  Clock Stretch    Clocks can be stretched by making dividers skip one PLL beat  The C  D  and J clock  domains are stretched simultaneou
132. ess of an entry in the WBB is  inserted into the miss buffer  This instruction must wait for the entry in the WBB to  write to the DRAM before entering the L2 cache pipe     The WBB is divided into a RAM portion  which stores the evicted data until it can be  written to the DRAM  and a CAM portion  which contains the address     The WBB has a 64 byte read interface with the scdata array and a 64  bit write  interface with the DRAM controller  The WBB reads from the scdata array faster than  it can flush data out to the DRAM controller     4 1 2 11  Remote DMA Write Buffer    The remote DMA  RDMA  write buffer is a four entry buffer that accommodates the  cache line for a 64 byte DMA write  The output interface is with the DRAM  controller that it shares with the WBB  The WBB has a direct input interface with the  JBI     4 1 2 12  L2 Cache Directory    Each L2 cache directory has 2048 entries  with one entry per L1 tag that maps to a  particular L2 cache bank  Half of the entries correspond to the L1 instruction cache   icache  and the other half of the entries correspond to the L1 data cache  dcache    The L2 directory participates in coherency management and it also maintains the  inclusive property of the L2 cache     The L2 cache directory also ensures that the same line is not resident in both the  icache and the dcache  across all CPUs   The L2 cache directory is written in the C5  cycle of a load or an I miss that hits the L2 cache  and is cammed in the C5 cycle of  a 
133. et in the same cycle  ARB0 follows a  round robin policy to arbitrate among packets from multiple sources     A 5 bit bus originates from each CPU  and the bit corresponding to the destination is  high while all other bits are low  Each arbiter receives one bit from the 5 bit bus from  each CPU     The arbitration scheme is implemented using a simple checkerboard as shown in  FIGURE  3 10     FIGURE  3 10 Control Flow in PCX Arbiter    The checkerboard consists of eight FIFOs  Each FIFO is sixteen entries deep  and  each entry holds a single valid bit received from its corresponding CPU  Each valid  FIFO entry represents a valid packet from a source for the L2 cache Bank0  Since  each source can send at the most two entries for the L2 cache Bank0  there can be at    CPU0  CPU1  CPU2  CPU3  CPU4  CPU5  CPU6  CPU7    C0    Q2  Q1  Q0    C1  C2  C3  C4  C5  C6  C7 Direction    ARB0  ARB1    Data select  to arbiter 0  Data select  to arbiter 2  Data select  to arbiter 3  Data select  to arbiter 1  Data select  to arbiter 4    ARB2  ARB3  ARB4     4 0      0    1    2    3    4     8  8  8  8  8    16 entries    C0    Q2  Q1  Q0    C1  C2  C3  C4  C5  C6  C7 Direction    16 entries    3 20 OpenSPARC T1 Microarchitecture Specification      August 2006    most two valid bits in each FIFO  Therefore  the entire checkerboard can have a  maximum of 16 valid bits  This maximum represents the case when the L2 cache  Bank0 is unable to process any new entry  The PCX reaches the maximum li
134. evious instructions because the C2 stage of the current instruction completes  before the C5 stage of the last instruction         The miss buffer is cammed in the C1 stage  However  the MB is written in the C3  stage  The bypass logic for a miss buffer entry generation is completed in the C2  stage  This ensures that the correct data is available to the current instruction from  previous instructions  because the C2 stage of the current instruction starts before  the C3 stage of the last instruction completes     Chapter 4  Level 2 Cache 4 11    C3        The set and way select is transmitted to scdata         An entry is created in miss buffer for instructions that miss the cache     C4        The first cycle of read or write to the scdata array for load store instructions that  hit the cache     C5        The second cycle of read or write to the scdata array for load store instructions  that hit the cache         Write into the L2 cache directory for loads  and CAM the L2 cache directory for  stores         Write the new state of line into the VUAD array  by now the new state of line has  been computed          Fill buffer bypass     If the data to service the load that missed the cache is  available in the FB  then do not wait for the data to be available in the data array   The FB provides the data directly to the pipeline     C6        128 bits of data and 28 bits of ECC are transmitted from the scdata  data array  to  the sctag  tag array      C7        Error correctio
135. following list highlights the architecture registers maintained by the trap logic  unit  TLU   Only supervisor  SV  or hypervisor  HV  privileged code can access  these registers     1  Processor state and control registers        PSTATE  processor state  register        TL  trap level  register        GL  global register window level  register        PIL  processor interrupt level  register        TBA  trap base address  register        HPSTATE  Hypervisor processor state  register        HTBA  Hypervisor trap base address  register        HINTP  Hypervisor interrupt pending  register        HSTICK_CMPR_REG  Hypervisor system tick compare  register    2  Trap stack  six deep         TPC  trap PC  register        TNPC  trap next PC  register        TTYPE  trap type  register        TSTATE  trap state  register        HTSTATE  Hypervisor trap state  register    3  Ancillary state registers        TICK_REG  tick  register        STICK_REG  system tick  register        TICK_CMPR_REG  tick compare  register        STICK_CMPR_REG  system tick compare  register        SOFTINT_REG  software interrupt  register        SET_SOFTINT  set software interrupt register  register        CLEAR_SOFTINT  clear software interrupt register  register        PERF_CONTROL_REG  performance control  register        PERF_COUNTER  performance counter  register    Chapter 2  SPARC Core 2 53    4  ASI mapped registers        Scratch pad registers  eight of them         CPU and device mondo registers      
136. from the L2   cache scbuf is then passed from the return queues to the J Bus transaction issue  and  then to the J Bus         Type of reads supported          ReadToDiscard  RDD   ReadToShare  RDS   ReadToShareAlways  RDSA    NonCachedBlockRead  NCBRD  translates to 64 byte RDDs to the L2 cache         NonCachedRead  NCRD  translates to 8 byte RDD to the L2 cache         There is a maximum of 4 outstanding reads to each L2 cache bank        Reads to the L2 cache may observe strict ordering with respect to writes to the L2   cache  software programmable     6 1 1 3  Flow Control    WDQ gives backward pressure to the J Bus when the programmable high  watermark has been reached  Credit based flow control exists between the JBI and  the L2 cache  arising from the L2 cache   s two entry snoop input buffer and the four   entry RDMA write buffer     6 1 2  I O Buffer Requests to the J Bus    Write requests  NCWR  can be 1  2  4  or 8 byte writes and those writes are aligned  to size  Write request comes from the I O buffer  IOB   gets stored in the PIO request  queue  and then goes out on the J Bus     Read requests comes from IOB  gets stored in the PIO request queue  and then goes  out on the J Bus  The data read from J Bus is then parsed by J Bus parser  and then  the data is stored in the PIO return queue which is sent to the IOB     The Read transactions  NCRD  can be 1  2  4  8  16 byte reads and are aligned to size   There is a maximum support for 1 to 4 pending reads to the
137. g a trap  or experiencing a  resource conflict     2 16 OpenSPARC T1 Microarchitecture Specification      August 2006    2 3 11  Thread States    A thread cycles through these three different states     idle  active  and halt  FIGURE  2 9    illustrates the basic transition of non active states     FIGURE  2 9 Basic Transition of Non Active States    A thread is in an idle state at power on  An active thread will only be transitioned to  an idle state after a wait mask for an I cache fill has been cleared     A thread in the idle state should not receive the resume command without a previous  reset  When a thread is violated  the integrity of the hardware behavior cannot be  guaranteed     FIGURE  2 10 illustrates the thread state transition of an active thread     Idle    Halt  Active    Id    le    in    tr    I    d    l    e    i    n    tr    Ha    lt    in    st    an    y i    ntr    re    su    m    e     re    se    t    Chapter 2  SPARC Core 2 17    FIGURE  2 10 Thread State Transition of an Active Thread    An active thread could be placed in the wait state because of any of the following  reasons     1  Wait for an I cache fill     2  Wait due to store buffer full     3  Wait due to long latency  or a resource conflict where all resource conflicts arise  because of long latency     4  Wait due to any combination of the preceding reasons     The current wait state is tracked in the IFU wait masks     FIGURE  2 11 illustrates the state transition for a thread in spe
138. h source connects with the CPX  on its own separate bus  Therefore  there are six buses that connect from the four  L2 caches  the I O bridge  and the FPU to the CPX  The CPX connects by way of a  separate bus to each destination  Therefore  there are eight buses from the PCX that  connect it to the six destinations  The CPX does not perform any packet processing   so the bus width from the CPX to each destination is 145 bits wide  which is  identical to the bus width from the source to the CPX  FIGURE  3 3 illustrates the CPX    interface     FIGURE  3 3 Cache Processor Crossbar  CPX  Interface    A source can send at most two single packet requests  or one two packet request  to  a particular destination  There is a 2 deep queue inside the CPX for each source   destination pair that holds the packet  The CPX sends a grant to the source after  dispatching a packet to its destination  Each source uses this handshake signal to  monitor the queue full condition     Unlike the PCX  the CPX does not receive a stall from any of its destinations  as each  CPU has an efficient mechanism to drain the buffer that stores the incoming packets     CPU  8     L2Cache  4   IOBridge  FPU    CCX  PCX  CPX    Chapter 3  CPU Cache Crossbar 3 5    3 1 5  CPX and PCX Packet Formats    TABLE  3 1 and TABLE  3 2 define the CPX packet format  and TABLE  3 3 and TABLE  3 4    define the PCX packet format     Note     For the next four packet format tables  the table entries are defined as  follows  
139. he  bank is 768 Kbytes in size  with each logical line 64 bytes in size  The bank allows  read access of 16 bytes and 64 bytes  and each cache line has 16 byte enables to allow  writing into each 4 byte part  However  a fill updates all 64 bytes at a time     Each scdata array bank is further subdivided into four columns  Each column  consists of six 32 Kbyte sub arrays     Any L2 cache data array access takes two cycles to complete  so no columns can be  accessed in consecutive cycles  All access can be pipelined except back to back  accesses to the same column  The scdata array has a throughput of one access per  cycle     Each 32 bit word is protected by seven bits of SEC DED ECC   Each line is 32 x  32    7 ECC    1248 bits   All sub word accesses require a read modify write operation  to be performed  and they are referred to in this chapter as partial stores     4 1 2 5  Input Queue    The input queue  IQ  is a 16 entry FIFO that queues packets arriving on the PCX  when they cannot be immediately accepted into the L2 cache pipe  Each entry in the  IQ is 130 bits wide    The FIFO is implemented with a dual ported array  The write port is used for  writing into the IQ from the PCX interface  The read port is for reading contents for  issue into the L2 cache pipeline  If the IQ is empty when a packet comes to the PCX   the packet can pass around the IQ if it is selected for issue to the L2 cache pipe  The  IQ asserts a stall to the PCX when all eleven entries are used in t
140. he FIFO  This stall  allows space for the packets already in flight     4 6 OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 2 6  Output Queue    The output queue  OQ  is a 16 entry FIFO that queues operations waiting for access  to the CPX  Each entry in the OQ is 146 bits wide  The FIFO is implemented with a  dual ported array  The write port is used for writing into the OQ from the L2 cache  pipe  The read port is used for reading contents for issue to the CPX  If the OQ is  empty when a packet arrives from the L2 cache pipe  the packet can pass around the  OQ if it is selected for issue to the CPX     Multicast requests are dequeued from the FIFO only if all the of CPX destination  queues can accept the response packet  When the OQ reaches its high water mark   the L2 cache pipe stops accepting inputs from miss buffer or the PCX  Fills can  happen while the OQ is full since they do not generate CPX traffic     4 1 2 7  Snoop Input Queue    The Snoop input queue  SNPIQ  is a two entry FIFO for storing DMA instructions  coming from the JBI  The non data portion  the address  is stored in the snoop input  queue  SNPIQ   For a partial line write  WR8   both the control and the store data is  stored in snoop input queue     4 1 2 8  Miss Buffer    The 16 entry miss buffer  MB  stores instructions which cannot be processed as a  simple cache hit  These instructions include true L2 cache misses  no tag match    instructions that have the same cache line addre
141. he clock edge will align         Only C lt  gt D and C lt  gt J clock domain crossings are supported         Domain crossing is governed by the Rx Tx sync pulses  which are named with  respect to the domain  for example  dram_rx_sync means the C domain is  receiving from the D domain          Sync pulses are generated in the C domain  and are used as clock enables for the  C domain flops         Domain crossing paths are time delayed as a single cycle path in C domain         The prescribed usage allows electrical correctness  and the logical correctness is  still up to surrounding logic     FIGURE  10 4 shows a waveform for cross domain crossing Rx and Tx pulses     TABLE  10 1 Clock Domain Dividers    C Divider  D Divider  J Divider  Description    1  4  4  Power On default   PLL Bypass mode    4  16  16  Power On default   PLL Locked mode    2  14  12  Expected nominal ratios    10 6 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  10 4 Sync Pulses Waveforms    dram_clk  cmp_clk    dram_rx    cmp    PLL    dram    dram_rx    Yd    Y    Yc    Yd  Y  Yc    a  b    en    a  b    a  b    dram_clk  cmp_clk    cmp    PLL    dram    dram_tx    Xc    X    Xd    Xc  X  Xd    a  b    dram_tx    en    a  b    a  b    Chapter 10  Clocks and Resets 10 7    10 1 1 4  Clock Gating    Clock gating has the following characteristics         The CTU will occasionally gate an entire domain off on         Each cluster can be gated off on separately         Reset sequenc
142. he need for a tag lookup in the L1   cache     Coherency and ordering in the L2 cache are described as         Loads update directory and fill the L1 cache on return        Stores are non allocating in the L1 cache         There are two flavors of stores  total store order  TSO  and read memory order   RMO      Only one outstanding TSO store to the L2 cache per thread is permitted in  order to preserve the store ordering  There is no such limitation on RMO  stores          No tag check is done at a store buffer insert         Stores check directory and determines an L1 cache hit         Directory sends store acknowledgements or invalidates to the SPARC core         Store updates happens to D  on a store acknowledge        Crossbar orders the responses across cache banks     Chapter 1  OpenSPARC T1 Overview 1 11    1 3 5  DRAM Controller    The OpenSPARC T1 processor DRAM controller is banked four ways  with each L2  bank interacting with exactly one DRAM controller bank  a two bank option is  available for cost constrained minimal memory configurations   The DRAM  controller is interleaved based on physical address bits 7 6  so each DRAM controller  bank must have identical dual in line memory modules  DIMM  installed and  enabled     The OpenSPARC T1 processor uses DDR2 DIMMs and can support one or two ranks  of stacked or unstacked DIMMs  Each DRAM bank port is two DIMMs wide  128   bit   16 bit ECC   All installed DIMMs must be identical  and the same number of  DIMMs  tha
143. he operation     2 50 OpenSPARC T1 Microarchitecture Specification      August 2006    1  Read zero_ctxt_cfg or nonzero_ctxt_cfg to determine the page size     2  Read zero_ctxt_tsb_base_ps0 or zero_ctxt_tsb_base_ps1 on  nonzero_ctxt_tsb_base_ps0 or nonzero_ctxt_tsb_base_ps1 to get the TSB base  address and size of the TSB     3  Access tag     Software will then generate a pointer into the TSB based on the VA  the TSB base  address  the TSB size  and the Tag     2 10  Trap Logic Unit    The trap logic unit  TLU  supports six trap levels  A trap can be in one of the  following four modes     reset error debug  RED  mode  hypervisor  HV  mode   supervisor  SV  mode  and user mode  Traps will cause the SPARC core pipeline to  be flushed  and a thread switch to occur  until the trap vector  redirect PC  has been  resolved     Software interrupts are delivered to each of the virtual cores using the  interrupt_level_n trap through the SOFTINT_REG register  I O and CPU cross call  interrupts are delivered to each virtual core using the interrupt_vector trap  Up to 64  outstanding interrupts can be queued up per thread   one for each interrupt vector   Interrupt vectors are implicitly prioritized  with vector 0x63 being at the highest  priority  while vector 0x0 is at the lowest priority  Each I O interrupt source has a  hardwired interrupt number that is used as the interrupt vector by the I O bridge  block     The TLU is in a logically central position to collect all of the traps 
144. hecking  Full   RAW data will be returned to the register files from the pipe  Partial RAW hits will  force the load to access the L2 cache while interlocked with the store issued to the  CCX  Multiple hits in the store buffer will always force access to the L2 cache in  order to enforce data consistency     If a store hits any part of a quad load  16 byte access   the quad load checking will  force the serialization of the issue to the CCX  This forced serialization enforces that  there will be no bypass operation     Instructions such as a blk load  64 byte access  will not detect the potential store  buffer hit on the 64 byte boundary  The software must guarantee the data  consistency using membar instructions     2 26 OpenSPARC T1 Microarchitecture Specification      August 2006    2 4 6  Load Miss Queue    The load miss queue  LMQ  contains four entries in its physical structure  and the  queue supports up to one load miss per thread  Instructions similar to load  such as  atomics and prefetches  may also reside in the load miss queue     A load instruction speculates on a D cache miss to reduce the latency in accessing  the CCX  The load instruction may also speculate on the availability of a queue entry  in the CCX  If the speculation fails  the miss speculated load instruction can be  replayed out of LMQ     Load requests to the L2 cache from different addresses can alias to the same L2   cache line  Primary versus secondary checking will be performed in order to prev
145. her Instructions    L1 Cache Invalidation    The instruction invalidates the four primary cache entries as well as the four L2   cache directory entries corresponding to each primary cache tag entry  The  invalidation is issued whenever the CPU detects a parity error in the tags of I cache  or dcache     Interrupts    When a thread wants to send an interrupt to another thread  it sends it through the  L2 cache  The L2 cache treats the thread like a bypass  After a decode  the L2 cache  sends the instruction back to destination CPU if it is a interrupt     Chapter 4  Level 2 Cache 4 17    Flush    From the L2 cache s perspective  a flush is a broadcast  The OpenSPARC T1  processor requires this flush instruction  Whenever a self modifying code is  performed  the first instruction at the end of the self modifying sequence should  come from a new stream     An interrupt with a BR 1 is broadcast to all CPUs   Such an interrupt is issued by a  CPU in response to a flush instruction      A flush stays in the output queue until all eight receiving queues are available  This  is a total store order  TSO  requirement     4 1 5  L2 Cache Memory Coherency and Instruction  Ordering    Cache coherency is maintained using a mixture of structures in the miss buffer  fill  buffer  and the write back buffer  The miss buffer maintains a dependency list for the  access to the 64 bytes of cache lines with the same address  Responses are sent to the  CPUs in the age order of the requests for the 
146. here are 12 used bits  per set in the L2 cache  The used bit gets set when there are any store load hits   1  per way   Used bits get cleared  all 12 at a time  when there are no unused or  unallocated entries for that set     The allocate bit indicates that the marked line has been allocated to a miss  This bit  is also used in the processing of some special instructions  such as atomics and  partial stores   Because these stores do read modify writes  which involve two passes  through the pipe  the line needs to be locked until the second pass completes   otherwise  the line may get replaced before the second pass happens   The allocate    Chapter 4  Level 2 Cache 4 5    bit  therefore  acts analogous to a lock bit  The allocate bit  per way  gets set when a  line gets picked for replacement  For a load or an ifetch  the bit gets cleared when a  fill happens  and for a store when a store completes     The dirty bit indicates that L2 cache contains the only valid copy of the line  The  dirty bit  per way  gets set when a stores modifies the line  It gets cleared when the  line is invalidated     The pseudo least recently used  LRU  algorithm examines all the ways starting from  a certain point in a round robin fashion  The first unused  unallocated ways is  selected for replacement  If no unused  unallocated way is found  then the first  unallocated way is selected     4 1 2 4  L2 Data  scdata     The L2 data  scdata  array bank is a single ported SRAM structure  Each L2 cac
147. hread in Speculative States  2   18    FIGURE  2 12  Rollback Mechanism Pipeline Graph  2   19    FIGURE  2 13  LSU Pipeline Graph  2   22    FIGURE  2 14  LSU Data Flow Concept  2   23    FIGURE  2 15  Execution Unit Diagram  2   33    FIGURE  2 16  Shifter Block Diagram  2   34    FIGURE  2 17  ALU Block Diagram  2   34    xiv OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 18  IDIV Block Diagram  2   35    FIGURE  2 19  Top Level FFU Block Diagram  2   36    FIGURE  2 20  Multiplexor  MUL  Block Diagram  2   37    FIGURE  2 21  Layout of MA_ADDR Register Bit Fields  2   38    FIGURE  2 22  Data Flow of Modular Arithmetic Operations  2   40    FIGURE  2 23  State Transition Diagram Illustrating MA Operations  2   41    FIGURE  2 24  Multiply Function Result Generation Sequence Pipeline Diagram  2   43    FIGURE  2 25  MMU and TLBs Relationship  2   44    FIGURE  2 26  Virtualization Diagram  2   44    FIGURE  2 27  Translation Lookaside Buffer Structure  2   46    FIGURE  2 28  TLU Role With Respect to All Other Backlogs in a SPARC Core  2   51    FIGURE  2 29  Trap Flow Sequence  2   55    FIGURE  2 30  Trap Flow With Respect to the Hardware Blocks  2   56    FIGURE  2 31  Flow of Hardware and Vector Interrupts  2   58    FIGURE  2 32  Flow of Reset or Idle or Resume Interrupts  2   59    FIGURE  2 33  Flow of Software and Timer Interrupts  2   60    FIGURE  2 34  Trap Modes Transition  2   62    FIGURE  2 35  Thread State Transition  2   63    F
148. ic packet    spc5_pcx_data_pa 123 0   In  sparc5  SPARC PCX data address    spc5_pcx_req_pq 4 0   In  sparc5  SPARC PCX request    spc6_pcx_atom_pq  In  sparc6  Atomic packet    spc6_pcx_data_pa 123 0   In  sparc6  SPARC PCX data address    spc6_pcx_req_pq 4 0   In  sparc6  SPARC PCX request    TABLE  3 5  CCX I O Signal List  Continued     Signal Name  I O  Source Destination  Description    Chapter 3  CPU Cache Crossbar 3 11    spc7_pcx_atom_pq  In  sparc7  Atomic racket    spc7_pcx_data_pa 123 0   In  sparc7  SPARC PCX data address    spc7_pcx_req_pq 4 0   In  sparc7  SPARC PCX request    iob_pcx_stall_pq  In  IOB PCX stall    ccx_scanout0  Out  DFT  Scan out 0    ccx_scanout1  Out  DFT  Scan out 1    cpx_iob_grant_cx2 7 0   Out  IOB  CPX grant    cpx_sctag0_grant_cx 7 0   Out  L2 Bank0  CPX grant    cpx_sctag1_grant_cx 7 0   Out  L2 Bank1  CPX grant    cpx_sctag2_grant_cx 7 0   Out  L2 Bank2  CPX grant    cpx_sctag3_grant_cx 7 0   Out  L2 Bank3  CPX grant    cpx_spc0_data_cx2 144 0   Out  sparc0  CPX SPARC data    cpx_spc0_data_rdy_cx2  Out  sparc0  CPX data ready    cpx_spc1_data_cx2 144 0   Out  sparc1  CPX SPARC data    cpx_spc1_data_rdy_cx2  Out  sparc1  CPX data ready    cpx_spc2_data_cx2 144 0   Out  sparc2  CPX SPARC data    cpx_spc2_data_rdy_cx2  Out  sparc2  CPX data ready    cpx_spc3_data_cx2 144 0   Out  sparc3  CPX SPARC data    cpx_spc3_data_rdy_cx2  Out  sparc3  CPX data ready    cpx_spc4_data_cx2 144 0   Out  sparc4  CPX SPARC data    cpx_spc4_data_rdy_cx2 
149. ical quad word  A load instruction waiting in the miss  buffer can enter the pipeline after the critical quad word arrives from the DRAM   the critical 16 bytes will arrive first from the DRAM   In this case  the data is  bypassed  After all four quad words arrive  the fill instruction enters the pipeline  and fills the cache  and the fill buffer entry gets invalidated      When data comes back in the FB  the instruction in the MB gets readied for reissue  and the cache line gets written into the data array  These two events are independent  and can happen in any order     For a non allocating read  for example  an I O read   the data gets drained from the  fill buffer directly to the I O interface when the data arrives  and the fill buffer entry  gets invalidated   When the FB is full  the miss buffer cannot make requests to the  DRAM     The fill buffer is divided into a RAM portion  which stores the data returned from  the DRAM waiting for a fill to the cache  and a CAM portion  which contains the  address  The fill buffer has a read interface with the DRAM controller     4 8 OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 2 10  Writeback Buffer    The writeback buffer  WBB  is an eight entry buffer used to store the 64 byte evicted  dirty data line from the L2 cache  The replacement algorithm picks a line for eviction  on a miss  The evicted lines are streamed out to the DRAM opportunistically  An  instruction whose cache line address matches the addr
150. ier  FPM      multiplies         Floating point divider  FPD      divides        One instruction per cycle may be issued from the FPU input FIFO queue to one of  the three execution pipelines         One instruction per cycle may complete and exit the FPU         Support for all IEEE 754 floating point data types  normalized  denormalized   NaN  zero  infinity   A denormalized operand or result will never generate an  unfinished_FPop trap to the software  The hardware provides full support for  denormalized operands and results         IEEE non standard mode  FSR ns  is ignored by the FPU         The following instruction types are fully pipe lined and have a fixed latency   independent of operand values     add  subtract  compare  convert between  floating point formats  convert floating point to integer  convert integer to  floating point         The following instruction types are not fully pipe lined     multiply  fixed latency   independent of operand values   divide  variable latency  dependent on operand  values          Divide instructions execute in a dedicated datapath and are non blocking         Underflow tininess is detected before rounding  Loss of accuracy is detected  when the delivered result value differs from what would have been computed  were both the exponent range and precision unbounded  inexact condition          A precise exception model is maintained  The OpenSPARC T1 implementation  does not require early exception detection prediction  A given thr
151. ignore  user mode events     If the PCR ST bit is set to 1 and HPSTATE ENB is also set to 1  it counts events in  supervisor mode  Otherwise  it will ignore supervisor mode events     If the PCR ST bit is set to 1 and HPSTATE ENB is also set to 0  it counts events in  hypervisor mode  Otherwise  it will ignore hypervisor mode events     If the PCR PRIV bit is set to 1  it prevents user code access to the PIC counter   Otherwise  it allows the user code to access the PIC counter     The PIC H bits form the instruction counter  Trapped or canceled instructions will  not be counted  The Tcc instructions will be counted even if some other trap is taken  on them     PIC    PCR    Upper 32 bits  Inst  Counter  Lower 32 bits  Event Counter  H  L       OVFH OVFL     SL      UT  ST PRIV    63  10  9  8  7  6  4  3  2  1  0    SL  EVENT     b000     b001     b010     b011     b100     b101     b110     b111    Store buffer full    FP instruction count    Icache miss    Dcache_miss    ITLB miss    DTLB miss    L2 Imiss    L2 Dmiss    2 68 OpenSPARC T1 Microarchitecture Specification      August 2006    The PIC L bits form the event counter  The TLU includes only the counter control  logic  while the other functional units in the SPARC core provide the logic to signal  any event  An event counter overflow will generate a disrupting trap  while a  performance counter overflow will generate a disrupting but precise trap  of a type  level_15 interrupt  on the next following instruction and 
152. in basis  There is a simple non restoring divider  which allows for one  divide outstanding per SPARC core  Thread issuing a MUL DIV will be rolled back  and switched out if another thread is occupying the MUL DIV units     1 3 1 3  Load Store Unit    The data cache complex has an 8 Kbyte data  4 way  16 byte line size  It also has  single ported data tag  There is a dual ported  1R 1W  valid bit array to hold cache  line state of valid or invalid  Invalidates access the V bit array but not the data tag  A    Chapter 1  OpenSPARC T1 Overview 1 7    pseudo random replacement algorithm is used to replace the data cache line  The  loads are allocating  and the stores are non allocating  The data TLB operates  similarly to the instruction TLB     The load store unit  LSU  has an 8 entry store buffer per thread  which is unified  into a single 32 entry array  with RAW bypassing  Only a single load per thread  outstanding is allowed  Duplicate requests for the same line are not sent to the L2   cache  The LSU has interface logic to interface to the CPU cache crossbar  CCX   This  interface performs the following operations         Prioritizes the requests to the crossbar for floating point operation  Fpops    streaming operations  I  and D  misses  stores and interrupts  and so on         Request priority  imiss gt ldmiss gt stores  fpu stream interrupt          Assembles packets for the processor cache crossbar  PCX      The LSU handles returns from the CPX crossbar and maintains 
153. induced the trap remain unexecuted     2  Deferred trap    A deferred trap is induced by a particular instruction  However  the trap may  occur after the program visible state has been changed by the execution of either  the trap inducing instruction itself  or one or more other instructions     If an instruction induces a deferred trap  and a precise trap occurs simultaneously   the deferred trap may not be deferred past the precise trap     2 54 OpenSPARC T1 Microarchitecture Specification      August 2006    3  Disrupting trap    A disrupting trap is caused by a condition  for example  an interrupt   rather than  directly caused by a particular instruction  When a disrupting trap has been  serviced  the program execution resumes where it left off  A reset type of trap  resumes execution at the unique reset address and it is not a disrupting trap     Disrupting traps are controlled by a combination of the processor interrupt level   PIL  and the interrupt enable  IE  bit field of the processor state register   PSTATE   A disrupting trap condition is ignored when the interrupts are disabled   PSTATE IE   0  or the condition   s interrupt level is lower than that specified in  the PIL     A disrupting trap may be due to either an interrupt request not directly related to  a previously executed instruction  or to an exception related to a previously  executed instruction  Interrupt requests may be either internal or external  and  can be induced by the assertion of a signal no
154. init   sync pulses for clock  domain crossing  and built in self test  BIST  signals for blocks with memory  BIST         For debugging purposes  the CTU receives a trigger signal from the cluster         The CTU and PADS themselves are clock and reset recipients     FIGURE  10 1 displays a high level block diagram of the CTU clock and reset signals    and CTU sub blocks     FIGURE  10 1 Clock and Reset Functional Block Diagram    J_CLK 1 0     TRST_l    PWRON_RST_L    PLL_CHAR_IN    J_RST_L    clk    cken    rst init    sync    bist    dll    trigger    clsp    CTU  PAD CTU    PAD MISC    PAD JBUSR    Clusters  OpenSPARC T1    dft    PLL    RNG    TSR    Chapter 10  Clocks and Resets 10 3    10 1 1 1  Phase Locked Loop    The phase locked loop  PLL  has two modes of operation     PLL bypass and PLL  locked mode         Bypass mode     in this mode  the clk_out  clock output  follows J_CLK  VCO and  divider are set to don   t care         PLL locked mode     clk_out is OFF when ARST_L is asserted  the voltage control  oscillator  VCO  ramps up at an ARST_L deassertion  the divider is free running   and the feedback is matched to the clock tree output     FIGURE  10 2 shows the PLL block diagram including the VCO and the feedback path     FIGURE  10 2 PLL Functional Block Diagram    J_CLK 1 0     BW_PLL    PLL_CHAR_IN  CLKOBS 1 0     CTU    obs    VCO    jdup_div    pll_raw_clk_out    pll_clk_out_1    pll_clk_out    jbus_gclk_dup_out  jbus_gclk_dup    pll_bypass  pll_arst_1   
155. instruction encountering an ECC error during the  floating point register file access         Instruction s  following a load hits the store buffer and the level 1 D cache  where  the data has not been bypassed from the store buffer to the level 1 D cache         Encountering D cache parity errors         Launching an idle or resume interrupt where the machine states must be restored         An interrupt has been scheduled but not yet taken     2 3 14  Instruction Decode    The IFU decodes the SPARC V9 instructions  and the floating point frontend unit   FFU  decodes the floating point instructions  Unimplemented floating point  instructions will cause an fp_exception_other trap with a FSR ftt 3   unimplemented_FPop   These operations will be emulated by the software     The privilege is checked in D stage of the SPARC core pipeline  Some instructions  can only be executed with hypervisor privilege or with supervisor privilege     The branch condition is also evaluated in the D stage  and the decision for annulling  a delay slot is made in this stage as well     2 3 15  Instruction Fetch Unit Interrupt Handling    All interrupts are delivered to the instruction fetch unit  IFU   For each received  interrupt  the IFU shall check the bit   s pstate ie  the interrupt enable bit in the  processor state register  and hpstate  the hypervisor state  before scheduling the  interrupt  All interrupts will be prioritized  refer to the Programmer   s Reference  Manual for these priority 
156. instructions     Chapter  5 describes the processor   s input output bridge  IOB      Chapter  6 gives a functional description of the J Bus interface  JBI  block     Chapter  7 provides a functional description of the floating point unit  FPU      Chapter  8 describes the dynamic random access memory  DRAM  controller     Chapter  9 provides a detailed overview of the processor   s error handling    mechanisms     Chapter  10 gives a functional description of the processor   s clock and test unit     CTU      xx OpenSPARC T1 Microarchitecture Specification      August 2006    Using UNIX Commands    This document might not contain information about basic UNIX   commands and  procedures such as shutting down the system  booting the system  and configuring  devices  Refer to the following for this information         Software documentation that you received with your system        Solaris    Operating System documentation  which is at     http   docs sun com    Shell Prompts    Typographic Conventions    Shell  Prompt    C shell  machine name     C shell superuser  machine name     Bourne shell and Korn shell       Bourne shell and Korn shell superuser       Typeface1    1 The settings on your browser might differ from these settings     Meaning  Examples    AaBbCc123  The names of commands  files   and directories  on screen  computer output    Edit your login file   Use ls  a to list all files     You have mail     AaBbCc123  What you type  when contrasted  with on screen com
157. ions ensure that the store buffer of a thread has been drained  before the thread gets switched back in  The completion of draining the store buffer  implies that all stores prior to the MEMBAR instruction have reached a global  visibility  in compliance with TSO ordering  Before a MEMBAR is released  it ensures  that all blk init and blk st instructions have also reached global visibility  This is  accomplished by making sure that st ack counter has been cleared     There are several flavors of MEMBAR instructions  The implementation for   storestore   loadstores  and  loadload is to make them behave like NOPs  The  implementation for  storeload   memissue  and  lookaside is to make them to  behave like  sync  membar  sync is fully implemented to help enforce the compliance  to TSO ordering     A parity error on a store to the DTLB will cause a deferred trap  It will be reported  on the follow up membar  sync  The trap PC in this case will point to the store  instruction encountering the parity error when storing to the DTLB  The deferred  trap will look like a precise trap to the system software because of the way the  hardware supports the recording of the precise trap PC     2 4 13  Core to Core Interrupt Support    A core to core interrupt is initiated by a write to the interrupt dispatch register  IINT_VEC_DIS ASI  in Trap Logic Unit  TLU   It will generate a request to LSU for  access to PCX  LSU only supports one outstanding interrupt request at any time     An interr
158. ions ou reexportations vers des pays sous embargo des Etats Unis  ou  vers des entites    gurant sur les listes d   exclusion d   exportation americaines  y compris  mais de maniere non exclusive  la liste de personnes qui  font objet d   un ordre de ne pas participer  d   une facon directe ou indirecte  aux exportations des produits ou des services qui sont regi par la  legislation americaine en matiere de controle des exportations et la liste de ressortissants speci   quement designes  sont rigoureusement  interdites     LA DOCUMENTATION EST FOURNIE  EN L   ETAT  ET TOUTES AUTRES CONDITIONS  DECLARATIONS ET GARANTIES EXPRESSES  OU TACITES SONT FORMELLEMENT EXCLUES  DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE  Y COMPRIS NOTAMMENT  TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE  A L   APTITUDE A UNE UTILISATION PARTICULIERE OU A  L   ABSENCE DE CONTREFACON     iii    Contents    Preface  xix    1   OpenSPARC T1 Overview  1   1    1 1  Introducing the OpenSPARC T1 Processor  1   1    1 2  Functional Description  1   2    1 3  OpenSPARC T1 Components  1   4    1 3 1  SPARC Core  1   4    1 3 1 1  Instruction Fetch Unit  1   6    1 3 1 2  Execution Unit  1   6    1 3 1 3  Load Store Unit  1   6    1 3 1 4  Floating Point Frontend Unit  1   7    1 3 1 5  Trap Logic Unit  1   7    1 3 1 6  Stream Processing Unit  1   8    1 3 2  CPU Cache Crossbar  1   8    1 3 3  Floating Point Unit  1   9    1 3 4  L2 Cache  1   10    1 3 5  DRAM Controller  1   11    1 3 6  I O Bri
159. is 32 bits  A partial stores is  executed as a read modify write operation  In the first step the cache line is read and  merged with the write data  It is then saved in the miss buffer  The cache line is  written into the scdata array in the second pass of the instruction through the pipe     4 1 4 4  Atomics    The L2 cache processes three types of atomic instructions     load store unsigned byte   LDSTUB   SWAP  and compare and swap  CAS   These instructions require two  passes down the L2 cache pipeline     4 14 OpenSPARC T1 Microarchitecture Specification      August 2006    LDSTUB SWAP    The instruction reads a byte from memory into a register  and then it writes 0xFF  into memory in a single  indivisible operation  The value in the register can then be  examined to see if it was already 0xFF  which means that another processor got there  first  If the value is 0x00  then this processor is in charge  This instruction is used to  make mutual exclusion locks  known as mutexes  that make sure only one processor  at a time can hold the lock  The lock is acquired through the LDSTUB and cleared by  storing 0x00 back to the memory     The first pass reads the addressed cache line and returns 128 bits of data to the  requesting CPU  It also merges it with unsigned byte swap data  This merged data  is written into the miss buffer     In the second pass of the instruction  the new data is stored in the scdata array  An  acknowledgement is sent to the issuing CPU and the invalida
160. ivilege Levels and Thread States    Split  Mode  Non Split  Mode    Red  Hypervisor  Supervisor  User  Privileged  User    HPSTATE enb  X  1  1  1  0  0    HPSTATE red  1  0  0  0  0  0    HPSTATE priv  1  1  0  0  X 1   0    PSTATE priv  1  X  1  0  1  0    2 62 OpenSPARC T1 Microarchitecture Specification      August 2006    2 10 9  Trap Modes Transition    FIGURE  2 34 illustrates the mode transitions among the different levels of traps     FIGURE  2 34 Trap Modes Transition     S    V     Tr    ap     T    L lt     2      S    V     Tra    p     2 lt      T    L     lt     5          H    V T    ra    p     or         S    V         T    r    a    p         T    L     lt     2              R    e    s    e    t         o    r         T    r    a    p         T    r    a    p         T    L     lt     5     R    es    et     or     T    ra    p     T    L gt      5          R    es    et         o    r     T    ra    p         T    L     gt          5         User    Hypervisor   HV Trap  or   Trap TL gt  2      Reset  or   Trap TL gt  5     Supervisor    RED State    Chapter 2  SPARC Core 2 63    2 10 10  Thread States Transition    A thread can be in any one of these four states     RED  reset  error  debug    supervisor  SV   hypervisor  HV   or user  The privilege level is very different in  each different states  FIGURE  2 35 illustrates the state transition of a thread     FIGURE  2 35 Thread State Transition    Supervisor    User    Hypervisor    RED State    1    1
161. k enable    ctu_scdata1_cmp_cken  Out  SCDATA1  Clock enable    ctu_scdata2_cmp_cken  Out  SCDATA2  Clock enable    ctu_scdata3_cmp_cken  Out  SCDATA3  Clock enable    ctu_sctag0_cmp_cken  Out  SCTAG0  Clock enable    ctu_sctag0_mbisten  Out  SCTAG0  MBIST enable    ctu_sctag1_cmp_cken  Out  SCTAG1  Clock enable    ctu_sctag1_mbisten  Out  SCTAG1  MBIST enable    ctu_sctag2_cmp_cken  Out  SCTAG2  Clock enable    ctu_sctag2_mbisten  Out  SCTAG2  MBIST enable    ctu_sctag3_cmp_cken  Out  SCTAG3  Clock enable    ctu_sctag3_mbisten  Out  SCTAG3  MBIST enable    ctu_spc0_cmp_cken  Out  SPARC0  Clock enable    ctu_spc0_mbisten  Out  SPARC0  MBIST enable    ctu_spc0_sscan_se  Out  SPARC0  Shadow scan enable    ctu_spc0_tck  Out  SPARC0  Test clock    ctu_spc1_cmp_cken  Out  SPARC1  Clock enable    ctu_spc1_mbisten  Out  SPARC1  MBIST enable    ctu_spc1_sscan_se  Out  SPARC1  Shadow scan enable    ctu_spc1_tck  Out  SPARC1  Test clock    ctu_spc2_cmp_cken  Out  SPARC2  Clock enable    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    10 24 OpenSPARC T1 Microarchitecture Specification      August 2006    ctu_spc2_mbisten  Out  SPARC2  MBIST enable    ctu_spc2_sscan_se  Out  SPARC2  Shadow scan enable    ctu_spc2_tck  Out  SPARC2  Test clock    ctu_spc3_cmp_cken  Out  SPARC3  Clock enable    ctu_spc3_mbisten  Out  SPARC3  MBIST enable    ctu_spc3_sscan_se  Out  SPARC3  Shadow scan enable    ctu_spc3_tck  Out  SPARC3  Test clock    ctu
162. ld due to a result from that  pipeline not exiting the FPU     7 1 4  Floating Point Adder    The floating point adder  FPA  performs addition and subtraction on single and  double precision floating point numbers  conversions between floating point and  integer formats  and floating point compares     FPA characteristics include         The FPA execution datapath is implemented in four pipeline stages  A1  A2  A3   and A4          Certain integer conversions to floating point instructions require a second pass  through the final stage  see TABLE  7 3 for details          All FPA instructions are fixed latency  and independent of operand values         Follows a large exponent difference  LED  small exponent difference  SED   mantissa datapath organization         A post normalization incrementer is used for rounding  late round organization          NaN source propagation is supported by steering the appropriate NaN source  through the datapath to the result   Refer to the UltraSPARC Architecture 2005  Specification for more information      Chapter 7  Floating Point Unit 7 7    7 1 5  Floating Point Multiplier    Characteristics of the floating point multiplier  FPM  include         The FPM execution datapath is implemented in six pipeline stages  M1 through  M6    See TABLE  7 4 for details of these stages          A two pass  double pump  implementation is used for all multiply instructions   single and double precision   which produces a latency of seven cycles and a  th
163. lidity of its line         Evictions from the L2 cache     Both directories are cammed to invalidate any line  that is no longer resident in the L2 cache     The dcache directory is organized as sixteen panels with sixty four entries in each  panel  Each entry number is formed using the cpu ID  way number  and bit 8 from  the physical address  Each panel is organized in four rows and four columns  The  icache directory is organized similarly  For an eviction  all four rows are cammed     4 1 3  L2 Cache Pipeline    This section describes the L2 cache transaction types and the stages of the L2 cache  pipeline     4 1 3 1  L2 Cache Transaction Types    The L2 cache processes three main types of instructions         Requests from a CPU by way of the PCX        Requests from the I O by way of the JBI        Requests from the IOB by way of the PCX    The requests from a CPU include the following instructions     load  streaming load   Ifetch  prefetch  store  streaming store  block store  block init store  atomics   interrupt  and flush     The requests from the I O include the following instructions     block read  RD64    write invalidate  WRI   and partial line write  WR8      The requests from the I O buffer includes the following instructions     forward  request load and forward request store  these instructions are used for diagnostics    The test access port  TAP  device cannot talk to the L2 cache directly  The TAP    4 10 OpenSPARC T1 Microarchitecture Specification     
164. ller 8 5    FIGURE  8 3 displays the DIMM scheduler state diagram  The DIMM scheduler has    three main states     wait  CAS pick  and RAS pick  Whenever a CAS or a RAS request  exists and timing is met  the scheduler goes into a CAS pick or a RAS pick state     FIGURE  8 3 DIMM Scheduler State Diagram    8 1 3  Programmable Features    The DRAM chips on the DIMMs contain a number of timing parameters that need to  be controlled  These chips are controlled by programming the CSRs in the DRAM  controller  For complete list of registers and bit definitions  refer to the UltraSPARC  T1 Supplement to UltraSPARC 2005 Architecture Specification document  Here is the list  of some of the programmable parameters         RAS address width        CAS address width        CAS latency  CL         Refresh frequency        Scrub frequency        RAS to RAS delay to different bank  Trrd         RAS to RAS delay to same bank  Trc         RAS to CAS delay  Trcd     Wait    RAS  Pick  CAS  Pick    RAS request  and timing met    CAS request and  timing met    CAS  request  and  timing met    8 6 OpenSPARC T1 Microarchitecture Specification      August 2006        Write to read delay        Read to write delay        Programmable data expect cycles    8 1 4  Errors    The DRAM controller error mechanism has the following characteristics         Error injection can be done through software programming        Error registers are accessible by way of the IOB interface        Error counter registers
165. low    The LSU includes an 8 Kbyte D cache  which is a part of the level 1 cache shared by  four threads  There is one store buffer  STB  per thread  Stores are total store  ordering  TSO  ordered  that is to say that no membar  sync is required after each  store operation in order to maintain the program order among the stores  Non TSO  compliant stores include     blk store and blk init  Bypass data are reported  asynchronously  and they are supported by the bypass queue     Load misses are kept in the load miss  LSM  queue  which is shared by other  opcodes such as atomics and prefetch  The LSM queue supports one outstanding  load miss per thread  Load misses with duplicated physical addresses  PA  will not  be sent to the level 2  L2  cache     Inbound packets from the CCX are queued and ordered for distribution to other  units through the data fill queue  DFQ      The DTLB is fully associative  and it is responsible for the address translations  All  CAM RAM translations are single cycle operations     The ASI operations are serialized through the LSU  They are sequenced through the  ASI queue to the destination units on the chip     FIGURE  2 14 illustrates the LSU data flow concept     E  Cache  TLB  setup    M  Cache Tag  TLB  read    W  stb lookup  traps  bypass    W2  pcx rcq gcn   and writeback    Chapter 2  SPARC Core 2 23    FIGURE  2 14 LSU Data Flow Concept    2 4 3  Level 1 Data Cache  D Cache     The 8 Kbyte level 1  L1  D cache is 4 way set associative  and 
166. lt Address Register  SFAR          IMMU Tag Access         DMMU Tag Access         IMMU Tag Target         DMMU Tag Target        ASI Accesses to Registers as Miss Handler Support         IMMU TSB Page Size 0         IMMU TSB Page Size 1         DMMU TSB Page Size 0         DMMU TSB Page Size 1         IMMU Context 0 TSB Page Size 0         IMMU Context 0 TSB Page Size 1         DMMU Context 0 TSB Page Size 0         DMMU Context 0 TSB Page Size 1         IMMU Context non 0 TSB Page Size 0         IMMU Context non 0 TSB Page Size 1    2 48 OpenSPARC T1 Microarchitecture Specification      August 2006         DMMU Context non 0 TSB Page Size 0         DMMU Context non 0 TSB Page Size 1         IMMU Context 0 Config         DMMU Context 0 Config         IMMU Context non 0 Config         DMMU Context non 0 Config    2 9 5  Specifics on TLB Write Access    A stxa to data in or data access causes a write operation that is asynchronous to the  pipeline flow  Write requests are originated from the four entry FIFO in the LSU  The  LSU passes the write request to the MMU  which forwards it to the ITLB or the  DTLB  A handshake from the target completes the write operation  which in turn  enables the four entry FIFO in the LSU to proceed with the next entry     Write access to the data in algorithmically places the translation table entry  TTE  in  the TLB  Writes occur to the least significant unused entry  In contrast  write access  to the data access places the TTE in the specified 
167. mat in the i2c and sends them to the  CPU        J Bus Mondo Interrupts         JBI sends mondo interrupt packet to the i2c         Accumulate packet interrupts sent to the target src data0 data1         If J_INT_BUSY target  CSR BUSY   0    i  Send ACK to the JBI    ii  Send target src data0 data1 to the c2i    iii  Stores source in J_INT_BUSY target   data0 1 in J_INT_DATA0 1 target   and set J_INT_BUSY target  BUSY    iv  Generates an CPX interrupt packet to the target using J_INT_VEC CSR and  send         If J_INT_BUSY target  CSR BUSY   1    i  Send NACK to the JBI    ii  Source will re issue the INTERRUPT on the J BUS    Chapter 5  Input Output Bridge 5 11        Mondo Interrupt Handling         Mondo interrupt CSRs     i  J_INT_VEC     specifies interrupt vector for the CPX Int in order to target  thread    ii  J_INT_BUSY  count 32      source and BUSY for each target thread    iii  J_INT_DATA0  count 32      mondo data 0 for each target thread    iv  J_INT_DATA1  count 32      mondo data 1 for each target thread    v  J_INT_ABUSY  J_INT_ADATA0  J_INT_ADATA1     aliases to J_INT_BUSY   J_INT_DATA0  J_INT_DATA1 for the current thread         The interrupt handler must clear the BUSY bit in J_INT_BUSY target  to allow  future mondo interrupts to that thread    5 1 7  IOB Miscellaneous Functionality        Launches one thread after reset         Sends resume interrupt to thread 0 in the lowest available core  the EFC sends  the available cores information          RSET_ST
168. mit of  storing two packets from each source     There can be only one entry for each request  even if a request contains two packets   Such requests occupy one valid entry in the checkerboard and two FIFO entries in  the data queue  A separate bit identifies a two packet request     The direction for the round robin selection depends on the direction bit  Round   robin selection is left to right  C0   C7  if the direction bit is high  or right to left  C7    C0  if the direction bit is low  The direction bit toggles every cycle     The direction bit is low for all arbiters at a reset  The direction bit toggles for all  arbiters during every cycle  This requirement is required to maintain the TSO  ordering for invalidates sent by an L2 cache bank     ARB0 picks the first valid entry from the last row of the checkerboard every cycle   ARB0 then sends an 8 bit signal to the multiplexer at the output of the FIFOs storing  the data  as show in FIGURE  3 9   The 8 bit signal is 1 hot  and the index of the high    bit is same as the index of the entry picked in the last row  If there are multiple valid  entries  ARB0 picks them in a round robin fashion  ARB0 decides the direction for  round robin based on the direction bit     3 5  CPX Internal Blocks Functional  Description    3 5 1  CPX Overview    The CPX contains eight identical arbiter modules     one for each destination  The  arbiters inside the CPX are identical to those inside PCX  so see Section  3 4 1     PCX    Overview
169. n is done by the sctag  data array          The sctag sends the request packet to the CPX  and the sctag is the only interface  the L2 cache has with the CPX     C8        A data packet is sent to the CPX  This stage corresponds with the CQ stage of the  CPX pipeline     Cache miss instructions are reissued from the miss buffer after the data returns from  the DRAM controller  These reissued instructions follow the preceding pipeline     4 12 OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 4  L2 Cache Instruction Descriptions    The following instructions follow a skewed pipeline  They do not follow the simple  pipeline like the one described in Section  4 1 3     L2 Cache Pipeline    on page  4 9     4 1 4 1  Loads    A load instruction to the L2 cache is caused by any one of the following conditions         A miss in the L1 cache  the primary cache  by a load  prefetch  block load  or a  quad load instruction         A streaming load issued by the stream processing unit  SPU         A forward request read issued by the IOB    The output of the scdata array  returned by the load  is 16 bytes in size  This size is  same as the size of the L1 data cache line  An entry is created in the dcache directory   An icache directory entry is invalidated if it exists  An icache directory entry is  invalidated for L1 cache of every CPU in which it exists     From an L2 cache perspective  a block load is the same as eight load requests  A  quad load is same as four lo
170. n to page size based fields with individual  enables  The CTXT field also has its own enable in order to allow the flexibility in  implementation  The CAM portion of the fields are for comparison purposes  RAM  consists of the following field of bits  namely  physical address  PA  and attributes   The RAM portion of the fields are for read purposes  where a read could be caused  by a software read or a CAM based 1 hot read     FIGURE  2 27 illustrates the structure of the TLB     FIGURE  2 27 Translation Lookaside Buffer Structure    pid  0    63    r  va3 va2  pa  attributes  va1 va0 ctxt    r w  r w  cam    cam rd data    ram rd data  address    Chapter 2  SPARC Core 2 47    2 9 4  MMU ASI Operations    The types of regular MMU ASI operations are as follows         Writes         IMMU Data In         DMMU Data In         IMMU Data Access         DMMU Data Access        Reads         IMMU Data In         DMMU Data In         IMMU Tag Read         DMMU Tag Read        Demap         IMMU Demap Page         DMMU Demap Page         IMMU Demap Context         DMMU Demap Context         IMMU Demap All  cannot demap locked pages          DMMU Demap All  cannot demap locked pages         Soft Reset         IMMU Invalidate All  including locked pages          DMMU Invalidate All  including locked pages         Fault Related ASI Accesses to Registers         IMMU Synchronous Fault Status Register  SFSR          DMMU Synchronous Fault Status Register  SFSR          DMMU Synchronous Fau
171. nctional Description  10   1    10 1 1  OpenSPARC T1 Processor Clocks  10   1    10 1 1 1  Phase Locked Loop  10   3    10 1 1 2  Clock Dividers  10   4    10 1 1 3  Clock Domain Crossings  10   5    10 1 1 4  Clock Gating  10   7    10 1 1 5  Clock Stop  10   7    10 1 1 6  Clock Stretch  10   8    10 1 1 7  Clock n Step  10   8    10 1 1 8  Clock Signal Distribution  10   8    10 1 2  OpenSPARC T1 Processor Resets  10   10    10 1 2 1  Power On Reset  PWRON_RST_L   10   10    10 1 2 2  J Bus Reset  J_RST_L   10   11    10 1 2 3  Reset Sequence  10   11    10 1 2 4  Debug Initialization  10   15    10 2  I O Signal list  10   15    xii OpenSPARC T1 Microarchitecture Specification      August 2006    xiii    Figures    FIGURE  1 1  OpenSPARC T1 Processor Block Diagram  1   3    FIGURE  1 2  SPARC Core Pipeline  1   5    FIGURE  1 3  CCX Block Diagram  1   9    FIGURE  2 1  SPARC Core Block Diagram  2   2    FIGURE  2 2  Physical Location of Functional Units on an OpenSPARC T1 SPARC Core  2   3    FIGURE  2 3  Virtualization of Software Layers  2   4    FIGURE  2 4  SPARC Core Pipeline and Support Structures  2   7    FIGURE  2 5  Frontend of the SPARC Core Pipeline  2   8    FIGURE  2 6  I Cache Fill Path  2   10    FIGURE  2 7  I Cache Miss Path  2   12    FIGURE  2 8  IARF and IWRF File Structure  2   14    FIGURE  2 9  Basic Transition of Non Active States  2   16    FIGURE  2 10  Thread State Transition of an Active Thread  2   17    FIGURE  2 11  State Transition for a T
172. nected through a crossbar to an on chip  unified level 2 cache  L2 cache      The four on chip dynamic random access memory  DRAM  controllers directly  interface to the double data rate synchronous DRAM  DDR2 SDRAM   Additionally   there is an on chip J Bus controller that provides an interconnect between the  OpenSPARC T1 processor and the I O subsystem     1 2 OpenSPARC T1 Microarchitecture Specification      August 2006    1 2  Functional Description    The features of the OpenSPARC T1 processor include         8 SPARC V9 CPU cores  with 4 threads per core  for a total of 32 threads        132 Gbytes sec crossbar interconnect for on chip communication        16 Kbytes of primary  Level 1  instruction cache per CPU core        8 Kbytes of primary  Level 1  data cache per CPU core        3 Mbytes of secondary  Level 2  cache     4 way banked  12 way associative shared  by all CPU cores        4 DDR II DRAM controllers     144 bit interface per channel  25 GBytes sec peak  total bandwidth        IEEE 754 compliant floating point unit  FPU   shared by all CPU cores        External interfaces          J Bus interface  JBI  for I O     2 56 Gbytes sec peak bandwidth  128 bit  multiplexed address data bus         Serial system interface  SSI  for boot PROM    FIGURE  1 1 shows a block diagram of the OpenSPARC T1 processor illustrating the    various interfaces and integrated components of the chip     Chapter 1  OpenSPARC T1 Overview 1 3    FIGURE  1 1 OpenSPARC T1 Processor Blo
173. non existent module  AID 2     5 2  I O Bridge Signal List    TABLE  5 10 describes the I O Signals for OpenSPARC T1 processor   s IOB     TABLE  5 10 I O Bridge I O Signal List    Signal Name  I O  Source Destination  Description    clk_iob_cmp_cken  In  CTU    clk_iob_data 3 0   In  CTU    clk_iob_jbus_cken  In  CTU    clk_iob_stall  In  CTU    clk_iob_vld  In  CTU    clspine_iob_resetstat 3 0   In    clspine_iob_resetstat_wr  In    clspine_jbus_rx_sync  In  RX synchronous    clspine_jbus_tx_sync  In  TX synchronous    cmp_adbginit_l  In  CTU  Asynchronous reset    cmp_arst_l  In  CTU  Asynchronous reset    cmp_gclk  In  CTU  Clock    cmp_gdbginit_l  In  CTU  Synchronous reset    Chapter 5  Input Output Bridge 5 13    cmp_grst_l  In  CTU  Synchronous reset    cpx_iob_grant_cx2 7 0   In  CCX CPX  CPX grant    ctu_iob_wake_thr  In  CTU    ctu_tst_macrotest  In  CTU    ctu_tst_pre_grst_l  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_scanmode  In  CTU    ctu_tst_short_chain  In  CTU    dbg_en_01  In    dbg_en_23  In    dram02_iob_data 3 0   In  DRAM  UCB data    dram02_iob_stall  In  DRAM  UCB stall    dram02_iob_vld  In  DRAM  UCB valid    dram13_iob_data 3 0   In  DRAM  UCB data    dram13_iob_stall  In  DRAM  UCB stall    dram13_iob_vld  In  DRAM  UCB valid    efc_iob_coreavail_dshift  In  EFC    efc_iob_fuse_data  In  EFC    efc_iob_fusestat_dshift  In  EFC    efc_iob_sernum0_dshift  In  EFC    efc_iob_sernum1_dshift  In  EFC    efc_iob_sernum2_dshift  In  EFC    glob
174. ns to the CPU as these are all independent accesses to different  addresses  Therefore  when a later read gets replayed from the MB down the pipe  and invalidates its slot in the MB  a new request from the pipe will take its slot in the  MB  even while an older read has not yet returned data from the DRAM     In most cases  when a data return happens  the replayed load from the MB makes it  through the pipe before the fill request can  Therefore  the valid bit of the MB entry  gets cleared  after the replayed MB instruction execution is complete in the pipe   before the fill buffer valid bit  However  if there are other prior MB instructions  like  partial stores that get picked instead of the MB instruction of concern  the fill request  can enter the pipe before the MB instruction  In these cases  the valid bit in the fill  buffer gets cleared prior to the MB valid bit  Therefore  the MB valid bit and FB valid  bits always get set in the order of MB valid bit first  and FB valid bit second   These  bits can get cleared in any order  however      4 1 2 9  Fill Buffer    The fill buffer  FB  contains a cache line wide entry to the stage data from the DRAM  before it fills the cache  Addresses are also stored for maintaining the age ordering in  order to satisfy coherency conditions     The fill buffer is an 8 entry buffer used to temporarily store data arriving from the  DRAM on an L2 cache miss request  Data arrives from the DRAM in four 16 byte  blocks starting with the crit
175. nsaction  The boot prom is a part of the I O address space  All  instruction fetches from the I O space are non cacheable  The boot PROM fetches  only one 4 byte instruction at a time  This 4 byte instruction is replicated four times  during the formation of the CPX packet  Only one CPX packet of non cacheable  instructions will be forwarded to the IFQ  The non cacheable instructions fetched  from the boot PROM will not be filled in the I cache  They will be sent to  or   bypassed to  the thread instruction register  TIR  directly     2 3 6  Alternate Space Identifier Accesses  I Cache Line  Invalidations  and Built In Self Test Accesses to  the I Cache    Alternate space identifiers  ASI  accesses to the I cache  and the built in self test   BIST  accesses to the I cache  go through the IFQ data path to the I cache  All ASI  accesses and BIST accesses will cause the SPARC core pipeline to stall  so these  accesses are serviced almost immediately     The load store unit  LSU  initiates all ASI accesses  The LSU serializes all ASI  accesses so that the second access will not be launched until the first access has been  acknowledged  ASI accesses tend to be slow  and data for an ASI read will be sent  back later     A BIST operation requires atomicity  and it assumes and accommodates no  interruptions until it completes     Level 2 cache invalidations will always undergo a CPU ID check in order to ensure  that this invalidation packet is indeed meant for the specified SPARC cor
176. o control register  and the thread returns to normal processing  The  MAU unit initiates streaming load store operations to the L2 cache through the  crossbar  and compute operations to the multiplier  Completion of the MAU can be  checked by polling or issuing an interrupt     1 3 2  CPU Cache Crossbar    The eight SPARC cores  the four L2 cache banks  the I O Bridge  and the FPU all  interface with the crossbar  FIGURE  1 3 displays the crossbar block diagram  The CPU     cache crossbar  CCX  features include         Each requester queues up to two packets per destination         Three stage pipeline     request  arbitrate  and transmit         Centralized arbitration with oldest requester getting priority         Core to cache bus optimized for address plus doubleword store         Cache to core bus optimized for 16 byte line fill  32 byte I  line fill delivered in  two back to back clocks     Chapter 1  OpenSPARC T1 Overview 1 9    FIGURE  1 3 CCX Block Diagram    1 3 3  Floating Point Unit    A single floating point unit  FPU  is shared by all eight SPARC cores  The shared  floating point unit is sufficient for most commercial applications in which typically  less than one percent of the instructions are floating point operations     L2 Bank 0  L2 Bank 3  FPU  CRI    Core 1  Core 0  Core 7    FPU CRI    Bank 1    C0  C1  C7    Bank 0    Core o    Core 1    Core 7    Bank 0  FPU  CRI    Core to L2 Cache  Shared FPU CRI    L2 Cache  FPU  CRI to Core    1 10 OpenSPARC T1 Mi
177. o the DRAM on misses in the L2 cache  A writeback gets issued 64 bits at  a time to the DRAM controller  A fill happens 128 bits at a time from the DRAM  controller to the L2 cache     The L2 cache interfaces with the J Bus interface  JBI  by way of the snoop input  queue and the RDMA write buffer     Each L2 cache bank consists of these three main sub blocks         sctag  secondary cache tag  contains the tag array  VUAD array  L2 cache  directory  and the cache controller        scbuf contains write back buffer  WBB   fill buffer  FB  and DMA buffer        scdata contains the scdata array    FIGURE  4 1 shows the various L2 cache blocks and their interfaces  The following    paragraphs provide additional details about each functional block     Chapter 4  Level 2 Cache 4 3    FIGURE  4 1 Flow Diagram and Interfaces for an L2 Cache Bank    L2Data    L2 Tag VUAD    ARB    Dir    OQ 16Q   IQ 16Q     From PCX  To PCX    MB 16L     FB 8L     WB 8L     Rdma WB 8L     CnplQ 16Q   32b    32b    36b   control    128b   control    64b    control    Jbi i f  Dram i f    4 4 OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 2 1  Arbiter    The arbiter  ARB  manages the access to the L2 cache pipeline from the various  sources that request access  The arbiter gets inputs from the following         Instructions from the CCX  and from the bypass path for input queue  IQ         DMA instructions from the snoop input queue  which is the RDMA input queue  interface with th
178. ogram counters  NPC  of all live instructions executed on the OpenSPARC T1  processor  For every SPARC core clock cycle  two instructions are fetched for every  instruction issued  This two fetches per one issue relationship is intended to reduce  the I cache access in order to allow the opportunistic I cache line fill  Each thread is  allowed to have one outstanding I cache miss  and the SPARC core allows a total of  four I cache misses  Duplicated I cache misses do not induce the redundant fill  request to the level 2 cache  L2 cache      I Cache  4 x  TIR   NIR    4 x  PC    ITLB    IFQ    From  LSU  To  LSU    br pc trap pc    MIL    DEC    Schedule    Chapter 2  SPARC Core 2 9    2 3 3  Instruction Registers and Program Counter  Registers    In the instruction buffer  there are two instruction registers per thread     the thread  instruction register  TIR  and the next instruction register  NIR   The TIR contains  the current thread instruction in the thread selection stage  S stage   and the NIR  contains the next instruction  An I cache miss fill bypasses the I cache and writes  directly to the TIR  but it never writes to the NIR     The thread scheduler selects a valid instruction from the TIR  After selecting the  instruction  the valid instruction will be moved from the NIR to the TIR  If no valid  instruction exists in the TIR  a no operation  NOP  instruction will be inserted     There is one program counter  PC  register per thread  The next program counter   NPC
179. ole of MMU in Virtualization  2   44    2 9 2  Data Flow in MMU  2   45    2 9 3  Structure of Translation Lookaside Buffer  2   45    2 9 4  MMU ASI Operations  2   47    2 9 5  Specifics on TLB Write Access  2   48    2 9 6  Specifics on TLB Read Access  2   48    2 9 7  Translation Lookaside Buffer Demap  2   48    2 9 8  TLB Auto Demap Specifics  2   49    2 9 9  TLB Entry Replacement Algorithm  2   49    2 9 10  TSB Pointer Construction  2   49    2 10  Trap Logic Unit  2   50    2 10 1  Architecture Registers in the Trap Logic Unit  2   52    2 10 2  Trap Types  2   53    2 10 3  Trap Flow  2   55    2 10 4  Trap Program Counter Construction  2   57    2 10 5  Interrupts  2   57    2 10 6  Interrupt Flow  2   58    2 10 7  Interrupt Behavior and Interrupt Masking  2   61    2 10 8  Privilege Levels and States of a Thread  2   61    2 10 9  Trap Modes Transition  2   62    2 10 10 Thread States Transition  2   63    2 10 11 Content Construction for Processor State Registers  2   64    2 10 12 Trap Stack  2   65    2 10 13 Trap  Tcc  Instructions  2   66    2 10 14 Trap Level 0 Trap for Hypervisor  2   66    Contents  vii    2 10 15 Performance Control Register and Performance Instrumentation  Counter  2   66    3   CPU Cache Crossbar  3   1    3 1  Functional Description  3   1    3 1 1  CPU Cache Crossbar Overview  3   1    3 1 2  CCX Packet Delivery  3   2    3 1 3  Processor Cache Crossbar Packet Delivery  3   3    3 1 4  Cache Processor Crossbar Packet Delivery  3   
180. ons     reads  writes   and demap  All TLB entries are shared among the threads  and the consistency  among the TLB entries is maintained through auto demap  The MMU is responsible  for generating the pointers to the software translation storage buffers  TSB   and it  also maintains the fault status for the various traps     The access to the MMU is through the hypervisor managed ASI operations such as  ldxa and stxa  These ASI operations can be asynchronous or in pipe  depending on  the latency requirements  Those asynchronous ASI reads and writes will be queued  up in LSU  Some of the ASI operations can be updated through faults or by a data  access exception  Fault data for the status registers will be sent by trap logic unit   TLU  and the load and store unit  LSU      2 9 3  Structure of Translation Lookaside Buffer    The translation lookaside buffer  TLB  consists of content addressable memory   CAM  and randomly addressable memory  RAM   CAM has one compare port and  one read write port  1C1RW   and RAM has one read write port  1RW   The TLB  supports the following mutually exclusive events     1  CAM    2  Read    3  Write    4  Bypass    5  Demap    2 46 OpenSPARC T1 Microarchitecture Specification      August 2006    6  Soft reset    7  Hard reset    CAM consists of the following field of bits     partition ID  PID   real  identifies a RA   to PA translation or a VA to PA translation   context ID  CTXT   and virtual address   VA   The VA field is further broken dow
181. order    iii  The gap between clusters is defined by CREG_CLK_CTL STP_DLY    iv  The default gap is 128 chip level multiprocessor  CMP  clocks    v  The gap for the D and J domain clock enables is subject to Tx_sync     3  Turn off clock trees    a  The C and D domain trees are stopped at the divider    b  The J and dup trees are never turned off    i  The J div may be turned off  but then the J tree is fed from j dup    4  Establish PLL output    a  The PLL locking is sequenced by a simple SM on raw clocks    b  For a cold reset  the sequence is shown as     i  PLL bypass mode     reset count   128  lock count   16    ii  PLL lock mode     reset count   128  reset   lock   32000  for a cold reset     c  For a frequency change reset  a similar sequence is used    d  For other warm resets  a fake sequence is used  where the PLL reset is not  asserted and counters are shorter    5  Turn on clock trees    a  The C  D  and J domain dividers start in sync with J dup  and the result is a  common rising  AKA coincident  edge   For cycle deterministic operation   tester diagnostics tests must keep track of coincident edges      b  If the JBUS_GCLK was running from J dup  it switches to J div  in PLL bypass  mode  JBUS_GCLK is not the same frequency as J_CLK     6  Turn on clock enables    a  The cluster clock enables are turned on in a staggered way    b  The starting cluster is 0  for sparc0   and the enables progress in a CREG bit  order    c  There is a gap of 129 CMP clocks betwe
182. ory Coherency and Instruction Ordering  4   17    4 2  L2 Cache I O LIST  4   18    5   Input Output Bridge  5   1    5 1  Functional Description  5   1    5 1 1  IOB Interfaces  5   2    5 1 2  UCB Interface  5   4    5 1 2 1  UCB Request and Acknowledge Packets  5   4    5 1 2 2  UCB Interrupt Packet  5   6    5 1 2 3  UCB Interface Packet Example  5   6    5 1 3  IOB Address Map  5   7    Contents  ix    5 1 4  IOB Block Diagram  5   8    5 1 5  IOB Transactions  5   9    5 1 6  IOB Interrupts  5   10    5 1 7  IOB Miscellaneous Functionality  5   11    5 1 8  IOB Errors  5   11    5 1 9  Debug Ports  5   12    5 2  I O Bridge Signal List  5   12    6   J Bus Interface  6   1    6 1  Functional Description  6   1    6 1 1  J Bus Requests to the L2 Cache  6   3    6 1 1 1  Write Requests to the L2 Cache  6   3    6 1 1 2  Read Requests to the L2 Cache  6   4    6 1 1 3  Flow Control  6   4    6 1 2  I O Buffer Requests to the J Bus  6   4    6 1 3  J Bus Interrupt Requests to the IOB  6   5    6 1 4  J Bus Interface Details  6   5    6 1 5  Debug Port to the J Bus  6   6    6 1 6  J Bus Internal Arbitration  6   6    6 1 7  Error Handling in JBI  6   7    6 1 8  Performance Counters  6   7    6 2  I O Signal list  6   8    7   Floating Point Unit  7   1    7 1  Functional Description  7   1    7 1 1  Floating Point Instructions  7   4    7 1 2  FPU Input FIFO Queue  7   5    7 1 3  FPU Output Arbitration  7   6    7 1 4  Floating Point Adder  7   6    x OpenSPARC T1 Microar
183. otected with a 2 bit  parity for each 64 bit word     MA memory requires software initialization prior to the start of MA memory  operations  Three MA_LD operations are required to initialize all 160 words of  memory because the MA_CTL length field allows up to 64 words to be loaded into  MA memory     Write accesses to the MA memory can be on either the 16 byte boundary or the  8 byte boundary  Read accesses to the MA memory must be on the 8 byte boundary     EXU    MUL    IFU    TLU    LSU    SPU_MAMEM    SPU    CPX     bw_r_idct     SPU_MADP    SPU_CTL    Chapter 2  SPARC Core 2 41    2 8 4  Modular Arithmetic Operations    All modular arithmetic registers must be initialized prior to launching a modular  arithmetic operation  Modular arithmetic operations  MA ops  start with a stxa to  the MA_CTL register if the store buffer for that thread is empty  Otherwise  the  thread will wait until the store buffer is emptied before sending stx_ack to the LSU   An MA operation that is in progress can be aborted by another thread by way of a  stx to the MA_CTL register     An ldxa to MA registers are blocking  All except ldxa to the MA_Sync register will  respond immediately  An ldxa to the MA_Sync register will return a 0 to the  destination register upon the operation completion  The thread ID of this ldxa  should be equal to that stored in the thread ID field of the MA_CTL register   Otherwise  the SPU will respond immediately and send signals to the LSU to not  update the regist
184. parc_ifu v    ctu_sscan_tid 3 0   In  CTU  To IFU of sparc_ifu v    ctu_tst_mbist_enable  In  CTU  To test_stub of test_stub_bist v    efc_spc_fuse_clk1  In  EFC    efc_spc_fuse_clk2  In  EFC    efc_spc_ifuse_ashift  In  EFC    efc_spc_ifuse_dshift  In  EFC    efc_spc_ifuse_data  In  EFC    efc_spc_dfuse_ashift  In  EFC    efc_spc_dfuse_dshift  In  EFC    efc_spc_dfuse_data  In  EFC    ctu_tst_macrotest  In  CTU  To test_stub of test_stub_bist v    ctu_tst_scan_disable  In  CTU  To test_stub of test_stub_bist v    ctu_tst_short_chain  In  CTU  To test_stub of test_stub_bist v    global_shift_enable  In  CTU  To test_stub of test_stub_two_bist v    ctu_tst_scanmode  In  CTU  To test_stub of test_stub_two_bist v    spc_scanin0  In  DFT  Scan in    spc_scanin1  In  DFT  Scan in    2 6 OpenSPARC T1 Microarchitecture Specification      August 2006    2 3  Instruction Fetch Unit    The instruction fetch unit  IFU  is responsible for maintaining the program counters   PC  of different threads and fetching the corresponding instructions  The IFU also  manages the level 1 I cache  L1I  and the instruction translation lookaside buffer   ITLB   as well as managing and scheduling the four threads in a SPARC core  The  SPARC core pipeline resides in the IFU  which controls instruction issue and  instruction flow in the pipeline  The IFU decodes the instructions flowing through  the pipeline  schedules interrupts  and it implements the idle resume states of the  pipeline  The IFU also logs
185. penSPARC T1 processor clock and test unit  CTU  contains three main  components     clock generation and control  reset generation  and test  Because the  test functions are physical design dependent  they are not described in this  document  This chapter describes the OpenSPARC T1 processor   s clocks and resets     10 1 1  OpenSPARC T1 Processor Clocks    There are three clock domains in the OpenSPARC T1 processor     chip level  multiprocessor  CMP  in the CPU clusters  J Bus  and DRAM  Throughout this  chapter  these three clock domains are referred in this document as C for CMP  J for  J Bus  and D for DRAM         Only one phased locked loop  PLL  in the chip  which has a differential  J_CLK 1 0   is used as a reference clock for the PLL  This clock runs at 150 MHz at  power up  and then it is increased to 200 MHz  or any other target frequency  between 150 MHz to 200 MHz          Each domain  C  D  and J  has its own balanced clock distribution tree         Signals from the CTU are delivered to the cluster   s clock headers  The C clock  domain uses flop repeaters for clock distribution     10 2 OpenSPARC T1 Microarchitecture Specification      August 2006        The CTU has the following sub blocks     PLL  clock PLL   random number  generator  RNG   design For testability  DFT   clock spine  CLSP   the  temperature sensor  TSR          The CTU generates the following signals for each cluster     clock  clock enable   reset  synchronous and asynchronous   init  debug 
186. pter 6  J Bus Interface 6 11    jbi_iob_mondo_vld  Out  IOB  MONDO valid    jbi_iob_mondo_data 7 0   Out  IOB  MONDO data    jbi_io_ssi_mosi  Out  PADS  Master out slave in to pad     jbi_io_ssi_sck  Out  PADS  Serial clock to pad     jbi_iob_spi_vld  Out  IOB  Valid packet from UCB     jbi_iob_spi_data 3 0   Out  IOB  Packet data from UCB     jbi_iob_spi_stall  Out  IOB  Flow control to stop data     jbi_io_j_req0_out_l  Out  PADS  J Bus request 0    jbi_io_j_req0_out_en  Out  PADS  J Bus request 0 enable    jbi_io_j_adtype 7 0   Out  PADS  J Bus type    jbi_io_j_adtype_en  Out  PADS  J Bus type enable    jbi_io_j_ad 127 0   Out  PADS  J Bus address data    jbi_io_j_ad_en 3 0   Out  PADS  J Bus address data enable    jbi_io_j_pack0 2 0   Out  PADS  J Bus ACK  0    jbi_io_j_pack0_en  Out  PADS  J Bus ACK  0 enable    jbi_io_j_pack1 2 0   Out  PADS  J Bus ACK  1    jbi_io_j_pack1_en  Out  PADS  J Bus ACK  1 enable    jbi_io_j_adp 3 0   Out  PADS  J Bus address data Parity    jbi_io_j_adp_en  Out  PADS  J Bus address data parity enable    jbi_io_config_dtl 1 0   Out  PADS  J Bus I O DTL configuration    TABLE  6 1  JBI I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    6 12 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1    CHAPTER 7    Floating Point Unit    This chapter describes the following topics         Section  7 1     Functional Description    on page  7 1        Section  7 2     I O Signal list    on page  7 15  
187. puter output      su  Password     AaBbCc123  Book titles  new words or terms   words to be emphasized   Replace command line variables  with real names or values     Read Chapter 6 in the User   s Guide   These are called class options   You must be superuser to do this   To delete a file  type rm filename     Preface  xxi    Related Documentation    The documents listed as online or download are available at     http   www opensparc net     Documentation  Support  and Training    Application  Title  Part Number  Format  Location    OpenSPARC T1 instruction set  UltraSPARC   Architecture 2005  Specification    950 4895  PDF  Online    OpenSPARC T1 processor   s internal  registers    UltraSPARC T1 Supplement to the  UltraSPARC Architecture 2005    819 3404  PDF  Online    OpenSPARC T1 megacells  OpenSPARC T1 Processor Megacell  Specification    819 5016  PDF  Download    OpenSPARC T1 signal pin list  OpenSPARC T1 Processor Datasheet  819 5015  PDF  Download    OpenSPARC T1 microarchitecture  OpenSPARC T1 Microarchitecture  Specification    819 6650  PDF  Download    OpenSPARC T1 processor  J Bus and SSI interfaces    OpenSPARC T1 Processor External  Interface Specification    819 5014  PDF  Download    Sun Function  URL    OpenSPARC T1  http   www opensparc net     Documentation  http   www sun com documentation     Support  http   www sun com support     Training  http   www sun com training     xxii OpenSPARC T1 Microarchitecture Specification      August 2006    Third Par
188. r 2  SPARC Core 2 19    2 3 13  Rollback Mechanism    The rollback mechanism provides a way of recovering from a scheduling error  The  two reasons for performing a rollback include     1  All of the stall conditions  or switch conditions  were not known at the time of the  scheduling     2  The scheduling was done speculatively on purpose     For example  after issuing a load  the scheduler will speculate a level 1 D cache  hit performance reasons  If the speculation was incorrect  because of encountering  a load miss   all of the instructions after the speculative load instruction must be  rolled back  Otherwise  the performance gain would be a substantial     Rolled back instructions must be restarted from the S stage or F stage of the SPARC  core pipeline  FIGURE  2 12 illustrates the pipeline graph for the rollback mechanism     FIGURE  2 12 Rollback Mechanism Pipeline Graph    The three rollback cases include     1  E to S and D to F    2  D to S and S to F    3  W to F    The possible conditions causing a rollback case 1 or a case 2 include         Instruction s  following a load miss        Resource conflict due to long latency        Store buffer full        I cache instruction parity error        I fetch retry    F  S  D  E  M  W    2 20 OpenSPARC T1 Microarchitecture Specification      August 2006    The possible conditions causing rollback case 3 include         Encountering an ECC error during the instruction register file access         The floating point store 
189. rce   Destination  Description    Chapter 10  Clocks and Resets 10 21    ctu_ddr_testmode_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_debug_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_debug_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_debug_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_debug_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_debug_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_dll0_byp_l  Out  From ctu_clsp of ctu_clsp v    ctu_dll0_byp_val 4 0   Out  From ctu_clsp of ctu_clsp v    ctu_dll1_byp_l  Out  From ctu_clsp of ctu_clsp v    ctu_dll1_byp_val 4 0   Out  From ctu_clsp of ctu_clsp v    ctu_dll2_byp_l  Out  From ctu_clsp of ctu_clsp v    ctu_dll2_byp_val 4 0   Out  From ctu_clsp of ctu_clsp v    ctu_dll3_byp_l  Out  From ctu_clsp of ctu_clsp v    ctu_dll3_byp_val 4 0   Out  From ctu_clsp of ctu_clsp v    ctu_dram02_cmp_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram02_dram_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram02_jbus_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram13_cmp_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram13_dram_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram13_jbus_cken  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_dram_selfrsh  Out  DRAM  From ctu_clsp of ctu_clsp v    ctu_efc_capturedr  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_coladdr 4 0   Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_data_in  Out  EFC  From ctu_dft of ctu_dft v    ctu_efc_
190. rd request write issued by the IOB     The store instruction writes  in a granularity of  32 bits of data into the scdata array   An acknowledgment packet is sent to the CPU that issued the request  and an  invalidate packet is sent to all other CPUs  The icache directory entry for every CPU  is cammed and invalidated  The dcache directory entry of every CPU  except the  requesting CPU  is cammed and invalidated     A block store is the same as eight stores from an L2 cache perspective  A block init  store is same as a block store except for one difference     in the case of a miss for a  block init store  a dummy read request is issued to the DRAM controller  The DRAM  controller returns a line filled with all zeroes  Essentially  this line return saves  DRAM read bandwidth     The LSU treats every store as a total store order  TSO  store  The LSU waits for an  acknowledgement to arrive before processing the next store  However  block init  stores can be processed without waiting for acknowledgements     From the L2 cache s perspective  a streaming store is the same as a store     A forward request write stores 64 bits of data in the scdata  The icache and the  dcache directory entries are not cammed afterwards     The forward request write and the streaming store may stride a couple of words and  therefore may require partial stores     Partial stores  PST  perform sub 32 bit writes into the scdata array  As mentioned  earlier  the granularity of the writes into the scdata 
191. res  so the occurrence of such a trap is not synchronous with the SPARC  core pipeline operation  These traps are all precise traps in the OpenSPARC T1  processor  A trap bubble is identified in the W stage when there is no valid  instruction available  or the instruction there is taking a trap  Asynchronous traps  will be taken at the W stage when a trap bubble has been identified     Disrupting traps are associated with certain particular conditions  The TLU collects  them and forward them to the IFU  The IFU sends them down the pipeline as  interrupts instead of sending instructions down  A trap bubble is thus guaranteed at  the W stage  and the trap will be taken     FIGURE  2 29 illustrates the trap flow sequence     FIGURE  2 29 Trap Flow Sequence    D  E  M  W  W2    Reg  RD WR   DONE RETRY  Inst  from IFU  Alt  LD ST  Inst  from IFU  VA from EXU  Sync  Traps   Interrupts  from IFU  EXU   SPU and TLU  internal Traps  Synchronous  and Deferred  Traps from LSU  Resolve priority   save states in  Stack and send  TrapPC_vld  to IFU  Update States  and send  TrapPC to IFU    Async  Traps    2 56 OpenSPARC T1 Microarchitecture Specification      August 2006    All the traps from the IFU  EXU  SPU  LSU  and the TLU will be sorted through in  order to resolve the priority first  and also to determine the following     trap type   TTYPE  and trap vector  redirect PC   After these are resolved  the trap base address   TBA  will be selected to travel down the pipeline for further
192. ries are dedicated to the combined FPA FPM  and eight entries are  dedicated to FPD  The FPD has issue priority over FPA FPM         The eight FPD FIFO entries and the eight FPA FPM entries always issue in FIFO  order         The 155 bit FIFO entry format           154 150      5 bit ID  CPU and thread           149 148      2 bit round mode          147 146      2 bit fcc field          145 138      8 bit opcode          137 69      69 bit rs1  includes tag bits           68 0      69 bit rs2  includes tag bits     7 6 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1 3  FPU Output Arbitration    The FPA  FPM  and FPD execution pipelines are arbitrated for the single FPU result  bus to the crossbar  Only one instruction may complete and exit the FPU per cycle   During this arbitration  the FPD pipeline has priority over the FPA and the FPM  pipelines  The FPA and FPM pipelines are prioritized in a round robin fashion     If an FPA or FPM execution pipeline is waiting for its result to exit the FPU  the  pipeline will stall at the final execution stage  If the final execution stage is not  occupied by a valid instruction  instructions within the pipeline will advance  and  the input FIFO queue may issue to the pipeline  If the final execution stage is  occupied by a valid instruction then each pipeline stage is held     The input FIFO queue will not advance if the instruction at the head of the FIFO  must issue to a pipeline  which at each stage has been he
193. ription    5 16 OpenSPARC T1 Microarchitecture Specification      August 2006    6 1    CHAPTER 6    J Bus Interface    This chapter contains the following topics about the J Bus interface  JBI  functional  block         Section  6 1     Functional Description    on page  6 1        Section  6 2     I O Signal list    on page  6 8    6 1  Functional Description    For a detailed description on the external J Bus interface  refer to OpenSPARC T1  Processor External Interface Specification  The OpenSPARC T1 J Bus interface  JBI   block generates J Bus transactions and responds to external J Bus transactions     The JBI block         Interfaces with following blocks in an OpenSPARC T1 processor          L2 cache  scbuf and sctag  to read and write data to L2 cache         I O Bridge  IOB    for programmed input output  PIO   interrupts  and debug  port         J Bus I O pads        Most of the JBI sub blocks use the J Bus clock  and remaining part runs at the  CPU core clock or cmp clk  The data transfer between the two clock domains is by  way of queues within the two clock domains  these are the Request header queues  and the Return data queues  The interface to the L2 cache is through the direct  memory access  DMA  reads and DMA writes         The IOB debug port data is stored in the debug FIFOs and then it is sent out to  the external J Bus         IOB PIO requests are stored in the PIO queue and the return data is stored in the  PIO return queue  Similarly  there is an int
194. rotection  and error correction of the  SPARC core     9 2 1  SPARC Core Error Registers    Every thread in the SPARC core has its own set of hyper privileged error registers   The error registers are described as         ASI_SPARC_ERROR_EN_REG          NCEEN  If set  it will enable uncorrectable error traps          CEEN  If set  it will enable correctable error traps          POR value is 0          Logging will occur even if error traps are disabled         ASI_SPARC_ERROR_STATUS_REG          Logs the errors that occur          Indicates if multiple errors occurred          Indicates if the error occurred at a privileged level          Not cleared on a hardware reset  so the software will need to do so          Never cleared by the hardware         ASI_SPARC_ERROR_ADDRESS_REG          Captures the address  syndrome  and so on as applicable         ASI_ERROR_INJECT_REG          Used for error injection          One per core  and shared by all threads          Can specify one error source  from among the TLBs and the register file           Error injection can be a single shot or multiple shots          Diagnostic writes can be used to inject errors into the L1 caches     9 4 OpenSPARC T1 Microarchitecture Specification      August 2006    9 2 2  SPARC Core Error Protection    All SRAMs  caches  TLBs  and so on in the SPARC core have error protection using  either parity or ECC  TABLE  9 1 shows the SPARC core memories and their error    protection types     9 2 3  SPARC Cor
195. roughput of one instruction every two cycles         All FPM instructions are fixed latency and are independent of the operand values         A post normalization incrementer is used for rounding  otherwise known as a late  round organization          NaN source propagation is supported by steering the appropriate NaN source  through the datapath to the result   Refer to the UltraSPARC Architecture 2005  Specification for more information      TABLE  7 3  FPA Datapath Stages    Stage  LED Action  SED Action    A1  Format input operands    Compare fractions    A2  Align smaller operand to larger  operand    Invert smaller operand if a logical   effective  subtraction is to be  performed    Invert smaller operand if a logical   effective  subtraction is to be  performed    Compute the intermediate result  A   B     A3  Compute the intermediate result  A   B   Leading zero detect    A4  Round  Normalize    FiTOs  FxTOs  FxTOd instructions only    A4  Round    7 8 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1 6  Floating Point Divider    The floating point divider  FPD  has the following characteristics         The floating point divide  FDIV  instructions maximum execution latency is   32  single precision  SP   and 61 double precision  DP    Zero or denormalized  results have less latency          Normalized results always produce a fixed execution latency of 32 SP  61 DP         Denormalized results produce a variable execution latency of between 9 and 3
196. rrupt packet types     5 1 2 3  UCB Interface Packet Example    The UCB interface packet without payload has width of 64 bits  If the physical  interface is 8 bits  it will take 8 cycles  without a stall  to send the packet  The first  data sent  D0  is bits 7 to 0  the second data sent  D1  is bits 15 to 8  and so on     TABLE  5 7 shows the UCB no payload packet  64 bit  over an 8 bit interface without    stalls     TABLE  5 5  UCB Interrupt Packet Format    Bits  63 57  56 51  50 19  18 10  9 4  3 0    Description  Reserved  Vector  Reserved  Device ID  Thread ID  Packet Type    TABLE  5 6  UCB Interrupt Packet Types    Description  Packet Type Value  Binary   Comment    UCB_INT  1000    UCB_INT_VEC  1100  IOB Internal Use Only    UCB_RESET_VEC  1101  IOB Internal Use Only    UCB_IDLE_VEC  1110  IOB Internal Use Only    UCB_RESUME_VEC  1111  IOB Internal Use Only    TABLE  5 7  UCB No Payload Over an 8 Bit Interface Without Stalls    iob_ucb_vld  0  1  1  1  1  1  1  1  1  0    iob_ucb_data 7 0   X  D0  D1  D2  D3  D4  D5  D6  D7  X    ucb_iob_stall  0  0  0  0  0  0  0  0  0  0    Chapter 5  Input Output Bridge 5 7    TABLE  5 8 shows the UCB no payload packet  64 bit  over an 8 bit interface with    stalls     5 1 3  IOB Address Map    Refer to UltraSPARC T1 Supplement to UltraSPARC Architecture 2005 Specification for  descriptions of the detailed addresses of the registers and bit levels  TABLE  5 9    describes the high level IOB address map for the address block level
197. s active        The test mode is active    The input FIFO queue and output arbitration blocks receive free running clocks  This  eliminates potential timing issues  simplifies the design  and has only a small impact  on the overall FPU power savings     The FPU power management feature automatically powers up and powers down  each of the three FPU execution pipelines  based on the contents of the instruction  stream  Also  the pipelines are clocked only when required  For example  when no  divide instructions are executing  the FPD execution pipeline automatically powers  down  Power management is provided without affecting functionality or  performance  and it is transparent to the software     TABLE  7 5  FPD Datapath Stages    Stage  Action    D1  Format input operand rs1    D2  Leading zero detect for rs1  Format input operand rs2    D3  Pre normalize rs1  Leading zero detect for rs2    D4  Pre normalize rs2    D5  Quotient loop  if normalized result  run 55 cycles DP  26 cycles SP     D6  Determine sticky bit from remainder    D7  Round    7 10 OpenSPARC T1 Microarchitecture Specification      August 2006    7 1 8  Floating Point State Register Exceptions and Traps    The SPARC core FFU physically contains the architected floating point state register   FSR   The characteristics of the FSR  as well as exceptions and traps  include         The FFU provides FSR rd  IEEE rounding direction  to the FPU  IEEE non   standard mode  FSR ns  is ignored by the FPU  and thus is not
198. same address     The L2 cache directory maintains the cache coherency in all primary caches  The L2   cache directory preserves the inclusion property     all valid entries in the primary  cache should reside in the L2 cache as well  It also keeps the icache and the dcache  exclusive for each CPU     The read after write  RAW  dependency to the DRAM controller is resolved by  camming the write back buffer on a load miss     Mulitcast requests  for example  a flush request  are sent to the CPX only if all of the  receiving queues are available  This process is a requirement for maintaining the  total store order  TSO      4 18 OpenSPARC T1 Microarchitecture Specification      August 2006    4 2  L2 Cache I O LIST    The following tables describe the L2 cache I O signals     TABLE  4 1  SCDATA I O Signal List    Signal Name  I O    Source   Destination  Description    cmp_gclk 1 0   In  CTU  Clock    global_shift_enable  In  CTU  To data of bw_r_l2d v    si  In  DFT  Scan in    arst_l   In  CTU    cluster_cken  In  CTU    ctu_tst_pre_grst_l  In  CTU    ctu_tst_scanmode  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_macrotest  In  CTU    ctu_tst_short_chain  In  CTU    efc_scdata_fuse_ashift  In  EFC  To efuse_hdr of scdata_efuse_hdr v    efc_scdata_fuse_clk1  In  EFC  To efuse_hdr of scdata_efuse_hdr v  and so on     efc_scdata_fuse_clk2  In  EFC  To efuse_hdr of scdata_efuse_hdr v  and so on     efc_scdata_fuse_data  In  EFC  To efuse_hdr of scdata_efuse_hdr v    efc_scd
199. se http   www sun com patents et un ou les brevets suppl  mentaires ou les applications de brevet en attente aux Etats   Unis et dans les  autres pays     L   utilisation est soumise aux termes de la Licence     Cette distribution peut comprendre des composants d  velopp  s par des tierces parties     Sun  Sun Microsystems  le logo Sun  Solaris  OpenSPARC T1 et UltraSPARC sont des marques de fabrique ou des marques d  pos  es de Sun  Microsystems  Inc  aux Etats Unis et dans d   autres pays     Toutes les marques SPARC sont utilis  es sous licence et sont des marques de fabrique ou des marques d  pos  es de SPARC International  Inc   aux Etats Unis et dans d   autres pays  Les produits portant les marques SPARC sont bas  s sur une architecture d  velopp  e par Sun  Microsystems  Inc     UNIX est une marque d  pos  e aux Etats Unis et dans d   autres pays et licenci  e exlusivement par X Open Company  Ltd     Le logo Adobe  est une marque d  pos  e de Adobe Systems  Incorporated     Les produits qui font l   objet de ce manuel d   entretien et les informations qu   il contient sont regis par la legislation americaine en matiere de  controle des exportations et peuvent etre soumis au droit d   autres pays dans le domaine des exportations et importations  Les utilisations     nales  ou utilisateurs    naux  pour des armes nucleaires  des missiles  des armes biologiques et chimiques ou du nucleaire maritime   directement ou indirectement  sont strictement interdites  Les exportat
200. set the PCR OVFH or the  PCR OVFL bits and bit 15 of the SOFTINT_REG register     Software writes to the PCR that set one of the overflow bits  OVFH  OVFL  will also  cause a disrupting but precise trap on the instruction following the next  incrementing event     3 1    CHAPTER 3    CPU Cache Crossbar    This chapter contains the following topics         Section  3 1     Functional Description    on page  3 1        Section  3 2     CCX I O List    on page  3 9        Section  3 3     CCX Timing Diagrams    on page  3 13        Section  3 4     PCX Internal Blocks Functional Description    on page  3 17        Section  3 5     CPX Internal Blocks Functional Description    on page  3 20    3 1  Functional Description    3 1 1  CPU Cache Crossbar Overview    The CPU cache crossbar  CCX  manages the communication among the eight CPU  cores  the four L2 cache banks  the I O bridge  and the floating point unit  FPU    These functional units communicate with each by sending packets  and the CCX  arbitrates the packet delivery     Each SPARC CPU core can send a packet to any one of the L2 cache banks  the I O  bridge  or the FPU  Conversely  packets can also be sent in the reverse direction   where any of the four L2 cache banks  the I O bridge  or the FPU can send a packet  to any one of the eight CPU cores     FIGURE  3 1 shows that each of the eight SPARC CPU cores can communicate with    each of the four L2 cache banks  the I O bridge  and the FPU  The cache processor  crossbar
201. sing    Base Device  Part  Number of Banks  Bank Address  Row Address  Column Address    256 Mbyte  x4  4  BA 1 0   A 12 0   A 11  A 9 0     512 Mbyte  x4  4  BA 1 0   A 13 0   A 11  A 9 0     1 Gbyte  x4  8  BA 2 0   A 13 0   A 11  A 9 0     2 Gbyte  x4  8  BA 2 0   A 14 0   A 11  A 9 0     4 Gbyte  x4  8  BA 2 0   A 15 0   A 11  A 9 0     TABLE  8 2  Physical Address to DIMM Address Decoding    Total  Memory  Per  Channel    DIMM  Density    Type    DRAM  Component  Used  RANK    Stacked  DIMM    DIMM Bank  Address   BA     Row  Address  Column Address    1 Gbytes  512 Mbyte  unstacked    256 Mbit  PA 9 8   PA 31 19    PA 18 10  PA 5 4      2 Gbytes  with Rank    512 Mbyte  unstacked    256 Mbit  PA 32   PA 9 8   PA 31 19    PA 18 10  PA 5 4      2 Gbytes  1 Gbyte  stacked    256 Mbit  PA 32   PA 9 8   PA 31 19    PA 18 10  PA 5 4      4 Gbytes  with Rank    1 Gbyte  stacked    256 Mbit  PA 33   PA 32   PA 9 8   PA 31 19    PA 18 10  PA 5 4      4 Gbytes  2 Gbytes  unstacked    1 Gbit  PA 10 8   PA 33 20    PA 19 11  PA 5 4      8 8 OpenSPARC T1 Microarchitecture Specification      August 2006    8 1 7  DDR II Supported Features    The DRAM controller supports the following DDR II features         DIMMs with component sizes 256 Mbit to 2 Gbit are supported        Only x4 SDRAM parts are supported        DIMMs on one channel should have same timing parameters        Banks are always closed after a read or a write        Supports a burst length of 4        There is one fixed 
202. sly  however  dup is never stretched      The CREG_CLK_DLL STR_CONT bit defines if the clock stretch is in continuous or  in precise mode  In either mode  the CLK_STRETCH pin is the stretch trigger         In continuous mode  as long as the CLK_STRETCH pin is high  every third PLL  beat is skipped         In precise mode  a pulse on the CLK_STRETCH pin causes a single PLL beat to be  skipped          The exact PLL cycle depends on Tx  for example  J div          The CREG_CLK_DLL STR_DLY bit allows sweeping of the stretch cycle     10 1 1 7  Clock n Step    You can issue a specific number of clocks at speed  which can be used for an  automatic test pattern generation  ATPG  or a macro test capture cycle  Specifying  the number of clocks         Can lock the PLL with a cold reset sequence        Program n step by way of the TAP        Scan in a pattern  at TCK speed         Trigger the CTU to issue n capture clocks  at full speed         Scan out result  at TCK speed     10 1 1 8  Clock Signal Distribution    Clock signals distribution have the following characteristics         Clocks are generated in the PLL domain and are distributed as gclk         The C domain control signals are distributed through the flop repeaters         The repeaters on the gclk have an asynchronous reset         The D and J domain control signals are distributed point to point         A Cluster has one header per domain         The Cluster header does the clock gating  gclk   gt  rclk         sync an
203. ss as a previous miss or an entry in  the writeback buffer  instructions requiring multiple passes through the L2 cache  pipeline  atomics and partial stores   unallocated L2 cache misses  and accesses  causing tag ECC errors     The miss buffer is divided into a non tag portion which holds the store data  and a  tag portion which contains the address  The non tag portion of the buffer is a RAM  with 1 read and 1 write port  The tag portion is a CAM with 1 read  1 write  and 1  cam port     A read request is issued to the DRAM and the requesting instruction is replayed  when the critical quad word of data arrives from the DRAM     All entries in the miss buffer that share the same cache line address are linked in the  order of insertion in order to preserve the coherency  Instructions to the same  address are processed in age order  whereas instructions to different addresses are  not ordered and exist as a free list     When an MB entry gets picked for issue to the DRAM  such as a load  store  or ifetch  miss   the MB entry gets copied into the fill buffer and a valid bit gets set  There can  be up to 8 reads outstanding from the L2 cache to the DRAM at any point of time     Chapter 4  Level 2 Cache 4 7    Data can come from the DRAM to the L2 cache out of order with respect to the  address order  When the data comes back out of order  the MB entries get readied for  issue in the order of the data return  This means that there is no concept of age in the  order of data retur
204. sserts  spc0_pcx_atom_pq  which tells the PCX that CPU0 is sending a two packet request   The PCX handles all two packet requests atomically  CPU0 sends the first packet in  cycle PA and the second packet in cycle PX  ARB0 looks at all pending requests and  issues a grant to CPU0 in cycle PX  The grant is asserted for two cycles  The PCX  also asserts pcx_sctag0_atm_px1 in cycle PX  which tells the L2 cache Bank0 that the  PCX is sending a two packet request  ARB0 sends a data ready signal to the L2 cache  Bank0 in cycle PX  ARB0 sends the two packets to the L2 cache Bank0 in cycles PX2  and PX3     Arbiter  control  Arbiter  control    Arbiter  data  select    Arbiter  data  select    spc0_pcx_req_vld_pq 0     spc0_pcx_atom_pq    spc0_pcx_data_pa 123 0     pcx_spc0_grant_px    pcx_sctag0_data_rdy_px1    pcx_sctag0_atm_px1    pcx_sctag0_data_px2 123 0     PQ  PA  PX  PX2  PX3    pkt1  pkt2    pkt1  pkt2    Chapter 3  CPU Cache Crossbar 3 15    Note     FIGURE  3 4 and FIGURE  3 5 represent the best case scenario when there are no    pending requests     The timing for CPX transfers is similar to PCX transfers with the following  difference   the data ready signal from the CPX is delayed by one cycle before  sending the packet to its destination  FIGURE  3 6 and FIGURE  3 7 shows the CPX    packet transfer timing diagrams     FIGURE  3 6 CPX Packet Transfer Timing Diagram     One Packet Request    Arbiter  control    Arbiter  data  select    sctag0_cpx_req_cq 0     sctag0_cpx_
205. ster  the initial state of the J_RST_L reset is unasserted         The PWRON_RST_L reset is not always used asynchronously         The design guarantees that some clocks will propagate to PADs while the  PWRON_RST_L reset is asserted     10 1 2 1  Power On Reset  PWRON_RST_L         Assertion of PWRON_RST_L reset causes          All internal cluster resets to be asserted          CREGs in the CTU to be set to their defaults          All CKENs to be turned off  except J domain PADs          The C and D domain trees to be turned off  and the J and dup trees to be  turned on          The ARST_L to PLL is unasserted  allowing the PLL to toggle          The J domain tree is fed from dup divider         Deassertion of PWRON_RST_L reset causes          The asynchronous resets to be deasserted  the synchronous ones will remain  asserted           The initiation of PLL lock sequence     Chapter 10  Clocks and Resets 10 11    10 1 2 2  J Bus Reset  J_RST_L     A J_RST_L reset assertion causes all cluster clocks to be turned on at the target  frequencies         For a cold reset  the PLL is already locked          In system  the J_RST_L reset should remain asserted until the PLL is locked          In tester  the J_RST_L reset should assert after the PLL is locked         For a warm reset  a  real or fake  PLL re lock sequence is done          CKEN to all clusters are turned on          The J_RST_L reset deassertion causes all synchronous resets to be deasserted   and the reset sequences
206. store streaming store operation that hits the L2 cache  The lookup operation is  performed in order to invalidate all the SPARC L1 caches that own the line other  than the SPARC core that performed the store     The L2 cache directory is split into an icache directory  icdir  and a dcache directory   dcdir   which are both similar in size and functionality     The L2 cache directory is written only when a load is performed  On certain data  accesses  loads  stores and evictions   the directory is cammed to determine whether  the data is resident in the L1 caches  The result of this CAM operation is a set of    Chapter 4  Level 2 Cache 4 9    match bits which are encoded to create an invalidation vector that is to be sent back  to the SPARC CPU cores to invalidate the L1 cache lines  Descriptions of these data  access are as follows         Loads     The icdir is cammed to maintain I D exclusivity  The dcdir is updated to  reflect the load data that fills the L1 cache         IFetch     The dcdir is cammed to maintain the I D exclusivity  The icdir is  updated to reflect the instruction data that fills the L1 cache         Stores     Both directories are cammed  which ensures that  1  if the store is to  instruction space  the L1 icache invalidates the line and does not pick up stale  data   2  if a line is shared across SPARC CPUs  the L1 dcache invalidates the  other CPUs and does not pick up the stale data  and  3  the issuing CPU has the  most current information on the va
207. strm ld instructions  and the store  buffer will not buffer strm st data  Software must be written to enforce the ordering  and the maintenance of the data coherency     The acknowledgements for strm st instructions will be ordered through the data fill  queue  DFQ  upon the return to the stream processing unit  SPU   The  corresponding store acknowledgement  st ack  will be sent to the SPU once the  level  1 D cache  L1D  invalidation  if any  has been completed     2 4 19  Test Access Port Controller Accesses and Forward  Packets Support    Test access port  TAP  controller can access any SPARC core by way of the SPARC  interface of the I O bridge  IOB   A forward request to the SPARC core might take  any of the following actions         Read or write level 1 I cache or D cache        Read or write BIST control        Read or write margin control        Read or write de feature bits to the de feature any  or all  of L1I  L1D  ITLB  DTLB  in order to take a cache off line  or a TLB offline  for diagnostic purposes    2 32 OpenSPARC T1 Microarchitecture Specification      August 2006    A forward reply will be sent back to the I O bridge  IOB  once the data is read or  written  A SPARC core might further forward the request to the L2 cache for an  access to the control status register  CSR   The I O bridge only supports one  outstanding forward access at any time     2 4 20  SPARC Core Pipeline Flush Support    A SPARC core pipeline flush is reported through the LSU since the
208. t directly related to any particular  processor or memory state     A disrupting trap  related to an earlier instruction causing an exception  is similar  to a deferred trap in that it occurs after instructions  follows the trap inducing  instruction  and modifies the processor or memory state  The difference is that the  condition which caused the instruction to induce the trap may lead to  unrecoverable errors  since the implantation may not preserve the necessary  states     Disrupting trap conditions should persist until the corresponding trap is taken     TABLE  2 5 illustrates the type of traps supported by the OpenSPARC T1 processor     Asynchronous traps are taken opportunistically  They will be pending until the TLU  can find a trap bubble in the SPARC core pipeline  A maximum of one asynchronous  trap per thread can be pending at a time  When the other three threads are taking  traps back to back  an asynchronous trap may wait a maximum three SPARC core  clock cycles before the trap is taken     TABLE  2 5  Supported OpenSPARC T1 Trap Types    Trap Type  Deferred  Disrupting  Precise    Asynchronous  None  None  Spill traps  FPU traps  DTLB  parity error on loads  SPU MA  memory error return on load to  SYNC reg    Synchronous  DTLB parity error on  stores  precise to SW     Interrupts and some error traps  All other traps    Chapter 2  SPARC Core 2 55    2 10 3  Trap Flow    An asynchronous trap is normally associated with long latency instructions and  saves resto
209. t is  ranks  must be installed on each DRAM controller port  The DRAM  controller frequency is an exact ratio of the core frequency  where the core frequency  must be at least three times the DRAM controller frequency  The double data rate   DDR  data buses transfer data at twice the frequency of the DRAM controller  frequency     The OpenSPARC T1 processor can support memory sizes of up to 128 Gbytes with a  25 Gbytes sec peak bandwidth limit  Memory access is scheduled across 8 reads  plus 8 writes  and the processor can be programmed into a two channel mode for a  reduced configuration  Each DRAM channel has 128 bits of data and 16 bytes of ECC  interface  with chipkill support  nibble error correction  and byte error detection     1 3 6  I O Bridge    The I O bridge  IOB  performs an address decode on I O addressable transactions  and directs them to the appropriate internal block or to the appropriate external  interface  J Bus or the serial system interface   Additionally  the IOB maintains the  register status for external interrupts     1 3 7  J Bus Interface    The J Bus interface  JBI  is the interconnect between the OpenSPARC T1 processor  and the I O subsystem  The J Bus is a 200 MHz  128 bit wide  multiplexed address  or data bus  used predominantly for direct memory access  DMA  traffic  plus the  programmable input output  PIO  traffic used to control it     The J Bus interface is the functional block that interfaces to the J Bus  receiving and  responding to D
210. the SOFTINT_REG register  I O and CPU cross call  interrupts are delivered to each virtual core using the interrupt_vector trap  0x60      Interrupt_vector traps for software interrupts have a corresponding 64 bit  ASI_SWVR_INTR_RECEIVE register     I O devices and CPU cross call interrupts contain a 6 bit identifier  which  determines which interrupt vector  level  in the ASI_SWVR_INTR_RECEIVE register  the interrupt will target  Each strand   s ASI_SWVR_INTR_RECEIVE register can  queue up to 64 outstanding interrupts  one for each interrupt vector  Interrupt  vectors are implicitly prioritized with vector 63 being the highest priority and vector  0 being the lowest priority     Each I O interrupt source has a hard wired interrupt number  which is used to  index a table of interrupt vector information  INT_MAN  in the I O bridge unit   Generally  each I O interrupt source will be assigned a unique virtual core target  and vector level  This association is defined by the software programming of the    2 58 OpenSPARC T1 Microarchitecture Specification      August 2006    interrupt vector and the VC_ID fields in the INT_MAN table of the I O bridge   IOB   The software must maintain the association between the interrupt vector and  the hardware interrupt number in order to index the appropriate entry in the  INT_MAN and the INT_CTL tables     2 10 6  Interrupt Flow    FIGURE  2 31 illustrates the flow of hardware interrupts and vector interrupts     FIGURE  2 31 Flow of Hardware
211. the line size is 16  bytes  The D cache has a single read and write port  1 RW  for the data and tag  array  The valid bit  V bit  array is dual ported with one read port and one write port   1R 1W   The valid bit array holds the cache line state of valid or invalid   Invalidations access the V bit array directly without first accessing the data and tag  array  The cache line replacement policy follows a pseudo random algorithm  where  loads are allocating and stores non allocating     A cacheable load miss will allocate a line  and it will execute the write through  policy for stores  Stores do not allocate  and local stores may update the L1 D cache  if it is present in the L1 D cache  as determined by L2  Level 2  cache directory  If it  is deemed that it is not present in L1 D cache  the local stores will cause the lines to  become invalidated  The line replacement policy is pseudo random based on a linear  shift register  The data from the bypass queues will be multiplexed into the L1 D   cache in order to be steered to the intended destination  The D cache supports up to  four simultaneous invalidates from the data evictions     array  tag  vld  dfq    From cpx  To pcx    To ifu  etc  irf  frf  load    tlb    store    stb    pcx gen    2 24 OpenSPARC T1 Microarchitecture Specification      August 2006    The L2 cache is always inclusive of the L1 D cache  The exclusivity of the D cache  dictates that a line present in the L1 D cache will not be present in the L1 I cache
212. the order for cache  updates and invalidates     1 3 1 4  Floating Point Frontend Unit    The floating point frontend unit  FFU  decodes floating point instructions and it also  includes the floating point register file  FRF   Some of the floating point instructions  like move  absolute value  and negate are implemented in the FFU  while the others  are implemented in the FPU  The following steps are taken when the FFU detects a  floating point operation  Fpop          The thread switches out         The Fpop is further decoded and the FRF is read         Fpops with operands are packetized and shipped over the crossbar to the FPU         The computation is done in the FPU and the results are returned by way of the  crossbar         Writeback completed to the FRF and the thread restarts     1 3 1 5  Trap Logic Unit    The trap logic unit  TLU  has support for six trap levels  Traps cause pipeline flush  and thread switch until trap program counter  PC  becomes available  The TLU also  has support for up to 64 pending interrupts per thread     1 8 OpenSPARC T1 Microarchitecture Specification      August 2006    1 3 1 6  Stream Processing Unit    The stream processing unit  SPU  includes a modular arithmetic unit  MAU  for  crypto  one per core   and it supports asymmetric crypto  public key RSA  for up to  a 2048 byte size key  It shares an integer multiplier for modular arithmetic  operations  MAU can be used by one thread at a time  The MAU operation is set up  by the store t
213. the smallest normalized number     Loss of accuracy is detected when the delivered result value differs from what  would have been computed had both the exponent range and the precision been  unbounded  an inexact condition      The FPA  FPM  and FPD will signal an underflow to the SPARC core FFU for all tiny  results  The FFU must clear the FSR ufc flag if the result is exact  the FSR nxc is not  set  and the FSR UFM mask is not set  This case represents an exact denormalized  result     Chapter 7  Floating Point Unit 7 13    7 1 8 2  IEEE Exception List    TABLE  7 6 lists the IEEE exception cases and their OpenSPARC T1 generated results     Note     The FPU does not receive the trap enable mask  FSR TEM   The FSR TEM  bits are used within the FFU  If an instruction generates an IEEE exception when the  corresponding trap enable is set  then an fp_exception_ieee_754 trap is generated  and the results are inhibited by the FFU     TABLE  7 6  IEEE Exception Cases    Instruction  Invalid  Divide by zero  Overflow    Underflow or  Denormalized  Inexact    FABS s d   Executed in SPARC core FFU  cannot generate IEEE exceptions     FADD s d       SNaN                   result NaN1 2    FSR nvc 1    result   max  or        FSR ofc 14    result   0 or    min or    denorm  FSR ufc 15  4    result IEEE6    FSR nxc 17    FCMP s d       SNaN  result fcc  FSR nvc 1    FCMPE s d       NaN  result fcc  FSR nvc 1    FDIV s d       SNaN      0    0                  result NaN1  2    FSR nvc 
214. tion  and the length  field specifies the number of 64 bit words for each operation  The maximum length  of these operations should never exceed 32 words     The MA_MUL operates on A  B  M  N and N operands  The result will be stored in  the X operand     The MA_RED operates on A and N operands and the result will be stored in the R  operand     The MA_EXP performs the inner loop of modular exponentiation of A  M  N  X  E   operands stored in the MA Memory  This is the binary approach where the  MA_MUL  followed by MA_RED functions  are called and will have the results  stored in X operand     The parity error encountered on an operand read will cause the operation to be  halted  The LSU and the IFU will be signaled     FIGURE  2 24 shows a pipeline diagram that illustrates the sequence of the result    generation of the multiply function     FIGURE  2 24 Multiply Function Result Generation Sequence Pipeline Diagram    2 9  Memory Management Unit    The memory management unit  MMU  maintains the contents of the instruction  translation lookaside buffer  ITLB  and the data translation lookaside buffer  DTLB    The ITLB resides in instruction fetch unit  IFU   and the DTLB resides in load and  store unit  LSU   FIGURE  2 25 shows the relationship among the MMU and the TLBs     oprnd1  oprnd2    mul_ack  oprnd1    mul_ack    mul_res    mul_res  oprnd2    2 44 OpenSPARC T1 Microarchitecture Specification      August 2006    FIGURE  2 25 MMU and TLBs Relationship    2 9 1  The Rol
215. tion Register         An injection of a bad ECC on the data written to memory          When ENB 1 is set  the DRAM writes will be XOR   d with the normally  generated ECC     Chapter 9  Error Handling 9 9         Errors can be injected as either single shot or continuously          In single shot mode  after the first injected error is generated  the SSHOT and  ENB are automatically reset by the hardware to 0     9 4 2  DRAM Error Protection    Each DRAM bank has 16 bits of ECC for 128 bits of data     9 4 3  DRAM Correctable Errors        Corrected data written to the L1  or L2 caches         Error information is captured in the DRAM error status  L2 cache error status   and the L2 cache error address registers         If the L2 cache error enable CEEN and SPARC error enable CEEN bits are set  a  disrupting ECC_error trap is generated         Load  ifetch  atomic  prefetch     an error on the critical chunk will be reported to  the thread that requested the data  otherwise it will be reported to the steering  thread         Stores  streaming stores  DMA reads  DMA writes     errors reported to the steering  thread         Streaming loads      errors are reported to the streaming unit  which reports it to the  thread programmed in the MA control register to receive the completion  interrupt         A correctable error during a scrub is captured in the DRAM error status and  DRAM error address registers  and the DSC bit is set in the L2 cache error status  register     9 4 4  
216. tion is sent to all other  CPUs appropriately  The icache and the dcache directories are cammed and the  entries are invalidated  In case of atomics  the directory entry of even the issuing  CPU is invalidated     CAS CAS X     CAS X  instructions are handled as two packets on the PCX  The first packet   CAS 1   contains the address and the data  against which the read data will be  compared      The first pass reads the addressed cache line and sends 128 bits of data read back to  the requesting CPU   The comparison is performed in the first pass      The second packet  CAS 2   contains the store data  The store data is inserted into  the miss buffer as a store at the address contained in the first packet  If the  comparison result is true  the second pass proceeds like a normal store  If the result  was false  the second pass proceeds to generate the store acknowledgment only  The  scdata array is not written     4 1 4 5  J Bus Interface Instructions    I O requests are sent to the L2 cache by way of the J Bus interface  JBI   The L2   cache processes the following instructions from a JBI     block read  RD64   write  invalidate  WRI   and partial line write  WR8      Chapter 4  Level 2 Cache 4 15    Block Read    A block read  RD64  from the JBI goes through the L2 cache pipe like a regular load  from the CPU  On a hit  64 bytes of data is returned to the JBI  On a miss  the L2   cache does not allocate  but sends a non allocating read to the DRAM  It gets 64  bytes of dat
217. tive or if there is a misaligned  access  Otherwise  it does a regular store      Data is written into the scdata cache on a miss  allocated      The CTAG  the instruction identifier  is returned to the JBI when the processor sends  an acknowledgement to the cache line invalidation request sent over the CPX     The directory entry is not created in the case of a miss     4 16 OpenSPARC T1 Microarchitecture Specification      August 2006    4 1 4 6  Eviction    When a load or a store instruction is a miss in the L2 cache  a request goes to the  DRAM controller to bring the cache line from the main memory  Before the arriving  data can be installed  one of the ways must be evicted  The pseudo LRU algorithm  described earlier picks the way to be evicted     The L2 cache  scdata  includes all valid L1 cache lines  In order to preserve the  inclusion  the L2 cache directory  both icache and dcache  is cammed with the  evicted tag  and the corresponding entry is invalidated  The invalidated packets are  all sent to the appropriate CPUs     If the evicted line is dirty  it is written into the write back buffer  WBB   The WBB  opportunistically streams out the cache line to the DRAM controller over a 64 bit  bus     4 1 4 7  Fill    A fill is issued following an eviction after an L2 cache store or load miss  The 64 byte  data arrives from the DRAM controller and is stored in the fill buffer  Data is read  from the fill buffer and written into the L2 cache scdata array     4 1 4 8  Ot
218. to MA_CTL register opcode equals  MA_LD  and the length field specifies the number of words to be fetched from the  L2 cache  The SPU sends a PCX request to the LSU and waits for an  acknowledgement from the LSU before sending out another request  The L2 cache  returns data to the SPU directly on CPX     Any data returned with an uncorrectable error will halt the operation  If the Int bit is  cleared  Int   0   the SPU will send a signal to the LSU and the IFU on any ldxa to  MA register     Any data returned with a correctable error will cause the error address to be sent to  IFU and be logged  while the operation will continue until completion     TABLE  2 4 illustrates the error handling behavior     tr2_maop_frm_idle   cur_idle  amp  stxa_2ctlreq  amp   wait_4stb_empty  amp   wait_4trapack_set   tr2_abort_frm_maop   cur_maop  amp  stxa_2ctlreg   tr2_wait_frm_abort   cur_abort  amp  ma_op_complete   tr2_maop_frm_wait   cur_wait  amp    stxa_2ctlreg   wait_4stb_empty   wait_4trapack_set    tr2_idl_frm_maop   cur_maop  amp   stxa_2cltreg  amp  ma_op_complete   tr2_wait_frm_idle   cur_idle  amp  stxa_2ctlreg  amp   wait_4stb_empty   wait_4trapack_set      TABLE  2 4  Error Handling Behavior    NCEEN  Int  LSU  IFU    0  0     error_log    0  1     error_log    1  0  precise trap  error_log    1  1     error_log    Chapter 2  SPARC Core 2 43    The MA_MUL  MA_RED  and the MA_EXP operations all started with a stxa to  MA_CTL register with an opcode equal to the respective opera
219. to replace the cache line     There is a fully associative instruction TLB with 64 entries  The buffer supports the  following page sizes  8 Kbytes  64 Kbytes  4 Mbytes  and 256 Mbytes  The TLB uses  a pseudo least recently used  LRU  algorithm for replacement  Multiple hits in the  TLB are prevented by doing an autodemap on a fill     Two instructions are fetched each cycle  though only one instruction is issued per  clock  which reduces the instruction cache activity and allows for an opportunistic  line fill  There is only one outstanding miss per thread  and only four per core   Duplicate misses do not issue requests to the L2 cache     The integer register file  IRF  of the SPARC core has 5 Kbytes with 3 read 2 write 1  transport ports  There are 640 64 bit registers with error correction code  ECC   Only  32 registers from the current window are visible to the thread  Window changing in  background occurs under the thread switch  Other threads continue to access the IRF   the IRF provides a single cycle read write access      1 3 1 2  Execution Unit    The execution unit  EXU  has a single arithmetic logic unit  ALU  and shifter  The  ALU is reused for branch address and virtual address calculation  The integer  multiplier has a 5 clock latency  and a throughput of half per cycle for area saving   One integer multiplication is allowed outstanding per core  The integer multiplier is  shared between the core pipe  EXU  and the modular arithmetic  SPU  unit on a  round rob
220. to the floating point unit  FPU  through the LSU  as well as  executing simple FP ops  mov  abs  neg  and VIS instructions  The FFU also  maintains the floating point state register  FSR  and the graphics state register   GSR   There can be only one outstanding instruction in the FFU at a time     Input  Queue    XOR  Dividend   Quotient    1 bit left shift    Divisor       2 36 OpenSPARC T1 Microarchitecture Specification      August 2006    The FFU is composed of four blocks     the floating point register file  FFU_FRF   the  control block  FFU_CTL   the data path block  FFU_DP   and the VIS execution block   FFU_VIS   FIGURE  2 19 shows a block diagram of the FFU illustrating these four sub     blocks     FIGURE  2 19 Top Level FFU Block Diagram    2 6 2  Floating Point Register File    The floating point register file  FRF  has 128 entries of 64 bits of data  plus 14 bits of  ECC  The write port has an enable for each half of the data  Bits  38 32  are the ECC  bits for the lower word  data 31 0   and bits  77 71  are the ECC bits for the upper  word  data 70 39       2 6 3  FFU Control  FFU_CTL     The FFU control  FFU_CTL  block implements the control logic for the FFU  and it  generates the appropriate multiplexor selects and data path control signals  The FFU  control also decodes the fp_opcode and contains the state machine for the FFU  pipeline  It also generates the FP traps and kill signals  as well as signalling the LSU  when the data is ready for dispatch     F
221. trap level can be changed by the done or the retry instructions or a WRPR  instruction to TL  The trap is taken on the instruction immediately following these  instructions  The change could be stepping down the trap level  or changing the TL  from  gt 0 to 0  The HPSTATE tlz bit will not be cleared by the hardware when a trap  is taken so the TLZ trap  tlz trap  handler has to clear this bit before returning in  order to avoid the infinite tlz trap loop     2 10 15  Performance Control Register and Performance  Instrumentation Counter    Each thread has a privileged performance control register  PCR   Non privileged  accesses to this register causes a privileged_opcode trap     Each thread has a performance instrumentation counter  PIC  register  The access  privileged is controlled by the setting the PERF_CONTROL_REG PRIV bit  When  PERF_CONTROL_REG PRIV 1  non privileged accesses to this register cause a  privileged_action trap     FIGURE  2 36 highlights the layout of PCR and PIC     Chapter 2  SPARC Core 2 67    FIGURE  2 36 PCR and PIC Layout    If the PCR OVFH bit is set to 1  the PIC H has overflowed and the next event will  cause a disrupting trap that appears to be precise to the instruction following the  event     If the PCR OVFL bit is set to 1  the PIC L has overflowed and next event will cause a  disrupting trap that appears to be precise to the instruction following the event     If the PCR UT bit is set to 1  it counts events in user mode  Otherwise  it will 
222. trated in this order     1  I cache miss    2  Load miss    3  Stores    4   FPU operations  SPU operations  Interrupts     Chapter 2  SPARC Core 2 27    The use of a two level history allows a fair  per category scheduling among the  different categories  The arbiter achieves a resolution in every cycle  Requests from  atomic instructions take two cycles to finish the arbitration     There are five possible targets  which include four L2 cache banks and one I O  buffer  IOB   The FPU access shares the path through the IOB     Speculation on the PCX availability does occur  and a history will be established  once the speculation is known to be correct     2 4 8  Data Fill Queue    A SPARC core communicates with memory and I O using packets  The incoming  packets  destined to a SPARC core  are queued in the data fill queue  DFQ  first   These packets can be acknowledgement packets or data packets from independent  sources  The DFQ maintains a predefined ordering requirement for all the inbound  packets  The targets for the DFQ to deliver the packets to include the instruction  fetch unit  IFU   load store unit  LSU   trap logic unit  TLU   and stream processing  unit  SPU      A store to the D cache is not allowed to bypass another store to the D cache  Store  operations to different caches can bypass each other without violating the total store  ordering  TSO  model     Interrupts are allowed to be delivered to TLU only after all the prior invalidates  have been visible in th
223. ts of an adder and logic operations such  as     ADD  SUB  AND  NAND  OR  NOR  XOR  XNOR  and NOT  The ALU is also  reused when calculating the branch address or a virtual address  FIGURE  2 17    illustrates the top level block diagram of the ALU     FIGURE  2 17 ALU Block Diagram    32b  mask sign  extend    right shift    left shift    selec    Logic     0      0     cc  C  V  sum    shft result    Result Select Mux    logic    regz    cc Z    PR or SR output  Sum predict    Add sub    Exu_ifu_brpc_e    Chapter 2  SPARC Core 2 35    MUL is the integer multiplier unit  IMUL   and DIV is the integer divider unit   IDIV   IMUL includes the accumulate function for modular arithmetic  The latency  of IMUL is 5 cycles  and the throughput is 1 half per cycle  IMUL supports one  outstanding integer multiplication operation per core  and it is shared between a  SPARC core pipeline and the modular arithmetic unit  MAU   The arbitration is  based on a round robin algorithm     IDIV contains a simple non restoring divider  and it supports one outstanding  divide operation per core     FIGURE  2 18 illustrates the top level diagram of the IDIV     FIGURE  2 18 IDIV Block Diagram    When either IMUL or IDIV is occupied  a thread issuing a MUL or DIV instruction  will be rolled back and switched out     2 6  Floating Point Frontend Unit    2 6 1  Functional Description of the FFU    The floating point frontend unit  FFU  is responsible for dispatching floating point  operations  FP ops  
224. ty Web Sites    Sun is not responsible for the availability of third party web sites mentioned in this  document  Sun does not endorse and is not responsible or liable for any content   advertising  products  or other materials that are available on or through such sites  or resources  Sun will not be responsible or liable for any actual or alleged damage  or loss caused by or in connection with the use of or reliance on any such content   goods  or services that are available on or through such sites or resources     1 1    CHAPTER 1    OpenSPARC T1 Overview    This chapter contains the following topics         Section  1 1     Introducing the OpenSPARC T1 Processor    on page  1 1        Section  1 2     Functional Description    on page  1 2        Section  1 3     OpenSPARC T1 Components    on page  1 4    1 1  Introducing the OpenSPARC T1  Processor    The OpenSPARC T1 processor is the first chip multiprocessor that fully implements  the Sun Throughput Computing Initiative  The OpenSPARC T1 processor is a highly  integrated processor that implements the 64 bit SPARC V9 architecture  This  processor targets commercial applications such as application servers and database  servers     The OpenSPARC T1 processor contains eight SPARC   processor cores  which each  have full hardware support for four threads  Each SPARC core has an instruction  cache  a data cache  and a fully associative instruction and data translation lookaside  buffers  TLB   The eight SPARC cores are con
225. u_clsp v    ctu_jbi_jbus_cken  Out  JBI  From ctu_clsp of ctu_clsp v    ctu_jbusl_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusl_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusl_jbus_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_jbusl_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusl_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusl_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusr_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusr_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusr_jbus_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_jbusr_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusr_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_jbusr_update_dr  Out  PADS  From ctu_dft of ctu_dft v    TABLE  10 2 CTU I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 10  Clocks and Resets 10 23    ctu_misc_clock_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_misc_hiz_l  Out  PADS  From ctu_dft of ctu_dft v    ctu_misc_jbus_cken  Out  PADS  From ctu_clsp of ctu_clsp v    ctu_misc_mode_ctl  Out  PADS  From ctu_dft of ctu_dft v    ctu_misc_shift_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_misc_update_dr  Out  PADS  From ctu_dft of ctu_dft v    ctu_pads_bso  Out  PADS  From ctu_dft of ctu_dft v    ctu_pads_so  Out  PADS  From ctu_dft of ctu_dft v    ctu_pads_sscan_update  Out  PADS  From ctu_dft of ctu_dft v    ctu_scdata0_cmp_cken  Out  SCDATA0  Cloc
226. unters in the JBI  which are 31 bits wide each         The software can select one of the 12 events to be counted          J Bus cycles         DMA read transactions  inbound          Total DMA read latency         DMA write transactions         DMA WR8 transactions         Ordering waits  number of jbi  gt l2 queues blocked each cycle          PIO read transactions         Total PIO read latency         PIO write transactions         AOK off or DOK off seen         AOK off seen         DOK off seen    6 8 OpenSPARC T1 Microarchitecture Specification      August 2006    6 2  I O Signal list    TABLE  6 1 lists the I O Signals for the OpenSPARC T1 JBI block     TABLE  6 1  JBI I O Signal List    Signal Name  I O    Source   Destination  Description    cmp_gclk  In  CTU  CMP clock     cmp_arst_l  In  CTU  CMP clock domain async reset     cmp_grst_l  In  CTU  CMP clock domain reset     jbus_gclk  In  CTU  J Bus clock     jbus_arst_l  In  CTU  J Bus clock domain async reset     jbus_grst_l  In  CTU  J Bus clock domain reset     ctu_jbi_ssiclk  In  CTU  J Bus clk divided by 4    ctu_jbi_tx_en  In  CTU  CMP to JBI clock domain crossing synchronization pulse     ctu_jbi_rx_en  In  CTU  JBI to CMP clock domain crossing synchronization pulse     ctu_jbi_fst_rst_l  In  CTU  Fast reset for capturing port present bits  J_RST_L   1      clk_jbi_jbus_cken  In  CTU  Jbi clock enable     clk_jbi_cmp_cken  In  CTU  Cmp clock enable     global_shift_enable  In  CTU  Scan shift enable signal   
227. upt is treated similar to a membar  It will be sent to PCX once the store  buffer of the corresponding thread has been drained  This interrupt will then  immediately be acknowledged to TLU     After the interrupt packet has been dispatched by way of the L2 cache to Core  Interface  CCX   the packet would be executed on the destination thread of a SPARC  core  It can be invalidated after all prior invalidates have completed and results  arrived at L1 D cache  L1D      2 4 14  Flush Instruction Support    A flush instruction does not actually flush the instruction memory  It instead  it acts  as a barrier to ensure that all of the prior invalidations for a thread have been visible  in the level 1 I cache  L1I  before causing the thread to be switched back in     The flush is issued as an interrupt with the flush bit set  which causes the L2 cache  to broadcast the packet to all SPARC cores     2 30 OpenSPARC T1 Microarchitecture Specification      August 2006    For the SPARC core that issued the flush  an acknowledgement from the DFQ upon  receiving the packet will cause all of the prior invalidations to complete with the  results arrived at the level 1 I cache and the level 1 D cache  L1 I D      For the SPARC cores that did not issue the flush  the DFQ will serialize the flushes  so that the order of the issuing threads actions  relative to the flushes  will be  preserved     2 4 15  Prefetch Instruction Support    A prefetch instruction is treated as a non cacheable load  A
228. uses a precise trap         I O Load Instruction Fetch Uncorrectable Error         Causes a precise trap         Modular Arithmetic Memory Error         SPU aborts the operation and logs the error          Different synchronization modes result in different traps          No address applies to this case     9 3  L2 Cache Errors    This section lists the error registers and error protection types of the L2 cache  This  section also describes the L2 cache correctable and uncorrectable errors     9 3 1  L2 Cache Error Registers    Each L2 cache bank contains the following error registers         L2 Control Register  whose bits in this register are          ERRORSTEER     specifies which of the 32 threads receives all the L2 errors  whose cause cannot be linked to a specific thread          SCRUBINTERVAL     the interval between scrubbing of adjacent sets in the L2   cache          SCRUBENABLE     enable a hardware scrub         L2 Error Enable Register         NCEEN     if set  uncorrectable errors are reported to the SPARC core          CEEN     if set  correctable errors are reported to the SPARC core          Logging occurs even if reporting to the cores is disabled     9 6 OpenSPARC T1 Microarchitecture Specification      August 2006        L2 Error Status Register         Contains the error status for that bank          Not cleared after a reset          Indicates multiple errors if they have occurred         L2 Error Address Register         Logs the error address per cach
229. validate  111  1  0  0  0  0  x  x  0  0    Prefetch  110  1  0  0  0  0  x  x  0  0    Block init store   Displacement flush    109  1  0  0  0  0  x  x  0  0    Rep_L1_way  108 107  2  V  V  A x  x  x  x  x  x    Size  106 104  3  V  A B x  V  x  x  x  011  ERR    Address  103 64  40  V  V  V  x  V   V   V  x    Data  63 0  64  Vrs2  x  V  V   RS2  RS1  V x  V x    Chapter 3  CPU Cache Crossbar 3 9    3 2  CCX I O List    TABLE  3 5 lists the CCX I O signals     TABLE  3 5  CCX I O Signal List    Signal Name  I O  Source Destination  Description    adbginit_l  In  Asynchronous reset    ccx_scanin0  In  DFT  Scan in 0    ccx_scanin1  In  DFT  Scan in 1    clk_ccx_cken  In  CTU    cmp_arst_l  In  CTU  Asynchronous reset    cmp_grst_l  In  CTU  Synchronous reset    ctu_tst_macrotest  In  CTU    ctu_tst_pre_grst_l  In  CTU    ctu_tst_scan_disable  In  CTU    ctu_tst_scanmode  In  CTU    ctu_tst_short_chain  In  CTU    fp_cpx_data_ca 144 0   In  FPU  FPU CPX data    fp_cpx_req_cq 7 0   In  FPU  FPU CPX request    gclk 1 0   In  CTU  Clock    gdbginit_l  In  CTU  Synchronous reset    global_shift_enable  In  CTU    iob_cpx_data_ca 144 0   In  IOB  IOB CPX data    iob_cpx_req_cq 7 0   In  IOB  IOB CPX request    sctag0_cpx_atom_cq  In  L2 Bank0  Atomic packet    sctag0_cpx_data_ca 144 0   In  L2 Bank0  L2 CPX data    sctag0_cpx_req_cq 7 0   In  L2 Bank0  L2 CPX request    sctag0_pcx_stall_pq  In  L2 Bank0  PCX Stall    sctag1_cpx_atom_cq  In  L2 Bank1  Atomic packet    sctag1_cpx_
230. wl_r2 2 0   Out  SCBUF    sctag_scbuf_fbd_stdatasel_c3  Out  SCBUF  Select store data in OFF mode    TABLE  4 3  SCTAG I O Signal List  Continued     Signal Name  I O  Source   Destination  Description    Chapter 4  Level 2 Cache 4 23    sctag_scbuf_wbwr_wen_c6 3 0   Out  SCBUF  Write en    sctag_scbuf_wbwr_wl_c6 2 0   Out  SCBUF  From wbctl    sctag_scbuf_wbrd_en_r0  Out  SCBUF  Triggerred by a wr_ack from dram    sctag_scbuf_wbrd_wl_r0 2 0   Out  SCBUF    sctag_scbuf_ev_dword_r0 2 0   Out  SCBUF    sctag_scbuf_evict_en_r0  Out  SCBUF    sctag_scbuf_rdma_wren_s2 15 0   Out  SCBUF  May be all 1s    sctag_scbuf_rdma_wrwl_s2 1 0   Out  SCBUF    sctag_scbuf_rdma_rdwl_r0 1 0   Out  SCBUF    sctag_scbuf_rdma_rden_r0  Out  SCBUF    sctag_scbuf_ctag_en_c7  Out  SCBUF    sctag_scbuf_ctag_c7 14 0   Out  SCBUF    sctag_scbuf_word_c7 3 0   Out  SCBUF    sctag_scbuf_req_en_c7  Out  SCBUF    sctag_scbuf_word_vld_c7  Out  SCBUF  This signal is high for 16 signals     sctag_dram_rd_req  Out  DRAM    sctag_dram_rd_dummy_req  Out  DRAM    sctag_dram_rd_req_id 2 0   Out  DRAM    sctag_dram_addr 39 5   Out  DRAM    sctag_dram_wr_req  Out  DRAM    sctag_jbi_iq_dequeue  Out  JBI  Implies that an instruction has been issued    sctag_jbi_wib_dequeue  Out  JBI  Implies that an entry in the rdma array has freed     sctag_dbgbus_out 40 0   Out  IOB  Debug bus    sctag_clk_tr  Out    sctag_ctu_mbistdone  Out  CTU  MBIST done    sctag_ctu_mbisterr  Out  CTU  MBIST error    sctag_ctu_scanout  Out  DFT  S
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Data Sheet: uDiskOnChip (uDOC) - Digi-Key      Anton_Brochure  Edimax BR-6314K Broadband Router / 4 port switch  Finances - Comité Départemental Olympique et Sportif  CV-MIDI Assembly Manual  MMS6850/A3760 fflfflfflffl User Manual  C 4/36-90 / C 4/36-350  Chicco Soft & Dream    Copyright © All rights reserved. 
   Failed to retrieve file