Home

UltraSPARC IV+ Processor User's Manual Supplement

1. 8K_PTR TSB_Base 63 14 n TSB_Extension 63 14 n 0 VA 21 n 13 0000 Instruction and Data Memory Management Unit 312 amp Sun microsystems 13 1 6 13 1 7 13 1 8 13 1 9 64K_PTR TSB_Base 63 14 n TSB_Extension 63 14 n 1 VA 24 n 16 0000 The TSB Tag Target is formed by aligning the missing access VA from the Tag Access Register and the current context to positions found above in the description of the TTE tag allowing a simple XOR instruction to detect a TSB hit Faults and Traps On a mem_address_not_aligned trap that occurs during the execution of a JMPL or RETURN instruction the UltraSPARC IV processor updates the D SFSR register with the FT field set to 0 and updates the D SFAR register with the fault address For details please refer to the UltraSPARC III Cu Processor User s Manual Reset Disable and RED_state Behavior Please refer to the UltraSPARC IIT Cu Processor User s Manual for general details When the I MMU is disabled it truncates all instruction accesses to the physical address size implementation dependent and passes the physically cacheable bit Data Cache Unit Control Register CP bit to the cache system The access does not generate an instruction_access_exception trap Internal Registers and ASI Operations Please refer to the UltraSPARC III Cu Processor User s Manual for details I TLB Tag Ac
2. Q w See the UltraSPARC III Cu Processor User s Manual 13 2 4 Hardware Support for TSB Access The MMU hardware provides services to allow the TLB miss handler to efficiently reload a missing TLB entry for an 8 KB or 64 KB page These services include Instruction and Data Memory Management Unit 329 amp Sun microsystems e Formation of TSB Pointers based on the missing virtual address and address space e Formation of the TTE Tag Target used for the TSB tag comparison e Efficient atomic write of a TLB entry with a single store ASI operation e Alternate globals on MMU signaled traps Please refer to the UltraSPARC III Cu Processor Leer e Manual for additional details 13 2 4 1 Typical TLB Miss Refill Sequence A typical TLB miss and TLB refill sequence is the following 1 A D TLB miss causes a fast_data_access_MMU_umiss exception 2 The appropriate TLB miss handler loads the TSB Pointers and the TTE Tag Target with loads from the MMU registers 3 Using this information the TLB miss handler checks to see if the desired TTE exists in the TSB If so the TTE data are loaded into the TLB Data In Register to initiate an atomic write of the TLB entry chosen by the replacement algorithm 4 Ifthe TTE does not exist in the TSB then the TLB miss handler jumps to the more sophisticated and slower TSB miss handler The virtual address used in the formation of the pointer addresses comes from the Tag Acces
3. Bit Field Description R W 63 7 Reserved Reserved for future implementation 6 LRU The LRU bit in the CAM read write RW 5 3 CAM SIZE The 3 bit page size field from the RAM read only 2 0 RAM SIZE The 3 bit page size field from the CAM read only TABLE 13 14 T512 Diagnostic Register Bit Field Description R W 63 See the UltraSPARC III Cu Processor User s Manual 62 61 Size 1 0 Encode page size bits 60 See the UltraSPARC III Cu Processor User s Manual 59 IE See the UltraSPARC III Cu Processor User s Manual 58 50 Soft2 See the UltraSPARC III Cu Processor User s Manual 49 Reserved for future implementation 48 Size 2 Always 0 47 DP Data parity bit 46 TP Tag parity bit 45 43 Reserved for future implementation 42 13 Physical page number 12 7 Soft See the UltraSPARC III Cu Processor User s Manual 6 Lock bit 3 E See the UltraSPARC III Cu Processor User s Manual 2 P See the UltraSPARC III Cu Processor User s Manual 1 W See the UltraSPARC III Cu Processor User s Manual Note See TABLE 13 5 for a detailed description of the fields An ASI load from the I TLB Diagnostic Register initiates an internal read of the data portion of the specified I TLB If any instruction that misses the I TLB is followed by a diagnostic read access LDXA from ASIT_ITLB_DATA_ACCESS_REG i e ASI 0x55 from the fully associativ
4. Virtual Address Format The virtual address format of the D TLB Data Access register is described in TABLE 13 24 TABLE 13 24 D TLB Data Access register Bit Field Description R W 63 19 Reserved Reserved for future implementation Mandatory 18 value Should be 0 The TLB to access as defined below 0 T16 2 T512_0 3 T512_1 17 16 15 12 Reserved Reserved for future implementation The TLB entry number to be accessed in the range 0 511 Not all TLBs will have all 512 entries All TLBs regardless of size are accessed from 0 to N 1 where N is the 11 3 TLB Entry number of entries in the TLB For the T512s bit 11 is used to select either way0 or way and bit 10 3 is used to access the specified index For the T16 only bit 6 3 is used to access one of 16 entries Mandatory 2 0 value Should be 0 Instruction and Data Memory Management Unit 332 amp Sun microsystems Data Format The D TLB Data Access Register uses the TTE data format with the addition of parity information in the T512s as described in TABLE 13 21 for the T16 fields and TABLE 13 31 for the T512 fields Note When writing to the T512_0 and T512_1 Size 2 bit 48 for the T512_0 and Size 1 bit 62 for the T512_1 are masked and forced to zero by the hardware The Data Parity and Tag Parity bits DP bit 47 and TP bit 46 are also masked by the hardware during writes The parity bits a
5. D cache line matching Physical_Address is invalidated If there is no 42 5 Physical Address matching D cache entry then the ASI store is a NOP 4 0 Mandatory value Should be 0 3 9 3 9 1 Write Cache Diagnostic Accesses Three W cache diagnostic accesses are supported e W cache diagnostic state register access e W cache diagnostic data register access e W cache diagnostic tag register access Write Cache Diagnostic State Register Access ASI 38 6 per logical processor T Name ASI _WCACHE_STAT Gl The address format for W cache diagnostic state register access is shown in TABLE 3 41 TABLE 3 41 Write Cache Diagnostic State Access Address Format Bits Field Description 63 11 Mandatory value Should be 0 10 6 WC_entry A 5 bit index VA 10 6 that selects a W cache entry for ASI writes These bits are don t care for ASI reads 5 0 Mandatory value Should be 0 The data format for W cache diagnostic state register write access is shown in TABLE 3 42 TABLE 3 42 Write Cache Diagnostic State Access Write Data Format Bits Field Description 63 2 Mandatory value Should be 0 A 2 bit W cache state field encoded as follows 1 0 wcache_state 2 b00 Invalid 2 b10 Owner 2 b11 Modified Caches Cache Coherency and Diagnostics 52 un microsystems 3 9 2 The data format for W cache diagnostic state r
6. data includes data instruction prefetch and Unknown Unchanged write caches Cache snooping Enabled data tag Valid bits Instruction Prefetch Buffer IPB Branch Target Buffer BTB data Unchanged Unchanged Unchanged Unchanged Unchanged 0 Unchanged Instruction Queue Empty Store Queue Empty Empty Unchanged Mappings in 2 Unknown Unchanged Unchanged 2 way set associative Unchanged I TLB D TLB Mappings in 0 Unknown Unknown and fully set invalid associative 1 EI bit 1 1 1 NC bit 1 1 CMT ASI extensions SHARED Registers ASI_CORE_AVAILABLE Predefined value set by hardware 1 for each implemented logical processor 0 for other bits Reset and RED_ state value of value of Unchanged ASI_CORE ASI_CORE ENABLE a ENABLE at E REE NGREED the time of the time of deassertion of deassertion of reset reset 1 for each Unchanged available logical processor Unchanged 0 for other value can be ASI_CORE_ENABLE bits ER by system value can be controller overwritten during reset by system controller during reset value of value of Unchanged ASI_CORE ASI_CORE ENABLE a ENABLE at AST XIR STEERING the time of the time of deassertion of reset deassertion of reset 89 un microsystems TABLE 4 1 Name ASI_CMP_ERROR_ STEERING ASI_CORE_RUNNING_RW Fields Machine State After Reset and R
7. AG_RI in bit 46 and bit 47 of the D TLB Diagnostic Registers During bypass ASIs the D TLB does not flag parity errors B_DATA_ACC EG the tag and data parities are available EG the tag and data parities can be supplied by the stored data When T all tag and data parity bits will be cleared ESS_R EG TABLE 13 18 summarizes the UltraSPARC IV processor D MMU parity error behavior TABLE 13 18 D MMU parity error behavior Parity Enable DCR register bit 17 Operation D MMU Enable DCU Ctrl register bit 3 T512_0 T512_1 T16 hit Trap taken or no trap taken no trap taken no trap taken no trap taken no trap taken Translation no trap taken no trap taken data_access_exception data_access_exception data_access_exception data_access_exception no trap taken no trap taken Demap no trap taken no trap taken no trap taken Read no trap taken Translation no trap taken no trap taken no trap taken Heee HEE no trap taken Instruction and Data Memory Management Unit 323 amp Sun microsystems 13 2 2 3 13 2 2 4 NOTE An x in the table represents don t cares Rules for Setting the Same Page Size on Both the T512_0 and T512_1 When both the T512s are programmed to have identical page sizes they would behave is as if both T512s were a single 4 way 1024 entry
8. Bits Field Description 63 16 Mandatory value Should be 0 This 2 bit field selects a way 4 way associative 16 15 IC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 14 7 IC_addr This 8 bit index VA 13 6 selects a cache tag 6 0 Mandatory value Should be 0 Data Format of Instruction Cache Snoop Tag is described below in TABLE 3 21 TABLE 3 21 The Data Format of I cache Snoop Tag Bits Field Description 63 38 Undefined The value of these bits is undefined on reads and must be masked off by software 37 IC_snoop_tag_parity Odd parity value of the IC_snoop_tag fields 36 8 IC_snoop_tag The 29 bit physical tag field PA 41 13 of the associated instructions 7 0 Undefined The value of these bits is undefined on reads and must be masked off by software 3 6 Instruction Prefetch Buffer Diagnostic Accesses 3 6 1 Instruction Prefetch Buffer Data field Accesses ASI 69 per logical processor Name ASI IP B_DATA The address format of the instruction prefetch buffer IPB data array access is shown in TABLE 3 22 TABLE 3 22 Instruction Prefetch Buffer Data access Address Format Bits Field Description 63 10 Mandatory value Should be 0 IPB_addr 9 7 is a 3 bit index VA 9 7 that selects one entry of the 8 9 3 IPB add entry prefetch data IPB_addr 6 is used to select a 32 byte sub block i S of the 64 byte cache line and IPB
9. Context switch Yes Yes Context switch to B PCR gt savePCR PIC savePIC 0 1 gt PCR st PIC r rd 0 1 gt PCR priv 0 PIC PIC gt r rd Y e Switch to context B Back to context A AA Context switch to A savePCR PCR savePIC PIC PIC gt r rd FIGURE 5 3 Operational Flow Diagram for Controlling Event Counters Set up the PCR register as desired to select two events and the modes in which data should be collected When more than two events need to be monitored the program code sequence or code loop need to be run again with the new events enabled It is not possible to monitor more than two events at any given time The monitoring must consider the real effects of the computer that includes calls to the system and interrupts When used the PCR register is considered part of a process state and must be saved and restored when switching process contexts Multiple data collection times can be done while the program executes to show ongoing statistics Performance Instrumentation and Optimization 101 un microsystems 5 7 1 Performance Instrumentation Implementations Counting events and cycle stalls is sometimes complex due to dynamic conditions and cancelled activities 5 7 2 Performance Instrumentation Accuracy The performance instrumentation counters are designed to provide reasona
10. 13 1 3 2 I TLB Tag Read Register The UltraSPARC IV processor s behavior on read to ASI_ITLB_TAG_READ_REG ASI 5646 is as follows e For the T16 the 64 bit data read is the same as in the UltraSPARC III processor and is backward compatible see the UltraSPARC III Cu Processor User s Manual e For the T512 the bit positions of VPN virtual page number within bits 63 13 change as follows bit 63 21 VA 63 21 for 8 KB 64 KB page bit 20 13 I TLB index VA 23 16 for 64 KB page Instruction and Data Memory Management Unit 307 amp Sun microsystems 13 1 3 3 IZ 134 13 1 3 5 bit 20 13 I TLB index VA 20 13 for 8 KB page bit 12 0 Context 12 0 The I TLB tag array can be written during BIST mode to support back to back ASI writes and reads Demap Operation For the demap page in the large I TLB the page size used to index the I TLB is derived based on the Context bits primary nucleus The hardware will automatically select the proper PgSz bits based on the Context field primary nucleus defined in ASI_IMMU_DEMAP ASI 57 6 These two PgSz fields are used to properly index the T512 Demap operations in the T16 are single cycle operations All matching entries in the T16 are demapped in one cycle however Demap operations in the T512 are multi cycle operations demap one entry at a time I TLB SRAM BIST Mode Back to back ASI writes and reads to the
11. Destination Register Written rd Flag s Flag s Trap sign 0 expo 111 111 Asserts nvc 0 20 40 frac 111 111 QNaN Asserts TIVES AWE IEEE trap enabled 0 Normal 0 None set None set 0 Infinity 0 None set None set Asserts dzc mc Normal 0 Infinity Asserts nvc nva IEFE trap enabled Asserts dzc nvc Normal 0 Infinity Asserts nvc nva IEFE trap enabled Asserts dzc mc S o Normal 0 Infinity Asserts nvc nva IEFE trap enabled Normal 0 Infinity Asserts nvc nva Asserts dze DVE IEEE trap enabled Can underflow overflow Can underflow overflow Normal Normal See 6 5 See 6 5 Asserts nvc IEEE trap enabled Infinity Normal Infinity None set Infinity None set Infinity tInfinity QNaN Asserts nvc nva No Infinity Normal Infinity None set Infinity None set Infinity Normal Infinity None set Infinity None set 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard 126 un microsystems 6 3 5 Square Root TABLE 6 7 SQUARE ROOT Instruction sq root of rsz FSQRT rs gt rd Normal Floating Point Square Root Result from the operation includes one or more of the following Number in f register See Trap Event on page 132 Exception bit set See TABLE 6 12 Trap occurs See abbreviations in TABLE 6 12 Underflow Overflow ca
12. UE in the critical 32 byte UE Yes P S installed in I SE E taken and Precise trap fast data from system bus cache CPP ecc error will be dropped I cache fill request with CE in the non critical Corrected data d e S 2nd 32 byte data from CE No tte S No action No action Disrupting trap system bus I cache fill request with UE in the non critical Raw UE data F 2nd 32 byte data from UE No DEE S No action No action Deferred trap system bus D cache load 32 byte fill request with CE in the Corrected data Good data in D Good data z critical 32 byte data from CE NO corrected ecc ES cache taken Disfupting trap system bus D cache load 32 byte fill request with UE in the Raw UE data Bad data in D Bad data Deferred tr p U E willbe Re UE Yes E S taken and Precise trap fast critical 32 byte data from raw ecc cache dropped 8 1 ecc error will be dropped system bus D cache load 32 byte fill request with CE in the Corrected data z non critical 2nd 32 byte CE No EE wut E S No action No action Disrupting trap data from system bus D cache load 32 byte fill request with UE in the Raw UE data F non critical 2nd 32 byte UE No FW oC E S No action No action Deferred trap data from system bus D cache FP 64 bit load fill request with CE in the Corrected data Good data z 3 critical 32 byte data from QE NG corrected ecc ES taken Distupting trap system bus Gees ee E Bad data in D Deferred trap UE will be request with UE in th
13. Dispatch Control Register DCR ASR 18 The Dispatch Control Register is accessed through ASR 18 This register should only be accessed in privileged mode Non privileged accesses to this register causes a privileged_opcode trap The Dispatch Control Register is described in TABLE 9 1 Note The bit fields IPE DPE ITPE DTPE and PPE are 0 by default disabled after power on or system reset TABLE 9 1 Dispatch Control Register 1 of 2 Bit Field Description 63 19 Reserved Reserved for future implementation 1 Prefetch Cache Parity Error Enable If cleared no parity checking is done at the Prefetch Cache 18 PPE SRAM arrays data physical tag and virtual tag arrays 2 D TLB Parity Error Enable If cleared no parity checking is done at the D TLB arrays data 17 DTPE and tag arrays 16 ITPE3 I TLB Parity Error Enable If cleared no parity checking is done at the I TLB arrays data tag and content addressable memory arrays 269 TABLE 9 1 Dispatch Control Register 2 of 2 Bit Field Description 15 JPE Jump Prediction Enable If set the BTB Branch Target Buffer is used to predict target address of JMPL instructions which do not match the format of the RET RETL synthetic instructions Branch Predictor Mode Branch Predictor Mode can be configured to use a separate history register when operating in privileged mode It can also be configured not to u
14. If any or both of the appropriate trap enable masks are set FSR OFM 1 or FSR NXM 1 only an IEEE overflow trap is generated FSR FTT 1 The FSR CEXC bit that is set follows the SPARC V9 architecture e If FSR OFM 0 and FSR NXM 1 then FSR NXC 1 e If FSR OFM independent of FSR NXM then FSR OFC 1 and FSR NXC 0 6 8 2 2 Gross Underflow Zero Result Result 0 with correct sign If the appropriate trap enable masks are not set FSR UFM 0 and FSR NXM 0 set the FSR AEXC and FSR CEXC underflow and inexact flags FSR UFA 1 FSR NXA 1 FSR UFC 1 and FSR NXC 1 A trap is not generated If either or both of the appropriate trap enable masks are set FSR UFM 1 or FSR NXM 1 only an IEEE underflow trap is generated FSR FTT 1 and FSR CEXC UF 1 The FSR CEXC bit that is set diverges from previous UltraSPARC implementations to follow the SPARC V9 architecture e If FSR UFM 0 and FSR NXM 1 then FSR NXC 1 e If FSR UFM 1 independent of FSR NXM then FSR UFC 1 and FSR NXC 0 6 8 2 3 Subnormal Handling Override Result is a QNaN or SNaN e Subnormal SNaN QNaN invalid exception generated e Standard mode No unfinished_F Pop e Non standard mode No FSR NX e Subnormal QNaN QNaN no exception generated e Standard mode No unfinished_F Pop e Non standard mode No FSR NX Result already generates an exception divide by zero or invalid operation e FSQRT number less than zero in
15. Note A MEMBAR Sync is required before and after a load or store to ASIT_WCACHE_TAG 3 10 Prefetch Cache Diagnostic Accesses Four P cache diagnostic accesses are supported e P cache status data register access e P cache diagnostic data register access e P cache virtual tag valid fields access Caches Cache Coherency and Diagnostics 54 un microsystems e P cache snoop tag register access 3 10 1 Prefetch Cache Status Data Register Access ASI 3046 per logical processor Name ASI_PCACHE_STATUS_DATA T The address format of the P cache status data register access is shown in TABLE 3 49 TABLE 3 49 Prefetch Cache Status Data Access Address Format Bits Field Description 63 11 Reserved Reserved for future implementation A 2 bit entry that selects an associative way 4 way associative 10 9 PC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 8 6 PC_addr A 3 bit index VA 8 6 that selects a P cache entry 5 0 Reserved Reserved for future implementation The data format of P cache status data register is shown in TABLE 3 50 TABLE 3 50 Data Format Bit Description Bits Field Description 63 58 Reserved Reserved for future implementation Data Array Parity bits odd parity 57 50 Parity_bits y S parity A read only field accessed through an ASI read Not used in SRAM test mode 49 Prefetch_Que_empty 1 Corresponds to empty 0 Corresponds t
16. Stall cycles when the next instruction to be executed is stalled in the R stage Rstall_FP_use PICL waiting for the result of a preceding floating point instruction in the pipeline that is not yet available Stall cycles when the next instruction to be executed is stalled in the R stage Rstall_TU_use PICL waiting for the result of a preceding integer instruction in the pipeline that is not yet available Note If multiple events result in R stage stall in a given cycle only one of the counters will be incremented based on the following priority Rstall_IU_use gt Rstall_FP_use gt Rstall_storeQ Performance Instrumentation and Optimization 104 un microsystems 5 8 5 Recirculate Stall Counts The counters listed in TABLE 5 10 count the stall cycles due to recirculation The recirculation may happen due to non bypassable RAW hazard non bypassable FPU condition load miss or prefetch queue full conditions These are also private counters TABLE 5 10 Counters for Recirculation Counter Description Re_RAW_miss PICU Stall cycles due to recirculation when there is a load instruction in the E stage of the pipeline which has a non bypassable read after write RAW hazard with an earlier store instruction Note that due to implementation issue this count also includes the stall cycles due to recirculation of prefetch requests when the prefetch queue is full see Re_PFQ_full description in page 105
17. ecc Original data original ecc P cache for instruction 17 request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache for instruction 17 request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original ecc Original data original ecc P cache HW prefetch request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache HW prefetch request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data EDC EDU EDC EDU EDC EDU EDC EDU EDC EDU EDC EDU EDC EDU Original data original ecc Original data original ecc 223 un microsystems TABLE 7 18 L2 cache data CE and UE errors 3 of 3 Event fast ecc error trap Error logged in AFSR L2 cache data W cache exclusive request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data W cache exclusive request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original ecc EDC No Original data original ecc EDU W cache eviction request and CE in the critical 32 byte or the 2nd 32 byte L2 cache data W cache eviction request and UE in the critical 32 byte or the 2nd 32 byte L2 cache data Direct ASI L2 data read request with CE in the 1st 32 byte or 2nd 32 byte L2 cache data W cache eviction data overwrite the Original data No error logged W cache eviction d
18. policy written Access Register with store data TLB entry specified by ae entry SS Data STXA address written specified by i STXA address No effect No effect No effect Access with contents of Tag written with store Access Register data SFSR No effect No effect No effect No effect age i SS Written with fault status of faulting TLB Written with Written instruction and No effect No effect VA and context with miss page sizes at WUER of access page size faulting context for two 2 way set associative TLB Instruction and Data Memory Management Unit 309 un microsystems 13 1 4 Translation Table Entry TTE The Translation Table Entry TTE holds information for a single page mapping The TTE is divided into two 64 bit words representing the tag and data of the translation Just as in a hardware cache the tag is used to determine whether there is a hit in the TSB if there is a hit then the data is fetched by software The configuration of the TTE is described in TABLE 13 5 see also the UltraSPARC III Cu Processor User s Manual Note All bootbus addresses must be mapped as side effect pages with the TTE E bit set TABLE 13 5 Translation Table Entry TTE 1 of 2 Bit Field Description 63 y See the UltraSPARC III Cu Processor User s Manual Bit 62 61 are the encoded page size bits for I TLB 00 8 KB page 62 61 Size 1 0 01 64 KB page 10 512 KB page 11 4 MB page 60 NFO
19. 191 un microsystems 7 4 Error Reporting Summary TABLE 7 16 Error Reporting Summary 1 of 4 E 2 alal lS le E AFSR Trap Tap z Z z 2 S e 4 S Error Event 3 controlled eleiei a l status bit taken galala BI le nal by E 11 e oS oe 2 Zall 2 a lt E System unrecoverable error other than CPQ_TO NCPQ_TO PERR none 1 0 0 0 0 0 0 Shared TID_TO System unrecoverable error PERR hee i 1 Shared CPQ_TO NCPQ_TO TID_TO Internal unrecoverable error IERR none 1 Shared Parity error during transfer from e eee Fuse array to repairable SRAM EFA_PAR_ERR none 1 8 See LP s are logged Bit in Redundancy register is flipped for I cache or D cache or RED_ERR none 1 Private D TLB or I TLB Bit in Redundancy register is E flipped for L2 cache tag data or RED_ERR none 1 LP s A L3 cache tag data logged ge system address parity ISAP Wee l Shared Uncorrectable system bus data ECC instruction fetch UE l CEE Ke Uncorrectable system bus data ECC load block load atomic UE D NCEEN Private instructions Uncorrectable system bus data ECC store queue RTO or prefetch DUE C NCEE Private queue read Uncorrectable system bus data ECC interrupt vector fetch IyU S NEET ES HW_corrected system bus data x ECC all but interrupt vector fetch CE S CEE Fos HW_corrected system bus data ECC interrupt vector fetch IVG E CEE Private Uncorrectable system bus microtag ECC
20. D cache FP 64 bit load request with UE in the 2nd 32 byte L3 cache data L3_EDU No Original data Good data in D cache block load request with CE in the SS original ecc Good data d P cache Disrupting trap Ist 32 byte or the 2nd 32 byte L3 cache L3_EDC No moved from L3 taken D cache to L2 cache bones Original data D cache block load request with UE in the TER Bad data in P Bad data in lst 32 byte or the 2nd 32 byte L3 cache L3_EDU No g FP register Deferred trap data moved from L3 cache buffer file cache to L2 cache Original data original ecc Good data in Bad data moved from L3 W cache dropped cache to L2 cache D cache atomic request with CE in the critical 32 byte L3 cache data ES VER Ne Precise trap Precise trap when the line is Bad critical evicted out from 32 byte data the W cache Original data and UE again based on original ecc information Bad data the UE status bit moved from L3 and and good dropped W cache flips the cache to L2 cache critical 32 2 least significant byte in W ecc check bits cache C 1 0 in both lower and upper 16 byte D cache atomic request with UE in the critical 32 byte L3 cache data L3 UCU Tes Original data original ecc Good data in Good data moved from L3 W cache taken cache to L2 cache D cache atomic request with CE in the 2nd 32 byte L3 cache data LIEDE No Disrupting trap Error Handling 226 un
21. ECC is not checked for system bus data which returns with DSTAT 1 2 or 3 ECC errors may occur in either the data or microtag field The UltraSPARC IV processor can store only one data ECC syndrome and one microtag ECC syndrome for every 64 bytes of incoming data even though it does detect errors in every 16 bytes of data The syndrome of the first ECC error detected whether HW_corrected or uncorrectable is saved in an internal error register If the first occurrence of an ECC error is uncorrectable the error register is locked and all subsequent errors within the 64 byte block are ignored 213 amp Sun microsystems 7 8 1 2 7 8 2 Error Handling If the first occurrence of an ECC error is HW_corrected then subsequent correctable errors within the 64 byte block will be corrected but not logged A subsequent uncorrectable error will overwrite the syndrome of the correctable error At this point the error register is locked Signalling ECC Error Not only does the UltraSPARC IV processor perform ECC checking for incoming system bus data but it also generates ECC check bits for outgoing system bus data A problem occurs when new ECC check bits are generated for data that contains an uncorrectable error ECC or bus error With new ECC check bits there is no way to detect that the original data was bad To fix this problem without having to add an additional interface a new uncorrectable ECC error is injected into the
22. P cache Mode Rese to Ot When weak prefetches function code is 0x0 0x1 0x2 or 0x3 miss TLB TLB miss trap is not taken When strong prefetches function code is 0x14 0x15 0x16 or 0x17 miss TLB TLB miss trap is taken Weak prefetches are not recirculated if the prefetch queue is full Strong prefetches are recirculated if the prefetch queue is full 54 Set to 1 When weak prefetches miss TLB TLB miss trap is not taken When strong prefetches miss TLB TLB miss trap is taken Weak prefetches are recirculated if the prefetch queue is full Strong prefetches are recirculated if the prefetch queue is full Programmable Instruction Prefetch Stride 00 no prefetch 53 52 01 64 Bytes 10 128 Bytes 11 192 Bytes Programmable P cache Prefetch Stride 00 no prefetch 51 50 01 64 Bytes 10 128 Bytes 11 192 Bytes 49 The UltraSPARC IV processor implements this cacheability bit as described in 49 the UltraSPARC III Cu Processor User s Manual 48 The UltraSPARC IV processor implements this cacheability bit as described in the UltraSPARC III Cu Processor User s Manual Non cacheable Store Merging Enable If cleared no merging or coalescing of 47 noncacheable non side effect store data occurs Each non cacheable store generates a system bus Fireplane transaction RAW Bypass Enable If cleared no bypassing of data from the store queue to a 46 dependent load instruction
23. The FSR AEXC field is unchanged The FSR CEXC field is unchanged The FSR FTT field is set to No change v Appropriate bit is set to 1 1 IEEE 754 1985 Standard 132 amp Sun microsystems 6 4 5 Trap Priority Traps generated by floating point exceptions fp_disabled fp_exception_IEEE_754 and fp_exception_other are prioritized 6 5 6 5 1 6 5 2 6 5 3 6 5 4 IEEE Traps Underflow overflow inexact division by zero and invalid IEEE traps are supported in standard and non standard modes These traps are listed in TABLE 6 12 and operate according to the IEEE 754 1985 Standard IEEE Trap Enable Mask TEM Individual IEEE traps nv of uf dz and nx are masked by the FSR TEM bits When a trap is masked and an exception is detected the appropriate FSR CEXC bit s are set and the destination register is written with data shown in TABLE 6 3 TABLE 6 4 TABLE 6 5 TABLE 6 6 TABLE 6 7 TABLE 6 8 and TABLE 6 9 IEEE Invalid nv Trap The IEFE invalid exception nv is generated when either the source operand to a mathematical operation is a NaN signaling or quiet or the result of a mathematical operation does not fit in the integer format The nv trap for an invalid case can be masked using the FSR register IEEE Overflow of Trap When an overflow occurs the inexact flag is also set e If an overflow occurs and the IEEE overflow of and invalid nv traps are enabled FSR TEM NVM
24. UltraSPARC IV Processor User s Manual Supplement amp Sun microsystems Version 1 0 October 2005 un microsystems Copyright 2005 Sun Microsystems Inc 4150 Network Circle Santa Clara California 95054 U S A All rights reserved Sun Microsystems Inc has intellectual property rights relating to technology embodied in the product that is described in this document In particular and without limitation these intellectual property rights may include one or more of the U S patents listed at http www sun com patents and one or more additional patents or pending patent applications in the U S and in other countries This document and the product to which it pertains are distributed under licenses restricting their use copying distribution and decompilation No part of the product or of this document may be reproduced in any form by any means without prior written authorization of Sun and its licensors if any Third party software including font technology is copyrighted and licensed from Sun suppliers Sun Sun Microsystems the Sun logo Java Solaris UltraSPARC IV UltraSPARC IV UltraSPARC III Cu UltraSPARC Sun Fireplane Interconnect VIS and OpenBoot PROM are trademarks or registered trademarks of Sun Microsystems Inc in the U S and other countries All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International Inc in the U S and other countries Pro
25. Underflow occurs when the result of an operation before rounding is less than that representable by a normal number After rounding the tiny number underflow is usually represented by a subnormal number but may equal the smallest normal number if the unrounded result is just below the range of normal numbers and the rounding mode specified in FSR RD moves the value into the normal number range The underflow result will be zero subnormal or the smallest normal value IEEE 754 1985 Standard 134 un microsystems 6 6 1 6 6 2 Note The floating point unit does not support exponent wrapping for underflow or overflow Trapped Underflow The floating point unit will trap on underflow if the FSR TEM UFM bit is set to 1 Because tininess is detected before rounding trapped underflow occurs when the exact unrounded result has a magnitude between zero and the smallest representable normal number in the precision of the destination format When underflow is trapped the destination and other registers are left unchanged See Trap Event on page 132 Untrapped Underflow If the FSR TEM UFM bit is set to 0 the floating point unit will not generate an underflow trap when an underflow occurs If the result causes an underflow and the result after rounding is exact the floating point unit will not generate an inexact trap Tininess detection before rounding is summarized in TABLE 6 15 using the following terms e u is the u
26. ecc 222 un microsystems Error Handling TABLE 7 18 L2 cache data CE and UE errors 2 of 3 Event Error logged in AFSR fast ecc error trap L2 cache data D cache atomic request with CE in the 2nd 32 byte L2 cache data D cache atomic request with UE in the 2nd 32 byte L2 cache data No Original data original ecc Original data original ecc P cache for several reads 0 20 request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache for several reads 0 20 request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original ecc Original data original ecc P cache for one read 1 21 request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache for one read 1 21 request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original ecc Original data original ecc P cache for several writes 2 22 request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache for several writes 2 22 request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original ecc Original data original ecc P cache for one write 3 23 request with CE in the critical 32 byte or the 2nd 32 byte L2 cache data P cache for one write 3 23 request with UE in the critical 32 byte or the 2nd 32 byte L2 cache data Original data original
27. s Manual for general information about XIR steering register XIR steering register ASI 41 6 VA 63 0 3016 Name AST_XIR_STEERING Access Read Write Privileged access ASI register See XIR Steering Register ASI_XIR_STEERING on page 21 for details about the XIR steering register in the UltraSPARC IV processor Watchdog Reset WDR and error_state Please refer to the UltraSPARC III Cu Processor User s Manual for the description of the watchdog reset and error_state Software Initiated Reset SIR Please refer to the UltraSPARC III Cu Processor User s Manual for the description of software initiated reset 4 3 RED_state Trap Vector When an earlier UltraSPARC processor processes a reset or trap that enters RED_state that processor takes a trap at an offset relative to the RED_state trap vector base address RST Vaddr at virtual address FFFF FFFF F000 0000 In the UltraSPARC IV processor this base address passes through to physical address 7FF F000 00006 Reset and RED_ state 85 un microsystems 4 4 Machine States After Reset TABLE 4 1 shows the machine states created as a result of any reset or after RED_state is entered RSTVaddr is often abbreviated as RSTV in the table TABLE 4 1 Machine State After Reset and RED_state J of 5 Integer registers Unknown Unchanged Unchanged Floating point reg
28. un microsystems Note Size bits are stored in the tag array of the T512 in the UltraSPARC IV processor to correctly select bits for different page sizes Page size bits are returned by all operations When writing to the T512 the two most significant size bits Size 2 1 bit 48 bit 62 are masked and forced to zero by the hardware The Data Parity and Tag Parity bits DP bit 47 and TP bit 46 are masked by the hardware during writes The parity bits are calculated by the hardware and written to the corresponding I TLB entry when replacement occurs as mentioned in TLB Parity Protection on page 304 Tag Read Register ASI 5616 VA 63 0 0016 20FF816 Name ASI Access RW Virtual Address Format TLB_TAG_READ_R EG The virtual address format of the I TLB Tag Read Register Note Bit 2 0 is 0 Data Format The data format of Tag Read Register is described in TABLE 13 9 and TABLE 13 10 TABLE 13 9 Tag Read Register Data Format for T512 Bit Field Description 63 21 VA T512 VA 63 21 for both 8 KB and 64 KB pages 20 13 Index T512 VA 20 13 for 8 KB pages VA 23 16 for 64 KB pages 12 0 Context The 13 bit context identifier TABLE 13 10 Tag Read Register Data Format for T16 Bit 63 13 Field VA T16 Instruction and Data Memory Management Unit The 51 bit virtual page number In the fully associative TLB page offset bits
29. 1 an fp_exception_IEEE_754 is generated e If the overflow trap is masked and the operation is valid the destination register rd receives infinity The overflow trap is caused when the result of an arithmetic operation exceeds the range supported by the floating point or integer number precision This condition can happen in many different cases as listed in the tables of this section IEEE Underflow uf Trap When a normal number underflows the inexact flag is also set Underflow is detected before rounding The underflow condition leads to a subnormal result unless gross underflow is detected In that case the result is 0 and the inexact flag is asserted Underflow is discussed in detail in Underflow Operation on page 134 IEEE 754 1985 Standard 133 un microsystems 6 5 5 IEEE Divide by Zero dz Trap When a number is divided by zero the divide by zero flag is asserted and an JEEE_exception is generated if enabled The dz flag and trap can only be generated by the FDIV instruction 6 5 6 IEEE Inexact nx Trap When an inexact condition occurs the processor sets the FSR AEXC NXA and or the FSR CEXC NXC bits whenever the rounded result of an operation differs from the precise result e The inexact flag is asserted for most overflow or underflow conditions e The inexact trap is caused when the ideal result cannot fit into the destination format This occurs for e Most square root operations e Some add subtract
30. 25 ohm pulldown 52 5 0316 DTL 2 Termination pullup and 25 ohm pulldown See TABLE 11 3 for DTL pin configuration 50 49 48 45 Module revision Written at boot time by the OpenBoot PROM OBP code which reads it from the module serial PROM Sun Fireplane Interconnect and Processor Identification 288 un microsystems TABLE 11 2 FIREPLANE_ CONFIG Register Format 2 of 3 Bits Field Description Module type Written at boot time by OBP code which reads it from the module serial 44 39 PROM 38 Timeout Freeze mode If set all timeout counters are reset and stop counting Timeout Log value Timeout period is 200 2x TOL Sun F ireplane Interconnect cycles Setting TOL gt 10 results in the max timeout period of 23 Sun Fireplane Interconnect cycles Setting TOL 9 results in the Sun Fireplane Interconnect timeout period of 1 75 seconds A TOL value of 0 should not be used since the timeout could occur immediately or as much as 2 Sun F ireplane Interconnect cycles later Timeout Period TOL Timeout Period 37 34 210 226 2 2 228 214 230 216 231 2 8 232 220 222 224 ADNABWNK GC Processor to Sun Fireplane Interconnect clock ratio bit 3 This field may only be written 33 CLKB during initialization before any Sun Fireplane Interconnect transactions are initiated Please refer to Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register on page 292 0
31. 3 b100 Late write SRAM default hardwired to 1 b0 Default too 32MB L3 cache size hardwired to 3 b100 EC_clock EC_ECC_en 4 b0101 Default to 8 1 L3 cache clock ratio 8 8 8 L3 cache mode No ECC checking on L3 cache data bits EC_ECC_force ECC bits for writing are not picked from the L3 cache control register EC_check Caches Cache Coherency and Diagnostics ECC bits to be forced onto L3 cache data bus 81 amp Sun microsystems Caches Cache Coherency and Diagnostics 82 amp Sun microsystems Reset and RED state This chapter defines and describes the RED_state Reset Error and Debug state in the following sections Chapter Topics e RED_state Characteristics on page 83 e Resets on page 84 e RED_state Trap Vector on page 85 e Machine States After Reset on page 86 In the UltraSPARC IV processor RED_state externally initiated reset XIR watchdog reset WDR and software initiated reset SIR can apply to one logical processor but not the other while a hard power on reset hard POR or a system reset soft POR always applies to both logical processors 4 1 RED state Characteristics A reset or trap that sets PSTATE RED including a trap occurring while in RED_state clears the Data Cache Unit Control Register including the enable bits for the I cache D cache I MMU D MMU and the virtual and physical watchpoints The characteristi
32. 4616 ASI_DCACHE_DATA D cache data EE RW Private access 4716 ASL DCACHE_ TAG D cache tag valid RAM diagnostic Rw Private access 4816 Reserved Reserved for future Private implementation 4916 Reserved Reserved tor mature Private implementation Sun Fireplane Interconnect config 4A16 ASI_FIREPLANE_CONFIG_REG Shared register Sun Fireplane Interconnect 4A16 ASI_FIREPLANE_CONFIG_REG Shared address register 4Ax6 ASI_FIREPLANE_CONFIG_REG_2 EE Shared register II 4A16 ASIEMU_ACTIVITY_STATUS EMU activity status register R Shared 4Bi ASI_L3STATE_ERROR_EN_REG 0016 L3 cache error enable register Shared AC ASI_ASYNC_FAULT_STATUS SEET Private AFSR Cregs secondary async fault status 4C16 ASI_SECOND_ASYNC_FAULT_STATUS Private register AFSR_2 4C ASI_ASYNC_FAULT_STATUS_EXT lee SeS Syne ee extension Ri Biva s 16 register AFSR_EXT Address Space Identifiers 282 un microsystems TABLE 10 2 The UltraSPARC IV processor ASI Extensions 3 of 5 i Private Value ASI Name Suggested Macro Syntax Description R W Shared ASI_SECOND_ASYNC_FAULT_STATUS Cregs secondary async faultistatis l 4Ci6 EXT extension register R Private AFSR_EXT_2 4Di 6 Reserved ie for rature Private implementation 4E 16 ASI_L3CACHE_TAG Poe RAM data RW Shared diagnostic access 4F i ASI_SCRATCHPAD_0_REG Scratchpad register 0 RW Private 4F 16 ASI_SCRATCHP
33. A D TLB miss fast trap handler utilizes the automatic hardware replacement write using store ASI_DTLB_DATA_IN_REG When a D TLB miss data_access_exception or fast_data_access_protection is detected the hardware automatically saves the missing VA and context to the Tag Access Register ASI_DMMU_TAG_ACCESS The missing page size information of the T512_0 and T512_1 is captured into the Tag Access Extension Register AST_DMMU_TAG_ACCESS_EXT see D TLB Tag Access Extension Registers on page 331 This information is used during replacement The hardware D TLB replacement algorithm is as follows Note PgSz0 below is AST_DMMU_TAG_ACCESS_EXT 18 16 bits corresponding to the page size of T512_0 and PgSzl is AST_DMMU_TAG_ACCESS_EXT 21 19 bits corresponding to the page size of T512_1 Instruction and Data Memory Management Unit 324 un microsystems if See below for the D TLB Replacement Pseudocode TTI E to fill is a locked page fill TTE to T16 L bit is set fill TTE to T16 else if both T512s have same page size PgSz0 PgSzl if TTE s Size PgSz0 Pili ELE to P16 else if one of the 4 same index entries is invalid fill TTE to an invalid entry with selection order of T512_0 way0 T512_0 wa
34. CMT 10 concatenation of bit vectors xxiv condition code register 86 conventions font xxiii notational xxiv Correctable_Error CEEN trap 258 corrected_ECC_error trap 258 counters branch prediction statistics 103 floating point operation statistics 116 HU stall 103 memory access statistics 106 memory controller statistics 111 recirculate 105 R stage stall 104 software statistics 115 xxxii UltraSPARC IV Processor User s Manual October 2005 system interface statistics 115 CTI couple 95 CWP register 86 cycles accumulated count 102 D D pipeline stage 103 data alignment 96 data cache access statistics 106 and RED_ state 83 bypassing 25 data fields access 48 data parity error 262 description 25 diagnostic accesses 48 enable bit 275 error detection 152 error recovery 151 flushing 25 28 invalidation 51 microtag fields access 50 physical tag parity error 263 snoop tag parity error 264 tag valid fields access 49 data cache unit control register DCUCR 273 data_access_error exception 163 171 172 175 176 197 199 255 data_access_exception exception 79 80 277 281 D cache hit rate 96 line 96 logical organization illustrated 96 misses 96 organization 96 sub block 96 timing 96 DCACHE_DATA DC addr field 48 DC_data field 48 49 DC_data_parity field 48 DC_parity field 48 DC_way field 48 dcache_parity_error exception 253 254 259 DCACHE_SNOOP_TAG DC_adadr field 51 DC_way field 51 D
35. Displacing the wrong line by mistake from the D cache would give a data correctness problem Displacing the wrong line by mistake from the L2 cache and L3 cache will only lead to the same error being reported twice The second time the error is reported the AFAR is likely to be correct Corrupt data is never stored in the D cache without a trap being generated to allow it to be cleared out Note While the above code description appears only to be appropriate for correctable L2 cache data and tag errors it is actually effective for uncorrectable L2 cache data errors as well In the event that it is handling an uncorrectable error the victimize at step 3 Evict the L2 cache line that contained the error will write it out to L3 cache If the L2 cache still returns an uncorrectable data ECC error when the processor reads it to perform a L2 cache writeback the WDU bit will be set in the AFSR during this trap handler which would generate a disrupting trap later if it was not cleared somewhere in this handler In this case the processor will write deliberately bad signalling ECC back to L3 cache In the event that it is handling an uncorrectable error the victimize at step 4 Evict the L3 cache line that contained the error will either invalidate the line in error or will if it is in M O or Os state write it out to memory If the L3 cache still returns an uncorrectable data ECC error when the processor reads it to perform a L
36. If the hw_ecc_gen_en is not set the ECC specified in the data register will be written into the index The hardware generated ECC will be ignored in that case Note For the one entry address buffer used in the L2 cache off mode no ASI diagnostic access is supported The L2 cache tag array can be accessed as usual in the L2 cache off mode However the returned value is not guaranteed to be correct since the SRAM can be defective and this may be the reason to turn off the L2 cache 3 11 2 1 Notes on L2 cache Tag ECC The ECC value of a zero L2 cache tag is also zero Thus after ASI_SRAM_FAST_INIT_SHARED STXA Abrel the ECC value is correct and all lines will be in the INVALID state L2 cache tag ECC checking is carried out regardless of the L2 cache line is valid or not If L2 cache tag diagnostic access encounters an L2 cache tag CE the returned data will not be corrected and raw L2 cache tag data will be returned regardless of L2_tag_ecc_en in L2 cache control register ASI 6Dj is set or not If there is an ASI write request to the L2 cache tag does not include displacement flush and the ASI write request wins the L2L3 pipeline arbitration it will be sent to the L2L3 pipeline to access L2 tag and L3 tag at the same time Within 15 cycles if there is another request I Cache D Cache P cache SIU snoop or SIU copyback request to the same cache index following the ASI request the second access will get inco
37. Note The I TLB tag parity calculation can ignore Size 2 1 as this bits are always 0 The I TLB data parity calculation can ignore NFO IE E and W bits as these also are always 0 Due to physical implementation constraint parity error will still be reported if these bits are flipped in the T512 The CV bit is included in the I TLB data parity calculation as this bit is maintained in the UltraSPARC IV processor In the UltraSPARC III processor the CV bit was read as zero and writes to it were ignored Parity bits are available in the same cycle that the tag and data are sent to the I TLB The tag parity is written to bit 60 of the tag array while the data parity is written to bit 35 of the data array During the I TLB translation the I TLB generates parities on both tag and data and then checks them against the parity bits previously written during replacement The T512 has 4 parity trees 2 per way with one for the tag and one for the data arrays Parity checking takes place in parallel with tag comparison A tag and or data parity error is reported as an instruction_access_exception with the fault status recorded in the I SFSR register Fault Type 2016 Note Both tag and data parities are checked even for invalid I TLB entries When a trap is taken on an I TLB parity error software needs to invalidate the corresponding entry and write the entry with the good parity The I TLB tag and data parity errors are masked
38. Nucleus context s page size for T512 in I TLB 54 23 Reserved Reserved for future implementation 24 22 21 19 P_pgszl 18 16 P_pgsz_I Primary context s page size for T512 in I TLB 15 13 Reserved Reserved for future implementation 12 0 Context identifier for the primary address space Page size bit encoding Two most significant bits are reserved to 0 in I MMU 000 8 KB 001 64 KB Note The ASI_PRIMARY_CONTEXT_REG resides in D MMU There is no I MMU Secondary Context Register When changing page size on primary or nucleus context for the T512 the code must reside in a T16 page A FLUSH must be executed after programming a page size change that effects the I TLB This is to ensure that instructions fetched afterward the change are translated correctly I TLB Access Operation When an instruction memory access is issued its VA Context and PgSz are presented to the I MMU Both I TLBs T512 and T16 are accessed in parallel The fully associative T16 only needs the VA and the Context to CAM match and output an entry 1 out of 16 The proper VA bits are compared based on the page size bits of each T16 entry fast 3 bit encoding is used to define 8 KB 64 KB 512 KB and 4 MB pages Since the T512 is not fully associative indexing the T512 array requires knowledge of the page size to properly select the VA bits 8 bits to be used as the index as shown below if 8 KB page is sele
39. S l 8 invalid R_RTO invalid invalid invalid invalid retry MTag miss miss miss SST eg S hit hit LPA R_RTO R_WS R_WS Caches Cache Coherency and Diagnostics 31 un microsystems TABLE 3 3 Combined State MODE Hit Miss State Change and Transaction Generated for Processor Action 2 of 2 Processor action Store Block Store Write Swap Bloc Store Commit Prefetch miss miss miss R SSM hit RTO WS WS SSM amp MTag miss miss miss s hit LPA RTO R_WS R_WS O SSM amp MTag miss LPA amp e retry invalid R_RTO invalid SSM amp MTag miss fit i LPA R_RTO SSM invalid MTag miss JEE lie hit Os Legal only in LPA R_RTO SSM mode SSM amp MTag miss LPA amp Reset ere SEN SE KEN invalid R_RTO invalid invalid invalid invalid retry MTag miss miss miss commen hit hit LPA R_RTO R_WS R_WS miss SSM hit hit hit hit hit WS SSM amp i miss d LPA hit hit hit hit RWS hit M SSM amp LPA amp retry invalid SSM amp miss LPA hit hit hit hit RWS hit TABLE 3 4 Combined Tag MTag States MTag State CTag State gl gM cM I M cO I O cE I E cS I S cI I I I Caches Cache Coherency and Diagnostics 32 un microsystems TABLE 3 5 Deriving DTags CTags and MTags from Combined Tags Combined Tags CCTags 3 3 2 Snoop Output and Input T
40. TLB will be incorrect 13 2 8 Translation Lookaside Buffer Hardware 13 2 8 1 TLB Replacement Policy On an automatic replacement write to the TLB the D MMU picks the entry to write The rules used for picking the entry to write are given in detail in D TLB Automatic Replacement on page 324 If replacement is directed to the fully associative TLB then the following alternatives are evaluated a The first invalid entry is replaced measuring from entry 0 If there is no invalid entry then b The first unused unlocked LRU but clear entry will be replaced measuring from entry 0 If there is no unused unlocked entry then c All used bits are reset and the process is repeated from Step b The replacement operation is undefined if all entries in the fully associative TLB have their lock bit set For the 2 way set associative TLBs a pseudo random replacement algorithm is used to select a way Instruction and Data Memory Management Unit 337 un microsystems Instruction and Data Memory Management Unit 338 amp Sun microsystems INDEX A accesses branch prediction diagnostic 47 data cache diagnostic 48 Fireplane Configuration Register 288 293 instruction cache diagnostic 39 L2 cache diagnostics 58 L3 cache diagnostics 66 prefetch cache diagnostic 54 address aliasing 24 branch prediction array access format 47 branch target buffer access format 47 data cache data access format 48 i
41. The Write cache is a 2 KB fully associative cache with 64 byte line size The W cache uses a FIFO First In First Out replacement policy The UltraSPARC IV processor s W cache has no parity protection The W cache is included in the L2 cache however write data is not immediately updated in the L2 cache The L2 cache line gets updated either when the line is evicted from the W cache or when a primary cache of either logical processor sends read request to the L2 cache in which case the L2 cache probes the W cache and sends the data from the W cache to the primary cache and subsequently updates the L2 cache line Since the W cache is inclusive in the L2 cache flushing the L2 cache ensures that the W cache has also been flushed Caches Cache Coherency and Diagnostics 25 amp Sun microsystems 3 1 3 2 3 1 4 3 1 4 1 L2 cache and L3 cache The L2 cache is a unified 2MB 4 way set associative cache with 64 byte line size It is a writeback write allocate cache and uses a pseudo LRU replacement policy The L2 cache includes the contents of the I cache the D cache and the W cache in both logical processors Thus invalidating a line in the L2 cache will also invalidate the corresponding line s in the I cache D cache or W cache The inclusion property is not maintained between the L2 cache and the P cache The L3 cache is a unified 32MB 4 way set associative cache with 64 byte line size It is a writeback write allocate
42. UCC TUE_SH TUE L3_TUE_SH L3_TUE L3_UCU L3_UCC Class 3 UE DUE EDU EMU WDU CPU L3_EDU L3_WDU L3_CPU Class 2 CE EDC EMC WDC CPC THCE L3_THCE L3_EDC L3_WDC L3_CPC Class 1 TO DTO BERR DBERR the lowest priority Class 5 errors are hardware time outs associated with the EESR status bits CPQ_TO and NCPQ_TO These are transactions that have exceeded the permitted time unlike AFSR TO events which are just transactions for which the system bus did not assert MAPPED These are all fatal errors AFSR PERR events other than these three do not capture APART Priority for AFAR1 updates PERR gt UCU UCC TUE_SH TUE L3_TUE_SH L3_TUE L3_UCU L3_UCC gt UE DUE EDU EMU WDU CPU L3_EDU L3_WDU L3_CPU gt CE EDC EMC WDC CPC THCE L3_THCE L3_EDC L3_WDC L3_CPC gt TO DTO BERR DBERR There is one exception to the above AFARI overwrite policy Within the same priority class it is possible that multiple errors from system bus L2 cache tag L2 cache data L3 cache tag and L3 cache data might be reported at the same clock cycle For this case all the errors will be logged into AFSR1 AFSR1_EXT and the priority for AFAR1 is system bus error gt L3 cache data error gt L3 cache tag error gt L2 cache data error L2 cache tag error Note that when L2 cache tag correctable error occurs it will retry the request except snoop request and it will not report any L2 cache data errors If L2 cache tag unc
43. amp Sun microsystems Terl Error Handling L2 cache Data ECC Error UCC When an instruction fetch misses the I cache a load like instruction misses the D cache or an atomic operation is performed and it hits the L2 cache data is read from the L2 cache SRAM and will be checked for the correctness of its ECC If a single bit error is detected in critical 32 byte data for load and atomic operations or in either critical or non critical 32 byte data for I cache fetch the UCC bit will be set to log this error condition This is a SW_correctable error A precise fast_ECC_error trap will be generated provided that the UCEEN bit of the Error Enable Register is set For correctness a software initiated flush of the D cache is required because the faulty word will already have been loaded into the D cache and will be used if the trap routine retries the faulting instruction L2 cache errors are not loaded into the I cache or P cache so there is no need to flush them A software initiated L2 cache flush which evicts the corrected line into L3 cache is desirable so that the corrected data can be brought back from L3 cache later Without the L2 cache flush a further single bit error is likely the next time this word is fetched from L2 cache Multiple occurrences of this error will cause the AFSR1 ME to be set In the event that the UCC event is for an instruction fetch which is later discarded without the instruction being executed no
44. and multiple page sizes This chapter describes the Instruction Memory Management Unit as seen by the operating system software in these sections Chapter Topics Instruction Memory Management Unit on page 302 e Larger amp Programmable I TLB on page 302 e Translation Table Entry TTE on page 310 e Hardware Support for TSB Access on page 312 e Faults and Traps on page 313 e Reset Disable and RED_state Behavior on page 313 e Internal Registers and ASI Operations on page 313 s I TLB Tag Access Extension Register on page 313 e Write Access Limitation of I MMU Registers on page 319 e Data Memory Management Unit on page 319 e Virtual Address Translation on page 319 e Two D TLBs with Large Page Support on page 320 e Hardware Support for TSB Access on page 329 e Faults and Traps on page 331 e Reset Disable and RED_state Behavior on page 331 e Internal Registers and ASI Operations on page 331 e Translation Lookaside Buffer Hardware on page 337 Instruction and Data Memory Management Unit 301 amp Sun microsystems 13 1 13 1 1 13 12 Instruction Memory Management Unit Virtual Address Translation The UltraSPARC IV processor supports a 64 bit virtual address VA space with 43 bits of physical address PA The I MMU s two Translation Lookaside Buffers TLBs and their respective page sizes are described in TABLE 13 1 Replacement pages not maintaining the T512 page size are covered by the T16 TLB TABLE 13 1 I MMU
45. and some apply to an arbitrary subset The following sections address how each type of reset is handled with respect to having multiple logical processors integrated into a package In general the reset nomenclature used is consistent with UltraSPARC IV processors Future processors may have a different classification of resets in which case those processors should extend this model appropriately Private Resets SIR and WDR Resets The only resets that are limited to a single logical processor are the private resets internally generated by a logical processor An UltraSPARC IV processor has a number of resets of this class These types of resets are generated by an individual logical processor and are not propagated to the other logical processors on a CMT processor Full CMT Resets System Reset There is a class of resets that are generated by an external agent and apply to all the logical processors in a CMT processor These include any resets associated with fundamental reconfigurations of the CMT processor Current SPARC processors have a single system reset of which power on reset is a special case System reset is required for certain reconfigurations of the processor Future processors may have multiple resets that replace the single system reset of current processors The power on and system resets or their equivalents in future processors are sent to all logical processors in a CMT processor All logical processors except the
46. cccccsscesesssseeesseeecnessceecseeecessenessesseeeaeeeeaeeaes 50 Data Cache Microtag Access Data Format 50 Data Cache Tag Valid Access Data Format oo ecceeccsesesecseeseteeseeseseeseesacescseescsesaeeessesacesaeees 50 Data Cache Snoop Tag Access Address Format o cccceccccsssssssesceseeseeeeseesecsececeeeeeceseeseeaeeaeeaeenees 51 Data Cache Snoop Tag Access Data Format o cccccceesessessesseeseeseeseeseeseesecnecseceeeeeeseesenseeaeeaeeaees 31 Data Cache Invalidate Address Format oo cccecescscsssssesssseseessesenecseeesseesesesaceseseescesaeeessesacesaeees 52 Write Cache Diagnostic State Access Address Format 52 Write Cache Diagnostic State Access Write Data Format oi ceeceeseesseeesenereeeeentesenereeeeeeaees 52 Write Cache Diagnostic State Access Read Data Format o ecccceseseesseseeeeceeeeseeeeeseeseeeeeeneenes 53 Write Cache Diagnostic Data Access Address Format 53 Write Cache Diagnostic Data Access Data Format ssssssesessisisesesrsssesrsresrtesststetsrsrsrsrsrererersrseet 53 Write Cache Diagnostic Data Access Data Format 54 Write Cache Tag Register Access Address Format sssessesssisesesrsesrrrsrsrtrsststsrsrsrsrsrsrererersrseee 54 Write Cache Tag Register Access Data Format 54 Prefetch Cache Status Data Access Address Format 55 Data Format Bit Descente saraororerirasneireoy ranas dee aE EE i e ae 55 P cach status data array ssisisrescerdivenveivearavelis ag isearten enabeeeidia ade hese tenes e
47. divided into two 32 byte subblocks with separate valid bits Also the 2 KB write cache has been made fully set associative The other two L1 caches 64 KB data and 2 KB prefetch are unchanged As in the UltraSPARC IV processor all four Ll caches are duplicated with each core To maintain data coherency the L1 data cache is write through L2 Cache The UltraSPARC IV processor s L2 cache has been reduced in size from 16 MB to 2 MB but brought on chip The L2 cache is also shared rather than split between the two cores The structure of the L2 cache has been revised to 4 way set associative with a 64 byte line The L2 cache operates at half the processor frequency sustaining one read or write request every 2 cycles With the exception of the prefetch cache all L1 caches are included in the L2 cache To minimize off chip traffic the L2 cache is copy back amp Sun microsystems 1 4 3 L3 Cache The first two levels of on chip caches are backed by a large off chip L3 cache of 32MB also shared by the two cores Like the revised L2 cache the new L3 cache is 4 way set associative with a 64 byte line To maximize access bandwidth and speed the tags for the L3 cache are kept on chip To minimize traffic to main memory the L3 cache is copy back The L3 cache is a victim cache e When data or instructions are loaded from memory they are written into the L2 and L1 on chip caches but not into the L3 off chip cache e A l
48. fill request with CE in the critical 32 byte data from system bus due to RTSR transaction OR Prefetch for one read 1 21 fill request with CE in the non critical 2nd 32 byte data from system bus due to RTSR transaction System Bus CE UE TO DTO BERR DBERR errors 4 of 10 flag fast ecc error L2 cache data Corrected data corrected ecc L2 cache state L1 cache data Good data in P cache Pipeline Action No action Comment Disrupting trap Prefetch for several writes 2 22 fill request with CE in the critical 32 byte data from system bus regardless non RTOR transaction or RTOR transaction OR Prefetch for several writes 2 22 fill request with CE in the non critical 2nd 32 byte data from system bus regardless non RTOR transaction or RTOR transaction Corrected data corrected ecc Data installed in P cache No action Disrupting trap Prefetch for one write 3 23 fill request with CE in the critical 32 byte data from system bus regardless non RTOR transaction or RTOR transaction OR Prefetch for one write 3 23 fill request with CE in the non critical 2nd 32 byte data from system bus regardless non RTOR transaction or RTOR transaction Prefetch for instruction 17 fill request with CE in the critical 32 byte data from system bus due to non RTSR transaction OR Prefetch for instruction 17 fill request with CE in the non critical 2nd 32 byte data from sys
49. is updated the corresponding field in the Sun Fireplane Interconnect Configuration Register will be updated too The register is illustrated below and described in TABLE 11 4 TABLE 11 4 FIREPLANE_CONFIG 2 Register Format J of 2 Bits Field Description 63 CPBK_BYP Copyback Bypass Enable If set it enables the copyback bypass mode 62 SAPE SDRAM Address Parity Enable If set it enables the detection of SDRAM address parity New SSM Transactions Enable If set it enables the 3 new SSM transactions 61 RE RTSU ReadToShare and update MTag from gM to gS RTOU ReadToOwn and update MTag to gI UGM Update MTag to gM 60 59 DTL_6 58 57 DTL 5 DTL_ 1 0 DTL termination mode O16 Reserved 56 55 DTL_4 116 DTL end Termination pullup 54 53 DTL 3 216 DTL mid 25 ohm pulldown 316 DTL 2 Termination pullup and 25 ohm pulldown 52 51 DTL_2 See TABLE 11 3 for DTL pin configuration 50 49 DTL_1 Module revision Written at boot time by the OpenBoot PROM OBP code which reads it eg Ee from the module serial PROM 44 39 MT 5 0 Module type Written at boot time by OBP code which reads it from the module serial b PROM 38 TOF Timeout Freeze mode If set all timeout counters are reset and stop counting gt P g Timeout Log value Timeout period is 2 x TOL Sun Fireplane Interconnect cycles Setting TOL gt 10 results in the max timeout period of 2 Sun Fireplane Interconnect cycl
50. microsystems 8 4 3 Hardware Action on Trap for I cache Snoop Tag Parity Error Parity error detected in I cache snoop tag will not cause a trap nor will it be reported logged An invalidate transaction snoops all four ways of the I cache in parallel On an invalidate transaction each entry over the four ways that has a parity error will be invalidated in addition to those that have a true match to the invalidate address Entries which do not possess parity errors or do not match the invalidate address are not affected Note I cache snoop tag parity error detection is suppressed if DCR IPE 0 or cache line is invalid 8 5 8 5 1 D cache Parity Error Trap A D cache physical tag or data parity error results in a dcache_parity_error precise trap Hardware does not provide any information as to whether the dcache_parity_error trap occurred due to a tag or a data parity error Hardware Action on Trap for D cache Data Parity Error Parity error detected during D cache load operation will take a dcache_parity_error trap TT 0x071 priority 2 globals AG Note D cache data parity error reporting is suppressed if DCR DPE 0 D cache data parity error checking ignores cache line s valid bit microtag hit miss and physical tag hit miss Parity error checking is only done for load instructions but not for store instructions On store update to D cache a parity bit is generated for every byte of store data written
51. of all D cache tag indexes and ways Clear the P cache by writing the valid bit for every way at every line to 0 with a write to ASI_Pcache_TAG Re enable I cache and D cache by writing 1 to DCUCR DC and DCUCR IC Execute RETRY to return to the originally faulted instruction D cache entries must be invalidated because cacheable stores performed by the dcache_parity_error trap routine and any pending stores in the store queue will not update any old data in the D cache while DCUCR DC is set to 0 If the D cache was not invalidated some data could be stale when the D cache was re enabled Snoop invalidates however still happen normally while DCUCR DC is 0 151 un microsystems Lu Error Handling The D cache entries must be initialized and the correct data parity installed because D cache data parity is still checked even when the cache line is marked as invalid It is possible to write to just one cache line that which returned the error but this would require disassembling the instruction pointed to by the TPC for this precise trap Disassembling the instruction would provide the D cache tag index from the target address D cache physical tag parity is checked only when DC_valid bit is 1 It is not necessary to write to AST_DCACHE_SNOOP_TAG Zero can be written to each tag at ASI_DCACHE_TAG The fact that the physical tag is the same for every entry and is different from the snoop tag for every entry
52. where 3tag_ecc_din 127 0 18 b0 tag2 19 tag0 19 tag2 18 tagO 18 tag2 1 tagO 1 tag2 0 tagO 0 state2 2 stateO 2 state2 1 stateO 1 state2 0 stateO 0 18 b0 tag3 19 tag1 19 tag3 18 tag1 18 tag3 1 tagl 1 tag3 0 tag1 0 state3 2 state1 2 state3 1 statel 1 state3 0 state1 0 The ECC value of a zero L3 cache tag is also zero Thus after ASI_SRAM_FAST_INIT_SHARED STXA Abrel the ECC value is correct and all lines will be in INVALID state L3 cache tag ECC checking is carried out regardless of the L3 cache line is valid or not If L3 cache tag diagnostic access encounters an L3 cache tag CE the returned data will not be corrected and raw L3 cache tag data will be returned regardless of ET_ECC_en in the L3 cache control register is set or not If there is an ASI write request to L3 cache tag does not include displacement flush and the ASI write request wins L2L3 pipeline arbitration it will be sent to the L2L3 pipeline to access the L2 tag and L3 tag at the same time Within 15 cycles if there is a request I cache D cache P cache SIU snoop or SIU copyback request to the same cache index following the ASI request then the second request will get incorrect tag ECC data It will not have any issues if two accesses are to the different index To avoid this problem the following procedures will be followed when software uses ASI write to L3 cache tag to inje
53. 111 Asserts nvc nva No IEFE trap enabled Normal gt Integer representation Integer representation of 26 1 of the normal number None s t the normal number None set a lt i Normal lt 100 000 None set 100 000 None set 2 1 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard 129 un microsystems 6 3 9 TABLE 6 11 Integer to Floating Point NUMBER CONVERSION Instruction single operand Integer to Floating Point Number Conversion Integer to Floating Point Number Conversion Result from the operation includes one or more of the following e Number in f register See Trap Event on page 132 e Exception bit set See TABLE 6 12 s Trap occurs See abbreviations in TABLE 6 12 e Underflow Overflow may occur MSB and converted 1 IEEE trap means f p_exception_IEEE_754 6 3 10 Copy Move Operations FiTOs rs gt rd Masked Exception TEM NXM 0 Enabled Exception TEM NXM 1 FiTOd rs gt rd FxTOs rs gt rd Destination Register Destination Register FxTOd rsz gt rd Written rd PERE Written rd marO ap SP DP 0 0 None set 0 None set Integer lt 223 Normal None set Normal None set Asserts nvc Int i ded to 23 Integer gt 23 EE Asserts nvc nxc No IEEE trap MSB and converted SP enabled Integer gt BE 1 Normal None set Normal None set Integer is rounded to 23 Asserts nvc Integer lt 2
54. 2 11 Memory Errors and Interrupt Transmission 0 0 00 ceceeseeseeeseeeseeeeneeeeeeneeeeees 176 7 2 12 Cache Flushing in the Event of Multiple Errors 0 0 00 ceeseesceeseeeereeeneeeneeeneees 176 Error Registers arinn i EERSTEN 177 kal Shared Error Reporting perco e E tiie ed EEE R 177 7 3 2 Etror Enable Register narei o e deen N EE EA 178 7 3 3 AFSR Register and AFSR_EXT Register eccccccecesceceeseeeesneeeeseeesteeeesaees 180 TBA ECC Syndrome Smeren iioii i a E E ee edad A A RER A i a 186 7 3 5 Asynchronous Fault Address Register sesssesesesesseesseeeseessserssressrrrsrrrsserssees 191 Error Reporting SumMarye 4017 es ceesssededee cess evedude dav eceeeus gesuns See eek E E ebb Eh Set eee 192 OVerwrite PON CY besen eieiei ee Eed chs Segel 198 7 5 1 AFARI Overwrite Policy ccccccccsscccesscessneecesneecesceeceseeeesseeeseeeesneeeesueeenenees 198 7 5 2 AFSRI E_SYND Data ECC Syndrome Overwrite Policy ceseeesseeenees 199 7 5 3 AFSRI M_SYND microtag ECC Syndrome Overwrite Policy 00 199 Multiple Errors and Nested Traps ccccccecsscceesceeeneeeesneeeseeeeeeneeeseeeeneneeesseeeesseeessaeeeneas 199 Further Details on Detected Errors oo cic rai aa AR E AEA i 200 TAA Leache Data ECC Error steigt EUREN EEN een 201 T2 L2 cache Tag ECC Errors vroiai ea Teales eegen sneered aeaath seed ees 204 7 7 3 L3 cache Data ECC Errors oo cceeceeseeseceeeeeeeseeseeeeeeaecaeesseesesseeeaeseseeaeeaes 205 LLA L
55. 258 Status Register AFSR 143 183 220 258 primary 180 secondary 180 atomic instructions and store queue 97 B big endian default ordering after POR 277 instruction fetches 277 bit vector concatenation xxiv bit vectors concatenation 312 330 block load instruction in block store flushing 29 L2 L3 cache allocation 26 and store queue 97 store instruction L2 L3 cache allocation 26 booting code 26 branch target alignment 92 branch prediction access 47 INDEX XXxi amp Sun microsystems un microsystems diagnostic accesses 47 branch predictor access 47 branch taget buffer access 47 branch target buffet BTB 270 BRANCH_PREDICTION_ARRAY BPA_addr field 47 PNT_Bits prediction not taken field 47 PT_Bits prediction taken field 47 BTB_DATA BTB_addr 47 BUSY NACK pairs 299 byte ordering 277 C cache flushing 28 flushing and multiple errors 176 invalidation 51 organization 23 physical indexed physical tagged PIPT 25 virtual indexed virtual tagged VIVT 26 virtual indexed physical tagged VIPT 24 cache coherence tables 29 37 cache state Not Available NA 61 64 72 cacheability determination of 25 CALL instruction 268 CANRESTORE register 86 CANSAVE register 86 CCR register 86 chip multiprocessor CMP 1 2 CLEANWIN register 87 Cluster Error Status ID Register 295 CMP error reporting noncore specific 177 CMP registers CMP error steering register 177 scratchpad registers 276
56. 3 22 23 fill request with UE in the critical 32 byte data from system bus regardless non RTOR transaction or RTOR transaction OR Prefetch2 3 22 23 fill request with UE in the 2nd 32 byte data from system bus regardless non RTOR transaction or RTOR transaction DUE Raw UE data raw ecc Raw UE data raw ecc Bad data not installed in P cache Data not installed in P cache No action No action Disrupting trap Disrupting trap Error Handling 245 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 6 of 10 Event stores W cache exclusive fill request with CE in the critical 32 byte data from system bus OR stores W cache exclusive fill request with CE in the 2nd 32 byte data from system bus stores W cache exclusive fill request with UE in the critical 32 byte data from system bus OR stores W cache exclusive fill request with UE in the 2nd 32 byte data from system bus flag fast ecc error L2 cache data Corrected data corrected ecc Raw UE data raw ecc L2 cache state L1 cache data W cache gets the permission to modify the data after all 64 byte of data have been received W cache gets the permission to modify the data after all 64 byte of data have been received Pipeline Action W cache proceeds to modify the data W cache proceeds to modify the data and UE information is sent to a
57. 34 27 12 11 1110 unused 34 27 12 11 1111 unused After hard reset this bit will be read as 9 b0 20 Reserved This bit is hardwired to 1 b0 in the UltraSPARC IV processor 15 Reserved This bit is hardwired to 1 b0 in the UltraSPARC IV processor If set enables ECC checking on L3 cache data bits 10 EC_ECC_en A itat 4S After hard reset this bit will be read as bt If set forces EC_check 8 0 onto L3 cache data ECC bits 9 EC_ECC_force a After hard reset this bit will be read as bt ECC check vector to force onto ECC bits 8 0 EC_check Note Bit 63 45 bit 20 and bit 15 of the L3 cache control register are treated as don t care for ASI write operations and read as zero for ASI read operations Hardware will automatically use the default POR value if an un supported value is programmed in L3 cache control register 3 12 2 Secondary L3 cache Control Register ASI 7516 Read and Write Shared by both logical processors Caches Cache Coherency and Diagnostics 69 amp Sun microsystems 3123 3 12 4 VA 816 The UltraSPARC IV processor does not implement this register because low power mode is not supported in the UltraSPARC IV processor For stores it is treated as a NOP For loads it is the same as ASI 7516 VA 016 L3 cache Data ECC Fields Diagnostics Accesses ASI 7616 WRITING ASI 7E 6 READING Shared by both logical processors VA 63
58. 6 7 4 NaN Results From Operands Without NaNs The following operations generate NaNs See JEEE Operations on page 122 for details e FSQRT Normal or 0 e FDIV 0 IEEE 754 1985 Standard 137 amp Sun microsystems 6 8 6 8 1 6 8 1 1 6 8 1 2 Subnormal Operations The handling of subnormals is different for standard and non standard floating point modes The handling of operands and results are described separately in the following sections Response to Subnormal Operands The floating point unit responds to subnormal operands and results by either handling the result in the hardware or by generating an fp_exception_other with FSR FTT 2 unfinished_FPop The response of the floating point unit depends on its operating mode which is controlled by the FSR NS bit Standard Mode In standard mode the floating point unit in most cases traps when a subnormal operand is detected or a subnormal result is generated In this situation the system software must perform or complete the operation The floating point unit supports the following in standard mode e Some cases of subnormal operands are handled in hardware e Gross underflow results are supported in hardware for FdTOs FMULs d and FDIVs d instructions Non standard Mode In non standard mode the floating point unit in most cases flushes subnormal operands to 0 with the same sign as the subnormal number then uses the value in the oper
59. 65 L3 cache Control Register Access Data Format o cccesesssssseseeseeeeeceseneesceecseeseesaceesseeacesacees 67 L3 cache data diagnostic ACCESS snenie riii tior e EEA E E E AOE i 70 L3 cache data staging register ACCESS oo eeeeeeseseseseeseesesteseecseeseseesceesaceecsesscsesseesceesaceesaesaceesaeees 71 L3 cache data staging register data ACCESS o eeseescessessesseesceseeseeseeseeseeseeseeaeeaecaeeeeeeeeceeseeseeaeeseeseees 71 L3 cache data staging register ECC access oi eeeeceeessessessesesesseeseeesceescesceeescecaeeaceesaeeeeaesaeeesacees 71 L3 cache tas diagnostic ACCESS een see Ae dee Deech 72 Ls3 cache Tag Diagnostic ACcess teg eideler 73 ASI access to shared SRAMS in L2 L3 off mode on eeseeeeseeseeeeseeeeeeeseeeeseeecaeeseeesaeeseeeeaees 75 The UltraSPARC IV processor OBP Backward Compatibility List oo e ee eceseeseeseeteeneeeeee 80 Machine State After Reset and RED state oo ecceessesesseeseneeseeeeseesesesseeesseeseesaeeessesseesacees 86 Performance Control Register as eiaa riia E E EEE s 98 NEE ee le 99 Performance Instrumentation Counter Register cccceceececcesseeseeseeseeeeeeeeeeceeeeceseeseeseeseeseeaeeaeens 100 PIC ResisterFields oyabun ninoman a n EAEE ie baseband Si AR 100 PIC Counter Overflow Processor Compatibility Comparison 100 Instruction Execution Clock Cycles and Counts 102 Counters for Collecting ITU Branch Statistics cceceeeceeceesessceseeseeseeeeeceeeceeeeeeeseeseeseeseeseeaeeaeens 103 Counters
60. 8 KB 64 KB DIRECT _PTR_REG expands to ASI_DMMU_TSB_8 KB _PTR_REG ASIL DMMU_TSB_64 KB_ PTR _REG and ASL DMMU_TSB_DIRECT_PTR_REG e The symbol designates concatenation of bit vectors A comma on the left side of an assignment separates quantities that are concatenated for the purpose of assignment For example if X Y and Z are 1 bit vectors and the 2 bit vector T equals 115 then X Y Z lt 0 T results in X 0 Y 1 and Z 1 e A mod B means A modulus B where the calculated value is the remainder when A is divided by B X and x are used to represent states or bits that are either not used or are not relevant De don t care condition X usually indicates that a state may be either Yes or No True or False while x indicates the bit may be either a 1 or a 0 Notation for Numbers Numbers throughout this specification are decimal base 10 unless otherwise indicated Numbers in other bases are followed by a numeric subscript indicating their base for example 1001 FFFF 0000 6 In some cases numbers may be preceded by Ox to indicate hexadecimal base 16 notation for example OxFFFF 0000 Long binary and hexadecimal numbers within the text have spaces or periods inserted every four characters to improve readability The notation 7h 1F indicates a hexadecimal number of 1F 6 with 7 binary bits o
61. A DCU Control Register Access Data Format ASI 4516 ooeeeeiceesesseseesseeseeseeseeseeseeseeseeseeneeneens ASI_SCRATCHPAD_n_REG Register oo eeececcccccessessesseesceseesceseeseesecseesecneceeceaeeseeaeeaenaeeaeeaeeaees Rast Internals Ss inin casa EE EE The UltraSPARC IV processor ASI Extensions FIREPLANE_PORT_ID Register Format oo secsesescsssesesseseeeeseeceeseccsneascseteacseeaceeseeeaees FIREPLANE CONFIG Register Format viirit n ENE EE EETA DTI Pin Configurations aciera e a eee NRA RR A EE aS FIREPLANE_CONFIG_2 Register Format Preplane Address Register sioria h a KE a eal es Sun Fireplane Interconnect Specific Machine State After Reset and RED_state oe VER Register Encoding in Panther oo c e secsscccssseseenesseeeessosecesseoeersnecenssecenesseoesenecsaesseoesnsoeates Speed Data Register Bits eege Mh Se Rea el heh eee ee ec Program Version Register Bits dd dE deed deet ien o aa e is LECLERC I MMU and D MMU Primary Context Register oo ceecessseeseeseeeeeeeseneeseeeeneeseeesaceresesaeeesacees List of Tables amp Sun microsystems xix amp Sun microsystems TABLE 13 3 TABLE 13 4 TABLE 13 5 TABLE 13 6 TABLE 13 7 TABLE 13 8 TABLE 13 9 TABLE 13 10 TABLE 13 We TABLE 13 N TABLE 13 w TABLE 13 A TABLE 13 15 TABLE 13 an TABLE 13 17 TABLE 13 18 TABLE 13 19 TABLE 13 20 TABLE 13 21 TABLE 13 22 TABLE 13 23
62. After hard reset this bit will be read as 3 b111 8 7 6 5 L2_ off Retry_disable Retry_debug_counter If set disables the on chip L2 cache A separate one entry address and data buffer behave as a one entry L2 cache for debugging purposes when the L2 cache is disabled After hard reset this bit will be read as 1 b0 If set disables the logic that stops the W cache s of one or both LPs After hard reset this bit will be read as 1 b0 It is recommended that programmers set this bit to avoid livelocks if the Write cache of an enabled logical processor is enabled If Retry_disable is not set these bits control the retry counter which stops the W cache s 1024 retries will stop the W cache s 512 retries will stop the W cache s 256 retries will stop the W cache s 11 128 retries will stop the W cache s After hard reset these bits will be read as 2 b00 4 L2L3arb_single_issue_frequency Caches Cache Coherency and Diagnostics Controls the frequency of issue in the single issue mode 0 one transaction every 16 cycles 1 one transaction every 32 cycles After hard reset will be read as 1 b0 59 un microsystems TABLE 3 58 L2 cache Control Register 2 of 2 If set the L2L3 arbiter enters the single issue mode where a transaction to L2 L2L3arb_single_issue_en and or L3 pipeline s except snoop and ASI_SRAM FAST INIT_SHARED 3 transactions will be
63. D cache P cache and W cache excluding block stores requests Note that the load part and the store part of an atomic is counted as a single reference L3_rd_miss PICL L3_IC_miss PICU Performance Instrumentation and Optimization Number of L3 cache misses from this core by cacheable D cache requests excluding block loads and atomics Number of L3 cache misses by cacheable I cache requests from this core 109 TABLE 5 11 Counter L3_SW_pf_miss PICU L3_write_hit_RTO PICU L3_write_miss_RTO PICU Cache Access Counters 5 of 5 Description Number of L3 cache misses by software prefetch requests from this core Number of L3 cache hits in O Os or S state by cacheable store requests from this core that do a read to own RTO bus transaction The count does not include RTO requests for prefetch fen 2 3 22 23 instructions Number of L3 cache misses from this core by cacheable store requests that do a read to own RTO bus transaction The count does not include RTO requests for prefetch fen 2 3 22 23 instructions Note that this count also includes the L3_write_hit_RTO cases RTO_nodata i e stores that hit L3 cache in O Os or S state L3_hit_other_half PICU Number of L3 cache hits from this core to the ways filled by the other core when the cache is in pseudo split mode Note that the counter is not incremented if the L3 cache is not in pseudo split mode If the L3 cache is
64. E S O Os or M state a store queue will issue an exclusive request to L3 cache The exclusive request will perform an L3 cache read and the line is moved from the L3 cache to L2 cache and will be checked for the correctness of its ECC If uncorrectable ECC error is detected the L3_EDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When L3 cache data is delivered to the W cache the associated UE information will be stored with data When the W cache evicts a line with the UE data information it will generate ECC based on the data stored in the W cache then ECC check bits C 1 0 of the 128 bit word are inverted before the word is scrubbed back to the L2 cache When an atomic operation misses from the W cache and hits the L3 cache in E S O Os M state a store queue will issue an exclusive request to the L3 cache The exclusive request will perform an L3 cache read and the line is moved from the L3 cache to the L2 cache and will be checked for the correctness of its ECC If uncorrectable ECC error is detected in non critical 32 byte data the L3_EDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When the L3 cache data is delivered to the W cache the associated UE information will be stored with data When the
65. I TLB data array is done using AST_ITLB_DATA_ACCESS_REG ASI 0x55 Back to back ASI writes and reads to the I TLB tag array are done using AST_ITLB_TAG_READ_REG ASI 0x56 I TLB Access Summary TABLE 13 4 lists the MMU TLB Access Summary Instruction and Data Memory Management Unit 308 un microsystems Note There is no I MMU Synchronous Fault Address Register SFAR A missed VA is found in the TPC Register TABLE 13 4 I MMU TLB Access Summary Software Operation Effect on MMU Physical Registers Load Tag Access Store Register TLB tag array TLB data array Tag Access Ext SFSR Tag Read Contents returned From No effect No effect entry specified by LDXA s No effect No effect access Contents Tag No effect No effect Contents No effect Access returned returned Data In Trap with instruction_access_exception see I MMU Synchronous Fault Status Register I SFSR on page 316 Data Contents returned No effect No effect No effect Access From entry No effect specified by LDXA s access Load SFSR No effect No effect No effect No effect Contents returned Tag Read Trap with instruction_access_exception Tag No effect No effect E No effect No effect Access store data TLB entry determined by TIB entry Data In replacement policy written qetermined by S replacement No effect No effect No effect with contents of Tag p
66. L2 cache and D 127 126 will be inverted in the data written to the system bus L2 cache Tag ECC Errors THCE For an instruction fetch a load or atomic operation or writeback copyout store queue exclusive request block store and prefetch queue operations L2 cache data fill processor hardware corrects single bit errors in L2 cache tags without software intervention then retry the operation For a snoop read operation the processor will return L2 cache snoop result based on the corrected L2 cache tag and correct L2 cache tag at the same time For these events AFSR THCE is set and a disrupting ECC_error trap is generated For all hardware corrections of L2 cache tag ECC errors not only does the processor hardware automatically correct the erroneous tag but it also writes the corrected tag back to the L2 cache tag RAM Diagnosis software can use correctable error rate discrimination to determine if a real fault is present in the L2 cache tags rather than a succession of soft errors 204 amp Sun microsystems Tad Error Handling TUE For any access except ASI access and SIU tag update or copyback from foreign bus transaction and snoop request if an uncorrectable tag error is discovered the processor sets AFSR TUE The processor asserts its ERROR output pin and it is expected that the coherence domain has suffered a fatal error and must be restarted TUE_SH For any access due to SIU tag update or copyback f
67. LDF load floating point single precision instructions and avoid the trap The UltraSPARC processor supports single precision loads mixed with double precision operations so that the case above can execute without penalty except for the additional load If a trap does occur the UltraSPARC processor dedicates a trap vector for this specific misalignment which reduces the overall penalty of the trap Grouping load data is desirable because a D cache subblock can contain either four properly aligned single precision operands or two properly aligned double precision operands eight and four respectively for a D cache line This grouping is desirable not only for improving the D cache hit rate by increasing its utilization density but also for D cache misses where for sequential accesses one out of two requests to the L2 cache can be eliminated Performance Instrumentation and Optimization 96 amp Sun microsystems 5 3 3 1 5 3 4 53 53 Using LDDF to Load Two Single Precision Operands Cycle The UltraSPARC processor supports single cycle eight byte data transfers into the floating point register file for LDDF Wherever possible applications that use single precision floating point arithmetic heavily should organize their code and data to replace two LDF s with one LDDF This replacement reduces the load frequency by approximately one half and cuts execution time considerably Store Considerations The store on the UltraSPARC
68. LP LRU 1 0 000 way0 is the LRU LP LRU 1 0 001 way is the LRU LP LRU 1 0 010 way0 is the LRU LP LRU 1 0 011 way is the LRU LP LRU 1 0 100 way2 is the LRU LP LRU 1 0 101 way2 is the LRU LP LRU 1 0 110 way3 is the LRU LP LRU 1 0 111 way3 is the LRU A 3 bit L2 cache state entry The 3 state bits are encoded as follows state 2 0 000 Invalid state 2 0 001 Shared state 2 0 010 Exclusive state 2 0 011 Owner state 2 0 100 Modified state 2 0 101 NA Not Available see Notes on NA Cache State on page 64 state 2 0 110 Owner Shared state 2 0 111 Reserved 2 0 State T Note Bit 63 43 and bit 18 15 of ASI_L2CACHE_TAG data are treated as don t care for ASI write operations and read as zero for ASI read operations On reads to ASI_L2CACHE_TAG regardless of the hw_ecc_gen_en bit the tag and the state of the specified entry given by the index and the way will be read out along with the ECC and the LRU of the corresponding index A common ECC and LRU is shared by all 4 ways of an index Caches Cache Coherency and Diagnostics 62 amp Sun microsystems On writes to AST_L2CACHE_TAG the tag and the state will always be written into the specified entry The LRU will always be written into the specified index For the ECC field if hw_ecc_gen_en is set the ECC field of the data register is don t care
69. N A N A the request is corrects the tag 1 request misses L2 cache retried i Good data D cache FP 64 bit load D installed in D Good Disrupting trap request with L3 tag UE Original tag cache and P data request hits L2 cache ehe taken D cache FP 64 bit load Disrupting trap request with L3 tag UE Original tag N A N A the request is request misses L2 cache dropped L3 cache access for D and itai Sue cache block load request i eater at 20 Disrupting tra E L3_THCE No L3 Pipe P cache block data GE we GE corrects the tag load buffer taken request hits L2 cache 237 un microsystems Error Handling TABLE 7 25 L3 cache Tag CE and UE Errors 2 of 4 Errors logged Pipeline Event in AFSR L3 cache Tag Li cache data Note Comment L3 cache access for D Derin cache block load request L3 Pipe pune Si with tag CE Ee corrects the tag NA the TOUS UIS request misses L2 cache DESS L3 cache access for D i cache block load request A Good data in Good Disrupting trap with tag UE Original tag P cache block data i load buffer taken request hits L2 cache L3 cache access for D Disrintins tr cache block load request ER ect with tag UE Original tag N A the request is request misses L2 cache dropped D cache atomic request Good data Good e i Disrupting tra with L3 tag CE L3_THCE L3 Pipe installed in D data eee corrects the tag request hits L2 cache cache taken D cache atom
70. Note The I cache microtags must be initialized after power on reset and before the instruction cache is enabled For each of the four ways of each index of the instruction cache the microtags must contain a unique value For example for index 0 the four microtags could be initialized to 0 1 2 and 3 respectively The values need not be unique across indices in the previous example 0 1 2 and 3 could also be used in cache index 1 2 3 and so on 10 Valid predict tag Access I cache Valid Load Predict Bits Note Any write to the I cache Valid Predict array will update both sub blocks simultaneously Reads can however happen individually to any sub block Caches Cache Coherency and Diagnostics 43 microsystems TABLE 3 17 shows the write data format for I cache Valid Load Predict Bits LPB TABLE 3 17 Format for Writing I cache Valid Predict Tag Field Data Bits Field Description 63 56 Mandatory value Should be 0 55 Valid1 Valid1 is the Valid bit for the 32 byte sub block given by IC_addr 14 7 with IC_addr 6 1 54 Valido ValidO is the valid bit for the other 32 byte sub block given by the same IC_addr 14 7 but with IC_addr 6 0 IC_vpred1 is the 8 bit LPB for eight instructions starting at the 32 byte boundary 53 46 IC_vpred1 7 0 align address given by IC_addr 14 7 with IC_addr 6 1 f IC_vpred0 are the LPB bits for eight instructions of the other 32 byte sub block 45 38 IC_vpr
71. Original ta N A p with tag CE logged 8 8 action ASI access get original tag No tra direct ASI L2 tag read request No error No No Original ta N A S with tag UE logged 8 8 action ASI access get original tag i Disrupting tra ASI L2 displacement flush THCE No No e SE N A No on S read request with tag CE z action the reguestas tag retried Disrupting tra ASI L2 displacement flush ve Ge p read request with tag UE TUE No Yes Original tag N A the request is dropped ASI tag write request with tag No error No No new tag N A No CE logged written in action ASI tag write request with tag No error N N new tag N A No UE logged S S written in action TABLE 7 24 L2 cache Tag CE and UE Errors Error logged in Error Snoop result Event AFSR Pin L2 cache tag sent to SIU Comment Snoop request with tag CE Good L2 and without E to S L2 Pipe corrects the tag Disrupting trap gier cache state Snoop request with tag CE L2 Pipe performs E to S Good L2 EE and with E to S upgrade based on corrected tag cache state ping trap ER force a miss e Snoop request with tag UE Original state BE Disrupting trap Error Handling 236 un microsystems 7 13 Behavior on L3 cache TAG Errors Error Handling This section presents information about L3 cache tag errors TABLE 7 25 L3 cache Tag CE and UE Errors 1 of 4 Errors logged Pipeline Event in AFSR L3 cache Tag Li cache data NAGA Comment L3 cache access f
72. P cache and maybe in the L2 cache depending on the type of prefetch it is If the prefetch queue operation to the L2 cache detects a single bit L2 cache data error AFSR EDC will be set AFAR captured and a disrupting ECC_error trap generated to log the event The P cache line being fetched will not be installed The L2 cache data is unchanged If the prefetch queue operation to the L2 cache detects an uncorrectable L2 cache data error AFSR EDU will be set AFAR captured and a disrupting ECC_error trap generated to log the event The P cache line being fetched will not be installed If the prefetch queue operation that misses in L2 cache and hits in L3 cache detects a single bit L3 cache data error AFSR_EXT L3_EDC will be set AFAR captured and a disrupting ECC_error trap generated to log the event The P cache line being fetched will not be installed The L3 cache data will be moved from L3 cache to L2 cache without any correction If the prefetch queue operation that misses in L2 cache and hits in L3 cache detects an uncorrectable L2 cache data error AFSR EDU will be set AFAR captured and a disrupting ECC_error trap generated to log the event The P cache line being fetched will not be installed The L3 cache data will be moved from L3 cache to L2 cache without any change If the prefetch queue operation causes a system bus read operation correctable data ECC uncorrectable data ECC correctable microtag ECC uncorrectable microtag ECC
73. Parity Error Parity error detected in D cache snoop tag will not cause a trap nor will it be reported logged An invalidate transaction snoops all four ways of the D cache in parallel On an invalidate transaction each entry over the four ways that has a parity error will be invalidated in addition to those that have a true match to the invalidate address Entries which do not possess parity errors or do not match the invalidate address are not affected Note D cache snoop tag parity error detection is suppressed if DCR DPE 0 or cache line is invalid 8 6 8 6 1 P cache Parity Error Trap A P cache data parity error results in a dcache_parity_error precise trap Hardware does not provide any information as to whether the dcache_parity_error trap occurred due to a D cache tag data error or a P cache data error Hardware Action on Trap for P cache Data Parity Error For parity error detected for floating point load through P cache the P cache data parity error is reported the same way same trap type and precise trap timing as in D cache data array parity error See Hardware Action on Trap for D cache Data Parity Error on page 262 Exceptions Traps and Trap Types 264 un microsystems Note P cache data parity error detection is suppressed if DCR PPE 0 The error behavior on P cache data parity is described in TABLE 8 6 TABLE 8 6 _ P cache Data Parity Instructions Reporting if Parity Error dcache_
74. Pipeline Action Good critical 32 byte data taken Comment Deferred trap when the line is evicted out from the W cache again based on the UE status bit sent from L2 cache W cache flips the 2 least significant ecc check bits C 1 0 in both lower and upper 16 byte Prefetch for several reads 0 20 fill request with CE in the critical 32 byte data from system bus due to non RTSR transaction OR Prefetch for several reads 0 20 fill request with CE in the non critical 2nd 32 byte data from system bus due to non RTSR transaction Corrected data corrected ecc Good data in P cache No action Disrupting trap Prefetch for several reads 0 20 fill request with CE in the critical 32 byte data from system bus due to RTSR transaction OR Prefetch for several reads 0 20 fill request with CE in the non critical 2nd 32 byte data from system bus due to RTSR transaction Prefetch for one read 1 21 fill request with CE in the critical 32 byte data from system bus due to non RTSR transaction OR Prefetch for one read 1 21 fill request with CE in the non critical 2nd 32 byte data from system bus due to non RTSR transaction Error Handling Corrected data corrected ecc Not installed Good data in P cache Good data in P cache No action No action Disrupting trap Disrupting trap 243 un microsystems TABLE 7 27 Event Prefetch for one read 1 21
75. Soft2 See the UltraSPARC III Cu Processor User s Manual 49 Reserved Reserved for future implementation Instruction and Data Memory Management Unit 328 un microsystems TABLE 13 21 TTE Data Field Descriptions 2 of 2 Bit 48 47 43 Field Size 2 Reserved Description Bit 48 is the most significant bit of the page size and is concatenated with bits 62 61 Reserved for future implementation 42 13 PA The physical page number Page offset bits for larger page sizes PA 15 13 PA 18 13 PA 21 13 PA 24 13 and PA 27 13 for 64 KB 512 KB 4 MB 32 MB and 256 MB pages respectively are stored in the TLB and returned for a Data Access read but are ignored during normal translation When page offset bits for larger page sizes are stored in the TLB on UltraSPARC IV processor the data returned from those fields by a Data Access read are the data previously written to them 12 7 Soft See the UltraSPARC III Cu Processor User s Manual 6 If the lock bit is set then the TTE entry will be locked down when it is loaded into the TLB that is if this entry is valid it will not be replaced by the automatic replacement algorithm invoked by an ASI store to the Data In Register The lock bit has no meaning for an invalid entry Arbitrary entries can be locked down in the TLB Software must ensure that at least one entry is not locked when replacing a TLB entry otherwise a locked entry w
76. Software assigned unique interrupt RW Private ID for logical processor 63 16 ASL CORE ID Hardware assigned ID for logical R Private processor 63 16 ASI_CESR_ID Software assigned CESR_ID Private value 66 16 ASI_ICACHE_INSTR ASI_IC_INSTR I cache RAM diagnostic access Private 6716 ASLICACHE_TAG ASI_IC_TAG pie tag valid RAM diagnostic Private 68 ASI_ICACHE_SNOOP_TAG I cache snoop tag RAM diagnostic Pavie 16 ASI_IC_STAG access 6916 ASL IER DATA I cache prefetch buffer data RAM Reais diagnostic access ER ASL IPB_TAG I cache prefetch buffer tag RAM Pavate diagnostic access ep ASL_L2CACHE_ DATA L2 cache data RAM diagnostic Shared access SC ASL _L2CACHE_TAG L2 cache tag RAM diagnostic Shared access GC ASL L2CACHE_ CTRL L2 cache control diagnostic Shared access 6E16 ASI_BTB_DATA J mip predict table RAM Private diagnostic access 6F 16 ASI_BRANCH_PREDICTION_ARRAY EE RW Private diagnostic access 7016 to PE Reserved for future Private 7li6 implementation 7216 ASI_MCU_TIM1_CTRL_REG 00 6 Cregs memory control I register Shared 7216 ASI_MCU_TIM2_CTRL_REG Cregs memory control II register RW Shared ays ASI_MCU_ADDR_DEC_REGO Gregs memory decoding registi eis Shared bank 0 oie ASI_MCU_ADDR_DEC_REG1 Cregs memory decoding register eer Shared bank 1 7216 ASI_MCU_ADDR_DEC_REG2 Cregs memory decoding register Rw Shared bank 2 Ti ASI_MCU_ADDR_DEC_REG3 Cregs memory decoding register Riy Stared bank 3 Cregs
77. Summary The UltraSPARC IV processor implements the following private and shared registers Chip Multithreading CMT 21 un microsystems 2 6 1 Implementation Registers TABLE 2 10 and TABLE 2 11 summarize the private and shared registers respectively TABLE 2 10 The UltraSPARC IV Processor Private Registers as on ASI Name Access VA Description 0x63 ASI_CORE_ID R 0x10 LP ID register 0x63 ASI_CESR_ID RW 0x40 CESR ID register TABLE 2 11 The UltraSPARC IV Processor Shared Registers E Sc ASI Name Access VA Description 0x41 ASI_CORE_AVAILABLE 0x41 ASI_CORE_ENABLE_ STATUS LP Enable Status register 0x41 ASI_CORE_ ENABLE LP Enable register Read Write 0x41 ASI_XIR_STEERING XIR Steering register Read Write 0x41 ASI_CORE_RUNNING_RW 0x41 ASI CORE RUNNING WIS LP Running register Write One Set 0x41 ASI_CORE_RUNNING_WI1C LP Running register Write One Clear 0x41 ASI_CORE_RUNNING_STATUS LP Running Status register 0x41 ASI_CMT_ERROR_STEERING Error Steering register Read Write Note ASI accesses to the registers must use LDXA STXA LDDFA STDEA instructions Using another type of load or store instruction will cause a data_access_exception trap with SFSR FT 8 illegal ASI value VA RW or size Attempt to access these registers while in non privileged mode will cause a privileged_action trap with SFSR FT 1 privilege violation A non aligned access will cause a mem_address_not_aligned tr
78. T512 During the TTE fill if there is no invalid TLB entry to take then the T512 selection and way selection are determined based on a random replacement policy using a new 10 bit LFSR Linear Feedback Shift Register pseudo random generator For 4 way pseudo random selection LFSR 1 0 bits two least significant bits of LFSR will be used Please refer to TABLE 13 19 TABLE 13 19 LFSR Bit Setting LFSR 1 0 TTE Fill To 00 T512_0 way 0 01 T512_0 way 1 10 T512_1 way 0 11 T512_1 way 1 Note All LFSRs in the D cache I cache and TLBs are initialized on power on reset Hard_POR and system reset Soft_POR This single LFSR is also shared by both the T512_0 and T512_1 when they have different page sizes The least significant bit LFSR 0 is used for the entry replacement It selects the same bank of both T512s but only one of the T512 write enables is eventually asserted at the TTE fill time The Demap Context is needed when the same context changes PgSz During context in if the operating system software decides to change any PgSz setting of the T512_0 or T512_1 differently from last context out of the same Context e g was both 8 KB at context out now 8 KB amp 256 MB at context in then the operating system software will perform a Demap Context operation first This avoids remnant entries in the T512_0 or T512_1 which could cause a duplicate possibly stale hit D TLB Automatic Replacement
79. TABLE 3 7 TABLE 3 8 TABLE 3 9 TABLE 3 10 TABLE 3 11 TABLE 3 12 TABLE 3 13 TABLE 3 14 TABLE 3 15 TABLE 3 16 EPID Register eege tea en ee athe teeta 13 LP lnterrupt ID Repister belle eeneg i ih ay Gee e 13 CESR IDRE ISEN ebe tee 14 LP Available Resister Shared soenen ii iiti EE E E ERCE 15 LP Enable Status Register Shared o c ecccecessessessessesseesceeeeseeseeseeseeaeesecaeeaeceeeeeceseeseeseeseeseeaeeaeens 16 LP Enable Register Shared y s cecciehciveheescsesce teen teiieaedy i aiae eE E E ET EEan 16 LP R nning Register Shared iere d i a 18 LP Running Status Register Shared 19 XIR Steering Register Shared er oa anca nie aid hs cate alee Ra ed aes 21 The UltraSPARC IV Processor Private Registers 22 The UltraSPARC IV Processor Shared Registers 22 The UltraSPARC IV Processor Cache Organization 24 Definttions of the Terms ue Ee ARENS EEN 30 Hit Miss State Change and Transaction Generated for Processor Action 31 Combined Tag MTag States Ee ENEE ENEE NES 32 Snoop Output and DTag Transition o ccceeeececcceseesceseeseesecseesecseeeeceeceaeeseeaeeaecaeeeeeeseseeseeseesecaeeaeens 33 Deriving DTags CTags and MTags from Combined Tags oo eceeesesesseeseneeseeeteceeeneeseeesaeees 33 Snoop Input and CIQ Operation Quetied oe ee eeseseessteeseneeseeseseescesseeecseescsecseeessesecesseeesseeaes 36 Transaction Handling at Head of CIQ oeeeececccecessessesseesceseesceseeseeseeseesecaeeaecseee
80. TLBs TLB name TLB ID Translating Page Size Description 16 entry fully associative both for locked and T16 0 8 KB 64 KB 512 KB 4 MB unlocked pages 512 entry 2 way set associative 256 entries T512 2 8 KB 64 KB per way for unlocked pages Larger amp Programmable I TLB The UltraSPARC IV processor has two I TLBs The original 2 way set associative I TLB of the UltraSPARC III processor has been changed to 512 entries and is now known as the T512 I TLB The 16 entry fully associative T16 I TLB is the same as found in the UltraSPARC III processor with the exception of now supporting 8 KB unlocked pages When an instruction memory access is issued its VA Context and PgSz are presented to the I MMU Both I TLBs T512 and T16 are accessed in parallel Note Unlike in the UltraSPARC III processor the UltraSPARC IV processor s T16 can support unlocked 8 KB pages This is necessary in the case where the T512 is programmed to a non 8 KB page size to ensure that the I TLB s unlocked 8 KB pages will not get dropped Instruction and Data Memory Management Unit 302 amp Sun microsystems 13 1 2 1 The T512 page size pgsz is programmable one PgSz per context primary nucleus Software can set the PgSz fields in ASI_PRIMARY_CONTEXT_REG as described in TABLE 13 2 TABLE 13 2 I MMU and D MMU Primary Context Register 60 58 N_pgszl 57 55 N_pgsz_I
81. Then perform a write to an address which maps to the same I cache tag index but does not match any of the entries in the I cache Diagnostic accesses to the I cache should confirm that all the ways of the I cache at this index have either been invalidated or displaced by instructions from other addresses that happened to map to the same I cache tag index Again this can be iterated for all the covered I cache snoop tag bits The parity is computed and stored independently for each instruction in the I cache An I cache parity error is determined based on per instruction fetch group granularity Unused or annulled instruction s are not masked during the parity check It means that they can still cause an icache_parity_error trap Note In the event of the simultaneous detection of an I cache parity error and an I TLB miss the icache_parity_error trap is taken When this trap routine executes RETRY to return to the original code a fast_instruction_access_MMU_umiss trap will be generated Because the icache_parity_error trap routine uses the alternate global register set recovery from an I cache parity error is unlikely to be possible if the error occurs in a trap routine which is already using these registers This is expected to be a low probability event The icache_parity_error trap routine can check to see if it has been entered at other than TL 1 and reboot the domain if it has 148 un microsystems 7 2 1 6 Notes on the C
82. This approach works even though both the second STXA and the MEMBAR Sync can take interrupts STXA to an MMU Register and MEMBAR Sync implicitly wait for all previous stores to complete before starting down the pipeline Thus if the second STXA or the MEMBAR takes an interrupt it does so only at the end of the pipeline after having made sure that all previous stores were complete In the above case the MEMBAR Sync is still susceptible to all the traps noted above except interrupts and mem_address_not_aligned Address Space Identifiers 279 amp Sun microsystems e I MMU miss e Illegal_instruction e Instruction breakpoint debug feature which manifests itself as an illegal instruction but is currently unsupported e Instruction_access_exception es Instruction_access_error e Fast ECC error Note DONE RETRY can take a privileged_opcode trap if used in place of MEMBAR Sync This possibility is not considered since as any STXA that target internal ASIs must be done in privileged mode The second part of the code is to start any of the vulnerable trap handlers with a MEMBAR Sync an approach which has also been verified in the system especially if they use LDXA instructions which target internal ASIs ASI values 0x30 to Ox6f 0x72 to 0x77 and 0x7a to 0x7f In the case of the I MMU miss handler this approach may result in unacceptable performance reduction In s
83. To determine the stall cycles due to non bypassable RAW hazard only subtract Re_PFQ_full from Re_RAW_umiss i e actual Re_RAW_miss Re_RAW_miss Re_PFQ_full Re_FPU_bypass PICU Re_DC_miss PICL Stall cycles due to recirculation when an FPU bypass condition that does not have a direct bypass path occurs FPU bypass cannot occur in the following cases 1 a PDIST instruction is followed by a dependent FP multiply FG multiply instruction 2 a PDIST instruction is followed by another PDIST instruction with the same destination register WAW hazard which in turn is followed by a dependent FP multiply FG multiply instruction Stall cycles due to recirculation of cacheable loads that miss D cache This includes L2 hit L2 miss L3 hit and L3 miss cases Stall cycles from the point when a cacheable D cache load miss reaches D stage to the point when the recirculated flow reaches D stage again are counted This is equivalent to the load to use latency of the load instruction Note The count does not include stall cycles for cacheable loads that recirculate due to a D cache miss for which there is an outstanding prefetch fcn 1 request in the prefetch queue LAP hazard Re_L2_miss PICL Stall cycles due to recirculation of cacheable loads that miss both D cache and L2 cache This includes both L3 hit and L3 miss cases Stall cycles from the point when L2 cache miss is detected to the point when the recirculated flow rea
84. V8 equivalent 310 328 Tag Target 312 330 U uncorrectable error 171 underflow mask UFM bit of TEM field of FSR register 134 135 unfinished_FPop exception 138 user trap handler 135 V VA_WATCHPOINT register 88 VER register 87 297 VIPT cache D cache 25 virtual caching 29 virtually indexed physically tagged VIPT 96 VIVT cache prefetch cache 26 INDEX xlvii amp Sun microsystems un microsystems W watchdog_reset WDR 83 watchpoint and RED state 83 WC_DATA WC entry field 53 WCACHE_DATA WC_dbl_word field 53 WC_ecc field 53 wcache data field 53 WCACHE_STATE WC entry field 52 WCACHE_TAG WC addr field 54 WRASR instruction 98 99 diagnostic control data 39 write cache access statistics 107 description 25 Diagnostic Bank Valid Bits Register 54 diagnostic data register access 53 diagnostic state register access 52 diagnostic tag register access 54 enable bit 275 flushing 25 write through cache 96 WRPIC instruction 99 WSTATE register 87 Y Y register 86 xlviii UltraSPARC IV Processor User s Manual October 2005
85. W cache evicts a line with UE data information it will generate ECC based on the data stored in the W cache then ECC check bits C 1 0 of the 128 bit word are inverted before the word is scrubbed back to the L2 cache When a software PREFETCH instruction misses the P cache and hits the L3 cache the line is moved from the L3 cache to the L2 cache and will be checked for the correctness of its ECC If an uncorrectable ECC error is detected the L3_EDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case No data will be installed in the P cache Note that when the line moved from the L3 cache to the L2 cache raw data read from the L3 cache without correction is stored in the L2 cache Since the L2 cache and the L3 cache are mutually exclusive once the line is read from the L3 cache to the L2 cache it will not exist in the L3 cache Software initiated flushes of the L2 cache and L3 cache are required if this event is not to recur the next time the word is fetched from the L2 cache This may need to be linked with a correction of a multi bit error in the L3 cache if that is corrupted too 207 amp Sun microsystems Error Handling If an L3 cache uncorrectable error is detected as the result of either a store queue exclusive request or a block load or a prefetch queue operation and the AFSR EDU status bit is already set AFSR1 ME will be set If both the L3_EDU and the L3_M
86. above regarding the number of instructions fetched from an I cache access it is desirable to align branch targets so that enough instructions are fetched to match the number of instructions issued in the first group of the branch target The following examples highlight the logic behind branch target alignment e If the compiler scheduler indicates that the target can be grouped with only one more instruction the target should be placed anywhere in the line except in the last slot because only one instruction would be fetched in that case It may be beneficial to fetch more instructions if possible e If the target is accessed from more than one place it should be aligned so that it accommodates the largest possible group first five instructions of a line e If accesses to the I cache are expected to miss it may be desirable to align targets on a 64 byte boundary or at least the front end of a block so that four instructions are forwarded to the next stage Such an alignment helps assure that the maximum number of instructions can be processed between cache misses assuming that the code does not branch out of the sequence of instructions In fact 64 byte alignment can help instruction prefetch In general it is best to align for maximum fetching by aligning branch targets on four instruction 16 byte boundaries This can help ensure that the fetch bandwidth matches the issue width which is a maximum of four instructions in the case of th
87. an alternative to displacement flushing The invalidated line will not be written back to the next level of cache when these ASI accesses are used Specifically data clean or modified in the L2 cache will not be written back to the L3 cache and the modified data in the L3 cache will not be written back to memory Prefetch Cache Flushing A context switch flushes the P cache When a write to a context register takes place all entries in the P cache are invalidated The entries in the prefetch queue also get invalidated such that data for outstanding prefetch requests will not get installed in the P cache once the data is returned 3 3 Coherence Tables The set of tables in this section describes the cache coherence protocol that governs the behavior of the processor on the Sun Fireplane Interconnect Caches Cache Coherency and Diagnostics 29 un microsystems 3 3 1 Processor State Transition and the Generated Transaction Tables in this section summarize the following e Hit Miss State Change and Transaction Generated for Processor Action e Combined Tag MTag States TABLE 3 2 defines the terms used in the subsequent tables Derivation of DTags CTags and MTags from Combined Tags is shown in TABLE 3 5 TABLE 3 2 Definitions of the Terms Term Meaning MOESI A cache coherence protocol M modified dirty data with no outstanding shared copy O owned dirty data with outstanding shared copy s E excl
88. and data parities are calculated in the I MMU and sent to the I TLB to be stored into the respective arrays When writing to the I TLB using the I TLB Diagnostic Register AST_ITLB_DIAG_REG both tag and data parities are explicitly specified in the to be stored data with bit 46 being the tag parity and bit 47 being the data parity see D MMU Synchronous Fault Status Registers D SFSR on page 334 When using AST_SRAM_FAST_INIT all tag and data parity bits will be cleared When reading the I TLB using the Data Access Register AST_ITLB_DATA_ACCESS_REG or the I TLB Diagnostic Register ASI_ITLB_DIAG_REG both tag and data parities are available in bit 46 and bit 47 of the I TLB Diagnostic Register During bypass mode both the tag and the data bypass the I TLB structure and no parity errors are generated Parity error generation is also suppressed during ASI reads Instruction and Data Memory Management Unit 305 un microsystems TABLE 13 3 summarizes the UltraSPARC IV processor I MMU parity error behavior TABLE 13 3 I MMU parity error behavior dee CU Parity Enable a DCR register T16 T512 Trap taken Control bit 16 Operation register bit 2 Data Tag hit parity parity error 0 x x no trap taken 1 0 x fast_instruction_access_MMU 1 0 x no trap taken 1 1 0 fast_instruction_access_
89. and written by any logical processor when running in privileged mode by using special Address Space Identifiers ASIs in specific load and store instructions ASIs are a feature of the SPARC instruction set providing a convenient and flexible mechanism for mapping additional architectural states SPARC architecture processors can access these additional states through special load and store instructions that take an ASI value together with an address virtual address Certain ASI values cause an access to the corresponding address in physical memory but with behavior different from the default semantics of normal load and store operations Other ASI values are used to access special locations reserved for storing configuration diagnostic or other vital information The CMT Programming Model defines a number of new ASIs used specifically for accessing the CMT specific registers Types of CMT Registers The two main classes of CMT specific registers are private registers and shared registers e Private registers a private copy of the register is associated with each logical processor e Shared registers a single copy of each register is shared by all the logical processors Both private and shared registers can be accessed as ASI mapped registers by privileged software running on one of the logical processors A logical processor can access only its own private registers since it has no way to address the private registers of any other lo
90. any instruction other than a block load causes a fast_ECC_error trap As described in Software Correctable L2 cache ECC Error Recovery Actions on page 160 these errors will be recoverable by the trap handler if the line at fault was in the E or S MOESI state Uncorrectable L2 cache data ECC errors can also occur on multi bit data errors detected as the result of the following transactions e Reads of data from an O Os or M state line to respond to an incoming snoop request copyout e Reads of data from an E S O or Os or M state line to write it back to L3 cache writeback e Reads of data from the L2 cache to merge with bytes being written by the processor W cache exclusive request Reads of data from the L2 cache to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction or a P cache hardware PREFETCH operation Reads of the L2 cache by the processor to perform an operation block load instruction 162 amp Sun microsystems 7 2 8 7 2 8 1 Error Handling For copyout the processor reading the uncorrectable error from its L2 cache sets its AFSR CPU bit In this case deliberately bad signalling ECC is sent with the data to the processor issuing the snoop request If the device issuing the snoop request is a processor it takes an instruction_access_error or data_access_error trap If the device issuing the snoop request is an T O device it will have some
91. b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 13 5 DC_addr A 9 bit index that selects a tag valid field 512 tags 4 0 Mandatory value Should be 0 The data format for D cache microtag access is shown in TABLE 3 37 TABLE 3 37 Data Cache Microtag Access Data Format 63 8 Mandatory value Should be 0 7 0 DC_utag DC_utag is the 8 bit virtual microtag VA 21 14 of the associated data Note A MEMBAR Sync is required before and after a load or store to ASI_DCACHE_UTAG The data cache microtags must be initialized after power on reset and before the data cache is enabled For each of the four ways of each index of the data cache the microtags must contain a unique value for example for index 0 the four microtags could be initialized to 0 1 2 and 3 respectively The values need not be unique across indices in the previous example 0 1 2 and 3 could also be used in cache index 1 2 3 and so on Caches Cache Coherency and Diagnostics 50 un microsystems 3 8 4 3 8 5 Data Cache Snoop Tag Access ASI 4416 per logical processor T Name AST_DCACHE_SNOOP_TAG The address format for the D cache snoop tag fields is shown in TABLE 3 38 TABLE 3 38 Data Cache Snoop Tag Access Address Format Bits Field Description 63 16 Mandatory value Should be 0 A 2 bit index that selects an associative way 4 way associative 15 14 DC_way 2 b00 Wa
92. be generated in this case No data will be installed in the P cache 206 amp Sun microsystems Error Handling Note that when the line moved from L3 cache to L2 cache raw data read from L3 cache without correction is stored in L2 cache Since L2 cache and L3 cache are mutually exclusive once the line is read from L3 cache to L2 cache it will not exist in L3 cache A software initiated L2 cache flush which will flush the data back to L3 cache is desirable so that the corrected data can be brought back from L3 cache later Without the L2 cache flush a further single bit error is likely the next time this word is fetched from L2 cache If both the L3_EDC and the L3_MECC bits are set it indicates that an address parity error has occurred L3_EDU The AFSR EDU status bit is set by errors in block loads to the processor and errors in reading L3 cache to merge data to complete store like operations and prefetch queue operations When a block load instruction misses the D cache and hits the L3 cache the line is moved from the L3 cache to L2 cache and will be checked for the correctness of its ECC If a multi bit error is detected it will be recognized as an uncorrectable error and the L3_EDU bit will be set to log this error condition A deferred data_access_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When a store instruction misses the W cache and hits L3 cache in
93. by the FSR NS bit e When FSR NS 1 non standard mode is selected e When GSR IM 1 interval arithmetic rounding mode is selected In that case the processor will be in standard mode regardless of the FSR NS bit Memory and Register Data Images Floating point values are represented in the floating point f registers in the same way that they are represented in memory Any conversions for ALU operations are completed within the floating point execution unit Load and store operations do not modify the register value VIS instructions logical and move copy operations can be used with values generated by the floating point unit Subnormal Operations Subnormal operations include operations with subnormal number operands and operations without subnormal number operands that generate a subnormal number result The floating point unit response to subnormal numbers is described in Subnormal Operations on page 138 FSR CEXC and FSR AEXC Updates The current exception CEXC and accrued exception AEXC fields in the FSR register are described in JEEE Traps on page 133 FPops update these fields in the following situations e CEXC Only floating point operations will update CEXC and only when an exceptional condition is detected All other instructions will leave CEXC unchanged e AEXC When an exception is detected and the trap is masked the FPop will update the appropriate AEXC field of the FSR register Prediction Logic
94. cache 5 Log the error 6 Clear AFSR WDC WDU CPC CPU THCE AFSR_EXT L3_UCC L3_UCU L3_WDC L3_WDU L3_CPC L3_CPU and L3_THCE 7 Clear AFSR2 and AFSR2_EXT Displacement flush any cacheable fast_ECC_error exception vector or cacheable fast_ECC_error trap handler code or data from the L2 cache to L3 cache 9 Displacement flush any cacheable fast_ECC_error exception vector or cacheable fast_ECC_error trap handler code or data from the L3 cache 10 Re execute the instruction that caused the error using RETRY 166 un microsystems Error Handling Data in error is stored in the D cache If the data was read from the L3 cache as the result of a load instruction or an atomic instruction corrupt data will be stored in D cache However if the data was read as the result of a block load instruction corrupt data will not be stored in D cache Store instructions never cause fast_ECC_error traps directly just load and atomic instructions Store like instructions never result in corrupt data being loaded into the D cache The entire D cache is invalidated because there are circumstances when the AFAR used by the trap routine does not point to the line in error in the D cache This can happen when multiple errors are reported and when an instruction fetch not a prefetch queue operation has logged an L3 cache error in AFSR_EXT and AFAR but not generated a trap due to a misfetch Displacing the wrong line by mistake from the D
95. cache The instruction cache is a 64 KB 4 way set associative cache with a 64 byte line size Each I cache line is divided into two 32 byte subblocks with separate valid bits The I cache is a write invalidate cache It uses a pseudo random replacement policy I cache tag and data arrays are parity protected Instruction fetches bypass the I cache in the following cases e The I cache enable IC bit in the Data Cache Unit Control Register is not set DCUCR IC 0 e The I MMU is disabled DCUCR IM O and the CV bit in the Data Cache Unit Control Register is not set DCUCR CV 0 e The processor is in RED_state e The fetch is mapped by the I MMU as being non virtual cacheable The I cache snoops stores from other processors or DMA transfers as well as stores in the same processor and block store commits Caches Cache Coherency and Diagnostics 24 amp Sun microsystems The FLUSH instruction is not required to maintain coherency Stores and block store commits invalidate the I cache but do not flush instructions that have already been prefetched into the logical processor A FLUSH DONE or RETRY instruction can be used to flush the logical processor Note If a program changes I cache mode to I cache ON from I cache OFF then the next instruction fetch always causes an I cache miss even if it is supposed to hit This rule applies even when the DONE instruction turns on the I cache by changing its status from RE
96. cache IPB data is enabled by DCR IPE in the dispatch control register When this bit is 0 parity will still be correctly generated and installed in the I cache IPB arrays but will not be checked Parity error checking is also enabled separately for each line by the line s valid bit If a line is not valid in the I cache IPB then tag and data parity for that line is not checked I cache physical tag or I cache IPB data parity errors are not checked for non cacheable accesses or when DCUCR IC is 0 Although I cache IPB errors are not logged in AFSR trap software can log I cache IPB errors 144 un microsystems 7 2 1 1 hed 7 2 1 3 Error Handling I cache Physical Tag Errors I cache physical tag errors can only occur as the result of an instruction fetch I cache entries that have been invalidated by bus coherence traffic do not use the physical tag array For each instruction fetch the microtags select the correct bank of physical tag array The resulting physical tag data value is then checked for parity errors and can be compared in parallel with the fetched physical address for determining I cache hit miss If there is a parity error in the selected bank of physical tag data a parity error is reported regardless whether the result was an I cache hit or miss When the trap is taken TPC will point to the beginning of the fetch group An I cache physical tag error is only reported for instructions that are actually exec
97. can if needed then be placed directly in the cache without executing them by diagnostic accesses To test hardware and software for IPB parity error recovery programs need to load an instruction in the IPB Depending on the I cache prefetch stride the instruction block at the 64 128 or 192 byte offset of the current instruction block will be prefetched into the IPB Once the instruction block is loaded into the IPB use the AST B_DATA ASI 0x69 diagnostic access to flip the parity bit of the instruction in the IPB entry Upon executing the modified instruction a precise 147 un microsystems Error Handling trap should be generated If no trap is generated the program should check the IPB using AST_IPB_DATA to see whether the instruction has been repaired This would be a sign that the broken instruction had been displaced from the IPB before it had been executed Iterating this test for each entry of the IPB can check that each covered bit of the IPB data is actually connected to its parity generator Testing the I cache snoop tag error recovery hardware is harder but one example is e First execute multiple instructions that map to the same I cache tag index so all the ways of the I cache are filled with valid instructions e Then insert a parity error in one of the I cache snoop tag entries for this index using AST_ICACHE_SNOOP_TAG ASI 0x68 diagnostic access e
98. copyout from L3 cache or snoop or tag update due to foreign transaction encounters HW_corrected L3 cache tag ECC error CMT error steering register will be used to decide which logical processor to log L3_THCE When L2_tag_ecc_en bit of ASIT_L2CACHE_CONTROL ASI 0x6D is set to 0 no L2 cache tag error is reported This applies to TUE_SH TUE and THCE When ET_ECC_en bit of AST_L3CACHE_CONTROL ASI 0x75 is set to 0 no L3 cache tag error is reported This applies to L3_TUE_SH L3_TUE and L3_THCE When IPE or DPE bit of Dispatch Control Register DCR ASR 0x12 is set to 0 no I cache or D cache data tag parity error is reported respectively Trap types I instruction_access_error trap because the error is always the result of an instruction fetch always deferred D data_access_error trap always deferred C ECC_error trap always disrupting FC fast_ECC_error trap always precise IP icache_parity_error trap always precise DP dcache_parity_error trap always precise none no trap is taken processor continues normal execution FERR fatal error If there is a 1 in the FERR column the processor will assert its ERROR output pin for this event Detailed processor behavior not specified It is expected that the system will reset the processor Priority All priority entries in the above TABLE 7 16 work as follows APART and AFSR1 AFSR1_EXT have an overwrite policy Associated with the AFARI and with
99. data No action Disrupting trap W cache gets W cache exclusive request with CE in the e E the Pine critical 32 byte or the 2nd 32 byte L3 cache L3_EDC No moved Ome permission to modify the Disrupting trap data modify the cache to L2 cache data data Disrupting trap when the line is evicted out from W cache gets the W cache the permiss W cache again based on Original data i 8 ion to modify the UE status bit W cache exclusive request with UE in the fy i S original ecc proceeds to sa 32 byte or the 2nd 32 byte L3 cache L3_EDU No moved from ae e Ee UE modifyihe sent from L2 W E cache to L2 cache TT data cache flips the 2 stored in W least significant cache ecc check bits C 1 0 in both lower and upper 16 byte TABLE 7 21 L3 cache ASI Access Errors as fast ecc L3 cache E Pipeline Event logged in mates daia cache Acun Comment AFSR R data No trap Direct ASI L3 cache data read Original ASI access will get request with CE in the 1st 32 No error data Noaction corrected data if byte or 2nd 32 byte L3 cache logged ec_ecc_en is asserted original ecc fe data Otherwise the original data is returned Direct ASI L3 cache data read request with tag UE in the Let No error Gre E 32 byte or 2nd 32 byte L3 logged ce original ecc cache data Direct ASI L3 cache data write Se request with CE in the 1st 32 No error No Ve N A No action No trap byte or 2nd 32 byte L3 cache log
100. data installed M garbage data not installed in P cache No action Disrupting trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped cacheable stores W cache exclusive fill request with BERR in the critical 32 byte data from system bus OR cacheable stores W cache exclusive fill request with BERR in the 2nd 32 byte data from system bus DBERR No garbage data installed M W cache gets the permission to modify the data after all 64 byte of data have been received W cache proceeds to modify the data and UE information is sent to and stored in W cache Disrupting trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Error Handling 250 un microsystems Error Handling TABLE 7 28 System Bus EMC and EMU errors J of 2 Event Error logged AFSR Error Pin Comment I cache fill request with CE in the microtag of the critical 32 byte data from system bus OR I cache fill request with CE in the microtag of the non critical 2nd 32 byte data from system bus I cache fill request with UE in the microtag of the critical 32 byte data from system bus OR I cache fill request with UE in the microtag of the non critical 2nd 32 byte data from system bus D cache fill request with CE in the microtag of the critical 32 byte data from system bus OR D cache fill request with CE
101. data fill Hardware corrected errors optionally produce an ECC_error disrupting trap enabled by the CEEN bit in the Error Enable Register to carry out error logging Hardware corrected L3 cache tag errors set AFSR_EXT L3_THCE and log the access physical address in AFAR In contrast to L3 cache data ECC errors AFSR E_SYND is not captured For hardware corrected L3 cache tag errors the processor actually writes the corrected entry back to the L3 cache tag array Future accesses to this same tag should see no error This is in contrast to Hardware corrected L3 cache data ECC errors for which the processor corrects the data but does not write it back to the L3 cache This rewrite correction activity by the processor manages still to maintain the full required snoop latency and also obey the coherence ordering rules Uncorrectable L3 cache Tag ECC Errors An uncorrectable L3 cache tag ECC error may be detected as the result of any operation which reads the L3 cache tags All uncorrectable L3 cache tag ECC errors are fatal errors The processor will assert its ERROR output pin The event will be logged in AFSR_EXT L3_TUE or AFSR_EXT TUE_SH and AFAR 164 amp Sun microsystems 7 2 8 4 7 2 8 5 7 2 8 6 Error Handling All uncorrectable L3 cache tag ECC errors for tag update or copyout due to foreign bus transactions or snoop request will set AFSR_EXT L3_TUE_SH All uncorrectable L3 cache tag ECC errors for fill tag update due to
102. data format of the instruction prefetch buffer tag array access is shown in TABLE 3 25 TABLE 3 25 Instruction Prefetch Buffer Tag Field Write Access Data Format Bits Field Description 63 42 Mandatory value Should be 0 41 IPB_valid Sp SE Se of the 32 byte sub block of an 40 8 IPB_tag IPB_tag represents the 33 bit physical tag 5 0 Mandatory value Should be 0 Since the IPB_tag array is small it is not parity protected Caches Cache Coherency and Diagnostics 46 un microsystems 3 7 Branch Prediction Diagnostic Accesses 3 7 1 Branch Predictor Array Accesses ASI 6F 6 per logical processor Name AST_BRANCH_PREDICTION_ARRAY The address format of the branch prediction array access is shown in TABLE 3 26 TABLE 3 26 Branch Prediction Array Access Address Format Bits Field Description 63 16 Mandatory value Should be 0 15 3 BPA addr BPA_addr is a 13 bit index VA 15 3 that selects a branch z prediction array entry 2 0 Mandatory value Should be 0 The branch prediction array entry is shown in TABLE 3 27 TABLE 3 27 Branch Prediction Array Data Format Bits Field Description The value of these bits is undefined on reads and must be masked off by software 3 2 PNT_Bits The two predict bits if the last prediction was NOT_TAKEN 1 0 PT_Bit The two predict bits if the last prediction was TAKEN 63 4 Undefined
103. device specific error reporting mechanism which the device driver must handle The processor being snooped logs error information in AFSR For copyout the processor reading the uncorrectable error from its L2 cache sets its AFSR CPU bit For writeback the processor reading the uncorrectable error from its L2 cache sets its AFSR WDU bit and the uncorrectable writeback data is written into L3 cache If an uncorrectable L2 cache Data ECC error occurs as the result of a copyout deliberately bad signalling ECC is sent with the data to the system bus Correct system bus ECC for the uncorrectably corrupted data is computed and transmitted on the system bus and data bits 127 126 are inverted as the corrupt 128 bit word is transmitted on the system bus This signals to other devices that the word is corrupt and should not be used The error information is logged in the AFSR and an optional disrupting ECC_error trap is generated if the NCEEN bit is set in the Error Enable Register Software should log the copyout error so that a subsequent uncorrectable system bus data ECC error can be correlated back to the L2 cache data ECC error For an uncorrectable L2 cache data ECC error as the result of a exclusive request from the store queue or W cache or as the result of an operation placed in the prefetch queue by an explicit software PREFETCH instruction the processor sets AFSR EDU Data can be read from L2 cache by the W cache exclusive request On these
104. effect No effect No effect EE store data Written with fault status of faulting Written with Written with VA instruction and virtual TLB No effect No effect and context of page sizes at address of miss access faulting context for l faulting two 2 way set instruction associative TLB Instruction and Data Memory Management Unit 327 un microsystems 13 2 3 Translation Table Entry TTE The Translation Table Entry TTE holds the information for a single page mapping The TTE is divided into two 64 bit words representing the tag and data of the translation Just as in a hardware cache the tag is used to determine whether there is a hit in the TSB if there is a hit then the data is fetched by software The configuration of the TTE is described in TABLE 13 21 see also the UltraSPARC IIT Cu Processor User s Manual Note All bootbus addresses must be mapped as side effect pages with the TTE E bit set TABLE 13 21 TTE Data Field Descriptions J of 2 Bit Field Description 63 V See the UltraSPARC III Cu Processor User s Manual 62 61 Bit 62 61 represent the least significant 2 bits of the page size Size 2 concatenated with Size 1 0 encodes the page size for D TLB as follows 000 8 KB 001 64 KB Size l 0 Van 512 KB 011 4 MB 100 32 MB 101 256 MB 60 NFO See the UltraSPARC III Cu Processor User s Manual 59 IE See the UltraSPARC III Cu Processor User s Manual 58 50
105. elen EE UE 104 Counters for R stage Stall iess n a KEO EA es 104 Counters for De ee Ee EE 105 Cache Access Cotinters enee dee SEENEN 106 Counters for Memory Controller Statistics c ccccceecessesseeseesceseeseeseeseeseeseeaeeeeeeeeseeseeseeseeaeeaeeaeens 111 SSM data locality counters Ae Acie Eenelter 112 Data Locality Events eege Ee hohe dE deele eebe iA gt 114 Counters for System Interface Statistics oo iccccecessessessceseeseeseeseeseeseesecaeeseceeeeeeeeeseeseeseeseeaeeaeees 115 amp Sun microsystems List of Tables xvii amp Sun microsystems xviii TABLE 5 16 TABLE 5 17 TABLE 5 18 TABLE 6 1 TABLE 6 2 TABLE 6 3 TABLE 6 4 TABLE 6 5 TABLE 6 6 TABLE 6 7 TABLE 6 8 TABLE 6 9 TABLE 6 10 TABLE 6 11 TABLE 6 12 TABLE 6 13 TABLE 6 14 TABLE 6 15 TABLE 6 16 TABLE 6 17 TABLE 7 1 TABLE 7 2 TABLE 7 3 TABLE 7 4 TABLE 7 5 TABLE 7 6 TABLE 7 7 TABLE 7 8 TABLE 7 9 TABLE 7 10 TABLE 7 11 TABLE 7 12 TABLE 7 13 TABLE 7 14 TABLE 7 15 TABLE 7 16 Counters for Software Statisties er eebe Ce SEENEN ndings 115 Counters for Floating Point Operation Statistics o c cceceeeseesceeeseeeeseeseesecneceeceeceeceseeseeseeaeeeeaees 116 DCH GU and PCR SL Selection Bit Field Encoding 116 ROUMNGING Direction 2s sasesdccdeccncevoct ENNEN dE ENEE EEN 119 Floating Point Numbers s c cccceveerccvestcsncuacsnccusscddscocctcevertestesscvadsavencsccesesd
106. error trap ever generated due to L2 cache Tag access 80 un microsystems TABLE 3 72 The UltraSPARC IV processor OBP Backward Compatibility List 2 of 2 The UltraSPARC IV processor No New Feat 4 0 ew Feature Bits Hard _POR State The UltraSPARC IV processor Behavior on UltraSPARC OBP Software siu_data_mode 5 b01111 Default to 8 8 8 L3 cache mode with 8 1 Sun Fireplane Interconnect clock ratio ET_off EC DAR force _LHS The on chip L3 cache tag RAM is enabled No bit flip on the address to the left hand side L3 cache SRAM DIMM EC_PAR_ force _RHS No bit flip on the address to the right hand side L3 cache SRAM DIMM EC_RW_grp_en EC_split_en pf2_RTO_en No read bypass write feature in L3 cache access All four ways in L3 cache can be used for replacement triggered by either logical processor No prefetch request can send RTO on the bus ET_ECC_en Default to L3 cache Tag ECC checking disabled NO error trap ever generated due to L3 cache Tag access ASIL3CACHE_CTRL 7516 EC_assoc addr_setup 2 b01 Late write SRAM hardwired to 1 b0 Default to 2 cycle SRAM addr setup w r t SRAM clock trace_out trace_in 4 b0101 4 b0110 Default to 8 cycles 8 8 8 L3 cache mode Default to 8 cycles 8 8 8 L3 cache mode EC_turn_rw Default to 2 SRAM cycles between read write EC_early EC_size
107. fill request with CE in the microtag of the non critical 2nd 32 byte data from system bus EMC Disrupting trap 251 un microsystems Error Handling TABLE 7 28 System Bus EMC and EMU errors 2 of 2 Error 2nd 32 byte data from system bus stores W cache exclusive fill request with UE in the microtag of the non critical data is not delivered to L2 cache SIU will not report the error Event logged Benes Comment AFSR Sg stores W cache exclusive fill request with UE in the microtag of the critical 32 byte data from system bus OR EMU Yes Deferred trap Note For microtag error when data is delivered to L2 cache then SIU will report the error If the D cache FP 64 bit load means any of the following 4 kinds of FP load instructions that is LDDF or LDF or LDDFA with some ASIs or LDFA with some ASIs where the cited ASIs are tabulated below TABLE 7 29 System Bus IVC and IVU errors Interrupt vector with UE in the microtag of the non critical 2nd 32 byte data from system bus interrupt data Error Interrupt Vector Event lesed Receive Register Interrup tData Comment m Busy bit setting EE AFSR v Interrupt vector with CE in the critical 32 byte data from system bus Disrupting trap OR IVC No Yes Bee interrupt data Interrupt vector with CE in the non critical 2nd Interrupt taken 32 byte data from system bus Interrupt vector with UE in th
108. floating point integer sign bit If 0 changes it to 1 if 1 changes it to 0 e Does not change any other bit regardless of register content 6 3 11 f Register Load Store Operations Load store operations for the f register include the following Load single floating point LDF instruction writes to a 32 bit register This value must be converted to a 64 bit value F sTOd for use with double precision instructions Load double floating point LDDF instruction can write to a pair of adjacent 32 bit f registers aligned to an even boundary LDDF can also write to a 64 bit register The value must be converted to a 32 bit value FdTOs for use with single precision instructions Two LDF instructions can be used to load a 64 bit value when the memory address alignment to 64 bits is not guaranteed Two STF instructions can be used to store a 64 bit value when the memory address alignment to 64 bits is not guaranteed 6 3 12 VIS Operations The floating point unit must be enabled to execute VIS instructions VIS instructions do not generate interrupts unless the floating point unit is disabled VIS instructions are unaffected by floating point models 6 4 Traps and Exceptions Three trap vectors are defined for floating point operations e fp_disabled e fp_exception_IEEE_754 See IEEE Traps on page 133 e fp_exception_other 6 4 1 fp_disabled Trap The floating point unit can be enabled and disabled IEEE 754 1985 Standard 1
109. flushed In this case the trap handler should flush all D cache contents from the processor to be sure of flushing all the required lines 176 amp Sun microsystems 13 7 3 1 7 3 1 1 Error Handling Error Registers Shared Error Reporting Shared errors are more complicated than private errors When a shared error occurs it must be recorded one of the logical processor within the CMT must be trapped to deal with the error By definition shared errors are asynchronous errors if they could be identified with an instruction they could be identified with a logical processor that occur in shared resources Even within this set as many errors as possible are attributed to a specific core and reported as a private errors Where to record the error and which logical processor to trap is addressed in the following subsections Note MEMBAR Sync is generally needed after stores to error ASI Registers CMT Error Steering Register AST_CMP_ERROR_STEERING The CMT Error Steering register is a shared register accessible from all logical processors as well as being accessible from the system controller This register is used to control and identify which logical processor is responsible for handling all shared errors The specified logical processor has the error recorded in its asynchronous error reporting mechanism and takes an appropriate trap to resolve the error Whenever an error occurs that cannot be iden
110. for larger page sizes are stored in TLB that is VA 15 13 VA 18 13 and VA 21 13 for 64 KB 512 KB and 4 MB pages respectively These values are ignored R during normal translation The 13 bit context identifier R Description R W 315 amp Sun microsystems 13 1 9 2 13 1 9 3 13 1 9 4 Note An ASI load from the I TLB Diagnostic Register initiates an internal read of the data portion of the specified I TLB If any instruction that misses the I TLB is followed by a diagnostic read access LDXA from ASIT_ITLB_DATA_ACCESS_REG i e ASI 0x55 from the fully associative TLBs and the target TTE has the page size set to 64 KB 512 KB or 4 MB then the data returned from the TLB will be incorrect This issue can be overcome by reading the fully associative TLB TTE twice back to back The first access may return incorrect data if the above conditions are met however the second access will return correct data I MMU TSB I TSB Base Register ASI 5016 VA 63 0 2816 Name AGT MM Top BASE Access RW The I MMU TSB Base Register follows the same format as the D MMU TSB Please refer to Tag Read Register on page 333 I MMU TSB I TSB Extension Registers ASI 5016 VA 63 0 4816 Name ASI_IMMU_TSB_PEXT_REG Access RW ASI 5016 VA 63 0 5816 Name ASI_IMMU_TSB_NEXT_REG Access RW Please refer to the Ultr
111. for prefetch queue operations does not depend on the privilege state of the original PREFETCH instruction In the UltraSPARC IV processor any error returned as the result of a prefetch queue operation is correctly reported to the operating system for logging Memory Errors and Interrupt Transmission System bus data ECC errors for an interrupt vector fetch operation are treated in a special manner HW_corrected interrupt vector data ECC errors set AFSR IVC not AFSR CE and correct the error in hardware before writing the vector data into the interrupt receive registers and generating an interrupt_vector trap Uncorrectable interrupt vector data ECC errors set AFSR IVU not AFSR UE or DUE write the received vector data unchanged into the interrupt receive registers and do not generate an interrupt_vector disrupting trap AFSR E_SYND will be captured AFAR will not be captured AFSR PRIV will be updated with the state that happens to be in PSTATE PRIV at the time the event is detected System bus microtag ECC errors for interrupt vector fetches whether in an SSM system or not are treated exactly as though the bus cycle was a read access to I O or memory AFSR IMC or IMU is set M_SYND and AFAR will be captured The value captured in AFAR is not meaningful AFSR PRIV will be updated with the state that happens to be in PSTATE PRIV at the time the event is detected For AFSR IMU events the processor will assert its ERROR output pin an
112. in XIR Steering Register ASI_XIR_STEERING on page 21 should be used The second way to specify the subset is to specify the subset concurrently with delivering the reset across the interface used for communicating the reset This method would require that the interface used for communicating resets supports sending packets of information along with the resets XIR Steering Register ASI_XIR_STEERING The XIR reset can be steered only to specific logical processors under the control of the XIR Steering register described in TABLE 2 9 Name ASI_XIR_STEERING ASI 0x41 VA 63 0 0x30 Privileged Read Write TABLE 2 9 XIR Steering Register Shared Bit Field Description 63 2 Mandatory value Should be 0 1 LP 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 The XIR Steering register is a 64 bit register out of which only bits 1 0 are used in the UltraSPARC IV processor Each bit of the register represents one logical processor with bit 0 representing LP 0 and bit 1 representing LP 1 An XIR is blocked to a logical processor if the corresponding bit is 0 Hardware will force a 0 for unimplemented logical processors State After Reset At the end of a system reset or equivalent reset the value of the XIR reset is equal to the value of the LP Enable Status register which in turn is equal to the value of the LP Enable register 2 6 Private and Shared Registers
113. in the UltraSPARC IV processor reducing the need for frequent Sun Fireplane Interconnect flow control operations Supports Larger Main Memories The memory controller in the UltraSPARC IV processor has been redesigned to support higher density DRAM up to a maximum configuration of 32GB of memory per processor applicable to both CK DIMM and NG DIMM This compares with a maximum configuration of 16GB of memory per processor for earlier family members In SMP configurations the per processor memory capacity of the UltraSPARC III IV processor family is additive An SMP system based on four of the UltraSPARC IV processors could support a maximum of 128GB of memory which is twice the 64GB possible with earlier family members 1 6 Architectural Overview Enhanced Error Detection and Correction In addition to the many new features intended to improve performance the reliability availability and serviceability RAS of the UltraSPARC IV processor have been significantly enhanced Although the UltraSPARC IV processor design is much more complicated than the UltraSPARC III processor design requiring a larger die and over ten times as many transistors to implement it actually offers a lower fault rate than any previous family member All the large memory arrays in the UltraSPARC IV processor design have error detection and recovery logic associated with them Like earlier family members the UltraSPARC IV processor protects its cl
114. in the microtag of the non critical 2nd 32 byte data from system bus EMC EMU EMC No No Disrupting trap Deferred trap Disrupting trap D cache fill request with UE in the microtag of the critical 32 byte data from system bus OR D cache fill request with UE in the microtag of the non critical 2nd 32 byte data from system bus EMU Deferred trap D cache atomic fill request with CE in the microtag of the critical 32 byte data from system bus OR D cache atomic fill request with CE in the microtag of the non critical 2nd 32 byte data from system bus D cache atomic fill request with UE in the microtag of the critical 32 byte data from system bus OR D cache atomic fill request with UE in the microtag of the non critical 2nd 32 byte data from system bus P cache fill request with CE in the microtag of the critical 32 byte data from system bus OR P cache fill request with CE in the microtag of the non critical 2nd 32 byte data from system bus EMC EMU EMC No Yes Disrupting trap Deferred trap Disrupting trap P cache fill request with UE in the microtag of the critical 32 byte data from system bus OR P cache fill request with UE in the microtag of the non critical 2nd 32 byte data from system bus EMU Deferred trap stores W cache exclusive fill request with CE in the microtag of the critical 32 byte data from system bus OR stores W cache exclusive
115. instruction prefetch buffer IPB The count includes some fills related to wrong path instructions The count is updated on 64 Byte granularity Number of I cache prefetch requests sent to L2 cache IC_L2_req PICU IC_miss_cancelled PICL Number of I cache requests sent to L2 cache This includes both I cache miss requests and I cache prefetch requests The count does not include non cacheable acesses to the I cache Note that some of the I cache requests sent to L2 cache may not eventually be filled into the I cache Number of I cache miss requests cancelled due to new fetch stream The cancellation may be due to misspeculation recycle or other events ITLB_miss PICU Data Cache DC_rd PICL DC_rd_miss PICU Number of I TLB miss traps taken Number of D cache read references by cacheable loads excluding block loads References by all cacheable load instructions including LDD are considered as 1 reference each The count is only updated for load instructions that retired Number of cacheable loads excluding atomics and block loads that miss D cache as well as P cache for FP loads The count is only updated for load instructions that retired DC_wr PICU Number of D cache write references by cacheable stores excluding block stores References by all cacheable store instructions including STD and atomic are counted as 1 reference each The count is only updated for store instru
116. issued every 16 or 32 cycles as specified by L2L3arb_single_issue_frequency After hard rest this bit will be read as 1 b0 If set enables the following scheme for misses generated by LPO allocate only in ways 0 and 1 of the L2 cache and for misses generated by LP1 2 L2_split_en allocate only in ways 2 and 3 of the L2 cache After hard rest this bit will be read as 1 b0 If set enables ECC checking on L2 cache data bits 1 L2_data_ecc_en SEA After hard rest this bit will be read as 1 b0 If set enables ECC checking on L2 cache tag bits If ECC checking is disabled there will never be ECC errors due to L2 cache tag access However 0 L2_tag_ecc_en ECC generation and write to L2 cache tag will still occur in correct manner After hard rest this bit will be read as 1 b0 Note Bit 63 16 of the L2 cache control register are reserved They are treated as don t care for ASI write operations and read as zero for ASI read operations The L2_off field in the L2 cache control register should not be changed during run time Otherwise the behavior is undefined The rest of the fields can be programmed during run time without breaking the functionality and they will take effect without requiring a system reset Bits 15 13 Queue_timeout_detected WC1_status WCU status bits are sticky status bits They are set by the hardware when the associated event occurs Multiple bits can be set by the hardware These bits are not ASI writab
117. it is possible for the different aliased virtual addresses to end up in different cache blocks Such aliases are illegal because updates to one cache block will not be reflected in aliased cache blocks For example consider two virtual addresses A and B with the same VA 12 0 bits and different VA 13 bit map to the same physical address PA 12 0 VA 12 0 Now if both A and B are loaded into the D cache they will be mapped to different D cache blocks since the D cache index VA 13 5 for A and B are different due to bit 13 Such address aliasing is illegal since stores to one aliased block say A will not update the other aliased block B as they are mapped to different blocks Normally software avoids illegal aliasing by forcing aliases to have the same address bits known as virtual color up to an alias boundary The minimum alias boundary is 16 KB This size may increase in future designs When the alias boundary is violated software must flush the I cache or the D cache if the page was virtual cacheable In this case only one mapping of the physical page can be allowed in the I MMU or D MMU at a time Caches Cache Coherency and Diagnostics 28 amp Sun microsystems D222 320 3 2 4 Alternatively software can turn off virtual caching of illegally aliased pages Doing so allows multiple mapping of the alias to be in the I MMU or D MMU and avoids flushing of the I cache or the D cache each time a different mappi
118. it will be recognized as an uncorrectable error and the EDU bit will be set to log this error condition A deferred data_access_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register A software PREFETCH instruction writes a command to the prefetch queue which operates autonomously from the execution unit An uncorrectable L2 cache data ECC error as the result of a read operation initiated by a prefetch queue entry will set EDU No data will be stored in the P cache 202 amp Sun microsystems Error Handling When a store instruction misses the W cache and hits the L2 cache in the E or M state a store queue will issue an exclusive request to the L2 cache The exclusive request will perform an L2 cache read and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If uncorrectable ECC error is detected the EDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When the L2 cache data is delivered to the W cache the associated UE information will be stored with data When W cache evicts a line with UE data information it will generate ECC based on the data stored in W cache then ECC check bits C 1 0 of the 128 bit word are inverted before the word is scrubbed back to the L2 cache When an atomic instruction misses from the
119. its ERROR output pin so whether the trap is ever executed depends on system design Bus error BERR as the result of a system bus read of memory or I O for I fetch load block load or atomic operations Timeout TO as the result of the following e A system bus read of memory or I O for I fetch load block load or atomic operations e A system bus write of memory for block store and writeback operations e A system bus write of I O for store and block store operations e An interrupt vector transmit operation 8 1 2 5 Special Access Sequence for Recovering Deferred Traps A special access sequence is required for intentional peeks and pokes to determine device presence and correctness and for I O accesses from hardened drivers which must survive faults in an I O device This special access sequence allows the data_access_error trap handler to recover predictably even though the trap is deferred One possible sequence is described here Exceptions Traps and Trap Types 255 un microsystems The_peeker lt Set the peek_sequence_flag to indicate that a special peek sequenc is about to occur This flag includes specifying the handler as Special_peek_handler if a deferred TO BERR does occur gt MEMBAR Sync error barrier for deferred traps 1 See explanation below lt Call routine to do the peek gt lt Reset the peek_sequence_flag gt lt Check success failure indication from peek gt Do_the_pee
120. like instructions never cause system bus reads for stores that will not be executed Memory Errors and Hardware PREFETCH The P cache can autonomously read data from the L2 cache into the internal P cache This is called hardware prefetch This never generates system bus activity In the UltraSPARC IV processor errors detected as the result of hardware prefetch operations are treated exactly the same as errors detected as the result of explicit software PREFETCH 174 amp Sun microsystems 7 2 10 3 Error Handling Memory Errors and Software PREFETCH Programs can execute explicit software PREFETCH operations The effect of an explicit software PREFETCH instruction is to write a command to the prefetch queue an autonomous hardware queue outside of the execution unit The prefetch queue hardware works very much like the store queue L3 cache accesses and system bus accesses being handled by the hardware completely decoupled from the flow of instructions through the execution unit In the UltraSPARC IV processor errors as the result of operations from the prefetch queue are handled the same as errors as the result of operations from the store queue A prefetch queue operation first searches the L2 cache and L3 cache If it misses in both the L2 cache and L3 cache it will generate a system bus read operation After the prefetch queue operation completes the prefetched data will be in the
121. local bus transactions writeback will set AFSR_EXT L3_TUE The response of the system to assertion of the ERROR output pin depends on the system but is expected to result in a reboot of the affected domain Error status can be read from the AFSR_EXT after a system reset event For L3_TUE or L3_TUE_SH the request will be dropped including copyback request L3 cache Data ECC Errors The types of L3 cache data ECC errors are e Hardware corrected HW_corrected L3 cache data ECC errors single bit ECC errors that are corrected by hardware e Software correctable SW_correctable L3 cache data ECC errors L2 cache ECC errors that require software intervention e Uncorrectable L3 cache data ECC errors multi bit ECC errors that are not correctable by hardware or software Hardware Corrected L3 cache Data ECC Errors Hardware corrected ECC errors occur on single bit errors detected as the result of these transactions e W cache read accesses to L3 cache AFSR_EXT L3_EDC e Reads of the L3 cache by the processor in order to perform a writeback or copyout to the system bus AFSR_EXT L3_WDC AFSR_EXT L3_CPC e Reads of the L3 cache by the processor to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction AFSR_EXT L3_EDC e Reads of L3 cache by the processor to perform an operation block load instruction AFSR_EXT L3_EDC Hardware corrected errors optionally produce an ECC_error d
122. logged No E e N A No action No trap Note D cache FP 64 bit load means any of the following 4 kinds of FP load instructions that is LDDF or LDF or LDDFA with some ASIs or LDFA with some ASIs TABLE 7 26 L3 cache Tag CE and UE Errors Error logged in Event AFSR Snoop request with L3 tag CE and without E to S upgrade Snoop request with L3 tag CE and L3_THCE No Error Pin snoop result sent L3 cache tag to SIU Good L3 cache state L3 Pipe corrects the tag E to S upgrade Good L3 cache Comment Disrupting trap Disrupting trap snoop result with E to S upgrade L3_THCE No based on corrected state tag i Disrupting tra Snoop request with L3 tag UE L3_TUE_SH Yes Original state Porce anis GE 240 un microsystems 7 14 Behavior on System Bus Errors 1 Note that dropping the precise trap and taking the deferred trap instead is not visible to software TABLE 7 27 Event I cache fill request with System Bus CE UE TO DTO BERR DBERR errors 1 of 10 flag fast ecc error L2 cache data a z an a o EI a Ll cache data Pipeline Action Comment system bus P cache CE in the critical 32 byte CE No Corrected data S Good data ini I good d ta Disrupting trap corrected ecc cache taken data from system bus I cache fill request with Bad data not Deferred trap UE will be
123. lst32byte or L3_CPU D 127 126 of the corresponding upper or Disrupting trap Note D Cache FP 64 bit load means any of the following 4 kinds of FP load instructions that When UE and CE occur in the same 32 byte data both CE and UE will be reported but AFAR will point to the UE case on the 16 byte boundary 229 un microsystems PAZ Error Handling Behavior on L2 cache TAG Errors This section presents information about L2 cache tag errors TABLE 7 23 L2 cache Tag CE and UE errors 1 of 7 Errors a Error S Event logged in we S L2 cache Tag E Comment ecc Pin EN AFSR a error i Disrupting tra L2 cache access for I cache THCE No No Se SE k P request with tag CE S L2 cache pipe tag retries the request L2 cache access for I cache request with tag UE but the ee Disrupting trap the data returned to I cache get TUE No Tes Original tag request is dropped used later E E GE SE Disrupting trap the NES RES igi request is dropped data returned to I cache do not TUE SS Yes Original tag KR pped get used later i Disrupting tra D cache 32 byte load request TH N L2 SC Ge k with tag CE CE No o corrects the L2 cache pipe tag retries the request D cache 32 byte load request Ga Disrupting trap the with tag UE TUE No nee Original tag request is dropped i Disrupting tra Deca PP 64 bit load THCE No No Ke ee i request with tag CE L2 cache pipe tag retries the request D cache FP 6
124. members all of which were characterized by an off chip L2 cache the write cache critically served to minimize the amount of off chip traffic to L2 generated by a core s write through L1 cache For this reason a store generated no off chip traffic until the written line eventually was evicted from the write cache On the first write to a line line ownership was transferred by accessing the on chip L2 tags To maximize the capacity of the write cache just those bytes of a line actually updated by the store stream were cached Only on eviction was it finally necessary to go off chip performing a background read of the entire original line in L2 merging any unmodified bytes from that line with the modified bytes held in the write cache and writing the complete updated line back to L2 In the UltraSPARC IV processor with the L2 cache brought on chip the original function of the write cache reducing off chip store traffic has disappeared making it possible to streamline write cache operations In effect in the UltraSPARC IV processor the write cache functions largely as a 32 entry expansion of the 8 entry store queue On a store transaction when the store misses in the write cache a single read transaction is issued to the L2 cache that both reads in the entire 64 byte line and gets ownership of it before overwriting modified bytes with the store Subsequent writes to that line update only the copy held in the write cache When the line is e
125. mode change 25 enable bit 275 error checking sequence 149 error detection 147 error eecovery 146 instruction fields access 40 physical tag parity error 261 snoop tag fields access 44 snoop tag parity error 262 tag valid field access data 44 tag valid fields access 42 instruction prefetch buffer data field access 45 diagnostic access 45 tag field access 46 instruction queue state after reset 89 Instruction Translation Lookaside Buffer iTLB misses 94 instruction_access_error exception 163 171 172 174 176 197 199 215 255 instruction_access_error exception 84 268 instruction_access_error trap 166 168 instruction_access_exception exception 313 INSTRUCTION_TRAP register 88 internal ASI 277 interrupt CMP related behavior 300 on floating point instructions 270 global registers 268 to disabled core 300 to parked core 300 Interrupt Vector Dispatch Register 299 Interrupt Vector Dispatch Status Register 299 Interrupt Vector Receive Register 299 interrupt_vector exception 176 interrupt_vector exception 271 INTR_DISPATCH register 88 INTR_RECEIVE register 88 invalidation data cache 51 prefetch cache 27 IPB_DATA IPB_instr field 46 IPB_parity field 46 IPB_predecode field 46 ITLB Data Access register 314 xxxviii UltraSPARC IV Processor User s Manual October 2005 J Data In register 314 error detection 156 error recovery 156 state after reset 89 Tag Access Extension register 313 Tag Read
126. multiply and divide operations e Some number and precision conversion operations TABLE 6 14 Floating Point lt gt Integer Conversions That Generate Inexact Exceptions S Masked e S SS Unmasked Exception i Instruction Conversion Description TEM 0 Exception TEM 1 FsTOi Floating Point to 32 bit integer when the source operand is not FdTOi between 23 1 and 23 then the result is inexact EE EE FsTOx Floating Point to 64 bit integer when the source operand is not FdTOx between 2 1 and 263 then the result is inexact Integer number nx IEEE rap Integer to Floating Point when the 32 bit integer source FiTOs operand magnitude is not exactly representable in single precision 23 bit fraction Single Precision Normal nx nx IEEE trap Integer to Floating Point when the 64 bit integer source FxTOs operand magnitude is not exactly representable in single precision 23 bit fraction Single Precision EE nx IEEE trap Integer to Floating Point when the 64 bit integer source FxTOd operand magnitude is not exactly representable in double precision 52 bit fraction Double Precision Normal nx nx IEEE trap 1 Even if the operand is gt 274 1 if enough of its trailing bits are zeros it may still be exactly representable 2 Even if the operand is gt 25 1 if enough of its trailing bits are zeros it may still be exactly representable 6 6 Underflow Operation
127. not automatically turn off D cache and or I cache Because I cache is not filled on error detection the trap code can safely run off I cache where the first step is to have software turn off D cache and or I cache as needed See Software Correctable L2 cache ECC Error Recovery Actions on page 160 and Software Correctable L3 cache ECC Error Recovery Actions on page 166 for details about fast_ECC_error trap handler actions Taken on parity error detected when a load instruction gets its data from the D cache or the P cache dcache_parity_error See D cache Error Recovery Actions on page 151 for details about dcache_parity_error trap handler actions Taken on parity error detected when instructions are fetched from the I cache icache_parity_error See Legche Error Recovery Actions on page 146 for details about 2 icache_parity_error trap handler actions Exceptions Traps and Trap Types 259 amp Sun microsystems The exceptions listed in TABLE 8 1 result in precise traps In the UltraSPARC IV processor all traps are precise except for the deferred traps described in Deferred Traps on page 254 and the disrupting traps described in Disrupting Traps on page 257 8 3 8 3 1 Trap Priority To ensure the appropriate processing of trap information a priority system is employed Arriving traps are processed by the hardware depending on their priority and the length of time that the trap has been waiting
128. occurs See abbreviations in TABLE 6 12 Underflow Overflow may occur Operation Masked Exception Enabled Exception TEM NVM 0 TEM NVM 1 rd or fcc register rd or fcc written EE register written E One Operand rs gt rd QNaN QNaN Any QNaN Sech None set Spe None set Asserts nvc NaN NaN Any SNaN Ge Mi Asserts nvc nva No IEEE trap See note enabled Two Operand ra rsz rs2 rsy gt rd QNaN QNaN QNaN s2 None set None set NaN thi t QNaN anything excep QNaN None set None set SNaN and QNan SNaN s2 QNaN SNaN SNaN See note SNaN anything except SNaN SNaN gt QNaN See note Asserts nvc nva Asserts nvc nva Asserts nvc IEEE trap enabled Asserts nvc IEEE trap enabled FCMPEs d SNaN or QNaN anything fec 3 unordered FCMPs d SNaN anything fec 3 unordered Asserts nvc nva Asserts nvc nva Asserts nvc IEEE trap enabled Asserts nvc IEEE trap enabled QNaN anything except FCMPs d SKON fec 3 unordered None set 1 For the Fs dTOs d and other instructions see SNaN to QNaN Transformation on page 136 2 IEEE trap means fp_exception_IEEE_754 fec 3 unordered Ee Note Notice from TABLE 6 16 that the compare and cause exception if unordered instruction FCMPEs d will cause an invalid nv exception if either operand is a quiet or signaling NaN The FCMP instruction causes an exception for signaling NaNs only
129. of NaN e QNaN quiet NaN is a NaN with the most significant fraction field bit set QNaN is allowed to freely propagate through most arithmetic operations QNaN tends to appear when an operation produced mathematically undefined results e SNaN signaling NaN is a NaN with the most significant fraction field bit clear SNaN is used to signal an exception when it appears out of an operation being executed Semantically QNaN denotes indeterminate operations while SNaN indicates invalid operations 6 2 5 Floating Point Number Line The floating point number line in FIGURE 6 1 represents the floating point numbers used in the processor Infinity Normal Subnormal Subnormal Infinity SNaN QNaN CO Za e SC ee zx m Xx xe l Exp All1s Exp All 1S e A Sign Bit 0 me FFF FFF 800 000 000 000 AA gister Positive y Negative 7FF FFF Register Register Register FIGURE 6 1 Floating Point Number Line 6 3 IEEE Operations The response of each operation to operands with 0 normal infinite and NaN numbers are described in this section The response to subnormal numbers is described in Subnormal Operations on page 138 The result of each operation is concluded by one of the following e A number is written to the destination f register rd e A number is written to the destination register and an IEEE flag is set e An IEEE flag is set and an IEEE tr
130. or 3 response This makes a ECC_error trap pending The processor continues to execute instructions looking for a BR MS AO or A1 pipe instruction 3 The processor reads an instruction from the L2 cache and detects an L2 cache data ECC error This makes a precise fast_ECC_error trap pending 4 An earlier instruction prefetch from the system bus by the instruction fetcher not a prefetch queue operation of an instruction now known not to be used completes This instruction has a UE which makes an instruction_access_error pending The instruction fetcher dispatches the corrupt instruction specially marked in the BR pipe Because the processor can now take a trap it inhibits further instruction execution and waits for outstanding system bus reads to complete When the reads have completed the processor then examines the various pending traps and begins to execute the deferred instruction_access_error trap because deferred traps are handled before fast_ECC_error as a special case and that has the higher priority This makes the instruction_access_error trap and all precise traps no longer pending The processor takes only one trap at a time It will begin executing the instruction_access_error trap by fetching the exception vector and executing the instruction there As part of SPARC V9 trap processing the processor clears PSTATE IE so the ECC_error and interrupt_vector traps cannot be taken at the moment so are no longer pending alt
131. parity protected for PC relative instructions IC_predecode 7 is used to determine whether an instruction is PC relative or not Since this PC relative status bit is used to determine which instruction bits will be parity protected if IC_predecode 7 is flipped a non PC relative CTI will be treated as PC relative CTI or vice versa when computing the parity value Since the parity computation for the two types of instructions are different a trap may or may not occur To allow code to be written to check the operation of the parity detection hardware the following equation can be used For a non PC relative instruction IC_predecode 7 0 IC_parity XOR IC_instr 31 0 IC_predecode 9 7 5 0 For a PC relative instruction C_predecode 7 1 IC_parity XOR C_instr 31 11 IC_predecode 9 7 4 0 The PC relative instructions are BP cc Bicc BPr CALL F IC_predecode 7 1 Caches Cache Coherency and Diagnostics Bfcc and FBPfcc where 41 un microsystems 3752 35 241 Note After ASI read write to ASI 6616 6716 6816 6916 6Aj6 Die or Die instruction cache consistency may be broken even if the instruction cache is disabled The reason is that invalidates to the instruction cache may collide with the ASI load store Thus before these ASI accesses the instruction cache must be turned off Then before the instruction cache is turned on again all of the instruction cac
132. processor is designed so that stores can be issued even when the data is not ready More specifically a store can be issued in the same group as the instruction producing the result The address of a store is buffered until the data is eventually available Once in the store buffer the store data is buffered until it can be completed The write cache can be used to exploit locality both temporal and spatial in the write stream Read After Write Hazards Load data can be bypassed from previous stores before they become globally visible data for load from the store queue This bypass is specifically allowed by the total store order TSO memory model Data for all types of loads cannot be bypassed from all types of stores All types of load instructions can get data from the store queue except the following load instructions e Signed loads 1dsb ldsh ldsw e Atomics e Load double to integer register file 1dd e Quad loads to integer register file e Load from FSR register e Block loads e Short floating point loads e Loads from internal ASIs All types of store instructions can give data to a load except the following store instructions e Floating point partial stores e Store double from integer register file std e Store part of atomic e Short floating point stores e Stores to pages with side effect bit set e Stores to non cacheable pages When data for a load cannot be bypassed from previous stores before they beco
133. recirculation instrumentation 105 RED state exiting 268 Fireplane Interconnect 296 instruction cache bypassing 24 L2 L3 cache bypassing 26 MMU behavior 313 331 trap vector 85 Register L3 Cache Error Enable 258 register Data Cache Unit Control 274 Floating Point Status FSR 267 global trap 268 PSTATE 268 values after reset 86 registers performance control PCR 98 reset Fireplane values 296 PSTATE RED 83 register values after reset 86 software Initiated Reset SIR 84 system 85 watchdog reset 83 Reset pin 85 RET RETL instruction 270 RETRY instruction 271 exiting RED_state 84 268 flushing pipeline 25 use with IFPOE 271 when TSTATE uninitialized 85 Return Address Prediction Enable 270 Return Address Stack RAS 95 RETURN instruction 313 331 Rfr_CSR register 88 RSTVaddr 85 S SFSR INDEX xlv amp Sun microsystems un microsystems FT 1 39 state after reset 88 short floating point load instruction 97 SIGM instruction 84 single bit ECC error 258 snooping instruction cache 24 snoop counts 115 SOFTINT register 87 software prefetch enable 275 Software Initiated Reset SIR 84 Speed Data register 298 stable storage 29 STDFA instruction 281 STICK register 87 STICK_COMPARE register 87 store instructions giving data to a load 97 queue state after reset 89 store buffer 97 STXA instruction 281 caution 39 diagnostic control data 39 system bus data ECC errors uncorrectable 171 data MTag ECC errors HW_co
134. rts GE fec 1 rs None set fec 1 rs None set 20 E fec 2 rs None set fec 2 rs None set Normal Normal gt or lt None set gt or lt None set IEEE 754 1985 Standard 127 un microsystems 6 3 7 Precision Conversion TABLE 6 9 Precision Conversion PRECISION CONVERSION Operations single operand FsTOd rs gt rd FdTOs rs gt rd FsTOd 0 FdTOs 0 Result from the operation includes one or more of the following Number in f register See Trap Event on page 132 Exception bit set See TABLE 6 12 Trap occurs See abbreviations in TABLE 6 12 Underflow Overflow can occur Masked Exception TEM 0 Enabled Exception TEM 1 Destination Register Written rd Hee 0 None set Destination Register Written rd 0 Flag s Trap None set FsTOd Normal Normal None set Normal None set FdTOs Normal FsTOd Infinity FdTOs Infinity Can underflow overflow See 6 4 Infinity None set Can underflow overflow See 6 4 Infinity None set Examples e FsTOd 7FD1 0000 7FFA 2000 0000 0000 e FsTOd FDD1 0000 FFFA 2000 0000 0000 e FdTOs 7FFA 2000 0000 0000 7FD 1 0000 e FdTOs FFFA 2000 0000 0000 FFD1 0000 IEEE 754 1985 Standard 128 un microsystems 6 3 8 TABLE 6 10 Floating Point to Integer Number Conversion Floating Point to Integer NUMBER CONVERSION Instruction si
135. setting of the DCR OBS control bits bit 11 must be set to 1 272 amp Sun microsystems 9 4 9 4 1 Registers Registers Referenced Through ASIs Data Cache Unit Control Register DCUCR ASI 4516 ASI _DCU_CONTROL_R EGI ST ER VA 016 The Data Cache Unit Control Register contains fields that control several memory related hardware functions These functions include instruction prefetch write and data caches MMUs and watchpoint setting Most of the DCUCR s functions are described in the UltraSPARC II Cu Processor User s Manual details specific to the UltraSPARC IV processor are described in this section After a power on reset POR all fields of the DCUCR are set to 0 After a WDR XIR or SIR reset all fields of the DCUCR except WE are set to 0 and WE is left unchanged 273 un microsystems The Data Cache Unit Control Register as implemented in the UltraSPARC IV processor is described in TABLE 9 3 In the table bits are grouped by function rather than by strict bit sequence TABLE 9 3 DCU Control Register Access Data Format ASI 4516 J of 2 Bits Field Description R W 63 62 Reserved Reserved for DFT Design for Test 61 56 Reserved Reserved for future implementation Write cache Coalescing Enable If cleared coalescing of store data for cacheable 55 WCE stores is disabled The default setting for this bit is 1 i e coalescing is enabled
136. tag hits in L3 cache that miss in L2 when the line is in I state This counter approximates the number of coherence misses in the L3 cache in a multiprocessor system Performance Instrumentation and Optimization 110 un microsystems 5 10 Memory Controller Counters Memory controller statistics are collected through the counters listed in TABLE 5 12 These counters are shared counters TABLE 5 12 Counters for Memory Controller Statistics Counter Description Number of read requests completed to memory bank 0 MC_reads_0_sh PICL Note that some memory read requests correspond to transactions where some other processor s cache contains a dirty copy of the data and the data will really be provided by that processor s cache MC_reads_1_sh PICL The same as above for bank 1 MC_reads_2_sh PICL The same as above for bank 2 MC_reads_3_sh PICL The same as above for bank 3 MC_writes_0_sh PICU Number of write requests completed to memory bank 0 MC_writes_1_sh PICU The same as above for bank 1 MC_writes_2_sh PICU The same as above for bank 2 MC_writes_3_sh PICU The same as above for bank 3 Number of processor cycles that requests were stalled in the MCU queues MC _stalls_0_sh PICL because bank 0 was busy with a previous request The delay could be due to data bus contention bank busy data availability for a write etc MC_stalls_1_sh PICU The same as above for bank 1 MC _stalls_2
137. tag or data parity errors are masked with the following conditions e Fully associative TLB T16 is hit e The D TLB parity enable bit is off The D TLB parity enable is controlled by bit 17 DTPE of the Data Cache Unit Control Register e The D MMU enable bit is off The D MMU enable is controlled by bit 3 DM of the Data Cache Unit Control Register Instruction and Data Memory Management Unit 322 un microsystems During the D TLB demap operation set associative TLBs T512_0 and T512_1 also check the tag and data parities If a parity error is detected the corresponding entry will be invalidated by the hardware regardless of whether a hit or miss has occurred The demap operation will only clear the valid bit but not the parity error No data_access_exception is generated during demap operation While writing to the D TLB using Data In Register ASI _DTLI D DATA N_RI EG the tag and data parities are generated by hardware While writing to the D TLB using the Data Access B _DTLI Register ASI _DATA_ACC ESS_R EG the tag and data parities are generated by the hardware While writing to the D TLB using the D TLB Diagnostic Register ASI _DTLI D Di AG_RI using ASI While reading the D TLB using the Data Access Register ASI the D TLB Diagnostic Register AST _SRAM_FAST_I NI _DTLI DTLI D D
138. taking two cycles To guarantee this does not happen in the case of uncertain instruction alignment ensure that no two branches are within four instructions of each other Instruction Cache Timing While accesses to the I cache hit successfully the pipeline rarely starves for instructions In rare cases however the Instruction Dispatch is unable to provide a sufficient number of instructions to keep the functional units busy For example a taken branch to a taken branch sequence without any instructions between the branches except for the delay slot could only be executed at a peak rate of two instructions per cycle An I cache miss does not necessarily result in bubbles being inserted into the pipeline Part of the I cache miss processing or even all of it can be overlapped with the execution of instructions that are already in the instruction buffer and are waiting to be grouped and executed Because the operation of the Instruction Dispatch is somewhat separated from the rest of the pipeline the I cache miss may have occurred when the pipeline was already stalled for example due to a multi cycle integer divide floating point divide dependency dependency on load data that missed the D cache etc This means that the miss or part of it may be transparent to the pipeline Note Because of the possibility of stalling the processor when the pipeline is waiting for new instructions try to make code routines fit in the I cache and
139. the need for the set and clear operations required when writing a specific value to the register Chip Multithreading CMT 18 amp Sun microsystems 2 4 3 2 State After Reset On assertion of power on reset or system reset Soft POR the LP Running register will be initialized such that all the logical processors are suspended except the logical processor with the lowest number which instead is marked enabled in the LP Enable Status register This provides an integrated boot master logical processor for systems without a System Controller SC reducing bootbus contention The logical processors suspended by the reset should be set to run by the master logical processor at the proper time in the booting process LP Running Status Register AST_CORE_RUNNING_STATUS Since there is a delay from when a logical processor is directed to suspend until it actually becomes suspended the LP Running Status register is provided to indicate when a logical processor actually becomes suspended The LP Running Status register is a shared read only register where each bit indicates if the corresponding logical processor is active In the UltraSPARC IV processor a logical processor is considered suspended successfully if the following conditions are satisfied 1 No instruction in the instruction queue and logical processor 2 No pending I cache fetch D cache load D cache store P cache load and W cache eviction requests 3
140. the CBND field If 14 9 CBND 5 0 PA 42 37 CBASE and PA 42 37 lt CBND then PA is in COMA space Remote_WriteBack not issued in SSM mode CBASE address limit Physical address bits 42 37 are compared to the CBASE field If 8 3 CBASE 5 0 PA 42 37 CBASE and PA 42 37 lt CBND then PA is in COMA space Remote_WriteBack not issued in SSM mode If set it expects snoop responses from other Fireplane agents using the slow snoop response 2 Hierarchical bus mode If set uses the Sun Fireplane Interconnect protocol for a 1 multilevel transaction request bus If cleared the UltraSPARC IV processor uses the Sun Fireplane Interconnect protocol for a single level transaction request bus If set performs Sun Fireplane Interconnect transactions in accordance with the Sun Fireplane Interconnect Scalable Shared Memory model 0 Sun Fireplane Interconnect and Processor Identification 290 un microsystems TABLE 11 3 DTL Pin Configurations DTL GROUPS DTL_1 Group 0 COMMAND _L 1 0 ADDRESS_L 42 4 MASK_L 9 0 ATRANSID_L 8 0 Group 2 ADDRARBOUT_L ADDRARBIN_L 4 0 Group 8 ADDRPTY_L Power Workstation Midrange Server Enterprise Server mid Power on Reset State DTL_2 Group 1 INCOMING_L PREREQIN_L DTL_3 Group 3 PAUSEOUT_L MAPPEDOUT_L SHAREDOUT_L OWNEDOUT_L DTL_4 Group 5 DTRANSID_L 8 0 DTARG_L DSTAT_L 1 0 Group 6 TARGID_L 8 0 TTRANSID_L 8 0 216 lie lie 21
141. the CMT model defines an ASI register used by the operating system to steer the errors to a designated logical processor Resets generated by an externally initiated reset XIR signal can be steered to an arbitrary subset of the logical processors by either the operating system or an external service processor by setting the appropriate bit mask in an ASI register Logical processors can be enabled disabled either by software or by an external service processor through an ASI register This action only takes effect after a system reset Logical processors can be suspended at any time by software or by an external service processor through an ASI register Logical processors can suspend either themselves or other logical processors When a logical Chip Multithreading CMT 9 un microsystems 2 1 1 212 processor is suspended it stops fetching instructions completes any instructions it already has in its pipe and then becomes idle A suspended logical processor still fully participates in cache coherency transactions and remains coherent When it is started again it continues execution from the point of suspension The ability to suspend logical processors is very important for diagnostic and recovery code and is used during the boot process to facilitate initial bring up CMT Definition A CMT processor is defined by its external visible nature and not its internal organization The following section provides background terminology fol
142. the P cache or L2 cache or both depending on the type of prefetch instruction used Software prefetching can also hide memory latency for integer instructions by bringing in integer data into the L2 cache the data brought into the P cache is useless in this case since integer instructions cannot use the AX pipe Note To enable the use of software prefetching the Software Prefetch Enable SPE bit bit 43 as well as the PE bit in DCUCR must be set If the P cache is disabled or the P cache is enabled without the SPE bit being set i e PE 1 SPE 0 all software prefetch instructions will be treated as NOPs The PREFETCH instruction with fen 16 can be used to invalidate or flush a P cache entry This fen can be used to invalidate special non cacheable data after the data is loaded into registers from the P cache Note PREFETCH with fen 16 cannot be used to prefetch non cacheable data It is used only to invalidate the P cache line In particular it is used to invalidate the P cache data that was non cacheable and was prefetched through other software prefetch instructions Caches Cache Coherency and Diagnostics 27 amp Sun microsystems 3 2 SCT Cache Flushing Data in the I cache D cache P cache W cache L2 cache and L3 cache can be flushed by invalidation of the entry in the cache Modified data in the W cache L2 cache and L3 cache must be written back to me
143. the missed page size information in the T512_0 and T512_1 TABLE 13 23 DTLB Tag Access Extension Register Description Bit Field R W Description 21 19 pgszl RW Page size of T512_1 pgsz1 1 is reserved as 0 18 16 pgsz0 RW Page size of T512_0 pgsz0 2 is reserved as 0 Note With the saved page sizes hardware pre computes in the background the index to the T512_0 and T512_1 for the TTE fill When the TTE data arrive only one write is enabled to the T512_0 T512_1 or T16 Instruction and Data Memory Management Unit 331 un microsystems 13 2 7 2 D TLB Data In Data Access and Tag Read Registers Data In Register ASI 5C 16 VA 63 0 001 6 Name ASI_DTLB_DATA_IN_REG Access W Writes to the TLB Data In register require the virtual address to be set to 0 Note DTLB Data In register is used when fast_data_access_MMU_miss trap is taken to fill DTLB based on replacement algorithm Other than fast_data_access_MMU_miss trap an ASI store to DTLB Data In register may replace an unexpected entry in the DTLB even if the entry is locked The entry that gets updated depends on the state of the Tag Access Extension Register LFSR bits and the TTE page size in the store data If nested fast_data_access_MMU_miss happens DTLB Data In register will not work Data Access Register ASI SD 16 VA 63 0 00 j 6 30FF8 16 Name ASI_DTLB_ DATA _ACCESS_REG Access RW
144. three levels of cache TABLE 3 1 summarizes the cache organization of the UltraSPARC IV processor level 1 L1 I cache D cache P cache W cache level 2 L2 and level 3 L3 caches The cache organization is discussed in details in subsequent sections 23 un microsystems oa by 3 1 2 1 TABLE 3 1 The UltraSPARC IV Processor Cache Organization Subblock Data Protection Set Replacement BC Cache Number of Associativit Polic Organization subblocks y y ECC Parity Instruction Yes IWO 327 Pseudo Tag cache bye away Random None Dat subblocks ale 8 Ta Data cache No 4 way Pseudo None s Random Data Prefetch Yes UWO 327 Cache byte 4 way Sequential None Data subblocks Write cache Fully None None associative 64 Tag L2 cache 2 MB 4 way Pseudo LRU PIPT None byte Data 2 Tag SRAM L3 cache 32 MB No 4 way Pseudo LRU PIPT Ss Data address Virtually Indexed Physically Tagged VIPT Caches The Instruction cache I cache and the Data cache D cache are virtually indexed physically tagged caches The I cache and D cache have no references to context information Virtual addresses index into the cache tag and data arrays while accessing the I MMU D MMU The resulting tag is compared against the translated physical address to determine a cache hit Note A side effect inherent in a virtual indexed cache is address aliasing See Address Aliasing Flushing on page 28 Instruction Cache I
145. to W THCE No No corrects the i cache for W cache exclusive e the request 1s request and writes data to L2 8 retried cache SIU forward and fill request with L2 cache tag UE cs plats os SE Ge Disrupting trap and then 2n byte to W ed cache for W cache exclusive TUE No Ka Original tag the request is request and writes data to L2 dropped cache i Disrupting tra L2 cache Tag update request THCE No No E SE V d by SIU with tag CE e the request 1s tag retried L2 cache Tag update request Disrupting trap by SIU local transaction with TUE No Yes Original tag the request is tag UE dropped L2 cache Tag update request Disrupting trap by SIU foreign transaction TUE_SH No Yes Original tag the request is with tag UE dropped L2 Pipe Disrupting trap I t t with t SS BE THCE No No corrects the the request is tag retried SIU ith Disrupting trap EE TUE SH No Yes Original tag the request is UE dropped Error Handling un microsystems TABLE 7 23 1L2 cache Tag CE and UE errors 7 of 7 Errors Ae Error L1 2 Event logged in ne he L2 cache Tag cache 3 Comment AFSR data error aoe data read request with Tea No Original tag N A Ge data read request with EE No 5 Original tag N A r data write request with Ee No o Original tag N A ie data write request with E E Original tag N A o tra direct ASI L2 tag read request No error S No
146. to D cache Precise traps are used to report a D cache data parity error On detection of a D cache data parity error hardware turns off the D cache and I cache by clearing DCUCR DC and DCUCR IC bits In the trap handler software should invalidate the entire I cache D cache and P cache See D cache Error Recovery Actions on page 151 for details No special parity error status bit or address information will be logged in hardware Because dcache_parity_error trap is precise software has the option to log the parity error information on its own Exceptions Traps and Trap Types 262 un microsystems There are some cases where a speculative access to D cache belongs to a canceled or retried instruction Also the access could be from a special load instruction The error behaviors in such cases are described in TABLE 8 4 The general explanation is that in hardware trap information must be attached to an instruction Thus if the instruction is canceled for example in a wrong branch path no trap is taken TABLE 8 4 D cache Data Parity Error Behavior on Canceled Retried Special Load dcache_parity_error Canceled Retried Special Load Reporting if Parity Error Trap Taken D cache miss real miss without tag parity error due to invalid line dcache_parity_error is not suppressed or tag mismatch dcache_parity_error has priority over D LOD miS fast_data_access_MMU_miss Preceded by trap or retry of an older
147. to capture a new hardware error when bits 62 54 51 33 of AFSR1 and bits 11 0 of AFSR1_EXT are zero 182 amp Sun microsystems Error Handling Note Software should clear both the error bit and the PRIV bit in the AFSR register at the same time If software attempts to clear error bits at the same time as an error occurs one of two events will occur 1 The clear will appear to happen before the error occurs The state of AFSRI AFSR1_EXT syndrome ME PRIV and sticky bits and the state of AFAR1 will all be consistent with the clear having happened before the error occurs If the clear zeroed all bits 62 54 51 33 of AFSR1 and bits 11 0 AFSR1_EXT then AFSR2 and AFSR2_EXT and AFAR2 will capture the new error 2 The clear will appear to happen after the error occurs The state of AFSR1 AFSR_EXT syndrome ME PRIV and sticky bits and the state of AFAR1 will all be consistent with the clear having happened after the error occurs AFSR2 and AFSR2_EXT and AFAR2 will not have been updated with the new error information The PERR and JERR bits must be cleared by software by writing a 1 to the corresponding bit positions When multiple events have been logged by the various bits in AFSR1 or AFSR1_EXT at most one of these events will have its status captured in AFAR1 AFAR1 will be unlocked and available to capture the address of another event as soon as the one bit is cleared in AFSR1 or AFSR1I_EXT which correspo
148. trap will be generated UCU When a cacheable load instruction misses the I cache or D cache or an atomic operation misses the D cache and it hits the L2 cache an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a multi bit error is detected in critical 32 byte data for load and atomic operations or in either critical or non critical 32 byte data for I cache fetch it will be recognized as an uncorrectable error and the UCU bit will be set to log this error condition A precise fast_ECC_error trap will be generated provided that the UCEEN bit of the Error Enable Register is set For correctness a software initiated flush of the D cache is required because the faulty word may already have been loaded into the D cache and will be used without any error trap if the trap routine retries the faulting instruction Corrupt data is never stored in the I cache or P cache A software initiated flush of the L2 cache which evicts the corrupted line into L3 cache then a software initiated flush of the L3 cache which evicts the corrupted line from L3 cache back to DRAM is required if this event is not to recur the next time the word is fetched from L2 cache This may need to be linked with a correction of a multi bit error in L2 cache if that is corrupted too Multiple occurrences of this error will cause the AFSR1 ME to be set In the event that the UCU event is for
149. updates UCU UCC L3_UCU L3_UCC gt UE DUE IVU EDU WDU CPU L3_EDU L3_WDU L3_CPU gt CE IVC EDC WDC CPC L3_EDC L3_WDC L3_CPC AFSR1 M_SYND microtag ECC Syndrome Overwrite Policy Class 2 EMU IMU the highest priority Class 1 EMC IMC the lowest priority Priority for M_SYND updates EMU IMU gt EMC IMC 7 6 Error Handling Multiple Errors and Nested Traps The AFSR1 ME bit is set when there are multiple uncorrectable errors or multiple SW_correctable errors associated with the same sticky bit in different data transfers Multiple occurrences of all uncorrectable errors ISAP EMU IVU TO DTO BERR DBERR UCU TUE_SH TUE CPU WDU EDU DUE UE L3_UCU L3_UCC L3_EDU L3_WDU L3_CPU L3_TUE_SH L3_TUE or L3_MECC errors will set the AFSR1 ME bit For example one ISAP error and one EMU error will not set the ME bit but two ISAP errors will Multiple occurrences of SW_correctable errors that set AFSR1 ME include UCC and L3_UCC errors only This is to make diagnosis easier for the unrecoverable event of an L2 cache L3 cache error while handling a previous L2 cache L3 cache error If multiple errors leading to the same trap type are reported before a trap is taken due to any one of them then only one trap will be taken for all those errors If multiple errors leading to different trap types are reported before a trap is taken for any one of them then one trap of each type will be taken One instr
150. will not be significant because every entry will be invalid Snoop tag and physical tag will be reloaded next time the line is used D cache Error Detection Details D cache diagnostic accesses described in D cache Errors on page 149 can be used for diagnostic purposes or testing development of the D cache parity error trap handling code To allow code to be written to check the operation of the parity detection hardware the following equations specify which storage bits are covered by which parity bits DC_tag_parity xor DC_tag 28 0 This is equal to xor PA 41 13 DC_snoop_tag_parity xor DC_snoop_tag 28 0 This is equal to xor PA 41 13 TABLE 7 3 D cache Parity Generation for Load Miss Fill and Store Update 1 of 2 Parity Bit D cache Load Miss Fill D cache Store Update DC_data_parity 0 xor data 7 0 xor data 7 0 DC_data_parity 1 DC_data_parity 2 DC_data_parity 3 xor data 15 8 xor data 23 16 xor data 31 24 xor data 15 8 xor data 23 16 xor data 31 24 DC_data_parity 4 xor data 39 32 xor data 39 32 DC_data_parity 5 DC_data_parity 6 DC_data_parity 7 xor data 47 40 xor data 55 48 xor data 63 56 xor data 47 40 xor data 55 48 xor data 63 56 DC_data_parity 8 xor data 71 64 xor data 7 0 DC_data_parity 9 DC_data_parity 10 DC_data_parity 11 xor data 79 72 xor data 87 80 xor data 95 88 xor data 15 8 xor d
151. 0 2 b01 Way 2 b10 Way 2 2 b11 Wax 14 3 IC addr This 12 bit index which corresponds to VA 13 2 of the instruction address T selects a 32 bit instruction and associated predecode bits and parity bit 2 0 Mandatory value Should be 0 The data format for the instruction cache instruction fields is shown in TABLE 3 11 TABLE 3 11 Instruction Cache Instruction Access Data Format 63 43 Mandatory value Should be 0 42 IC_parity Odd parity bit of the 32 bit instruction field plus 9 predecode bits 41 32 IC_predecode 9 0 10 predecode bits associated with the instruction field 31 0 32 bit instruction field IC_predecode 4 0 represents the following pipes TABLE 3 12 Definition of predecode bits 4 0 Bits Field 4 FM 3 FA 2 MS 1 BR 0 AX 40 un microsystems IC_predecode 9 5 represents the following TABLE 3 13 Definition of predecode bits 9 5 19 8 17 IC_instr IC ae 0 0 0 not a cti 1 cti done retry jmpl return br call 1 0 1 1 den jmpl call return br 1 1 0 regular jmpl 1 1 0 pop RAS JMPL return 1 1 0 call call w o push RAS followed by restore or write 07 IC_parity Odd parity bit of the 32 bit instruction field plus 9 predecode bits IC_predecode 6 is not parity protected due to implementation considerations IC_instr 10 0 and IC_predecode 5 are not
152. 0 21 17 fill request with BERR in the 2nd 32 byte data from system bus due to RTSR transaction Error Handling garbage data installed garbage data not installed in P cache No action Disrupting trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped 249 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 10 of 10 Event cacheable Prefetch for several writes 2 22 fill request with BERR in the critical 32 byte data from system bus regardless non RTSR transaction or RTSR transaction OR cacheable Prefetch for several writes 2 22 fill request with BERR in the 2nd 32 byte data from system bus regardless non RTSR transaction or RTSR transaction DBERR flag fast ecc error No L2 cache data garbage data installed 2 Ki Ki N D Si kl N kel L1 cache data garbage data not installed in P cache Pipeline Action No action Comment Disrupting trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped cacheable Prefetch for one write 3 23 fill request with BERR in the critical 32 byte data from system bus regardless non RTSR transaction or RTSR transaction OR cacheable Prefetch for one write 3 23 fill request with BERR in the 2nd 32 byte data from system bus regardless non RTSR transaction or RTSR transaction DBERR No garbage
153. 010 011 Note Due to different page size support in the T512_0 and T512_1 the following bits in the D MMU Primary Context Register and Secondary Context Register are reserved as 0 N_pgsz1 1 N_pgsz0 2 P_pgsz1 1 P_pgsz0 2 N_pgsz_I 2 1 S_pgsz1 1 S_pgsz0 2 It is illegal to program primary and or secondary context registers with page sizes that are not supported by the two T512 D TLBs Such incorrect page size programming can result in unexpected D TLB parity error data_access_exception trap or fast_data_access_MMU_miss trap It is also possible that these traps can recur and in some cases will lead to RED mode A load operation 1dxa to appropriate ASI targetting an illegally programmed context register will not return the actual value instead it will return one of the legal values The only way to find out if illegal page sizes are programmed in the context register is to read the AST_DMMU_TAG_ACCESS_EXT register on a fast_data_access_MMU_umiss or a data_access_exception trap and inspect AST_DMMU_ACCSES_EXT bits 21 19 for T512_1 page size and bits 18 16 for T512_0 page size 13 2 2 1 D TLB Access Operation When a memory access instruction is issued its VA Context and PgSz are used to access all 3 D TLBs T512_0 T512_1 and T16 in parallel The fully associative T16 only needs the VA and Context to CAM match and output an entry 1 out of 16 The proper VA bits are compared based on the pa
154. 1 Assert N Ee MSB and converted RE R IEEE trap enabled Integer lt 2 Normal None set Normal None set 52 Integer is rounded to 52 Asserts nvc gt Integer 2 MSB and converted GE Ne IEEE trap enabled DP Integer gt 25 1 Normal None set Normal None set Int i ded to 52 Assert Integer lt 2 1 PER a TOEI EEA Asserts nvc nxc No peace IEEE trap enabled Floating point numbers are not modified by the copy and move instructions FMOV FABS and FNEG The copy move instructions will not generate an unfinished_FPop or unimplemented_F Pop exception but they will generate the fp_disabled exception if the floating point unit is disabled The processor performs the appropriate sign bit transformation but will not cause an invalid exception and will not perform a QNaN to SNaN transformation The following single operand instructions use the rs register as the source operand 6 3 10 1 The FMOV Instruction The FMOV instruction e Performs f register to f register moves e Does not change any bit regardless of register content e Is useful with VIS instructions IEEE 754 1985 Standard 130 amp Sun microsystems 6 3 10 2 The FABS Instruction The FABS instruction e Changes the floating point integer sign bit to positive if needed e Does not change any other bit regardless of register content 6 3 10 3 The FNEG Instruction The FNEG instruction e Toggles the
155. 110 High 4 bits hexadecimal row number Low 3 bits hexadecimal column number 0x6 Syndrome returned for 1bit error in data bit 126 Ox1c9 TABLE 7 10 Data Single Bit Error ECC Syndromes EE Ge SC 0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x0 03b 127 067 097 10f O8f 04f 02c 0x1 147 0c7 02f Ole 117 032 08a 04a 0x2 or 086 046 026 09b 08c 0c1 Oal 0x3 Ola 016 061 091 052 00e 109 029 0x4 02a 019 105 085 045 025 015 103 0x5 031 00d 083 043 051 089 023 007 0x6 0b9 049 013 0a7 057 00b 07a 187 0x7 Of8 11b 079 034 178 1d8 05b 04c 0x8 064 1b4 037 03d 058 13c 1b1 03e 0x9 103 Obe 1a0 1d4 lca 190 124 13a Oxa 1c0 188 122 114 184 182 160 118 Oxb 181 150 148 144 142 141 130 0a8 Oxe 128 m oo 094 112 10c 0d0 0b0 Oxd 10a 106 062 1b2 08 0c4 0c2 1f0 Oxe 0a4 0a2 098 ldl 070 Je 1c6 1c5 Oxf 068 Lei le2 lel 1d2 Icc 109 1b8 Error Handling 187 amp Sun microsystems Error Handling TABLE 7 11 shows the 9 bit ECC syndromes that correspond to a single bit error for each of the 9 ECC check bits for the L2 cache L3 cache and system bus error correcting codes used for data TABLE 7 11 Data Check Bit Single Bit Error Syndrome Check bit number AFSR E_SYND 0 0x001 1 0x002 2 0x004 3 0x008 4 0x010 5 0x020 6 0x040 7 0x080 8 0x100 Other syndromes found in the AFSR E_SYND field indicate either no error syndrome 0 or a multi bit error has occurred TABLE 7 12 gives the mapping from
156. 114 un microsystems 5 12 Miscellaneous Counters 5 12 1 System Interface Event Counters System interface statistics are collected through the counters listed in TABLE 5 15 The counters with a _sh suffix are shared by both cores and so the count in both cores is the same for the shared counters TABLE 5 15 Counters for System Interface Statistics Counter Description SI_snoop_sh PICL SL_ciq_flow_sh PICL Number of snoops from other processors on the system due to foreign RTS RTSR RTO RTOR RS RTSM WS RTSU RTOU UGM Number of system cycles with flow control PauseIn observed by the local processor SIl_owned_sh PICU SI_RTS_srce_data PICL Number of times owned_in is asserted on bus requests from the local processor This corresponds to the number of requests from this processor that will be satisfied either by the local processor s cache in case of RTO_nodata or by the cache of another processor on the system but not by the memory Number of local RTS transactions due to I cache D cache or P cache requests from this core where data is from the cache of another processor on the system not from memory The count does not include re issued local RTS i e RTSR transactions SI_RTO_src_data PICU Number of local RTO transactions due to W cache or P cache requests from this core where data is from the cache of another processor on the system not from memory
157. 13 2 7 3 D MMU TSB D TSB Base Registers ASI 58165 VA 63 0 28 6 Name ASI_DMMU_TSB_ BASE Access RW The D MMU TSB Base Register is described in TABLE 13 27 TABLE 13 27 TSB Base Register Description Bit Field Description R W 63 13 TSB_Base See the UltraSPARC III Cu Processor User s Manual RW 12 Split See the UltraSPARC III Cu Processor User s Manual RW 11 3 The UltraSPARC IV processor implements a 3 bit TSB_Size field The number of entries in the TSB or each TSB if split 512 x SE 2 0 TSB_Size 13 2 7 4 D MMU TSB D TSB Extension Registers ASI 5816 VA 63 0 48 16 Name ASI_DMMU_TSB_PEXT_REG Access RW ASI 5816 VA 63 0 5016 Name ASI_DMMU_TSB_SEXT_REG Access RW ASI 5816 VA 63 0 58 16 Name ASI_DMMU_TSB_NEXT_REG Access RW Please refer to the UltraSPARC II Cu Processor User s Manual for information on the TSB Extension Registers The TSB registers are defined as follows In the UltraSPARC IV processor the TSB_Hash bits 11 3 of the extension registers are exclusive ORed with the calculated TSB offset to provide a hash into the TSB Changing the TSB_Hash field on a per process basis minimizes the collision of TSB entries between different processes 13 2 7 5 D MMU Synchronous Fault Status Registers D SFSR ASI 5816 VA 63 0 18 16 Name ASI_DMMU_SFSR Access RW Instruction and
158. 2 8 2 7 2 8 3 Error Handling Hardware corrected HW_corrected L3 cache tag ECC errors single bit ECC errors that are corrected by hardware e Uncorrectable L3 cache tag ECC errors multi bit ECC errors that are not correctable by hardware or software Each L3 cache tag entry is covered by ECC The tag includes the MOESI state of the line which implies that tag ECC is checked whether or not the line is valid Tag ECC must be correct even if the line is not present in the L3 cache L3 cache tag ECC checking is enabled by the ET_ECC_en bit in the L3 cache control register The processor always generates correct ECC when writing L3 cache tag entries except when programs use diagnostic ECC tag accesses Hardware Corrected L3 cache Tag ECC Errors Hardware corrected errors occur on single bit errors in tag value or tag ECC detected as the result of these transactions Cacheable I fetches Cacheable load pperations Atomic operations W cache exclusive request to the L3 cache to obtain data and ownership of the line in W cache Reads of the L3 cache by the processor in order to perform a writeback or copyout to the system bus Reads of the L3 cache by the processor to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction or a P cache hardware PREFETCH operation Reads of the L3 cache Tag while performing snoop read local writeback displacement flush L3 cache
159. 2 bit entry that selects an associative way 4 way associative wa y 2 b00 Way 0 2 b01 Way 1 2 b10 Way 2 2 b11 Way 3 18 6 index A 13 bit index PA 18 6 that selects an L2 cache entry 5 0 Mandatory value Should be 0 Caches Cache Coherency and Diagnostics 61 un microsystems The data format for L2 cache tag diagnostic access disp_flush 0 is described in TABLE 3 60 TABLE 3 60 L2 cache Tag Access Data Format 63 43 Mandatory value Should be 0 42 19 Tag A 24 bit tag PA 42 19 18 15 Mandatory value Should be 0 A 9 bit ECC entry that protects L2 cache tag and state The ECC bits protect all 4 ways of a given L2 cache index 14 6 ECC A 3 bit entry that records the access history of the 4 ways of a given L2 cache index If L2_split_en in the L2 cache control register ASI 6D16 is not set the LRU is as described below The LRU pointed way will not be picked for replacement if the corresponding state is NA LRU 2 0 000way0 is the LRU LRU 2 0 001 way is the LRU LRU 2 0 010way0 is the LRU LRU 2 0 01 1way1 is the LRU LRU 2 0 100way2 is the LRU LRU 2 0 101 way2 is the LRU LRU 2 0 110way3 is the LRU 5 3 LRU LRU 2 0 111way3 is the LRU If L2_split_en in the L2 cache control register ASI 6D16 is set the LRU is as described below LRU 2 is ignored and the logical processor ID of the logical processor that issues the request is used instead
160. 25 0 VA 24 23 EC_way VA 22 5 EC_addr VA 4 0 0 Name AST_L3CACHE_W AST_L3CACHE_R The address format for L3 cache data diagnostic access is described below in TABLE 3 65 TABLE 3 65 L3 cache data diagnostic access Bits Field Description 63 25 Mandatory value Should be 0 A 2 bit entry that selects an associative way 4 way associative 24 23 EC_way 2 b00 Way 0 2 b01 Way 1 2 b10 Way 2 2 b11 Way 3 The size of this field is determined by the EC_size field specified in the L3 cache control register 22 5 EC addr The UltraSPARC IV processor supports a 32MB L3 cache and therefore uses an 18 bit index PA 22 5 plus a 2 bit way 24 23 to read write a 32 byte field from the L3 cache to from the L3 cache Data Staging registers discussed in L3 cache Data Staging Registers on page 70 4 0 Mandatory value Should be 0 Note The off chip L3 cache data SRAM ASI accesses can take place regardless of the on chip L3 cache tag SRAM is on or off L3 cache Data Staging Registers ASI 7416 Read and Write Shared by both logical processors VA 63 6 0 VA 5 3 Staging Register Number VA 2 0 0 Name ASI_L3CACHE_DATA Caches Cache Coherency and Diagnostics 70 un microsystems 342 5 The address format for L3 cache data staging register access is shown in TABLE 3 66 TABLE 3 66 L3 cache data staging register access Bits F
161. 271 OTHERWIN register 87 overwrite policy AFSR non sticky bit 198 P PA Data Watchpoint Register 275 PA_WATCHPOINT register 88 PC register 86 PC Instr_cnt 102 PCACHE_DATA PC_adadr field 56 PC_dbl_word field 56 PC_way field 56 PCACHE_SNOOP_TAG PC_adadr field 58 PC_physical_tag field 58 PC_valid_bit field 58 PC_way field 58 PCACHE_STATUS_DATA PC_addr field 55 PC_way field 55 PCACHE_TAG PC addr field 57 PC_port field 57 PC_way field 57 PCR access 98 99 fields PRIV 99 ST system trace enable field 99 INDEX xli amp Sun microsystems un microsystems SU select upper bits of PIC field 99 UT user trace enable field 99 function Cycle_cnt 102 IC_ref function 106 SL field 116 ST field 102 state after reset 87 SU field 116 unused bits 268 UT field 102 PCR EC_hit function 111 PIC xlii DC_PC_rd_miss 106 DC_rd 106 DC_wr 106 DC_wr_miss 107 Dispatch0_2nd_br 104 DispatchO_IC_miss 103 104 DispatchO_other 104 DTLB_miss 107 FA_pipe_completion 116 FM_pipe_completion 116 HW_PF_exec 107 IC_fill 106 IC_L2_req 106 IC_miss_cancelled 106 IC_prefetch 106 IC_ref 106 IPB_to_IC_fill 106 ITLB_miss 106 IU_Stat_Br_count_taken 103 IU_Stat_Br_count_untaken 103 IU_Stat_Br_miss_untaken 103 IU_Stat_Jmp_correct_pred 103 IU_Stat_Jmp_mispred 103 ITU_Stat_Ret_correct_pred 103 IU_Stat_Ret_mispred 103 L2_hit_I_state_sh 109 L2_hit_other_half 108 L2_HWPF_miss 108 L2_ic_miss 108 L2_miss 108 L2_rd_miss 108 L
162. 28 27 DBG 1 0 1 Up to 8 outstanding transactions allowed 2 Up to 4 outstanding transactions allowed 3 One outstanding transaction allowed Contains the 10 bit Sun Fireplane Interconnect bus agent identifier for this processor This 26 17 AID 9 0 field must be initialized on power up before any Sun Fireplane Interconnect transactions are initiated Processor to Sun Fireplane Interconnect clock ratio bits 1 0 This field may only be written during initialization before any Sun Fireplane Interconnect transactions are 16 15 CLK 1 0 initiated refer to Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register on page 292 CBND address limit Physical address bits 42 37 are compared to the CBND field If 14 9 CBND 5 0 PA 42 37 CBASE and PA 42 37 lt CBND then PA is in COMA space Remote_WriteBack not issued in SSM mode CBASE address limit Physical address bits 42 37 are compared to the CBASE field If 8 3 CBASE 5 0 PA 42 37 CBASE and PA 42 37 lt CBND then PA is in COMA space Remote_WriteBack not issued in SSM mode If set it expects snoop responses from other Sun Fireplane Interconnect agents using the 2 SLOW slow snoop response Hierarchical bus mode If set uses the Sun Fireplane Interconnect protocol for a multilevel 1 HBM transaction request bus If cleared the UltraSPARC IV processor uses the Sun Fireplane Interconnect protocol for a single level transaction
163. 2_ref 108 L2_snoop_cb_sh 108 L2_snoop_inv_sh 108 L2_SWPF_miss 108 L2_wb 108 L2_wb_sh 108 L2_write_hit_RTO 108 L2_write_miss 108 L2L3_snoop_cb_sh 110 L2L3_snoop_inv_sh 110 L3_hit_I_state_sh 110 L3_hit_other_half 110 L3_ic_miss 109 UltraSPARC IV Processor User s Manual October 2005 amp Sun microsystems L3_miss 109 L3_rd_miss 109 L3_SWPF_miss 110 L3_wb 110 L3_wb_sh 110 L3_write_hit_RTO 110 L3_write_miss_RTO 110 MC _reads_ _sh 111 MC stalls_ _sh 111 MC_writes_ _sh 111 New_SSM_transaction_sh 112 PC_hard_hit 107 PC_inv 107 DC MS misses 107 PC_rd 107 PC_soft_hit 107 Re_DC_miss 105 Re_DC_missovhd 105 Re_FPU_bypass 105 Re_L2_miss 105 Re_L3_miss 105 Re_PFQ_ full 105 Re RAW miss 105 Rstall_FP_use 104 Rstall_TU_use 104 Rstall_storeQ 104 SI_ciq_flow_sh 115 SI_owned_sh 115 SI_RTO_src_data 115 SI_RTS_src_data 115 SI_snoop_sh 115 SSM_L3_miss_local 112 SSM_L3_miss_mtag_remote 112 SSM_L3_miss_remote 112 SSM_L3_wb_remote 112 SW_count_NOP 115 SW_PF_dropped 107 SW_PF_duplicate 107 SW_PF_exec 107 SW_PF_instr 107 SW_PF_L2_ installed 107 SW_PF_PC_installed 107 SW_PF_str_exec 107 SW_PF_str_trapped 107 WC miss 107 PIC register and PCR 98 access 98 99 event logging 99 PICL 269 PICL field 100 PICU 269 SL selection bit field encoding 116 state after reset 87 SU selection bit field encoding 116 PIL register 86 pipeline INDEX xliii un microsystems FGA 116 FGM 116 stages D 103 R 104 PIPT cach
164. 2nd 32 byte L3 cache original ecc Bad data not Bad data data and the data either the critical or the JE Ke moved from L3 in I cache dropped Frecise trap non critical 32 byte data do get used later cache to L2 cache I cache fill request with UE in the non Original data critical 32 byte the 2nd 32 byte L3 cache original ecc Bad data not Bad data data and the data either the critical or the ech yes moved from L3 in I cache dropped SCH non critical 32 byte data do get used later cache to L2 cache I cache fill request with CE in the critical 32 byte L3 cache data but the data never get used Original data original ecc Bad data not Bad data OR L3_UCC Yes moved from L3 in I cache dropped DOE I cache fill request with CE in the non cache to L2 cache critical 32 byte the 2nd 32 byte L3 cache data but the data never get used I cache fill request with UE in the critical 32 byte L3 cache data but the data never get used Original data original ecc Bad data not Bad data OR L3_UCU es moved from L3 in I cache dropped SE I cache fill request with UE in the non cache to L2 cache critical 32 byte the 2nd 32 byte L3 cache data but the data never get used Original data D cache 32 byte load request with CE in original ecc Bad data in Bad data F the critical 32 byte L3 cache data L3_UCC yes moved from L3 D cache dropped Precise trap cache to L2 cache Original data Bad data D cache 32 byte load request with UE in original ecc Ba
165. 3 7 2 Branch Target Buffer Accesses ASI 6Ej6 per logical processor Name ASI_BTB_ DATA The address format of the branch target buffer access is described in TABLE 3 28 TABLE 3 28 Branch Target Buffer Access Address Format Bits Field Description 63 10 Mandatory value Should be 0 BTB_addr is a 5 bit index VA 9 5 that selects a branch target buffer 9 5 BTB_addr entry 4 0 Mandatory value Should be 0 Caches Cache Coherency and Diagnostics 47 un microsystems Branch Target Buffer array entry is described below and described in TABLE 3 29 TABLE 3 29 Branch Target Buffer Data Format The address bits of the predicted target 63 2 Target Address i instruction 1 0 Reserved These two bits are unused Note The Branch Target Buffer is not ASI accessible in RED state 3 8 3 8 1 Data Cache Diagnostic Accesses Five D cache diagnostic accesses are supported e Data cache data fields access e Data cache tag valid fields access e Data cache microtag fields access e Data cache snoop tag access e Data cache invalidate Data Cache Data Fields Access ASI 4616 per logical processor T Name ASI DCACHE DATA TABLE 3 30 Data Cache Data Parity Access Address Format Bits Field Description 63 17 Mandatory value Should be 0 16 DC_data_parity A 1 bit index that selects a data 0 or a parity 1 A 2 bit index that
166. 3 cach Tag ECC Errors o cc ceccsscccveiceissscesacescvsaseseed dass E E E a a i 209 TTS System Bus ECC EMOL S mereci eege E AS ANENE E EEA EEA 210 V7 6 System Bus Status Errors iento eech E AES 212 7 7 7 SRAM e Fuse Array Related Errors ccecccceeescceesceeeenceeseneeeeeneeensneeesneeeesaees 213 Further Details of ECC Error Processing 213 LST System Bus ECC EOT Secene re deities E AE A N E T E EEA 213 182 L2 cache and L3 cache Data ECC Errors robisuienssrnarnnnanenanani 214 7 8 3 When Are Traps Taken cccccecccccecccceseeeesneeeeeeeeceneeceseeeseneeesseeeeseeeseteeensaees 215 7 8 4 When Are Interrupts Taken 0 0 ccc cccceceseceeceeceneeeceseeeseneeeseeeeesseeessteeensaees 217 TERR PERR Error Handling ee EE R 220 7 9 1 Error Detection and Reporting Structures ccccccccccesceeeeneeeeseeceseeesteeeesnees 220 7 9 2 Fatal Eitor FERR eseu ee hatte ees 220 79 3 Entering RED state onecie irori soia ir iari i EAE EEN 221 Behavior on L2 cache DATA Error srncisensaiiericinenieniania aa A 222 Behavior on L3 cache DATA ETOT ronisricair niietada e E N 225 Behavior on L2 cache TAG Errors oo ec eeeeceeseeecceseeseeeseeseeeeeeseseeeseeseseaeeaeseeesaeeseeeaeeaes 230 Behavior on L3 cache TAG Errors oo ee ceeeeceeeeeecceseeeeeesceseeeeeeseseeceseeseesseeaeseseeaseeeeeaeeaes 237 Table of Contents amp Sun xi microsystems un microsystems 7 14 Behavior on System Bus Errors ccccccescccssseceesseeee
167. 3 cache writeback the L3_WDU bit will be set in the AFSR_EXT during this trap handler which would generate a disrupting trap later if it were not cleared somewhere in this handler In this case the processor will write deliberately bad signalling ECC back to memory When the fast_ECC_error trap handler exits and retries the offending instruction the previously faulty line will be re fetched from main memory It will either be correct so the program will continue correctly or still contain an uncorrectable data ECC error in which case the processor will take a deferred instruction_access_error or data_access_error trap It is the responsibility of these later traps to perform the proper clean up for the uncorrectable error The fast_ECC_error trap routine does not need to execute any complex cleanup operations Encountering a software correctable error while executing the software correctable trap routine is unlikely to be recoverable To avoid this three approaches are known 1 The software correctable exception handler code can be written normally in cacheable space If a single bit error exists in exception handler code in the L2 cache other single bit L2 cache data or tag errors will be unrecoverable To reduce the probability of this the software correctable exception handler code can be flushed from the L2 cache and L3 cache at the end of execution This solution does not cover cases where the L2 cache or L3 cache has a hard fault on a
168. 3 else it is equal to 0 vtag 0 1 if the line is valid in Way 0 else it is equal to 0 vtag 1 1 if the line is valid in Way 1 else it is equal to 0 vtag 2 1 if the line is valid in Way 2 else it is equal to 0 vtag 3 1 if the line is valid in Way 3 else it is equal to 0 TABLE 7 4 highlights the checking sequence for D cache tag data parity errors TABLE 7 4 D cache Tag Data Parity Errors Page microtag Cache Line Physical tag D Cache Tag D Cache Data DCR s Signal a Cacheable Hit Valid Hit Parity Error Parity Error DPE bit Trap Yes Yes Yes 1 x x x 1 1 Yes I TLB Parity Errors The I TLB is composed of two structures the 2 way associative T512 array and the fully associative T16 array The T512 array has parity protection for both tags and data while the T16 array is not parity protected 155 amp Sun microsystems 7 2 3 1 e 7 2 4 7 2 4 1 7 2 4 2 725 Error Handling I TLB Parity Error Detection Please refer to TLB Parity Protection on page 304 for details about I TLB parity error detection I TLB parity Error Recovery Actions When all parity error reporting conditions are met I MMU enabled I TLB parity enabled and no translation hit in the T16 a parity error detected on a translation will generate an instruction_access_exception and the I SFSR FT will be set to 2016 Thus the instruction_access_exception trap handler must check the I SFSR to determine if the c
169. 31 microsystems 6 4 2 fp_exception_other Trap 6 4 3 6 4 4 The fp_exception_other trap occurs when a floating point operation cannot be completed by the processor unfinished_F Pop or an operation is requested that is not implemented by the processor unimplemented_F Pop Summary of Exceptions TABLE 6 12 Floating Point Unit Exceptions Trap Description Te Trap Status Fault Trap Type Exception Trap Vector Pare No traps fp_disabled Floating Point unit disabled None set enabled None 02015 Floating Point operation e invalid IEEE Floating Point operation of overflow IEEE Floating Point operation uf IEEE trap IEEE_745_exception fp_exception_IEEE_754 underflow IEEE enabled FSR FTT 1 02116 Floating Point operation division T by zero IEEE Floating Point operation inexact E IEEE Trap Event When a floating point exception causes a trap the trap is precise The traps that are affected are checked in TABLE 6 13 TABLE 6 13 Response to Traps Exception Event gt fp_disabled unimplemented_FPop Resulting Action 1 fp_exception_other unfinished_FPop fp_exception_IEEE_754 Address of instruction that caused the trap is put in the PC and pushed onto the trap stack The destination f register rd is unchanged from its state prior to the execution of the instruction that caused the trap The floating point condition codes och are unchanged v
170. 36 36 36 36 36 36 36 36 36 36 19 17 trace_in Description Address trace out cycles 29 22 21 0000 3 cycles not supported 29 22 21 0001 4 cycles not supported 29 22 21 0010 5 cycles 29 22 21 0011 6 cycles 29 22 21 0100 7 cycles 29 22 21 0101 8 cycles POR value 29 22 21 0110 9 cycles 37 29 22 21 0111 10 cycles 29 22 21 1000 11 cycles 29 22 21 1001 12 cycles 29 22 21 1010 unused 29 22 21 1011 unused 29 22 21 1100 unused 29 22 21 1101 unused 29 22 21 1110 unused 29 22 21 1111 unused Data trace in cycles 19 17 0000 2 cycles not supported 19 17 0001 4 cycles not supported 19 17 0010 5 cycles 19 17 0011 6 cycles 19 17 0100 3 cycles not supported 19 17 0101 7 cycles 19 17 0110 8 cycles POR value 19 17 0111 9 cycles 19 17 1000 10 cycles 19 17 1001 11 cycles 19 17 1010 12 cycles 19 17 1011 unused 19 17 1100 unused 19 17 1101 unused 19 17 1110 unused 19 17 1111 unused 35 35 35 35 35 16 EC_turn_rw L3 cache data turnaround cycle read to write 16 00 1 SRAM cycle not supported 16 01 2 SRAM cycles POR value 16 10 3 SRAM cycles 16 11 unused Caches Cache Coherency and Diagnostics 68 un microsystems
171. 4 bit load d Disrupting trap the request with tag UE TUE Ne Ze Original tag request is dropped i Disrupting tra L2 cache access for D cache THCE No No eo Ene g block load request with tag CE L2 cache Pipe tag retries the request L2 cache access for D cache Distupting wap ch block load request with tag TUE No Yes Original tag PTSS ee request is dropped UE i Disrupting tra D cache atomic request with L2 Pipe Ve f g THCE No No corrects the L2 cache pipe tag CE tag retries the request D cache atomic request with bests Disrupting trap the tag UE TUE No ZS Original tag request is dropped L2 cache access for Prefetch L2 Pipe Disrupting trap 0 1 2 3 20 21 22 23 17 THCE No No corrects the L2 cache pipe request with tag CE tag retries the request Disrupting trap L2 cache access for Prefetch hie teqwestis 0 1 2 3 20 21 22 23 17 TUE No Yes Original tag INE request with tag UE dropped 230 un microsystems Error Handling TABLE 7 23 1L2 cache Tag CE and UE errors 2 of 7 Errors tag L1 fast Error Event logged in SE Pin L2 cache Tag cache gt Comment AFSR data error L2 cache access for P cache L2 Pipe Disrupting trap HW prefetch request with tag THCE No No corrects the N A L2 cache pipe CE tag retries the request L2 cache access for P cache Disrupting trap the HW prefetch request with tag TUE No Yes Original tag N A prae Tap UE request is dropped L2 Pi
172. 58 TABLE 3 58 L2 cache Control Register 1 of 2 Reserved for future implementation 15 Queue_timeout_detected If set indicates that one of the queues that need to access the 12 13 pipeline was detected to timeout After hard reset this bit will be read as 1 b0 14 WC1_status If set indicates that the W cache of LP1 was stopped After hard reset this bit will be read as 1 b0 13 WCO_status If set indicates that the W cache of LPO was stopped After hard reset this bit will be read as 1 b0 12 11 9 Queue_timeout _ disable Queue_timeout If set disables the hardware logic that detects the progress of a queue After hard reset this bit will be read as 1 b0 Programmers should set this bit to ensure livelock free operation if the throughput of the L2 L3 pipelines is reduced from 1 2 to 1 16 or 1 32 The throughput is reduced whenever the L2L3 arbiter enters the single issue mode under one of the following conditions e The L2L3arb_single_issue_en field of the L2 cache control register is set e The L2 cache is disabled by setting the L2_off field of the L2 cache control register e The Write cache of an enabled logical processor is disabled Controls the timeout period of the queues timeout period 2 7 2 Queue_timeout system cycles where Queue_timeout 000 001 110 111 This gives a timeout period ranging from 128 system cycles to 2M system cycles
173. 6 liclk 12clk 8 I1clk 8 bootclk 110110 LP 0 IU a0_valid al_valid ms_valid br_valid fa_valid fm_valid ins_comp mispred recirc 110100 LP 1 IU ms_valid br_valid fa_valid fm_valid ins_comp mispred recirc 2 LPO LPO LPO LPO LP LP LPI LP 110101 i e A r master LP IU trap_tak ins_disp ins_comp recirc trap_tak ins_disp ins_comp recirc E delta 111000 IOT impctl 2 up Zcu 3 Zcu 2 Zcu 1 Zcu 0 up IOT 111001 impctl gelta Zcd 3 Zcd 2 Zcd 1 Zcd 0 Zcd 1 Zcd 0 P down 3 111011 IOL impctl 0 Zceu 3 Zcu 0 IOL 111010 impcti 1 Zcd 3 Zcd 0 111110 IOR Zeul Zcu 2 Zeu 1 Zcu 0 Zcu 1 Zcu 0 impctl 4 IOR 111111 impctl 5 Zcd 3 Zcd 2 Zced 1 Zcd 0 Zcd 1 Zcd 0 IOB 111101 Zcul Zcu 0 impctl 6 IOB 111100 Zcd 3 Zcd 0 impctl 7 271 un microsystems TABLE 9 2 Signals Observed at obsdata 9 0 for Settings on Bits 11 6 of the DCR 2 of 2 Bee Signal obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata 11 6 Source 9 8 7 6 5 4 3 2 1 L2 cache L2 ee 12t_LP_id 110000 pipeline cache_ gt 12t_valid E 12t_sre 4 12t_sre 3 12t_sre 2 12t_sre 1 T signal hit L3 L3 cache 110001 ine sional 20e Bt LP_id 13t_valid ES 13t_sre 4 13t_src 3 13t_sre 2 13t_sre 1 one pipeline signa hit L2 cache livelock SE ee queue_ under_ under_ under_ stop_w stop_w 12miss_fi 100000 L3 cache tix watch_ 2 watch_ 1 watch_ 0 Cac
174. 6 lig lie DTL_S Group 11 ERROR_L FREEZE_L FREEZEACK_L CHANGE_L lie DTL_6 Group 4 PAUSEIN_L OWNEDIN_L SHAREDIN_L MAPPEDIN_L Sun Fireplane Interconnect and Processor Identification 291 amp Sun microsystems Note The Sun Fireplane Interconnect bootbus signals CAStrobe_L ACD_L and Ready_L have their DTL configuration programmable through two the UltraSPARC IV processor package pins All other Sun Fireplane Interconnect DTL signals that do not have a programmable configuration are configured as DTL end Several fields of the Sun Fireplane Interconnect Configuration Register do not take effect until after a soft POR is performed If these fields are read before a soft POR then the value last written will be returned However this value may not be the one currently being used by the processor The fields that require a soft POR to take effect are HBM SLOW DTL CLK MT MR NXE SAPE CPBK_BYP Low power mode clock rate is NOT supported in the UltraSPARC IV processor The field 31 30 are kept for backward compatibility and will be write ignore and read 0 11 1 1 2 The UltraSPARC IV Processor System Bus Clock Ratio The default UltraSPARC IV processor boot up system bus clock ratio is 8 1 The UltraSPARC IV processor supports system bus clock ratio from 8 1 to 16 1 11 1 1 3 Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register CLK 3 CLK 2 CLK 1 0 bi
175. 62 54 51 33 of AFSR1 and bits 11 0 of AFSR1_EXT are cleared AFARI is captured when one of the AFSR1 error status bits that capture address is set see TABLE 7 16 for details The address corresponds to the first occurrence of the highest priority error that captures address according to the AFARI AFSR1_EXT overwrite policy in AFSR1 AFSR1_EXT Address capture in AFAR is reenabled by clearing the corresponding error bit in AFSRI See above at Clearing the AFSR and AFSR_EXT on page 182 for a description of behavior when clearing occurs at the same time as an error AFARI ASI 0x4D VA 63 0 0x0 private AFAR2 ASI 0x4D VA 63 0 0x8 private Name ASI_ASYNC_FAULT_ADDRESS TABLE 7 15 Asynchronous Fault Address Register Bits Description RW 63 43 Reserved for future implementation R Physical address of faulting 16 byte component bits 42 4 PA 42 4 5 4 isolate the fault to 128 bit sub unit within a 512 bit RW AFARI R AFAR2 coherency block 3 0 Reserved for future implementation R TABLE 7 15 describes AFAR1 AFAR2 differs only in being read only PA Address information for the most recently captured error In the event of multiple errors within a 64 byte block AFAR captures only the first detected highest priority error When there is an asynchronous error and AFAR2 is unfrozen i e AFSR1 bits 62 54 51 33 and AFSR1_EXT bits 11 0 are zero a write to AFAR1 will write to both AFAR1 and AFAR2
176. 84 Asynchronous Fault Status Extension Register ssssssesseesesesrsssestsrsrseesrertrtststststsrsrsrsrsrererersrseet 185 Key to interpreting TABLE CID eege a EAE aiai 187 Data Single Bit Error ECC Syndromes o ceeccccesescessessesseseececcceeseeseeseesecsecseceeeeceaeeseeaeeaeeaeeaees 187 Data Check Bit Single Bit Error Syndrome oo ecceeceesesecseeseseesceseseesesevsceecseesceesaeeessesacesaeees 188 ECC Sytidromes ione EN EEE EEEN RE Reeder 189 Microtag Single Bit Error ECC Syndromes o eececesscssssesessseseeeeseeseseesceessceeeaeesceesaeeeseesaeeesaeees 190 Syndrome table for Microtag ECC oieceeecececsssesesseesecsecececeeceeeeeceseeaeeaeeaeeaecaeseceseeseeseeseeaeeaeeaeees 190 Asynchronous Fault Address Register o c eeececcessesssseeseesseneeesceceeseeseeseesecaecaecaeeeeseceeeeeaeeaeeaseneents 191 Error Reporting Summary o ececccesccsscsscsscseceseeeceseeseeseeseeseceeeecceeseeseeseesecsecsecseceeeeeeeeeaeeaeeaeeaees 192 UltraSPARC IV Processor User s Manual October 2005 TABLE 7 17 TABLE 7 18 TABLE 7 19 TABLE 7 20 TABLE 7 21 TABLE 7 22 TABLE 7 23 TABLE 7 24 TABLE 7 25 TABLE 7 26 TABLE 7 27 TABLE 7 28 TABLE 7 29 TABLE 8 1 TABLE 8 2 TABLE 8 3 TABLE 8 4 TABLE 8 5 TABLE 8 6 TABLE 9 1 TABLE 9 2 TABLE 9 3 TABLE 9 4 TABLE 10 1 TABLE 10 2 TABLE 11 1 TABLE 11 2 TABLE 11 3 TABLE 11 4 TABLE 11 5 TABLE 11 6 TABLE 11 7 TABLE 11 8 TABLE 11 9 T
177. ABLE 13 1 TABLE 13 2 Eraps andwhen they are taken erer d EES Ree e ha Seabee Is2 Cache data CE and WE Errors tee dente ENEE L2 cache data Writeback and Copyback Errors oo eeeeeeceeceesceseeseeseeseeseeaeeeeeeeeeeeseeseeseeseeaeeneeaeens L3 cache Data CE andi UE eropgoen L3 cache AST Access Etage GENEE ENEE L3 cache Data Writeback and Copyback Errors o eeeceeceeseeseeseeseeseeseeseeseeeecaceeeeseeceeseeseeseeneeaeens E2 cache Tag CE and UE errors eege L2 cache Tag CEand WE Ertors acai seis vs Abee EENS eee L3 cache Tag CE and UE Errors 20 L3scache Tag CH and bro ee edd SAS Retina EE System Bus CE UE TO DTO BERR DBERR errors System Bus EMC and EMU errors oo eeeeeeeeescsessceseeeseeesseescseesceceseesceesacecsesacsesseecsesaceesaeeeseeeaes System Bus IVC andl VU errors tee EERSTEN a Exceptions Specific to the UltraSPARC IV Processor oo ceeecessceesseeseneseeseseeseeetaceseaesaceeaeees I cache Data Parity Error Behavior on Instruction Fetch oe ieeeceeeeeeseseeseeseneeseeerseeeeseseceesaeees I cache Physical Tag Parity Error Behavior on Instruction Fetch o e ecieeeeseseeesteeseteeseeeeaeees D cache Data Parity Error Behavior on Canceled Retried Special Load o e ee eeeeeseeseeeteeee D cache Physical Tag Parity Error Behavior on Canceled Retried Special Load oo eee P cache Hamas Seed Dispatch Control Register synu a ett eege E ee ae ee Signals Observed at obsdata 9 0 for Settings on Bits 11 6 of the DCR
178. ABLE 3 6 summarizes snoop output and DTag transition TABLE 3 7 summarizes snoop input and CIQ operation queueing Note The symbol implies not TABLE 3 6 Snoop Output and DTag Transition 1 of 3 Shared Owned Error Nexe Snooped Request DTag State Output Output Output DTag Action for Snoop Pipeline State dI 0 0 0 dT own RTS wait data dS 1 0 1 ds Error own RTS for data dT 1 1 dT Error d 0 0 ds own RTS inst wait data ds 1 1 dS Error own RTS for instructions dT 1 1 dT Error dI 0 dI none dS 1 ds none foreign RTS foreign RTS copyback a O dT 1 0 dS none dI 0 0 dO own RTO wait data dS amp SSM 1 1 dO own RTO no data own RTO dS amp SSM ZERURA own RTO wait data dO 1 1 dO own RTO no data dT 1 1 1 dO Error dI 0 0 dI none dS Ee ee foreign RTO invalidate foreign RTO foreign RTO copyback q9 a invalidate dT 0 dI foreign RTO invalidate Caches Cache Coherency and Diagnostics 33 un microsystems TABLE 3 6 Snoop Output and DTag Transition 2 of 3 Shared Owned Error 3 SE Snooped Request DTag State Output Output Output Action for Snoop Pipeline dI 0 own RS wait data own RS dO 1 dO Error dT 1 dT Error d d none foreign RS foreign RS copyback dO 1 dO discard dT 0 dT none ds 1 d own WB cancel own WB dO 0 d own WB dT 1 1 d Error ds ds none foreign WB dO dO none dT dT none dS 0 dS foreign RT
179. AD_1_REG Scratchpad register 1 RW Private 4F 16 ASI_SCRATCHPAD 2 REG Scratchpad register 2 RW Private 4F 16 ASI_SCRATCHPAD_3_REG 1815 Scratchpad register 3 Private 4F i ASI_SCRATCHPAD A REG Scratchpad register 4 RW Private 4F 16 ASI_SCRATCHPAD_5_ REG Scratchpad register 5 RW Private 4F 16 ASI_SCRATCHPAD _6_REG Scratchpad register 6 RW Private AF 16 ASI_SCRATCHPAD_7_REG 38 4 Scratchpad register 7 Private 5016 Reserved Reserved for future Private implementation 5016 ASI_IMMU_TAG_ACCESS_EXT EEN RW Private register 5146 to E Reserved for future Private 5416 implementation 5516 Reserved Reserved tor ER Private implementation 4000016 5516 ASL I TLB_DIAG_REG to I TLB diagnostic register RW Private 60FF8 16 5646 to E Reserved for future Private 5716 implementation 5816 Reserved Reserved tor meure Private implementation 5816 ASI_DMMU_TAG_ACCESS_EXT Dre Tag access extension RW Private register 5916 to e Reserved for future Private 5C16 implementation Address Space Identifiers 283 un microsystems TABLE 10 2 The UltraSPARC IV processor ASI Extensions 4 of 5 Dax Private Value ASI Name Suggested Macro Syntax Description R W Shared 5Di6 Reserved Reserved Tor future Private implementation 5D16 ASIDTLB_DIAG_REG D TLB diagnostic register RW Private Sie to Reese Reserved for future ie 6016 implementation 63 46 ASLINTR_ID
180. ARC IV processor still supports the same three TLBs as earlier family members a small fully associative 16 entry TLB and two large 2 way set associative 512 entry TLBs However in the UltraSPARC IV processor one large D TLB continues to support page sizes of 8 KB 64 KB 512 KB and 4 MB while the second of the two large D TLBs has been modified to support page sizes of 8KB 64 KB 32 MB and 256 MB The new large page sizes allow the UltraSPARC IV processor to support applications that need to map extremely large data sets A TLB can only access fill pages of one size at a time but the two large TLBs are each programmable and may be independently set to support either the same page size or different page sizes Thus for systems with very large memories one TLB can be set to handle default pages of either 8 or 64 KB while the other TLB handles large pages of 32 or 256 MB Whereas for systems with smaller memories both TLBs can be set to handle the default page size doubling the number of entries available for mapping smaller pages 1 4 1 4 1 1 4 2 Architectural Overview Cache Hierarchy The cache hierarchy supported by the UltraSPARC IV processor has been completely revised The cache hierarchy has been expanded from two to three levels L1 L2 L3 L1 Cache The L1 instruction cache I cache was doubled in size from 32 KB to 64 KB in the UltraSPARC IV processor The expanded I cache has a 64 byte line
181. B 16 entry fully associative both for locked and unlocked pages T5120 2 8 KB 64 KB 512 KB 4 MB First large D TLB 512 entry 2 way set associative 256 entries per way for unlocked pages T5121 3 8 KB 64 KB 32 MB 256 MB Second large D TLB 512 entry 2 way set associative 256 entries per way for unlocked pages Two D TLBs with Large Page Support The UltraSPARC IV processor has three D TLBs The first 16 entry fully associative D TLB the T16 remains the same as in the UltraSPARC III processor with the exception of supporting 8 KB unlocked pages All supported page sizes are described in TABLE 13 15 When a memory access is issued its VA Context and PgSz are presented to the D MMU All three D TLBs T512_0 T512_1 and T16 are accessed in parallel Note Unlike the UltraSPARC III processor the UltraSPARC IV processor s T16 can support unlocked 8 KB pages to prevent dropping of a D TLB fill of an unlocked 8 KB page when the large D TLBs are programmed to non unlocked 8 KB pages When both large D TLBs are configured with same page size the behavior is like a single D TLB with 1024 entry 4 way set associative 256 entries per way Each T512 s page size PgSz is programmable independently one PgSz per context primary secondary nucleus Software can set the PgSz fields in AST_PRIMARY_CONTEXT_REG and ASI_SECONDARY_CONTEXT_REG as descri
182. Back to back Sun Fireplane Interconnect Request Bus Mastership 32 DEAD 1 Inserts a dead cycle in between bus masters on the Sun Fireplane Interconnect Request Bus Sun Fireplane Interconnect and Processor Identification 289 un microsystems TABLE 11 2 FIREPLANE_ CONFIG Register Format 3 of 3 Bits Field Description 31 30 Reserved Reserved for future implementation Processor to Sun Fireplane Interconnect clock ratio bit 2 This field may only be written 29 CLK 2 during initialization before any Sun Fireplane Interconnect transactions are initiated refer to Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register on page 292 Debug 0 Up to 15 outstanding transactions allowed 28 27 DBG 1 0 1 Up to 8 outstanding transactions allowed 2 Up to 4 outstanding transactions allowed 3 One outstanding transaction allowed Contains the lower 10 bits of the logical processor Interrupt ID ASI_INTR_ID for each 26 17 INTR_ID 9 0 logical processor on the processor This field is logical processor specific and is not shared Processor to Sun Fireplane Interconnect clock ratio bits 1 0 This field may only be written during initialization before any Sun Fireplane Interconnect transactions are 16 15 initiated refer to Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register on page 292 CBND address limit Physical address bits 42 37 are compared to
183. C 0 L3_WDU Multiple bit ECC error on L3 cache data access for writeback RWIC TABLE 7 8 describes AFSR1_EXT AFSR2_EXT is identical except that all bits are read only ECC Syndromes ECC syndromes are captured on system bus data and microtag ECC errors and on L2 cache data ECC errors and L3 cache data ECC errors Syndromes are not captured for L2 cache tag ECC errors and L3 cache tag ECC errors The syndrome tables for system bus data L2 cache data L3 cache data and system bus microtags are given here E_SYND The AFSR E_SYND field contains a 9 bit value that indicates which data bit of a 128 bit quad word contains a single bit error This field is used to report the ECC syndrome for system bus L2 cache tag and data and L3 cache tag and data ECC errors of all types HW_corrected SW_correctable and uncorrectable 186 un microsystems TABLE 7 10 shows the 9 bit ECC syndromes that correspond to a single bit error for each of the 128 data bits To locate a syndrome in the table use the low order 3 bits of the data bit number to find the column and the high order 4 bits of the data bit number to find the row For example data bit number 126 is at column 0x6 row Oxf and has a syndrome of 0x1c9 TABLE 7 9 Key to interpreting TABLE 7 10 Interpretation Example Data bit number decimal 126 Data bit number hexadecimal Data bit number 7 bit binary 111 1110 High 4 bits binary 1111 Low 3 bits binary
184. C IV processor are equally as fast TABLE 9 4 ASI_SCRATCHPAD_n_REG Register REGISTER NAME ASI VA SHARED ACCESS SP ASI_SCRATCHPAD_0_REG 0x4F 0x00 Private RD WR NO ASI_SCRATCHPAD_1_REG Ox4F 0x08 Private RD WR NO ASI_SCRATCHPAD_2_REG Ox4F Private ASI_SCRATCHPAD_3_REG Ox4F Private ASI_SCRATCHPAD_4_REG Ox4F Private ASI_SCRATCHPAD_5_REG Ox4F Private ASI_SCRATCHPAD_6_REG Ox4F Private ASI_SCRATCHPAD_7_REG Ox4F 0x38 Private RD WR NO Registers 276 amp Sun microsystems 10 Address Space Identifiers A SPARC V9 processor generates an address space identifier ASI with every address sent to memory The ASI provides the following e The ability to distinguish different address spaces e An attribute that is unique to an address space e A map of the internal control and diagnostic registers within a processor The UltraSPARC IV processor memory management hardware translates a 64 bit virtual address and an 8 bit ASI to a 43 bit physical address The UltraSPARC IV processor supports both big endian and little endian byte ordering The default data access byte ordering after a power on reset is big endian Instruction fetches are always big endian Note Programmers must not issue any memory operation with ASI_PHYS_USE_EC or AST_PHYS_USE_EC_LITTLE to any bootbus address This chapter discusses the following sections Chapter Topics e TSB AS
185. C value to TPC TL and TNPC TL The UltraSPARC IV processor truncates Oe high order 32 bits set to 0 the branch target address sent to nPC of a CALL JMPL instruction as well as the value loaded to PC nPC from TPC TL TNPC TL on returning from a trap using DONE RETRY When an exception occurs the UltraSPARC IV processor writes the full 64 bit address to the D SFAR Note Exiting RED_state by writing 0 to PSTATE RED in the delay slot of a JMPL instruction is not recommended A noncacheable instruction prefetch can be made to the JMPL target which can be in a cacheable memory area which may result in a bus error on some systems and cause an instruction_access_error trap Programmers can mask the trap by setting the NCEEN bit in the L3 cache Error Enable Register to 0 but this solution masks all non correctable error checking Exiting RED_state with DONE or RETRY avoids the problem 9 3 9 3 1 Registers Ancillary State Registers ASRs Performance Control Register PCR ASR 16 Bits 47 32 26 17 and bit 3 of the PCR are unused in the UltraSPARC IV processor These bits read as zeroes and writes to these bits are ignored 268 amp Sun microsystems 932 9 35 Registers Performance Instrumentation Counter PIC Register ASR 17 The performance Instrumentation counters previously known as PIC1 and PICO in earlier processor documentation are now known as PICU and PICL
186. CACHE_TAG DC addr field 49 DC_tag field 50 DC_tag_parity field 50 DC_valid field 50 DC_way field 49 DCACHE_UTAG DC addr field 50 DC_way field 50 DCR OBS field 270 requirements for controlling observability bus 272 INDEX xxxiii amp Sun microsystems un microsystems state after reset 87 DCU control register 83 DCUCR access data format 274 CP cacheability field 25 26 274 313 CV cacheability field 24 25 274 331 DC D cache enable field 25 275 DM D MMtU enable field 275 DM field 25 HPE P cache HW enable field 274 HPE field 27 IC I cache enable field 24 275 IM IMMU enable field 275 IPS instruction prefetch stride field 274 ME noncacheable store merging enable field 274 PCM P cache mode field 274 PE P cache enable field 274 PE field 27 PM physical address field 275 PPS P cache prefetch stride field 274 PPS field 27 PR PA data watchpoint enable field 275 PW PA data watchpoint enable field 275 RE RAW bypass enable field 274 SPE software prefetch enable field 275 SPE field 27 state after reset 87 VM virtual address data watchpoint mask field 275 VR VA data watchpoint enable field 275 VW VA data watchpoint enable field 275 WCE W cache coalescing enable field 274 WE W cache enable field 275 deferred traps 254 and TPC TNPC 255 handling 257 delay slot and instruction fetch 93 diagnostics L2 cache accesses 58 L3 cache accesses 66 displacement flush 29 L2 cac
187. CY ecceeesceesseeeeteeeeeeeesnneeeneeeesseeensaes 102 5 8 Pipeline Counters esc dees tout orn ech selec od ak ae ena ee Ow eee ee 102 5 8 1 Instruction Execution and Clock Counts ceeeceescessseeereceneeeneeceeeeeaeeeatenaee 102 5 8 2 MU Branch Statistics erarecon ern EEGENEN 103 583 TU Stall RE 103 viii UltraSPARC IV Processor User s Manual October 2005 5 3 4 Restage Stalli Counts ue edel degen 104 5 8 5 Recirculate Stall Counts orriren ee rE E OAE 105 5 9 Cache Access RO 106 5 10 Memory Controller Counters srein neaei E a ERER 111 5 11 Data Locality Counters for Scalable Shared Memory Systems cccceeesseesseeeetteeeeees 111 5 11 1 Scalable Shared Memory Systems ccccccecsscecssseeeeneeesseeeeseeeesseeessneeensaees 112 KOR RE 112 5 11 3 Data Locality Event Matrix cccccccccccescccessecceeceeceseeeceeeeseseeeseeeeesieeessteeenenees 114 5 12 Miscellaneous Counters ee ees EEAS ss AA EAE ANAS 115 5 12 1 System Interface Event Counters sesssssssessseseseresseessetsseessreesseessresseresrersseees 115 5 12 2 Software Event NEEN E E A NAE 115 5 12 3 Floating Point Operation Events 116 5 13 PCR SL and PCR SU Encoding 116 gt IEEE 754 1985 Standard scscsscsscsssscssesscsscsssssesssssssssscsssssssssssssssssssessessssssssessessssssssssassases 119 6 1 Floating Point Operations ccccceescccesseceesseeeeceeeeeeeesneeesseeeesneeeseeeesenseseeeeseseeeens
188. C_error trap will be generated provided that the CEEN bit is set in the Error Enable Register Hardware will correct the error EMU When data are entering the UltraSPARC IV processor from the system bus the microtags will be checked for the correctness of ECC If a multi bit error is detected it will be recognized as an uncorrectable error and the EMU bit will be set to log this error condition Provided that the 210 amp Sun Error Handling microsystems NCEEN bit is set in the Error Enable Register a deferred instruction_access_error or data_access_error trap will be generated depending on whether the read was to satisfy an instruction fetch or a load operation Multiple occurrences of this error will cause the AFSR ME to be set IVC When interrupt vector data are entering the UltraSPARC IV processor from the system bus the data will be checked for ECC correctness If a single bit error is detected the IVC bit will be set to log this error condition A disrupting ECC_error trap will be generated provided that the CEEN bit is set in the Error Enable Register Hardware will correct the error IVU When interrupt vector data are entering the UltraSPARC IV processor from the system bus the data will be checked for ECC correctness If a multi bit error is detected it will be recognized as an uncorrectable error and the IVU bit will be set to log this error condition A disrupting ECC_error trap will be generated prov
189. C_parity xor IC_instr 31 11 IC_predecode 9 7 4 0 For a non PC relative instruction IPB_predecode 7 0 IPB_parity xor IPB_instr 31 0 IPB_predecode 9 7 5 0 For a PC relative instruction IPB_predecode 7 1 IPB_parity xor IPB_instr 31 11 IPB_predecode 9 7 4 0 The PC relative instructions are in SPARC V9 terms BPcc Bicc BPr CALL FBfcc and FBPfcc where IC_predecode 7 IPB_predecode 7 1 To test hardware and software for I cache parity error recovery programs can cause an instruction to be loaded into the I cache by executing it then use the I cache diagnostic accesses to flip the parity bit using AST_ICACHE_INSTR ASI 0x66 or modify the tags using ASTI_ICACHE_TAG ASI 0x67 or data using ASI_ICACHE_DATA Upon re executing the modified instruction a precise trap should be generated If no trap is generated the program should check the I cache using diagnostic accesses to see whether the instruction has been repaired This would be a sign that the broken instruction had been displaced from the I cache before it had been re executed Iterating this test can check that each covered bit of the I cache physical tags and data is actually connected to its parity generator Instructions need to be executed in order to discover the value of the predecode bits synthesized by the processor After the I cache fill the predecode bits can be read via AST_ICACHE_INSTR Instructions
190. D status This is not a hardware time out operation which causes an AFSR PERR event It is also different from a DSTAT 2 time out response for a Sun Fireplane Interconnect transaction which causes an AFSR BERR or AFSR DBERR For an unmapped bus error due to a system bus read for an instruction fetch load like block load or atomic operation or a system bus write for block store to memory WS store to I O WIO block store to I O WBIO writeback from L3 cache or interrupt vector transmit operation the TO bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a deferred instruction_access_error or data_access_error trap will be generated depending on whether the read was to satisfy an instruction fetch or a load operation Multiple occurrences of this error will cause AFSR1 ME to be set DTO When the UltraSPARC IV processor performs a system bus read access it is possible that no device responds with a MAPPED status This is not a hardware time out operation which causes an AFSR PERR event It is also different from a DSTAT 2 time out response for a Sun Fireplane Interconnect transaction which causes an AFSR BERR or AFSR DBERR For an unmapped bus error due to a system bus read for a prefetch queue or read to own store queue operation the DTO bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a deferred data_
191. D_state to normal mode 3 1 2 2 Data Cache D cache The data cache is a 64 KB 4 way set associative cache with 32 byte line size It is a write through non write allocate cache The D cache uses a pseudo random replacement policy D cache tag and data arrays are parity protected Data accesses bypass the D cache if the D cache enable DC bit in the Data Cache Unit Control Register is not set DCUCR DC 0 If the D MMU is disabled DCUCR DM 0 then cacheability in the D cache is determined by the CP and CV bits If the access is mapped by the D MMU as non virtual cacheable then load misses will not allocate in the D cache For more information on the DM CP or CV bits see Data Cache Unit Control Register DCUCR on page 273 A non virtual cacheable access may access data in the D cache from an earlier cacheable access to the same physical block unless the D cache is disabled Note Software must flush the D cache when changing a physical page from cacheable to non cacheable see Cache Flushing on page 28 3 1 3 Physically Indexed Physically Tagged Caches PIPT The Write cache Level 2 L2 cache and Level 3 L3 cache are physically indexed physically tagged PIPT caches These caches have no references to virtual address and context information The operating system needs no knowledge of such caches after initialization except for stable storage management and error handling 3 1 3 1 Write Cache W cache
192. Data Memory Management Unit 334 un microsystems D MMU SFSR is described in TABLE 13 28 TABLE 13 28 D SFSR Bit Description Bit Description R W 63 25 Reserved Reserved for future implementation 24 N Set if the faulting instruction is a nonfaulting load a load to RW ASI_NOFAULT F J Records the 8 bit ASI associated with the faulting instruction This field is 23 16 AS valid for all traps in which the FV bit is set RW 15 TM D TLB miss RW 14 Reserved Reserved for future implementation Specifies the exact condition that caused the recorded fault according to following this table In the D MMU the Fault Type field FT 5 0 is valid only for data_access_exception faults and invalid for fast_data_access_MMU_miss FT 6 is valid only for fast_data_access_MMU_miss and invalid for data_access_exception There RW is no ambiguity in all other MMU trap cases The hardware does not priority encode the bits set in the fault type register that is multiple bits can be set Note that when a D TLB parity error occurs FT 5 1 the other FT bits FT 6 and FT 4 0 are undefined Side effect bit Associated with the faulting data access or FLUSH instruction Set by translating ASI accesses that are mapped by the TLB with 6 E the E bit set and bypass ASIs 1516 and 1D 6 The E bit is undefined after a RW D TLB parity error All other cases that update the SFSR including bypass or internal ASI accesses set the
193. Disrupting traps e Multiple traps Precise Traps A precise trap occurs before any program visible state has been modified by the instruction to which the trap saved program counter TPC points When a precise trap occurs several conditions are true e The program counter PC saved in TPC TL points to a valid instruction which will be executed by the program The next program counter nPC saved in TNPC TL points to the instruction that will be executed following that one e All instructions issued before the one pointed to by the TPC have completed execution e Any instructions issued after the one pointed to by the TPC remain unexecuted The UltraSPARC IV processor generates three varieties of precise traps associated with data errors dcache_parity_error icache_parity_error and fast_ECC_error Exceptions Traps and Trap Types 253 un microsystems A precise dcache_parity_error trap 1s generated for parity errors detected in the D cache data physical tag arrays or P cache data arrays as the result of instructions that perform a load A precise icache_parity_error trap is generated for a parity error detected in the I cache data or physical tag arrays as the result of an I fetch A precise fast_ECC_error trap also occurs when an uncorrectable L2 cache tag a data error correcting code ECC error or L3 cache data ECC error is detected as the result of a D cache load miss atomic instruction or I cache miss for I fetch All
194. E bit to 0 Context Register selection The context is set to 11 when the access does 5 4 GI not have a translating ASI RW PR W OW FV 13 7 3 Privilege bit Set if the faulting access occurred while in privileged mode RW This filed is valid for all traps in which FV bit is set Write bit Set if the faulting access indicated a data write operation a store i RW or atomic load store instruction 2 Overwrite bit When the D MMU detects a fault the Overwrite bit is set to 1 if the FV bit has not been cleared from a previous fault otherwise it is set to RW 0 1 Fault Valid bit Set when the D MMU detects a fault it is cleared only on an explicit ASI write of 0 to SFSR When the FV bit is not set the values of the remaining fields in the SFSR and SFAR are undefined for traps RW 0 TABLE 13 29 D MMU Synchronous Fault Status Register FT Fault Type Field I D FT 6 0 Description D Olie Privileged violation D 0216 Speculative Load instruction to page marked with E bit This bit is 0 for internal ASI access D 0416 Atomic including 128 bit atomic load to page marked non cacheable D 0816 Ga LDA STA ASI value VA RW or size Does not include cases where 0216 and 0446 are D 1016 Access other than nonfaulting load to page marked NFO This bit is 0 for internal ASI accesses 2016 D TLB tag or data parity error D 4016 D TLB miss with prefetch instru
195. E iaie 56 Prefetch Cache Diagnostic Data Access Address Format 56 UltraSPARC IV Processor User s Manual October 2005 TABLE 3 53 TABLE 3 54 TABLE 3 55 TABLE 3 56 TABLE 3 57 TABLE 3 58 TABLE 3 59 TABLE 3 60 TABLE 3 61 TABLE 3 62 TABLE 3 63 TABLE 3 64 TABLE 3 65 TABLE 3 66 TABLE 3 67 TABLE 3 68 TABLE 3 69 TABLE 3 70 TABLE 3 71 TABLE 3 72 TABLE 4 1 TABLE 5 1 TABLE 5 2 TABLE 5 3 TABLE 5 4 TABLE 5 5 TABLE 5 6 TABLE 5 7 TABLE 5 8 TABLE 5 9 TABLE 5 10 TABLE 5 11 TABLE 5 12 TABLE 5 13 TABLE 5 14 TABLE 5 15 Prefetch Cache Diagnostic Data Access Data Format o ecccccsescesseseesseeeeeeeeseeseeseeseeseeaeeneeaeens 56 Prefetch Cache Tag Register Access Address Format o cccccesssscsssceeneeseeseeeseeerseeseesaeeesacees 57 Prefetch Cache Tag Register Access Data Format 57 Prefetch Snoop Tag Access Address Format 58 Prefetch Cache Snoop Tag Access Data Format e sssssssessesisisesesrstsesterrertetststsrstsrsrsrerererersrseet 58 E2 cache Control Registers sninnrserori rns aae VEEE E OE EEE NEIEN E 59 L2 cache Tag Access Address Format 6l E2 ca he Tag Access Data Format ged aii tanh A A E YET ones 62 L2 cache Data Diagnostit Access 01 65 L2 cache Data Access Data Format when ECC_sel 50 usssssesesisisesrsrersesrestsrsrersrsrsrsrererersrsene 65 L2 cache data access Data Format when ECC_sel 51 wee eeecessceeseeeseneeseeseseeseeesseesesesaeeesacees
196. ECC bits are set it indicates that an address parity error has occurred L3_WDC For an L3 cache writeback operation when a modified line in the L3 cache is being victimized to make way for a new line an L3 cache read will be performed and the data read back from the L3 cache will be checked for the correctness of its ECC The data read back from L3 cache will be put in the writeback buffer as the staging area for the writeback operation If a single bit error is detected the L3_WDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case providing that the CEEN bit is set in the Error Enable Register Hardware will proceed to correct the error Corrected data will be written out to memory through the system bus If both the L3_WDC and the L3_MECC bits are set it indicates that an address parity error has occurred L3_WDU For an L3 cache writeback operation an L3 cache read will be performed and the data read back from the L3 cache will be checked for the correctness of its ECC The data read back from L3 cache will be put in the writeback buffer as the staging area for the writeback operation When a multi bit error is detected it will be recognized as an uncorrectable error and the L3_WDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When the processor reads uncorr
197. ED_state 5 of 5 the LP_ID of the LP_ID of Unchanged the lowest the lowest numbered numbered enabled enabled logical logical processor as processor as indicated by indicated by ASI_CORE ASI_CORE ENABLE a ENABLE at time of time of deassertion of deassertion of reset reset I in bit Unchanged position of lowest I in bit enabled position of logical lowest processor as available specified by logical ASI_CORE_ processor ENABLE 0 in all before the other bits reset NOTE if the 0 in all system other bit controller positions changes NOTE if the ASI_CORE_ system ENABLE controller during the changes reset and ASI_CORE_ disables the ENABLE lowest logical during the processor it reset and disables the lowest logical processor it must update this register must update this register ASTI_CORE_RUNNING_STA TUS Equal to the Equal to the Not affected ASI_CORE ASI_CORE RUNNING RUNNING register register CMT ASI extensions PER logical processor Registers AST_CORE_ID ASI_INTR_ID ASI_SCRATCHPAD_n_REG Number of LPs Predefined value set by hardware LP ID Predefined value set by hardware Unknown Unchanged Unknown Unchanged 1 Processor states are only updated according to the following table if RED_state is entered because of a reset or a trap If RED_state is entered because the PSTATE RED bit was explicitly set to 1 then
198. E_SH All uncorrectable L2 cache tag ECC errors for fill tag update due to local bus transactions writeback will set AFSR TUE 158 amp Sun microsystems 7 2 1 4 de 7 2 1 6 Error Handling The response of the system to assertion of the ERROR output pin depends on the system but is expected to result in a reboot of the affected domain Error status can be read from the AFSR after a system reset event L2 cache Data ECC Errors e Hardware corrected HW_corrected L2 cache data ECC errors single bit ECC errors that are corrected by hardware e Software correctable SW_correctable L2 cache data ECC errors L2 cache ECC errors that require software intervention e Uncorrectable L2 cache data ECC errors multi bit ECC errors that are not correctable by hardware or software Depending on the operation accessing the L2 cache the full 64 byte line may be checked for data ECC errors or only a 32 byte portion representing either the lower or upper half of the line may be checked The cases where only 32 bytes are checked correspond to some reads from the L1 caches that only load 32 bytes Hardware Corrected L2 cache Data ECC Errors Hardware corrected ECC errors occur on single bit errors detected as the result of the following transactions W cache exclusive request accesses to the L2 cache to obtain the line and ownership from L2 cache AFSR EDC Reads of the L2 cache by the processor in order to perform a writeback t
199. E_SYND applies only to system bus L2 cache and L3 cache data ECC errors It is not updated for L2 cache tag ECC errors or L3 cache tag ECC errors AFSR_EXT Fields Bits 11 0 are sticky error bits that record the most recently detected errors Each sticky bit in AFSR1_EXT accumulates errors that have been detected since the last write to clear the bit Unless two errors are detected in the same clock cycle at most one of these bits can be set in AFSR2_EXT Clearing the AFSR and AFSR_EXT AFSR1 AFSR1_EXT must be explicitly cleared by software it is not cleared automatically by a read Writes to the AFSR1 RWIC bits 62 33 with particular bits set will clear the corresponding bits in both AFSR1 and AFSR2 Writes to the AFSR1_EXT RWIC bits 11 0 with particular bits set will clear the corresponding bits in both AFSR1_EXT and AFSR2_EXT Bits associated with disrupting traps must be cleared before re enabling interrupts by setting PSTATE IE to prevent multiple traps for the same error Writes to AFSR1 bits with particular bits clear will not affect the corresponding bits in either AFSR The syndrome fields are read only and writes to these fields are ignored Each of the AFSR2 bits 62 33 is cleared automatically when software clears the corresponding AFSR1 bit Each of the AFSR2_EXT bits 11 0 is cleared automatically when software clears the corresponding AFSR1_EXT bit AFSR2 AFSR2_EXT is read only AFSR2 only becomes available unfrozen
200. FSR DUE AFSR DTO AFSR DBERR AFSR IVU or AFSR IMU will be followed later by some unrecoverable event that requires process death The appropriate action by the handler for the disrupting trap is to log the event and return A later event will cause the right system recovery action to be taken The ECC_error disrupting trap is enabled by PSTATE IE PSTATE PIL has no effect on ECC_error traps Note To prevent multiple traps from the same error software should not re enable interrupts until after the disrupting error status bit in AFSR1 is cleared Multiple Traps See When Are Traps Taken on page 215 for a discussion on what happens when multiple traps occur at once Exceptions Traps and Trap Types 258 un microsystems 8 2 Exceptions Specific to the UltraSPARC IV Processor Note Floating Point Control Two state bits PSTATE PEF and FPRS FEF in the SPARC V9 architecture provide the means to disable direct floating point execution If either field is set to 0 an fp_disabled exception is taken when any floating point instruction is encountered Graphics instructions that use the floating point register file and instructions that read or update the Graphic Status Register GSR are treated as floating point instructions They cause an fp_disabled exception if either PSTATE PEF or FPRS FEF is zero See Graphics Status Register GSR ASR 19 the UltraSPARC III Cu Processor User s Manual for more information The exc
201. FSR_EXT during this trap handler which would generate a disrupting trap later if it were not cleared somewhere in this handler In this case the processor will write deliberately bad signalling ECC back to memory When the fast_ECC_error trap handler exits and retries the offending instruction the previously faulty line will be re fetched from main memory It will either be correct so the program will continue correctly or still contain an uncorrectable data ECC error in which case the processor will take a deferred instruction_access_error or data_access_error trap It is the responsibility of these later traps to perform the proper cleanup for the uncorrectable error The fast_ECC_error trap routine does not need to execute any complex cleanup operations Encountering a software correctable error while executing the software correctable trap routine is unlikely to be recoverable To avoid this three approaches are known 1 The software correctable exception handler code can be written normally in cacheable space If a single bit error exists in exception handler code in the L3 cache other single bit L3 cache data or tag errors will be unrecoverable To reduce the probability of this the software correctable exception handler code can be flushed from the L3 cache at the end of execution This solution does not cover cases where the L3 cache has a hard fault on a tag or data bit giving a software correctable error on every access 2 All
202. Fast initialize all SRAM in logical Ww ot processor 4116 ASI_CORE_AVAILABLE Bit mask ofimplemented logical Deeg processors Alig ASI_CORE_ENABLED ip ioe enabled logical R Shared processors Address Space Identifiers 281 un microsystems TABLE 10 2 The UltraSPARC IV processor ASI Extensions 2 of 5 Val ASI Name Suggested Macro Syntax Description E alue ame Suggested Macro Syntax escriptio Shared Alig ASL CORE_ ENABLE Bit mask of logical processors to Shared enable after next reset Alig ASI_XIR_STEERING 30 6 Cregs XIR steering register Shared Specify ID of which logical 4116 ASI_CMP_ERROR_STEERING processor to trap on non logical RW Shared processor specific error Bit mask to control which logical 4li ASI_CORE_RUNNING_RW processors are active and which RW Shared are parked 1 active 0 parked Bit mask of logical processors that 4116 ASI CORE RUNNING SIATUS are currently active 1 active R Shared 0 parked Alig ASI_CORE_RUNNING_WIS lo ical pro ess r EE tee write one to set bit s Alig ASI_CORE_RUNNING_WIC logical processor parking fegi ter w Shale write one to clear bit s 4216 ASI_DCACHE_INVALIDATE Se Invalidate diagnostic w Private 4316 ASIDCACHE_UTAG D cache mTag diagnostic access Private 4416 ASI_DCACHE_SNOOP_TAG D cache snoop tag RAM RW Private diagnostic access 4516 Reserved Peed for murite Private implementation
203. I Groupings on page 277 e ASI Assignments on page 280 10 1 I TSB ASI Groupings Internal ASIs also called non translating ASIs are in the ranges of 3016 to 6F 16 7216 to 7716 and 7A 6 to Pie These ASIs are not translated by the MMU Instead these ASIs pass through their virtual addresses as physical addresses Note Access to internal ASIs with invalid virtual addresses have undefined behavior Invalid virtual addresses may or may not cause a data_access_exception trap and may or may not alias onto a valid virtual address Software should not rely on any specific behavior Address Space Identifiers 277 un microsystems 10 1 1 Fast Internal ASIs In the UltraSPARC IV processor internal ASIs are further classified into either fast or regular ASIs Fast internal ASIs have a 3 cycle read latency same as the regular load latency for a D cache hit Data for fast internal ASIs are returned in the M stage of the pipeline without recirculating Regular internal ASIs on the other hand have a longer read latency approximately 20 cycles The regular internal ASIs always recirculate once and return the data in the M stage of the recirculated instruction The latency of the regular internal ASIs depends on the ASI s address The ASIs listed in TABLE 10 1 are implemented as fast ASIs in the UltraSPARC IV processor The balance of the internal ASIs are all regular ASIs TABLE 10 1 Fast Internal ASIs Value ASI Name Suggested Macr
204. L3_EDU No 8 f No action Disrupting trap EE moved from L3 in P cache i rear cache to L2 cache mes e P cache for several writes 2 22 request with Ge Ee E E CE in the critical 32 byte or the 2nd 32 byte L3_EDC No 8 7 No action Disrupting trap moved from L3 in P cache L3 cache data cache to L2 cache mes F P cache for several writes 2 22 request with GE Se Ee UE in the critical 32 byte or the 2nd 32 L3_EDU No 8 No action Disrupting trap byt L3 cache dat moved from L3 in P cache PE ERE EE cache to L2 cache sees e P cache for one write 3 23 request with CE SE E in the critical 32 byte or the 2nd 32 byte L3_EDC No 8 z No action Disrupting trap EE moved from L3 in P cache een cache to L2 cache me F P cache for one write 3 23 request with UE EEN Bad datadoi in the critical 32 byte or the 2nd 32 byte L3_EDU No 8 f No action Disrupting trap E3 cache dat moved from L3 in P cache EE cache to L2 cache mes e P cache for instruction 17 request with CE ee Gena datanot in the critical 32 byte or the 2nd 32 byte L3_EDC No 8 No action Disrupting trap 227 un microsystems TABLE 7 20 L3 cache Data CE and UE errors 4 of 4 Error fast ecc Event logged in error L2 cache data L1 cache data AFSR trap Pipeline 5 Comment Action Original data original ecc Bad data not moved from L3 in P cache cache to L2 cache P cache for instruction 17 request with UE in the critical 32 byte or the 2nd 32 byte L3_EDU No L3 cache
205. LB even if the entry is locked The entry that gets updated depends on the state of the Tag Access Extension Register LFSR bits and the TTE page size in the store data If a nested fast_instruction_access_MMU_miss happens the I TLB Data in register will not work Data Access Register ASI 5516 VA 63 0 0016 20FF816 Name ASI_ITLB DATA ACCESS DEG Access RW Virtual Address Format The virtual address format of the I TLB Data Access register is described in TABLE 13 8 TABLE 13 8 I TLB Data Access Register Virtual Address Format Description Bit Field Description 63 19 Reserved Reserved for future implementation 18 Mandatory value Should be 0 The TLB to access as defined below 17 16 TLB ID 0 T16 2 T512 15 12 Reserved Reserved for future implementation The TLB entry number to be accessed in the range 0 511 Not all TLBs will have all 512 entries All TLBs regardless of size are accessed from 0 to N 1 where N is the number of entries in the TLB For the T512 bit 11 11 3 TLB Entry is used to select either way0 or way1 and bit 10 3 is used to access the specified index For the T16 only bit 6 3 is used to access one of 16 entries 2 0 Mandatory value Should be 0 Z Z Data Format The Data Access Register uses the TTE data format with the addition of parity information in the T512 as described in TABLE 13 5 Instruction and Data Memory Management Unit 314
206. MMU fast_instruction_access_MMU l 1 e instruction_access_exception on retry if accessing T512 and the other way has the parity error S fast_instruction_access_MMU Se 1 l 1 instruction_access_exception on retry if accessing T512 and the other way has the parity error 1 1 0 x instruction_access_exception 1 1 0 1 x 1 instruction_access_exception 1 1 1 0 1 x no trap taken 1 1 1 0 x 1 no trap taken Demap x x x x X 1 no trap taken NOTE An x in the table represents don t cares 13 1 3 I TLB Automatic Replacement The I TLB miss fast trap handler utilizes the automatic hardware replacement write using store ASI_ITLB_DATA_IN_REG When an I TLB misses an instruction_access_exception or protection trap is detected with the hardware automatically saving the missing VA and context to the Tag Access Register AST_IMMU_TAG_ACCESS To facilitate indexing of the T512 when the TTE data is presented via STXA ASI_ITLB_DATA_IN_REG the missing page size information of the T512 is captured into a new extension register called AST_IMMU_TAG_ACCESS_EXT which is described in J TLB Tag Access Extension Register on page 313 The hardware I TLB replacement algorithm is as follows T Note PgSz below is AST_IMMU_TAG_ACC S nN _EXT 24 22 bits Instruction and Data Memory Management Unit 306 amp S
207. Mandatory value Should be 0 A 9 bit ECC entry that protects L3 cache tag and state The ECC bits protect all 4 ways 14 6 ECC of a given L3 cache index A 3 bit entry that records the access history of the 4 ways of a given L3 cache index If EC_split_en in the L3 cache control register ASI 7516 is not set the LRU is as described below The LRU pointed way will not be picked for replacement if NA is the state LRU 2 0 000way0 is the LRU LRU 2 0 001way 1 is the LRU LRU 2 0 010way0 is the LRU LRU 2 0 011way1 is the LRU LRU 2 0 100way2 is the LRU LRU 2 0 101way2 is the LRU LRU 2 0 110way3 is the LRU 5 3 LRU LRU 2 0 111way3 is the LRU If EC_split_en is set the LRU is as described below LRU 2 is ignored and logical processor ID of the logical processor which triggers the request is used LP LRU 1 0 000way0 is the LRU LP LRU 1 0 001wayl is the LRU LP LRU 1 0 010way0 is the LRU LP LRU 1 0 011lwayl is the LRU LP LRU 1 0 100way2 is the LRU LP LRU 1 0 101 way2 is the LRU LP LRU 1 0 110way3 is the LRU LP LRU 1 0 111way3 is the LRU 3 bit L3 cache state field The bits are encoded as follows state 2 0 000 Invalid state 2 0 001 Shared state 2 0 010 Exclusive 2 0 EC_state state 2 0 011 Owner state 2 0 100 Modified state 2 0 101 NA Not Available see Notes on NA Cache State on page 64 state 2 0 110 Owne
208. N A cache block load buffer for s SE E the request is block load and writes 64 byte dropped data to L2 cache L3 cache to L2 cache fill request with L2 cache tag CE L2 Pipe Disrupting trap forwards the critical 32 byte THCE No No Ets N A h ti to D cache for atomic request t EES and writes 64 byte data to L2 a8 retried cache L3 cache to L2 cache fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE N y Original t N A i to D cache for atomic request S hee the request is and writes 64 byte data to L2 dropped cache L3 cache to L2 cache fill request with L2 cache tag CE forwards the critical 32 byte L2 Pipe Disrupting trap and then 2nd 32 byte to P THCE No No corrects the N A the request is cache for Prefetch 0 1 2 3 tag retried request and writes 64 byte data to L2 cache 232 un microsystems Error Handling TABLE 7 23 1L2 cache Tag CE and UE errors 4 of 7 Errors flag L1 S fast Error 5 Event logged in SE Pin L2 cache Tag cache D Comment AFSR data error L3 cache to L2 cache fill request with L2 cache tag UE forwards the critical 32 byte Disrupting trap and then 2nd 32 byte to P TUE No Yes Original tag N A the request is cache for Prefetch 0 1 2 3 dropped request and writes 64 byte data to L2 cache L3 cache to L2 cache fill request with L2 cache tag CE forwards the critical 32 byte L2 Pipe Disrupting trap and then 2nd 32
209. ND and AFAR being captured and disrupting trap being taken System Bus DSTAT 2 or 3 Errors A DSTAT 2 or 3 may be returned in response to a system bus read operation In this case the processor handles the event in the same way as specified in the section titled Uncorrectable System Bus Data ECC Errors on page 171 except for the following differences 1 For a system bus read from memory or I O caused by an instruction fetch load block load or atomic operation the AFSR BERR bit is set instead of AFSR UE 2 For a system bus read from memory or I O caused by prefetch queue or a system bus read from memory causes by read to own store queue operation the AFSR DBERR bit is set instead of AFSR DUE 3 The BERR or DBERR AFSR and AFAR overwrite priorities are used rather than the UE or DUE priorities 4 Data bits 1 0 of each of the four 128 bit correction words written to the L2 cache are inverted to create signalling ECC if the access is cacheable 5 For AFSR BERR a deferred instruction_access_error or data_access_error trap is generated 6 For AFSR DBERR a disrupting ECC_error trap is generated The processor treats both Sun Fireplane Interconnect termination code DSTAT 2 time out error and DSTAT 3 bus error as the same event Both will cause AFSR BERR or AFSR DBERR to be set and cause the same signalling ECC to be sent to the L2 cache These conditions are checked on both cacheable and non cach
210. Name ASI_SRAM_FAST_INIT_SHARED Description This ASI is used only with STXA instruction to quickly initialize the on chip SRAM structures that are shared between the two logical processors in the CMT device A single STXA ASI Aye will cause the store data value to be written in all shared on chip SRAM entries of L2 cache Data and Tag arrays L3 cache Tag array Usage stxa g0 g0 AST_SRAM FAST_INIT_SHARED The 64 bit store data must be zero Initializing the SRAMs to non zero value could have unwanted side effects This STXA instruction must be surrounded preceded and followed by MEMBAR Sync to guarantee that e All Sun Fireplane Interconnect transactions have completed before the SRAM initialization begins e SRAM initialization fully completes before proceeding Note Only one logical processor should issue AST_SRAM_FAST_INIT_SHARED The other logical processor must be parked or disabled During the SRAM initialization caches and TLBs are considered unusable and incoherent The AST_SRAM_FAST_INIT_SHARED instruction should be located in non cacheable address space ASI_SRAM_FAST_INIT_SHARED will be issued at the default L2L3 arbitration throughput which is once every two cycles regardless of the L2L3arb_single_issue_en in L2 cache control register ASI 6D 6 or WE in DCUCR ASI 4546 Caches Cache Coherency and Diagnostics 79 microsys
211. No requests in the Store Queue Note A D cache load is considered finished if the D cache has received the data Name AST_CORE_RUNNING_STATUS ASI 0x41 VA 63 0 0x58 Privileged Read Only TABLE 2 8 LP Running Status Register Shared 63 2 Mandatory value Should be 0 1 LP 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 As shown in TABLE 2 8 the LP Running Status register is a 64 bit register Each bit of the register represents a logical processor with bit 0 representing LP 0 and bit 1 representing LP 1 For any bit set to 1 in the LP Running register the corresponding bit needs to be 1 in the LP Running Status register Note For one suspend command to a logical processor the corresponding bit of the specified logical processor in the LP Running Status register will have only one transition from 1 to 0 The LP Enable LP Running and LP Running Status registers are mainly used to support debug and diagnostics The LP Running register is also used to support booting Chip Multithreading CMT 19 amp Sun microsystems State After Reset The value of the LP Running Status register is the same as the value of the LP Running register at the end of a system reset 23 sel 29 2 2 5 3 Reset Handling Each reset is handled differently in a CMT processor Some resets apply to all the logical processors some apply to an individual logical processor
212. PB data parity error is determined based on per instruction fetch group granularity If a fetch group is used in the pipe from IPB and if any of the instructions in the fetch group has a parity error an icache_parity_error trap will be taken on that fetch group irrespective of whether the instruction is executed or not When the trap is taken TPC will point to the beginning instruction of the fetch group If an I fetch misses the IPB an icache_parity_error trap will not be generated 156 amp Sun microsystems 7 2 6 7 2 6 1 7 2 6 2 ZC 7 2 7 1 Error Handling P cache Data Parity Errors Parity error protection is provided for the P cache data array only P cache snoop tag array virtual tag array and status array are not parity error protected P cache Data Parity trap is taken only when a floating point load hits P cache and there is a parity error associated with that extended word Software or hardware prefetch instruction do not generate P cache Data parity trap Parity error checking in P cache data array is enabled by DCR PPE in the dispatch control register When this bit is 0 parity will still be correctly generated and installed in the P cache data array but will not be checked Note P cache data parity errors are not checked while DCUCR PE is 0 P cache Error Recovery Actions P cache data parity error is reported the same way same trap type and precise trap timing as in D cache Data Array Pari
213. PSTATE PIL is less than the pending interrupt level the processor automatically disables its floating point unit by clearing PSTATE PEF If the next instruction executed is an FGA or FGM pipe instruction a precise fp_disabled trap will be taken This has the side effect of clearing PSTATE IE so the disrupting traps can still not be taken However the fp_disabled trap routine is specially arranged to first set PSTATE PEF again then set PSTATE IE again then execute a BR MS AO or A1 pipe instruction When this occurs the interrupt_vector or interrupt_level_ 1 15 trap routine is executed Eventually the trap routine executes a RETRY instruction to retry the faulted floating point operation If the next instruction after the processor clears PSTATE PEF is not an FGA or FGM pipe instruction then it must be a BR MS AO or A1 pipe instruction and the disrupting trap routine can be executed anyway In this case the next floating point operation will trigger an fp_disabled trap which results in the floating point unit being enabled again Automatic clearing of PSTATE PEF is not triggered by pending disrupting ECC_error traps The ECC_error trap routine can be indefinitely delayed if the processor is handling only FGA and FGM pipe instructions Error Barriers AMEMBAR Sync instruction causes the processor to wait until all system bus reads are complete and the store queue is empty before continuing Stores will have completed any syste
214. Prediction of overflow underflow and inexact traps is used in the hardware Prediction provides correct results when possible and generates an exception when it is not possible Prediction of an inexact value never occurs if one of the operands is a zero NaN Not a Number or infinity When an inexact prediction occurs and the exception is enabled system software will properly handle these cases and resume program execution If the exception is not enabled the result status is used to update the FSR AEXC and FSR CEXC fields of the FSR register IEEE 754 1985 Standard 120 un microsystems 6 2 6 2 1 6 2 2 6 23 Floating Point Numbers Floating point number types and their abbreviations are shown in TABLE 6 2 In general the IEEE 754 1985 Standard reserves exponent field values of all Os and all 1s to represent special values in the standard s floating point scheme TABLE 6 2 Floating Point Numbers Data Representation Number Type Abbreviation Sign Exponent Fraction Zero 0 0 or 1 000 000 000 000 Subnormal SbN 0 or 1 000 000 Ce ee 111 111 000 001 to 000 000 to Normal Normal Oorl 111 110 111 111 Infinity Infinity 0 or 1 111 111 000 000 Signaling NaN SNaN 0 or 1 111 111 Oxx XXX Quiet NaN QNaN 0 or 1 111 111 1xx XXX Zero Zero is not directly representable if the straight format is followed This limitation is due to the assumption of a leading 1 To allow the num
215. R UCU SE data original I cache fill request with UE in the non critical 32 byte the 2nd 32 byte L2 cache data but this data never get used D cache 32 byte load request with CE in the critical 32 byte Original data original UCC L2 cache data ecc D cache 32 byte load request with UE in the critical 32 byte Original data original UCU L2 cache data ecc D cache 32 byte load request with CE in the non critical 32 Original data original EDC byte L2 cache data ecc D cache 32 byte load request with UE in the non critical 32 Original data original EDU byte L2 cache data ecc D cache FP 64 bit load request with CE in the critical 32 byte Original data original UCC L2 cache data ecc D cache FP 64 bit load request with UE in the critical 32 Original data original UCU byte L2 cache data ecc D cache FP 64 bit load request with CE in the 2nd 32 byte Original data original EDC L2 cache data ecc D cache FP 64 bit load request with UE in the 2nd 32 byte Original data original EDU L2 cache data ecc D cache block load request with CE in the 1st 32 byte or the EDC Original data original 2nd 32 byte L2 cache data ecc D cache block load request with UE in the 1st 32 byte or the EDU Original data original 2nd 32 byte L2 cache data ecc D cache atomic request with CE in the critical 32 byte L2 UCC Original data original cache data ecc D cache atomic request with UE in the critical 32 byte L2 UCU Original data original cache data
216. R instruction 97 LDSB instruction 97 LDSH instruction 97 LDSW instruction 97 LDXA instruction diagnostic control data 39 LDXESR instruction 97 leaf subroutine 95 little endian byte ordering 277 load instructions getting data from store queue 97 load recirculation 97 LRU TLB entry replacement 337 M machine state after reset 86 in RED state 86 MAXTL 83 MCU 84 Mem_Addr_CTL register 88 Mem_Addr_Dec register 88 mem_address_not_aligned exception 313 331 mem_address_not_aligned trap 96 Mem_Timing CSR register 88 MEMBAR Sync 257 after ASI stores to error registers 177 after STXA 39 before after load store ASI_DCACHE_DATA 49 ASI_DCACHE_SNOOP_TAG 51 ASI_DCACHE_TAG 50 ASI_DCACHE_UTAG 50 ASI_WCACHE_DATA 54 ASI_WCACHE_STATE 53 ASI_WCACHE_TAG 54 error isolation 254 to ensure complete flush 29 instruction 181 xl UltraSPARC IV Processor User s Manual October 2005 memory bank access counts 111 errors 144 noncacheable scratch 26 memory controller statistics 111 microtags instruction cache 43 MMU 301 316 317 318 319 337 alternate global registers 312 330 conformity 301 319 disable 313 331 global registers 268 N nonallocating cache 96 noncacheable instruction prefetch 84 store merging enable 274 nonstandard floating point operation NS 267 nontranslating ASI 277 nPC register 86 NS field of FSR 267 O OBP backward compatibility 80 obsdata 272 observability bus DCR control requirements 272 mapping for bits 11 6
217. RR status bit indicates that an event has been detected which is likely to have its source inside the processor reporting the problem The PERR status bit indicates that the error source may well be elsewhere in the system not in the processor reporting the problem However this differentiation cannot be perfect These are merely likely outcomes Further error recording elsewhere in the system is desirable for accurate diagnosis Bits 62 54 49 33 are sticky error bits that record the most recently detected errors Each sticky bit in AFSR1 accumulates errors that have been detected since the last write to clear the bit Unless two errors in AFSR or AFSR_EXT are detected in the same clock cycle at most one of these bits can be set in AFSR2 and AFSR2_EXT Bits 19 16 contain the data microtag ECC syndrome captured on a system bus microtag ECC error The syndrome field captures the status of the first occurrence of the highest priority error according to the M_SYND overwrite policy After the AFSR1 sticky bit corresponding to the error for which the M_SYND is reported is cleared the contents of the M_SYND field will be cleared Bits 8 0 contain the data ECC syndrome The syndrome field captures the status of the first occurrence of the highest priority error according to the E_SYND overwrite policy After the AFSR sticky bit corresponding to the error for which the E_SYND is reported is cleared the contents of the E_SYND field will be cleared
218. SH bit or L3_TUE bit or L3_TUE_SH bit in the AFSR When one of the fatal error bits in the AFSR is set the processor will assert its error pin for 8 consecutive system cycles For information on AFSR please refer to AFSR Register and AFSR_EXT Register on page 180 The Asynchronous Fault Status Register and three other registers Error Status Error Mask and report IERR PERR errors Fatal Error FERR It is usually impossible to recover a domain which suffers a system snoop request parity error invalid coherence state L2 cache tag uncorrectable error L3 cache tag uncorrectable error system interface protocol error or internal error at the processor level When these errors occur the normal recovery mechanism is to reset the coherence domain of the effected processor When one of these fatal errors is detected by a processor the processor asserts its ERROR output pin The response of the system when an ERROR pin is asserted depends on the system design Since the AFSR is not reset by a system reset event error logging information is preserved The system can generate a domain reset in response to assertion of an ERROR pin and software can then examine the system registers to determine that the reset was due to an FERR The AFSR of all processors can be read to determine the source and cause of the FERR Most errors which lead to FERR do not cause any special processor behavior However an uncorrectable error in the MTags L2 cache t
219. SM foreign RTSM do 1 dS fRTSM copyback dT 0 dS foreign RTSM dS 0 dS none foreign RTSU dO 1 dS fRTSU copyback dT 0 dS none dS 0 dI foreign RTOU invalidate foreign RTOU fRTOU copyback dO 1 d invalidate d 0 d none dS 0 dS none foreign UGM foreign RS copyback dO 1 dO discard dT 0 0 dT none Caches Cache Coherency and Diagnostics 34 un microsystems TABLE 3 6 Snoop Output and DTag Transition 3 of 3 Shared Owned Error f At BS Snooped Request DTag State Output Oriri Output Action for Snoop Pipeline dI 0 own RTSR wait data dS EARE own RTSR wait data own RTSR issued by SSM own RTSR wait data device dO 1 dO Error dT l dT own RTSR wait data Error dI 0 0 dI none dS 1 0 ds none foreign RTSR do 1 1 dS foreign RTSR dI dO own RTOR wait data own RTOR issued by SSM dS dO own RTOR wait data device dO dO own RTOR wait data dI dI none dS dI foreign RTOR invalidate foreign RTOR dO dI foreign RTOR invalidate dI dI own RSR wait data dS 1 dS Error own RSR dO 1 dO Error dI dI none dS dS none foreign RSR dO dO none dI dI own WS dS dI own invalidate WS own WS dO dI own invalidate WS dI dI none foreign WS dS dI invalidate dO dI invalidate NOTE A blank entry in the Error Output column indicates that there is no corresponding error output Caches Cache Coherency and Diagnostics 35 un microsystems TABLE 3 7 summarizes the Snoop in
220. SSM mode by setting the SSM bit in the Sun Fireplane Interconnect Configuration register system bus microtag ECC is checked 171 amp Sun microsystems 7 2 9 4 L225 Error Handling All four microtag values associated with a 64 byte system bus read are checked for ECC correctness Uncorrectable errors in microtags arriving at the processor from the system bus are not normally recoverable When the processor detects one of these errors it will assert its ERROR output pin The response of the system to the assertion of the ERROR output pin is system dependent but will usually result in the reset of all the chips in the affected coherence domain In addition to asserting its ERROR output pin the processor will take a trap if the NCEEN bit is set in the Error Enable Register This will be an instruction_access_error disrupting trap if the error is the result of an instruction fetch or a data_access_error disrupting trap for any other operation Whether the trap taken has any effect or meaning will depend on the system s response to the processor ERROR output pin The effect of an uncorrectable microtag ECC error on the L2 cache state is undefined System bus microtag ECC is checked on interrupt vector fetch operations and on read accesses to uncacheable space even though the microtag has little meaning for these An uncorrectable error in microtag will still result in the ERROR output pin being asserted AFSR IMU being set M_SY
221. STAT 2 or 3 response to a prefetch queue or store queue read operation from user code can cause a disrupting trap with AFSR PRIV 1 after the user code has trapped into system space It happens that the detailed timing behavior of the processor prevents the same anomaly with load or atomic operations Uncorrectable errors on these user operations will always present a deferred data_access_error trap with AFSR PRIV 0 218 un microsystems Error Handling It is possible to use a MEMBAR Sync instruction near the front of all trap routines and to handle specially deferred traps that occur there properly to isolate user space hardware faults from system space faults However the cost in execution time for the error free case is significant and programmers may choose to decide that the additional small gain in possible system availability is not worth the cost in throughput If there is no MEMBAR Sync the effect will be that a very small fraction of errors that might perhaps have terminated only one user process will instead result in a reboot of the affected coherence domain Neither traps nor explicit MEMBAR Sync instructions provide a barrier for prefetch queue operations Software PREFETCH instructions which are executed before a trap or MEMBAR Sync can plant an operation in the prefetch queue This operation can cause system bus activity after the trap or MEMBAR_ Sync In the UltraSPARC IV processor
222. See the UltraSPARC IIT Cu Processor User s Manual 59 IE See the UltraSPARC IIT Cu Processor User s Manual 58 50 Soft2 See the UltraSPARC IIT Cu Processor User s Manual 49 Reserved Reserved for future implementation Instruction and Data Memory Management Unit 310 un microsystems TABLE 13 5 Translation Table Entry TTE 2 of 2 Bit Field 48 Size 2 Description Bit 48 is the most significant bit of the page size and is concatenated with bits 62 61 Size 2 is always 0 for I TLB 47 43 Reserved Reserved for future implementation 42 13 PA 12 7 Soft The physical page number Page offset bits for larger page sizes PA 15 13 PA 18 13 and PA 21 13 for 64 KB 512 KB and 4 MB pages respectively are stored in the TLB and returned for a Data Access read but are ignored during normal translation When page offset bits for larger page sizes PA 15 13 PA 18 13 and PA 21 13 for 64 KB 512 KB and 4 MB pages respectively are stored in the TLB on the UltraSPARC IV processor the data returned from those fields by a Data Access read are the data previously written to them See the UltraSPARC III Cu Processor User s Manual 6 L 5 4 CP CV 3 If the lock bit is set then the TTE entry will be locked down when it is loaded into the TLB that is if this entry is valid it will not be replaced by the automatic replacement algorithm invoked by an ASI store to the Dat
223. TABLE 13 24 TABLE 13 25 TABLE 13 26 TABLE 13 27 TABLE 13 28 TABLE 13 29 TABLE 13 30 TABLE 13 31 XX EMMY patity ertor behavior see ee AE ea SE anid a t 306 T MMU TLB Access Summary cecceeeesccseessessessesscsecescsscesscsecsecsecsessesseseeseasesseaseaseaeeaees 309 Translation Table EMEY TEE ee dans 310 Mearingof FER eeh AER eee eee ane eiaa 311 I TLB Tag Access Extension Register o ccicccececsssssseseeseeecsteseessceesseesesesscescsesseesaeeessesaceesaeees 313 I TLB Data Access Register Virtual Address Format Description 314 Tag Read Register Data Format for T512 occ ecesesessesesecseeseneesceeeseeseevsceesseeseeesaeeessesacesaeees 315 Tag Read Register Data Format for T16 o ececccccsssssssessesseeseeseeseeseeseeseesecsecseceeeeeeeseeseeaeeaeeaeeaees 315 ESFSR Bit Description sxviens iin nied ee eee eins 317 FU S O NEE 317 TR DiagnostiG Register gt EE a es Re e 318 KEE Ee E e E 318 FEMME USB seet AER cotton i a a e A is 320 I MMU and D MMU Primary Context Register essssssssssesssssssrsrersrsrsrstsesrerrererrestrrststetetsrsrsrses 320 D MMU Secondary Context Register oo ececcesscsesseeseseeseeeeseeseseeseeeeseesesesaceecaeesesesaeeessesacesaeees 321 D MMU parity error behavior doen SSES feats Gy Ee aden tenes 323 EESRYBit Setting EE tisk cats a A A coh toate tate AS 324 D MMU TLB Access Summary rerni SEENEN Eed S E E EE R seet 327 TTE Data Field Descriptions edd i e REN E L De ek 328 Me
224. TABLE 3 64 L3 cache Control Register Access Data Format 3 of 3 Bits 28 14 13 34 27 12 11 EC_clock Description L3 cache size The size specified here affects the size of the EC_addr field in the L3 cache Data Register These bits are hardwired to 3 b100 in the UltraSPARC IV processor 28 14 13 000 1MB L3 cache size not supported 28 14 13 001 4MB L3 cache size not supported 28 14 13 010 8 MB L3 cache size not supported 28 14 13 011 16 MB L3 cache size not supported 28 14 13 100 32 MB L3 cache size hardwired to 3 b100 28 14 13 101 unused 28 14 13 110 unused 28 14 13 111 unused L3 cache clock ratio 34 27 12 11 0000 selects 3 1 L3 cache clock ratio not supported 34 27 12 11 0001 selects 4 1 L3 cache clock ratio not supported 34 27 12 11 0010 selects 5 1 L3 cache clock ratio 34 27 12 11 0011 selects 6 1 L3 cache clock ratio 34 27 12 11 0100 selects 7 1 L3 cache clock ratio 34 27 12 11 0101 selects 8 1 L3 cache clock ratio POR value 34 27 12 11 0110 selects 9 1 L3 cache clock ratio 34 27 12 11 0111 selects 10 1 L3 cache clock ratio 34 27 12 11 1000 selects 11 1 L3 cache clock ratio 34 27 12 11 1001 selects 12 1 L3 cache clock ratio 34 27 12 11 1010 unused 34 27 12 11 1011 unused 34 27 12 11 1100 unused 34 27 12 11 1101 unused
225. The context is set to 112 when the access does not have a translating ASI 3 PR Privilege bit Set if the faulting access occurred while in privileged mode This field is valid for all traps in which FV bit is set 2 W Write bit Hardwired to zero in I SFSR 1 ow Overwrite bit When the I MMU detects a fault the Overwrite bit is set to 1 if the FV bit has not been cleared from a previous fault otherwise it is set to 0 Fault Valid bit Set when the I MMU detects a fault it is cleared only on an explicit ASI write of 0 to SFSR 0 FV CES el i RW When the FV bit is not set the values of the remaining fields in the SFSR and SFAR are undefined for traps TABLE 13 12 FT 5 0 I D FT 5 0 Description I Olig Privileged violation I 2016 I TLB tag or data parity error Note FT 4 1 are hardwired to zero Bit 18 and bit 2 0 are set to 1 and 0 respectively Data Format The Diagnostic Register format for the T16 is described below in TABLE 13 13 Three new bits are added to the UltraSPARC IV processor in the T512 Diagnostic Register described in TABLE 13 14 Instruction and Data Memory Management Unit 317 un microsystems An ASI store to the I TLB Diagnostic Register initiates an internal atomic write to the specified TLB entry The Tag portion is obtained from the Tag Access Register while the data portion is obtained from the to be stored data TABLE 13 13 I TLB Diagnostic Register
226. The count does not include local RTO_nodata and re issued local RTO i e RTOR transactions 5 12 2 Software Events Software statistics are collected through the counter listed in TABLE 5 16 This is a private counter TABLE 5 16 Counters for Software Statistics Counter Description SW_count_NOP PICL PICU Performance Instrumentation and Optimization Number of retired non annulled special software NOP instructions which is equivalent to sethi hi 0Oxfc000 g0 instruction 115 un microsystems 5123 Floating Point Operation Events Floating point operation statistics are collected through the counters listed in TABLE 5 17 These are private counters TABLE 5 17 Counters for Floating Point Operation Statistics Counter Description Number of retired instructions that complete execution on the Floating Point FA_pipe_completion PICL Graphics ALU pipeline Number of retired instructions that complete execution on the Floating Point Graphics Multiply pipeline FM_pipe_completion PICU 5 13 PCR SL and PCR SU Encoding TABLE 5 18 lists PCR SL selection bit field encoding for the PICL counters as well as PCR SU encoding for the PICU counters TABLE 5 18 PCR SU and PCR SL Selection Bit Field Encoding J of 2 PCR SU Value PICU Selection PCR SL Value PICL Selection 000000 Cycle_cnt 000000 Cycle_cnt 000001 Instr_cnt 000001 Ins
227. V accumulates the privilege status of the specified error events detected since the last time the AFSR PRIV bit was cleared ME A 1 in the set ME column implies that the specified event will cause the AFSR ME bit to be set if the status bit specified for that event is already set at the time the event happens A 0 in the set ME column implies that multiple events will not cause the AFSR ME bit to be set AFSR ME accumulates the multiple error status of the specified error events detected since the last time the AFSR ME bit was cleared Flushing The flush needed column contains a 0 if a D cache flush is never needed for correctness It contains a 1 if a D cache flush is needed only if the read access is to a cacheable address It contains a 2 if a D cache flush is always needed It contains a 3 if an I cache D cache and P cache flush is needed Note that for some of these errors an L2 cache or L3 cache flush or a main memory update is desirable to eliminate errors still stored in L2 cache or L3 cache or DRAM However the system does not need these to ensure that the data stored in the caches does not lead to undetected data corruption The entries in the table only deal with data correctness D cache flushes should not be needed for instruction_access_error traps but it is simplest to invalidate the D cache for both instruction_access_error and data_access_error traps Shared Private Shared Private column specifies i
228. W cache and hits the L2 cache in E or M state a store queue will issue an exclusive request to the L2 cache The exclusive request will perform an L2 cache read and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If uncorrectable ECC error is detected in non critical 32 byte data the EDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register When the L2 cache data is delivered to the W cache the associated UE information will be stored with data When W cache evicts a line with UE data information it will generate ECC based on the data stored in the W cache then ECC check bits C 1 0 of the 128 bit word are inverted before the word is scrubbed back to the L2 cache If an L2 cache uncorrectable error is detected as the result of either a store queue exclusive request or an atomic request or a block load or a prefetch queue operation and the AFSR EDU status bit is already set AFSR1 ME will be set WDC For an L2 cache writeback operation when a line which is in either clean or dirty state in the L2 cache is being victimized to make way for a new line an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC The data read back from L2 cache will be put in the writeback buffer as the staging area for the writeback oper
229. W cache exclusive request if an uncorrectable error occurs in the requested line a disrupting ECC_error trap is generated if the NCEEN bit is set in the Error Enable Register and L2 cache sends 64 byte data to W cache associated with the uncorrectable data error information W cache stores the uncorrectable data error information in W cache so that deliberately bad signalling ECC is scrubbed back to the L2 cache Correct ECC is computed for the corrupt merged data then ECC check bits 0 and 1 are inverted in the check word scrubbed to L2 cache The W cache sends out data for an L2 cache writeback or copyout event if it has the latest modified data rather than eviction from the W cache and update of the L2 cache For writeback or copyout AFSR WDU or AFSR CPU is set and not AFSR EDU which only occurs on W cache exclusive requests and prefetch queue operations A copyout operation which happens to hit in the processor writeback buffer sets AFSR WDU not AFSR CPU L3 cache Errors L3 cache tags are on chip and data held in external RAMs are covered by ECC L3 cache errors can be recovered by software or hardware measures Information on errors detected is logged in the AFSR_EXT and AFAR L3 cache address parity is also detected and reported as a L3 cache uncorrectable or correctable data error with AFSR_EXT L3_MECC is set L3 cache Tag ECC errors The types of L3 cache tag ECC errors are 163 amp Sun microsystems 7
230. XXXvi UltraSPARC IV Processor User s Manual October 2005 FEF field 259 271 state after reset 87 FSR floating point trap type ftt 268 ftt field 268 NS field 267 1 267 state after reset 87 G global registers interrupt 268 MMU 312 330 trap 268 graphics instructions using floating point register file 259 treatment as floating point instruction 259 GSR reading or updating 259 state after reset 87 H hard power on reset 84 I I cache miss processing 93 organization 92 utilization 95 ICACHE_INSTR IC_addr field 40 IC_parity field 41 IC_way field 40 icache_parity_error exception 253 254 259 ICACHE_SNOOP_TAG IC_addr field 45 IC_snoop_tag field 45 IC_snoop_tag_parity field 45 IC_way field 45 ICACHE_TAG IC_addr field 42 IC_tag address field 42 IC_tag field 43 IC_utag field 43 IC_vpred field 44 IC_way field 42 Parity field 43 Valid field 44 illegal address aliasing 28 illegally aliased page 29 I MMU demap operation 308 disabled 313 instruction cache bypassing 24 L2 L3 cache bypassing 26 Synchronous Fault Status Register I SFSR 317 TSB Base register 316 TSB Extension registers 316 virtual address translation 302 Implementation Registers 22 INDEX XXXVii amp Sun microsystems un microsystems instruction buffer 93 96 number completed 102 prefetch 84 instruction cache access Statistics 106 bypassing 24 data parity error 260 diagnostic access 39 disabled in RED state 83 effect of
231. _ sh PICL The same as above for bank 2 MC_stalls_3_sh PICU The same as above for bank 3 5 11 Data Locality Counters for Scalable Shared Memory Systems Data locality performance event counters in the UltraSPARC IV processor improve the ability to monitor and exploit performance in Scalable Shared Memory systems where there are multiprocessor system clusters using Shared Memory Protocol that are tied to other clusters using Performance Instrumentation and Optimization 111 un microsystems 5 11 1 5 11 2 fabric interconnect utilizing the Scalable Shared Memory SSM architecture SSM data locality counters are listed in TABLE 5 13 The SSM_new_transaction_sh counter is shared by both cores while the other SSM counters count private events TABLE 5 13 SSM data locality counters Counter Description Number of new SSM transactions RTSU RTOU UGM SSM_new_transaction_sh PICL observed by this processor on the Fireplane Interconnect Safari bus Number of L3 cache line victimizations from this core which SSM_L3_wb_remote PICL generate R_WB transactions to non LPA remote physical address region Number of L3 cache misses to LPA local physical address from this core which generate an RTS RTO or RS transaction SSM_L3_miss_local PICL Number of L3 cache misses to LPA local physical address from this core which generate retry R_ transactions including R_RTS R_RTO and R_RS SSM
232. _ACCESS_EXT 331 ASI_DMMU_TSB_BASE 334 ASI_DMMU_TSB_NEXT_REG 334 ASI_DMMU_TSB_PEXT_REG 334 ASI_DMMU_TSB_SEXT_REG 334 ASI_DTLB_DATA_ACCESS_REG 332 ASI_DTLB_DATA_IN_REG 332 ASI_DTLB_DIAG_REG 336 ASI_DTLB_TAG_READ_REG 333 ASI_EC_CTRL 285 ASI_EC_R 285 ASI_EC_TAG 283 ASI_EC_W 285 ASI_ECACHE_CONTROL 285 ASI_ECACHE_DATA 285 ASI_ECACHE R 65 285 ASI_ECACHE_TAG 283 AST_ECACHE_W 285 ASI_ESTATE_ERROR_EN_REG 178 282 ASI_IC_INSTR 284 ASI_IC_TAG 284 ASI_ICACHE_INSTR 40 284 ASI_ICACHE_SNOOP_TAG 45 ASI_ICACHE_TAG 42 284 AST_IMMU_SFSR 316 AST_IMMU_TAG_ACCESS_EXT 313 XXX UltraSPARC IV Processor User s Manual October 2005 ASI_IMMU_TSB_BASE 316 ASI_IMMU_TSB_NEXT_REG 316 ASI_IMMU_TSB_PEXT_REG 316 ASI_IPB_DATA 45 ASI_IPB_TAG 46 ASI_ITLB_CAM_ACCESS_REG 283 284 ASI_ITLB_DATA_ACCESS_REG 283 314 ASI_ITLB_DATA_IN_REG 314 ASI_ITLB_TAG_READ_REG 315 ASI_L2CACHE_CONTROL 58 ASI_L2CACHE_DATA 65 ASI_L2CACHE_TAG 61 ASI_L3CACHE_CONTROL 66 ASI_L3CACHE_DATA 70 ASI_L3CACHE_R 70 ASI_L3CACHE_TAG 72 ASI_L3CACHE_W 70 ASI_MCU_CTRL_REG 284 ASI_PCACHE_DATA 56 281 ASI_PCACHE_SNOOP_TAG 57 281 ASI_PCACHE_STATUS_DATA 55 278 281 ASI_PCACHE_TAG 56 281 ASI_PHYS_USE_EC 26 ASI_SCRATCHPAD_n_REG 276 ASI_SRAM_FAST_INIT 77 ASI_SRAM_FAST_INIT_SHARED 74 ASI_SRAM_FAST_INIT_Shared 79 ASI_SRAM_TEST_INIT 281 ASI_WCACHE_DATA 53 281 ASI_WCACHE_ STATE 52 ASI_WCACHE_TAG 54 ASI_WCACHE_VALID_BITS 281 Asynchronous Fault Address Register AFAR 143 191
233. _L3_miss_mtag_remote PICL PICU Number of L3 cache misses from this core which generate retry R_ transactions to non LPA non local physical address address space or R_WS transactions due to block store BST block store commit BSTC to any address space LPA or non SSM_L3_miss_remote PICU LPA or R_RTO due to atomic request on Os state to LPA space Note that this counter counts more than just remote misses as defined above To determine the actual number of remote misses use L3_miss SSM_L3_miss_local Scalable Shared Memory Systems Typically four to six local processors are in a system cluster and have their own local memory subsystem s They use a Shared Memory Protocol to maintain data coherency among themselves Data coherency is maintained between system clusters using a directory based Scalable Shared Memory data coherency mechanism to insure data coherency across systems with a large number of processors The data locality event counters are only valid for Scalable Shared Memory system architectures Event Tree The SSM data locality event counters are illustrated in FIGURE 5 4 Performance Instrumentation and Optimization 112 un microsystems Data Locality Events SSM_L3_miss_local LPA Local Processor Physical Address SSM_L3_miss_mtag_remote LPA Remote Processor Physical Address SSM_L3_miss_remote SSM_L3_wb_remote All L3 cache Misses Load Block Store BST e Sine Blo
234. _action trap 0x0000 0000 Performance Instrumentation and Optimization 98 un microsystems TABLE 5 2 PCR Bit Description Bit Description Reserved by SPARC architecture see Reserved Read zero write zero or write value read previously read modify write Unused by the UltraSPARC IV processor iss Reverved Read zero write zero or write value read previously read modify write Reserved by SPARC architecture BES Reseed Read zero write zero or write value read previously read modify write Unused by the UltraSPARC IV processor Read zero write zero or write value read previously read modify write 16 11 Selects one of up to 64 counters accessible in the upper half bits 63 32 of the PIC register Reserved by SPARC architecture Read zero write zero or write value read previously read modify write 26 17 Reserved 10 Reserved 9 4 SL Selects one of up to 64 counters accessible in the lower half bits 31 0 of the PIC register Unused by the UltraSPARC IV processor 3 Reserved f g e 3 Read zero write zero or write value read previously read modify write 2 UT User Trace Enable If set to 1 counts events in nonprivileged mode User System Trace Enable If set to 1 counts events in privileged mode Supervisor Notes e If both PCR UT and PCR ST are set to 1 all selected events are counted e If both PCR UT and PCR ST are zero counting is disabl
235. _addr 5 3 is used to select one instruction from the 32 byte sub block 2 0 Mandatory value Should be 0 ry Caches Cache Coherency and Diagnostics 45 un microsystems The instruction prefetch buffer data format is shown in TABLE 3 23 TABLE 3 23 Instruction Prefetch Buffer Data Format 63 43 Mandatory value Bits Field Description Should be 0 42 31 0 IPB_parity IPB_ instr The parity bit for IPB instructions are computed the same way as the I cache parity bit 41 32 IPB_predecode This is similar to instruction cache data format This is similar to instruction cache data format 3 6 2 Instruction Prefetch Buffer Tag field Accesses ASI 6A per logical processor Name ASI_IP B_TAG The address format of the instruction prefetch buffer tag array access is shown in TABLE 3 24 TABLE 3 24 Instruction Prefetch Buffer Tag Valid Field Read Access Data Format Bits Field Description 63 10 Mandatory value Should be 0 IPB_addr 9 7 is a 3 bit index VA 9 7 that selects one entry of the 8 entry instruction prefetch buffer tag During instruction prefetch buffer tag reads or writes the valid bit of 9 6 IPB addi one of the 32 byte sub blocks of that entry will also be read or written IPB_addr 6 is used to select the valid bit of the desired sub block of an instruction prefetch buffer entry 5 0 Mandatory value Should be 0 The
236. _sel 1 is shown in TABLE 3 62 and TABLE 3 63 respectively TABLE 3 62 L2 cache Data Access Data Format when ECC_sel 0 Bits Field Description 63 0 L2_data 64 bit L2 cache data of a given index way xw_offset TABLE 3 63 L2 cache data access Data Format when ECC_sel 1 Bits Field Description When VA 5 0 L2_ecc_hi corresponds to 9 bit ecc for L2 data 511 384 17 9 L2_ecc_hi when VA 5 1 L2_ecc_hi corresponds to 9 bit ecc for L2 data 255 128 When VA 5 0 L2_ecc_lo corresponds to 9 bit ecc for L2 data 383 256 8 0 L2_ecc_lo when VA 5 1 L2_ecc_lo corresponds to 9 bit ecc for L2 data 127 0 Caches Cache Coherency and Diagnostics 65 amp Sun microsystems Note If L2 cache data diagnostic access encounters a L2 cache data CE the returned data will not be corrected and raw L2 cache data will be returned regardless of L2_data_ecc_en in L2 cache control register ASI 6D 6 is set or not For the one entry data buffer used in the L2 cache off mode no ASI diagnostic access is supported The L2 cache data array can be accessed as usual in the L2 cache off mode However the returned value is not guaranteed to be correct since the SRAM can be defective and this may be the reason to turn off the L2 cache 3 12 3 12 1 L3 cache Diagnostic amp Control Accesses Separate ASIs are provided for reading and writing the L3 cache tag and data SRAMs as well as
237. _setup S bO1111 trace_out 2 b01 L3 cache Control Register ee Geen Unchanged EC_turn_rw 4 b0110 EC_size 2 b01 EC_clock 3 b100 all others 4 b0101 0 INSTRUCTION_TRAP all 0 off 0 off Unchanged VA_WATCHPOINT Unknown nchanged Unchanged PA_WATCHPOINT Unknown Unchanged ASI Unknown Unchanged Unchanged FT Unknown Unchanged S SE E Unknown Unchanged Unchanged CTXT Unknown Unchanged Unchanged L SFSR D SFSR PRIV Unknown Unchanged Ge W Unknown Unchanged Unchanged OW overwrite Unknown Unchanged Unchanged FV 0 Unchanged NF Unknown Unchanged TM Unknown Unchanged DMMU_SFAR Unknown Unchanged Unchanged INTR_DISPATCH_W all 0 Unchanged INTR_DISPATCH_STATUS all 0 Unchanged Unchanged INTR_RECEIVE BUSY 0 Unchanged MID Unknown Unchanged Unchanged ESTATE_ERR_EN all 0 all off Unchanged Unchanged AFAR PA Unknown Unchanged Unchanged AFAR_2 PA Unknown Unchanged AFSR all 0 nchanged Unchanged AFSR_2 all 0 nchanged Unchanged AFSR_EXT all 0 nchanged Unchanged AFSR__EXT_2 all 0 Unchanged Rfr_CSR all Unknown Unchanged Unchanged Mem_Timing_ CSR all Unknown Unchanged Unchanged Mem_Addr_Dec all Unknown Unchanged Unchanged Mem_Addr_Cntl all Unknown Unchanged Unchanged EMU_ACTIVITY_STATUS all 0 0 Unchanged Reset and RED_ state 88 un microsystems TABLE 4 1 Name Other Processor Specific States Processor and external cache tags microtags and Machine State After Reset and RED_state 4 of 5 RED state Unchanged
238. a In Register The lock bit has no meaning for an invalid entry Arbitrary entries can be locked down in the TLB Software must ensure that at least one entry is not locked when replacing a TLB entry otherwise a locked entry will be replaced Since the 16 entry fully associative TLB is shared for all locked entries as well as for 4 MB and 512 KB pages the total number of locked pages is limited to less than or equal to 15 In the UltraSPARC IV processor the TLB lock bit is only implemented in the D MMU 16 entry fully associative TLB and the I MMU 16 entry fully associative TLB In the TLBs dedicated to 8 KB page translations D MMU 512 entry 2 way associative TLB and I MMU 512 entry 2 way associative TLB each TLB entry s lock bit reads as 0 and writes to it are ignored The lock bit set for 8 KB page translation in both I MMU and D MMU is read as 0 and ignored when written The cacheable in physically indexed cache bit and cacheable in virtually indexed cache bit determine the placement of data in the caches UltraSPARC IV processor fully implements the CV bit The following table describes how CP and CV control cacheability in specific UltraSPARC IV processor caches TABLE 13 6 Meaning of TTE Cacheable CP CV Meaning of TTE when placed in the I TLB 00 01 Non cacheable Cacheable L2 L3 cache amp I cache but not 19 installed in I cache 11 Cacheable L2 L3 cache amp I cache The MMU does not oper
239. a in P cache block load buffer Good amp critical 32 byte data and non critical 32 byte data in W cache Good data taken Bad data in FP register file Good data taken Disrupting trap Deferred trap Disrupting trap D cache atomic fill request with UE in the critical 32 byte data from system bus D cache atomic fill request with CE in the non critical 2nd 32 byte data from system bus CE No Raw UE data raw ecc Corrected data corrected ecc Bad critical 32 byte data and UE information and good non critical 32 byte data in W cache Good amp critical 32 byte data and non critical 32 byte data in W cache Good critical 32 byte data taken Deferred trap UE will be taken and Precise trap fast ecc error will be dropped when the line is evicted out from the W cache again based on the UE status bit sent from L2 cache W cache flips the 2 least significant ecc check bits C 1 0 in both lower and upper 16 byte Disrupting trap Error Handling 242 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 3 of 10 Event D cache atomic fill request with UE in the non critical 2nd 32 byte data from system bus flag fast ecc error L2 cache data Raw UE data raw ecc L2 cache state L1 cache data Good critical 32 byte data and bad non critical 32 byte data and UE information is in W cache
240. aSPARC III Cu Processor User s Manual for information on the TSB Extension Registers In the UltraSPARC IV processor TSB_Hash bits 11 3 of the extension registers are exclusive ORed with the calculated TSB offset to provide a hash into the TSB Changing the TSB_Hash field on a per process basis minimizes the collision of TSB entries between different processes I MMU Synchronous Fault Status Register I SFSR ASI 5016 VA 63 0 1816 Name ASI_IMMU_SFSR Access RW Instruction and Data Memory Management Unit 316 un microsystems I MMU SFSR is described in TABLE 13 11 TABLE 13 11 I SFSR Bit Description 63 25 Reserved Reserved for future implementation 24 NF Non faulting load bit Hardwired to zero in I SFSR 23 16 ASI Records the 8 bit ASI associated with the faulting instruction This field is valid for all RW traps in which the FV bit is set 15 TM I TLB miss 14 13 Reserved Reserved for future implementation Specifies the exact condition that caused the recorded fault according to TABLE 13 12 following this table In the I MMU the Fault Type field is valid only for 12 7 FT instruction_access_exception faults there is no ambiguity in all other MMU trap cases Note that the hardware does not priority encode the bits set in the fault type register that is multiple bits can be set 6 E Side effect bit Hardwired to zero in I SFSR 5 4 CT Context Register selection
241. aSPARC IV Processor User s Manual October 2005 amp Sun microsystems Preface Preface This book contains information about the architecture and programming of the UltraSPARC V processor one of Sun Microsystems family of SPARC v9 compliant processors This document is a supplement to the UltraSPARC III Cu Processor User s Manual and should be read in conjunction with that document This document extends the material in the UltraSPARC HI Cu Processor User s Manual Any material that is not referred to in this document remains unchanged for the UltraSPARC IV processor Any material that overrides or extends the material in the UltraSPARC III Cu Processor User s Manual should be read from this document Target Audience This user s manual is mainly targeted for programmers who write software for the UltraSPARC IV processor This user s manual supplement contains a depository of information that is useful to operating system programmers application software programmers logic designers and third party vendors who are trying to understand the architecture and operation of the UltraSPARC IV processor This supplement is both a guide and a reference manual for low level programming of the processor Prerequisites This user s manual is a companion to the UltraSPARC III Cu Processor User s Manual The reader of this user s manual should be familiar with the contents of the UltraSPARC III Cu Processor Use
242. access_error trap will be generated 212 amp Sun microsystems dated Multiple occurrences of this error will cause AFSR1 ME to be set Note For foreign WIO to the UltraSPARC IV processor internal ASI registers such as MCU parameters both CE error or UE error in the data will not be logged For CE error in the data it will be corrected automatically For UE error in the data bad data will be installed in the ASI registers and no traps will be generated SRAM e Fuse Array Related Errors EFA_PAR_ERR When parity error occurs during transfer from e Fuse array to repairable SRAM array AFSR EFA_PAR_ERR is set to 1 and error pin is asserted To clear AFSR EFA_PAR_ERR user must pull hard power on reset which will initiate e Fuse array transfer and clear the parity error RED_ERR When e Fuse error occurs in I cache D Cache DTLBs or I TLB SRAM redundancy or shared L2 cache L3 cache tag L2 cache data SRAM redundancy AFSR RED_ERR is set to 1 and error pin is asserted To clear AFSR RED_ERR user must pull hard power on reset which will initiate e Fuse array transfer and clear the parity error 7 8 7 8 1 7 8 1 1 Error Handling Further Details of ECC Error Processing System Bus ECC Errors ECC Error Detection For incoming data from the system bus ECC error checking is turned on when the Sun Fireplane Interconnect DSTAT bits indicate that the data is valid and ECC has been generated for this data
243. action garbage data in D cache but not in P cache garbage data not in P cache No action garbage data not taken No action garbage data dropped No action Deferred trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Deferred trap BERR will be taken and Precise trap fast ecc error will be dropped the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Deferred trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Deferred trap BERR will be taken and Precise trap fast ecc error will be dropped the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Deferred trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Cacheable D cache block load fill request with BERR in the critical 32 byte data from system bus OR Cacheable D cache block load fill request with BERR in the non critical 2nd 32 byte data from system bus Error Handling es No BERR Yes No No BERR BERR Not installed garbage data in P cache block load buffer garbage data in FP register file Deferred trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped 248 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 9 of 10 Event Cacheable D ca
244. aded into the D cache and will be used without any error trap if the trap routine retries the faulting instruction Note that when the line moved from the L3 cache to the L2 cache raw data read from the L3 cache without correction is stored in the L2 cache Since the L2 cache and the L3 cache are mutual exclusive once the line is read from the L3 cache to the L2 cache it will not exist in the L3 cache Software initiated flushes of the L2 cache and L3 cache are required if this event is not to recur the next time the word is fetched from the L2 cache This may need to be linked with a correction of a multi bit error in the L3 cache if that is corrupted too Multiple occurrences of this error will cause the AFSR1 ME to be set In the event that the L3_UCU event is for an instruction fetch which is later discarded without the instruction being executed no trap will be generated If both the L3_UCU and the L3_MECC bits are set it indicates that an address parity error has occurred L3_EDC The AFSR L3_EDC status bit is set by errors in block loads to the processor and errors in reading L3 cache as the result of store queue exclusive request and errors as the result of prefetch queue operations When a block load instruction misses the D cache and hits the L3 cache the line is moved from the L3 cache to the L2 cache and will be checked for the correctness of its ECC If a single bit error is detected the L3_EDC bit will be set to log t
245. aeeeeags 119 6 1 1 Rounding Mode seesinane ne E ERA E a 119 6 1 2 Non standard Floating Point Operating Mode cccccceeesseeeeneeeesteeesteeeesnees 120 6 1 3 Memory and Register Data Images s ssnessesssesssesrsseesseessreessersseessresserrsrersseess 120 6 1 4 Subnormal Operations sssesssessssesssessseessessseesserssrressersseessressteessressressreesrersseet gt 120 6 1 5 FSR CEXC and FSR AEXC Updates wo ecesssessesseeeeseescneetesesseeseesees 120 O 1 6 Prediction Logie lt x 2 25 3 c2ce ene ee ee ee a A ee 120 6 2 Floating Point NUMbe Ss forssar ier A E R EEK 121 LN DE EE 121 G22 Subnotmal sentik aus EE Ee a ae 121 6 2 36 TINAN cee See ee ee EE 121 GZ Nota Number NaN eege eee ern 122 6 2 5 Floating Point Number Line cecceccceeeseceeeeeceeeeceneeeseeeeesseeeesaeeessneeeneaees 122 6 3 TEEE Operations cniri i aaa sens A PR EA REEE E E AASL 122 GIL Adio EE 123 Do SUBUACION rreri Doea a r a A A EE AA abet ees 124 623 3 Multiplication si os e N E A E E eege 125 E KEE 126 E E Sq are EE 127 65326 Sr 127 6 3 7 Precision Conversion EE 128 6 3 8 Floating Point to Integer Number Conversion ccccsceeesseeesteeeesteeeeneeeeenees 129 6 3 9 Integer to Floating Point Number Conversion ccccceccesseeesteeeesneeeeneeeesnees 130 Table of Contents amp Sun microsystems un microsystems 6 3 10 Copy Move Operations cccccecesccessseceen
246. ags or L3 cache tag behaves differently than normal causing the processor to both assert its ERROR output pin and to begin trap execution Uncorrectable errors in the MTags L2 cache tags or L3 cache tags are normally fatal and reset the affected coherence domain 220 amp Sun microsystems LIJ Error Handling In the UltraSPARC IV processor on chip SRAMs are e Fuse repairable A parity error detected during the transfer of data between the e Fuse array and the repairable SRAM array will cause the processor to assert the ERROR output pin During normal operation bit flipping in REDUNDANCY registers will cause the processor to assert the ERROR output pin After an FERR event system reset to the processor can be cycled to regain control and code can then read out the internal register values and run diagnostic tests Following this diagnosis phase if it is determined that the processor is to be integrated into a new domain cycling POK not necessarily power initializes the processor as consistently as possible Entering RED_ state In the event of a catastrophic hardware fault which produces repeated errors or a variety of programming faults the processor can take a number of nested traps leading to an eventual entry into RED_state RED_state entry is not normally recoverable However programs in RED_state can provide a useful diagnosis of the problem encountered prior to attempting corrective action The I cache P cache
247. ake the interrupt only one trap entry exit Upon entry the handler must check both TSTATE PEF and FPRS FEF bits If TSTATE PEF 1 and FPRF FEF 1 the handler has been entered because of an interrupt either interrupt_vector or interrupt_level In such a case The fp_disabled handler should enable interrupts that is set PSTATE IE 1 then issue an integer instruction for example add g0 g0 g0 An interrupt is triggered on this instruction The UltraSPARC IV processor then enters the appropriate interrupt handler PSTATE IE is turned off here for the type of interrupt At the end of the handler the interrupted instruction is RETRY d after returning from the interrupt The add g0 g0 g0 is RETRY d The fp_disabled handler then returns to the original process with a RETRY The interrupted FPop is then retried taking an fp_exception_ieee_754 or fp_exception_other at this time if needed TABLE 9 2 shows the mapping between the settings on bits 11 6 of the DCR and the values seen on the observability bus Registers TABLE 9 2 Signals Observed at obsdata 9 0 for Settings on Bits 11 6 of the DCR 1 of 2 DCR p Bits Signal obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata obsdata p Source 9 8 7 6 5 4 3 2 1 11 6 101xxx ECC 13t_cor 12t_cor 13d_cor 12d_cor 12d_uncor sys_cor EE Default r 100010 Clk grid 12clk 4 12clk 16 12clk 11clk 4 l1clk 1
248. al processor I cache cluster D cache cluster P cache cluster W cache cluster Branch Prediction Array and TLBs TLBs must be turned off before the ASI_SRAM_FAST_INIT is launched ASI 3F 6 ASI_SRAM_FAST_INIT_SHARED will initialize all the SRAM structures that are shared between the two logical processors L2 cache tag L2 cache data and L3 cache tag SRAMs FIGURE 3 1 shows how the D MMU of a particular logical processor sends the initialization data to all 3 loops in parallel for ASI 4046 or sends the initialization data to the shared loop for ASI 3F 16 Loop 2 will initialize the D cache data physical tag snoop tag microtag arrays Loop 3 will initialize the three D TLBs Loop 1 will initialize the 2 I TLBs as well as the I cache physical tag Caches Cache Coherency and Diagnostics 76 un microsystems snoop tag valid predict Microtag arrays etc For the shared loop D MMU will continuously dispatch a total of 2 ASI write requests to flush all the entries of L2 cache tag L2 cache data and the L3 cache tag SRAM structures on the ASI chain in a pipe lined fashion shared loop L2 cache tag L2 cache data L3 cache tag Top level SRAM Test Mode BIST Control l cache mtag IL IVpred LP_O_RAM_test_bus IPB Br Pred l cache instr Predecode l cache snoop l cache ptag D cache data D cache mtag D cache tags D cache snoop Localto D MMU cache data D TLB t16 I TLB it16 I TLB
249. all ELSYND ECC syndromes to the indicated event Legend for TABLE 7 12 no error 0 127 Data single bit error data bit 0 127 CO C8 ECC check single bit error check bit 0 8 M2 Probable Double bit error within a nibble M3 Probable Triple bit error within a nibble M4 Probable Quad bit error within a nibble M multi bit error Three syndromes in particular from TABLE 7 12 are useful These are the syndromes corresponding to the three different deliberately inserted bad ECC conditions the signalling ECC codes used by the processor For a DSTAT 2 or 3 BERR or DBERR event from the system bus for a cacheable load data bits 1 0 are inverted in the data stored in the L2 cache The syndrome seen when one of these signalling words is read will be Ox1 Ic For an uncorrectable data ECC error from the L3 cache data bits 127 126 are inverted in data sent to the system bus as part of a writeback or copyout The syndrome seen when one of these signalling words is read will be 0x071 For an uncorrectable data ECC error from the L2 cache data bits 127 126 are inverted in data sent to the system bus as part of a copyout The syndrome seen when one of these signalling words is read will be 0x071 For uncorrectable data ECC error on the L2 cache or L3 cache read done to complete a store queue exclusive request the uncorrectable ECC error information is stored When the line is evicted from W cache back to L2 cache if the un
250. an individual single threaded processor that has aggregated together with other separate single threaded processors to form a Symmetric MultiProcessing SMP system In either case the operating system exploits logical processors to simultaneously schedule multiple threads of execution Various low level software programs however boot error diagnostic and other codes must be aware of logical processors as elements of the same CMT processor chip This chapter describes the interface between low level software and the multiple logical processors that cohabit a CMT processor Logical processors obey the same memory model semantics as if they were independent processors All multiprocessing libraries thread libraries and code will be able to operate without modification on a CMT processor comprising N logical processors in exactly the same way they operate on an SMP system composed of N independent processors Note All previous documentation including the UltraSPARC II Cu Processor User s Manual and the SPARC V9 use the term processor When these earlier documents are read in conjunction with this supplement replace the term processor with logical processor to read them in context of the UltraSPARC IV processor Z2 22 Accessing CMT Registers A key part of the CMT Programming Model is a set of specific privileged registers This section covers how these registers are organized and accessed These registers can be read
251. an instruction fetch which is later discarded without the instruction being executed no trap will be generated 201 amp Sun microsystems Error Handling EDC The AFSR EDC status bit is set by errors in block loads to the processor and errors in reading L2 cache as the result of store queue exclusive request and errors as the result of prefetch queue operations When a block load instruction misses the D cache and hits the L2 cache an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a single bit error is detected the EDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case while hardware will proceed to correct the error A software PREFETCH instruction writes a command to the prefetch queue which operates autonomously from the execution unit A correctable L2 cache data ECC error as the result of a read operation initiated by a prefetch queue entry will set EDC and a disrupting ECC_error trap will be generated No data will be installed in the P cache When a store instruction misses the W cache and hits the L2 cache in E or M state the store queue will issue an exclusive request to the L2 cache The exclusive request will perform an L2 cache read and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a single bit error is detected the EDC bit w
252. an uncorrectable system bus data ECC error as the result of a prefetch queue operation results in a disrupting trap It is not particularly important when this is taken One way to enforce an error barrier for a software PREFETCH for read s prefetch fen 0 1 20 or 21 is to issue a floating point load instruction that uses the prefetched data This will force an interlock when the load is issued The load miss request will not be sent waiting for the completion of the prefetch request with or without error Upon completion the load instruction is recirculated any error will then be replayed and generate a precise trap Note that in implementation the floating point load instruction has to be scheduled at least 4 cycles after the PREFETCH instruction Disrupting and precise traps do not act as though a MEMBAR Sync was executed at the time of the trap This is because the disrupting and precise traps wait for all reads that have been issued on the system bus to be complete but not for the store queue to be empty before beginning to execute the trap Store queue write operations WS WIO or WBIO can result in deferred traps but only if the target device does not assert MAPPED AFSR TO These are generally indicative of a fault more serious than a transient data upset This can lead to the problem described in the next paragraph If several stores of type WS WIO or WBIO ar
253. and D cache are automatically disabled by the hardware clearing the IC PC and DC bits in the DCU Control Register DCUCR on entering RED_state The L2 cache L3 cache and W cache state are unchanged 221 un microsystems 7 10 Behavior on L2 cache DATA Error Error Handling TABLE 7 18 L2 cache data CE and UE errors 1 of 3 Error logged in fast ecc error Event AFSR aan L2 cache data I cache fill request with CE in the critical 32 byte L2 cache Original data original UCC data and the data do get used later ecc I cache fill request with UE in the critical 32 byte L2 cache Original data original UCU data and the data do get used later ecc I cache fill request with CE in the non critical 32 byte the Onana EE 2nd 32 byte L2 cache data and the data either the critical or UCC 8 8 Ge ecc the non critical 32 byte data do get used later I cache fill request with UE in the non critical 32 byte the ORAL data ea al 2nd 32 byte L2 cache data and the data either the critical or UCU 8 8 BE ecc the non critical 32 byte data do get used later I cache fill request with CE in the critical 32 byte L2 cache data but the data never get used w deg op UCC E data original I cache fill request with CE in the non critical 32 byte the 2nd 32 byte L2 cache data but the data never get used I cache fill request with UE in the critical 32 byte L2 cache data but the data never get used e ere O
254. aning of LTE eier Ee ER Sheen ee E E 329 DTLB Tag Access Extension Register Description 331 D TLEB Data Access tegister A0 332 Tag Read Register Data Format Description for T16 oot eiceeceeseeseseeseeerseeseeeseeesaeeeeseseeeesaeees 333 Tag Read Register Data Format Description for T512 cee eeseesseseeeneeceseeeseeeraeeeesesseeesaeees 333 TSB Base Register Description EENS SES 334 D SESR Bit Description ee dee 335 D MMU Synchronous Fault Status Register FT Fault Type Field oe cceececeeceeseeseeteeteeeeeeees 335 TTE Data et 336 D TLB Diagnostic Register of T512_0 and DU In 336 UltraSPARC IV Processor User s Manual October 2005 amp Sun microsystems List of Figures FIGURE 3 1 Fast init ASI 4016 Goes Through Three loops in Pipeline Fashion 77 FIGURE 5 1 Handling of Conditional Branches o eeeeeceeseesesssesseeeseeseceseeseceaeeseesaeeeeeseeesesseessesereseseaeeneees 94 FIGURE 5 2 Handling of MONTEE e ees Ale 95 FIGURE 5 3 Operational Flow Diagram for Controlling Event Counters oo cece eeeecesesessereeeesereeeeneeeneeneees 101 FIGURE 5 4 SSM Performance Counter Event Tree o ceceeessecessesseceneeseeeeeeseceessesseessesseesecsaesneesateneesatenss 113 FIGURE 6 1 Floating Point Number Line el NEdeE EIER wi e NE o e E E din eE weiss 122 FIGURE 7 1 The UltraSPARC IV Processor RAS Diagram 200 FIGURE 8 1 Recovering Deferred rage See ee EE 256 List of Figures xxi amp Sun microsystems xxii Ultr
255. antial benefits across a wide spectrum of workloads including technical computing Within the constraints imposed by building from the common pipeline and the common bus protocol that defines membership in the UltraSPARC III IV processor family every effort has been made to optimize the design of the UltraSPARC IV processor Key features of the processor include e Chip Multithreading with two upgraded UltraSPARC III processor cores per chip e Implemented three levels of cache hierarchy e Enhanced memory controller and system interface unit e Enhanced error detection and correction Architectural Overview 1 amp Sun microsystems 1 2 Chip Multithreading CMT Many of the workloads for the midrange and enterprise server markets exhibit a high degree of thread level parallelism TLP These workloads consist of many independent tasks or threads that can be partitioned to run on separate logical processors These threads scale well in performance as the number of logical processors available is increased up to the limit of available threads This characteristic behavior of enterprise workloads will be exploited by future UltraSPARC processors specifically designed to take advantage of TLP in code One of the most effective ways to exploit TLP at the processor level is through Chip Multithreading technology CMT processors incorporate multiple logical processors onto a single chip Sun Microsystems a longtime leader in providing supp
256. ap If the instruction is LDDFA STDFA and if the address is aligned to a 32 bit boundary but not to a 64 bit boundary then the trap type will be LDDF STDF_mem_address_not_aligned Chip Multithreading CMT 22 amp Sun microsystems Caches Cache Coherency and Diagnostics This chapter describes the caches cache coherency and the diagnostics in the following sections Chapter Topics 3 1 3 1 1 Caches Cache Coherency and Diagnostics Cache Organization on page 23 Cache Flushing on page 28 Coherence Tables on page 29 Diagnostics Control and Accesses on page 39 Instruction Cache Diagnostic Accesses on page 39 Instruction Prefetch Buffer Diagnostic Accesses on page 45 Branch Prediction Diagnostic Accesses on page 47 Data Cache Diagnostic Accesses on page 48 Write Cache Diagnostic Accesses on page 52 Prefetch Cache Diagnostic Accesses on page 54 L2 cache Diagnostics amp Control Accesses on page 58 L3 cache Diagnostic amp Control Accesses on page 66 Summary of ASI Accesses in L2 L3 Off Mode on page 75 ASI SRAM Fast Init on page 76 OBP Backward Compatibility Incompatibility on page 80 Cache Organization This section describes the different types of cache organizations found in the UltraSPARC IV processor virtually indexed physically tagged VIPT physically indexed physically tagged PIPT and virtually indexed virtually tagged VIVT caches Cache Overview The UltraSPARC IV processor supports
257. ap is generated rd is unchanged IEEE 754 1985 Standard 122 un microsystems 6 3 1 Each instruction is defined with one or more operands Most instructions generate a result The FCMP E instruction does not generate a result it sets the fecN bits instead Addition TABLE 6 3 Floating Point Addition Result from the operation includes one or more of the following e Number in f register See Trap Event on page 132 ADDITION e Exception bit set See TABLE 6 12 Instruction FADD rsj rsp rs2 Trap occurs See abbreviations in TABLE 6 12 Underflow overflow can occur Masked Exception TEM 0 Enabled Exception TEM 1 rel gt rd Destination Register Destination Register Written rd Flag s Written rd Kaeo tose 0 0 0 one set 0 None set 0 FSR RD 0 1 2 0 FSR RD 0 1 2 0 A 0 0 0 FSR RD 3 one set 0 FSR RD 3 None set 0 0 0 one set 0 None set 0 Normal Normal one set Normal None set 0 Normal Normal one set Normal None set 0 Infinity Infinity one set Infinity None set 0 Infinity Infinity one set Infinity None set Asserts ofc ofa Asserts ofc nvc a ola gt Normal Infinity Infinity EC No IEEE trap enabled Asserts ofc ofa Asserts ofc nvc 4 i z ola gt Normal Infinity Infinity Bee No IEFE trap enabled Normal Normal Can overflow See 6 5 3 Ca
258. apped 011101 IU_stat_br_miss_untaken 011101 SW_pf_PC_installed 011110 IU_stat_br_count_taken 011110 IPB_to_IC_fill 011111 PC_miss 011111 L2_write_miss 100000 MC_writes_0_sh 100000 MC_reads_0_sh 100001 MC_writes_1_sh 100001 MC_reads_1_sh 100010 MC_writes_2_sh 100010 MC_reads_2_sh 100011 MC_writes_3_sh 100011 MC_reads_3_sh 100100 MC_stalls_1_sh 100100 MC stalls_0_sh 100101 MC_stalls_3_sh 100101 MC stalls_2_sh 100110 Re_RAW_miss 100110 L2_hit_other_half 100111 FM_pipe_completion 100111 Reserved 101000 SSM_L3_miss_mtag_remote 101000 L3_rd_miss 101001 SSM_L3_miss_remote 101001 Re_L2_miss 101010 SW_pf_exec 101010 IC_miss_cancelled 101011 SW_pf_str_exec 101011 DC_wr_miss 101100 SW_pf_dropped 101100 L3_hit_I_state_sh 101101 SW_pf_L2_installed 101101 SI_RTS_src_data 101110 Reserved 101110 L2_IC_miss 101111 L2_HW_pf_miss 101111 SSM_new_transaction_sh 110000 Reserved 110000 L2_SW_pf_miss 110001 L3_miss 110001 L2_wb 110010 L3_IC_miss 110010 L2_wb_sh 110011 L3_SW_pf_miss 110011 L2_snoop_cb_sh 110100 L3_hit_other_half 110100 Reserved 110101 L3_wb 110101 Reserved 110110 L3_wb_sh 110110 Reserved 110111 L2L3_snoop_cb_sh 110111 Reserved 111000 111111 Reserved 110000 111111 Reserved 117 un microsystems Performance Instrumentation and Optimization 118 amp Sun microsystems IEEE 754 1985 Standard The implementation of the floating point unit for standard and non standard operating modes is described in this chapter Debug and diagnosti
259. ard call and return conventions so that hardware can correctly predict the return addresses Performance Instrumentation and Optimization 95 amp Sun microsystems 5 3 5 3 1 SE 533 Data Stream Issues The following section addresses these data stream issues e Data Cache Organization e Data Cache Timing e Data Alignment e Using LDDF to Load Two Single Precision Operands Cycle e Store Considerations e Read After Write Hazards Data Cache Organization The D cache is a mapped virtually indexed physically tagged VIPT write through non write allocating cache It is logically organized as lines of 32 bytes Data Cache Timing The latency of a load to the D cache depends on the opcode LDX and LDUW have two cycle load to use latency while all other loads have three For example if the first two instructions in the instruction buffer are a load and an instruction dependent on that load the grouping logic will break the group after the load and insert a bubble in the pipeline It is very important to separate loads from their use Data Alignment The SPARC V9 specification requires that all accesses be aligned on an address equal to the size of the access Otherwise a mem_address_not_aligned trap is generated This is especially important for double precision floating point loads which should be aligned on an 8 byte boundary If misalignment is determined to be possible at compile time it is better to use two
260. ared before the caches are enabled The I MMU and D MMU TLBs must be initialized by clearing the valid bits of all TLB entries see the UltraSPARC HI Cu Processor User s Manual The P cache valid bits must be cleared before any floating point loads are executed The MCU refresh control register as well as the Sun Fireplane Interconnect configuration register must be initialized after a hard power on reset In SSM Scalable Shared Memory systems the microtags contained in memory must be initialized before any Sun Fireplane Interconnect transactions are generated Reset and RED_ state 84 amp Sun microsystems 4 2 2 4 2 3 4 2 4 4 2 5 Note Executing a DONE or RETRY instruction when TSTATE is uninitialized after a POR can damage the processor The POR boot code should initialize TSTATE 3 0 using wrpr writes before any DONE or RETRY instructions are executed System Reset Soft POR Sun Fireplane Interconnect Reset POR A system reset occurs when the Reset pin is activated When the Reset pin is active all other resets and traps are ignored System reset has a trap type of 1 at physical address offset 2016 Any pending external transactions are canceled Memory refresh continues uninterrupted during a system reset Externally Initiated Reset XIR An XIR steering register controls which logical processor s will receive the XIR reset Please refer to the UltraSPARC III Cu Processor User
261. assured using an error barrier as in the example above If AFAR is used the presence of orphaned errors resulting from the asynchronous activity of the instruction fetcher must be considered e If an orphaned error occurs the source of the TO or BERR report cannot be determined from the AFAR Given the error barrier sequence above it is reasonable to expect that the TO or BERR resulted from the peek or poke and proceed accordingly To reduce the likelihood of this event orphaned errors can be cleaned at point 1 shown in FIGURE 8 1 The source of the TO or BERR can be confirmed by retrying the peek or poke e If the TO or BERR happens again the system can continue with the normal peek or poke failure case e If the TO or BERR does not happen the system must panic Exceptions Traps and Trap Types 256 amp Sun microsystems 8 1 2 6 8 1 3 The peek access should be preceded and followed by MEMBAR Sync instructions The state of the destination register of the access may be corrupted however other states will not be affected If TPC is pointing to the MEMBAR Sync following the access then the data_access_error trap handler knows that a recoverable error has occurred and resumes execution after setting a status flag The trap handler will have to set TNPC to TPC 4 before resuming because the contents of TNPC are otherwise undefined Deferred Trap Handler Functionality The following is a possible
262. at caused it due to the skid problem Each of the two 32 bit PICs can accumulate over four billion events before wrapping around Overflow of PICL or PICU causes a disrupting trap and SOFTINT register bit 15 to be set to 1 If the overflow occurs when PSTATE IE 1 and PIL lt 15 an interrupt_level_15 trap is generated Extended event logging can be accomplished by periodic reading of the contents of the PICs before each overflows Additional statistics can be collected using the two PICs over multiple Performance Instrumentation and Optimization 99 un microsystems passes of program execution Two events can be measured simultaneously by setting the PCR SU PCR SL fields along with the PCR UT and PCR ST fields The selected statistics are reflected during subsequent accesses to the PICs The difference between the values read from the PIC on two reads reflects the number of events that occurred between register reads Software can only rely on read to read PIC accesses to get an accurate count and not a write to read of the PIC counters TABLE 5 3 shows the details of the PIC TABLE 5 4 describes the various fields of the PIC TABLE 5 3 Performance Instrumentation Counter Register ASR R W Description Reset Accessibility depends on PCR PRIV bit 64 bit Read Write i e a accessible in any mode Note Writes are designed for 7 SE 7 0x0000 0000 diagnostic and test purposes 1 accessible in Supervisor Mode otherwise privilege
263. ata overwrite the original data No error logged No error logged Original data original ecc Direct ASI L2 data read request with tag UE in the 1st 32 byte or 2nd 32 byte L2 cache data ASI L2 displacement flush read request with CE in the 1st 32 byte or 2nd 32 byte L2 cache data No error logged Original data original ecc L2 cache data flushed out and corrected data written into L3 cache ASI L2 displacement flush read request with UE in the 1st 32 byte or 2nd 32 byte L2 cache data L2 cache data flushed out and written into L3 cache WDU Direct ASI L2 data write request with CE in the 1st 32 byte or 2nd 32 byte L2 cache data Direct ASI L2 data write request with tag UE in the 1st 32 byte or 2nd 32 byte L2 cache data ASI write data overwrite original data No error logged ASI write data overwrite original data No error logged TABLE 7 19 L2 cache data Writeback and Copyback Errors Error Handling Event Error logged in AFSR Data sent to L3 cache Comment L2 Writeback encountering CE in the 1st 32 WDC Corrected data corrected reste byte or 2nd 32 byte L2 cache data ecc Pung rap L2 Writeback encountering UE in the 1st 32 Get St byte or 2nd 32 byte L2 cache data WDU Original data original ecc Disrupting trap Copyout hits in the L2 writeback buffer because the line is being victimized where a CE WDC SE pes corrected Disrupting trap ha
264. ata 23 16 xor data 31 24 DC_data_parity 12 xor data 103 96 xor data 39 32 DC_data_parity 13 DC_data_parity 14 DC_data_parity 15 xor data 111 104 xor data 119 112 xor data 127 120 xor data 47 40 xor data 55 48 xor data 63 56 DC_data_parity 16 xor data 135 128 xor data 7 0 DC_data_parity 17 DC_data_parity 18 DC_data_parity 19 xor data 143 136 xor data 151 144 xor datal59 152 xor data 15 8 xor data 23 16 xor data 31 24 DC_data_parity 20 xor data 167 160 xor data 39 32 152 un microsystems Error Handling TABLE 7 3 D cache Parity Generation for Load Miss Fill and Store Update 2 of 2 Parity Bit D cache Load Miss Fill D cache Store Update DC_data_parity 21 xor data 175 168 xor data 47 40 DC_data_parity 22 xor data 183 176 xor data 55 48 DC_data_parity 23 xor data 191 184 xor data 63 56 DC_data_parity 24 xor data 199 192 xor data 7 0 DC_data_parity 25 xor data 207 200 xor data 15 8 DC_data_parity 26 xor data 215 208 xor data 23 16 DC_data_parity 27 xor data 223 216 xor data 31 24 DC_data_parity 28 xor data 23 1 224 xor data 39 32 DC_data_parity 29 xor data 239 232 xor data 47 40 DC_data_parity 30 xor data 247 240 xor data 55 48 DC_data_parity 31 xor data 255 248 xor data 63 56 Note The D cache data parity check granularity is 8 bits This is the
265. ate on the cacheable bits but merely passes them through to the cache subsystem In the UltraSPARC IV processor the CV bit is maintained in I TLB See the UltraSPARC III Cu Processor User s Manual 2 See the UltraSPARC III Cu Processor User s Manual 1 0 Q 2 w mo Instruction and Data Memory Management Unit See the UltraSPARC III Cu Processor User s Manual See the UltraSPARC III Cu Processor User s Manual 311 amp Sun microsystems 13 1 5 Hardware Support for TSB Access The MMU hardware provides services to allow the TLB miss handler to efficiently reload a missing TLB entry for either an 8 KB or 64 KB page These services include e Formation the TSB Pointers based on the missing virtual address and address space e Formation of the TTE Tag Target used for the TSB tag comparison e Efficient atomic write of a TLB entry with a single store ASI operation e Alternate globals on MMU signaled traps Please refer to the UltraSPARC II Cu Processor Leer e Manual for additional details 13 1 5 1 Typical I TLB Miss Refill Sequence A typical TLB miss and TLB refill sequence 1 An I TLB miss causes a fast_instruction_access_MMU_miss exception 2 The appropriate TLB miss handler loads the TSB Pointers and the TTE Tag Target with loads from the MMU registers 3 Using this information the TLB miss handler checks to see if the desired TTE exists in the TSB If so the TTE data are loade
266. ation If a single bit error is detected the WDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the CEEN bit is set in the Error Enable Register Hardware will proceed to correct the error Corrected data will be written out to the L3 cache WDU For an L2 cache writeback operation an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC The data read back from L2 cache will be put in the L2 cache writeback buffer as the staging area for the writeback operation When a multi bit error is detected it will be recognized as an uncorrectable error and the WDU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register The uncorrectable L2 cache writeback data will be written into L3 cache Therefore the trap handler should perform a displacement flush to flush out the line in the L3 cache Multiple occurrences of this error will cause ASFR1 ME to be set 203 amp Sun microsystems Lede Error Handling CPC For a copyout operation to serve a snoop request from another processor an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a single bit error is detected the CPC bit will be set to log this error condi
267. ation Subnormal results results that would otherwise cause an unfinished_FPop are also flushed to 0 in non standard mode If the higher priority invalid operation or a divide by zero condition occurs the corresponding bits are asserted in the FSR CEXC register field e If the trap is enabled FSR TEM an fp_exception_IEEE_754 trap occurs e If the trap is disabled the corresponding bits are also flagged in the FSR AEXC register field If neither the invalid nor divide by zero condition occurs an inexact condition plus any other detected floating point exception conditions are flagged in the FSR CEXC register field e If an IEEE trap is enabled FSR TEM an fp_exception_IEEE_754 trap occurs e If the trap is disabled the corresponding condition s are also flagged in the FSR AEXC register field IEEE 754 1985 Standard 138 amp Sun microsystems 6 8 2 Subnormal Number Generation Handling of the FMULs FMULd FDIVs FDIVd and FdTOs instructions requires further explanation using the following terms e Sign sign of result e Re round nearest effective truncate or round truncate RP round to infinity RM round to infinity e RND FSR RD e Er biased exponent result e Ep the biased exponent result before rounding E rs biased exponent of rs operand e P_rs precision of the rs operand The value of the constants depends on precision type as shown in TABLE 6 17 TABLE 6 17 Subnormal Handl
268. ause of the trap was an I TLB parity error When an I TLB parity error trap is detected software must invalidate the corresponding T512 TTE by either executing either a demap context or a demap all and write a new entry with the correct parity D TLB Parity Errors The D TLB is composed of three structures the two 2 way associative TLB array T512_0 and T512_1 and the fully associative TLB array T16 The T512 array has parity protection for both tag and data while T16 array is not parity protected D TLB Parity Error Detection Please refer to D TLB Parity Protection on page 322 for details about D TLB parity error detection D TLB parity Error Recovery Actions When all parity error reporting conditions are met D MMU enabled D TLB parity enabled and no translation hit in T16 a parity error detected on a translation will generate an data_access_exception and the SFSR will be set to Fault Type 2016 Thus the data_access_exception trap handler must check the SFSR to determine if the cause of the trap was an D TLB parity error When a D TLB parity error trap is detected software must invalidate the corresponding T512 TTE by either a demap page demap context or demap all and write a new entry with correct parity Instruction Prefetch Buffer IPB Data Errors Similar to those of the I cache IPB data parity errors are only caused by an I fetch from IPB IPB data is not checked as part of a snoop invalidation operation An I
269. aused by the read to own required to obtain permission to complete a store like operation for which data is held in the store queue then the behavior is different An uncorrectable system bus data ECC error will set AFSR DUE and will capture the ECC syndrome in E_SYND AFSR PRIV will not be captured Uncorrectable system bus data ECC errors on read accesses to a cacheable space will install the bad ECC from the system bus directly in the L2 cache This prevents using the bad data or having the bad data written back to memory with good ECC bits Uncorrectable ECC errors from the system bus on cache fills will be reported for any ECC error in the 64 byte line not just the referenced word The error information is logged in the AFSR An instruction_access_error or data_access_error deferred trap if AFSR UE or an ECC_error disrupting trap if AFSR DUE is generated provided that the NCEEN bit is set in the Error Enable Register If NCEEN were clear the processor would operate incorrectly on the corrupt data An uncorrectable error as the result of a system bus read for an instruction fetch causes an instruction_access_error deferred trap An uncorrectable error as the result of a load like block load or atomic operation causes a data_access_error deferred trap An uncorrectable error as the result of a prefetch queue or store queue system bus read causes a disrupting ECC_error trap See Multiple Errors and Nested Traps on page 199 for the behavior in
270. avoid cache misses Performance Instrumentation and Optimization 93 amp Sun microsystems I2 5 2 4 5 2 4 1 SE The UltraSPARC processor provides instrumentation to profile a program and detect if instruction accesses generate a cache miss or a cache hit By checkpointing the counters before and after a large section of code combined with profiling the section of code one can determine if the frequently executed functions hit or miss the I cache Executing Instructions With Minimum Latency Instructions fetched from the L2 cache or the L3 cache require fewer number of cycles than fetching an instruction from main memory The hardware can prefetch the next eight instructions if the initial fetch was from the lower 32 bytes of a 64 byte aligned memory boundary Translation Lookaside Buffer TLB Misses The TLB contains the virtual page number and the associated physical page number of the most recently accessed pages A TLB miss is handled by software via the translation storage buffer TSB and takes a large number of cycles To minimize the frequency of TLB misses the UltraSPARC processor provides a large number of entries in the TLB Impact of the Annulled Slot Grouping rules in the UltraSPARC III Cu Processor User s Manual describe how the UltraSPARC processor handles instructions following an annulling branch Avoid scheduling WR PR ASR SAVE SAVED RESTORE RESTORED RETURN RETRY and DONE in the delay sl
271. ays 0 47 DP Data parity bit Instruction and Data Memory Management Unit 336 un microsystems TABLE 13 31 D TLB Diagnostic Register of T512_0 and T512_1 2 of 2 46 TP Tag parity bit 45 43 Reserved Reserved for future implementation 42 13 PA Physical page number 12 7 See the UltraSPARC III Cu Processor User s Manual 6 L Lock bit 5 4 Cacheable bit See the UltraSPARC III Cu Processor User s Manual 3 E 2 P See the UltraSPARC III Cu Processor User s Manual 1 W 0 See the UltraSPARC III Cu Processor User s Manual Q See the UltraSPARC III Cu Processor User s Manual Note See TABLE 13 5 for a detailed description of the fields An ASI store to the D TLB Diagnostic Register initiates an internal atomic write to the specified TLB entry The Tag portion is obtained from the Tag Access Register AST_DMMU_TAG_ACCESS the Data portion is obtained from the store data An ASI load from the TLB Diagnostic Register ASI_DTLB_DIAG_REG initiates an internal read of the Tag and Data The Tag portion is discarded the Data portion is returned Note If any memory access instruction that misses the D TLB is followed by a diagnostic read access LDXA from ASI_DTLB_DATA_ACCESS_REG i e ASI 0x5d from fully associative TLBs and the target TTE has page size set to 64 KB 512 KB or 4 MB the data returned from the
272. be taken to mean either primary or secondary AFSR_EXT Secondary AFSR The secondary AFSR and secondary AFSR_EXT are intended to capture the first event that the processor sees among a closely connected series of errors The secondary AFSR and secondary AFSR_EXT captures the first event that sets one of bits 62 33 of the primary AFSR and bits 11 0 of the primary AFSR_EXT In the case there are multiple first errors arriving at exactly the same cycle multiple error bits will be captured at secondary AFSR AFSR_EXT The secondary AFSR and AFSR_EXT are unfrozen enabled to capture a new event when bits 62 54 and 51 33 of the primary AFSR and bits 11 0 of the primary AFSR_EXT are 0 Note that AFSR1 PRIV and AFSR1I ME do not have to be 0 in order to unlock the secondary AFSR The secondary AFSR and AFSR_EXT never accumulates nor does any overwrite policy apply To clear the secondary AFSR bits software should clear bits 62 33 of the primary AFSR and bits 11 0 of the primary AFSR_EXT 180 amp Sun microsystems 7 3 3 3 Error Handling The secondary AFSR and AFSR_EXT enable diagnosis software to determine the source of an error If the processor reads a uncorrectable data ECC error from the system bus into the L2 cache and then a writeback event copies the same error out to the system bus again before the diagnosis software executes the primary AFSR AFSR_EXT cannot show whether the original event came from the sys
273. because the prefetch request matched an outstanding request in the prefetch queue or the request hit the P cache SW_pf_str_trapped PICL SW_pf_L2_installed PICU SW_pf_PC_installed PICL Number of strong software prefetch instructions trapping due to TLB miss Number of software prefetch instructions that installed lines in the L2 cache Number of software prefetch instructions that installed lines in the P cache Note that both SW_pf_PC_installed and SW_pf_L2_installed can be updated by some prefetch instructions depending on the prefetch function Performance Instrumentation and Optimization 107 un microsystems TABLE 5 11 Cache Access Counters 3 of 5 L2 Cache Note The L2 cache access counters do not include retried L2 cache requests Private L2 counters Number of L2 cache references from this core by cacheable I cache D cache P cache and W cache excluding block stores that miss L2 cache requests A 64 Byte L2_ref PICL request is counted as 1 reference Note that the load part and the store part of an atomic is counted as a single reference Number of L2 cache misses from this core by cacheable I cache D cache P cache and W cache excluding block stores requests This is equivalent to the number of L3 cache references requested by this core L2_miss PICU Note that the load part and the store part of an atomic is counted as a single request Also the count does no
274. bed in TABLE 13 16 and TABLE 13 17 TABLE 13 16 I MMU and D MMU Primary Context Register Bit Field Description 63 61 N_pgsz0 Nucleus context s page size at the first large D TLB T512_0 60 58 N_pgszl Nucleus context s page size at the second large D TLB T512_1 57 55 N_pgsz_lI Nucleus context s page size at the first large I TLB iT512 54 23 Reserved Reserved for future implementation 24 22 P_pgsz0 Primary context s page size at the first large D TLB T512_0 21 19 P_pgszl Primary context s page size at the second large D TLB T512_1 18 16 P_pgsz_I Primary context s page size at the first large I TLB iT512 15 13 Reserved Reserved for future implementation 12 0 PContext Context identifier for the primary address space 320 Instruction and Data Memory Management Unit amp Sun microsystems TABLE 13 17 D MMU Secondary Context Register Bit Field Description 63 22 Reserved Reserved for future implementation 21 19 S_pgszl Secondary context s page size at the second large D TLB T512_1 18 16 S_pgsz_0 Secondary context s page size at the first large D TLB T512_0 15 13 Reserved Reserved for future implementation 12 0 SContext Context identifier for the secondary address space Page size bit encoding 000 8 KB 001 64 KB 010 512 KB 011 4 MB 100 32 MB 101 256 MB T512_0 undefined page size bit encoding 100 101 T512_1 undefined page size bit encoding
275. ber zero to yield a value of zero the fraction or mantissa must be exactly zero Therefore the number zero is a special case with exponent and fraction fields of zero Note that 0 and 0 are considered to be distinct values even though they both compare as equal Subnormal If the exponent field is all 0 s and the fraction field is non zero the value is a subnormal denormalized number These numbers do not have an assumed leading 1 before the binary point For single precision these numbers are represented as GI x 0 f x 27126 In double precision the representation is 1 x 0 f x 271022 In both cases s is the sign bit and fis the fraction Exponent and fraction fields of all Os are the special representation of the number zero From this point of view the number zero can be considered a subnormal Infinity The values infinity and infinity are represented with an exponent field of all 1s and a fraction field of all Os The sign bit distinguishes between positive and negative infinities The infinity representation is important because it allows operations to continue past overflow Operations dealing with infinities are well defined by the IEEE 754 1985 Standard IEEE 754 1985 Standard 121 amp Sun microsystems 6 2 4 Not a Number NaN The value NaN Not a Number is used to represent values that do not represent real numbers The NaN exponent field is all 1s and the fraction field is non zero There are two categories
276. bits are reset to 3 b000 and updated based on the following scheme bit 2 1 if hit in wayl or way0 bi 1 1 if hit in way2 or it remains 1 when not hit in way3 bit 0 1 if hit in way0 or it remains 1 when not hit in way1 e LRU bits are not ECC protected L3 cache SRAM Mapping The following hashing algorithm is used to determine which L3 cache lines are mapped to which SRAM blocks L3_SRAM_adadr 24 5 PA 22 5 way 1 0 where PA and way are the physical address and the way of the L3 cache line to be mapped and L3_SRAM_adadr is the address of the L3 cache SRAM block to which the L3 cache line is mapped 3 13 Summary of ASI Accesses in L2 L3 Off Mode TABLE 3 71 summarizes the different ASI accesses to L2 and L3 tag data SRAMs in the L2 L3 off mode TABLE 3 71 ASI access to shared SRAMs in L2 L3 off mode SRAM ASI Write displacement flush ASI Read Tag Update L2data SRAM the same as L2 on NOP same as L2 on N A L3tag SRAM the same as L3 on NOP same as L3 on No L3data SRAM the same as L3 on NOP same as L3 on N A Caches Cache Coherency and Diagnostics 715 amp Sun microsystems 3 14 ASI SRAM Fast Init To speed up cache and on chip SRAM initialization the UltraSPARC IV processor leverages the SRAM manufacturing hooks to initialize zero out these SRAM contents ASI 4016 ASI_SRAM_FAST_INIT will initialize all the SRAM structures that are associated with a particular logic
277. ble Register See Uncorrectable L2 cache Data ECC Errors on page 162 Software Correctable L2 cache ECC Error Recovery Actions The fast_ECC_error trap handler should carry out the following sequence of actions to correct an L2 cache tag or data ECC error Read the address of the correctable error from the AFAR register 2 Invalidate the entire D cache using writes to ASI_DCACHE_TAG to zero out the valid bits in the tags 3 Evict the L2 cache line that contained the error This requires four LDXA AST_L2CACHE_TAG operations using the L2 cache disp_flush addressing mode to evict all four ways of the L2 cache A single bit data error will be corrected when the processor reads the data from the L2 cache and writes it to L3 cache to perform the writeback This operation may set the AFSR WDC or AFSR WDU bits If the offending line was in O Os or M MOESI state and another processor happened to read the line while the trap handler was executing the AFSR CPC or AFSR CPU bit could be set A single bit tag error will be corrected when the processor reads the line tag to update it to I state This operation may set AFSR THCE 4 Evict the L3 cache line that contained the error This requires four LDXA ASI_L3CACHE_TAG operations using the L3 cache disp_flush addressing mode to evict all four ways of the L3 cache to memory When the line is displacement flushed from L2 cache if the line is not in I state then it will be writ
278. ble accuracy especially when used to count hundreds or thousands of events or stall cycles or when comparing the PIC counts that have recorded a similar number of events or stall cycles Accuracy is most challenging when trying to associate an event to an instruction and when comparing PIC counts with one rarely occurring count When using the overflow trap it is sometimes difficult to pinpoint the instruction that is responsible for the overflow because of the way the pipeline is designed A delay of several instructions is possible before the overflow is able to stop the current instruction flow and fetch the trap vector The skid for the load miss detection case is small The skid value cannot be measured and its length depends on what event or stall cycle is being measured and what other instructions are in the pipeline 5 8 Pipeline Counters 5 8 1 Instruction Execution and Clock Counts The instruction execution count monitors are described in TABLE 5 6 for clock and instruction execution counts TABLE 5 6 Instruction Execution Clock Cycles and Counts Counter Description Cvele cni PICL 00 0000 and PICU 00 0000 Counts clock cycles This counter increments the same as the ege SPARC V9 TICK register except that cycle counting is controlled by the PCR UT and PCR ST fields ce ane PICL 00 0001 and PICU 00 0001 Counts the number of instructions completed retired This count does not include annulled mispredicted tra
279. byte to W THCE No No corrects the N A the request is cache for W cache exclusive tag retried request and writes 64 byte data to L2 cache L3 cache to L2 cache fill request with L2 cache tag UE forwards the critical 32 byte Disrupting trap and then 2nd 32 byte to W TUE No Yes Original tag N A the request is cache for W cache exclusive dropped request and writes 64 byte data to L2 cache L2 Pipe Disrupting trap SEH EE THCE No No corrects the N A the request is 8 tag retried SIU fill ith L2 cach E tag UE reguest wat deeg TUE No Yes Original tag N A the request is S dropped SIU forward and fill request with L2 cache tag CE L2 Pipe Disrupting trap forwards the critical 64 byte THCE No No corrects the N A the request is to I cache and writes 64 byte tag retried data to L2 cache SIU forward and fill request with L2 cache tag UE Disrupting trap forwards the critical 64 byte TUE No Yes Original tag N A the request is to I cache and writes 64 byte dropped data to L2 cache SIU forward only request with d RE THCE N N E N A GE o o corrects the i forwards the critical 32 byte Ghe request is tag retried to I cache SIU forward only request with Deea etag UE TUE N Y Original t N A SE o es riginal ta i forwards the critical 32 byte S sk Sg 2 to I cache PP SIU forward and fill request with L2 cache tag CE L2 Pipe Disrupting trap forwards the critical 32 byte THCE No No SE N A Ge to D cache for 32 byte load tag re
280. c support are defined in these sections Chapter Topics Floating Point Operations on page 119 e Floating Point Numbers on page 121 e IEEE Operations on page 122 e Traps and Exceptions on page 131 IEEE Traps on page 133 e Underflow Operation on page 134 e IEEE NaN Operations on page 136 e Subnormal Operations on page 138 6 1 Floating Point Operations Floating point operations FPops include the algebraic operations and usually do not include the specially treated floating point load store FBfcc or the VIS instructions The FAI FMOV instructions are also treated separately from the algebraic operations 6 1 1 Rounding Mode BS FNI EG and The rounding mode of the floating point unit is determined either by the FSR RD bit when in standard rounding mode or by the GSR IRND bit when in interval arithmetic rounding mode The rounding direction affects the result after any under or overflow condition is detected Underflow is detected before rounding TABLE 6 1 Rounding Direction FSR RD Round Toward 0 Nearest even if tie 1 0 2 co IEEE 754 1985 Standard 119 amp Sun microsystems 6 1 2 6 1 3 6 1 4 6 1 5 6 1 6 Non standard Floating Point Operating Mode The processor supports a non standard floating point mode to facilitate the handling of subnormals by the hardware thus avoiding a software trap to supervisor software The floating point operating mode is controlled
281. cache and uses a pseudo LRU replacement policy The L3 cache is a dirty victim cache When a line comes into the processor it is loaded in the L2 cache and the appropriate Ll cache s When a line both clean and dirty is evicted from the L2 cache it is written back to the L3 cache The L2 cache and L3 caches are mutually exclusive i e a given cache line can exist either in the L2 cache or the L3 cache but not in both The tag and the data arrays of both the L2 cache and the L3 cache are ECC protected Instruction fetches bypass the L2 cache and L3 cache in the following cases e The I MMU is disabled DCUCR IM 0 and the CP bit in the Data Cache Unit Control Register is not set DCUCR CP 0 e The processor is in RED_state e The access is mapped by the I MMU as non physical cacheable Data accesses bypass the L2 cache and L3 cache if the D MMU is disabled DCUCR DM 0 or if the access is mapped by the D MMU as non physical cacheable unless ASIT_PHYS_USE_EC is used The system must provide a non cacheable scratch memory region for booting code use until the MMUs are enabled Block loads and block stores which load or store a 64 byte block of data from memory to the Floating Point Register file do not allocate into the L2 cache or L3 cache Prefetch Read Once instructions prefetch fen 1 21 which load a 64 byte block of data into the P cache do not allocate into the L2 cache and L3 cache Virtually Indexe
282. cache would give a data correctness problem Displacing the wrong line by mistake from the L2 cache and L3 cache will only lead to the same error being reported twice The second time the error is reported the AFAR is likely to be correct Corrupt data is never stored in the D cache without a trap being generated to allow it to be cleared out Note While the above code description appears only to be appropriate for correctable L3 cache data and tag errors it is actually effective for uncorrectable L3 cache data errors as well In the event that it is handling an uncorrectable error the victimize at step 3 Evict the L2 cache line that contained the error will write it out to L3 cache If the L2 cache still returns an uncorrectable data ECC error when the processor reads it to perform a L2 cache writeback the WDU bit will be set in the AFSR during this trap handler which would generate a disrupting trap later if it was not cleared somewhere in this handler In this case the processor will write deliberately bad signalling ECC back to L3 cache In the event that it is handling an uncorrectable error the victimize at step 4 Evict the L3 cache line that contained the error will either invalidate the line in error or will if it is in M O or Os state write it out to memory If the L3 cache still returns an uncorrectable data ECC error when the processor reads it to perform a L3 cache writeback the L3_WDU bit will be set in the A
283. ceeccesseeereeeeeeeeeseeeesseeeeseeeeees 299 12 2 CMT Related Interrupt Behavior 2 0 0 ccc cecccccesseeeeneeeseneeesseeeeeseeeneneecseneeeseeeeesseeeneeeeneas 300 12 2 1 Interrupt to Disabled Logical Processor c cccsscecesseeeenseeeeeeeeeeeeeesseeeesneeenans 300 12 2 2 Interrupt to Parked Logical Processor ccccccessseeessceeeeeeeeeeceeseeeesneeesseeenens 300 13 Instruction and Data Memory Management Unit ssssccsssssssecssssscessssscceesssssesesssssseees 301 13 1 Instruction Memory Management Unit c cc cccccceescceeseceeseeeeseeeeeeeeeeaeeesseeeesseeesseeenens 302 13 1 1 Virtual Address Translation aeeiio nehrin ea e EE 302 13 1 2 Larger amp Programmable I TLB ssssssssessssesssssssessserrsseessesssressseesseessresseersseesseees 302 13 1 3 I TLB Automatic Replacement 306 13 1 4 Translation Table Entry TTE essessssssessesesssessssesserrsseesseessrresseesseessresseresseesseees 310 13 1 5 Hardware Support for TSB Access cccccecssceesssceseseeeseeeeseeecenseeeesseeenseeesans 312 EAE aR Fauilts and Traps EE E E een akin d 313 13 1 7 Reset Disable and RED state Behavior esssseessesersrsssssssesessrrrrssssssssserrreees 313 13 1 8 Internal Registers and ASI Operations ssssssesssesssseessessseesseesseessressrrrsseesseees 313 13 1 9 I TLB Tag Access Extension Register cccccccsscecesseeeeneeeseseeeeseeeesseeenseeenens 313 13 1 10 Write Access Limitation of I MMU Regis
284. ceeesscececseseeseseeseeeeess 136 6 7 3 Operations With NaN Operands 00 0 eeeceeeseceseecenceeeeeeeeseeeeseeesseeesseeensaes 136 6 7 4 NaN Results From Operands Without NaNs 137 6 8 Subnormal Operations aaae a a R aE Eirian E aE TEE R i 138 6 8 1 Response to Subnormal Operands sesssesesseesseesseessrrsserssseessrsssersseeesseessersse 138 6 8 2 Subnormal Number Generation ssssseesessessreersrssrsrrsrerrsrerrstesrnreersrrsrsersesrsest 139 Te Hrror Handling 55 cassiccvssnsecessssszansessscosdesssescsestvancedstosssesestvesscessesscesatedsosnoabecsosenessersoconscavsssessess s 143 Tel Error Classes Soinuenean seas eh else orcens RNR AEE Nee 143 E2 Memory ETOL S ernen Berck e r a aA n a E ENR 144 Tal STeeache EOTS oa To a A A 144 Tao Decache Errors nasienia E A OE E RA 149 T23 ETEB Parity Errors eine E KE ee ee 155 TZA D TEB Parity Errors c e e dee OE Re ere 156 7 2 5 Instruction Prefetch Buffer IPB Data Errors sessessesesesrsserssesssrrssrersseesseees 156 7 2 6 P cache Data Parity Errors cccccccessssccssssseesesessscessneessenevessnecseeueceenseeseneesses 157 Tad E EE 157 UltraSPARC IV Processor User s Manual October 2005 7 3 7 4 7 5 7 6 7 7 7 8 7 9 7 10 7 11 7 12 7 13 72 8 EEN Errors A A EO E E T T 163 7 2 9 Errors on the System Bus 170 7 2 10 Memory Errors and Prefetch 0 ccccccssceeeseceeseecesneeceseeeseneeesseeeesueeeesueeensaees 174 7
285. cess Extension Register ASI 50165 VA 63 0 60165 Name ASI_IMMU_TAG_ACCESS_EXT Access RW Tag Access Extension Register keeps the missed page sizes of T512 The format of the I TLB Tag Access Extension Register is described in TABLE 13 7 TABLE 13 7 I TLB Tag Access Extension Register Bit Field Description 63 25 Reserved Reserved for future implementation 24 22 pgsz Page size of I TLB miss context primary nucleus in the T512 21 0 Reserved Reserved for future implementation Note Bit 24 and 23 are hardwired to zero since only one bit is required to decode 8 KB and 64 KB page sizes With the saved page sizes the hardware pre computes in the background the index to the T512 for a TTE fill When the TTE data arrives only one write enable to the T512 and the T16 will be activated Instruction and Data Memory Management Unit 313 un microsystems 13 1 9 1 I TLB Data In Data Access and Tag Read Registers Data In Register ASI 5416 VA 63 0 0016 Name ASI_ITLB DATA IN BG Access W Writes to the TLB Data In register requires the virtual address to be set to 0 Note The I TLB Data In Register is used when the fast_instruction_access_MMU_miss trap is taken to fill I TLB based on replacement algorithm Other than the fast_instruction_access_MMU_umiss trap an ASI store to I TLB Data in register may replace an unexpected entry in the I T
286. cessors are still issuing cacheable accesses Note Exiting RED_state by setting PSTATE RED to 0 in the delay slot of a JMPL is not recommended A noncacheable instruction prefetch can be made to the JMPL target which may be in a cacheable memory area This condition could result in a bus error on some systems and cause an instruction_access_error trap The trap can be masked by setting the NCEEN bit in the ESTATE_ERR_EN register to 0 but this approach will mask all non correctable error checking Exiting RED_state with DONE or RETRY avoids the problem 4 2 4 2 1 Resets The reset priorities ranging from highest to lowest are e power on resets POR hard or soft e externally initiated reset XIR e watchdog reset WDR e software initiated reset SIR Hard Power on Reset Hard POR Power on Reset Power OK Reset A hard power on reset hard POR occurs when the Power OK Reset POK pin is activated and stays asserted until the processor is within its specified operating range When the POK pin is active all other resets and traps are ignored Power on reset has a trap type of 1 at physical address offset 2016 Any pending external transactions are canceled After a hard power on reset the software must initialize processor certain values please see TABLE 4 1 In particular the valid and microtag bits in the I cache the valid and microtag bits in the D cache and all L2 L3 cache tags and data must be cle
287. che atomic fill request with BERR in the critical 32 byte data from system bus Cacheable D cache atomic fill request with BERR in the non critical 2nd 32 byte data from system bus flag fast ecc error L2 cache data garbage data installed garbage data installed L2 cache state L1 cache data garbage critical 32 byte data and UE information is sent and stored in W cache garbage non critical 32 byte data and UE information is sent and stored in W cache Pipeline Action garbage data dropped Good data taken Comment Deferred trap BERR will be taken and Precise trap fast ecc error will be dropped the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Deferred trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Cacheable Prefetch 0 1 20 21 17 fill request with BERR in the critical 32 byte data from system bus due to non RTSR transaction OR Cacheable Prefetch 0 1 20 21 17 fill request with BERR in the 2nd 32 byte data from system bus due to non RTSR transaction Not installed garbage data not installed in P cache No action Disrupting trap the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Cacheable P cache 0 1 20 21 17 fill request with BERR in the critical 32 byte data from system bus due to RTSR transaction OR Cacheable P cache 0 1 2
288. che should confirm that all the ways of the D cache at this index have either been invalidated or displaced by data from other addresses that happened to map to the same D cache tag index Again this can be iterated for all the covered D cache snoop tag bits If an instruction is not executed whether this is because it is annulled because a trap prior to the instruction switched the program flow or because a branch diverted control then a D cache data parity error on that instruction cannot cause a dcache_parity_error trap D cache data parity is not checked for store like instructions D cache physical tag parity is checked for all store instructions except for STD and STDA Because the primary D cache has a high ratio of reads to writes and also because the majority of D cache writes do overwrite entire 32 bit words the effect of this reduction in coverage is small D cache data parity is not checked for block stores which overwrite a number of 64 bit words and recompute parity Any original parity error is overwritten 153 amp Sun microsystems 7 2 2 6 EE Error Handling D cache data parity is not checked for atomic operations Atomic operation data comes from the L2 cache not the D cache D cache data parity is not checked for block load quad load operations Data for block load quad load never comes from the D cache For load like instructions D cache data and physical tag parity is always checked
289. ches D stage again are counted Note that these stall cycles are also counted in Re_DC_miss Re_L3_miss PICU Re_PFQ_full PICU Re_DC_missovhd PICU Performance Instrumentation and Optimization Stall cycles due to recirculation of cacheable loads that miss D cache L2 and L3 cache Stall cycles from the point when L3 cache miss is detected to the point when the recirculated flow reaches D stage again are counted Note that these stall cycles are also counted in Re_DC_miss and Re_L2_miss Stall cycles due to recirculation of prefetch instructions because the prefetch queue PFQ was full The count includes stall cycles for strong software prefetch instructions that recirculate when the PFQ is full The count also includes stall cycles for any software prefetch instruction when the PCM bit in the Data cache Unit Control Register DCUCR is enabled that recirculates when the PFQ is full Counts the overhead of stall cycles due to D cache load miss Includes cycles from the point the load reaches D stage about to be recirculated to the point L2 cache hit miss is reported Note The count does not include overhead cycles for cacheable loads that recirculate due to D cache miss for which there is an outstanding prefetch fen 1 request in the prefetch queue LAP hazard 105 un microsystems 5 9 Cache Access Counters Instruction data prefetch write and L2 L3 cache access statistics can be collected
290. cise dcache_parity_error trap See D cache Error Recovery Actions on page 151 D cache Snoop Tag Errors Snoop cycles from the system bus may need to invalidate entries in the D cache To discover if the referenced line is present in the D cache the physical address from the snoop access is compared in parallel with the snoop tags for each of the ways of the D cache In the event that any of the addressed valid snoop tags contains a parity error the processor hardware will automatically clear the valid bits for all the ways of the D cache at that tag index This applies whether the snoop hits in the D cache or not There is just one valid bit for each line used in both the D cache physical tags and snoop tags Clearing this bit will make the next data fetch to this line miss in the D cache and refill the physical and snoop tags and the D cache entry Hardware in the D cache snoop logic ensures that both a conventional invalidation where the snoop tag matches and an error invalidation where there is a snoop tag parity error so the hardware cannot tell if the snoop tag matched or not meet the ordering and timing requirements necessary for the coherence policy Note This operation of automatically clearing the valid bits on an error is not logged in the AFSR or reported as a trap Therefore it may cause undiagnosable out of sync events in lockstep systems In non lockstep systems if a snoop tag array memory bit became stuck because of a fa
291. ck Store Commit BSTC Wille Back WB Block Load LPA LPA LPA and LPA LPA LPA l v SSM_L3_wb_remote remote event mtag_miss mtag_hit L SZ TI LG wp SSM_L3_miss_mtag_remote SSM_L3_miss_remote retry event el remote event SSM_L3_miss_local l transaction event L3_miss l transaction event FIGURE 5 4 SSM Performance Counter Event Tree Performance Instrumentation and Optimization amp Sun microsystems 5 11 3 5 11 3 1 Data Locality Event Matrix TABLE 5 14 shows the data locality event matrix TABLE 5 14 Data Locality Events Local Processor Physical Address LPA Retried Events Processor Action Combined Block MODE State Load E Block Block Store Sunre Write Back Swap Load with miss miss miss RTS issued RTO issued RS issued R_WS issued LPA MTag miss RTO issued miss MTag miss R_WS issued R_RTO issued MTag miss MTag miss MTag miss R_RTS issued R_RTO issued R_RS issued invalid LPA EE MTag miss R_RTO issued miss miss miss R_RTO issued R_RS issued R_WS issued hit hit hit E gt M E gt M miss LPA R_WS MTag miss miss issued R_RTO issued R_WS issued miss R_WB issued Retry is to issue an R_ transaction for an RTS RTO RS transaction that gets unexpected MTag from the SSM system interconnect for example cache state O and MTag state gS A retry takes place in LPA Performance Instrumentation and Optimization
292. control A set of rules on the D TLB replacement demap and context switch must be followed to maintain consistent and correct behavior D TLB Parity Protection Both the T512_0 and T512_1 support parity protection for both the tag and data arrays however the T16 does not support parity protection The D MMU generates an odd parity for the tag from a 60 bit parity tree and an odd parity for the data from a 37 bit parity tree upon a D TLB replacement The parities are calculated as follows Tag Parity XOR Size 2 0 Global VA 63 21 Context 12 0 Data Parity XOR NFO IE PA 42 13 CP CV E P W Note The Valid bit is not included in the tag parity calculation The parity bits are available during the same cycle that the tag and data are sent to the D TLB The tag parity is written to bit 60 of the tag array while the data parity is written to bit 35 of the data array During the D TLB translation set associative TLBs T512_0 and T512_1 check the tag and data parities previously written by replacement The tag and data parity errors are reported as a data_access_exception with the fault status recorded in the D SFSR register Fault Type 2016 D SFSR bit 12 is valid when tag or data parity errors occur Note The tag and data parities are checked even for invalid D TLB entries When a trap is taken on a D TLB parity error the software needs to invalidate the corresponding entry and write that entry with good parity The
293. correctable ECC error information is asserted ECC check bits 1 0 are inverted in the data written back to the L2 cache The syndrome seen when one of these signalling words is read will be 0x003 188 un microsystems Error Handling TABLE 7 12 ECC Syndromes For uncorrectable on chip L2 cache tag or L3 cache tag ECC error error pin is asserted and the domain should be reset The copyout or writeback might be dropped CT EEE sc 1 C4 M M 50 16 2 C5 M M 46 10 3 M2 M 4 C6 6 5 M2 M4 6 M2 M4 7 116 M3 8 C7 5 9 M M a M2 M2 b 103 M3 c M2 M d 102 M3 e 98 M f M2 M 10 C8 4 11 M M 12 M2 M2 13 94 M 14 M2 M4 15 89 M3 16 86 M3 17 M M2 18 M2 M4 19 77 M la 74 M3 lb M2 M lc 80 M3 ld M2 M le M M If 111 M M M M 189 amp Sun microsystems 7 3 4 2 M_SYND The AFSR M_SYND field contains a 4 bit ECC syndrome for the 3 bit microtags of the system bus TABLE 7 13 shows the 4 bit syndrome corresponding to a single bit error in each of the microtag data or correction bits TABLE 7 13 Microtag Single Bit Error ECC Syndromes Bit Number AFSR M_SYND microtag Data 0 0x7 microtag Data 1 0xB microtag Data 2 0xD microtag ECC 0 0x1 microtag ECC 1 0x2 microtag ECC 2 0x4 microtag ECC 3 0x8 A complete microtag syndrome
294. cs of RED_state include the following e The default access in RED state is non cacheable so there must be non cacheable scratch memory somewhere in the system e The D cache watchpoints and D MMU can be enabled by software in RED_state but any trap will disable them again e The I MMU and consequently the I cache are always disabled in RED_state Disabling overrides the enable bits in the DCU control register e When PSTATE RED is explicitly set by a software write there are no side effects other than that the I MMU is disabled Software must create the appropriate state itself A trap when TL MAXTL immediately brings the processor into RED_state In addition any trap that occurs while TL MAXTL immediately brings the processor into error_state Upon error_state entry the processor automatically recovers through watchdog reset into RED state e A trap to error_state immediately triggers watchdog reset Reset and RED_ state 83 amp Sun microsystems s A SIR instruction generates a software_initiated_reset SIR trap on the corresponding logical processor Trapping to software_initiated_reset causes an S R trap on the corresponding logical processor and brings the logical processor into RED_state The External Reset pin generates an externally_initiated_reset XIR trap which is used for system debug or Sun Fireplane Interconnect transactions The caches continue to snoop and maintain coherence if DVMA or other pro
295. ct L2 cache tag error 3 12 5 2 Procedure For Writing AST_L3CACHE_TAG Park the other logical processor Wait for the parking logical processor to be parked Turn off kernel pre emption Block interrupts on this processor Displacement flush all 4 ways in L2 cache and L3 cache for the index to be error injected Load some data into the L3 cache Locate the data in the L3 cache and associated tag Read the L3 cache tag ECC using ASI L3 cache tag read access Corrupt the tag ECC 10 Store the tag ECC back using ASI L3 cache tag write access Oe OO Geht SON Cen en So a 11 Re enable interrupts 12 Unpark the other logical processor Caches Cache Coherency and Diagnostics 74 amp Sun microsystems 3 12 5 3 Sl 226 The reason to displacement flush all 4 ways in both L2 cache and L3 cache is to guarantee that foreign snoop will have no effect to the index during ASI L3 cache tag write access even if the hazard window exists in the hardware Notes on L3 cache LRU Bits e The L3 cache LRU algorithm is described below For each 4 way set of L3 data blocks a 3 bit structure is used to identify the least recently used way Bit 2 tracks which one of the two 2 way sets way3 2 is one group way1 0 is the other is the least recently used Bit 1 tracks which way of the 2 way set way3 and way2 is the least recently used Bit 0 tracks which way of the other 2 way set way1 and way0 is the least recently used The LRU
296. cted array index VA 20 13 if 64 KB page is selected array index VA 23 16 The Context bits are used after the indexed entry comes out of each array bank way to qualify the context hit There are 3 possible Context numbers active in the processor but only two are relevant to the I TLB Instruction and Data Memory Management Unit 303 amp Sun microsystems 13 1 2 2 e primary PContext field in ASI_PRIMARY_CONTEXT_REG e nucleus default to Context 0 Determining which Context register to send to the I MMU is based on the ASI encoding of primary nucleus of the instruction memory access Since both I TLBs are accessed in parallel software must guarantee that there are no duplicate stale entries Most of this responsibility lies in operating system software with the hardware providing some assistance to support full software control The rules of I TLB replacement demap and context switching must be followed to maintain consistent and correct behavior I TLB Parity Protection The T512 I TLB supports parity protection for both tag and data arrays however the T16 does not support parity protection The I MMU generates an odd parity for the tag array from a 58 bit parity tree and odd parity for the data array from a 33 bit parity tree upon I TLB replacement Parities are calculated as follows Tag Parity XOR Size 0 Global VA 63 21 Context 12 0 Data Parity XOR PA 42 13 CP CV P
297. ction Instruction and Data Memory Management Unit 335 un microsystems Note The NF ASI TM FT E CT PR and W fields of the D SFSR are undefined for privileged_action trap caused by an rd instruction 13 2 7 6 D TLB Diagnostic Register ASI SD 16 VA 63 0 40000 TOFF8 46 Name ASI_DTLB DIAG REG Access RW Virtual Address Format Note Bit 18 and bit 2 0 are set to 1 and 0 respectively Data Format The format for the Diagnostic Register of the T16 is described below TABLE 13 30 The format for the Diagnostic Register of the T512s as described in TABLE 13 31 uses the TTE data format with addition of parity information TABLE 13 30 TTE Data Format Bit Field Description 63 7 Reserved Reserved for future implementation 6 LRU The LRU bit in the CAM read write 5 3 CAM SIZE The 3 bit page size field from the CAM read only R 2 0 RAM SIZE The 3 bit page size field from the RAM read only R TABLE 13 31 D TLB Diagnostic Register of T512_0 and T512_1 J of 2 Bit Field Description 63 V See the UltraSPARC III Cu Processor User s Manual 62 61 Size 1 0 Encode page size bits 60 NFO See the UltraSPARC III Cu Processor User s Manual 59 IE See the UltraSPARC III Cu Processor User s Manual 58 50 See the UltraSPARC III Cu Processor User s Manual 49 Reserved Reserved for future implementation 48 Size 2 Alw
298. ctions that retired Performance Instrumentation and Optimization 106 TABLE 5 11 Cache Access Counters 2 of 5 Counter DC_wr_miss PICL Description Number of D cache write misses by cacheable stores excluding block stores The count is only updated for store instructions that retired Note that hitting or missing the D cache does not significantly impact the performance of a store DTLB_miss PICU Number of D TLB miss traps taken Write Cache WC miss PICU Prefetch Cache Number of W cache misses by cacheable stores PC miss PICU PC_soft_hit PICU PC_hard_hit PICU PC_inv PICU Number of cacheable FP loads that miss P cache irrespective of whether the loads hit or miss the D cache The count is only updated for FP load instructions that retired Number of cacheable FP loads that hit a P cache line that was prefetched by a software prefetch instruction irrespective of whether the loads hit or miss the D cache The count is only updated for FP load instructions that retired Number of cacheable FP loads that hit a P cache line that was fetched by a FP load or a hardware prefetch irrespective of whether the loads hit or miss the D cache The count is only updated for FP load instructions that retired Note that if hardware prefetching is disabled DCUCR bit HPE 0 the counter will count the number of hits to P cache lines that were fetched by a FP load only since no ha
299. ctor RW 3 UCEEN Enable fast_ECC_error trap on SW_correctable and uncorrectable L3 cache error RW 2 Reserved Reserved for future implementation RW d NCEEN TEEN SE error data_access_error or ECC_error trap on uncorrectable ECC RW 0 CEEN Enable ECC_error trap on HW_corrected ECC errors RW FPPE When this bit is 1 force Cport processor data port data parity error on data parity bit FDPE When this bit is 1 force Cport data parity error on data LSB bit FISAPE When this bit is 1 force Sun Fireplane Interconnect address parity error on parity bit FSDAPE When this bit is 1 force SDRAM address parity error on parity bit during memory write access FMT When this bit is 1 the contents of the FMECC field are transmitted as the system bus microtag ECC bits for all data sent to the system bus by this processor This includes writeback copyout interrupt vector and non cacheable store like operation data FMECC 4 bit ECC vector to transmit as the system bus microtag ECC bits 178 amp Sun Error Handling microsystems FMD When this bit is 1 the contents of the FDECC field are transmitted as the system bus data ECC bits for all data sent to the system bus by this processor This includes writeback copyout interrupt vector and non cacheable store like operation data FDECC 9 bit ECC vector to transmit as the system bus data ECC bits The FMT and FMD fields allow test code to confirm correct operation of s
300. d Virtually Tagged VIVT Caches The prefetch cache P cache is virtually indexed virtually tagged for cache reads However it is physically indexed physically tagged PIPT for snooping purposes Prefetch cache P cache The prefetch cache is a 2 KB 4 way set associative cache with 64 byte line size Each cache line is divided into two 32 byte subblocks with separate valid bits The P cache is a write invalidate cache and uses a sequential replacement policy i e the ways are replaced in sequential order The P cache data array is parity protected The P cache needs to be flushed only for error handling The P cache can be used to hide memory latency and increase memory level parallelism by prefetching data into the P cache Prefetches can be generated by an autonomous hardware prefetch engine or by software prefetch instructions Caches Cache Coherency and Diagnostics 26 amp Sun microsystems Hardware Prefetching The hardware prefetch engine in the UltraSPARC IV processor automatically starts prefetching the next cache line from the L2 cache after a floating point FP load instruction hits the P cache When a floating point load misses both the D cache and the P cache either 32 bytes if from memory or 64 bytes if from L2 or L3 of data is installed in the P cache Each P cache line contains a fetched_mode bit that indicates how the P cache line was installed by software prefetch instruction or hardware prefetch mec
301. d and used In the UltraSPARC IV processor this register is always read as 2 b11 TABLE 2 4 shows the format of the LP Available register Each bit represents one logical processor bit 0 for LP 0 bit 1 for LP 1 and so on If a logical processor is available or implemented then the hardware will set the corresponding bit 1 Otherwise the hardware sets bit 0 In the UltraSPARC IV processor bit 1 and bit 0 will be set to 1 bits 63 2 are always 0 TABLE 2 4 LP Available Register Shared Bit Field Description 63 2 Mandatory value Should be 0 1 LP 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 2 4 2 Enabling and Disabling Logical Processors The CMT programming model allows logical processors to be enabled and disabled Enabling or disabling a logical processor is a special operation that requires a system reset for updates Disabled logical processors produce no architectural effects observable by other logical processors and do not participate in cache coherency Any transaction issued to a disabled logical processor such as an interrupt results in an unmapped reply or a time out 2 4 2 1 LP Enable Status Register AST_CORE_ENABLE_STATUS The LP Enable Status register is a shared register that indicates whether each logical processor is currently enabled The register is a read only register with a single 64 bit field assuming a maximum of 64 logical processors per CMT pr
302. d data g the critical 32 byte L3 cache data Ey ae moved from L3 a dropped Bess ep cache to L2 cache D cache Original data D cache 32 byte load request with CE in original ecc Good data in Good data F the non critical 32 byte L3 cache data L3_EDC No moved from L3 D cache taken D isruptng trap cache to L2 cache Original data e i i igi Good data in D cache 32 byte load request with UE in L3_EDU No original ecc Good data Disrupting trap 225 un microsystems TABLE 7 20 L3 cache Data CE and UE errors 2 of 4 Error fast ecc Event logged in error L2 cache data L1 cache data AFSR trap Pipeline 5 Comment Action Original data Bad data in original ecc D cache but Bad data moved from L3 good data in dropped cache to L2 cache P cache D cache FP 64 bit load request with CE in the critical 32 byte L3 cache data FREE ass Precise trap Original data original ecc Bad data in Bad data Precise trap D cache but moved from L3 d dropped not in P cache cache to L2 cache D cache FP 64 bit load request with UE in the critical 32 byte L3 cache data L3_UCU Yes Original data original ecc Poole Good data moved from L3 EE taken Disrupting trap cache to L2 cache EE D cache FP 64 bit load request with CE in the 2nd 32 byte L3 cache data SES Ne Original data original ecc moved from L3 cache to L2 cache Good 32 byte Good 32 data in D byte data Disrupting trap cache only taken
303. d into the TLB Data In Register to initiate an atomic write of the TLB entry chosen by the replacement algorithm 4 Ifthe TTE does not exist in the TSB then the TLB miss handler jumps to the more sophisticated and slower TSB miss handler The virtual address used in the formation of the pointer addresses comes from the Tag Access Register which holds the virtual address and context of the load or store responsible for the MMU exception See Translation Table Entry TTE on page 328 Note There are no separate physical registers in hardware for the pointer registers rather virtual registers are implemented through dynamic reordering of the data stored in the Tag Access and the TSB registers The hardware provides pointers for the most common cases of 8 KB and 64 KB page miss processing These pointers give the virtual addresses where the 8 KB and 64 KB TTEs are stored if either is present in the TSB The TSB_Size field n is defined to have a value ranging from 0 to 7 Note that the TSB_Size refers to the size of each TSB when the TSB is split The symbol designates concatenation of bit vectors and indicates an exclusive or operation For a shared TSB TSB register split field 0 8K_PTR TSB_Base 63 13 n TSB_Extension 63 13 n VA 21 n 13 0000 64K_PTR TSB_Base 63 13 n TSB_Extension 63 13 n VA 24 n 16 0000 For a split TSB TSB register split field 1
304. d microtags are fixed in hardware A single bit data error as the result of a system bus read from memory or I O sets the AFSR CE bit A single bit data error as the result of an interrupt vector fetch sets AFSR IVC A single bit microtag error as the result of a system bus read from memory or I O or sets AFSR EMC A single bit microtag error as the result of an interrupt vector fetch sets AFSR IMC HW_corrected system bus errors cause an ECC_error disrupting trap if the CEEN bit is set in the Error Enable Register The HW_corrected error is corrected by the system bus interface hardware at the processor and the processor uses the corrected data automatically The microtag ECC correctness is checked whether or not the processor is configured in SSM mode by setting the SSM bit in the Fireplane configuration register All four microtag values associated with a 64 byte system bus read are checked for ECC correctness Error Handling 170 amp SUN microsystems 7 2 9 2 7 2 9 3 Error Handling Uncorrectable System Bus Data ECC Errors An uncorrectable system bus data ECC error as the result of a system bus read from memory or T O caused by an instruction fetch load like block load or atomic operation sets the AFSR UE bit The ECC syndrome is captured in E_LSYND The AFSR PRIV bit is set if PSTATE PRIV was set at the time the error was detected If the same ECC error is caused by a read triggered by a prefetch queue operation or c
305. d mode When PCR PRIV 1 supervisor access only an attempt by user software to access the PIC register causes a privileged_action trap Software can control event measurements in nonprivileged or privileged modes by setting the PCR UT user trace and PCR ST system trace fields The PCR has mode bits to enable the counters in privileged mode nonprivileged mode or either mode The mode setting affects both counters 5 5 Performance Control Register PCR The Performance Control Register PCR is used to select which events to monitor and provides control for counting in privileged and or nonprivileged modes The 64 bit PCR is accessed through the read write Ancillary State Register instructions RDASR WRASR The PCR is located at ASRs 16 10 Two events can be measured simultaneously by setting the PIC_SL and PIC_SU fields The counters can be enabled separately for Supervisor and User mode using UT and ST fields The selected statistics are reflected during subsequent accesses to the PICs The PCR is a read write register used to control the counting of performance monitoring events TABLE 5 1 shows the details of the PCR TABLE 5 2 describes the various fields of the PCR Counts are collected in the PIC register See Performance Instrumentation Counter PIC Register on page 99 TABLE 5 1 Performance Control Register ASR R W Description Reset ASR 1616 64 bit Read Write Privileged Mode otherwise privileged
306. d take a disrupting instruction_access_error trap or data_access_error trap For AFSR IMC events an ECC_error disrupting trap will be taken Both of these events also generate an interrupt_vector trap A Sun Fireplane Interconnect DSTAT 2 or DSTAT 3 response from the interrupting device to an interrupt vector fetch operation will set AFSR BERR at the processor which is fetching the interrupt vector The interrupt vector data received in this transfer is written into the interrupt receive registers and an interrupt_vector exception is generated even though the data may be incorrect A deferred data_access_error trap is also generated A processor transmitting an interrupt may receive no MAPPED response to its Sun Fireplane Interconnect address cycle This is treated exactly as though the bus cycle was a read access to I O or memory AFSR TO will be set AFAR will be captured although its meaning is uncertain and AFSR PRIV will be updated with the state that happens to be in PSTATE PRIV at the time the event is detected A deferred data_access_error trap will be generated Cache Flushing in the Event of Multiple Errors If a software trap handler needs to flush a line from any processor cache in order to ensure correct operation as part of recovery from an error and multiple uncorrectable errors are reported in the AFSR either through multiple sticky bits or through AFSR ME then the value stored in AFAR may not be the only line needing to be
307. d_action trap ASR 1710 TABLE 5 4 PIC Register Fields Bit Field Description 32 bit field representing the count of an event selected by the SU field of the Performance Control 63 32 PICU s Register PCR 32 bit field representing the count of an event selected by the SL field of the Performance Control 31 0 PICL Register PCR 5 6 1 PIC Counter Overflow Trap Operation When a PIC counter overflows an interrupt is generated as described in TABLE 5 5 TABLE 5 5 PIC Counter Overflow Processor Compatibility Comparison Function Description The counter overflow trap is triggered on the transition from value FFFF FFFF 16 to value 0 The point at which the interrupt is delivered may be several instructions after the instruction responsible for the overflow event This situation is known as a skid e The counter wraps to zero e SOFTINT register bit 15 is set to 1 e An interrupt_level_15 trap a disrupting trap is generated PIC Counter Overflow Performance Instrumentation and Optimization 100 un microsystems 5 7 Performance Instrumentation Operation FIGURE 5 3 shows how an operating system might use the performance instrumentation features to provide event monitoring services Set up PCR hi_select_value gt PCR su low_select_value PCR sl 0 1 gt PCR ut Accumulate stat in PIC PIC gt rd No No
308. data after the new ECC check bits have been generated In this way the receiver will detect the uncorrectable error when it performs its own ECC checking This deliberately bad ECC is known as signalling ECC For DSTAT 2 or 3 events coming from the system bus and being stored with deliberately bad signalling ECC in the L2 cache an uncorrectable error is injected by inverting data bits 1 0 after correct ECC is generated for the corrupt data For ano MAPPED event coming from the system bus the data and ECC values present on the system bus at the time that the unmapped error is detected are not stored in the L2 cache Any result can be returned when the L2 cache line affected is read For UE and DUE events coming from the system bus the data and ECC values present on the system bus are stored unchanged in the L2 cache An uncorrectable error should be returned when the L2 cache line is read but the syndrome is not defined For uncorrectable ECC errors detected in copyout data from L2 cache L3 cache or writeback data from the L3 cache an uncorrectable error is injected into outgoing data by inverting data bits 127 126 after correct ECC is generated for the corrupt data For uncorrectable ECC errors detected in an L2 cache or L3 cache read to complete a store queue exclusive request associated with a store like operation ECC check bits 1 0 will be inverted in the data scrubbed back to the L2 cache when W cache evict this line A line w
309. ding instruction will be issued to the BR pipe to allow trap processing to be initiated The processor then will examine both the pending traps and take the instruction_access_error trap say at TL 1 because it is higher priority The data_access_error remains pending When the first BR or MS pipe instruction is executed in the instruction_access_error trap routine the data_access_error trap routine will run say at TL 2 Despite the fact that the data_access_error has lower priority than the instruction_access_error trap the data_access_error trap routine runs at a higher TL within an enclosing instruction_access_error trap routine and before the bulk of that routine This is the opposite of the usual effect that interrupt priorities have The result of this is that at the time that the trap handler begins only one data_access_error trap is executed for all data access errors that have been detected by this time and only one instruction_access_error trap is executed for all instruction access errors Processor action is always determined by the trap priorities except for one special case and that is for a precise fast_ECC_error trap pending at the same time as a deferred data_access_error or instruction_access_error trap In this one case only the higher priority deferred trap will be taken and the precise trap will no longer be pending If a deferred trap is taken while a precise trap is pending that precise trap will no longer be p
310. do split mode If the L2 cache is switched from the psuedu split mode to regular mode the counter will retain its value Number of L2 cache lines that were written back to the L3 cache because of requests RH from this core Shared L2 event counters Total number of L2 cache lines that were written back to the L3 cache due to L2_wb_sh PICL requests from both cores Total number of L2 cache lines that were invalidated due to other processors doing L2 i h PICL S Snoop_iny_sh PICL RTO RTOR RTOU or WS transactions Total number of L2 cache lines that were copied back due to other processors The count includes copybacks due to both foreign copy back and copy back invalidate requests e foreign RTS RTO RS RTSR RTOR RSR RTSM RTSU or RTOU requests L2_snoop_cb_sh PICL Performance Instrumentation and Optimization 108 un microsystems TABLE 5 11 Cache Access Counters 4 of 5 Counter L2_hit_I_state_sh PICU Total number of tag hits in L2 cache when the line is in I state The count does not include L2 cache tag hits for hardware prefetch and block store requests This counter approximates the number of coherence misses in the L2 cache in a multiprocessor system L3 Cache Note The L3 cache access counters do not include retried L3 cache requests Private L3 cache counters L3_miss PICU Number of L3 cache misses sent out to SIU from this core by cacheable I cache
311. duction on page 9 e Accessing CMT Registers on page 11 e Private Processor Registers on page 12 e Disabling and Suspending Logical Processors on page 14 e Reset Handling on page 20 e Private and Shared Registers Summary on page 21 2 1 Introduction This chapter corresponds to Sun s common interface between hardware and software and addresses issues common to CMT processors The UltraSPARC IV processor uses Sun s standard CMT programming model an interface specifying the basic functionality needed in the operating system diagnostics and recovery code to control and configure a processor comprised of multiple logical processors Among other requirements the CMT programming model defines how logical processors are identified how errors and resets are steered to logical processors and how logical processors can be disabled or suspended Logical processors are identified by two globally unique IDs one used to reference the processor s registers and a second used to reference interrupts The former ID is used to disable or suspend logical processors while the latter is used for steering a thread s errors resets and traps These IDs enable CMT processors to behave much like traditional symmetrical multiprocessor systems When an error can be identified as being associated with a logical processor the error will be reported to that logical processor For errors that cannot be associated with any specific thread or logical processor
312. ducts bearing SPARC trademarks are based upon architecture developed by Sun Microsystems Inc DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS REPRESENTATIONS AND WARRANTIES INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE OR NON INFRINGEMENT ARE DISCLAIMED EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID amp Sun microsystems Table of Contents amp Sun microsystems NC xxiii e Architectural Overview cssscccsssccssseccssseccssscccssscssssecssssessssessssssesssssesssssssssesscsesssssssssseoosases 1 LA troduction nean a tiie LNA eta oat acide on at ey eet 1 1 2 Chip Multithreading CMT cccceeccccesscessseeeeseeeeeeeesseeeesseeeesaeeeneaeeeseneeeseeeeseneeeseneeenags 2 LS Enhanced Core Design Zteteestzegsueee EENS 2 13 1 Instruction Fetches gx s ngt EEN e Sa ge edad eee ths 3 13 2 Executi UNIS EE 3 1 3 3 Wite Cache eieiei a a a dha ane ae 4 RIA Data Prefetehiht nreo a RE ee a ee AL 4 1 3 5 Memory Management Units MMU 0 0 eeeccceesseceseeeeeeeeseeeeeseeeeeseeenseeenens 5 14 Cache Hierarchy EE 5 Lai Lachen deent nian eect tie eine ioe Nee aa ete 5 142 L2 Cache EE 5 143a E ee E 6 1 5 System Interface and Memory Controller Enhancements 0 cceecceeseeeseceneeeeeeeeneeeneenes 6 1 5 1 Reduced Latency of Cache to Cache Transfers ccccssccesseeeeteeeeseeeeteeeeens 7 1 5 2 Higher Sustainable Ba
313. e MMU miss There are two ways of ensuring such a case does not occur 1 Any code which modifies the I MMU state should be locked down in the I TLB to prevent the possibility of intervening TLB misses 2 If suggestion 1 is not possible the STXA and the subsequent FLUSH DONE or RETRY should be kept on the same 8 KB page again preventing an intervening I TLB miss Note An instruction_access_execption by I MMU parity error may cause this problem 132 13 2 1 Data Memory Management Unit The Data Memory Management Unit D MMU conforms to the requirements set forth in the UltraSPARC III Cu Processor User s Manual In particular the D MMU supports a 64 bit virtual address space simplified protection encoding multiple page sizes and software D TLB miss processing There is no hardware page table walk Virtual Address Translation A 64 bit virtual address VA space is supported with 43 bits of physical address PA Instruction and Data Memory Management Unit 319 un microsystems 13 2 2 The UltraSPARC IV processor D MMU consists of three Translation Lookaside Buffers TLBs as described in TABLE 13 15 There is no support for locked pages of sizes 32 MB or 256 MB An attempt to install a 32 MB or 256 MB locked page will result in undefined and potentially erroneous translations TABLE 13 15 D MMU TLBs TLB name TLB ID Translating Page Size Remark T16 0 8 KB 64 KB 512 KB 4 M
314. e TLBs and the target TTE has the page size set to 64 KB 512 KB or 4 MB then the data returned from the TLB will be incorrect This issue can be overcome by reading the fully associative TLB TTE twice back to back The first access may return incorrect data if the above conditions are met however the second access will return correct data Instruction and Data Memory Management Unit 318 amp Sun microsystems 13 1 10 Write Access Limitation of I MMU Registers If an STXA instruction that targets an I MMU register eg AST_IMMU_SFSR ASI 0x50 VA 0x18 is executed and an I MMU miss is taken prior to the programmer visible state being modified it is possible for both the STXA and the I MMU miss to attempt to update the targeted register at the exact same instant An example is shown below 0x25105e21ffc stxa g0 i10 0x50 0x25105e22000 IMMU miss In this case if the STXA instruction takes priority over the I MMU miss it will cause stale data to be stored in the I MMU register A FLUSH DONE or RETRY is a special case needed after an internal ASI store that affects instruction accesses see Instruction Memory Management Unit on page 302 The usage is that the FLUSH DONE or RETRY should immediately follow the STXA There can be a case where the programmer may have inserted such an instruction after the STXA instruction but the instruction was not processed prior to th
315. e including prefetch buffer or hits in the prefetch buffer In addition to reducing the overall latency of instruction fetching the instruction prefetch mechanism provides more robust performance for applications whose instruction working set exceeds the capacity of the instruction cache The UltraSPARC IV processor benefits not only from more aggressive instruction prefetching but also from superior mechanisms for predicting both the direction and target of branch instructions The branch direction predictor has been made configurable allowing different mechanisms to be used for code of different types In addition to a standard Gshare predictor of the sort used in UltraSPARC III processor two separate history registers are available for privileged supervisor and user code Further either or both of the history registers can be disabled in favor of a PC indexed branch predictor While a standard Gshare predictor works well for smaller applications a PC indexed predictor often works better for large irregular applications like large databases as well as for privileged code in general To improve target prediction for indirect branches branches whose targets are specified by a register value the UltraSPARC IV processor incorporates a 32 entry branch target buffer BTB When an indirect branch is encountered the BTB is used in conjunction with the return address stack and the branch preparation instruction to predict the target instructio
316. e prefetch cache snooping purposes 26 POK pin 84 power on reset POR hard reset when POK pin activated 84 and I cache microtags 43 soft reset 292 294 software 272 system reset when Reset pin activated 85 precise trap priority 260 prediction branch 47 prefetch hardware 27 instruction noncacheable 84 software 27 prefetch cache access statistics 107 data parity error 264 description 26 diagnostic accesses 54 diagnostic data register access 56 enable bit 274 error detection 157 error recovery 157 HW prefetch enable bit 274 invalidating flushing entry 27 invalidation 27 noncacheable data 27 snoop tag register access 57 software prefetch enable bit 275 status data register access 55 valid bits 84 virtual tag valid fields access 56 PREFETCH instruction L2 L3 cache allocation 26 P cache invalidation 27 privileged_action exception 39 98 100 PIC access 99 Program Version register 298 PSTATE IE field 271 IG field 268 MG field 268 PEF field 259 RED field clearing DCUCR 83 explicitly set 83 register 268 state after reset 86 xliv UltraSPARC IV Processor User s Manual October 2005 quad load instruction 97 queue instruction state after reset 89 store state after reset 89 quiet NaN not a number 136 R R A W bypassing algorithm 97 detection algorithm 97 RAW bypass enable bit in DCUCR 274 bypassing data from store queue 274 RDASR instruction 98 99 RDPC instruction 268 RDPIC instruction 99 RDPR instruction 297
317. e Raw UE data Bad data Z UE Yes E S cache but not in taken and Precise trap fast critical 32 byte data from raw ecc dropped ecc error will be dropped Error Handling 241 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 2 of 10 Event D cache FP 64 bit load fill request with CE in the non critical 2nd 32 byte data from system bus flag fast ecc error L2 cache data Corrected data corrected ecc L2 cache state L1 cache data Good amp critical 32 byte data in D Cache and good amp critical 32 byte or full 64 byte data in P cache Pipeline Action No action Comment Disrupting trap D cache FP 64 bit load fill request with UE in the non critical 2nd 32 byte data from system bus Raw UE data raw ecc Bad data not in P cache No action Deferred trap D cache block load fill request with CE in the critical 32 byte data from system bus OR D cache block load fill request with CE in the non critical 2nd 32 byte data from system bus D cache block load fill request with UE in the critical 32 byte data from system bus OR D cache block load fill request with UE in the non critical 2nd 32 byte data from system bus D cache atomic fill request with CE in the critical 32 byte data from system bus No Not installed Not installed Corrected data corrected ecc Good data in P cache block load buffer Bad dat
318. e Tag Li cache data ENEE Comment L3 cache access for Prefetch 0 1 2 3 20 21 Disrupting trap 22 23 17 request with tag Original tag N A the request is UE dropped request misses L2 cache stores W cache exclusive e pipe sends Tisai request with L3 tag CE L3 THCE L3 Pipe valid amp N A EE g T corrects the tag grant to W request hits L2 cache cache stores W cache exclusive i Disrupting trap request with L3 tag CE L3 THCE L3 Pipe N A N A the request is corrects the tag q request misses L2 cache retried stores W cache exclusive L2 Pipe gives RE request with L3 tag UE Original tag both valid amp N A E irap grant to W request hits L2 cache cache stores W cache exclusive Disrupting trap request with L3 tag UE Original tag N A N A the request is request misses L2 cache dropped L2 Writeback t with L3 Pi t EE riteback request wi ipe corrects i L3 tag CE L3_THCE the tag No action the request is retried f Disrupting trap L2 Writeback request with Original N i L3 tag UE riginal tag o action the request is dropped TERES d iat Disrupting trap cache eviction tag rea ipe corrects i request with tag CE L3 IHEE the tag EK the request IS retried Disrupting tra L3 cache eviction tag read Original ta Ne danas gea ia request with tag UE 8 8 s the request is dropped L3 cache Tag update 13 Pine correi Disrupting trap request by SIU with tag L3_THCE P No action CE the tag t
319. e UltraSPARC processor Performance Instrumentation and Optimization 92 amp Sun microsystems 5 2 1 3 5 2 1 4 5 2305 5 2 1 6 5 2 2 Branch Optimization The UltraSPARC processors favor branch not taken conditionals Regardless of this preference the instruction issue remains the same and the fetch is optimized Impact of the Delay Slot on Instruction Fetch Most Control Transfer Instructions CTIs are actually delayed Control transfer takes effect one instruction after the actual CTI The intervening delay is called the delay slot The instruction following the branch or after CTI is always executed regardless of where the CTI directs execution unless annulling is used If the last instruction of a line is a branch the next sequential line in the I cache must be fetched even if the branch predicted is taken because the delay slot must be sent to the grouping logic This line fetch leads to inefficient fetches because an entire L2 cache access must be dedicated to fetching the missing delay slot Therefore do not place delayed CTIs at the end of a cache line Instruction Alignment for the Grouping Logic See the UltraSPARC III Cu Processor User s Manual for a description of grouping logic Impact of Instruction Alignment on Instruction Dispatch It is important that no two branches are in the same fetch group If there are two branches in the same group the second branch will end the group and will cause a refetch
320. e critical 32 byte data from system bus k isrupting trap OR IVU No No garbage data i SR Interrupt Interrupt vector with UE in the non critical 2nd dropped 32 byte data from system bus Interrupt vector with CE in the microtag of the Disrupting trap critical 32 byte data from system bus amp OR IMC No Yesifnorvu Corrected Interrupt taken interrupt data if no IVU Interrupt vector with CE in the microtag of the Interrupt non critical 2nd 32 byte data from system bus dropped if IVU Interrupt vector with UE in the microtag of the Disrupting trap critical 32 byte data from system bus amp OR IMU Yes Yes if no IVU Received Interrupt taken if no IVU Interrupt dropped if IVU 252 amp Sun microsystems Exceptions Traps and Trap Types The UltraSPARC IV processor implements all mandatory SPARC V9 exceptions as described in the UltraSPARC III Cu Processor User s Manual In addition the UltraSPARC IV processor implements the exceptions listed in TABLE 8 1 which are specific to the UltraSPARC IV processor Chapter Topics e Traps on page 253 e Exceptions Specific to the UltraSPARC IV Processor on page 259 e Trap Priority on page 260 e I cache Parity Error Trap on page 260 e D cache Parity Error Trap on page 262 e P cache Parity Error Trap on page 264 8 1 8 1 1 Traps The four main types of traps are discussed in detail in the following sections e Precise traps e Deferred traps e
321. e errors or are a form of an uncorrectable error that requires system intervention before normal processor execution can continue Deferred traps These errors are signaled as uncorrectable errors requiring immediate attention but do not require a system reset Disrupting traps These errors are signaled as requiring logging and clearing but which do not otherwise affect processor execution Fatal errors These errors are normally handled by a processor reset before continuing ey ec Error Handling Memory Errors Memory errors include I cache errors D cache errors I TLB parity error D TLB parity error IPB data parity error P cache data parity error L2 cache errors L3 cache errors Errors on the system bus I cache Errors Parity error protection is provided for the I cache physical and snoop tag arrays I cache data array and the instruction prefetch buffer IPB data array The I cache IPB is clean meaning that the value in the I cache IPB for a given physical address is always the same as that in some other store in the system This means that recovery from single bit errors in the I cache IPB needs only parity error detection not full error correcting code The basic concept is that when an error is observed the I cache IPB can be invalidated and the access retried Both hardware and software methods are used in the UltraSPARC IV processor Parity error checking in physical tags snoop tags and I
322. e forces the corresponding bit in the LP Enable register to 0 and ignores attempts to write 1 to that bit Since the UltraSPARC IV processor always has both logical processors available this scenario does not exist in the UltraSPARC IV processor Note A disabled logical processor will not respond to any transaction issued to it The sender should encounter an unmapped reply or a timeout error In the UltraSPARC IV processor if both bits 1 and 0 are set to 0 then both logical processors will be disabled after a Hard Soft POR State After Reset The value of the LP Enable register is set to the value of the LP Available register at the assertion of a power on reset The value of the LP Enable register remains unchanged during all other resets including system resets or equivalent resets Suspending and Running Logical Processors Suspended logical processors can be set to run later The suspending and running of logical processors can be performed at arbitrary points in time and unlike disabling a logical processor a system reset is not required There may be an arbitrarily long but bounded delay from when a logical processor is directed to suspend until the change takes effect There is a LP Running Status register that can be used to determine if a logical processor has completed the process of becoming suspended A suspended logical processor does not execute instructions and does not initiate any transactions on its ow
323. e present in the store queue each store can potentially result in a data_access_error trap as a result of a system bus problem Because deferred trap processing does not wait for all stores to complete the data_access_error trap routine can start as soon as an error is detected as the result of the first store Execution of the trap routine still may be delayed until the right pipe includes a valid instruction though Once the data_access_error routine has started a further store from the original store queue can result in system bus activity which eventually returns an error and causes another data_access_error trap to become pending This can once the correct pipe has a valid instruction in it start another data_access_error trap routine at TL 2 This can continue until all available trap levels are exhausted and the processor begins RED_state execution To overcome this problem we need to insert a MEMBAR Sync or RETRY instruction at the beginning of the deferred trap handler This avoids for nested deferred traps going to RED_state The MEMBAR Sync or RETRY requires the store queue to be empty before it can be issued This forces the hardware to merge multiple deferred traps while in TL 1 into one deferred trap to stop at TL 2 In the case of other store types that fill STQ and generate system bus read operation read to own multiple disrupting traps might be generated back to back In this case one disrupti
324. e to be invoked right after a reset the uninitialized content of ASIT_IMMU_TAG_ACCESS register will cause the I TLB tag to not initialize As a result the expected result would not have been accomplished The UltraSPARC IV processor also requires that there are no outstanding requests of any kind before ASI_SRAM_FAST_INITT is invoked In order to guarantee that there are no outstanding requests before ASI_SRAM_FAST_INIT and to avoid an I MMU tag initialization problem the following code sequence must be executed in atomic form set 3016 gl membar Sync stxa g0 g1 ASI_IMMU_TAG_ACCESS membar Sync align 16 nop Caches Cache Coherency and Diagnostics 78 amp Sun microsystems 3 14 2 flush g0 stxa g0 g0 ASI_SRAM_FAST_INIT membar Sync Note The requirement of executing this code sequence in atomic form is to ensure that there would not be any trap handler sneaking in between to initialize the ASI_IMMU_TAG_ACCESS register to other unwanted value After ASI_SRAM_FAST_INIT D cache and I cache microtag arrays have to be initialized with unique values among 4 same index entries from different banks Traps data_access_exception trap for ASI 40 use other than with STXA instruction ASI_SRAM_FAST_INIT_Shared Definition ASI 3F 16 Write only Shared by both logical processors VA 63 0 016
325. eable reads Neither sets AFSR TO nor AFSR DTO System Bus Unmapped Errors The AFSR TO or AFSR DTO bit is set when no device responds with a MAPPED status as the result of the system bus address phase This is not a hardware time out operation which causes an AFSR PERR event It is also different from a DSTAT 2 time out response for a Sun Fireplane Interconnect transaction which actually sets AFSR BERR or AFSR DBERR 172 amp Sun microsystems 7 2 9 6 Error Handling A TO or DTO may be returned in response to a system bus read or write operation In this case the processor handles the event in the same way as specified above at Uncorrectable System Bus Data ECC Errors on page 171 except for the following differences 1 The AFSR DTO is set instead of AFSR DUE for a system bus read caused by prefetch queue or read to own store queue operation 2 The AFSR TO is set instead AFSR UE for a system bus read caused by an instruction fetch load like block load or atomic operation 3 The AFSR TO is also set for all system bus unmapped errors caused by write access transfer This includes block store to memory WS store to I O WIO block store to I O WBIO or writeback from L3 cache operation WB 4 The AFSR TO is also set for all system bus unmapped errors caused by issuing an interrupt to undefined target device or disabled logical processor INT 5 The TO or DTO AFSR and AFAR overwrite priorities are used rat
326. ean write through L1 caches with simple parity checking If a data error is detected the faulty line is invalidated and quickly reloaded with a good copy of the data from the L2 cache In addition the UltraSPARC IV processor improves the bit layout of the L1 instruction and data caches specifically to reduce the probability of a single cosmic ray striking two bits in the same parity group thereby causing an undetectable double bit error Also parity protection has been added to smaller previously unguarded clean data structures including the prefetch cache and the large 512 entry TLBs in both the LMMU and D MMU amp Sun microsystems The on chip L2 cache both tag and data as well as the on chip L3 cache tags are protected by full error correcting code ECC that supports single bit error correction and double bit error detection While all members of the UltraSPARC III IV processor family provide ECC for the external data buses the UltraSPARC IV processor goes a step further by providing protection for the external address buses that connect the processor to its external cache and main memory Architectural Overview 8 amp Sun microsystems Chip Multithreading CMT The UltraSPARC IV processor supports Sun s new software interface and registers to support logical processor identification reset diagnostics and error reporting These CMT registers can be classified as private or shared Chapter Topics e Intro
327. eback to L3 cache 5 Log the error 6 Clear AFSR UCC UCU WDC WDU CPC CPU THCE and AFSR_EXT L3_WDU L3_CPU 7 Clear AFSR2 and AFSR2_EXT Displacement flush any cacheable fast_ECC_error exception vector or cacheable fast_ECC_error trap handler code or data from the L2 cache to L3 cache 9 Displacement flush any cacheable fast_ECC_error exception vector or cacheable fast_ECC_error trap handler code or data from the L3 cache to memory 10 Re execute the instruction that caused the error using RETRY Corrupt data is never stored in the I cache Data in error is stored in the D cache If the data was read from the L2 cache as the result of a load instruction or an atomic instruction corrupt data will be stored in D cache However if the data was read as the result of a block load instruction corrupt data will not be stored in D cache Store like instructions never cause fast_ECC_error traps directly just load like and atomic instructions Store like instructions never result in corrupt data being loaded into the D cache 160 amp Sun microsystems Error Handling The entire D cache is invalidated because there are circumstances when the AFAR used by the trap routine does not point to the line in error in the D cache This can happen when multiple errors are reported and when an instruction fetch not a prefetch queue operation has logged an L2 cache error in AFSR and AFAR but not generated a trap due to a misfetch
328. ecovery action should treat this as an uncorrectable L3 cache data error 169 amp Sun microsystems Note There is no real parity check parity pin for the L3 cache data address bus since the parity check requires a motherboard respin To avoid motherboard respin and to protect SRAM data address bus 9 bit ECC for the first DIMM is stored in the second DIMM and vice versa If an error occurs on the address bus there is high possibility that ECC violation is detected on both DIMMs Thus we assume that address bus errors occur when both DIMMs get ECC errors 7 2 9 Errors on the System Bus Errors on the system bus are detected as the result of e Cacheable read accesses to the system bus e Non cacheable read accesses to the system bus Data ECC microtag ECC system bus error with Dstat 2 or 3 and unmapped responses are always checked on these accesses e Fetching interrupt vector data by the processor on the system bus Data ECC microtag ECC and system bus error with Dstat 2 or 3 responses are always checked on interrupt vector fetches e Cacheable write accesses to the system bus e Non cacheable write accesses to the system bus e Transmitting interrupts by the processor on the system bus Unmapped response is checked on these accesses above 7 2 9 1 HW_corrected System Bus Data and Microtag ECC Errors ECC is checked for data and microtags arriving at the processor from the system bus Single bit errors in data an
329. ectable L3 cache data and writes it to the system bus in a writeback operation it will compute correct system bus ECC for the corrupt data then invert bits 127 126 of the data to signal to other devices that the data is not usable Multiple occurrences of this error will cause ASFR1 ME to be set If both the L3_WDU and the L3_MECC bits are set it indicates that an address parity error has occurred L3_CPC For an L3 cache copyout operation to serve a snoop request from another processor an L3 cache read will be performed and the data read back from the L3 cache will be checked for the correctness of its ECC If a single bit error is detected the L3_CPC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the CEEN bit is set in the Error Enable Register Hardware will proceed to correct the error and corrected data will be sent to the snooping device This bit is not set if the copyout happens to hit in the writeback buffer because the line is being victimized Instead the WDC bit is set Please refer to the section L3_WDC for an explanation of this 208 amp Sun microsystems 7 1 4 Error Handling If both the L3_CPC and the L3_MECC bits are set it indicates that an address parity error has occurred L3_CPU For an L3 cache copyout operation an L3 cache read will be performed and the data read back from the L3 cache will be checked for the correct
330. ecuted block identified by profiling achieves a better I cache utilization Handling of CTI Couples Avoid placing a CTI into the delay slot of another CTI because it will disrupt the fetch and cost many cycles Mispredicted Branches Correctly predicted conditional branches allow the processor to group instructions from subsequent basic blocks and continue to progress speculatively until the branch is resolved The capability of executing instructions speculatively is a significant performance boost for the UltraSPARC processor Return Address Stack RAS To speed up returns from subroutines invoked through CALL instructions the UltraSPARC processor dedicates an 8 deep stack to store the return address Each time a CALL is detected the return address is pushed onto this Return Address Stack Each time a return is encountered the address is obtained from the top of the stack and the stack is popped The UltraSPARC processor considers a return to be a JMPL or RETURN with rs equal to 07 normal subroutine or i7 leaf subroutine The Return Address Stack provides a guess for the target address so that prefetching can continue even though the address calculation has not yet been performed JMPL or RETURN instructions using rsl values other than 07 or i7 use the value on the top of the Return Address Stack for continuing prefetching but they do not pop the stack To take full advantage of the Return Address Stack follow the stand
331. ed e PCR UT and PCR ST are global fields which apply to both PIC pairs 1 Privileged If PCR PRIV 1 a nonprivileged PSTATE PRIV 0 attempt to access PIC via a RDPIC or WRPIC instruction will result in a privileged_action exception 0 5 6 Performance Instrumentation Counter PIC Register The 64 bit PIC is accessed through read write Ancillary State Register instructions RDASR WRASR PIC is located at ASRs 17 11 6 The PIC counters can be monitored during program execution to gather ongoing statistics or reconfigure during steady state program execution to gather statistics for more than two events The pair of 32 bit counters can accumulate over four billion events each prior to wrapping Overflow of PICL or PICU causes a disrupting trap and SOFTINT Active monitoring will allow the gathering software to extend the data range by periodically reading the contents of the PICs to detect and avoid overflow an interrupt can be enabled on a counter overflow The point at which the interrupt due to a PIC overflow is delivered may be several instructions after the instruction responsible for the overflow event This delay is known as a skid The degree of skid a delay of a dozen or more clock cycles in length depends on the event that caused the overflow and the state of the processor pipelines at the time the overflow occurred It may not be possible to associate a counter overflow with the particular instruction th
332. ed as returns described in IU_stat_ret_correct_pred counter below IU_stat_jmp_mispred PICU Number of retired non annulled register indirect jumps mispredicted Number of retired non annulled returns predicted correctly Returns include the return instruction op 216 0p3 39 and the ret retl IU_stat_ret_correct_pred PICL synthetic instructions Ret retl are jmpl instructions op 2 0p3 38 46 with the following format ret jmpl i7 8 g0 retl jmpl 07 8 g0 IU_stat_ret_mispred PICU Number of retired non annulled returns mispredicted HU Stall Counts IIU stall counts listed in TABLE 5 8 correspond to the major cause of pipeline stalls from the fetch and decode stages of the pipeline These counters count cycles during which no instructions are dispatched or issued because the I queue is empty due to various events including I cache miss and refetching due to a second branch in a fetch group Stalls are counted for each clock at which the associated condition is true The counters listed in TABLE 5 8 are all per core counters Performance Instrumentation and Optimization 103 un microsystems TABLE 5 8 Counters for IIU stalls Dispatcht JC miss PICL Stall cycles due to the event that no instructions are dispatched because the I queue is empty from an I cache miss Stall cycles due to the event that no instructions are dispatched because the I queue is empty because there were two branch instructi
333. ed0 7 0 given by the same IC_addr 14 7 but with IC_addr 6 0 37 0 Undefined The value of these bits is undefined on reads and must be masked off by software TABLE 3 18 shows the read data format for the upper bits of the I cache Valid LPB array TABLE 3 18 Format for Reading Upper Bits of Valid Predict Tag Field Data Bits Field Description 63 51 Mandatory value Should be 0 50 Valid Valid is the Valid bit for the 32 byte sub block i IC_vpred is the upper 4 LPB bits for the eight instructions starting at the 32 4 48 1C_vpred 7 4 byte boundary align address given by IC_addr 45 0 Mandatory value Should be 0 TABLE 3 19 shows the read data format for the lower bits of the I cache Valid LPB array TABLE 3 19 Format for Reading Lower Bits of Valid Predict Tag Field Data 63 51 Mandatory value Should be 0 50 Valid Valid is the Valid bit for the 32 byte sub block IC_vpred is the lower 4 LPB bits for the eight instructions starting at the 32 49 46 1C_vpred 3 0 byte boundary align address given by IC_addr 45 0 Undefined The value of these bits is undefined on reads and must be masked off by software 3 5 3 Instruction Cache Snoop Tag Fields Access ASI 686 per logical processor Caches Cache Coherency and Diagnostics 44 un microsystems Name ASI T ICACHE_SNOOP_TAG TABLE 3 20 The address format for the I cache snoop tag
334. eeeeeseeaeeseeseeaeeseees 37 Memory controller actions for SSM RMW transactions eeeesceeseeseseeseeseeeseeesseeseseseceeeacees 39 Instruction Cache Instruction Access Address Format 40 Instruction Cache Instruction Access Data Format o ccceecsesseeeesseseecsesessessseseeesecsesseeseeseaes 40 Definition of predecode bits 4 0 eceeeeceeeeseeseeseeseeseeseeeeseeeceeeeceseeseesecsecsecsecseeeeeeeeeaeeaeeaeeaees 40 Definition of predecode bits 9 5 nerens nn e a a a 41 Instruction Cache Tag Valid Access Address Format s sssessssesesisrsesesrerrrrresessrsrstsrsrerersrersrseee 42 Data Format for I cache Physical Address Tag Field sssesssesssesseseesessssssesessrsessssssesessesesesseseses 43 Data Format for I cache Microtag Field 43 List of Tables XV amp Sun microsystems xvi TABLE 3 17 TABLE 3 18 TABLE 3 19 TABLE 3 20 TABLE 3 21 TABLE 3 22 TABLE 3 23 TABLE 3 24 TABLE 3 25 TABLE 3 26 TABLE 3 27 TABLE 3 28 TABLE 3 29 TABLE 3 30 TABLE 3 31 TABLE 3 33 TABLE 3 34 TABLE 3 32 TABLE 3 36 TABLE 3 37 TABLE 3 35 TABLE 3 38 TABLE 3 39 TABLE 3 40 TABLE 3 41 TABLE 3 42 TABLE 3 43 TABLE 3 44 TABLE 3 45 TABLE 3 46 TABLE 3 47 TABLE 3 48 TABLE 3 49 TABLE 3 50 TABLE 3 51 TABLE 3 52 Format for Writing I cache Valid Predict Tag Field Data oo ceceesesesseeseneeseeeeseeseesseeeeaeees 44 Format for Reading Upper Bits of Valid Predict Tag Fi
335. eeeeseeceneeeceseeeseeeeseneesseeessteeeesteeensaes 130 6 3 11 fRegister Load Store Operations 0 ccccccescecescceceneeeeneeeseeeeesneeeeeneeeeseeesaaes 131 6 312 VIS Operations ee cece eebe ee ii 131 6 4 Traps and Exceptions e a r a E e a Ea aaRS 131 6 41 fp_disabled Trap moane n a e e E A A a a 131 6 4 2 fp_exception_other Trap 132 64 3 Summary of Exc ptiOns recor EEEE E EN E IR 132 Dk sirap Evente aao EE E AE E node E E EEE E EEE RENEE 132 0 45 BT o a aO a AEE E Ee E E EST 133 6 3 JEFE apse aaie E EE aE S aa A E EA EA E N E 133 6 5 1 IEEE Trap Enable Mask TEM seessessesssesssssrsserssessssrrssersseressressersseeesseesseesse 133 D IEEE Invalid ny Trap ees nn n E E R 133 D i IEEE Overflow of Trap ccecccececccccssccesseeeeseeceeeeceseeeseseeeseeeeseeesseeenseeenaaes 133 6 5 4 IEEE Underflow uf Trap ccccecccccscccseseeceseececeeceseeessneeesseeeeseeesseeensteeensaes 133 6 5 5 IEEE Divide by Zero dz Trap 134 65 6 TEBE Imneeaert mx Trap dese a Be ee ee 134 66 Underflow Operation r aaar T TE E T a a aaa a aaiae 134 6 6 1 Trapp d UnderloW sssrinin iin Eni E E E K E a a 135 6 6 2 Unttapped Endertlomw sirna ase center E EE E E RE 135 6 7 IEEE NaN Operations srecan nea e eet cavers E EE E 136 6 7 1 Signaling and Quiet NaNs sssssesssesssessssesssessseessseesseessresseresseesseesseeesreesseessress 136 6 7 2 SNaN to QNaN Transformation ccccccccceeessecceeeescee
336. eene ebessi a a qutedeacta r E aa ENTRES 280 10 2 1 UltraSPARC IV Processor ASI Assignments cccsccesssseeeeteeeeseeessneeesens 281 11 Sun Fireplane Interconnect and Processor Identification cssscccssscssssecssssscsstesssseosees 287 11 1 Sun Fireplane Interconnect ASI Extensions cceccccceesccesseeceseeeeeeeeeeeeessneeesseeenseeenens 287 11 1 1 Sun Fireplane Interconnect Port ID Register eee eeseesceseeeereceneeeereeseeeeaes 287 11 2 RED state and Reset Values osios creci citenesveaieectebeves i e REEE E AAR ARE E aa 296 11 3 The UltraSPARC IV Processor Identification sssssseseseseseeesseessseesserssersssresseesseessreesses 297 LL31 NErsiOn RESIStEL ce ees Eed AEN 297 11 3 2 FIREPLANE_PORT_ID MID Field oo ce ccc eccececceseeeceeeeeeeeseeeceseeaeeneeeaeens 297 113 3 Speed Data E 298 L134 Program Version Register oifisean aa a ie e EE eer 298 12 Interrupt Haudltug iesge gege gege ees Eed Eed 299 12 1 Tnterript ASI Registers ces accecsec ovsscheschsedeteeceuscevasdscdysaseeeus ESA coe EEA E staneee abies neat 299 12 1 1 Interrupt Vector Dispatch Register ccccecesceesssceseneceeeeeeseeeeeeneeeeseeeeseeensns 299 12 1 2 Interrupt Vector Dispatch Status Register eccccsscecesceeeneeeeseeeesseeessneeeeees 299 12 1 3 Interrupt Vector Receive Register ccccceessceesscesenseeseeeeseneceeseeeeeneeesseeeeens 299 12 1 4 Logical Processor Interrupt ID Register 0 0
337. eestneeenneeeesnees 57 3 11 L2 cache Diagnostics amp Control Accesses eecccecesceeesseeesneeeeeceeceeeecsseeensaeeesneeesteeeetaees 58 JLL L2 cache Control Register css ec catsvesdeeedeeeusth Toscan eegen souederige dee quarts see ceva 58 3 11 2 L2 cache Tag Diagnostic ACCESS c ceessceeesseeescecenseeceseeeceneeeseeeeseeeseeeentaees 61 3 11 3 L2 cache Data Diagnostic ACCESS cccccesesseceeneeeenseeceneeessneeeseeeeseeessteeeneaees 64 3 12 L3 cache Diagnostic amp Control Accesses eecccececceeeseceseeeeeseeeeeeeeceaeeessaeeesneeeseeeenaees 66 3 12 1 L3 cache Control Register use ege deed eve densest dE 66 3 12 2 Secondary L3 cache Control Register cccceeccceessceeeneeeeeneeeeseeeeseeeenneeeesaees 69 3 12 3 L3 cache Data ECC Fields Diagnostics Accesses ccccecessceeeseeeetteeeeteeeennees 70 3 12 4 L3 cache Data Staging Registers 0 ccccecessceessceceseeeeenceeeeseeesseeeesneeessneeensaees 70 3 12 5 Direct L3 cache Tag Diagnostics Access and Displacement Flush 71 3 12 6 L3 cache SRAM Mapping sssssesssesssesssserssesssessseresserssressressseesseessressreesseesseets 75 3 13 Summary of ASI Accesses in L2 L3 Off Mode sssssesssesssesessessserssseessesssersseresseesseesseeesses 75 3 14 ASESRAM Fast it eein a i OA EEE E E atone 76 3 14 1 ASI SRAM_FAST_INIT Definition seeeeseessseeeeisiersrrreeerrerersrersrrerersrsrererrere 77 3 14 2 ASI_SRAM_FAST_INIT_Shared Definitio
338. egister read access is showed in TABLE 3 43 TABLE 3 43 Write Cache Diagnostic State Access Read Data Format Bits Field Description entry31_state 1 0 entry30_state 1 0 maar tO gt The 2 bit state of all 32 W cache entries entry 0 through entry entry2_state 1 0 31 entry 1_state 1 0 entry0_state 1 0 Note A MEMBAR Sync is required before and after a load or store to AST_WCACHE_STATE Write Cache Diagnostic Data Register Access ASI 3916 per logical processor Name ASI_WCACHE_DATA The address format for W cache diagnostic data access is shown in TABLE 3 44 TABLE 3 44 Write Cache Diagnostic Data Access Address Format Bits Field Description 63 12 Mandatory value Should be 0 A 1 bit entry that selects between ecc_error and 11 WC_ecc_error data array 1 ecc_error 0 data 10 6 WC_entry A 5 bit index VA 10 6 that selects a W cache entry A 3 bit field that selects one of 8 doublewords read from the Data Return 5 3 WC_dbl_word y i When reading from ECC bit 5 determines which bank of ecc bit it is reading from 2 0 Mandatory value Should be 0 TABLE 3 45 Write Cache Diagnostic Data Access Data Format Bits Field Description The data format for W cache diagnostic data 63 0 wcache_data access when WC_ecc_error 0 and wcache_data is a doubleword of W cache data 53 Caches Cache Coherency and Diagnostics un microsystem
339. egister will be used to predict one non RETURN JMPL after it is written if JPE is enabled in the DCR After one prediction it will be invalidated until the next write This register is written as a side effect of a MOVcc with 07 r15 as the destination register 5 Implies no dcache_parity_error trap TT 0x071 will ever be generated However parity bits are still generated automatically and cor rectly by hardware 6 Implies no icache_parity_error trap TT 0x072 will ever be generated However parity bits are still generated automatically and cor rectly by hardware 270 un microsystems 7 This enable bit is cleared by hardware at power on System software must set the bit as needed When this bit is enabled the UltraSPARC IV processor forces an fp_disabled trap when an interrupt occurs on Floating point only code The trap handler is then responsible for checking whether the Floating point is indeed disabled If it is not the trap handler then enables interrupts to take the pending interrupt This behavior deviates from the SPARC V9 trap priorities in that interrupts are of lower priorities than the other two types of Floating point Exceptions fp_exception_ieee_754 fp_exception_other This mechanism is triggered for a Floating point instruction only if none of the approximately 12 preceding instructions across the two integer load store and branch pipelines are valid under the assumption that they are better suited to t
340. eld Data eseese 44 Format for Reading Lower Bits of Valid Predict Tag Field Data eessen 44 The address format for the I cache snoop tag eseecescsssessessesseesceseeseeseeseesecsecseceeeeeeaeeseeaeeaeeaeeaees 45 The Data Format of I cache Snoop Tag 45 Instruction Prefetch Buffer Data access Address Format 45 Instruction Prefetch Buffer Data Format 46 Instruction Prefetch Buffer Tag Valid Field Read Access Data Format oo eeceeeseeeeeeeeeteeee 46 Instruction Prefetch Buffer Tag Field Write Access Data Format o eciceeeeesceeeseeseeeeseeeeneees 46 Branch Prediction Array Access Address Format oo cccceeceescesseseeseeseeseeseesecsecseceeeeceaeeseeaeeaeeaeeaees 47 Branch Prediction Array Data Format oo c cceccecsesscessesseeeseeeceecceeseeseeseesecaecseceeeeceaeeaeeaeeaeeaeeaees 47 Branch Target Buffer Access Address Format 47 Branch Target Buffer Data Format enceiieniecarcecani ennei a 48 Data Cache Data Parity Access Address Format o eccscsscssssesesseeseeeeseeesscesceesceesaeesssesaeeesaeees 48 Data Cache Data Access Data Format o cceccceescsessseseseeseeesseeseseesceeeseescsesacecseescsesaceessesaceesaeees 48 Data parity buss EE orgie beeen eee seers hatte aria ele cee 49 Data Cache Tag Valid Access Address Format oo cccceescsssssssesesseeseeeeseeesscescsesscesaeeessessceeeaeees 49 Data Cache Data Access Data Format When DC_data_parity 1 oo eceeeceeeceeseeseeteeseeseeteeteeeees 49 Data Cache Microtag Access Address Format oo
341. ement flushed is in NA state it will not be written out to the memory The line remains in NA state in L3 cache In L3 cache tag SRAM off mode displacement flush is treated as a NOP Note For L3 cache displacement flush use only LDXA STXA has NOP behavior Since L3 cache will return garbage data to the MS pipeline it is recommended to use Idxa reg_address ASI_L3CACHE_TAG g0 instruction format On read this bit is don t care On write if the bit is set write the 9 bit ECC field of the selected L3 cache tag entry with the value calculated by the hardware ECC generation logic 25 hw_gen_ecc inside the L3 tag SRAM When this bit is not set the ECC generated by hardware ECC generation logic is ignored what is specified in the ECC field of the ASI data register will be written into the ECC field of the selected L3 cache line A 2 bit entry that selects an associative way 4 way associative 24 23 EC_way 2 b00 Way 0 2 b01 Way 1 2 b10 Way 2 2 b11 Way 3 22 6 EC_tag_addr Specifies the L3 cache tag set PA 22 6 for the read write access 5 0 Mandatory value Should be 0 Caches Cache Coherency and Diagnostics 72 un microsystems The data format for L3 cache tag diagnostic access when disp_flush 0 is shown in TABLE 3 70 TABLE 3 70 L3 cache Tag Diagnostic Access Bits Field Description 63 44 Mandatory value Should be 0 43 24 EC_Tag A 20 bit tag PA 42 23 23 15
342. enabled A logical processor suspended for debug or diagnostics is considered enabled State After Reset The LP Enable Status register changes only at system resets or power on reset The logical processor enable status register value is set by hardware to the value of the LP Enable register at the deassertion of reset LP Enable Register ASI_CORE_ENABLE The LP Enable register illustrated in TABLE 2 6 is used by software to enable disable logical processor s The enable disable action takes effect only when a power on reset or a system reset Soft POR is deasserted Name AST_CORE_ENABLE ASI 0x41 VA 63 0 0x20 Privileged Read Write TABLE 2 6 LP Enable Register Shared 63 2 Mandatory value Should be 0 II LP 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 The LP Enable register is a 64 bit register Each bit of the register represents one logical processor with bit 0 representing LP 0 and bit 1 representing LP 1 A bit set to 1 means a logical processor should be enabled after the next system reset and a bit set to 0 means a logical processor should be disabled after the next reset Note Bits 63 2 are forced to 0 since their corresponding logical processors are not implemented in the UltraSPARC IV processor Chip Multithreading CMT 16 amp Sun microsystems 2 4 3 2 4 3 1 If a bit in the LP Available register is 0 unavailable hardwar
343. ending So a data_access_error trap routine might see multiple events logged in AFSR associated with both data and instruction access errors and might also see an L2 cache or L3 cache error event logged The L2 cache or L3 cache error event would normally be associated with a precise trap but the deferred trap happened to arrive at the same time and make the precise trap no longer pending If the deferred trap routine executes RETRY an unlikely event in itself then the precise trap may become pending again but this would depend on the L2 cache or L3 cache giving the same error again Error Handling 216 amp Sun microsystems 7 8 4 Error Handling Pending disrupting traps are affected by the current state of PSTATE IE and PSTATE PIL All the disrupting traps interrupt_vector ECC_error and interrupt_level_ 1 15 are temporarily inhibited i e their pending status is hidden when PSTATE IE is 0 Interrupt_level_ 1 15 traps are also temporarily inhibited when PSTATE PIL is greater than or equal to the interrupt level As an example consider the following sequence of events 1 An interrupt vector arrives with a HW_corrected system bus data ECC error This makes ECC_error and interrupt_vector traps pending The processor continues to execute instructions looking for a BR MS AO or A1 pipe instruction 2 The processor executes a system bus read the RTO associated with an earlier store like instruction and detects a DSTAT 2
344. eporting Summary on page 192 e Overwrite Policy on page 198 e Multiple Errors and Nested Traps on page 199 e Further Details on Detected Errors on page 200 e Further Details of ECC Error Processing on page 213 e IERR PERR Error Handling on page 220 e Behavior on L2 cache DATA Error on page 222 e Behavior on L3 cache DATA Error on page 225 e Behavior on L2 cache TAG Errors on page 230 e Behavior on L3 cache TAG Errors on page 237 e Behavior on System Bus Errors on page 241 7 1 Error Handling Error Classes The three main classes or types of errors that occur are 1 Hardware correctable errors These errors are corrected automatically by the hardware For some hardware correctable errors a trap is optionally generated to log the error condition 2 Software correctable errors These errors are not corrected automatically by the hardware but are correctable by the software 3 Uncorrectable errors These errors are not correctable by either the software or the hardware 1 This user s manual refers to the level one instruction cache and data cache as I cache and D cache the prefetch cache as P cache the write cache as W cache the level two cache as L2 cache and the level three external cache as L3 cache 143 amp Sun microsystems These three main classes of errors are handled by normal recovery mechanisms and result in the following types of traps Precise traps These errors are either signaled as software correctabl
345. eptions specific to the UltraSPARC IV processor are described in TABLE 8 1 TABLE 8 1 Exceptions Specific to the UltraSPARC IV Processor Exception or D ipti Interrupt Request Ge Global Register Set Priority Taken on software correctable L2 cache or L3 cache data ECC errors uncorrectable L2 cache tag data ECC errors or L3 cache data ECC errors detected as a result of a D cache load miss atomic instruction or I cache miss for instruction fetch The trap handler is required to flush the cache line containing the error from the D cache L2 cache and L3 cache because incorrect data would have already been written into the D cache The UltraSPARC IV processor hardware will automatically correct single bit ECC errors on the L2 cache writeback and L3 cache writeback when the trap handler performs the L2 cache flush and L3 cache flush After the caches are flushed the instruction that encountered the error should be retried the corrected data will then be brought back in from memory and reinstalled in the D cache and L2 cache fast_ECC_error On fast_ECC_error detection during D cache load miss fill D cache installs the uncorrected data Because the fast_ECC_error trap is precise hardware can rely on software to help clean up the bad data In case of I cache miss however bad data never gets installed in the I cache Unlike D cache and I cache parity error a D cache I cache miss request that returns with fast_ECC_error will
346. equired to write the D TLB Tag Access Extension Register AST_DMMU_TAG_ACCESS_EXT with page size information since ASI_DTLB_DATA_ACCESS_REG gets the page size from the TTE data But it is recommended that software writes the proper page size information to AST_DMMU_TAG_ACCESS_EXT before writing to ASI_DTLB DATA _ACCESS_REG D TLB Tag Read Register See Tag Read Register on page 333 for details about ASI_DTLB_TAG_READ_REG ASI 5E j Demap Operation For a demap page in the large D TLBs the page size used to index the D TLBs is derived based on the Context bits primary secondary nucleus The hardware will automatically select the proper PgSz bits based on the Context field primary secondary nucleus defined in ASI_DMMU_DEMAP ASI 5Fj These two PgSz fields are used to properly index the T512_0 and T512_1 Demap operations in the T16 are single cycle operations all matching entries are demapped in one cycle The Demap operations in the T512_0 and T512_1 are multi cycle operations The demap operation is done in parallel for both TLBs one entry at a time for all 512 entries Note When global pages are used G 1 any active page in a given T512 must have the same page size When pages with G 1 in a T512 have variety of page sizes the T512 cannot index and locate the page correctly when trying to match the VA tag without t
347. er logical processor before accessing L3 cache data staging registers If both logical processors access the L3 cache data staging registers the behavior is undefined If L3 cache data diagnostic access encounters a L3 cache data CE the returned data will be corrected if EC_ECC_en in L3 cache control register ASI 7546 is asserted otherwise the raw L3 cache data will be returned Direct L3 cache Tag Diagnostics Access and Displacement Flush ASI Ae Read and Write Shared by both logical processors Caches Cache Coherency and Diagnostics 71 un microsystems VA 63 27 0 VA 26 disp_flush VA 25 hw_gen_ecc VA 24 23 EC_way VA 22 6 EC_tag_addr VA 5 0 0 Name ASI_L3CACHE_TAG T The address format for L3 cache tag diagnostic access is shown in TABLE 3 69 TABLE 3 69 L3 cache tag diagnostic access 63 27 Mandatory value Should be 0 Specifies the type of access 1 b1 displacement flush if the specified L3 cache line is dirty Oe state is equal to Modified or Owned or Owned Shared write the line back to memory and invalidate the L3 cache line if the line is clean i e state is equal to Shared or Exclusive update the state to Invalid state The ASI data portion is unused when this bit is set 1 b0 direct L3 cache tag access read write the tag state ECC and LRU fields of the 26 disp_flush selected L3 cache line In L3 cache tag SRAM on mode if the line to be displac
348. error is logged in the AFSR and no trap is generated This bit enables the traps associated with the AFSR EMU EDU WDU CPU IVU UE DUE BERR DBERR TO DTO and IMU bits and AFSR_EXT L3_WDU L3_CPU L3_EDU bits Note Executing code with NCEEN clear can lead to the processor executing instructions with uncorrectable errors spuriously because it will not take traps on uncorrectable errors CEEN If set e A HW_corrected data or microtag ECC error detected as the result of a system bus read causes an ECC_error disrupting trap e A HW_corrected L2 cache L3 cache data error as the result of a store queue exclusive request or for a writeback or copyout will generate a disrupting ECC_error trap e A HW_corrected L2 cache L3 cache tag error will cause a disrupting ECC_error trap e If CEEN is clear the error is logged in the AFSR and no trap is generated This bit enables the errors associated with the AFSR EDC WDC CPC IVC CE EMC and IMC bits and AFSR_EXT L3_WDC L3_CPC L3_EDC bits 179 amp Sun microsystems ere 7 3 3 1 7 3 3 2 Error Handling Note FSAPE in the UltraSPARC IV processor is renamed to FISAPE in the UltraSPARC IV processor to avoid the naming confusion between forcing Sun Fireplane Interconnect address parity error and forcing SDRAM address parity error AFSR Register and AFSR_EXT Register The Asynchronous Fault Status Register AFSR is presented as two separate registers the p
349. es Setting TOL 9 results in the Sun Fireplane Interconnect timeout period of 1 75 seconds A TOL value of 0 should not be used since the timeout could occur immediately or as much as 2 Sun Fireplane Interconnect cycles later TOL Timeout Period Timeout Period 37 34 TOL 3 0 0 22 1 212 728 2 230 3 23 4 32 5 2 6 232 31 T 233 233 32 Processor to Sun Fireplane Interconnect clock ratio bit 3 This field may only be written 33 CLKB during initialization before any Sun Fireplane Interconnect transactions are initiated ease refer to itiona ncoding in the Sun Fireplane Interconnect Configuration Pl fe Add l CLK Encoding in the Sun Fireplane I Config Register on page 292 0 Back to back Sun Fireplane Interconnect Request Bus Mastership Inserts a dead cycle in between bus masters on the Sun Fireplane Interconnect Request 32 DEAD 1 I dead cycle in b b he Sun Fireplane I Req Bus Sun Fireplane Interconnect and Processor Identification 293 un microsystems TABLE 11 4 FIREPLANE_ CONFIG 2 Register Format 2 of 2 31 30 Reserved Reserved for future implementation Processor to Sun Fireplane Interconnect clock ratio bit 2 This field may only be written during initialization before any Sun Fireplane Interconnect transactions are initiated 29 CLK 2 i refer to Additional CLK Encoding in the Sun Fireplane Interconnect Configuration Register on page 292 Debug 0 Up to 15 outstanding transactions allowed
350. es are ignored If the first occurrence of an ECC error is correctable then subsequent correctable errors within the 32 bytes are ignored A subsequent uncorrectable error will overwrite the syndrome of the correctable error stored in the internal error register At this point the internal error register is locked This applies to both HW_corrected and SW_correctable errors The internal error register is cleared and unlocked once the error has been logged in the AFSR When Are Traps Taken Precise traps such as fast_instruction_access_mmu_miss and fast_ECC_error are taken explicitly when the instruction with the problem is executed Disrupting and deferred traps are not associated with particular instructions In fact the processor only takes these traps when a valid instruction that will definitely be executed not discarded as a result of speculation is moving through particular internal pipelines TABLE 7 17 Traps and when they are taken Initiate trap processing E when a valid instruction is in Note The instruction_access_error events and many precise traps produce instructions that cannot be executed Therefore the instruction fetcher takes the affected instruction and dispatches it to the BR pipe specially marked to cause a trap instruction_access_error It is true to say that instruction_access_error traps are only taken when an instruction is in the BR pipe but this has no effect because the error creates an instruc
351. essor does encounter a software correctable L2 cache ECC error while executing the fast_ECC_error trap handler the processor may recurse into RED_state and not make any record in the AFSR of the event leading to difficult diagnosis The processor will set the AFSR ME bit for multiple software correctable events but this is expected to occur routinely when an AFAR and AFSR is captured for an instruction which is prefetched automatically by the instruction fetcher then discarded The fast_ECC_error trap uses the alternate global registers If a software correctable L2 cache error occurs while the processor is running some other trap which uses alternate global registers such as spill and fill traps there may be no practical way to recover the system state The fast_ECC_error routine should note this condition and if necessary reset the domain rather than recover from the software correctable event One way to look for the condition is to check whether the TL of the fast_ECC_error trap handler is greater than 1 Uncorrectable L2 cache Data ECC Errors Uncorrectable L2 cache data ECC errors occur on multi bit data errors detected as the result of the following transactions e Reads of data from the L2 cache to fill the D cache e Reads of data from the L2 cache to fill the I cache e Performing an atomic instruction These events set AFSR UCU An uncorrectable L2 cache data ECC error which is the result of an I fetch or a data read caused by
352. eveeseugbucsdtestadeeuecvecvaces 121 Floatinge PointAdditigi ga cidteeshcevensdealieaetis deed SERA 123 Floating Point Subtraction 124 Floating Point Multiplication 125 Floating Point DIVISION EREECHEN EE 126 Floating Poinit Square Root ee nies dade een Reece tent 127 Number Compare ot eier EES a A A Es 127 Eeer 128 Floating Point to Integer Number Conversion 129 Integer to Floating Point Number Conversion 130 Floating Point Unit Exceptions sud te e aE ese aet 132 R sponse to Traps geet sonene i e E a E Ae EEAS Aa Sni E AEE Ra eiai aias 132 Floating Point Integer Conversions That Generate Inexact Exceptions eseeeeeeeeeeeererereeees 134 Underflow Exc ption Summary tenira eer EES EENS ee ie 135 Results From NaN Operands AAA 137 Subnormal Handling Constants per Destination Register Precision 139 I cache Tag Data Parity Errors o c ecececcescessesscsseescesceseeseeseesecaeeaecncesceseeseesecsecaecaeceeeeeeeeeeaeeaeeaeeaees 149 Iscache Data Parity EION arrani a aa a a E OE ER ene cadesitiien hee gues 149 D cache Parity Generation for Load Miss Fill and Store Update ssssesesseseeeeseseeeeeerererererrersrsees 152 D c che Tag Data Parity EE 155 CMT Error Steering Register aprioran tiee En VEENVAT SUAE EEEO NEES EE E ei eS 177 Error Enable Register After Reset cipenarsinnrec reei ee e AAEE R E Rienie 178 Asynchronous Fault Status Register cceeceeccescessesseseeseeeeeeeececeeseeseeseesecaecaecaeseeeeeeeeaeeaeeaeeaeeaes 1
353. events AFSR L3_THCE is set and a disrupting ECC_error trap is generated For all hardware corrections of L3 cache tag ECC errors not only does the processor hardware automatically correct the erroneous tag but it also writes the corrected tag back to the L3 cache tag RAM Diagnosis software can use correctable error rate discrimination to determine if a real fault is present in the L3 cache tags rather than a succession of soft errors L3_TUE For any access except ASI access and SIU tag update or copyback from foreign bus transaction and snoop request if an uncorrectable L3 cache tag error is discovered the processor sets AFSR TUE The processor asserts its ERROR output pin and it is expected that the coherence domain has suffered a fatal error and must be restarted 209 amp Sun microsystems 7 1 5 Error Handling L3_TUE_SH For any access for SIU tag update or copyback from foreign bus transaction and snoop request if an uncorrectable L3 cache tag error is discovered the processor sets AFSR TUE The processor asserts its ERROR output pin and it is expected that the coherence domain has suffered a fatal error and must be restarted System Bus ECC Errors CE When data are entering the UltraSPARC IV processor from the system bus the data will be checked for the correctness of its ECC If a single bit error is detected the CE bit will be set to log this error condition A disrupting ECC_error trap will be gene
354. except for integer LDD and LDDA For these instructions which access two or four 32 bit words data and physical tag parity is only checked for the first 4 byte 32 bit word retrieved Thus the parity error detected in the second access of integer LDD LDDA in which helper is used is suppressed In the event of the simultaneous detection of a D cache parity error and a D TLB miss the dcache_parity_error trap is taken When this trap routine executes RETRY to return to the original code a fast_data_access_MMU_miss trap will be generated Notes on D cache Data Parity Error Traps D cache Data Parity will not be taken under the following conditions e When D Cache is disabled DCR s DPE 0 e Parity errors detected for any type of Stores e Parity errors detected for Block loads e Parity errors detected for Atomic e Parity errors detected for Quad loads e Parity errors detected for integer LDD second access only or helper access e Parity errors detected for Internal ASI loads D Cache Data parity Traps will not be suppressed on Data Parity Error that occurs for load that misses D Cache or causes other types of recirculates Notes on D cache Physical Tag Parity Error Traps D cache Physical Tag Parity will not be taken under the following conditions e When D Cache is disabled DCR s DPE 0 e Parity errors detected for a invalid line e Parity errors detected with microtag miss e Parity errors detected for Block
355. f 5 Fields Hard_POR Unknown Unchanged Unchanged RED state Reset and RED_ state CLEANWIN Unknown Unchanged Unchanged WSTATE OTHER Unknown Unchanged Unchanged NORMAL Unknown Unchanged Unchanged MANUF 003E16 IMPL 001916 VER MASK Mask dependent MAXTL 5 MAXWIN 7 FSR all 0 0 Unchanged FPRS all Unknown Unchanged nchanged Non SPARC V9 ASRs SOFTINT Unknown Unchanged Unchanged TICK COMPARE INT_DIS 1 off 1 off Unchanged S TICK_CMPR_ 0 Unchanged STICK NPT 1 Unchanged counter 0 Count STICK COMPARE INT_DIS 1 off 1 off Unchanged TICK_CMPR 0 0 Unchanged sl Unknown nchanged Unchanged so Unchanged Unknown nchanged Unchanged UT Unknown nchanged PCR trace user Unchanged ST Unknown nchanged trace system S Unchanged PRIV priv access Unknown nchanged PIC all Unknown nknown Unknown GSR IM 0 Unchanged others Unknown nchanged Unchanged DCR all 0 Unchanged Non SPARC V9 ASIs sun Pireplane Interconnect See RED_state and Reset Values on page 296 Information WE O oft Doft Unchanged DCUCR e Ge 0 off all others 0 off 0 off Queue_timeout 16 h0e00 Unchanged L2 cache Control Register 3 b111 Unchanged all others 0 87 un microsystems TABLE 4 1 Machine State After Reset and RED_state 3 of 5 Name Fields Hard_POR RED_state 45 h0f0038ad Unchanged siu_data_mode 0800 addr
356. f the corresponding error event can be traced back to the logical processor that caused the error If the error can be traced back to a particular logical processor it is listed as a private event If the error cannot be traced back to a particular logical processor it is listed as a shared event For shared error events the CMT Error Steering register AST_CMP_ERROR_STEERING determines which logical processor the error should be reported to See CMT Error Steering Register ASI_CMP_ERROR_STEERING on page 177 for details about how the CMT Error Steering register is used to report shared errors 197 amp Sun microsystems 7 5 Poul Error Handling Overwrite Policy This section describes the overwrite policy for error status when multiple errors have occurred Errors are captured in the order that they are detected not necessarily in program order This policy applies only to AFSR1 and AFARI AFAR2 and AFSR2 have no overwrite mechanism they are either frozen or unfrozen AFSR2 and AFAR2 capture the first event after AFSR1 status lock bits are all cleared The overwrite policies are described by the priority entries in TABLE 7 16 The descriptions here set the policies out at length For the behavior when errors arrive at the same time as the AFSR is being cleared by software see Clearing the AFSR and AFSR_EXT on page 182 AFARI Overwrite Policy Class 5 PERR the highest priority Class 4 UCU
357. f width Informational Notes This guide provides several different types of information in notes as follows Note This highlights a useful note regarding important and informative processor architecture or functional operation This may be used for purposes not covered in one of the other notes XXIV amp Sun microsystems Preface XXV amp Sun microsystems Preface XXVi amp Sun microsystems Architectural Overview This chapter provides an overview of the UltraSPARC IV processor focusing on its differences from the UltraSPARC III and UltraSPARC IV processors The UltraSPARC IV processor is a high performance processor targeted at the enterprise server market and is intended as an upgrade product for the Sun Fire family of servers Compared with the UltraSPARC III IV processors the UltraSPARC IV processor has an equivalent footprint fits within a compatible thermal and power budget and is designed to be incorporated into interchangeable motherboards based on the Sun Fireplane Interconnect Chapter Topics e Introduction on page 1 e Chip Multithreading CMT on page 2 e Enhanced Core Design on page 2 e Cache Hierarchy on page 5 e System Interface and Memory Controller Enhancements on page 6 e Enhanced Error Detection and Correction on page 7 1 1 Introduction Although the UltraSPARC IV processor is primarily targeted at addressing the demands of commercial computing it offers subst
358. finity one set 0 Infinity None set one set Infinity Asserts ufc pc IEEE trap enabled Normal Infinity Normal Normal Normal Normal Asserts ufc Infinit y nvc ufa nva Can overflow See 6 5 3 Normal None set Normal Asserts ofc pc IEEE trap enabled Can overflow See 6 5 3 None set Normal Normal Can underflow See 6 5 4 Can underflow See 6 5 4 Normal Normal Infinity 0 Normal Infinity 0 Normal Can underflow See 6 5 4 Infinity None set Infinity None set Infinity Infinity Can underflow See 6 5 4 None set Infinity Infinity Infinity Infinity Infinity Infinity Asserts nvc nva QNaN Asserts nvc No IEFE trap enabled Infinity None set Infinity Infinity None set Infinity None set 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard 124 un microsystems 6 3 3 TABLE 6 5 FMUL rs rsz fra rs gt rd Multiplication Floating Point Multiplication MULTIPLICATION 3 Instruction Result from the operation includes one or more of the following Number in f register See Trap Event on page 132 Exception bit set See TABLE 6 12 Trap occurs See abbreviations in TABLE 6 12 Underflow overflow can occur Masked Exception TEM 0 Enabled Exception TEM 1 Destination Register Destina
359. for processing Precise Trap Priority All traps with priority 2 are precise traps Miss error traps with priority 2 that arrive at the same time are processed by hardware according to their age or program order The oldest instruction with an miss error will get the trap However there are two cases where the same instruction generates multiple traps Case 1 Singular trap type with highest priority The processing order is determined by the priority number lowest number that has the highest priority is processed first Case 2 Multiple traps having same priority For trap priority 2 the only possible combination is simultaneous traps due to I cache parity error and I TLB miss In this case the hardware processing order is icache_parity_error gt fast_instruction_access_MMU_umiss All other priority 2 traps have staggered arrivals and therefore will not result in simultaneous traps D cache access is further down the pipeline after instruction fetch from I cache Thus D cache parity error on a load instruction if any will be detected after I cache parity error if any and I TLB miss if any The other priority 2 trap fast_ECC_error can only be caused by an I cache miss or D cache load miss therefore it arrives even later To summarize precise traps are processed in the following order program order gt trap priority number gt hardware implementation order 8 4 8 4 1 I cache Parity Error Trap An I cache physical ta
360. g or data parity error results in an icache_parity_error precise trap Hardware does not provide any information as to whether the icache_parity_error trap occurred due to a tag or a data parity error Hardware Action on Trap for I cache Data Parity Error Parity error detected during I cache instruction fetch will take an icache_parity_error trap TT 0x072 priority 2 globals AG Exceptions Traps and Trap Types 260 un microsystems Note I cache data parity error detection and reporting is suppressed if DCR IPE 0 or cache line is invalid Precise traps are used to report an I cache data parity error I cache and D cache are immediately disabled by hardware when a parity error is detected by clearing DCUCR IC and DCUCR DC bits In the trap handler software should invalidate the entire I cache D cache and P cache See cache Error Recovery Actions on page 146 for details No special parity error status bit or address information will be logged in hardware Because icache_parity_error trap is precise software has the option to log the parity error information on its own I cache data parity error is determined based on per instruction fetch group granularity Unused or annulled instructions are not masked out during the parity check If an I cache data parity error is detected while in another event I cache miss or I TLB miss the behavior is described in TABLE 8 2 TABLE 8 2 I cache Data Parity Error Behav
361. ge size bits of each T16 entry Fast 3 bit encoding is used to define the 8 KB 64 KB 512 KB and 4 MB page sizes Since the T512s are not fully associative indexing the T512 arrays require knowledge of the page size to properly select the VA bits to be used as the index as shown below if an 8 KB page is selected array index VA 20 13 if a 64 KB page is selected array index VA 23 16 if a 512 KB page is selected array index VA 26 19 Instruction and Data Memory Management Unit 321 amp Sun microsystems d if a 4 MB page is selected array index VA 29 22 if a 32 MB page is selected array index VA 32 25 if a 256 MB page is selected array index VA 35 28 The Context bits are used after the indexed entry comes out of each array bank way to qualify the context hit There are 3 possible Context numbers active in the processor i e the primary PContext field in ASI_PRIMARY_CONTEXT_REG secondary SContext field in ASI_SECONDARY_CONTEXT_REG and nucleus default to Context 0 The Context register used to send to data to the D MMU is determined based on the load store s ASI encoding of primary secondary nucleus Since all 3 D TLBs are accessed in parallel software must guarantee that there are no duplicate stale entry hits Most of this responsibility lies with the operating system software with the hardware providing some assistance to support full software
362. ged S original data data Direct ASI L3 cache data write CG request with tag UE in the Let No error 32 byte or 2nd 32 byte L3 logged No overwrite N A No action No trap original cache data data Error Handling 228 un microsystems Error Handling TABLE 7 22 L3 cache Data Writeback and Copyback Errors 2nd 32 byte L3 cache data lower 16 byte data is LDDF or LDF or LDDFA with some ASIs or LDFA with some ASIs AFAR points to 16 byte boundary as dictated by the system bus Error Event logged in Data sent to SIU Comment AFSR L3 Writeback encountering CE in the 1st 32 byte or 2nd 32 byte L3 cache data L3_WDC Corrected data corrected ecc Disrupting trap EE L3_WDU a Se of EH a Ges Disrupting tra byte or 2nd 32 byte L3 cache data d P 8 Upp Pang uap lower 16 byte data Copyout hits in the L3 writeback buffer because the line is being victimized where a L3_WDC Corrected data corrected ecc Disrupting trap CE has already been detected Copyout hits in the L3 writeback buffer SIU flips the most significant 2 bits of data because the line is being victimized where a L3_WDU D 127 126 of the corresponding upper or Disrupting trap UE has already been detected lower 16 byte data Copyout encountering CE in the Ist 32 byte or Corrected data corrected ecc 2nd 32 byte L3 cache data L3 CPC Disrupting trap F i SIU flips the most significant 2 bits of data Copyout encountering VE imthe
363. ged unchanged CLK 0 updated unchanged MT 0 updated unchanged MR 0 updated unchanged FIREPLANE_CONFIG_2 AID 0 9 5 unchanged unchanged 4 0 updated DBG 0 unchanged unchanged DTL see TABLE 11 3 updated unchanged TOL 0 unchanged unchanged TOF 0 unchanged unchanged NXE 0 updated unchanged SAPE 0 updated unchanged CPBK_BYP 0 updated unchanged FIREPLANE_ADDRESS all unknown unchanged unchanged ASI_CESR_ID all 0 unchanged unchanged Sun Fireplane Interconnect and Processor Identification 296 un microsystems 1 The state of Module Type MT and Module Revision MR fields of FIREPLANE_PORT_ID after reset is not defined Typically software reads a module PROM after a reset then updates MT and MR 2 This field of the FIREPLANE_CONFIG register is not immediately updated when written by software Writes to this field of the FIREPLANE_CONFIG register have no visible effect until a reset occurs 3 This field of the FIREPLANE_CONFIG_2 register is not immediately updated when written by software Writes to this field of the FIREPLANE_CONFIG register have no visible effect until a reset occurs 11 3 11 3 1 ES e The UltraSPARC IV Processor Identification Version Register The 64 bit Version VER register is only accessible via the SPARC V9 RDPR instruction It is not accessible through an ASI or ASR In SPARC assembly language the instruction is rdpr Sver srd The VER mask field consists of two 4 bi
364. gical processor The specific semantics for accessing the CMT registers through the ASI interface are described in Accessing CMT Registers Through ASI Interface on page 12 Chip Multithreading CMT 11 amp Sun microsystems 2 2 2 Accessing CMT Registers Through ASI Interface Each CMT specific register is accessible through an ASI address a combination of an address space identifier value and virtual address All CMT registers are mapped into ASI values that are only accessible in privileged mode The specific ASI number and virtual address of each CMT register is covered later in this document Each logical processor can access only its own private registers Accesses by logical processors to their own associated private registers follow the standard semantics for accessing ASI mapped internal registers Each logical processor can access all the shared registers An update to a shared register from one logical processor will be visible to all other logical processors The ordering of accesses to shared registers from different logical processors is not defined but there are a number of hardware rules that are enforced The hardware guarantees that accesses to a shared register from the same logical processor follow sequential semantics The hardware also guarantees that if multiple logical processors attempt to write the same shared register at the same time after the updates the register contains the value from just one of
365. hanism When a load hits the P cache and the fetched_mode bit of the P cache line indicates that it was not brought in by a software prefetch the hardware prefetch engine attempts to prefetch the next 64 byte cache line from the L2 cache Depending on the prefetch stride the next line to be prefetched can be at a 64 128 or 192 byte offset from the P cache line that initiated the prefetch Thus when a floating point load at address A hits the P cache a hardware prefetch to address A 64 A 128 or A 192 will be initiated A hardware prefetch request will be dropped in the following cases e the data already exists in the P cache e the prefetch queue is full e the hardware prefetch address is the same as one of the outstanding prefetches e the prefetch address is not in the same 8 KB boundary as the line that initiated the prefetch e the request misses the L2 cache To enable hardware prefetching the Prefetch Cache Enable PE bit bit 45 and the Hardware Prefetch Enable HPE bit bit 44 in the Data Cache Unit Control Register DCUCR must be set The Programmable P cache Prefetch Stride PPS bits bit 51 50 of DCUCR determine the prefetch stride when hardware prefetch is enabled See Data Cache Unit Control Register DCUCR on page 273 for details Software Prefetching The UltraSPARC IV processor supports software prefetching through the PREFETCH A instructions Software prefetching can prefetch floating point data into
366. he 61 L3 cache 72 disrupting traps handling 257 D MMU and RED state 83 demap operation 326 disabled effect on D cache 331 Synchronous Fault Status Register D SFSR 335 TLB replacement policy 337 TSB Base register 334 TSB Extension registers 334 virtual address translation 320 DONE instruction exiting RED_state 84 268 xXX V UltraSPARC IV Processor User s Manual October 2005 flushing pipeline 25 when TSTATE uninitialized 85 D SFAR register exception address 64 bit 268 state after reset 88 D SFSR register FT fault type field 331 state after reset 88 DTLB Data Access register 332 Data In register 332 error detection 156 error recovery 156 miss refill sequence 312 330 state after reset 89 Tag Access Extension register 331 DVMA 84 E E_SYND overwrite policy 199 ECACHE_W EC_data 71 ECC error correction 259 Syndrome 186 overwrite policy 199 ECC_error exception 158 159 163 164 165 169 171 172 175 ECC_error trap 258 Error Enable Register state after reset 88 error registers 177 error_state and watchdog reset 85 errors and RED state 221 data cache 149 155 deferred how to handle 257 DTLB 156 handling IERR PERR 220 instruction cache 144 149 ITLB 155 L2 cache 157 163 L3 cache 163 169 memory 144 176 and hardware prefetch 174 and instruction fetch 174 and interrupt 176 and software prefetch 175 multiple 199 overwrite policy 198 199 prefetch cache 157 recoverable ECC errors 258 reporting 192 197
367. he L2 cache it will not exist in the L3 cache A software initiated the L2 cache flush which will flush the data back to the L3 cache is desirable so that the corrected data can be brought back from the L3 cache later Without the L2 cache flush a further single bit error is likely the next time this word is fetched from the L2 cache Multiple occurrences of this error will cause the AFSR1 ME to be set In the event that the L3_UCC event is for an instruction fetch which is later discarded without the instruction being executed no trap will be generated If both the L3_UCC and the L3_MECC bits are set it indicates that an address parity error has occurred L3_UCU When a cacheable load instruction misses the I cache or D cache or an atomic operation misses the D cache and it hits the L3 cache the line is moved from the L3 cache to the L2 cache and will be checked for the correctness of its ECC If a multi bit error is detected in critical 32 byte 205 amp Sun microsystems Error Handling data for load and atomic operations or in either critical or non critical 32 byte data for I cache fetch it will be recognized as an uncorrectable error and the UCU bit will be set to log this error condition A precise fast_ECC_error trap will be generated provided that the UCEEN bit of the Error Enable Register is set For correctness a software initiated flush of the D cache is required because the faulty word may already have been lo
368. he context number as a qualifier Instruction and Data Memory Management Unit 326 un microsystems 13 2 2 8 D TLB Access Summary TABLE 13 20 lists the D MMU TLB Access Summary TABLE 13 20 D MMU TLB Access Summary Software Operation Effect on MMU Physical Registers ree Register TLB tag array TLB data array Tag Access Contents returned Tag Read i No effect No effect No effect 8 From entry specified No effect by LDXA s LDDFA s access Tag No effect No effect Contents No effect No effect Access returned Data In Trap with data_access_exception Contents returned Data No effect From entry No effect No effect Access specified by No effect LDXA s LDDFA s access Load SFSR No effect No effect Contents returned No effect Contents SFAR No effect No effect No effect No effect returned Tag Read Trap with data_access_exception Tag No effect No effect Nae No effect No effect Access store data TLB entry determined TLB enuy Data In by replacement polic determined Dy eee Pone replacement policy No effect No effect No effect written with contents S written with store of Tag Access Register data TLB entry specified by TLB entry specified Data STXA s STDFA s by STXA s address written with STDFA s address No effect No effect No effect Store Access P contents of Tag Access written with store Register data SFSR No effect No effect No effect eee SS No effect SFAR No effect No
369. he line based on VA 5 3 A 2 bit entry that selects an associative way 4 way associative 20 19 l l 2 b00 Way 0 2 b01 Way 1 2 b10 Way 2 2 b11 Way 3 18 6 index A 13 bit index PA 18 6 that selects an L2 cache entry A 3 bit double word offset VA 5 3 000 selects L2_data 511 448 VA 5 3 001 selects L2_data 447 384 VA 5 3 010 selects L2_data 383 320 5 3 xw_offset VA 5 3 011 selects L2_data 319 256 VA 5 3 100 selects L2_data 255 192 VA 5 3 101 selects L2_data 191 128 VA 5 3 110 selects L2_data 127 64 VA 5 3 111 selects L2_data 63 0 2 0 Mandatory value Should be 0 During a write to AST_L2CACHE_DATA with ECC_sel 0 ECC check bits will not be written because there is no ECC generation circuitry in the L2 data write path and the data portion will be written based on the address data format During an ASI read with ECC_sel 0 the data portion will be read out based on the address format T During an ASI write with ECC_sel 1 ECC check bits will be written based on the ASI address data format During an ASI read with ECC_sel 1 the ECC check bits will be read out based on the address format VA 4 3 in the address format is not used when ECC_sel 1 because ECC is 16B boundary and ASI_L2CACHE_DATA returns the ECC for both the high and the low 16B data based on VA 5 T The data format for L2 cache data diagnostic access when ECC_sel 0 and ECC
370. he physical tag or the data is in error and maybe even the bit in error by accessing the D cache in its diagnostic address space However this attempt would depend on the D cache continuing to return an error for the same access and the entry not being displaced by coherence traffic This means that attempting to pinpoint the erroneous bit may often fail When the processor takes a dcache_parity_error trap it automatically disables both the I cache and D cache by setting DCUCR DC and DCUCR IC to 0 This prevents recursion in the dcache_parity_error trap routine when it takes a D cache miss for a D cache index which contains an error It also prevents an I cache error causing a trap to the icache_parity_error routine from within the dcache_parity_error routine When the primary caches are disabled in this way program accesses can still hit in the L2 cache so throughput will not degrade excessively Actions required of the dcache_parity_error trap routine e Log the event e Clear the I cache by writing the valid bit for every way at every line to 0 with a write to ASI_ICACHE_TAG e Clear the IPB by writing the valid bit for every entry of the IPB tag array to 0 with a write to AST_IPB_TAG e Clear the D cache by writing the valid bit for every way at every line to 0 with a write to ASTI_DCACHE_TAG Also initialize the D cache by writing distinct values to ASI_DCACHE_UTAG and writing 0 to AST_DCACHE_DATA both data and parity
371. he request is retried 239 un microsystems Error Handling TABLE 7 25 1L3 cache Tag CE and UE Errors 4 of 4 Errors logged Pipeline Event in AFSR L3 cache Tag Li cache data ENEE Comment L3 cache Tag update Disrupting trap L3 Pipe corrects i request by SIU local L3_TUE No action the request is 8 the tag 1 transaction with tag UE retried L3 cache Tag update Disrupting trap request by SIU foreign L3_TUE_SH Original tag No action the request is transaction with tag UE dropped sU TON Sp iab Disrupting trap copyback request wit 1pe corrects tag CE L3_THCE the tag N A No action the request is retried See bask Disrupting trap copyback request ki with tag UE L3_TUE_SH Original tag N A N A the request is dropped Direct ASI L3 cache t eee irec cache tag ee read request with tag CE No error logged Original tag N A No action ASI access get original tag Direct ASI L3 cache t e irec cache tag vc read request with tag UE No error logged Original tag N A No action ASI access get original tag ASI L3 cache L3 Pi Disrupting trap displacement flush read L3_THCE ee N A No action the request is i corrects the tag 1 request with tag CE retried ASI L3 cache Disrupting trap displacement flush read Original tag N A No action the request is request with tag UE dropped ASI tag write request with No error logged new tag written N A No acion No trap tag CE in on Ke Ee No error
372. he using writes to ASI_DCACHE_TAG to zero the valid bits in the tags 3 Evict the L2 cache line that contained the error This requires four LDXA AST_L2CACHE_TAG operations using the disp_flush addressing mode to evict all four ways of the L2 cache L2 cache and L3 cache are mutually exclusive When a read miss from primary caches hit in L3 cache the line will be moved from L3 cache to L2 cache If the line contains the error it is still moved to L2 cache without any change Therefore the recovery actions for software correctable L3 cache ECC error must displacement flush the line in L2 cache A single bit data error will be corrected when the processor reads the data from the L2 cache and writes it to the L3 cache to perform the writeback This operation may set the AFSR WDC or AFSR WDU bits If the offending line was in O Os or M MOESI state and another processor happened to read the line while the trap handler was executing the AFSR CPC or AFSR CPU bits could be set A single bit tag error will be corrected when the processor reads the line tag to update it to I state This operation may set AFSR THCE 4 Evict the L3 cache line that contained the error This requires four LDXA ASI_L3CACHE_TAG operations using the L3 cache disp_flush addressing mode to evict all four ways of the L3 cache to memory When the line is displacement flushed from L2 cache if the line is not in I state then it will be writeback to L3
373. he valid bits must be cleared to keep cache consistency Instruction Cache Tag Valid Fields Access ASI 6716 per logical processor Kal Name ASI_ICACHI _ TAG The address format for the instruction cache tag and valid fields are shown in TABLE 3 14 TABLE 3 14 Instruction Cache Tag Valid Access Address Format Bits Field Description 63 17 Mandatory value Should be 0 A 2 bit field that selects a way 4 way associative 16 15 IC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 IC_addr 14 7 corresponds to VA 13 6 of the instruction address It is used to index the 14 7 IC_addr physical tag microtag and valid load predict bit arrays Since the I Cache line size is 64bytes sub blocked as two 32bytes IC_addr 6 is used to select a sub block of the valid load predict bit LPB array This sub block selection is not needed for physical and microtag arrays as they are common for both sub blocks Only one half of the load predict bit data from a cache line can be read in each cycle Hence IC_addr 5 is used to select the upper or lower half of the load predict bits 6 5 IC_addr IC_addr 5 0 Corresponds to the upper half of the load predict bits IC_addr 5 1 Corresponds to the lower half of the load predict bits The valid bit is always read out along with the load predict bits Note IC_addr 6 5 is a don t care in accessing physical tag and microtag arrays 4 3 IC In
374. hecking Sequence of icache_parity_error Trap TABLE 7 1 and TABLE 7 2 highlight the checking sequence for I cache tag data parity errors and IPB data parity error respectively TABLE 7 1 I cache Tag Data Parity Errors Page microtag Cache Line Physical Tag Fetch Group E DCR IPE Signal a Cacheable Hit Valid Parity Error Executed Parity Error i Trap Yes No X X Yes Yes No X Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes No Yes Yes Yes No Yes Yes Yes No TABLE 7 2 I cache Data Parity Error e GH IPB Tag Hit KE eer EE DCR IPE Bit Ga e Error Yes o Yes No Yes o Yes No Yes No Yes No Yes Yes 7 2 2 D cache Errors Parity error protection is provided for the D cache physical tag array snoop tag array and data array The D cache is clean meaning that the value in the D cache for a given physical address is always the same as that in some other store in the system This means that recovery from single bit errors in D cache need only parity error detection not full error correcting code The basic concept is that when an error is observed the D cache is invalidated and the access retried Both hardware and software methods are used in the UltraSPARC IV processor Parity error checking in physical tags snoop tags and D cache data is enabled by DCR DPE in the dispatch control register When this bit is 0 parity will still be correctly genera
375. her than the UE or DUE priorities 6 Ifthe access is a cacheable read transfer the data value from the system bus and the ECC present on the system bus are not written to the L2 cache 7 For AFSR TO a deferred instruction_access_error or data_access_error trap is generated The latter applies even if the event was a writeback from L2 cache not directly related to instruction processing 8 For AFSR DTO a disrupting ECC_error trap is generated It is possible that no destination asserts MAPPED when the processor attempts to send an interrupt This event also causes AFSR TO to be set and PRIV and AFAR to be captured although the exact meaning of the captured information is not clear A deferred data_access_error trap will be taken Copyout events from L2 cache or L3 cache cannot see an AFSR TO event because they are responses to Sun Fireplane Interconnect transactions not initiating Sun Fireplane Interconnect transactions System Bus Hardware Time outs The AFSR TO bit is set when no device responds with a MAPPED status as the result of the system bus address phase This is not a hardware time out operation In addition to the AFSR TO functionality there are hardware time outs that detect that an ASIC is taking too long to complete a system bus operation This time out might come into effect if for example a target device developed a fault during an access to it Hardware time outs are reported as AFSR PERR fatal error events
376. het cachel JEM ll livelock signal bere LPO L1 cache 12 to ic D a P W a e signal fill cache_fill cac Ie sio cache_fill cache_fill K ssen re ore LPI Ll cache 12_to_ic D D P W W 100100 Cache sto Cache st signal _ fill cache_fill cache_fill cache_fill fa 100101 Empty 0 0 0 100110 0 Registers ECC Error Correcting Code IU Integer Unit A few requirements and recommendations for using the DCR to control the observability bus are outlined below Use only one or the other mode of control for the observability bus that is control either by pulses at obsdata 0 or by programming of the DCR bits As long as the por_n pin is asserted the state of obsdata 9 0 will always be 0 Once the device has been reset the default state becomes visible on the bus Note that the DCR OBS control bits are reset to all 0 s on a software POR trap Until the DCR bit 11 is programmed to 1 the obsdata 0 mode of control has precedence There is a latency of approximately 5 cycles between writing the DCR and the signals corresponding to the setting appearing at obsdata 9 0 In the UltraSPARC IV processor there are two sets of DCR 11 6 one for each logical processor Either set can be used to control the state of obsdata 9 0 When one set is programmed with a value the other set must be programmed to 0 When a logical processor is disabled or parked its DCR 11 6 should be programmed to 0 Note Every valid
377. hich arrives as DSTAT 2 or 3 and is stored in the L2 cache with data bits 1 0 inverted can then be rewritten as part of a store queue operation with check bits 1 0 inverted and eventually written back out to the system bus with data bits 127 126 inverted The E_SYND reported for correction words with data bits 1 0 inverted is always Ox1 1c The E_SYND reported for correction words with data bits 127 126 inverted is always 0x071 The E_SYND reported for correction words with check bits 1 0 inverted is always 0x003 L2 cache and L3 cache Data ECC Errors ECC error checking for L2 cache data is turned on for all read transactions from the L2 cache whenever the L2_data_ecc_en bit of the L2 cache Control Register is asserted ECC error checking for L3 cache data is turned on for all read transactions from the L3 cache whenever the EC_ECC_ENABLE bit of the L3 cache Control Register is asserted 214 un microsystems 7 8 3 Error Handling The UltraSPARC IV processor can store only one data ECC syndrome and one microtag ECC syndrome for every 32 bytes of L2 cache or L3 cache data even though it is possible to detect errors in every 16 bytes of data The syndrome of the first ECC error detected whether HW_corrected SW_correctable or uncorrectable is saved in an internal error register If the first occurrence of an ECC error is uncorrectable the internal error register is locked and all subsequent errors within the 32 byt
378. his error condition A disrupting ECC_error trap will be generated in this case while hardware deliver the corrected data to P cache block load data buffer When a store instruction misses the W cache and hits the L3 cache in E S O Os or M state a store queue will issue an exclusive request The exclusive request will perform an L3 cache read and the data read back from the L2 cache will be checked for the correctness of its ECC If a single bit error is detected the L3_EDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case while hardware writes the corrected data to the W cache When an atomic operation misses from the W cache and hits the L3 cache in E S O Os or M state a store queue will issue an exclusive request The exclusive request will perform an L3 cache read and the data read back from the L2 cache will be checked for the correctness of its ECC Ifa single bit error is detected in non critical 32 byte data the L3_EDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case while hardware writes the corrected data to W cache When a software PREFETCH instruction misses the P cache and hits the L3 cache the line is moved from the L3 cache to the L2 cache and is checked for the correctness of its ECC Ifa single bit error is detected the L3_EDC bit will be set to log this error condition A disrupting ECC_error trap will
379. hough they re still remembered just hidden The processor will now be running with TL 1 When the instruction_access_error trap routine executes a BR or MS pipe instruction the data_access_error trap routine will run at TL 2 If that routine explicitly set PSTATE IE then the interrupt_vector and ECC_error traps would become pending again After the next BR MS AO or Al pipe instruction the processor after waiting for outstanding reads to complete would take the interrupt_vector trap which would run at TL 3 However assuming the data_access_error trap routine does not set PSTATE IE then it will run uninterrupted to completion at TL 2 It s a deferred trap so it s not possible to return to the TL routine correctly Recovery at this time is a matter for the system designer When Are Interrupts Taken The processor is only sensitive to interrupt_vector and interrupt_level_ 1 15 traps when a valid instruction is in the BR MS AO or A1 pipes If the processor is executing only FGA or FGM pipe instructions it will not take the interrupt This could lead to unacceptably long delays in interrupt processing 217 amp Sun microsystems 7 8 4 1 Error Handling To avoid this problem if the processor is handling only floating point instructions with PSTATE PEF and FPRS FEF both set and an interrupt_vector or interrupt_level_ 1 15 trap becomes pending and PSTATE IE is set and for interrupt_level_ 1 15 traps
380. ic instruction for cacheable data On these W cache exclusive request if an uncorrectable error occurs in the requested line a disrupting ECC_error trap is generated if the NCEEN bit is set in the Error Enable Register and L3 cache sends 64 byte data to W cache associated with the uncorrectable data error information W cache stores the uncorrectable data error information in W cache so that deliberately bad signalling ECC is scrubbed back to the L2 cache during W cache eviction Correct ECC is computed for the corrupt evicted W cache data then ECC check bits 0 and 1 are inverted in the check word scrubbed to L2 cache A copyout operation which happens to hit in the processor writeback buffer sets AFSR_EXT L3_WDU not AFSR_EXT L3_CPU L3 cache address parity errors When an L3 cache address parity error is detected it is reported and treated as an uncorrectable L3 cache data error which are described in previous subsections with AFSR_EXT L3_MECC set When an L3 cache address parity error occurs AFSR_EXT L3_MECC is set and based on the request source the corresponding L3_UCU or L3_EDU or L3_CPU or L3_WDU is also set The recovery actions for L3_UCU L3_EDU L3_CPU L3_WDU are also applied to L3 cache address parity errors In very rare cases when an L3 cache address parity error is detected AFSR_EXT L3_MECC is set and based on the request source the corresponding L3_UCC L3_EDC L3_CPC or L3_WDC is also set However the r
381. ic request E Disrupting trap with L3 tag CE 1pe i g L3_THCE corrects the tag N A N A the request is request misses L2 cache retried i Good data D cache atomic request Good i in D Disrupting tra with L3 tag UE Original tag installed in D di pling trap cache and W request hits L2 cache Cache taken D cache atomic request Disrupting trap with L3 tag UE Original tag N A N A the request is request misses L2 cache dropped L3 cache access for Prefetch 0 1 2 20 21 22 L3 Pipe data installed Disrupting trap request with tag CE D3 THOE corrects the tag in P cache Ne request hits L2 cache L3 cache access for dAn Prefetch 3 23 17 request L3 Pipe ae ne Disrupting trap with tag CE L3_THCE corrects thestag installed in P N A cache request hits L2 cache L3 cache access for Prefetch 0 1 2 3 20 21 EE Disrupting trap 22 23 17 request with ta IDE CE 2 S SE corrects the tag N the request i retried request misses L2 cache L3 cache access for Prefetch 0 1 2 20 21 22 S data installed Disrupting trap request with tag UE Original tag in P cache N A request hits L2 cache L3 cache access for Prefetch 3 23 17 request Geh data not Disrupting trap with tag UE L3_TUE Yes Original tag go in P N A cache request hits L2 cache 238 un microsystems Error Handling TABLE 7 25 L3 cache Tag CE and UE Errors 3 of 4 Errors logged Pipeline Event in AFSR L3 cach
382. ided that the NCEEN bit is set in the Error Enable Register Multiple occurrences of this error will cause AFSR ME to be set A multi bit error in received interrupt vector data still causes the data to be stored in the interrupt receive registers but does not cause an interrupt_vector disrupting trap IMC When interrupt vector data are entering the UltraSPARC IV processor from the system bus the data will be checked for microtags correctness If a single bit error is detected the IMC bit will be set to log this error condition A disrupting ECC_error trap will be generated provided that the CEEN bit is set in the Error Enable Register Hardware will correct the error IMU When interrupt vector data are entering the UltraSPARC IV processor from the system bus the data will be checked for microtags correctness If a multi bit error is detected it will be recognized as an uncorrectable error and the IMU bit will be set to log this error condition A disrupting ECC_error trap will be generated provided that the NCEEN bit is set in the Error Enable Register Multiple occurrences of this error will cause AFSR ME to be set A multi bit error in received interrupt vector data still causes the data to be stored in the interrupt receive registers but does not cause an interrupt_vector disrupting trap 211 amp Sun microsystems 7 7 6 Error Handling System Bus Status Errors BERR When the UltraSPARC IV processor perfo
383. ield Description 63 6 Mandatory value Should be 0 Selects one of five staging registers 000 EC_data_3 001 EC_data_2 010 EC_data_1 data_register_number 011 EC_data_0 100 EC daa BCC 101 unused 5 3 110 unused 111 unused 2 0 Mandatory value Should be 0 The data format for the L3 cache data staging register data access is shown in TABLE 3 67 TABLE 3 67 L3 cache data staging register data access Bits Field Description 64 bit staged L3 cache data EC_data_0 63 0 corresponds to VA 5 3 011 63 0 EC_data_N EC_data_1 127 64 corresponds toVA 5 3 010 EC_data_2 191 128 corresponds to VA 5 3 001 EC_data_3 255 192 corresponds to VA 5 3 000 The data format for the L3 cache data staging register ECC access is shown in TABLE 3 68 TABLE 3 68 L3 cache data staging register ECC access Bits Field Description 63 18 Reserved Reserved for future implementation 17 9 EC_data_ECC_hi 9 bit L3 cache data ECC field for the high 16 byte PA 4 0 L3 data 255 128 8 0 EC_data_ECC_lo 9 bit L3 cache data ECC field for the low 16 byte PA 4 1 L3 data 127 0 Note L3 cache data staging registers are shared by both logical processors When a logical processor accesses L3 cache data staging registers software must guarantee the other logical processor will not access the L3 cache data staging registers It can be done by parking or disabling the oth
384. ill be replaced Only the 16 entry fully associative TLB T16 can hold locked pages In the UltraSPARC IV processor 32 MB and 256 MB can t be locked Set associative TLB T512_0 and T512_1 can hold unlocked pages of all sizes 5 4 3 2 CP CV The cacheable in physically indexed cache bit and cacheable in virtually indexed cache bit determine the placement of data in the caches The UltraSPARC IV processor fully implements the CV bit The following table describes how CP and CV control cacheability in specific UltraSPARC IV processor caches TABLE 13 22 Meaning of TTE Cacheable Meaning of TTE when placed in CP CV D TLB Data Cache VA indexed 00 Non cacheable 01 Cacheable P cache 10 Cacheable all caches except D cache 11 Cacheable all caches The MMU does not operate on the cacheable bits but merely passes them through to the cache subsystem Note When defining alias page attributes care should be taken to avoid the following CP CV combinations Set 1 VA VAI PA PAI CP 1 CV 1 Set 2 VA VA2 PA PAI CP 1 CV 0 Aliasing with the above attributes will result in stale value in the D cache for VA1 after a write to VA2 A write to VA2 will only update the W cache and not the D cache See the UltraSPARC III Cu Processor User s Manual See the UltraSPARC III Cu Processor User s Manual 1 See the UltraSPARC III Cu Processor User s Manual 0
385. ill be set to log this error condition A disrupting ECC_error trap will be generated in this case while hardware will proceed to correct the error Note The UltraSPARC IV processor s W cache has been improved to contain entire modified line data When the modified line is evicted from W cache to L2 cache it is written into L2 cache directly without the sequence of reading merging and scrubbing actions in the UltraSPARC III processor Therefore W cache eviction will not generate any EDC or EDU error in the UltraSPARC IV processor When an atomic instruction misses from the W cache and hits the L2 cache in the E or M state a store queue will issue an atomic exclusive request to L2 cache The exclusive request will perform an L2 cache read and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a single bit error is detected in a non critical 32 byte the EDC bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case while hardware will proceed to correct the error EDU The AFSR EDU status bit is set by errors in block loads to the processor and errors in reading the L2 cache for store operations and prefetch queue operations When a block load misses the D cache and hits the L2 cache an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a multi bit error is detected
386. ill generated automatically and cor rectly by hardware 2 Implies no data_access_exception trap TT 0x030 will ever be generated However parity bits are still generated automatically and correctly by hardware 3 Implies no instruction_access_exception trap TT 0x008 will ever be generated However parity bits are still generated automatically and correctly by hardware 4 This 32 entry direct mapped BTB is updated when a non RETURN JMPL is mispredicted Ifa JMPL is not a RET RETL and jump prediction is enabled in the DCR and there is not currently a valid entry in the prepare to jump register its target will be predicted by reading the BTB using VA 9 5 In addition to BTB the target of indirect branches those which don t contain the target address as part of the instruction but get the target from a register such as JMPL or RETURN will be predicted using one of two structures Note if PSTATE RED 1 the predicted target is forced to a known safe address instead of using the standard methods of prediction The Return Address Stack RAS This 8 entry stack contains the predicted address for RET RETL RETURN instructions When a CALL is executed the PC 8 of the CALL is pushed onto this stack If return prediction is enabled in the DCR a return instruction will use the top entry of the RAS as its predicted target A return instruction will also decrement the stack pointer The Branch Target Buffer BTB The prepare to jump r
387. in the UltraSPARC III Cu Processor User s Manual 22 21 DCU Virtual Address Data Watchpoint Enable Implemented as described in the UltraSPARC III Cu Processor User s Manual 20 4 Reserved Reserved for future implementation D MMU Enable Implemented as described in the UltraSPARC III Cu Processor 3 User e Manual 2 I MMU Enable Implemented as described in the UltraSPARC II Cu Processor User s Manual d Data Cache Enable Implemented as described in the UltraSPARC UI Cu Processor User s Manual 0 Instruction Cache Enable Implemented as described in the UltraSPARC III Cu Processor User s Manual 9 4 2 Data Watchpoint Registers The UltraSPARC IV processor supports a 43 bit physical address space Software is responsible to write a zero extended 64 bit address into the PA Data Watchpoint Register Note Prefetch instructions do not generate VA PA_watchpoint traps Registers 275 un microsystems H ScratchPad Registers ASIT_SCRATCHPAD_n_REG Each logical processor in the UltraSPARC IV processor implements eight ScratchPad registers 64 bit each read write accessible The addresses of the ScratchPad registers are defined in TABLE 9 4 The use of the ScratchPad registers is completely defined by software The ScratchPad registers provide faster access than saving data in main memory and are not subject to register windowing All ScratchPad registers in the UltraSPAR
388. ine is written into L3 only when it is evicted from the L2 cache Both clean and dirty modified lines are treated the same The L2 and L3 caches are exclusive e A line cannot be in both caches at the same time e A line evicted from L2 and written into L3 is no longer in L2 e On a hit in L3 the line is copied back to the L2 and L1 levels of the cache hierarchy and marked as invalid in the L3 cache e Because L2 is not included in L3 in effect the L2 and L3 caches provide a total of 34MB of secondary cache storage between them e Because the L2 and L3 caches are exclusive either can be the source of data for cache to cache transactions Shared L2 and L3 caches offer significant benefits When two running threads operate on common data shared caches work to minimize access times and maximize hit rates for both Even when two simultaneously running threads do not share data shared caches enable more flexible allocation of space to each according to need and usually provide superior performance However to handle the occasional case where two running threads are unable to cooperate to their mutual advantage but instead antagonistically contend for cache space with the result that the performance of each is impaired the UltraSPARC IV processor also provides a mechanism for pseudo splitting its shared caches In this mode each thread is allocated its own half of the L2 and L3 cache resources thereby avoiding any interfe
389. ing Constants per Destination Register Precision Destination Register Precision Number of Bits in Exponent Bias Exponent Max Exponent Gross P Exponent Field Egias Emax Underflow Egur Single 8 127 255 24 Double 11 1023 2047 53 e For FMULs and FMULd E E rs Pirsch Egas e For FDIVs and FDIVd E E ts E rs2 Egjas 1 When two normal operands of FMULs d and FDIVs d generate a subnormal result the Ep is calculated using the algorithm shown below If fraction_msb overflows i e fraction_msb gt 1 d2 Ep Be 1 ELSE rb Ey e For FdTOs E E rs2 Egras P_rs2 Epras P_rd where P_rs gt is the larger precision of the source and P_rd is the smaller precision of the destination Even though 0 lt E rs or E rs lt 255 for each single precision biased operand exponent the computed biased exponent result E can be 0 lt E lt 255 or can even be negative For example for the FMULs instruction s If E rs E rs2 127 then E 127 127 127 127 s If E rs E rs2 0 then E 127 0 0 127 IEEE 754 1985 Standard 139 amp Sun microsystems 6 8 2 1 Overflow Result If the appropriate trap enable masks are not set FSR OFM 0 and FSR NXM 0 set the FSR AEXC and FSR CEXC overflow and inexact flags as follows FSR OFA 1 FSR NXA 1 FSR OFC 1 and FSR NXC 1 No trap is generated
390. ing a write Sun Fireplane Interconnect Port ID Register The private FIREPLANE_PORT_ID Register can be accessed only from the Sun Fireplane Interconnect as a read only non cacheable slave access at offset 0 of the address space of the processor port Sun Fireplane Interconnect and Processor Identification 287 un microsystems This register indicates the capability of the processor module See TABLE 11 1 for the state of this register after reset The FIREPLANE_PORT_ID Register is described in TABLE 11 1 TABLE 11 1 FIREPLANE_PORT_ID Register Format Bits Field Description A one byte field containing the value FC 6 This field is used by the OpenBoot PROM to Eel FCi6 indicate that no Fcode PROM is present 55 27 Reserved Reserved for future implementation 26 17 AID 10 bit Sun Fireplane Interconnect Agent ID read only field copy of AID field of the Sun Fireplane Interconnect Configuration Register 2 Master or Slave bit Indicates whether the agent is a Sun Fireplane Interconnect master 16 M S or a Sun Fireplane Interconnect slave device 1 master 0 slave This bit is hardwired to 0 in the UltraSPARC IV processor master device Manufacturer ID read only field 15 10 MID f f Refer to the UltraSPARC IV processor data sheet for the content of this register 9 4 MT Module Type read only field copy of MT field of the Sun Fireplane Interconnect Configuration Register 3 0 MR Module
391. ing the DCUCR DC and the DCUCR IC bits to 0 This prevents recursion in the icache_parity_error trap routine when it takes an I cache miss for an I cache index which contains an error It also prevents a D cache error causing a trap to the dcache_parity_error routine from within the icache_parity_error routine When the primary caches are disabled in this way program accesses can still hit in the L2 cache so throughput will not be excessively degraded Note In the extremely rare case of an I cache parity error occurring immediately after a write to the DCU register enabling either the D cache or I cache the DCU register update may take effect after the hardware has automatically disabled caches resulting in the caches being enabled after the processor vectors to the trap handler for the parity error Actions required of the icache_parity_error trap routine e Log the event e Clear the I cache by setting the valid bit for every way at every line to 0 with a write to AST_ICACHE_TAG e Clear the IPB by setting the valid bit for every entry of the IPB tag array to 0 with a write to AST_IPB_TAG e Clear the D cache by setting the valid bit for every way at every line to 0 with a write to AST_DCACHE_TAG Also initialize the D cache by setting distinct values to ASI_DCACHE_UTAG and writing 0 to ASI_DCACHE_TAG both data and parity of all D cache tag indexes and ways e Clear the P cache by setting the va
392. instruction for example following wrong path of a mispredicted branch instruction Dropped suppressed by age at the trap logic Dropped an annulled instruction never enters an execution pipeline thus no D cache access Annulled in delay slot of a branch Block load internal ASI load i Suppressed atomic PP The hardware checks the full eight bytes on the first access and will report any errors at that time If an error occurs during the second load access the error will not be detected Integer LDD needs two load accesses to D cache Quad load accesses the D cache but does not get the data Yes from the D cache A quad load always forces a D cache miss so that the data is loaded from the L2 L3 cache or the Can detect an error but Quad load the error will correspond to a line not actually being used memory The D cache access may cause a parity error to be observed and reported even though the corresponding data will not be used 8 5 2 Hardware Action on Trap for D cache Physical Tag Parity Error Parity error is reported the same way same trap type and precise trap timing as in D cache data array parity error See Hardware Action on Trap for D cache Data Parity Error on page 262 Note Unlike in D cache data array parity error checking in D cache physical tag is further qualified with valid bit and microtag hit but ignores physical tag hit miss On store atomic or block store instructio
393. instruction fetch EMU l NCEE l Private Uncorrectable system bus microtag ECC All but instruction fetch EMU D NGRE l Frayate HW_corrected system bus EMC c CEE Private microtag ECC Error Handling 192 un microsystems Error Handling TABLE 7 16 Error Reporting Summary 2 of 4 parity load like instruction E ZS Jalali lSis E AFSR Trap Trap SISl l SIS 5 Error Event 2 controlled allalla ele a status bit taken b Gi a a x Bat bes E y Slsuwisi S lt E Uncorrectable L2 cache data ECC instruction fetch critical 32 byte and non critical 32 byte load critical 32 byte atomic UCU FC UCEEN 0 3 Private instruction critical 32 byte and non critical 32 byte Uncorrectable L2 cache data ECC store queue or prefetch queue EDU c CEE 2 Pik operation or load like instruction non critical 32 byte Uncorrectable L2 cache data ECC EDU D CEE 2 Private Block Load Uncorrectable L2 cache data ECC WDU c NCEE 2 Private writeback Uncorrectable L2 cache data ECC CPU c NCEE 2 Shared copyout SW_correctable L2 cache data ECC instruction fetch critical 32 byte and non critical 32 byte load UCC FC UCEEN 3 Private critical 32 byte atomic instruction critical 32 byte HW_corrected L2 cache data ECC store queue or prefetch queue operation or load instruction non EDC C CEEN 1 Private critical 32 byte or atomic instruction no
394. instruction fetch misses in the I cache then an icache_parity_error trap may still be generated if there is a parity error in one of the ways of the set of the I cache indexed for the instruction fetch Only one arbitrary way of the I cache set will be checked for parity errors in the case of an instruction fetch miss If there are parity errors in one of the other ways then that is not checked it will not be detected An I cache data parity error for an instruction fetched and later discarded can cause an undiagnosable out of sync event in a lockstep system I cache Error Recovery Actions An I cache physical tag or I cache IPB data parity error results in an icache_parity_error precise trap The processor logs nothing about the error in hardware for this event The trap routine can log the index of the I cache IPB error based on the virtual address stored in the TPC The trap routine may attempt to discover which associative way caused an error and whether the physical tags or the data is in error and sometimes even the bit that is in error by accessing the I cache IPB in its diagnostic address space However this attempt depends on the I cache continuing to return an error for the same access and the entry not being displaced by coherence traffic This means that attempting to pinpoint the erroneous bit may often fail When the processor takes an icache_parity_error trap both the I cache and the D cache are automatically disabled by sett
395. ior on Instruction Fetch Canceled Retried Instruction Reporting if Parity Error icache_parity_error Trap Taken I cache miss due to invalid line valid 0 icache_parity_error detection is microtag miss or tag mismatch suppressed icache_parity_error has priority over fast_instruction_access_MMU_miss I TLB miss Yes 8 4 2 Hardware Action on Trap for I cache Physical Tag Parity Error Parity error is reported the same way same trap type and precise trap timing as in I cache data array parity error See Hardware Action on Trap for I cache Data Parity Error on page 260 Note I cache physical tag parity error detection is suppressed if DCR IPE 0 or cache line is invalid I cache physical tag parity error is determined based on per instruction fetch group granularity Unused or annulled instructions are not masked out during the parity check If an I cache physical tag parity error is detected while in another event I cache miss or I TLB miss the behavior is described in TABLE 8 3 TABLE 8 3 I cache Physical Tag Parity Error Behavior on Instruction Fetch Canceled Retried Instruction Reporting if Parity Error icache_parity_error Trap Taken I cache miss due to invalid line valid 0 icache_parity_error detection is microtag miss or tag mismatch suppressed icache_parity_error has priority over I TLB miss A 8 fast_instruction_access_MMU_umiss Exceptions Traps and Trap Types 261 amp Sun
396. ireplane accessible registers FIREPLANE_PORT_ID and Memory Controller registers These address bits 42 23 Address correspond to Sun Fireplane bus address bits PA 42 23 22 0 Reserved Reserved for future implementation 11 1 1 6 Cluster Error Status Register ID ASI_CESR_ID Register See TABLE 11 6 for the state of this register after reset Refer to CESR Cluster Error Status Register ID Register on page 14 for further information Sun Fireplane Interconnect and Processor Identification 295 un microsystems 11 2 RED state and Reset Values Reset values and RED_state for Sun Fireplane Interconnect specific machines are listed in TABLE 11 6 TABLE 11 6 Sun Fireplane Interconnect Specific Machine State After Reset and RED_state Name Fields Hard_POR WDR FC FCi6 AID 0 FIREPLANE_PORT_ID MID 3E 16 MT undefined MR undefined SSM 0 unchanged unchanged HBM 0 updated unchanged SLOW 0 updated unchanged CBASE 0 unchanged unchanged CBND 0 unchanged unchanged CLK 0 updated unchanged MT 0 updated unchanged FIREPLANE_CONFIG MR 0 updated unchanged INTR_ID unknown unchanged unchanged DBG 0 unchanged unchanged DTL see TABLE 11 3 updated unchanged TOL 0 unchanged unchanged TOF 0 unchanged unchanged NXE 0 updated unchanged SAPE 0 updated unchanged CPBK_BYP 0 updated unchanged SSM 0 unchanged unchanged HBM 0 updated unchanged SLOW 0 updated unchanged CBASE 0 unchanged unchanged CBND 0 unchan
397. isrupting trap enabled by the CEEN bit in the Error Enable Register to carry out error logging Note For HW_corrected L3 cache data errors the hardware does not actually write the corrected data back to the L3 cache data SRAM However the L3 cache sends corrected data to the P cache the W cache and the system bus Therefore the instruction that creates the single bit error transaction can be completed without correcting the L3 cache data SRAM The L3 cache data SRAM may later be corrected in the disrupting ECC_error trap Software Correctable L3 cache Data ECC Errors Software correctable errors occur on single bit data errors detected as the result of the following transactions e Reads of data from the L3 cache to fill the D cache 165 amp Sun microsystems 7 2 8 7 Error Handling e Reads of data from the L3 cache to fill the I cache e Performing an atomic instruction All these events cause the processor to set AFSR_EXT L3_UCC A software correctable error will generate a precise fast_ECC_error trap if the UCEEN bit is set in the Error Enable Register See Software Correctable L3 cache ECC Error Recovery Actions on page 166 Software Correctable L3 cache ECC Error Recovery Actions The fast_ECC_error trap handler should carry out the following sequence of actions to correct an L3 cache tag or data ECC error Read the address of the correctable error from the AFAR register 2 Invalidate the entire D cac
398. isters Unknown Unchanged Unchanged VA FFFF FFFF F000 000046 RSTVaddr value PA 7FF F000 000046 PC RSTV 2016 RSTV 20 RSTV 40 6 RSTV 6016 RSTV 8016 RSTV A016 aPC RSTV 24 6 RSTV 24 6 RSTV 46 RSTV 416 RSTV 8416 RSTV A416 0 0 0 TSO MM 1 1 1 RED_state 1 FPU on RED 1 1 0 Full 64 bit address PEF 0 0 1 Privileged mode AM 1 1 0 Disable interrupts PRIV 1 Alternate globals selected PSTATE IE 0 0 PSTATE TLE AG 1 1 Unchanged CLE2 0 0 0 Interrupt globals not selected TLE 0 MMU globals not selected 0 0 IG MG 0 0 0 0 TBA 63 15 Unknown Unchanged Unchanged Y Unknown Unchanged Unchanged PIL Unknown nchanged Unchanged CWP Unknown nchanged Unchanged except for register window traps TT TL 1 Unchanged 3 4 trap type CCR Unknown Unchanged Unchanged Unchanged ASI Unknown Unchanged TL MAXTL MAXTL Min TL 1 MAXTL TPC TL Unknown Unchanged PC PC amp 1Fy6 PC TNPC TL Unknown Unchanged nPC nPC PC 4 nPC CCR Unknown Unchanged CCR ASI Unknown Unchanged ASI PSTATE Unk Unch d oa nknown nchange TSTATE S CWP CWP Unknown Unchanged PC PC Unknown Unchanged SE nPC Unknown Unchanged TICK NPT 1 1 Unchanged Unchanged Unchanged counter Restart at 0 Restart at 0 Count Restart at H Count CANSAVE Unknown Unchanged Unchanged CANRESTORE Unknown Unchanged Unchanged Reset and RED_ state 86 un microsystems TABLE 4 1 Name OTHERWIN Machine State After Reset and RED_state 2 o
399. it512 W cache W cache data W cache data status W cache tag valid W cache snoop Device ID P cache P cache data P cache data status P cache tag valid D TLB t512_0 P cache snoop D TLB t512_1 FIGURE 3 1 Fast init ASI 4016 Goes Through Three loops in Pipeline Fashion For the shared loop since L3 cache tag SRAM has the maximum entries 219 D MMU sends 27 SRAM addresses VA 3 to VA 24 to initialize the cache The address is incremented every cycle in regular SRAM test mode or every other cycle in BIST control mode When there are no more ASI write request to be issued the state machine enters a wait state where it waits for the last return signal from the chain to arrive which is exactly a fixed 21 cycle delay Once the 21st cycle is reached the STQ is signaled to retire the store ASI 3Fj instruction Therefore the total cycles for the ASI 3F execution is at most 272 x 2 21 8388629 cycles or about 0 007 seconds for a 1200MHz UltraSPARC IV processor or 0 004 seconds for a 2400 MHz UltraSPARC IV processor Thus all cache structures are initialized in 10 milliseconds 3 14 1 AST_SRAM_FAST_INIT Definition ASI 4016 Write only Per logical processor Caches Cache Coherency and Diagnostics 77 amp Sun microsystems 3 14 1 1 VA 63 0 016 Name ASI_SRAM_FAST_INIT Description This ASI is used only with the STXA instruction to quickly initialize almost all on chip SRAM structure
400. k_routine lt Perform load If a deferred trap occurs execution will never resume here gt MEMBAR Sync error barrier Make sure load takes Indicate peek success gt Return to peeker gt A A Special _peek_handler lt Indicate peek failure gt lt Return to peeker as if returning from Do_the_peek_routine gt Deferred_trap_handler TL 1 lt If the deferred trap handler sees a UE or TO or BERR and the peek_sequence_flag is set it resumes execution at the Special_peek_handler by setting TPC and TNPC gt FIGURE 8 1 Recovering Deferred Traps Other than the load or store in the case of poke the Do_the_peek_routine should not have any other side effect because the deferred trap means that the code is not restartable Execution after the trap is resumed in the Special_peek_handler The code in Deferred_trap_handler must be able to recognize any deferred traps that happen as a result of hitting the error barrier in The_peeker as not being from the peek operation This situation is typically part of setting the peek_sequence_flag A MEMBAR Sync is required as the first instruction in the trap table entry for Deferred_trap_handler to collect all potential trapping stores together to avoid a RED_state exception You can determine whether a deferred trap has come from a peek or poke sequence by using TPC or AFAR as follows e If TPC is used the locality of the trap to the Do_the_peek_routine must be
401. l be read as 1 b0 30 EC_split_en 26 pf2_RTO_en 25 ET_ECC_en If set enables L2 cache writebacks originating from logical processor 0 to write into way 0 and way 1 of L3 cache and L2 cache writebacks originating from logical processor 1 to write into way 2 and way 3 of L3 cache After hard reset this bit will be read as 1 b0 If set enables sending RTO on PREFETCH fen 2 3 22 23 After hard reset this bit will be read as 1 b0 If set enables ECC checking on L3 cache tag bits If disabled then there will never be ECC errors due to L3 cache tag access However ECC generation and write to L3 cache tag ECC array will still occur in correct manner After hard reset this bit will be read as 1 b0 24 EC_assoc L3 cache data SRAM architecture This bit is hardwired to 1 b0 in the UltraSPARC IV processor 1 b0 Late Write SRAM Hard POR value 1 b1 Late Select SRAM not supported 38 23 addr_setup Caches Cache Coherency and Diagnostics Address setup cycles prior to SRAM rising clock edge 38 23 00 1 cycle not supported 38 23 01 2 cycles POR value 38 23 10 3 cycle 38 23 11 unused 67 un microsystems TABLE 3 64 L3 cache Control Register Access Data Format 2 of 3 Bits 37 37 37 37 37 37 37 37 29 trace_out 22 21 37 37 37 37 37 37 37 37 36 36 36 36 36 36 36
402. le An ASI read to the L2 cache control register will reset these bits to 0 The value of Queue_timebout bits 11 9 should not be programmed to result in a timeout value longer than what the TOL field of the Fireplane Configuration Register is programmed to result in for CPQ timeout 3 11 1 1 Notes on L2 cache Off Mode e If the Write cache is disabled in an enabled logical processor or if the L2 cache is disabled e L2_off 1 the L2L3arb_single_issue_en field in the L2 cache control register is ignored and the L2L3 arbiter hardware will automatically switch to the single issue mode e If the L2 cache is disabled software should disable L2_tag_ecc_en L2_data_ecc_en i e no support for L2 cache tag data ECC checking reporting and correction and L2_split_en in the L2 cache control register EC_ECC_en in the L3 cache control register ASI 7546 should also be disabled in the L2 cache off mode since L3 cache is a victim cache Besides L2 cache LRU replacement scheme will not work and only one pending miss is allowed when L2 cache is disabled The inclusion property between L2 cache and primary caches will sustain e The one entry address and data buffer used in L2 cache off mode will be reset by hard reset System reset or ASI_SRAM_FAST_INIT_SHARED ASI 3Fj will not take any effect Caches Cache Coherency and Diagnostics 60 un microsystems 3 11 2 L2 cache Tag Diagnostic Access ASI 6Cj6 Read and Write Shared b
403. le L2 cache tag ECC errors multi bit ECC errors that are not correctable by hardware or software 157 amp Sun microsystems e 7 2 1 3 Error Handling Each L2 cache tag entry is covered by ECC The tag includes the MOESI state of the line which implies that tag ECC is checked whether or not the line is valid Tag ECC must be correct even if the line is not present in the L2 cache L2 cache tag ECC checking is enabled by the L2_TAG_ECC_en bit in the L2 cache control register The processor always generates correct ECC when writing L2 cache tag entries except when programs use diagnostic ECC tag accesses Hardware Corrected L2 cache Tag ECC Errors Hardware corrected errors occur on single bit errors in tag value or tag ECC detected as the result of the following transactions Cacheable I fetches Cacheable load like operations Atomic operations W cache exclusive request to the L2 cache to obtain data and ownership of the line in W cache Reads of the L2 cache by the processor in order to perform a writeback to L3 cache or a copyout to the system bus Reads of the L2 cache by the processor to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction or a P cache hardware PREFETCH operation Reads of the L2 cache tag while performing snoop read displacement flush L3 cache data fill block store Hardware corrected errors optionally produce an ECC_error di
404. lease refer to LP Interrupt ID Register ASI_INTR_ID on page 13 Interrupt Handling 299 amp Sun microsystems 122 12 2 1 2 2 2 CMT Related Interrupt Behavior Interrupt to Disabled Logical Processor If an interrupt is issued to a disabled logical processor the target processor in which the disabled logical processor resides will not assert a Mapped Out signal for the interrupt transaction The TO bit in AFSR will be asserted in the logical processor issuing the interrupt The busy bit in the Interrupt Vector Dispatch Status Register will be cleared for the interrupt Interrupt to Parked Logical Processor If an interrupt is issued to a parked logical processor the target processor in which the parked logical processor resides will assert a snoop response and log the interrupt receive register for the incoming interrupt transaction the same way as a running logical processor If the incoming interrupt is not NACK ed for a parked logical processor the pipeline will process the interrupt after the logical processor is unparked Interrupt Handling 300 amp Sun microsystems 13 Instruction and Data Memory Management Unit The Instruction Memory Management Unit I MMU conforms to the requirements set forth in the UltraSPARC HI Cu Processor User s Manual In particular the MMU supports a 64 bit virtual address space software TLB miss processing no hardware page table walk simplified protection encoding
405. led ee garbage data taken and Precise trap fast EE add D cache and P not taken il 1 pped address ecc error will be dropped cache Cacheable D cache block SE A A garbage data in SE for unmapped TO No Not installed N A cache block load FP register file Deferred trap buffer Non cacheable D cache are Se bace datai block load request for TO No Not installed N A msta edin j garbage cata m Deferred trap FP register file Error Handling 246 un microsystems TABLE 7 27 Event Cacheable D cache atomic fill request for unmapped address System Bus CE UE TO DTO BERR DBERR errors 7 of 10 flag fast ecc error L2 cache data not installed L2 cache state Ll cache data L2L3 unit gives both grant and valid to W cache Pipeline Action garbage data not taken Comment Deferred trap TO will be taken and Precise trap fast ecc error will be dropped Cacheable Prefetch 0 1 2 3 20 21 22 23 17 fill request for unmapped address not installed garbage data not installed in P cache No action Disrupting trap Non cacheable Prefetch 0 1 2 3 20 21 22 23 17 fill request for unmapped address not installed garbage data not installed in P cache No action Disrupting trap cacheable stores W cache exclusive fill request with unmapped address not installed L2L3 unit gives both grant and valid to W cache W cache will drop
406. lid bit for every way at every line to 0 with a write to AST_PCACHE_TAG e Re enable I cache and D cache by setting DCUCR DC and DCUCR IC to 1 e Execute RETRY to return to the originally faulted instruction 146 amp Sun microsystems 7 2 1 5 Error Handling It is necessary to invalidate the D cache because any stores the icache_parity_error trap routine makes on pending stores in the store queue while DCUCR DC is 0 will not write to the D cache The D cache will later contain stale data I cache Error Detection Details I cache IPB diagnostic accesses described in cache Data Errors on page 145 and Instruction Prefetch Buffer IPB Data Errors on page 156 can be used for diagnostic purposes or testing development of the I cache parity error trap handling code For SRAM manufacturing test since the parity bits are part of existing SRAM it is automatically included in the SRAM loop chain of ASI_SRAM_FAST_INIT ASI 0x40 To allow code to be written to check the operation of the parity detection hardware the following equations specify which storage bits are covered by which parity bits IC_tag_parity xor IC_tag 28 0 This is equal to xor PA 41 13 IC_snoop_tag_parity xor IC_snoop_tag 28 0 This is equal to xor PA 41 13 For a non PC relative instruction IC_predecode 7 0 IC_parity xor IC_instr 31 0 IC_predecode 9 7 5 0 For a PC relative instruction IC_predecode 7 1 I
407. line from the L2 cache L3 cache In the event of an error from the L2 cache or L3 cache for one of these fetches a fast_ECC_error trap is generated provided that the fetched instruction is actually executed If the instruction marked as encountering an error is discarded without being executed no trap is generated However AFSR AFSR_EXT and AFAR will still be updated with the L2 cache or L3 cache error status In the event of an error from the system bus for an instruction fetch the processor works exactly as normal with the AFSR and AFAR being set and a deferred instruction_access_error trap being taken despite the fact that the faulty line has not yet been used in the committed instruction stream and may in fact never be used The above applies to speculative fetches well beyond a branch and also to annulled instructions in the delay slot of a branch Corrupt data is never stored in the I cache The execution unit can issue speculative requests for data because of load like instructions but not block load store or atomic operations These can miss in the D cache and go to the L2 cache However in all circumstances if the data is not to be used the execution unit cancels the fetch before the L2 cache can detect any errors The AFSR and AFAR are not updated the D cache is not loaded with corrupt data and no trap is taken Speculative data fetches which are later discarded never cause system bus reads Speculation around store
408. ling TABLE 3 8 in this section summarizes handling of the following e Transactions at the head of CIQ e No snoop transactions e Transactions internal to the UltraSPARC IV processor TABLE 3 8 Transaction Handling at Head of CIQ J of 2 Operation at Head of CIQ CCTag MTag in out Error Retry Next CCTag RTS Shared I gM in S gS in S peo RTS Shared E I RTSR Shared O Os I RTSR Shared M Os I RTO nodata M RTO data amp SSM M Os I RTO data amp SSM M Os I RTOR data O et a S 0 Os I gS in 1 Os gl in 1 1 I Caches Cache Coherency and Diagnostics un microsystems TABLE 3 8 Transaction Handling at Head of CIQ 2 of 2 Operation at Head of CIQ Next CCTag no change S S RTSM copyback M O S S Coa gS out S 1 copyback O Os BEE S invalidate x I copyback invalidate I a S E a gI out I gM out I copyback discard gM out no change Os gI out I gM out no change gM in no change RS data gS in no change gl in no change own WS gM out I own WB gM out I I own invalidate WS Caches Cache Coherency and Diagnostics 38 un microsystems TABLE 3 9 summarizes Memory controller actions for SSM RMW Read Modify Write transactions TABLE 3 9 Memory controller actions for SSM RMW transactions SSM RMW 3 Memory Controller Action t
409. loads e Parity errors detected for Quad loads e Parity errors detected for integer LDD second access only or helper access e Parity errors detected for Internal ASI loads D cache Data parity Traps will NOT be suppressed on D cache Tag Parity Errors that occurs for Load that misses D cache or causes other types of recirculates 154 amp Sun microsystems T23 Error Handling For loads D cache Physical Tag Parity error Trap is taken if the following expression is true pe 0 amp microtag 0 amp vtag 0 pe 1 amp microtag 1 amp vtag 1 pe 2 amp microtag 2 amp vtag 2 pe 3 amp microtag 3 amp vtag 3 For Stores and Atomics D cache Tag Parity error Trap is taken if the following expression is true pe 0 amp vtag 0 pe 1 amp vtag 1 pe 2 amp vtag 2 pe 3 amp vtag 3 The values of the pe s microtag s and vtag s used in the above expressions are determined as follows pe 0 1 if Parity is detected in Way 0 else it is equal to 0 pe 1 1 if Parity is detected in Way 1 else it is equal to 0 pe 2 1 if Parity is detected in Way 2 else it is equal to 0 pe 3 1 if Parity is detected in Way 3 else it is equal to 0 utag 0 1 if microtags hits in Way 0 else it is equal to 0 utag 1 1 if microtags hits in Way else it is equal to 0 utag 2 1 if microtags hits in Way 2 else it is equal to 0 utag 3 1 if microtags hits in Way
410. lowed by a description of the CMT definition Background Terminology Thread The basic unit of program execution a stream of computer instructions that constitutes the control flow of a process Logical Processor LP The abstraction of a processor s architecture that maintains the state and management of an executing thread Core A hardware unit that instantiates one or more logical processors In addition to the basic execution pipeline s and associated registers a core often includes L1 caches Processor A single piece of silicon chip that interprets and executes operating system functions and other software tasks A processor is implemented by one or more cores Chip Multithreading CMT A processor capable of executing 2 or more software threads simultaneously without resorting to a software context switch Chip Multithreading may be achieved through the use of a single core able to execute multiple threads in parallel or multiple cores each able to execute one or more parallel threads General CMT Behavior In general each logical processor of a CMT processor behaves functionally from the viewpoint of software visibility as if it was an independent unit This is an important aspect of CMT because user code running on a logical processor need not know whether or not that logical processor is Chip Multithreading CMT 10 amp Sun microsystems just part of a single multithreaded CMT processor or
411. lowest enabled logical processor are set by default to the suspended state at the beginning of a system reset The one logical processor that is set to run becomes the default master logical processor which should arbitrate for the bootbus if necessary i e if multiple CMT processors share the same bootbus The master logical processor should enable set to run the other logical processors at the proper time in the booting process Partial CMT Resets XIR Reset There is a class of resets that are generated by an external agent and apply to an arbitrary improper subset of the logical processors within a CMT processor any number of the LPs included from zero to all The UltraSPARC IV processors have in addition to a single global system reset a single eXternally Initiated Reset XIR signal This is a reset intended to reset a specific processor in a system primarily for diagnostic and recovery purposes Future processors may have multiple resets that replace the single XIR reset of current processors Chip Multithreading CMT 20 amp Sun microsystems 2 5 3 1 For this class of resets there must be a mechanism to specify which subset of logical processors should be reset There are two possible ways to specify the subset The first way to specify the subset is to have a steering register that is set up ahead of time to specify the subset of logical processors For systems using an XIR reset the XIR Steering register described
412. m bus activity including any RTO for a cacheable store but the store data may still be in the W cache and may not have reached the L2 cache A DONE or RETRY instruction behaves exactly like a MEMBAR Sync instruction for error isolation The processor waits for outstanding data loads or stores to complete before continuing Traps do not serve as error barriers in the way that MEMBAR Sync does User code can issue a store instruction which misses in the D cache L2 cache and L3 cache therefore generating a system bus RTO operation After the store instruction the user code can go on to trap into privileged code through an explicit TRAP instruction a TLB miss a spill or fill trap or an arithmetic exception such as a floating point trap None of these trap events wait for the user code s pending stores to be issued let alone to complete The processor s store queue can hold several outstanding stores any or all of which can require system bus activity In the UltraSPARC IV processor an uncorrectable system bus data ECC error as the result of a store queue or prefetch queue RTO leads only to a disrupting trap not to a deferred trap When the disrupting trap is taken is of no particular importance These stores can issue and complete on the system bus after a trap has changed PSTATE PRIV to 1 and errors as the result of the stores are not logged with AFSR PRIV 1 because they come from user code A D
413. me globally visible store data is not yet retired from the store queue the load is recirculated after the read after write RAW hazard is removed The following conditions can cause this recirculation e Load data can be bypassed from more than one store in the store queue Performance Instrumentation and Optimization 97 amp Sun microsystems e The load s VA 12 0 overlaps a store s VA 12 0 and store data cannot be bypassed from the store queue e The load s VA 12 5 matches a store s VA 12 5 and the load misses the D cache e Load is from a side effect page page attribute E 1 when the store queue contains one or more stores to side effect pages 5 4 Performance Instrumentation Performance instrumentation consists of processor event counters that can be used to gather statistics during program execution and calls that start and stop the gathering process Many events can be monitored two at a time to gain information about the performance of the processor Memory access and stall times for example can be measured using two 32 bit Performance Instrumentation Counters PICs The Performance Control Register PCR and PIC are accessed through read write Ancillary State Register ASR instructions 5 4 1 Supervisor User Mode Access to the PCR is privileged Nonprivileged accesses cause a privileged_opcode trap Software can restrict nonprivileged access to PICs by setting the PCR PRIV field while in privilege
414. memory address control 7216 ASI_MCU_ADDR_CTRL_REG RW Shared register Address Space Identifiers 284 un microsystems TABLE 10 2 The UltraSPARC IV processor ASI Extensions 5 of 5 Le Private Value ASI Name Suggested Macro Syntax Description R W Shared 7216 ASI_MCU_TIM3_CTRL_REG Cregs memory control III register RW Shared 7216 ASI_MCU_TIM4_CTRL_REG Cregs memory control IV register RW Shared 7216 ASI_MCU_TIMS_CTRL_REG 4816 Cregs memory control V register Shared 7416 ASI_L3CACHE_DATA L3 cache data staging register RW Shared 7516 ASI_L3CACHE_CTRL L3 cache control register RW Shared CH ASL L3CACHE_W L3 cache data RAM diagnostic w Shared write access 7716 Reserved Reserved for mutare Private implementation 7816 to EE Reserved for future Private 7916 implementation ASI_L3CACHE_R z i i TE 16 E L3 cache data RAM diagnostic Shared read access TF 16 Reserved Reserved for Gr Private implementation 8016 to g R S 83 The SPARC Architecture Manual Version 9 Private 16 8816 to B The SPARC Architecture Manual Version 9 Private 16 Cie to Revered Reserved for future Private C516 implementation C816 to PE Reserved for future e CDie implementation D046 to See Reserved for future Priva D316 implementation Die to Reed Reserved for future primar D I implementation E016 to Besa Reserved for future pause Elie implementation F0j to EE Reserved fo
415. microsystems TABLE 7 20 L3 cache Data CE and UE errors 3 of 4 Error fast ecc L3 cache data Error Handling moved from L3 cache to L2 cache in P cache Event logged in error L2 cache data Ll cache data Pipel Rar Comment Action AFSR trap Disrupting trap when the line is Good critical evicted out from 32 byte data the W cache Original data and bad non again based on i Ge Sp Good 32 2 D cache atomic request with UE in the 2nd original ecc critical 32 the UE status bit L3_EDU No byte data 32 byte L3 cache data moved from L3 byte data and W cache flips the taken cache to L2 cache UE 2 least significant information in ecc check bits W cache C 1 0 in both lower and upper 16 byte mes a P cache for several reads 0 20 request with GE E E CE in the critical 32 byte or the 2nd 32 byte L3_EDC No 8 S No action Disrupting trap moved from L3 in P cache L3 cache data cache to L2 cache mes e P cache for several reads 0 20 request with GE Ee UE in the critical 32 byte or the 2nd 32 L3_EDU No 8 No action Disrupting trap byt L3 cache dat moved from L3 in P cache EE EE cache to L2 cache ae e P cache for one read 1 21 request with CE GE E dainot in the critical 32 byte or the 2nd 32 byte L3_EDC No 8 z No action Disrupting trap L3 cache dat moved from L3 in P cache SE cache to L2 cache mes F P cache for one read 1 21 request with UE BEEN Bad datadoi in the critical 32 byte or the 2nd 32 byte
416. mming NA state to certain ways of all indices 3 During multi probe phase the tester can detect and store the bitmap information of the L2 cache data SRAM and disable certain indext way by writing NA state in the L2 cache tag SRAM To ensure a smooth transition between Available i e all other states accept NA and Not Available state during run time the ASI write should follow the steps defined in Procedure For Writing ASI_L2CACHE_TAG on page 63 When L2_split_en is disabled in L2 cache control register ASI 6D 6 or EC_split_en is disabled in L3 cache control register ASI 7516 the OS cannot program all 4 ways of an index to be in NA state When L2_split_en or EC_split_en is set the OS has to make sure at least one way among way0 and way of an index is available Similarly the same restriction applies for way2 and way3 L2 cache Data Diagnostic Access ASI ob Read and Write Shared by both logical processors VA 63 22 0 VA 21 ECC_sel VA 20 19 way Caches Cache Coherency and Diagnostics 64 un microsystems VA 18 6 index VA 5 3 xw_offset VA 2 0 0 Name ASI_L2CACHE_ DATA The address format for L2 cache data diagnostic access is shown in TABLE 3 61 TABLE 3 61 L2 cache Data Diagnostic Access Bits Field Description 63 22 Mandatory value Should be 0 If set access the ECC portion of the L2 cache line based on VA 5 If not access the data 21 ECC sel portion of t
417. mory when flushed Cache flushing is required in the following cases e A D cache flush is needed when a physical page is changed from virtually cacheable to virtually non cacheable or when an illegal address alias is created Flushing is done with a displacement flush see Displacement Flushing on page 29 or by use of ASI accesses An L2 cache flush is needed for stable storage Flushing is done with either a displacement flush or a store with AST_BLK_COMMIT Flushing the L2 cache will flush the corresponding blocks from W cache also See Committing Block Store Flushing on page 29 An L3 cache flush is needed for stable storage Flushing is done with either a displacement flush or a store with ASI_BLK_COMMIT L2 L3 Data Prefetch and Instruction cache flushes may be required when an ECC error occurs on a read from the Sun Fireplane Interconnect or the L3 cache AFSR Register and AFSR_EXT Register on page 180 describes the case when a flush on an error is required When an ECC error occurs invalid data may be written into one of the caches and the cache lines must be flushed to prevent further corruption of data Address Aliasing Flushing A side effect inherent in a virtually indexed cache is illegal address aliasing Aliasing occurs when multiple virtual addresses map to the same physical address Note Since the I cache and D cache are indexed with the virtual address bits and the caches are larger than the minimum page size
418. mpact HW_corrected ECC errors result from detection of a single bit ECC error as the result of a system bus read or L2 cache or L3 cache access HW_corrected errors are logged in the Asynchronous Fault Status Register and except for interrupt vector fetches in the Asynchronous Fault Address Register If the Correctable_Error CEEN trap is enabled in the Error Enable Register an ECC_error trap is generated L2 cache data ECC errors are discussed in L2 cache Data ECC Errors on page 159 Uncorrectable L2 cache data ECC errors as the result of a read to satisfy store queue exclusive request prefetch requests writeback or copyout require only logging on the processor not using the affected data Consequently a disrupting ECC_error trap is taken instead of a deferred trap This avoids panics when the system displaces corrupted user data from the cache L3 cache data ECC errors are discussed in L3 cache Data ECC Errors on page 165 Uncorrectable L3 cache data ECC errors as the result of a read to satisfy a store queue exclusive request prefetch requests writeback or copyout require only logging on the processor not using the affected data Consequently a disrupting ECC_error trap is taken instead of a deferred trap thus avoiding panics when the system displaces corrupted user data from the cache Uncorrectable errors causing disrupting traps need no immediate action to guarantee data correctness However it is likely that an event signaled by A
419. mplementation A 1 bit field that selects port 0 or port 1 of the dual ported RAM Both ports give the same 11 PC_port values and this bit is used only during manufacture testing A 2 bit entry that selects a way 4 way associative 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 8 6 PC_addr A 3 bit index VA 8 6 that selects a P cache entry 5 0 Reserved Reserved for future implementation 10 9 PC_way The data format for P cache snoop tag register access is shown in TABLE 3 57 TABLE 3 57 Prefetch Cache Snoop Tag Access Data Format 63 38 Reserved Reserved for future implementation 37 PC_valid_bit A 1 bit field that indicates a valid physical tag entry 36 0 PC_physical_tag The 37 bit physical tag of associated data 3 11 L2 cache Diagnostics amp Control Accesses Separate ASIs are provided for reading and writing the L2 cache tag and data as well as the L2 cache control register This section describes three L2 cache accesses e L2 cache control register e L2 cache tag state LRU ECC field diagnostics accesses e L2 cache data ECC field diagnostic accesses 3 11 1 L2 cache Control Register ASI 6D 6 Read and Write Shared by both logical processors VA 0 Name AST_L2CACHE_CONTROL Caches Cache Coherency and Diagnostics 58 un microsystems Bits 63 16 Field Reserved The data format for L2 cache control register is shown in TABLE 3
420. mportant optimization technique is to prefetch data avoiding the long latencies associated with cache misses 5 2 Instruction Stream Issues The following section addresses these issues e Instruction Alignment e Instruction Cache Timing Performance Instrumentation and Optimization 91 un microsystems e Executing Code Out of the Level 2 Cache e Translation Lookaside Buffer TLB Misses e Instruction Cache Utilization e Handling of Control Transfer Instructions CTI Couples e Mispredicted Branches e Return Address Stack 5 2 1 Instruction Alignment To ensure that the maximum number of instructions are fetched from an access the instructions should be appropriately aligned 5 2 1 1 Instruction Cache Organization The I cache is organized as a four way set associative cache with each set containing a multiple of eight instruction lines Depending on its address for each line of 16 instructions up to four instructions are sent to the instruction buffer If the address points to one of the last three instructions in the line only this last instruction and instructions 0 2 from the end of the line are selected Consequently on average for random accesses 3 25 instructions are fetched from the I cache For sequential accesses the fetching rate four instructions per cycle matches the consuming rate of the pipeline up to four instructions per cycle 5 2 1 2 Branch Target Alignment Given the restriction mentioned
421. n 1 3 2 Execution Units The UltraSPARC IV processor s integer execution unit has been augmented with a new special unit that uses the load store pipe to execute the population count POPC instruction in hardware The UltraSPARC IV processor s floating point functional units have been redesigned to handle many more special functions and IEEE exceptions directly in hardware than did previous members of the UltraSPARC III IV processor family which trapped to a software handler on certain IEEE exceptions For example the UltraSPARC IV processor handles in hardware both 1 integer to floating point conversions where the relevant bits of the integer are more than the bits of mantissa and 2 operations in non standard mode that produce subnormal results Programs that generate significant numbers of operations affected by this upgrade should experience substantially improved performance on the UltraSPARC IV processor Architectural Overview 3 amp SUN microsystems 1 3 3 1 3 4 Write Cache Like previous family members each of the UltraSPARC IV processor cores has an associated 2 KB write cache Write through stores from a core s Ll cache are written into the write cache rather than directly to the L2 cache If a store misses in the write cache the missed line is brought into the cache If the write cache is already full the oldest line in the cache is evicted to L2 to make room for the new line In previous family
422. n A suspended logical processor does remain coherent with the system To remain coherent a suspended logical processor fully participates in cache coherency and can generate transactions in response to coherency requests from other logical processors on the same or different CMT processor When a logical processor is set to run it continues execution with the instruction that was next to be executed when the logical processor was suspended It is transparent to the software running on a logical processor that it was ever suspended An interrupt to a suspended logical processor behaves the same as if the logical processor was too busy to accept the interrupt For example if an interrupt buffer is available the interrupt is ACK ed and a trap is taken only when the logical processor is set to run If however no interrupt buffer is available the interrupt is NACK ed The STICK and TICK counters will continue to count while a logical processor is suspended Suspending logical processors is intended for critical diagnostic and recovery code The interference with performance monitors using the TICK or STICK counters should not be a general issue Using the TICK or STICK counter to detect the suspending of a logical processor is not recommended LP Running Register AST_CORE_RUNNING The LP Running register is a shared register used by software to suspend and run selected logical processors When a logical processor is suspended the logical pr
423. n 2 0 0 ee eeeeeeeesseeeteeeneeeneeeneees 79 3 15 OBP Backward Compatibility Incompatibility 0 00 ccc cccceeeseecesceeeeneeeesneeeesneeenteeeennees 80 Reset and RED state sccsssssssssssssssssssessssssesssesessssssnsessssescessssesesesssessasssnsssesseseessnsesesssssessees 83 dt RED state Characteristics eege BEE pints ieee aetna Ee 83 AQ TRESETS EE ENEE ed 84 Table of Contents amp Sun vii microsystems un microsystems 4 2 1 Hard Power on Reset Hard POR Power on Reset Power OK Reset 84 4 2 2 System Reset Soft POR Sun Fireplane Interconnect Reset POR 85 4 2 3 Externally Initiated Reset XIR cceecccecescessecesneeeeneeensneeeneeeeseeessneeeseneees 85 4 2 4 Watchdog Reset WDR and error_state eecccecccceesceeeeceeeeeceeteeeeseeeseneees 85 4 2 5 Software Initiated Reset SIR ceccecccccsseceeseeeeneeceseeeeeneeeeseeeesneeessaeeeseaeees 85 4 3 RED state Trap Vector NEEN SEENEN EEN EEN aes 85 44 Machine States After Reset idee eet 86 Performance Instrumentation and Optimization sccccssssssccsssssesessccseescsscscesesscceeees 91 5 1 Introduction to Optimization senn n E E O E EE E R 91 KENE ene ee RRE 91 5 2 1 Instruction Alignment 0 cccccececccceseceesseeessneeceneeceneeeseseeesneecsseeeseeeeseeeenaaes 92 5 2 2 Instruction Cache Timing 00 cceccceccecesceesneeceneeeceseeeeeeeeseneecseeeesueeeesteeensaes 93 5 2 3 Executing Ins
424. n critical 32 byte HW_corrected L2 cache data ECC EDC c CEEN l Private Block Load HW_corrected L2 cache data ECC WDC c CEEN l Private writeback HW_corrected L2 cache data ECC CPC c CEEN l Shared copyout Uncorrectable L2 cache tag ECC SIU tag update or copyback from NCEEN foreign bus transactions or snoop TUE_SH S L2_tag_ ECC_en l 0 Shared operations Uncorrectable L2 cache tag ECC NCEEN all other L2 cache tag accesses TUE S L2_tag_ ECC_en l Private HW_corrected L2 cache tag ECC fetch load atomic instruction CEEN writeback copyout block load THCE C S Private L2_tag_ ECC_en store queue or prefetch queue read SW_correctable I cache data or tag SE IP DCR IPE nis parity instruction fetch SES none DP DCR DPE olo o olo Private 193 un microsystems Error Handling TABLE 7 16 Error Reporting Summary 3 of 4 Error Event AFSR status bit Trap taken Trap controlled by FERR AFAR Priority Set PRIV Set ME Shared Private SW_correctable D cache tag parity load like or store like instruction none DP DCR DPE Private HW_ corrected I cache or D cache tag parity snoop invalidation DSTAT 2 or 3 response bus error instruction fetch none BERR none NCEEN Private Private DSTAT 2 or 3 response bus error load like block load atomic instructions interrupt vector fetch operations BERR NCEEN Private DSTAT 2 or 3
425. n issue D cache physical tag is also read to determine whether it is a D cache store hit miss Consequently the D cache physical tag parity error checking is also done on store or block store Exceptions Traps and Trap Types 263 un microsystems 8 5 3 There are some cases where a speculative access to D cache belongs to a canceled or retried instruction Also the access could be from a special load instruction The error behaviors on D cache physical tag in such cases are described in TABLE 8 5 TABLE 8 5 D cache Physical Tag Parity Error Behavior on Canceled Retried Special Load dcache_parity_error Canceled Retried Special Load Reporting if Parity Error Trap Taken D cache miss due to invalid line valid 0 dcache_parity_error detection is suppressed dcache_parity_error detection is suppressed for D cache miss due to microtag miss S physical tag array dcache_parity_error has priority over DATED miss fast_data_access_MMU_miss Preceded by trap or retry of an older instruction for example following wrong Dropped suppressed by age at the trap logic path of a mispredicted branch instruction Microtag hit but D cache miss due to Physical Tag parity error dcache_parity_error is reported Dropped annulled instructions never enter an See execution pipeline thus no D cache access Block load internal ASI load Suppressed No Hardware Action on Trap for D cache Snoop Tag
426. n occur Masked Exception TEM 0 Enabled Exception TEM 1 Destination Register Written rd Flag s None set Can underflow overflow See 6 5 Asserts nvc nva Destination Register Written rd Flag s Trap Asserts nvc IEEE trap enabled Can underflow overflow See 6 5 Normall Infinity Infinity QNaN sign 0 expo 111 111 frac 111 111 Infinity None set Asserts nvc nva Asserts nvc Ne IEEE trap enabled Infinity None set 1 IEEE trap means fp_exception_IEEE_754 6 3 6 Compare Two f registers are compared The result of the compare is reflected in the Tech bits of the FSR registers The FCMP on page 137 TABLE 6 8 Number Compare E version of the instruction relates to subnormal operations See TABLE 6 16 Floating Point NUMBER COMPARE Instruction FCMP E ren res Result from the operation includes one or more of the following e Exception bit set See TABLE 6 12 e Trap occurs See abbreviations in TABLE 6 12 e The fcc bit set Masked Exception TEM 0 Enabled Exception TEM 1 Condition Code Setting Flag s Condition Code Setting Flag s Trap fecN fecN 0 0 fec 0 rs None set fec 0 rs None set 0 0 fec 0 rs None set fec 0 rs None set 0 Normal Infinity fec 1 rs None set fec 1 rs None set 0 Normal Infinity None set fec 0 rs None set
427. n overflow See 6 5 3 Normal Normal Normal Normal Normal Normal Normal Normal Can underflow See 6 5 4 Normal Normal Can underflow See 6 5 4 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard Infinity Infinity Infinity None set Infinity None set A Asserts nvc Infinity Infinity QNaN Asserts nvc nva No TEBE trap enabled Infinity Infinit QNaN Asserts nvc nva No eee ANG 2 y gen IEEE trap enabled Infinity Infinity Infinity None set Infinity None set 123 un microsystems 6 3 2 Subtraction TABLE 6 4 SUBTRACTION Instruction FS VS2 FSUB rsj res gt rd Floating Point Subtraction Result from the operation includes one Number in f register See Trap Event on page 132 Exception bit set See TABLE 6 12 Trap occurs See abbreviations in TABLE 6 12 Underflow overflow can occur or more of the following Masked Exception TEM 0 Enabled Exception TEM 1 Destination Register Destin Written rd Flas Written rd ation Register Flag s Trap Normal Infinity Infinity Asserts ufc Infinit y nvc ufa nva 0 0 0 one set 0 one set 0 0 None set one set 0 0 None set None set 0 Normal Normal None set Normal 0 Normal Normal one set Normal one set 0 Infinity Infinity None set In
428. n that caused the error but TPC TL does not necessarily point to the corrupted instruction 8 1 2 3 Enabling Deferred Traps When an error occurs which leads to a deferred trap the trap will only be taken if the NCEEN bit is set in the Error Enable Register See Error Enable Register on page 178 The deferred trap is an instruction_access_error if the error occurred as the result of an I fetch The deferred trap is a data_access_error if the error occurred as the result of a load store block load block store or atomic data access instruction The NCEEN bit should normally be set If NCEEN is clear the processor will compute using corrupt data and instructions when an uncorrectable error occurs The NCEEN bit also controls a number of disrupting traps associated with uncorrectable errors 8 1 2 4 Errors Leading to Deferred Traps Deferred traps are generated by the following Uncorrectable L2 cache data ECC error as the result of a block load operation Uncorrectable L3 cache data ECC error as the result of a block load operation Uncorrectable system bus data ECC error in system bus read of memory or I O for I fetch load block load or atomic operations Uncorrectable ECC errors on L2 cache fills will be reported for any ECC error in the cache block not just the referenced word UE Uncorrectable system bus MTag ECC error for any incoming data but not including interrupt vectors These errors also cause the processor to assert
429. nd stored in W cache Comment Disrupting trap Disrupting trap Cacheable I cache fill request for unmapped address Non cacheable I cache fill Not installed garbage data not installed in I cache garbage data not garbage data not taken garbage data Deferred trap TO will be taken and Precise trap fast ecc error will be dropped Deferred trap TO will be unmapped address cache block load buffer request for unmapped TO Yes Not installed installed in I taken and Precise trap fast not taken 1 address cache ecc error will be dropped Cacheable D cache 32 garbage data date Deferred trap TO will be byte fill request for TO Yes Not installed N A installed in D 8 Si SE taken and Precise trap fast unmapped address cache ecc error will be dropped Non cacheable D cache garbage data not GE Deferred trap TO will be 32 byte fill request for TO Yes not installed N A installed in D 8 not E taken and Precise trap fast unmapped address cache ecc error will be dropped Cacheable D cache FP 64 e d T bacerdat Deferred trap TO will be bit load fill request for TO Yes Not installed NAA WE garbage data taken and Precise trap fast unmapped address cache but notin notitaken ecc error will be dropped PP P cache Opp garbage data not f Non cacheable D cache installed in both base dat Deferred trap TO will be FP 64 bit load fill request TO Yes Not instal
430. nds to the event logged in AFAR1 For example if AFSR1 CE is detected then AFSR1 UE which overwrites AFAR1 and AFSR1 UE is cleared but not AFSR1 CE then AFARI will be unlocked and ready for another event even though AFSR1 CE is still set This same argument also applies to primary AFSRI M_SYND and AFSR1 E_SYND fields Each field will be unlocked and available for further error capture when the specific AFSR1 status bit is cleared associated with the event logged in the field AFAR2 captures the address associated with the error stored in AFSR2 AFSR2_EXT AFSR2 AFSR2_EXT and AFAR2 are frozen with the status captured on the first error which sets one of bits 62 33 of AFSR1 and bits 11 0 AFSR1_EXT No overwrites occur in AFSR2 AFSR2_EXT and AFAR2 AFSR1 ASI 4Cj6 VA 63 0 0x0 private Name ASI_ASYNC_FAULT_STATUS AFSR2 ASI 4Cj VA 63 0 0x8 private 183 un microsystems Error Handling Name ASI_ASYNC_FAULT_STATUS2 TABLE 7 7 Asynchronous Fault Status Register 1 of 2 Bits Field Description RW 63 Reserved Reserved for future implementation R 62 TUE_SH Ee EE SC to copyback or tag update from foreign Sun RWIC 61 IMC Single bit ECC error on system bus microtag for interrupt vector RWIC 60 IMU Multi bit microtag ECC error on system bus microtag for interrupt vector RWIC 59 DTO Unmapped error from system bus for prefetch queue or store queue read opera
431. ndwidth of Foreign Writes 00 0 0 ceeceeceeseceseeeneeeeeeeenes 7 1 5 3 Larger Coherent Pending Queue cceecceessseeeseceeeneeessneeeeeneeeseeeesneeenseeenans 7 1 5 4 Supports Larger Main Memories ccccceesseeesceeeneeeseeeeneeeeesseeeeeseeesseeeneas 7 1 6 Enhanced Error Detection and Correction ceeeseesceseceseeceeeeseeseessneseneceaeesseessneenaeenss 7 gt Chip Multithreading CMT E 9 2 1 Introduction neii en ae desde HESE AE EE 9 2l CMT Definition use tees ee ee EE 10 2 12 Generali CMT Behavior incsie 8 tein dees Ma tea ta a eee En 10 2 2 Accessing CM T Registers euer a SE eg dee Santee 11 2 2 1 Types of CMT Registers lt 2 20 cscs s ccvessscsscagieat niies rie iro EEEL PK 11 2 2 2 Accessing CMT Registers Through ASI Interface o e 12 Table of Contents un microsystems 23 Private Processor Registers e encia aE Ra aa deaa aaraa aaa a aa aa aasde 12 2 3 1 LPID Register ASLCORE Un 12 2 3 2 LP Interrupt ID Register ASI_LINTR_ID oo cece eeeeeeneceneeeeeeeeeeeeaeeeaeeaee 13 2 3 3 CESR Cluster Error Status Register ID Register ccececceeesceeesteeeereeeeee 14 2 4 Disabling and Suspending Logical Processors cc ccesssceesseeeeneeeseneeeeeneeeseeeseeeeseeeeees 14 2 4 1 LP Available Register ASI_CORE_AVAILABLE o 0 cececeeseeseeeneneenes 14 2 4 2 Enabling and Disabling Logical Processors cccescceeesseeeeneeeeseeeesneeeeeeeeees 15 2 4 3 Suspending and R
432. ness of its ECC If a multi bit error is detected it will be recognized as an uncorrectable error and the L3_CPU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register Multiple occurrences of this error will cause AFSR1 ME to be set When the processor reads uncorrectable L3 cache data and writes it to the system bus in a copyback operation it will compute correct system bus ECC for the corrupt data then invert bits 127 126 of the data to signal to other devices that the data is not usable This bit is not set if the copyout hits in the writeback buffer Instead the WDU bit is set Please refer to the section L3_WDU for an explanation of this If both the L3_CPU and the L3_MECC bits are set it indicates that an address parity error has occurred L3_MECC L3 cache data access errors on both 16 byte data of L3 cache data access L3 cache Tag ECC Errors L3_THCE For an instruction fetch a load atomic operation writeback copyout store queue exclusive request block store and prefetch queue operations and L3 cache data fill processor hardware corrects single bit errors in the L3 cache tags without software intervention then retries the operation For a snoop read operation the processor will return L3 cache snoop result based on the corrected L3 cache tag and correct L3 cache tag at the same time For these
433. ng is referenced Note A change in virtual color when allocating a free page does not require a D cache flush because the D cache is write through Committing Block Store Flushing Stable storage must be implemented by software cache flush Examples of stable storage are battery backed memory and a transaction log Data that are present and modified in the L2 cache L3 cache or W cache must be written back to the stable storage Two ASIs ASI_BLK_COMMIT_PRIMARY and AST_BLK_COMMIT_SECONDARY perform these writebacks efficiently when software can ensure exclusive write access to the block being flushed These ASIs write back the data to memory from the Floating Point Registers and invalidate the entry in the cache The data in the Floating Point Registers must first be loaded by a block load instruction A MEMBAR Sync instruction can be used to ensure that the flush is complete Displacement Flushing Cache flushing can also be accomplished by a displacement flush This procedure reads a range of addresses that map to the corresponding cache line being flushed forcing out modified entries in the local cache Note The range of read only addresses must be mapped in the MMU before starting a displacement flush otherwise the TLB miss handler may put new data into the caches Diagnostic ASI accesses to the D cache L2 cache or L3 cache can be used to invalidate a line but they are not
434. ng trap is serviced at a time no nested traps since the TL goes back to 0 before serving the next disrupting trap This is the case because when a trap is taken the hardware automatically sets the PSTATE IE bit to 0 which disable further disrupting traps and interrupts Subsequent disrupting traps and interrupts are blocked but not dropped 219 amp Sun microsystems 19 7 9 1 7 9 1 1 192 Error Handling IERR PERR Error Handling In the UltraSPARC IV processor additional error detection and handling are added in the memory arrays within the logical processors and EMU External Memory Unit to improve the RAS features of the processor Error Detection and Reporting Structures The errors detected in the memory arrays within each logical processor and the EMU are divided into three categories e Bus protocol errors Error conditions that violate system bus protocol and the UltraSPARC IV processor Coherency tables Internal errors Internal error conditions that point to inconsistent or illegal operations on some of the finite state machines of the memory arrays within each logical processor and the EMU e Tag errors Uncorrectable ECC errors on the L2 cache tag and L3 cache tag Asynchronous Fault Status Register Bus protocol errors are reported by setting the PERR field of the AFSR Internal errors are reported by setting the JERR bit in the AFSR Tag errors are reported by setting the TUE bit or TUE_
435. ngle operand FsTOi rs gt rd FsTOx rs gt rd FdTOi rs gt rd FdTOx rs gt rd Floating Point to Integer Number Conversion Number in f register See Trap Event on page 132 Exception bit set See TABLE 6 12 Trap occurs See abbreviations in TABLE 6 12 Underflow Overflow can occur Result from the operation includes one or more of the following Masked Exception TEM NVM 0 Enabled Exception TEM NVM 1 Destination Register Written rd Flag s Destination Register Written rd Flag s Trap SP DP Int 000 000 None set 000 000 None set 111 111 None set 111 111 None set Asserts nvc Infinity None set IEEE trap enabled Infinity Normal lt 23 100 000 Integer representation of the normal number None set None set Integer representation of the normal number Asserts nvc IEFE trap enabled None set Normal gt 23 Normal gt 2 1 011 111 Integer representation of the normal number Asserts nvc nva None set No Integer representation of the normal number Asserts nvc IEEE trap enabled None set Normal lt 23 1 Normal lt 263 100 000 Integer representation of the normal number Asserts nvc nva None set No Integer representation of the normal number Asserts nvc IEEE trap enabled None set Asserts nvc S gt 763 Normal gt 2 011
436. no matter what bus activity was taking place No other status is logged in the AFSR The AFAR will log the address causing time out e When a transaction stays at the top of CPQ more than the time period specified in the TOL field of Sun Fireplane Interconnect Configuration Register the CPQ_TO bit will be set in the EESR and the PA 42 4 will be logged into AFAR e When a transaction stays at the top of NCPQ more than the time period specified in the TOL field of Sun Fireplane Interconnect Configuration Register the NCPQ_TO bit will be set in the EESR and the PA 42 4 will be logged into AFAR 173 amp Sun microsystems 7 2 10 7 2 10 1 7 2 10 2 Error Handling Memory Errors and Prefetch Memory Errors and Prefetch by the Instruction Fetcher and I cache The instruction fetcher sends requests for instructions to the I cache before it is certain that the instructions will ever be executed This occurs for example when a branch is mispredicted These appear as perfectly normal operations to the rest of the processor and the rest of the system which cannot tell that they are prefetch requests One of these requests from the instruction fetcher to the I cache can miss in the I cache and cause a fetch from the L2 cache It can also miss in the L2 cache A miss in the L3 cache can cause a fetch from the system bus In addition any instruction fetch by an I cache miss causes an automatic read by the I cache of the next I cache
437. nrounded exact value of the result e r is the rounded value of u which occurs when there is no trap generated e Underflow is when 0 lt u lt smallest normal number TABLE 6 15 Underflow Exception Summary Underflow enabled UFM 1 masked UFM 0 masked UFM 0 Inexact don t care NXM x enabled NXM 1 masked NXM 0 r is minimum normal None None u r r is subnormal Asserts None exact IEEE trap enabled result r is zero None r is minimum normal Asserts ufe ASSET TG Asserts ufc ufa IEFE trap enabled IEFE trap enabled CZE ur r is subnormal EE EE Asserts ufc ufa inexact S IEEE trap enabled IEEE trap enabled ANG result r is zero EE Se Asserts ufc ufa IEEE trap enabled IEEE trap enabled wey 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard 135 amp Sun microsystems 6 7 6 7 1 6 7 2 6 7 3 IEEE NaN Operations When a NaN operand appears or a NaN result is generated and the invalid nv trap is enabled FSR TEM NVM 1 the fp_exception_IEEE_754 occurs If the invalid mv trap is masked FSR TEM NVM 0 a signaling NaN operand is transformed into a quiet NaN A quiet NaN operand will propagate to the destination register Subnormal operations are described in TABLE 6 16 Whenever a NaN is created from non NaN operands the nv flag is set Signaling and Quiet NaNs SNaN and QNaN numbers are unsigned The sign bit is an extensi
438. nvalidate format 52 microtag access format 50 tag valid access format 49 Fireplane Address Register 295 illegal address alliasing 28 instruction cache tag valid access format 42 instruction prefetch buffer data access format 45 prefetch cache data access format 56 snoop tag access format 58 status data access format 55 space identifier ASI 277 address alias flushing 28 AFAR overwrite policy 198 state after reset 88 AFSR clearing 182 fields 181 182 non sticky bit overwrite policy 198 state after reset 88 writes to 182 alias address 24 boundary 28 aligning branch targets 92 alternate global registers MMU 312 330 INDEX Xxix un microsystems ancillary state register ASR 268 annulled slot 94 ASI _ASYNC_FAULT_STATUS 183 _ESTATE_ERROR_EN_REG 178 D MMU operations 331 I MMU operations 313 internal 277 load from DTLB Diagnostic register 337 load from ITLB Data Access register 316 318 nontranslating 277 registers state after reset 86 ASI accesses and shared resources 39 ASI_ASYNC_FAULT_ADDRESS 191 ASIASYNC_FAULT_STATUS 183 ASI_ASYNC_FAULT_STATUS2 184 ASI_BLK_COMMIT 28 ASI_BLK_COMMIT_PRIMARY 29 ASI_BLK_COMMIT_SECONDARY 29 ASI_BRANCH_PREDICTION_ARRAY 47 284 ASI_BTB_DATA 47 ASI_CESR_ID 295 ASI_CMP_ERROR_STEERING 177 ASI_DCACHE_DATA 48 282 ASI_DCACHE_INVALIDATE 51 282 ASI_DCACHE_SNOOP_TAG 51 282 ASI_DCACHE_TAG 49 282 ASI_DCACHE_UTAG 50 282 ASI_DCU_CONTROL_REGISTER 273 ASI_DMMU_SFSR 334 ASI_DMMU_TAG
439. o L3 cache or copyout to the system bus AFSR WDC AFSR CPC Reads of the L2 cache by the processor to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction AFSR EDC or a P cache hardware PREFETCH operation Reads of the L2 cache by the processor to perform an operation block load instruction AFSR EDC Hardware corrected errors optionally produce a disrupting ECC_error trap enabled by the CEEN bit in the Error Enable Register to carry out error logging Note For hardware corrected L2 cache data errors the hardware does not actually write the corrected data back to the L2 cache data array However the L2 cache sends corrected data to the W cache Therefore the instruction that creates the single bit error transaction can be completed without correcting the L2 cache data The L2 cache data may later be corrected in the disrupting ECC_error trap Software Correctable L2 cache Data ECC Errors Software correctable errors occur on single bit data errors detected as the result of the following transactions e Reads of data from the L2 cache to fill the D cache e Reads of data from the L2 cache to fill the I cache 159 amp Sun microsystems T2I4 Error Handling e Performing an atomic instruction All these events cause the processor to set AFSR UCC A SW_correctable error will generate a precise fast_ECC_error trap if the UCEEN bit is set in the Error Ena
440. o Syntax Description 5016 ASIIMMU_TAG_TARGET I TSB Tag Target Register 5016 ASIIMMU_TAG_ACCESS I TLB Tag Access Register 5116 ASIIMMU_TSB_8K_PTR_REG I TSB 8 KB Pointer Register 5216 ASI_IMMU_TSB_64K_PTR_REG I TSB 64 KB Pointer Register 5816 ASI_DMMU_TAG_TARGET D TSB Tag Target Register 5816 ASI_DMMU_TAG_ACCESS D TLB Tag Access Register 5916 ASI_DMMU_TSB_8K_PTR_REG D TSB 8 KB Pointer Register 5A16 ASI_DMMU_TSB_64K_PTR_REG D TSB 64 KB Pointer Register 4F 16 ASI_SCRATCHPAD_n_REG n 0 7 0016 to 3816 Scratchpad Registers RW 10 1 2 Rules for Accessing Internal ASIs All stores to internal ASIs are single instruction group SIG instructions that basically have membar semantics before they can be issued Stores to internal ASIs must be followed by a MEMBAR Sync FLUSH DONE or RETRY before the effects of the store are guaranteed to have taken effect This includes ASI store to scratchpad registers which must have a MEMBAR Sync separating a write from any subsequent read Specifically e AMEMBAR Sync is needed after an internal ASI store other than MMU ASIs before the point that side effects must be visible This MEMBAR must precede the next load or non internal store To avoid data corruption the MEMBAR must also be in or before the delay slot of a delayed control transfer instruction of any type e AMEMBAR Sync is needed af
441. o not empty 49 bits of status from selected entry as follows BitFieldAttribute 48 46 pg_size 2 0 Page size 45pg_w Writable 44pg_priv Privileged 43pg_ebit Side effect 42pg_cp Cacheable Physical 41pg_cv Cacheable Virtual 40pg_le Invert endianness 39pg_nfo No fault 38PG_sn No snoop not used 37 1 paddr 42 6 PA 42 6 Ofetched Software Prefetch 1 48 0 pceache_data Hardware prefetch 0 Caches Cache Coherency and Diagnostics 55 un microsystems TABLE 3 51 shows the parity bits in the P cache status data array and the corresponding P cache data bytes that are parity protected TABLE 3 51 P cache status data array P cache Status Data Array bit P cache Data Bytes 50 63 56 corresponding to VA 5 3 111 51 55 48 corresponding to VA 5 3 110 52 47 40 corresponding to VA 5 3 101 53 39 32 corresponding to VA 5 3 100 54 31 24 corresponding to VA 5 3 011 55 23 16 corresponding to VA 5 3 010 56 15 8 corresponding to VA 5 3 001 57 7 0 corresponding to VA 5 3 000 3 10 2 Prefetch Cache Diagnostic Data Register Access ASI 3116 per logical processor Name ASI_PCACHE_DATA T The address format for P cache diagnostic data register access is shown in TABLE 3 52 TABLE 3 52 Prefetch Cache Diagnostic Data Access Address Format Bits Field Description 63 58 Reserved Reserved for future implementati
442. occurs All load instructions will have their RAW predict field cleared P cache Enable If cleared all references to the P cache are handled as P cache misses 45 P cache hardware Prefetch Enable If cleared the P cache does not generate any 44 hardware prefetch requests to the L2 cache Software prefetch instructions are not affected by this bit Registers 274 un microsystems TABLE 9 3 DCU Control Register Access Data Format ASI 4516 2 of 2 Bits Description Software Prefetch Enable If cleared software prefetch instructions do not 43 SPE generate a request to the L2 cache L3 cache or the system interface They will continue to be issued to the pipeline where they will be treated as NOPs 42 Reserved Reserved for future implementation Write Cache Enable If cleared all W cache references are handled as W cache misses Each store queue entry performs an RMW transaction to the L2 cache 41 and the W cache is maintained in a clean state Software is required to flush the W cache force it to a clean state before setting this bit to 0 40 33 PMI7 0 DCU Physical Address Data Watchpoint Mask Implemented as described in the i UltraSPARC III Cu Processor User s Manual 32 25 VMI7 0 DCU Virtual Address Data Watchpoint Mask Implemented as described in the UltraSPARC III Cu Processor User s Manual 24 23 DCU Physical Address Data Watchpoint Enable Implemented as described
443. ocessor in which each bit corresponds to a possible logical processor The UltraSPARC IV processor has two software visible logical processors Name ASI_CORE_ENABLE_ STATUS ASI 0x41 VA 63 0 0x 10 Read Only Privileged Bit 0 and bit 1 represents LP 0 and LP 1 respectively If a bit in the register is asserted 1 the corresponding logical processor is implemented and enabled A logical processor not implemented in a CMT device indicated as not available in the LP Available register cannot be enabled and its corresponding enabled bit in this register will be 0 A logical processor that is suspended is still considered enabled Chip Multithreading CMT 15 amp Sun microsystems 2 4 2 2 TABLE 2 5 shows the format of the LP Enable Status register Each bit represents one logical processor A bit set to 1 indicates the corresponding logical processor is enabled if set to 0 it is disabled In the UltraSPARC IV processor bit 0 and bit 1 are defined for LP 0 and LP 1 respectively Bits 63 2 are reserved and read as 0 TABLE 2 5 LP Enable Status Register Shared Bit Field Description 63 2 Mandatory value Should be 0 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 A logical processor disabled by programming the LP Enable register it requires a power on reset or system reset for the updates to the LP Enable register to take effect is considered not
444. ocessor stops executing new instructions and will not initiate transactions except in response to a coherency transaction initiated Chip Multithreading CMT 17 amp Sun microsystems by another logical processor There may be an arbitrarily long but bounded delay from when the LP Running register is updated until the corresponding logical processor s actually suspends or is set to run The LP Running register is described in TABLE 2 7 is used by software to suspend selected logical processors Name ASI_CORE_RUNNING_RW ASI 0x41 VA 63 0 0x50 Privileged Read Write Name AST_CORE_RUNNING_W1S ASI 0x41 VA 63 0 0x60 Privileged Write Only Write One to Set Name AST_CORE_RUNNING_W1C ASI 0x41 VA 63 0 0x68 Privileged Write Only Write One to Clear TABLE 2 7 LP Running Register Shared 63 2 Mandatory value Should be 0 1 LP 1 This bit represents LP 1 0 LP 0 This bit represents LP 0 The LP Running register is a 64 bit register Each bit of the register represents one logical processor with bit 0 representing LP 0 and bit 1 representing LP 1 Once a logical processor is set to suspend the logical processor will stop fetching instructions complete the instructions in the logical processor and the instruction buffers and then become idle When the logical processor is set to run it continues execution from the point it was suspended A logical process
445. oes not affect other types of Sun Fireplane Interconnect transactions 2 4 2 4 1 Disabling and Suspending Logical Processors The CMT programming model provides the ability to disable or temporarily suspend logical processors This section describes the interface for probing which logical processors are available enabled and not suspended This section also describes the interface for enabling disabling and suspending running logical processors The registers described in this section are shared between logical processors LP Available Register ASI_CORE_AVAILABLE The LP Available register is a shared register that indicates the number of logical processors implemented in a CMT processor and which logical processor numbers are assigned to them Chip Multithreading CMT 14 amp Sun microsystems Name ASI_CORE_AVAILABLE ASI 0x41 VA 63 0 0x00 Read Only Privileged The LP Available register is a read only register with fields in which each bit position corresponds to a logical processor Bit 0 represents LP 0 bit 1 represents LP 1 If a bit position in the register is asserted 1 the corresponding logical processor is implemented and is functional in the CMT processor If a bit position in the register is not asserted 0 the corresponding logical processor is not implemented or was permanently disabled at manufacturing time An implemented logical processor is a logical processor that can be enable
446. off with the following conditions e If there is a translation hit in the T16 e If the I TLB parity checking is disabled The I TLB parity enable is controlled by bit 16 of the Dispatch Control Register Instruction and Data Memory Management Unit 304 amp Sun microsystems If the I MMU enable bit is off The I MMU enable is controlled by bit 2 IM of the Data cache Unit Control Register The T512 hit signal and the I TLB tag parity error signal generation share many common bits including VA 63 21 If one of these common bits is flipped a miss trap fast_instruction_access_MMU_miss will take place instead of a parity error trap instruction_access_exception as fast_instruction_access_MMU_miss has a higher priority than instruction_access_exception The parity error in this case will be indirectly corrected when the TLB entry gets replaced by the fast_instruction_access_MMU_miss trap handler The T512 also checks both tag and data parities during the I TLB demap operation If a parity error is found during the demap operation the corresponding entry will be invalidated by the hardware regardless of whether it was a hit or miss This will only clear the valid bit but not the parity error No instruction_access_exception will be reported during the demap operation When writing to the I TLB using the Data In Register ASI_ITLB_DATA_IN_REG during a replacement or using the Data Access Register ASIT_ITLB_DATA_ACCESS_REG both tag
447. on A 2 bit entry that selects an associative way 4 way associative 10 9 PC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 8 6 PC_addr A 3 bit index VA 8 6 that selects a P cache entry 5 3 PC_dbl_word A 3 bit field that selects one of 8 doublewords read from the Data Return 2 0 Reserved Reserved for future implementation The data format for P cache diagnostic data register access is shown in TABLE 3 53 TABLE 3 53 Prefetch Cache Diagnostic Data Access Data Format Bits Field Description pcache_data is a doubleword of P cache data 3 10 3 Prefetch Cache Virtual Tag Valid Fields Access ASI 3216 per logical processor Name ASI_PCACHI GI _ TAG Caches Cache Coherency and Diagnostics 56 un microsystems The address format for P cache virtual tag valid fields access is shown in TABLE 3 54 TABLE 3 54 Prefetch Cache Tag Register Access Address Format Bits Field Description 63 12 Reserved Reserved for future implementation A 1 bit field that selects port 0 or port 1 of the dual ported RAM Both ports give the same 11 PC_port SE i p values This bit is used only during manufacture testing A 2 bit entry that selects a way 4 way associative 10 9 PC_way 7 d a 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 8 6 PC_addr A 3 bit index VA 8 6 that selects a P cache tag entry 5 0 Reserved Reserved for future implementation The da
448. on of the NaN s fraction field SNaN operands propagate to the destination register as a QNaN result when the nv exception is masked All operations with NaN operands keep the sign bit unchanged including an FSQRT operation NaNs are generated for the conditions shown in NaN Results From Operands Without NaNs on page 137 SNaN to QNaN Transformation The signaling to quiet NaN transformation causes the following events e The most significant bits of the operand fraction are copied to the most significant bits of the result s fraction e In conversion to a narrower format excess low order bits of the operand fraction are discarded e In conversion to a wider format unwritten low order bits of the result fraction are set to 0 e The quiet bit the most significant bit of the result fraction is set to 1 The NaN transformation produces a QNaN e The sign bit is copied from the operand to the result without modification Operations With NaN Operands Operations with NaN operands may assert the IEEE invalid trap flag nv These operations are listed in TABLE 6 16 If the invalid trap is enabled FSR TEM NVM 1 a trap event occurs as described in Trap Event on page 132 IEEE 754 1985 Standard 136 un microsystems TABLE 6 16 Results From NaN Operands Result from the operation includes one or more of the following Number in f register See Trap Event note page 132 Exception bit set See TABLE 6 12 Trap
449. ons in the fetch group causing the second branch in the group to be refetched Dispatch0_2nd_br PICL Stall cycles due to the event that no instructions are dispatched because the I queue is empty due to various other events including branch target address fetch and various events which cause an instruction to be refetched Dispatch0_other PICU Note that this count does not include IIU stall cycles due to recirculation measured by Re_ counters see Section 5 8 5 Also the count does not include the stall cycles due to I cache misses measured by Dispatcht IC miss and refetch due to second branch in the fetch group measured by Dispatch0O_2nd_br Note If multiple events result in IU stall in a given cycle only one of the counters will be incremented based on the following priority Re_ gt Dispatch0_ic_miss gt Dispatch0O_2nd_br gt Dispatch0_other 5 8 4 R stage Stall Counts The counters described in TABLE 5 9 count the stall cycles at the R stage of the pipeline The stalls may happen due to the unavailability of a resource or the unavailability of the result of a preceding instruction Stalls are counted for each clock at which the associated condition is true The counters are private counters TABLE 5 9 Counters for R stage Stalls Counter Description Stall cycles when a store instruction which is the next instruction to be ERR executed is stalled in the R stage due to the store queue being full
450. opped L3 cache to L2 cache fill E E D to D cache for 32 byte load THCE No No corrects the N A N A the request is tag retried and writes 64 byte data to L2 cache 231 un microsystems Error Handling TABLE 7 23 1L2 cache Tag CE and UE errors 3 of 7 Errors Bie Error Event logged in z L2 cache Tag gt Comment ecc Pin EN AFSR E error L3 cache to L2 cache fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE N Y Original t N A to D cache for 32 byte load g EUT AE the request is and writes 64 byte data to L2 dropped cache L3 cache to L2 cache fill request with L2 cache tag CE L2 Pipe Disrupting trap forwards the critical 32 byte THCE No No Geet the N A th 1 to D cache for FP 64 bit load CPS UE SETS and writes 64 byte data to P a retried cache and L2 cache L3 cache to L2 cache fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE N y Original t N A i to D cache for FP 64 bit load 3 S SE EE the request is and writes 64 byte data to P dropped cache and L2 cache L3 cache to L2 cache fill request with L2 cache tag CE L2 Pipe Disrupting trap Comat ee Os byte to E THCE No No corrects the N A th ti cache block load buffer for R KEE block load and writes 64 byte ag retried data to L2 cache L3 cache to L2 cache fill request with L2 cache tag UE Disrupting trap forwards the 64 byte to P TUE N Yi Original t
451. or Handling register the hardware guarantees that the error reporting and trap are both delivered to the same logical processor either the logical processor specified by the old or new value of the Steering register Recording Shared Errors Before a trap can be generated for a shared error the error must be recorded Shared errors are recorded in the asynchronous error reporting mechanism of the logical processor specified by the ASI_CMP_ERROR_STEERING register The asynchronous error reporting mechanism that is used for reporting private errors is also used in this case Error Enable Register Refer to TABLE 7 6 for the state of this register after reset ASI 4B 16 VA 63 0 0x0 Shared Name ASI_ESTATE_ERROR_EN_REG TABLE 7 6 Error Enable Register After Reset Bits Field Description RW 22 FPPE Force Cport data parity error on data parity bit on both incoming and outgoing data RW 21 FDPE Force Cport data parity error on data LSB bit on both incoming and outgoing data RW 20 FISAPE Force Sun Fireplane Interconnect address parity error on parity bit RW 19 FSDAPE Force SDRAM address parity error on parity bit RW 18 FMT Force error on the outgoing system microtag ECC RW 17 14 FMECC Forced error on the outgoing system microtag ECC vector RW 13 FMD Force error on the outgoing system Data ECC RW 12 4 FDECC Forced error on the outgoing system Data ECC ve
452. or I Good data Good Disrupting tra cache request with tag CE L3_THCE eek D installed in I data Ree request hits L2 cache 8 cache taken L3 cache access for I L3 Pipe Disrupting trap cache request with tag CE L3_THCE p N A N A the request is corrects the tag 1 request misses L2 cache retried L3 cache access for I Good data Good Disrupting trap cache request with tag UE Original tag installed in I data request hits L2 cache cache taken L3 cache access for I Disrupting trap cache request with tag UE Original tag N A N A the request is request misses L2 cache dropped D cache 32 byte load Good data Good P i Disrupting tra request with L3 tag CE L3_THCE GE he ag installed in D data E request hits L2 cache cache taken D cache 32 byte load tapi Disrupting trap request with L3 tag CE Ke i q g L3_THCE corrects the tag N A N A the request is request misses L2 cache retried D cache 32 byte load Good data Good Disrupting trap request with L3 tag UE Original tag installed in D data request hits L2 cache cache taken D cache 32 byte load Disrupting trap request with L3 tag UE Original tag N A No action the request is request misses L2 cache dropped Good data D cache FP 64 bit load Good S Disrupting tra request with L3 tag CE L3_THCE L3 Pipe Ee data GH corrects the tag cache and P request hits L2 cache cache taken D cache FP 64 bit load Ee r Disrupting trap request with L3 tag CE L3_THCE p
453. or is allowed to suspend itself A logical processor that suspends itself should follow the ASI write by a FLUSH instruction This satisfies the ASI writing rules and guarantees that the logical processor will be suspended and no instructions will be executed following the FLUSH if the logical processor is successfully suspended The FLUSH instruction itself may be asserted before or after the logical processor is suspended Note The UltraSPARC IV processor will not allow software to suspend both logical processors An update to the LP Running register that would cause both logical processors to become suspended results only in the suspension of the logical processor not making the request with the logical processor making the update automatically set to run instead by hardware To minimize the need for synchronization between logical processors in writing to this register separate virtual addresses are provided to set and reset the bits of this register This combined with the reset setting means that the need for special interlocking on the register is not necessary When writing to this register there is a choice between writing an exact value and modifying individual bits When a logical processor suspends itself a write to the clear bit VA should be used When a logical processor wants to become the only logical processor active it is more appropriate to write the desired value directly to the direct access VA since this eliminates
454. orrectable error occurs the error pin is asserted In this case it will not report any L2 data error since it does not know which way to access It is not possible for software to differentiate this event from the same errors arriving on different clock cycles but the probability of having simultaneous errors is extremely low This difficulty only applies to APART and AFAR2 AFSR1 AFSR1_EXT and AFSR2 AFSR2_EXT fields do not suffer this confusion on simultaneously arriving errors and the normal overwrite priorities apply there The policy of flushing the entire D cache on a deferred data_access_error trap or a precise fast_ECC_error trap avoids problems with the AFAR showing an inappropriate address when 1 Multiple errors occur 198 amp Sun microsystems T52 pie 2 Simultaneous L2 cache L3 cache and system bus errors occur An L2 cache or L3 cache error is captured in AFSR and AFAR yet no trap is generated because the event was a speculative instruction later discarded Later a trap finds the old AFAR 4 A UE was detected in the first half on a 64 byte block from system bus but the second half of the 64 byte block also in error was loaded into the D cache AFSR1 E_SYND Data ECC Syndrome Overwrite Policy Class 3 UCU UCC L3_UCU L3_UCC the highest priority Class 2 UE DUE IVU EDU WDU CPU L3_EDU L3_WDU L3_CPU Class 1 CE IVC EDC WDC CPC L3_EDC L3_WDC L3_CPC the lowest priority Priority for E_SYND
455. ort for threads at the operating system level and in creating symmetric multiprocessor SMP systems for running threaded workloads is now taking a leading role in developing CMT processor technology and driving support for threads down to the basic hardware level Sun s MAJC 5200 processor shipped in 2000 was one of the first commercial CMT products The UltraSPARC IV processor will be the second CMT processor in the UltraSPARC III IV family after the UltraSPARC IV processor Relative to the UltraSPARC III processor generation the UltraSPARC IV processor generation achieves a large leap in thread level parallelism by integrating two UltraSPARC III processor cores onto a single chip Using UltraSPARC IV processors products built for the UltraSPARC III family of processors effectively deliver twice as many logical processors in the same system The dual core UltraSPARC IV processors are designed to be compatible with the single core UltraSPARC III processors in terms of both spatial and thermal footprint It is therefore possible to upgrade a Sun Fire system based on UltraSPARC III processors to UltraSPARC IV processors This upgrade results in a significant increase in throughput per cubic foot per watt and per dollar for that system Relative to the initial UltraSPARC IV processor built in 130 nm process technology the UltraSPARC IV processor takes advantage of the greatly expanded transistor budget possible with 90 nm technology to p
456. orward only request with L2 cache tag UE TUE y TEE N A Distupting trap o es riginal ta i forwards the critical 32 byte S S e Ge e to D cache for FP 64 bit load PP SIU forward and fill request with L2 cache tag CE i Ss L2 Pipe Disrupting trap forwards the critical 32 byte THCE No No G trects th N A the request is to D cache for atomic request t 8 ag retried and writes 64 byte data to L2 cache SIU forward and fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE No Yes Original tag N A the request is to D cache for atomic request and writes 64 byte data to L2 cache dropped 234 TABLE 7 23 L2 cache Tag CE and UE errors 6 of 7 flag Ge fast Error 2 Event logged in 7 L2 cache Tag 3 Comment ecc Pin EN AFSR 2 error SIU forward and fill request with L2 cache tag CE forwards the critical 32 byte L2 Pipe Disrupting trap and then 2nd 32 byte to P THCE No No corrects the the request is cache for Prefetch 0 1 2 3 tag retried request and writes 64 byte data to L2 cache SIU forward and fill request with L2 cache tag UE forwards the critical 32 byte Disrupting trap and then 2nd 32 byte to P TUE No Yes Original tag the request is cache for Prefetch 0 1 2 3 dropped request and writes data to L2 cache SIU forward and fill request with L2 cache tag CE forwards the critical 32 byte L2 Pipe Disrupting trap and thien sad 32 byte
457. ositions found above in the description of the TTE tag allowing a simple XOR instruction for TSB hit detection Instruction and Data Memory Management Unit 330 amp Sun microsystems 1325 13 2 6 B24 13 2 7 1 Faults and Traps On a mem_address_not_aligned trap that occurs during a JMPL or RETURN instruction the UltraSPARC IV processor updates the D SFSR register with the FT field set to 0 and updates the D SFAR register with the fault address For details please refer to the UltraSPARC HI Cu Processor User s Manual Reset Disable and RED state Behavior Please refer to the UltraSPARC IIT Cu Processor User s Manual for general details Note While the D MMU is disabled and the default CV bit in the Data Cache Unit Control Register is set to 0 data in the D cache can be accessed only through the load and store alternates to the internal D cache access ASI Normal loads and stores bypass the D cache Data in the D cache cannot be accessed by load or store alternates that use AST_PHYS_EC or AST_PHYS_EC_LITTLE Other caches are physically indexed or are still accessible Internal Registers and ASI Operations Please refer to the UltraSPARC III Cu Processor User s Manual for details D TLB Tag Access Extension Registers Tag Access Extension Register ASI 58165 VA 63 0 60 16 Name ASI_DMMU_TAG_ACCESS_EXT Access RW The D TLB Tag Access Extension Register saves
458. ot and in the first three groups following an annulling branch Conditional Moves vs Conditional Branches The MOVcc and MOVR instructions provide an alternative to conditional branches for executing short code segments The UltraSPARC processor differentiates the two as follows e Conditional branches Distancing the SETcc from Bicc does not gain any performance The penalty for a mispredicted branch is always eight cycles SETcc Bicc and the delay slot can be grouped together as shown in FIGURE 5 1 setcc G bicc G delay G FIGURE 5 1 Handling of Conditional Branches Performance Instrumentation and Optimization 94 amp Sun microsystems 5 2 6 Seat 5 2 8 5 29 Conditional moves A use of the destination register for the MOVcc follows the same rule as a load use FIGURE 5 2 shows a typical example FIGURE 5 2 Handling of MOVCC If a branch is correctly predicted the issue rate can be higher than that of a branch that is replaced by a conditional move If a branch is not predictable the mispredicted penalty is significantly higher than the extra latency of a conditional move Instruction Cache Utilization Grouping blocks that are executed frequently can effectively increase the apparent size of the I cache Case studies show that often half of the I cache entries are never executed Placing rarely executed code out of a line containing a frequently ex
459. ot load data into the D cache and will never cause a precise trap 8 1 2 Deferred Traps Deferred traps may corrupt the processor state Such traps lead to termination of the currently executing process or result in a system reset if the system state has been corrupted Error logging information allows software to determine if the system state has been corrupted 8 1 2 1 Error Barriers A MEMBAR Sync instruction provides an error barrier for deferred traps It ensures that deferred traps from earlier accesses will not be reported after the MEMBAR A MEMBAR Sync should be used when context switching or any time the PSTATE PRIV bit is changed to provide error isolation between processes DONE and RETRY instructions implicitly provide the same function as MEMBAR Sync so that they act as error barriers Errors reported as the result of fetching user code after a DONE or RETRY are always reported after the DONE or RETRY Exceptions Traps and Trap Types 254 amp Sun microsystems 8 1 2 2 TPC TNPC and Deferred Traps After a deferred trap the contents of TPC TL and TNPC TL are undefined except for the special peek sequence described below Because they do not generally contain the oldest unexecuted instruction and its next PC execution cannot normally be resumed from the point that the trap is taken Instruction access errors are reported before executing the instructio
460. other single bit data ECC errors are corrected by hardware and sent to the P cache or W cache In any case the L2 cache or L3 cache data is not corrected Therefore software support is required to correct the single bit ECC errors A precise trap also occurs when an uncorrectable L2 cache data ECC error or L3 cache data ECC error is detected as the result of a D cache load miss atomic instruction or I cache miss If the affected line is in the E or S MOESI state software can recover from this problem in the precise trap handler If the affected line is in the M or O states a process or the whole domain must be terminated An I cache or D cache error can be detected for speculative instruction or data fetches An L2 cache or L3 cache error can be detected when the instruction fetch is missed in the I cache Errors can also be detected when the I cache autonomously fetches the second 32 byte line of a 64 byte L2 cache or L3 cache line If an error detected in this way is on an instruction which is never executed the precise trap associated with the error is never taken However L2 cache or L3 cache errors of this kind will be logged in the primary and secondary Asynchronous Fault Status Register AFSR and Asynchronous Fault Address Register AFAR When a speculative request is canceled after an associated L2 cache or L3 cache error was generated these errors will not be logged in the AFSR or AFAR The speculative request when canceled will n
461. p As described in Software Correctable L3 cache Data ECC Errors on page 165 these errors will be recoverable by the trap handler if the line at fault was in the E or S MOESI state Uncorrectable L3 cache data ECC errors can also occur on multi bit data errors detected as the result of the following transactions e Reads of data from an O Os or M state line to respond to an incoming snoop request copyout e Reads of data from an O Os or M state line to write it back to memory writeback e Reads of data from the L3 cache to merge with bytes being written by the processor W cache exclusive request 168 amp Sun microsystems Error Handling e Reads of data from the L3 cache to perform an operation placed in the prefetch queue by an explicit software PREFETCH instruction or a P cache hardware PREFETCH operation e Reads of the L2 cache by the processor to perform an operation block load instruction For copyout the processor reading the uncorrectable error from its L3 cache sets its AFSR_EXT L3_CPU bit In this case deliberately bad signalling ECC is sent with the data to the processor issuing the snoop request If the processor issuing the snoop request is a processor it takes an instruction_access_error or data_access_error trap If the procssor issuing the snoop request is an I O device it will have some device specific error reporting mechanism which the device driver must handle The processo
462. p Sun microsystems 232 Note If the Int ID of the two logical processors in an UltraSPARC IV processor are not unique in a system then the behavior of the logical processor when an interrupt specifying that ID is sent or received is undefined CESR Cluster Error Status Register ID Register The CESR ID register summarized in TABLE 2 3 provides support for a tightly clustered system This register contains an 8 bit field CESR_ID which uniquely identifies a logical processor in a tightly clustered system Certain transactions append this value into the transaction This allows software at a remote node or within the cluster switch to associate the initiating logical processor with the transaction The CESR ID register should only be used with the appropriate cluster interconnect and the corresponding cluster specific software support The specific value to encode in the CESR ID register is platform specific When not used in a cluster architecture this register should always be programmed to zero Name ASI_CESR_ID ASI 0x63 VA 63 0 0x40 Read Write Privileged Access TABLE 2 3 CESR ID Register 63 8 Reserved Reserved for future implementation The CESR_ID field is an 8 bit CESR ID in the bus transaction For a RBIO WBIO transaction CESR 7 0 is encoded appropriately 7 0 CESR_ID Note The CESR_ID only affects the Sun Fireplane Interconnect RBIO and WBIO transactions It d
463. parity_error Trap Taken FP load miss P cache dcache_parity_error detection is suppressed No FP load hit P cache dcache_parity_error detected Yes Software prefetch hit P cache dcache_parity_error detected No Internal ASI load dcache_parity_error detected No Exceptions Traps and Trap Types 265 un microsystems Exceptions Traps and Trap Types 266 amp Sun microsystems Registers For general register information see the UltraSPARC II Cu Processor User Manual The registers specific to the UltraSPARC IV processor are discussed in this chapter Chapter Topics e Floating Point State Register FSR on page 267 e PSTATE Register on page 268 e Ancillary State Registers ASRs on page 268 e Registers Referenced Through ASIs on page 273 Note In the UltraSPARC IV processor all user visible registers are private registers This includes all General Purpose Registers GPRs and Floating Point Registers FPRs as well as all Ancillary State Registers ASRs and privileged registers Some ASI Address Space Identifier registers are private while some ASI registers are shared by both logical processors AST Assignments on page 280 lists all the ASI registers in the UltraSPARC IV processor and specifies if each register is a private or a shared register 9 1 9 1 1 Registers Floating Point State Register FSR FSR_nonstandard_fp NS If a floating point operation generates a subnormal val
464. pe Disrupting trap stores W cache exclusive request with tag CE THCE No No corrects the N A L2 cache pipe tag retries the request stores W cache exclusive CG Disrupting trap the request with tag UE TUE NO ZS Original tag NA request is dropped i Disrupting tra W cache block store request THCE N e Se N A SS f g with tag CE C No o corrects the L2 cache pipe tag retries the request pi oa neste i Disrupting trap the cache block store reques E request is dropped with tag UE TUE No Yes Original tag N A q pped Disrupting tra W cache eviction request with L2 Pipe N i Se gt q THCE No No corrects the N A eviction data tag CE t action written into L2 ag cache ea m Disrupting trap tag UE EE request wit TUE No Yes Original tag N A W cache eviction is dropped i Disrupting tra L2 cache eviction tag read THCE No No ao N A No SE i g request with tag CE action L2 cache pipe tag retries the request NER Disrupting trap oes viction tag read TUE No Yes Original tag N A Ng eviction is request with tag UE action dropped L3 cache to L2 cache fill request with L2 cache tag CE L2 Pipe Disrupting trap forwards the 64 byte data to THCE No No corrects the N A the request is I cache and writes the 64 byte tag retried data to L2 cache L3 cache to L2 cache fill request with L2 cache tag UE forwards the 64 byte data to Wi Disiupting rap I cache and writes the 64 byte TUE No Yes Original tag N A the request is data to L2 cache dr
465. pped or helper instructions 5 8 1 1 Synthesized Clocks Per Instruction CPI The cycle and instruction counts can be used to calculate the average number of instructions completed per cycle CPI Cycle_cnt Instr_cnt EQ 1 In EQ 1 the formula refers to clock cycles per instruction Performance Instrumentation and Optimization 102 un microsystems 5 8 2 5 8 3 IIU Branch Statistics The counters listed in TABLE 5 7 record branch prediction statistics for retired non annulled branch instructions A retired branch in the following descriptions refers to a branch that reaches the D stage of the pipeline without being cancelled These counters are private counters TABLE 5 7 Counters for Collecting ITU Branch Statistics IU_stat_br_miss_taken PICL Number of retired non annulled conditional branches that were predicted to be taken but in fact were not taken Number of retired non annulled conditional branches that were predicted to be not taken but in fact were taken IU_stat_br_miss_untaken PICU IU_stat_br_count_taken PICU Number of retired non annulled conditional branches that were taken IU_stat_br_count_untaken PICL Number of retired non annulled conditional branches that were not taken Number of retired non annulled register indirect jumps predicted correctly IU_stat_jmp_correct_pred PICL Register indirect jumps are jmpl instructions op 21 op3 38 6 except for the cases treat
466. put and CTQ operation queued TABLE 3 7 Snoop Input and CIQ Operation Queued Action from Snoop Pipeline Ee RE CH Operation Queued in CIQ own RTS wait data 1 x RTS Shared 0 0 RTS Shared 0 1 RTS Shared Error own RTS inst wait data x x RTS Shared foreign RTS copyback D 1 copyback x 0 1 copyback Error own RTO no data 1 RTO nodata 0 x 1 RTO nodata error own RTO wait data 1 x 1 RTO data error 0 x RTO data foreign RTO invalidate x invalidate foreign RTO copyback invalidate x 0 1 copyback invalidate Error 0 1 copyback invalidate 1 1 invalidate own RS wait data RS data x x foreign RS copyback discard D nr 1 Error x 1 copyback discard foreign RTSM copyback D 0 1 RTSM copyback Error D 1 RTSM copyback foreign RTSU copyback D nr 1 RTSU copyback Error D 1 RTSU copyback foreign RTOU invalidate D x 1 invalidate foreign RTOU copyback invalidate x 0 1 copyback invalidate Error 0 copyback invalidate own RTSR wait data 1 x RTSR shared 0 x RTSR shared own RTOR wait data x x RTOR data foreign RTOR invalidate x invalidate own RSR x x RS data own WS x x own WS own WB x x own WB own invalidate WS x own invalidate WS invalidate x x invalidate An X in the table represents don t cares A blank entry in the Error out column indicates that there is no corresponding error output Caches Cache Coherency and Diagnostics 36 un microsystems 3 3 3 Transaction Hand
467. r s Manual Textual Usage Fonts Fonts are used as follows e Emphasis is used for exceptions traps and errors as well as book titles Courier is used for all fields in the registers register names instructions and read only register fields The rs1 field contains is an example of how this font is used e UPPERCASE items are acronyms instruction names or writable register fields Note Names of some instructions contain both upper and lowercase letters xxiii amp Sun microsystems Preface Underbar characters join words in register register field exception and trap names Note Such words can be split across lines at the underbar without an intervening hyphen This is true whenever the integer_condition_code field is an example of how the underbar characters are used Notational Conventions The following notational conventions are used Square brackets indicate a numbered register in a register file For example r 0 translates to register 0 indicate a bit number or colon separated range of bit numbers within a field Bits FSR 29 28 and FSR 12 are is an example of how the angle brackets are used Curly braces indicate textual substitution For example the string PRIMARY _LITTLE expands to ASI_PRIMARY and ASI_PRIMARY_LITTLE If the bar is used with the curly braces it represents multiple substitutions For example the string ASI DMMU_TSB_
468. r Shared state 2 0 111 Reserved T Note Bit 63 44 and bit 23 15 of ASI_L3CACHE_TAG data are treated as don t care for ASI write operations and read as zero for ASI read operations On reads to AST_L3CACHE_TAG regardless of the hw_ecc_gen_en bit the tag and the state of the specified entry given by the index and the way will be read out along with the ECC and the LRU of the corresponding index A common ECC and LRU is shared by all 4 ways of an index T T On writes to ASI_L3CACHE_TAG the tag and the state will always be written into the specified entry LRU will always be written into the specified index For the ECC field if hw_ecc_gen_en is set the ECC field of the data register is don t care If the hw_ecc_gen_en is not set the ECC specified in the data register will be written into the index The hardware generated ECC will be ignored in that case Caches Cache Coherency and Diagnostics 73 amp Sun microsystems Note The L3 cache tag array can be accessed as usual through ASI when ET_off is set in the L3 cache control register ASI 7516 However the returned value is not guaranteed to be correct since the SRAM can be defective and this may be one reason to turn off the L3 cache 3 12 5 1 Notes on L3 cache Tag ECC e The ECC for L3 cache tag is generated using a 128 bit ECC generator based on Hsiao s SEC DED S4ED code ECC 8 0 128bit_ECC_generator 3tag_ecc_din
469. r being snooped logs error information in AFSR_EXT For copyout the processor reading the uncorrectable error from its L3 cache sets its AFSR_EXT L3_CPU bit For writeback the processor reading the uncorrectable error from its L3 cache sets its AFSR_EXT L3_WDU bit If an uncorrectable L3 cache Data ECC error occurs as the result of a writeback or a copyout deliberately bad signalling ECC is sent with the data to the system bus Correct system bus ECC for the uncorrectably corrupted data is computed and transmitted on the system bus and data bits 127 126 are inverted as the corrupt 128 bit word is transmitted on the system bus This signals to other devices that the word is corrupt and should not be used The error information is logged in the AFSR_EXT and an optional disrupting ECC_error trap is generated if the NCEEN bit is set in the Error Enable Register Software should log the writeback error or the copyout error so that a subsequent uncorrectable system bus data ECC error reported by this processor or any other processor can be correlated back to the L3 cache data ECC error For an uncorrectable L3 cache data ECC error as the result of a exclusive request from the store queue or W cache or as the result of an operation placed in the prefetch queue by an explicit software PREFETCH instruction the processor sets AFSR_EXT L3_EDU If the W cache is turned off for some reason the store buffer causes this to happen on every store like and atom
470. r future PEVAR Flig implementation F816 to Hesse Reserved for future Private F916 implementation 1 Read or write only access will cause a data_access_exception trap Address Space Identifiers 285 un microsystems Address Space Identifiers 286 amp Sun microsystems 11 Sun Fireplane Interconnect and Processor Identification This chapter describes the UltraSPARC IV processor system bus in the following sections Chapter Topics e Sun Fireplane Interconnect ASI Extensions on page 287 e RED_state and Reset Values on page 296 e The UltraSPARC IV Processor Identification on page 297 11 1 11 1 1 Sun Fireplane Interconnect ASI Extensions Note When performing an ASI access to shared resources it is important that the other logical processor first be parked If the other logical processor is not parked there is no guarantee that the ASI access will complete in a timely fashion as normal transactions to shared resources from the other logical processor will have higher priority than ASI accesses Sun Fireplane Interconnect ASI extensions include e Sun Fireplane Interconnect Port ID Register e Sun Fireplane Interconnect Configuration Register e Sun Fireplane Interconnect Configuration Register 2 e Sun Fireplane Interconnect Address Register e Cluster Error Status Register ID ASI_CESR_ID Register Note In the following registers Reserved bits are read as 0 s and ignored dur
471. ransaction Perform an atomic RMW with MTag gS on the SDRAM through the DCDS and read data is delivered to the Sun Fireplane foreign RTSU Interconnect Read cancellation no atomic MTag update is scheduled by memory controller Perform an atomic RMW with MTag gI on the SDRAM through foreign RTOU the DCDS and read data is delivered to the Sun Fireplane Interconnect Perform a memory write with MTag el do not care data foreign UGM Perform an atomic RMW with MTag gM to the home memory WS will be issued from SSM device 3 4 Diagnostics Control and Accesses The diagnostics control and data registers are accessed through Load Store Alternate LDXA STXA instructions Note Attempts to access these registers while in non privileged mode cause a privileged_action exception with SFSR FT 1 privilege violation User accesses can be accomplished through system calls to these facilities See D Synchronous Fault Status Register SFSR in the UltraSPARC III Cu Processor User s Manual for SFSR details A store STXA STDFA to any internal debug or diagnostic register requires a MEMBAR Sync before another load instruction is executed Furthermore the MEMBAR must be executed in or before the delay slot of a delayed control transfer instruction of any type This requirement is not just to guarantee that result of the store is seen but is also imposed because the store may corrupt the load data if the
472. rated in this case if the CEEN bit is set in the Error Enable Register Hardware will correct the error UE When data are entering the UltraSPARC IV processor from the system bus the data will be checked for the correctness of its ECC For load like block load or atomic operations if a multi bit error is detected it will be recognized as an uncorrectable error and the UE bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a deferred instruction_access_error or data_access_error trap will be generated depending on whether the read was to satisfy an instruction fetch or a load operation Multiple occurrences of this error will cause AFSR1 ME to be set DUE When data are entering the UltraSPARC IV processor from the system bus the data will be checked for the correctness of its ECC For prefetch queue and store queue operations if a multi bit error is detected it will be recognized as an uncorrectable error and the DUE bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a disrupting ECC_error trap will be generated Multiple occurrences of this error will cause AFSR1 ME to be set EMC When data are entering the UltraSPARC IV processor from the system bus the microtags will be checked for the correctness of ECC If a single bit error is detected the EMC bit will be set to log this error condition A disrupting EC
473. rdware prefetches will be issued in that case Number of P cache lines that were invalidated due to external snoops internal stores and L2 evictions PC_rd PICL SW_pf_instr PICL SW_pf_exec PICU HW_pf_exec PICL Number of cacheable FP loads to P cache The count is only updated for FP load instructions that retired Number of retired software prefetch instructions Note SW_pf_instr SW_pf_exec SW_pf_dropped SW_pf_duplicate Number of retired non trapping software prefetch instructions that completed i e number of retired prefetch instructions that were not dropped due to the prefetch queue being full The count does not include duplicate prefetch instructions for which the prefetch request was not issued because it matched an outstanding prefetch request in the prefetch queue or the request hit the P cache Number of hardware prefetches enqueued in the prefetch queue SW_pf_str_exec PICU Number of retired non trapping strong prefetch instructions that completed The count does not include duplicate strong prefetch instructions for which the prefetch request was not issued because it matched an outstanding prefetch request in the prefetch queue or the request hit the P cache SW_pf_dropped PICU SW_pf_duplicate PICU Number of software prefetch instructions dropped due to TLB miss or due to the prefetch queue being full Number of software prefetch instructions that were dropped
474. re calculated by the hardware and written to the corresponding D TLB entry when replacement occurs as mentioned in D TLB Parity Protection on page 322 Tag Read Register ASI SE 6 VA 63 0 00 6 20FF8 16 Name ASI_DTLB_ TAG READ REG Access R Virtual Address Format Note Bit 2 0 is 0 Data Format The data format of Tag Read Register is described in TABLE 13 25 and TABLE 13 26 TABLE 13 25 Tag Read Register Data Format Description for T16 Bit Field Description R W The 51 bit virtual page number In the fully associative TLB page offset bits for larger VA page sizes are stored in TLB that is VA 15 13 VA 18 13 and VA 21 13 for 64 KB ae 512 KB and 4 MB pages respectively These values are ignored during normal Ge translation 12 0 Context The 13 bit context identifier R TABLE 13 26 Tag Read Register Data Format Description for T512 Bit Field Description R W 63 21 VA VA 63 21 stored in T512 Tag array R 12 0 Context The 13 bit context identifier R Note If any memory access instruction that misses the D TLB is followed by a diagnostic read access LDXA from ASI_DTLB_DATA_ACCESS_REG i e ASI 0x5d from fully associative TLBs and the target TTE has page size set to 64 KB 512 KB or 4 MB the data returned from the TLB will be incorrect Instruction and Data Memory Management Unit 333 amp Sun microsystems
475. re is not an intervening MEMBAR Sync For more predictable behavior it may be desirable to park the other logical processor when performing ASI accesses to a shared resource If the other logical processor is not parked it may perform operations making use of and or modifying the shared resource being accessed The UltraSPARC IV processor does not allow simultaneous ASI write accesses to shared resources by both logical processors The other logical processor will be parked or disabled when performing write accesses to any of the ASIs described in this chapter 3 5 Instruction Cache Diagnostic Accesses Three I cache diagnostic accesses are supported Caches Cache Coherency and Diagnostics 39 un microsystems 3 5 1 Caches Cache Coherency and Diagnostics e Instruction cache instruction fields access e Instruction cache tag valid fields access e Instruction cache snoop tag fields access In the following ASI descriptions means reserved and unused bits These bits are treated as don t care for ASI write operations and read as zero by ASI read operations Instruction Cache Instruction Fields Access ASI 6616 per logical processor al INSTR Name ASI_ICACHI TABLE 3 10 Instruction Cache Instruction Access Address Format Bits Field Description 63 17 Mandatory value Should be 0 This 2 bit field selects a way 4 way associative 16 15 IC_way 2 b00 Way
476. register 315 JEDEC code 297 JMPL instruction 84 268 270 313 331 L L2 cache access Statistics 108 bypassing by instruction fetches 26 control register 58 data diagnostic access 64 data ECC errors 159 201 204 214 HW_corrected 159 SW_correctable 159 uncorrectable 162 data error behavior 222 displacement flush 61 flushing 28 LRU 64 off mode 60 75 tag diagnostic access 61 tag ECC 63 tag ECC errors 204 205 HW_corrected 158 uncorrectable 158 tag error behavior 230 L2 cache counters per core 108 shared 108 L3 Cache Error Enable Register 258 L3 cache access statistics 109 address parity errors 169 bypassing by instruction fetches 26 control register 66 data ECC errors 165 205 209 214 HW corrected 165 uncorrectable 168 data error behavior 225 data ECC diagnostic access 70 description 26 displacement flush 72 error enable register 178 CEEN field 179 FDECC field 179 FDPE field 178 FMD field 179 FMT field 178 FPPE field 178 FSAPE field 178 FTECC field 178 NCEEN field 179 INDEX XXX X amp Sun microsystems un microsystems flushing 28 LRU 75 secondary cache control register 69 supported modes 69 tag diagnostic access 71 tag ECC 74 tag ECC errors 163 209 HW_corrected 164 uncorrectable 164 tag error behavior 237 tag off mode 75 L3 cache counters per core 109 shared 110 L3 cache Error Enable Register NCEEN field 268 LDD instruction 97 LDDFA instruction 281 LDFS
477. rence from one thread in the cache operations of the other thread When operating in split cache mode while each thread can still read all four way sets of the L2 and L3 caches the processor can only write into two of the four way sets in L2 or L3 1 5 Architectural Overview System Interface and Memory Controller Enhancements In addition to its revised and more aggressive cache hierarchy the UltraSPARC IV processor has additional improvements to reduce the average latency of off chip memory and I O transactions amp Sun microsystems 1 5 1 1 5 2 1 5 3 1 5 4 Reduced Latency of Cache to Cache Transfers The latency of cache to cache transfers has been reduced by overlapping some of the copyback latency with the snoop latency This is important in symmetric multiprocessor SMP configurations because the large L2 with dirty entries and L3 caches will cause a significant percentage of requests to be satisfied by modified data currently held in other chips caches Higher Sustainable Bandwidth of Foreign Writes By a more optimal assignment of transaction identification the overall sustainable bandwidth for foreign transactions in general and write streams in particular has been increased Larger Coherent Pending Queue The size of the Coherent Pending Queue CPQ which holds outstanding coherent Sun Fireplane Interconnect transactions has been increased from 24 entries in earlier family members to 32 entries
478. request bus If set performs Sun Fireplane Interconnect transactions in accordance with the Sun 0 SSM Fireplane Interconnect Scalable Shared Memory model Note The Sun Fireplane Interconnect bootbus signals CAStrobe_L ACD_L and Ready_L have their DTL configuration programmable through the UltraSPARC IV processor package pins All other Sun Fireplane Interconnect DTL signals that do not have a programmable configuration are configured as DTL end Several fields of the Sun Fireplane Interconnect Configuration Register 2 do not take effect until after a soft POR is performed If these fields are read before a soft POR then the value last written will be returned However this value may not be the one currently being used by the processor The fields that require a soft POR to take effect are HBM SLOW DTL CLK MT MR AID 4 0 NXE SAPE CPBK_BYP Low power mode clock rate is NOT supported in the UltraSPARC IV processor The field 31 30 are kept for backward compatibility and will be write ignore and read 0 Sun Fireplane Interconnect and Processor Identification 294 un microsystems 11 1 1 5 Fireplane Interconnect Address Register The Fireplane Interconnect Register can be accessed at ASI 4A VA 08 6 TABLE 11 5 describes the Address register TABLE 11 5 Fireplane Address Register Bit Field Description 63 43 Reserved for future implementation Address is the 20 bit physical address of the F
479. response bus error prefetch queue and store queue operations No MAPPED response time out instruction fetch DBERR TO NCEEN NCEEN Private Private No MAPPED response time out load like block load atomic store queue write WS WBIO WIO writeback block store instructions interrupt vector transmit operations No MAPPED response time out prefetch queue and store queue read operations Uncorrectable system bus microtag ECC interrupt vector fetch TO DTO IMU NCEEN NCEE NCEE Private Private Private Correctable system bus microtag ECC interrupt vector fetch IMC CEEN Private Uncorrectable L3 cache data ECC writeback HW_corrected L3 cache data ECC writeback L3_WDU L3_WDC NCEE CEE Private Private Uncorrectable L3 cache data ECC copyback L3_CPU NCEE Shared 194 un microsystems Error Handling TABLE 7 16 Error Reporting Summary 4 of 4 read The notes below provide the detailed explanation of the entries in the TABLE 7 16 E 2 Trap e E gt CS CEMA 2 Error Event AESSR Trap controlled al s s status bit taken Si o ls 3 by E SRKAE m a E lt n HW_corrected L3 cache data ECC L3_CPC c CEEN 2 Shared copyback Uncorrectable L3 cache data ECC instruction fetch critical 32 byte and non cri
480. revision read only field copy of the MR field of the Sun Fireplane Interconnect Configuration Register 11 1 1 1 Sun Fireplane Interconnect Configuration Register The Sun Fireplane Interconnect Configuration register can be accessed at ASI 4Aj VA 0 All fields except bits 26 17 INTR_ID are shared by both logical processors and are identical to the corresponding fields in the Sun Fireplane Interconnect Configuration Register 2 When any field except bits 26 17 INTR_ID is updated the corresponding field in the Sun Fireplane Interconnect Configuration Register 2 will be updated as well The bits 26 17 NTR_ID are logical processor specific See TABLE 11 2 for the state of this register after reset The fields in the register are described in this table as well TABLE 11 2 FIREPLANE_CONFIG Register Format 1 of 3 Bits Field Description 63 CPBK_BYP Copyback Bypass Enable If set it enables the copyback bypass mode 62 SAPE SDRAM Address Parity Enable If set it enables the detection of SDRAM address parity New SSM Transactions Enable If set it enables the 3 new SSM transactions RTSU ReadToShare and update MTag from gM to gS RTOU ReadToOwn and update MTag to gl UGM Update MTag to gM 61 60 59 DTL_6 DTL_5 DTL_4 DTL_3 DTL_2 DTL_1 1 0 DTL Dynamic Termination 58 57 Logic termination mode 56 55 0016 Reserved 0116 DTL end Termination pullup 54 53 0216 DTL mid
481. rimary AFSR and the secondary AFSR These registers have the same bit specifications throughout but have different mechanisms for writing and clearing The new primary Asynchronous Fault Status Extension register AFSR_EXT is added to log the L3 cache tag or L3 cache data ECC errors The secondary AFSR2_EXT is added to extend the AFSR_2 register The Asynchronous Fault Status Extension Register AFSR_EXT is presented as two separate registers the primary AFSR_EXT and the secondary AFSR_EXT These registers have the same bit specifications throughout but have different mechanisms for writing and clearing Note AFSR AFSR_EXT AFSR2 AFSR2_EXT AFAR AFAR2 are private ASI registers Primary AFSR and AFSR_EXT The primary AFSR accumulates all errors from system bus and L2 cache that have occurred since its fields were last cleared An extension of AFSR register AFSR_EXT accumulates all errors from L3 cache that have occurred since its fields were last cleared The AFSR and AFSR_EXT are updated according to the policy described in Table 7 16 Error Reporting Summary on page 192 The primary AFSR is represented by the label AFSR1 in this document where it is necessary to distinguish the registers A reference to AFSR should be taken to mean either primary or secondary AFSR The primary AFSR_EXT is represented by the label AFSR1_EXT in this document where it is necessary to distinguish the registers A reference to AFSR_EXT should
482. ription RW 63 14 Reserved Reserved for future implementation R e Fuse error for I cache D Cache DTLBs I TLB SRAM redundancy E RED ERR or shared L2 cache L3 cache tag L2 cache data SRAM redundancy RWIS 12 EFA_PAR_ERR e Fuse parity error RWIC d L3_MECC Both 16 byte data of L3 cache data access have ECC error either RWIC correctable or uncorrectable ECC error 10 L3_THCE Single bit ECC error on L3 cache tag access RWIC Error Handling 185 un microsystems 7 3 4 7 3 4 1 Error Handling TABLE 7 8 Asynchronous Fault Status Extension Register 2 of 2 Bits Field Description RW 9 L3_TUE_SH Multiple bit ECC error on L3 cache tag access due to copyback or RWIC tag update from foreign Sun Fireplane device snoop request 8 L3_TUE Multiple bit ECC error on L3 cache tag access due to private tag RWIC access 7 L3_EDC Single bit ECC error on L3 cache data access for P cache and W RWIC cache request 6 L3_ EDU Multiple bit ECC error on L3 cache data access for P cache and W RWIC cache request 5 L3_UCC Single bit ECC error on L3 cache data access for I cache and D RWIC cache request 4 L3_UCU Multiple bit ECC error on L3 cache data access for I cache and D RWIC cache request 3 L3_CPC Single bit ECC error on L3 cache data access for copyout RWIC 2 L3_CPU Multiple bit ECC error on L3 cache data access for copyout RWIC 1 L3_WDC Single bit ECC error on L3 cache data access for writeback RWI
483. rms a system bus read access DSTAT 2 or 3 status may be returned The processor treats both Sun Fireplane Interconnect termination code DSTAT 2 time out error and DSTAT 3 bus error as the same event For a bus error due to a instruction fetch load like block load or atomic operation the BERR bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a deferred instruction_access_error or data_access_error trap will be generated depending on whether the read was to satisfy an instruction fetch or a load operation Multiple occurrences of this error will cause AFSR1 ME to be set DBERR When the UltraSPARC IV processor performs a system bus read access DSTAT 2 or 3 status may be returned The processor treats both Sun Fireplane Interconnect termination code DSTAT 2 time out error and DSTAT 3 bus error as the same event For a bus error due to a system bus read from memory or I O caused by prefetch queue or a system bus read from memory caused by read to own store queue operation the DBERR bit will be set to log this error condition Provided that the NCEEN bit is set in the Error Enable Register a deferred data_access_error trap will be generated Multiple occurrences of this error will cause AFSR1 ME to be set TO When the UltraSPARC IV processor performs a system bus read or write access it is possible that no device responds with a MAPPE
484. rom foreign bus transaction and snoop request if an uncorrectable tag error is discovered the processor sets AFSR TUE_SH The processor asserts its ERROR output pin and it is expected that the coherence domain has suffered a fatal error and must be restarted L3 cache Data ECC Errors L3_UCC When an instruction fetch misses the I cache or a load instruction misses the D cache or an atomic operation is performed and it hits the L3 cache the line is moved from the L3 cache to L2 cache and will be checked for the correctness of its ECC If a single bit error is detected in critical 32 byte data for load and atomic operations or in either critical or non critical 32 byte data for I cache fetch the L3_UCC bit will be set to log this error condition This is a SW_correctable error A precise fast_ECC_error trap will be generated provided that the UCEEN bit of the Error Enable Register is set For correctness a software initiated flush of the D cache is required because the faulty word will already have been loaded into the D cache and will be used if the trap routine retries the faulting instruction L3 cache errors are not loaded into the I cache or P cache so there is no need to flush them Note that when the line moved from the L3 cache to the L2 cache raw data read from the L3 cache without correction is stored in the L2 cache Since the L2 cache and the L3 cache are mutually exclusive once the line is read from the L3 cache to t
485. rong prefetches are similar to regular software prefetches but will succeed under a wider range of conditions First if a strong prefetch has a translation lookaside buffer TLB miss instead of simply dropping the prefetch the processor will take a trap to fill the TLB and then re issue the prefetch Second if the prefetch queue is full instead of potentially dropping the prefetch the processor will wait until one of the outstanding prefetches completes and then place the prefetch in queue Strong prefetch allows software to use prefetch for critically needed items with a high degree of confidence that the item requested will in fact be loaded in advance of use Architectural Overview 4 amp SUN microsystems 1 3 5 Memory Management Units MMU While the general MMU organization remains the same in the UltraSPARC IV processor cores as in the cores of previous family members both the number of TLB entries in the instruction MMU and the page sizes supported by the MMUs have been increased Instruction MMU The instruction MMU of the UltraSPARC IV processor core incorporates two TLBs a small fully associative 16 entry TLB and a large 2 way set associative TLB In earlier family members the large I TLB had a total of 128 entries In the UltraSPARC IV processor this total has been increased to 512 entries The UltraSPARC IV processor can work with either a base page size of 8 KB or 64 KB Data MMU The data MMU in the UltraSP
486. rovide a thorough upgrade of the initial UltraSPARC IV processor dual core design 1 3 Enhanced Core Design Compared with its predecessors the UltraSPARC IV processor s core has been optimized in a number of important ways The enhancements include improvements in the following processor resources e Instruction fetch e Execution units e Write cache e Data prefetching e Memory management units 1 Defined in Chapter 2 Architectural Overview 2 amp SUN microsystems 1 3 1 Instruction Fetch The UltraSPARC IV processor s instruction cache I cache has been doubled in capacity from 32 KB to 64 KB and its line size also has been doubled from 32 bytes to 64 bytes The larger capacity significantly improves the hit rate for programs whose instruction stream exhibits good temporal locality while the longer line length helps programs whose instruction stream exhibits good spatial locality The expanded I cache is augmented with a much more aggressive instruction prefetch mechanism based on an 8 entry prefetch buffer where each entry is a full 64 byte line The prefetch buffer is accessed in parallel with the I cache and in the case of a prefetch buffer hit on an I cache miss the prefetched line is filled into the instruction cache A prefetch of the next sequential line address N 64 into the prefetch buffer is triggered by one of two conditions a request for address N that either misses altogether in the L1 I cach
487. rrect tag ECC data It will not have any issues if two accesses are to the different index To avoid the problem the following procedure should be followed when software uses ASI write to L2 cache tag to inject L2 cache tag error 3 11 2 2 Procedure For Writing AST_L2CACHE_TAG Park the other logical processor Wait for the parking logical processor to be parked Turn off kernel pre emption Block interrupts on this processor Displacement flush all 4 ways in L2 cache for the index to be error injected Load some data into the L2 cache Locate the data in the L2 cache and associated tag Read the L2 cache tag ECC using ASI L2 cache tag read access Corrupt the tag ECC 10 Store the tag ECC back using ASI L2 cache tag write access 11 Re enable interrupts SO CO Ne eS eg 12 Unpark the other logical processor The reason to displacement flush all 4 ways is to guarantee that foreign snoop will have no effect to the index during ASI L2 cache tag write access even if the hazard window exists in the hardware Caches Cache Coherency and Diagnostics 63 amp Sun microsystems 3 11 2 3 Notes on L2 cache LRU Bits 3 11 2 4 S113 The LRU entry for a given L2 cache index is based on the following algorithm For each 4 way set of L2 data blocks a 3 bit structure is used to identify the least recently used way Bit 2 tracks which one of the two 2 way sets way3 2 is one set way1 0 is the other is the least recentl
488. rrected 170 Dstat 2 3 errors 172 ECC errors 210 211 213 error behavior 241 hardware time outs 173 MTag ECC errors uncorrectable 171 status errors 212 213 213 unmapped errors 172 system bus clock ratio 292 system interface statistics 115 T Tag Access Register 312 330 331 TBA register 86 Thread 10 TICK register state after reset 86 TICK_COMPARE register 87 TL register 86 TLB CAM Diagnostic Register 336 Data Access register 332 Data In register 312 330 DTLB state after reset 89 hardware 337 ITLB state after reset 89 miss handler 312 329 330 miss processing 301 319 xlvi UltraSPARC IV Processor User s Manual October 2005 missing entry 312 329 TNPC register 86 TPC register 86 Translation Storage Buffer TSB 334 334 Table Entry TTE 310 328 trap CEEN Correctable_Error 258 corrected_ECC_error 258 fp_disabled 271 instruction_access_error 268 level TL MAXTL 83 TL MAXTL 1 83 trap globals 268 trap handler ECC errors 259 user 135 TSB 94 Extension Registers TSB_Hash field 316 334 miss handler 312 330 SB_Size field 330 shared 312 330 split 312 330 Tag Target Register 313 330 TSTATE register initializing 85 PEF field 271 state after reset 86 TT register 86 TTE configuration 310 328 CP cacheability field 311 318 329 337 CV cacheability field 311 318 329 337 entry locking in TSB 311 329 L lock field 311 318 329 337 PA physical page number field 311 318 329 337 SPARC
489. rred trap data_access_error or instruction_access_error is pending then the processor begins execution of either the data_access_error or instruction_access_error trap code If both data_access_error and instruction_access_error traps are pending the instruction_access_error trap will be taken because it is higher priority Taking a data_access_error trap clears the pending status for data access errors Taking an instruction_access_error trap clears the pending status for instruction access errors and has no effect on the pending status of data access errors Because of priorities there cannot be an instruction_access_error trap pending at the time a data_access_error trap is taken Taking any trap makes all precise traps no longer pending The description above implies that a pipe has to have a valid instruction to initiate trap handling but once trap handling is initiated any of the pending traps can be taken not just ones to which that pipe is sensitive So if the processor is executing AO pipe instructions and a data_access_error is pending but cannot be taken an interrupt_vector can arrive and enable the data_access_error trap to be executed even though only AO pipe instructions are present If a data_access_error trap becomes pending but cannot be taken because neither the BR or MS pipe has a valid instruction the processor continues to fetch and execute instructions If an instruction_access_error trap then becomes pending the offen
490. s In TABLE 3 46 bit 63 represents the ecc_error bit and the rest of the data are don t cares TABLE 3 46 Write Cache Diagnostic Data Access Data Format The data format for W cache diagnostic data access when WC_ecc_error 1 The 63 ecc_error ecc_error bit if set indicates that an ECC error occurred when the corresponding W cache line was loaded from the L2 cache 62 0 Reserved for future implementation Note A MEMBAR Sync is required before and after a load or store to ASIT_WCACHE_DATA T 3 9 3 Write Cache Diagnostic Tag Register Access ASI 3Aj6 per logical processor Name ASI_WCACH GI _ TAG The address format for W cache diagnostic tag register access is shown in TABLE 3 47 TABLE 3 47 Write Cache Tag Register Access Address Format Bits Field Description 63 11 Reserved Reserved for future implementation 10 6 WC_entry A 5 bit index VA 10 6 that selects a W cache entry 5 0 Reserved Reserved for future implementation The data format for W cache diagnostic tag register access is shown in TABLE 3 48 TABLE 3 48 Write Cache Tag Register Access Data Format Bits Field Description Must Be zero 63 37 Undefined Note Writing a nonzero value to this field may generate an undefined result Software should not rely on any specific behavior 36 0 WC_physical_tag A 37 bit physical tag PA 42 6 of the associated data
491. s Register which holds the virtual address and context of the load or store responsible for the MMU exception See Translation Table Entry TTE on page 328 Note There are no separate physical registers in hardware for the pointer registers rather they are implemented through a dynamic reordering of the data stored in the Tag Access and the TSB registers The hardware provides pointers for the most common cases of either 8 KB and 64 KB page miss processing These pointers give the virtual addresses where the 8 KB and 64 KB TTEs are stored if either is present in the TSB The TSB_Size field n of the TSB register ranges from 0 to 7 Note that TSB_Size refers to the size of each TSB when the TSB is split The symbol designates concatenation of bit vectors and indicates an exclusive or operation For a shared TSB TSB register split field 0 8K_PTR TSB_Base 63 13 n TSB_Extension 63 13 n VA 21 n 13 0000 64K_PTR TSB_Base 63 13 n TSB_Extension 63 13 n VA 24 n 16 0000 For a split TSB TSB register split field 1 8K_PTR TSB_Base 63 14 n TSB_Extension 63 14 n 0 VA 21 n 13 0000 64K_PTR TSB_Base 63 14 n TSB_Extension 63 14 n 1 VA 24 n 16 0000 The TSB Tag Target is formed by aligning the missing access VA from the Tag Access Register and the current context to p
492. s already been detected Copyout hits in the L2 writeback buffer Original data original ecc because the line is being victimized where a WDU Disrupting trap UE has already been detected c 7 ER CBA EE Corrected data corrected opyout encountering in the 1st 32 byte or ace 8 2nd 32 byte L2 cache data GEG Disrupting trap SIU flips the most significant 2 bits of data Copyout encountering UE in the 1st 32 byte or i g 2nd 32 byte L2 cache data CPU D 127 126 of the Disrupting trap corresponding upper or lower 16 byte data 224 un microsystems 7 11 Behavior on L3 cache DATA Error TABLE 7 20 L3 cache Data CE and UE errors 1 of 4 the non critical 32 byte L3 cache data Error Handling moved from L3 cache to L2 cache D cache taken Error fast ecc Pipeline Event logged in error L2 cache data L1 cache data Gs ES Comment AFSR trap me i I cache fill request with CE in the critical Original data original ecc Bad data not Bad data 32 byte L3 cache data and the data do get L3_UCC Yes Precise trap moved from L3 in I cache dropped used later cache to L2 cache Sen e I cache fill request with UE in the critical Original data original ecc Bad data not Bad data F 32 byte L3 cache data and the data do get L3_UCU Yes Precise trap moved from L3 in I cache dropped used later cache to L2 cache I cache fill request with CE in the Non Original data critical 32 byte the
493. s not set if the copyout hits in the writeback buffer Instead the WDU bit RW1C is set 184 un microsystems TABLE 7 7 Asynchronous Fault Status Register 2 of 2 Bits Field Description RW 38 WDC HW_corrected L2 cache data ECC error for writeback RWIC 37 WDU Uncorrectable L2 cache data ECC error for writeback RWIC 36 EDC ee L2 cache data ECC error for store or block load or prefetch queue RWIC 35 EDU SE L2 cache data ECC error for store or block load or prefetch queue RWIC 34 UE ae oe read of memory or I O for instruction fetch RWIC 33 CE Correctable system bus data ECC error for any read of memory or I O RWIC 20 32 Reserved Reserved for future implementation R 19 16 M_SYND System bus microtag ECC syndrome R 15 9 Reserved Reserved for future implementation R 8 0 E_SYND System bus or L2 cache or L3 cache data ECC syndrome R Note For system bus read access error due to prefetch queue or store queue read operation a DUE DTO or DBERR is set instead of UE TO or BERR respectively TABLE 7 7 describes AFSR1 AFSR2 is identical except that all bits are read only and AFSR2 ME is always 0 AFSR1_EXT ASI ACie VA 63 0 0x10 private Name ASI _ASYNC_FAULT_STATUS_EXT AFSR2_EXT ASI 4Cj VA 63 0 0x18 private Name AST_ASYNC_FAULT_STATUS_EXT TABLE 7 8 Asynchronous Fault Status Extension Register J of 2 Bits Field Desc
494. s shown in TABLE 3 34 TABLE 3 34 Data Cache Tag Valid Access Address Format Bits Field Description 63 16 Mandatory value Should be 0 A 2 bit index that selects an associative way 4 way associative 15 14 DC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 13 5 DC addr A 9 bit index that selects a tag valid field 512 e tags 4 0 Mandatory value Should be 0 Caches Cache Coherency and Diagnostics 49 un microsystems 3 8 3 TABLE 3 35 Data Cache Tag Valid Access Data Format Bits Field Description 63 31 Mandatory value Should be 0 30 DC_tag_parity The 1 bit odd parity bit of DC_tag The 29 bit physical tag PA 41 13 of the 23 1 DC_tag associated data 0 DC_valid The 1 bit valid field Note A MEMBAR Sync is required before and after a load or store to AST_DCACHE_TAG During ASI writes DC_tag_parity is not generated from data bits 29 1 but data bit 30 is written as the parity bit This will allow testing of D cache tag parity error trap Data Cache Microtag Fields Access ASI 43 16 per logical processor T Name ASI_DCACHE_UTAG The address format for the D cache microtag access is shown in TABLE 3 36 TABLE 3 36 Data Cache Microtag Access Address Format Bits Field Description 63 16 Mandatory value Should be 0 A 2 bit index that selects an associative way 4 way associative 15 14 DC_way 2
495. s that are associated with each logical processor A single STXA ASI 4016 will cause the store data value to be written to all on chip SRAM entries of within one logical processor I cache Inst Predecode Physical Tag Snoop Tag Valid Predict Microtag arrays IPB Inst Predecode Tag Valid BP Branch Predictor Array Branch Target Buffer D cache Data Physical Tag Snoop Tag Microtag arrays P cache Data Status Data Tag Valid Snoop Tag arrays W cache Data Status Data Tag Valid Snoop Tag arrays D TLB t16 t512_0 t512_1 arrays I TLB it16 it512 arrays Usage stxa g0 g0JASI_SRAM_FAST_INIT The 64 bit store data must be zero Initializing the SRAMs to non zero value could have unwanted side effects This STXA instruction must be surrounded preceded and followed by MEMBAR Sync_ to guarantee that e All Sun Fireplane Interconnect transactions have completed before the SRAM initialization begins e SRAM initialization fully completes before proceeding Note During the SRAM initialization caches and TLBs are considered unusable and incoherent The AST_SRAM_FAST_INTT instruction should be located in non cacheable address space Code Sequence for AST_SRAM_FAST_INIT In the UltraSPARC IV processor ASI_SRAM_FAST_INIT copies the content of the AST_IMMU_TAG_ACCESS register to I MMU tag assuming this register contains functional content However if ASI_SRAM_FAST_INIT operation wer
496. same for line fill and store update Fill data size data 255 0 store data size data 63 0 To test the hardware or software for D cache parity error recovery programs can cause data to be loaded into the D cache by reading it then use the D cache diagnostic accesses to modify the tags using ASI_DCACHE_TAG ASI 0x47 or data using ASI_DCACHE_DATA ASI 0x46 Alternatively the data can be synthesized completely using diagnostic accesses Upon executing an instruction to read the modified line into a register a trap should be generated If no trap is generated the program should check the D cache using diagnostic accesses to see whether the entry has been repaired This would be a sign that the broken entry had been displaced from the D cache before it had been re accessed Iterating this test can check that each covered bit of the D cache physical tags and data is actually connected to its parity generator Testing the D cache snoop tag error recovery hardware is more extensive First load multiple lines of data that map to the same D cache tag index so all the ways of the D cache are filled with valid data Then insert a parity error in one of the D cache snoop tag entries for this index using AST_DCACHE_SNOOP_TAG ASI 0x44 diagnostic access Then have another processor perform a write to an address which maps to the same D cache tag index but does not match any of the entries in the D cache Diagnostic accesses to the D ca
497. se the history register when indexing the Branch Prediction table using PC based indexing instead DCR Branch Predictor Mode 14 13 14 13 BPM 00 One history register gshare mode US III mode Two history registers both gshare mode PC based indexing Two history registers normal uses gshare privileged uses PC based 12 DPES Data Cache Parity Error Enable If cleared no parity checking is done at the Data Cache SRAM arrays data physical tag and snoop tag arrays 11 6 OBS Observability Bus Bits 11 6 can be programmed to select the set of signals to be observed at i obsdata 9 0 See TABLE 9 2 for bit settings 5 BPE See the UltraSPARC III Cu Processor User s Manual for a description of the BPE 4 RPE See the UltraSPARC III Cu Processor User s Manual for a description of the RPE 3 SI Single Issue Disable See the UltraSPARC III Cu Processor User s Manual and has no additional side effects 2 IPE Instruction Cache Parity Error Enable If cleared no parity checking will be done at the Instruction Cache and IPB SRAM arrays data physical tag snoop tag and IPB arrays oi IFPOE Interrupt Floating Point Operation Enable This bit enables system software to take interrupts on FP instructions 0 MS Multiscalar Dispatch Enable See the UltraSPARC III Cu Processor User s Manual for a description of the MS bit 1 Implies no dcache_parity_error trap TT 0x071 will ever be generated However parity bits are st
498. seeeeseeeesseeeeeneceseeeeseeeeesseeeseneeeeseeenes 241 8 Exceptions Traps and Trap Types ssssseesesosesssoososesossocesosossesososessocosesosessssssosssoosesssossesesosssse 253 Ze AEE D E i o A E Ee d deed dee eege 253 SLL Precise Traps EIERE a EE E EA E EE E E N 253 SL Deferred Traps e ee RA a el A R 254 8 1 3 Disr pting EE 257 STE Multiple Traps onenn nna a aa a a teste a 258 8 2 Exceptions Specific to the UltraSPARC IV Processor ssssessseesssesesseessessseresseesserssers 259 Sc rrap PAO o e E E vers E O E E O R E 260 83 1 Precise Trap Priority enen en E ies Ae ie ee 260 8 4 I cach Parity Error Tram a a a E n T eege 260 8 4 1 Hardware Action on Trap for I cache Data Parity Error 260 8 4 2 Hardware Action on Trap for I cache Physical Tag Parity Error 00 261 8 4 3 Hardware Action on Trap for I cache Snoop Tag Parity Error ssseseseesseee 262 8 5 D cache Parity Error Trap c ccccccecssccesssccesneeeeeseeeesneeceseeeeeseeeceneeeseneeeseeeeneeeseneesseneeeees 262 8 5 1 Hardware Action on Trap for D cache Data Parity Error ssecsseseseeeseeesees 262 8 5 2 Hardware Action on Trap for D cache Physical Tag Parity more 263 8 5 3 Hardware Action on Trap for D cache Snoop Tag Parity Error eee 264 8 6 P cache Parity Error Trap ccccceccccessccesssceesneeeeseeeseneeceeeeceseeeceneceseeeeesseeeeceeeeeeesseneeenes 264 8 6 1 Hardware Action on Trap for P cache Data Parit
499. seeseseeeeneeeeeeeeesneeesaeeeceseeeceneeesaeessaeeeeseeeneaees 48 3 8 1 Data Cache Data Fields ACCeSS oo eeeesceseeeceeneseeceseeseeeneeaeesseeaeessenseeaes 48 3 8 2 Data Cache Tag Valid Fields Access 00 c cceccceeessseceseeseneeeceseeeseneeesneeessneeensaees 49 3 8 3 Data Cache Microtag Fields Access eeeceecsseeeseceesneeeeneeeeeeeensneeessneeenteeeesaees 50 3 8 4 Data Cache Snoop Tag Access cccccccesseeesseeeeeeeceeeeceseeeceseeeseneeeseeeesteeenenees 51 3 8 5 Data Cache Invalidate oriens ann AE EAE EEEE Aii 51 3 9 Write Cache Diagnostic Accesses essessssessesesseesesrsserssttesreesserssreesseesstesseessseessressteesees 52 3 9 1 Write Cache Diagnostic State Register Access sssessesesesessersseersrrrserrssereseee 52 3 9 2 Write Cache Diagnostic Data Register Access eecceceeccesetseeesteeeeteeenteeeesnees 53 3 9 3 Write Cache Diagnostic Tag Register ACCESS ccccccesseeestceeeeteeeeseeeeteeeenaees 54 3 10 Prefetch Cache Diagnostic Accesses ccccesscesesseeeseeeeeeeeeeneceeeneecsaeecesaeeensaeeesaeeesaeeenaees 54 3 10 1 Prefetch Cache Status Data Register Access 0 cccscccesseeesneeeeseeeesneeenteeeesnees 55 3 10 2 Prefetch Cache Diagnostic Data Register ACCESS cecccecesseeesteeeesteeeeteeeeenees 56 3 10 3 Prefetch Cache Virtual Tag Valid Fields Access cccccesesseeesseeeeeteeeeneeeeenees 56 3 10 4 Prefetch Cache Snoop Tag Register ACCESS 0 cccceccceesseeeeteeeese
500. selects an associative way 4 way associative 15 14 DC_way 2 b00 Way 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 13 3 DC_addr An 11 bit index that selects a 64 bit data field 2 0 Mandatory value Should be 0 The address format for D cache data access is shown in TABLE 3 31 The data formats for D cache data access when DC_data_parity 0 and are shown in TABLE 3 31 and TABLE 3 32 respectively DC_parity is 8 bit data parity odd parity TABLE 3 31 Data Cache Data Access Data Format Bits Field Description 63 0 DC_data DC_data is 64 bit data Caches Cache Coherency and Diagnostics 48 un microsystems 3 8 2 TABLE 3 32 Data Cache Data Access Data Format When DC_data_parity 1 63 8 Should be 0 7 0 DC_parity DC_parity is 8 bit data parity odd parity T A MEMBAR Sync is required before and after a load or store to ASI_DCACHE_DATA TABLE 3 33 Data parity bits Data Bits Parity Bit 63 56 7 55 48 6 47 40 5 39 32 4 31 24 3 23 16 2 15 8 1 7 0 0 The TABLE 3 33 shows the data parity bits and the corresponding data bits that are parity protected Note During ASI writes to DC data parity bits are not generated from the data Data Cache Tag Valid Fields Access ASI 4716 per logical processor Name ASI_DCACHI GI _ TAG The address format for the D cache Tag Valid fields i
501. sequence for handling unexpected deferred errors within the trap handler Log the error s Reset the error logging bits in AFSR1 Perform a MEMBAR Sync to complete internal ASI stores Ie aS Panic if AFSR1 PRIV is set and not performing an intentional peek poke otherwise try to continue 5 Invalidate the D cache by writing each line of the D cache with AST_DCACHE_TAG This may not be required for instruction_access_error events but it is the simplest way to invalidate the D cache in all cases Abort the current process For user process UE errors in a conventional UNIX system once all processes using the physical page in error have been signaled and terminated as part of the normal page recycling mechanism clear the UE from main memory by writing the page zero routine to use block store instructions The trap handler does not usually have to clear out a UE in main memory 8 Resume execution Disrupting Traps Disrupting traps like deferred traps may cause program visible state change However disrupting traps are similar to precise traps in the following ways 1 The PC saved in TPC TL points to a valid instruction which will be executed by the program and the nPC saved in TNPC TL points to the instruction that will be executed after that one 2 All instructions issued before the one pointed to by the TPC have completed execution 3 Any instructions issued after the one pointed to by the TPC remain
502. set to 0x5a might cause this issue ASI Set To Ox5a init_mondo 24 st xa 01 g0 50 asi init_mondo 28 st xa 02 g0 60 asi init_mondo 2c jmp 07 8 g0 init_mondot 30 membar Sync In this particular case a vector interrupt trap was taken on the JMP instruction The interrupt trap handler executed an LDXA to the Interrupt Dispatch Register ASI_INTR_DISPATCH_W 0x77 which returned indeterminate data as the second STXA was still in progress in this case the data returned was the data written by the first STXA The reason the above case failed is that the JMP instruction took the interrupt before the MEMBAR Sync semantics was invoked thus leaving the interrupt trap handler unprotected Besides interrupts the JMP in this code sequence is also susceptible to the following traps The trap mem_address_not_aligned The trap illegal_instruction Instruction breakpoint debug feature which manifests itself as an illegal instruction but is currently unsupported e I MMU miss The trap instruction_access_exception The trap instruction_access_error The trap fast_ECC_error The specific problem observed can be avoided by using the following code sequence Jmp instruction interrupt before MEMBAR Sync init_mondo 24 st xa 01 g0 50 asi init_mondo 28 st xa 02 g0 60 asi init_mondot2c membar Sync init_mondo 30 jmp 07 8 g0
503. shown in TABLE 11 9 TABLE 11 9 Program Version Register Bits Bit Fuse Description 63 8 Mandatory value 7 0 Program version Sun Fireplane Interconnect and Processor Identification 298 amp Sun microsystems 12 Interrupt Handling For general information please refer to the UltraSPARC III Cu Processor User s Manual This chapter describes the UltraSPARC IV processor implementation specific information about interrupt handling in the following sections Chapter Topics e Interrupt ASI Registers on page 299 e CMT Related Interrupt Behavior on page 300 121 12 11 KN 2 13 12 1 4 Interrupt ASI Registers Interrupt Vector Dispatch Register The UltraSPARC IV processor interprets all 10 bits of VA 38 29 when the Interrupt Vector Dispatch Register is written Interrupt Vector Dispatch Status Register In the UltraSPARC IV processor 32 BUSY NACK pairs are implemented in the Interrupt Vector Dispatch Status Register Interrupt Vector Receive Register The UltraSPARC IV processor sets all 10 physical module ID MID bits in the SID_U and SID_L fields of the Interrupt Vector Receive Register UltraSPARC IV processor obtains SID_U from VA 38 34 of the interrupt source and SID_L from VA 33 29 of the interrupt source Logical Processor Interrupt ID Register ASI 0x63 VA 63 0 0x00 Name ASI_INTR_ID Read Write Privileged access per logical processor register P
504. software must create the appropriate states itself 2 CLE is Current Little Endian 3 TLE is Trap Little Endian 4 E is side effect bit 5 NC is Non cacheable bit Reset and RED_ state 90 amp Sun microsystems Performance Instrumentation and Optimization This chapter addresses the following sections Chapter Topics Introduction to Optimization on page 91 e Instruction Stream Issues on page 91 e Data Stream Issues on page 96 e Performance Instrumentation on page 98 e Performance Control Register PCR on page 98 e Performance Instrumentation Counter PIC Register on page 99 e Performance Instrumentation Operation on page 101 e Pipeline Counters on page 102 e Cache Access Counters on page 106 e Memory Controller Counters on page 111 e Data Locality Counters for Scalable Shared Memory Systems on page 111 e Miscellaneous Counters on page 115 e PCR SL and PCR SU Encoding on page 116 5 1 Introduction to Optimization Recompiling legacy code using a specifically designed compiler and setting the correct compile flags can significantly increase performance There are several performance features provided by newer UltraSPARC processors that can only be taken advantage of by using modern compiler technology Optimization is aimed at increasing the supply of as many valid instructions as possible to the grouping logic and eventually to the functional units ALUs FGUs branch units load store pipes just to name a few One very i
505. sor State Transition and the Generated Transaction 0 ceseeseeeeeeee 30 3 3 2 Snoop Output and Input ecccescceeeeceeneceeeneeeeseeeesseeeesseeeesneeeesneeesseesee 33 3 3 3 Transaction Handling eene tensed axa dE 37 3 4 Diagnostics Control and ACCESSES spisset enie aa N E E RE 39 3 5 Instruction Cache Diagnostic Accesses ccscccesseeeesseeesseeeeeeeeeeseeesenscesneeeseeeseneeeseneeeas 39 3 5 1 Instruction Cache Instruction Fields Access eeseeeseeeereeeeeerssrrsrrerrerreees 40 3 5 2 Instruction Cache Tag Valid Fields Access cccccccessesesseeeseeeeseeeeeneeeseeeeees 42 3 5 3 Instruction Cache Snoop Tag Fields Access ssessesssesesesessersseseserssrersserssrees 44 3 6 Instruction Prefetch Buffer Diagnostic Accesses sssssessseersseressessseessreesseesseessersssressee 45 3 6 1 Instruction Prefetch Buffer Data field Accesses oeeeneeeeeeeerserrsreeerreees 45 3 6 2 Instruction Prefetch Buffer Tag field Accesses ssnssenseeesseesseseseresrersseessees 46 vi UltraSPARC IV Processor User s Manual October 2005 3 7 Branch Prediction Diagnostic Accesses cccecesccessseeeenseeeeneeeeeeeceaeeeceneeeseneeeseeesseeeneaees 47 3 7 1 Branch Predictor Array Accesses cccesscesesseeesneeeeseeeeeeeeeeseeesneeeseeessteeeneaees 47 3 7 2 Branch Target Buffer Accesses ccccccsccesesecesneeceseeeceeeeseeeesseeecseeessueeensnees 47 3 8 Data Cache Diagnostic ACCESSES ccccecceses
506. sponding bit mask CMT registers like LP Enable register Many of the CMT specific registers provide a bit mask wherein each bit corresponds to an individual logical processor For these registers the LP_ID field indicates which bit of a bit mask corresponds to a specific logical processor Chip Multithreading CMT 12 un microsystems Zio Name ASI_CORE_ID ASI 0x63 VA 63 0 0x10 Read Only Privileged Access As described in the TABLE 2 1 the LP ID register has two fields TABLE 2 1 LP ID Register Bit Field Description 63 22 Reserved Reserved for future implementation Max LP ID which gives the logical processor ID value of the highest numbered implemented but not necessarily enabled logical processor 21 16 MAX_LP_ID in this CMT processor For the UltraSPARC IV processor the value of this field is 1 because there are two logical processors 15 6 Reserved Reserved for future implementation A LP_ID field which represents this logical processor s number as assigned by the hardware The LP ID is encoded in 6 bits In the UltraSPARC IV processor one logical processor has a value of 6 b000000 the other logical processor has a value of 6 b000001 5 0 LP Interrupt ID Register ASI_INTR_ID The LP Interrupt ID register described in TABLE 2 2 is added to support the Sun Fireplane Interconnect interrupt transaction This register is used to differentiate bet
507. srupting trap enabled by the CEEN bit in the Error Enable Register to carry out error logging Hardware corrected L2 cache tag errors set AFSR THCE and log the access physical address in AFAR In contrast to L2 cache data ECC errors AFSR E_SYND is not captured For hardware corrected L2 cache tag errors the processor actually writes the corrected entry back to the L2 cache tag array then retries the original request except for a snoop request For a snoop request the Hardware corrected L2 cache state will be returned to the system interface unit and the processor writes the corrected entry back to L2 cache tag array For most cases future accesses to this same tag should see no error This is in contrast to Hardware corrected L2 cache data ECC errors for which the processor corrects the data but does not write it back to the L2 cache This rewrite correction activity by the processor is to maintain the full required snoop latency and also obey the coherence ordering rules Uncorrectable L2 cache Tag ECC Errors An uncorrectable L2 cache tag ECC error may be detected as the result of any operation which reads the L2 cache tags All uncorrectable L2 cache tag ECC errors are fatal errors indicated by the processor asserting its ERROR output pin The event will be logged in AFSR TUE or AFSR TUE_SH and AFAR All uncorrectable L2 cache tag ECC errors for tag update or copyout due to foreign bus transactions or snoop request will set AFSR TU
508. struction cache tag number 00 0 and 10 S tag H gt E pay sical tag microtag or valid load predict bit array is accessed 2 0 Mandatory value Should be 0 IC_tag I cache tag numbers TABLE 3 15 through TABLE 3 19 illustrate the meaning of the tag numbers in IC_tag In the tables Undefined means the value of these bits is undefined on reads and must be masked off by software Caches Cache Coherency and Diagnostics 42 un microsystems 00 Physical Address Tag Access I cache physical address tag The data format for I cache physical address tag is shown in TABLE 3 15 TABLE 3 15 Data Format for I cache Physical Address Tag Field Bits Field Description 63 38 Mandatory value Should be 0 37 Parity Parity is the odd parity of the IC_tag fields 36 8 IC_tag IC_tag is the 29 bit physical tag field PA 41 13 of the associated instructions 7 0 Undefined The value of these bits is undefined on reads and must be masked off by software 01 Microtag Access I cache microtag The data format for the I cache microtag is shown in TABLE 3 16 TABLE 3 16 Data Format for I cache Microtag Field Bits Field Description 63 46 Mandatory value Should be 0 IC_utag is the 8 bit virtual microtag field VA 21 14 of the associated instructions 45 38 IC_utag The value of these bits is undefined on reads and must be masked off by 37 0 Undefined software
509. structions unless otherwise noted Other length accesses cause a data_access_exception trap TABLE 10 2 The UltraSPARC IV processor ASI Extensions 1 of 5 Value ASI Name Suggested Macro Syntax Description R W Erivate Shared 0416 The SPARC Architecture Manual Version 9 Private 0C16 The SPARC Architecture Manual Version 9 Private 1016 to 11 The SPARC Architecture Manual Version 9 Private 6 1446 to Pereme Reserved for future Pive 1516 implementation SE 19 The SPARC Architecture Manual Version 9 Private 6 1C46 to Reserved Reserved for future Divat 1Di6 implementation 2416 Reserved Reserved for funte Private implementation 2C16 Reserved Reserved for ES Private implementation 3016 ASI_PCACHE_STATUS_DATA F cache datastatus RAM RW Private diagnostic access 3116 ASI_PCACHE_DATA Proacherdata RAM SEHR RW Private access 3216 ASI_PCACHE_TAG P cache tag RAM diagnostic Private access 3316 ASI_PCACHE_SNOOP_TAG Eee RAM Private diagnostic access 3416 ASI_QUAD_LDD_PHYS 128 bit atomic load Private 3816 ASI_WCACHE_ STATE Mee RW Private access 3916 ASI WCACHE DATA EE RW Private access W cache tag RAM 3A16 ASI_WCACHE_TAG 10 6 RW Private diagnostic access 3C 16 ASI_QUAD_LDD_PHYS_L 128 bit atomic load little endian R Private 3F 16 ASI_SRAM_FAST_INIT_SHARED Se Se initialize all SRAM i shared ae tee A0 ASL SRAM EAST DNIT
510. switched from the psuedu split mode to regular mode the counter will retain its value L3_wb PICU Number of L3 cache lines that were written back because of requests from this core Shared L3 Event counters L3_wb_sh PICU Total number of L3 cache lines that were written back due to requests from both cores L2L3_snoop_inv_sh PICU L2L3_snoop_cb_sh PICU Total number of L2 and L3 cache lines that were invalidated due to other processors doing RTO RTOR RTOU or WS transactions Note that the count includes invalidations to L2 miss block but does not include invalidations to L3 miss block Total number of L2 and L3 cache lines that were copied back due to other processors The count includes copybacks due to both foreign copy back and copy back invalidate requests Oe foreign RTS RTO RS RTSR RTOR RSR RTSM RTSU or RTOU requests Note that the count includes copybacks from L2 miss block but does not include copybacks from L3 miss block The total number of cache to cache transfers observed within the processor can be formulated as the following Cache_to_cache_transfer L2L3_snoop_cb_sh SI_RTS_sre data cored corel t SLRTO_sre_data eore9 corel For the total number of cache to cache transfers in an n chip multiprocessor system use Cache_to_cache_transfer L2L3_snoop_cb_shghjp9 L2L3_snoop_cb_shgpjpj L2L3_snoop_cb_sh hip n 1 L3_hit_I_state_sh PICL Total number of
511. system bus 170 173 External Cache Tag Access Address Format 332 336 external cache INDEX XXXV amp Sun microsystems un microsystems Control register See ECACHE_CTRL data access bypass 26 External Reset pin 84 Externally Initiated Reset XIR 84 F fast_data_access_MMU_miss exception 330 fast_ECC_error exception 159 160 162 168 174 215 253 254 fast_ECC_error exception 259 fast_instruction_access_MMU_miss exception 312 330 fast_instruction_access_mmu_miss exception 215 FGA pipeline 116 FGM pipeline 116 Fireplane ASI extensions 287 DTL signals 292 294 reset values 296 Fireplane Configuration Register access 288 293 AID field 294 CBASE field 290 CBND field 290 CLK field 289 290 DEAD field 289 DTL_ fields 288 E _CLK field reset values 296 HBM hierarchical bus mode field 290 MR field 288 MT field 289 NXE field 288 SAPE field 288 SLOW field 290 SSM field 290 TOF field 289 TOL field 289 updating fields 297 FIREPLANE_PORT_ID AID field 288 MID field 288 297 MR module revision field 288 297 MT module type field 288 297 register accessing 287 Floating point control 40 floating point subnormal value generation 267 FLUSH instruction 25 I cache 25 flushing address aliasing 28 displacement 29 L2 cache 25 write cache 25 fp_disabled exception 259 271 fp_exception_ieee_754 exception 271 fp_exception_other exception 120 131 132 138 271 FPRS register
512. t Non cacheable D cache block load fill request with BERR response flag fast ecc error L2 cache data Not installed L2 cache state L1 cache data Not installed Pipeline Action garbage data not taken Comment Deferred trap Non cacheable Prefetch 0 1 2 3 fill request with BERR response Not installed Not installed garbage data not taken Disrupting trap Cacheable I cache fill request with BERR in the critical 32 byte data from system bus garbage data installed Not installed garbage data not taken Deferred trap BERR will be taken and Precise trap fast ecc error will be dropped the 2 least significant data bits 1 0 in both lower and upper 16 byte are flipped Cacheable I cache fill request with BERR in the non critical 2nd 32 byte data from system bus Cacheable D cache load 32 byte fill request with BERR in the critical 32 byte data from system bus Cacheable D cache load 32 byte fill request with BERR in the non critical 2nd 32 byte data from system bus Cacheable D cache FP 64 bit load fill request with BERR in the critical 32 byte data from system bus Cacheable D cache FP 64 bit load fill request with BERR in the non critical 2nd 32 byte data from system bus BERR Y BERR garbage data installed garbage data installed garbage data installed garbage data installed garbage data installed No action Installed No
513. t 33 bit 29 bit 16 15 of the Sun Fireplane Interconnect Configuration Register Set the processor to system bus clock ratio These fields may only be written during initialization before any System Bus transactions are initiated 0 0 00 reserved 0 0 01 reserved 0 0 10 reserved 0 0 11 reserved 0 1 00 8 1 processor to system clock ratio default 0 1 01 9 1 processor to system clock ratio new 0 1 10 10 1 processor to system clock ratio new 0 1 11 11 1 processor to system clock ratio new 1 0 00 12 1 processor to system clock ratio new 1 0 01 13 1 processor to system clock ratio new 1 0 10 14 1 processor to system clock ratio new 1 0 11 15 1 processor to system clock ratio new 1 1 00 16 1 processor to system clock ratio new 1 1 01 reserved 1 1 10 reserved 1 1 11 reserved Note The UltraSPARC IV processor supports 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 processor to system clock ratio Sun Fireplane Interconnect and Processor Identification 292 un microsystems 11 1 1 4 Sun Fireplane Interconnect Configuration Register 2 The Sun Fireplane Interconnect Configuration Register 2 can be accessed at ASI 4A 6 VA 0x10 All fields are shared by both logical processors All fields except bits 26 17 are identical to the corresponding fields in the Sun Fireplane Interconnect Configuration Register When any field except bits 26 17 AID
514. t include hardware prefetch requests that miss L2 cache L2_HWPF_miss Number of L2 cache misses from this core by cacheable D cache requests excluding Lacra miss PICE block loads and atomics Number of L2 cache misses from this core by cacheable I cache requests The count L2_I iss PICL R s 1C_miss PICL includes some wrong path instruction requests LI Sw pt miss PICL Number of L2 cache misses by software prefetch requests from this core Number of L2 cache misses by hardware prefetch requests from this core L2_HW_pf_miss PICU Note that hardware prefetch requests that miss L2 cache are not sent to L3 cache they are dropped Number of L2 cache hits in O Os or S state by cacheable store requests from this L2_write_hit_RTO PICL core that do a read to own RTO bus transaction The count does not include RTO requests for prefetch fen 2 3 22 23 instructions Number of L2 cache misses from this core by cacheable store requests excluding block stores The count does not include write miss requests for prefetch fen 2 3 L2_write_miss PICL 22 23 instructions Note that this count also includes the L2_write_hit RTO cases RTO_nodata i e stores that hit L2 cache in O Os or S state Number of L2 cache hits from this core to the ways filled by the other core when the cache is in the pseudo split mode L2_hit_other_half PICL Note that the counter does not count if the L2 cache is not in pseu
515. t is set for an uncorrectable error system state has been corrupted The corruption may be limited and may be recoverable if this occurs as the result of code as described in Special Access Sequence for Recovering Deferred Traps on page 255 PRIV accumulates the privilege state of the processor at the time errors are detected until software clears PRIV PRIV accumulates the state of the PSTATE PRIV bit at the time the event is detected rather than the PSTATE PRIV value associated with the instruction which caused the access which returns the error Note MEMBAR Sync is required before an ASI store which changes the PSTATE PRIV bit to act as an error barrier between previous transactions that were launched with a different PRIV state This ensures that privileged operations which fault will be recorded as privileged in AFSR PRIV AFSR1 PRIV accumulates as specified in TABLE 7 16 AFSR2 PRIV captures privilege state as specified in this table but only for the first error encountered 181 amp Sun microsystems 7 3 3 4 7 3 3 5 Error Handling Bits 51 50 PERR and IERR indicate that either an internal inconsistency has occurred in the system interface logic or that a protocol error has occurred on the system bus If either of these conditions occurs the processor will assert its ERROR output pin The AFSR may be read after a reset event used to recover from the error condition to discover the cause The IE
516. t subfields the major mask number VER bits 31 28 and the minor mask number VER bits 27 24 Please refer to TABLE 11 7 TABLE 11 7 VER Register Encoding in Panther Bit Field Value 63 48 manuf 003E16 47 32 impl 001916 31 24 mask starts from 1016 23 16 Reserved 0 15 8 maxtl 0516 7 5 Reserved 0 4 0 maxwin 0716 FIREPLANE_PORT_ID MID Field The 6 bit MID field in the FIREPLANE_PORT_ID register contains the six least significant bits 3E of Sun s JEDEC code Sun Fireplane Interconnect and Processor Identification 297 amp Sun microsystems 11 3 3 11 3 4 Speed Data Register The Speed Data register is a low power mode programmable 64 bit register It is programmed to hold the clock frequency information of the processor after final testing The value stored in the register is the clock frequency in MHz divided by 25 This register is read accessible from the ASI bus using ASI 53 6 VA 20 The data format for the Speed Data register is shown in TABLE 11 8 TABLE 11 8 Speed Data Register Bits 63 8 Mandatory value 7 0 Speed data clock frequency in MHz 25 Program Version Register The Program Version register is a low power mode 64 bit register It is programmed to hold the test program version used to test the processor This register is read accessible from the ASI bus using ASI 5316 VA 3046 The data format for the Speed Data register is
517. ta format for P cache tag valid fields access is shown in TABLE 3 55 TABLE 3 55 Prefetch Cache Tag Register Access Data Format Bits Field Description 63 62 Reserved Reserved for future implementation 61 PC_bank0_valid Valid bit for RAM bits 511 256 60 PC_bank1_valid Valid bit for RAM bits 255 0 59 58 context nucleus_cxt and secondary_cxt bits of the load instruction 57 0 PC_virtual_tag 58 bit virtual tag VA 63 6 of the associated data The P cache keeps a 2 bit context tag bit 59 58 of the P cache tag register The two bits are decoded as follows 00 Primary Context 01 Secondary Context 10 Nucleus Context 11 Not used A nucleus access will never hit P cache for a primary or a secondary context entry even if it has the same VA Moving between nucleus primary or secondary context does not require P cache invalidation All entries in the P cache are invalidated on a write to a context register The write invalidates the prefetch queue such that once the data is returned from the external memory unit it is not installed in the P cache 3 10 4 Prefetch Cache Snoop Tag Register Access Name ASI_PCACHE_SNOOP_TAG T Caches Cache Coherency and Diagnostics 57 un microsystems The address format of P cache snoop tag register access is shown in TABLE 3 56 TABLE 3 56 Prefetch Snoop Tag Access Address Format Bits Description 63 12 Reserved Reserved for future i
518. table is shown in TABLE 7 13 and TABLE 7 14 TABLE 7 14 Syndrome table for Microtag ECC Syndrome 3 0 Error Indication 0x0 None 0x1 microtag ECC 0 0x3 Double bit UE 0x4 microtag ECC 2 0x5 Double bit UE 0x7 microtag Data 0 0x8 microtag ECC 3 0x9 Double bit UE 0xB microtag Data 1 0xC Double bit UE 0xD microtag Data 2 OxF Double bit UE The M_SYND field is locked by the AFSR EMC EMU IMC and IMU bits The E_LSYND field is locked by the AFSR UE CE UCU UCC EDU EDC WDU WDC CPU CPC IVU IVC and L3_UCC L3_UCU L3_EDC L3_EDU L3_CPC L3_CPU L3_WDC and L3_WDU bits So a data ECC error can lead to the data ECC syndrome being recorded in ELSYND perhaps with a CE status then a later microtag ECC error event can store the microtag ECC syndrome in M_SYND perhaps with an EMC status The two are independent Error Handling 190 amp Sun microsystems ye Fe Error Handling Asynchronous Fault Address Register Primary and secondary AFARs are provided AFAR1 and AFAR2 associated with AFSR1 and AFSR2 AFAR1 works with the status captured in AFSR1 according to the overwrite policy described in Overwrite Policy on page 198 AFAR2 is captured at the time that AFSR2 is captured and specifically reflects the address of the transaction which causes AFSR2 to be frozen AFAR2 operates no overwrite policy AFAR2 becomes available for further updates exactly as AFSR2 does when bits
519. tag or data bit giving a software correctable error on every access 2 All the exception handler code can be placed on a non cacheable page This solution does cover hard faults on data bits in the L2 cache at least adequately to report a diagnosis or to remove the processor from a running domain provided that the actual exception vector for the fast_ECC_error trap is not in the L2 cache Exception vectors are normally in cacheable memory To avoid fetching the exception vector from the L2 cache flush it from the L2 cache and L3 cache in the fast_ECC_error trap routine 161 amp Sun microsystems 7 2 1 8 Error Handling 3 Exception handler code may be placed in cacheable memory but only in the first 32 bytes of each 64 byte L2 cache sub block At the end of the 32 bytes the code has to branch to the beginning of another L2 cache sub block The first 32 bytes of each L2 cache sub block fetched from system bus are sent directly to the instruction unit without being fetched from the L2 cache None of the I cache or D cache lines fetched may be in the L2 cache or L3 cache This does cover hard faults on data bits in the L2 cache or L3 cache for systems that do not have non cacheable memory from which the trap routine can be run It does not cover hard faults on tag bits The exception vector and the trap routine must all be flushed from the L2 cache and L3 cache after the trap routine has executed Note If by some means the proc
520. ted and installed in the D cache arrays but will not be checked Note D cache physical tag or data parity errors are not checked for non cacheable accesses or when DCUCR DC is 0 D cache errors are not logged in AFSR Trap software can log D cache errors Error Handling 149 amp Sun microsystems 7 2 2 1 RE 7 2 2 3 Error Handling D cache Physical Tag Errors D cache physical tag parity errors can only occur as the result of a load store or atomic instruction Invalidation of D cache entries by bus coherence traffic does not use the physical tag array The physical address of each datum being fetched as a result of a data access instruction including speculative loads is compared in parallel with the physical tags in all four ways of the D cache A parity error on the physical tag of any of the four ways will result in a dcache_parity_error trap when it hits valid and micro tag array The D cache actually only supplies data for load like operations from the processor However physical tag parity is checked for store like operations as well because if the D cache has an entry for the relevant line it must be updated Note A D cache physical tag error is only reported for data which is actually used Errors associated with data which is speculatively fetched but later discarded without being used are ignored for the moment D cache physical tag parity errors on executed instructions are reported by a pre
521. tem bus due to non RTSR transaction CE Corrected data corrected ecc Corrected data corrected ecc N A Data not installed in P cache Good data not installed in P cache No action No action Disrupting trap Disrupting trap Error Handling 244 un microsystems TABLE 7 27 Event Prefetch for instruction 17 fill request with CE in the critical 32 byte data from system bus due to RTSR transaction OR Prefetch for instruction 17 fill request with CE in the non critical 2nd 32 byte data from system bus due to RTSR transaction System Bus CE UE TO DTO BERR DBERR errors 5 of 10 flag fast ecc error L2 cache data Corrected data corrected ecc L2 cache state L1 cache data Good data not installed in P cache Pipeline Action No action Comment Disrupting trap Prefetch 0 1 20 21 17 fill request with UE in the critical 32 byte data from system bus due to non RTSR transaction OR Prefetch 0 1 20 21 17 fill request with UE in the 2nd 32 byte data from system bus due to non RTSR transaction Not installed Bad data not installed in P cache No action Disrupting trap P cache 0 1 20 21 17 fill request with UE in the critical 32 byte data from system bus due to RTSR transaction OR P cache 0 1 20 21 17 fill request with UE in the 2nd 32 byte data from system bus due to RTSR transaction Prefetch 2
522. tem bus or the L2 cache The secondary AFSR in this case would show that it came from the system bus The secondary AFSR is represented by the label AFSR2 in this manual where it is necessary to distinguish primary and secondary registers The secondary AFSR_EXT enables diagnosis software to determine the source of an error The secondary AFSR_EXT is represented by the label AFSR2_EXT in this document where it is necessary to distinguish primary and secondary registers AFSR Fields Bit 53 the accumulating multiple error ME bit is set in AFSR1 or AFSR1_EXT when an uncorrectable error occurs or a SW_correctable error occurs and the AFSR1 or AFSR1_EXT status bit to report that error is already set to 1 Multiple errors of different types are indicated by setting more than one of the AFSR1 or AFSR1_EXT status bits Note AFSR1 ME is not set if multiple HW_corrected errors with the same status bit occur only uncorrectable and SW_correctable AFSR2 ME can never be set because if any bit is already set in AFSR1 AFSR2 is already frozen AFSRI1 ME is not set by multiple ECC errors which occur within a single 64 byte system bus transaction The first ECC error in a 16 byte correction word will be logged Further errors of the same type in following 16 byte words from the same 64 byte transaction are ignored Bit 52 the accumulating privilege error PRIV is set when an error is detected at a time when PSTATE PRIV 1 If this bi
523. tems Traps data_access_exception trap for ASI 3F 16 use other than with STXA instruction 3 15 OBP Backward Compatibility Incompatibility TABLE 3 72 summarizes the UltraSPARC IV processor OBP Open Boot PROM backward compatibility list TABLE 3 72 The UltraSPARC IV processor OBP Backward Compatibility List 7 of 2 New Feature ASI_L2CACHE_CTRL 6D 6 Caches Cache Coherency and Diagnostics The UltraSPARC IV processor Bits Queue_timeout _ disable Hard _POR State The UltraSPARC IV processor Behavior on UltraSPARC OBP Software The logic to detect the progress of a queue is disabled Queue_timeout Queue timeout period is set at maximum value of 2M system cycles L2_off L2 cache is enabled Retry_disable The logic to measure request queue progress and put L2L3 arbiter in single issue mode is enabled Retry_debug _counter The logic in L2L3 arbiter will count 1024 retries before entering the single issue mode L2L3arb_single _issue_frequency L2L3arb_single _issue_en If L2L3 arbiter is in single issue mode the dispatch rate is one per 16 cycles L2L3 arbiter will dispatch a request every other cycle L2_split_en All four ways in L2 cache can be used for replacement triggered by either logical processor L2_data_ecc_en L2_tag_ecc_en No ECC checking on L2 cache data bits Default to L2 cache Tag ECC checking disabled NO
524. ter an internal ASI store before a load from any internal ASI If a MEMBAR is not used the load data returned for the internal ASI is not defined e A FLUSH DONE or RETRY is needed after an internal ASI store that affects instruction accesses including the I D MMU ASIs ASI 5016 to 5216 5416 to 5F16 the I cache ASIs 6616 to 6F 16 or the IC bit in the DCU Control Register ASI 45 before the point that side effects must be visible Stores to D MMU registers ASI 5816 to 5F other than the context ASIs ASI 5816 VA 8 6 1016 can use a MEMBAR Sync A MEMBAR FLUSH DONE or RETRY instruction must precede the next load or non internal store The instruction must also be in or before the delay slot of a delayed control transfer instruction to avoid corrupting data Address Space Identifiers 278 amp Sun microsystems 10 1 3 Note For more predictable behavior it is recommended to park the other logical processor when performing ASI accesses to a shared resource If the other logical processor is not parked it may perform operations making use of and or modifying the shared resource being accessed Limitation of Accessing Internal ASIs STXA to an internal ASI may corrupt the data returned for a subsequent LDXA from an internal ASI due to an exception which occurred prior to the protecting MEMBAR Sync For example the following code sequence where asi is
525. ters cccccccsseseseeeeeeeeesseeeeeteeenens 319 13 2 Data Memory Management Unit 0 0 ccc cccccceseceeenceeeseeeeeseeeeseeeneaeeeseseeeseneeeseeensneeensas 319 Table of Contents amp Sun xiii microsystems un microsystems 14 INDEX xiv 13 2 1 13 2 2 13 2 3 13 2 4 13 2 5 13 2 6 13 2 7 13 2 8 Virtual Address Translation cesceescesscescessseceeesseesseecsaeeeseesseesseeeeaeesaeenaee 319 Two D TLBs with Large Page Support cccecsscecesseeeeceeeseeeeeeneeeseeeeseeeeees 320 Translation Table Entry TTE iiceoe ie E E E E EE 328 Hardware Support for TSB Access ssssssssssersssessseessersssrssseesseessersseeesseessersse 329 Faults and R B eY o A E E E T E EAA 331 Reset Disable and RED_ state Behavior csesccsesssssseessssrrsssssssssrrerrrrrssssssseee 331 Internal Registers and ASI Operations ccccccssceeessceeeeeeeeeeeeeneeeesneeeseneeeees 331 Translation Lookaside Buffer Hardware 0 cceesesscessessseeseeeeesseeeeneeeaeenaee 337 Gadaussecsddeacsescnseesesasavansessascsseassevevesssssosssted successesessusseesesescowiseovesssutsssaneeeussosseosesesse lx UltraSPARC IV Processor User s Manual October 2005 List of Tables amp Sun microsystems TABLE 2 1 TABLE 2 2 TABLE 2 3 TABLE 2 4 TABLE 2 5 TABLE 2 6 TABLE 2 7 TABLE 2 8 TABLE 2 9 TABLE 2 10 TABLE 2 11 TABLE 3 1 TABLE 3 2 TABLE 3 3 TABLE 3 4 TABLE 3 6 TABLE 3 5
526. the AFSR1 M_SYND and E_SYND fields is a separate stored priority for the data in the field When AFSR1 and AFSR1_EXT are empty and no errors have been logged the effective priority stored for each field is 0 Whenever an event to be logged in AFSR1 AFSR1_EXT or AFARI occurs compare the priority specified for each field for that event to the priority stored internal to the processor for that field If the priority for the field for the event is numerically higher than the priority stored internal to the processor for that field update the field with the value appropriate for the event that has just occurred and update the stored priority in the processor with the priority specified in the table for that field and new event 196 amp Sun microsystems Error Handling Note This implies that fields with a 0 priority in the above table are never stored for that event For instance if first a UE occurs to capture AFSR1 E_SYND then an EDU the EDU doesn t update AFSR1 E_SYND because it has the same priority as UE Trap handler software clears AFSR1 UE but leaves AFSR1 EDU set AFSR1 E_SYND will be unchanged When a CE occurs AFSRI E_SYND will not be changed since CE has lower priority than EDU PRIV A 1 in the set PRIV column implies that the specified event will set the AFSR PRIV bit if the PSTATE PRIV bit is 1 at the time the event is detected A 0 implies that the event has no effect on the AFSR PRIV bit AFSR PRI
527. the L3 cache control register This section describes the following L3 cache accesses e L3 cache Control Register e L3 cache data ECC field diagnostic accesses e L3 cache tag state LRU ECC field diagnostics accesses e L3 cache SRAM mapping L3 cache Control Register ASI 7516 Read and Write Shared by both logical processors VA O16 Name ASI_L3CACHE_CONTROL Caches Cache Coherency and Diagnostics 66 un microsystems The data format for L3 cache control register is shown in TABLE 3 64 TABLE 3 64 L3 cache Control Register Access Data Format 1 of 3 63 45 Mandatory value Description Should be 0 44 40 siu_data_mode 39 ET_off 33 EC_PAR_force_LHS 32 EC_PAR_force_RHS SIU data interface mode required to be set for different system clock ratio and L3 cache mode based on Secondary L3 cache Control Register on page 69 After hard reset these bits will be read as OF j6 If set disables on chip L3 cache tag SRAM for debugging After hard reset this bit will be read as bt If set flips the least significant bit of address to the left hand side SRAM DIMM of L3 cache After hard reset this bit will be read as bt If set flips the least significant bit of address to the right hand side SRAM DIMM of L3 cache After hard reset this bit will be read as bt 31 EC_RW_grp_en If set enables the read bypassing write logic in L3 cache data access After hard reset this bit wil
528. the event of multiple errors being detected simultaneously When an uncorrectable error is present in a 64 byte line read from the system bus in order to complete a load like or atomic instruction corrupt data will be installed in the D cache The deferred trap handler should invalidate the D cache during recovery Corrupt data is never stored in the I cache or P cache An uncorrectable system bus data ECC error on a read to a non cacheable space is handled in the same way as cacheable accesses except that the error cannot be stored in the processor caches so there is no need to flush them An uncorrectable error cannot occur as the result of a store like operation to uncached space An uncorrectable system bus data ECC error as the result of an interrupt vector fetch sets AFSR IVU in the processor fetching the vector The error is not reported to the processor which generates the interrupt When the uncorrectable interrupt vector data is read by the interrupt vector fetch hardware of the processor receiving the interrupt a disrupting ECC_error exception is generated No interrupt_vector trap is generated The processor will store the uncorrected interrupt vector data in the internal interrupt registers unmodified as it is received from the system bus Uncorrectable System Bus Microtag Errors An uncorrectable microtag ECC error as the result of a system bus read of memory or I O sets AFSR EMU Whether or not the processor is configured in
529. the exception handler code can be placed on a non cacheable page This solution does cover hard faults on data bits in the L3 cache at least adequately to report a diagnosis or to remove the processor from a running domain provided that the actual exception vector for the 167 amp Sun microsystems 7 2 8 8 Error Handling fast_ECC_error trap is not in the L3 cache Exception vectors are normally in cacheable memory To avoid fetching the exception vector from the L3 cache flush it from the L3 cache in the fast_ECC_error trap routine 3 Exception handler code may be placed in cacheable memory but only in the first 32 bytes of each 64 byte L2 cache sub block At the end of the 32 bytes the code has to branch to the beginning of another L2 cache sub block The first 32 bytes of each L2 cache sub block fetched from system bus are sent directly to the instruction unit without being fetched from the L2 cache or L3 cache None of the I cache or D cache lines fetched may be in the L2 cache or L3 cache This does cover hard faults on data bits in the L2 cache or L3 cache for systems that do not have non cacheable memory from which the trap routine can be run It does not cover hard faults on tag bits The exception vector and the trap routine must all be flushed from the L2 cache and L3 cache after the trap routine has executed Note If by some means the processor does encounter a software correctable L3 cache ECC error while executing
530. the fast_ECC_error trap handler the processor may recurse into RED_state and not make any record in the AFSR of the event leading to difficult diagnosis The processor will set the AFSR ME bit for multiple software correctable events but this is expected to occur routinely when an AFAR and AFSR is captured for an instruction which is prefetched automatically by the instruction fetcher then discarded The fast_ECC_error trap uses the alternate global registers If a software correctable L3 cache error occurs while the processor is running some other trap which uses alternate global registers such as spill and fill traps there may be no practical way to recover the system state The fast_ECC_error routine should note this condition and if necessary reset the domain rather than recover from the software correctable event One way to look for the condition is to check whether the TL of the fast_ECC_error trap handler is greater than 1 Uncorrectable L3 cache Data ECC Errors Uncorrectable L3 cache data ECC errors occur on multi bit data errors detected as the result of the following transactions e Reads of data from the L3 cache to fill the D cache e Reads of data from the L3 cache to fill the I cache e Performing an atomic instruction These events set AFSR_EXT L3_UCU An uncorrectable L3 cache data ECC error which is the result of an I fetch or a data read caused by any instruction other than a block load causes a fast_ECC_error tra
531. the store Disrupting trap W cache cacheable block store missing L2 cache write stream with unmapped address OR W cache cacheable block store commit request write stream with unmapped address Deferred trap Non cacheable W cache store with unmapped address Deferred trap Non cacheable W cache block store with unmapped address Ecache eviction with unmapped address Deferred trap Deferred trap writeback operation is terminated amp data coherency is lost Outgoing interrupt request with unmapped address Non cacheable I cache fill garbage data Deferred trap the corresponding Busy bit in Interrupt Vector Dispatch Status Register will be cleared Deferred trap BERR will be BERR response not taken request with BERR BERR Yes Not installed Not installed taken and Precise trap fast E not taken H ponse ecc error will be dropped Non cacheable D cache EE Deferred trap BERR will be 32 byte fill request with BERR Yes Not installed N A Not installed 8 Se SE taken and Precise trap fast BERR response ecc error will be dropped Non cacheable D cache E EG Deferred trap BERR will be FP 64 bit fill request with BERR Yes Not installed N A Not installed ete taken and Precise trap fast ecc error will be dropped Error Handling 247 un microsystems TABLE 7 27 System Bus CE UE TO DTO BERR DBERR errors 8 of 10 Even
532. those writes That is stores to shared CMT registers must be performed atomically on all bits of the register All the CMT registers are 64 bit registers although some of the bits of individual registers can be reserved or defined to a fixed value Reserved register fields always should be written by software with values of those fields previously read from that register or with zeroes they should read as zero in hardware Software intended to run on future versions of CMTs should not assume that these fields will read as 0 or any other particular value This software convention makes future expansion of the interface easier Only the Load extended from alternate space LDXA or Load double floating point register from alternate space LDDFA instructions can be used to read CMT registers Only the Store extended into alternate space STXA and the Store double floating point register to alternate space STDFA instructions can be used to write to CMT registers An attempt to access a CMT register with any other instruction results in a data_access_exception trap Le 2 3 1 Private Processor Registers There are three private registers used for logical processor identification LP ID Register AST_CORE_ID The LP ID register is a read only private register that holds the ID value assigned by hardware to each implemented logical processor The ID value is unique within the CMT The LP ID register corresponds to a bit offset for corre
533. through the counters listed in TABLE 5 11 Counts are updated by each cache access regardless of whether the access will be used The Instruction data prefetch and write cache counters are private counters Because the L2 and the L3 caches are shared by the two cores some events are counted by the core which causes the access while other events cannot be attributed to an individual core and are treated as shared events TABLE 5 11 Cache Access Counters 1 of 5 Counter Description Instruction Cache IC_ref PICL IC_fill PICU IPB_to_IC_fill PICL IC_pf PICU Number of I cache references I cache references are fetches of up to four instructions from an aligned block of eight instructions Note that the count includes references for non cacheable instruction accesses and instructions that were later cancelled due to misspeculation or other reasons Thus the count is generally higher than the number of references for instructions that were actually executed Number of I cache fills excluding fills from the instruction prefetch buffer This is the best approximation of the number of I cache misses for instructions that were actually executed The count includes some fills related to wrong path instructions where the branch was not resolved before the fill took place The count is updated for 64 Byte fills only in some cases the fetcher performs 32 Byte fills to I cache Number of I cache fills from the
534. tical 32 byte load L3_UCU FC UCEEN Private critical 32 byte atomic instruction critical 32 byte SW_correctable L3 cache data ECC instruction fetch critical 32 byte and non critical 32 byte load L3_UCC FC UCEEN Private critical 32 byte atomic instruction critical 32 byte Uncorrectable L3 cache data ECC store queue or prefetch queue operation or load instruction L3_EDU C NCEEN Private critical 32 byte or atomic instruction non critical 32 byte Uncorrectable L3 cache data ECC L3_EDU D NCEEN Private block load operation HW_corrected L3 cache data ECC store queue or prefetch queue operation or load instruction L3_EDC C CEEN Private critical 32 byte or atomic instruction non critical 32 byte HW_corrected L3 cache data ECC L3_EDC C CEEN Private block load operation Uncorrectable L3 cache tag ECC SIU tag update or copyback from NCEEN foreign bus transactions or snoop L3_TUE_SH ET_ECC_en Se operations Uncorrectable L3 cache tag ECC NCEEN all other L3 cache tag accesses L3 TUE C ET_ECC_en l Private HW_corrected L3 cache tag ECC writeback copyout block load CEEN d store queue or prefetch queue TO THEE S EI ECC en Ke 195 amp Sun microsystems Error Handling Note When copyout from L2 cache or snoop or tag update due to foreign transaction encounters HW_corrected L2 cache tag ECC error CMT error steering register will be used to decide which logical processor to log THCE When
535. tified with any particular logical processor that error is recorded in and a trap is sent to the logical processor identified by the CMT Error Steering register Name AST_CMP_ERROR_STEERING Register ASI 0x41 VA 63 0 0x40 Read Write Privileged Access TABLE 7 5 CMT Error Steering Register Bit Field Description 63 1 Reserved Reserved for future implementation The Target_ID field has only one 1 bit field that encodes the LP_ID of the logical 0 Target_ID processor that will be informed of shared errors The Target_ID indicates the TTE that has an LP_ID equal in value to that of the Target_ID Software is responsible for ensuring that the CMT Error Steering register identifies an appropriate logical processor Particularly the case of assigning the LP_ID of non enabled logical processor to the CMT Error Steering register must be avoided If the CMT Error Steering register identifies a logical processor that is parked the shared error is reported to that logical processor and the logical processor will take the appropriate trap but not until after it is unparked The timing of the update to the CMT Error Steering register is not defined If the store to the CMT Error Steering register is followed by a MEMBAR synchronization barrier the completion of the barrier guarantees the completion of the update During the update of the CMT Error Steering 177 un microsystems R32 Toe Err
536. tion A disrupting ECC_error trap will be generated in this case provided that the CEEN bit is set in the Error Enable Register Hardware will proceed to correct the error and corrected data will be sent to the snooping device This bit is not set if the copyout happens to hit in the L2 cache writeback buffer because the line is being victimized Instead the WDC bit is set Please refer to the section WDC for an explanation of this CPU For a copyout operation an L2 cache read will be performed and the data read back from the L2 cache SRAM will be checked for the correctness of its ECC If a multi bit error is detected it will be recognized as an uncorrectable error and the CPU bit will be set to log this error condition A disrupting ECC_error trap will be generated in this case provided that the NCEEN bit is set in the Error Enable Register Multiple occurrences of this error will cause AFSR1 ME to be set When the processor reads uncorrectable L2 cache data and writes it to the system bus in a copyback operation it will compute correct system bus ECC for the corrupt data then invert bits 127 126 of the data to signal to other devices that the data is not usable This bit is not set if the copyout hits in the writeback buffer Instead the WDU bit is set Please refer to the section WDU for an explanation of this The copyback data with UE information in W cache C 1 0 will be inverted in the data written back to the
537. tion RWIC 58 DBERR _ Bus error from system bus for prefetch queue or store queue read operation RWIC 57 THCE HW_corrected L2 cache tag ECC error RWIC 56 Reserved Reserved for future implementation R 55 TUE Uncorrectable L2 cache tag ECC error due to logical processor specific tag access RWIC 54 DUE Uncorrectable system bus data ECC for prefetch queue or store queue read operation RWIC 53 ME Multiple error of same type occurred RWIC 52 PRIV Privileged state error has occurred RWIC 51 PERR System interface protocol error RWIC 50 IERR Internal processor error RWIC 49 ISAP System request parity error on incoming address RWIC 48 EMC HW_corrected system bus microtag ECC error RWIC 47 EMU Uncorrectable system bus microtag ECC error RWIC 46 IVC HW_corrected system bus data ECC error for read of interrupt vector RWIC 45 IVU Uncorrectable system bus data ECC error for read of interrupt vector RWIC 44 TO Unmapped error from system bus RWIC 43 BERR Bus error response from system bus RWIC 42 UCC SW_correctable L2 cache ECC error for instruction fetch load like or atomic instruction RW1C 41 UCU Ee L2 cache data ECC error for instruction fetch load like or atomic RWIC HW_corrected L2 cache data ECC error for copyout 40 CPC Note This bit is not set if the copyout hits in the writeback buffer Instead the WDC bit RW1C is set Uncorrectable L2 cache data ECC error for copyout 39 CPU Note This bit i
538. tion Register Written rd Flag s Written rd ER 0 0 Normal 0 None set 0 None set 0 0 Normal 0 None set 0 None set 0 0 Normal 0 None set 0 None set 0 0 Normal 0 None set 0 None set s Asserts nvc 0 Infinity QNaN Asserts nvc nva No IEEE trap enabled d Asserts nvc 0 Infinity QNaN Asserts nvc nva No IEEE trap enabled i Asserts nvc 0 Infinity QNaN Asserts nvc nva No IEEE trap enabled S Asserts nvc 0 Infinity QNaN Asserts nvc nva No Normal Normal Can underflow overflow See 6 5 IEEE trap enabled Can underflow overflow See 6 5 Normal Infinity Infinity Infinity None set Infinity None set Normal Infinity Infinity Infinity None set Infinity None set Normall Infinity Infinity Infinity None set Infinity None set Normall Infinity Infinity Infinity None set Infinity None set 1 IEEE trap means fp_exception_IEEE_754 IEEE 754 1985 Standard 125 un microsystems 6 3 4 Division TABLE 6 6 Floating Point Division Result from the operation includes one or more of the following DIVISION Number in f register See Trap Event on page 132 s Exception bit set See TABLE 6 12 s Trap occurs See abbreviations in TABLE 6 12 ren rsz e Underflow overflow can occur Instruction Masked Exception TEM 0 Enabled Exception TEM 1 LIS oog gt 00 Destination Register Written rd
539. tion in the BR pipe itself So instruction_access_error traps are taken as soon as the instruction fetcher dispatches the faulty instruction data_access_error BR or MS pipe interrupt_vector BR MS AO or A1 pipe but see note above ECC_error BR or MS pipe interrupt_level_n BR MS AO or A1 pipe but see note above The above table specifies when the processor will initiate trap processing meaning consider what trap to take When the processor has initiated trap processing it will take one of the outstanding traps but not necessarily the one which caused the trap processing to be initiated When the processor encounters an event which should lead to a trap that trap type becomes pending The processor continues to fetch and execute instructions until a valid instruction is issued to a pipe which is sensitive to a trap which is pending During this time further traps may become pending When the next committed instruction is issued to a pipe which is sensitive to any pending trap 215 un microsystems The processor ceases to issue instructions in the normal processing stream 2 Ifthe pending trap is a precise or disrupting trap the processor waits for all the system bus reads which have already started to complete The processor does not do this if a deferred trap will be taken During this waiting time more traps can become pending 3 The processor takes the highest priority pending trap If a defe
540. tr_cnt 000010 Dispatch0_other 000010 Dispatcht IC miss 000011 DC_wr 000011 IU_stat_jmp_correct_pred 000100 Re_DC_missovhd 000100 DispatchO_2nd_br 000101 Re_FPU_bypass 000101 Rstall_storeQ 000110 L3_write_hit_RTO 000110 Ratall IU use 000111 L2L3_snoop_inv_sh 000111 IU_stat_ret_correct_pred 001000 IC_L2_req 001000 IC_ref 001001 DC_rd_miss 001001 DC_rd 001010 L2_hit_I_state_sh 001010 Rstall_FP_use 001011 L3_write_miss_ RTO 001011 SW_pf_instr 001100 L2_miss 001100 L2_ref 001101 SI_owned_sh 001101 L2_write_hit_RTO 001110 SI_RTO_src_data 001110 L2_snoop_inv_sh 001111 SW_pf_duplicate 001111 L2_rd_miss 010000 IU_stat_jmp_mispred 010000 PC_rd 010001 ITLB_miss 010001 SI_snoop_sh 010010 DTLB_miss 010010 SL_ciq_flow_sh 010011 WC_miss 010011 Re_DC_miss 010100 IC_fill 010100 SW_count_NOP Performance Instrumentation and Optimization 116 un microsystems TABLE 5 18 PCR SU and PCR SL Selection Bit Field Encoding 2 of 2 Performance Instrumentation and Optimization PCR SU Value PICU Selection PCR SL Value PICL Selection 010101 IU_stat_ret_mispred 010101 IU_stat_br_miss_taken 010110 Re_L3_miss 010110 IU_stat_br_count_untaken 010111 Re_PFQ_full 010111 HW_pf_exec 011000 PC_soft_hit 011000 FA_pipe_completion 011001 PC_inv 011001 SSM_L3_wb_remote 011010 PC_hard_hit 011010 SSM_L3_miss_local 011011 IC_pf 011011 SSM_L3_miss_mtag_remote 011100 SW_count_NOP 011100 SW_pf_str_tr
541. tried and writes 64 byte data to L2 cache 233 un microsystems Error Handling TABLE 7 23 L2 cache Tag CE and UE errors 5 of 7 Ge Aes Error S Event logged in ee Pin L2 cache Tag 3 Comment AFSR error SIU forward and fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE No Yes Original tag N A data will not be to D cache for 32 byte load stored in L2 cache and writes 64 byte data to L2 cache SIU forward only request with i d L2 cache tag CE THCE N N i B N A Distuptine trap o o corrects the i forwards the critical 32 byte tag Se to D cache for 32 byte load SIU forward only request with L2 cache tag UE TUE N y ENNE WA Distupting trap o es riginal ta i forwards the critical 32 byte R E ee 8 to D cache for 32 byte load PP SIU forward and fill request with L2 cache tag CE f GE L2 Pipe Disrupting trap OLWAGEN Ee ee ened THCE No No corrects the N A i to D cache for FP 64 bit load the requests e tag retried and writes 64 byte data to L2 cache SIU forward and fill request with L2 cache tag UE Disrupting trap forwards the critical 32 byte TUE No Yes Original ta N A i to D cache for FP 64 bit load S ee SE Se and writes 64 byte data to L2 PP cache SIU forward only request with L2 cache tag CE THCE x L2 ged N A Distupting trap o o corrects the i forwards the critical 32 byte tag SE 2 to D cache for FP 64 bit load S SIU f
542. tructions With Minimum Latency cee eeeesesseeeeeeeeeeeeenaee 94 5 2 4 Translation Lookaside Buffer TLB Misses cccccssceseseceesteeeeneeesteeeesees 94 5 2 5 Conditional Moves vs Conditional Branches eeceeseeseceseeeeeeeseeeeeeeseeaee 94 5 2 6 Instruction Cache Utilization 00 cee eeeeeeesesscececeeeneeseeessecsaeesaeesaeseaeeeaeesaes 95 5 2 7 Handling of CTE Couples morien a EEEE eegene ee EE 95 SR Mispredicted Branches 13 2 9 22 scence E deen ted Bessie hn ede ee 95 5 2 9 Return Address Stack RAS eececcscecesseeseneecesteeceeeeseeeessneecseeessteeeneeeenaaes 95 5 3 Data Str am EE 96 5 3 1 Data Cache Organizations nseni E E a 96 CA Data Cache Timing nesia e E E EE See 96 E GE ER EE 96 5 3 4 Store Considerations ee 97 53 5 Read After Write Hazards 97 5 4 Performance Instrumentation arszt ioii i e E A a EE 98 5 41 Supervisor User Mode sirisser ten inia aE aE a an TE E AaS 98 33 Performance Control Register DCH 98 5 6 Performance Instrumentation Counter PIC Register ssessssseessssresseessrsssrresseessersrers 99 5 6 1 PIC Counter Overflow Trap Operation cceeccecesseeeeneeeseeeeeneeeeeneeeeseeenaaes 100 5 7 Performance Instrumentation Operation cccceeesceeeseceeeneeeeeneeeseeeceseeeeeeeesseeensaeeeees 101 5 7 1 Performance Instrumentation Implementations cccceecceeesseeeesteeeeteeeeees 102 5 7 2 Performance Instrumentation ACCULA
543. ty error see D cache Error Recovery Actions on page 151 P cache Error Detection Details Use ASI_PCACHE_STATUS_DATA ASI 0x30 diagnostic access to access the P cache data parity bits stored in the P cache Status Array see P cache Data Parity Errors on page 157 for details To test hardware and software for P cache data parity error recovery programs can cause data to be loaded into the P cache by issuing prefetches then use the P cache diagnostic access to modify the parity bits Alternatively the data can be synthesized completely using diagnostic accesses Upon executing an instruction to read the modified line into a register a trap should be generated If no trap is generated the program should check the P cache using diagnostic accesses to see whether the entry has been repaired This would be a sign that the broken entry had been displaced from the P cache before it had been re accessed Iterating this test can check that each covered bit of the P cache data is actually connected to its parity generator L2 cache Errors Both the L2 cache tags and data are internal to the processor and covered by ECC L2 cache errors can be recovered by hardware measures Information on errors detected is logged in the AFSR and AFAR L2 cache Tag ECC Errors The types of L2 cache tag ECC errors are Hardware corrected HW_corrected L2 cache tag ECC errors single bit ECC errors that are corrected by hardware e Uncorrectab
544. uch a case it is recommended that both the STXA and the protecting MEMBAR Sync or FLUSH DONE or RETRY are always on the same 8 KB page thus eliminating the possibility of an intervening I MMU miss unless the code is otherwise guaranteed to not take an I MMU miss e g it was guaranteed to be locked down in the TLB The code described should be sufficient in all cases where an STXA to an internal ASI is either followed immediately by another such STXA or by one of the protecting instructions MEMBAR Sync FLUSH DONE or RETRY Note In cases where other interruptable instructions are used after an STXA and before a protecting instruction any exception handler which can be invoked would need similar protection Such coding style is strongly discouraged and should only be done with great care when there are compelling performance reasons e g in TLB miss handlers 10 2 ASI Assignments Please refer to the UltraSPARC III Cu Processor User s Manual for general information regarding the ASI assignments The sections below discuss the UltraSPARC IV processor ASI assignments Address Space Identifiers 280 un microsystems 10 2 1 UltraSPARC IV Processor ASI Assignments TABLE 10 2 defines all ASIs supported by the UltraSPARC IV processor that are not defined by either The SPARC Architecture Manual Version 9 or are new These can be used only with LDXA STXA or LDDFA STDFA in
545. uction_access_error and one data_access_error and so on Multiple errors occurring in separate correction words of a single transaction an L2 cache read or L3 cache read or a system bus read do not set the AFSR1 ME bit AFSR2 ME is never set 199 amp Sun microsystems 7 7 Further Details on Detected Errors This section includes a more extensive description for detailed diagnosis A simplified block diagram of the on chip caches and the external interfaces is provided here to illustrate the main data paths and the terminologies used below for logging the different kind of errors TABLE 7 16 is the main reference for all aspects of each individual error The descriptive paragraphs in the following sections are meant to clarify the key concepts D TLB HL TLB D TLB LP 0 8 parityy t8 parity t8 parity t8 parity B cache 2 ceche fFcache Fcache D cache deparityy UEBST t Iparity t Iparity t Iparity P cache par d parity d parity use d parity Vd part ncorr dat W cache W cache ES E data ECC tag ECC data ecc L2 cache Data Tag data ecc data data ecc GP data ecc data ecc data ecc addr parity data ECC microtag ECC Sun Fireplane Interconnect data ecc uncorr data ECC vc corr data ONLY Uncorr data error reported FIGURE 7 1 The UltraSPARC IV Processor RAS Diagram Error Handling 200
546. ue on the UltraSPARC IV processor and FSR NS 1 the subnormal value is replaced by a floating point zero value of the same sign In the UltraSPARC IV processor this replacement is performed in hardware for all cases See Subnormal Handling Override on page 140 for details Note Earlier processors in the UltraSPARC III processor family performed this replacement in hardware for some cases While for other cases the processor generates an fp_exception_other exception with FSR FTT 2 unfinished_FPop expecting that the trap handler will perform the replacement 267 amp Sun microsystems 9 1 2 FSR_floating_point_trap_type FTT The UltraSPARC IV processor triggers fp_exception_other with trap type unfinished_FPop under the conditions described in Response to Subnormal Operands on page 138 9 2 PSTATE Register The UltraSPARC IV processor supports two additional sets privileged only of eight 64 bit global registers the interrupt globals and MMU globals These additional registers are called the trap globals Two 1 bit fields PSTATE IG and PSTATE MG have been added to the PSTATE register to select which set of global registers to use While PSTATE AM 1 The UltraSPARC IV processor writes the full 64 bit program counter PC value to the destination register of a CALL JMPL or RDPC instruction When a trap occurs the UltraSPARC IV processor writes the truncated i e high order 32 bits set to 0 PC and nP
547. ult it would not be detected in normal operation but would slow down the processor Depending on the fault it might be detected in power on self test code D cache snooping and invalidation continues to operate even if DCUCR DC is 0 In this case if DCR DPE is 1 the fix up with silent snoop will still apply in the case of a parity error D cache Data Errors D cache data parity errors can only be the result of a load like instruction accessing cacheable data D cache data is not checked as part of a snoop invalidation 150 amp Sun microsystems 7 2 2 4 Error Handling A D cache data parity error can only be reported for instructions which are actually executed not for those speculatively fetched and later discarded The error will result in a dcache_parity_error precise trap If a load has a microtag hit for a particular way in the D cache and there is a D cache data array parity error in that way then irrespective of the valid bit a dcache_parity_error trap will be generated D cache data array errors in any other way will be ignored D cache Error Recovery Actions A D cache physical tag or data parity error results in a dcache_parity_error precise trap The processor logs nothing about the error in hardware for this event To log the index of the D cache which produced the error the trap routine can disassemble the instruction pointed to by TPC The trap routine may attempt to discover the way in error and whether t
548. un microsystems See below for the I TLB Replacement Algorithm if TTE to fill is a locked page L bit is set fill TTE to 116 if TTE s Size PgSz if one of the 2 same index entries is invalid fill TTE to that invalid entry else if no entry is valid both entries are valid case LFSR 0 De FILL E to T512 way0 ez ad E to T512 wayl fill TTE to T16 13 1 3 1 Direct I TLB Data Read Write As described in the UltraSPARC III Cu Processor User s Manual each I TLB can be directly written using the store AST_ITLB_DATA_ACCESS_REG instruction Software typically uses this method for initialization and diagnostic Page size information is determined by bit 48 62 61 of TTE data store data of STXA ASI_ITLB_DATA_ACCESS_REG Direct accessing the I TLB by properly selecting the TLB ID and TLB entry fields of the ASI virtual address is explained in the D TLB Tag Access Extension Registers on page 331 It is not required to write the Tag Access Extension Register ASI_IMMU_TAG_ACCESS_EXT with page size information since ASI_ITLB_DATA_ACCESS_REG gets the page size from the TTE data But it is recommended that software writes proper page size information to AST_IMMU_TAG_ACCESS_ EXT before writing to ASI_I TLB_DATA_ACCESS_REG
549. unexecuted Errors which lead to disrupting traps are the following e HW_corrected ECC errors in the L2 cache data AFSR EDC CPC WDC and in the L2 cache tag AFSR THCE e HW_corrected ECC errors in L3 cache data AFSR_EXT L3_EDC L3_CPC L3_WDC and in L3 cache tag AFSR_EXT L3_THCE correctable MTag error AFSR EMC correctable interrupt vector AFSR IVC correctable Mag error in interrupt vector AFSR IMC e Uncorrectable L2 cache data errors as the result of store operation writeback and copyout operations AFSR EDU WDU CPU e Uncorrectable L2 cache tag errors AFSR TUE TUE_SH Exceptions Traps and Trap Types 257 amp Sun microsystems 8 1 4 e Uncorrectable L3 cache errors as the result of prefetch store writeback and copyout operations AFSR_EXT L3_EDU L3_WDU L3_CPU e Uncorrectable interrupt vector prefetch queue and store queue read system bus errors AFSR IVU AFSR DUE AFSR DTO AFSR DBERR e Uncorrectable MTag error in interrupt vector AFSR IMU The disrupting trap handler should log the error No special operations such as cache flushing are required for correctness after a disrupting trap However for many errors it is appropriate to correct the data which produced the original error so that a later access to the same data does not produce the same trap again For uncorrectable errors it is the responsibility of the software to determine the recovery mechanism with the minimum system i
550. unmapped or DSTAT 2 or 3 could be returned All of these are handled in the same way as a system bus read to own operation triggered by a store queue entry If a single bit data ECC error or single bit microtag ECC error is returned from the system bus for a prefetch queue operation the event is logged in AFSR CE or AFSR EMC and AFAR Hardware will correct the error and install the prefetched data in the P cache or L2 cache or both Uncorrectable system bus data ECC as the result of a prefetch queue operation will set AFSR DUE and generate an ECC_error disrupting trap If the prefetch queue operation causes an RTO or an RTSR in an SSM system the unmodified uncorrectable error will be installed in the L2 cache Otherwise the L2 cache line remains invalid Corrupt data is never stored in P cache Uncorrectable system bus microtag ECC as the result of an operation from the prefetch queue will set AFSR EMU and cause the processor to assert its ERROR output pin and take a data_access_error trap If the prefetch instruction causes an RTO or an RTSR in an SSM system the unmodified uncorrectable error will be installed in the L2 cache Otherwise the L2 cache line remains invalid 175 amp Sun microsystems Teall ek Error Handling If a bus error or unmapped error is returned from the system bus for a prefetch queue operation the processor sets AFSR DBERR or DTO and takes a disrupting data_access_error The behavior on errors
551. unning Logical Processors ccsccecesseeeeeeeeteeeeeneeeeeeeeees 17 25 Reset Handling 2 2285 Ae eae i en ene AA 20 2 5 1 Private Resets SIR and WDR Resets ceeccceessceceseeeeeneeeeseeeeeseeeeseeeseeeeees 20 2 5 2 Full CMT Resets System Reset ecccceccecsesseseseceeeneeeeeeeeesseeeesneeeeeeeeneneeeeee 20 2 5 3 Partial CMT Resets XIR Reset ceccecccseesssceeeseceeeeeeeeneeesseeeeeneeeeseeessaeeseae 20 2 6 Private and Shared Registers Summary c ceesscccesseesseeeeseeeeeneeeeeneeeneneeeeneeeeseeeneneeeeee 21 2 6 1 Implementation Registers eecccecesscesseeeeeeeceeeeeeeeeeeseneeeseeeeeeeesseeensneesees 22 3 Caches Cache Coherency and Diagnostics cssssccsssssssecssccsseccssssscesesscceescsscsscesessccesees 23 3 1 Cache Organization sorire EES ENEE ENEE ENEE 23 ALL Cache Overview sorrir ics snari NE ERNER EO ATE NEAS 23 3 1 2 Virtually Indexed Physically Tagged VIPT Caches 24 3 1 3 Physically Indexed Physically Tagged Caches PIPT cccccessecssseeeetteees 25 3 1 4 Virtually Indexed Virtually Tagged VIVT Caches ceccceessceeesteeeereeeeee 26 3 2 Cache Fl shins uerger kat eases tes NE ee A 28 3 2 1 Address Aliasing Flushing 28 3 2 2 Committing Block Store Flushing 0 ccececcceesecceeneeeseeeeseneeeeneeessneeeseneeeees 29 3 2 3 Displacement Flushing 29 324 Prefetch Cache Flushing eege Se ee E 29 3 3 EE 29 3 3 1 Proces
552. usive clean data with no outstanding shared copy S shared clean data with outstanding shared copy s I invalid invalid data MOOSESI RTO An SSM mode cache coherence protocol M modified dirty data with no outstanding shared copy O owned dirty data with potentially outstanding shared copy s in the local SSM domain Os owned dirty data with potentially outstanding shared copy s in other SSM domains E exclusive clean data with no outstanding shared copy S shared clean data with outstanding shared copy s I invalid invalid data Read To Own Transaction RTS Read To Share Transaction RTOR RTSM Caches Cache Coherency and Diagnostics Read To Own Remote Read To Share MTag 30 un microsystems TABLE 3 3 Hit Miss State Change and Transaction Generated for Processor Action 1 of 2 Processor action Combined State MODE SE Block Store Write Gs WE Commit Prefetch miss miss SSM miss RTO RTS RS WS WS miss miss SSM miss RTO LPA RTS RS R_WS R_WS I SSM amp MTag miss MTag miss MTag miss LPA amp g R_RTO retry R_RS invalid SSM amp LPA miss R_RTO SSM hit SSM amp LPA hit hit E SSM amp LPA amp i o retry invalid SSM amp LPA hit hit miss miss miss SSM hit hit hit RTO WS WS MTag miss miss miss SSM amp ve E hit hit LPA RTO R_WS R_WS S SSM amp MTag miss LPA amp e 4
553. uted Errors associated with instructions which are fetched but later discarded without being executed are ignored Such an event will leave no trace but may cause an out of sync event in a lockstep system I cache Snoop Tag Errors Snoop cycles from the system bus and from store operations executed in this processor may need to invalidate entries in the I cache To discover if the referenced line is present in the I cache the physical address from the snoop access is compared in parallel with the snoop tags for each of the ways of the I cache In the event that any of the addressed valid snoop tags contains a parity error the processor will safely report snoop hits There is just one valid bit for each line used in both the I cache physical tags and snoop tags Clearing this bit will make the next instruction fetch to this line miss in the I cache and refill the physical and snoop tags and the I cache entry Hardware in the I cache snoop logic does not differentiate between conventional invalidations where the snoop tag matches and error invalidations where there is a snoop tag parity error hence the hardware cannot tell if the snoop tag matched or not This is done to ensure that the ordering and timing requirements necessary for the coherence policy are met This operation of automatically clearing the valid bits on an error is neither logged in the AFSR nor reported as a trap Therefore it is possible to cause undiagnosable out of s
554. valid Result is infinity e Subnormal Infinity Infinity no exception generated e Standard mode No unfinished_FPop e Non standard mode No FSR NX e Standard mode Subnormal x Infinity Infinity IEEE 754 1985 Standard 140 amp Sun microsystems Non standard mode Subnormal x Infinity QNaN with nv exception Subnormal is flushed to zero Result is zero e Subnormal x 0 0 no exception generated e Standard mode No unfinished_FPop e Non standard mode No FSR NX IEEE 754 1985 Standard 141 un microsystems IEEE 754 1985 Standard 142 amp Sun microsystems Error Handling This chapter describes the behavior of the UltraSPARC IV processor as viewed by a programmer writing operating systems software service processor diagnosis or recovery code for the UltraSPARC IV processor Errors are checked in data arriving at or passing through the processor from the L2 cache the L3 cache the system bus and in the MTags arriving from the system bus In addition certain cache arrays protocols and internal logic are also checked for errors Error information is logged in the Asynchronous Fault Address Register AFAR and Asynchronous Fault Status Register AFSR Errors are logged even if their corresponding traps are disabled First error information is captured in the secondary fault registers Chapter Topics e Error Classes on page 143 e Memory Errors on page 144 e Error Registers on page 177 e Error R
555. ventually evicted a single write transaction is made to L2 without need for any associated background read and merge operations Storing entire lines in the write cache unmodified in addition to modified bytes is less conservative of write cache space but for an on chip L2 the associated increase in write traffic is a minor penalty compared to the benefits of greatly simplified write cache operation The write cache in the UltraSPARC IV processor is fully set associative and uses a FIFO allocation policy This makes the cache much more robust for applications that have multiple store streams and enables the write cache to better exploit temporal locality in the store stream Data Prefetching Because all members of the UltraSPARC III IV processor family block access to the Ll cache on a miss data prefetching is an important feature for exploiting memory level parallelism In the UltraSPARC IV processor the prefetch mechanisms have been improved in a number of ways The most important optimizations are focused on making the behavior of software prefetches more predictable thus making it easier for the compiler or a programmer to use these instructions Prefetching efficiency also has been improved by optimizing a number of steps in the prefetch process thereby reducing the latency of prefetch operations To make prefetch behavior more predictable the UltraSPARC IV processor supports a new class of prefetch operation strong prefetches St
556. ween logical processors when sending interrupts This private register is used by software to assign a 10 bit interrupt ID to a logical processor that is unique within the system This is important to enable logical processors to receive interrupts The ID in this register is used by other logical processors and other bus agents to address interrupts to this specific logical processor It is also used by this logical processor to identify the source of interrupts it issues to other logical processors and bus agents It is expected to be changed only at boot or reconfiguration time Name AST_INTR_ID ASI 0x63 VA 63 0 0x00 Read Write Privileged Access Note The UltraSPARC IV processor sets the Sun Fireplane Interconnect MID 9 5 to SID_U and MID 4 0 to SID_L The source of MID 9 0 is the AST_INTR_ID 9 0 of the logical processor issuing the interrupt TABLE 2 2 LP Interrupt ID Register Fields Bits Field Description 63 10 Reserved Reserved for future implementation The Int_ID is used as the source or target logical processor identities in a Sun Fireplane Interconnect interrupt transaction In a Sun Fireplane Interconnect interrupt transaction the source logical processor identity is placed in the Sun Fireplane Interconnect Address bus bits 38 29 and the target logical processor identity is placed in Address bus bits 23 14 9 0 Int_ID Chip Multithreading CMT 13 am
557. y 0 2 b01 Way 2 b10 Way 2 2 b11 Way3 13 5 DC_addr A 9 bit index that selects a snoop tag field 512 tags 4 0 Mandatory value Should be 0 The data format for D cache snoop tag access is shown in TABLE 3 39 TABLE 3 39 Data Cache Snoop Tag Access Data Format Bits Field Description 63 31 Mandatory value Should be 0 30 DC_snoop_tag_parity Odd parity bit of DC_snoop_tag The 29 bit physical snoop tag PA 41 13 of the associated data PA 42 Pere DC_snoop_tag is always 0 for cacheable 0 Mandatory value Should be 0 Note A MEMBAR Sync is required before and after a load or store to ASI_DCACHE_SNOOP_TAG During ASI writes DC_snoop_tag_parity is not generated from data bits 29 1 but data bit 30 is written as the parity bit This will allow testing of D cache snoop tag parity error trap Data Cache Invalidate ASI 4216 per logical processor Kal Name ASI_DCACHE_INVALIDATE A store that uses the Data Cache Invalidate ASI invalidates a D cache line that matches the supplied physical address from the data cache A load to this ASI returns an undefined value and does not invalidate a D cache line Caches Cache Coherency and Diagnostics 51 un microsystems The address format for D cache invalidate is shown in TABLE 3 40 TABLE 3 40 Data Cache Invalidate Address Format Bits Field Description 63 41 Mandatory value Should be 0
558. y Error c ccecsceeeseeeeeees 264 9 Registers E E EE E EE ERTA 267 9 1 Floating Point State Register FSR sessssesssssssesssssrsssesssesssrssersssresseessrreseeesseesserssseessee 267 9 1 1 FSR_nonstandard_fp NS arraie a ER EREE E E 267 9 1 2 FSR_floating_point_trap_type PTT 268 VE RE D Register E T he es tienen eed iinnds 268 9 3 Ancillary State Registers ASRS ccccecsccesssccessceeeeeeeseeeeeeeeeneneeeseneeeneneeesaeeesseeeneneeeas 268 9 3 1 Performance Control Register PCR ASR Joie 268 9 3 2 Performance Instrumentation Counter PIC Register IEN a WEE 269 9 3 3 Dispatch Control Register DCR ASR 18 269 9 4 Registers Referenced Through ASIs ccccccscccessceeeseeeeeeeeesseeeeeeceneaeeeseeeeeeeeenseeenseeseas 273 9 4 1 Data Cache Unit Control Register DCUCR 00 ec cc eeeccecesseeeeeeeeeeeeeeneeeeeneeeees 273 9 4 2 Data Watchpoint Registers ccccccessccsssccesseceeeneceeeeeessneeeseeeeeseeeesneeensaeesees 275 9 5 ScratchPad Registers AS SCRATCHDAID n RO 276 DE Address Space Tdentitiers i3s 033cs cccccesccsccescdsesesesdvesocensdsvsscecsGseseponssscotoqsesesscpocssisseascessonssecsesseacae 277 10 1 TSB ASI Ee EE 277 xii UltraSPARC IV Processor User s Manual October 2005 IOLI Fast e EE 278 10 1 2 Rules for Accessing Internal ASIS 0 ccceeecceesseceeeneeeeeneeeeeneceseeeeesteeessneeesens 278 10 1 3 Limitation of Accessing Internal Als 279 10 2 ASL Assignments 2 geg d
559. y both LPs Name ASI_L2CACHE _ TAG The L2 cache Tag Access Address format is shown in TABLE 3 59 TABLE 3 59 L2 cache Tag Access Address Format Bits Field Description 63 24 Reserved Reserved for future implementation specifies the type of access 1 b1 displacement flush write back the selected L2 cache line both clean and modified to L3 cache and invalidate the line in L2 cache When this bit is set the ASI data portion is unused 1 b0 direct L2 cache tag diagnostic access read write the tag state ECC and LRU fields 23 disp_flush of the selected L2 cache line 8 In L2 cache on mode if the line to be displacement flushed is in NA state see Notes on NA Cache State on page 64 it will not be written out to L3 cache The line remains in NA state in L2 cache In L2 cache off mode displacement flush is treated as a NOP Note For L2 cache displacement flush use only LDXA STXA has NOP behavior Since L2 cache will return garbage data to the MS pipeline it is recommended to use Idxa reg_address ASI_L2CACHE_TAG g0 instruction format On read this bit is don t care On write if set it enables hardware ECC generation logic 22 EE inside L2 cache tag SRAM When this bit is not set the ECC generated by hardware ECC eee generation logic is ignored what is specified in the ECC field of the ASI data portion will be written into the ECC entry 21 Mandatory value Should be 0 20 19 A
560. y used Bit 1 tracks which way of the of 2 way set way3 and way2 is the least recently used Bit 0 tracks which way of the other 2 way set wayl and way0 is the least recently used The LRU bits are reset to 3 b000 and updated based on the following scheme bit 2 1 if hit in wayl or way0 bi 1 1 if hit in way2 or it remains 1 when not hit in way3 bit 0 1 if hit in way0 or it remains 1 when not hit in wax An example of LRU decoding LRU 3 b000 means way and way0 are less recently used than way3 and way2 way2 is less recently used than way3 way0 is less recently used than way This algorithm is like a binary tree which is suitable for the split cache mode as well LRU bits are updated during fill or hits No updates can happen during replacement calculation LRU bits are not ECC protected Notes on NA Cache State The NA Not Available cache state is introduced to enhance the RAS feature and testability in the L2 and L3 caches When a cache line is in NA state it means this way will be excluded from the replacement algorithm It can be used in the following scenarios 1 During run time the operating system can selectively disable specific indext way in L2 and L3 cache tag SRAMs based on soft error reporting in the L2 and L3 cache data SRAMs 2 During run time the operating system can make a 4 way set associative L2 and or L3 cache behave like a direct mapped 2 way or 3 way set associative cache by progra
561. yl T512_1 way0 T512_1 wayl else case LFSR 1 0 OO E to T512_0 way0 O01 fill E to T512_0 wayl Tros ELUL E to T512_1 way0 tis Eid E to T512_1 wayl else if TTE s Size PgSz0 if one of the 2 same index entries is invalid fill TTE to an invalid entry with selection order of T512_0 way0 T512_0 wayl else case LFSR 0 QO fill E to T512_0 way0 Le fi 02 E to T512_0 wayl else if TTE s Size PgSzl if one of the 2 same index entries is invalid fill TTE to an invalid entry with selection order of T512_1 way0 T512_1 wayl else case LFSR 0 Oe ELLE E to T512_1 way0 BE ee bk E to T512_1 wayl else 13 2 2 5 Direct D TLB Data Read Write As described in the UltraSPARC III Cu Processor User s Manual each D TLB can be directly written to using store AST DTLB_DATA_ACC ESSR EG ASI 5D46 instruction Software typically uses this method for initialization and diagnostic Page size information is determined by bit 48 62 61 of the TTE data store data of STXA ASI _DTLB_DATA_ACC ESS_R EG Direct access to the D TLB is achieved by properly selecting the TLB ID and TLB entry fields of ASI virtual address as explained in D TLB Tag Access Extension Registers on pag Instruction and Data Memory Management Unit e 331 325 amp Sun microsystems 13 2 2 6 3 2 2 7 It is not r
562. ync events in lockstep systems In non lockstep systems if a snoop tag array memory bit became stuck because of a fault it would not be detected in normal operation but would slow down the processor Depending on the fault it might be detected in power on self test code I cache snooping and invalidation continues to operate even if DCUCR IC is 0 In this case when DCR IPE is 1 and a parity error occurs the silent snoop fix up will apply I cache Data Errors I cache data parity errors can only be the result of an instruction fetch I cache data is not checked as part of a snoop invalidation operation An I cache data parity error is determined based on per instruction fetch group granularity Any instruction no matter whether it is executed or not in the fetch group which has parity error will cause an icache_parity_error trap on that fetch group When the trap is taken TPC will point to the beginning instruction of the fetch group If an instruction fetch hits in one way of the I cache an icache_parity_error trap will be generated for data errors only in the way of the I cache which hits at this address Meanwhile data parity errors in the other ways will be ignored 1 A fetch group is the collection of instructions which appear on the I cache outputs at the end of the F Stage A fetch group is four or fewer consecutive instructions contained within a 32 byte I cache line 145 un microsystems 7 2 1 4 Error Handling If an
563. ystem bus error detection hardware and software To check L3 cache error detection test programs should use the L3 cache diagnostic access operations UCEEN If set a SW_correctable or uncorrectable L2 cache ECC error or uncorrectable L3 cache ECC error will generate a precise fast_ECC_error trap This event can only occur on reads of the L2 cache or L3 cache by this processor for I fetches data loads and atomic operations and not on merge writeback and copyout operations This bit enables the traps associated with the AFSR UCC and UCU and AFSR_EXT L3_UCC and L3_UCU NCEEN If set An uncorrectable system bus data or microtag ECC error system bus error with unmapped or DSTAT 2 or 3 response as the result of an instruction fetch causes a deferred instruction_access_error trap and as the result of a load like atomic or block load instruction causes a deferred data_access_error trap An uncorrectable L2 cache L3 cache data error as the result of a store queue exclusive request a prefetch queue operation or for a writeback or copyout will generate a disrupting ECC_error trap An uncorrectable L2 cache L3 cache tag error will cause a disrupting ECC_error trap An uncorrectable system bus data ECC error as the result of an interrupt vector fetch will generate a disrupting ECC_error trap An uncorrectable system bus microtag ECC error or system bus DSTAT 2 or 3 will cause a deferred data_access_error trap If NCEEN is clear the

UltraSPARC IV+ Processor User's Manual Supplement

Contents

Download Pdf Manuals

Related Search

Related Contents