Home
Intel Computer Hardware 80200 User's Manual
Contents
1. Software Debug Figure 13 2 SELDCSR Data Register DCSR 0 Capture DR rr DBG_SR TDI j 35 34 3 1 3 11 10 mmi TDO ignored Update DR rr DBG REG 34133 2 1 0 TEK L DBGHLD_RST DBG BRK DBG DCSR 13 11 2 1 DBG HLD_RST The debugger uses DBG HLD_RST when loading code into the instruction cache during a processor reset Details about loading code into the instruction cache are in Section 13 14 Downloading Code in the ICache The debugger must set DBGHLD RST before or during assertion of the reset pin Once DBG HLD_RST is set the reset pin can be de asserted and the processor internally remains in reset The debugger can then load debug handler code into the instruction cache before the processor begins executing any code Once the code download is complete the debugger must clear DBGHLD RST This takes the processor out of reset and execution begins at the reset vector A debugger sets DBG HLD RST in one of 2 ways Either by taking the JTAG state machine into the Capture DR state which automatically loads DBG SR 1 with 1 then the Exit2 state followed by the Update Dr state This sets the DBGHLD RST clear DBG BRK and leave the DCSR unchanged the DCSR bits captured in DBG_SR 34 3 are written back to the DCSR on the Update DR e Alternatively a 1 can be scanned into DBG_SR 1 with the
2. Message Name Message Byte Type Message Byte format address bytes Exception exception OboVVV CCCC 0 Direct Branch non exception 0b1000 CCCC 0 Checkpointed Direct Branch non exception 0b1100 CCCC 0 Indirect Branch non exception 0b1001 CCCC 4 Checkpointed Indirect Branch non exception 0b1101 CCCC 4 Roll over non exception 0b1111 1111 0 a Direct branches include ARM and THUMB bl b b Indirect branches include ARM Idm Idr and dproc to PC ARM and THUMB bx blx 1 and blx 2 and THUMB pop 13 28 March 2003 Developer s Manual intel 13 13 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Exception Message Byte When any kind of exception occurs an exception message is placed in the trace buffer In an exception message byte the message type bit M is always 0 The vector exception V VV field is used to specify bits 4 2 of the vector address offset from the base of default or relocated vector table The vector allows the host SW to identify which exception occurred The incremental word count CCCC is the instruction count since the last control flow change not including the current instruction for undef SWI and pre fetch abort The instruction count includes instructions that were executed and conditional instructions that were not executed due to the condition of the instruction not matching the CC flags A count value of 0 indicat
3. 1 eu gEdAE 2 4 2 1 Operation When Instruction Cache is Enabled A 2 4 2 2 Operation When The Instruction Cache Is Disabled reenen 2 4 2 8 Senf Ge le e M 3 4 2 4 Round Robin Replacement Algorithm AAA 3 4 2 5 Parity lte LTE A 4 2 6 Instruction Fetch Laton eese e eret ete cen eto oed ax pee 5 4 2 7 Instruction Cache Coherency eene enne nnne nnne nnne 5 Instruction Cache Conttol teeth ee do Feet sii re eere etx a De e Varna Ev X as EU ona E a A 6 4 3 1 Instruction Cache State at RESET essssesseseeeesreeenssrnrsrrnsrrrnsrrnnrrreerinsrrnnsrrnssrensnnt 6 4 3 2 Enabling Disablirig eterne tont t rr eec ra Reden set Rb vr ex eue de 6 4 3 8 Invalidating the Instruction Cache 7 4 3 4 Locking Instructions in the Instruction Cache 8 4 3 5 Unlocking Instructions in the Instruction Cache 9 Ge OR ee EA 1 Branch Target Buffer BTB Operation 1 5 1 1 all 2 5 1 2 Update koleg m M 2 BTB COmtroll e aia 3 5 2 1 Disabling Enablirig cete eri aiaa aaga aE AS ENNEN deed 3 5 2 2 lu ue ME 3 Date CACC Me ORRR 1 NM HM XY MM 1 6 1 1 Data Cache OV6lVieW tenent s RED e dp EXE Dee XH EE Eua ER AERE E
4. J 3 Data c o 9 a Adr Ctl Memory Bus i Lal ROM i Chipset SDRAM Flash jeg 2218 3 PCI March 2003 Developers Manual intel 10 2 Table 10 1 Intel 80200 Processor based on Intel XScale Microarchitecture Signal Description External Bus Intel 80200 Processor based on Intel XScale Microarchitecture Bus Signals Signal Width HO Function MCLK 1 bus clock note all bus activity is triggered by the rising edge of this clock Request Bus ADS LEN 2 During the first cycle of the issue phase this signal indicates the start of a bus request During the second cycle of the issue phase this signal is the MSB of a value which indicates the length of the transaction Lock LEN 1 During the first cycle of the issue phase this signal indicates whether the current transaction is part of an atomic read write pair During the second cycle of the issue phase this signal is the middle bit of a value which indicates the length of the transaction W RS LEN O During the first cycle of the issue phase this signal indicates whether the current transaction is a write W R 1 or a read W R O During the second cycle of the issue phase this signal is the LSB of a value which indicates the length of the transaction 16 During the first cycle of the issue phase this signal carries the upper 16 bits of t
5. HO 1 A 3 Architecture RE Le E 3 ASt Redd I O MOREM 3 LCE ell le EE 3 A 3 3 Cacheable C and Bufferable B Encoding sessesssessrressresrrisrirseriserirsrrnnsrnnnernnernnnens 3 AA Wirite Butfer Behavior 2 init erre et dia Eua eed eta ue xu udo eu cbe e de Lee EROR Feu us 4 A 3 5 External ADOFIS e e ei ad er etie ie ena RET xS ERR RE ensi e Rex Ru SES 4 A 3 6 Performance Dterences reen 5 A 3 7 System Control Coprocessor cease nenen nnne nnne nnne 5 A 3 8 New Instructions and Instruction Formats AA 5 A 3 9 Augmented Page Table Descrotors AA 5 B Optimization Guide me 1 B 1 INtPOGUCHION Pr 1 B 1 1 About This Guide nnne nennen aaa aaa a iaa aaa aaa enne 1 B 2 Intel 80200 Processor vun cm A 2 B 2 1 General Pipeline Characteristics sesssssssssesseeeeeeeneenneeenneen nnne 2 B 2 1 1 Number of Pipeline Stages sse nennen 2 B 2 1 2 Intel 80200 Processor Pipeline Oroantzation 3 B 2 1 3 Out Of Order Completion sess 4 B 2 1 4 Register Scoreboarding sess 4 B 2 1 5 Use of Bypassing acuse siise ete tret d s meen EES x rd den RR edu FEN RA 4 B 2 2 Instruction Flow Through the Pipeline esseeennm 5 B 2 2 1 ARM V5 Instruction Exvecuton enn 5 B 2 22 Pipeline St
6. download flag Figure 13 6 DBGRX Data Register RX TXRXCTRL 31 1 H C DR Y apture Ili DBG_SR TDI J 35 34 3 1 3111 0 _ TDO DBG RR cleared by RX Write Logic Update DR Il DBG REG 34 33 2 1 0 TOR L DBGFLUSH DBG D DBG RX DBG V 13 11 6 3 DBG RR The debugger uses DBG RR as part of the synchronization that occurs between the debugger and debug handler for accessing RX This bit contains the value of TXRXCTRL 31 after a Capture DR The debug handler automatically sets TXRXCTRL 31 by doing a write to RX The debugger polls DBG RR to determine when the handler has read the previous data from RX The debugger sets TXRXCTRL 31 by setting the DBG V bit 13 24 March 2003 Developer s Manual intel 13 11 6 4 13 11 6 5 13 11 6 6 13 11 6 7 13 11 7 Table 13 13 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug DBGV The debugger sets this bit to indicate the data scanned into DBG_SR 34 3 is valid data to write to RX DBGV is an input to the RX Write Logic and is also cleared by the RX Write Logic When this bit is set the data scanned into the DBG SR is written to RX following an Update DR If DBGV is not set and the debugger does an Update DR RX is unchanged This bit does not affect the actions of DBG FLUSH or DBG D DBG RX DBG RX is written into the RX register based on the output of the RX Write Logic Any data that needs to be sent
7. 14 2 3 4 5 Events from Preload Instructions A 16 2 3 4 6 Debug EE 16 3 Memory Management eeneraninadanAns aua du aa daa ana aaa a ansa Cds ua oM a ERR EREER Ends 1 3 1 MEDIEN E 1 3 2 Architecture le ET 2 3 2 1 Version 4 vs VGISION b c ema e pe dev nce RR edax ur uu CR ATE ER ax das 2 3 2 2 Memory Attributes ii e vended ax ii EE PRX CU EERA O 2 Developer s Manual March 2003 iii Intel 80200 Processor based on Intel XScale Microarchitecture tel e 3 3 3 4 4 1 4 2 4 3 6 2 3 2 2 1 Page P Attribute Bit enne enne nnne 2 3 2 2 2 Cacheable C Bufferable B and eXtension X Bits 2 3 2 2 3 INSTRUCTION CACM P 2 3 2 2 4 Data Cache and Write Butter 3 3 2 2 5 Details on Data Cache and Write Buffer Behavior 4 3 2 2 6 Memory Operation Ordertng n 4 3 2 8 EEXCO PT OINS ee M 4 Interaction of the MMU Instruction Cache and Data Cache 5 eon 6 3 4 1 Invalidate Flush Operation 6 3 4 2 Enabling Disabling esee nennen nennen nennen nnne nnns nnn 6 3 4 8 Locking Enies 7 3 4 4 Round Robin Replacement Algorithm AAA 9 Instruction Cache m 1 QI
8. Ons 25ns 50ns 75ns MAIK X 4 V f X J VV d Vt ZK EK ZK ADS LEN 2 WrReq 0 Lock LEN 1 L 14 W R LEN 0 o S A oo X 000 rs DValid RE Or RS D BEn DCB Abort Developer s Manual March 2003 10 17 External Bus In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 10 3 5 Figure 10 8 10 18 Two Word Coalesced Write In Figure 10 8 two store byte instructions from the instruction stream have been coalesced into a single write command in the write buffer The bytes were stored to addresses 0x240 and 0x247 The request is the same as the basic write word case except now the length is 0x3 indicating a two word write When the chipset or memory needs the data DValid is asserted and two cycles later the data is driven In this case however only BE bits 0 and 7 are asserted indicating that the first and last byte of the bus have valid data to be stored and the rest must not be written Two Word Coalesced Write Ons 25ns 50ns 75ns MCLK ii X j ADS LEN 2 X WrReq Zo Lock END L 1 W R LEN O IBS 1 O A MK oxo X 0x240 DValid o cur D BEn DCB BN rcc iy Abort March 2003 Developer s Manual n Intel 80200 Processor based on Intel XScale Microarchitecture I n
9. The second loop reuses the data elements A i and c i Fusing the loops together produces for i 0 i lt NMAX i prefetch D i 1 A i 1 cli 1 b i 1 ai b i clil A i ai D i ai clil Developer s Manual March 2003 B 33 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 4 4 13 B 34 Prefetch to Reduce Register Pressure Prefetch can be used to reduce register pressure When data is needed for an operation then the load is scheduled far enough in advance to hide the load latency However the load ties up the receiving register until the data can be used For example ldr r2 r0 Process code not yet cached latency gt 60 core clocks add El El r2 In the above case r2 is unavailable for processing until the add statement Prefetching the data load frees the register for use The example code becomes pld r0 prefetch the data keeping r2 available for use Process code ldr 2 r0 Process code ldr result latency is 3 core clocks add rl i y2 With the added prefetch register r2 can be used for other operations until almost just before it is needed March 2003 Developer s Manual B 5 1 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Instruction Scheduling This chapter discusses instruction scheduling optimizations Instruction scheduling refers to
10. intel Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations Table 14 7 Multiply Instruction Timings Sheet 2 of 2 Mnemonic Rs Value S Bit Minimum Minimum Result Minimum Resource Early Termination Value Issue Latency Latency Latency Throughput 0 1 RdLo 2 RdHi 3 2 Rs 31 15 0x00000 1 3 3 3 0 1 RdLo 3 RdHi 4 3 UMULL Rs 31 27 0x00 1 4 4 4 0 1 RdLo 4 RdHi 5 4 all others 1 5 5 5 1 If the next instruction needs to use the result of the multiply for a shift by immediate or as Rn in a QDADD or QDSUB one extra cycle of result latency is added to the number listed Table 14 8 Multiply Implicit Accumulate Instruction Timings e Rs Value Early Minimum Issue Minimum Result Inge Tree m Termination Latency Latency Latenoy Throughput Rs 31 16 0x0000 or 1 1 1 Rs 31 16 OXFFFF MIA Rs 31 28 0x0 or 1 2 2 Rs 31 28 OxF all others 1 3 3 MIAxy N A 1 1 1 MIAPH N A 1 2 2 Table 14 9 Implicit Accumulator Access Instruction Timings Mnemonic Minimum Issue Latenc Minimum Result Latenc eat eens CMM y SE Throughput MAR 2 2 2 MRA 1 RdLo 2 RdHi 3 2 Developer s Manual If the next instruction needs to use the result of the MRA for a shift by immediate or as Rn in a QDADD or QDSUB one extra cycle of result latency is added to the number listed Mar
11. General instruction decoding extracting the opcode operand addresses destination addresses and the offset Detecting undefined instructions and generating an exception Dynamic expansion of complex instructions into sequence of simple instructions Complex instructions are defined as ones that take more than one clock cycle to issue such as LDM STM and SWP March 2003 Developer s Manual intel B 2 3 3 B 2 3 4 B 2 3 5 B 2 3 6 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide RF Register File Shifter Pipestage The main function of the RF pipestage is to read and write to the register file unit RFU It provides source data to EX for ALU operations MAC for multiply operations Data Cache for memory writes Coprocessor interface The ID unit decodes the instruction and specifies which registers are accessed in the RFU Based upon this information the RFU determines if it needs to stall the pipeline due to a register dependency A register dependency occurs when a previous instruction is about to modify a register value that has not been returned to the RFU and the current instruction needs to access that same register If no dependencies exist the RFU selects the appropriate data from the register file and pass it to the next pipestage When a register dependency does exist the RFU keeps track of which register is unavailable and when the result is retur
12. Instruction Cache n 4 2 5 Example 4 1 Parity Protection The instruction cache is protected by parity to ensure data integrity Each instruction cache word has 1 parity bit The instruction cache tag is NOT parity protected When a parity error is detected on an instruction cache access a prefetch abort exception occurs if the Intel 80200 processor attempts to execute the instruction Before servicing the exception hardware will place a notification of the error in the Fault Status Register Coprocessor 15 register 5 A software exception handler can recover from an instruction cache parity error This can be accomplished by invalidating the instruction cache and the branch target buffer and then returning to the instruction that caused the prefetch abort exception A simplified code example is shown in Example 4 1 on page 4 4 A more complex handler might choose to invalidate the specific line that caused the exception and then invalidate the BTB Recovering from an Instruction Cache Parity Error Prefetch abort handler MCR P15 0 R0 C7 C5 0 Invalidate the instruction cache and branch target buffer CPWAIT wait for effect see Section 2 3 3 fora description of CPWAIT SUBS PC R14 4 Returns to the instruction that generated the parity error The Instruction Cache is guaranteed to be invalidated at this point If a parity error occurs on an instruction that is locked in the cache the softwa
13. 0x3 Instruction TLB miss 0x4 Data TLB miss 0x5 Branch instruction executed branch may or may not have changed program flow 0x6 Branch mispredicted B and BL instructions only 0x7 Instruction executed 0x8 Stall because the data cache buffers are full This event occurs every cycle in which the condition is present 0x9 Stall because the data cache buffers are full This event occurs once for each contiguous sequence of this type of stall OxA Data cache access not including Cache Operations defined in Section 7 2 8 OxB Data cache miss not including Cache Operations defined in Section 7 2 8 Data cache write back This event occurs once for each 1 2 line four words that are OxC written back from the cache Software changed the PC This event occurs any time the PC is changed by software and OxD there is not a mode change For example a mov instruction with PC as the destination triggers this event Executing a swi from User mode does not trigger this event because it incurs a mode change 0x10 The BCU received a new memory request from the core The BCUs request queue is full This event takes place each clock cycle in which the 0x11 condition is met A high incidence of this event indicates the BCU is often waiting for transactions to complete on the external bus The number of times the BCU queues were drained due to a Drain Write Buffer command 0x12 or an I O transaction as identified by C 0 and B
14. 1 Correction if BCUCTL SC 1 1 0 Merge store data with data from read RMW with 1 bit error from Write updated data to memory the read cycle e Correction if BCUCTL SC 1 Merge store data with data from read Write updated data to memory Request core interrupt 0 Does not occur RMW only if BCUCTL EE 1 Merge store data with data from read Write updated data to memory with deasserted byte enables e If transaction is a data read imprecise data abort e If transaction is an instruction fetch prefetch abort e If transaction is an MMU operation precise data abort RMW with multi bit error from the read cycle 1 a In all cases the BCU attempts the additional response of logging the error into a register If register ELOGO is not already being used to record an error the BCU logs error information into registers ELOGO and ECARO If ELOGO already has error information but ELOGI does not the BCU uses ELOGI and ECARI to log the error If both ELOGO and ELOGI are in use the BCU sets the error overflow bit register BCUCTL bit EV A description of these registers is in Section 11 4 1 and Section 11 4 2 When an error is detected during a RMW the BCU writes the corrupted data back to memory because it needs to finish the atomic transaction If it did not do this an external bus agent might consider the bus to be permanently locked The BCU does not assert any of the byte enables nBE wh
15. Bits Access Description 31 9 Read as Zero Write ignored Reserved DBR1 Mode M 8 Read Write 0 DBR1 Data Address Breakpoint 1 DBR1 Data Address Mask 7 4 Read as Zero Write ignored Reserved DBR1 Enable E1 When DBR1 Data Address Breakpoint 0b00 DBR1 disabled 3 2 Read Write 0b01 DBR1 enabled Store only 0b10 DBR1 enabled Any data access load or store 0b11 DBR1 enabled Load only When DBR1 Data Address Mask this field has no effect DBRO Enable E0 0b00 DBRO disabled 1 0 Read Write 0b01 DBRO enabled Store only 0b10 DBRO enabled Any data access load or store 0b11 DBRO enabled Load only March 2003 Developer s Manual 13 7 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug When DBRI is programmed as a data address mask it is used in conjunction with the address in DBRO The bits set in DBRI are ignored by the processor when comparing the address of a memory access with the address in DBRO Using DBR1 as a data address mask allows a range of addresses to generate a data breakpoint When DBR1 is selected as a data address mask it is unaffected by the E1 field of DBCON The mask is used only when DBRO is enabled When DBRI is programmed as a second data address breakpoint it functions independently of DBRO In this case the DBCON E1 controls DBR1 A data breakpoint is triggered if the memory access matches the access type a
16. 0 Rd c8 c6 1 March 2003 7 13 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration n 7 2 10 Table 7 14 Table 7 15 Register 9 Cache Lock Down Register 9 is used for locking down entries into the instruction cache and data cache The protocol for locking down entries can be found in Chapter 6 Data Cache Table 7 14 shows the command for locking down entries in the instruction cache instruction TLB and data TLB The entry to lock is specified by the virtual address in Rd The data cache locking mechanism follows a different procedure than the others The data cache is placed in lock down mode such that all subsequent fills to the data cache result in that line being locked in as controlled by Table 7 15 Lock unlock operations on a disabled cache have an undefined effect This register should be accessed as write only Reads from this register as with an MRC have an undefined effect Cache Lockdown Functions Function opcode 2 CRm Data Instruction Fetch and Lock I cache line 0b000 0b0001 MVA MCR p15 0 Rd c9 c1 0 Unlock Instruction cache 0b001 0b0001 Ignored MCR p15 0 Rd c9 c1 1 Read lock mode Read data cache lock register 0b000 0b0010 value MRC p15 0 Rd c9 c2 0 Write data cache lock register ObO00 0b0010 SSES lock MCR pts 0 Rd c9 c2 0 Unlock Data Cache 0b001 0b0010 Ignored MCR p15 0 Rd c9 c2 1 Data C
17. 10 2 4 Configuration PINS osiers 8 10 2 5 Multimaster Support 9 10 2 6 Dis M ate 11 10 2 7 joo MEET 12 10 2 8 Big Endian System Configuration sss 13 Cre O 14 10 3 1 luese Kelt DEE 14 10 3 2 Read Burst No Critical Word FirSt cccccccccccccccccececeeseeseeeeeeneuseeeeeeauaeauensssaensenees 15 10 3 3 Read Burst Critical Word First Data Hetum 16 10 3 4 MORI ce ES 17 10 3 5 Two Word Coalesced Write nennen nnne 18 10 3 5 1 Write BUI Stirs E 19 10 3 6 Write Burst CoalesGed iicuseircein ect red reo ca Cen ace PY ka pre uA Eua a d rr E ER a e ER RR EEN 20 10 3 7 Pipelined Accesses AEN EN 21 10 3 8 Locked ACCOSS m 22 10 3 9 Aborted ACCESS o 23 rk lafe e e BEE 24 ES 1 anaa a O eto p 1 icc M ERR 1 Eror ele ln e 2 11 3 1 Bus ADOMIS 2 2 11 3 2 EGG EITOIS oio oet rcu uitelitu dtu i ie Mete LL 3 Programmer Model TEE 5 11 4 1 BCU Control Registers sssssssssssseseeeeeeneenenn nnne enne 5 11 4 2 EGC ErrorRegisters niece fe i i E c a ER doge A ae eue FERE RR E TRE 9 Performance Monitoring eeeeeeeeere eee creer 1 QM ee Eeer
18. 11 2 ECC System software has the ability to request ECC checking on memory accesses If the MMU determines that a region of memory is protected by ECC the BCU is responsible for checking and generating ECC on accesses to that region See Chapter 3 Memory Management for a description of the MMU The Intel 80200 processor data bus width is configured at reset time to either 32 or 64 bits Bus configuration is detailed in Chapter 10 External Bus The Intel 80200 processor only supports ECC in the 64 bit mode When ECC is enabled for a memory region the BCU never performs sub bus width 64 bits writes to that region If directed by the core to perform a sub bus width write the BCU performs a bus width read merge in the appropriate bytes and then perform a bus width write with all byte enables asserted This read modify write RMW is performed as an atomic transaction on the external bus This RMW behavior affects performance and can be avoided by specifying the region as read write allocate and write back cacheable in the MMU When writing to ECC protected memory the BCU always calculates the correct ECC bits and writes them to memory at the same time as the data The ECC algorithm supported by the Intel 80200 processor uses eight bits to protect the data bus It can detect and correct any one bit error and it can detect any two bit error Whenever the BCU checks ECC it generates a syndrome a bitwise exclusive OR of
19. 2003 5 1 Branch Target Buffer n Intel 80200 Processor based on Intel XScale Microarchitecture m tel e Figure 5 2 5 1 1 5 1 2 5 2 Branch History Taken Taken Not Taken Taken Not Taken Not Taken Not Taken SN Strongly Not Taken ST Strongly Taken WN Weakly Not Taken WT Weakly Taken Reset After Processor Reset the BTB is disabled and all entries are invalidated Update Policy A new entry is stored into the BTB when the following conditions are met the branch instruction has executed the branch was taken the branch is not currently in the BTB The entry is then marked valid and the history bits are set to WT If another valid branch exists at the same entry in the BTB it is evicted by the new branch Once a branch is stored in the BTB the history bits are updated upon every execution of the branch as shown in Figure 5 2 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n e Branch Target Buffer 5 2 BTB Control 5 2 1 Disabling Enabling The BTB is always disabled out of Reset Software can enable the BTB through a bit in a coprocessor register see Section 7 2 2 Before enabling or disabling the BTB software must invalidate it described in the following section This action ensures correct operation in case stale data is in the BTB Software should not place any branch instruction between the code that inval
20. 80200 Processor based on Intel XScale Microarchitecture tel e B 2 4 B 2 4 1 B 2 5 B 2 5 1 B 8 Memory Pipeline The memory pipeline consists of two stages D1 and D2 The data cache unit or DCU consists of the data cache array mini data cache fill buffers and writebuffers The memory pipeline handles load store instructions D1 and D2 Pipestage Operation begins in D1 after the X1 pipestage has calculated the effective address for load stores The data cache and mini data cache returns the destination data in the D2 pipestage Before data is returned in the D2 pipestage sign extension and byte alignment occurs for byte and half word loads Multiply Multiply Accumulate MAC Pipeline The Multiply Accumulate MAC unit executes the multiply and multiply accumulate instructions supported by the Intel 80200 processor core The MAC implements the 40 bit Intel 80200 processor accumulator register accO and handles the instructions which transfer its value to and from general purpose ARM registers The following are important characteristics about the MAC The MAC is not truly pipelined as the processing of a single instruction may require use of the same datapath resources for several cycles before a new instruction can be accepted The type of instruction and source arguments determines the number of cycles required No more than two instructions can occupy the MAC pipeline concurrently When the MAC i
21. 80200 processor Control Register are used to control the allocation policy for the mini data cache and whether it uses write back caching or write through caching The configuration of the mini data cache should be setup before any data access is made that may be cached in the mini data cache Once data is cached software must ensure that the mini data cache has been cleaned and invalidated before the mini data cache attributes can be changed Auxiliary Control Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 2 1 0 reset value writeable bits set to 0 Bits Access Description Read Unpredictable Write as Zero Reserved 31 6 Mini Data Cache Attributes MD All configurations of the Mini data cache are cacheable stores are buffered in the write buffer and stores are coalesced in the write buffer as long as coalescing is 5 4 Read Write globally enabled bit O of this register 0b00 Write back Read allocate 0b01 Write back Read Write allocate 0b10 Write through Read allocate 0b11 Unpredictable Read Unpredictable Write as Zero Reserved 3 2 Page Table Memory Attribute P 1 Read Write If set page table accesses are protected by ECC See Chapter 11 Bus Controller for more information Write Buffer Coalescing Disable K This bit globally disables the coalescing of all stores in the write buffer no matter what the value of the Cachea
22. E MAC may finish instruction and return results in Early terminate any of the MAC pipestages SS Instruction Cache 7 Both the SA 110 and the Intel 80200 processor a Replacemerit are 32 way set associative with round robin 16K 32K Weed replacement They differ in size Instructions may be locked into the instruction Lockable by line cache with line granularity preventing future eviction Fetch buffers Buffer incoming external memory operations Data Cache Mini Data Cache Replacement Policy round robin round robin Both SA 110 and the Intel 80200 processor are ee Cache 32 way set associative with round robin Sec gb E y replacement They differ in size y y Mini Data Cache Gen 512 bytes 2K Geometry SERRE 2 ways 2 ways Software can re map portions of the data cache Pale EUM into data RAM on a line granularity a Hitunder miss Allow accesses to the data cache while there are S outstanding miss requests to external memory TT Store operation that hits cache is not written to m Write back external memory 8 e Store operations are written to external memory A TUE even if cache is hit Fill Buffer Buffer incoming external memory operations Collects memory requests that hit an Pending Butter outstanding load e Write Buffer 8 entry write buffer Write Buffer coalescing UmBSPDE entres IMAL a MAW EE last entry All entries coalesce to in the write buffer a b A 32 bit shift and AL
23. F2 Instruction Fetch Pipestages The job of the instruction fetch stages F1 and F2 is to present the next instruction to be executed to the ID stage Several important functional units reside within the F1 and F2 stages including Branch Target Buffer BTB Instruction Fetch Unit IFU An understanding of the BTB See Chapter 5 Branch Target Buffer and IFU are important for performance considerations A summary of operation is provided here so that the reader may understand its role in the F1 pipestage Branch Target Buffer BTB The BTB predicts the outcome of branch type instructions Once a branch type instruction reaches the X1 pipestage its target address is known If this address is different from the address that the BTB predicted the pipeline is flushed execution starts at the new target address and the branch s history is updated in the BTB nstruction Fetch Unit IFU The IFU is responsible for delivering instructions to the instruction decode ID pipestage One instruction word is delivered each cycle if possible to the ID The instruction could come from one of two sources instruction cache or fetch buffers ID Instruction Decode Pipestage The ID pipestage accepts an instruction word from the IFU and sends register decode information to the RF pipestage The ID is able to accept a new instruction word from the IFU on every clock cycle in which there is no stall The ID pipestage is responsible for
24. Instructions that do not call functions generate no activity in the test logic while the controller is in this state The instruction register and all test data registers retain their current state When TMS is high on the rising edge of TCK the controller moves to the Select DR Scan state Select DR Scan State The Select DR Scan state is a temporary controller state The test data registers selected by the current instruction retain their previous state If TMS is held low on the rising edge of TCK when the controller is in this state the controller moves into the Capture DR state and a scan sequence for the selected test data register is initiated If TMS is held high on the rising edge of TCK the controller moves into the Select IR Scan state The instruction does not change while the TAP controller is in this state Capture DR State When the controller is in this state and the current instruction is sample preload the Boundary Scan register captures input pin data on the rising edge of TCK Test data registers that do not have parallel input are not changed Also if the sample preload instruction is not selected while in this state the Boundary Scan registers retain their previous state The instruction does not change while the TAP controller is in this state If TMS is high on the rising edge of TCK the controller enters the Exit DR If TMS is low on the rising edge of TCK the controller enters the Shift DR state March 2003 De
25. Microarchitecture Optimization Guide P2 Percentage of times we are likely to incur a branch misprediction penalty Nl Number of cycles to execute the if else portion using conditional instructions assuming the if condition to be true Nic Number of cycles to execute the if else portion using conditional instructions assuming the if condition to be false Once we have the above data use conditional instructions when PL eset P1 et 1 x N2 x SL NI x 2x x4 x c Toy 4ce Too Nlex 199 428 Too Too The following example illustrates a situation in which we are better off using branches over conditional instructions Consider the code sample shown below cmp ro 0 bne L1 add rO rO 1 add rl rl 3i add r2 r2 1 add r3 r3 SEL add r4 r4 1 b L2 L1 sub r0 x0 SG sub rl rl 1 sub 2 r2 41 sub r3 x3 91 sub r4 r4 1 L2 In the above code sample the cmp instruction takes 1 cycle to execute the if part takes 7 cycles to execute and the else part takes 6 cycles to execute If we were to change the code above so as to eliminate the branch instructions by making use of conditional instructions the if else part would always take 10 cycles to complete If we make the assumptions that both paths are equally likely to be taken and that branches are mis predicted 5096 of the time the costs of using conditional execution Vs using branches can be computed as follo
26. N LOCKLINE R1 R2 R6 R6 1 Decrement loop count DONE LOCKLINE R1 R3 DONE OCKLINE R1 R4 DONE OCKLINE R1 R5 N MOV R2 MCR P15 CPWAIT DSP RO PC 4 RO 0x1 R6 R6 1 Decrement loop count R6 R6 1 Decrement loop count R6 R6 1 Decrement loop count LOOP1 Turn off data cache locking 0 RO C2 CO 0 0 RO C7 C10 4 drain pending loads and stores 0 NRx C7 C10 1 0 Rx C7 C6 1 NRx 32 R4 R6 LR 0 R2 C9 C2 0 Put the data cache in lock mode OxO 0 R2 C9 C2 0 Take the data cache out of lock mode R4 R6 PC Developer s Manual March 2003 6 13 Intel 80200 Processor based on Intel XScale Microarchitecture m Data Cache n e Example 6 4 Creating Data RAM R1 contains the virtual address of a region of memory to configure as data RAM which is aligned on a 32 byte boundary MMU is configured so that the memory region is cacheable RO is the number of 32 byte lines to designate as data RAM In this example 16 lines of the data cache are re configured as data RAM The inner loop is used to initialize the newly allocated lines MMU and data cache are enabled prior to this code MACRO ALLOCATE Rx MCR P15 D Rx C7 C2 5 ENDM MACRO DRAIN MCR P15 0 RO C7 C10 A Grain pending loads and stores ENDM DRAIN MOV R4 0x0 MOV R5 0x0 MOV R2 0x1 MCR P15 0 R2 C9 C2 0 Put the data cache
27. TDO is connected to the least significant bit Data is shifted one bit position within the register towards TDO on each rising edge of TCK The following sections describe each of the test data registers See Figure C 5 for an example of loading the data register Device Identification Register The Device Identification register is a 32 bit register containing the manufacturer s identification code part number code and version code The identification register is selected only by the idcode instruction When the TAP controller s Test_Logic_Reset state is entered idcode is automatically loaded into the instruction register The Device Identification register has a fixed parallel input value that is loaded in the Capture_DR state The value of this register is shown in Table C 4 JTAG ID Register Value Stepping Value AO 0x09263013 Al 0x19263013 BO 0x29263013 CO 0x39263013 DO 0x49263013 Bypass Register The required Bypass Register a one bit shift register provides the shortest path between TDI and TDO when either of a bypass highz or clamp instructions are in effect This allows rapid movement of test data to and from other components on the board This path can be selected when no test operation is being performed While the Bypass register is selected data is transferred from TDI to TDO without inversion Any instruction that does not make use of another test data register may select the Bypass reg
28. but never hits on an address translation Effectively a hole is in the TLB This situation may be rectified by unlocking the TLB Enabling Disabling The MMU is enabled by setting bit 0 in coprocessor 15 register 1 Control Register When the MMU is disabled accesses to the instruction cache default to cacheable and all accesses to data memory are made non cacheable A recommended code sequence for enabling the MMU is shown in Example 3 1 on page 3 6 Enabling the MMU This routine provides software with a predictable way of enabling the MMU After the CPWAIT the MMU is guaranteed to be enabled Be aware that the MMU will be enabled sometime after MCR and before the instruction that executes after the CPWAIT Programming Note This code sequence requires a one to one virtual to physical address mapping on this code since the MMU may be enabled part way through This would allow the instructions after MCR to execute properly regardless the state of the MMU MRC P15 0 R0 C1 C0 0 Read CP15 register 1 ORR RO RO 0x1 Turn on the MMU MCR P15 0 R0 C1 C0 0 Write to CP15 register 1 For a description of CPWAIT see Section 2 3 3 Additions to CP15 Functionality on page 2 11 CPWAIT The MMU is guaranteed to be enabled at this point the next instruction or data address will be translated March 2003 Developer s Manual intel 3 4 3 Example 3 2 Note Intel 80200 Processor
29. mra rB r9 acco add rl El 4t The code shown above would incur a 1 cycle stall due to the 2 cycle resource latency of an MRA instruction The code can be rearranged as shown below to prevent this stall mra r6 r7 acco add rl r1 1 mra r8 r9 accO Similarly the code shown below would incur a 2 cycle penalty due to the 3 cycle result latency for the second destination register mra r6 r7 acco mov X mov r0 6 add 2 2 L The stalls incurred by the code shown above can be prevented by rearranging the code mra r6 r7 acco add r2 r2 1 mov r0 r6 mov ri 7 The MAR MCRR instruction has an issue latency a result latency and a resource latency of 2 cycles Due to the 2 cycle issue latency the pipeline would always stall for 1 cycle following a MAR instruction The use of the MAR instruction should therefore be used only where absolutely necessary March 2003 Developers Manual intel B 5 6 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Scheduling the MIA and MIAPH Instructions The MIA instruction has an issue latency of 1 cycle The result and resource latency can vary from 1 to 3 cycles depending on the values in the source register Consider the following code sample r3 ES mia mia acco acco r2 r4 The second MIA instruction above can stall from 0 to 2 cycles depending on the values in the registers r2 and r3 due t
30. 0x200 Bits 15 12 refer to the processor generation Bits 11 8 refer to the implementation Bits 7 4 used for implementation derivatives 3 0 Read Write Ignored Revision number for the processor Implementation Specified AO stepping 0b0000 A1 stepping 0b0001 BO stepping 0b0010 CO stepping 0b0011 DO stepping 0b0100 The Cache Type Register is selected when opcode_2 1 and describes the present Intel 80200 processor cache Cache Type Register Sheet 1 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value As Shown Bits Access Description 31 29 Read as Zero Write Ignored Reserved Cache class 0b0101 28 25 Read Write Ignored The caches support locking write back and round robin replacement They do not support address by index 24 Read Write Ignored Harvard Cache 23 21 Read as Zero Write Ignored Reserved 20 18 Read Write Ignored Data Cache Size 0b110 32 kB 17 15 Read Write Ignored Data cache associativity 0b101 32 14 Read as Zero Write Ignored Reserved 13 12 Read Write Ignored Data cache line length 0b10 8 words line 11 9 Read as Zero Write Ignored Reserved 8 6 Read Write Ignored Instruction cache size 0b110 32 kB Developer s Manual March 2003 7 5 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration n Table 7 5 Cache T
31. 1 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 3 13 3 1 13 3 2 13 2 Introduction The Intel 80200 processor debug unit when used with a debugger application allows software running on a the Intel 80200 processor target to be debugged The debug unit allows the debugger to stop program execution and re direct execution to a debug handling routine Once program execution has stopped the debugger can examine or modify processor state co processor state or memory The debugger can then restart execution of the application On the Intel 80200 processor one of two debug modes can be entered Halt mode Monitor mode Halt Mode When the debug unit is configured for halt mode the reset vector is overloaded to serve as the debug vector A new processor mode DEBUG mode CPSR 4 0 0x15 is added to allow debug exceptions to be handled similarly to other types of ARM exceptions When a debug exception occurs the processor switches to debug mode and redirects execution to a debug handler via the reset vector After the debug handler begins execution the debugger can communicate with the debug handler to examine or alter processor state or memory through the JTAG interface The debug handler can be downloaded and locked directly into the instruction cache through JTAG so external memory is not required to contain debug handler code Monitor Mode In monito
32. 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 2 1 0 reset value unpredictable Bits Access Description 31 8 Read as Zero Write ignored Reserved 7 0 Read Write unpredictable Message Byte or Address Byte Developer s Manual March 2003 13 27 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug 13 13 Trace Buffer Entries Intel Trace buffer entries consist of either one or five bytes Most entries are one byte messages indicating the type of control flow change The target address of the control flow change represented by the message byte is either encoded in the message byte like for exceptions or can be determined by looking at the instruction word like for direct branches Indirect branches require five bytes per entry One byte is the message byte identifying it as an indirect branch The other four bytes make up the target address of the indirect branch The following sections describe the trace buffer entries in detail 13 13 1 Message Byte There are two message formats exception and non exception as shown in Figure 13 7 Figure 13 7 Message Byte Formats 7 Exception Format Medien M Message Type Bit VVV z exception vector 4 2 CCCC z Incremental Word Count 7 MMMM Message Type Bits CCCC Incremental Word Count Non exception Format 0 Table 13 17 shows all of the possible trace messages Table 13 17 Message Byte Formats
33. 29 RX l set TXRXCTRL 31 4 Write enable RX Logic 31 0 TXRXCTRL 31 d e Intel 80200 Processor software read CLK set overflow A Capture DR loads TXRXCTRL 31 into DBG SR 0 The other bits in DBG SR are loaded as shown in Figure 13 4 The captured data is scanned out during the Shift DR state Care must be taken while scanning in data While polling TXRXCTRL 31 incorrectly setting DBG SR 35 or DBG SR 1 may cause unpredictable behavior following an Update DR Update DR parallel loads DBG_SR 35 1 into DBG REG 34 0 Whether the new data gets written to the RX register or an overflow condition is detected depends on the inputs to the RX write logic March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug 13 11 6 1 RX Write Logic The RX write logic Figure 13 6 serves 4 functions 1 Enable the debugger write to RX the logic ensures only new valid data from the debugger is written to RX In particular when the debugger polls TXRXCTRL 31 to see whether the debug handler has read the previous data from RX The JTAG state machine must go through Update_DR which should not modify RX 2 Clear DBG_REG 34 mainly to support high speed download During high speed download the debugger continuously scan in a data to send to the debug handler and sets DB
34. 6 2 VA 31 5 For Load Main IC and Load Mini IC 8 additional data packets are used to specify 8 ARM instructions to be loaded into the target instruction cache Bits 31 0 of the data packets contain the data to download Bit 32 of each data packet is the value of the parity for the data in that packet e REI N hrs a N As shown in Figure 13 11 the first bit shifted in TDI is bit O of the first packet After each 33 bit packet the host must take the JTAG state machine into the Update DR state After the host does an Update DR and returns the JTAG state machine back to the Shift DR state the host can immediately begin shifting in the next 33 bit packet Developer s Manual March 2003 13 37 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 14 4 13 38 Loading IC During Reset Code can be downloaded into the instruction cache through JTAG during a processor reset This feature is used during software debug to download the debug handler prior to starting an application program The downloaded handler can then intercept the reset vector and do any necessary setup before the application code executes In general any code downloaded into the instruction cache through JTAG must be downloaded to addresses that are not already valid in the instruction cache Failure to meet this requirement results in unpredictable behavior by the processor During a processor
35. A write through policy instructs the Data Cache to keep external memory coherent by performing stores to both external memory and the cache A write back policy only updates external memory when a line in the cache is cleaned or needs to be replaced with a new line Generally write back provides higher performance because it generates less data traffic to external memory More details on cache policies may be gleaned from Section 6 2 3 Cache Policies on page 6 5 Memory Operation Ordering A fence memory operation memop is one that guarantees all memops issued prior to the fence executes before any memop issued after the fence Thus software may issue a fence to impose a partial ordering on memory accesses Table 3 3 on page 3 4 shows the circumstances in which memops act as fences Any swap SWP or SWPB to a page that would create a fence on a load or store is a fence Memory Operations that Impose a Fence operation X C B load 0 store 1 0 1 load or store 0 0 0 Exceptions The MMU may generate prefetch aborts for instruction accesses and data aborts for data memory accesses The types and priorities of these exceptions are described in Section 2 3 4 Event Architecture on page 2 12 Data address alignment checking is enabled by setting bit 1 of the Control Register CP15 register 1 Alignment faults are still reported even if the MMU is disabled All other MMU exceptions are disabled when the
36. ARM Architecture Reference Manual ARMv5T Operation If DCSR 31 0 BKPT is a nop If DCSR 31 21 BKPT causes a debug exception The processor handles the software breakpoint as described in Section 13 5 Debug Exceptions on page 13 6 Developer s Manual March 2003 13 11 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug 13 8 Table 13 6 13 12 intel Transmit Receive Control Register TXRXCTRL Communications between the debug handler and debugger are controlled through handshaking bits that ensures the debugger and debug handler make synchronized accesses to TX and RX The debugger side of the handshaking is accessed through the DBGTX Section 13 11 4 DBGTX JTAG Register and DBGRX Section 13 11 6 DBGRX JTAG Register JTAG Data Registers depending on the direction of the data transfer The debug handler uses separate handshaking bits in TXRXCTRL register for accessing TX and RX The TXRXCTRL register also contains two other bits that support high speed download One bit indicates an overflow condition that occurs when the debugger attempts to write the RX register before the debug handler has read the previous data written to RX The other bit is used by the debug handler as a branch flag during high speed download All of the bits in the TXRXCTRL register are placed such that they can be read directly into the CC flags in the CPSR with an MRC with Rd PC The subsequent instruction can
37. C1 c0 0 read CCNT register MRC P14 0 R4 C2 c0 0 read PMNO register MRC P14 0 R5 C3 c0 0 read PMN1 register process the results UBSPC R14 4 return from interrupt Uu As an example assume the following values in CCNT PMNO PMNI and PMNC Example 12 3 Computing the Results 12 12 Assume CCNT overflowed CCNT 0x0000 0020 Overflowed and continued counting Number of instructions executed PMNO Ox6AAA AAAA Number of instruction cache miss requests PMN1 0x0555 5555 Instruction Cache miss rate 100 PMN1 PMNO 5 CPI CCNT 2 32 Number of instructions executed 2 4 cycles instruction In the contrived example above the instruction cache had a miss rate of 5 and CPI was 2 4 March 2003 Developer s Manual intel Software Debug 13 This chapter describes software debug and related features in the Intel 80200 processor based on Intel XScale microarchitecture compliant with ARM Architecture V5TE namely debug modes registers and exceptions aserial debug communication link via the JTAG interface atrace buffer amechanism to load the instruction cache through JTAG Debug Handler SW requirements and suggestions 13 1 Definitions debug handler is an event handler that runs on the Intel 80200 processor when a debug event occurs debugger is software that runs on a host system outside of the Intel 80200 processor 13 2 Debug Registers CP1
38. Controller Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e Table 11 6 11 10 The BCU does not write to these ELOGx ECARx registers unless the corresponding BCUCTL Ex bit is cleared either by reset or by software ECTST Register 8 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 reset value all implemented bits are 0 Bits Access Description Disable Write ECC DWE Disables the Read Modify Write for sub 64 bit stores If written as a si Read unpr dictable Write even if ECC is enabled then the 80200 will not perform the usual ECC actions when writing data 30 8 Read unpredictable Write as O reserved 7 0 Read Write mask When writing the BCU exclusive ORs this mask with the ECC code it generates Software can generate data with incorrect ECC values for Validation purposes By setting a bit in ECTST mask to 1 the corresponding bit in all generated ECCs is inverted during a write to memory Subsequent reads of that memory generate an ECC error Bit 31 in ECTST is defined as the Disable Write ECC DWE bit in the C 0 and D O steppings of the 80200 If DWE is written as a 1 even if ECC is enabled then the 80200 will not perform the usual ECC actions when writing data In other words if DWE 1 and a write is performed to ECC protected memory the Bus Controller Unit BCU will not perform a RMW The value on
39. Coprocessor 14 was also created Enhancements were made to the Event Architecture instruction cache and data cache parity error exceptions breakpoint events and imprecise external data aborts DSP Coprocessor 0 CPO The Intel 80200 processor adds a DSP coprocessor to the architecture for the purpose of increasing the performance and the precision of audio processing algorithms This coprocessor contains a 40 bit accumulator and new instructions The 40 bit accumulator is referenced by several new instructions that were added to the architecture MIA MIAPH and MIAxy are multiply accumulate instructions that reference the 40 bit accumulator instead of a register specified accumulator MAR and MRA provide the ability to read and write the 40 bit accumulator Access to CPO is always allowed in all processor modes when bit 0 of the Coprocessor Access Register is set Any access to CPO when this bit is clear causes an undefined exception See Section 7 2 15 Register 15 Coprocessor Access Register on page 7 18 for more details Note that only privileged software can set this bit in the Coprocessor Access Register The 40 bit accumulator needs to be saved on a context switch if multiple processes are using it Two new instruction formats were added for coprocessor 0 Multiply with Internal Accumulate Format and Internal Accumulate Access Format The formats and instructions are described next Developer s Manual March 2003 2 3 I
40. E 2 13 3 1 Halt Mode 2 13 3 2 Monitor Mode 2 Debug Control and Status Register DC GP 3 13 4 1 Global Enable Bit GE 4 13 4 2 Halt Mode Bit H niii i cider nre obi Ee cu eR eee a satu EE ede 4 13 4 3 Vector Trap Bits TF TI TD TA TS TU TR ssssssseeeeeneneerenen nennen nennen 5 13 4 4 Sticky Abort BISA m 5 13 4 5 Method of Entry Bits MOE eene nnne nennen nnne nennen 5 13 4 6 Trace Buffer Mode Bit M esssssssssseseseseseseee ener nennen enne nnne 5 13 4 7 Trace Buffer Enable Bit E nennen nennen eterne nnne 5 Debug EXCOptionS e 6 13 5 1 PAE M OCG ies Se cv aes 6 13 5 2 Monitor e lies 8 HW Breakpoint Resources AAA 9 13 6 1 Instruction Breakpoints iieri ceed reae eccae eaeedee VEER ue ane retinue E aea e n 9 13 6 2 Data Breakpoints eeeeseisesseseeseeeeeeeeee nennen EEN 10 Software Breakpoints ions Leid cea channeled ell ae FUE ERR PES 11 Transmit Receive Control Register TXRXCTRL ssssssssseeeeeneeeneeen nennen 12 13 8 1 RX Register Ready Bit DP 13 13 8 2 Overflow Flag ON 14 13 8 3 Download Flag KA RE 14 13 8 4 TX Register Ready Bit TR nennen eene nnne nnns 15 13 8 5 Conditional Execution Using TSPSCTHL nennen 15 lic
41. ERAUS 32 13 10 LDIC JTAG Data Register Hardware 35 13 11 Format of LDIC Cache Functions esee eee ren irese anena p ER ESAR REE beta the EE e hose one E 37 13 12 Code Download During a Cold Reset For Debug esee 39 13 13 Code Download During a Warm Reset For Debug essent enne 41 13 14 Downloading Code in IC During Program Rxecupon 43 B 1 Intel 80200 Processor RISC S perpipe etre etr ter Ree Ide ie PL RE CORRER 3 C 1 Test Access Port Block Diagram ett ne bett see ete e Ret e HOY desc CERE ree E Pv dd Pen 2 C 2 TAP Controller State Diagram mem eee E E Lease ERES dE SES 7 C 3 JTAG EX ample c 13 C 4 Timing Diagram Illustrating the Loading of Instruction Register eese 14 C 5 Timing Diagram Illustrating the Loading of Data Register 15 xii March 2003 Developer s Manual i ntel Intel 80200 Processor based on Intel XScale Microarchitecture Tables 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 9 2 10 2 8 2 11 2 12 2 13 2 14 3 1 3 2 3 3 3 4 7 1 7 2 7 3 7 4 7 5 7 6 7 7 7 8 7 9 7 10 7 11 7 12 7 13 7 14 7 15 7 16 7 17 7 18 7 19 7 20 7 21 7 22 7 23 7 24 7 25 7 26 8 1 8 2 8 3 8 4 Multiply with Internal Accumulate Format 4 MIA T cond accO Rim R8 eniin rtc e ERR HERE EE PERIERE EEE RICE WERE See c 4 MIAPH lt cond gt acc0 Rim R5 annii Rc SES Pe Po sb Re E R EE e Roe CH dE 5 MIAxy lt cond gt acc0
42. External Bus 10 3 5 1 Write Burst Figure 10 9 shows a four word write caused by the eviction of a half cache line In this case the Len is 0x5 indicating four words DValid is asserted for two consecutive cycles here but the two cycles could be spread out In this case the Intel 80200 processor drives the data as requested along with BE of 0x00 each cycle indicating that all the bytes are being written Figure 10 9 Four Word Eviction Write Ons 25ns 50ns 75ns MCLK ADS LEN 2 X WwrReq 1 Lock LEN 1 0 A W RS LEN O IB v1 IEEE IM 2229 A Bn o0 X 0x240 DValid CF I 50 MM D M wo wo Sm sEs EE 00x00 0x00 NI DCD EE X Abort Developer s Manual March 2003 10 19 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus n e 10 3 6 Write Burst Coalesced Figure 10 10 shows a four word cache write caused by store requests coalesced in a write buffer The Len is 0x5 indicating four words D Valid is asserted for two consecutive cycles The Intel 80200 processor drives the data as requested but this time the byte enables are not all zeroes The byte enables here are asserted low only for those bytes that were stored by the instruction stream Any possible combination of byte enables can occur with only the requirement that the first and last data cycles have at least o
43. HER Ry ERR E RR 1 6 1 2 Mini Data Cache Overview esses enne aea nnne nnne nnns en 3 6 1 3 Write Buffer and Fill Buffer COvervlew 4 Data Cache and Mini Data Cache Operation seessseessieesiresseessrrsritstrnstirtstrnntinnntensttnnnrnntennnennn 5 6 2 1 Operation When Caching is Enabled 5 6 2 2 Operation When Data Caching is Disabled eeseesseeseeesreerereerrennrernttrntrrrnennene 5 6 2 3 Cache PoliGiBS 2 te aaeoa E rti eic bee ao dt b redacta lieu da 5 6 2 3 1 Ehel ET 5 6 2 3 2 Road Ecaunlm 6 March 2003 Developer s Manual In m tel Intel 80200 Processor based on Intel XScale Microarchitecture 6 2 3 3 Write Miss Policy inercia criar tensa cede aTe in ne corona 7 6 2 3 4 Write Back Versus Write Through eeeen 7 6 2 4 Round Robin Replacement Algorithm AA 8 6 2 5 Parity aerei O 8 6 2 6 rel ee 8 6 3 Data Cache and Mini Data Cache Control 9 6 3 1 Data Memory State After Heset enne enne nennen 9 6 3 2 Enabling Disablirig s con iiri irr RE et PO XR ORER UE EERS 9 6 3 3 Invalidate A Clean Operations nennen nnne nnne 9 6 3 3 1 Global Clean and Invalidate Operation 10 6 4 Re configuring the Data Cache as Data DAME 12 6 5 Write Buffer Fill Buffer Operation and Control 16 7 je 1 7 1 I HE 1 7 2 CP1
44. MMU is disabled March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n e Memory Management 3 3 Interaction of the MMU Instruction Cache and Data Cache The MMU instruction cache and data mini data cache may be enabled disabled independently The instruction cache can be enabled with the MMU enabled or disabled However the data cache can only be enabled when the MMU is enabled Therefore only three of the four combinations of the MMU and data mini data cache enables are valid The invalid combination causes undefined results Table 3 4 Valid MMU amp Data mini data Cache Combinations MMU Data mini data Cache Off Off On Off On On Developer s Manual March 2003 3 5 Intel 80200 Processor based on Intel XScale Microarchitecture Memory Management n 3 4 3 4 1 3 4 2 Example 3 1 3 6 Control Invalidate Flush Operation The entire instruction and data TLB can be invalidated at the same time with one command or they can be invalidated separately An individual entry in the data or instruction TLB can also be invalidated See Table 7 13 TLB Functions on page 7 13 for a listing of commands supported by the Intel 80200 processor Globally invalidating a TLB does not affect locked TLB entries However the invalidate entry operations can invalidate individual locked entries In this case the locked entry remains in the TLB
45. Note that STM and LDM each count as several accesses to the data cache depending on the number of registers specified in the register list LDRD registers two accesses PMNI counts the number of data cache and mini data cache misses Cache operations do not contribute to this count See Section 7 2 8 for a description of these operations The statistic derived from these two events is e Data cache miss rate This is derived by dividing PMN1 by PMNO Instruction Fetch Latency Mode PMNO accumulates the number of cycles when the instruction cache is not able to deliver an instruction to the Intel 80200 processor due to an instruction cache miss or instruction TLB miss This event means that the processor core is stalled PMNI counts the number of instruction fetch requests to external memory Each of these requests loads 32 bytes at a time This is the same event as measured in instruction cache efficiency mode and is included in this mode for convenience so that only one performance monitoring run is need Statistics derived from these two events The average number of cycles the processor stalled waiting for an instruction fetch from external memory to return This is calculated by dividing PMNO by PMNI If the average is high then the Intel 80200 processor may be starved of the bus external to the Intel 80200 processor The percentage of total execution cycles the processor stalled waiting on an instruction fetch from external me
46. P Domain O0 C B 1 0 Fine page table base address SBZ P Domain SBZ UN Table 2 9 Second level Descriptors for Coarse Page Table 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131109 8 7 6 54 3 2 1 0 SBZ 0 0 Large page base address TEX AP3 AP2 AP1 APO C B 0 1 Small page base address AP3 AP2 AP1 APO C B 1 0 Extended small page base address SBZ TEX AP C B 1 1 Table 2 10 Second level Descriptors for Fine Page Table 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 SBZ 0 0 Large page base address TEX AP3 AP2 AP1 APO C B0 1 Small page base address AP3 AP2 AP1 APO C B 1 0 Tiny Page Base Address TEX AP C B 1 1 The P bit controls ECC The TEX Type Extension field is present in several of the descriptor types In the Intel 80200 processor only the LSB of this field is used this is called the X bit A Small Page descriptor does not have a TEX field For these descriptors TEX is implicitly zero that is they operate as if the X bit had a 0 value The X bit when set modifies the meaning of the C and B bits Description of page attributes and their encoding can be found in Chapter 3 Memory Management 2 10 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n D Prog
47. Processor based on Intel XScale Microarchitecture External Bus l n e A simpler but lower performance method would be to assert Hold to the Intel 80200 processor wait for all outstanding transactions to complete grant the issue bus to the alternate master using the issue bus pins with the Intel 80200 processor bus protocols or whatever protocol the alternate master required and give the bus back to the Intel 80200 processor only once the alternate bus master is completely finished The Lock signal is active in transactions which require atomicity on a read write pair The minimum level of memory granularity over which a lock can be held should be at least 32 bytes That is if a master on the bus asserts Lock on a particular address the naturally aligned 32 byte block that contains that address should be considered protected from access by other bus masters It is permissible of course to consider all of memory locked and to hold off all accesses by other bus masters The interaction of Hold with the Lock signal is interesting If the Lock pin is asserted by the Intel 80200 processor the Intel 80200 processor is executing an ARM Swap instruction which does a read from a memory location followed atomically by a write to the same location This is used for updating semaphores in shared memory Until a request appears that does not have the Lock signal asserted the chipset should not allow accesses of any kind to the memory locati
48. Rm Ra 6 Internal Accumulator Access Format nennen nennen eene s Aea Ease E nennen ene 7 MAR lt cond gt acc0 RdLo Rad 8 MRA lt cond gt RdLo RdHi Act sisipin eenean eoi ei rE EEE Eo pE dre eR PUR EHE REN Ca 8 Second level Descriptors for Coarse Page Table 10 Second level Descriptors for Fine Page Table 10 te TR ek OC 10 Exception utnmalty EE 12 Byent PHOLUDyS e e eiecit iem MM dE PN TRUE EE 12 Intel 80200 Processor Encoding of Fault Status for Prefetch Aborts esses 13 Intel 80200 Processor Encoding of Fault Status for Data Aborts sessessseeeeeen 14 Data Cache and Buffer Behavior when X 0 nennen rennen rennen 3 Data Cache and Buffer Behavior when X cl 3 Memory Operations that Impose a Pence 4 Valid MMU amp Data mini data Cache Combmnatong ccc eeeeeeseeeceeeeeeeeseceseeseeseseaeeseeesesseeeaesneseaeenees 5 MRC MCR Form l 8 d onere Severed eon cues te 2 TD eA PC ipd E S 3 CP15 Reeislets iei teh GRO bo esit A ee D Pott Ee LAE eee ee E Pega ese 4 BR M 5 Cache Type Register Dr RI EE EE tp iit ree e DR e ett i iones 5 ARM Control uic 7 Auxiliary Control R glster 5 6 tre fret pee ere EE DH HER ERE ERU Re S Cte Pee suena Fe ec pa oen 8 Translation KEE 9 Domain Access Control Register isisi sonona eene eene nennen nne eter eter et
49. STRD ARM v5 Sticky overflow flag in SPSR for saturated math e DSP extensions SMLAxy SMLAWy SMLALxy SMULxy SMULWy QADD S QDADD QSUB QDSUB CLZ ARM Thumb transfer instructions Thumb H Floating Point Instructions Tiny pages 26 bit code D Big Endian H Little Endian Developer s Manual March 2003 A 1 Intel 80200 Processor based on Intel XScale Microarchitecture Compatibility Intel 80200 Processor vs SA 110 Intel i E Intel 80200 Feature Parameter Brief Description or Note SA 110 Processor Main Execution Pipeline Scalar in order execution single issue S PE Pipeline with more than usual number of pipe S RISC Superpipeline stages Allows greater operating frequency 5 stage 7 stage e Out of order completion Instructions may finish out of program order Concurrent execution in 3 Instructions may occupy the MAC ALU and S pipes data cache pipes concurrently E Number of cycles ALU takes to complete a ALU instruction 1 1 e Allows instructions in different pipelines to s Register Scoreboarding execute as long as there are no data hazards eu 128 entry branch table address cache holds e Dynamic Branch predictions c of branches taken and not taken Branch misprediction penalty in cycles 1 4 MAC Pipeline Handles all multiply operations Dedicated 40 bit Allows 256 accumulates before scaling of data is accumulator necessary e 1 cycle 16x32 MAC sustained
50. The caution is to choose data structure sizes and stride requirements that do not overwhelm a given set causing conflicts and increased register pressure Register pressure can be increased because additional registers are required to track prefetch addresses The effects can be affected by rearranging data structure components to use more parallel access to search and compare elements Similarly rearranging sections of data structures so that sections often written fit in the same half cache line 16 bytes for the Intel 80200 processor can reduce cache eviction write backs On a global scale techniques such as array merging can enhance the spatial locality of the data As an example of array merging consider the following code int a array NMAX int b array NMAX int ix for i20 i lt NMAX i ix blil if ali 0 ix alil do_other calculations In the above code data is read from both arrays a and b but a and b are not spatially close Array merging can place a and b specially close struct int a int b e arrays int ix for i20 i lt NMAX i ix cli D if c il a 0 ix c il a do_other_calculations As an example of rearranging often written to sections in a structure consider the code sample struct employee struct employee prev struct employee next float Year2DatePay float Year2DateTax int ssno int empid float Year2Date401KDed float Year2DateOther
51. accO Rm Rs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 X vue ryvav4ss ww rest DORSCH Operation if ConditionPassed cond then accO sign extend Rm 31 16 Rs 31 16 sign extend Rm 15 0 Rs 15 0 acc0 39 0 Exceptions none Qualifiers Condition Code S bit is always cleared no condition code flags are updated Notes Instruction timings can be found in Section 14 4 4 Multiply Instruction Timings on page 14 6 Specifying R15 for register Rs or Rm has unpredictable results accO is defined to be 0b000 on 80200 The MIAPH instruction performs twol6 bit signed multiplies on packed half word data and accumulates these to a single 40 bit accumulator The first signed multiplication is performed on the lower 16 bits of the value in register Rs with the lower 16 bits of the value in register Rm The second signed multiplication is performed on the upper 16 bits of the value in register Rs with the upper 16 bits of the value in register Rm Both signed 32 bit products are sign extended and then added to the value in the 40 bit accumulator accO The instruction is only executed if the condition specified in the instruction matches the condition code status Developer s Manual March 2003 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model n Table 2 4 MIAxy cond acc0 Rm Rs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 1
52. after Abort is sampled At that point the transaction is cancelled and no further data or DValids are expected for that transaction For burst transactions care must be taken to end the transaction when the Abort is asserted If a cache line fill four data cycles on a 64 bit bus has Abort asserted along with the first DValid the Intel 80200 processor ends the transaction and expect that none of the other data cycles for the burst occurs If the chipset or memory were to assert DValid at that point and continue to send data the Intel 80200 processor would assume that data was associated with the next read request after the cache line fill that was aborted If there had been no read request after the cache line fill those extra data cycles would cause unpredictable results Abort must not be asserted for two consecutive cycles In other words back to back aborts are not permitted and causes incorrect operation if they occur The Data Bus portion of the bus must have a dead cycle after any abort cycle That is DValid must not be asserted the cycle after Abort has been asserted Developer s Manual March 2003 10 11 External Bus In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 10 2 7 10 12 ECC Software running on the Intel 80200 processor may configure pages in memory as being ECC protected For such pages the Intel 80200 processor checks the ECC code associated with read data and generates a
53. all imprecise data aborts is undefined and R14 ABORT is the address of the next instruction to execute 4 which is the same for both ARM and Thumb mode The Intel 80200 processor generates external data aborts on multi bit ECC errors and when the Abort pin is asserted on memory transactions See Chapter 11 Bus Controller for more details An external data abort can occur on non cacheable loads reads into the cache cache evictions or stores to external memory Although the Intel 80200 processor guarantees the Base Restored Abort Model for precise aborts it cannot do so in the case of imprecise aborts A Data Abort handler may encounter an updated base register if it is invoked because of an imprecise abort Imprecise data aborts may create scenarios that are difficult for an abort handler to recover Both external data aborts and data cache parity errors may result in corrupted data in the targeted registers Because these faults are imprecise it is possible that the corrupted data has been used before the Data Abort fault handler is invoked Because of this software should treat imprecise data aborts as unrecoverable Note that even memory accesses marked as stall until complete see Section 3 2 2 4 can result in imprecise data aborts For these types of accesses the fault is somewhat less imprecise than the general case it is guaranteed to be raised within three instructions of the instruction that caused it In other words i
54. appropriate value scanned in for the DCSR and DBGBRK DBGHLD RST can only be cleared by scanning in a 0 to DBG_SR 1 and scanning in the appropriate values for the DCSR and DBG BRK Developer s Manual March 2003 13 19 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 11 2 2 13 11 2 3 13 11 3 13 20 DBGBRK DBG BRK allows the debugger to generate an external debug break and asynchronously re direct execution to a debug handling routine A debugger sets an external debug break by scanning data into the DBG SR with DBG_SR 2 set and the desired value to set the DCSR JTAG writable bits in DBG_SR 34 3 Once an external debug break is set it remains set internally until a debug exception occurs In Monitor mode external debug breaks detected during abort mode are pended until the processor exits abort mode In Halt mode breaks detected during SDS are pended until the processor exits SDS When an external debug break is detected outside of these two cases the processor ceases executing instructions as quickly as possible This minimizes breakpoint skid by reducing the number of instructions that can execute after the external debug break is requested However the processor continues to process any instructions which may have already begun execution Debug mode is not entered until all processor activity has ceased in an orderly fashion DBGDCSR The DCSR is updated with t
55. are in use This happens when more than four loads are outstanding and are being fetched from memory As a result code written should ensure no more than four loads are outstanding at same time For example number of loads issued sequentially should not exceed four Also note a preload instruction may cause fill buffer to be used As a result number of preload instructions outstanding should also be considered to arrive at number of loads that are outstanding Similarly number of write buffers also limits number of successive writes issued before the processor stalls No more than eight stores can be issued Also note if data caches are using write allocate with writeback policy then a load operation may cause stores to external memory if read operation evicts a cache line that is dirty modified The number of sequential stores may be limited by this fact March 2003 Developer s Manual intel B 5 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Scheduling Load and Store Double LDRD STRD The Intel 80200 processor introduces two new double word instructions LDRD and STRD LDRD loads 64 bits of data from an effective address into two consecutive registers conversely STRD stores 64 bits from two consecutive registers to an effective address There are two important restrictions on how these instructions may be used the effective address must be aligned on an 8 byte boundary e the
56. arise from load use dependencies A load use dependency gives rise to a LUD if the result of the load instruction cannot be made available by the pipeline in due time for the subsequent instruction An optimizing compiler should find independent instructions to fill the slot following the load Certain instructions incur a few extra cycles of delay on the Intel 80200 processor as compared to first generation Intel StrongARM processors LDM STM Decode and register file lookups are spread out over 2 cycles in the Intel 80200 processor instead of 1 cycle in predecessors March 2003 Developer s Manual intel B 2 1 2 Figure B 1 Table B 1 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Intel 80200 Processor Pipeline Organization The Intel 80200 processor single issue superpipeline consists of a main execution pipeline MAC pipeline and a memory access pipeline These are shown in Figure B 1 with the main execution pipeline shaded Intel 80200 Processor RISC Superpipeline Memory pipeline Main execution pipeline MAC pipeline Table B 1 gives a brief description of each pipe stage Pipelines and Pipe stages Pipe Pipestage Description Covered In Main Execution Pipeline Handles data processing instructions Section B 2 3 IF1 IF2 Instruction Fetch P ID Instruction Decode S RF Register File Operand Shifter 7 x1 ALU Execute
57. attempted to download the next word before the debugger read the previous word More details on the Download bit Overflow flag and high speed download in general can be found in Section 13 8 Transmit Receive Control Register TXRXCTRL Following is example code showing how the Download bit and Overflow flag are used in the debug handler hs write word loop hs write overflow bil read RX read data word from host read TXRXCTRL into the CCs mrc pis 0 r15 c14 cO 0 bcc hs write done if D bit clear download complete exit loop beq hs write overflow if overflow detected loop until host clears D bit str ro r6 4 store only if there is no overflow b hs write word loop get next data word hs write done after the loop if the overflow flag was set return error message to host moveq r0 HOVERFLOW RESPONSE beq send response b write common exit March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug 13 15 3 Ending a Debug Session Prior to ending a debug session the debugger should take the following actions Clear the DCSR disable debug exit Halt Mode clear all vector traps disable the trace buffer turn off all breakpoints invalidate the mini instruction cache invalidate the main instruction cache invalidate the btb These actions ensure that the application program executes correctly after the debugger has been
58. because all the system resources to transfer data are quickly allocated and there are no instructions that can profitably be executed On the other end of the scale compute bound loops allow complete hiding of all data transfer latencies Low Number of Iterations Loops with very low iteration counts may have the advantages of prefetch completely mitigated A loop with a small fixed number of iterations may be faster if the loop is completely unrolled rather than trying to schedule prefetch instructions Developer s Manual March 2003 B 27 Optimization Guide I n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 4 4 6 B 28 Bandwidth Limitations Overuse of prefetches can usurp resources and degrade performance This happens because once the bus traffic requests exceed the system resource capacity the processor stalls The Intel 80200 processor data transfer resources are 4 fill buffers 4 pending buffers 8 half cache line write buffer SDRAM resources are typically 4 memory banks page buffer per bank referencing a 4K address range 4 transfer request buffers Consider how these resources work together A fill buffer is allocated for each cache read miss A fill buffer is also allocated each cache write miss if the memory space is write allocate along with a pending buffer A subsequent read to the same cache line does not require a new fill buffer but does require a pending buffer and a subseq
59. c10 c8 1 Register 11 12 Reserved These registers are reserved Reading and writing them yields unpredictable results Developer s Manual March 2003 7 15 Configuration n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 7 2 13 Table 7 17 Table 7 18 7 2 13 1 7 16 Register 13 Process ID The Intel 80200 processor supports the remapping of virtual addresses through a Process ID PID register This remapping occurs before the instruction cache instruction TLB data cache and data TLB are accessed The PID register controls when virtual addresses are remapped and to what value The PID register is a 7 bit value that is ORed with bits 31 25 of the virtual address when they are zero This effectively remaps the address to one of 128 slots in the 4 Gbytes of address space If bits 31 25 are not zero no remapping occurs This feature is useful for operating system management of processes that may map to the same virtual address space In those cases the virtually mapped caches on the Intel 80200 processor would not require invalidating on a process switch Accessing Process ID Function opcode 2 CRm Instruction Read Process ID Register 0b000 0b0000 MRC p15 0 Rd c13 c0 0 Write Process ID Register 0b000 0b0000 MCR p15 0 Rd c13 c0 0 Process ID Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131 109 8 7 6 54 3 2
60. code does not need to be specially mapped out to avoid this problem Also space for dynamic functions does not need to be allocated in the mini instruction cache and dynamic functions are not limited to the size allocated The dynamic function can actually be downloaded anywhere in the address space The debugger specifies the location of the dynamic function by writing the address to RX when it signals to the handler to continue The debug handler then does a branch and link to that address If the dynamic function is already downloaded in the main instruction cache the debugger immediately downloads the address signalling the handler to continue The static Debug Handler only needs to support one dynamic function command Multiple dynamic functions can be downloaded to different addresses and the debugger uses the function s address to specify which dynamic function to execute Since the dynamic function is being downloaded into the main instruction cache the downloaded code may overwrite valid application code and conversely application code may overwrite the dynamic function The dynamic function is only guaranteed to be in the cache from the time it is downloaded to the time the debug handler returns to the application or the debugger overwrites it External memory Dynamic functions can also we downloaded to external memory or they may already exist there The debugger can download to external memory using the write memory comman
61. contents are unaffected The trace buffer captures a trace up to the processor reset The trace buffer does not capture reset events or debug exceptions Since the trace buffer is cleared out before it is used all entries are initially Ob0000 0000 In fill once mode these 0 s can be used to identify the first valid entry in the trace buffer In wrap around mode in addition to identifying the first valid entry these 0 entries can be used to determine whether a wrap around occurred March 2003 Developer s Manual Inte 80200 Processor based on Intel XScale Microarchitecture Software Debug As the trace buffer is read the oldest entries are read first Reading a series of 5 or more consecutive Ob0000 0000 entries in the oldest entries indicates that the trace buffer has not wrapped around and the first valid entry is the first non zero entry read out Reading 4 or less consecutive Ob0000 0000 entries requires a bit more intelligence in the host SW The host SW must determine whether these 0 s are part of the address of an indirect branch message or whether they are part of the 0b0000 0000 that the trace buffer was initialized with If the first non zero message byte is an indirect branch message then these 0 s are part of the address since the address is always read before the indirect branch message see Section 13 13 1 3 Address Bytes If the first non zero entry is any other type of message byte then these O s indi
62. continuing The debugger uses the RX handshaking to signal the debug handler when the download is complete The debug handler polls the RR bit until it is set A debugger write to RX automatically sets the RR bit allowing the handler to proceed dk db db db NOTE The value written to RX by the debugger is implementation defined it can be a bogus value signalling the handler to continue or it can be a target address for the handler to branch to loop mrc p14 0 r15 c14 cO O handler waits for signal from debugger bpl loop mrc p14 0 r0 c9 cO O debugger writes target address to RX bx FO In a very simple debug handler stub the above parts may form the complete handler downloaded during reset with some handler entry and exit code When a debug exception occurs routines can be downloaded as necessary This basically allows the entire handler to be dynamic Another possibility is for a more complete debug handler is downloaded during reset The debug handler may support some operations such as read memory write memory etc However other operations such as reading or writing a group of CP register can be downloaded dynamically This method could be used to dynamically download infrequently used debug handler functions while the more common operations remain static in the mini instruction cache The Intel Debug Handler is a complete debug handler that implements the more commonly used functions and allows less frequently
63. data example a byte read the other lines should be driven with some binary value If a read is directed to a slave that implements ECC a full bus width of valid data 64 bits must be returned without regard for the requested size For example even if just a byte is requested from ECC memory the memory should still return eight bytes of data This restriction guarantees that the Intel 80200 processor will be able to compute ECC on the data see Section 10 2 7 for more information about ECC D Valid for a transaction may be asserted no earlier than the clock after the transaction request In other words if the Intel 80200 processor asserts ADS at cycle N it must not receive a matching D Valid until at least cycle N 2 See Figure 10 11 for an example of this minimum wait timing For a single word read returning to the core the transaction would consist of the DValid signal being asserted high and sampled on clock edge n and the data being sampled by the Intel 80200 processor from the D bus on clock edge n 2 see Figure 10 4 For a read burst request multiple data cycles are needed On a 64 bit bus an eight word burst read would be four data cycles The data cycles are independent and can occur back to back or spread out with any delay between the cycles Each cycle consists of DValid being asserted followed two cycles later by the corresponding data This can be overlapped such that DValid for one data cycle is being asserted while t
64. data cache is typically modified once and then written back out to external memory Parity Protection The data cache and mini data cache are protected by parity to ensure data integrity there is one parity bit per byte of data The tags are NOT parity protected When a parity error is detected on a data mini data cache access a data abort exception occurs Before servicing the exception hardware sets bit 10 of the Fault Status Register register A data mini data cache parity error is an imprecise data abort meaning R14 ABORT 8 may not point to the instruction that caused the parity error If the parity error occurred during a load the targeted register may be updated with incorrect data A data abort due to a data mini data cache parity error may not be recoverable if the data address that caused the abort occurred on a line in the cache that has a write back caching policy Prior updates to this line may be lost in this case the software exception handler should perform a clean and clear operation on the data cache ignoring subsequent parity errors and restart the offending process This operation is shown in Section 6 3 3 1 Atomic Accesses The SWP and SWPB instructions generate an atomic load and store operation allowing a memory semaphore to be loaded and altered without interruption These accesses may hit or miss the data mini data cache depending on configuration of the cache configuration of the MMU and the page attr
65. executed See Figure C 3 for a JTAG example The steps are 1 Load the sample preload instruction into the Instruction Register a Select the Instruction register scan b Use the Shift IR state four times to read the least through most significant instruction bits into the instruction register we do not care that the old instruction is being shifted out of the TDO pin c Enter the Update IR state to make the instruction take effect d Exit the Instruction register 2 Capture and shift the data onto the TDO pin a Select the Data register scan state b Capture the pin information into the n stage Boundary Scan register c Enter and stay in the shift DR state for n cycles These TDO values are compared against expected data to determine if component operation and connection are correct Record the TDO value after each cycle New serial data enters the boundary scan register through the TDI pin while old data is scanned out d Pass through the Exitl DR and Update DR to continue This example does not make use of the pause states Those states would be more useful where we do not control the clock directly The pause states let the clock tick without affecting the shift registers The old instruction was abcd in the example It is known that the original value is the ID code since the example starts from the reset state Other times it represents the previous opcode The new instruction opcode is 0001 sample preload All pins are
66. feature set allows programmers to select the appropriate features that obtains the best performance for their application Many of the architectural features added to Intel 80200 processor help hide memory latency which often is a serious impediment to high performance processors This includes the ability to continue instruction execution even while the data cache is retrieving data from external memory e a write buffer write back caching various data cache allocation policies which can be configured different for each application cache locking anda pipelined external bus All these features improve the efficiency of the external bus The Intel 80200 processor has been equipped to efficiently handle audio processing through the support of 16 bit data types and 16 bit operations These audio coding enhancements center around multiply and accumulate operations which accelerate many of the audio filter operations ARM Architecture Compliance ARM Version 5 V5 Architecture added floating point instructions to ARM Version 4 The Intel 80200 processor implements the integer instruction set architecture of ARM V5 but does not provide hardware support of the floating point instructions The Intel 80200 processor provides the Thumb instruction set ARM V5T and the ARM V5E DSP extensions Backward compatibility with the first generation of Intel StrongARM products is maintained for user mode applications Operati
67. flow change excluding the current branch The instruction count includes instructions that were executed and conditional instructions that were not executed due to the condition of the instruction not matching the CC flags In the case of back to back branches the word count would be 0 indicating that no instructions executed after the last branch and before the current one A rollover message is used to keep track of long traces of code that do not have control flow changes The rollover message means that 16 instructions have executed since the last message byte was written to the trace buffer If the incremental counter reaches its maximum value of 15 a rollover message is written to the trace buffer following the next instruction which is the 16th instruction to execute This is shown in Example 13 1 The count in the rollover message is 0b1111 indicating that 15 instructions have executed after the last branch and before the current non branch instruction that caused the rollover message Example 13 1 Rollover Messages Examples 13 30 count 5 BL label1 branch message placed in trace buffer after branch executes count 0 count 0b0101 MOV count 1 MOV count 2 MOV count 14 MOV count 15 MOV rollover message placed in trace buffer after 16th instruction executes count 0 count 061111 If the 16th instruction is a branch direct or indirect the appropriate branch message is placed in the tr
68. from the debugger to the processor must be loaded into DBG RX with DBGV set to 1 DBGRX is loaded from DBG_SR 34 3 when the JTAG enters the Update DR state DBGRxX is written to RX following an Update DR when the RX Write Logic enables the RX register DBGD DBG_D is provided for use during high speed download This bit is written directly to TXRXCTRL 29 The debugger sets DBG D when downloading a block of code or data to the Intel 80200 processor system memory The debug handler then uses TXRXCTRL 29 as a branch flag to determine the end of the loop Using DBGD as a branch flags eliminates the need for a loop counter in the debug handler code This avoids the problem were the debugger s loop counter is out of synchronization with the debug handler s counter because of overflow conditions that may have occurred DBG FLUSH DBG FLUSH allows the debugger to flush any previous data written to RX Setting DBGFLUSH clears TXRXCTRL 31 Debug JTAG Data Register Reset Values Upon asserting TRST the DEBUG data register is reset Assertion of the reset pin does not affect the DEBUG data register Table 13 13 shows the reset and TRST values for the data register Note these values apply for DBG REG for SELDCSR DBGTX and DBGRX DEBUG Data Register Reset Values Bit TRST RESET DBG REG 0 0 unchanged DBG REG 1 0 unchanged DBG REG 33 2 unpredictable unpredictable DBG REG 34 0 unchanged Developer s Manua
69. i X2 State Execute XWB Write back Memory Pipeline Handles load store instructions Section B 2 4 D1 D2 Data Cache Access i DWB Data cache writeback MAC Pipeline Handles all multiply instructions Section B 2 5 M1 M5 Multiplier stages MWB not shown MAC write back may occur during M2 M5 Developer s Manual March 2003 B 3 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 2 1 3 B 2 1 4 B 2 1 5 Out Of Order Completion Sequential consistency of instruction execution relates to two aspects first to the order in which the instructions are completed and second to the order in which memory is accessed due to load and store instructions The Intel 80200 processor preserves a weak processor consistency because instructions may complete out of order provided that no data dependencies exist While instructions are issued in order the main execution pipeline memory and MAC pipelines are not lock stepped and therefore have different execution times This means that instructions may finish out of program order Short younger instructions may be finished earlier than long older ones The term to finish is used here to indicate that the operation has been completed and the result has been written back to the register file Register Scoreboarding In certain situations the pipeline may need to be stalled because of register dependencies between instru
70. incur a 1 cycle stall due to the issue latency of the add instruction as the shifter operand is shift by a register The issue latency can be avoided by changing the code as follows mov r3 10 mul r4 2 r3 add r5 r6 r2 LSL 10 sub r7 EB E2 Developer s Manual March 2003 B 39 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 5 3 B 40 Scheduling Multiply Instructions Multiply instructions can cause pipeline stalls due to either resource conflicts or result latencies The following code segment would incur a stall of 0 3 cycles depending on the values in registers r1 r2 r4 and r5 due to resource conflicts mul 60 X1 X2 mul r3 r4 r5 The following code segment would incur a stall of 1 3 cycles depending on the values in registers rl and r2 due to result latency mul EO Xj X2 mov r4 rO Note that a multiply instruction that sets the condition codes blocks the whole pipeline A 4 cycle multiply operation that sets the condition codes behaves the same as a 4 cycle issue operation Consider the following code segment muls r0 r1 r2 add r3 r3 FL sub r4 r4 1 sub rb rb 1 The add operation above would stall for 3 cycles if the multiply takes 4 cycles to complete It is better to replace the code segment above with the following sequence mul FO Xu X2 add E23 r3 i sub r4 r4 1 sub pb r5 i cmp rO 0 Please refer to Section 14 4 Instructi
71. instructions have an issue latency of 2 20 cycles depending on the number of registers being loaded or stored The issue latency is typically 2 cycles plus an additional cycle for each of the registers being loaded or stored assuming a data cache hit The instruction following an Idm would stall whether or not this instruction depends on the results of the load A LDRD or STRD instruction does not suffer from this drawback except when followed by a memory operation and should be used where possible Consider the task of adding two 64 bit integer values Assume that the addresses of these values are aligned on an 8 byte boundary This can be achieved using the LDM instructions as shown below r0 contains the address of the value being copied rl contains the address of the destination location ldm ro r2 r3 ldm rl r4 r5 adds r0 r2 r4 adc rl r3 rb If the code were written as shown above assuming all the accesses hit the cache the code would take 11 cycles to complete Rewriting the code as shown below using LDRD instruction would take only 7 cycles to complete The performance would increase further if we can fill in other instructions after LDRD to reduce the stalls due to the result latencies of the LDRD instructions r0 contains the address of the value being copied rl contains the address of the destination location ldrd r2 r0 ldrd r4 r1 adds r0 r2 r4 adc rl r3 5 Similarly the code sequence shown below takes
72. into instruction TLB MCR P15 0 R2 C10 C4 0 Translate virtual address R2 and lock into instruction TLB MCR P15 0 R3 C10 C4 0 Translate virtual address R3 and lock into instruction TLB CPWAIT The MMU is guaranteed to be updated at this point the next instruction will see the locked instruction TLB entries If exceptions are allowed to occur in the middle of this routine the TLB may end up caching a translation that is about to be locked For example if R1 is the virtual address of an interrupt service routine and that interrupt occurs immediately after the TLB has been invalidated the lock operation is ignored when the interrupt service routine returns back to this code sequence Software should disable interrupts FIQ or IRQ in this case As a general rule software should avoid locking in all other exception types The proper procedure for locking entries into the data TLB is shown in Example 3 3 on page 3 8 Developer s Manual March 2003 3 7 Intel 80200 Processor based on Intel XScale Microarchitecture m Memory Management Intel Example 3 3 Locking Entries into the Data TLB Note MCR P15 0 R1 C8 C6 1 MCR P15 0 R1 C10 C8 0 H MCR P15 0 R2 C8 C6 1 H MCR P15 0 R2 C10 C8 0 H CPWAIT H R1 and R2 contain the virtual addresses to translate and lock into the data TLB Invalidate the data TLB entry specified by the virtual address in R1 Translate virtu
73. into the instruction cache during reset Figure 13 1 SELDCSR Hardware Capture DR i TD j TDO 8534 3 2 110 DBG SR ignored Update DR a TCK 34 33 2 110 DBG_REG Cc Intel 80200 Processor CLK hold_rst y external debug break DCSR 31 0 LEIL software read write A Capture DR loads the current DCSR value into DBG_SR 34 3 The other bits in DBG SR are loaded as shown in Figure 13 1 A new DCSR value can be scanned into DBG SR and the previous value out during the Shift DR state When scanning in a new DCSR value into the DBG SR care must be taken to also set up DBG SR 2 1 to prevent undesirable behavior Update DR parallel loads the new DCSR value into DBG_REG 33 2 This value is then loaded into the actual DCSR register All bits defined as JTAG writable in Table 13 1 Debug Control and Status Register DCSR on page 13 3 are updated An external host and the debug handler running on the Intel 80200 processor must synchronize access the DCSR If one side writes the DCSR at the same side the other side reads the DCSR the results are unpredictable 13 18 March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture
74. is to count the number of core clock cycles which is useful in measuring total execution time The Intel 80200 processor can monitor either occurrence events or duration events When counting occurrence events a counter is incremented each time a specified event takes place and when measuring duration a counter counts the number of processor clocks that occur while a specified condition is true If any of the three counters overflow an IRQ or FIQ is generated if it is enabled IRQ or FIQ selection is programmed in the interrupt controller Each counter has its own interrupt enable The counters continue to monitor events even after an overflow occurs until disabled by software Each of these counters can be programmed to monitor any one of various events To further augment performance monitoring the Intel 80200 processor clock counter can be used to measure the executing time of an application This information combined with a duration event can feedback a percentage of time the event occurred with respect to overall execution time Each of the three counters and the performance monitoring control register are accessible through Coprocessor 14 CP14 registers 0 3 Refer to Section 7 3 1 Registers 0 3 Performance Monitoring on page 7 20 for more details on accessing these registers with MRC MCR LDC and STC coprocessor instructions Access is allowed in privileged mode only Developer s Manual March 2003 12 1 Intel 802
75. located functions for cache locking this approach runs the risk of landing multiple cache ways in one set and few or none in another set This distribution unevenness can lead to excessive thrashing of the Data and Mini Caches March 2003 Developer s Manual intel B 4 2 B 4 2 1 B 4 2 2 Inte 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Data and Mini Cache The Intel 80200 processor allows the user to define memory regions whose cache policies can be set by the user see Section 6 2 3 Cache Policies Supported policies and configurations are Non Cacheable with no coalescing of memory writes Non Cacheable with coalescing of memory writes Mini Data cache with write coalescing read allocate and write back caching Mini Data cache with write coalescing read allocate and write through caching Mini Data cache with write coalescing read write allocate and write back caching Data cache with write coalescing read allocate and write back caching Data cache with write coalescing read allocate and write through caching Data cache with write coalescing read write allocate and write back caching To support allocating variables to these various memory regions the tool chain compiler assembler linker and debugger must implement named sections The performance of your application code depends on what cache policy you are using for data objects A description
76. mode 1 fill once mode S SW Read Write Dec i Enable E 0 unchanged TAG R i Disable NC 1 Enabled 13 4 1 Global Enable Bit GE The Global Enable bit disables and enables all debug functionality except the reset vector trap Following a processor reset this bit is clear so all debug functionality is disabled When debug functionality is disabled the BKPT instruction becomes a noop and external debug breaks hardware breakpoints and non reset vector traps are ignored 13 4 2 Halt Mode Bit H The Halt Mode bit configures the debug unit for either halt mode or monitor mode 13 4 March 2003 Dev eloper s Manual intel 13 4 3 13 4 4 13 4 5 13 4 6 13 4 7 Inte 80200 Processor based on Intel XScale Microarchitecture Software Debug Vector Trap Bits TF TI TD TA TS TU TR The Vector Trap bits allow instruction breakpoints to be set on exception vectors without using up any of the breakpoint registers When a bit is set it acts as if an instruction breakpoint was set up on the corresponding exception vector A debug exception is generated before the instruction in the exception vector executes Software running on the Intel 80200 processor must set the Global Enable bit and the debugger must set the Halt Mode bit and the appropriate vector trap bit through JTAG to set up a non reset vector trap To set up a reset vector trap the debugger sets the Halt Mode bit and reset vector trap b
77. mode the Intel 80200 processor does not A 3 3 Cacheable C and Bufferable B Encoding Table A 1 describes the differences in the encoding of the C and B bits for data accesses The Intel 80200 processor now follows the ARM definition of the C and B bits when X20 The main difference occurs when cacheable and non bufferable data is specified C21 B 0 SA 110 uses this encoding for the mini data cache and the Intel 80200 processor uses this encoding to specify write through caching Another subtle difference is for C20 B 1 where the Intel 80200 processor coalesces and stores in the write buffer and SA 110 does not Table A 1 Cand B encoding Encoding SA 110 Function Intel 80200 Processor Function C 1 B 1 Cacheable in data cache store misses can coalesce in write buffer Cacheable in data cache store misses can coalesce in write buffer Cacheable in mini data cache store misses Cacheable in data cache with a write through done C 1 B 0 can coalesce in write buffer policy Store misses can coalesce in write buffer C 0 B 1 Non cacheable no coalescing in write buffer Non cacheable stores can coalesce in the pudicum but can wait in write buffer write buffer Non cacheable no coalescing in the write Non cacheable no coalescing in the write C 0 B 0 buffer SA 110 stalls until this transaction is buffer Intel 80200 processor does stall until the operation is complete Develope
78. n d If it is necessary to download code into the instruction cache then 2 Assert TRST This clears the Halt Mode bit allowing the instruction cache to be invalidated 3 Clear the Halt Mode bit through JTAG This allows the instruction cache to be invalidated by reset 4 Place the LDIC JTAG instruction in the JTAG IR then proceed with the normal code download using the Invalidate IC Line function before loading each line This requires 10 packets to be downloaded per cache line instead of the 9 packets described in Section 13 14 3 13 42 March 2003 Developer s Manual intel 13 14 5 Figure 13 14 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Dynamically Loading IC After Reset An external host can load code into the instruction cache on the fly or dynamically This occurs when the host downloads code while the processor is not being reset However this requires strict synchronization between the code running on the Intel 80200 processor and the external host The guidelines for downloading code during program execution must be followed to ensure proper operation of the processor The description in this section focuses on using a debug handler running on the Intel 80200 processor to synchronize with the external host but the details apply for any application that is running while code is dynamically downloaded To dynamically download code during software debug there must be a mi
79. of when to use a particular policy is described below The Intel 80200 processor processor allows dynamic modification of the cache policies at run time however the operation is requires considerable processing time and therefore should not be used by applications If the application is running under an OS then the OS may restrict you from using certain cache policies Non Cacheable Regions It is recommended that non cacheable memory X 0 C 0 and B 0 be used only if necessary or as is often necessary for I O devices Accessing non cacheable memory is likely to cause the processor to stall frequently due to the long latency of memory reads Write through and Write back Cached Memory Regions Write through memory regions generate more data traffic on the bus Therefore is not recommended that the write through policy be used The write back policy must be used whenever possible However in a multiprocessor environment it is necessary to use a write through policy if data is shared across multiple processors In such a situation all shared memory regions should use write through policy Memory regions that are private to a particular processor should use the write back policy Developer s Manual March 2003 B 19 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 4 2 3 B 4 2 4 B 20 Read Allocate and Read write Allocate Memory Regions Most of the regular data and t
80. on page 6 8 The line chosen may contain a valid line previously allocated in the cache In this case both dirty bits are examined and if set the four words associated with a dirty bit that s asserted are written back to external memory as a four word burst operation 3 When the data requested by the load is returned from external memory it is immediately sent to the destination register specified by the load A system that returns the requested data back first with respect to the other bytes of the line obtains the best performance 4 As data returns from external memory it is written into the cache in the previously allocated line A load operation that misses the cache and is NOT cacheable makes a request from external memory for the exact data size of the original load request For example LDRH requests exactly two bytes from external memory LDR requests 4 bytes from external memory etc This request is placed in the fill buffer until the data is returned from external memory which is then forwarded back to the destination register s March 2003 Developer s Manual intel 6 2 3 3 6 2 3 4 Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache Write Miss Policy A write operation that misses the cache requests a 32 byte cache line from external memory if the access is cacheable and write allocation is specified in the page In this case the following sequence of events occur 1 The fill buffer
81. processor supports the Thumb instruction set Developer s Manual March 2003 2 1 Programming Model n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 2 2 4 2 2 5 2 2 ARM DSP Enhanced Instruction Set The Intel 80200 processor implements ARM DSP enhanced instruction set which is a set of instructions that boost the performance of signal processing applications There are new multiply instructions that operate on 16 bit data values and new saturation instructions Some of the new instructions are SMLAxy 32 lt 16x16 32 SMLAWy 32 232x16432 SMLALxy 64 lt 16x16 64 SMULxy 32 lt 16x16 SMULWy 32 lt 32x16 QADD adds two registers and saturates the result if an overflow occurred QDADD doubles and saturates one of the input registers then add and saturate QSUB subtracts two registers and saturates the result if an overflow occurred QDSUB doubles and saturates one of the input registers then subtract and saturate The Intel 80200 processor also implements LDRD STRD and PLD instructions with the following implementation notes e PLD is interpreted as a read operation by the MMU and is ignored by the data breakpoint unit i e PLD never generates data breakpoint events PLD toa non cacheable page performs no action Also if the targeted cache line is already resident this instruction has no affect Both LDRD and STRD instructions generation an alignment exception
82. register is required to track the next prefetch address Generally not aligning and sizing data adds extra computational overhead Additional prefetch considerations are discussed in greater detail in following sections March 2003 Developer s Manual intel B 4 2 7 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Literal Pools The Intel 80200 processor does not have a single instruction that can move all literals a constant or address to a register One technique to load registers with literals in the Intel 80200 processor is by loading the literal from a memory location that has been initialized with the constant or address These blocks of constants are referred to as literal pools See Section B 3 Basic Optimizations for more information on how to do this It is advantageous to place all the literals together in a pool of memory known a literal pool These data blocks are located in the text or code address space so that they can be loaded using PC relative addressing However references to the literal pool area load the data into the data cache instead of the instruction cache Therefore it is possible that the literal may be present in both the data and instruction caches resulting in waste of space For maximum efficiency the compiler should align all literal pools on cache boundaries and size each pool to a multiple of 32 bytes the size of a cache line One additional optimi
83. register selected by the current instruction retains its previous value during this state The instruction does not change in this state The controller remains in this state as long as TMS is low When TMS goes high on the rising edge of TCK the controller moves to the Exit2 DR state Exit2 DR State This is a temporary state If TMS is held high on the rising edge of TCK the controller enters the Update DR state which terminates the scanning process If TMS is held low on the rising edge of TCK the controller enters the Shift DR state The instruction does not change while the TAP controller is in this state All test data registers selected by the current instruction retain their previous value during this state Developer s Manual March 2003 C 9 Test Features In Intel 80200 Processor based on Intel XScale Microarchitecture tel e C 2 5 9 C 2 5 10 C 2 5 11 C 2 5 12 C 10 Update DR State The Boundary Scan register is provided with a latched parallel output This output prevents changes at the parallel output while data is shifted in response to the extest sample preload instructions When the Boundary Scan register is selected while the TAP controller is in the Update DR state data is latched onto the Boundary Scan register s parallel output from the shift register path on the falling edge of TCK The data held at the latched parallel output does not change unless the controller is in this state Wh
84. requested instruction If the cache does not contain the requested instruction the access misses the cache and the cache requests a fetch from external memory of the 8 word line 32 bytes that contains the requested instruction using the fetch policy described in Section 4 2 3 As the fetch returns instructions to the cache they are placed in one of two fetch buffers and the requested instruction is delivered to the instruction decoder A fetched line is written into the cache if it is cacheable Code is designated as cacheable when the Memory Management Unit MMU is disabled or when the MMU is enable and the cacheable C bit is set to 1 in its corresponding page See Chapter 3 Memory Management for a discussion on page attributes Note that an instruction fetch may miss the cache but hit one of the fetch buffers When this happens the requested instruction is delivered to the instruction decoder in the same manner as a cache hit Operation When The Instruction Cache Is Disabled Disabling the cache prevents any lines from being written into the instruction cache Although the cache is disabled it is still accessed and may generate a hit if the data is already in the cache Disabling the instruction cache does not disable instruction buffering that may occur within the instruction fetch buffers Two 8 word instruction fetch buffers are always enabled in the cache disabled mode So long as instruction fetches contin
85. reset the instruction cache is typically invalidated with the exception of the following modes LDIC mode active when LDIC JTAG instruction is loaded in the JTAG IR prevents the mini instruction cache and the main instruction cache from being invalidated during reset HALT mode active when the Halt Mode bit is set in the DCSR prevents only the mini instruction cache from being invalidated main instruction cache is invalidated by reset During a cold reset in which both a processor reset and a JTAG reset occurs it can be guaranteed that the instruction cache is invalidated since the JTAG reset takes the processor out of any of the modes listed above During a warm reset if a JTAG reset does not occur the instruction cache is not invalidated by reset when any of the above modes are active This situation requires special attention if code needs be downloaded during the warm reset Note that while Halt Mode is active reset can invalidate the main instruction cache Thus debug handler code downloaded during reset can only be loaded into the mini instruction cache However code can be dynamically downloaded into the main instruction cache refer to Section 13 14 5 Dynamically Loading IC After Reset The following sections describe the steps necessary to ensure code is correctly downloaded into the instruction cache March 2003 Developer s Manual intel 13 14 4 1 Loading IC During Cold Reset for Debug Intel 80200 Pr
86. s Manual March 2003 11 7 Bus Controller n Intel 80200 Processor based on Intel XScale Microarchitecture m tel e 11 8 BCUMOD AF affects the behavior of the BCU when it is reading a 32 byte block a cache line fill If this bit is 0 then the BCU always emits the 32 byte aligned address of the cache line when requesting it If this bit is 1 then the BCU emits the address of the critical word in the cache line when requesting it This latter setting allows external logic to implement CWF logic as detailed in Section 10 2 3 which will usually yield higher performance March 2003 Developer s Manual intel 11 4 2 Table 11 4 Table 11 5 Intel 80200 Processor based on Intel XScale Microarchitecture Bus Controller ECC Error Registers ELOGO ELOG1 Registers 4 5 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 4 3 2 1 Se HH m reset value undefined Bits Access Description RW indicates the direction of the errant transfer 31 Read Write ignored 0 read error 1 2 write error ET Error Type 00 single bit ECC error 30 29 Read Write ignored 01 2 multi bit ECC error 10 bus abort 11 reserved 28 8 Read unpredictable Write as 0 reserved syn the syndrome value that indicated the error This CH Read Write ignored field has an undefined value if ET is greater than 1 ECARO ECAR1 Registers 6 7 31 30 29 28 2
87. significant 16 bits of the address for the request Lock is asserted for the read request that begins the atomic pair and is asserted for each new request up to but not including the write operation that ends the atomic pair It is possible that instruction fetches and instruction MMU page table walk requests occurs between the read and the write of an atomic operation but no other requests are made Details of the lock mechanism and its interaction with the multimaster logic is described below A bus master may have at most four requests outstanding at any time That is a bus master may issue up to four requests before receiving back any data For more information see Section 10 3 7 Pipelined Accesses on page 10 21 Intel 80200 Processor Use of the Request Bus The possible sizes and alignments of read and write requests that can be issued are shown in Table 10 2 and Table 10 3 Requests on a 64 bit Bus ee ee Mu ME Cycles i 000 1 1 Y Y Any Address 001 2 1 Y A 0 0 010 4 1 Y Y A 1 0 00 011 8 1 Y Y A 2 0 000 100 Not Used in 64 bit Bus Mode 101 16 2 N A 3 0 0000 110 32 4 Y A 1 0 00 9 111 Not Used in 64 bit Bus Mode E LEN of 000 001 or 010 should be treated as a LEN of 011 if directed to a slave that implements ECC 2 An 8 byte read will only occur if ECC is enabled it will occur as part of read modify write transacti
88. that it is possible on a 32 bit bus for a 3 4 word write transaction to go out which requires 3 4 data cycles on the bus but during one or more of the middle data cycles no byte enables are asserted The first and last data cycle always has at least one byte enable valid Even though it seems inefficient to waste data cycles with no byte enables on average the merging of writes in the write buffers can be a big performance gain Some examples a pair of non cacheable bufferable byte stores to addresses 0x2401 and 0x2403 might be merged in the write buffer On the bus they show up as a write of length 010 word to address 0x2400 When data is driven out on the data bus however only byte enables for bytes 1 and 3 would be asserted A pair of non cacheable bufferable byte stores to addresses 0x2401 and 0x2409 could be merged in the write buffers and require a three word write On a 64 bit bus there would be two data cycles Both data cycles have only byte enable asserted On a 32 bit bus there would be three data cycles The first and third would have 1 byte enable asserted Developer s Manual March 2003 10 5 External Bus In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 10 2 2 10 6 Data Bus Some time after a request is made on the request bus data must be transferred for that request on the data bus Each request has a corresponding transaction one or more cycles on the data bus Data
89. that some data was lost The assumption during high speed download is that the time it takes for the debugger to shift in the next data word is greater than the time necessary for the debug handler to process the previous data word So before the debugger shifts in the next data word the handler is polling for that data However if the handler incurs stalls that are long enough such that the handler is still processing the previous data when the debugger completes shifting in the next data word an overflow condition occurs and the OV bit is set Once set the overflow flag remains set until cleared by a write to TXRXCTRL with an MCR After the debugger completes the download it can examine the OV bit to determine if an overflow occurred The debug handler software is responsible for saving the address of the last valid store before the overflow occurred Download Flag D The value of the download flag is set by the debugger through JTAG This flag is used during high speed download to replace a loop counter The download flag becomes especially useful when an overflow occurs If a loop counter is used and an overflow occurs the debug handler cannot determine how many data words overflowed Therefore the debug handler counter may get out of sync with the debugger the debugger may finish downloading the data but the debug handler counter may indicate there is more data to be downloaded this may result in unpredictable behavior of the d
90. the rearrangement of a sequence of instructions for the purpose of minimizing pipeline stalls Reducing the number of pipeline stalls improves application performance While making this rearrangement care should be taken to ensure that the rearranged sequence of instructions has the same effect as the original sequence of instructions Scheduling Loads On the Intel 80200 processor an LDR instruction has a result latency of 3 cycles assuming the data being loaded is in the data cache If the instruction after the LDR needs to use the result of the load then it would stall for 2 cycles If possible the instructions surrounding the LDR instruction should be rearranged to avoid this stall Consider the following example add rl r2 r3 ldr rd r5 add r6 rO r1 sub r8 r2 x3 mul r9 r2 x3 In the code shown above the ADD instruction following the LDR would stall for 2 cycles because it uses the result of the load The code can be rearranged as follows to prevent the stalls ldr ro r5 add rl r2 x3 sub r8 r2 r3 add r6 r r1 mul t9 B24 X3 Note that this rearrangement may not be always possible Consider the following example cmp rl 0 addne r4 r5 4 subeq r4 rb 4 ldr cp r4 cmp rO 10 In the example above the LDR instruction cannot be moved before the ADDNE or the SUBEQ instructions because the LDR instruction depends on the result of these instructions Rewrite the above code to make it run faster a
91. the else part and four cycles for the if part assuming best case conditions and no branch misprediction penalties In the case of Intel 80200 processors a branch misprediction incurs a penalty of four cycles If the branch is mispredicted 50 of the time and if we assume that both the if part and the else part are equally likely to be taken on an average the code above takes 5 5 cycles to execute 350 3 4 veles 3x4 2 25 5 cycles If we were to use Intel 80200 processors to execute instructions conditionally the code generated for the above if else statement is cmp rO 10 movgt r0 0 movle r0 1 The above code segment would not incur any branch misprediction penalties and would take three cycles to execute assuming best case conditions As can be seen using conditional instructions speeds up execution significantly However the use of conditional instructions should be carefully considered to ensure that it does improve performance To decide when to use conditional instructions over branches consider the following hypothetical code segment if cond if stmt else else stmt Assume that we have the following data Nie Number of cycles to execute the if stmt assuming the use of branch instructions N2g Number of cycles to execute the else stmt assuming the use of branch instructions P1 Percentage of times the if stmt is likely to be executed March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale
92. the transfer and one additional core clock if the word is in the upper word address range of the transfer Thus for the examples presented here this would be 6 or 7 core clock cycles cwfxfer Nowe for the Intel 80200 processor works out to be 60 instructions assuming 2 wait state SDRAM and that the current SDRAM memory page is selected The second 64 bits of data are available at the next bus cycle or 6 core clocks Noixfer is the minimal number of cycles to prefetch ahead for an entire cache line Netxfer Niookup Niinexfer Where Niinexfer This is the number of core clocks required to transfer one complete cache line The Intel 80200 processor requires 4 bus cycles to transfer four 64 bit words of a full cache line Given the six to one core to bus clock ratio this would be 24 core clock cycles N works out to be about 78 cycles for the Intel 80200 processor when using 2 bus cycle wait state clxfer Ngubissue This is the maximum number of core clocks that a subsequent bus transfer request must be made to guarantee that transfer takes place immediately after the previous request has completed its transfer If a transfer is not made in this time then idle bus cycles occur reducing efficiently This time transfer time of the previous request If the previous transfer was for a full cache line read or write then this would take 24 core cycles at a six to one ratio between core and bus clocks If the previous operation was for
93. then conditionally execute based on the updated CC value TX RX Control Register TXRXCTRL 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 21 0 OT V R reset value 0x00000000 Bits Access Description 31 SW Read only Write ignored RR JTAG Write only RX Register Ready 30 SW Read Writ dd ea rite RX overflow sticky flag 29 SW Read only Write ignored D JTAG Write only High speed download flag 28 SW Read only Write ignored TR JTAG Write only TX Register Ready 27 0 Read as Zero Write ignored Reserved March 2003 Developer s Manual intel 13 8 1 Table 13 7 Table 13 8 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug RX Register Ready Bit RR The debugger and debug handler use the RR bit to synchronize accesses to RX Normally the debugger and debug handler use a handshaking scheme that requires both sides to poll the RR bit To support higher download performance for large amounts of data a high speed download handshaking scheme can be used in which only the debug handler polls the RR bit before accessing the RX register while the debugger continuously downloads data Table 13 7 shows the normal handshaking used to access the RX register Normal RX Handshaking Debugger Actions Debugger wants to send data to debug handler Before writing new data to the RX register the debugger polls RR through JTAG
94. this checkpoint In this case the host SW would have to start at its known address the first checkpoint which is half way through the buffer and work forward from there Developer s Manual March 2003 13 33 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 14 13 14 1 Downloading Code in the ICache On the Intel 80200 processor a 2K mini instruction cache physically separate from the 32K main instruction cache can be used as an on chip instruction RAM An external host can download code directly into either instruction cache through JTAG In addition to downloading code several cache functions are supported The Intel 80200 processor supports loading the instruction cache during reset and during program execution Loading the instruction cache during normal program execution requires a strict handshaking protocol between software running on the Intel 80200 processor and the external host In the remainder of this section the term instruction cache applies to either main or mini instruction cache LDIC JTAG Command The LDIC JTAG instruction selects the JTAG data register for loading code into the instruction cache The JTAG opcode for this instruction is 00111 The LDIC instruction must be in the JTAG instruction register in order to load code directly into the instruction cache through JTAG 1 A cache line fill from external memory is never be written into t
95. to be prefetched for both reading and writing pref Nevict Is the number of cache half line evictions caused by the loop Ninst Is the number of instructions executed in one iteration of the loop Nnpwlinexfer This is the number of core clocks required to write half a cache line as would happen if only one of the cache line dirty bits were set when a line eviction occurred For the Intel 80200 processor this takes 2 bus clocks or 12 core clocks CPI This is the average number of core clocks per instruction The psd number provided by the above equation is a good starting point but may not be the most ideal consideration Estimating N is very difficult from static code However if the operational data uses the mini data cache and if the loop operations should overflow the mini data cache then a first order estimate of N would be the number of bytes written pre loop iteration divided by a half cache line size of 16 bytes Cache overflow can be estimated by the number of cache lines transferred each iteration and the number of expected loop iterations Nevict and CPI can be estimated by profiling the code using the performance monitor cache write back event count Prefetch Loop Limitations Itis not always advantages to add prefetch to a loop Loop characteristics that limit the use value of prefetch are discussed below Compute vs Data Bus Bound At the extreme a loop which is data bus bound does not benefit from prefetch
96. to non existent memory To avoid this situation the line allocate operation should only be used if one of the following can be guaranteed The virtual address associated with this command is not one that is generated during normal program execution This is the case when line allocate is used to clean invalidate the entire cache The line allocate operation is used only on a cache region destined to be locked When the region is unlocked it must be invalidated before making another data access March 2003 Developer s Manual Configuration i ntel Intel 80200 Processor based on Intel XScale Microarchitecture 7 2 9 Register 8 TLB Operations Disabling enabling the MMU has no effect on the contents of either TLB valid entries stay valid locked items remain locked All operations defined in Table 7 13 work regardless of whether the TLB is enabled or disabled This register should be accessed as write only Reads from this register as with an MRC have an undefined effect Table 7 13 TLB Functions Developer s Manual Function opcode 2 CRm Data Instruction Invalidate I amp D TLB 0b000 0b0111 Ignored MCR p15 0 Rd c8 c7 0 Invalidate TLB 0b000 0b0101 Ignored MCR p15 0 Rd c8 c5 0 Invalidate TLB entry 0b001 0b0101 MVA MCR p15 0 Rd c8 c5 1 Invalidate D TLB 0b000 0b0110 Ignored MCR p15 0 Rd c8 c6 0 Invalidate D TLB entry 05001 0b0110 MVA MCR p15
97. up after the instruction cache is loaded In this case the DCSR should be set up to do a reset vector trap with the Halt Mode bit and the hold rst signal remaining set In either case when the debugger clears the hold rst bit to de assert internal reset the debugger must set the Halt Mode and Trap Reset bits in the DCSR March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Software Debug 13 14 4 2 Loading IC During a Warm Reset for Debug Loading the instruction cache during a warm reset may be a slightly different situation than during a cold reset For a warm reset the main issue is whether the instruction cache gets invalidated by the processor reset or not There are several possible scenarios While reset is asserted TRST is also asserted In this case the instruction cache is invalidated so the actions taken to download code are identical to those described in Section 13 14 4 1 When reset is asserted TRST is not asserted but the processor is not in Halt Mode In this case the instruction cache is also invalidated so the actions are the same as described in Section 13 14 4 1 after the LDIC instruction is loaded into the JTAG IR When reset is asserted TRST is not asserted and the processor is in Halt Mode In this last scenario the mini instruction cache does not get invalidated by reset since the processor is in Halt Mode This scenario is described in more det
98. used functions to be dynamically downloaded Developer s Manual March 2003 13 45 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 14 6 13 46 Mini Instruction Cache Overview The mini instruction cache is a smaller version of the main instruction cache Refer to Chapter 4 for more details on the main instruction cache It is a 2KB 2 way set associative cache There are 32 sets each containing two ways each way contains 8 words The cache uses the round robin replacement policy The mini instruction cache is virtually addressed and addresses may be remapped by the PID However since the debug handler executes in Special Debug State address translation and PID remapping are turned off For application code accesses to the mini instruction cache use the normal address translation and PID mechanisms Normal application code is never cached in the mini instruction cache on an instruction fetch The only way to get code into the mini instruction cache is through the JTAG LDIC function Code downloaded into the mini instruction cache is essentially locked it cannot be overwritten by application code running on the Intel 80200 processor However it is not locked against code downloaded through the JTAG LDIC functions Application code can invalidate a line in the mini instruction cache using a CP15 Invalidate IC line function to an address that hits in the mini instruction cache Howe
99. where all four requests can be active on the pipelined external bus Chapter 10 External Bus describes the external bus protocol and Chapter 11 Bus Controller covers the aspects of ECC protection The bus controller registers are accessed via coprocessor 13 Performance Monitoring Two performance monitoring counters have been added to the Intel 80200 processor that can be configured to monitor various events in the Intel 80200 processor These events allow a software developer to measure cache efficiency detect system bottlenecks and reduce the overall latency of programs Chapter 12 Performance Monitoring discusses this in more detail Debug The Intel 80200 processor supports software debugging through two instruction address breakpoint registers one data address breakpoint register one data address mask breakpoint register and a trace buffer Chapter 13 Software Debug discusses this in more detail JTAG Testability is supported on the Intel 80200 processor through the Test Access Port TAP Controller implementation which is based on IEEE 1149 1 JTAG Standard Test Access Port and Boundary Scan Architecture The purpose of the TAP controller is to support test logic internal and external to the Intel 80200 processor such as built in self test boundary scan and scan Appendix C 2 discusses this in more detail March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XSca
100. wide When the IR is selected in the Shift IR state the most significant bit is connected to TDI and the least significant bit is connected to TDO TDI is shifted into IR on each rising edge of TCK as long as TMS remains asserted When the processor enters Capture IR TAP controller state fixed parallel data 0001 is captured During Shift IR when a new instruction is shifted in through TDI the value 0001 is always shifted out through TDO least significant bit first This helps identify instructions in a long chain of serial data from several devices Upon activation of the TRST pin the latched instruction asynchronously changes to the idcode instruction If the TAP controller moved into the Test Logic Reset state other than by reset activation the opcode changes as TDI is shifted and becomes active on the falling edge of TCK See Figure C 4 for an example of loading the instruction register Boundary Scan Instruction Set The Intel 80200 processor supports three mandatory boundary scan instructions bypass sample preload and extest The Intel 80200 processor also contains seven additional public instructions along with seven Intel 80200 processor private instructions Table C 2 lists the Intel 80200 processor instruction codes and Table C 3 describes each instruction JTAG Instruction Set Instruction Code Instruction Name Instruction Code Instruction Name 00000 extest 010
101. write completes If software is running in a privileged mode it can explicitly drain all buffered writes For details on this operation see the description of Drain Write Buffer in Section 7 2 8 Register 7 Cache Functions on page 7 11 March 2003 Developer s Manual intel Configuration 7 7 1 This chapter describes the System Control Coprocessor CP15 and coprocessor 14 CP14 CP15 configures the MMU caches buffers and other system attributes Where possible the definition of CP15 follows the definition in the first generation Intel StrongARM products CP14 contains the performance monitor registers and the trace buffer registers Overview CP15 is accessed through MRC and MCR coprocessor instructions and allowed only in privileged mode Any access to CP15 in user mode or with LDC or STC coprocessor instructions causes an undefined instruction exception CP14 registers can be accessed through MRC MCR LDC and STC coprocessor instructions and allowed only in privileged mode Any access to CP14 in user mode causes an undefined instruction exception Coprocessors on the Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture VSTE do not support access via CDP MRRC or MCRR instructions An attempt to execute these instructions results in an Undefined Instruction exception Many of the MCR commands available in CP15 modify hardware state sometime after execution A so
102. 0 05610101 DEBUG mode CPSR 5 0 CPSR 6 1 CPSR 7 1 PC 0x0 Following a debug exception the processor switches to debug mode and enters SDS which allows the following special functionality All events are disabled SWI or undefined instructions have unpredictable results The processor ignores pre fetch aborts FIQ and IRQ SDS disables FIQ and IRQ regardless of the enable values in the CPSR The processor reports data aborts detected during SDS by setting the Sticky Abort bit in the DCSR but does not generate an exception processor also sets up FSR and FAR as it normally would for a data abort Normally during halt mode software cannot write the hardware breakpoint registers or the DCSR However during the SDS software has write access to the breakpoint registers see Section 13 6 HW Breakpoint Resources and the DCSR see Table 13 1 Debug Control and Status Register DCSR on page 13 3 The IMMU is disabled In halt mode since the debug handler would typically be downloaded directly into the IC it would not be appropriate to do TLB accesses or translation walks since there may not be any external memory or if there is the translation table or TLB may not contain a valid mapping for the debug handler code To avoid these problems the processor internally disables the IMMU during SDS The PID is disabled for instruction fetches This prevents fetches of the debug handler code from being remapped to a differe
103. 0 cacheable and bufferable page attribute bits 0x13 Reserved unpredictable results The BCU detected an ECC error but no ELOG register was available in which to log the 0x14 error See Section 11 4 2 ECC Error Registers on page 11 9 for a description of the ELOG registers BCU detected a 1 bit error while reading data from the bus This event may be counted 0x15 even if reporting of 1 bit errors is disabled See Section 11 3 Error Handling on page 11 2 for a description of 1 bit errors RMW cycle occurred due to narrow write on ECC protected memory see Section 11 2 DB ECC on page 11 1 for a description of ECC and RMW cycles all others Reserved unpredictable results 12 6 March 2003 Developer s Manual In Table 12 5 12 5 1 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Monitoring Some typical combination of counted events are listed in this section and summarized in Table 12 5 In this section we call such an event combination a mode Some Common Uses of the PMU Mode PMNC evtCountO PMNC evtCount1 Instruction Cache Efficiency 0x7 instruction count 0x0 ICache miss Data Cache Efficiency OxA Dcache access OxB DCache miss Instruction Fetch Latency 0x1 ICache cannot deliver 0x0 ICache miss Data Bus Request Buffer Full 0x8 DBuffer stall duration 0x9 DBuffer stall Stall Writeb
104. 0 no overflow 1 overflow has occurred Write Values 0 nochange 1 clear this bit Read unpredictable Write as O Reserved 12 4 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Performance Monitoring Table 12 3 Performance Monitor Control Register CP14 register 0 Sheet 2 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 L mem O eoe Im Tem PISTE reset value E and inten are 0 others unpredictable Bits Access Description Interrupt Enable used to enable disable interrupt reporting for each counter Bit 6 clock counter interrupt enable 0 disable interrupt 1 2 enable interrupt 6 4 Read Write Bit 5 performance counter 1 interrupt enable 0 disable interrupt 1 2 enable interrupt Bit 4 2 performance counter 0 interrupt enable 0 disable interrupt 1 2 enable interrupt Clock Counter Divider D 3 Read Write 0 CCNT counts every processor clock cycle 1 2 CONT counts every GA processor clock cycle Clock Counter Reset C 2 Read unpredictable Write 0 no action 1 reset the clock counter to 0x0 Performance Counter Reset P 1 Read unpredictable Write 0 no action 1 reset both performance counters to 0x0 Enable E 0 Read Write O all 3 counters are disabled 1 2 all 3 counters are enabled 12 4 1 Managing PM
105. 0 9 8 7 6 5 4 3 2 1 Pome PEPE 1 eae eee ee open Operation if ConditionPassed lt cond gt then acc0 39 32 RdHi 7 0 acc0 31 0 RdLo 31 0 Exceptions none Qualifiers Condition Code No condition code flags are updated Notes Instruction timings can be found in Section 14 4 4 Multiply Instruction Timings on page 14 6 Specifying R15 as either RdHi or RdLo has unpredictable results The MAR instruction moves the value in register RdLo to bits 31 0 of the 40 bit accumulator acc0 and moves bits 7 0 of the value in register RdHi into bits 39 32 of accO The instruction is only executed if the condition specified in the instruction matches the condition code status This instruction executes in any processor mode MRA lt cond gt RdLo RdHi accO 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 Pee PPTL ee rae wooo woop Operation if ConditionPassed lt cond gt then RdHi 31 0 sign extend acc0 39 32 RdLo 31 0 acc0 31 0 Exceptions none Qualifiers Condition Code No condition code flags are updated Notes Instruction timings can be found in Section 14 4 4 Multiply Instruction Timings on page 14 6 Specifying the same register for RdHi and RdLo has unpredictable results Specifying R15 as either RdHi or RdLo has unpredictable results The MRA instruction moves the 40 bit accumulator value acc0 into two regis
106. 0 to 127 rO 127 value of r0 to Oxfffffefb ro 260 value of r0 to 257 rO 1 r0 rO 256 value of r0 to 0x51f rO 0x1f ro ro 0x500 value of r0 to Oxf100ffff rO 0xff 16 ro rO 0xe 8 value of r0 to 0x12341234 ro 0x8d 30 rO rO 0x1 20 ro r0 r0 LSL 16 shifter delay of 1 cycle Note that it is possible to load any 32 bit value into a register using a sequence of four instructions March 2003 Developer s Manual intel B 3 4 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Optimizing Integer Multiply and Divide Multiplication by an integer constant should be optimized to make use of the shift operation whenever possible Multiplication of RO by 2 mov ro r0 LSL n Multiplication of RO by 2 1 add rO rO r0 LSL n Multiplication by an integer constant that can be expressed as 2 1 2 can similarly be optimized as Multiplication of r0 by an integer constant that can be expressed as 2741 2 add ro rO rO LSL dn mov rO r0 LSL m Please note that the above optimization should only be used in cases where the multiply operation cannot be advanced far enough to prevent pipeline stalls Dividing an unsigned integer by an integer constant should be optimized to make use of the shift operation whenever possible Dividing r0 containing an unsigned value by an integer constant that can be represented as gn mov rO r
107. 00 Processor based on Intel XScale Microarchitecture Performance Monitoring 12 2 Clock Counter CCNT CP14 Register 1 intel The format of CCNT is shown in Table 12 1 The clock counter is reset to 0 by Performance Monitor Control Register PMNC or can be set to a predetermined value by directly writing to it It counts core clock cycles When CCNT reaches its maximum value OXFFFF FFFF the next clock cycle causes it to roll over to zero and set the overflow flag bit 6 in PMNC An IRQ or FIQ is reported if it is enabled via bit 6 in the PMNC register Table 12 1 Clock Count Register CCNT 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131 109 8 7 6 5 43 2 1 0 Clock Counter reset value unpredictable Bits Access Description 31 0 Read Write 32 bit clock counter Reset to 0 by PMNC register When the clock counter reaches its maximum value OxFFFF FFFF the next cycle causes it to roll over to zero and generate an IRQ or FIQ if enabled 12 2 March 2003 Developer s Manual intel 12 3 Table 12 2 12 3 1 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Monitoring Performance Count Registers PMNO PMN1 CP14 Register 2 and 3 Respectively There are two 32 bit event counters their format is shown in Table 12 2 The event counters are reset to 0 by the PMNC register or can be set to a predetermined value
108. 003 Developer s Manual intel 13 11 13 11 1 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Debug JTAG Access There are four JTAG instructions used by the debugger during software debug LDIC SELDCSR DBGTX and DBGRX LDIC is described in Section 13 14 Downloading Code in the ICache The other three JTAG instructions are described in this section SELDCSR DBGTX and DBGRX use a common 36 bit shift register DBG SR New data is shifted in and captured data out through the DBG SR In the UPDATE DR state the new data shifted into the appropriate data register SELDCSR JTAG Command The SELDCSR JTAG instruction selects the DCSR JTAG data register The JTAG opcode is 01001 When the SELDCSR JTAG instruction is in the JTAG instruction register the debugger can directly access the Debug Control and Status Register DCSR The debugger can only modify certain bits through JTAG but can read the entire register The SELDCSR instruction also allows the debugger to generate an external debug break Developer s Manual March 2003 13 17 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug n e 13 11 2 SELDCSR JTAG Register Placing the SELDCSR JTAG instruction in the JTAG IR selects the DCSR JTAG Data register Figure 13 1 allowing the debugger to access the DCSR generate an external debug break set the hold rst signal which is used when loading code
109. 0200 Processor based on Intel XScale Microarchitecture m tel B 5 B 6 C 1 C 2 B 4 1 Instruction Cache suites eet exuit ca ea eene nee e edi aee ra Una Bac e edo xe dana ae cus 17 B 4 1 1 Cache Miss COSl aiite edd ec nae neci Eel epa CAE ERR e REN deeed 17 B 4 1 2 Round Robin Replacement Cache Policy 17 B 4 1 3 Code Placement to Reduce Cache Misses essesssesesiesseresrrresresrrnsrrresrreee 17 B 4 1 4 Locking Code into the Instruction Cache 18 BA2 Dataand Mini Cache e ee t reine SEE ERE EE I EES ERES M EXR ERR Hae ER RR EnRR HAE 19 B 4 2 1 Non Cacheable Peglons sse nnns 19 B 4 2 2 Write through and Write back Cached Memory Regions 19 B 4 2 3 Read Allocate and Read write Allocate Memory Regions 20 B 4 2 4 Creating On chip BAM nennen nennt nnns eerte 20 B 4 2 5 Mini data Cache 21 B 4 2 6 Data Alignment ecce ince eee rena Lea e eR EE ute ex RE DE Le E Re d 22 B 4 2 7 Literal e 23 B 4 3 Cache Considerations AA 24 B 4 3 1 Cache Conflicts Pollution and Pressure essessssssesssrssssrrrrrrrrrrrrrrrrrrrrrrrrrrereeeee 24 B 4 3 2 Memory Page Thrashing A 24 B 4 4 Prefetch Considerations nennen nennen nennen nnns 25 B 4 4 1 Prefetch Distances in the Intel 80200 Process 25 B 4 4 2 Prefetch Loop Scheduling sssssssseeeseeeneeeennennennnns 27 B 4 4 3 Prefetch Loop Limitations iren
110. 1 0 BLIIN EE reset value 0x0000 0000 Bits Access Description Process ID This field is used for remapping the virtual sen Read Write address when bits 31 25 of the virtual address are zero Reserved Should be programmed to zero for future 24 0 Read as Zero Write as Zero compatibility The PID Register Affect On Addresses All addresses generated and used by User Mode code are eligible for being PIDified as described in the previous section Privileged code however must be aware of certain special cases in which address generation does not follow the usual flow The PID register is not used to remap the virtual address when accessing the Branch Target Buffer BTB Any writes to the PID register invalidate the BTB which prevents any virtual addresses from being double mapped between two processes A breakpoint address see Section 7 2 14 Register 14 Breakpoint Registers on page 7 17 must be expressed as an MVA when written to the breakpoint register This means the value of the PID must be combined appropriately with the address before it is written to the breakpoint register All virtual addresses in translation descriptors see Chapter 3 Memory Management are MVAs March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Configuration 7 2 14 Register 14 Breakpoint Registers The Intel 80200 processor contains two instruct
111. 115 private 000015 sample 01100 private 000105 dbgrx 011015 private 000115 private 011105 not used 001005 clamp 011115 not used 001015 private 10000 dbgtx 001105 not used 100015 not used 001115 Idic 10010 not used 010005 highz 100115 through 111015 not used 010015 deer 111105 idcode 010105 private 111115 bypass March 2003 Developer s Manual In Table C 3 Intel 80200 Processor based on Intel XScale Microarchitecture IEEE Instructions Test Features Instruction Requisite Opcode Description extest initiates testing of external circuitry typically board level interconnects and off chip circuitry extest connects the Boundary Scan register between TDI and TDO in the Shift DR state only When extest is selected all output signal pin extest values are driven by values shifted into the Boundary Scan register and may IEEE 1149 1 000005 change only on the falling edge of TCK in the Update DR state Also when Required extest is selected all system input pin states must be loaded into the Boundary Scan register on the rising edge of TCK in the Capture DR state Values shifted into input latches in the Boundary Scan register are never used by the processor s internal logic sample preload performs two functions When the TAP controller is in the Capture DR state the sample instruction occurs on the rising edge of TCK and provides a snapshot of the components normal operation without inte
112. 12 1 Trace Buffer CP Registers AAA 26 13 12 1 1 Checkpoint Registers AA 26 13 12 1 2 Trace Buffer Register TBPDEOG 27 Trace Buffer E 28 xx MEE Ee ENEE 28 13 13 1 1 Exception Message Byte A 29 13 13 1 2 Non exception Message Die A 30 13 13 1 3 e KIC EN 13 13 2 Trace Buffer Usoge ek 32 Downloading Code in the Cache nennen nennen rennen nnns 34 13 14 1 LDIC JTAG Commande 34 13 14 2 LDIC JTAG Data Register AA 35 13 14 3 LDIC Cache Functons eee enne nnne nenne entren nnne 36 13 14 4 Loading IC During Reset enne nnne entere 38 13 14 4 1 Loading IC During Cold Reset for Debug ssseesessreeriserirsrrresrrreseness 39 13 14 4 2 Loading IC During a Warm Reset for Debug 41 13 14 5 Dynamically Loading IC After Heset 43 13 14 5 1 Dynamic Code Download Synchronization sesesssssss 45 13 14 6 Mini Instruction Cache Overview sssssssssseeeeeneeeeenen enne 46 Halt Mode Software Protoco nennen nennen nnns n nnns nennen nennen 47 13 15 1 Starting a Debug Session ener nnne 47 13 15 1 1 Setting up Override Vector Tables ssseseeseeeeseeeeeereeeererirsrrirsrreesreees 47 13 15 1 2 Placing the Handler in Memory see 48 13 15 2 Implementing a Debug Handler 49 13 15 2 1 Debug Handler Entry essseeeeeneeeenm eene 49 13 15 2 2 Debug Handler Hesttctons A 49 13 15 2 8 Dynamic Debug Handler 50 13 15 2 4 High Speed Download 52 13 15 3 Ending a Debug Session nen
113. 1308 01 March 2003 Developers Manual intel 10 3 9 Figure 10 13 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Aborted Access As discussed in Section 10 2 6 Abort on page 10 11 any request from the Intel 80200 processor can be aborted by the chipset or memory This might occur if there was a PCI error or if a request was issued to unimplemented memory Figure 10 13 shows an aborted read Read A is issued at time 10 ns for 32 bytes of data At 50 ns D Valid goes high to indicate the beginning of the first data cycle Abort is not asserted along with this DValid assertion At 60 ns D Valid for the second data cycle is asserted and this time Abort is asserted This indicates that this transaction is aborted and no more data cycles begin for this transaction Notice however that the first data cycle begun with the D Valid assertion at 50 ns is still ongoing and the Intel 80200 processor latches that data and return it to the core as valid data If transaction A was a write request rather than a read the Intel 80200 processor would drive data onto the bus at time 70 ns as requested by the DValid with Abort NOT asserted at 50 ns No data would be driven at time 80 ns because the D Valid at time 60 ns had Abort asserted high When the Abort data cycle occurs the Intel 80200 processor ends that transaction and expect no further data on it When DValid next goes high at time 80 ns t
114. 14 Cache Lockdown Functions on page 7 14 for the exact command Developer s Manual March 2003 4 9 intel Branch Target Buffer 5 5 1 Figure 5 1 Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE uses dynamic branch prediction to reduce the penalties associated with changing the flow of program execution The Intel 80200 processor features a branch target buffer that provides the instruction cache with the target address of branch type instructions The branch target buffer is implemented as a 128 entry direct mapped cache This chapter is primarily for those optimizing their code for performance An understanding of the branch target buffer is needed in this case so that code can be scheduled to best utilize the performance benefits of the branch target buffer Branch Target Buffer BTB Operation The BTB stores the history of branches that have executed along with their targets Figure 5 1 shows an entry in the BTB where the tag is the instruction address of a previously executed branch and the data contains the target address of the previously executed branch along with two bits of history information BTB Entry TAG DATA History Branch Address 31 9 1 Target Address 31 1 Bits 1 0 The BTB takes the current instruction address and checks to see if this address is a branch that was previously seen It uses bits 8 2 of the
115. 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value all implemented bits are 0 Bits Access Description EE ECC Enable 3 Read Write 0 disable ECC generation and checking 1 2 enable ECC SC Single bit Correct Enable 2 Read Write O disable single bit error correction 1 2 enable single bit error correction 1 Read unpredictable Write as 1 Reserved SR Single bit Error Reporting Enable 0 Read Write O disable single bit error reporting 1 enable single bit error reporting BCUCTL TP allows software to determine if the BCU has any pending memory transactions This may be used to ensure that all memory operations have completed before attempting to modify system state For example the code in Example 11 1 simply waits until the BCU is idle Example 11 1 Loop to Wait on BCU Wait for BCU to finish all outstanding operations waitLoop MRC P13 0 R15 CO Cl 0 read BCUCTL update condition code flags BMI waitLoop try again if BCU is busy bit 31 set Get here when BCU is no longer busy Of course this sort of code should be in a cacheable region or it may keep the BCU eternally busy fetching the code If BCUCTL EE is set then the BCU performs ECC generation checking as described in Table 11 1 The other control bits SC SR should only be modified while the EE bit is cleared To ensure correct operation hardware waits until all pendi
116. 4 Registers 8 15 Software Debug nene 22 8 System MANAG Eu E 1 8 1 OSA sse a ae eee eect uie atio NE ELI MM RE 1 8 2 Processor Rosol a iciv sisecei ivi H PPA 3 8 2 1 Reset Sequence seen nennen esent ni tenen nter entren entrent nnne 3 8 2 2 Reset Effect on Outputs iei dee ecc eines suae aa E Pede dun 4 8 3 ower Mapggertgnt e EAA 5 8 3 1 INVOCATION RE 5 8 3 2 Signals Associated with Power Management 5 9 rugis 1 9 1 al goto uero M 1 9 2 External crgo 1 Developer s Manual March 2003 V Intel 80200 Processor based on Intel XScale Microarchitecture tel e 9 3 10 10 1 10 2 10 8 11 11 1 11 2 11 8 12 12 1 12 2 12 8 12 4 12 5 vi Programmer de m 2 9 3 1 huge c 3 9 3 2 INTSRG 4 9 3 3 INTS TR SE 5 External BUS me H 1 General Description ER 1 Signal Descrtiptiohi ene prete ree edi epar daub da px ia eae SEE Oaa EeEN 3 10 2 1 alc UE 4 10 2 1 1 Intel 80200 Processor Use of the Request Bus 4 10 2 2 Bee eL c n 6 10 2 3 Critical Wordi First isses H 7
117. 5 Registers E ees 4 7 2 1 Register 0 ID and Cache Type Registers A 5 7 2 2 Register 1 Control and Auxiliary Control Registers sssseeeeeee 7 7 2 8 Register 2 Translation Table Base Register 9 7 2 4 Register 3 Domain Access Control Register ssesseeessessirerirsereerirsrirterrnsernterrnnens 9 7 2 5 Register amp Hieserved ce eiie eed x Eee ep beo Eh Eu a dee 9 7 2 6 Register 5 Fault Status PHeglsier nennen 10 7 2 7 Register 6 Fault Address Register AANEREN 10 7 2 8 Register 7 Cache Functions sssssssssssseseeeeeeeenneen enne nennen nennen nnns 11 7 2 9 Register 8 TLB Operations eere ct retient eee tr dass era dal 13 7 2 10 Register 9 Cache Lock Down 14 7 2 11 Register 10 TLB Lock Down itr tette deci tunc cn mv zen ene dene 15 7 2 12 Register 11 12 Hesened ek 15 7 2 18 Register 43 Process Diui ets tac Ee ze Een ite zv fe tube iur Ee avete 16 7 2 13 1 The PID Register Affect On Addresses eneeeeeeeeeerrerereereer 16 7 2 14 Register 14 Breakpoint Registers AANEREN 17 7 2 15 Register 15 Coprocessor Access Register sssssssseeeeenes 18 7 3 ixl c Pee 20 7 3 1 Registers 0 3 Performance Monitoring 20 7 3 2 Register 4 5 Heserved ui tanc ttr or zi eee kr eg edo a se Ec Fe nue cun 20 7 3 3 Registers 6 7 Clock and Power Management 21 7 3
118. 5 Registers CRn 14 CRm 8 instruction breakpoint register 0 IBCRO CRn 14 CRm 9 instruction breakpoint register 1 IBCRI CRn 14 CRm 0 data breakpoint register 0 DBRO CRn 14 CRm 3 data breakpoint register 1 DBR1 CRn 14 CRm 4 data breakpoint control register DBCON CP15 registers are accessible using MRC and MCR CRn and CRm specify the register to access The opcode_1 and opcode 2 fields are not used and should be set to 0 CP14 Registers CRn 8 CRm 0 TX Register TX CRn 9 CRm 0 RX Register RX CRn 10 CRm 0 Debug Control and Status Register DCSR CRn 11 CRm 0 Trace Buffer Register TBREG CRn 12 CRm 0 Checkpoint Register 0 CHKPTO CRn 13 CRm 0 Checkpoint Register 1 CHKPT1 CRn 14 CRm 0 TXRX Control Register TXRXCTRL CP14 registers are accessible using MRC MCR LDC and STC CDP to any CP14 registers cause an undefined instruction trap The CRn field specifies the number of the register to access The CRm opcode 1 and opcode 2 fields are not used and should be set to 0 Software access to all debug registers must be done in privileged mode User mode access generates an undefined instruction exception Specifying registers which do not exist has unpredictable results The TX and RX registers certain bits in the TXRXCTRL register and certain bits in the DCSR can be accessed by a debugger through the JTAG interface Developer s Manual March 2003 13
119. 5 cycles to complete stm 0 r2 x3 add rl TL FL The alternative version which is shown below would only take 3 cycles to complete strd x2 ro add El El ml March 2003 Developer s Manual intel B 5 2 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Scheduling Data Processing Instructions Most Intel 80200 processor data processing instructions have a result latency of 1 cycle This means that the current instruction is able to use the result from the previous data processing instruction However the result latency is 2 cycles if the current instruction needs to use the result of the previous data processing instruction for a shift by immediate As a result the following code segment would incur a 1 cycle stall for the mov instruction sub r6 r7 r8 add rl x2 r3 mov r4 rl LSL 2 The code above can be rearranged as follows to remove the 1 cycle stall add SL 2 x3 sub r6 r7 r8 mov r4 rl LSL 2 All data processing instructions incur a 2 cycle issue penalty and a 2 cycle result penalty when the shifter operand is a shift rotate by a register or shifter operand is RRX Since the next instruction would always incur a 2 cycle issue penalty there is no way to avoid such a stall except by re writing the assembler instruction Consider the following segment of code mov r3 10 mul r4 r2 x3 add r5 r6 r2 LSL r3 sub r7 EB x2 The subtract instruction would
120. 6 domains was being SP Read Write accessed when a data abort occurred 3 0 Read Write Status Type of data access being attempted Register 6 Fault Address Register Fault Address Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 Fault Virtual Address reset value unpredictable Bits Access Description Fault Virtual Address Contains the MVA of the data SH Read Wate access that caused the memory abort March 2003 Developer s Manual intel 7 2 8 Table 7 12 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Register 7 Cache Functions All the functions defined in the first generation of Intel StrongARM appear here The Intel 80200 processor adds other functions as well This register should be accessed as write only Reads from this register as with an MRC have an undefined effect The Drain Write Buffer function not only drains the write buffer but also drains the fill buffer The Intel 80200 processor does not check permissions on addresses supplied for cache or TLB functions Because only privileged software may execute these functions full accessibility is assumed Cache functions do not generate any of the following translation faults domain faults permission faults The invalidate instruction cache line command does not invalidate the BTB If software invalidates a
121. 7 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 Pome PEP wees mm Operation if ConditionPassed lt cond gt then if bit 17 0 operandi Rm 15 0 else operandi Rm 31 16 if bit 16 0 lt operand2 gt Rs 15 0 else lt operand2 gt Rs 31 16 acc0 39 0 sign extend operandl lt operand2 gt acc0 39 0 Exceptions none Qualifiers Condition Code S bit is always cleared no condition code flags are updated Notes Instruction timings can be found in Section 14 4 4 Multiply Instruction Timings on page 14 6 Specifying R15 for register Rs or Rm has unpredictable results accO is defined to be 0b000 on 80200 The MIAxy instruction performs one16 bit signed multiply and accumulates these to a single 40 bit accumulator x refers to either the upper half or lower half of register Rm multiplicand and y refers to the upper or lower half of Rs multiplier A value of Ox1 selects bits 31 16 of the register which is specified in the mnemonic as T for top A value of 0x0 selects bits 15 0 of the register which is specified in the mnemonic as B for bottom MIAxy does not support unsigned multiplication all values in Rs and Rm are interpreted as signed data values The instruction is only executed if the condition specified in the instruction matches the condition code status 2 6 March 2003 Developer s Manual intel 2 3 1 2 Table 2 5 Note Intel 80200 Processor b
122. 7 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 109 8 7 6 5 43 2 1 0 addr reset value undefined Bits Access Description 31 0 Read Write ignored addr the physical address that yielded an error The contents of these registers should only be considered valid if the corresponding bit in register BCUCTL is set When an error is detected the BCU selects a free ELOGx ECAR x register pair and updates it with information relevant to the error It then sets BCUCTL Ex ECARx holds the physical address associated with the transaction that caused the error If the transaction was a multiple cycle burst then only the initial address is captured the actual error occurred at some point during the burst Because a burst can cover up to 32 bytes software may only know that the error was in a range ECARx ECARx 31 System software bears the burden of translating this address to a logical one if needed Since changes to the page tables may make this a non trivial exercise systems that respond to errors may wish to delay page table updates until the BCU is quiescent as determined by the BCUCTL TP bit For ECC errors the lower bits of this register are always zero because ECC always operates on 64 bit items This means ECC errors have an ECARx register in which bits 2 0 is zero For bus aborts which could occur on byte sized transactions all address bits are recorded Developer s Manual March 2003 11 9 Bus
123. 80200 Processor based on Intel XScale Microarchitecture Performance Considerations 14 4 4 Multiply Instruction Timings Table 14 7 Multiply Instruction Timings Sheet 1 of 2 Miseni Rs Value S Bit Minimum Minimum Result Minimum Resource Early Termination Value Issue Latency Latency Latency Throughput Rs 31 15 0x00000 0 1 2 1 or Rs 31 15 Ox1FFFF 2 2 2 Rs 31 27 0x00 0 1 3 2 MLA or Rs 31 27 Ox1F 1 3 3 3 0 1 4 3 all others 1 4 4 4 Rs 31 15 0x00000 0 1 2 1 or Rs 31 15 Ox1FFFF 1 2 2 2 Rs 31 27 0x00 0 1 3 2 MUL or Rs 31 27 Ox1F 1 3 3 3 0 1 4 3 all others 1 4 4 4 Rs 31 15 0x00000 0 2 RdLo 2 RdHi 3 2 or Rs 31 15 Ox1FFFF 1 3 3 3 Rs 31 27 0x00 0 2 RdLo 3 RdHi 4 3 SMLAL or Rs 31 27 Ox1F 1 4 4 4 0 2 RdLo 4 RdHi 5 4 all others 1 5 5 5 SMLALxy N A N A 2 RdLo 2 RdHi 3 2 SMLAWy N A N A 1 3 2 SMLAxy N A N A 1 2 1 Rs 31 15 0x00000 0 1 RdLo 2 RdHi 3 2 or Rs 31 15 Ox1FFFF 1 3 3 3 Rs 31 27 0x00 0 1 RdLo 3 RdHi 4 3 SMULL or Rs 31 27 Ox1F 1 4 4 4 0 1 RdLo 4 RdHi 5 4 all others 1 5 5 5 SMULWy N A N A 1 3 2 SMULxy N A N A 1 2 1 0 2 RdLo 2 RdHi 3 2 Rs 31 15 0x00000 1 3 3 3 0 2 RdLo 3 RdHi 4 3 UMLAL Rs 31 27 0x00 1 4 4 4 0 2 RdLo 4 RdHi 5 4 all others 1 5 5 5 14 6 March 2003 Developer s Manual
124. 80200 processor a 3X performance boost over SA 110 alone System Control Coprocessor Additional bits and registers were added to CP15 to support the added functionality of the Intel 80200 processor Also some system resources are controlled from CP14 registers See Chapter 7 Configuration for more information New Instructions and Instruction Formats Chapter 2 Programming Model discusses new instructions and instruction formats These instructions would have generated Undefined faults on the SA 110 Augmented Page Table Descriptors Chapter 2 Programming Model discusses how the Intel 80200 processor augments the SA 110 descriptors Developer s Manual March 2003 A 5 intel Optimization Guide B 1 B 1 1 Introduction This appendix contains optimization techniques for achieving the highest performance from the Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE It is written for developers who are optimizing compilers or performance analysis tools for the Intel 80200 processor based processors It can also be used by application developers to obtain the best performance from their assembly language code The optimizations presented in this chapter are based on the Intel 80200 processor core and hence can be applied to all products that are based on the Intel 80200 processor core The Intel 80200 processor architecture includes a superpipelin
125. 9 5 4 2 1 0 March 2003 Developers Manual intel 6 1 2 Figure 6 2 Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache Mini Data Cache Overview The mini data cache is a 2 Kbyte 2 way set associative cache this means there are 32 sets with each set containing 2 ways Each way of a set contains 32 bytes one cache line and one valid bit There also exist 2 dirty bits for every line one for the lower 16 bytes and the other one for the upper 16 bytes When a store hits the cache the dirty bit associated with it is set The replacement policy is a round robin algorithm Figure 6 2 Mini Data Cache Organization on page 6 3 shows the cache organization and how the data address is used to access the cache The mini data cache is virtually addressed and virtually tagged and supports the same caching policies as the data cache However lines can t be locked into the mini data cache Mini Data Cache Organization This example Set 31 shows Set 0 being selected by Set Index Tag Word Select 9 Byte Alignment Sign Extension Byte Select Data Word 4 bytes to Destination Register Data Address Virtual 31 10 9 5 4 2 1 0 Developer s Manual March 2003 6 3 Data Cache In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 6 1 3 6 4 Write Buffer and Fil
126. A 3 7 TDI 6X X el 0 0 0 en Data input to IR IR shift register Parallel output of IR INSTRUCTION ID CODE Data input to TDR K A TDR shift register K Xxx yyy YX Parallel output of TDR OLD DATA 1 NEW DATA Register Selected TEST DATA REGISTER TDO enable INACTIVE X ACT X INACTIVE X ACTIVEX INACTIVE TDO DOC XXX Don t care or undefined o KI g amp Un ARMs Developer s Manual March 2003 C 15
127. AR 12 move from internal accumulator MRA On a read of the acc this 8 bit high order field 19 16 RdHi specifies the high order eight 39 32 is sign extended l bits of the internal accumulator On a write to the acc the lower 8 bits of this register is written to acc 39 32 15 12 RdLo specifies the low order 32 bits of the internal accumulator This field could be used in future implementations to specify the type of saturation to perform on the read of an internal as Should b zero accumulator e g a signed saturation to 16 bits may be useful for some filter algorithms 3 Should be zero e n Intel 80200 processor only implements accO 2 0 acc specifies 1 of 8 internal accumulators access to any other acc is unpredictable MAR has the same encoding as MCRR to coprocessor 0 and MRA has the same encoding as MRRC to coprocessor 0 These instructions move 64 bits of data to from ARM registers from to coprocessor registers MCRR and MRRC are defined in ARM s DSP instruction set Disassemblers not aware of MAR and MRA produces the following syntax MCRR lt cond gt p0 0x0 RdLo RdHi cO MRRC lt cond gt p0 0x0 RdLo RdHi cO Developer s Manual March 2003 2 7 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model n Table 2 6 Table 2 7 2 8 MAR lt cond gt accO RdLo RdHi 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 111
128. ATE DR NOTE All state transitions are based on the value of TMS B1309 01 Developer s Manual March 2003 C 7 Test Features In Intel 80200 Processor based on Intel XScale Microarchitecture tel e C 2 5 1 C 2 5 2 C 2 5 3 C 2 5 4 C 8 Test Logic Reset State In this state test logic is disabled to allow normal operation of the Intel 80200 processor Test logic is disabled by loading the idcode register No matter what the state of the controller it enters Test Logic Reset state when the TMS input is held high 1 for at least five rising edges of TCK The controller remains in this state while TMS is high The TAP controller is also forced to enter this state by enabling TRST If the controller exits the Test Logic Reset controller states as a result of an erroneous low signal on the TMS line at the time of a rising edge on TCK for example a glitch due to external interference it returns to the test logic reset state following three rising edges of TCK with the TMS line at the intended high logic level Test logic operation is such that no disturbance is caused to on chip system logic operation as the result of such an error Run Test Idle State The TAP controller enters the Run Test Idle state between scan operations The controller remains in this state as long as TMS is held low In the Run Test Idle state the runbist instruction is performed the result is reported in the RUNBIST register
129. CFG reflects the effect of the PLLCFG pin CCLKCFG contains either one or four CCLK Modification Procedure MOV R1 47 New CCLKCFG value MCR P14 0 R1 C6 C0 0 Change core clock frequency and wait for PLL to re lock March 2003 Developer s Manual intel 8 2 8 2 1 Figure 8 1 Intel 80200 Processor based on Intel XScale Microarchitecture System Management Processor Reset The RESET pin must be asserted when CLK and power are applied to the processor CLK MCLK and power must be present and stable before RESET can be deasserted To ensure reset RESET must be asserted for at least 32 MCLK cycles once both clocks and the power are stable Reset pulses shorter than this have an undefined effect To simplify external reset circuitry the Intel 80200 processor has a Schmitt trigger and an internal pull up resistor on RESET This allows a board to implement power on reset simply by connecting a 0 1 uF capacitor between RESET and VSSp TRST the JTAG reset pin must be asserted simultaneously with RESET It is permissible to deassert it when RESET is deasserted or TRST may remain asserted tied to its active state Like RESET TRST has a Schmitt trigger and internal pull up so it is permissible to tie these together to a single external capacitor Hardware debug solutions may impose special requirements on RESET and TRST Before designing a board consult the guidelines provided by the hardware
130. Coprocessor Access Register An application may request the use of a shared resource e g the accumulator in CPO by issuing an access to the resource which results in an undefined exception The operating system may grant access to this coprocessor by setting the appropriate bit in the Coprocessor Access Register and return to the application where the access is retried Sharing resources among different applications requires a state saving mechanism Two possibilities are The operating system during a context switch could save the state of the coprocessor if the last executing process had access rights to the coprocessor The operating system during a request for access saves off the old coprocessor state and saves it with last process to have access to it Under both scenarios the OS needs to restore state when a request for access is made This means the OS has to maintain a list of what processes are modifying CPO and their associated state Example 7 1 Disallowing access to CPO 7 18 The following code clears bit 0 of the CPAR This will cause the processor to fault if software attempts to access CPO LDR RO 0x0000 bit 0 is clear MCR P15 OD RO C15 C1 O move to CPAR CPWAIT wait for effect March 2003 Developer s Manual intel Table 7 20 Coprocessor Access Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 2 1 0 Intel 80200 Processor base
131. Data Cache Operation The following discussions refer to the data cache and mini data cache as one cache data mini data since their behavior is the same when accessed Operation When Caching is Enabled When the data mini data cache is enabled for an access the data mini data cache compares the address of the request against the addresses of data that it is currently holding If the line containing the address of the request is resident in the cache the access hits the cache For a load operation the cache returns the requested data to the destination register and for a store operation the data is stored into the cache The data associated with the store may also be written to external memory if write through caching is specified for that area of memory If the cache does not contain the requested data the access misses the cache and the sequence of events that follows depends on the configuration of the cache the configuration of the MMU and the page attributes which are described in Section 6 2 3 2 Read Miss Policy on page 6 6 and Section 6 2 3 3 Write Miss Policy on page 6 7 for a load miss and store miss respectively Operation When Data Caching is Disabled The data mini data cache is still accessed even though it is disabled If a load hits the cache it returns the requested data to the destination register If a store hits the cache the data is written into the cache Any access that misses the cache does not a
132. Ded hs In the data structure shown above the fields Year2DatePay Year2DateTax Year2Date401 KDed and Year2DateOtherDed are likely to change with each pay check The remaining fields however change very rarely If the fields are laid out as shown above assuming that the structure is aligned Developer s Manual March 2003 B 29 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide I n e on a 32 byte boundary modifications to the Year2Date fields is likely to use two write buffers when the data is written out to memory However we can restrict the number of write buffers that are commonly used to 1 by rearranging the fields in the above data structure as shown below struct employee struct employee prev struct employee next int ssno int empid float Year2DatePay float Year2DateTax float Year2Date401KDed float Year2DateOtherDed B 30 March 2003 Developer s Manual intel B 4 4 8 B 4 4 9 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Cache Blocking Cache blocking techniques such as strip mining are used to improve temporal locality of the data Given a large data set that can be reused across multiple passes of a loop data blocking divides the data into smaller chunks which can be loaded into the cache during the first loop and then be available for processing on subsequence loops thus minimizing cache misses and reducing bus traffi
133. F low to indicate that this burst starts at the lowest word pair and return sequentially Notice that the data returning is the eight word block beginning at 0x240 not 0x249 The low five bits of the address are not used for determining what data to return in a 32 byte read request cache line fill DValid stays high for three cycles drops for a cycle and then is asserted for one more cycle Two clocks after each clock where D Valid is sampled high the next sequential pair of data words is driven on the bus This data can come back to back in sequential cycles as it likely would from a burst SDRAM for example or can be spaced further apart as shown by the last data cycle here Each data cycle in the transaction is independent of the others in timing as long as the order of cycles in the transaction is maintained consistent with the CWF value asserted on the first data cycle Read Burst No CWF Ons 25ns 50ns 75ns 100ns I d I MELK V NV NV VS VS NS NI NS N IV IV IV ADSt LEN 2 EE Rd Req Lock LEN 1 4 4 w RELENO BEER o GE DValid V CWF A 8 ANN o Wrap D Eu 04s xso Will 08 BE De DCE WEN rcc K aum o Abort Developer s Manual March 2003 10 15 External Bus Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e 10 3 3 Read Burst Critical Word First Data Return Figure 10 6 is the same as the last with o
134. Fault Status Register FSR indicates which fault has occurred which could be either a prefetch abort or a data abort Bit 10 extends the encoding of the status field for prefetch aborts and data aborts The definition of the extended status field is found in Section 2 3 4 Event Architecture on page 2 12 Bit 9 indicates that a debug event occurred and the exact source of the event is found in the debug control and status register CP14 register 10 When bit 9 is set the domain and extended status field are undefined Upon entry into the prefetch abort or data abort handler hardware updates this register with the source of the exception Software is not required to clear these fields Fault Status Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value unpredictable Bits Access Description 31 11 Read unpredictable Write as Zero Reserved Status Field Extension X This bit is used to extend the encoding of the Status field 10 Read Write when there is a prefetch abort and when there is a data abort The definition of this field can be found in Section 2 3 4 Event Architecture on page 2 12 Debug Event D 9 Read Write This flag indicates a debug event has occurred and that the cause of the debug event is found in the MOE field of the debug control register CP14 register 10 8 Read as zero Write as Zero 0 Domain Specifies which of the 1
135. G_REG 34 to signal the data is valid Since DBG_REG 34 is never cleared by the debugger in this case the 0 to 1 transition used to enable the debugger write to RX would not occur 3 Set TXRXCTRL 31 When the debugger writes new data to RX the logic automatically sets TXRXCTRL 31 signalling to the debug handler that the data is valid 4 Set the overflow flag TXRXCTRL 30 During high speed download the debugger does not poll to see if the handler has read the previous data If the debug handler stalls long enough the debugger may overwrite the previous data before the handler can read it The logic sets the overflow flag when the previous data has not been read yet and the debugger has just written new data to RX Figure 13 5 RX Write Logic DBG REG 34 m Clear DBG REG 34 Y RX write enable set TXRXCTRL 31 set overflow flag TXRXCTRL 31 TXRXCTRL 30 Intel 80200 Processor CLK Developer s Manual March 2003 13 23 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug n e 13 11 6 2 DBGRX Data Register The bits in the DBGRX data register Figure 13 6 are used by the debugger to send data to the processor The data register also contains a bit to flush previously written data and a high speed
136. Intel 80200 Processor based on Intel XScale Microarchitecture Developer s Manual March 2003 Order Number 273411 003 Intel 80200 Processor based on Intel XScale Microarchitecture m Information in this document is provided in connection with Intel products No license express or implied by estoppel or otherwise to any intellectual property rights is granted by this document Except as provided in Intel s Terms and Conditions of Sale for such products Intel assumes no liability whatsoever and Intel disclaims any express or implied warranty relating to sale and or use of Intel products including liability or warranties relating to fitness for a particular purpose merchantability or infringement of any patent copyright or other intellectual property right Intel products are not intended for use in medical life saving or life sustaining applications Intel may make changes to specifications and product descriptions at any time without notice Intel may make changes to specifications and product descriptions at any time without notice Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The Intel 80200 Processor may contain design defects or errors known as errata which may cause the product t
137. Intel XScale Microarchitecture i ntel 12 4 Performance Monitor Control Register PMNC The performance monitor control register PMNC is a coprocessor register that controls which events PMNO and PMNI monitors detects which counter overflowed enables disables interrupt reporting extends CCNT counting by six more bits cycles between counter rollover 238 resets all counters to zero andenables the entire mechanism Table 12 3 shows the format of the PMNC register Table 12 3 Performance Monitor Control Register CP14 register 0 Sheet 1 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131109 8 7 6 5 4 3 2 1 LLL omm encom t m PIT E and inten are 0 others unpredictable reset value Bits Access Description 31 28 Read unpredictable Write as O Reserved 27 20 Read Write Event Count identifies the source of events that PMN 1 counts See Table 12 4 for a description of the values this field may contain 19 12 Read Write Event CountO0 identifies the source of events that PMNO counts See Table 12 4 for a description of the values this field may contain Read unpredictable Write as O Reserved 10 8 Read Write Overflow Interrupt Flag identifies which counter overflowed Bit 10 clock counter overflow flag Bit 9 performance counter 1 overflow flag Bit 8 performance counter 0 overflow flag Read Values
138. JTAG ensures that components function correctly connections between components are correct and components interact correctly on the printed circuit board Developer s Manual March 2003 C 1 Intel 80200 Processor based on Intel XScale Microarchitecture Test Features n e C 2 1 Boundary Scan Architecture Boundary scan test logic consists of a Boundary Scan register and support logic These are accessed through a Test Access Port TAP The TAP provides a simple serial interface that allows all processor signal pins to be driven and or sampled thereby providing the direct control and monitoring of processor pins at the system level This mode of operation is valuable for design debugging and fault diagnosis since it permits examination of connections not normally accessible to the test system The following subsections describe the boundary scan test logic elements TAP pins instruction register test data registers and TAP controller Figure C 1 illustrates how these pieces fit together to form the JTAG unit To ensure that the processor does not enter an invalid state during boundary scan it must have received a valid reset on RESET including a valid clock input on clk Figure C 1 Test Access Port Block Diagram TDI D i f TMS TCKo ID Reg 32 Bypass Reg 1 dili a To Control And Clock Signals C 2 March 2003 Deve
139. NC The following are a few notes about controlling the performance monitoring mechanism Aninterrupt is reported when a counter overflow flag is set and its associated interrupt enable bit is set in the PMNC register The interrupt remains asserted until software clears the overflow flag by writing a one to the flag that is set Note that the interrupt unit Chapter 9 Interrupts and the CPSR must have enabled the interrupt in order for software to receive it The counters continue to record events even after they overflow Developer s Manual March 2003 12 5 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Monitoring n 12 5 Performance Monitoring Events Table 12 4 lists events that may be monitored by the PMU Each of the Performance Monitor Count Registers PMNO and PMNI can count any listed event Software selects which event is counted by each PMNx register by programming the evtCountx fields of the PMNC register Table 12 4 Performance Monitoring Events Event Number evtCountO or Event Definition evtCount1 0x0 Instruction cache miss requires fetch from external memory Ox Instruction cache cannot deliver an instruction This could indicate an ICache miss or an ITLB miss This event occurs every cycle in which the condition is present Stall due to a data dependency This event occurs every cycle in which the condition is 0x2 present
140. O LSR n Dividing a signed integer by an integer constant should be optimized to make use of the shift operation whenever possible Dividing r0 containing a signed value by an integer constant that can be represented as 25 mov rl r0 ASR 131 add rO rO rl LSR 32 n mov ro r0 ASR n The add instruction would stall for 1 cycle The stall can be prevented by filling in another instruction before add Developer s Manual March 2003 B 15 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 3 5 B 16 Effective Use of Addressing Modes The Intel 80200 processor provides a variety of addressing modes that make indexing an array of objects highly efficient For a detailed description of these addressing modes please refer to the ARM Architecture Reference Manual The following code samples illustrate how various kinds of array operations can be optimized to make use of these addressing modes Set the contents of the word pointed to by r0 to the value contained in r1 and make r0 point to the next word Str r1 r0 4 Increment the contents of r0 to make it point to the next word and set the contents of the word pointed to the value contained in r1 str rl rO 4 Set the contents of the word pointed to by r0 to the value contained in r1 and make r0 point to the previous word str rl rO 4 Decrement the contents of r0 to make it point to the previous word and set t
141. RE 4 1 1 2 11 JAG PE E 4 1 2 Terminology and Conventions eene eene nennen nennen nnne nnns nnns tennis 5 1 2 1 Number Hepresentatton eren nnne enne 5 1 2 2 Terminology and Acromms nennen nennen nnne nennt nnne 5 1 3 Other Relevant Documents ssssssssssssseseseseseeeen nennen nenne nnne n nnne senes nn ren nennen nennen nns 6 2 Programming Model 55 xbusuauRocnu Gu na GU ea CEU un ERCR UAR DU CEPI n m ZE SAM SNR B CER ENUEKUR UE 1 2 1 ADM Architecture Compliance enne nennen nennen nn nnns nnne nnn 1 2 2 ARM Architecture Implementation Options nennen 1 2 2 1 Big Endian versus Little Endian AA 1 2 2 2 EE CONG New 1 2 2 3 Mu 2 1 2 2 4 ARM DSP Enhanced Instruction Get 2 2 2 5 Base Register Update uk 2 2 8 Extensions to ARM Archttechure AAA 3 2 3 1 DSP Coprocessor 0 CO 3 2 3 1 1 Multiply With Internal Accumulate Format A 4 2 3 1 2 Internal Accumulator Access Fomat 7 2 3 2 New Page Attrbutes AAA 9 2 3 3 Additions to CP15 Functionality eesessseseeeeeneeeenneeen nennen nenne 11 2 3 4 Event Architecture eesesssseseessesseeeeee nennen nnne en Raani AARE nns 12 2 3 4 1 Exception Summlaty 2 iretur ta c le Fee e shinies 12 2 3 4 2 i um c 12 2 3 4 3 Prefetch Aborts nre cbe ed Eoi ei PES HE RV SEC SLE E Ya a E Rad 13 2 3 4 4 Data AD OMS M
142. Read Write Fault Status 6 0 Read Write Fault Address 7 0 Read unpredictable Write Cache Operations 8 0 Read unpredictable Write TLB Operations 9 0 Read unpredictable Write Cache Lock Down 10 0 Read Write TLB Lock Down 11 12 Unpredictable Reserved 13 0 Read Write Process ID PID 14 0 Read Write Breakpoint Registers 15 0 Read Write CRm 1 CP Access 7 4 March 2003 Developer s Manual intel 7 2 1 Table 7 4 Table 7 5 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Register 0 ID and Cache Type Registers Register 0 houses two read only registers that are used for part identification an ID register and a cache type register The ID Register is selected when opcode 2 0 This register returns the code for the Intel 80200 processor 0x69052000 for AO stepping revision The low order four bits of the register are the chip revision number and will be incremented for future steppings ID Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 0 1 reset value As Shown 01001000001 0134 0 010000 0 0 0 0 OJ Revision Bits Access Description 31 24 Read Write Ignored Implementation trademark 0x69 i Intel Corporation 23 16 Read Write Ignored Architecture version ARM Version 5 15 4 Read Write Ignored Part Number Implementation Specified Intel 80200 processor
143. SA 110 and the Intel 80200 processor If an SA 110 application had operations that had specific timing relationships these relationships would not hold for the Intel 80200 processor In all typical applications the Intel 80200 processor performance greatly exceeds that of SA 110 The following is a list of all the new features introduced in the Intel 80200 processor that significantly improve the instruction per cycle IPC number seen in SA 110 Any application written for SA 110 will encounter these performance enhancing features Data cache is non blocking which means execution does not stall for every data cache miss Multiply instructions execute in parallel with other non multiply instructions The instruction and data cache size doubled A branch target buffer was added to reduce branch latency Overall the instruction timings specified in Section 14 4 Instruction Latencies on page 14 3 are similar to the SA 110 There are however a few cases that may negatively impact the IPC number Worst case branch latency increased from two to five cycles The minimum result latency of loads that zero extend the result data increased from two to three cycles The minimum result latency of an ALU operation followed by a shift operation increased from one to two cycles Keep in mind that the previous discussion was focusing on IPC differences There is also the frequency difference which is targeted to give the Intel
144. SR JTAG instruction into JTAG IR and scan in a value to set the Halt Mode bit in DCSR and to set the hold_rst signal For details of the SELDCSR refer to Section 13 11 2 After hold_rst is set de assert the RESET pin Internally the processor remains held in reset After RESET is de asserted wait 2030 TCKs Load the LDIC JTAG instruction into JTAG IR Download code into instruction cache in 33 bit packets as described in Section 13 14 3 LDIC Cache Functions After code download is complete clock a minimum of 15 TCKs following the last update_dr in LDIC mode Place the SELDCSR JTAG instruction into the JTAG IR and scan in a value to clear the hold rst signal The Halt Mode bit must remain set to prevent the instruction cache from being invalidated When hold rst is cleared internal reset is de asserted and the processor executes the reset vector at address 0 An additional issue for debug is setting up the reset vector trap This must be done before the internal reset signal is de asserted As described in Section 13 4 3 the Halt Mode and the Trap Reset bits in the DCSR must be set prior to de asserting reset in order to trap the reset vector There are two possibilities for setting up the reset vector trap The reset vector trap can be set up before the instruction cache is loaded by scanning in a DCSR value that sets the Trap Reset bit in addition to the Halt Mode bit and the hold_rst signal OR The reset vector trap can be set
145. TRST Input Provides asynchronous initialization of the test logic TRST is pulled low when not being driven Assertion of this pin puts the TAP controller in the Test Logic Reset initial state An external source drives this signal high for TAP controller operation Developer s Manual March 2003 C 3 Test Features Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e C 2 3 C 2 3 1 Table C 2 Instruction Register IR The instruction register holds instruction codes shifted through the Test Data Input TDI pin The instruction codes are used to select the specific test operation to be performed and the test data register to be accessed The instruction register is a parallel loadable master slave configured 5 bit wide serial shift register with latched outputs Data is loaded into the IR serially through the TDI pin clocked by the rising edge of TCK when the TAP controller is in the Shift IR state The shifted in instruction becomes active upon latching from the master stage to the slave stage in the Update IR state At that time the IR outputs along with the TAP finite state machine outputs are decoded to select and control the test data register selected by that instruction Upon latching all actions caused by any previous instructions must terminate The instruction determines the test to be performed the test data register to be accessed or both Table C 2 The IR is five bits
146. Table 2 14 Intel 80200 Processor Encoding of Fault Status for Data Aborts on page 2 14 The Fault Address Register is set to the effective data address of the instruction and R14 ABORT is the address of the aborted instruction 8 Table 2 14 Intel 80200 Processor Encoding of Fault Status for Data Aborts Priority Sources FS 10 3 0 Domain FAR Highest Alignment 0b000x1 invalid valid First level 0b01100 invalid valid exe Abor On mranslation Second level 0b01110 valid valid e Section 0b00101 invalid valid Translation Page 0b00111 valid valid Section 0b01001 valid valid Domain Page 0b01011 valid valid dh Section 0b01101 valid valid Permission Page 0b01111 valid valid Lock Abort This data abort occurs on an MMU lock operation data or 0b10100 invalid invalid instruction TLB or on an Instruction Cache lock operation Imprecise External Data Abort 0b10110 invalid invalid Lowest Data Cache Parity Error Exception 0b11000 invalid invalid a All other encodings not listed in the table are reserved March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model Imprecise data aborts A data cache parity error is imprecise the extended Status field of the Fault Status Register is set to Oxb11000 Allexternal data aborts except for those generated on a data MMU translation are imprecise The Fault Address Register for
147. U operation takes 1 5 cycles SA1100 March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Compatibility Inte 80200 Processor vs SA 110 intel A 3 Architecture Deviations A 3 1 Read Buffer A Read Buffer is not supported on the Intel 80200 processor and the definition of CP15 register 9 has changed from controlling the read buffer on SA 110 to one that controls cache TLB lock down on the Intel 80200 processor The functionality of the Read Buffer on the Intel 80200 processor can be realized with the existing architecture features of the Intel 80200 processor The Read Buffer allowed applications to prefetch data into the Read Buffer for future use and did not stall SA 110 during the data prefetch The Intel 80200 processor provides a PLD instruction that preloads 32 bytes of data into the data cache This instruction combined with the support for hit under miss provides similar functionality to the Read Buffer Note that this is for cacheable data only so software needs to re map non cacheable read only data to cacheable before issuing the PLD instruction If the targeted memory region is being shared by another hardware entity be sure to issue a cache invalidate see Section 7 2 8 Register 7 Cache Functions on page 7 11s before the PLD This ensures an incoherent copy doesn t already exist in the Intel 80200 processor cache A 3 2 26 bit Mode SA 110 supported 26 bit
148. While the Intel 80200 processor is held off the issue bus data can continue to be returned to the Intel 80200 processor on the data bus and write data can be requested from the Intel 80200 processor by the memory controller When write data is not being requested by the chipset from the Intel 80200 processor the Intel 80200 processor automatically floats the data bus and associated signals D BE DCB This means that write data from another master or read data to another master can be driven onto the data buses without informing or requesting permission from the Intel 80200 processor The chipset owns the data bus and completely controls who gets to drive it Care must be taken however to not assert the Intel 80200 processor DValid pin any time other than two cycles before the next valid Intel 80200 processor data cycle This system allows memory accesses from the Intel 80200 processor to be pipelined with memory accesses from another master An Intel 80200 processor memory access can be issued followed by a memory access from another master followed by another Intel 80200 processor memory access all before the data cycles of the first access begin The data cycles for those transactions can then occur sequentially except for any required turnaround cycles on the data busses This would of course require that the other master and the chipset supported this pipelining Developer s Manual March 2003 10 9 Intel 80200
149. a half cache line then this would be done in 12 core clocks Consider the following code sample add rl rl 1 Sequence of instructions that use r2 These instructions leave r3 unchanged ldr r2 r3 add r3 x3 44 mov r4 r3 sub Ds 62 L The sub instruction above would stall if the data being loaded misses the cache These stalls can be avoided by using a pld instruction well ahead as shown below The number of instructions required to insure a stall does not occur is proportional to Nr for a given system pld r3 add rl r1 1 Sequence of instructions that use r2 These instructions leave r3 unchanged ldr r2 r3 add r3 r3 4 mov r4 r3 sub r2 2 FL B 26 March 2003 Developer s Manual intel B 4 4 2 B 4 4 3 B 4 4 4 B 4 4 5 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Prefetch Loop Scheduling When adding prefetch to a loop which operates on arrays it may be advantages to prefetch ahead one two or more iterations The data for future iterations is located in memory by a fixed offset from the data for the current iteration This makes it easy to predict where to fetch the data The number of iterations to prefetch ahead is refereed to as the prefetch scheduling distance or psd For the Intel 80200 processor this can be calculated as Niookup T Niinexfer X None Niwlinexfer x dou CPI x N inst psd floor Where N Is the number of cache lines
150. ace buffer instead of the roll over message The incremental counter is still set to Ob1111 meaning 15 instructions executed between the last branch and the current branch March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Software Debug 13 13 1 3 Address Bytes Only indirect branch entries contain address bytes in addition to the message byte Indirect branch entries always have four address bytes indicating the target of that indirect branch When reading the trace buffer the MSB of the target address is read out first the LSB is the fourth byte read out and the indirect branch message byte is the fifth byte read out The byte organization of the indirect branch message is shown in Figure 13 8 Figure 13 8 Indirect Branch Entry Address Byte Organization target 31 24 Trace buffer is read by software in this target 23 16 direction The message byte is always the last of target 15 8 the 5 bytes in the entry target 7 0 to be read Y indirect br msg Developer s Manual March 2003 13 31 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 13 2 Figure 13 9 13 32 Trace Buffer Usage The Intel 80200 processor trace buffer is 256 bytes in length The first byte read from the buffer represents the oldest trace history information in the buffer The last 256th byte read repres
151. ache Lock Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 c MR reset value writeable bits set to 0 Bits Access Description 31 1 Read unpredictable Write as Zero Reserved Data Cache Lock Mode L 0 Read unpredictable Write 0 No locking occurs DEN 1 2 Any fill into the data cache while this bit is set gets locked in March 2003 Developer s Manual intel 7 2 11 Table 7 16 7 2 12 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Register 10 TLB Lock Down Register 10 is used for locking down entries into the instruction TLB and data TLB The protocol for locking down entries can be found in Chapter 3 Memory Management Lock unlock operations on a TLB when the MMU is disabled have an undefined effect This register should be accessed as write only Reads from this register as with an MRC have an undefined effect Table 7 16 shows the command for locking down entries in the instruction TLB and data TLB The entry to lock is specified by the virtual address in Rd TLB Lockdown Functions Function opcode 2 CRm Data Instruction Translate and Lock TLB entry 0b000 0b0100 MVA MCR p15 0 Rd c10 c4 0 Translate and Lock D TLB entry 0b000 0b1000 MVA MCR p15 0 Rd c10 c8 0 Unlock TLB 0b001 0b0100 Ignored MCR p15 0 Rd c10 c4 1 Unlock D TLB 0b001 0b1000 Ignored MCR p15 0 Rd
152. ack Statistics 0x2 data stall OxC DCache writeback Instruction TLB Efficiency 0x7 instruction count 0x3 ITLB miss Data TLB Efficiency OxA Dcache access 0x4 DTLB miss Instruction Cache Efficiency Mode PMNO totals the number of instructions that were executed which does not include instructions fetched from the instruction cache that were never executed This can happen if a branch instruction changes the program flow the instruction cache may retrieve the next sequential instructions after the branch before it receives the target address of the branch PMN counts the number of instruction fetch requests to external memory Each of these requests loads 32 bytes at a time Statistics derived from these two events e Instruction cache miss rate This is derived by dividing PMN1 by PMNO The average number of cycles it took to execute an instruction or commonly referred to as cycles per instruction CPI CPI can be derived by dividing CCNT by PMNO where CCNT was used to measure total execution time Developer s Manual March 2003 12 7 Performance Monitoring n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 12 5 2 12 5 3 12 8 Data Cache Efficiency Mode PMNO totals the number of data cache accesses which includes cacheable and non cacheable accesses mini data cache access and accesses made to locations configured as data RAM
153. ail in this section In the last scenario described above is shown in Figure 13 14 Figure 13 13 Code Download During a Warm Reset For Debug Be RESET pin assert until hold rst signal is set RESET Pin TRST RESET does not affect Mini IC Halt Mode Bit set Internal so RESET 3 hold rst keeps internal reset asserted hold_rst wait 2030 tcks after clock 15 tcks after RESET asserted last update dr in LDIC mode men JTAGUR SELDCSR L J L Set hold_rst signal Enter LDIC mode Clear hold_rst signal Keep Halt Mode bit set Load code into IC Keep Halt Mode bit set Halt Mode B1311 01 As shown in the figure reset does not invalidate the instruction cache because of the processor is in Halt Mode Since the instruction cache was not invalidated it may contain valid lines The host must avoid downloading code to virtual addresses that are already valid in the instruction cache mini IC or main IC otherwise the processor may behave unpredictably There are several possible solutions that ensure code is not downloaded to a VA that already exists in the instruction cache 1 Since the mini instruction cache was not invalidated any code previously downloaded into the mini IC is valid in the mini IC so it is not necessary to download the same code again Developer s Manual March 2003 13 41 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug
154. al address R1 and lock into data TLB Repeat sequence for virtual address in R2 Invalidate the data TLB entry specified by the virtual address in R2 Translate virtual address R2 and lock into data TLB wait for locks to complete The MMU is guaranteed to be updated at this point the next instruction will See the locked data TLB entries Care must be exercised here when allowing exceptions to occur during this routine whose handlers may have data that lies in a page that is trying to be locked into the TLB March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n e Memory Management 3 4 4 Round Robin Replacement Algorithm The line replacement algorithm for the TLBs is round robin there is a round robin pointer that keeps track of the next entry to replace The next entry to replace is the one sequentially after the last entry that was written For example if the last virtual to physical address translation was written into entry 5 the next entry to replace is entry 6 At reset the round robin pointer is set to entry 31 Once a translation is written into entry 31 the round robin pointer gets set to the next available entry beginning with entry 0 if no entries have been locked down Subsequent translations move the round robin pointer to the next sequential entry until entry 31 is reached where it wraps back to entry 0 upon the next translation A lock pointer is
155. alls ed e re recien e EOD RENE RUE EX AEn 5 B23 Main Execution Pipeline eid teer tact edet vue ee cR EY sa ees ENEE 6 B 2 3 1 F1 F2 Instruction Fetch Pipestages 6 B 2 3 2 ID Instruction Decode Ppoestage AA 6 B 2 3 3 RF Register File Shifter Pipestage sssssseeeeen 7 B 2 3 4 X1 Execute Pipestage nennen nnns 7 B 2 3 5 X2 Execute 2 Pipestage sssesssssssssessesseseeeeeen nennen enne nennen 7 B 2 3 6 WB write back ssssssssssssseseeseneenneee nennen nnne nnne nnne nnns nennen nnns 7 B 2 4 Memory Pipeline iicet eni nie ue UE E eu ERR Lun Dei o Rn dee pneu es 8 B 2 4 1 D1 and D2 Pipestage nennen nennen nere nnns 8 B 2 5 Multiply Multiply Accumulate MAC Pipeline 8 B 2 5 1 Behavioral Description 8 B 3 Basic Optimizations 2 Licet ree iE ed Fe ca dz Ep cc e le e Ee e Ee E Ra po se Do Eee ERE pde 9 B 3 1 Conditional Instructions nennen enne ener 9 B 3 1 1 Optimizing Condition Checks A 9 B 3 1 2 Optimizing Branches nennen nnne 10 B 3 1 3 Optimizing Complex Expression 12 B 3 2 Bit Field Manipulation AAA 13 B 3 8 Optimizing the Use of Immediate Values nenne 14 B 3 4 Optimizing Integer Multiply and Divide sseesseeeeeeeeeeeieereesirterirerirsriirerirssrrsrrrrsrrrssrns 15 B 3 5 Effective Use of Addressing Modes AAA 16 B 4 Cache and Prefetch Optimizations enne eene enne nnne nennen enn 17 Developer s Manual March 2003 ix Intel 8
156. ally needs multiple data processing instructions to create the target address and branch to it One possibility is to set up vector traps on the non reset exception vectors These vector locations can then be used to extend the reset vector Another solution is to have the reset vector do a direct branch to some intermediate code This intermediate code can then uses several instructions to create the debug handler start address and branch to it This would require another line in the mini instruction cache since the intermediate code must also be downloaded This method also requires that the layout of the debug handler be well thought out to avoid the intermediate code overwriting a line of debug handler code or vice versa For the indirect branch cases a temporary scratch register may be necessary to hold intermediate values while computing the final target address DBG r13 can be used for this purpose see Section 13 15 2 2 Debug Handler Restrictions for restrictions on DBG r13 usage March 2003 Developer s Manual intel 13 15 2 13 15 2 1 13 15 2 2 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Implementing a Debug Handler The debugger uses the debug handler to examine or modify processor state by sending commands and reading data through JTAG The API between the debugger and debug handler is specific to a debugger implementation Intel provides a standard debug handler and API which c
157. an be used by third party vendors Issues and details for writing a debug handler are discussed in this section and in the Intel Debug Handler Debug Handler Entry When the debugger requests an external debug break or is waiting for an internal break it should poll the TR bit through JTAG to determine when the processor has entered Debug Mode The debug handler entry code must do a write to TX to signal the debugger that the processor has entered Debug Mode The write to TX sets the TR bit signalling the host that a debug exception has occurred and the processor has entered Debug Mode The value of the data written to TX is implementation defined debug break message contents of register to save on host etc Debug Handler Restrictions The Debug Handler executes in Debug Mode which is similar to other privileged processor modes however there are some differences Following are restrictions on Debug Handler code and differences between Debug Mode and other privileged modes The processor is in Special Debug State following a debug exception and thus has special functionality as described in Section 13 5 1 Halt Mode Although address translation and PID remapping are disabled for instruction accesses as defined in Special Debug State data accesses use the normal address translation and PID remapping mechanisms Debug Mode does not have a dedicated stack pointer DBG r13 Although DBG_r13 exists it is not a general purpose re
158. apter describes the programming model of the Intel 80200 processor based on Intel XScale microarchitecture namely the implementation options and extensions to the ARM Version 5 architecture The ARM Architecture Version 5TE Specification ARM DDI 0100E describes Version 5TE of the ARM Architecture including the Thumb ISA and ARM DSP Enhanced ISA 2 1 ARM Architecture Compliance The Intel 80200 processor implements the integer instruction set architecture specified in ARM Version 5TE T refers to the Thumb instruction set and E refers to the DSP Enhanced instruction set ARM Version 5 introduces a few more architecture features over Version 4 specifically the addition of tiny pages 1 Kbyte a new instruction CLZ that counts the leading zeroes in a data value enhanced ARM Thumb transfer instructions and a modification of the system control coprocessor CP15 2 2 ARM Architecture Implementation Options 2 2 1 Big Endian versus Little Endian The Intel 80200 processor supports both big and little endian data representation The B bit of the Control Register Coprocessor 15 register 1 bit 7 selects big and little endian mode To run in big endian mode the B bit must be set before attempting any sub word accesses to memory or undefined results occur Note that this bit takes effect even if the MMU is disabled 2 2 2 26 Bit Code The Intel 80200 processor does not support 26 bit code 2 2 3 Thumb The Intel 80200
159. are a few isolated cases where back to back ALU operations may result in one cycle delay in the execution These cases are defined in Chapter 14 Performance Considerations Developer s Manual March 2003 12 9 Performance Monitoring Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e 12 5 6 12 5 7 12 10 PMNI counts the number of writeback operations emitted by the data cache These writebacks occur when the data cache evicts a dirty line of data to make room for a newly requested line or as the result of clean operation CP15 register 7 Statistics derived from these two events The percentage of total execution cycles the processor stalled because of a data dependency This is calculated by dividing PMNO by CCNT which was used to measure total execution time Often a compiler can reschedule code to avoid these penalties when given the right optimization switches Total number of data writeback requests to external memory can be derived solely with PMNI Instruction TLB Efficiency Mode PMNO totals the number of instructions that were executed which does not include instructions that were translated by the instruction TLB and never executed This can happen if a branch instruction changes the program flow the instruction TLB may translate the next sequential instructions after the branch before it receives the target address of the branch PMNI counts the number of instruction TLB table walk
160. as a round robin pointer that keeps track of the next line in that set to replace The next line to replace in a set is the next sequential line after the last one that was just filled For example if the line for the last fill was written into way 5 set 2 the next line to replace for that set would be way 6 None of the other round robin pointers for the other sets are affected in this case After reset way 31 is pointed to by the round robin pointer for all the sets Once a line is written into way 31 the round robin pointer points to the first available way of a set beginning with way 0 if no lines have been re configured as data RAM in that particular set Re configuring lines as data RAM effectively reduces the available lines for cache updating For example if the first three lines of a set were re configured the round robin pointer would point to the line at way 3 after it rolled over from way 31 Refer to Section 6 4 Re configuring the Data Cache as Data RAM on page 6 12 for more details on data RAM The mini data cache follows the same round robin replacement algorithm as the data cache except that there are only two lines the round robin pointer can point to such that the round robin pointer always points to the least recently filled line A least recently used replacement algorithm is not supported because the purpose of the mini data cache is to cache data that exhibits low temporal locality i e data that is placed into the mini
161. ased on Intel XScale Microarchitecture Programming Model Internal Accumulator Access Format The Intel 80200 processor defines a new instruction format for accessing internal accumulators in CPO Table 2 5 Internal Accumulator Access Format on page 2 7 shows that the opcode falls into the coprocessor register transfer space The RdHi and RdLo fields allow up to 64 bits of data transfer between Intel StrongARM registers and an internal aoe wma The acc field specifies 1 of 8 internal accumulators to transfer data to from The Intel 80200 processor implements a single 40 bit accumulator referred to as accO future implementations can specify multiple internal accumulators of varying sizes up to 64 bits Access to the internal accumulator is allowed in all processor modes user and privileged as long bit 0 of the Coprocessor Access Register is set See Section 7 2 15 Register 15 Coprocessor Access Register on page 7 18 for more details The Intel 80200 processor implements two instructions MAR and MRA that move two Intel StrongARM registers to accO and move accO to two Intel StrongARM registers respectively Internal Accumulator Access Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 RRE em oe pro eqno epo en Bits Description Notes 31 28 cond ARM condition codes L move to from internal accumulator 20 0 2 move to internal accumulator M
162. ased on Intel XScale Microarchitecture S Data Cache The line allocate command will not operate on the mini Data Cache so system software must clean this cache by reading 2KByte of contiguous unused data into it This data must be unused and reserved for this purpose so that it will not already be in the cache It must reside in a page that is marked as mini Data Cache cacheable see Section 2 3 2 The time it takes to execute the global clean operation depends on the number of dirty lines in the cache Developer s Manual March 2003 6 11 Data Cache In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 6 4 6 12 Re configuring the Data Cache as Data RAM Software has the ability to lock tags associated with 32 byte lines in the data cache thus creating the appearance of data RAM Any subsequent access to this line always hits the cache unless it is invalidated Once a line is locked into the data cache it is no longer available for cache allocation on a line fill Up to 28 lines in each set can be reconfigured as data RAM such that the maximum data RAM size is 28 Kbytes Hardware does not support locking lines into the mini data cache any attempt to do this produces unpredictable results There are two methods for locking tags into the data cache the method of choice depends on the application One method is used to lock data that resides in external memory into the data cache and the other met
163. ate TDI Data input to IR IR shift register Parallel output of IR Data input to TDR TDR shift register Parallel output of TDR Register selected TDO enable TDO jeseu oifo1 1s81 SIP SOL unH Cueog ya Deeg leos YI 910g M e HI eamde5 V V VV Y Hl ulus ul nix HI esned Hl axa Hl ulus ul Hg ul erepdn ail Sey UNH gt N Pa Pa DEL idcode A NEW INSTRUCTION py OLD DATA K INSTRUCTION REGISTER X INACTIVE A ACT X INACTIVE A ACTIVE A INACTIVE PPS LAS Don t care or undefined March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Test Features Figure C 5 Timing Diagram Illustrating the Loading of Data Register TCK TMS s f J 2 Eg g 5 o c 5 E T a a E E 2 igg BE 8 Controller State 2 zi 5 i 9 S 3 iR 8 9x z i g e ei ei ei NE D S gs 7 7 SIP FS g EE P SA A A4
164. ation and Control on page 6 16 External Aborts External aborts are imprecise exceptions on the Intel 80200 processor External aborts may be generated by external memory when for example there is a parity error detected during a memory access Since the Intel 80200 processor continues instruction execution during external memory requests the PC that is saved in R14 when the exception is reported may not be the PC of the offending instruction Many instructions may have executed after the offending instruction SA 110 always stalls the processor when there was an external load request or when an external write request occurs with C 0 and B 0 External aborts detected on these requests would be precise meaning the PC that is saved in RI4 ABORT when the exception is reported is the address of the offending instruction The Intel 80200 processor also stalls the processor for these requests but an external abort on these requests would not be precise The value in RI4 ABORT when the exception is reported would not be that of the offending instruction Software relying on this feature of SA 110 may not be compatible with the Intel 80200 processor March 2003 Developer s Manual intel A 3 6 A 3 7 A 3 8 A 3 9 Intel 80200 Processor based on Intel XScale Microarchitecture Compatibility Intel 80200 Processor vs SA 110 Performance Differences There exists significant performance differences in program execution between
165. based on Intel XScale Microarchitecture Memory Management Locking Entries Individual entries can be locked into the instruction and data TLBs See Table 7 14 Cache Lockdown Functions on page 7 14 for the exact commands If a lock operation finds the virtual address translation already resident in the TLB the results are unpredictable An invalidate by entry command before the lock command ensures proper operation Software can also accomplish this by invalidating all entries as shown in Example 3 2 on page 3 7 Locking entries into either the instruction TLB or data TLB reduces the available number of entries by the number that was locked down for hardware to cache other virtual to physical address translations A procedure for locking entries into the instruction TLB is shown in Example 3 2 on page 3 7 If a MMU abort is generated during an instruction or data TLB lock operation the Fault Status Register is updated to indicate a Lock Abort see Section 2 3 4 4 Data Aborts on page 2 14 and the exception is reported as a data abort Locking Entries into the Instruction TLB R1 R2 and R3 contain the virtual addresses to translate and lock into the instruction TLB The value in RO is ignored in the following instruction Hardware guarantees that accesses to CP15 occur in program order MCR P15 0 R0 C8 C5 0 Invalidate the entire instruction TLB MCR P15 0 R1 C10 C4 0 Translate virtual address R1 and lock
166. ble 8 Read Write and Bufferable bits are in the page table descriptors 0 Enabled 1 Disabled March 2003 Developers Manual intel 7 2 3 Table 7 8 7 2 4 Table 7 9 7 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Register 2 Translation Table Base Register Translation Table Base Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 reset value unpredictable Bits Access Description Translation Table Base Physical address of the base of aisle Read Write the first level table 13 0 Read unpredictable Write as Zero Reserved Register 3 Domain Access Control Register Domain Access Control Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 4 3 2 1 ows or oo Tow Ton Tow oo er ow e on Tem s or 0 reset value unpredictable Bits Access Description Access permissions for all 16 domains The meaning 31 0 Read Write of each field can be found in the ARM Architecture Reference Manual Register 4 Reserved Register 4 is reserved Reading and writing this register yields unpredictable results Developer s Manual March 2003 7 9 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration n 7 2 6 Table 7 10 7 2 7 Table 7 11 7 10 Register 5 Fault Status Register The
167. bort has higher priority so the processor first goes to the data abort handler This data abort is placed into the trace buffer without losing any data However if another imprecise data abort is detected at the start of the data abort handler it has higher priority than the trace buffer full break so the processor goes back to the data abort handler This 2nd data abort also gets written into the trace buffer This causes the trace buffer to wrap around and one trace buffer entry is lost oldest entry is lost Additional trace buffer entries can be lost if imprecise data aborts continue to be detected before the processor can handle the trace buffer full break which turns off the trace buffer This trace buffer overflow problem can be avoided by enabling vector traps on data aborts TXRXCTRL RR prevents TX register from being updated even if TXRXCTRL TR is clear This is to be fixed on B step The problem is that there is incorrect and unnecessary interaction between the RX ready RR flag and writing the TX register The debug handler looks at the TX ready bit before writing to the TX register If this bit is clear then the handler should be able to write to the TX register However in the current implementation even if the TR bit is clear if the RR bit is set TX is unchanged when the handler writes to it It is OK to prevent a write to TX when the TR bit is set since the host has not read the previous data in the TX and we don
168. bus transactions must occur in the same order as the requests were made The delay between a request going out and the data coming back to or being driven from the bus master is arbitrary No explicit wait state insertion is needed All data on the 64 bit data bus is read or written in its natural location within an aligned 64 bit memory block In little endian mode 64 bit bus big endian and 32 bit busses are covered later bits 7 0 of D always correspond to a byte with low address bits 2 0 of 000 and bits 63 56 always correspond to a byte with low address bits of 111 As an example a word 32 bit read to address 0x24004 would need to be returned to the core with the most significant byte on bits 63 56 and the least significant byte on bits 39 32 For a byte write to location 0x3703 the valid data byte would be driven out on bits 31 24 of D and only bit 3 of the BE would be asserted Each data transaction consists of one or more data cycles Each data cycle begins with the assertion of DValid to indicate to the Intel 80200 processor that the next cycle of data is going to be transferred followed two clock cycles later by the data transfer read or write data on the D BE for writes only and DCB buses All lines in D must be driven during a data transaction on writes the Intel 80200 processor will drive them on reads they must be driven by the addressed slave If the read request was for less than a full bus width of
169. by directly writing to them When an event counter reaches its maximum value OxFFFF FFFF the next event it needs to count will cause it to roll over to zero and set the overflow flag bit 8 or 9 in PMNC An IRQ or FIQ interrupt will be reported if it is enabled via bit 4 or 5 in the PMNC register Performance Monitor Count Register PMNO and PMN1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 43 2 1 0 Event Counter reset value unpredictable Bits Access Description 32 bit event counter Reset to 0 by PMNC register When an event counter reaches its maximum value 31 0 Read Write OxFFFF FFFF the next event it needs to count causes it to roll over to zero and generate an IRQ or FIQ interrupt if enabled Extending Count Duration Beyond 32 Bits To increase the monitoring duration software can extend the count duration beyond 32 bits by counting the number of overflow interrupts each 32 bit counter generates This can be done in the interrupt service routine ISR where an increment to some memory location every time the interrupt occurs enables longer durations of performance monitoring This does intrude upon program execution but is negligible since the ISR execution time is in the order of tens of cycles compared to the number of cycles it took to generate an overflow interrupt 255 Developer s Manual March 2003 12 3 Performance Monitoring Intel 80200 Processor based on
170. c As an example of cache blocking consider the following code for i 0 i lt 10000 i for j 0 j 10000 j for k 0 k lt 10000 k C j k A i k B j i The variable A 1 k is completely reused However accessing C j k in the j and k loops can displace A i j from the cache Using blocking the code becomes for i 0 i lt 10000 i for ji120 j lt 100 j for k1 0 k lt 100 k for j2 0 j lt 100 j for k2 0 k lt 100 k j ji 100 42 k kl 100 k2 C j k A i k B j i Prefetch Unrolling When iterating through a loop data transfer latency can be hidden by prefetching ahead one or more iterations The solution incurs an unwanted side affect that the final interactions of a loop loads useless data into the cache polluting the cache increasing bus traffic and possibly evicting valuable temporal data This problem can be resolved by prefetch unrolling For example consider for i 0 i lt NMAX i prefetch data i 2 sum data il Interactions i 1 and i prefetches superfluous data The problem can be avoid by unrolling the end of the loop for i 0 i lt NMAX 2 i prefetch data i 2 sum datalil sum data NMAX 2 sum data NMAX 1 Unfortunately prefetch loop unrolling does not work on loops with indeterminate iterations Developer s Manual March 2003 B 31 Intel 80200 Processor based on Intel XScale Microarch
171. can be moved before the ORR instruction to prevent this stall Scheduling CP15 Coprocessor Instructions The MRC instruction has an issue latency of 1 cycle and a result latency of 3 cycles The MCR instruction has an issue latency of 1 cycle Consider the code sample add Dl r2 r3 mrc pis 0 r7 C1 CO 0 mov rO X add FI wl 1 The MOV instruction above would incur a 2 cycle latency due to the 3 cycle result latency of the mrc instruction The code shown above can be rearranged as follows to avoid these stalls mrc pi5 0 r7 Cl CO 0 add rl X2 x3 add rl r1 1 mov I0 eT March 2003 Developer s Manual intel B 6 B 7 B 7 1 B 7 1 1 B 7 1 2 B 7 1 3 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Optimizing C Libraries Many of the standard C library routines can benefit greatly by being optimized for the Intel 80200 processor architecture The following string and memory manipulation routines should be tuned to obtain the best performance from the Intel 80200 processor architecture instruction selection cache usage and data prefetch strcat strchr strcmp strcoll strcpy strcspn strlen strncat strncmp strpbrk strrchr strspn strstr strtok strxfrm memchr memcmp memcpy memmove memset Optimizations for Size For applications such as cell phone software it is necessary to optimize the code for improved performance while minimizing code si
172. captured into the serial Boundary Scan register and the values are output to the TDO pin The clock signal drawn at the top of the diagram is drawn as a stable symmetrical clock This is not in practice the most common case Instead the clocking is usually done by a program writing to a port bit The TMS and TDI signals are written by software and then the software makes the clock go high The software typically often lowers the clock input quickly The program can then read the TDO pin March 2003 Developer s Manual m Int 0200 Processor based on In Ital Figure C 3 JTAG Example bits long Parallel Out DR Shift Reg Register Selected ri pp uzze SES SES NNj our ua S LE i mE CC OO sg iuo o lee un T Bounda AJAJAJA E oO SG WEEK ry Scan TN A S m oc joc oc oc jocjoc foc cM WE la cc ec ec ec uuu uu ud OQ O O0 n0NwW WoooOolaslaadaod 2 Gitt EFTrT i JB8B989 828 LL LL LL LL JL LI wirEIEEEE QKt Iririr O Ee 55 5 5 Malas nnd Boy ing a dd uu Developer s Manual March 2003 Test Features Intel 80200 Processor based on Intel XScale Microarchitecture i n Figure C 4 Timing Diagram Illustrating the Loading of Instruction Register TCK TMS Controller St
173. cate that the trace buffer has not wrapped around and that first non zero entry is the start of the trace If the oldest entry from the trace buffer is non zero then the trace buffer has either wrapped around or just filled up Once the trace buffer has been read and parsed the host SW should re create the trace history from oldest trace buffer entry to latest Trying to re create the trace going backwards from the latest trace buffer entry may not work in most cases because once a branch message is encountered it may not be possible to determine the source of the branch In fill once mode the return from the debug handler to the application should generate an indirect branch message The address placed in the trace buffer is that of the target application instruction Using this as a starting point re creating a trace going forward in time should be straightforward In wrap around mode the host SW should use the checkpoint registers and address bytes from indirect branch entries to re create the trace going forward The drawback is that some of the oldest entries in the trace buffer may be untraceable depending on where the earliest checkpoint or indirect branch entry is located The best case is when the oldest entry in the trace buffer was checkpointed so the entire trace buffer can be used to re create the trace The worst case is when the first checkpoint is in the middle of the trace buffer and no indirect branch messages exist before
174. cess load store semaphore and coprocessor The following section explains how to read these tables Performance Terms Issue Clock cycle 0 The first cycle when an instruction is decoded and allowed to proceed to further stages in the execution pipeline i e when the instruction is actually issued Cycle Distance from A to B The cycle distance from cycle A to cycle B is B A that is the number of cycles from the start of cycle A to the start of cycle B Example the cycle distance from cycle 3 to cycle 4 is one cycle Issue Latency The cycle distance from the first issue clock of the current instruction to the issue clock of the next instruction The actual number of cycles can be influenced by cache misses resource dependency stalls and resource availability conflicts Result Latency The cycle distance from the first issue clock of the current instruction fo the issue clock of the first instruction that can use the result without incurring a resource dependency stall The actual number of cycles can be influenced by cache misses resource dependency stalls and resource availability conflicts Minimum Issue Latency without Branch Misprediction The minimum cycle distance from the issue clock of the current instruction fo the first possible issue clock of the next instruction assuming best case conditions i e that the issuing of the next instruction is not stalled due to a resource dependency stall the next instr
175. ch 2003 14 7 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations n e 14 4 5 Saturated Arithmetic Instructions Table 14 10 Saturated Data Processing Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency QADD 1 2 QSUB 1 2 QDADD 1 2 QDSUB 1 2 14 4 6 Status Register Access Instructions Table 14 11 Status Register Access Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency MRS 1 2 MSR 2 6 if updating mode bits 1 14 4 7 Load Store Instructions Table 14 12 Load and Store Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency LDR 1 3 for load data 1 for writeback of base LDRB 1 3 for load data 1 for writeback of base LDRBT 1 3 for load data 1 for writeback of base LDRD 1 1 if Rd is R12 3 for Rd 4 for Rd 1 2 for writeback of base LDRH 1 3 for load data 1 for writeback of base LDRSB 1 3 for load data 1 for writeback of base LDRSH 1 3 for load data 1 for writeback of base LDRT 1 3 for load data 1 for writeback of base PLD 1 N A STR 1 1 for writeback of base STRB 1 1 for writeback of base STRBT 1 1 for writeback of base STRD 2 1 for writeback of base STRH 1 1 for writeback of base STRT 1 1 for writeback of base Table 14 13 Load and Store Multiple Instruction Timings Mnemonic Minimum Issue Latency M
176. ch Predicted by the BTB Misprediction B 1 5 BL 1 5 14 4 March 2003 Developer s Manual Intel Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations Table 14 5 Branch Instruction Timings Those not predicted by the BTB Miieimonle Minimum Td when Branch Minimum Issue ad when Branch BLX 1 N A 5 BLX 2 1 5 BX 1 5 Data Processing Instruction with Same as Table 14 6 4 numbers in Table 14 6 PC as the destination LDR PC lt gt 2 8 gree Nu 3 numreg 10 max 0 numreg 3 1 numreg is the number of registers in the register list including the PC 14 4 3 Data Processing Instruction Timings Table 14 6 Data Processing Instruction Timings lt shifter operand gt is NOT a Shift Rotate AY OR NUPNIBBO Ny Mnananig by Register shifter operand is RRX Minimum Issue Minimum Result Minimum Issue Minimum Result Latency Latency Latency Latency ADC 1 1 2 2 ADD 1 1 2 2 AND 1 1 2 2 BIC 1 1 2 2 CMN 1 1 2 2 CMP 1 1 2 2 EOR 1 1 2 2 MOV 1 1 2 2 MVN 1 1 2 2 ORR 1 1 2 2 RSB 1 1 2 2 RSC 1 1 2 2 SBC 1 1 2 2 SUB 1 1 2 2 TEQ 1 1 2 2 TST 1 1 2 2 1 If the next instruction needs to use the result of the data processing for a shift by immediate or as Rn in a QDADD or QDSUB one extra cycle of result latency is added to the number listed Developer s Manual March 2003 14 5 Intel
177. ci eda eh er E p re ete A e P PO EEG HO re ERO Eton pan 3 B 1 Pipelines and Pipe stages ait e ep aie decore DER Qu SEE RE DD Yo END e Ee SENE UEM ve PES ee REOR ve ESAE ERROR Ras 3 C 1 TAP Controller Pin Definitions ieri erbe f cer Fert re EENS E E RH RENE EO HERI ERI RE 3 C 2 FTA G Instruction Set eere ci e dee EE Zeeche 4 C 3 IEEE Instr ctiols uc ire tre teri sues t E PL HERE sad a tomen Ee tegi eee UE HERI HERE Re ERR Pes E esed 5 C 4 JA ID ReesteorMahtg uge deen EER ENEE EE EERE 6 Developer s Manual March 2003 XV intel Introduction 1 1 1 1 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture High Level Overview The Intel 80200 processor based on Intel XScale microarchitecture is the next generation in the Intel Strong ARM processor family compliant with ARM Architecture V5TE It is designed for high performance and low power leading the industry in mW MIPs The Intel 80200 processor integrates a bus controller and an interrupt controller around a core processor with intended embedded markets such as handheld devices networking remote access servers etc This technology is ideal for internet infrastructure products such as network and I O processors where ultimate performance is critical for moving and processing large amounts of data quickly The Intel 80200 processor incorporates an extensive list of architecture features that allows it to achieve high performance This rich
178. ction is in the JTAG IR see Section 13 11 3 DBGTX JTAG Command the host polls DBG_SR 0 waiting for the debug handler to set it When the debug handler gets to the point where it is OK to begin the code download it writes to TX which automatically sets DBG SR 0 This signals the host it is OK to begin the download The debug handler then begins polling TXRXCTRL 31 waiting for the host to clear it through the DBGRX JTAG register to indicate the download is complete The host writes LDIC to the JTAG IR and downloads the code For each line downloaded the host must invalidate the target line before downloading code to that line Failure to invalidate a line prior to writing it may cause unpredictable operation by the processor When the host completes its download the host must wait a minimum of 15 TCKs then switch the JTAG IR to DBGRX and complete the handshaking by scanning in a value that sets DBG_SR 35 This clears TXRXCTL 31 and allows the debug handler code to exit the polling loop The data scanned into DBG_SR 34 3 is implementation specific After the handler exits the polling loop it branches to the downloaded code Note that this debug handler stub must reside in the instruction cache and execute out of the cache while doing the synchronization The processor should not be doing any code fetches to external memory while code is being downloaded March 2003 Developer s Manual intel 13 14 5 1 Inte 80200 Proc
179. ction to execute 4 for Data Aborts R14 abt PC of the faulting instruction 4 for Prefetch Aborts SPSR abt CPSR CPSR 4 0 0b10111 ABORT mode e CPSR 5 20 CPSR 6 unchanged e CPSR 7 1 e PC Oxc for Prefetch Aborts PC 0x10 for Data Aborts During abort mode external debug breaks and trace buffer full breaks are internally pended When the processor exits abort mode either through a CPSR restore or a write directly to the CPSR the pended debug breaks immediately generate a debug exception Any pending debug breaks are cleared out when any type of debug exception occurs When exiting the debug handler should do a CPSR restore operation that branches to the next instruction to be executed in the program under debug March 2003 Developer s Manual intel 13 6 13 6 1 Table 13 3 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug HW Breakpoint Resources The Intel 80200 processor debug architecture defines two instruction and two data breakpoint registers denoted IBCRO IBCR1 DBRO and DBR1 The instruction and data address breakpoint registers are 32 bit registers The instruction breakpoint causes a break before execution of the target instruction The data breakpoint causes a break after the memory access has been issued In this section Modified Virtual Address MVA refers to the virtual address ORed with the PID Refer to section XXX for more details o
180. ctions A register dependency occurs when a previous MAC or load instruction is about to modify a register value that has not been returned to the register file and the current instruction needs access to the same register Only the destination of MAC operations and memory loads are scoreboarded The destinations of ALU instructions are not scoreboarded If no register dependencies exist the pipeline is not stalled For example if a load operation has missed the data cache subsequent instructions that do not depend on the load may complete independently Use of Bypassing The Intel 80200 processor pipeline makes extensive use of bypassing to minimize data hazards Bypassing allows results forwarding from multiple sources eliminating the need to stall the pipeline March 2003 Developer s Manual intel B 2 2 B 2 2 1 B 2 2 2 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Instruction Flow Through the Pipeline The Intel 80200 processor pipeline issues a single instruction per clock cycle Instruction execution begins at the F1 pipestage and completes at the WB pipestage Although a single instruction may be issued per clock cycle all three pipelines MAC memory and main execution may be processing instructions simultaneously If there are no data hazards then each instruction may complete independently of the others Each pipestage takes a single clock cycle or machine cycle to
181. current address to read out the tag and then compares this tag to bits 31 9 1 of the current instruction address If the current instruction address matches the tag in the cache and the history bits indicate that this branch is usually taken in the past the BTB uses the data target address as the next instruction address to send to the instruction cache Bit 1 of the instruction address is included in the tag comparison in order to support Thumb execution This organization means that two consecutive Thumb branch B instructions with instruction address bits 8 2 the same contends for the same BTB entry Thumb also requires 31 bits for the branch target address In ARM mode bit 1 is zero The history bits represent four possible prediction states for a branch entry in the BTB Figure 5 2 Branch History on page 5 2 shows these states along with the possible transitions The initial state for branches stored in the BTB is Weakly Taken WT Every time a branch that exists in the BTB is executed the history bits are updated to reflect the latest outcome of the branch either taken or not taken Chapter 14 Performance Considerations describes which instructions are dynamically predicted by the BTB and the performance penalty for mispredicting a branch The BTB does not have to be managed explicitly by software it is disabled by default after reset and is invalidated when the instruction cache is invalidated Developer s Manual March
182. d on Intel XScale Microarchitecture Configuration reset value 0x0000 0000 Bits Access Description 31 16 Read unpredictable Write as Zero Reserved Program to zero for future compatibility 15 14 Read as Zero Write as Zero Reserved Program to zero for future compatibility Coprocessor Access Rights Each bit in this field corresponds to the access rights for each coprocessor F h bit 13 0 Read Write ane 0 Access denied Attempts to access corresponding coprocessor generates an undefined exception 1 Access allowed Includes read and write accesses Setting any of bits 12 1 has an undefined effect Developer s Manual March 2003 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration 9 Ital 7 3 CP14 Registers Table 7 21 lists the CP14 registers implemented in the Intel 80200 processor Table 7 21 CP14 Registers Register CRn Access Description 0 3 Read Write Performance Monitoring Registers 4 5 Unpredictable Reserved 6 7 Read Write Clock and Power Management 8 15 Read Write Software Debug 7 3 1 Registers 0 3 Performance Monitoring The performance monitoring unit contains a control register PMNC a clock counter CCNT and two event counters PMNO and PMN1 The format of these registers can be found in Chapter 12 Performance Monitoring along with a description on how to use the performance monitoring facilit
183. debug solutions you may be utilizing Reset Sequence The output pin RESETOUT is asserted when RESET is asserted The deassertion of RESET triggers an internal reset timer that keeps the Intel 80200 processor in a reset state until the PLL has locked CCLK is not running during this reset state and RESETOUT remains asserted When the PLL is locked and stable RESETOUT is deasserted informing the system that the Intel 80200 processor has completed reset and CCLK is running The time from the deasserting of RESET to the deassertion of RESETOUT is approximately two thousand CLK cycles MCLK input memory clock must be stable at a minimum 10 CLK cycles before the deassertion of RESET After RESETOUTH is deasserted MCLK need only be present when the Intel 80200 processor external bus is active Reset Sequence CYCLE 0 2000 eu ET pe CCLK Mecki M _ deseen S regi RESET RESETOUT Developer s Manual March 2003 8 3 Intel 80200 Processor based on Intel XScale Microarchitecture System Management n d 8 2 2 Reset Effect on Outputs After RESETOUTf is asserted the processor s output pins are driven to a well defined state Critical bus signals receive a 0 or 1 value as shown in Figure 8 2 This figure also illustrates that HOLD is acknowledged during the reset sequence Outpu
184. debugger turns off Halt mode through JTAG either by scanning in a new DCSR value or by a TRST Processor reset does not effect the value of the Halt mode bit When halt mode is active the processor uses the reset vector as the debug vector The debug handler and exception vectors can be downloaded directly into the instruction cache to intercept the default vectors and reset handler or they can be resident in external memory Downloading into the instruction cache allows a system with memory problems or no external memory to be debugged Refer top Section 13 14 Downloading Code in the ICache on page 13 34 for details about downloading code into the instruction cache March 2003 Developers Manual In tel Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug During Halt mode software running on the Intel 80200 processor cannot access DCSR or any of hardware breakpoint registers unless the processor is in Special Debug State SDS described below When a debug exception occurs during Halt mode the processor takes the following actions disables the trace buffer sets DCSR moe encoding processor enters a Special Debug State SDS for data breakpoints trace buffer full break and external debug break R14 dbg PC of the next instruction to execute 4 for instruction breakpoints and software breakpoints and vector traps R14 dbg PC of the aborted instruction 4 SPSR de CPSR CPSR 4
185. dictable Write as Zero Reserved Mode M 0 ACTIVE 1 0 Read Write 2 RESERVED 1 IDLE 3 SLEEP Software can change the core clock frequency by writing to the CP 14 register 6 CCLKCFG This function waits for all the Intel 80200 processor initiated memory requests to complete and informs the PLL to change the core clock frequency This function completes when the PLL is re locked Software can read CCLKCFG to determine current operating frequency Clock and Power Management Function Data Instruction Go to IDLE 1 MCR p14 0 Rd c7 c0 0 Go to SLEEP 3 MCR p14 0 Rd c7 c0 0 Read CCLKCFG ignored MRC p14 0 Rd c6 c0 0 Write CCLKCFG CCLKCFG value MCR p14 0 Rd c6 c0 0 CCLKCFG Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 13110 9 8 7 6 5 4 3 2 1 0 CCLKCFG reset value unpredictable Bits Access Description 31 4 Read unpredictable Write as Zero Reserved Core Clock Configuration CCLKCFG This field is used to configure the core clock frequency The value in this field is multiplied by REFCLK to obtain core clock See Table 8 2 3 0 Read Write Developer s Manual March 2003 7 21 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration 7 3 4 Table 7 26 7 22 Registers 8 15 Software Debug Intel Software debug is supported by address breakpoint registers Coprocessor 15
186. different MVA already exists at this location it will be evicted The 32 bytes of data associated with the newly allocated line are not initialized and therefore generates unpredictable results if read This command may be used for cleaning the entire data cache on a context switch and also when re configuring portions of the data cache as data RAM In both cases Rd is a virtual address that maps to some non existent physical memory When creating data RAM software must initialize the data RAM before read accesses can occur Specific uses of these commands can be found in Chapter 6 Data Cache Developer s Manual March 2003 7 11 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Intel Other items to note about the line allocate command are It forces all pending memory operations to complete Bits 31 5 of Rd is used to specific the virtual address of the line to allocated into the data cache If the targeted cache line is already resident this command has no effect This command cannot be used to allocate a line in the mini Data Cache The newly allocated line is not marked as dirty so it never gets evicted However if a valid store is made to that line it is marked as dirty and gets written back to external memory if another line is allocated to the same cache location This eviction produces unpredictable results if the line allocate command used a virtual address that mapped
187. dirty bit associated with it is set The replacement policy is a round robin algorithm and the cache also supports the ability to reconfigure each line as data RAM Figure 6 1 Data Cache Organization on page 6 2 shows the cache organization and how the data address is used to access the cache Cache policies may be adjusted for particular regions of memory by altering page attribute bits in the MMU descriptor that controls that memory See Section 3 2 2 for a description of these bits The data cache is virtually addressed and virtually tagged It supports write back and write through caching policies The data cache always allocates a line in the cache when a cacheable read miss occurs and allocates a line into the cache on a cacheable write miss when write allocate is specified by its page attribute Page attribute bits determine whether a line gets allocated into the data cache or mini data cache Developer s Manual March 2003 6 1 6 2 Intel 80200 Processor based on Intel XScale Microarchitecture m Data Cache n Figure 6 1 Data Cache Organization Set Index This example shows Set 0 being selected by the set index Tag Word Select l l l i l i I Content Addressable Memory Byte Alignment Sign Extension Byte Select 3 Data Word 4 bytes to Destination Register Data Address Virtual 31 10
188. disabled To allow compatibility with older system software the new Intel 80200 processor attributes take advantage of encoding space in the descriptors that was formerly reserved Page P Attribute Bit The P bit specifies that the associated memory should be protected with ECC The P bit is only present in the first level descriptors Thus ECC memory is specified with a 1 megabyte granularity If the MMU is disabled ECC is disabled for all memory accesses If the MMU is enabled ECC is enabled for a region of memory if its P bit in the first level descriptor for that virtual memory is set and the BCU has ECC enabled see Chapter 11 Bus Controller Accesses to memory for page walks do not use the MMU For these accesses ECC is enabled if the CP15 Auxiliary Control Register enables it see Section 7 2 2 Register 1 Control and Auxiliary Control Registers on page 7 7 and the BCU has ECC enabled see Chapter 11 Bus Controller Cacheable C Bufferable B and eXtension X Bits Instruction Cache When examining these bits in a descriptor the Instruction Cache only utilizes the C bit If the C bit is clear the Instruction Cache considers a code fetch from that memory to be non cacheable and does not fill a cache entry If the C bit is set then fetches from the associated memory region are cached March 2003 Developer s Manual intel 3 2 2 4 Table 3 1 Table 3 2 Intel 80200 Processor bas
189. disconnected Developer s Manual March 2003 13 53 Software Debug In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 16 13 54 Software Debug Notes Errata Trace buffer message count value on data aborts LDR to non PC that aborts gets counted in the exception message But an LDR to the PC that aborts does not get counted on exception message SW Note on data abort generation in Special Debug State 1 Avoid code that could generate precise data aborts 2 If this cannot be done then handler needs to be written such that a memory access is followed by 1 nops In this case certain memory operations must be avoided LDM STM STRD LDC SWP Data abort on Special Debug State When write back is on for a memory access that causes a data abort the base register is updated with the write back value This is inconsistent with normal non SDS behavior where the base remains unchanged if write back is on and a data abort occurs Trace Buffer wraps around and loses data in Halt Mode when configured for fill once mode It is possible to overflow and lose data from the trace buffer in fill once mode in Halt Mode When the trace buffer fills up it has space for 1 indirect branch message 5 bytes and 1 exception message 1 byte If the trace buffer fills up with an indirect branch message and generates a trace buffer full break at the same time as a data abort occurs the data a
190. does not alter the contents of registers ELOGn and ECARn This ensures that software can use these registers until it clears BCUCTL En These bits may be used by an ISR to quickly determine how many errors have occurred and to locate the error information in other registers ELOGx and ECARx To clear the BCU s interrupt an ISR must ensure that all of these bits are cleared Pseudocode for a typical ISR that handles BCU Errors is shown in Example 11 3 Example 11 3 Handling BCU Errors Table 11 3 if BCUCTL EV dis more errors that couldn t be logged react to them here BCUCTL EV 1 clear the EV bit error is handled if BCUCTL E1 guy react to error here BCUCTL E1 1 clear the E1 bit error is handled if BCUCTL EO o react to error here BCUCTL EO 1 clear the EO bit error is handled It is possible for more errors to occur while clearing the existing errors In this case the ISR is reinvoked as soon as BCU interrupts are unmasked The BCU Control Register BCUCTL allows software to view and control the behavior of the BCU BCUMOD Register 1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 43 2 1 reset value all implemented bits are 0 Bits Access Description AF Aligned Fetch 0 On a 32 byte read BCU requests an aligned block 1 BCU can request 32 byte reads on any 4 byte 0 Read unpredictable Write aligned address Developer
191. ds Then the debugger executes the dynamic command using the address of the function to identify which function to execute This method has the many of the same advantages as downloading into the main instruction cache Depending on the memory system this method could be much slower than downloading directly into the instruction cache Another problem is the application may write to the memory where the function is downloaded If it can be guaranteed that the application does not modify the downloaded dynamic function the debug handler can save the time it takes to re download the code Otherwise the ensure the application does not corrupt the dynamic functions the debugger should re download any dynamic functions it uses For all three methods the downloaded code executes in the context of the debug handler The processor is in Special Debug State so all of the special functionality applies The downloaded functions may also require some common routines from the static debug handler such as the polling routines for reading RX or writing TX To simplify the dynamic functions the debug handler should define a set of registers to contain the addresses of the most commonly used routines The dynamic functions can then access these routines using indirect branches BLX This helps reduce the amount of code in the dynamic function since common routines do not need to be replicated within each dynamic function March 2003 13 51 Intel 80200 Pr
192. e locked into the cache along with how the round robin pointer is affected Figure 6 3 Locked Line Effect on Round Robin Replacement set 0 8 ways locked 24 ways available for round robin replacement set 1 23 ways locked 9 ways available for round robin replacement set 2 28 ways locked only ways 28 31 available for replacement set 31 all 32 ways available for round robin replacement set 0 set 1 set 2 Ur set 31 wayO 9 x way 1 TE B E 8 way 7 e x x way 8 o fe c at way 22 way 23 i g way 30 way 31 Software can lock down data located at different memory locations This may cause some sets to have more locked lines than others as shown in Figure 6 3 Lines are unlocked in the data cache by performing an unlock operation See Section 7 2 10 Register 9 Cache Lock Down on page 7 14 for more information about locking and unlocking the data cache Before locking the programmer must ensure that no part of the target data range is already resident in the cache The Intel 80200 processor does not refetch such data which results in it not being locked into the cache If there is any doubt as to the location of the targeted memory data the cache should be cleaned and invalidated to prevent this scenario If the cache contains a locked region which the programmer wishes to lock again then the cache must be unlocked before being cleaned and invalidated Developer s Manual March 2003 6 15 Data Cache I
193. e Intel 80200 processor architecture provides the ability to execute instructions conditionally This feature combined with the ability of the Intel 80200 processor instructions to modify the condition codes makes possible a wide array of optimizations B 3 1 1 Optimizing Condition Checks Intel 80200 processor instructions can selectively modify condition codes state When generating code for if else and loop conditions it is often beneficial to make use of this feature to set condition codes thereby eliminating need for a subsequent compare instruction Consider C code segment if a b Code generated for the if condition without using an add instruction to set condition codes is Assume r0 contains the value a and r1 contains the value b add rO rO rl1 cmp ro 0 However the code can be optimized as follows making use of the add instruction to set the condition codes Assume r0 contains the value a and r1 contains the value b adds r0 r0 rl The instructions that increment or decrement the loop counter can also be used to modify the condition codes This eliminates the need for a subsequent compare instruction A conditional branch instruction can then be used to exit or continue with the next loop iteration Consider the following C code segment for i 10 i 0 i The optimized code generated for the above code segment would look like do something L subs r3 r3 1 bne Le It is also ben
194. e buffer Both DCSR e and DCSR ge must be set to enable the trace buffer The processor automatically clears this bit to disable the trace buffer when a debug exception occurs For more details on the trace buffer refer to Section 13 12 Trace Buffer Developer s Manual March 2003 13 5 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 5 Table 13 2 13 5 1 13 6 Debug Exceptions A debug exception causes the processor to re direct execution to a debug event handling routine The Intel 80200 processor debug architecture defines the following debug exceptions instruction breakpoint data breakpoint software breakpoint external debug break exception vector trap trace buffer full break When a debug exception occurs the processor s actions depend on whether the debug unit is configured for Halt mode or Monitor mode Table 13 2 shows the priority of debug exceptions relative to other processor exceptions Event Priority Event Priority Reset i Vector Trap data abort precise data bkpt data abort imprecise external debug break trace buffer full FIQ IRQ o o N oaj ol AJ OJN instruction breakpoint pre fetch abort E o undef SWI software Bkpt Halt Mode The debugger turns on Halt mode through the JTAG interface by scanning in a value that sets the bit in DCSR The
195. e instruction register retains its state The controller remains in this state as long as TMS is held low When TMS goes high on the rising edges of TCK the controller moves to the Exit2 IR state Exit2 IR State This is a temporary state If TMS is held high on the rising edge of TCK the controller enters the Update IR state which terminates the scanning process If TMS is held low on the rising edge of TCK the controller enters the Shift IR state This test data register selected by the current instruction retains its previous value during this state The instruction does not change and the instruction register retains its state Update IR State The instruction shifted into the instruction register is latched onto the parallel output from the shift register path on the falling edge of TCK Once latched the new instruction becomes the current instruction Test data registers selected by the current instruction retain their previous values If TMS is held high on the rising edge of TCK the controller enters the Select DR Scan state If TMS is held low on the rising edge of TCK the controller enters the Run Test Idle state Developer s Manual March 2003 C 11 Test Features In Intel 80200 Processor based on Intel XScale Microarchitecture tel e C 2 5 17 C 12 Boundary Scan Example In the example that follows two command actions are described The example starts in the reset state a new instruction is loaded and
196. e minimum cycle distance from the issue clock of the current multiply instruction to the issue clock of the next multiply instruction assuming the second multiply does not incur a data dependency and is immediately available from the instruction cache or memory interface For the following code fragment here is an example of computing latencies Example 14 1 Computing Latencies UMLALr6 r8 ADD r9 r10 SUB r2 r8 r MOV rO r1 rO rlil r11 9 Table 14 3 shows how to calculate Issue Latency and Result Latency for each instruction Looking at the issue column the UMLAL instruction starts to issue on cycle 0 and the next instruction ADD issues on cycle 2 so the Issue Latency for UMLAL is two From the code fragment there is a result dependency between the UMLAL instruction and the SUB instruction In Table 14 3 UMLAL starts to issue at cycle 0 and the SUB issues at cycle 5 thus the Result Latency is five Table 14 3 Cycle NI OO Oil AJ OJN 14 4 2 Latency Example Issue Executing umlal 1st cycle umlal 2nd cycle add umlal umlal sub stalled umlal amp add sub stalled umlal sub mov sub mov Branch Instruction Timings Table 14 4 Branch Instruction Timings Those predicted by the BTB Mnemonic Minimum Issue Latency when Correctly Minimum Issue Latency with Bran
197. e various attributes with regions of memory cacheable bufferable line allocate policy write policy e YO mini Data Cache Coalescing ECC Protected See Section 3 2 2 Memory Attributes on page 3 2 for a description of page attributes and Section 2 3 2 New Page Attributes on page 2 9 to find out where these attributes have been mapped in the MMU descriptors The virtual address with which the TLBs are accessed may be remapped by the PID register See Section 7 2 13 Register 13 Process ID on page 7 16 for a description of the PID register Developer s Manual March 2003 3 1 Memory Management n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 3 2 3 2 1 3 2 2 3 2 2 1 3 2 2 2 3 2 2 3 3 2 Architecture Model Version 4 vs Version 5 ARM MMU Version 5 Architecture introduces the support of tiny pages which are 1 KByte in size The reserved field in the first level descriptor encoding 0b11 is used as the fine page table base address The exact bit fields and the format of the first and second level descriptors can be found in Section 2 3 2 New Page Attributes on page 2 9 Memory Attributes The attributes associated with a particular region of memory are configured in the memory management page table and control the behavior of accesses to the instruction cache data cache mini data cache and the write buffer These attributes are ignored when the MMU is
198. ead Write 11 Y Y Write Back Allocate a Normally bufferable writes can coalesce with previously buffered data in the same address range b See Section 7 2 2 for a description of this register Developer s Manual March 2003 Memory Management n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 3 2 2 5 3 2 2 6 Table 3 3 3 2 3 3 4 Details on Data Cache and Write Buffer Behavior If the MMU is disabled all data accesses are non cacheable and non bufferable This is the same behavior as when the MMU is enabled and a data access uses a descriptor with X C and B all set to 0 The X C and B bits determine when the processor should place new data into the Data Cache The cache places data into the cache in lines also called blocks Thus the basis for making a decision about placing new data into the cache is a called a Line Allocation Policy If the Line Allocation Policy is read allocate all load operations that miss the cache request a 32 byte cache line from external memory and allocate it into either the data cache or mini data cache this is assuming the cache is enabled Store operations that miss the cache do not cause a line to be allocated If read write allocate is in effect load or store operations that miss the cache requests a 32 byte cache line from external memory if the cache is enabled The other policy determined by the X C and B bits is the Write Policy
199. ebug handler Using the download flag the debug handler loops until the debugger clears the flag Therefore when doing a high speed download for each data word downloaded the debugger should set the D bit March 2003 Developer s Manual intel 13 8 4 Table 13 9 13 8 5 Table 13 10 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug TX Register Ready Bit TR The debugger and debug handler use the TR bit to synchronize accesses to the TX register The debugger and debug handler must poll the TR bit before accessing the TX register Table 13 9 shows the handshaking used to access the TX register TX Handshaking Debugger Actions Debugger is expecting data from the debug handler Before reading data from the TX register the debugger polls the TR bit through JTAG until the bit is set NOTE while polling TR the debugger must scan out the TR bit and the TX register data Reading a 1 from the TR bit indicates that the TX data scanned out is valid The action of scanning out data when the TR bit is set automatically clears TR Debug Handler Actions Debug handler wants to send data to the debugger in response to a previous request The debug handler polls the TR bit to determine when the TX register is empty any previous data has been read out by the debugger The handler polls the TR bit until it is clear Once the TR bit is clear the debug handler writes ne
200. ed RISC architecture with an enhanced memory pipeline The Intel 80200 processor instruction set is based on ARM V5TE architecture however the Intel 80200 processor includes additional instructions Code generated for the SA 110 SA 1100 and SA 1110 execute on the Intel 80200 processors however to obtain the maximum performance of your application code it should be optimized for the Intel 80200 processor architecture using the techniques presented in this document About This Guide This guide assumes that you are familiar with the Intel StrongARM instruction set and the C language It consists of the following sections Section B 1 Introduction Outlines the contents of this guide Section B 2 Intel 80200 Processor Pipeline This chapter provides an overview of the Intel 80200 processor pipeline behavior Section B 3 Basic Optimizations This chapter outlines basic Intel StrongARM optimizations that can be applied to the Intel 80200 processors Section B 4 Cache and Prefetch Optimizations This chapter contains optimizations for efficient use of caches Also included are optimizations that take advantage of the prefetch instruction of the Intel 80200 processor Section B 5 Instruction Scheduling This chapter shows how to optimally schedule code for the Intel 80200 processor pipeline Section B 6 Optimizing C Libraries This chapter contains information relating to optimizations fo
201. ed TX value is scanned out during the Shift DR state Data scanned in is ignored on an Update DR A l captured in DBG SR 0 indicates the captured TX data is valid After doing a Capture DR the debugger must place the JTAG state machine in the Shift DR state to guarantee that a debugger read clears TXRXCTRL 28 DBGRX JTAG Command The DBGRX JTAG instruction selects the DBGRX JTAG data register The JTAG opcode for this instruction is Ob00010 Once the DBGRX data register is selected the debugger can send data to the debug handler through the RX register Developer s Manual March 2003 13 21 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 11 6 DBGRX JTAG Register The DBGRX JTAG instruction selects the DBGRX JTAG Data register The debugger uses the DBGRX data register to send data or commands to the debug handler Figure 13 4 DBGRX Hardware 13 22 software read write delay undefined Capture DR P TDO DBG SR TXRXCTRL E TDI m I 35 34 3 2 1 I NEN o DBG REG 1 Update DR clear by SW read from RX set by Debugger Write 3 TCK 34 33 2 1 0 DBG_REG E E mE clear DBG_REG 34 j Flush RR to TXRXCTRL
202. ed on Intel XScale Microarchitecture Data Cache and Write Buffer Memory Management All of these descriptor bits affect the behavior of the Data Cache and the Write Buffer If the X bit for a descriptor is zero the C and B bits operate as mandated by the ARM architecture This behavior is detailed in Table 3 1 If the X bit for a descriptor is one the C and B bits meaning is extended as detailed in Table 3 2 Data Cache and Buffer Behavior when X 0 CB Cacheable Bufferable Write Policy Fee Notes Policy 00 N N Stall until complete 0 1 N N 10 Y Y Write Through Read Allocate 11 Y Y Write Back Read Allocate a Normally the processor continues executing after a data access if no dependency on that access is encountered With this setting the processor stalls execution until the data access completes This guarantees to software that the data access has taken effect by the time execution of the data access instruction completes External data aborts from such accesses are imprecise but see Section 2 3 4 4 for a method to shield code from this imprecision Data Cache and Buffer Behavior when X 1 Line CB Cacheable Bufferable Write Policy Allocation Notes Policy 00 Unpredictable do not use Writes do not coalesce into 21 N d j D buffers we Cache policy is determined 10 I by MD field of Auxiliary Control register R
203. ee ees 1 Clock Counter CCNT CP14 Register 1 nennen nensi 2 Performance Count Registers PMNO PMN1 CP14 Register 2 and 3 Respectively 3 12 3 1 Extending Count Duration Beyond 32 Bits sssssssssseeeeee ene 3 Performance Monitor Control Register PMNC sssssssssssseeneeeeenneneen nennen 4 12 4 1 ENEInegalemEEP 5 Performance Monitoring Events A 6 12 5 1 Instruction Cache Efficiency Mode nnne 7 12 5 2 Data Cache Efficiency Mode 8 March 2003 Developer s Manual In 12 6 12 7 13 13 1 13 2 13 3 13 4 13 5 13 6 13 7 13 8 13 9 13 10 13 11 m tel Intel 80200 Processor based on Intel XScale Microarchitecture 12 5 8 Instruction Fetch Latency Mode 8 12 5 4 Data Bus Request Buffer Full Mode cccscceceseeeeeeeeeeeneeeeeeeeeseeeeeeeaeeeseaeeeeneeeesenees 9 12 5 5 Stall Writeback Statistics riasu entre pete d peer tede En ede vena yess 9 12 5 6 Instruction TLB Efficiency Mode AAA 10 12 5 7 Data TLB Efficiency Mode tete cde ntc et c e ir ue ux 10 Multiple Performance Monitoring Run Statistics sssssseseeeeenenennens 11 ge m 12 Software DGG E 1 Blue rd 1 Debug Beglsters petis tet dees ache facon Erat dv eee eee ered ca Exc deen Pese ed ata dE 1 MINTO Le m
204. ef X NN X nADS LEN2 RdA o RdB 4 WrC 9 RdD 9 RdE g 4 Lock LEN 1 Lo A 4 La 3 1 ys WnR LEN O IB Zon A 9 V go s Loi o C A Bl oxo X0x581X 0x0 Ox35CX oxo K0x24 0x0 0x970X0x55 Oxioo WEE DValid V Developer s Manual March 2003 10 21 External Bus In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 10 3 8 Figure 10 12 10 22 Locked Access An example of a locked access is shown in Figure 10 12 Here the processor is doing an atomic read write to address 0x240 denoted as A in the figure The Lock signal which is valid at the positive edge of MCLK when ADS is asserted are asserted for each request from the read of A just prior to the matching write of A Note that an intervening read cycle to location 0x12349988 also occurs perhaps to read a page table element This is legal behavior and must be accommodated by all entities on the bus The bus makes no guarantee as to how many cycles may elapse between a locked read and its corresponding locked write It is guaranteed that no writes intervene during that period although an arbitrary number of reads may occur Locked Access Ons 25ns 50ns 75ns 100ns wx ee Bee ef een Beef ef E Ee GER S em 0 WA fgg Lock LEN 1 Lt A W R LEN UA re 0 A A Ox Ox240A AOx 243A0x9988A OxO X Ox240 DValid N N D CE O aD ODO Abort B
205. eficial to rewrite loops whenever possible so as to make the loop exit conditions check against the value 0 For example the code generated for the code segment below needs a compare instruction to check for the loop exit condition for i 0 i lt 10 i If the loop were rewritten as follows the code generated avoids using the compare instruction to check for the loop exit condition do something for i 9 i gt 0 i do something Developer s Manual March 2003 B 9 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 3 1 2 B 10 Optimizing Branches Branches decrease application performance by indirectly causing pipeline stalls Branch prediction improves the performance by lessening the delay inherent in fetching a new instruction stream The number of branches that can accurately be predicted is limited by the size of the branch target buffer Since the total number of branches executed in a program is relatively large compared to the size of the branch target buffer it is often beneficial to minimize the number of branches in a program Consider the following C code segment int foo int a if a gt 10 return 0 else return 1 The code generated for the if else portion of this code segment using branches is cmp rO 10 ble L1 mov rO 0 b L2 L1 mov rO 1 L2 The code generated above takes three cycles to execute
206. egister 13 Process ID on page 7 16 An instruction that modifies a BCU register is guaranteed to take effect before the next instruction executes BCU Control Registers The BCU Control Register BCUCTL allows software to view and control the behavior of the BCU BCUCTL Register 0 Sheet 1 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 TIE PIV reset value all implemented bits are 0 Bits Access Description TP Transactions Pending 31 Read Write ignored Indicates whether the BCU is idle 0 no memory transactions pending 1 one or more transactions pending EV Error Overflow Read Values 0 no unlogged errors have occurred 1 errors have occurred beyond those logged in dd Heads Wie ELOGO and ELOG1 Write Values 0 nochange 1 clear this bit E1 ELOG1 is valid Read Values 0 contents of ELOG1 should be disregarded 29 Read Write 1 Error occurred and is logged in ELOG1 Write Values 0 nochange 1 clear this bit E0 ELOGO is valid Read Values 0 contents of ELOGO should be disregarded 28 Read Write 1 2 Error occurred and is logged in ELOGO Write Values 0 2 nochange 1 clear this bit 27 4 Read unpredictable Write as O reserved Developer s Manual March 2003 11 5 Intel 80200 Processor based on Intel XScale Microarchitecture Bus Controller n Table 11 2 BCUCTL Register 0 Sheet 2 of 2 31 30 29 28 27 26
207. en or sustained Once asserted the level on the Hold pin must be maintained until HIdA has been asserted by the Intel 80200 processor If Hold is deasserted before HIdA asserts the results are unpredictable The delay to asserted HIdA after Hold is sampled may vary depending on the state of the bus If Hold is asserted on clock edge n the Intel 80200 processor guarantees it asserts HIdA by clock edge n 6 The Intel 80200 processor drives the floated signals two cycles after Hold is deasserted That is if Hold is deasserted at clock edge n then the Intel 80200 processor drives the floating signals to a valid level on clock edge n 2 The implication of this is that the signals are not carrying valid data that may be sampled until clock edge n 3 HIdA is deasserted at the same time the signals are taken out of float Once Hlda is deasserted external hardware must wait at least one cycle before asserting Hold again That is Hold may only be asserted at cycle n if Hlda is not asserted at cycle n 7 If the Intel 80200 processor is in a low power mode see Section 8 3 Power Management on page 8 5 then the timing for entering and exiting hold mode may be faster than described above If Hold functionality is desired during one of these modes the MCLK clock must be toggling During reset the Intel 80200 processor honors requests for Hold Note that the data bus is not affected at all by the Hold pin or the floating of the request bus
208. en the locking process can be speed up by using the CP15 prefetch zero function This function does not generate external memory references See the Intel 80200 processor reference manual for more information on how to do this When creating the on chip RAM care must be taken to ensure that all sets in the on chip RAM area of the Data cache have approximately the same number of ways locked otherwise some sets have more ways locked than the others This uneven allocation increases the level of thrashing in some sets and leave other sets under utilized For example consider three arrays arr arr2 and arr3 of size 64 bytes each that are being allocated to the on chip RAM and assume that the address of arr1 is 0 address of arr2 is 1024 and the address of arr3 is 2048 All three arrays are within the same sets i e set0 and set2 as a result three ways in both sets sett and set are locked leaving 28 ways for use by other variables This can overcome by allocating on chip RAM data in sequential order In the above example allocating arr2 to address 64 and arr3 to address 128 allows the three arrays to use only 1 way in sets 0 through 8 March 2003 Developer s Manual intel B 4 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Mini data Cache The mini data cache is best used for data structures which have short temporal lives and or cover vast amounts of data space Addressing these typ
209. en writing the corrupted data back A memory system should use this as an indication to not update the DCB value or any data bytes Developer s Manual March 2003 11 3 Bus Controller Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e 11 4 Error reporting may be enabled with the BCUCTL register described in Section 11 4 1 If enabled single bit errors cause the BCU to assert an interrupt to the Interrupt Controller Unit ICU If the interrupt is not enabled in the ICU it is not propagated to the core This interrupt may be cleared by software by writing to the BCU Control Register see Section 11 4 1 If a masked interrupt is not cleared by software it interrupts the core if software ever unmasks it Although single bit errors can be corrected by the BCU the software may choose to take the additional step of scrubbing the offending memory location To accomplish this the software needs to write the correct data back to the location The BCU logs sufficient information in its registers ELOGx and ECARx to enable software to re issue the load that uncovered the problem Single bit errors detected during a RMW are corrected and written back if single bit correction is enabled The BCU still reports the error if enabled but software does not need to scrub the location Single bit errors may also be detected in the received ECC itself In this case the BCU doesn t need to modify the received data but it still r
210. ent Priority Exception Priority Reset 1 Highest Data Abort Precise amp Imprecise 2 FIQ 3 IRQ 4 Prefetch Abort 5 Undefined Instruction SWI 6 Lowest 2 12 March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model 2 3 4 3 Prefetch Aborts The Intel 80200 processor detects three types of prefetch aborts Instruction MMU abort external abort on an instruction access and an instruction cache parity error These aborts are described in Table 2 13 When a prefetch abort occurs hardware reports the highest priority one in the extended Status field of the Fault Status Register The value placed in R14 ABORT the link register in abort mode is the address of the aborted instruction 4 Table 2 13 Intel 80200 Processor Encoding of Fault Status for Prefetch Aborts Priority Sources FS 10 3 0 Domain FAR Instruction MMU Exception Several exceptions can generate this encoding e translation faults e F e Highest domain faults and 0b10000 invalid invalid permission faults It is up to software to figure out which one occurred External Instruction Error Exception This exception occurs when the external memory system 0b10110 invalid invalid reports an error on an instruction cache fetch Lowest Instruction Cache Parity Error Exception 0b11000 invalid invalid a All other encodings not listed
211. ents the most recent entry in the buffer The last byte read from the buffer is always a message byte This provides the debugger with a starting point for parsing the entries out of the buffer Because the debugger needs the last byte as a starting point when parsing the buffer the entire trace buffer must be read 256 bytes on the Intel 80200 processor before the buffer can be parsed Figure 13 9 is a high level view of the trace buffer High Level View of Trace Buffer first byte read target 7 0 oldest entry 1001 CCCC indirect 1000 CCCC direct 1100 CCCC direct CHKPTi L4 CHKPTO 1111 1111 roll over target 31 24 target 23 16 target 15 8 target 7 0 1101 CCCC indirect 1000 CCCC direct 1111 1111 roll over last byte read most recent entry 1000 CCCC direct The trace buffer must be initialized prior to its initial usage then again prior to each subsequent usage Initialization is done be reading the entire trace buffer The process of reading the trace buffer also clears it out all entries are set to Ob0000 0000 so when the trace buffer has been used to capture a trace the process of reading the captured trace data also re initializes the trace buffer for its next usage The trace buffer can be used to capture a trace up to a processor reset A processor reset disables the trace buffer but the
212. eports the error if enabled Software can scrub the memory s ECC by simply writing to it The BCU automatically generates and writes the correct ECC value If a multi bit error is detected the BCU requests an exception of the core The exact exception triggered depends on the data s destination For example a code fetch that received a multi bit error would result in a Prefetch Abort if the processor attempted to execute the code If the BCU receives a bus abort see Section 10 2 6 Abort on page 10 11 and Section 11 3 1 Bus Aborts on page 11 2 and an ECC error on the same cycle it ignores the ECC information for the cycle and process the bus abort normally Any ECC error that elicits a Prefetch or Data abort from the BCU is the final error associated with a data burst The BCU accepts the remaining data from the bus but ignores it and does not perform ECC checks March 2003 Developer s Manual intel 11 4 11 4 1 Table 11 2 Intel 80200 Processor based on Intel XScale Microarchitecture Bus Controller Programmer Model The BCU registers reside in Coprocessor 13 CP13 They may be accessed manipulated with the MCR MRC STC and LDC instructions The CRn field of the instruction denotes the register number to be accessed Field CRm must be set to 1 The opcode 1 and opcode_2 fields of the instruction should be zero Access to CP13 may be controlled using the Coprocessor Access Register see Section 7 2 13 R
213. er CP13 register 8 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value writeable bits set to 0 Bits Access Description 31 2 Read unpredictable Write as Zero Reserved BS BCU Steering If BCU interrupts are enabled this bit steers them 1 Read Write 0 BCU interrupts directed to internal IRQ 1 BCU interrupts directed to internal FIQ PS PMU Steering 0 Read Write If PMU interrupts are enabled this bit steers them 0 PMU interrupts directed to internal IRQ 1 PMU interrupts directed to internal FIQ Developer s Manual March 2003 9 5 intel External Bus 10 1 General Description 10 The Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture VSTE bus is a split bus with separate request and data buses It is designed primarily as the memory and I O bus for the Intel 80200 processor not as a general purpose multi master bus although it is possible to have several masters on it efficiently It is deeply pipelined and works well with deeply pipelined memory technologies with the memory sharing the data bus with the Intel 80200 processor or with the memory on a separate chipset bus Figure 10 1 shows a typical system this configuration allows the chipset to avoid having a second 64 bit memory bus in addition to the 64 bit Intel 80200 processor data bus with significant savings
214. errupt request 0 BCU not interrupting 1 BCU interrupting PI PMU Interrupt Active Holds the state of the PMU interrupt request 0 not interrupting 1 interrupting 31 Read Write ignored 30 Read Write ignored 29 Read Write ignored 28 Read Write ignored 27 0 Read unpredictable Write ignored Reserved Note that memory buffering and external logic on FIQ and IRQ could cause INTSRC II and INTSRC FI to remain asserted for several cycles after the interrupt source has been commanded to stop asserting An ISR should ensure that the interrupt source is quelled or a spurious recursive entry to the ISR may result when interrupts are enabled Example 9 1 on page 9 4 shows how software might wait for the FIQ signal to be deasserted before proceeding Waiting for FIQ Deassertion waitForNoFIQ MRC P13 0 R15 C4 CO 0 get high bits of INTSRC into flags BMI waitForNoFIQ if FI bit set try again March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Interrupts 9 3 3 INTSTR Systems may have differing priorities for the various interrupt cases the ICU allows system designers to associate each internal interrupt source with one of the two internal interrupts FIQ and IRQ This association is called steering INTSTR is used to specify how internal interrupt sources should be steered Table 9 3 Interrupt Steer Regist
215. es of data spaces from the Data cache would corrupt much if not all of the Data cache by evicting valuable data Eviction of valuable data reduces performance Placing this data instead in Mini data cache memory region would prevent Data cache corruption while providing the benefits of cached accesses A prime example of using the mini data cache would be for caching the procedure call stack The stack can be allocated to the mini data cache so that it s use does not trash the main dcache This would keep local variables from global data Following are examples of data that could be assigned to mini dcache The stack space of a frequently occurring interrupt the stack is used only during the duration of the interrupt which is usually very small Video buffers these are usual large and can occupy the whole cache Over use of the Mini Data cache thrashes the cache This is easy to do because the Mini Data cache only has two ways per set For example a loop which uses a simple statement such as for i20 I lt IMAX i Ali B i C il Where A B and C reside in a mini data cache memory region and each is array is aligned on a 1K boundary quickly thrashes the cache Developer s Manual March 2003 B 21 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 4 2 6 B 22 Data Alignment Cache lines begin on 32 byte address boundaries To maximize cache line use and min
216. es that 0 instructions executed since the last control flow change and the current exception For example if a branch is immediate followed by a SWI a direct branch exception message for the branch is followed by an exception message for the SWT in the trace buffer The count value in the exception message is 0 meaning that O instructions executed after the last control flow change the branch and before the current control flow change the SWI Instead of the SWI if an IRQ was handled immediately after the branch before any other instructions executed the count would still be 0 since no instructions executed after the branch and before the interrupt was handled A count of 0b1111 indicates that 15 instructions executed between the last branch and the exception In this case an exception was either caused by the 16th instruction if it is an undefined instruction exception pre fetch abort or SWI or handled before the 16th instruction executed for FIQ IRQ or data abort Developer s Manual March 2003 13 29 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug n 13 13 1 2 Non exception Message Byte Non exception message bytes are used for direct branches indirect branches and rollovers In a non exception message byte the 4 bit message type field MMMM specifies the type of message refer to Table 13 17 The incremental word count CCCC is the instruction count since the last control
217. essor based on Intel XScale Microarchitecture Software Debug Dynamic Code Download Synchronization The following pieces of code are necessary in the debug handler to implement the synchronization used during dynamic code download The pieces must be ordered in the handler as shown below Before the download can start all outstanding instruction fetches must complete The MCR invalidate IC by line function serves as a barrier instruction in the 80200 All outstanding instruction fetches are guaranteed to complete before the next instruction executes NOTE1 the actual address specified to invalidate is implementation defined but must not have any harmful effects NOTE2 The placement of the invalidate code is implementation defined the only requirement is that it must be placed such that by the time the debugger starts loading the instruction cache all outstanding instruction fetches have completed HHH db db HHH H mov r5 address mcr p15 0 rb c7 cb 1 The host waits for the debug handler to signal that it is ready for the code download This can be done using the TX register access handshaking protocol The host polls the TR bit through JTAG until it is set then begins the code download The following MCR does a write to TX automatically setting the TR bit NOTE The value written to TX is implementation defined dk dk db HHH mcr pis O0 r6 c8 c0 O The debug handler waits until the download is complete before
218. eveloper s Manual Example 4 4 4 3 5 Intel 80200 Processor based on Intel XScale Microarchitecture Instruction Cache Software can lock down several different routines located at different memory locations This may cause some sets to have more locked lines than others as shown in Figure 4 2 Example 4 4 on page 4 9 shows how a routine called lockMe in this example might be locked into the instruction cache Note that it is possible to receive an exception while locking code see Section 2 3 4 Event Architecture on page 2 12 Locking Code into the Cache lockMe This is the code that will be locked into the cache mov r0 5 add r5 r1 r2 lockMeEnd codeLock here is the code to lock the lockMe routine ldr r0 lockMe AND NOT 31 r0 gets a pointer to the first line we should lock ldr ri lockMeEnd AND NOT 31 rl contains a pointer to the last line we should lock lockLoop mcr p15 0 r0 c9 c1 0 lock next line of code into ICache cmp rO r1 are we done yet add rO ro 32 advance pointer to next line bne lockLoop if not done do the next line Unlocking Instructions in the Instruction Cache The Intel 80200 processor provides a global unlock command for the instruction cache Writing to coprocessor 15 register 9 unlocks all the locked lines in the instruction cache and leaves them valid These lines then become available for the round robin replacement algorithm See Table 7
219. f a stall until complete LD or ST instruction triggers an imprecise fault then that fault is seen by the program within three instructions With this knowledge it is possible to write code that accesses stall until complete memory with impunity Simply place several NOP instructions after such an access If an imprecise fault occurs it happens during the NOPs the data abort handler sees identical register and memory state as it would with a precise exception and so should be able to recover An example of this is shown in Example 2 2 on page 2 15 Example 2 2 Shielding Code from Potential Imprecise Aborts Example of code that maintains architectural state through the window where an imprecise fault might occur LD RO R1 R1 points to stall until complete i region of memory NOP NOP NOP Code beyond this point is guaranteed not to see any aborts from the LD Of course if a system design precludes events that could cause external aborts then such precautions are not necessary Developer s Manual March 2003 2 15 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model n 2 3 4 5 Multiple Data Aborts Multiple data aborts may be detected by hardware but only the highest priority one is reported If the reported data abort is precise software can correct the cause of the abort and re execute the aborted instruction If the lower priority abort still exists
220. ftware must disable the breakpoint before exiting the handler This allows the breakpointed instruction to execute after the exception is handled Single step execution is accomplished using the instruction breakpoint registers and must be completely handled in software either on the host or by the debug handler Developer s Manual March 2003 13 9 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug n 13 6 2 Table 13 4 Table 13 5 13 10 Data Breakpoints The Intel 80200 processor debug architecture defines two data breakpoint registers DBRO DBR1 The format of the registers is shown in Table 13 4 Data Breakpoint Register DBRx 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 0 DBRx reset value unpredictable Bits Access Description DBRO Data Breakpoint MVA 31 0 Read Write DBR1 Data Address Mask OR Data Breakpoint MVA DBRO is a dedicated data address breakpoint register DBR1 can be programmed for 1 of 2 operations data address mask second data address breakpoint The DBCON register controls the functionality of DBR1 as well as the enables for both DBRs DBCON also controls what type of memory access to break on Data Breakpoint Controls Register DBCON 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 EEU O COOU ei reset value 0x00000000
221. ftware sequence is available for those wishing to determine when this update occurs and can be found in Section 2 3 3 Additions to CP15 Functionality on page 2 11 Like certain other ARM architecture products the Intel 80200 processor includes an extra level of virtual address translation in the form of a PID Process ID register and associated logic For a detailed description of this facility see Section 7 2 13 Register 13 Process ID on page 7 16 Privileged code needs to be aware of this facility because when interacting with CP15 some addresses are modified by the PID and others are not An address that has yet to be modified by the PID PIDified is known as a virtual address VA An address that has been through the PID logic but not translated into a physical address is a modified virtual address MVA Non privileged code always deals with VAs while privileged code that programs CP15 occasionally needs to use MVAs Developer s Manual March 2003 7 1 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Table 7 1 Intel The format of MRC and MCR is shown in Table 7 1 cp num is defined for CP15 CP14 CP13 and CPO CP13 contains the interrupt controller and bus controller registers and is described in Chapter 9 Interrupts and Chapter 11 Bus Controller respectively CPO supports instructions specific for DSP and is described in Chapter 2 Programming Model Access to all o
222. g sleep mode Drive a 0 into the JTAG clock when not toggling it 8 6 March 2003 Developer s Manual intel Interrupts 9 9 1 9 2 Introduction The Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE supports a variety of external and internal interrupt sources The Interrupt Control Unit ICU controls how the Intel 80200 processor reacts to these interrupts Ultimately all interrupt sources are combined into one of two internal interrupts IRQ and FIQ These interrupts correspond to the IRQ and FIQ described in the ARM Architecture Reference Manual The two interrupt signals that enter the chip are FIQ fast interrupt and IRQ normal interrupt These signals must be asserted and held low to interrupt the processor The internal interrupt sources originate in the Bus Controller Unit see Chapter 11 Bus Controller and the Performance Monitoring Unit see Chapter 12 Performance Monitoring To allow flexible system design these interrupts may be steered under software control to act equivalently to either FIQ or IRQ All interrupts are level sensitive interrupt sources must keep asserting the interrupt signal until software causes the source to deassert it All interrupt sources are individually maskable with the ICUs Interrupt Control register INTCTL Additionally all interrupts may be quickly disabled by altering the F and I bits in the CPSR as s
223. gger starts the download The debugger scans data into JTAG to write to the RX register with the download bit and the valid bit set Following the write to RX the RR bit and D bit are automatically set in TXRXCTRL Without polling of RR to see whether the debug handler has read the data just scanned in the debugger continues scanning in new data into JTAG for RX with the download bit and the valid bit set An overflow condition occurs if the debug handler does not read the previous data before the debugger completes scanning in the new data see Section 13 8 2 Overflow Flag OV for more details on the overflow condition After completing the download the debugger clears the D bit allowing the debug handler to exit the download loop Debug Handler Actions Debug is handler in a routine waiting to write data out to memory The routine loops based on the D bit in TXRXCTRL The debug handler polls the RR bit until it is set It then reads the Rx register and writes it out to memory The handler loops repeating these operations until the debugger clears the D bit Developer s Manual March 2003 13 13 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 8 2 13 8 3 13 14 Overflow Flag OV The Overflow flag is a sticky flag that is set when the debugger writes to the RX register while the RR bit is set The flag is used during high speed download to indicate
224. gister Its contents are unpredictable and should not be relied upon across any instructions or exceptions However DBG r13 can be used by data processing non RRX and MCR MRC instructions as a temporary scratch register The following instructions should not be executed in Debug Mode they may result in unpredictable behavior LDM LDR w Rd PC LDR w RRX addressing mode SWP LDC STC The handler executes in Debug Mode and can be switched to other modes to access banked registers The handler must not enter User Mode any User Mode registers that need to be accessed can be accessed in System Mode Entering User Mode may cause unpredictable behavior Developer s Manual March 2003 13 49 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 15 2 3 13 50 Dynamic Debug Handler On the Intel 80200 processor the debug handler and override vector tables reside in the 2 KB mini instruction cache separate from the main instruction cache A static Debug Handler is downloaded during reset This is the base handler code necessary to do common operations such as handler entry exit parse commands from the debugger read write ARM registers read write memory etc Some functions may require large amounts of code or may not be used very often As long as there is space in the mini instruction cache these functions can be downloaded as part of the static Debug Handler However if
225. gister BCUCTL bit EV A description of these registers is in Section 11 4 1 and Section 11 4 2 See Chapter 2 Programming Model for more discussion of Data Aborts March 2003 Developer s Manual intel 11 3 2 Table 11 1 Intel 80200 Processor based on Intel XScale Microarchitecture Bus Controller ECC Errors An ECC error occurs when the BCU reads data and notices that the associated ECC bits do not match the data This could also happen as a result of the RMW that the BCU performs on sub bus width writes A single transaction on the bus could result in multiple ECC errors as the BCU checks each bus width entity as it is received Table 11 1 summarizes the BCU error response for ECC errors The response is tailorable some of the actions may be disabled such as requesting a core interrupt See Section 11 4 1 BCU Control Registers on page 11 5 for information on the EE and SR bits and how the BCU error response may be altered BCU Response to ECC Errors Event EE SR Response 0 No correction no notification 1 0 Correction if BCUCTL SC 1 Correction if BCUCTL SC 1 Request core interrupt Read with 1 bit error 0 No notification Read with multi bit error e If transaction is a data read imprecise data abort 1 e If transaction is an instruction fetch prefetch abort e If transaction is an MMU operation precise data abort 0 Does not occur RMW only if BCUCTL EE
226. handler Note This number assumes that the interrupt vector is resident in the instruction cache The Intel 80200 processor does provide the capability to lock the vector and the interrupt service routine into the instruction cache Many parameters can affect this best case performance e instruction currently executing could be as bad as a 16 register LDM fault status processor could fault just when the interrupt arrives stalls processor could be waiting for data from a load doing a page table walk etc bus ratio the best case assumes a 3 1 core bus ratio Higher ratios would slightly improve performance Developer s Manual March 2003 14 1 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations n 14 2 Table 14 2 14 3 14 2 Branch Prediction The Intel 80200 processor implements dynamic branch prediction for the ARM instructions B and BL and for the Thumb instruction B Any instruction that specifies the PC as the destination is predicted as not taken For example an LDR or a MOV that loads or moves directly to the PC is predicted not taken and incur a branch latency penalty These instructions ARM B ARM BL and Thumb B enter into the branch target buffer when they are taken for the first time A taken branch refers to when they are evaluated to be true Once in the branch target buffer the Intel 80200 processor dynamically predicts the outco
227. he D for another data cycle is being driven on the bus A typical burst would be D Valid asserted for four contiguous cycles n n 1 n 2 n 3 with the data for that burst being driven in four contiguous cycles delayed two cycles from the D Valid n 2 n 3 n 4 n 5 as shown in Figure 10 5 If the part has been configured to use a 32 bit data bus then D 63 32 and DCB 7 0 are floated on all writes Also bits in BE 7 4 drives a binary 1 or 0 value It is suggested that DCB and the upper bits of D be pulled down on 32 bit data bus systems March 2003 Developer s Manual intel 10 2 3 Table 10 4 Table 10 5 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Critical Word First The CWF signal is only used during read bursts of eight words Len 6 CWF needs to be driven at the same time as D Valid of the first data cycle of the transaction This bit indicates to the requesting master what order the data is returning in The Intel 80200 processor uses this sort of transaction to fill a cache line There are two acceptable return orders that may be signalled with CWF CWF 0 starting with the words 0 and 1 64 bit bus or word 0 32 bit bus and sequentially returning the data e CWF 1 returning the word or pair of words that contain the byte indicated by the request address on the first cycle sequentially returning the next highest memory location word or pair of words unti
228. he Intel 80200 processor expects a data cycle associated with the next transaction in this case the first data cycle of read B If an aborted data burst cannot be stopped by the memory system it is sufficient to allow it to complete with DValid deasserted This is a slight bandwidth hit but aborts should be rare If during a locked access any read access encounters an Abort the Intel 80200 processor still emits the final unlocking write If the final unlocking write is an Abort the bus is still unlocked with the write The Intel 80200 processor never fails to deassert Lock External logic must leave a gap of at least one MCLK cycle between aborts to the Intel 80200 processor Aborted Access Ons 25ns 50ns 75ns 100ns MCLK X SVS VS VS VS VS NN NS ON Nu ADS LEN 2 RdA 1 RdB 4 7 Lock LEN 1 IBI LT Le W RE LEN O ISS fo X O ABK oxo KOx240X oxo 0x26 0 DValid V Developer s Manual March 2003 10 23 External Bus In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 10 3 10 Figure 10 14 10 24 Hold Figure 10 14 shows an example of hold being asserted to stop new transactions being issued The Intel 80200 processor floats the issue bus pins and issues no transactions until HIdA is deasserted The Hold signal assertion does not affect the data bus which continues to operate normally Read data for requests A and B c
229. he address for the access During the second cycle of the issue phase this signal carries the lower 16 bits of the address Data Bus D 64 1 0 Data Bus BE Byte Enables for writes timing same as Data DCB yo Data Check Bits for ECC timing same as Data CWF Critical Word First Indicates the order in which the current 32 byte read burst is returning See Section 10 2 3 Critical Word First on page 10 7 for more information on this signal This pin must be asserted at the same time as the DValid of the first cycle of returning data for a given read transaction It is ignored at all other times DValid Indicates that two cycles later there is an Intel 80200 Processor data cycle on D DCB and possibly BE either read sampled by the Intel 80200 processor or write driven by the Intel 80200 processor Abort Asserted with DValid indicates that the Intel 80200 processor transaction next in order on the Data bus has been aborted timing same as DValid Multimaster Support Hold Input that tells the Intel 80200 processor to float the following pins ADS A W R Lock HIdA Asserted when the Intel 80200 processor has completed the transition to hold mode in response to Hold Configuration CWF DBusWidth 1 shared with CWF When the Intel 80200 processor is in reset this pin functions as DBusWidth the Intel 80200 proces
230. he checkpoint registers are used to hold target addresses of specific entries in the trace buffer Only direct and indirect entries get checkpointed Exception and roll over messages are never checkpointed When an entry is checkpointed the processor sets bit 6 of the message byte to indicate this refer to Table 13 17 Message Byte Formats When the trace buffer contains only one checkpointed entry the corresponding checkpoint register is CHKPTO When the trace buffer wraps around two entries are typically checkpointed usually about half a buffers length apart In this case the first oldest checkpointed entry read from the trace buffer corresponds to CHKPTI the second checkpointed entry corresponds to CHKPTO Although the checkpoint registers are provided for wrap around mode they are still valid in fill once mode Trace Buffer Register TBREG The trace buffer is read through TBREG using MRC and MCR Software should only read the trace buffer when it is disabled Reading the trace buffer while it is enabled may cause unpredictable behavior of the trace buffer Writes to the trace buffer have unpredictable results Reading the trace buffer returns the oldest byte in the trace buffer in the least significant byte of TBREG The byte is either a message byte or one byte of the 32 bit address associated with an indirect branch message Table 13 16 shows the format of the trace buffer register TBREG Format 31 30 29 28 27 26 25 24 23 22 21
231. he contents of the word pointed to the value contained in r1 str rl rO 4 4 March 2003 Developer s Manual intel B 4 B 4 1 B 4 1 1 B 4 1 2 B 4 1 3 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Cache and Prefetch Optimizations This chapter considers how to use the various cache memories in all their modes and then examines when and how to use prefetch to improve execution efficiencies Instruction Cache The Intel 80200 processor has separate instruction and data caches Only fetched instructions are held in the instruction cache even though both data and instructions may reside within the same memory space with each other Functionally the instruction cache is either enabled or disabled There is no performance benefit in not using the instruction cache The exception is that code which locks code into the instruction cache must itself execute from non cached memory Cache Miss Cost The Intel 80200 processor performance is highly dependent on reducing the cache miss rate When an instruction cache miss occurs the timing to retrieve the next instruction is the same as that for retrieving data for the data cache Section B 4 4 1 Prefetch Distances in the Intel 80200 Processor provides a more detailed explanation of the required time Using the same assumptions as those used for the data caches the result is it takes about 60 to 90 core cycles to retrieve the fi
232. he mini instruction cache The only way to load a line into the mini instruction cache is through JTAG 13 34 March 2003 Developer s Manual intel 13 14 2 Figure 13 10 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug LDIC JTAG Data Register The LDIC JTAG Data Register is selected when the LDIC JTAG instruction is in the JTAG IR An external host can load and invalidate lines in the instruction cache through this data register LDIC JTAG Data Register Hardware unpredictable y Capture DR TDI TDO LDIC SR1 G 3 2 1 0 L Update DR N N ak o LDIC REG 5 TCK Y Intel 80200 Processor CLK eme wg LDIC SR2 as 2 1 0 To Instruction Cache LDIC f State Machine The data loaded into LDIC_SR1 during a Capture_DR is unpredictable All LDIC functions and data consists of 33 bit packets which are scanned into LDIC_SR1 during the Shift_DR state Update_DR parallel loads LDIC_SR1 into LDIC_REG which is then synchronized with the Intel 80200 processor clock and loaded into the LDIC_SR2 Once data is loaded into LDIC_SR2 the LDIC State Machine turns on and serially shifts the contents if LDIC_SR2 to the instruction cache Note that there is a delay from the time of the Update_DR to the time the entire c
233. he stack for your application should be allocated to a read write allocate region It is expected that you write and read from them often Data that is write only or data that is written to and subsequently not used for a long time should be placed in a read allocate region Under the read allocate policy if a cache write miss occurs a new cache line is not allocated and hence does not evict critical data from the Data cache Creating On chip RAM Part of the Data cache can be converted into fast on chip RAM Access to objects in the on chip RAM does not incur cache miss penalties thereby reducing the number of processor stalls Application performance can be improved by converting a part of the cache into on chip RAM and allocating frequently allocated variables to it Due to the Intel 80200 processor round robin replacement policy all data is eventually evicted Therefore to prevent critical or frequently used data from being evicted it should be allocated to on chip RAM The following variables are good candidates for allocating to the on chip RAM Frequently used global data used for storing context for context switching Global variables that are accessed in time critical functions such as interrupt service routines The on chip RAM is created by locking a memory region into the Data cache see Section 6 4 Re configuring the Data Cache as Data RAM for more details If the data in the on chip RAM is to be initialized to zero th
234. he value loaded into DBG DCSR following an Update DR Only bits specified as writable by JTAG in Table 13 1 are updated DBGTX JTAG Command The DBGTX JTAG instruction selects the DBGTX JTAG data register The JTAG opcode for this instruction is Ob10000 Once the DBGTX data register is selected the debugger can receive data from the debug handler March 2003 Developer s Manual intel 13 11 4 Figure 13 3 13 11 5 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug DBGTX JTAG Register The DBGTX JTAG instruction selects the Debug JTAG Data register Figure 13 3 The debugger uses the DBGTX data register to poll for breaks internal and external to debug mode and once in debug mode to read data from the debug handler DBGTX Hardware software write set by SW write to TX software read only TX TXRXCTRL Intel 80200 Processor 0x0000 0000 P mmm mm mm mmm mm mm mm mm mmm mm mm mm mm mm mm mm MO mm MO m mm umm mm mm um I Capture DR T Tes dela La y e VS i i clear by Debugger read lo TDO 35 34 312 110 DBG SR l Update DR i Ignored A Capture DR loads the TX register value into DBG_SR 34 3 and TXRXCTRL 28 into DBG SR 0 The other bits in DBG SR are loaded as shown in Figure 13 1 The captur
235. hod is used to re configure lines in the data cache as data RAM Locking data from external memory into the data cache is useful for lookup tables constants and any other data that is frequently accessed Re configuring a portion of the data cache as data RAM is useful when an application needs scratch memory bigger than the register file can provide for frequently used variables These variables may be strewn across memory making it advantageous for software to pack them into data RAM memory Code examples for these two applications are shown in Example 6 3 on page 6 13 and Example 6 4 on page 6 14 The difference between these two routines is that Example 6 3 on page 6 13 actually requests the entire line of data from external memory and Example 6 4 on page 6 14 uses the line allocate operation to lock the tag into the cache No external memory request is made which means software can map any unallocated area of memory as data RAM However the line allocate operation does validate the target address with the MMU so system software must ensure that the memory has a valid descriptor in the page table Another item to note in Example 6 4 on page 6 14 is that the 32 bytes of data located in a newly allocated line in the cache must be initialized by software before it can be read The line allocate operation does not initialize the 32 bytes and therefore reading from that line produces unpredictable results In both examples the code drains the pe
236. ibutes If the swap operation is directed to external memory the BCU performs a locked set of memory operations see Chapter 11 Bus Controller March 2003 Developer s Manual intel 6 3 6 3 1 6 3 2 Example 6 1 6 3 3 Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache Data Cache and Mini Data Cache Control Data Memory State After Reset After processor reset both the data cache and mini data cache are disabled all valid bits are set to zero invalid and the round robin bit points to way 31 Any lines in the data cache that were configured as data RAM before reset are changed back to cacheable lines after reset i e there are 32 KBytes of data cache and zero bytes of data RAM Enabling Disabling The data cache and mini data cache are enabled by setting bit 2 in coprocessor 15 register 1 Control Register See Chapter 7 Configuration for a description of this register and others Example 6 1 shows code that enables the data and mini data caches Note that the MMU must be enabled to use the data cache Enabling the Data Cache enableDCache MCR p15 0 r0 c7 c10 4 Drain pending data operations see Chapter 7 2 8 Register 7 Cache functions MRC p15 0 r0 cl c0 0 Get current control register ORR r0 rO 4 Enable DCache by setting C bit 2 MCR p15 0 r0 cl c0 0 And update the Control register Invalidate amp Clean Operations Individual en
237. idates the BTB and the code that enables disables it 5 2 2 Invalidation There are four ways the contents of the BTB can be invalidated 1 Reset 2 Software can directly invalidate the BTB via a CP15 register 7 function Refer to Section 7 2 8 Register 7 Cache Functions on page 7 11 3 The BTB is invalidated when the Process ID Register is written 4 The BTB is invalidated when the instruction cache is invalidated via CP15 register 7 functions Developer s Manual March 2003 5 3 intel Data Cache 6 1 6 1 1 The Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE data cache enhances performance by reducing the number of data accesses to and from external memory There are two data cache structures in the Intel 80200 processor a 32 Kbyte data cache and a 2 Kbyte mini data cache An eight entry write buffer and a four entry fill buffer are also implemented to decouple the Intel 80200 processor instruction execution from external memory accesses which increases overall system performance Overviews Data Cache Overview The data cache is a 32 Kbyte 32 way set associative cache this means there are 32 sets with each set containing 32 ways Each way of a set contains 32 bytes one cache line and one valid bit There also exist two dirty bits for every line one for the lower 16 bytes and the other one for the upper 16 bytes When a store hits the cache the
238. ile the TAP controller is in this state all of the test data register s shift register bit positions selected by the current instruction retain their previous values The instruction does not change while the TAP controller is in this state When the TAP controller is in this state and TMS is held high on the rising edge of TCK the controller enters the Select DR Scan state If TMS is held low on the rising edge of TCK the controller enters the Run Test Idle state Select IR Scan State This is a temporary controller state The test data registers selected by the current instruction retain their previous state In this state if TMS is held low on the rising edge of TCK the controller moves into the Capture IR state and a scan sequence for the instruction register is initiated If TMS is held high on the rising edge of TCK the controller moves to the Test Logic Reset state The instruction does not change in this state Capture IR State When the controller is in the Capture IR state the shift register contained in the instruction register loads the fixed value 0001 on the rising edge of TCK The test data register selected by the current instruction retains its previous value during this state The instruction does not change in this state While in this state holding TMS high on the rising edge of TCK causes the controller to enter the Exit1 IR state If TMS is held low on the rising edge of TCK the controller enters the Shift IR sta
239. imize cache pollution data structures should be aligned on 32 byte boundaries and sized to multiple cache line sizes Aligning data structures on cache address boundaries simplifies later addition of prefetch instructions to optimize performance Not aligning data on cache lines has the disadvantage of moving the prefetch address correspondingly to the misalignment Consider the following example struct long ia long ib long ic long id tdata IMAX for i 0 i lt IMAX i PREFETCH tdata i 1 tdata i ia tdata i ib tdata i ic _tdata i id tdata i id 0 In this case if tdata is not aligned to a cache line then the prefetch using the address of tdata i 1 ia may not include element id If the array was aligned on a cache line 12 bytes then the prefetch would halve to be placed on amp tdata i 1 id If the structure is not sized to a multiple of the cache line size then the prefetch address must be advanced appropriately and requires extra prefetch instructions Consider the following example struct long ia long ib long ic long id long ie tdata IMAX ADDRESS preadd tdata for i 0 i lt IMAX i PREFETCH predata 16 tdata I ia tdata I ib tdata I ic _tdata I id tdata I ie tdata I ie 0 In this case the prefetch address was advanced by size of half a cache line and every other prefetch instruction is ignored Further an additional
240. in chipset cost Figure 10 1 Typical System Intel 80200 Processor based on Intel XScale Microarchitecture Developer s Manual d aig jo oO Di a ZS Adr Ctl 88 ROM Chipset SDRAM Flash Leg 29 pl SDRAM Ctrl PCI March 2003 10 1 External Bus Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e Figure 10 2 10 2 An alternate configuration with a separate memory bus is also possible shown in Figure 10 2 AII signals on this bus data and request are sampled on the rising edge of MCLK MCLK is created by the system and is an input to the Intel 80200 processor MCLK is asynchronous with respect to the Intel 80200 processor core frequency and any other Intel 80200 processor input clocks MCLK in the configuration shown in Figure 10 1 would also need to be the SDRAM clock MCLK frequencies of up to 100 MHz are supported A 50 duty cycle is required MCLK must be one third or less of the internal clock frequency of the core however An Intel 80200 processor system running the core at 200 MHz would be allowed a maximum MCLK of 66 MHz for example This constraint comes from the design of the low latency synchronization logic in the Intel 80200 processor bus controller Alternate Configuration Intel 80200 Processor based on Intel XScale Microarchitecture
241. in lock mode CPWAIT MOV RO 16 LOOP1 ALLOCATE R1 Allocate and lock a tag into the data cache at address R1 initialize 32 bytes of newly allocated line DRAIN STRD R4 R1 8 STRD R4 R1 8 i STRD R4 R1 8 i STRD R4 R1 8 i SUBS RO RO 1 Decrement loop count BNE LOOP1 Turn off data cache locking DRAIN Finish all pending operations MOV R2 0x0 MCR P15 0 R2 C9 C2 0 Take the data cache out of lock mode CPWAIT 6 14 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture P Data Cache Tags can be locked into the data cache by enabling the data cache lock mode bit located in coprocessor 15 register 9 See Table 7 14 Cache Lockdown Functions on page 7 14 for the exact command Once enabled any new lines allocated into the data cache are locked down Note that the PLD instruction does not affect the cache contents if it encounters an error while executing For this reason system software should ensure the memory address used in the PLD is correct If this cannot be ascertained replace the PLD with a LDR instruction that targets a scratch register Lines are locked into a set starting at way0 and may progress up to way 27 which set a line gets locked into depends on the set index of the virtual address of the request Figure 6 3 Locked Line Effect on Round Robin Replacement on page 6 15 is an example of where lines of code may b
242. in the table are reserved Developer s Manual March 2003 2 13 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model n 2 3 4 4 Data Aborts Two types of data aborts exist in the Intel 80200 processor precise and imprecise A precise data abort is defined as one where R14_ABORT always contains the PC 8 of the instruction that caused the exception An imprecise abort is one where R14 ABORT contains the PC 4 of the next instruction to execute and not the address of the instruction that caused the abort In other Words instruction execution has advanced beyond the instruction that caused the data abort On the Intel 80200 processor precise data aborts are recoverable and imprecise data aborts are not recoverable Precise Data Aboris A lock abort is a precise data abort the extended Status field of the Fault Status Register is set to 0xb10100 This abort occurs when a lock operation directed to the MMU instruction or data or instruction cache causes an exception due to either a translation fault access permission fault or external bus fault The Fault Address Register is undefined and R14 ABORT is the address of the aborted instruction 4 8 A data MMU abort is precise These are due to an alignment fault translation fault domain fault permission fault or external data abort on an MMU translation The status field is set to a predetermined ARM definition which is shown in
243. ing six times a fast as the memory transfer bus Further the example values presented here apply to the current processor implantation TBD processor name and are different for future implementations Nowe 1s the number of core cycles required to transfer the first critical word of a prefetch or load operation News E Niookup N cwfxfer Where Niookup This is the number of core clocks required for the processor to issue a memory transfer request to the SDRAM plus the time the SDRAM requires to locate the data ee N N processor memwait N mempagewait Developer s Manual March 2003 B 25 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n d The Intel 80200 processor needs seven bus clocks to process a memory request to the SDRAM Nprocessor Typical SDRAM needs 2 to 3 bus clocks to select the memory locations provided that the current SDRAM memory page is selected N mnemwait If the current SDRAM memory page is not selected then an additional 3 to 4 bus cycles are required to lookup the memory data locations Nmempagewait Thus the lookup time can range from 9 to 14 bus clock cycles Translating this to core cycles at a ratio of six to one means between 54 and 84 core clocks N This is the number of core clocks required to transfer the first critical word of a cache line fill operation It takes one bus clock to transfer the first word if the data is in the lower word address of
244. inimum Result Latency LDM 3 23 1 3 for load data 1 for writeback of base STM 3 18 1 for writeback of base 1 LDM issue latency is 7 N if R15 is in the register list and 2 N if it is not STM issue latency is calculated as 2 N N is the number of registers to load or store 14 8 March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations intel 14 4 8 Semaphore Instructions Table 14 14 Semaphore Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency SWP 5 5 SWPB 5 5 14 4 9 Coprocessor Instructions Table 14 15 CP15 Register Access Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency MRC 4 4 MCR 2 N A Table 14 16 CP14 Register Access Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency MRC 7 7 MCR 7 N A LDC 10 N A STC 7 N A 14 4 10 Miscellaneous Instruction Timing Table 14 17 SWI Instruction Timings Mnemonic Minimum latency to first instruction of SWI exception handler SWI Table 14 18 Count Leading Zeros Instruction Timings Mnemonic Minimum Issue Latency Minimum Result Latency CLZ 1 1 14 4 41 Thumb Instructions The timing of Thumb instructions are the same as their equivalent ARM instructions This mapping can be found in the ARM Architecture Reference Manual The only excepti
245. instruction cache is always disabled unlocked and invalidated flushed 4 3 2 Enabling Disabling The instruction cache is enabled by setting bit 12 in coprocessor 15 register 1 Control Register This process is illustrated in Example 4 2 Enabling the Instruction Cache Example 4 2 Enabling the Instruction Cache Enable the ICache MRC P15 0 RO C1 CO O Get the control register ORR RO RO 0x1000 set bit 12 the I bit MCR P15 0 RO C1 CO O Set the control register CPWAIT 4 6 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n D Instruction Cache 4 3 3 Invalidating the Instruction Cache The entire instruction cache along with the fetch buffers are invalidated by writing to coprocessor 15 register 7 See Table 7 12 Cache Functions on page 7 11 for the exact command This command does not unlock any lines that were locked in the instruction cache nor does it invalidate those locked lines To invalidate the entire cache including locked lines the unlock instruction cache command needs to be executed before the invalidate command This unlock command can also be found in Table 7 14 Cache Lockdown Functions on page 7 14 There is an inherent delay from the execution of the instruction cache invalidate command to where the next instruction sees the result of the invalidate The following routine can be used to guarantee proper synchroniza
246. ion breakpoint address registers IBCRO and IBCR1 one data breakpoint address register DBRO one configurable data mask address register DBR1 and one data breakpoint control register DBCON The Intel 80200 processor also supports a 256 entry trace buffer that records program execution information The registers to control the trace buffer are located in CP14 Refer to Chapter 13 Software Debug for more information on these features of the Intel 80200 processor Table 7 19 Accessing the Debug Registers Function opcode 2 CRm Instruction du Tr sue Ww oup qu E NE DC ME T NEUE E o a e D UELLE MEM ame TEE MRE PIB OF reg e E NEE Developer s Manual March 2003 7 17 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration n 7 2 15 Register 15 Coprocessor Access Register This register is selected when opcode_2 0 and CRm 1 This register controls access rights to all the coprocessors in the system except for CP15 and CP14 Both CP15 and CP14 can only be accessed in privilege mode This register is accessed with an MCR or MRC with the CRm field set to 1 This register controls access to CPO and CP13 for the Intel 80200 processor A typical use for this register is for an operating system to control resource sharing among applications Initially all applications are denied access to shared resources by clearing the appropriate coprocessor bit in the
247. is checked to see if an outstanding fill request already exists for that line If so the current request is placed in the pending buffer and waits until the previously requested fill completes after which it writes its data into the recently allocated cache line If there is no outstanding fill request for that line the current store request is placed in the fill buffer and a 32 byte external memory read request is made If the pending buffer or fill buffer is full the Intel 80200 processor stalls until an entry is available 2 The 32 bytes of data can be returned back to the Intel 80200 processor in any word order i e the eight words in the line can be returned in any order Note that it does not matter for performance reasons which order the data is returned to the Intel 80200 processor since the store operation has to wait until the entire line is written into the cache before it can complete 3 When the entire 32 byte line has returned from external memory a line is allocated in the cache selected by the round robin pointer see Section 6 2 4 Round Robin Replacement Algorithm on page 6 8 The line to be written into the cache may replace a valid line previously allocated in the cache In this case both dirty bits are examined and if any are set the four words associated with a dirty bit that s asserted are written back to external memory as a 4 word burst operation This write operation is placed in the write buffer 4 The
248. ister TX The TX register is the debug handler transmit buffer The debug handler sends data to the debugger through this register TX Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 32 1 O0 TX reset value unpredictable Bits Access Description 31 0 SW Pins arne Debug handler writes data to send to debugger u wri u JTAG Read only g 99 Since the TX register is accessed by the debug handler using MCR MRC and the debugger through JTAG handshaking is required to prevent the debug handler from writing new data before the debugger reads the previous data The TX register handshaking is described in Table 13 9 TX Handshaking on page 13 15 Receive Register RX The RX register is the receive buffer used by the debug handler to get data sent by the debugger through the JTAG interface RX Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 2 1 0 RX reset value unpredictable Bits Access Description 31 0 SW Read only Software reads to receives data commands from JTAG Write only debugger Since the RX register is accessed by the debug handler using MRC and the debugger through JTAG handshaking is required to prevent the debugger from writing new data to the register before the debug handler reads the previous data out The handshaking is described in Section 13 8 1 RX Register Ready Bit RR March 2
249. ister as its active TDI to TDO path Boundary Scan Register The Boundary Scan register is a required set of serial shiftable register cells March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Test Features C 2 5 TAP Controller The TAP controller is a 16 state synchronous finite state machine that controls the sequence of test logic operations The TAP can be controlled via a bus master The bus master can be either automatic test equipment or a component i e PLD that interfaces to the Test Access Port TAP The TAP controller changes state only in response to a rising edge of TCK or power up The value of the test mode state TMS input signal at a rising edge of TCK controls the sequence of state changes The TAP controller is automatically initialized on powerup In addition the TAP controller can be initialized by applying a high signal level on the TMS input for five TCK periods Behavior of the TAP controller and other test logic in each controller state is described in the following subsections For greater detail on the state machine and the public instructions refer to IEEE 1149 1 Standard Test Access Port and Boundary Scan Architecture Document Figure C 2 TAP Controller State Diagram TRST 0 TEST LOGIC RESET 0 RUN TEST IDLE Io oC SELECT SELECT DR SCAN IR SCAN 0 0 CAPTURE DR EXIT1 DR EXIT2 DR UPD
250. ister and memory These instructions do not allow the programmer to specify values for opcode 1 opcode 2 or Rm those fields implicitly contain zero Table 7 2 LDC STC Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 43 2 1 0 e DET mo ome eem sema Bits Description Notes 31 28 cond ARM condition codes P U W specifies 1 of 3 addressing modes 24 23 21 identified by addressing mode 5 in the ARM Architecture Reference Manual N should be 0 for Intel 80200 processors 22 Setting this bit to 1 has an undefined effect L Load or Store 20 0 2 STC 1 2 LDC 19 16 Rn specifies the base register 15 12 CRd specifies the coprocessor register 0b1111 Undefined Exception 11 8 cp num coprocessor number 0b1110 CP14 0b1101 CP13 7 0 8 bit word offset Developer s Manual March 2003 7 3 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration 7 2 CP15 Registers Table 7 3 lists the CP15 registers implemented in the Intel 80200 processor Table 7 3 CP15 Registers Register CRn Opcode 2 Access Description 0 0 Read Write Ignored ID 0 1 Read Write Ignored Cache Type 1 0 Read Write Control 1 1 Read Write Auxiliary Control 2 0 Read Write Translation Table Base 3 0 Read Write Domain Access Control 4 Unpredictable Reserved 5 0
251. it is reported Software can handle each abort separately until the instruction successfully executes If the reported data abort is imprecise software needs to check the SPSR to see if the previous context was executing in abort mode If this is the case the link back to the current process has been lost and the data abort is unrecoverable Events from Preload Instructions A PLD instruction never causes the Data MMU to fault for any of the following reasons Domain Fault Permission Fault Translation Fault If execution of the PLD would cause one of the above faults then the PLD causes no effect This feature allows software to issue PLDs speculatively For example Example 2 3 on page 2 16 places a PLD instruction early in the loop This PLD is used to fetch data for the next loop iteration In this example the list is terminated with a node that has a null pointer When execution reaches the end of the list the PLD on address 0x0 does not cause a fault Rather it is ignored and the loop terminates normally Example 2 3 Speculatively issuing PLD 2 3 4 6 RO points to a node in a linked list A node has the following layout Offset Contents HE 0 data ii 4 pointer to next node This code computes the sum of all nodes in a list The sum is placed into R9 Pg MOV R9 0 Clear accumulator sumList LDR R1 RO 4 R1 gets pointer to next node LDR R3 RO R3 gets data from current node PLD R1 S
252. it through JTAG The Global Enable bit does not effect the reset vector trap A reset vector trap can be set up before or during a processor reset When processor reset is de asserted a debug exception occurs before the instruction in the reset vector executes Sticky Abort Bit SA The Sticky Abort bit is only valid in Halt mode It indicates a data abort occurred within the Special Debug State see Section 13 5 1 Halt Mode Since Special Debug State disables all exceptions a data abort exception does not occur However the processor sets the Sticky Abort bit to indicate a data abort was detected The debugger can use this bit to determine if a data abort was detected during the Special Debug State The sticky abort bit must be cleared by the debug handler before exiting the debug handler Method of Entry Bits MOE The Method of Entry bits specify the cause of the most recent debug exception When multiple exceptions occur in parallel the processor places the highest priority exception based on the priorities in Table 13 2 in the MOE field Trace Buffer Mode Bit M The Trace Buffer Mode bit selects one of two trace buffer modes Wrap around mode Trace buffer fills up and wraps around until a debug exception occurs Fill once mode The trace buffer automatically generates a debug exception trace buffer full break when it becomes full Trace Buffer Enable Bit E The Trace Buffer Enable bit enables and disables the trac
253. ita ps 16 Receive Register bah ein eere Le ct bee tiie HL Lee RUE Ese FIRE E DEL EE Ea ER SEDE RD e ET EO ER S 16 Debug JTAG ACCESS ssssssssssssseseee eene nen ennen entrent res entres eterne nnns enr ren entres en nnnr nennen 17 13 11 14 SELDCSR JTAG Commande 17 13 11 2 SELDCSR JTAG Register ssssssssssssseseseenenenennene enemies 18 13 11 2 1 DBG HLD CH EE 19 13 11 2 2 DBO BRK e esas 20 13 11 2 3 Blcitupei MED 20 13 11 3 DBGIXJTAG Command rrt Erro rex Reg ea E Renard ERR 20 13 11 4 DBGIXJTAG R69gist6t correre tert dr tener vise EEE 21 13 11 5 DBGRX JTAG Command ete Rp e e o dede x eR ex de Et Hee EEN 21 ERREECHT 22 13 11 6 1 HX Write logic er oed pn er i e es eee 23 13 11 6 2 DBGRX Data Register AAA 24 13 11 6 3 piieN i e M EE 24 Developer s Manual March 2003 vii Intel 80200 Processor based on Intel XScale Microarchitecture tel 13 12 13 13 13 14 13 15 13 16 14 14 1 14 2 14 3 14 4 viii 13 11 6 4 BELEA A 25 13 11 6 5 DB GE RX C MP 25 13 11 6 6 DB GED rem E 25 13 11 6 7 RE FLUSH E resuieeede 25 13 11 7 X Debug JTAG Data Register Reset Values ccccccceceeeeeeeeeeeeeeeeeeeeeeeeeeteeaeeseeaeeees 25 Jeer BUM EP 26 13
254. itecture Optimization Guide n B 4 4 10 B 32 Pointer Prefetch Not all looping constructs contain induction variables However prefetching techniques can still be applied Consider the following linked list traversal example while p do something p gt data p p next The pointer variable p becomes a pseudo induction variable and the data pointed to by p gt next can be prefetched to reduce data transfer latency for the next iteration of the loop Linked lists should be converted to arrays as much as possible while p prefetch p gt next do something p gt data p p gt next Recursive data structure traversal is another construct where prefetching can be applied This is similar to linked list traversal Consider the following pre order traversal of a binary tree preorder treeNode t if t process t gt data preorder t gt left preorder t gt right The pointer variable t becomes the pseudo induction variable in a recursive loop The data structures pointed to by the values t eft and t gt right can be prefetched for the next iteration of the loop preorder treeNode t if t prefetch t gt right prefetch t sleft process t gt data preorder t gt left preorder t gt right Note the order reversal of the prefetches in relationship to the usage If there is a cache conflict and data is evicted from the cache then only the data from the fir
255. its set to 0 Bits Access Description 31 14 Read Unpredictable Write as Zero Reserved 13 Read Write Exception Vector Relocation V 0 Base address of exception vectors is 0x0000 0000 1 2 Base address of exception vectors is OXFFFF 0000 12 Read Write Instruction Cache Enable Disable I 0 Disabled 1 2 Enabled 11 Read Write Branch Target Buffer Enable Z 0 Disabled 1 Enabled 10 Read as Zero Write as Zero Reserved Read Write ROM Protection R This selects the access checks performed by the memory management unit See the ARM Architecture Reference Manual for more information Read Write System Protection S This selects the access checks performed by the memory management unit See the ARM Architecture Reference Manual for more information Read Write Big Little Endian B 0 Little endian operation 1 2 Big endian operation 6 3 Read as One Write as One 0b1111 Read Write Data cache enable disable C 0 Disabled 1 2 Enabled Read Write Alignment fault enable disable A 0 Disabled 1 2 Enabled Read Write Memory management unit enable disable M 0 Disabled 1 2 Enabled Developer s Manual March 2003 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration n Table 7 7 The mini data cache attribute bits in the Intel
256. l Buffer Overview The Intel 80200 processor employs an eight entry write buffer each entry containing 16 bytes Stores to external memory are first placed in the write buffer and subsequently taken out when the bus is available The write buffer supports the coalescing of multiple store requests to external memory An incoming store may coalesce with any of the eight entries The fill buffer holds the external memory request information for a data cache or mini data cache fill or non cacheable read request Up to four 32 byte read request operations can be outstanding in the fill buffer before the Intel 80200 processor needs to stall The fill buffer has been augmented with a four entry pend buffer that captures data memory requests to outstanding fill operations Each entry in the pend buffer contains enough data storage to hold one 32 bit word specifically for store operations Cacheable load or store operations that hit an entry in the fill buffer get placed in the pend buffer and are completed when the associated fill completes Any entry in the pend buffer can be pended against any of the entries in the fill buffer multiple entries in the pend buffer can be pended against a single entry in the fill buffer Pended operations complete in program order March 2003 Developer s Manual intel 6 2 6 2 1 6 2 2 6 2 3 6 2 3 1 Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache Data Cache and Mini
257. l March 2003 13 25 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 12 13 12 1 Table 13 14 13 12 1 1 Table 13 15 13 26 Trace Buffer The 256 entry trace buffer provides the ability to capture control flow information to be used for debugging an application Two modes are supported 1 The buffer fills up completely and generates a debug exception Then SW empties the buffer 2 The buffer fills up and wraps around until it is disabled Then SW empties the buffer Trace Buffer CP Registers CP14 defines three registers see Table 13 14 for use with the trace buffer These CP14 registers are accessible using MRC MCR LDC and STC CDP to any CP14 registers causes an undefined instruction trap The CRn field specifies the number of the register to access The CRm opcode 1 and opcode 2 fields are not used and should be set to 0 CP 14 Trace Buffer Register Summary CP14 Register Number Register Name 11 Trace Buffer Register TBREG 12 Checkpoint 0 Register CHKPTO 13 Checkpoint 1 Register CHKPT1 Any access to the trace buffer registers in User mode causes an undefined instruction exception Specifying registers which do not exist has unpredictable results Checkpoint Registers When the debugger reconstructs a trace history it is required to start at the oldest trace buffer entry and construct a trace going forward In fill once mode and w
258. l the end of the eight word aligned block then returning the lowest word or pair of words and incrementing back up the word or pair of words just below the word or pair of words pointed to by the request address The allowable orders are spelled out below In this description A B C and D refer to pairs of 32 bit words in the aligned 8 word block A 0 1 B 2 3 C 4 5 D 6 7 In the 32 bit bus section a h are referring to the eight words in the aligned 8 word block Note that the Intel 80200 processor will align all 8 word burst transactions on a 64 bit address Specifically if the instruction or data address requested by the program is odd word aligned the Intel 80200 processor will emit a 64 bit aligned address by setting A 2 0x0 Return Order for 8 Word Burst 64 bit Data Bus CWF A 4 3 Return Order 0 XX ABCD 1 00 ABCD 1 01 BCDA 1 10 CDAB 1 11 DABC Return Order for 8 Word Burst 32 bit Data Bus CWF A 4 2 Return Order 0 XXX abcdefgh 1 000 abcdefgh 1 010 cdefghab 1 100 efghabcd 1 110 ghabcdef 1 The Intel 80200 processor will never generate an 8 word burst with A 2 0x1 The addresses for these requests will be 64 bit aligned The timing of write transactions on the data bus is similar to the timing of reads After a write request has gone out two cycles before the chipset or memory is ready to receive the data from the Intel 80200 processor the D Valid is a
259. le Microarchitecture Introduction 1 2 Terminology and Conventions 1 2 1 Number Representation All numbers in this document can be assumed to be base 10 unless designated otherwise In text and pseudo code descriptions hexadecimal numbers have a prefix of Ox and binary numbers have a prefix of Ob For example 107 would be represented as Ox6B in hexadecimal and 0b1101011 in binary 1 22 Terminology and Acronyms ASSP Application Specific Standard Product Assert This term refers to the logically active value of a signal or bit BTB Branch Target Buffer Clean A clean operation updates external memory with the contents of the specified line in the data mini data cache if any of the dirty bits are set and the line is valid There are two dirty bits associated with each line in the cache so only the portion that is dirty gets written back to external memory After this operation the line is still valid and both dirty bits are deasserted Coalescing Coalescing means bringing together a new store operation with an existing store operation already resident in the write buffer The new store is placed in the same write buffer entry as an existing store when the address of the new store falls in the 4 word aligned address of the existing entry This includes in PCI terminology write merging write collapsing and write combining Deassert This term refers to the logically inactive value of a signal or bit Flush A flush operation invalidate
260. line from the instruction cache and modifies the same location in external memory it needs to invalidate the BTB also Not invalidating the BTB in this case may cause unpredictable results Disabling enabling a cache has no effect on contents of the cache valid data stays valid locked items remain locked All operations defined in Table 7 12 work regardless of whether the cache is enabled or disabled Since the Clean D Cache Line function reads from the data cache it is capable of generating a parity fault The other operations do not generate parity faults Cache Functions Function opcode 2 CRm Data Instruction Invalidate I amp D cache amp BTB 0b000 0b0111 Ignored MCR p15 0 Rd c7 c7 0 Invalidate cache amp BTB 0b000 0b0101 Ignored MCR p15 0 Rd c7 c5 0 Invalidate cache line 0b001 0b0101 MVA MCR p15 0 Rd c7 c5 1 Invalidate D cache 0b000 0b0110 Ignored MCR p15 0 Rd c7 c6 0 Invalidate D cache line 0b001 0b0110 MVA MCR p15 0 Rd c7 c6 1 Clean D cache line 0b001 0b1010 MVA MCR p15 0 Rd c7 c10 1 Drain Write amp Fill Buffer 0b100 0b1010 Ignored MCR p15 0 Rd c7 c10 4 Invalidate Branch Target Buffer 0b110 0b0101 Ignored MCR p15 0 Rd c7 c5 6 Allocate Line in the Data Cache 0b101 0b0010 MVA MCR p15 0 Rd c7 c2 5 The line allocate command allocates a tag into the data cache specified by bits 31 5 of Rd Ifa valid dirty line with a
261. line is written into the cache along with the data associated with the store operation If the above condition for requesting a 32 byte cache line is not met a write miss causes a write request to external memory for the exact data size specified by the store operation assuming the write request doesn t coalesce with another write operation in the write buffer Write Back Versus Write Through The Intel 80200 processor supports write back caching or write through caching controlled through the MMU page attributes When write through caching is specified all store operations are written to external memory even if the access hits the cache This feature keeps the external memory coherent with the cache i e no dirty bits are set for this region of memory in the data mini data cache This however does not guarantee that the data mini data cache is coherent with external memory which is dependent on the system level configuration specifically if the external memory is shared by another master When write back caching is specified a store operation that hits the cache does not generate a write to external memory thus reducing external memory traffic Developer s Manual March 2003 6 7 Data Cache In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 6 2 4 6 2 5 6 2 6 6 8 Round Robin Replacement Algorithm The line replacement algorithm for the data cache is round robin Each set in the data cache h
262. llocate a line in the cache when it s disabled even if the MMU is enabled and the memory region s cacheability attribute is set Cache Policies Cacheability Data at a specified address is cacheable given the following the MMU is enabled the cacheable attribute is set in the descriptor for the accessed address and the data mini data cache is enabled Developer s Manual March 2003 6 5 Data Cache In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 6 2 3 2 6 6 Read Miss Policy The following sequence of events occurs when a cacheable see Section 6 2 3 1 Cacheability on page 6 5 load operation misses the cache 1 The fill buffer is checked to see if an outstanding fill request already exists for that line If so the current request is placed in the pending buffer and waits until the previously requested fill completes after which it accesses the cache again to obtain the request data and returns it to the destination register If there is no outstanding fill request for that line the current load request is placed in the fill buffer and a 32 byte external memory read request is made If the pending buffer or fill buffer is full the Intel 80200 processor stalls until an entry is available 2 A line is allocated in the cache to receive the 32 bytes of fill data The line selected is determined by the round robin pointer see Section 6 2 4 Round Robin Replacement Algorithm
263. loper s Manual intel C 2 2 TAP Pins Intel 80200 Processor based on Intel XScale Microarchitecture Test Features The Intel 80200 processor TAP is composed of four input connections TMS TCK TRST and TDI and one output connection TDO These pins are described in Table C 1 The TAP pins provide access to the instruction register and the test data registers Table C 1 TAP Controller Pin Definitions Pin Name Mnemonic Type Definition Test Clock TCK Input Clock input for the TAP controller the instruction register and the test data registers The JTAG unit retains its state when TCK is stopped at 0 or 1 Test Mode Select TMS Input Controls the operation of the TAP controller The TMS input is pulled high when not being driven TMS is sampled on the rising edge of TCK Test Data In TDI Input Serial date input to the instruction and test data registers Data at TDI is sampled on the rising edge of TCK Like TMS TDI is pulled high when not being driven Data shifted from TDI through a register to TDO appears non inverted at TDO Test Data Out TDO Output Used for serial data output Data at TDO is driven at the falling edge of TCK and provides an inactive high Z state when scanning is not in progress The non shift inactive state is provided to support parallel connection of TDO outputs at the board or module level Asynchronous Reset
264. m Interrupt EE 1 Branch Latency Penalty C 2 Latency Example eee ee Mta epi ipie PARENTI Ue i Pee ped ipe D pea Hd 4 Branch Instruction Timings Those predicted by the BTB 4 Branch Instruction Timings Those not predicted by the BTB 5 Data Processing Instruction Tmungs nennen nennen rennen rennen rennen nre 5 Multiply Instriiction b lC H aS 6 Multiply Implicit Accumulate Instruction Tmngs ener nennen 7 Implicit Accumulator Access Instruction Timings eese nnne nennen nne nnne nennen 7 Saturated Data Processing Instruction Tummnges enne een rennen renes 8 Status Register Access Instruction TIMINgsS emen nente enne 8 Load and Store Instruction Timings eere rennen nennen ener eene 8 Load and Store Multiple Instruction Timings eee eee eeeeseeseeeeececeeeeseeeeaesseeeseeseecaeseaeeaeeeaeeateneeeaeeanes 8 March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture 14 14 Semaphore Instruction Timings esistente tteen een eben tete i bete S aeaaea enin iseen 9 14 15 CP15 Register Access Instruction Tumnges eene em nennen re nren rennen 9 14 16 CP14 Register Access Instruction Tummnges enne nen enne trennen nnne 9 14 17 SWLInstruction E iri etri tetto E epe ripe PI EI o pe CE qa te e REC DAE Pe HRS Hopes papas 9 14 18 Count Leading Zeros Instruction Timings nennen rennen eren rennen rennen 9 A 1 C and B encoding u
265. me of these instructions based on previous outcomes Table 14 2 shows the branch latency penalty when these instructions are correctly predicted and when they are not A penalty of zero for correct prediction means that the Intel 80200 processor can execute the next instruction in the program flow in the cycle following the branch Branch Latency Penalty Core Clock Cycles Description ARM Thumb Predicted Correctly The instruction is in the branch target cache and is 0 9 correctly predicted Mispredicted There are three occurrences of branch misprediction all of which incur a 4 cycle branch delay penalty 1 The instruction is in the branch target buffer and is predicted not taken but is actually taken 2 The instruction is not in the branch target buffer and is a taken branch 3 The instruction is in the branch target buffer and is predicted taken but is actually not taken 4 5 Addressing Modes All load and store addressing modes implemented in the Intel 80200 processor do not add to the instruction latencies numbers March 2003 Developer s Manual intel 14 4 14 4 1 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Performance Considerations Instruction Latencies The latencies for all the instructions are shown in the following sections with respect to their functional groups branch data processing multiply status register ac
266. mory to return This is calculated by dividing PMNO by CCNT which was used to measure total execution time March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n D Performance Monitoring 12 5 4 Data Bus Request Buffer Full Mode The Data Cache has buffers available to service cache misses or uncacheable accesses For every memory request that the Data Cache receives from the processor core a buffer is speculatively allocated in case an external memory request is required or temporary storage is needed for an unaligned access If no buffers are available the Data Cache will stall the processor core How often the Data Cache stalls depends on the performance of the bus external to the Intel 80200 processor and what the memory access latency is for Data Cache miss requests to external memory If the Intel 80200 processor memory access latency is high possibly due to starvation these Data Cache buffers becomes full This performance monitoring mode is provided to see if the Intel 80200 processor is being starved of the bus external to the Intel 80200 processor which effects the performance of the application running on the Intel 80200 processor PMNO accumulates the number of clock cycles the processor is being stalled due to this condition and PMNI monitors the number of times this condition occurs Statistics derived from these two events The average number of cycles the process
267. moved before the MOV instruction Note this would prevent pipeline stalls if the load hits the data cache However if load is likely to miss data cache move the LDR instruction so that it executes as early as possible before the SUB instruction However moving the LDR instruction before the SUB instruction would change the program semantics It is possible to move the ADD and the LDR instructions before the SUB instruction if we allow the contents of the register r6 to be spilled and restored from the stack as shown below all other registers are in use str r6 sp 4 add r0 r4 r5 ldr r6 r0 mov 2 X2 LSL 42 orr r9 r9 0xf add r8 r6 r8 ldr r6 spl 4 add r8 r8 4 orr r8 r8 Most sub rl 6 r7 mul r3 r6 r2 The value in register r6 is not used after this As can be seen above the contents of the register r6 have been spilled to the stack and subsequently loaded back to the register r6 to retain the program semantics Another way to optimize the code above is with the use of the preload instruction as shown below all other registers are in use add r0 r4 r5 pid r0 sub rl 66 7 mul r3 r6 r2 mov E2 r3 LSL 42 orr r9 r9 0xf ldr r6 r0 add r8 r6 r8 add r8 r8 4 orr r8 r8 Most The value in register r6 is not used after this Intel 80200 processor has four fill buffers used to fetch data from external memory when a data cache miss occurs Intel 80200 processor stalls when all fill buffers
268. n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 6 5 6 16 Write Buffer Fill Buffer Operation and Control See Section 1 2 2 Terminology and Acronyms on page 1 5 for a definition of coalescing The write buffer is always enabled which means stores to external memory are buffered The K bit in the Auxiliary Control Register CP15 register 1 is a global enable disable for allowing coalescing in the write buffer When this bit disables coalescing no coalescing occurs regardless the value of the page attributes If this bit enables coalescing the page attributes X C and B are examined to see if coalescing is enabled for each region of memory All reads and writes to external memory occur in program order when coalescing is disabled in the write buffer If coalescing is enabled in the write buffer writes may occur out of program order to external memory Program correctness is maintained in this case by comparing all store requests with all the valid entries in the fill buffer The write buffer and fill buffer support a drain operation such that before the next instruction executes all Intel 80200 processor data requests to external memory including the write operations in the bus controller have completed See Table 7 12 Cache Functions on page 7 11 for the exact command Writes to a region marked non cacheable non bufferable page attributes C B and X all 0 causes execution to stall until the
269. n ECC code to associate with write data The ECC code for a data element is transported over the DCB bus For 64 bit wide memories an additional eight bits of width are required to hold the ECC code ECC is not supported in 32 bit memory systems If system software has indicated a memory region is ECC protected then the Intel 80200 processor will always process that memory at bus width granularity 64 bits Read data should only be returned in multiples of 64 bits 8 16 and 32 bit read requests should be answered with a full 64 bits of data Narrow writes to ECC protected memory will be seen as read modify write cycles of bus width granularity with Lock asserted during the transaction If the Intel 80200 processor is accessing a region of memory for which ECC is not enabled it drives zeroes on DCB during writes receivers should ignore such values On reads to such a memory region the Intel 80200 processor ignores the value on DCB but this bus should be driven to a valid level all zeroes for example Note for 32 bit data bus systems the system can satisfy this requirement on DCB by tieing its bits low through pull down resistors To ensure that a memory location has a valid ECC byte software must write to an address before it ever reads from it This would not seem a problem but the Intel 80200 processor aggressive memory architecture can make it a challenge When the data cache receives a write request it may respond by reading
270. n Intel XScale Microarchitecture tel e 10 3 10 3 1 Examples All examples assume a 64 bit bus in a little endian system Simple Read Word In Figure 10 4 a read request for one word at address 0x240 is issued at time 10 ns ADS is asserted low at that clock edge 0x240 is driven on A W R is driven low to indicate a read request and Ox2 is driven onto the Len bus to indicate that the access if for four bytes Some time later four clocks in this case DValid is asserted to indicate the next sequential data cycle is occurring Two clock edges later the data word from 0x240 is driven onto D 31 0 The other half of D can be any value The ECC or Parity bits associated with the data are driven onto DCB at the same time as the data Figure 10 4 Basic Read Timing 10 14 Ons 25ns 50ns 75n MCLK NV JV JV 7 N 7 Nf Nf Nf NN ADS LEN 2 WAdReq 0 Lock LEN 1 ft W RS LEN O HA 0 NENNEN ANN oo K ox240 WE DValid cwr MENSEM D March 2003 Developer s Manual intel 10 3 2 Figure 10 5 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Read Burst No Critical Word First In Figure 10 5 the request goes out the same as the last example with the address 0x248 this time and the length 0x6 indicating an eight word cache line fill The first data cycle begins at 50 ns with D Valid being asserted with CW
271. n the PID The processor does not OR the PID with the specified breakpoint address prior to doing address comparison This must be done by the programmer and written to the breakpoint register as the MVA This applies to data and instruction breakpoints Instruction Breakpoints The Debug architecture defines two instruction breakpoint registers IBCRO and IBCR1 The format of these registers is shown in Table 13 3 Instruction Breakpoint Address and Control Register IBCRx In ARM mode the upper 30 bits contain a word aligned MVA to break on In Thumb mode the upper 31 bits contain a half word aligned MVA to break on In both modes bit 0 enables and disables that instruction breakpoint register Enabling instruction breakpoints while debug is globally disabled DCSR GE 0 may result in unpredictable behavior Instruction Breakpoint Address and Control Register IBCRx 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 0 IBCRx E reset value unpredictable address disabled Bits Access Description Instruction Breakpoint MVA in ARM mode IBCRx 1 is ignored IBCRx Enable E 0 Read Write 0 Breakpoint disabled 1 2 Breakpoint enabled 31 1 Read Write An instruction breakpoint generates a debug exception before the instruction at the address specified in the ICBR executes When an instruction breakpoint occurs the processor sets the DBCR moe bits to 0b001 So
272. nd the address of any byte within the memory access matches the address in DBRx For example LDR triggers a breakpoint if DBCON EO is 0b10 or Ob11 and the address of any of the 4 bytes accessed by the load matches the address in DBRO The processor does not trigger data breakpoints for the PLD instruction or any CP15 register 7 8 9 or 10 functions Any other type of memory access can trigger a data breakpoint For data breakpoint purposes the SWP and SWPB instructions are treated as stores they do not cause a data breakpoint if the breakpoint is set up to break on loads only and an address match occurs On unaligned memory accesses breakpoint address comparison is done on a word aligned address aligned down to word boundary When a memory access triggers a data breakpoint the breakpoint is reported after the access is issued The memory access is not aborted by the processor The actual timing of when the access completes with respect to the start of the debug handler depends on the memory configuration On a data breakpoint the processor generates a debug exception and re directs execution to the debug handler before the next instruction executes The processor reports the data breakpoint by setting the DCSR MOE to 0b010 The link register of a data breakpoint is always PC of the next instruction to execute 4 regardless of whether the processor is configured for monitor mode or halt mode Software Breakpoints Mnemonics BKPT See
273. nd thus increasing register pressure By contrast the Intel 80200 processor prefetch can be used to reduce register pressure instead of increasing it The Intel 80200 processor prefetch load is a hint instruction and does not guarantee that the data is loaded Whenever the load would cause a fault or a table walk then the processor ignores the prefetch instruction the fault or table walk and continue processing the next instruction This is particularly advantageous in the case where a linked list or recursive data structure is terminated by a NULL pointer Prefetching the NULL pointer does not fault program flow Prefetch Distances in the Intel 80200 Processor Scheduling the prefetch instruction requires understanding the system latency times and system resources which affect when to use the prefetch instruction This section considers three timing elements N critical word first Neixfer full cache line transfer time Niubissue Subsequent prefetch issue time to insure uninterrupted transfers cwf The memory latency times presented here assume typical SDRAM that is currently available and working with the Intel 80200 processor It is assumed that the SDRAM supports Critical Word First transfers That is when a cache line is being transferred the first word transferred corresponds to the one needed by the processor immediately as opposed to transferring the data from lowest address first The cycle times assume that the core is runn
274. nding loads before and after locking data This step ensures that outstanding loads do not end up in the wrong place either unintentionally locked into the cache or not locked at all Note also that a drain operation has been placed after the operation that locks the tag into the cache This drain ensures predictable results if a programmer tries to lock more than 28 lines in a set the tag gets allocated in this case but not locked into the cache March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache Example 6 3 Locking Data into the Data Cache MRC MCR MCR MCR LDR DRAI LOOP1 SUBS BEQ UBS EQ UBS EQ DD Dm PD DN UBS BNE DONE DRAI LDMFE MOV RO SUB PC endm macroDRAIN MOV R2 MCR P15 CPWAIT configured with C 1 and B 1 RO is the number of 32 byte lines to lock into the data cache In this example 16 lines of data are locked into the cache MMU and data cache are enabled prior to this code macroCPWAIT P15 PtI5 endm MacroLOCKLINE Rx Ry Write back the line if it s dirty in the cache P15 Flush Invalidate the line from the cache PIS Load and lock 32 bytes of data located at R1 into the data cache Post increment the address in R1 to the next cache line Ry endm LockLines int cache lines void start address global LockLines LockLines STMFDSP MOV R6
275. ne byte enable asserted If the first or last data cycle had no bytes to store the transaction would have been issued as a shorter transaction Figure 10 10 Four Word Coalesced Write Burst Ons 25ns 50ns 75ns MCLK ADS LEN 2 V WrReq 1 Lock LEN 1 MERE W RH LEN O BEBE 1 A Rn o0 X 0x580 EEUU D BE wao wa23 GE BE BE oas 0x30 EN DCB Ty rcc Xk ECC NN Abort 10 20 March 2003 Developer s Manual intel 10 3 7 Figure 10 11 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Pipelined Accesses The example in Figure 10 11 demonstrates the four deep pipelined nature of this bus In this example the Intel 80200 processor is bus limited and is issuing requests as quickly as it can Before time Ons there are no outstanding transactions Two reads A and B followed by a write C and another read D are all requested before 85 ns in this timing diagram Because the Intel 80200 processor may have up to four outstanding transactions and in this example only three are outstanding at time 85 ns it can send another request for E at time 90 ns If none of the transactions had completed the E transaction would have been delayed Pipeline Example Ons 25ns 50ns 75ns 100ns 125ns L dod dob ee I Du dodge bb L HAT k odo d MCLK Nf SF NI NS NS PNP Ned N
276. ne difference CWF is asserted high on the first data cycle of the return data This indicates that the data is returning critical word first In this case since the address requested was 0x248 the word pair containing that byte starting at 0x248 is returned first The data is then returned sequentially through the end of the cache line and starting over at the beginning Figure 10 6 Read Burst CWF 10 16 Ons 25ns 50ns 75ns 100ns III OD MCLK NV JV IV IV JV 7 N 7 N 7 N JV JV 7 N 7 N 4 ADS LEN 2 Gd Reg Lock LEN 1 Iesse W R LEN O i o C ABK oxo Xoxz2e WE DValid Me cwr i un E D x24 x2 x2 x24 March 2003 Developer s Manual intel 10 3 4 Figure 10 7 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Word Write Figure 10 7 shows a 32 bit write request to address 0x240 W R is high when ADS is asserted low Two cycles before the write data needs to be on the bus for the SDRAM DValid is asserted by the chipset to the Intel 80200 processor to tell the Intel 80200 processor the data is needed Two cycles later the Intel 80200 processor drives the data onto the D bus the lower 32 bits in this case along with the appropriate check bits and byte enables In this case the low four byte enables are asserted low and the upper four are deasserted high because the write data is on the low four bytes of the bus Basic Word Write
277. ne nnne 9 2 5 8 EXit2 DHR State eerie ede dria ce Gee EE EENEG Ya da ER vedo 9 C 2 5 9 Update DR State nennen nennen nennen nnns 10 C 2 5 10 Select IR Scan Gate 10 C 2 5 11 Capt re R WEE 10 ev mpPASqudleccum best cds eaten dene bad vests cai asi aae dina aaa aaas AAEE 10 2 5 13 ExitT IR EE 11 7 2 5 14 Pause IR State rettet e dene a eee exeun ede aaa FER k SE E ER EN PER en 11 9 2 5 15 EXIt2 IR State noto pu edad e t er ra er oet e dine e e 11 7 2 5 16 Update IR State EE 11 C 2 5 17 Boundary Scan Example nnne enne 12 Developer s Manual March 2003 xi Intel 80200 Processor based on Intel XScale Microarchitecture tel e Figures 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture Features sees 2 3 1 Example of Locked Entries in ELB sca ioter suse ree cd oot ca ere cop onset hc PREX EE ER ee EU yep aee ra RS 9 4 1 Instruction Cache Organization 5 iei ect rti HEEL N Ee EE eS NASTE ESTRES Sek 1 4 2 Locked Line Effect on Round Robin Replacement nennen emnes 8 5 1 IBS VB EE 1 5 2 Branch ISCOLY t 2 6 1 Data Cache Organization 2 3 0 o Deere n RUE UR de i ped ie lai sae re EENES 2 6 2 Mini Data Cache Organization crieure tret eie rite EE A Ve ETE E Five ETENEE IEE didt 3 6 3 Locked Line Effect on Round Robin Replacement essent en emen 15 8 1 horn 3 8 2 Pin State at Reset i nOD POI NEN EE EE EES 4 9 1 I
278. ned the RFU stops stalling the pipe The ARM architecture specifies one of the operands for data processing instructions as the shifter operand where a 32 bit shift can be performed before it is used as an input to the ALU This shifter is located in the second half of the RF pipestage X1 Execute Pipestage The X1 pipestage performs the following functions ALU calculation the ALU performs arithmetic and logic operations as required for data processing instructions and load store index calculations Determine conditional instruction execution The instruction s condition is compared to the CPSR prior to execution of each instruction Any instruction with a false condition is cancelled and does not cause any architectural state changes including modifications of registers memory and PSR Branch target determination If a branch was mispredicted by the BTB the X1 pipestage flushes all of the instructions in the previous pipestages and sends the branch target address to the BTB which restarts the pipeline X2 Execute 2 Pipestage The X2 pipestage contains the program status registers PSRs This pipestage selects what is going to be written to the RFU in the WB cycle PSRs MRS instruction ALU output or other items WB write back When an instruction has reached the write back stage it is considered complete Changes are written to the RFU Developer s Manual March 2003 B 7 Optimization Guide n Intel
279. nen nnne nnne 53 Software Debug Notesitrata nennen nenne nnne nes et res entes nennen 54 Performance Considerations esee 1 INtSrrUpt LAtON Cy 1 Branch Bredene oder n 2 Addressing Modes eeesesesesieseeee eese ennnnntn nn en nennen nen nnne tnnt nennen ren nere n nnne nennen 2 Instr ction Latericles cei ire rint crea ener AEE aua e e un EEGENEN 3 14 4 1 Performance eme kA 3 14 4 2 Branch Instruction TIMINgS eene 4 14 4 8 Data Processing Instruction Timings seeeeeeeneennnmemnn 5 14 4 4 Multiply Instruction Timings seeseeeeeenen enemies 6 14 4 5 Saturated Arithmetic Instruchons enne 8 14 4 6 Status Register Access Instructions sessesssssseseeeeeeeneeen nenne 8 14 4 7 Load Store Instructlons eene nennen nennen nennen 8 14 4 8 Semaphore Instructons nennen enne nnne nennen nennen 9 14 4 9 Coprocessor Instructions eesessssssssseseseeeeneenenen nnne enne nennen nennen 9 March 2003 Developer s Manual m tel Intel 80200 Processor based on Intel XScale Microarchitecture 14 4 40 Miscellaneous Instruction Timing 9 14 4 11 dies Wl Eet EE 9 A Compatibility Intel 80200 Processor vs SA 110 ee 1 A 1 INtPOAUCTION e mRH 1 A 2 SN
280. ng operations are completed before allowing the EE bit to take affect Code similar to that shown in Example 11 2 is recommended for enabling ECC Example 11 2 Enabling ECC MACRO FLUSHALL MCR P15 O RO C7 C10 4 Drain buffers ENDM enableECC FLUSHALL Finish pending memory operations MOV RO 0xA Set bits 3 and 1 MCR P13 0 RO CO C1 O Set BCUCTL enable ECC 11 6 March 2003 Developer s Manual Intel 80200 Processor based on Intel XScale Microarchitecture Bus Controller When ECC is enabled the BCU only generates an interrupt on a single bit error if BCUCTL SR is set When ECC is enabled the BCU always generates an abort on a multi bit error The BCU repairs single bit errors if BCUCTL SC is set It is recommended that this bit always be set running with this bit cleared could cause software to operate on corrupted data before the ECC error detect interrupt is received If BCUCTL EE is zero then the BCU ignores the ECC bits on reads and drives all zeroes to the ECC bits on writes This is the fastest mode for the BCU as ECC error detection correction incurs an additional two MCLK cycles If error reporting is enabled in BCUCTL and any of the bits BCUCTL EV BCUCTL EI BCUCTL EO are set then the BCU asserts its interrupt to the ICU on a single bit error These bits are set by the BCU when it detects an error and saves the error information to ELOGx and ECARx If BCUCTL En is set n 0 1 the BCU
281. ng systems may require modifications to match the specific hardware features of the Intel 80200 processor and to take advantage of the performance enhancements added to the Intel 80200 processor Developer s Manual March 2003 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture n Introduction l n e 1 1 2 Features Figure 1 1 shows the major functional blocks of the Intel 80200 processor The following sections give a brief high level overview of these blocks Figure 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture Features B1307 01 1 1 2 1 Multiply Accumulate MAC The MAC unit supports early termination of multiplies accumulates in two cycles and can sustain a throughput of a MAC operation every cycle Several architectural enhancements were made to the MAC to support audio coding algorithms which include a 40 bit accumulator and support for 16 bit packed data See Section 2 3 Extensions to ARM Architecture on page 2 3 for more details 1 2 March 2003 Developer s Manual intel 1 1 2 2 1 1 2 3 1 1 2 4 1 1 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture Introduction Memory Management The Intel 80200 processor implements the Memory Management Unit MMU Architecture specified in the ARM Architecture Reference Manual The MMU provides access protection and virtual to physical address translation The MMU A
282. nimal debug handler stub responsible for doing the handshaking with the host resident in the instruction cache This debug handler stub should be downloaded into the instruction cache during processor reset using the method described in Section 13 14 4 Section 13 14 5 1 Dynamic Code Download Synchronization describes the details for implementing the handshaking in the debug handler Figure 13 14 shows a high level view of the actions taken by the host and debug handler during dynamic code download Downloading Code in IC During Program Execution Debugger Actions signal handler wait for handler to signal e is complete ready to start download download code 18 TCKs I T JTAG IR DBGTX X LDIC X DBGRX l 4 continue execution Handler begins execution signal host ready for download wait for host to signal download complete Debug Handler Actions Developer s Manual March 2003 13 43 Software Debug In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 44 The following steps describe the details for downloading code Since the debug handler is responsible for synchronization during the code download the handler must be executing before the host can begin the download The debug handler execution starts when the application running on the Intel 80200 processor generates a debug exception or when the host generates an external debug break While the DBGTX JTAG instru
283. noii etre eene reto E Fee a RS css RE Foe TUE IEEE rE EER a 6 Some Common Uses of the PMU e eene rne tet EEEEEE ako re Sun eL REL 7 Debug Control and Status Register DCSR enne ren en ren enne nennen 3 Faveur Erop ME 6 Instruction Breakpoint Address and Control Register IBCRXx eee 9 Data Breakpoint Register ODBRX sess iia nE en nen enne nnen nennen 10 Data Breakpoint Controls Register DBCON esee nennen 10 TX RX Control Register TXRXCTRL occ A 12 Normal RX Handshaking 5 ert eere este tra neia do i aSa enr te e EOS Rn epe dipasesuncessssiaauensns 13 High Speed Download Handshaking States 13 TX Wand Shakin g a E R 15 TXRXCTRL Mnemonic Extensions essere nnnm nen ennemi mrrennenretrennen nenne nennen 15 TX unc de 16 RX iui q 16 DEBUG Data Register Reset Values ee ete cine i a ir Eri er E 25 CP 14 Trace Buffer Register Summary esee eren ennemi serene nnns 26 Checkpoint Register CHKP Tx 1t einen koennen ie PL ee E been eite ka esa agens 26 TBREG Format reet deed A n 27 Message Byte Fom ats m 28 EDIC Cacbe FU nctions i cete teer peteret teinte tnit noter ick Deua ee AC EE SR en Reto 36 Minimu
284. nt address than where the code was downloaded The SDS remains in effect regardless of the processor mode This allows the debug handler to switch to other modes maintaining SDS functionality Entering user mode may cause unpredictable behavior The processor exits SDS following a CPSR restore operation When exiting the debug handler should use subs pc lr 4 This restores CPSR turns off all of SDS functionality and branches to the target instruction 1 When the vector table is relocated CP15 Control Register 13 1 the debug vector is relocated to Oxffff0000 Developer s Manual March 2003 13 7 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 5 2 13 8 Monitor Mode In monitor mode the processor handles debug exceptions like normal ARM exceptions If debug functionality is enabled DCSR 31 1 and the processor is in Monitor mode debug exceptions cause either a data abort or a pre fetch abort The following debug exceptions cause data aborts data breakpoint external debug break trace buffer full break The following debug exceptions cause pre fetch aborts instruction breakpoint BKPT instruction The processor ignores vector traps during monitor mode When an exception occurs in monitor mode the processor takes the following actions disables the trace buffer sets DCSR moe encoding sets FSR 9 e RI4 abt PC of the next instru
285. ntel 80200 Processor based on Intel XScale Microarchitecture Programming Model n 2 3 1 1 Multiply With Internal Accumulate Format A new multiply format has been created to define operations on 40 bit accumulators Table 2 1 Multiply with Internal Accumulate Format on page 2 4 shows the layout of the new format The opcode for this format lies within the coprocessor register transfer instruction type These instructions have their own syntax Table 2 1 Multiply with Internal Accumulate Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131109 8 7 6 5 4 3 2 1 rw i s ole o tof once fe Depp me DD m Bits Description Notes 31 28 cond ARM condition codes Intel 80200 processor defines the following 0b0000 MIA 0b1000 MIAPH 19 16 opcode 3 specifies the type of multiply with GE p MIABT internal accumulate 0b1110 MIATB 0b1111 MIATT The effect of all other encodings are unpredictable 15 12 Rs Multiplier Intel 80200 processor only implements accO 7 5 acc select 1 of 8 accumulators access to any other acc has unpredictable effect 3 0 Rm Multiplicand Two new fields were created for this format acc and opcode_3 The acc field specifies 1 of 8 internal accumulators to operate on and opcode_3 defines the operation for this format The Intel 80200 processor defines a single 40 bit accumulator 1 referred to as acc0 future implementations may define mul
286. ntel 80200 processor expects this cycle and system operation is not guaranteed without it It is up to the chipset or SDRAM controller to control the data bus cycles Because the Intel 80200 processor transactions on the data bus must occur in the order they were requested DValid can be used for both read and write data cycles Both the Intel 80200 processor and the chipset have enough information to know if the Intel 80200 processor is driving or sampling the data bus for any given transaction Configuration Pins DBusWidth which is on the CWF pin at reset indicates the data bus is either 32 bits wide or 64 bits wide If the pin is sampled as 0 during reset the Intel 80200 processor assumes a 64 bit bus If the pin is 1 at reset a 32 bit bus is assumed March 2003 Developer s Manual intel 10 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Multimaster Support Simple multimaster support is supplied with the Hold pin The Hold pin causes the Intel 80200 processor to stop issuing new requests as soon as possible see below for timing and to float the following pins A ADS LEN 2 W R LEN 0 and Lock LEN 1 Before floating ADS the Intel 80200 processor drives it to an inactive state high Simultaneously with floating the affected signals the Intel 80200 processor asserts HldA When HIdA is asserted it is up to the chipset to make sure that the floating signals are driv
287. nterrupt Controller Block Diagram 1 eret rrr tti etre tree repa ee eroe retra SEENEN 2 10 1 Typical System rere E e t ep e Erde odo Ped otto rH ER eee ien uates 1 10 2 E EE 2 10 3 Big Endian Lane Swapping on a 64 bit Bus 13 10 4 Basi Read TIMIDE 14 10 5 Read Burer No CWE nti aiegeseue ges Eiesdbe geet Dese EE ee ENEE NEE 15 10 6 Read Burst GE UO 16 10 7 Basic Word EE 17 10 8 Two Word Coalesced Write oo ice ceceescecesesseecsecseecsecsseesecssecsecsaeeseceaecaeseaeeaeceeesseseseseseaessaesseeeneeasenaeegs 18 10 9 Four Word Eviction e E 19 10 10 Four Word Coalesced Write Burst seesesseseeeeeeeeeeeene enne etre etre nee teen enne 20 10 11 Pipeline Beamer eeneg EE IRE UN eer 21 10 12 WOCK E e M 22 UI D WE 23 10 14 Hold ASsertiOn z eos teret eret ee rr E OE teas See TO EHI CH VERE He Peer ER eats 24 13 1 SELDCSR Hardwate 5 IRURE NRBIS HEURES ERUNT UR EU EE 18 13 2 SBEDCSR Data REISTER eere rrt teret ie Per ERI Ne ONE EE EA 19 13 3 RI OR EE 21 13 4 DBGRX H rdWar T 22 13 5 RX Write geed dE RR Red eae REIR EE de tee 23 13 6 DBGRX Data Register E 24 13 7 Message Byte Formats etae pte nicae nepoti pre 28 13 8 Indirect Branch Entry Address Byte Organization eese eene nennen enne 31 13 9 High Level View ot Trace Buffer iones e ec PURI RE REESE
288. o deviate from published specifications Current characterized errata are available on request Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order Copies of documents which have an ordering number and are referenced in this document or other Intel literature may be obtained by calling 1 800 548 4725 or by visiting Intel s website at http www intel com Copyright Intel Corporation 2003 Other brands and names are the property of their respective owners ARM and StrongARM are registered trademarks of ARM Ltd ii March 2003 Developer s Manual i ntel Intel 80200 Processor based on Intel XScale Microarchitecture Contents 1 Introductio rr 1 1 1 Intel 80200 Processor based on Intel XScale Microarchitecture High Level Overview 1 1 1 1 ADM Architecture Compliance nennen enne nnne 1 1 1 2 i is 2 1 1 2 1 Multiply Accumulate MAC nennen 2 1 1 2 2 Memory Management 3 1 1 2 3 Instruction CaGhE sesiis 3 1 1 2 4 Branch Target Buffer oct A EO 3 1 1 2 5 Data Cache I EU 3 1 1 2 6 Power Management AANEREN 4 1 1 2 7 Interrupt Controller there tete eh paco a ipa rede dE rege cr ne neta LER 4 1 1 2 8 Bus Controller it nt e HR RO FREU MdXREM NECARE SCELTE HERE 4 1 1 2 9 Performance Monitoring AA 4 1 1 2 10
289. o the 1 to 3 cycle resource latency Similarly consider the following code sample mia mra acoc0 r2 r3 r4 rb accO The MRA instruction above can stall from 0 to 2 cycles depending on the values in the registers r2 and r3 due to the 1 to 3 cycle result latency The MIAPH instruction has an issue latency of 1 cycle result latency of 2 cycles and a resource latency of 2 cycles Consider the code sample shown below add rl r2 x3 miaph accO r3 r4 miaph accO r5 r6 mra r6 r7 accO sub r8 r3 r4 The second MIAPH instruction would stall for 1 cycle due to a 2 cycle resource latency The MRA instruction would stall for 1 cycle due to a 2 cycle result latency These stalls can be avoided by rearranging the code as follows miaph accO r3 r4 add rl r2 x3 miaph acco r5 sub r8 r3 r4 mra ro r7 accO r6 March 2003 B 43 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 5 7 B 5 8 B 44 Scheduling MRS and MSR Instructions The MRS instruction has an issue latency of 1 cycle and a result latency of 2 cycles The MSR instruction has an issue latency of 2 cycles 6 if updating the mode bits and a result latency of 1 cycle Consider the code sample mrs r0 cpsr orr r0 x0 1 add rl r2 r3 The ORR instruction above would incur a 1 cycle stall due to the 2 cycle result latency of the MRS instruction In the code example above the ADD instruction
290. ocessor based on Intel XScale Microarchitecture Software Debug n 13 15 2 4 13 52 High Speed Download Special debug hardware has been added to support a high speed download mode to increase the performance of downloads to system memory vs writing a block of memory using the standard handshaking The basic assumption is that the debug handler can read any data sent by the debugger and write it to memory before the debugger can send the next data Thus in the time it takes for the debugger to scan in the next data word and do an Update_DR the handler is already in its polling loop waiting for it Using this assumption the debugger does not have to poll RR to see whether the handler has read the previous data it assumes the previous data has been consumed and immediately starts scanning in the next data word The pitfall is when the write to memory stalls long enough that the assumption fails In this case the download with normal handshaking can be used or high speed download can still be used but a few extra TCKs in the Pause_DR state may be necessary to allow a little more time for the store to complete The hardware support for high speed download includes the Download bit DCSR 29 and the Overflow Flag DCSR 30 The download bit acts as a branch flag signalling to the handler to continue with the download This removes the need for a counter in the debug handler The overflow flag indicates that the debugger
291. ocessor based on Intel XScale Microarchitecture Software Debug The Figure 13 12 shows the actions necessary to download code into the instruction cache during a cold reset for debug NOTE In the Figure 13 12 hold rst is a signal that gets set and cleared through JTAG When the JTAG IR contains the SELDCSR instruction the hold rst signal is set to the value scanned into DBG SR I Figure 13 12 Code Download During a Cold Reset For Debug TRST resets JTAG IR to IDCODE 7 RESET pin assert until hold rst signal is set RESET TRST RESET invalidates IC RESET does not affect IC if VW Internal RESET i hold_rst keeps internal D reset asserted Processor branches to address 0 hold_rst wait 2030 tcks after clock 15 tcks after RESET asserted last update dr in LDIC mode mag i TX CODE EE X ibo 9 SELDOSA l J l Set hold rst signal Set Halt Mode bit Enter LDIC mode Download code Clear hold rst signal Keep Halt Mode bit set B1310 01 Developer s Manual March 2003 13 39 Software Debug Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e 13 40 An external host should take the following steps to load code into the instruction cache following a cold reset Assert the RESET and TRST pins This resets the JTAG IR to IDCODE and invalidates the instruction cache main and mini Load the SELDC
292. olled by the round robin replacement algorithm This update may evict a valid line at that location 6 Once the cache is updated the eight valid bits of the fetch buffer are invalidated Round Robin Replacement Algorithm The line replacement algorithm for the instruction cache is round robin Each set in the instruction cache has a round robin pointer that keeps track of the next line in that set to replace The next line to replace in a set is the one after the last line that was written For example if the line for the last external instruction fetch was written into way 5 set 2 the next line to replace for that set would be way 6 None of the other round robin pointers for the other sets are affected in this case After reset way 31 is pointed to by the round robin pointer for all the sets Once a line is written into way 31 the round robin pointer points to the first available way of a set beginning with way 0 if no lines have been locked into that particular set Locking lines into the instruction cache effectively reduces the available lines for cache updating For example if the first three lines of a set were locked down the round robin pointer would point to the line at way 3 after it rolled over from way 31 Refer to Section 4 3 4 Locking Instructions in the Instruction Cache on page 4 8 for more details on cache locking Developer s Manual March 2003 4 3 Intel 80200 Processor based on Intel XScale Microarchitecture
293. on Intel XScale Microarchitecture tel e 9 2 9 3 10 1 10 2 10 3 10 4 10 5 11 1 11 2 11 3 11 4 11 5 11 6 12 1 12 2 12 3 12 4 12 5 13 1 13 2 13 3 13 4 13 5 13 6 13 7 13 8 13 9 13 10 13 11 13 12 13 13 13 14 13 15 13 16 13 17 13 18 14 1 14 2 14 3 14 4 14 5 14 6 14 7 14 8 14 9 14 10 14 11 14 12 14 13 xiv Interrupt Control Register CP13 register 0 3 Interrupt Source Register CP13 register Ai 4 Interrupt Steer Register CP13 register H 5 Intel 80200 Processor based on Intel XScale Microarchitecture Bus SUP tal S ege eee teet 3 Requests Oma 64 bit BUS E M 4 Requests oma 32 bit Bus auderet ee nitet Pe Ee dosed cases eng Eege 5 Return Order for 8 Word Burst 64 bit Data Bus 7 Return Order for 8 Word Burst 32 bit Data Bus 7 BCU Response to ECC Beate ed ee Ire ite ere cera ENEE 3 Ee geng NEE M 5 BCUMOD Register T nir E Ree RE SET eid EE EQ FUR Ue EEN 7 ELOGO0 ELOGITI Re gisters 4 5 iter tret eere he it Pre t Ra urere ek eap is E 9 ECARO EGAR Registers 0 7 55er been Re REIR da VO NNUS NN ES PER ede Pega pete o enu sp teo pe ea eve REN EUR 9 ECIST Register EE 10 Clock Count EE GL E KR EE 2 Performance Monitor Count Register PMNO and PMNI7 sese nnne 3 Performance Monitor Control Register CP14 register OU 4 Performance Monitoring Evehts
294. on Latencies to get the instruction latencies for various multiply instructions The multiply instructions should be scheduled taking into consideration these instruction latencies March 2003 Developer s Manual intel B 5 4 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Scheduling SWP and SWPB Instructions The SWP and SWPB instructions have a 5 cycle issue latency As a result of this latency the instruction following the SWP SWPB instruction would stall for 4 cycles SWP and SWPB instructions should therefore be used only where absolutely needed For example the following code may be used to swap the contents of 2 memory locations Swap the contents of memory locations pointed to by r0 and r1 ldr r2 r0 swp r2 r1 str r2 r1 The code above takes 9 cycles to complete The rewritten code below takes 6 cycles to execute Swap the contents of memory locations pointed to by r0 and r1 ldr r2 ro ldr r3 ri str r2 r1 str r3 r0 Developer s Manual March 2003 B 41 Optimization Guide Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e B 5 5 B 42 Scheduling the MRA and MAR Instructions MRRC MCRR The MRA MRRC instruction has an issue latency of 1 cycle a result latency of 2 or 3 cycles depending on the destination register value being accessed and a resource latency of 2 cycles Consider the code sample mra r6 r7 acco
295. on is the Thumb BL instruction when H 0 the timing in this case would be the same as an ARM data processing instruction Developer s Manual March 2003 14 9 intel Compatibility Intel 80200 Processor vs SA 110 A This appendix highlights the differences between the first generation Intel StrongARM technology SA 110 and the Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE A 1 Introduction The Intel 80200 processor architecture has been defined to be compatible with SA 110 where possible however there are some features not supported on the Intel 80200 processor or the definition of them has been modified The following sections discuss these deviations A programmer who is developing an application for SA 110 and wishes to migrate to the Intel 80200 processor must be aware of these architecture differences Use of these architecture features should be avoided or isolated in developing the application so that migrating to the Intel 80200 processor can occur with minimal effort A 2 Summary Various features of the SA 110 and the Intel 80200 processor are outlined in this section Subsequent sections give more details Feature SA 110 Intel Extensions 40 bit accumulator access instructions MRA MAR Cache preload PLD D New multiply instructions for packed data MIA MIAPH MIAxy New Load Store Consecutive LDRD
296. on of the read that asserted the lock It is possible that the chipset may assert Hold to allow another master on the bus on clock edge n and the Intel 80200 processor issues a read request and asserts Lock on the same clock edge n In this case the chipset should not let another master access memory at this time the Intel 80200 processor is stalled waiting for access to the bus However the Intel 80200 processor continues to respect the Hold pin and floats the request bus as it normally would This allows the chipset to have a guaranteed delay between Hold assertion and the Intel 80200 processor floating the pins In the general case the Hold pin should be deasserted a cycle or two later speed here is not critical as long as no other master is allowed to initiate a memory request and the Intel 80200 processor continues on with the atomic pair of requests Once the write request that deasserts Lock is issued the chipset can reassert Hold and give the bus to another master If the system designer knows that the other requesting master is not accessing the same 32 byte memory region as the locked read the chipset may choose to not deassert Hold and can continue on with the multimaster request Another possibility is for the chipset to accept the read with Lock request and store it into the chipset queues but to delay execution of the read with Lock until after the transactions from the other bus master to avoid a semaphore conflic
297. ons see Section 11 2 ECC on page 11 1 If ECC is not enabled the Intel 80200 processor will never perform an 8 byte read 3 Ona 32 byte load A 4 2 carries information for Critical Word First logic See Section 10 2 3 Critical Word First on page 10 7 for more information March 2003 Developer s Manual intel Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Requests on a 32 bit Bus LEN Data Bytes Data Bus Cycles Used for Reads Used for Writes Address Alignment 000 1 1 Y Y Any Address 001 2 1 Y Y A 0 O 010 4 1 Y Y A 1 0 00 011 8 2 N Y A 1 0 00 100 12 3 N Y A 3 0 OX00 101 16 4 N Y A 3 0 0000 110 32 8 Y N A 1 0 00 111 Not Used in 32 bit Bus Mode 1 Ona 32 byte load A 4 2 carries Critical Word First logic information see Section 10 2 3 Critical Word First In addition to the alignment constraints listed above read transactions never cross a 32 byte boundary and write transactions never cross a 16 byte boundary Some write case explanations Byte and short writes are caused by non cacheable non bufferable store commands in the Intel 80200 processor The four word write can be caused by eviction of dirty data in a cache line the Intel 80200 processor has two dirty bits for each eight word cache line and evicts halves separately as necessary or from the write buffe
298. ontents of LDIC_SR2 have been shifted to the instruction cache Removing the LDIC JTAG instruction from the JTAG IR before the entire contents of LDIC_SR2 are sent to the instruction cache results in unpredictable behavior Therefore following the Update_DR for the last LDIC packet the LDIC instruction must remain in the JTAG IR for a minimum of 15 TCKs This ensures the last packet is correctly sent to the instruction cache Developer s Manual March 2003 13 35 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug n 13 14 3 Table 13 18 LDIC Cache Functions The Intel 80200 processor supports four cache functions that can be executed through JTAG Two functions allow an external host to download code into the main instruction cache or the mini instruction cache through JTAG Two additional functions are supported to allow lines to be invalidated in the instruction cache The following table shows the cache functions supported through JTAG LDIC Cache Functions Arguments Function Encoding Address Data Words Invalidate IC Line 0b000 VA of line to invalidate 0 Invalidate Mini IC 0b001 0 Load Main IC 0b010 VA of line to load 8 Load Mini IC 0b011 VA of line to load 8 RESERVED 0b100 0b111 Invalidate IC line invalidates the line in the instruction cache containing specified virtual address If the line is not in the cache the operation has no effect I
299. ontinue to return Rd C shows the Intel 80200 processor requesting another access after Hold has been deasserted If Hold had continued to be asserted another bus master could take control of the request bus Hold Assertion Ons 25ns 50ns 75ns 100ns MIKA X X XY ou ox LY ee Xx eg y Xe ADS LEN 2 RdA RdB 4 RdC g Lock LEN 1 Lo 1 t 34 4 W R LEN O B OA o jp A Bi oxo XOx581X OxO X oxassc 0x0 Hold HIdA DValid In CWF i Te D Rd A WG Rd BOXRd B1X Rd B2 PE RE DCB ho 4 X X X Abort March 2003 Developer s Manual intel Bus Controller 11 11 1 Introduction The Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture VSTE Bus Controller Unit BCU is responsible for accessing off chip memory It initiates bus cycles as documented in Chapter 10 External Bus The BCU is capable of queuing four outstanding transactions This improves the performance of the processor because it does not need to wait for the result of a memory transaction before initiating another If enabled by software the BCU can protect data with an Error Correcting Code ECC The BCU has software accessible state in the form of coprocessor registers All BCU registers reside in Coprocessor 13 CP13
300. or stalled on a data cache access that may overflow the data cache buffers This is calculated by dividing PMNO by PMNI This statistic lets you know if the duration event cycles are due to many requests or are attributed to just a few requests If the average is high then the Intel 80200 processor may be starved of the bus external to the Intel 80200 processor The percentage of total execution cycles the processor stalled because a Data Cache request buffer was not available This is calculated by dividing PMNO by CCNT which was used to measure total execution time 12 5 5 Stall Writeback Statistics When an instruction requires the result of a previous instruction and that result is not yet available the Intel 80200 processor stalls in order to preserve the correct data dependencies PMNO counts the number of stall cycles due to data dependencies Not all data dependencies cause a stall only the following dependencies cause such a stall penalty Load use penalty attempting to use the result of a load before the load completes To avoid the penalty software should delay using the result of a load until it s available This penalty shows the latency effect of data cache access Multiply Accumulate use penalty attempting to use the result of a multiply or multiply accumulate operation before the operation completes Again to avoid the penalty software should delay using the result until it s available ALU use penalty there
301. ore This feature allows software to conserve power by matching the core frequency to the current workload Register CCLKCFG see Section 7 3 3 Registers 6 7 Clock and Power Management on page 7 21 controls the clock multiplier Developer s Manual March 2003 8 1 Intel 80200 Processor based on Intel XScale Microarchitecture System Management I n Table 8 2 Example 8 1 8 2 Software CCLK Configuration S Example CCLK Frequency ges c yes On Multiplier for CLK MHZ assuming CLK P reg Frequency of 66MHz 0 reserved Unpredictable 1 3 200 2 4 266 3 5 333 4 6 400 5 7 466 6 8 533 7 9 600 8 10 666 9 11 733 10 15 reserved Unpredictable The Intel 80200 processor supports low voltage operation with a supply as low as 0 95 V At lower voltages not all CCLK configurations are available See the Intel 80200 processor Datasheet for voltage frequency information Changing CCLK frequency is similar to entering a low power mode First the core is stalled waiting for all processing to complete second the new configuration is programmed into CCLKCEG and then finally the core waits for the PLL to re lock The exact code sequence is shown in Equation 8 1 If there are no external bus transactions this procedure takes approximately two thousand CLK cycles the same time it takes to transition out of reset After the Intel 80200 processor resets the value in CCLK
302. pecified in the ARM Architecture Reference Manual When software running on the Intel 80200 processor is vectored to an Interrupt Service Routine ISR it may query the ICUs Interrupt Source register INTSRC to quickly determine the source of the interrupt External Interrupts The two external interrupts FIQ and IRQ go through synchronization logic before being sampled by the ICU External interrupts must be held asserted until cleared at the interrupting source by software The Intel 80200 processor does not latch the external interrupt signals Enabled interrupts that are deasserted before software enters the interrupt service routine causes UNPREDICTABLE behavior Developer s Manual March 2003 9 1 Intel 80200 Processor based on Intel XScale Microarchitecture Interrupts 9 3 Figure 9 1 9 2 Intel Programmer Model Software has access to three registers in the ICU INTCTL is used to enable or disable mask individual interrupts As mentioned masking of all interrupts may still be accomplished via the CPSR register in the core INTSRC is a read only register that records all currently active interrupt sources Even if an interrupt is masked software may use INTSRC to test for its source INTSTR is used to direct internal interrupts to either FIQ or IRQ Interrupt Controller Block Diagram INTSTR Interrupt Steering Register INTCTL Interrupt Control Register Steer Control Mask Cont
303. peculatively start load of next node ADD R9 R9 R3 Add into accumulator MOVS RO R1 Advance to next node At end of list BNE sumList If not then loop Debug Events Debug events are covered in Section 13 5 Debug Exceptions on page 13 6 March 2003 Developer s Manual intel Memory Management 3 3 1 Note This chapter describes the memory management unit implemented in the Intel 80200 processor based on Intel XScale microarchitecture and is compliant with the ARM Architecture V5TE Overview The Intel 80200 processor implements the Memory Management Unit MMU Architecture specified in the ARM Architecture Reference Manual To accelerate virtual to physical address translation the Intel 80200 processor uses both an instruction Translation Look aside Buffer TLB and a data TLB to cache the latest translations Each TLB holds 32 entries and is fully associative Not only do the TLBs contain the translated addresses but also the access rights for memory references If an instruction or data TLB miss occurs a hardware translation table walking mechanism is invoked to translate the virtual address to a physical address Once translated the physical address is placed in the TLB along with the access rights and attributes of the page or section These translations can also be locked down in either TLB to guarantee the performance of critical routines The Intel 80200 processor allows system software to associat
304. perform its subtask with the exception of the MAC unit ARM V5 Instruction Execution Figure B 1 uses arrows to show the possible flow of instructions in the pipeline Instruction execution flows from the F1 pipestage to the RF pipestage The RF pipestage may issue a single instruction to either the X1 pipestage or the MAC unit multiply instructions go to the MAC while all others continue to X1 This means that M1 or X1 is idle All load store instructions are routed to the memory pipeline after the effective addresses have been calculated in X1 The ARM v5 bx branch and exchange instruction which is used to branch between ARM and THUMB code causes the entire pipeline to be flushed The bx instruction is not dynamically predicted by the BTB If the processor is in Thumb mode then the ID pipestage dynamically expands each Thumb instruction into a normal ARM v5 RISC instruction and execution resumes as usual Pipeline Stalls The progress of an instruction can stall anywhere in the pipeline Several pipestages may stall for various reasons It is important to understand when and how hazards occur in the Intel 80200 processor pipeline Performance degradation can be significant if care is not taken to minimize pipeline stalls Developer s Manual March 2003 B 5 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 2 3 B 2 3 1 B 2 3 2 B 6 Main Execution Pipeline F1
305. poral evenness over this space This is very difficult if not impossible for a compiler to do Most of the input needed to best estimate how to distribute the code comes from profiling followed by compiler based two pass optimizations Developer s Manual March 2003 B 17 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 4 1 4 B 18 Locking Code into the Instruction Cache One very important instruction cache feature is the ability to lock code into the instruction cache Once locked into the instruction cache the code is always available for fast execution Another reason for locking critical code into cache is that with the round robin replacement policy eventually the code is evicted even if it is a very frequently executed function Key code components to consider for locking are nterrupt handlers Real time clock handlers e OS critical code Time critical application code The disadvantage to locking code into the cache is that it reduces the cache size for the rest of the program How much code to lock is very application dependent and requires experimentation to optimize Code placed into the instruction cache should be aligned on a 1024 byte boundary and placed sequentially together as tightly as possible so as not to waste precious memory space Making the code sequential also insures even distribution across all cache ways Though it is possible to choose randomly
306. possible however that the CP15 side effect takes place before CPWAIT completes or is issued Programmers should take care that this does not affect the correctness of their code Developer s Manual March 2003 2 11 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model 2 3 4 Event Architecture 2 3 4 1 Exception Summary Table 2 11 shows all the exceptions that the Intel 80200 processor may generate and the attributes of each Subsequent sections give details on each exception Table 2 11 Exception Summary Exception Description Exception Type Precise Updates FAR Reset Reset N FIQ FIQ N N IRQ IRQ N N External Instruction Prefetch Y N Instruction MMU Prefetch Y N Instruction Cache Parity Prefetch Y N Lock Abort Data Y N MMU Data Data Y Y External Data Data N N Data Cache Parity Data N N Software Interrupt Software Interrupt Y N Undefined Instruction Undefined Instruction Y N Debug Events varies varies N a Exception types are those described in the ARM section 2 5 b Refer to Chapter 13 Software Debug for more details 2 3 4 2 Event Priority The Intel 80200 processor follows the exception priority specified in the ARM Architecture Reference Manual The processor has additional exceptions that might be generated while debugging For information on these debug exceptions see Chapter 13 Software Debug Table 2 12 Ev
307. r s Manual March 2003 A 3 Compatibility Intel 80200 Processor vs SA 110 n Intel 80200 Processor based on Intel XScale Microarchitecture tel e A 3 4 A 3 5 A 4 Write Buffer Behavior Definition of Coalescing Coalescing means bringing together a new store operation with an existing store operation already resident in the write buffer The new store is placed in the same write buffer entry as an existing store when the address of the new store falls in the 4 word aligned address of the existing entry This includes in PCI terminology write merging write collapsing and write combining There is a difference in how stores are coalesced to existing entries in the write buffer When coalescing is enabled SA 110 only coalesces to the last entry placed in the write buffer The Intel 80200 processor can coalesce to any entry in the write buffer The Intel 80200 processor also added a global coalesce disable bit located in the Control Register CP15 register 1 opcode 2 1 Another difference between SA 110 and the Intel 80200 processor is that the write buffer is always enabled on the Intel 80200 processor Bit 3 of the Control Register CP15 register 1 opcode 2 20 was used in SA 110 to enable disable the write buffer For the Intel 80200 processor this bit is always set to 1 Memory references are rearranged if that would cause incorrect program behavior see Section 6 5 Write Buffer Fill Buffer Oper
308. r For all writes not from write buffer non cacheable writes and cache line evictions the writes are simple and the byte enables on the data bus are asserted for the contiguous bytes specified by the address and the length specified in the request Writes coming from write buffers can look somewhat different Due to coalescing in the write buffers it is possible to get a single write request on the bus writing out a non contiguous byte pattern The write buffers temporarily hold outgoing store data in 4 word aligned blocks and later stores byte short or word to same block can be merged into or overwrite previous data Once a given write buffer is next in line for access to the bus merging to it stops and whatever pattern of bytes is valid in that write buffer determines the type of write transaction sent out The byte enables on the bus indicates which bytes within that word are valid and need to be written Even if only one byte is valid in the coalesce buffer a word store goes out The byte enables are only asserted for the one byte however This means that a single byte write request to address 0x2402 can be requested in two valid ways a non cacheable strb instruction causes a write of Len 0x0 byte to A 0x2402 with only one byte enable asserted when the data is driven whereas a coalesce buffer drain could cause a write of Len 0x2 word to A 0x2400 with only one byte enable asserted The same applies for two byte stores Notice
309. r C library routines Section B 7 Optimizations for Size This chapter contains optimizations that reduce the size of the generated code Thumb optimizations are also included Developer s Manual March 2003 B 1 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 2 B 2 1 B 2 1 1 B 2 Intel 80200 Processor Pipeline One of the biggest differences between the Intel 80200 processor and first generation Intel StrongARM processors is the pipeline Many of the differences are summarized in Figure B 1 This section provides a brief description of the structure and behavior of the Intel 80200 processor pipeline General Pipeline Characteristics While the Intel 80200 processor pipeline is scalar and single issue instructions may occupy all three pipelines at once Out of order completion is possible The following sections discuss general pipeline characteristics Number of Pipeline Stages The Intel 80200 processor has a longer pipeline 7 stages versus 5 stages which operates at a much higher frequency than its predecessors do This allows for greater overall performance The longer the Intel 80200 processor pipeline has several negative consequences however Larger branch misprediction penalty 4 cycles in the Intel 80200 processor instead of 1 in Intel StrongARM This is mitigated by dynamic branch prediction Larger load use delay LUD LUDs
310. r mode debug exceptions are handled like ARM prefetch aborts or ARM data aborts depending on the cause of the exception When a debug exception occurs the processor switches to abort mode and branches to a debug handler using the pre fetch abort vector or data abort vector The debugger then communicates with the debug handler to access processor state or memory contents March 2003 Developer s Manual intel 13 4 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Debug Control and Status Register DCSR The DCSR register is the main control register for the debug unit Table 13 1 shows the format of the register The DCSR register can be accessed in privileged modes by software running on the core or by a debugger through the JTAG interface Refer to Section 13 11 2 SELDCSR JTAG Register for details about accessing DCSR through JTAG Table 13 1 Debug Control and Status Register DCSR Sheet 1 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131 109 8 7 6 5 4 3 2 1 0 H SR aa vo P e Reset TRST Bits Access Description Value Value lobal Enabl E h SW Read Write 2 o al Enable GE l l 0 unchanged JTAG Read Only 0 disables all debug functionality 1 enables all debug functionality Halt M H h d 5 SW Read Only js ode H unchange 0 JTAG Read Write O Monitor Made 1 Halt Mode 29 24 Read undefined Wri
311. r modes idle sleep Clocking CLK is the input reference clock for the Intel 80200 processor CLK accepts an input clock frequency of 33 to 66 MHz The Intel 80200 processor uses an internal PLL to lock to the input clock and multiplies the frequency by a variable multiplier to produce a high speed core clock CCLK This multiplier is initially configured by the PLL configuration pin PLLCFG and can be changed anytime later by software Table 8 1 shows the possible clock multipliers immediately after the reset sequence PLLCFG can select between a multiplier of three and six PLLCFG is sampled when RESET transitions from low to high Reset CCLK Configuration PLLCFG Value Sampled at Deassertion of RESET BEE seca 0 3 1 6 a Lower speed grade parts may not provide this instead they substitute the highest supported multiplier MCLK is the input memory clock for the Intel 80200 processor MCLK is asynchronous with respect to CCLK and supports frequencies up to 100 MHz The ratio of MCLK to CCLK must be 1 3 or less For example if CCLK is 200 MHz MCLK is restricted to 66 MHz or less Normally CLK and MCLK support 4096 6096 duty cycle inputs At higher input frequencies MCLK may impose a stricter requirement such as 5096 5096 See the latest Intel 80200 Processor based on Intel XScale Datasheet for details Software has the ability to change the frequency of CCLK without having to reset the c
312. r of writeback operations PMNI of mode Stall Writeback and the second run could monitor the total number of data cache accesses PMNO of mode Data Cache Efficiency From the results a percentage of writeback operations to the total number of data accesses can be derived Developer s Manual March 2003 12 11 Intel 80200 Processor based on Intel XScale Microarchitecture Performance Monitoring n 12 7 Example 12 1 Examples In this example the events selected with the Instruction Cache Efficiency mode are monitored and CCNT is used to measure total execution time Sampling time ends when PMNO overflows which generates an IRQ interrupt Configuring the Performance Monitor Configure PMNC with the following values evtCount0 7 evtCount1 Oinstruction cache efficiency inten 0x7set all counters to trigger an interrupt on overflow Q m reset CCNT register H P a I reset PMNO and PMN1 registers E 1 enable counting MOV R0 30x7777 MCR P14 0 R0 C0 c0 0 write RO to PMNC Counting begins Counter overflow can be dealt with in the IRQ interrupt service routine as shown below Example 12 2 Interrupt Handling H RQ INTERRUPT_SERVICE_ROUTINE Assume that performance counting interrupts are the only IRQ in the system MRC P14 0 R1 C0 c0 0 read the PMNC register BIC R2 R1 1 clear the enable bit MCR P14 0 R2 C0 c0 0 clear interrupt flag and disable counting MRC P14 0 R3
313. ramming Model 2 3 3 Additions to CP15 Functionality To accommodate the functionality in the Intel 80200 processor registers in CP15 and CP14 have been added or augmented See Chapter 7 Configuration for details At times it is necessary to be able to guarantee exactly when a CP15 update takes effect For example when enabling memory address translation turning on the MMU it is vital to know when the MMU is actually guaranteed to be in operation To address this need a processor specific code sequence is defined for each Intel StrongARM processor For the Intel 80200 processor the sequence called CPWAIT is shown in Example 2 1 on page 2 11 Example 2 1 CPWAIT Canonical method to wait for CP15 update The following macro should be used when software needs to be assured that a CP15 update has taken effect It may only be used while in a privileged mode because it accesses CP15 MACRO CPWAIT MRC P15 0 RO C2 CO O arbitrary read of CP15 MOV RO RO wait for it SUB PC PC 4 branch to next instruction At this point any previous CP15 writes are guaranteed to have taken effect ENDM When setting multiple CP15 registers system software may opt to delay the assurance of their update This is accomplished by emitting CPWAIT only after the sequence of MCR instructions The CPWAIT sequence guarantees that CP15 side effects are complete by the time the CPWAIT is complete It is
314. rap around mode when the buffer does not wrap around the trace can be reconstructed by starting from the point in the code where the trace buffer was first enabled The difficulty occurs in wrap around mode when the trace buffer wraps around at least once In this case the debugger gets a snapshot of the last N control flow changes in the program where N lt size of buffer The debugger does not know the starting address of the oldest entry read from the trace buffer The checkpoint registers provide reference addresses to help reduce this problem Checkpoint Register CHKPTx 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 131 109 8 7 6 54 3 21 0 CHKPTx reset value Unpredictable Bits Access Description i CHKPTx 31 0 Read Write target address for corresponding entry in trace buffer The two checkpoint registers CHKPTO CHKPT1 on the Intel 80200 processor provide the debugger with two reference addresses to use for re constructing the trace history March 2003 Developer s Manual 13 12 1 2 Table 13 16 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug When the trace buffer is enabled reading and writing to either checkpoint register has unpredictable results When the trace buffer is disabled writing to a checkpoint register sets the register to the value written Reading the checkpoint registers returns the value of the register In normal usage t
315. rchitecture also specifies the caching policies for the instruction cache and data memory These policies are specified as page attributes and include identifying code as cacheable or non cacheable selecting between the mini data cache or data cache write back or write through data caching enabling data write allocation policy andenabling the write buffer to coalesce stores to external memory Chapter 3 Memory Management discusses this in more detail Instruction Cache The Intel 80200 processor implements a 32 Kbyte 32 way set associative instruction cache with a line size of 32 bytes All requests that miss the instruction cache generate a 32 byte read request to external memory A mechanism to lock critical code within the cache is also provided Chapter 4 Instruction Cache discusses this in more detail Branch Target Buffer The Intel 80200 processor provides a Branch Target Buffer BTB to predict the outcome of branch type instructions It provides storage for the target address of branch type instructions and predicts the next address to present to the instruction cache when the current instruction address is that of a branch The BTB holds 128 entries See Chapter 5 Branch Target Buffer for more details Data Cache The Intel 80200 processor implements a 32 Kbyte a 32 way set associative data cache and a 2 Kbyte 2 way set associative mini data cache Each cache has a line size of 32 bytes s
316. re compliant with the ARM Architecture VSTE implements Design For Test DFT techniques to ensure quality and reliability This appendix describes those techniques Introduction Testing VLSI circuits is critical for achieving high outgoing quality levels Unfortunately the cost of testing is already one of the largest portions of the final product cost and is getting more expensive as the complexity of VLSI chips grows Reducing test cost is thus one of the critical issues that must be resolved in order to make a cost competitive VLSI chip and area and or delay minimization at the cost of increased test time may not be a good design choice Testability is the ability or ease with which tests can be generated for and applied to CUT Circuit Under Test Design For Testability DFT stands for having design decisions early in the design cycle that ease the testing test generation and fault grading process for a given circuit JTAG IEEE1149 1 The goal of JTAG also known as IEEE1149 1 is to ensure that chips containing a common denominator of DFT circuitry makes testing of boards containing these chips significantly less costly and more effective JTAG consists of TAP controller Boundary Scan register instruction and data registers and dedicated signals including TDI TDO TCK TMS and TRST The Intel 80200 processor provides test features compatible with IEEE Standard Test Access Port and Boundary Scan Architecture IEEE Std 1149 1
317. re exception handler needs to unlock the instruction cache invalidate the cache and then re lock the code in before it returns to the faulting instruction March 2003 Developer s Manual intel 4 2 6 4 2 7 Intel 80200 Processor based on Intel XScale Microarchitecture Instruction Cache Instruction Fetch Latency Because the Intel 80200 processor core is clocked at a multiple of the external bus clock and the two clocks are truly asynchronous an exact fetch latency is difficult to derive In general if a fetch can be directly issued no other memory accesses are intervening then the delay to the first instruction is approximately 8 W bus clocks where W is number of memory wait states As an example in a system with 2 wait state memory W 2 an unoccluded fetch would require about 10 bus clocks to get the first instruction If this system were running with a core bus clock ratio of 6 then the core would perceive this as a latency of about 60 cycles These numbers are best case and assume that no other active memory transactions exist Refer to Chapter 10 External Bus for more information on External Bus signal definitions and request timings Instruction Cache Coherency The instruction cache does not detect modification to program memory by loads stores or actions of other bus masters Several situations may require program memory modification such as uploading code from disk The application program is
318. re trennen teen 9 Failt Status Keefer C 10 Fault Address Register iecoris nat testers o Gel Geo EDS ree en Eege rais 10 Cache Functions M 11 TEB EE 13 Cache Lockdown Punctions eec oe terrre treten a P SOSPECHA driers PU TEE NEA 14 Data Cache Lock Register ertet RD ntur Eo Eee RI DT ee 14 TLB Lockdown PUnctOms n certi Ite irte EE EE degen 15 RR EE 16 Process Reeg M C eu We E ee eege A 16 Accessing the Debug Registers ic e enrii neri oi th Cete Fete S E tan HERE EAR eases RH AER ine asas 17 Coprocessor Access ROSIS E se ori coto ODE ERR E Eege E capes tonics PR NE drengend 19 CRIA Registers enti REOR RU UIS LO en C reet oes stein 20 Accessing the Performance Monitoring Registers essere eem enne nennen 20 PWRMODE R6EiSI6r t cte oe ree euo pnr EE RR Dee ederet 21 Clock and Power Management iie retreat 21 COLE GEG Registers as tea o PERI EE 21 Accessing the Debug Registers 00 0 0 ccesceeccesseeessessnecececseecesceesaeeeseceneceaeessaeceaeeeseecnaeeeseecseesaeeeaeeeeeeneeeeaee 22 Reset CCEK Combis uration en rer eg rei SEHE ECHTE T Pe ort P y ee v Geen aeons 1 Software CCLK Cont purations ssc e cieceete er dee t oen tto se ERE e Peg Ib a E esie roto age e ERES RENE Ie ERE dee 2 Low Power Mod s eR E Ui titu D preti tei ete e eb eto oen 5 IN RT EE TE 5 Developer s Manual March 2003 xiii Intel 80200 Processor based
319. register 14 serial communication over the JTAG interface and a trace buffer Registers 8 and 9 are used for the serial interface and registers 10 through 13 support a 256 entry trace buffer Register 14 and 15 are the debug link register and debug SPSR saved program status register These registers are explained in more detail in Chapter 13 Software Debug Opcode 2 and CRm should be zero Accessing the Debug Registers Function CRn Register Instruction MRC p14 0 Rd c8 c0 0 Register Access Transmit Debug Register TX 0b1000 MCR p14 0 Rd c8 c0 0 Access Receive Debug Register RX 0b1001 MEE 3 a E E P Access Debug Control and Status Register 0b1010 MCR p14 0 Rd c10 c0 0 DBGCSR MRC p14 0 Rd c10 c0 0 Access Trace Buffer Register TBREG 0b1011 MEE a d Ea a ie r Access Checkpoint 0 Register CHKPTO 0b1100 MEO e o i Se E Access Checkpoint 1 Register CHKPT1 0b1101 RO D Bd GC Access Transmit and Receive Debug Control 0b1110 MCR p14 0 Rd c14 c0 0 MRC p14 0 Rd c14 c0 0 March 2003 Developer s Manual intel System Management 8 8 1 Table 8 1 This chapter describes the clocking and power management features of the Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture VSTE along with reset details Main features include a software controlled internal clock frequency and two low powe
320. responsible for synchronizing code modification and invalidating the cache In general software must ensure that modified code space is not accessed until modification and invalidating are completed To achieve cache coherence instruction cache contents can be invalidated after code modification in external memory is complete Refer to Section 4 3 3 Invalidating the Instruction Cache on page 4 7 for the proper procedure in invalidating the instruction cache If the instruction cache is not enabled or code is being written to a non cacheable region software must still invalidate the instruction cache before using the newly written code This precaution ensures that state associated with the new code is not buffered elsewhere in the processor such as the fetch buffers or the BTB Naturally when writing code as data care must be taken to force it completely out of the processor into external memory before attempting to execute it If writing into a non cacheable region flushing the write buffers is sufficient precaution see Section 7 2 8 for a description of this operation If writing to a cacheable region then the data cache should be submitted to a Clean Invalidate operation see Section 6 3 3 1 to ensure coherency Developer s Manual March 2003 4 5 Intel 80200 Processor based on Intel XScale Microarchitecture Instruction Cache n d 4 3 Instruction Cache Control 4 3 1 Instruction Cache State at RESET After reset the
321. rfering with that normal operation sample The instruction causes Boundary Scan register cells associated with outputs IEEE 1149 1 000015 to sample the value being driven by or to the processor Required When the TAP controller is in the Update DR state the preload instruction occurs on the falling edge of TCK This instruction causes the transfer of data held in the Boundary Scan cells to the slave register cells Typically the slave latched data is then applied to the system outputs by means of the extest instruction dbgrx 000105 See Chapter 13 Software Debug clamp instruction allows the state of the signals driven from Intel 80200 processor pins to be determined from the boundary scan register while the clamp 00100 Bypass register is selected as the serial path between TDI and TDO Signals driven from the component pins do not change while the clamp instruction is selected Idic 001115 See Chapter 13 Software Debug The execution of highz generates a signal that is read on the rising edge of RESET If this signal is found asserted the device floats all its output pins Also highz 01000 when this instruction is active the Bypass register is connected between TDI and g 2 TDO This register can be accessed via the JTAG Test Access Port throughout the device operation Access to the Bypass register can also be obtained with the bypass instruction dcsr 010015 See Chapter 13 Software Debug idcode is used in conjunc
322. rol Internal Interrupt Internal Interrupt Steering Logic Sources Internal FIQ Synch ra FIQ pin Internal IRQ Synch ra IRQ pin Y INTSRC Interrupt Source Register The ICU registers reside in Coprocessor 13 CP13 They may be accessed manipulated with the MCR MRC STC and LDC instructions The CRn field of the instruction denotes the register number to be accessed The opcode 1 opcode_2 and CRm fields of the instruction should be zero Most systems restricts access to CP13 to privileged processes To control access to CP13 use the Coprocessor Access Register see Section 7 2 15 An instruction that modifies an ICU register is guaranteed to take effect before the next instruction executes For example if an instruction masks an interrupt source subsequent instructions execute in an environment in which the masked interrupt does not occur The details of the ICU registers are discussed in the following sections March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Interrupts 9 3 1 INTCTL INTCTL is used to specify what interrupts are disabled masked Table 9 1 Interrupt Control Register CP13 register 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 4 3 2 1 reset value writeable bits set to 0 Bits Access Description Read unpredictable Write as Zero Mese
323. rst instruction Once the first 8 byte word is read it takes another six core cycles to read in the next two instructions or a total of 78 to 108 clocks to fill a cache line If the new instructions each execute in one core cycle then the processor is stalled for 4 cycles waiting for the next pair of instructions Further if the next pair of instructions each execute in one cycle each the processor is again stalled for 4 more cycles From this it is clear that executing non cached instructions severely curtails the processor s performance It is very important to do everything possible to minimize cache misses Round Robin Replacement Cache Policy Both the data and the instruction caches use a round robin replacement policy to evict a cache line The simple consequence of this is that at sometime every line is evicted assuming a non trivial program The less obvious consequence is that predicting when and over which cache lines evictions take place is very difficult to predict This information must be gained by experimentation using performance profiling Code Placement to Reduce Cache Misses Code placement can greatly affect cache misses One way to view the cache is to think of it as 32 sets of 32 bytes which span an address range of 1024 bytes When running the code maps into 32 blocks modular 1024 of cache space Any sets which are overused thrashes the cache The ideal situation is for the software tools to distribute the code on a tem
324. rved 31 4 BM BCU Mask Controls whether BCU interrupts are enabled 0 disable interrupt 1 2 enable interrupt PM PMU Mask Controls whether PMU interrupts are enabled 0 disable interrupt 1 2 enable interrupt IM IRQ mask Enables external interrupts from the IRQ pin O disable interrupt 1 2 enable interrupt FM FIQ mask Enables external interrupts from the FIQ pin O disable interrupt 1 2 enable interrupt 3 Read Write 2 Read Write 1 Read Write 0 Read Write Developer s Manual March 2003 9 3 Intel 80200 Processor based on Intel XScale Microarchitecture Interrupts 9 3 2 Table 9 2 Example 9 1 9 4 INTSRC The Interrupt Source register INTSRC indicates which interrupts are active This register may be used by an ISR to determine quickly the source of an interrupt Even if an interrupt is masked with INTCTL software may still detect whether it is asserted by reading its bit from INTSRC Interrupt Source Register CP13 register 4 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value undefined Bits Access Description FI FIQ active Holds the state of the synchronized FIQ signal 0 not interrupting 1 interrupting Il IRQ active Holds the state of the synchronized IRQ signal 0 not interrupting 1 interrupting BI BCU Interrupt Active Holds the state of the BCU int
325. rved for this clean operation RO is the loop count Iterate 1024 times which is the number of lines in the data cache Macro ALLOCATE performs the line allocation cache operation on the address specified in register Rx MACRO ALLOCATE Rx MCR BIR 0 Rx C7 C2 5 ENDM MOV RO 1024 LOOP1 ALLOCATE R1 Allocate a line at the virtual address Specified by R1 ADD R1 R1 32 Increment the address in R1 to the next cache line SUBS RO RO 1 Decrement loop count BNE LOOP1 Clean the Mini data Cache Can t use line allocate command so cycle 2KB of unused data through R2 contains the virtual address of a region of cacheable memory reserved for cleaning the Mini data cache RO is Mini data Cache the loop count Iterate 64 times which is the number of lines in the MOV RO LOOP2 LDR R3 R2 32 SUBS RO RO 1 BNE LOOP2 H64 Load and increment to next cache line Decrement loop count i Invalidate the data cache and mini data cache MCR P15 0 RO C7 C6 0 i The line allocate operation does not require physical memory to exist at the virtual address specified by the instruction since it does not generate a load fill request to external memory Also the line allocate operation does not set the 32 bytes of data associated with the line to any known value Reading this data produces unpredictable results March 2003 Developer s Manual m Intel 80200 Processor b
326. rwritten if the pointer does not select the expected way To avoid this problem the debug handler should be written to avoid placing critical code in either way of a set that is intended for dynamic code download This allows code to be downloaded into either way and the only code that is overwritten is the previously downloaded dynamic function This method requires that space within the mini instruction cache be allocated for dynamic download limiting the space available for the static Debug Handler Also the space available may not be suitable for a larger dynamic function Once downloaded a dynamic function essentially becomes part of the Debug Handler Since it is in the mini instruction cache it does not get overwritten by application code It remains in the cache until it is replaced by another dynamic function or the lines where it is downloaded are invalidated March 2003 Developer s Manual Developer s Manual Inte 80200 Processor based on Intel XScale Microarchitecture Software Debug 2 Using the Main IC The steps for downloading dynamic functions into the main instruction cache is similar to downloading into the mini instruction cache However using the main instruction cache has its advantages Using the main instruction cache eliminates the problem of inadvertently overwriting static Debug Handler code by writing to the wrong way of a set since the main and mini instruction caches are separate The debug handler
327. s which occurs when there is a TLB miss If the instruction TLB is disabled PMNI does not increment Statistics derived from these two events Instruction TLB miss rate This is derived by dividing PMN1 by PMNO The average number of cycles it took to execute an instruction or commonly referred to as cycles per instruction CPI CPI can be derived by dividing CCNT by PMNO where CCNT was used to measure total execution time Data TLB Efficiency Mode PMNO totals the number of data cache accesses which includes cacheable and non cacheable accesses mini data cache access and accesses made to locations configured as data RAM Note that STM and LDM each count as several accesses to the data TLB depending on the number of registers specified in the register list LDRD registers two accesses PMNI counts the number of data TLB table walks which occurs when there is a TLB miss If the data TLB is disabled PMN1 does not increment The statistic derived from these two events is e Data TLB miss rate This is derived by dividing PMN1 by PMNO March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n D Performance Monitoring 12 6 Multiple Performance Monitoring Run Statistics Even though only two events can be monitored at any given time multiple performance monitoring runs can be done capturing different events from different modes For example the first run could monitor the numbe
328. s processing an instruction another instruction may not enter M1 unless the original instruction completes in the next cycle The MAC unit can operate on 16 bit packed signed data This reduces register pressure and memory traffic size Two 16 bit data items can be loaded into a register with one LDR The MAC can achieve throughput of one multiply per cycle when performing a 16 by 32 bit multiply Behavioral Description The execution of the MAC unit starts at the beginning of the M1 pipestage where it receives two 32 bit source operands Results are completed N cycles later where N is dependent on the operand size and returned to the register file For more information on MAC instruction latencies refer to Section 14 4 Instruction Latencies An instruction that occupies the M1 or M2 pipestages also occupy the X1 and X2 pipestage respectively Each cycle a MAC operation progresses for M1 to M5 A MAC operation may complete anywhere from M2 M5 If a MAC operation enters M3 M5 it is considered committed because it modifies architectural state regardless of subsequent events March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n D Optimization Guide B 3 Basic Optimizations This chapter outlines optimizations specific to ARM architecture These optimizations have been modified to suit the Intel 80200 processor architecture where needed B 3 1 Conditional Instructions Th
329. s provide a useful way of manipulating bit fields Bit field operations can be optimized as follows Set the bit number specified by rl in register r0 mov r2 1 orr r0 rO r2 asl r1 Clear the bit number specified by rl in register r0 mov r2 1 bic rO rO r2 asl r1 Extract the bit value of the bit number specified by r1 of the value in r0 storing the value in r0 mov rl r0 asr r1 and rO r1 1 Extract the higher order 8 bits of the value in r0 storing the result in r1 mov rl r0 lsr 24 Developer s Manual March 2003 B 13 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide B 3 3 Optimizing the Use of Immediate Values Intel The Intel 80200 processor MOV or MVN instruction should be used when loading an immediate constant value into a register Please refer to the ARM Architecture Reference Manual for the set of immediate values that can be used in a MOV or MVN instruction It is also possible to generate a whole set of constant values using a combination of MOV MVN ORR BIC and ADD instructions The LDR instruction has the potential of incurring a cache miss in addition to polluting the data and instruction caches The code samples below illustrate cases when a combination of the above instructions can be used to set a register to a constant value Set the mov Set the mvn Set the mov OEE Set the mov orr Set the mvn bic Set the mov orr add value of r
330. s the location s in the cache by deasserting the valid bit Individual entries lines may be flushed or the entire cache may be flushed with one command Once an entry is flushed in the cache it can no longer be used by the program Reserved A reserved field is a field that may be used by an implementation If the initial value Developer s Manual of a reserved field is supplied by software this value must be zero Software should not modify reserved fields or depend on any values in reserved fields March 2003 1 5 Introduction In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 1 3 Other Relevant Documents Intel 80200 Processor based on Intel XScale Microarchitecture Datasheet Intel Order 273414 ARM Architecture Version 5TE Specification Document Number ARM DDI 0100E This document describes Version 5TE of the ARM Architecture which includes Thumb ISA and ARM DSP Enhanced ISA ARM Architecture Reference Manual Document Number ARM DDI 0100B This document describes Version 4 of the ARM Architecture Intel XScale Microarchitecture Programming Reference Manual Intel Order 273436 Intel 80312 1 0 Companion Chip Developer s Manual Intel Order 273410 StrongARM SA 1100 Microprocessor Developer s Manual Intel Order 278088 StrongARM SA 110 Microprocessor Technical Reference Manual Intel Order 278058 March 2003 Developer s Manual intel Programming Model 2 This ch
331. set Execution is redirected to the debug handler allowing the debugger to perform any necessary initialization The reset vector trap is the only debug exception that can occur with debug globally disabled DCSR 31 0 Therefore the debugger must also enable debug prior to existing the handler to ensure all subsequent debug exceptions correctly break to the debug handler 13 15 1 1 Setting up Override Vector Tables The override default vector table intercepts the reset vector and branches to the debug handler when a debug exception occurs If the vector table is relocated the debug vector is relocated to address Oxffff0000 Thus an override relocated vector table is required to intercept vector Oxffff0000 and branch to the debug handler Both override vector tables also intercept the other debug exceptions so they must be set up to either branch to a debugger specific handler or go to the application s handlers It is possible that the application modifies its vector table in memory so the debugger may not be able to set up the override vector table to branch to the application s handlers The Debug Handler may be used to work around this problem by reading memory and branching to the appropriate address Vector traps can be used to get to the debug handler or the override vector tables can redirect execution to a debug handler routine that examines memory and branches to the application s handler Developer s Manual March 2003 13 47 In
332. sor samples this pin to find the bus width in use If it is 1 then the system is operating with a 32 bit bus otherwise the Intel 80200 processor uses a 64 bit bus This pin is sampled by the Intel 80200 processor while RESETOUT is asserted Developer s Manual March 2003 10 3 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus 10 2 1 10 2 1 1 Table 10 2 10 4 In Request Bus The request bus issues read or write requests from the Intel 80200 processor or other bus master to the chipset or memory controller Each request takes two MCLK cycles All signals should be sampled on the rising edge of MCLK No data is ever transferred on the request bus On the first cycle ADS LEN 2 Lock LEN 1 and W R LEN 0 are used to carry the ADS Lock and W R signals A valid request is indicated by the ADS signal being asserted low On that same clock edge the sampled value of A is the most significant 16 bits of the 32 bit address of the request W R indicates whether the request is a read or write from the Intel 80200 processor and the Lock pin indicates whether there is an atomic pair of operations outstanding On the second cycle of a request ADS LEN 2 Lock LEN 1 and W R LEN 0 are used to carry LEN 2 0 LEN is used to indicate the number of data bytes associated with the request See Table 10 2 and Table 10 3 for information on how this signal is encoded A has the least
333. space is limited the debug handler also has a dynamic capability that allows a function to be downloaded when it is needed There are three methods for implementing a dynamic debug handler using the mini instruction cache main instruction cache or external memory Each method has their limitations and advantages Section 13 14 5 Dynamically Loading IC After Reset describes how do dynamically load the mini or main instruction cache 1 using the Mini IC The static debug handler can support a command which can have functionality dynamically mapped to it This dynamic command does not have any specific functionality associated with it until the debugger downloads a function into the mini instruction cache When the debugger sends the dynamic command to the handler new functionality can be downloaded or the previously downloaded functionality can be used There are also variations in which the debug handler supports multiple dynamic commands each mapped to a different dynamic function or a single dynamic command that can branch to one of several downloaded dynamic functions based on a parameter passed by the debugger Debug Handlers that allow code to be dynamically downloaded into the mini instruction cache must be carefully written to avoid inadvertently overwriting a critical piece of debug handler code Dynamic code is downloaded to the way pointed to by the round robin pointer Thus it is possible for critical debug handler code to be ove
334. specified register must be even r0 12 etc If this situation occurs using LDRD STRD instead of LDM STM to do the same thing is more efficient because LDRD STRD issues in only one two clock cycle s as opposed to LDM STM which issues in four clock cycles Avoid LDRDs targeting R12 this incurs an extra cycle of issue latency The LDRD instruction has a result latency of 3 or 4 cycles depending on the destination register being accessed assuming the data being loaded is in the data cache add r6 r7 r8 sub r5 r6 r9 The following ldrd instruction would load values into registers r0 and r1 ldrd ro r3 orr r8 r1 0xf mul r7 FO r7 In the code example above the ORR instruction would stall for 3 cycles because of the 4 cycle result latency for the second destination register of an LDRD instruction The code shown above can be rearranged to remove the pipeline stalls The following ldrd instruction would load values into registers r0 and r1 ldrd r0 r3 add r6 r7 r8 sub r5 r6 r9 mul r7 LO r7 orr r8 r1 0xf Any memory operation following a LDRD instruction LDR LDRD STR and so on would stall for 1 cycle The str instruction below would stall for 1 cycle ldrd r0 r3 str r4 r5 Developer s Manual March 2003 B 37 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 5 1 2 B 38 Scheduling Load and Store Multiple LDM STM LDM and STM
335. sserted Two cycles later the Intel 80200 processor drives the data onto the data bus For a write whose data all fits within one aligned 64 bit memory block for a 64 bit bus or a 32 bit aligned memory block for a 32 bit bus there is a single data cycle For a burst write there can be up to two 64 bit bus or four 32 bit bus data cycles for a maximum write burst of four 32 bit words Developer s Manual March 2003 10 7 External Bus Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e 10 2 4 10 8 There are eight byte enables BE associated with the D bus Each byte enable corresponds to one byte of the bus During a write cycle the byte enables for each byte that is being written is asserted low More detail on write transactions are given below Eight check bits DCB are also provided as part of the data bus These bits are used for ECC Section 10 2 7 ECC on page 10 12 has more information on how the Intel 80200 processor uses ECC The data bus pins D BE and DCB are not driven by the Intel 80200 processor except when explicitly requested by a DValid assertion for a write two clock edges earlier This means that when the chipset is not getting write data from the Intel 80200 processor it can use that bus for other purposes or allow its use by other masters Between a read and a write data cycle on the data bus there should be one turnaround cycle to avoid contention on the bus The I
336. ssor supports execution in a big endian system A system is said to be big endian if multi byte values are accessed with the MSB at lower addresses The endian orientation of a system is only evident when software performs sub word sized accesses To operate an Intel 80200 processor in a big endian system software running on the Intel 80200 processor must configure the device appropriately and the board hosting the Intel 80200 processor must swap the byte lanes in each word of the D bus The requisite arrangement for a 64 bit bus is shown in Figure 10 3 Big Endian Lane Swapping on a 64 bit Bus Intel 80200 Processor Memory D 7 0 D 7 0 D 15 8 D 15 8 D 23 16 D 23 16 D 31 24 D 31 24 D 39 32 D 39 32 D 47 40 D 47 40 D 55 48 D 55 48 D 63 56 D 63 56 The lines in DCB should be wired the same for either endian configuration Entities that create and check ECC should always interpret the contents of memory in the system format either big or little endian The processor uses the Byte Enable lines BE to indicate valid bytes during writes In big endian systems these lines should be swapped also That is bits in BE 3 0 should be swapped and bits in BE 7 4 should be swapped This has the effect of always associating BE 0 with D 7 0 BE 1 with D 15 8 and so on Developer s Manual March 2003 10 13 External Bus In Intel 80200 Processor based o
337. st be placed in non cacheable memory which means the MMU is enabled As a corollary no fetches of cacheable code should occur while locking instructions into the cache 2 The code being locked into the cache must be cacheable 3 The instruction cache must be enabled and invalidated prior to locking down lines Failure to follow these requirements produces unpredictable results when accessing the instruction cache System programmers should ensure that the code to lock instructions into the cache does not reside closer than 128 bytes to a non cacheable cacheable page boundary If the processor fetches ahead into a cacheable page then the first requirement noted above could be violated Lines are locked into a set starting at way 0 and may progress up to way 27 which set a line gets locked into depends on the set index of the virtual address Figure 4 2 is an example of where lines of code may be locked into the cache along with how the round robin pointer is affected Locked Line Effect on Round Robin Replacement set 0 8 ways locked 24 ways available for round robin replacement set 1 23 ways locked 9 ways available for round robin replacement set 2 28 ways locked only way28 31 available for replacement Set 31 all 32 ways available for round robin replacement set 0 set 1 set 2 1 set 31 way 0 ES x way 1 9 o i o b d o o way 7 9 way 8 8 8 l EI way 22 B way 23 v way 30 way 31 March 2003 D
338. st prefetch is lost March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n e Optimization Guide B 4 4 11 Loop Interchange As mentioned earlier the sequence in which data is accessed affects cache thrashing Usually it is best to access data in a contiguous spatially address range However arrays of data may have been laid out such that indexed elements are not physically next to each other Consider the following C code which places array elements in row major order for j 0 j NMAX j for i 0 i lt NMAX i prefetch A i 1 j sum A i j In the above example A i j and A i 1 j are not sequentially next to each other This situation causes an increase in bus traffic when prefetching loop data In some cases where the loop mathematics are unaffected the problem can be resolved by induction variable interchange The above examples becomes for i 0 i lt NMAX i for j 0 j lt NMAX j prefetch A i j 1 sum Ali j B 4 4 12 Loop Fusion Loop fusion is a process of combining multiple loops which reuse the same data in to one loop The advantage of this is that the reused data is immediately accessible from the data cache Consider the following example for i 0 i lt NMAX i prefetch A i 1 c i 1 cli 1 Ali b i clil for i 0 i lt NMAX i prefetch D i 1 c i 1 A i 11 D i A i clil
339. t Any of these strategies work as long as there are no accesses to the 32 byte memory region of the locked read after the read has executed and before the next write request is executed which deasserts Lock 10 10 March 2003 Developer s Manual intel 10 2 6 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Abort If for any reason a request made by the Intel 80200 processor can not be completed it must be aborted At the same time as the assertion D Valid for any data cycle of any transaction Abort can be asserted This has the effect of ending that transaction at that data cycle The Intel 80200 processor saves the address of the aborted transaction and take an exception If Abort is asserted at the same time as D Valid of the first data cycle of a given bus transaction no data is sampled or driven by the Intel 80200 processor off of the D bus two cycles after the Abort is asserted That transaction is finished as soon as the Abort signal is sampled and no further data is transferred On a transaction with multiple data cycles Abort can be asserted along with D Valid at any time during the transaction Data is read from or driven to the data bus two cycles after each of the pre abort DValids For a write transaction no data is driven at the clock edge two cycles after Abort is sampled For a read transaction external logic must drive the data buses D and DCB to a valid level two cycles
340. t does not take any data arguments Invalidate Mini IC invalidates the entire mini instruction cache It does not effect the main instruction cache It does not require a virtual address or any data arguments Load Main IC and Load Mini IC write one line of data 8 ARM instructions into the specified instruction cache at the specified virtual address Each cache function is downloaded through JTAG in 33 bit packets Figure 13 11 shows the packet formats for each of the JTAG cache functions Invalidate IC Line and Invalidate Mini IC each require 1 packet Load Main IC and Load Mini IC each require 9 packets 1 TheLDIC Invalidate Mini IC function does not invalidate the BTB like the CP15 Invalidate IC function so software must do this manually where appropriate 13 36 March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Figure 13 11 Format of LDIC Cache Functions Invalidate IC Line VAI ofofo olofo 5 2 0 32 3 Invalidate Mini IC ker d indicates first 5 2 bit shifted in 32 31 o Data Word 7 indicates last bit shifted in Load Main IC CMD 0b010 Data Word 0 and Load Mini IC CMD 0b011 VA 31 5 b o 1 All packets are 33 bits in length Bits 2 0 of the first packet specify the function to execute For functions that require an address bits 32 6 of the first packet specify an 8 word aligned address Packet1 32
341. t nennen 27 B 4 4 4 Compute vs Data Bus Bound nennen 27 B 4 4 5 Low Number of Iterations A 27 DAAG Bandwidth Lumtatons AA 28 B 4 4 7 Cache Memory Considerations esseeesieeeeeeeeeeeireeirtesrtterintintntnnsrnnnsrnneennnt 29 B 4 4 8 Cache Blocking niit trn ie ira on ene CERES ER ERE RERR RS CUL EE RR ERS 31 B 4 4 9 Prefetch Unrolling eueeeeseeeeeeeeeeeeeeeeeeeeee nennen nennen nnns 31 B 4 4 10 Pointer Prefetch uiii eie iecit cei rine eie udine e A Ee EEN ERE 32 B 4 4 11 Loop Interchange A 33 B 4 4 12 Lo0p FUSION iuiu erret e ae o ree eua ehe Hei pa she E Een eins ERR E ELE RR da 33 B 4 4 13 Prefetch to Reduce Register Preseure AA 34 Instruction stereo ino 35 B 5 1 Scheduling LoadS entente nnns nennen nennen 35 B 5 1 1 Scheduling Load and Store Double LDPRDVGTDD sees 37 B 5 1 2 Scheduling Load and Store Multiple LDM STM eeeeeeeeeeeeeeeeeerree 38 B 5 2 Scheduling Data Processing Instructions seseeeeseeeesieerterrresitsintsinttinsrernsrineerrnet 39 B 5 3 Scheduling Multiply Instructions ssssesssssseseeeeeeneennenennenenn nnne 40 B 5 4 Scheduling SWP and SWPB Instructlons 41 B 5 5 Scheduling the MRA and MAR Instructions MPDPDCMCHD sese 42 B 5 6 Scheduling the MIA and MIAPH Instructons A 43 B 5 7 Scheduling MRS and MSR Instructlons 44 B 5 8 Scheduling CP15 Coprocessor Instructions ccccceeeeceeeeeeeeeneeeeeeeeeseeeeeeaeeeseneeeeeeees 44 Optimi
342. t pins only transition if a valid MCLK is present Figure 8 2 Pin State at Reset ak Pm AAVNNSNNNNVVKVY wok m V VUE RESET 1 1 RESETOUT n n n ADS tt d PWRSTATUS d d HOLD 1 1 m iod HLDA l 8 4 March 2003 Developers Manual intel 8 3 Table 8 3 8 3 1 8 3 2 Table 8 4 Intel 80200 Processor based on Intel XScale Microarchitecture System Management Power Management The Intel 80200 processor provides low power modes idle and sleep which are listed in increasing power saving order Table 8 3 describes the attributes of each low power mode Low Power Modes Low Power Mode PLL Architectural State Wakeup Method Idle On Retained FIQ IRQ Sleep Off Must be saved prior to entering RESET Invocation The Intel 80200 processor provides a simple instruction to enter into low power mode See Table 7 24 Clock and Power Management on page 7 21 for exact commands This instruction waits for all processing to complete asserts the PWRSTATUS output pins and waits for a wake up event to transition back into the normal running mode For idle the wake up event is the assertion of FIQ or IRQ pins If the interrupt is masked the Intel 80200 processor still wakes up but won t service the interrupt The only way to exit Sleep mode is to go through the reset sequence Sleep mode provides
343. t the expense of increasing code size cmp rl 0 ldrne r0 r5 4 ldreq rO r5 4 addne r4 r5 4 subeq r4 r5 4 cmp rO 10 The optimized code takes six cycles to execute compared to the seven cycles taken by the unoptimized version The result latency for an LDR instruction is significantly higher if the data being loaded is not in the data cache To minimize the number of pipeline stalls in such a situation the LDR instruction should be moved as far away as possible from the instruction that uses result of the load Note that this may at times cause certain register values to be spilled to memory due to the increase in register pressure In such cases use a preload instruction or a preload hint to ensure that the data access in the LDR instruction hits the cache when it executes A preload hint should be used in cases where we cannot be sure whether the load instruction would be executed A preload instruction should be used in cases where we can be sure that the load instruction would be executed Consider following code sample March 2003 B 35 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 36 all other registers are in use sub El 6 r7 mul r3 r6 r2 mov r2 r2 LSL 2 orr r9 r9 0xf add r0 r4 r5 ldr r6 r0 add r8 r6 r8 add r8 r8 4 orr r8 r8 Most The value in register r6 is not used after this In code sample above ADD and LDR instruction can be
344. t want a write to TX to overwrite previous data The TXRXCTRL OV bit overflow flag does not get set during high speed download when the handler reads the RX register at the same time the debugger writes to it If the debugger writes to RX at the same time the handler reads from RX the handler read returns the newly written data and the previous data is lost However in this specific case the overflow flag does not get set so the debugger is unaware that the download was not successful March 2003 Developer s Manual intel Performance Considerations 14 This chapter describes relevant performance considerations that compiler writers application programmers and system designers need to be aware of to efficiently use Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE Performance numbers discussed here include interrupt latency branch prediction and instruction latencies 14 1 Interrupt Latency Table 14 1 shows the Minimum Interrupt Latency for the Intel 80200 processor which is the minimum number of cycles from the assertion of any interrupt signal IRQ or FIQ to the execution of the instruction at the vector for that interrupt Table 14 1 Minimum Interrupt Latency MCLK Clock Cycles Description Minimum Interrupt Latency This is measured from the assertion of IRQ or FIQ 3 interrupt pin to the execution of the first instruction of the interrupt event
345. te Shift IR State When the controller is in this state the shift register contained in the instruction register is connected between TDI and TDO and shifts data one bit position nearer to its serial output on each rising edge of TCK The test data register selected by the current instruction retains its previous value during this state The instruction does not change If TMS is held high on the rising edge of TCK the controller enters the Exit1 IR state If TMS is held low on the rising edge of TCK the controller remains in the Shift IR state March 2003 Developer s Manual intel C 2 5 13 C 2 5 14 C 2 5 15 C 2 5 16 Intel 80200 Processor based on Intel XScale Microarchitecture Test Features Exit1 IR State This is a temporary state If TMS is held high on the rising edge of TCK the controller enters the Update IR state which terminates the scanning process If TMS is held low on the rising edge of TCK the controller enters the Pause IR state The test data register selected by the current instruction retains its previous value during this state The instruction does not change and the instruction register retains its state Pause IR State The Pause IR state allows the test controller to temporarily halt the shifting of data through the instruction register The test data registers selected by the current instruction retain their previous values during this state The instruction does not change and th
346. te As Zero Reserved undefined undefined SW Read Only unchanged 0 3 JTAG Read Write Trap FIQ TF SW Read Only unchanged 0 22 JTAG Read Write Trap IRQ Ti 21 Read undefined Write As Zero Reserved undefined undefined SW Read Only unchanged 0 20 JTAG Read Write Trap Data Abort TD SW Read Only unchanged 0 19 JTAG Read Write Trap Prefetch Abort TA SW Read Only unchanged 0 18 JTAG Read Write Trap Software Interrupt TS SW Read Only e e unchanged 0 17 JTAG Read Write Trap Undefined Instruction TU SW Read Only unchanged 0 16 JTAG Read Write Trap Reset TR Developer s Manual March 2003 13 3 Intel 80200 Processor based on Intel XScale Microarchitecture Software Debug Table 13 1 Debug Control and Status Register DCSR Sheet 2 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 0 Intel P Er ds Reset TRST Bits Access Description Value Value 15 6 Read undefined Write As Zero Reserved undefined undefined SW Read Write 0 unchanged 5 JTAG Read Only Sticky Abort SA Method Of Entry MOE 0b000 unchanged 000 Processor Reset 001 Instruction Breakpoint Hit SW Read Write 010 Data Breakpoint Hit 4 2 JTAG Read Only 011 BKPT Instruction Executed 100 External Debug Event Asserted 101 Vector Trap Occurred 110 Trace Buffer Full Break 111 Reserved SW Read Write Trace Buffer Mode M 0 unchanged 1 JTAG Read Only 0 Wrap around
347. tel 80200 Processor based on Intel XScale Microarchitecture Software Debug n 13 15 1 2 13 48 Placing the Handler in Memory The debug handler is not required to be placed at a specific pre defined address However there are some limitations on where the handler can be placed due to the override vector tables and the 2 way set associative mini instruction cache In the override vector table the reset vector must branch to the debug handler using adirect branch which limits the start of the handler code to within 32 MB of the reset vector Or anindirect branch with a data processing instruction The data processing instruction creates an address using immediate operands and then branches to the target An LDR to the PC does not work because the debugger cannot set up data in memory before starting the debug handler The 2 way set associative limitation is due to the fact that when the override default and relocated vector tables are downloaded they take up both ways of Set 0 w addresses OxO and Oxffff0000 Therefore debug handler code can not be downloaded to an address that maps into Set 0 otherwise it overwrites one of the vector tables avoid addresses w lower 12 bits 0 The instruction cache 2 way set limitation is not a problem when the reset vector uses a direct branch since the branch offset can be adjusted accordingly However it makes using indirect branches more complicated Now the reset vector actu
348. ters Bits 31 0 of the value in accO are moved into the register RdLo Bits 39 32 of the value in accO are sign extended to 32 bits and moved into the register RdHi The instruction is only executed if the condition specified in the instruction matches the condition code status This instruction executes in any processor mode March 2003 Developer s Manual intel 2 3 2 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model New Page Attributes The Intel 80200 processor extends the page attributes defined by the C and B bits in the page descriptors with an additional X bit This bit allows four more attributes to be encoded when X 1 These new encodings include allocating data for the mini data cache and write allocate caching A full description of the encodings can be found in Section 3 2 2 Memory Attributes on page 3 2 The Intel 80200 processor retains ARM definitions of the C and B encoding when X 0 which is different than the first generation Intel StrongARM products The memory attribute for the mini data cache has been moved and replaced with the write through caching attribute When write allocate is enabled a store operation that misses the data cache cacheable data only generates a line fill If disabled a line fill only occurs when a load operation misses the data cache cacheable data only Write through caching causes all store operations to be written to memory whether
349. the greatest power savings since the Intel 80200 processor supply voltage can be reduced to zero Signals Associated with Power Management PWRSTATUS 1 0 is a 2 bit output from the Intel 80200 processor It carries information about the current power mode on the part Table 8 4 shows the encoding of the PWRSTATUS 1 0 signals PWRSTATUS 1 0 Encoding PWRSTATUS 1 0 Power Mode 00 Normal 01 Idle 10 Reserved 11 Sleep The external interrupt pins may be used to exit Idle mode If MCLK is toggling asserting FIQ or IRQ wakes the Intel 80200 processor up even if the interrupt is disabled If interrupts are enabled it takes some time after the processor is woken up The exact timing depends on the CCLK MCLK ratio and implementation details and can not be reliably predicted If software needs to guarantee it does not proceed until an interrupt is taken it should poll for an asserted interrupt after it is woken up As with all external interrupts the interrupt source must keep the wake up interrupt asserted until told otherwise by software running on the Intel 80200 processor See Chapter 9 Interrupts for more information After an interrupt is asserted the Intel 80200 processor takes approximately 10 CLK cycles to exit Idle mode Developer s Manual March 2003 8 5 Intel 80200 Processor based on Intel XScale Microarchitecture m System Management n d The JTAG clock must be stopped durin
350. the ECC bus DCB 7 0 signals in this case will be unpredictable but will be consistent with whatever is being driven on D 63 0 For example in the case of a non cacheable DWORD write D 31 0 is defined by the store instruction and D 63 32 is unpredictable However DCB 7 0 is still consistent with what is driven out on D 63 0 DWE resets to 0 and is write only Reads of DWE result in an unpredictable value The actions that the 80200 perform when reading data are unaffected by the setting of DWE When DWE is written it must be done while ECC is disabled in the BCU BCUCTL 3 0 Software should only change the value of DWE once typically at initialization reset time Changing the value of DWE in other circumstances results in unpredictable behavior March 2003 Developer s Manual intel Performance Monitoring 12 This chapter describes the performance monitoring facility of the Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE The events that are monitored can provide performance information for compiler writers system application developers and software programmers 12 1 Overview The Intel 80200 processor hardware provides two 32 bit performance counters that allow two unique events to be monitored simultaneously In addition the Intel 80200 processor implements a 32 bit clock counter that can be used in conjunction with the performance counters its sole purpose
351. the associated line from memory and then updating the specific address data cache operation is described in Chapter 6 Data Cache This line fill operation could be the first access to that memory So a write to memory could trigger an ECC fault The solution is to write zeroes to memory before ECC is enabled The Intel 80200 processor writes zero into the DCB bus for this kind of transaction and the ECC code for 64 bits of zero is Zero Itis possible for the Intel 80200 processor to perform a write in which all byte enables are not asserted i e BE 7 0 11111111 Systems that contains data memory for ECC should interpret this to mean no data bytes or ECC bytes should be updated in response to the write Because the Intel 80200 processor only ever performs bus width sized writes in ECC systems a simple solution to this requirement is to wire BE 0 to the byte enable of both the LSB and the ECC byte To summarize the behavior of BE on ECC protected memory Normal writes are 64 bits wide so all BE lines are asserted Reads do not activate the BE lines and should be answered with at least 64 bits of data f an ECC RMW operation detects an error on the read phase it deasserts all BE lines during the write phase March 2003 Developer s Manual intel 10 2 8 Figure 10 3 Intel 80200 Processor based on Intel XScale Microarchitecture External Bus Big Endian System Configuration The Intel 80200 proce
352. the expected ECC versus the actual code received on the bus Should an error be detected this syndrome is available to software to aid diagnosis Developer s Manual March 2003 11 1 Bus Controller n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 11 3 11 3 1 11 2 Error Handling The BCU is able to detect and respond to two classes of errors bus aborts and ECC errors Information about errors is captured in a set of programmer accessible registers ELOGO ELOGI and ECARO ECARI The ELOGx registers log general information about an error while the ECARx registers capture the address associated with an error Bus Aborts A bus abort occurs when the Abort pin is asserted during an external bus transaction This is described in detail in Chapter 10 External Bus In response to a bus abort the BCU causes an exception in the currently running software Note that if the exception raised is an External Data Abort then it is an imprecise exception it is not necessarily related to the instruction that just executed The BCU attempts the additional response of logging the error into a register If register ELOGO is not already being used to record an error the BCU logs error information into registers ELOGO and ECARO If ELOGO already has error information but ELOGI does not the BCU use ELOGI and ECARI to log the error If both ELOGO and ELOG1 are in use the BCU sets the error overflow bit re
353. ther coprocessors on the Intel 80200 processor causes an undefined exception Unless otherwise noted unused bits in coprocessor registers have unpredictable values when read For compatibility with future implementations software should not rely on the values in those bits MRC MCR Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 um Dome em om Dmm mp cms Bits Description Notes 31 28 cond ARM condition codes Should be programmed to zero for future 23 21 opcode 1 Reserved compatibility n Read or write coprocessor register 20 0 MCR S 1 2 MRC 19 16 CRn specifies which coprocessor register 15 12 Rd General Purpose Register RO R15 0b1111 2 CP15 0b1110 CP14 11 8 cp num coprocessor number 0x1101 CP13 0x0000 CPO This field should be programmed to zero for 7 5 opcode_2 Function bits future compatibility unless a value has been specified in the command This field should be programmed to zero for 3 0 CRm Function bits future compatibility unless a value has been specified in the command March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Configuration The format of LDC and STC is shown in Table 7 2 LDC and STC follow the programming notes in the ARM Architecture Reference Manual LDC and STC transfer a single 32 bit word between a coprocessor reg
354. they are cacheable or not cacheable This feature is useful for maintaining data cache coherency The Intel 80200 processor also added a P bit in the first level descriptors to identify which pages of memory are protected with ECC A descriptor with the P bit set indicates the corresponding page in memory is ECC protected If the BCUs ECC mode is enabled see Chapter 11 Bus Controller then writes to such a page are accompanied with an ECC and reads are validated by an ECC Bit 1 in the Control Register coprocessor 15 register 1 opcode 1 enables ECC protection for memory accesses made during page table walks These attributes are programmed in the translation table descriptors which are highlighted in Table 2 8 First level Descriptors on page 2 10 Table 2 9 Second level Descriptors for Coarse Page Table on page 2 10 and Table 2 10 Second level Descriptors for Fine Page Table on page 2 10 Two second level descriptor formats have been defined for Intel 80200 processor one is used for the coarse page table and the other is used for the fine page table Developer s Manual March 2003 2 9 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model n Table 2 8 First level Descriptors 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 54 3 2 1 0 SBZ 0 0 Coarse page table base address P Domain SBZ 0 1 Section base address SBZ TEX AP
355. tion Example 4 3 Invalidating the Instruction Cache MCR P15 0 R1 C7 C5 0 Invalidate the instruction cache and branch target buffer CPWAIT The instruction cache is guaranteed to be invalidated at this point the next instruction sees the result of the invalidate command The Intel 80200 processor also supports invalidating an individual line from the instruction cache See Table 7 12 Cache Functions on page 7 11 for the exact command Developer s Manual March 2003 4 7 Instruction Cache n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 4 3 4 Figure 4 2 4 8 Locking Instructions in the Instruction Cache Software has the ability to lock performance critical routines into the instruction cache Up to 28 lines in each set can be locked hardware ignores the lock command if software is trying to lock all the lines in a particular set 1 e ways 28 31can never be locked When this happens the line is still allocated into the cache but the lock is ignored The round robin pointer stays at way 31 for that set Lines can be locked into the instruction cache by initiating a write to coprocessor 15 See Table 7 14 Cache Lockdown Functions on page 7 14 for the exact command Register Rd contains the virtual address of the line to be locked into the cache There are several requirements for locking down code The routine used to lock lines down in the cache mu
356. tion and how the instruction address is used to access the cache The instruction cache is a 32 Kbyte 32 way set associative cache this means there are 32 sets with each set containing 32 ways Each way of a set contains eight 32 bit words and one valid bit which is referred to as a line The replacement policy is a round robin algorithm and the cache also supports the ability to lock code in at a line granularity Instruction Cache Organization Set Index This example shows Set 0 being selected by the set index CAM Content Tag Addressable Memory Word Select i Instruction Word 4 bytes Instruction Address Virtual 31 10 9 5 4 2 1 0 x E The instruction cache is virtually addressed and virtually tagged The virtual address presented to the instruction cache may be remapped by the PID register See Section 7 2 13 Register 13 Process ID on page 7 16 for a description of the PID register Developer s Manual March 2003 4 1 Instruction Cache n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 4 2 4 2 1 4 2 2 4 2 Operation Operation When Instruction Cache is Enabled When the cache is enabled it compares every instruction request address against the addresses of instructions that it is currently holding If the cache contains the requested instruction the access hits the cache and the cache returns the
357. tion with the device identification register It connects the identification register between TDI and TDO in the Shift DR state When idcode selected idcode parallel loads the hard wired identification code 32 bits on IEEE 1149 1 111105 TDO into the identification register on the rising edge of TCK in the Capture DR Optional state NOTE The device identification register is not altered by data being shifted in on TDI dbgtx 10000 See Chapter 13 Software Debug bypass instruction selects the Bypass register between TDI and TDO pins while bypass in SHIFT DR state effectively bypassing the processor s test logic 0 is captured IEEE 1149 1 11111 in the CAPTURE DR state While this instruction is in effect all other test data pt 2 registers have no effect on the operation of the system Test data registers with Required both test and system functionality perform their system functions when this instruction is selected Developer s Manual March 2003 Test Features In Intel 80200 Processor based on Intel XScale Microarchitecture tel e C 2 4 C 2 4 1 Table C 4 C 2 4 2 C 2 4 3 C 6 TAP Test Data Registers The Intel 80200 processor contains a device identification register and two test data registers Bypass and RUNBIST Each test data register selected by the TAP controller is connected serially between TDI and TDO TDI is connected to the test data register s most significant bit
358. tiple internal accumulators The Intel 80200 processor uses opcode 3 to define six instructions MIA MIAPH MIABB MIABT MIATB and MIATT Table 2 2 MIA lt cond gt accO Rm Rs 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 109 8 7 6 5 4 3 2 1 EB E Teo a Operation if ConditionPassed lt cond gt then accO Rm 31 0 Rs 31 0 39 0 acco 39 0 Exceptions none Qualifiers Condition Code No condition code flags are updated Notes Early termination is supported Instruction timings can be found in Section 14 4 4 Multiply Instruction Timings on page 14 6 Specifying R15 for register Rs or Rm has unpredictable results accO is defined to be 0b000 on 80200 The MIA instruction operates similarly to MLA except that the 40 bit accumulator is used MIA multiplies the signed value in register Rs multiplier by the signed value in register Rm multiplicand and then adds the result to the 40 bit accumulator accO 2 4 March 2003 Developer s Manual Table 2 3 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model MIA does not support unsigned multiplication all values in Rs and Rm are interpreted as signed data values MIA is useful for operating on signed 16 bit data that was loaded into a general purpose register by LDRSH The instruction is only executed if the condition specified in the instruction matches the condition code status MIAPH lt cond gt
359. tries can be invalidated and cleaned in the data cache and mini data cache via coprocessor 15 register 7 Note that a line locked into the data cache remains locked even after it has been subjected to an invalidate entry operation This will leave an unusable line in the cache until a global unlock has occurred For this reason do not use these commands on locked lines This same register also provides the command to invalidate the entire data cache and mini data cache Refer to Table 7 12 Cache Functions on page 7 11 for a listing of the commands These global invalidate commands have no effect on lines locked in the data cache Locked lines must be unlocked before they can be invalidated This is accomplished by the Unlock Data Cache command found in Table 7 14 Cache Lockdown Functions on page 7 14 Developer s Manual March 2003 6 9 Intel 80200 Processor based on Intel XScale Microarchitecture Data Cache 6 3 3 1 Example 6 2 6 10 In Global Clean and Invalidate Operation A simple software routine is used to globally clean the data cache It takes advantage of the line allocate data cache operation which allocates a line into the data cache This allocation will evict any dirty data in the cache back to external memory Example 6 2 shows how the data cache can be cleaned Global Clean Operation Global Clean Invalidate THE DATA CACHE R1 contains the virtual address of a region of cacheable memory rese
360. tures such that the data resides within the same memory page It is also extremely important to insure that instruction and data sections are in different memory banks or they continually trash the memory page selection March 2003 Developer s Manual intel B 4 4 B 4 4 1 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Prefetch Considerations The Intel 80200 processor has a true prefetch load instruction PLD The purpose of this instruction is to preload data into the data and mini data caches Data prefetching allows hiding of memory transfer latency while the processor continues to execute instructions The prefetch is important to compiler and assembly code because judicious use of the prefetch instruction can enormously improve throughput performance of the Intel 80200 processor Data prefetch can be applied not only to loops but also to any data references within a block of code Prefetch also applies to data writing when the memory type is enabled as write allocate The Intel 80200 processor prefetch load instruction is a true prefetch instruction because the load destination is the data or mini data cache and not a register Compilers for processors which have data caches but do not support prefetch sometimes use a load instruction to preload the data cache This technique has the disadvantages of using a register to load data and requiring additional registers for subsequent preloads a
361. uction is immediately available from the cache or memory interface the current instruction does not incur resource dependency stalls during execution that can not be detected at issue time and if the instruction uses dynamic branch prediction correct prediction is assumed Minimum Result Latency The required minimum cycle distance from the issue clock of the current instruction to the issue clock of the first instruction that can use the result without incurring a resource dependency stall assuming best case conditions i e that the issuing of the next instruction is not stalled due to a resource dependency stall the next instruction is immediately available from the cache or memory interface and the current instruction does not incur resource dependency stalls during execution that can not be detected at issue time Minimum Issue Latency with Branch Misprediction The minimum cycle distance from the issue clock of the current branching instruction to the first possible issue clock of the next instruction This definition is identical to Minimum Issue Latency except that the branching instruction has been mispredicted It is calculated by adding Minimum Issue Latency without Branch Misprediction to the minimum branch latency penalty number from Table 14 2 which is four cycles March 2003 14 3 Performance Considerations Intel 80200 Processor based on Intel XScale Microarchitecture i ntel e Minimum Resource Latency Th
362. ue to hit within either buffer even in the presence of forward and backward branches no external fetches for instructions are generated A miss causes one or the other buffer to be filled from external memory using the fill policy described in Section 4 2 3 March 2003 Developer s Manual intel 4 2 3 4 2 4 Intel 80200 Processor based on Intel XScale Microarchitecture Instruction Cache Fetch Policy An instruction cache miss occurs when the requested instruction is not found in the instruction fetch buffers or instruction cache a fetch request is then made to external memory The instruction cache can handle up to two misses Each external fetch request uses a fetch buffer that holds 32 bytes and eight valid bits one for each word A miss causes the following 1 A fetch buffer is allocated 2 The instruction cache sends a fetch request to the external bus This request is for a 32 byte line 3 Instruction words are returned back from the external bus at a maximum rate of 1 word per core cycle As each word returns the corresponding valid bit is set for the word in the fetch buffer 4 As soon as the fetch buffer receives the requested instruction it forwards the instruction to the instruction decoder for execution 5 When all words have returned the fetched line is written into the instruction cache if cacheable and if the instruction cache is enabled The line chosen for update in the cache is contr
363. uent write also requires a new pending buffer A fill buffer is also allocated for each read to a non cached memory and a write buffer is needed for each memory write to non cached memory that is non coalescing Consequently a STM instruction listing eight registers and referencing non cached memory uses eight write buffers assuming they don t coalesce and two write buffers if they do coalesce A cache eviction requires a write buffer for each dirty bit set in the cache line The prefetch instruction requires a fill buffer for each cache line and 0 1 or 2 write buffers for an eviction When adding prefetch instructions caution must be asserted to insure that the combination of prefetch and instruction bus requests do not exceed the system resource capacity described above or performance is degraded instead of improved The important points are to spread prefetch operations over calculations so as to allow bus traffic to free flow and to minimize the number of necessary prefetches March 2003 Developer s Manual intel B 4 4 7 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide Cache Memory Considerations Stride the way data structures are walked through can affect the temporal quality of the data and reduce or increase cache conflicts The Intel 80200 processor data cache and mini data caches each have 32 sets of 32 bytes This means that each cache line in a set is on a modular 1 K address boundary
364. until the bit is cleared After the debugger reads a 0 from the RR bit it scans data into JTAG to write to the RX register and sets the valid bit The write to the RX register automatically sets the RR bit Debug Handler Actions Debug handler is expecting data from the debugger The debug handler polls the RR bit until it is set indicating data in the RX register is valid Once the RR bit is set the debug handler reads the new data from the RX register The read operation automatically clears the RR bit When data is being downloaded by the debugger part of the normal handshaking can be bypassed to allow the download rate to be increased Table 13 8 shows the handshaking used when the debugger is doing a high speed download Note that before the high speed download can start both the debugger and debug handler must be synchronized such that the debug handler is executing a routine that supports the high speed download Although it is similar to the normal handshaking the debugger polling of RR is bypassed with the assumption that the debug handler can read the previous data from RX before the debugger can scan in the new data High Speed Download Handshaking States Debugger Actions Debugger wants to transfer code into the Intel 80200 processor system memory Prior to starting download the debugger must polls RR bit until it is clear Once the RR bit is clear indicating the debug handler is ready the debu
365. upports write through or write back caching The data mini data cache is controlled by page attributes defined in the MMU Architecture and by coprocessor 15 Chapter 6 Data Cache discusses all this in more detail The Intel 80200 processor allows applications to re configure a portion of the data cache as data RAM Software may place special tables or frequently used variables in this RAM See Section 6 4 Re configuring the Data Cache as Data RAM on page 6 12 for more information on this Developer s Manual March 2003 1 3 Introduction In Intel 80200 Processor based on Intel XScale Microarchitecture tel e 1 1 2 6 1 1 2 7 1 1 2 8 1 1 2 9 1 1 2 10 1 1 2 11 Power Management The Intel 80200 processor supports two low power modes idle and sleep These modes are discussed in Section 8 3 Power Management on page 8 5 Interrupt Controller An interrupt controller is implemented on the Intel 80200 processor that provides masking of interrupts and the ability to steer interrupts to FIQ or IRQ It is accessed through Coprocessor 13 registers See Chapter 9 Interrupts for more detail Bus Controller The Intel 80200 processor supports a pipelined external bus that runs at 100 MHz The data bus is 32 64 bits with ECC protection The bus controller can be configured to provide critical word first on load operations enhancing overall system performance The bus controller has four request queues
366. used for locking entries into the TLB and is set to entry O at reset A TLB lock operation places the specified translation at the entry designated by the lock pointer moves the lock pointer to the next sequential entry and resets the round robin pointer to entry 31 Locking entries into either TLB effectively reduces the available entries for updating For example if the first three entries were locked down the round robin pointer would be entry 3 after it rolled over from entry 31 Only entries 0 through 30 can be locked in either TLB entry 31can never be locked If the lock pointer is at entry 31 a lock operation updates the TLB entry with the translation and ignore the lock In this case the round robin pointer stays at entry 31 Figure 3 1 Example of Locked Entries in TLB Eight entries locked 24 entries available for round robin replacement entry O entry 1 Locked entry 7 E entry 8 entry 22 entry 23 entry 30 entry 31 Developer s Manual March 2003 3 9 intel Instruction Cache 4 1 Figure 4 1 Note The Intel 80200 processor based on Intel XScale microarchitecture compliant with the ARM Architecture V5TE instruction cache enhances performance by reducing the number of instruction fetches from external memory The cache provides fast execution of cached code Code can also be locked down when guaranteed or fast access time is required Overview Figure 4 1 shows the cache organiza
367. veloper s Manual intel C 2 5 5 C 2 5 6 C 2 5 7 C 2 5 8 Intel 80200 Processor based on Intel XScale Microarchitecture Test Features Shift DR State In this controller state the test data register which is connected between TDI and TDO as a result of the current instruction shifts data one bit position nearer to its serial output on each rising edge of TCK Test data registers that the current instruction selects but does not place in the serial path retain their previous value during this state The instruction does not change while the TAP controller is in this state If TMS is high on the rising edge of TCK the controller enters the Pv DR state If TMS is low on the rising edge of TCK the controller remains in the Shift DR state Exiti DR State This is a temporary controller state When the TAP controller is in the Exitl DR state and TMS is held high on the rising edge of TCK the controller enters the Update DR state which terminates the scanning process If TMS is held low on the rising edge of TCK the controller enters the Pause DR state The instruction does not change while the TAP controller is in this state All test data registers selected by the current instruction retain their previous value during this state Pause DR State The Pause DR state allows the test controller to temporarily halt the shifting of data through the test data register in the serial path between TDI and TDO The test data
368. ver a CP15 global invalidate IC function does not affect the mini instruction cache The mini instruction cache can be globally invalidated through JTAG by the LDIC Invalidate IC function or by a processor reset when the processor is not in HALT or LDIC mode A single line in the mini instruction cache can be invalidated through JTAG by the LDIC Invalidate IC line function March 2003 Developer s Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n Software Debug 13 15 Halt Mode Software Protocol This section describes the overall debug process in Halt Mode It describes how to start and end a debug session and details for implementing a debug handler Intel provides a standard Debug Handler that implements some of the techniques in this chapter The Intel Debug Handler itself is a a document describing additional handler implementation techniques and requirements 13 15 1 Starting a Debug Session Prior to starting a debug session in Halt Mode the debugger must download code into the instruction cache during reset via JTAG Section 13 14 Downloading Code in the ICache This downloaded code should consist of adebug handler an override default vector table anoverride relocated vector table if necessary While the processor is still in reset the debugger should set up the DCSR to trap the reset vector This causes a debug exception to occur immediately when the processor comes out of re
369. w data to the TX register The write operation automatically sets the TR bit Conditional Execution Using TXRXCTRL All of the bits in TXRXCTRL are placed such that they can be read directly into the CC flags using an MCR instruction To simplify the debug handler the TXRXCTRL register should be read using the following instruction mrc p14 0 r15 C14 CO 0 This instruction directly updates the condition codes in the CPSR The debug handler can then conditionally execute based on each CC bit Table 13 10 shows the mnemonic extension to conditionally execute based on whether the TXRXCTRL bit is set or clear TXRXCTRL Mnemonic Extensions TXRXCTRL bit mnemonic extension to execute if bit set mnemonic extension to execute if bit clear 31 to N flag MI PL 30 to Z flag EQ NE 29 to C flag CS cc 28 to V flag VS VC The following example is a code sequence in which the debug handler polls the TXRXCTRL handshaking bit to determine when the debugger has completed its write to RX and the data is ready for the debug handler to read loop mrc p14 0 r15 c14 cO O read the handshaking bit in TXRXCTRL mcrmi p14 0 r0 c9 c0 O if RX is valid read it bpl loop if RX is not valid loop Developer s Manual March 2003 13 15 Software Debug n Intel 80200 Processor based on Intel XScale Microarchitecture tel e 13 9 Table 13 11 13 10 Table 13 12 13 16 Transmit Reg
370. when the address bits 2 0 0b100 MCRR and MRRC are only supported on the Intel 80200 processor when directed to coprocessor 0 and are used to access the internal accumulator See Section 2 3 1 2 for more information Access to any other coprocessor besides 0x0 are undefined Base Register Update If a data abort is signalled on a memory instruction that specifies writeback the contents of the base register is not updated This holds for all load and store instructions This behavior matches that of the first generation Intel StrongARM processor and is referred to in the ARM V5 architecture as the Base Restored Abort Model March 2003 Developer s Manual intel 2 3 2 3 1 Intel 80200 Processor based on Intel XScale Microarchitecture Programming Model Extensions to ARM Architecture The Intel 80200 processor made a few extensions to the ARM Version 5 architecture to meet the needs of various markets and design requirements The following is a list of the extensions which are discussed in the next sections A DSP coprocessor CPO has been added that contains a 40 bit accumulator and new instructions New page attributes were added to the page table descriptors The C and B page attribute encoding was extended by one more bit to allow for more encodings write allocate and mini data cache An attribute specifying ECC for 1Meg regions was also added Additional functionality has been added to coprocessor 15
371. ws Cost of using conditional instructions 3949 2 x10 Le 5 x10 i09 10 11 cycles Cost of using branches 50 7 4 256 22 x4 DE too X 6 i00 4 9 5 cycles As can be seen we get better performance by using branch instructions in the above scenario Developer s Manual March 2003 B 11 Intel 80200 Processor based on Intel XScale Microarchitecture Optimization Guide n B 3 1 3 B 12 Optimizing Complex Expressions Conditional instructions should also be used to improve the code generated for complex expressions such as the C shortcut evaluation feature Consider the following C code segment int foo int a int b if a 0 amp amp b 0 return 0 else return 1 The optimized code for the if condition is cmp ro 0 cmpne r1 0 Similarly the code generated for the following C segment int foo int a int b if a 0 b 0 return 0 else return 1 is cmp ro 0 cmpeq Cl 0 The use of conditional instructions in the above fashion improves performance by minimizing the number of branches thereby minimizing the penalties caused by branch mispredictions This approach also reduces the utilization of branch prediction resources March 2003 Developers Manual m Intel 80200 Processor based on Intel XScale Microarchitecture I n e Optimization Guide B 3 2 Bit Field Manipulation The Intel 80200 processor shift and logical operation
372. y Opcode 2 and CRm should be zero Table 7 22 Accessing the Performance Monitoring Registers Function CRn Register Instruction Read PMNC 0b0000 MRC p14 0 Rd cO c0 0 Write PMNC 0b0000 MCR p14 0 Rd cO c0 0 Read CCNT 0b0001 MRC p14 0 Rd c1 c0 0 Write CONT 050001 MCR p14 0 Rd c1 c0 0 Read PMNO 0b0010 MRC p14 0 Rd c2 c0 0 Write PMNO 0b0010 MCR p14 0 Rd c2 c0 0 Read PMN1 0b0011 MRC p14 0 Rd c3 c0 0 Write PMN1 0b0011 MCR p14 0 Rd c3 c0 0 7 3 2 Register 4 5 Reserved These registers are reserved Reading and writing them yields unpredictable results 7 20 March 2003 Developer s Manual intel 7 3 3 Table 7 23 Table 7 24 Table 7 25 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Registers 6 7 Clock and Power Management These registers contain functions for managing the core clock and power Three low power modes are supported that are entered upon executing the functions listed in Table 7 24 To enter any of these modes write the appropriate data to CP14 register 7 PWRMODE Software may read this register but since software only runs during ACTIVE mode it always reads zeroes from the M field PWRMODE Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 43 2 1 0 Cee reset value writeable bits set to 0 Bits Access Description 31 2 Read unpre
373. ype Register Sheet 2 of 2 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 reset value As Shown Bits Access Description 5 3 Read Write Ignored Instruction cache associativity 0b101 32 kB 2 Read as Zero Write Ignored Reserved 1 0 Read Write Ignored Instruction cache line length 0b10 8 words line 7 6 March 2003 Developer s Manual intel 7 2 2 Intel 80200 Processor based on Intel XScale Microarchitecture Configuration Register 1 Control and Auxiliary Control Registers Register 1 is made up of two registers one that is compliant with ARM Version 5 and is referenced by opcode 2 0x0 and the other which is specific to Intel StrongARM and is referenced by opcode_2 0x1 The Exception Vector Relocation bit bit 13 of the ARM control register allows the vectors to be mapped into high memory rather than their default location at address 0 This bit is readable and writable by software If the MMU is enabled the exception vectors are accessed via the usual translation method involving the PID register see Section 7 2 13 Register 13 Process ID on page 7 16 and the TLBs To avoid automatic application of the PID to exception vector accesses software may relocate the exceptions to high memory Table 7 6 ARM Control Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11109 8 7 6 5 4 3 2 1 reset value writeable b
374. zation would be group highly used literal pool references into the same cache line The advantage is that once one of the literals has been loaded the other seven are available immediately from the data cache Developer s Manual March 2003 B 23 Optimization Guide n Intel 80200 Processor based on Intel XScale Microarchitecture tel e B 4 3 B 4 3 1 B 4 3 2 B 24 Cache Considerations Cache Conflicts Pollution and Pressure Cache pollution occurs when unused data is loaded in the cache and cache pressure occurs when data that is not temporal to the current process is loaded into the cache For an example see Section B 4 4 2 Prefetch Loop Scheduling below Memory Page Thrashing Memory page thrashing occurs because of the nature of SDRAM SDRAMs are typically divided into 4 banks Each bank can have one selected page where a page address size for current memory components is often defined as 4k Memory lookup time or latency time for a selected page address is currently 2 to 3 bus clocks Thrashing occurs when subsequent memory accesses within the same memory bank access different pages The memory page change adds 3 to 4 bus clock cycles to memory latency This added delay extends the prefetch distance correspondingly making it more difficult to hide memory access latencies This type of thrashing can be resolved by placing the conflicting data structures into different memory banks or by paralleling the data struc
375. ze Optimizing for smaller code size in general lowers the performance of your application This chapter contains techniques for optimizing for code size using the Intel 80200 processor instruction set Space Performance Trade Off Many optimizations mentioned in the previous chapters improve the performance of ARM code However using these instructions results in increased code size Use the following optimizations to reduce the space requirements of the application code Multiple Word Load and Store The LDM STM instructions are one word long and let you load or store multiple registers at once Use the LDM STM instructions instead of a sequence of loads stores to consecutive addresses in memory whenever possible Use of Conditional Instructions Using conditional instructions to expand if then else statements as described in Section B 3 1 Conditional Instructions results in increasing the size of the generated code Therefore do not use conditional instructions if application code space requirements are an issue Use of PLD Instructions The preload instruction PLD is only a hint it does not change the architectural state of the processor Using or not using them will not change the behavior of your code therefore you should avoid using these instructions when optimizing for space Developer s Manual March 2003 B 45 intel Test Features C C 1 The Intel 80200 processor based on Intel XScale microarchitectu
376. zing C Libraries 202 eee cee cece ee eeee ee tee ee eeae ee eeaeeeeeaee sense eeaeeseaeeseeaeeeeaeeeseaaeeseaeesenaeeesaes 45 Optimizations fOr SIZe iid eie ae ei nti so Fe eed edie dee 45 B 7 1 Space Performance Trade OH 45 B 7 1 1 Multiple Word Load and Store sssssssssseeeeeeneeennenns 45 B 7 1 2 Use of Conditional Instructions nnns 45 B 7 1 3 Use of PLD Instructloris ic ttn rin rn re te te cabe Ede 45 li 4 uii eet TT Tt 1 nire Lee E E MES 1 JTAG EEE hm aaa 1 C 2 1 Boundary Scan Architecture 2 March 2003 Developer s Manual C 2 2 C 2 3 C 2 4 C 2 5 Intel 80200 Processor based on Intel XScale Microarchitecture ILI rc 3 Instruction Register IRl eiit rtt tereti et ei nct eu e EE IRE Eun ous 4 C 2 3 1 Boundary Scan Instruction Get 4 TAP Test Data E UE 6 C 2 4 1 Device Identification Register A 6 2 4 2 Bypass e EE 6 C 2 4 3 Boundary Scan Register AA 6 esr 7 7 2 5 1 Test Logic Heset State iie e n E ei s d ua eee ERR xao dene 8 C 2 5 2 Run Test Idle State iisi eaa E EEEa aa 8 EE RI EE 8 7 2 5 4 Capture DR State inasre iuste d cus Aen oL PE REL E ce PEZ ENEE bea dea acid 8 C 2 5 5 Shift DR Gate 9 7 2 5 6 ExitT DR State ecu n eee ad sl ed e o uL Pa cca eee 9 C 2 5 7 Pause DR State eene nennen nnne entren nnne en
Download Pdf Manuals
Related Search
Related Contents
POPDIST, Version. 1.2.4: User's Guide 01/07/2015 - Gbs Samsung WA10V5JEC/XST คู่มือการใช้งาน Copyright © All rights reserved.
Failed to retrieve file