Home

AMD x86 Typewriter User Manual

1. 10 107 3DNow and MMX Intra Operand Swapping 112 si a whe bui ay RASA TINTE gel Ee E RUMOR 122 F st Divisi n x as p eee 108 Fast Square Root and Reciprocal Square Root 110 FEMMS Instruction 107 PAVGUSB for MPEG 2 Motion Compensation 123 PFCMP 114 PFMUL 113 114 2 Instruction 113 PREFETCH and PREFETCHW Instructions 8 46 47 49 PSWAPD Instruction 112 126 Scalar Code Translated into 3DNow Code 61 64 A Data Cache sda MORES 134 v ei Di eee woes 33 133 Dependencies V xr HET 128 DirectPath uc XI GRE Be 133 DirectPath Over VectorPath Instructions 9 34 219 Instr ctiohs veh he hisa PY 219 Displacements 8 Bit Sign Extended 39 Divisi n cil p DRE RI ERES 77 80 93 95 Replace Divides with Multiplies Integer 31 77 Using 3DNow Instructions 108 109 Dynamic Memory Allocation Consideration 25 Event and Time Stamp Monitoring Software 168 Address Generation Interlocks 72 Execution Unit
2. 20 ERES and Expressions are Type Float id nstruction ive eV IRR pe te 105 Replace with Computation 3DNow Code 60 EXCH Instruction ssie ee LER peer 99 103 C G C Language sac Pie ee 13 Group 1 Essential Optimizations 7 8 Array Style Over Pointer Style Code 15 Group II Secondary Optimizations 7 9 C Code to 3DNow Code Examples 61 64 Structure Component Considerations 27255 Caches as Baty ee Ee E IR Re MESE 4 64 Byte Cache 11 50 TE Statements peers cared ae oak Qs tu Ee xe e SEE dag 24 Cache and Memory Optimizations 45 Immediates 8 Bit 38 CALL and RETURN 59 Inline 71 72 86 Code Padding Using Neutral Code Fillers 39 Inline REP String with Low Counts 85 Code Sample Analysis 152 Complex Number Arithmetic 126 Const Type 22 Constant Control Code Multiple 23 Index 237 AMDA AMD Athlon Processor x86 Code Optimization Instruction Riu deno e
3. 148 AMD Athlon M Processor Extended Precision 99 Branch Free 58 Code Padding ae bo ee ee ee ee 40 Family iib hin ts ask AS 3 F 4 129 130 Far Control Transfer 65 AMD Athlon System Bus 139 Fetch and Decode Pipeline Stages 141 Macro Fae bee 98 B Floating Point Compare Instructions 98 Blended Code AMD K6 and AMD Athlon Processors Divides and Square 16 29 3DNow and MMX Intra Operand Swapping 112 Execution 137 Block Copies and Block Fills 115 97 Branch Examples 58 Pipeline 150 Code Padding 1 1 2 ka i eee 41 Pipeline 5 146 Signed Words to Floating Point Example 113 Scheduler esse Metu em ewe rA 136 Branches Subexpression 103 Align Branch Targets 36 To Integer Conversions 100 Compound Branch
4. 65 Avoid Far Control Transfer Instructions 65 Avoid Recursive Functions 66 Contents AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 7 Scheduling Optimizations 67 Schedule Instructions According to their Latency 67 Unrolling Loops 5 v3 vae pe qe or SO e Sue du oie 67 Complete Loop Unrolling 67 Partial Loop Unrollings 2 v2 4 3 aed br rera bee EE 68 Use Function Inlining 71 OVebvieW e ad d I eR Des ARSE DEE QU RE 71 Always Inline Functions 11 Called from One Site 72 Always Inline Functions with Fewer than 25 Machine Instructions ss ras qot dh e 72 Avoid Address Generation 72 Use MOVZX and MOVSX 73 Minimize Pointer Arithmetic in 73 Push Memory Data Carefully 75 8 Integer Optimizations 77 Replace Divides with 77 Multiplication by Reciprocal Division Utility 77 Unsigned Division by Multiplication of Constant 78 Signed Division by Multiplication of Constant 79 Use Alternative Code When Multiplying by a Constant 81 Use MMX Instructions for Integer On
5. gt Reserved Symbol Description Bits Physical Base Base address in Register Pair 35 12 Type See MTRR Types and Properties 7 0 Figure 16 MTRRphysBasen Register Format Note A software attempt to write to reserved bits will generate a general protection exception Physical Specifies a 24 bit value which is extended by 12 Base bits to form the base address of the region defined in the register pair Type See Standard MTRR Types and Properties on page 176 Page Attribute Table PAT 183 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 63 36 35 12 11 10 0 gt Reserved Symbol Description Bits Physical Mask 24 Bit Mask 35 12 V Variable Range Register Pair Enabled 11 V 0 at reset Figure 17 MTRRphysMasku Register Format Note A software attempt to write to reserved bits will generate a general protection exception Physical Specifies a 24 bit mask to determine the range of Mask the region defined in the register pair V Enables the register pair when set V 0 at reset Mask values can represent discontinuous ranges when the mask defines a lower significant bit as zero and a higher significant bit as one In a discontinuous range the memory area not mapped by the mask value is set to the default type Discontinuous ranges should not be used The range that is mapped by the variable range MTRR register pair must meet the follow
6. Use the 3DNow PREFETCH and PREFETCHW Instructions 47 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 MOV ECX C LARGE NUM used biased index MOV EAX OFFSET array a get address of array a MOV EDX OFFSET array b get address of array b MOV ECX OFFSET array_c get address of array c 100p PREFETCHW 196 two cachelines ahead PREFETCH 196 two cachelines ahead PREFETCH 196 two cachelines ahead FLD QWORD PTR EDX ECX 8 ARR SIZE 0111 FMUL QWORD PTR 8 SIZE 01115 111 FSTP QWORD PTR 8 SIZE sali b i c i FLD QWORD PTR LEDX ECX 8 ARR_SIZE 8 b i 1 FMUL QWORD PTR LECX ECX 8 ARR_SIZE 8 0114115 11411 FSTP QWORD PTR EAX ECX 8 ARR_SIZE 8 1 8 b i 1 c i 1 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 16 bLi 2 FMUL QWORD PTR LECX ECX 8 ARR_SIZE 16 b0i1 2 cLit2 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 16 ali 2 2 Lite 6 121 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 24 bLi 3 FMUL QWORD PTR LECX ECX 8 ARR_SIZE 24 b0i1 3 cLit 3 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 24 aLi 3 bLI 3 cLlit3 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 32 bLi 4 FMUL QWORD PTR LECX ECX 8 ARR_SIZE 32 b0i1 4 cLit 4 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 32 aLli 4 bLit 4 clit 4 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 40 bLi 5 FMUL QWORD PTR 8 517 4
7. m write transformed w ext input vertex eset to start of transform matrix MATRIX void XForm float res const float v const float m int numverts int i const VERTEX vv VERTEX v const MATRIX mm MATRIX m VERTEX rr VERTEX res for 0 gt numverts i 1 rr x vv x mm m 0 0 vv y mm m 0 L 1 vv gt z mm gt mLOJL2 vv w mm 5m 0 3 rr y vv x mm m 1 L0 vv y mm m 1 L1 vv z mm m 1 L2 vv w mm 5m 1 3 rr z vv x mm m 2 0 vv y mm m 2 L 1 vv z mm m 2 2 vv w mm 5m 2 3 rr w vv x mm m 3 0 vv y mm m 3 L1 vv z mm m 3 L2 vv w mm 5m 3 3 j j Use Array Style Instead of Pointer Style Code 17 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Completely Unroll Small Loops Take advantage of the AMD Athlon processor s large 64 Kbyte instruction cache and completely unroll small loops Unrolling loops can be beneficial to performance especially if the loop body is small which makes the loop overhead significant Many compilers are not aggressive at unrolling loops For loops that have a small fixed loop count and a small loop body completely unrolling the loops at the source level is recommended Example 1 Avoid 3D transfo
8. 204 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic b pin en M pon SHR mem16 32 imm8 Cih mm 101 xxx DirectPath SHR mregg 1 Doh 11 101 xxx DirectPath SHR 1 Doh mm_ 101 xxx DirectPath SHR mreg16 32 1 Dih 11 101 xxx DirectPath SHR mem16 32 1 Dih mm 101 xxx DirectPath SHR mreg8 CL D2h 11 101 xxx DirectPath SHR meme CL D2h mm 101 xxx DirectPath SHR mreg16 32 CL D3h 11 101 xxx DirectPath SHR mem16 32 CL D3h mm 101 xxx DirectPath SHLD mreg16 32 reg16 32 imm8 OFh A4h 11 xxx xxx VectorPath SHLD mem16 32 reg16 32 imm8 OFh A4h mm xxx xxx VectorPath SHLD mreg16 32 reg16 32 CL OFh Ash 11 xxx xxx VectorPath SHLD mem16 32 reg16 32 CL OFh A5h mm xxx xxx VectorPath SHRD mreg16 32 reg16 32 imm8 OFh ACh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 imm8 OFh ACh mm xxx xxx VectorPath SHRD mreg16 32 reg16 32 CL OFh ADh 11 xxx xxx VectorPath SHRD mem16 32 reg16 32 CL OFh ADh mm xxx xxx VectorPath SLDT 16 OFh ooh 11 000 xxx VectorPath SLDT mem16 OFh ooh mm 000 xxx VectorPath SMSW mreg16 OFh Oth 11 100 xxx VectorPath SMSW mem16 OFh oih mm 100 xxx VectorPath STC F9h DirectPath STD FDh
9. PSRLD mmreg mem64 DirectPath Instructions 227 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 26 DirectPath MMX Instructions Continued Instruction Mnemonic Instruction Mnemonic PSRLD mmreg imm8 si PXOR mmreg mem64 PSRLQ mmreg1 mmreg2 Table 27 DirectPath MMX Extensions PSRLQ mmreg imm8 PSRLW mmreg1 mmreg2 Instruction Mnemonic PSRLW mmreg mem64 MOVNTQ mem64 mmreg PSRLW mmreg imm8 PAVGB mmreg1 mmreg2 PSUBB 1 mmreg2 PAVGB mmreg mem64 PSUBB mmreg mem64 PAVGW mmreg1 mmreg2 PSUBD mmregl mmreg2 PAVGW mmreg mem64 PSUBD mmreg mem64 PMAXSW mmreg1 mmreg2 PSUBSB mmreg1 mmreg2 PMAXSW mmreg mem64 PSUBSB mmreg mem64 PMAXUB mmreg1 mmreg2 PSUBSW mmreg1 mmreg2 PMAXUB mmreg mem64 PSUBSW mmreg mem64 PMINSW mmreg1 mmreg2 PSUBUSB mmreg1 mmreg2 PMINSW mmreg mem64 PSUBUSB mmreg mem64 PMINUB mmreg mmreg2 PSUBUSW mmreg1 mmreg2 PMINUB mmreg mem64 PSUBUSW mmreg mem64 PMULHUW mmreg1 mmreg2 PSUBW mmreg1 mmreg2 PMULHUW mmreg mem64 PSUBW mmreg mem64 PSADBW mmreg1 mmreg2 PUNPCKHBW mmreg1 mmreg2 PSADBW mmreg mem64 PUNPCKHBW mmreg mem64 PSHUFW 1 mmreg2 imm8 PUNPCKHDQ mmregl mmreg2 PSHUFW mmreg mem64 imm8 PUNPCKHDQ mmreg mem64 PREFETCHNT
10. ADD mreg16 32 imm16 32 CMOVA CMOVBE reg16 32 reg16 32 ADD mem16 32 imm16 32 CMOVA CMOVBE 16 32 mem16 32 ADD mreg16 32 imm8 sign extended CMOVAE CMOVNB CMOVNC reg16 32 mem16 32 ADD mem16 32 imm8 sign extended CMOVAE CMOVNB CMOVNC mem16 32 mem16 32 AND mreg8 reg8 CMOVB CMOVC CMOVNAE reg16 32 16 32 AND mem reg8 CMOVB CMOVC CMOVNAE mem 16 32 reg 16 32 220 DirectPath Instructions 22007E 0 November 1999 Table 25 DirectPath Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic Instruction Mnemonic CMOVBE CMOVNA 16 32 reg16 32 CMP AL imm8 CMOVBE CMOVNA reg16 32 mem16 32 CMP EAX imm16 32 CMOVE CMOVZ reg16 32 reg16 32 CMP mreg8 imm8 CMOVE CMOVZ reg16 32 mem16 32 imm8 CMOVG CMOVNLE reg16 32 reg16 32 CMP mreg16 32 imm16 32 CMOVG CMOVNLE reg16 32 mem16 32 CMP mem16 32 imm16 32 CMOVGE CMOVNL reg16 32 reg16 32 CMP mreg16 32 imm8 sign extended CMOVGE CMOVNL reg16 32 mem16 32 CMP mem16 32 imm8 sign extended CMOVL CMOVNGE reg16 32 reg16 32 CWD CDQ CMOVL CMOVNGE reg16 32 mem16 32 DEC EAX CMOVLE CMOVNG reg16 32 reg16 32 DEC ECX CMOVLE CMO
11. 121 plemultiplicand lo multiplier lo ADD EDX ECX pl p21o p3 lo result in EDX EAX RET return to caller llmul ENDP Efficient 64 Bit Integer Arithmetic 87 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example 7 Division ulldiv divi des two unsigned 64 bit integers and returns the quotient INPUT LESP 8 LESP 4 dividend 12 16 55 552 divisor OUTPUT EDX EAX quotient of division DESTROYS EAX ECX EDX EFlags ulldiv PROC PUSH EBX save EBX as per calling convention MOV ECX LESP 20 divisor hi MOV EBX LESP 16 divisor lo MOV EDX LESP 12 dividend hi MOV 5 81 dividend 10 TEST ECX ECX divisor gt 2 32 1 JNZ big divisor yes divisor lt 1 CMP EDX EBX only one division needed ECX 0 JAE two divs need two divisions DIV EBX EAX quotient lo MOV EDX ECX EDX quotient hi 0 quotient in EDX EAX POP EBX restore EBX as per calling convention RET done return to caller two MOV ECX EAX save dividend lo in ECX MOV EAX EDX get dividend hi XOR EDX EDX zero extend it into EDX EAX DIV EBX quotient hi in EAX XCHG EAX ECX ECX quotient hi EAX dividend 10 DIV EBX EAX quotient 10 MOV EDX ECX EDX quotient hi quotient in EDX EAX POP EBX restore EBX as per calling convention RET done return to caller big diviso
12. VectorPath Instructions 233 AMDA AMD Athlon Processor x86 Code Optimization Table 29 VectorPath Integer Instructions Continued Instruction Mnemonic STI STOSB meme AL STOSW 16 STOSD mem32 EAX STR mreg16 STR mem16 SYSCALL SYSENTER SYSEXIT SYSRET VERR 16 VERR mem16 VERW 16 VERW 16 WBINVD WRMSR XADD mreg8 reg8 XADD reg8 XADD mreg16 32 reg16 32 XADD mem16 32 reg16 32 XCHG reg8 mreg8 XCHG reg8 mem8 XCHG reg16 32 mreg16 32 reg16 32 mem 16 32 XCHG EAX ECX XCHG EAX EDX XCHG EAX EBX XCHG EAX ESP XCHG EAX EBP XCHG EAX ESI XCHG EAX EDI XLAT 22007E 0 November 1999 Table 30 VectorPath MMX Instructions Instruction Mnemonic MOVD mmreg mreg32 MOVD mreg32 mmreg Table 31 VectorPath MMX Extensions Instruction Mnemonic MASKMOVQ mmreg1 mmreg2 PEXTRW reg32 mmreg imm8 PINSRW mmreg reg32 imm8 PINSRW mmreg mem16 imm8 PMOVMSKB reg32 mmreg SFENCE 234 VectorPath Instructions 22007E 0 November 1999 Table 32 VectorPath Floating Point Instructions Instruction Mnemonic F2XMI AMD Athlon Processor x86 Code Optimization FBLD mem80 Instruction Mnemonic FLDENV m
13. 10 Avoid Placing Code and Data in the Same 64 Byte Cache Line CC Tes 11 3 C Source Level Optimizations 13 Ensure Floating Point Variables and Expressions Type ER b DER 13 Use 32 Bit Data Types for Integer 13 Consider the Sign of Integer Operands 14 Use Array Style Instead of Pointer Style Code 15 Completely Unroll Small 18 Avoid Unnecessary Store to Load Dependencies 18 Consider Expression Order in Compound Branch Conditions 20 Contents lii AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Switch Statement 21 Optimize Switch 21 Use Prototypes for All Functions 21 Use Const Type Qualifier Ns rad eb RE Y 22 Generic Loop Hoisting 22 Generalization for Multiple Constant Control Code 23 Declare Local Functions as 24 Dynamic Memory Allocation Consideration 25 Introduce Explicit Parallelism into 25 Explicitly Extract Common Subexpressions 26 C Language Structure Component Considerations 27
14. MEDEC MESEQ Cycle 6 IDEC Rename AMD Athlon Processor x86 Code Optimization The FETCH pipeline stage calculates the address of the next x86 instruction window to fetch from the processor caches or system memory SCAN determines the start and end pointers of instructions SCAN can send up to six aligned instructions DirectPath and VectorPath to ALIGN1 and only one VectorPath instruction to the microcode engine MENG per cycle Because each 8 byte buffer quadword queue can contain up to three instructions ALIGN1 can buffer up to a maximum of nine instructions or 24 instruction bytes ALIGN 1 tries to send three instructions from an 8 byte buffer to ALIGN2 per cycle For VectorPath instructions the microcode engine control MECTL stage of the pipeline generates the microcode entry points ALIGN2 prioritizes prefix bytes determines the opcode ModR M and SIB bytes for each instruction and sends the accumulated prefix information to EDEC In the microcode engine ROM pipeline stage the entry point generated in the previous cycle MECTL is used to index into the MROM to obtain the microcode lines necessary to decode the instruction sent by SCAN The early decode EDEC stage decodes information from the DirectPath stage ALIGN2 and VectorPath stage MEROM into MacroOPs In addition EDEC determines register pointers flag updates immediate values displacements and other information EDEC then sel
15. Fast Conversion of Signed Words to Floating Point 113 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 cycle bypassing penalty and another one cycle penalty if the result goes to a 3DNow operation The PFMUL execution latency is four therefore in the worst case the PXOR and PMUL instructions are the same in terms of latency On the AMD K6 processor there is only a one cycle latency for PXOR versus a two cycle latency for the 3DNow PFMUL instruction Use the following code to negate 3DNow data msgn DQ 8000000080000000h PXOR MMO msgn toggle sign bit Use MMX PCMP Instead of 3DNow PFCMP Both Numbers Positive One Negative One Positive Both Numbers Negative Use the MMX PCMP instruction instead of the 3DNow PFCMP instruction On the AMD Athlon processor the PCMP has a latency of two cycles while the PFCMP has a latency of four cycles In addition to the shorter latency PCMP can be issued to either the FADD or the FMUL pipe while PFCMP is restricted to the FADD pipe Note PFCMP instruction has a GE greater or equal version PFCMPGE that is missing from PCMP If both arguments are positive PCMP always works If one number is negative and the other is positive PCMP still works except when one number is a positive zero and the other is a negative zero Be careful when performing integer comparison using PCMPGT on two negative 3DNow numbers T
16. lt 0 0 lt v gt w m 3 0 3 1 lt 3 2 lt 3 3 lt 21 3 2 lt 1 3 lt 0 3 lt 2 2 2 lt 1 2 lt 0 2 lt 21 1 2 lt 1 1 lt 0 1 lt 11100 lt 0 0 lt 3 1 lt 3 0 lt 2 0 2 lt store res gt y res x 2 3 2 lt 1 3 lt 01 3 lt v 2w m 3 3 v gt x m 0 2 v gt y m 1 2 v gt z m 2 2 v gt w m 3 2 store res gt w gt 7 numverts until numverts clear MMX state Optimized Matrix Multiplication 121 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Efficient 3D Clipping Code Computation Using 3DNow Instructions DATA RIGHT EQU LEFT EQU ABOVE EQU BELOW EQU BEHIND EQU BEFORE EQU ABOVE_RIGHT BELOW_LEFT 011 02h 04h 08h 10h 20h BEHIND BEFORE CODE gt 2 2 2 2 OO Clipping is one of the major activities occurring in a 3D graphics pipeline In many instances this activity is split into two parts which do not necessarily have to occur consecutively m Computation of the clip code for each vertex where each bit of the clip code indicates whether the vertex is outside the frustum with regard to a specific clip plane m Examination of the clip
17. 1 DS XXXx xlxxb SS xxlxb CS XXXX xxxlb ES 21h LS Stores to active instruction stream 40h DC Data cache accesses 4th DC Data cache misses xxxxb Modified M 1 XXXX_1xxxb Owner 0 42h DC 1 Exclusive E Data cache refills Shared S Invalid I xxxxb Modified 1 1 Owner 0 43h DC XXXX_X1xxb Exclusive E Data cache refills from system Shared S 18 Invalid I xxxxb Modified M 1 XXXX_1xxxb Owner 0 44h DC XXXX_X1xxb Exclusive E Data cache writebacks Shared S xxxlb Invalid 1 45h DC L1 DTLB misses and L2 DTLB hits 46h DC L1 and L2 DTLB misses 47h DC Misaligned data references 64h BU DRAM system requests 164 Performance Counter Usage AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Table 11 Performance Monitoring Counters Continued Nas e Notes Unit Mask bits 15 8 Event Description lxxx xxxxb reserved xxxxb WB xxxxb WP 65h BU Xxxl xxxxb WT System requests with the selected type bits 11 10 reserved xxlxb WC xxxlb UC bits 15 11 reserved xlxxb L2 L2 hit and no DC 73h BU Oj hit Snoop hits Xxxx xxlxb Data
18. 40 Recommendations for AMD K6 Family and AMD Athlon Processor Blended Code 41 5 Cache and Memory Optimizations 45 Memory Size and Alignment Issues 45 Avoid Memory Size 45 Align Data Where 1 46 Use the 3DNow PREFETCH and PREFETCHW Instructions 46 Take Advantage of Write 50 Avoid Placing Code and Data in the Same 64 Byte Cache Line 50 Store to Load Forwarding 51 Store to Load Forwarding Pitfalls True Dependencies 51 Summary of Store to Load Forwarding Pitfalls to Avoid 54 Stack Alignment 5 54 Align TBYTE Variables on Quadword Aligned Addresses 55 C Language Structure Component Considerations 55 Sort Variables According to Base Type Size 56 6 Branch Optimizations 57 Avoid Branches Dependent on Random Data 57 AMD Athlon Processor Specific 58 Blended AMD K6 and AMD Athlon Processor Code 58 Always Pair CALL and RETURN 59 Replace Branches with Computation in 3DNow Code 60 Mu xing Constructs o 3i b PR Pen 60 Sample Code Translated into 3DNow Code 61 Avoid the Loop Instruction
19. DILA xp Oxffffffff 0 df 0 pi 2 pi 2 xs df 0 0 pi 2 Q sy cop pi 2 pi 2 spes Scar opr 64 Replace Branches with Computation in 3DNow Code AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Avoid the Loop Instruction The LOOP instruction in the AMD Athlon processor requires eight cycles to execute Use the preferred code shown below Example 1 Avoid LOOP LABEL Example 2 Preferred DEC ECX JNZ LABEL Avoid Far Control Transfer Instructions Avoid using far control transfer instructions Far control transfer branches can not be predicted by the branch target buffer BTB Avoid the Loop Instruction 65 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Avoid Recursive Functions Avoid recursive functions due to the danger of overflowing the return address stack Convert end recursive functions to iterative code An end recursive function is when the function call to itself is at the end of the code Example 1 Avoid long fac long a if 8 0 return 1 else return a fac a 1 j return t Example 2 Preferred long fac long a long t 1 while a gt 0 4 2 return t 66 Avoid Recursive Functions AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Scheduling Optimizations This
20. Processor x86 Code Optimization 22007E 0 November 1999 m Avoid Placing Code and Data in the Same 64 Byte Cache Line Optimization Star The top optimizations described in this chapter are flagged with a star In addition the star appears beside the more detailed descriptions found in subsequent chapters Group I Optimizations Essential Optimizations Memory Size and Alignment Issues See Memory Size and Alignment Issues on page 45 for more details Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep the loads stores of each operand the same size Align Data Where Possible Avoid misaligned data references A misaligned store or load operation suffers a minimum one cycle penalty in the AMD Athlon processor load store pipeline Use the 3DNow PREFETCH and PREFETCHW Instructions For code that can take advantage of prefetching use the 3DNow PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor which significantly improves performance All the prefetch instructions are essentially integer instructions and can be used 8 Optimization Star AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization anywhere in any type of code integer x87 3DNow MMX etc Use the following formula to determin
21. The Intel documentation P6 PII states that the mapping of large pages into regions that are mapped with differing memory types can result in undefined behavior However testing shows that these processors decompose these large pages into 4 Kbyte pages When a large page 2 Mbytes 4 Mbytes mapping covers a region that contains more than one memory type as mapped by the MTRRs the AMD Athlon processor does not suppress the caching of that large page mapping and only caches the mapping for just that 4 Kbyte piece in the 4 Kbyte TLB Therefore the AMD Athlon processor does not decompose large pages under these conditions The fixed range MTRRs are 176 Memory Type Range Register MTRR Mechanism AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization not affected by this issue only the variable range and MTRR DefType registers are affected Page Attribute Table PAT MSR Access The Page Attribute Table PAT is an extension of the page table entry format which allows the specification of memory types to regions of physical memory based on the linear address The PAT provides the same functionality as MTRRs with the flexibility of the page tables It provides the operating systems and applications to determine the desired memory type for optimal performance PAT support is detected in the feature flags bit 16 of the CPUID instruction The PAT is located in a 64 bit MSR at location 27
22. JNZ JNE short disp8 MOV meme reg8 JBE JNA short disp8 MOV mreg16 32 reg16 32 JNBE JA short disp8 MOV mem16 32 reg16 32 JS short 8 MOV reg8 mreg8 JNS short disp8 MOV reg8 mem8 JP JPE short disp8 MOV reg16 32 mreg16 32 JNP JPO short disp8 MOV 16 32 mem16 32 JL JNGE short disp8 MOV AL mem8 JNL JGE short disp8 MOV EAX mem 16 32 JLE JNG short disp8 MOV meme AL JNLE JG short disp8 MOV mem16 32 EAX JO near disp16 32 MOV AL imm8 JNO near disp16 32 MOV CL imm8 JB JNAE near disp16 32 MOV DL imm8 JNB JAE near disp16 32 MOV BL imm8 JZ JE near disp16 32 MOV AH imm8 JNZ JNE near disp16 32 MOV CH imm8 JBE JNA near disp16 32 MOV DH imm8 JNBE JA near disp16 32 MOV BH imm8 JS near disp16 32 MOV EAX imm16 32 JNS near disp 2 MOV ECX imm16 32 JP JPE near disp16 32 MOV EDX imm16 32 JNP JPO near disp16 32 MOV EBX imm16 32 JL JNGE near disp16 32 MOV ESP imm16 32 JNL JGE near disp16 32 MOV EBP imm16 32 JLE JNG near disp16 32 MOV ESI imm16 32 JNLE JG near disp16 32 MOV EDI imm16 32 JMP near disp16 32 direct MOV mreg8 imm8 JMP far disp32 48 direct MOV meme imm8 JMP disp8 short MOV mreg16 32 imm16 32 222 DirectPath Instructions 22007E 0 November 1999 Table 25 DirectPath Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic Instruction Mnemonic MOV mem16 32 imm 16 3
23. Splitting a load execute integer instruction into two separate instructions a load instruction and a reg reg instruction reduces decoding bandwidth and increases register pressure which results in lower performance The split instruction form can be used to avoid scheduler stalls for longer executing instructions and to explicitly schedule the load and execute operations 34 Select DirectPath Over VectorPath Instructions AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use Load Execute Floating Point Instructions with Floating Point Operands When operating on single precision or double precision floating point data wherever possible use floating point load execute instructions to increase code density Note This optimization applies only to floating point instructions with floating point operands and not with integer operands as described in the next optimization This coding style helps in two ways First denser code allows more work to be held in the instruction cache Second the denser code generates fewer internal OPs and therefore the FPU scheduler holds more work which increases the chances of extracting parallelism from the code Example 1 Avoid FLD QWORD PTR TESTI FLD QWORD PTR TEST2 FMUL ST ST 1 Example 2 Preferred FLD QWORD PTR TEST1 FMUL QWORD PTR TEST2 Avoid Load Execute Floating Point Instructions with Integer Operands Do not use
24. Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic b pin jd M pon NOT mem8 Feh mm 010 xx DirectPath NOT mreg16 32 F7h 11 010 xxx DirectPath NOT mem16 32 F7h mm 010 xx DirectPath OR mreg8 reg8 08h 11 xxx xxx DirectPath OR mem8 reg8 08h mm xxx xxx DirectPath OR mreg16 32 reg16 32 09h 11 xxx xxx DirectPath OR mem16 32 reg16 32 09h mm xxx xxx DirectPath OR reg8 mreg8 0Ah 11 xxx xxx DirectPath OR reg8 mem8 0Ah mm xxx xxx DirectPath OR reg16 32 mreg16 32 OBh 11 xxx xxx DirectPath OR 16 32 mem16 32 OBh mm xxx xxx DirectPath OR AL imm8 oCh DirectPath OR EAX imm16 32 0Dh DirectPath OR mreg8 imm8 80h 11 001 xxx DirectPath OR mem8 imm8 80h mm 001 xxx DirectPath OR mreg16 32 imm16 32 811 11 001 xxx DirectPath OR mem16 32 imm16 32 81h mm 001 xxx DirectPath OR mreg16 32 imm sign extended 83h 11 001 xxx DirectPath OR mem16 32 imm8 sign extended 85h mm 001 xxx DirectPath OUT AL E6h VectorPath OUT imm8 AX E7h VectorPath OUT imma EAX E7h VectorPath OUT DX AL EEh VectorPath OUT DX AX EFh VectorPath OUT DX EAX EFh VectorPath POP ES 07h VectorPath POP SS 17h VectorPath POP DS 1Fh VectorPath POP FS OFh Ath VectorPath POP GS OFh A9h VectorPath POP EAX 58h VectorPath POP ECX 59h VectorPath POP EDX 5Ah Vector
25. While string instructions with DF 1 DOWN are slower only the overhead part of the cycle equation is larger and not the throughput part See Table 1 Latency of Repeated String Instructions on page 84 for additional latency numbers For REP MOVS make sure that both source and destination are aligned with regard to the operand size Handle the end case separately if necessary If either source or destination cannot be aligned make the destination aligned and the source misaligned For REP STOS make the destination aligned Expand REP string instructions into equivalent sequences of simple x86 instructions if the repeat count is constant and less than eight Use an inline sequence of loads and stores to accomplish the move Use a sequence of stores to emulate REP STOS This technique eliminates the setup overhead of REP instructions and increases instruction throughput If the repeated count is variable but is likely less than eight use a simple loop to move store the data This technique avoids the overhead of REP MOVS and REP STOS To fill or copy blocks of data that are larger than 512 bytes or where the destination is in uncacheable memory it is recommended to use the MMX instructions MOVQ MOVNTQ instead of REP STOS and REP MOVS in order to achieve maximum performance See the guideline Use MMX Instructions for Block Copies and Block Fills on page 115 Repeated String Instruction Usage 85 AMD
26. shi by ECX PCIE scal S xb th Bod SPOS 22007E 0 November 1999 ividend lo in ECX vidend hi xtend it into EDX EAX quotient hi EDX intermediate nder dividend lo quotient lo remainder lo remainder hi 0 e EBX as per calling convention return to caller e EDI as per calling convention e divisor hi ft both divisor and dividend right 1 bit number of remaining shifts e down divisor and dividend such at divisor is less than 2 32 i e fits in EBX tore original divisor EDI ESI compute quotient P 12 div 5 quo ESP 20 div P 16 div P 20 div sub di re P 24 6 EDX idend lo word e quotient tient divisor hi word low only quotient divisor lo word EAX quotient divisor idend lo quot divisor lo idend hi isor lo tract divisor quot from vidend ainder lt 0 OxFFFFFFFF 0 ainder lt 0 divisor lo 0 ainder lt 0 divisor hi 0 remainder remainder lt 0 di res res don visor 0 tore EDI as per calling convention tore EBX as per calling convention e return to caller 90 Efficient 64 Bit Integer Arithmetic AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Efficient Implementation of Population Count Function Step 1 Step 2 Population count is an operation that determines the number of set bits in a
27. 00041 001 0007h The performance counter MSRs contain the event or duration counts for the selected events being counted The RDPMC instruction can be used by programs or procedures running at any privilege level and in virtual 8086 mode to read these counters The PCE flag in control register CR4 bit 8 allows the use of this instruction to be restricted to only programs and procedures running at privilege level 0 The RDPMC instruction is not serializing or ordered with other instructions Therefore it does not necessarily wait until all previous instructions have been executed before reading the counter Similarly subsequent instructions can begin execution before the RDPMC instruction operation is performed Only the operating system executing at privilege level 0 can directly manipulate the performance counters using the RDMSR and WRMSR instructions A secure operating system would clear the PCE flag during system initialization which disables direct user access to the performance monitoring counters but provides a user accessible programming interface that emulates the RDPMC instruction The WRMSR instruction cannot arbitrarily write to the performance monitoring counter MSRs PerfCtr 3 0 Instead the value should be treated as 64 bit sign extended which Performance Counter Usage 167 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 allows writing both positive and negative val
28. 1 0 JB DONE2 then i 0 MOV EDX DWORD PTR X get lower 32 bits of double SHR ECX 20 extract exponent SHRD EDX EAX 21 extract mantissa NEG ECX compute shift factor for extracting ADD ECX 1054 non fractional mantissa bits OR EDX 080000000h set integer bit of mantissa SAR EAX 31 aX gt 0 Oxffffffff 0 SHR EDX CL trunc abs x XOR EDX EAX seem Xe Q 26 SUB EDX EAX ESAE OUT 2 DONE2 MOV I EDX store result For applications which can tolerate a floating point to integer conversion that is not compliant with existing programming language standards but is IEEE 754 compliant perform the conversion using the rounding mode that is currently in effect usually round to nearest even Example 4 Fastest FLD QWORD PTR X get double to be converted FISTP DWORD PTR I store integer result Some compilers offer an option to use the code from example 4 for floating point to integer conversion using the default rounding mode Lastly consider setting the rounding mode throughout an application to truncate and using the code from example 4 to perform extremely fast conversions that are compliant with language standards and IEEE 754 This mode is also provided as an option by some compilers Note that use of this technique also changes the rounding mode for all other FPU operations inside the application which can lead to significant changes in numerical results and even program failure fo
29. D9h mm 110 xxx VectorPath FSTENV mem28byte D9h mm 110 xxx VectorPath FSTP mem32real D9h mm 011 xxx DirectPath FADD FMUL FSTP meme4real DDh mm 011 xxx DirectPath FADD FMUL FSTP 80 D9h mm 111 xxx VectorPath FSTP ST i DDh 11 011 xxx DirectPath FADD FMUL FSTSW AX DFh VectorPath FSTSW mem16 DDh mm 111 xxx VectorPath FSTORE FSUB memz2real D8h mm 100 xxx DirectPath FADD FSUB mem64real DCh mm 100 xxx DirectPath FADD FSUB ST ST i D8h 11 100 xxx DirectPath FADD 1 FSUB ST i ST DCh 11 101 xxx DirectPath FADD 1 FSUBP ST ST i DEh 11 101 xxx DirectPath FADD 1 FSUBR mem32real D8h mm 101 xxx DirectPath FADD FSUBR mem64real DCh mm 101 xxx DirectPath FADD FSUBR ST ST i D8h 11 100 xxx DirectPath FADD 1 FSUBR ST i ST DCh 11 101 xxx DirectPath FADD 1 FSUBRP ST i ST DEh 11 100 xxx DirectPath FADD 1 FTST D9h E4h DirectPath FADD FUCOM DDh 11 100 xxx DirectPath FADD FUCOMI ST ST i DB E8 EFh VectorPath FADD FUCOMIP ST ST i DF E8 EFh VectorPath FADD FUCOMP DDh 11 101 xxx DirectPath FADD FUCOMPP DAh E9h DirectPath FADD FWAIT 9Bh DirectPath FXAM D9h E5h VectorPath FXCH D9h 11 001 xxx DirectPath FADD FMUL FSTORE FXTRACT F4h VectorPath FYL2X D9h Fih VectorPath FYL2XP1 D9h F9h VectorPath Notes 1 The last three bits of the modR M byte select the stack entry ST i 216 Instruction Dispatch and Execution Resources AMDA 22007E 0 Nov
30. DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 1 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 1 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh 1 00000000 EQU DB 08Dh DB 0EBh 007h 90h OBFh OADh 004h 01Ch 00Ch 014h 034h 03Ch 02Ch 1 004h nop 01Ch s p 00Ch I p 014 5 034h nop 03Ch 02Ch 0 0 0 0 gt 0 0 0 0 gt 005h O1Dh 00Dh 015h 035h 03Dh 020 005h O1Dh 00Dh O15h 035h 030 02Dh 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 gt 0 0 0 0 90h 0 0 0 0 90h 0 0 0 0 90h 0 0 0 0 90h 0 0 0 0 90h 0 0 0 0 90h 0 0 0 0 90h 90h 90h 90h 90h 90h 90h Code Padding Using Neutral Code Fillers 43 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 44 Code Padding Using Neutral Code Fillers AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Cache and Memory Optimizations This chapter describes code optimization techniques that take advantage of the large L1 caches and
31. DirectPath FADD PSWAPD mmreg1 mmreg2 OFh OFh BBh 11 xxx xxx DirectPath FADD FMUL PSWAPD mmreg mem64 OFh OFh BBh mm xxx xxx DirectPath FADD FMUL 218 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix G DirectPath versus VectorPath Instructions Select DirectPath Over VectorPath Instructions Use DirectPath instructions rather than VectorPath instructions DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction which includes register lt register memory as well as register register op register forms of instructions DirectPath Instructions The following tables contain DirectPath instructions which should be used in the AMD Athlon processor wherever possible m Table 25 DirectPath Integer Instructions on page 220 m Table 26 DirectPath MMX Instructions on page 227 and Table 27 DirectPath MMX Extensions on page 228 m Table 28 DirectPath Floating Point Instructions on page 229 m All 3DNow instructions including the 3DNow Extensions are DirectPath and are listed in Table 23 3DNow Instructions on page 217 and Table 24 3DNow Exten sions on page 218 Select DirectPath Over VectorPath Instructions 219 AMDA AMD Athlon Processor
32. FLDPI D9h EBh DirectPath FSTORE FLDZ EEh DirectPath FSTORE FMUL ST ST i D8h 11 001 xxx DirectPath FMUL 1 FMUL ST i ST DCh 11 001 xxx DirectPath FMUL 1 FMUL mem32real D8h mm 001 xxx DirectPath FMUL FMUL mem64real DCh mm 001 xxx DirectPath FMUL FMULP ST ST i DEh 11 001 xxx DirectPath FMUL 1 FNOP D9h Doh DirectPath FADD FMUL FSTORE FPTAN Dh F2h VectorPath FPATAN Dh F3h VectorPath FPREM D9h F8h DirectPath FMUL FPREMI D9h F5h DirectPath FMUL FRNDINT D9h VectorPath FRSTOR mem94byte DDh mm 100 xxx VectorPath FRSTOR mem108byte DDh mm 100 xxx VectorPath FSAVE mem94byte DDh mm 110 xxx VectorPath FSAVE mem108byte DDh mm 110 xxx VectorPath FSCALE D9h FDh VectorPath FSIN FEh VectorPath FSINCOS D9h FBh VectorPath FSQRT FAh DirectPath FMUL FST mem32real D9h mm 010 xxx DirectPath FSTORE FST mem64real DDh mm 010 xxx DirectPath FSTORE FST ST i DDh 11 010xxx DirectPath FADD FMUL Instruction Dispatch and Execution Resources 215 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 22 Floating Point Instructions Continued Instruction Mnemonic Pir St Second MOR F P Note Byte Byte Byte Type Pipe s FSTCW mem16 D9h mm 111 xxx VectorPath FSTENV 14
33. Input Memory Type Output Memory Type AMD 751 5 5 5 9 Note 5184 512 2 uL e e UC e e UC 1 e e CD e e CD e e WC e WC e e WT e e WT 1 e e WP e e WP 1 e e WB e e WB e e e e e CD 1 2 e UC e UC e CD e CD e WC e WC e WT 3 WP e WP 1 e WB e CD 3 e e e CD 2 e UC e UC e CD e CD e WC e WC e WT e CD 6 e WP e CD 6 e WB 6 e e CD 2 e e UC e UC 22007E 0 November 1999 180 Page Attribute Table PAT AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Table 16 Final Output Memory Types Continued Input Memory Type Output Memory Type AMD 751 E E 5 Note E E E a 7 gt 2 5 pr e e CD e e CD e e WC e e WC e e WT e e WT e e WP e e WP e e WB e e WT 4 e e e e e CD 2 Notes 1 WP is not functional for RdMem WrMem 2 ForceCD must cause the MTRR memory type to be ignored in order to avoid x s 3 D Ishould always be WP because the BIOS will only program RdMem WrlO for WP CD Is forced to preserve the write protect intent 4 Since cached IO lines cannot be copied back to IO the processor forces WB to WT to prevent cached IO from going dirty 5 ForceCD The memory type is forced CD due to CRO CD 1 2 memory type 15 for the ITLB and the is disabled or for the DTLB and the D Cache is disabled 3 w
34. PFMAX mmreg mem64 OFh OFh A4h mm xxx xxx DirectPath FADD PFMIN mmreg1 mmreg2 OFh OFh 94h 11 xxx xxx DirectPath FADD PFMIN mmreg mem64 OFh OFh 94h mm xxx xxx DirectPath FADD PFMUL mmreg1 mmreg2 OFh OFh B4h 11 xxx xxx DirectPath FMUL PFMUL mmreg mem64 OFh OFh B4h mm xxx xxx DirectPath FMUL PFRCP mmreg1 mmreg2 OFh OFh 96h 11 Xxx xxx DirectPath FMUL PFRCP mmreg mem64 OFh OFh 96h mm xxx xxx DirectPath FMUL mmreg1 mmreg2 OFh OFh A6h 11 xxx xxx DirectPath FMUL PFRCPITI mmreg mem64 OFh OFh Aeh mm xxx xxx DirectPath FMUL PFRCPIT2 mmreg1 mmreg2 OFh OFh 6 11 xxx xxx DirectPath FMUL PFRCPIT2 mmreg mem64 OFh OFh B6h mm xxx xxx DirectPath FMUL PFRSQITI mmregl mmreg2 OFh OFh A7h 11 xxx xxx DirectPath FMUL PFRSOITI mmreg mem64 OFh OFh A7h DirectPath FMUL PFRSQRT mmregl mmreg2 OFh OFh 97h 11 xxx xxx DirectPath FMUL 1 For the PREFETCH and PREFETCHW instructions the mem8 value refers to an address in the 64 byte line that will be Instruction Dispatch and Execution Resources 217 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 23 3DNow Instructions Continued Instruction Mnemonic Prefix imm8 ModR M Decode F pU Note Byte s Byte Type Pipe s PFRSQRT mmreg mem64 OFh OFh 97h mm xxx x
35. Sort Local Variables According to Base Size 28 Accelerating Floating Point Divides and Square Roots 29 Avoid Unnecessary Integer 31 Copy Frequently De referenced Pointer Arguments to Local Variables ARE sunsu hua pa 31 4 Instruction Decoding Optimizations 33 aad bb Deed edite pa aded shag ld tl aly 33 Select DirectPath Over VectorPath Instructions 34 Load Execute Instruction 34 Use Load Execute Integer Instructions 34 Use Load Execute Floating Point Instructions with Floating Point 35 Avoid Load Execute Floating Point Instructions with Integer Operands 35 Align Branch Targets in Program Hot Spots 36 Use Short Instruction Lengths 36 Avoid Partial Register Reads and Writes 37 Replace Certain SHLD Instructions with Alternative Code 38 Use 8 Bit Sign Extended Immediates 38 iv Contents AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use 8 Bit Sign Extended 39 Code Padding Using Neutral Code Fillers 39 Recommendations for the AMD Athlon Processor
36. esp 2 TEXTEQU DB O8Bh 0EDh gt mov ebp ebp TEXTEQU DB O8Dh 004h 020h gt lea eax eax TEXTEQU DB 08Dh 01Ch 023h lea ebx Code Padding Using Neutral Code Fillers 41 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 ECX TEXTEQU DB 08Dh 00Ch 021h lea ecx ecx EDX TEXTEQU DB 08Dh 014h 022h lea edx edx ESI TEXTEQU DB 08Dh 024h 024h lea esi esi EDI TEXTEQU DB 08Dh 034h 026h lea edi edi OP3 ESP TEXTEQU DB O8Dh 03Ch 027h gt lea esp esp 0P3 TEXTEQU DB 08Dh 06Dh 000h lea ebp ebp OP4_EAX TEXTEQU DB 08Dh 044h 020h 000h lea eax 00 OP4_EBX TEXTEQU DB 08Dh 05Ch 023h 000h 168 ebx 00 OP4_ECX TEXTEQU DB 08Dh 04Ch 021h 000h 168 ecx ecx 00 OP4_EDX TEXTEQU DB 08Dh 054h 022h 000h 168 edx edx 00 OPA ESI TEXTEQU DB O8Dh 064h 024h 000h gt lea esi 00 65 0 4 EDI TEXTEQU DB 08Dh 074h 026h 000h lea edi 601 00 0 4 ESP TEXTEQU DB 08Dh 07Ch 027h 000h 168 esp 00 lea eax Leax 00 nop OP5 TEXTEQU DB O8Dh 044h 020h 000h 090h gt lea Lebx 00 nop OP5 EBX TEXTEQU DB 08Dh 05Ch 023h 000h 090h lea ecx Lecx 00 nop OP5 ECX TEXTEQU DB 08Dh 04Ch 021h 000h 090h ea edx edx 00 nop P5 EDX TEXTEQU DB 08Dh 054h 022h 000h 090h 2 le
37. execution time is spent Such hot spots can be found through the use of profiling A typical data stream should be fed to the program while doing the experiments 20 Consider Expression Order in Compound Branch Conditions AMDA 22007E 0 November 1999 Switch Statement Usage AMD Athlon Processor x86 Code Optimization Optimize Switch Statements Switch statements are translated using a variety of algorithms The most common of these are jump tables and comparison chains trees It is recommended to sort the cases of a switch statement according to the probability of occurrences with the most probable first This will improve performance when the switch is translated as a comparison chain It is further recommended to make the case labels small contiguous integers as this will allow the switch to be translated as a jump table Example 1 Avoid int days in month sh switch days in month case 28 ort months normal months case 29 short_months break case 30 normal_mon case 31 long_month default printf Example 2 Preferred int days_in_month sh switch days in month case 31 long month case 30 normal mon case 28 thst break Stt break onth has fewer than 28 or days n ort_months normal_months st break ths break case 29 short_months break default printf onth has fewer than 28 or days
38. should be avoided Use double precision variables instead C Language Structure Component Considerations Structures struct in C language should be made the size of a multiple of the largest base type of any of their components To meet this requirement padding should be used where necessary Language definitions permitting to minimize padding structure components should be sorted and allocated such that the components with a larger base type are allocated ahead of those with a smaller base type For example consider the following code Align TBYTE Variables on Quadword Aligned Addresses 55 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example struct char a 5 long k doublex baz The structure components should be allocated lowest to highest address as follows k a 41 8 3 8 2 all 01 padbyte6 padbyteO See C Language Structure Component Considerations on page 27 for more information from a C source code perspective Sort Variables According to Base Type Size Sort local variables according to their base type size and allocate variables with larger base type size ahead of those with smaller base type size Assuming the first variable allocated is naturally aligned all other variables are naturally aligned without any padding The following example is a declaration of local variables in a C function Example short ga gu gi long
39. the pipelined floating point adder allows one add every cycle In the following code the loop is partially unrolled by a factor of two which creates potential endcases that must be handled outside the loop With Partia MOV oO o gt gt gt 19 Ss CO C2 2 2 ST gt EC 2 I Loop Unrolling ECX MAX LENGTH EAX offset A EBX offset B ECX 1 add loop QWORD PTR EAX QWORD PTR EBX QWORD PTR EAX EAX 8 EBX 8 QWORD PTRLEAX QWORD PTRLEBX QWORD PTRLEAX QWORD PTRLEAX 8 QWORD 81 QWORD PTRLEAX 8 EAX 16 EBX 16 ECX add loop Now the loop consists of 10 instructions Based on the decode retire bandwidth of three OPs per cycle this loop goes Unrolling Loops 69 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 no faster than three iterations in 10 cycles or 6 10 floating point adds per cycle or 1 4 times as fast as the original loop Deriving Loop A frequently used loop construct is a counting loop In a typical Control For Partially case the loop count starts at some lower bound 1o increases by Unrolled Loops some fixed positive increment inc for each iteration of the loop and may not exceed some upper bound hi The following example shows how to partially unroll such a loop by an unrolling factor of fac and how to derive the loop control for the partially u
40. 0 November 1999 Write Combining Definitions and Abbreviations This appendix uses the following definitions and abbreviations UC Uncacheable memory type WC Write combining memory type WT Writethrough memory type WP Write protected memory type WB Writeback memory type One Byte 8 bits One Word 16 bits Longword 32 bits same as a x86 doubleword Quadword 64 bits or 2 longwords Octaword 128 bits or 2 quadwords Cache Block 64 bytes or 4 octawords or 8 quadwords What is Write Combining Write combining is the merging of multiple memory write cycles that target locations within the address range of a write buffer The AMD Athlon processor combines multiple memory write cycles to a 64 byte buffer whenever the memory address is within a WC or WT memory type region The processor continues to combine writes to this buffer without writing the data to the system as long as certain rules apply see Table 9 on page 158 for more information Programming Details The steps required for programming write combining on the AMD Athlon processor are as follows 1 Verify the presence of an AMD Athlon processor by using the CPUID instruction to check for the instruction family code and vendor identification of the processor Standard function 0 on AMD processors returns a vendor identification string of AuthenticAMD in registers EBX EDX and ECX Standard function 1 returns the processor 156 W
41. 1 POMC i DoWork1 DoWork3 1 35 Tice break default break The trick here is that there is some up front work involved in generating all the combinations for the switch constant and the total amount of code has doubled However it is also clear that the inner loops are if free In ideal cases where the DoWork functions are inlined the successive functions will have greater overlap leading to greater parallelism than would be possible in the presence of intervening if statements The same idea can be applied to constant switch statements or combinations of switch statements and if statements inside of for loops The method for combining the input constants gets more complicated but will be worth it for the performance benefit However the number of inner loops can also substantially increase If the number of inner loops is prohibitively high then only the most common cases need to be dealt with directly and the remaining cases can fall back to the old code ina default clause for the switch statement This typically comes up when the programmer is considering runtime generated code While runtime generated code can lead to similar levels of performance improvement it is much harder to maintain and the developer must do their own optimizations for their code generation without the help of an available compiler Declare Local Functions as Stat
42. 1 0 3 0 lt 2 0 2 lt res gt y v x m 0 L1 v y m 1 L1 v z m 2 E1 3 1 lt res gt z v x m 0 2 3 2 lt 2 2 2 lt 1 2 lt res gt w v x m 0 3 v y m 11 3 2250121131 3 3 lt MOO 0 01 4 M02 8 define 12 M10 16 M11 20 define M12 24 M13 28 M20 32 M21 36 define M22 40 M23 44 M30 48 M31 52 define M32 56 M33 60 void XForm float res const float v const float m int numverts asm MOV EDX V EDX source vector ptr MOV EAX M EAX matrix ptr MOV EBX RES EBX destination vector ptr MOV ECX NUMVERTS ECX number of vertices to transform 3DNow version of fully general 3D vertex tranformation Optimal for AMD Athlon completes in 16 cycles FEMMS clear MMX state ALIGN 16 for optimal branch alignment 120 Optimized Matrix Multiplication AMDA 22007E 0 November 1999 xform gt 0 lt 00 lt OCO CO CJ CO PFMUL PUNPCKHDQ PFMUL MOVQ MOVQ MOVQ PUNPCKHDQ PFADD PFMUL PFMUL PFADD PFADD MOVQ PFADD MOVQ DEC JNZ FEMMS EBX 16 MMO QWO MM1 QWO EDX 16 MM2 MMO MM3 QWO
43. 1 load store and 1 MacroOPs FSTP EAX 1 store MacroOP MOVQ EAX MMO 1 store MacroOP As shown in Table 6 the load store unit LSU consists of a three stage data cache lookup Table 6 Load Store Unit Stages Stage 1 Cycle 8 Stage 2 Cycle 9 Stage 3 Cycle 10 Address Calculation 151 Transport Address to Data Data Cache Access 2 Scan Cache Data Forward Loads and stores are first dispatched in order into a 12 entry deep reservation queue called LS1 LS1 holds loads and stores that are waiting to enter the cache subsystem Loads and stores are allocated into LS1 entries at dispatch time in program order and are required by LS1 to probe the data cache in program order The AGUs can calculate addresses out of program order therefore LS1 acts as an address reorder buffer When a load or store is scanned out of the LS1 queue Stage 1 it is deallocated from the LS1 queue and inserted into the data cache probe pipeline Stage 2 and Stage 3 Up to two memory operations can be scheduled scanned out of LS1 to access the data cache per cycle The LSU can handle the following m Two 64 bit loads per cycle or m One 64 bit load and one 64 bit store per cycle or m Two 32 bit stores per cycle Execution Unit Resources 151 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Code Sample Analysis The samples in Table 7 on page 153 and Table 8 on page 154 show
44. 32 21h mm xxx xxx DirectPath AND reg8 mreg8 22h 11 DirectPath AND reg8 mem8 22h mm xxx xxx DirectPath AND reg16 32 mreg16 32 23h 11 DirectPath AND reg16 32 mem16 32 23h mm xxx xxx DirectPath AND AL imm8 24h DirectPath AND EAX imm16 32 25h DirectPath AND mreg8 imm8 80h 11 100 xxx DirectPath AND imm8 80h mm 100 xxx DirectPath AND mreg16 32 imm16 32 8ih 11 100 xxx DirectPath AND 16 32 imm16 32 81h mm 100 xxx DirectPath AND mreg16 32 imm8 sign extended 83h 11 100 xxx DirectPath AND mem16 32 imm8 sign extended 85h mm 100 xxx DirectPath ARPL mreg16 reg16 63h 11 Xxx xxx VectorPath ARPL 16 reg16 65h VectorPath BOUND 62h VectorPath BSF 16 32 mreg16 32 OFh BCh 11 xxx xxx VectorPath BSF reg16 32 mem16 32 OFh BCh mm xxx xxx VectorPath BSR reg16 32 mreg16 32 OFh BDh 11 xxx xxx VectorPath BSR reg16 32 mem16 32 OFh BDh mm xxx xxx VectorPath BSWAP EAX OFh C8h DirectPath BSWAP ECX OFh C9h DirectPath BSWAP EDX OFh DirectPath BSWAP EBX OFh CBh DirectPath BSWAP ESP OFh CCh DirectPath BSWAP EBP OFh CDh DirectPath BSWAP ESI OFh CEh DirectPath BSWAP EDI OFh CFh DirectPath BT mreg16 22 2 OFh Ash 11 DirectPath BT mem16 32 reg16 32 OFh Ash mm xxx xxx VectorPath BT mreg16 32 imm8 OFh BAh 11 100 xxx DirectPath 190 Instruction Dispatch and Execution Resource
45. 90h DirectPath XCHG EAX ECX 91h VectorPath XCHG EAX EDX 92h VectorPath XCHG EAX EBX 93h VectorPath XCHG EAX ESP 94h VectorPath XCHG EAX EBP 95h VectorPath XCHG EAX ESI 96h VectorPath XCHG EAX EDI 97h VectorPath XLAT D7h VectorPath XOR mreg8 reg8 30h 11 xxx xxx DirectPath XOR mem8 reg8 30h mm xxx xxx DirectPath XOR mreg16 32 reg16 32 31h 11 xxx xxx DirectPath XOR mem16 32 reg16 32 31h mm xxx xxx DirectPath XOR reg8 mreg8 32h 11 xxx xxx DirectPath reg8 mem8 32h mm xxx xxx DirectPath XOR reg16 32 mreg16 32 33h 11 xxx xxx DirectPath XOR reg16 32 mem16 32 33h mm xxx xxx DirectPath XOR AL imm8 34h DirectPath XOR EAX imm16 32 35h DirectPath XOR mreg8 imm8 80h 11 110 xxx DirectPath XOR mem8 imm8 80h mm 110 xxx DirectPath XOR mreg16 32 imm16 32 81h 11 110 xxx DirectPath XOR mem16 32 imm16 32 81h mm 110 xxx DirectPath XOR mreg16 32 imm8 sign extended 83h 11 110 xxx DirectPath XOR mem16 32 imm8 sign extended 83h mm 110 xxx DirectPath Instruction Dispatch and Execution Resources 207 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 20 MMX Instructions Instruction Mnemonic ipe deis pe FPU Pipe s Notes EMMS OFh 77h DirectPath FADD FMUL FSTORE MOVD mmreg reg32 OFh 6Eh 11 xxx xxx VectorPath 1 MOVD mmreg mem
46. AMD Athlon processor combined with 3DNow technology brings a better multimedia experience to mainstream PC users while maintaining backwards compatibility with all existing x86 software Although the AMD Athlon processor can extract code parallelism on the fly from off the shelf commercially available x86 software specific code optimization for the AMD Athlon processor can result in even higher delivered performance This document describes the proprietary microarchitecture in the AMD Athlon processor and makes recommendations for optimizing execution of x86 software on the processor AMD Athlon Processor Microarchitecture Summary 5 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 The coding techniques for achieving peak performance on the AMD Athlon processor include but are not limited to those for the AMD K6 AMD K6 2 Pentium Pentium Pro and Pentium II processors However many of these optimizations are not necessary for the AMD Athlon processor to achieve maximum performance Due to the more flexible pipeline control and aggressive out of order execution the AMD Athlon processor is not as sensitive to instruction selection and code scheduling This flexibility is one of the distinct advantages of the AMD Athlon processor The AMD Athlon processor uses the latest in processor microarchitecture design techniques to provide the highest x86 performance for today s PC In short the A
47. Athlon Processor x86 Code Optimization 22007E 0 November 1999 Use XOR Instruction to Clear Integer Registers To clear an integer register to all Os use reg reg The AMD Athlon processor is able to avoid the false read dependency on the XOR instruction Example 1 Acceptable MOV REG 0 Example 2 Preferred XOR REG REG Efficient 64 Bit Integer Arithmetic This section contains a collection of code snippets and subroutines showing the efficient implementation of 64 bit arithmetic Addition subtraction negation and shifts are best handled by inline code Multiplies divides and remainders are less common operations and should usually be implemented as subroutines If these subroutines are used often the programmer should consider inlining them Except for division and remainder the code presented works for both signed and unsigned integers The division and remainder code shown works for unsigned integers but can easily be extended to handle signed integers Example 1 Addition add operand in ECX EBX to operand EDX EAX result in EDX EAX ADD EAX EBX ADC EDX ECX Example 2 Subtraction subtract operand in ECX EBX from operand EDX EAX result in EDX EAX SUB EAX EBX SBB EDX ECX Example 3 Negation negate operand in EDX EAX NOT EDX NEG EAX SBB EDX 1 fixup increment hi word if low word was 0 86 Use XOR Instruction to Clear Integer Registers AMDA 22007
48. Buffer Data to the System Once write combining is closed for a 64 byte write buffer the contents of the write buffer are eligible to be sent to the system as one or more AMD Athlon system bus commands Table 10 lists the rules for determining what system commands are issued for a write buffer as a function of the alignment of the valid buffer data Table 10 AMD Athlon System Bus Commands Generation Rules l If all eight quadwords are either full 8 bytes valid or empty 0 bytes valid a Write Quadword system command is issued with an 8 byte mask representing which of the eight quadwords are valid If this case is true do not proceed to the next rule If all longwords are either full 4 bytes valid or empty 0 bytes valid a Write Longword system command is issued for each 32 byte buffer half that contains at least one valid longword The mask for each Write Longword system command indicates which longwords are valid in that 32 byte write buffer half If this case is true do not proceed to the next rule Sequence through all eight quadwords of the write buffer from quadword 0 to quadword 7 Skip over a quadword if no bytes are valid Issue a Write Quad system command if all bytes are valid asserting one mask bit Issue a Write Longword system command if the quadword contains one aligned longword asserting one mask bit Otherwise issue a Write Byte system command if there is at least one valid byte asserting a ma
49. Code Optimization 11 General x86 Optimization Guidelines Short Forms This chapter describes general code optimization techniques specific to superscalar processors that is techniques common to the AMD K6 processor AMD Athlon processor and Pentium family processors In general all optimization techniques used for the AMD K6 processor Pentium and Pentium Pro processors either improve the performance of the AMD Athlon processor or are not required and have a neutral effect usually due to fewer coding restrictions with the AMD Athlon processor Use shorter forms of instructions to increase the effective number of instructions that can be examined for decoding at any one time Use 8 bit displacements and jump offsets where possible Example 1 Avoid CMP REG 0 Example 2 Preferred TEST REG REG Although both of these instructions have an execute latency of one fewer opcode bytes need to be examined by the decoders for the TEST instruction Short Forms 127 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Dependencies Spread out true dependencies to increase the opportunities for parallel execution Anti dependencies and output dependencies do not impact performance Register Operands Maintain frequently used values in registers rather than in memory This technique avoids the comparatively long latencies for accessing memory Stack Allocati
50. Cycle 7 SCHED Cycle 8 EXEC Cycle 9 ADDGEN Cycle 10 DCACC Cycle 11 RESP AMD Athlon Processor x86 Code Optimization In the scheduler SCHED pipeline stage the scheduler buffers can contain MacroOPs that are waiting for integer operands from the ICU or the IEU result bus When all operands are received SCHED schedules the MacroOP for execution and issues the OPs to the next stage EXEC In the execution EXEC pipeline stage the OP and its associated operands are processed by an integer pipe either the IEU or the AGU If addresses must be calculated to access data necessary to complete the operation the OP proceeds to the next stages ADDGEN and DCACC In the address generation ADDGEN pipeline stage the load or store OP calculates a linear address which is sent to the data cache TLBs and caches In the data cache access DCACC pipeline stage the address generated in the previous pipeline stage is used to access the data cache arrays and TLBs Any OP waiting in the scheduler for this data snarfs this data and proceeds to the EXEC stage assuming all other operands were available In the response RESP pipeline stage the data cache returns hit miss status and data for the request from DCACC Integer Pipeline Stages 145 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Floating Point Pipeline Stages The floating point unit FPU is implemented as a coprocessor t
51. DirectPath SCASB AL mem8 AEh VectorPath SCASW AX mem16 AFh VectorPath SCASD EAX mem32 AFh VectorPath SETO mreg8 OFh 90h 11 xxx xxx DirectPath SETO mem8 OFh 90h mm xxx xxx DirectPath SETNO mreg8 OFh 91h 11 xxx xxx DirectPath SETNO mem8 OFh 91h mm xxx xxx DirectPath SETB SETC SETNAE mreg8 OFh 92h 11 xxx xxx DirectPath SETB SETC SETNAE mem8 OFh 92h mm xxx xxx DirectPath SETAE SETNB SETNC mreg8 OFh 93h 11 xxx xxx DirectPath SETAE SETNB SETNC mem8 OFh 93h mm xxx xxx DirectPath SETE SETZ mreg8 OFh 94h 11 xxx xxx DirectPath SETE SETZ mem8 OFh 94h mm xxx xxx DirectPath SETNE SETNZ mreg8 OFh 95h 11 xxx xxx DirectPath SETNE SETNZ mem8 OFh 95h mm xxx xxx DirectPath SETBE SETNA mreg8 OFh 96h 11 xxx xxx DirectPath SETBE SETNA mem8 OFh 96h mm xxx xxx DirectPath SETA SETNBE mreg8 OFh 97h 11 xxx xxx DirectPath SETA SETNBE mem8 OFh 97h mm xxx xxx DirectPath Instruction Dispatch and Execution Resources 203 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Instruction Mnemonic yee M pon SETS mreg8 OFh 98h 11 xxx xxx DirectPath SETS mem8 OFh 98h mm xxx xxx DirectPath SETNS mreg8 OFh 99h 11 xxx xxx DirectPath SETNS mem8 OFh 99h mm xxx xxx DirectPath SETP SETPE
52. FFREEP ST i FST mem64real mem 16int FST ST i mem32int FSTP mem32real mem64int FSTP mem64real FIMUL mem32int FSTP mem80real FIMUL mem 16int FSTP ST i FINCSTP FSUB mem32real FIST mem 16int FSUB mem64real FSUB ST ST i DirectPath Instructions 229 AMDA AMD Athlon Processor x86 Code Optimization Table 28 DirectPath Floating Point Instructions Instruction Mnemonic FSUB ST i ST FSUBP ST ST i FSUBR mem32real FSUBR mem64real FSUBR ST ST i FSUBR ST i ST FSUBRP ST i ST FTST FUCOM FUCOMP FUCOMPP FWAIT FXCH 22007E 0 November 1999 230 DirectPath Instructions 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization VectorPath Instructions The following tables contain VectorPath instructions which should be avoided in the AMD Athlon processor Table 29 VectorPath Integer Instructions on page 231 Table 30 VectorPath MMX Instructions on page 234 and Table 31 VectorPath MMX Extensions on page 234 Table 32 VectorPath Floating Point Instructions on page 235 Table 29 VectorPath Integer Instructions AAA Instruction Mnemonic Instruction Mnemonic BTS mem16 32 imm8 AAD CALL full pointer AAM CALL near imm
53. LGS reg16 32 mem32 48 IMUL reg16 32 mreg16 32 imm16 32 LIDT mem48 IMUL reg16 32 mem16 32 imm16 32 LLDT mreg16 IMUL reg16 32 imm sign extended LLDT mem16 IMUL reg16 32 mreg16 32 imm8 signed LMSW 16 IMUL reg16 32 mem16 32 imm8 signed LMSW mem16 IMUL AX AL mreg8 10058 AL mem8 IMUL AX AL mem8 LODSW AX mem16 IMUL EDX EAX EAX mreg16 32 LODSD EAX mem32 IMUL EDX EAX EAX mem16 32 LOOP disp8 IMUL reg16 32 mreg16 32 LOOPE LOOPZ disp8 IMUL reg16 32 mem16 32 LOOPNE LOOPNZ disp8 IN AL imm8 LSL reg16 32 mreg16 32 IN AX imm8 LSL reg16 52 mem16 32 IN EAX imm8 LSS reg16 32 mem32 48 IN AL DX LTR mregi6 IN AX DX LTR mem16 IN EAX DX MOV mreg16 segment reg INVD MOV 6 segment reg INVLPG MOV segment mreg16 JCXZ JEC short disp8 MOV segment reg mem16 JMP far disp32 48 direct MOVSB mem8 mems JMP far mem32 indirect MOVSD mem16 mem16 JMP far mreg32 indirect MOVSW mem32 mem32 LAHF MUL AL mreg8 LAR reg16 32 mreg16 32 MUL AL mem8 LAR reg16 32 mem16 32 MUL AX mreg16 LDS reg16 32 mem32 48 MUL AX mem16 MUL EAX mreg32 232 VectorPath Instructions 22007E 0 Nov
54. MMO MMO MM4 QWO MM3 MMO MM2 MM2 MM4 MM2 MM5 QWO MM7 QWO MM6 MM1 MM5 MMO MMO QWO MM1 MM1 MM7 MM2 MM2 QWO MMO MM1 MM3 MM4 20 CO PTR PTR 20 CO 20 62 PTR 20 PTR 20 PTR PTR A CO 20 CO PTR 20 CO PTR MM4 QWORD PTR MM2 MM5 MM7 MM1 QWORD PTR MM6 MM6 MM3 MMO MM4 MM6 MM1 MM6 MM5 MM2 MM3 MM4 EBX 16 MM3 MM5 MM1 EBX 8 MM5 ECX XFORM AMD Athlon Processor x86 Code Optimization resct EDX gt v gt x EDX 8 v gt w v gt z E E 001 10101111 01101 v gt x lt CEAX M10 m 11L1 11101 v 2x m 0 1 v 2 gt 2x m 0 0 sv gt y mC1 C1 v gt y m 1 0 LEAX M02 10101131 m 0 L 2 CEAX M12 mL 11L3 1112 v gt w v 2 gt Z v 2x mD01L3 vO x m 01 2 EAX M20 m 21 1 2 0 v gt z v gt z v gt y m 1 3 v gt y m 1 2 EAX M22 m 2 3 m 2 2 4 1 v gt z mL2 I 0 11 1 lt 01 1 lt 1100 lt 0 0 lt LEAX M30 m 3 1 m 3 L 0 gt 7 21 3 v gt z m 2 2 1 3 lt 0 3 lt v gt x m 0 2 v gt y m 1 2 LEAX M32 m 3 3 m 3 2 v 2 gt w lt 21 1 2 lt 1 1 lt 0 1 lt 21 0 2 lt 1110
55. MMX and 3DNow instruction tables have an additional column that lists the possible FPU execution pipelines available for use by any particular DirectPath decoded operation Typically VectorPath instructions require more than one execution pipe resource Table 19 Integer Instructions Maemonic First Second ModR M Decode Byte Byte Byte Type AAA 37h VectorPath AAD D5h OAh VectorPath AAM D4h OAh VectorPath AAS 3Fh VectorPath 188 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Mnemodic First Second ModR M Decode Byte Byte Byte Type ADC reg8 10h 11 xxx xxx DirectPath ADC mem8 reg8 10h mm xxx xxx DirectPath ADC mreg16 32 reg16 32 11h 11 xxx xxx DirectPath ADC mem16 32 reg16 32 11h mm xxx xxx DirectPath ADC reg8 mreg8 12h 11 DirectPath ADC reg8 mem8 12h mm xxx xxx DirectPath ADC reg16 32 mreg16 32 13h 11 xxx xxx DirectPath ADC 16 32 mem16 32 13h mm xxx xxx DirectPath ADC AL imm8 14h DirectPath ADC EAX imm 16 32 15h DirectPath ADC mreg8 imm8 80h 11 010 xxx DirectPath ADC mem8 imm8 80h mm 010 xxx DirectPath ADC mreg16 32 imm16 32 8lh 11 010 xxx DirectPath ADC mem16 32 imm16
56. MMX Instruction Set Manual order 22466 for more usage information Blended Code Otherwise for blended code which needs to run well on AMD K6 and AMD Athlon family processors the following code is recommended Example 1 Preferred faster MM1 SWAP MMO MMO destroyed MOVQ MM1 MMO make a copy PUNPCKLDQ MMO MMO duplicate lower half PUNPCKHDQ MM1 MMO combine lower halves Example 2 Preferred fast MM1 SWAP MMO MMO preserved MOVQ MM1 MMO make a copy PUNPCKHDQ MM1 MM1 duplicate upper half PUNPCKLDQ MM1 MMO combine upper halves Both examples accomplish the swapping but the first example should be used if the original contents of the register do not need to be preserved The first example is faster due to the fact that the MOVQ and PUNPCKLDQ instructions can execute in parallel The instructions in the second example are dependent on one another and take longer to execute 112 3DNow and MMX Intra Operand Swapping AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Fast Conversion of Signed Words to Floating Point In many applications there is a need to quickly convert data consisting of packed 16 bit signed integers into floating point numbers The following two examples show how this can be accomplished efficiently on AMD processors The first example shows how to do the conversion on a processor that supports AMD s 3DNow extensions such as the AMD Athlon
57. MOVQ mem64 mmreg PCMPGTD mmregl mmreg2 PACKSSDW mmreg1 mmreg2 PCMPGTD mmreg mem64 PACKSSDW mmreg mem64 PCMPGTW mmreg1 mmreg2 PACKSSWB mmreg1 mmreg2 PCMPGTW mmreg mem64 PACKSSWB mmreg mem64 PMADDWD mmregl mmreg2 PACKUSWB mmreg1 mmreg2 PMADDWD mmreg mem64 PACKUSWB mmreg mem64 PMULHW mmregl mmreg2 PADDB mmreg1 mmreg2 PMULHW mmreg mem64 PADDB mmreg mem64 PMULLW mmreg1 mmreg2 PADDD mmreg2 PMULLW mmreg mem64 PADDD mmreg mem64 POR mmregl mmreg2 PADDSB mmreg1 mmreg2 POR mmreg mem64 PADDSB mmreg mem64 PSLLD mmreg1 mmreg2 PADDSW mmreg1 mmreg2 PSLLD mmreg mem64 PADDSW mmreg mem64 PSLLD mmreg imm8 PADDUSB mmreg1 mmreg2 PSLLQ mmreg1 mmreg2 PADDUSB mmreg mem64 PSLLQ mmreg mem64 PADDUSW mmreg1 mmreg2 PSLLQ mmreg imm8 PADDUSW mmreg mem64 PSLLW mmreg1 mmreg2 PADDW mmreg1 mmreg2 PSLLW mmreg mem64 PADDW mmreg mem64 PSLLW mmreg imm8 PAND mmreg1 mmreg2 PSRAW mmreg1 mmreg2 PAND mmreg mem64 PSRAW mmreg mem64 PANDN mmreg1 mmreg2 PSRAW mmreg imm8 PANDN mmreg mem64 PSRAD mmreg1 mmreg2 PCMPEQB mmreg1 mmreg2 PSRAD mmreg mem64 PCMPEQB mmreg mem64 PSRAD mmreg imm8 PCMPEQD mmreg1 mmreg2 PSRLD mmreg1 mmreg2
58. PV AX 1 SV lt lt AX 055555555h v gt gt 1 amp 0x55555555 DX EAX v v gt gt 1 8 0x55555555 w TO O 1 AX EDX DX 2 AX 033333333h DX 033333333h zc K 000720 lt lt lt 2 amp 0 33333333 w gt gt 2 amp 0x33333333 lt lt lt gt Coco 5 92 Efficient Implementation of Population Count Function AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization ADD EAX EDX x w amp 0x33333333 w gt gt 2 8 0x33333333 MOV EDX EDX PX SHR EAX 4 iX lt lt 4 ADD EAX EDX sx x lt lt 4 AND FAX OOFOFOFOFh y x x gt gt 4 amp OxOFOFOFOF IMUL 001010101h y 0x01010101 SHR EAX 24 population count y 0x01010101 gt gt 24 MOV retVal EAX store result return retVal Derivation of Multiplier Used for Integer Division by Constants Unsigned Derivation for Algorithm Multiplier and Shift Factor The utility udiv exe was compiled using the code shown in this section The following code derives the multiplier value used when performing integer division by constants The code works for unsigned integer division and for odd divisors between 1 and 231 1 inclusive For divisors of the form d d 2 the multiplier is the same as for d and the shift factor is s n Code snippet to determine algorithm a multiplier m and sh
59. RET 131 Control Unity yeaa oem ede EIL 134 Dec ding i slve ERE RPM EE POSEE EOS 33 Dispatch and Execution 187 Short Forms sees tee coped CURA QA RA ER 127 Short Lengths ise pov eee 36 Integer evo Pape eA ROCK UR 77 Arithmetic 64 86 Divislon eR nA ATI YET e E 31 Execution Cus cqa mp 0 2 eee EE ee eee 135 Operand Consider 14 Pipeline Operations 149 144 Scheduler oie aqa ie 135 Use 32 Bit Data Types for Integer 13 L L2 Cache 139 LEA Instruction oR ge ER 38 Load Store Pipeline Operations 151 Load Execute 9 34 Floating Point Instructions 10 35 Integer Instructions 34 Load Store Unit 1810 138 24 Local 28 31 56 Loop lnstruction 9 200 34 L m WEE RI 65 Loops Deriving Loop Control For Partially Unrolled 70 Generic Loop 22 Minimize
60. Stream of Packed Unsigned Bytes ample 2 oe AX E EDX EBX ECX MMO 1 MMO 1 EDI L1 DWORD PTR Src MB 16 EAX EAX 8 EDI PL EDI 8 EDX I MMO I 8 1 EBX DWORD PTR Dst_MB DWORD PTR SrcStride DWORD PTR DstStride MMO QWORD1 MM1 QWORD2 QWORD1 QWORD3 2 with adjustment CQWORD2 QWORD4 2 with adjustment The following code is an example of how to process a stream of packed unsigned bytes like RGBA information with faster 3DNow instructions Example outside 1 PXOR inside lo MOVD PUNPCKLBW MOVQ PUNPCKLWD PUNPCKHWD PI2FD PI2FD 00p MMO MMO 1 VAR 1 MMO MM2 MM1 MM1 MMO MM2 MMO MM1 MM1 MM2 MM2 0 v 3 v 2 v 1 v 0 0 VE3J 0 vE2T O0 vEL1 0 vEOJ 0 4131 0 21 0 v 11 0 v 0 0 0 0 v 1 0 0 0 v 0 0 0 0 v 3 0 0 0 v 2 float v 1 float v 0 float v 3 float v 2 Stream of Packed Unsigned Bytes 125 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Complex Number Arithmetic Complex numbers have a real part and an imaginary part Multiplying complex numbers ex 3 4i is an integral part of many algorithms such as Discrete Fourier Transform DFT and complex FIR filters Complex number multiplication is shown below srcO real srcO imag srcl rea
61. VectorPath IN AL DX ECh VectorPath IN AX DX EDh VectorPath IN EAX DX EDh VectorPath INC EAX 40h DirectPath INC ECX 41h DirectPath INC EDX 42h DirectPath INC EBX 43h DirectPath INC ESP 44h DirectPath INC EBP 45h DirectPath INC ESI 46h DirectPath INC EDI 47h DirectPath 194 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Mnemonic First Second ModR M Decode Byte Byte Byte Type INC mreg8 FEh 11 000 xxx DirectPath INC mem8 FEh mm 000 xxx DirectPath INC mreg16 32 FFh 11 000 xxx DirectPath INC mem16 32 FFh mm 000 xxx DirectPath INVD OFh 08h VectorPath INVLPG OFh oih 111 VectorPath JO short disp8 70h DirectPath JNO short disp8 71h DirectPath JB JNAE JC short disp8 72h DirectPath JNB JAE JNC short disp8 73h DirectPath JZ JE short disp8 74h DirectPath JNZ JNE short disp8 75h DirectPath JBE JNA short disp8 76h DirectPath JNBE JA short disp8 77h DirectPath JS short disp8 78h DirectPath JNS short disp8 79h DirectPath JP JPE short disp8 7Ah DirectPath JNP JPO short disp8 7Bh DirectPath JL JNGE short disp8 7Ch DirectPath JNL JGE short disp8 7Dh DirectPath JLE JNG short disp8 7Eh DirectPath JNLE JG short disp8 7Fh DirectPath JCXZ JEC short dis
62. X load double to be converted FST DWORD PTR TX store X because sign X is needed FIST DWORD PTR 1 store rndint x as default result FISUB DWORD PTR I compute DIFF X rndint X FSTP DWORD PTR DIFF store DIFF as we need sign DIFF MOV EAX TX PX MOV EDX DIFF DIFF TEST EDX EDX DIFF 0 JZ DONE default result is OK done XOR EDX EAX need correction if sign X sign DIFF SAR EAX 31 40 OxFFFFFFFF 0 SAR EDX 31 sign X sign DIFF 0xFFFFFFFF 0 LEA EAX LEAX EAX 1 40 OxFFFFFFFF 1 AND EDX EAX correction 1 0 1 SUB I EDX trunc X e8rndint X correction DONE The second substitution simulates a truncating floating point to integer conversion using only integer instructions and therefore works correctly independent of the FPUs current rounding mode It does not handle NaNs infinities and overflows according to the IEEE 754 standard Note that the first instruction of this code may cause an STLF size mismatch resulting in performance degradation if the variable to be converted has been stored recently Minimize Floating Point to Integer Conversions 101 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example 3 Potentially faster MOV ECX DWORD PTR X 4 get upper 32 bits of double XOR EDX EDX zo MOV EAX ECX save sign bit AND ECX 07FF00000h isolate exponent field CMP ECX 03 00000 if abs x lt
63. below Note that the FFREEP instruction although insufficiently documented in the past is supported by all 32 bit x86 processors opcode bytes for FFREEP ST i are listed in Table 22 on page 212 FFREEP ST 0 removes one register from stack FFREEP ST i works like FFREE ST i except that it increments the FPU top of stack after doing the FFREE work In other words FFREEP ST i marks ST 1 as empty then increments the x87 stack pointer On the AMD Athlon processor the FFREEP instruction converts to an internal NOP which can go down any pipe with no dependencies Many assemblers do not support the FFREEP instruction In these cases a simple text macro can be created to facilitate use of the FFREEP ST 0 STO TEXTEQU DB ODFh OCOh gt Floating Point Compare Instructions For branches that are dependent on floating point comparisons use the following instructions FCOMI FCOMIP FUCOMI FUCOMIP 98 Use FFREEP Macro to Pop One Register from the FPU AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization These instructions are much faster than the classical approach using FSTSW because FSTSW is essentially a serializing instruction on the AMD Athlon processor When FSTSW cannot be avoided for example backward compatibility of code with older processors no FPU instruction should occur between an FCOM P FICOM P FUCOM P or FTST and a dependent FSTSW This optimiz
64. bit codes 162 Performance Counter Usage AMDA 22007E 0 November 1999 Unit Mask Field Bits 8 15 USR User Mode Flag Bit 16 OS Operating System Mode Flag Bit 17 E Edge Detect Flag Bit 18 PC Pin Control Flag Bit 19 INT APIC Interrupt Enable Flag Bit 20 EN Enable Counter Flag Bit 22 INV Invert Flag Bit 23 Counter Mask Field Bits 31 24 AMD Athlon Processor x86 Code Optimization These bits are used to further qualify the event selected in the event select field For example for some cache events the mask is used as a MESI protocol qualifier of cache states See Table 11 on page 164 for a list of unit masks and their 8 bit codes Events are counted only when the processor is operating at privilege levels 1 2 or 3 This flag can be used in conjunction with the OS flag Events are counted only when the processor is operating at privilege level 0 This flag can be used in conjunction with the USR flag When this flag is set edge detection of events is enabled The processor counts the number of negated to asserted transitions of any condition that can be expressed by the other fields The mechanism is limited in that it does not permit back to back assertions to be distinguished This mechanism allows software to measure not only the fraction of time spent in a particular state but also the average length of time spent in such a state for example the time spent wai
65. chapter describes how to code instructions for efficient scheduling Guidelines are listed in order of importance Schedule Instructions According to their Latency The AMD Athlon processor can execute up to three x86 instructions per cycle with each x86 instruction possibly having a different latency The AMD Athlon processor has flexible scheduling but for absolute maximum performance schedule instructions especially FPU and 3DNow instructions according to their latency Dependent instructions will then not have to wait on instructions with longer latencies See Appendix F Instruction Dispatch and Execution Resources on page 187 for a list of latency numbers Unrolling Loops Complete Loop Unrolling Make use of the large AMD Athlon processor 64 Kbyte instruction cache and unroll loops to get more parallelism and reduce loop overhead even with branch prediction Complete Schedule Instructions According to their Latency 67 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 unrolling reduces register pressure by removing the loop counter To completely unroll a loop remove the loop control and replicate the loop body N times In addition completely unrolling a loop increases scheduling opportunities Only unrolling very large code loops can result in the inefficient use of the L1 instruction cache Loops can be unrolled completely if all of the following conditions are tru
66. design for fast single cycle execution and fast operating frequencies The term design implementation refers to the actual logic and circuit designs from which the processor is created according to the microarchitecture specifications Introduction 129 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 AMD Athlon Processor Microarchitecture The innovative AMD Athlon processor microarchitecture approach implements the x86 instruction set by processing simpler operations OPs instead of complex x86 instructions These OPs are specially designed to include direct support for the x86 instructions while observing the high performance principles of fixed length encoding regularized instruction fields and a large register set Instead of executing complex x86 instructions which have lengths from 1 to 15 bytes the AMD Athlon processor executes the simpler fixed length OPs while maintaining the instruction coding efficiencies found in x86 programs The enhanced microarchitecture used in the AMD Athlon processor enables higher processor core performance and promotes straightforward extendibility for future designs Superscalar Processor The AMD Athlon processor is an aggressive out of order three way superscalar x86 processor It can fetch decode and issue up to three x86 instructions per cycle with a centralized instruction control unit ICU and two independent instruction schedulers an in
67. foo bar double Xs Jaata ds char IDs float baz Allocate in the following order from left to right from higher to lower addresses X y 2121 2 1 2 01 foo bar baz ga gu gi a b See Sort Local Variables According to Base Type Size on page 28 for more information from a C source code perspective Sort Variables According to Base Type Size AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Branch Optimizations While the AMD Athlon processor contains a very sophisticated branch unit certain optimizations increase the effectiveness of the branch prediction unit This chapter discusses rules that improve branch prediction and minimize branch penalties Guidelines are listed in order of importance Avoid Branches Dependent on Random Data Avoid conditional branches depending on random data as these are difficult to predict For example a piece of code receives a random stream of characters A through Z and branches if the character is before M in the collating sequence Data dependent branches acting upon basically random data causes the branch prediction logic to mispredict the branch about 50 of the time If possible design branch free alternative code sequences which results in shorter average execution time This technique is especially important if the branch body is small Examples 1 and 2 illustrate this concept using the CMOV instruction No
68. imm8 OFh 70h DirectPath FADD FMUL PREFETCHNTA mem8 OFh 18h DirectPath 1 PREFETCHTO mem8 OFh 18h DirectPath 1 PREFETCHT1 mem8 OFh 18h DirectPath 1 PREFETCHT2 mem8 OFh 18h DirectPath 1 SFENCE OFh AEh VectorPath Notes 1 For the PREFETCHNTA TO T 1 72 instructions the mem value refers to an address in the 64 byte line that will be prefetched Table 22 Floating Point Instructions Instruction Mnemonic First Second Decode f 0 Note Byte Byte Byte Type Pipe s F2XM1 VectorPath FABS D9h Eth DirectPath FMUL FADD ST 510 D8h 11 000 xxx DirectPath FADD 1 FADD mem32real D8h mm 000 xxx DirectPath FADD FADD ST i ST DCh 11 000 xxx DirectPath FADD 1 FADD meme4real DCh mm 000 xxx DirectPath FADD FADDP ST i ST DEh 11 000 xxx DirectPath FADD 1 FBLD mem80 DFh mm 100 xxx VectorPath FBSTP mem80 DFh mm 110 xxx VectorPath FCHS EOh DirectPath FMUL FCLEX DBh E2h VectorPath Notes 1 The last three bits of the modR M byte select the stack entry ST i 212 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 22 Floating Point Instructions Continued AMD Athlon Processor x86 Code Optimization Notes 1 The last three bits of the modR M byte select the stack entry
69. instances in the AMD Athlon processor load store architecture when either a load operation is not allowed to read needed data from a store in the store buffer or a load OP detects a false data dependency on a store in the store buffer In either case the load cannot complete load the needed data into a register until the store has retired out of the store buffer and written to the data cache A store buffer entry cannot retire and write to the data cache until every instruction before the store has completed and retired from the reorder buffer The implication of this restriction is that all instructions in the reorder buffer up to and including the store must complete and retire out of the reorder buffer before the load can complete Effectively the load has a false dependency on every instruction up to the store The following sections describe store to load forwarding examples that are acceptable and those that should be avoided Store to Load Forwarding Pitfalls True Dependencies A load 1s allowed to read data from the store buffer entry only if all of the following conditions are satisfied m The start address of the load matches the start address of the store m The load operand size is equal to or smaller than the store operand size Neither the load or store is misaligned m The store data is not from a high byte register AH BH CH or DH The following sections describe common case scenarios to avoid where
70. instructions can be selected for decode per cycle Only one VectorPath instruction can be selected for decode per cycle DirectPath instructions and VectorPath instructions cannot be simultaneously decoded Overview 33 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Select DirectPath Over VectorPath Instructions Use DirectPath instructions rather than VectorPath instructions DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction which includes register register memory as well as register register op register forms of instructions Up to three DirectPath instructions can be decoded per cycle VectorPath instructions will block the decoding of DirectPath instructions The very high majority of instructions used be a compiler has been implemented as DirectPath instructions in the AMD Athlon processor Assembly writers must still take into consideration the usage of DirectPath versus VectorPath instructions See Appendix F Instruction Dispatch and Execution Resources on page 187 and Appendix G DirectPath versus VectorPath Instructions on page 219 for tables of DirectPath and VectorPath instructions Load Execute Instruction Usage Use Load Execute Integer Instructions Most load execute integer instructions are DirectPath decodable and can be decoded at the rate of three per cycle
71. is advisable to check the argument before one of the trigonometric instructions is invoked Example 2 Preferred FLD QWORD PTR x sargument FLD DWORD PTR two to the 63 x2 63 FCOMIP ST ST 1 argument lt 2 63 JBE in range Yes It is in range CALL reduce range reduce argument in ST 0 to lt 2 63 in range FSIN compute sine in range argument guaranteed 104 Check Argument Range of Trigonometric Instructions AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Since out of range arguments are extremely uncommon the conditional branch will be perfectly predicted and the other instructions used to guard the trigonometric instruction can execute in parallel to it Take Advantage of the FSINCOS Instruction Frequently a piece of code that needs to compute the sine of an argument also needs to compute the cosine of that same argument In such cases the FSINCOS instruction should be used to compute both trigonometric functions concurrently which is faster than using separate FSIN and FCOS instructions to accomplish the same task Example 1 Avoid FLD QWORD PTR x FLD DWORD PTR two to the 63 FCOMIP ST ST 1 JBE in range CALL reduce range in range FLD ST 0 FCOS FSTP QWORD PTR cosine x FSIN FSTP QWORD PTR sine x Example 2 Preferred FLD QWORD PTR x FLD DWORD PTR two to the 63 FCOMIP ST ST 1 JBE in range CA
72. last place The square root X4 is formed in the last step by multiplying by the input operand b Use MMX PMADDWD Instruction to Perform Two 32 Bit Multiplies in Parallel The MMX PMADDWD instruction can be used to perform two signed 16 16 gt 32 bit multiplies in parallel with much higher performance than can be achieved using the IMUL instruction The PMADDWD instruction is designed to perform four 16 16 gt 32 bit signed multiplies and accumulate the results pairwise By making one of the results in a pair a zero there are now just two multiplies The following example shows how to multiply 16 bit signed numbers a b c d into signed 32 bit products axc and bxd Use MMX PMADDWD Instruction to Perform Two 32 Bit Multiplies in Parallel 111 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example PXOR MM2 MM2 00 MOVD MMO ab I MOVD MM1 cd Bu PUNPCKLWD MMO MM2 051 0 PUNCPKLWD 1 MM2 PMADDWD MMO 1 b d a c 3DNow and MMX Intra Operand Swapping AMD Athlon If the swapping of MMX register halves is necessary use the Specific Code PSWAPD instruction which is a new AMD Athlon 3DNow DSP extension Use of this instruction should only be for AMD Athlon specific code PSWAPD MMreg1 MMreg2 performs the following operation mmregl1 63 32 mmreg2 31 0 mmreg1 31 0 mmreg2 63 32 See the AMD Extensions to the 3DNow and
73. load execute floating point instructions with integer operands FIADD FISUB FISUBR FIMUL FIDIV FIDIVR FICOM and FICOMP Remember that floating point instructions can have integer operands while integer instruction cannot have floating point operands Floating point computations involving integer memory operands should use separate FILD and arithmetic instructions This optimization has the potential to increase decode bandwidth and OP density in the FPU scheduler The floating point load execute instructions with integer operands are VectorPath and generate two OPs in a cycle while the discrete equivalent enables a third DirectPath instruction to be decoded in the same cycle In some situations this optimizations can also reduce execution time if the FILD can be scheduled several instructions ahead of the arithmetic instruction in order to cover the FILD latency Load Execute Instruction Usage 35 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example 1 Avoid RD PT FLD QWO PIR foo FIMUL DWORD PTR bar FIADD DWORD PTR baz Example 2 Preferred FILD DWORD PTR bar FILD DWORD PTR baz FLD QWORD PTR foo FMULP STC2 ST FADDP ST 1 ST Align Branch Targets in Program Hot Spots In program hot spots i e innermost loops in the absence of profiling data place branch targets at or near the beginning of 16 byte aligned code windows This technique helps t
74. mm xxx xxx DirectPath FADD FMUL PADDUSW mmreg1 mmreg2 OFh DirectPath FADD FMUL PADDUSW mmreg mem64 OFh DDh mm xxx xxx DirectPath FADD FMUL PADDW mmreg1 mmreg2 OFh FDh 11 xxx xxx DirectPath FADD FMUL PADDW mmreg mem64 OFh FDh mm xxx xxx DirectPath FADD FMUL mmreg1 mmreg2 OFh DBh 11 xxx xxx DirectPath FADD FMUL PAND mmreg mem64 OFh DBh mm xxx xxx DirectPath FADD FMUL Notes 1 Bits 2 1 and 0 of the modR M byte select the integer register 208 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 20 MMX Instructions Continued AMD Athlon Processor x86 Code Optimization Notes 1 Bits 2 1 and 0 of the modR M byte select the integer register Instruction Mnemonic Ain pe M pong FPU Pipe s Notes PANDN mmreg1 mmreg2 OFh DFh 11 xxx xxx DirectPath FADD FMUL PANDN mmreg mem64 OFh DFh mm xxx xxx DirectPath FADD FMUL mmreg1 mmreg2 OFh 74h 11 xxx xxx DirectPath FADD FMUL PCMPEQB mmreg mem64 OFh 74h mm xxx xxx DirectPath FADD FMUL PCMPEQD mmreg1 mmreg2 OFh 76h 11 DirectPath FADD FMUL PCMPEQD mmreg mem64 OFh 76h mm xxx xxx DirectPath FADD FMUL PCMPEQW mmreg1 mmreg2 OFh 75h DirectPath FADD FMUL PCMPEQW mmreg mem64 OFh 75h m
75. n Use Prototypes for All Functions long_months more than 31 long months more than 31 In general use prototypes for all functions Prototypes can convey additional information to the compiler that might enable more aggressive optimizations Switch Statement Usage 21 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Use Const Type Qualifier Use the const type qualifier as much as possible This optimization makes code more robust and may enable higher performance code to be generated due to the additional information available to the compiler For example the C standard allows compilers to not allocate storage for objects that are declared const if their address is never taken Generic Loop Hoisting To improve the performance of inner loops it is beneficial to reduce redundant constant calculations i e loop invariant calculations However this idea can be extended to invariant control structures The first case is that of a constant if statement in a for loop Example 1 fore T xx if CONSTANTO DoWorkO i does not affect CONSTANTO else DoWorkl i does not affect CONSTANTO The above loop should be transformed into if CONSTANTO forc dq Ji DoWorkO i j else FOR CI ses 3 d DoWorkl i This will make your inner loops tighter by avoiding repetitious evaluation of a known
76. on word boundaries doublewords on doubleword boundaries and quadwords on quadword boundaries Misaligned memory accesses reduce the available memory bandwidth Use Multiplies Rather than Divides If accuracy requirements allow floating point division by a constant should be converted to a multiply by the reciprocal Divisors that are powers of two and their reciprocal are exactly representable except in the rare case that the reciprocal overflows or underflows and therefore does not cause an accuracy issue Unless such an overflow or underflow occurs a division by a power of two should always be converted to a multiply Although the AMD Athlon processor has high performance division multiplies are significantly faster than divides Ensure All FPU Data is Aligned 97 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Use FFREEP Macro to Pop One Register from the FPU Stack In FPU intensive code frequently accessed data is often pre loaded at the bottom of the FPU stack before processing floating point data After completion of processing it is desirable to remove the pre loaded data from the FPU stack as quickly as possible The classical way to clean up the FPU stack is to use either of the following instructions FSTP ST C0 removes one register from stack FCOMPP removes two registers from stack On the AMD Athlon processor a faster alternative is to use the FFREEP instruction
77. ooh 011 VectorPath MOV mregg reg8 88h 11 xxx xxx DirectPath MOV meme reg8 88h mm xxx xxx DirectPath MOV mreg16 32 reg16 32 89h 11 xxx xxx DirectPath MOV mem16 32 reg16 32 89h mm xxx xxx DirectPath MOV reg8 mreg8 8Ah 11 xxx xxx DirectPath MOV reg8 mem8 8Ah mm xxx xxx DirectPath MOV 16 32 mreg16 32 8Bh 11 xxx xxx DirectPath MOV 16 32 mem16 32 8Bh mm xxx xxx DirectPath MOV mreg16 segment reg 8Ch 11 xxx xxx VectorPath MOV 16 segment reg 8Ch mm xxx xxx VectorPath MOV segment reg mreg16 8Eh 11 xxx xxx VectorPath MOV segment reg mem16 8Eh mm xxx xxx VectorPath MOV AL mem8 DirectPath MOV EAX mem16 32 Alh DirectPath MOV meme AL A2h DirectPath MOV mem16 32 EAX Ash DirectPath MOV AL imm8 Boh DirectPath MOV CL imm8 Bih DirectPath MOV DL imm8 B2h DirectPath MOV BL imm8 B3h DirectPath MOV AH imm8 B4h DirectPath MOV CH imm8 B5h DirectPath MOV DH imm8 B6h DirectPath MOV BH imm8 B7h DirectPath MOV EAX imm16 32 B8h DirectPath MOV ECX imm16 32 B9h DirectPath Instruction Dispatch and Execution Resources 197 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Instruction Mnemonic prin pr M pon MOV EDX imm16 32 _ Bh MOV EBX imm16 32 BBh DirectPat
78. order of importance Ensure Floating Point Variables and Expressions are of Type Float For compilers that generate 3DNow instructions make sure that all floating point variables and expressions are of type float Pay special attention to floating point constants These require a suffix of F or f for example 3 14f in order to be of type float otherwise they default to type double To avoid automatic promotion of float arguments to double always use function prototypes for all functions that accept float arguments Use 32 Bit Data Types for Integer Code Use 32 bit data types for integer code Compiler implementations vary but typically the following data types are included int signed signed int unsigned unsigned int long signed long long int signed long int unsigned long and unsigned long int Ensure Floating Point Variables and Expressions are of Type Float 13 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Consider the Sign of Integer Operands In many cases the data stored in integer variables determines whether a signed or an unsigned integer type is appropriate For example to record the weight of a person in pounds no negative numbers are required so an unsigned type is appropriate However recording temperatures in degrees Celsius may require both positive and negative numbers so a signed type is needed Where there is a choice of using eit
79. processor It demonstrates the increased efficiency from using the PI2FW instruction Use of this instruction should only be for AMD Athlon processor specific code See the AMD Extensions to the 3DNow and MMX Instruction Set Manual order 22466 for more information on this instruction The second example demonstrates how to accomplish the same task in blended code that achieves good performance on the AMD Athlon processor as well as on the AMD K6 family processors that support 3DNow technology Example 1 AMD Athlon specific code using 3DNow DSP extension MOVD MMO packed sword bea PUNPCKLWD MMO MMO bb aa PI2FW MMO MMO xb float b xa float a MOVQ packed float MMO store xb xa Example 2 AMD K6 Family and AMD Athlon processor blended code MOVD MM1 packed sword 00 ba PXOR MMO MMO 00 090 PUNPCKLWD MMO 1 PSRAD MMO 16 sign extend b a PI2FD MMO MMO xb float b xa float a MOVQ packed float MMO store xb xa Use MMX PXOR to Negate 3DNow Data For both the AMD Athlon and AMD K6 processors it is recommended that code use the MMX PXOR instruction to change the sign bit of 3DNow operations instead of the 3DNow PFMUL instruction On the AMD Athlon processor using PXOR allows for more parallelism as it can execute in either the FADD or FMUL pipes PXOR has an execution latency of two but because it is a MMX instruction there is an initial one
80. significantly See Appendix C Implementation of Write Combining on page 155 for more details Avoid Placing Code and Data in the Same 64 Byte Cache Sharing code and data in the same 64 byte cache line may cause the L1 caches to thrash unnecessary castout of code data in order to maintain coherency between the separate instruction and data caches The AMD Athlon processor has a cache line size of 64 bytes which is twice the size of previous processors Programmers must be aware that code and data should not be shared within this larger cache line especially if the data becomes modified For example programmers should consider that a memory indirect JMP instruction may have the data for the jump table residing in the same 64 byte cache line as the JMP instruction which would result in lower performance Although rare do not place critical code at the border between 32 byte aligned code segments and a data segments The code at the start or end of your data segment should be as rarely executed as possible or simply padded with garbage In general the following should be avoided m self modifying code m storing data in code segments 50 Take Advantage of Write Combining AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Store to Load Forwarding Restrictions Store to load forwarding refers to the process of a load reading forwarding data from the store buffer LS2 There are
81. the execution behavior of several series of instructions as a function of decode constraints dependencies and execution resource constraints The sample tables show the x86 instructions the decode pipe in the integer execution pipeline the decode type the clock counts and a description of the events occurring within the processor The decode pipe gives the specific IEU used see Figure 7 on page 144 The decode type specifies either VectorPath VP or DirectPath DP The following nomenclature is used to describe the current location of a particular operation m D Dispatch stage Allocate in ICU reservation stations load store LS1 queue m I Issue stage Schedule operation for AGU or FU execution m E Integer Execution Unit IEU number corresponds to decode pipe m amp Address Generation Unit AGU number corresponds to decode pipe M Multiplier Execution m S Load Store pipe stage 1 Schedule operation for load store pipe A Load Store pipe stage 2 1st stage of data cache LS2 buffer access m Load Store pipe stage 3 2nd stage of data cache LS2 buffer access Note Instructions execute more efficiently that is without delays when scheduled apart by suitable distances based on dependencies In general the samples in this section show poorly scheduled code in order to illustrate the resultant effects 152 Execution Unit Resources AMDA 22007E 0 November 1999 Table 7
82. uz a Fee Sova ap prr tee m ets 132 Prefetch Determing 49 Multiple ree elk rx BUA Pie alae A 47 Ptototypes sva pue astaq ete 21 Memory Recursive Functions 66 Pushing Memory Data 75 Register 8 128 Size and Alignment 8 45 Register Reads and Writes Partial 37 nist 252 174 REP Prefix ee banyu D beds ed 40 84 85 Memory Type Range Register 171 Capability Register 174 Default Type Register 175 Extensions esc pun uper prt cus 177 Fixed Range Register 182 DRE DEP 1 IF MR Mc pM MADE 67 MSR Format 185 wu C I EE E MTRRs and PAT 178 SHR Instruction se RIED 38 c Nc 176 Signed Words to Floating Point Conversion 113 Variable Range MTRR Register Format 183 Er ROO huu teste ce de 110 MMX 107 Alignment Considerations 54 Block Copies and Block 5 115 Punt ONERE MA
83. writing a partial register merges the modified portion with the current state of the remainder of the register Therefore the dependency hardware can potentially force a false dependency on the most recent instruction that writes to any part of the register Example 1 Avoid MOV AL 10 inst 1 MOV AH 12 inst 2 has a false dependency on inst 1 inst 2 merges new AH with current EAX register value forwarded by inst 1 In addition an instruction that has a read dependency on any part of a given architectural register has a read dependency on the most recent instruction that modifies any part of the same architectural register Example 2 Avoid MOV BX 12h inst 1 MOV BL DL inst 2 false dependency on completion of inst 1 MOV BH CL inst 3 false dependency on completion of inst 2 MOV AL BL inst 4 depends on completion of inst 2 Avoid Partial Register Reads and Writes 37 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Replace Certain SHLD Instructions with Alternative Code Certain instances of the SHLD instruction can be replaced by alternative code using SHR and LEA The alternative code has lower latency and requires less execution resources SHR and LEA 32 bit version are DirectPath instructions while SHLD is a VectorPath instruction SHR and LEA preserves decode bandwidth as it potentially enables the decoding of a third DirectPath instruction Examp
84. x86 Code Optimization Table 25 DirectPath Integer Instructions 22007E 0 November 1999 Instruction Mnemonic ADC mreg8 reg8 Instruction Mnemonic AND mreg 16 32 reg16 32 ADC mem reg8 AND mem16 32 reg16 32 ADC mreg16 32 2 AND reg8 mreg8 ADC mem16 32 reg16 32 AND reg8 mem8 ADC reg8 mreg8 AND reg16 32 mreg16 32 ADC reg8 mem8 AND reg16 32 mem16 32 ADC reg16 32 mreg16 32 AND AL imm8 ADC 16 32 mem16 32 AND EAX imm16 32 ADC AL imm8 AND mreg8 imm8 ADC EAX imm16 32 AND mem8 imm8 ADC mreg8 imm8 AND mreg16 32 imm16 32 ADC imm8 AND mem16 32 imm16 32 ADC mreg16 32 imm16 32 AND mreg16 32 imm8 sign extended ADC mem16 32 imm16 32 AND mem16 32 imm8 sign extended ADC mreg16 32 imm8 sign extended BSWAP EAX ADC mem16 32 imm8 sign extended BSWAP ECX ADD mreg8 reg8 BSWAP EDX ADD reg8 BSWAP EBX ADD mreg16 32 reg16 32 BSWAP ESP ADD mem16 32 reg16 32 BSWAP EBP ADD reg8 mreg8 BSWAP ESI ADD reg8 mem8 BSWAP EDI ADD reg16 32 mreg16 32 BT mreg16 32 reg16 32 ADD 16 32 mem16 32 BT mreg16 32 imm8 ADD AL imm8 BT mem16 32 imm8 ADD EAX imm16 32 CBW CWDE ADD mreg8 imm8 CLC ADD mem8 imm8 CMC
85. 0 26Ah MTRRFIX4k 20000 See MTRR Fixed Range Register Format on page 182 26Bh MTRRFIX4k_D8000 26Ch MTRRFIX4k_E0000 26Dh MTRRFIX4k_E8000 26Eh 0000 26Fh MTRRFIX4k F8000 2FFh MTRRdefType See MTRR Default Type Register Format on page 175 Page Attribute Table PAT 185 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 186 Page Attribute Table PAT AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix F Instruction Dispatch and Execution Resources This chapter describes the MacroOPs generated by each decoded instruction along with the relative static execution latencies of these groups of operations Tables 19 through 24 starting on page 188 define the integer MMX MMX extensions floating point 3DNow M and 3DNow extensions instructions respectively The first column in these tables indicates the instruction mnemonic and operand types with the following notations reg8 byte integer register defined by instruction byte s bits 5 4 and 3 of the modR M byte mreg8 byte integer register defined by bits 2 1 and 0 of the modR M byte reg16 32 word and doubleword integer register defined by instruction byte s or bits 5 4 and 3 of the modR M byte mreg16 32 word and doubleword integer register defined by bits 2 1 and 0 of the mod R M byte mem8 amp byte memory location
86. 01000 00000 D7FFF D6FFF DSFFF D4FFF D3FFF D2FFF DiFFF Dorrr MIRR fx4K D0000 DFOO0 DEOO0 DDOO0 DCOO0 DBOOO DA000 D9000 D8000 DFFFF DEFFF DDFFF DCFFF DBFFF DAFFF Dorrr Darrr MIRR fx4K 08000 E7000 E6000 E5000 E4000 5000 E2000 1000 E0000 E7FFF ESFFF E4FFF ESFFF 2 EFO00 000 ED000 ECOO0 000 00 E9000 E8000 EFFFF EDFFF ECFFF EBFFF EAFFF Eorrr EFF MIRR E8000 F7000 F6000 5000 F4000 F3000 F2000 FI000 0000 MTRR fix4K F0000 F7FFF FGFFF FSFFF FSFFF F2FFF FIFFF FOFFF 000 000 _ F9000 FDOO0 FCOO0 000 FAO00 FFFFF FEFFF FDFFF FCFFF FBFFF FAFFF FoFFF FgFFF 182 Page Attribute Table PAT AMDA 22007E 0 November 1999 Variable Range MTRRs Variable Range MTRR Register Format 63 AMD Athlon Processor x86 Code Optimization A variable MTRR can be programmed to start at address 0000_0000h because the fixed MTRRs always override the variable ones However it is recommended not to create an overlap The upper two variable MTRRs should not be used by the BIOS and are reserved for operating system use The variable address range is power of 2 sized and aligned The range of supported sizes is from 212 to 236 in powers of 2 The AMD Athlon processor does not implement A 35 32 36 35 12 11 8 7 0
87. 0x30 0x41 MOV AL X load X value CMP AL 10 sif x is less than 10 set carry flag SBB AL 69h 0 9 gt 96h Ah Fh gt Alh A6h DAS 0 9 subtract 66h Ah Fh Sub 60h MOV Y AL Save conversion in y 58 Avoid Branches Dependent on Random Data AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 6 Increment Ring Buffer Offset C Code char buf BUFSIZE int a if a lt BUFSIZE 1 else Assembly Code MOV old offset CMP BUFSIZE 1 a lt BUFSIZE 1 CF NC INC EAX dtt SBB EDX EDX a lt BUFSIZE 1 Oxffffffff 0 AND EAX EDX a lt BUFSIZE 1 att z 0 MOV a EAX Store new offset Example 7 Integer Signum Function C Code int a S if a 50455206 else if lt 0 5 l else 5 1 Assembly Code MOV EAX a load a CDQ lt 0 0 CMP EDX EAX 073 CF 2 NC ADC EDX 0 vo rue MOV s EDX signum x Always Pair CALL and RETURN When the 12 entry return address stack gets out of synchronization the latency of returns increase The return address stack becomes out of sync when m calls and returns do not match m the depth of the return stack is exceeded because of too many levels of nested functions calls Always Pair CALL and RETURN 59 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Replace Branc
88. 1 0 6 1 1 1 7 The processor contains MTRRs as described earlier which provide a limited way of assigning memory types to specific regions However the page tables allow memory types to be assigned to the pages used for linear to physical translation The memory type as defined by PAT and MTRRs are combined to determine the effective memory type as listed in Table 15 and Table 16 Shaded areas indicated reserved settings 178 Page Attribute Table PAT AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Table 15 Effective Memory Type Based on PAT and MTRRs PAT Memory Type MTRR Memory Type Effective Memory Type UC WB WT WP WC UC Page UC UC MTRR WC X WC WT WB WT WT UC UC WC CD WP CD WP WB WP WP UC UC MTRR WC WT CD WB WB WB UC UC WC WC WT WT WP WP Notes 1L UC MTRR indicates that the UC attribute came from the MTRRs and that the processor caches should not be probed for performance reasons 2 UC Page indicates that the UC attribute came from the page tables and that the processor caches must be probed due to page aliasing 3 All reserved combinations default to CD Page Attribute Table PAT 179 AMDA AMD Athlon Processor x86 Code Optimization Table 16 Final Output Memory Types
89. 16 10 3 c 16 10 3 0 Note c value of ECX ECX gt 0 Table 1 lists the latencies with the direction flag DF 0 increment and DF 1 In addition these latencies are assumed for aligned memory operands Note that for MOVS STOS when DF 1 DOWN the overhead portion of the latency increases significantly However these types are less commonly found The user should use the formula and round up to the nearest integer value to determine the latency Guidelines for Repeated String Instructions Use the Largest Possible Operand Size To help achieve good performance this section contains guidelines for the careful scheduling of VectorPath repeated string instructions Always move data using the largest operand size possible For example use REP MOVSD rather than REP MOVSW and REP MOVSW rather than REP MOVSB Use REP STOSD rather than REP STOSW and REP STOSW rather than REP MOVSB 84 Repeated String Instruction Usage AMDA 22007E 0 November 1999 Ensure DF 0 UP Align Source and Destination with Operand Size Inline REP String with Low Counts Use Loop for REP String with Low Variable Counts Using MOVQ and MOVNTQ for Block Copy Fill AMD Athlon Processor x86 Code Optimization Always make sure that DF 0 UP after execution of CLD for REP MOVS and REP STOS DF 1 DOWN is only needed for certain cases of overlapping REP MOVS for example source and destination overlap
90. 16 32 AAS CALL mem 16 16 32 ARPL 16 reg16 CALL near mreg32 indirect ARPL 16 reg16 CALL near mem32 indirect BOUND CLD BSF 16 32 mreg16 32 CLI BSF reg16 32 mem16 32 CLTS BSR reg16 32 mreg16 32 CMPSB mem8 mem8 BSR reg16 32 mem16 32 CMPSW 16 mem32 BT mem16 32 reg16 32 CMPSD mem32 mem32 BIC mreg16 32 reg16 32 CMPXCHG reg8 mem16 32 reg16 32 CMPXCHG meme reg8 BIC mreg16 32 imm8 CMPXCHG mreg16 32 reg16 32 BIC mem16 32 imm8 CMPXCHG mem16 32 reg16 32 BTR mreg16 32 reg16 32 CMPXCHG8B mem64 BIR mem16 32 reg16 32 CPUID BTR mreg16 32 imm8 DAA BIR mem16 32 imm8 DAS BTS mreg16 32 16 32 DIV AL mreg8 BTS mem16 32 reg16 32 DIV AL mem8 BTS mreg16 32 imm8 DIV EAX mreg16 32 VectorPath Instructions 231 AMDA AMD Athlon Processor x86 Code Optimization Table 29 VectorPath Integer Instructions Continued Instruction Mnemonic DIV EAX mem16 32 22007E 0 November 1999 ENTER Instruction Mnemonic LEA reg16 mem16 32 IDIV mreg8 LEAVE IDIV mem8 LES reg16 32 mem32 48 IDIV EAX mreg16 32 LFS reg16 32 mem32 48 IDIV EAX mem 16 32 LGDT mem48 IMUL reg16 32 imm16 32
91. 19 Added the optimization Efficient 3D Clipping Code Computation Using 3DNow Instructions on page 122 Added the optimization Complex Number Arithmetic on page 126 Added Appendix E Programming the MTRR and PAT Rearranged the appendices Added Index Revision History XV AMD 1 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 xvi Revision History AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Introduction The AMD Athlon processor is the newest microprocessor in the AMD K86 family of microprocessors The advances in the AMD Athlon processor take superscalar operation and out of order execution to a new level The AMD Athlon processor has been designed to efficiently execute code written for previous generation x86 processors However to enable the fastest code execution with the AMD Athlon processor programmers should write software that includes specific code optimization techniques About this Document This document contains information to assist programmers in creating optimized code for the AMD Athlon processor In addition to compiler and assembler designers this document has been targeted to C and assembly language programmers writing execution sensitive code sequences This document assumes that the reader possesses in depth knowledge of the x86 instruction set the x86 architecture registers programming modes
92. 2 PUSH EAX MOVSX reg16 32 mreg8 PUSH ECX MOVSX reg16 32 mem8 PUSH EDX MOVSX reg32 mreg16 PUSH EBX MOVSX reg32 mem16 PUSH ESP MOVZX reg16 32 mreg8 PUSH EBP MOVZX 16 32 mem8 PUSH ESI MOVZX reg32 mreg16 PUSH EDI MOVZX reg32 mem16 PUSH imm8 NEG mreg8 PUSH imm16 32 NEG mem8 RCL mreg8 imm8 NEG mreg16 32 RCL mreg16 32 imm8 NEG mem16 32 RCL mregg 1 NOP XCHG EAX EAX RCL 1 NOT mreg8 RCL mreg16 22 1 NOT mem8 RCL mem16 32 1 NOT mreg16 32 RCL mreg8 CL NOT mem16 32 RCL mreg16 32 CL OR mreg8 reg8 RCR mreg8 imm8 OR meme reg8 RCR mreg16 32 imm8 OR mreg16 32 reg16 32 RCR mreg8 1 OR mem16 32 reg16 32 RCR 1 OR reg8 mreg8 RCR mreg16 32 1 OR reg8 mem8 RCR mem16 32 1 OR 16 32 mreg16 32 RCR mreg8 CL OR reg16 32 mem16 32 RCR mreg16 32 CL OR AL imm8 ROL mreg8 imm8 OR EAX imm16 32 ROL mem8 imm8 OR mreg8 imm8 ROL mreg16 32 imm8 OR mem8 imm8 ROL mem16 32 imm8 OR mreg16 32 imm16 32 ROL 1 OR mem16 32 imm16 32 ROL 1 OR mreg16 32 imm sign extended ROL mreg16 32 1 OR mem16 32 imm8 sign extended ROL mem16 32 1 DirectPath Instructions 223 AMDA AMD Athlon Processor x86 Code Optimization Table 25 Direc
93. 32 81h mm 010 xxx DirectPath ADC mreg16 32 imm8 sign extended 85h 11 010 xxx DirectPath ADC mem16 32 imm8 sign extended 83h mm 010 xxx DirectPath ADD mregs reg8 00h 11 DirectPath ADD mem reg8 00h mm xxx xxx DirectPath ADD mreg16 32 reg16 32 oih 11 DirectPath ADD mem16 32 reg16 32 oih mm xxx xxx DirectPath ADD reg8 mreg8 02h 11 DirectPath ADD reg8 mem8 02h mm xxx xxx DirectPath ADD 16 32 mreg16 32 03h 11 DirectPath ADD 16 32 mem16 32 03h mm xxx xxx DirectPath ADD AL imm8 04h DirectPath ADD EAX imm16 32 05h DirectPath ADD mreg8 imm8 80h 11 000 xxx DirectPath ADD meme imm8 80h mm 000 xxx DirectPath ADD mreg16 32 imm16 32 81h 11 000 xxx DirectPath ADD mem16 32 imm16 32 81h mm 000 xxx DirectPath ADD mreg16 32 imm8 sign extended 83h 11 000 xxx DirectPath ADD mem16 32 imm8 sign extended 83h mm 000 xxx DirectPath AND mreg8 reg8 20h 11 xxx xxx DirectPath Instruction Dispatch and Execution Resources 189 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued lisiricion Mnemonic First Second ModR M Decode Byte Byte Byte Type AND mem8 reg8 20h mm xxx xxx DirectPath AND mreg16 32 reg16 32 21h 11 xxx xxx DirectPath AND mem16 32 reg16
94. 32 OFh 6Eh mm xxx xxx DirectPath FADD FMUL FSTORE MOVD reg32 mmreg OFh 7Eh 11 xxx xxx VectorPath 1 MOVD mem32 mmreg OFh 7Eh mm xxx xxx DirectPath FSTORE MOVQ mmreg1 mmreg2 OFh 6Fh 11 xxx xxx DirectPath FADD FMUL MOVQ mmreg mem64 OFh 6Fh mm xxx xxx DirectPath FADD FMUL FSTORE MOVQ mmreg2 mmregl OFh 7Fh 11 xxx xxx DirectPath FADD FMUL MOVQ mem64 mmreg OFh 7Fh mm xxx xxx DirectPath FSTORE PACKSSDW mmreg1 mmreg2 OFh 6Bh 11 xxx xxx DirectPath FADD FMUL PACKSSDW mmreg mem64 OFh 6Bh mm xxx xxx DirectPath FADD FMUL PACKSSWB mmreg1 mmreg2 OFh 63h 11 xxx xxx DirectPath FADD FMUL PACKSSWB mmreg mem64 OFh 63h mm xxx xxx DirectPath FADD FMUL PACKUSWB mmreg1 mmreg2 OFh 67h 11 xxx xxx DirectPath FADD FMUL PACKUSWB mmreg mem64 OFh 67h mm xxx xxx DirectPath FADD FMUL PADDB mmreg1 mmreg2 OFh FCh 11 xxx xxx DirectPath FADD FMUL PADDB mmreg mem64 OFh FCh mm xxx xxx DirectPath FADD FMUL PADDD mmreg1 mmreg2 OFh FEh 11 xxx xxx DirectPath FADD FMUL PADDD mmreg mem64 OFh FEh mm xxx xxx DirectPath FADD FMUL PADDSB mmreg1 mmreg2 OFh ECh 11 xxx xxx DirectPath FADD FMUL PADDSB mmreg mem64 OFh ECh mm xxx xxx DirectPath FADD FMUL PADDSW mmreg1 mmreg2 OFh EDh 11 xxx xxx DirectPath FADD FMUL PADDSW mmreg mem64 OFh EDh mm xxx xxx DirectPath FADD FMUL PADDUSB mmreg1 mmreg2 OFh DCh 11 xxx xxx DirectPath FADD FMUL PADDUSB mmreg mem64 OFh DCh
95. 333 Cw gt gt 2 amp 0x33333333 Each 4 bit field now has value 0000b 0001b 0010b 0011b or 0100b Efficient Implementation of Population Count Function 91 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Step 3 For the first time the value in each k bit field is small enough that adding two k bit fields results in a value that still fits in the k bit field Thus the following computation is performed y x x lt lt 4 8 OxOFOFOFOF The result 1s four 8 bit fields whose lower half has the desired sum and whose upper half contains junk that has to be masked out In a symbolic form X 0aaa0bbbOcccOdddOeeeOfffOgggOhhh x gt gt 4 00000aaa0bbb0cccOdddOeeeOfffO0ggg sum 0aaaWWWWiiiiXXXXjjjjYYYYKKkkZZZZ WWWW XXXX YYYY and ZZZZ values the interesting sums with each at most 1000b or 8 decimal Step 4 The four 4 bit sums can now be rapidly accumulated by means of a multiply with a magic multiplier This can be derived from looking at the following chart of partial products 00000 05 01010101 5 00 5 0 0 0 05 0 40 05 000pxxww vvuuttOs Here p q r and s are the 4 bit sums from the previous step and vv is the final result in which we are interested Thus the final result 2 y 0x01010101 gt gt 24 Example unsigned int popcount unsigned int v Signed int retVal asm V EAX v PV DX EAX
96. 401 0114515 15451 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 40 aLi 5 bLit 5 clit 5 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 48 bLi 6 FMUL QWORD PTR ECX ECX 8 ARR_SIZE 48 bLi1 6 cLi 6 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 48 aLli 6 b i 6 c i 6 FLD QWORD PTR LEDX ECX 8 ARR_SIZE 56 bLi 7 FMUL QWORD PTR LECX ECX 8 ARR_SIZE 56 b0Li 7 cLit 7 FSTP QWORD PTR LEAX ECX 8 ARR_SIZE 56 aLlit bLIit 7 clit 7 ADD ECX 8 next 8 products JNZ 1oop until none left END 48 Use the 3DNow PREFETCH and PREFETCHW AMDA 22007E 0 November 1999 Determining Prefetch Distance Prefetch at Least 64 Bytes Away from Surrounding Stores AMD Athlon Processor x86 Code Optimization The following optimization rules were applied to this example m Loops should be unrolled to make sure that the data stride per loop iteration is equal to the length of a cache line This avoids overlapping PREFETCH instructions and thus optimal use of the available number of outstanding PREFETCHes m Since the array array a is written rather than read PREFETCHW is used instead of PREFETCH to avoid overhead for switching cache lines to the correct MESI state The PREFETCH lookahead has been optimized such that each loop iteration is working on three cache lines while six active PREFETCHes bring in the next six cache lines m Index arithmetic has been reduced to a minimum by use of complex addressing modes and biasing of the arr
97. 5 for a different perspective Sort Local Variables According to Base Type Size When a compiler allocates local variables in the same order in which they are declared in the source code it can be helpful to declare local variables in such a manner that variables with a larger base type size are declared ahead of the variables with smaller base type size Then if the first variable is allocated so that it is naturally aligned all other variables are allocated contiguously in the order they are declared and are naturally aligned without any padding Some compilers do not allocate variables in the order they are declared In these cases the compiler should automatically allocate variables in such a manner as to make them naturally aligned with the minimum amount of padding In addition some compilers do not guarantee that the stack is aligned suitably for the largest base type that is they do not guarantee 28 Sort Local Variables According to Base Type Size AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization quadword alignment so that quadword operands might be misaligned even if this technique is used and the compiler does allocate variables in the order they are declared The following example demonstrates the reordering of local variable declarations Original ordering Avoid short ga gu gi long foo bar double x y 2131 char a b float baz Improved ordering Pref
98. 6 32 OFh 02h 11 Xxx xxx VectorPath LAR reg16 32 mem16 32 OFh 02h mm xxx xxx VectorPath LDS reg16 32 mem32 48 C5h mm xxx xxx VectorPath LEA reg16 mem16 32 8Dh mm xxx xxx VectorPath LEA reg32 mem16 32 8Dh mm xxx xxx DirectPath LEAVE C9h VectorPath LES reg16 32 mem32 48 C4h mm xxx xxx VectorPath LFS reg16 32 mem32 48 OFh B4h VectorPath LGDT mem48 OFh Olh mm 010 xxx VectorPath LGS reg16 52 mem32 48 OFh B5h VectorPath LIDT mem48 OFh 011 VectorPath LLDT 16 OFh ooh 11 010 xxx VectorPath LLDT mem16 OFh ooh mm 010 xxx VectorPath LMSW 16 OFh Oth 11 100 xxx VectorPath LMSW mem16 OFh oih mm 100 xxx VectorPath LODSB AL mem8 ACh VectorPath LODSW AX mem16 ADh VectorPath LODSD EAX mem32 ADh VectorPath LOOP disp8 E2h VectorPath 196 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic e p pog LOOPE LOOPZ disp8 Eth VectorPath LOOPNE LOOPNZ disp8 VectorPath LSL reg16 32 mreg16 32 OFh 03h 11 xxx xxx VectorPath LSL reg16 32 mem16 32 OFh 03h mm xxx xxx VectorPath LSS reg16 32 mem32 48 OFh B2h mm xxx xxx VectorPath LTR 16 OFh ooh 11 011 xxx VectorPath LTR mem16 OFh
99. 6 32 OFh 47h mm xxx xxx DirectPath CMOVAE CMOVNB CMOVNC 16 32 mem16 32 OFh 43h 11 xxx xxx DirectPath ne MONNE CMOVNC meml6 52 OFh 43h mm xxx xxx DirectPath CMOVB CMOVC CMOVNAE reg16 32 reg16 32 OFh 42h 11 xxx xxx DirectPath CMOVB CMOVC CMOVNAE mem16 32 6616 32 OFh 42h mm xxx xxx DirectPath CMOVBE CMOVNA reg16 32 reg16 32 OFh 46h 11 xxx xxx DirectPath CMOVBE CMOVNA reg16 32 mem16 32 OFh 46h mm xxx xxx DirectPath Instruction Dispatch and Execution Resources 191 AMDA AMD Athlon Processor x86 Code Optimization Table 19 Integer Instructions Continued 22007E 0 November 1999 Instruction Mnemonic 2 k M pon CMOVE CMOVZ reg16 32 16 32 OFh 44h 11 xxx xxx DirectPath CMOVE CMOVZ reg16 32 mem16 32 OFh 44h mm xxx xxx DirectPath CMOVG CMOVNLE reg16 32 reg16 32 OFh 4Fh 11 xxx xxx DirectPath CMOVG CMOVNLE reg16 32 mem16 32 OFh 4Fh mm xxx xxx DirectPath CMOVGE CMOVNL reg16 32 reg16 32 OFh 4Dh 11 xxx xxx DirectPath CMOVGE CMOVNL 16 32 mem16 32 OFh 4Dh mm xxx xxx DirectPath CMOVL CMOVNGE reg16 32 reg16 32 OFh 4Ch 11 xxx xxx DirectPath CMOVL CMOVNGE reg16 32 mem16 32 OFh 4Ch mm xxx xxx DirectPath CMOVLE CMOVNG reg16 32 reg16 32 OFh 4Eh 11 xxx xxx DirectPath CMOVLE CMOVNG reg 16 32 mem16 32 O
100. 7h It is illustrated in Figure 15 Each of the eight PAn fields can contain the memory type encodings as described in Table 12 on page 174 An attempt to write an undefined memory type encoding into the PAT will generate a GP fault gt Reserved Figure 15 Page Attribute Table MSR 277h Page Attribute Table PAT 177 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Accessing the PAT MTRRs and PAT A 3 bit index consisting of the PATi PCD and PWT bits of the page table entry is used to select one of the seven PAT register fields to acquire the memory type for the desired page is defined as bit 7 for 4 Kbyte PTEs and bit 12 for PDEs which map to 2 Mbyte or 4 Mbyte pages The memory type from the PAT is used instead of the PCD and PWT for the effective memory type A 2 bit index consisting of PCD and PWT bits of the page table entry is used to select one of four PAT register fields when PAE page address extensions is enabled or when the PDE doesn t describe a large page In the latter case the PATi bit for a PTE bit 7 corresponds to the page size bit in a PDE Therefore the OS should only use PAO 3 when setting the memory type for a page table that is also used as a page directory See Table 14 on page 178 Table 14 PATI 3 Bit Encodings PATi PCD PWT PAT Entry Reset Value 0 0 0 0 0 0 1 1 0 1 0 2 0 1 1 3 1 0 0 4 1 0 1 5 1
101. A mem8 PUNPCKHWD mmreg1 mmreg2 PREFETCHTO mem8 PUNPCKHWD mmreg mem64 PREFETCHTI mem8 PUNPCKLBW 1 mmreg2 2 mem8 PUNPCKLBW mmreg mem64 PUNPCKLDQ mmreg1 mmreg2 PUNPCKLDQ mmreg mem64 PUNPCKLWD mmreg1 mmreg2 PUNPCKLWD mmreg mem64 PXOR mmreg1 mmreg2 228 DirectPath Instructions 22007E 0 November 1999 Table 28 DirectPath Floating Point Instructions Instruction Mnemonic FABS AMD Athlon Processor x86 Code Optimization FADD ST ST i Instruction Mnemonic FIST mem32int FADD mem32real FISTP mem 16int FADD ST i ST FISTP mem32int FADD mem 64real FISTP meme4int FADDP ST i ST FLD ST i FCHS FLD mem32real FCOM 510 FLD mem64real FCOMP ST i FLD mem80real FCOM mem32real FLD1 FCOM mem64real FLDL2E FCOMP mem32real FLDL2T FCOMP mem 64real FLDLG2 FCOMPP FLDLN2 FDECSTP FLDPI FDIV ST ST i FLDZ FDIV ST i ST FMUL ST 510 FDIV mem32real FMUL ST i ST FDIV mem64real FMUL mem32real FDIVP ST ST i FMUL mem64real FDIVR ST ST i FMULP ST ST i FDIVR ST i ST FNOP FDIVR mem32real FPREM FDIVR mem64real FPREMI FDIVRP ST i ST FSQRT FFREE 510 FST memz2real
102. AMD Athlon Processor x86 Code Optimization Guide 1999 Advanced Micro Devices Inc All rights reserved The contents of this document are provided in connection with Advanced Micro Devices Inc AMD products AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice No license whether express implied arising by estoppel or otherwise to any intellectual property rights is granted by this publication Except as set forth in AMD s Standard Terms and Conditions of Sale AMD assumes no liability whatsoever and disclaims any express or implied warranty relating to its products including but not limited to the implied warranty of merchantability fitness for a particular purpose or infringement of any intellectual property right AMD S products not designed intended authorized or warranted for use as components in systems intended for surgical implant into the body or in other applications intended to support or sustain life or in any other applica tion in which the failure of AMD s product could create a situation where per sonal injury death or severe property or environmental damage may occur AMD reserves the right to discontinue or make changes to its products at any time without notice Trademarks AMD the AMD logo AMD Ath
103. B mmreg1 mmreg2 OFh F8h 11 xxx xxx DirectPath FADD FMUL PSUBB mmreg mem64 OFh F8h mm xxx xxx DirectPath FADD FMUL PSUBD mmreg1 mmreg2 OFh FAh DirectPath FADD FMUL PSUBD mmreg mem64 OFh FAh mm xxx xxx DirectPath FADD FMUL PSUBSB mmreg1 mmreg2 OFh E8h 11 xxx xxx DirectPath FADD FMUL PSUBSB mmreg mem64 OFh E8h mm xxx xxx DirectPath FADD FMUL PSUBSW 1 mmreg2 Oh E9h 11 xxx xxx DirectPath FADD FMUL PSUBSW mmreg mem64 OFh E9h mm xxx xxx DirectPath FADD FMUL PSUBUSB mmreg1 mmreg2 OFh D8h 11 xxx xxx DirectPath FADD FMUL PSUBUSB mmreg mem64 OFh D8h mm xxx xxx DirectPath FADD FMUL PSUBUSW mmreg1 mmreg2 OFh D9h 11 xxx xxx DirectPath FADD FMUL PSUBUSW mmreg mem64 OFh D9h mm xxx xxx DirectPath FADD FMUL PSUBW mmreg1 mmreg2 Oh F9h 11 xxx xxx DirectPath FADD FMUL PSUBW mmreg mem64 OFh F9h mm xxx xxx DirectPath FADD FMUL PUNPCKHBW mmreg1 mmreg2 OFh 68h 11 xxx xxx DirectPath FADD FMUL PUNPCKHBW mmreg mem64 OFh 68h mm xxx xxx DirectPath FADD FMUL 210 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 20 MMX Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic dis b sd M pong FPU Pipe s Notes PUNPCKHDQ mmreg1 mmreg2 OFh 11 xx
104. Continued AMD Athlon Processor x86 Code Optimization Mnemodic First Second ModR M Decode Byte Byte Byte Type CMP EAX imm16 32 3Dh DirectPath CMP mreg8 imm8 80h 11 111 xxx DirectPath CMP mem imm8 80h mm 111 xxx DirectPath CMP mreg16 32 imm16 32 81h 11 111 xxx DirectPath CMP mem16 32 imm16 32 81h mm 111 xxx DirectPath CMP mreg16 32 imm8 sign extended 85h 11 111 xxx DirectPath CMP mem16 32 imm8 sign extended 83h mm 111 xxx DirectPath CMPSB mem8 mem8 A6h VectorPath CMPSW mem16 mem32 A7h VectorPath CMPSD mem32 mem32 A7h VectorPath CMPXCHG mreg8 reg8 OFh BOh 11 xxx xxx VectorPath CMPXCHG mem reg8 OFh Boh mm xxx xxx VectorPath CMPXCHG mreg16 32 reg16 32 OFh Bih 11 VectorPath CMPXCHG mem16 32 reg16 32 OFh Bih mm xxx xxx VectorPath CMPXCHG8B mem64 OFh C7h mm xxx xxx VectorPath CPUID OFh VectorPath CWD CDQ 99h DirectPath DAA 27h VectorPath DAS 2Fh VectorPath DEC EAX 48h DirectPath DEC ECX 49h DirectPath DEC EDX 4Ah DirectPath DEC EBX 4Bh DirectPath DEC ESP 4Ch DirectPath DEC EBP 4Dh DirectPath DEC ESI 4Eh DirectPath DEC EDI 4Fh DirectPath DEC mreg8 FEh 11 001 xxx DirectPath DEC mem8 FEh mm 001 xxx DirectPath DEC mreg16 32 FFh 11 001 xxx DirectPath DEC mem16 32 FFh mm 001 xxx DirectPath DIV AL mreg8 F6h 11 110 xxx Vec
105. D MM1 MM7 POR MM2 MM3 PSRLQ MMO 1 PSRLQ MM1 1 PAND MM2 MM6 PADDB MMO MM1 PADDB MMO MM2 MOVQ EDI MMO MOVQ MM4 551 8 MOVQ MM5 EDI 8 MOVQ MM2 MM4 MOVQ MM3 MM5 PAND MM2 MM6 PAND MM3 MM6 PAND MM4 MM7 PAND MM5 MM7 POR MM2 MM3 PSRLQ MM4 1 PSRLQ MM5 1 PAND MM2 MM6 PADDB MM4 MM5 PADDB MM4 MM2 MOVQ EDI 8 MM4 ADD ESI EDX ADD EDI EBX LOOP L1 5 MB Dst MB SrcStride DstStride ConstFEFE Const0101 MMO QWORD1 MM1 QWORD3 MMO MM1 calculate MMO 1 MMO adjustment add 150 ad MM4 QWORD2 MM5 QWORDA MMO MM1 calculate MMO 1 MMO adjustmen add 150 ad 22007E 0 November 1999 D1 8 Oxfefefefe D3 amp Oxfefefefe adjustment RD1 amp Oxfefefefe 2 RD3 amp Oxfefefefe 2 D1 2 QWORD3 2 w o justment D2 amp Oxfefefefe 04 8 Oxfefefefe adjustment RD2 amp Oxfefefefe 2 RD4 amp Oxfefefefe 2 D2 2 QWORD4 2 w o justment 124 Use 3DNow PAVGUSB for MPEG 2 Motion AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization The following code fragment uses the 3DNow PAVGUSB instruction to perform averaging between the source macroblock and destination macroblock V PN X LOOP
106. DIVRP ST i ST DEh 11 110 xxx DirectPath FMUL 1 FFREE ST i DDh 11 000 xxx DirectPath FADD FMUL FSTORE 1 FFREEP ST i DFh CO C7h DirectPath FADD FMUL FSTORE 1 Instruction Dispatch and Execution Resources 213 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 22 Floating Point Instructions Continued Instruction Mnemonic s f Decode F P Note Byte Byte Byte Type Pipe s FIADD mem32int DAh mm 000 xxx VectorPath FIADD memieint DEh mm 000 xxx VectorPath FICOM mem32int DAh mm 010 xxx VectorPath FICOM mem 16int DEh mm 010 xxx VectorPath FICOMP mem32int DAh mm 011 xxx VectorPath FICOMP memib6int DEh mm 011 xxx VectorPath FIDIV mem32int DAh mm 110 xxx VectorPath FIDIV mem 16int DEh mm 110 xxx VectorPath FIDIVR mem32int DAh mm 111 xxx VectorPath FIDIVR mem 16int DEh mm 111 xxx VectorPath FILD mem 16int DFh mm 000 xxx DirectPath FSTORE mem32int DBh mm 000 xxx DirectPath FSTORE mem64int DFh mm 101 xxx DirectPath FSTORE FIMUL mem32int DAh mm 001 xxx VectorPath FIMUL mem 16int DEh mm 001 xxx VectorPath FINCSTP D9h F7h DirectPath FADD FMUL FSTORE FINIT Eh VectorPath FIST 16 DFh mm 010 xxx DirectPath FSTORE FIST mem32int DBh mm 010 xxx Di
107. DX 5 ADD EDX ECX quotient in EDX typedef unsigned _ int64 064 typedef unsigned long U32 U32 1042 U32 1 U32 t 0 i i l while i j return t 032 1 s m a U64 m low m high j k Determine algorithm a multiplier m and shift count s for 32 bit signed integer division Based on Granlund T Montgomery P L Division by Invariant Integers using Multiplication SIGPLAN Notices Vol 29 June 1994 page 61 1 log2 d J U64 0x80000000 k U64 1 lt lt 3241 m_low U64 1 lt lt 32 1 m high U64 1 lt lt 32 1 k d while low gt gt 1 gt m high lt lt 1 88 1 gt 0 m low low gt gt 1 m high high lt lt 1 1 eu I U64 d U64 0x80000000 3 CU ht d U32 m_high s m_high gt gt 31 1 0 96 Derivation of Multiplier Used for Integer Division by AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Floating Point Optimizations This chapter details the methods used to optimize floating point code to the pipelined floating point unit FPU Guidelines are listed in order of importance Ensure All FPU Data is Aligned As discussed in Memory Size and Alignment Issues on page 45 floating point data should be naturally aligned That is words should be aligned
108. Data to the System 159 AppendixD Performance Monitoring Counters 161 Overview ps soa us Qd VERRE CRI Ud S e ORE UN E 161 Performance Counter 161 PerfEvtSel 3 0 MSRs MSR Addresses C001 0000 001 0003 162 Contents ix AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 PerfCtr 3 0 MSRs MSR Addresses C001 0004 001 0007h 167 Starting and Stopping the Performance Monitoring Counters uox eruitur Va c aid wads Der 168 Event and Time Stamp Monitoring Software 168 Monitoring Counter 169 Appendix E Programming the MTRR and PAT 171 Introduction mo eb de E te ten tO Ah u CO Sena elas ay 171 Memory Type Range Register MTRR Mechanism 171 Page Attribute Table PAT sue io Beebe ERE E 177 AppendixF Instruction Dispatch and Execution Resources 187 Appendix G DirectPath versus VectorPath Instructions 219 Select DirectPath Over VectorPath Instructions 219 DirectPath Instructions 219 VectorPath Instructions s 231 dex ree oe coches 237 x Contents AMDA 22007E 0 November 1999 List of Figures AMD Athlon Processor x86 Code Optimization Figure 1 AMD Athlon Processor Block Diagram 131 Figure 2 Integ
109. DirectPath TEST mem reg8 84h mm xxx xxx DirectPath TEST mreg16 32 2 85h 11 xxx xxx DirectPath TEST mem16 32 reg16 32 85h mm xxx xxx DirectPath TEST AL imm8 DirectPath TEST EAX imm16 32 A9h DirectPath TEST mreg8 imm8 Feh 11 000 xxx DirectPath TEST mem8 imm8 Feh mm 000 xxx DirectPath TEST mreg8 imm16 32 F7h 11 000 xxx DirectPath TEST imm16 32 F7h mm 000 xxx DirectPath VERR mreg16 OFh ooh 11 100 xxx VectorPath VERR mem16 OFh 00 mm 100 xxx VectorPath VERW 16 OFh ooh 11 101 VectorPath VERW mem16 OFh ooh mm 101 xxx VectorPath WAIT 9Bh DirectPath WBINVD OFh 09h VectorPath WRMSR OFh 30h VectorPath 206 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic jd M pn XADD mregg reg8 OFh 11 100 xxx VectorPath XADD mem reg8 OFh COh mm 100 xxx VectorPath XADD mreg16 32 reg16 32 OFh Cih 11 101 xxx VectorPath XADD mem16 32 reg16 32 OFh Cih mm 101 xxx VectorPath XCHG reg8 mreg8 86h 11 xxx xxx VectorPath XCHG reg8 mem8 86h mm xxx xxx VectorPath XCHG reg16 32 mreg16 32 87h 11 xxx xxx VectorPath XCHG reg16 32 mem16 32 87h mm xxx xxx VectorPath XCHG EAX EAX
110. E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 4 Left shift shift operand in EDX EAX left shi applied modulo 64 ft count in ECX count SHLD EDX EAX CL first apply shift count SHL EAX CL mod 32 to EDX EAX TEST ECX 32 sneed to shift by another 32 JZ 41511171 done no done MOV EDX EAX left shift EDX EAX XOR EAX EAX by 32 bits 41511111 done Example 5 Right shift SHRD EAX EDX CL first apply shift count SHR EDX CL mod 32 to EDX EAX TEST ECX 32 hneed to shift by another 32 JZ rshift done done MOV EAX EDX left shift EDX EAX XOR EDX EDX by 32 bits rshift done Example 6 Multiplication computes the low order half of the product of its arguments two 64 bit integers INPUT 55 8 55 4 multiplicand 16 550 12 5 multiplier OUTPUT EDX EAX multiplicand multiplier 2 64 DESTROYS EAX ECX EDX EFlags _11 1 PROC MOV EDX LESP 8 multiplicand hi MOV ECX LESP 16 multiplier hi OR EDX ECX one operand gt 2 32 MOV EDX LESP 12 multiplier 10 MOV LESP 4 multiplicand 10 JNZ twomul yes need two multiplies MUL EDX multiplicand lo multiplier lo RET done return to caller twomul IMUL EDX LESP 8 p3 lo multiplicand hi multiplier lo IMUL ECX EAX p2 lo multiplier hi multiplicand lo ADD ECX EDX p2 lo p3_lo MUL DWORD PTR
111. ECEN EAE 128 Integer Only Work OG Gu LENA eur d MP VICO BA 85 Store to Load Forwarding rae A 18 51 53 54 PAND to Find Absolute Value in 3DNow Code 119 Stream of Packed Unsigned 125 PCMP Instead of 3DNow PFCMP 114 String 5 0 5 84 PCMPEQD to Set an MMX Register 119 Structure 8 1 eee eee eee 27 28 56 PMADDWD Instruction 111 Subexpressions Explicitly Extract Common 26 PREFETCHNTA TO T1 T2 Instruction 47 Superscalar 130 113 118 119 Switch 21 24 238 Index AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization 1 irian lw Pl I eR RR RI RR 55 Write Combining 10 50 139 155 157 159 Trigonometric Instructions 103 x86 Optimization Guidelines 127 VectorPath 133 XOR Instruction 86 VectorPath 231 Index 239 AMD 1 AMD Athlo
112. EG REG1 8 REGI1 ADD 501 12 by 15 MOV REG2 REGI 2 cycles SHL REG1 4 SUB 501 12 by 16 SHL REG 4 l cycle by 17 MOV REG2 REGI 2 cycles SHL REG 4 ADD REG REG2 by 18 ADD REG REGI 3 cycles LEA REGI 8 REGI by 19 LEA REG2 REGI 2 REGI 3 cycles SHL REG 4 ADD REG REG2 by 20 SHL REG 2 3 cycles LEA REGIT REGI 4 REGI by 21 LEA REG2 REG1 4 REG1 3 cycles SHL REG 4 ADD 501 12 by 22 use IMUL by 23 LEA REG2 REGI 8 REGI 3 cycles SHL 1 5 SUB 1 REG2 by 24 SHL REG 3 3 cycles LEA REGI 2 REGI by 25 LEA REG2 REG1 8 REGI1 3 cycles SHL REG 4 ADD 501 12 82 Use Alternative Code When Multiplying by a Constant AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization by 26 use IMUL by 27 LEA REG2 REGI 4 REGI 3 cycles SHL REG1 5 SUB REG1 REG2 by 28 MOV REG2 REGI 3 cycles SHL REG1 3 SUB REGIT REG2 SHL REGI 2 by 29 LEA REG2 REGI 2 REGI 3 cycles SHL REG1 5 SUB REG REG2 by 30 MOV REG2 REGI 3 cycles SHL 1 4 SUB REG1 REG2 ADD REG REGI by 31 MOV REG2 REGI 2 cycles SHL REGI 5 SUB REGL REG2 by 32 SHL REG 5 1 cycle Use MMX Instructions for Integer Only Work In many programs it can be advantageous to use MMX instructions to do integer only work especially if the function already uses 3DNow or MMX code
113. ETP SETPE mreg8 SBB mem16 32 reg16 32 SETP SETPE mem8 SBB reg8 mreg8 SETNP SETPO mreg8 SBB reg8 mem8 SETNP SETPO 8 224 DirectPath Instructions AMDA 22007E 0 November 1999 Table 25 DirectPath Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic Instruction Mnemonic SETL SETNGE mreg8 SUB mem reg8 SETL SETNGE mem8 SUB mreg16 32 reg16 32 SETGE SETNL mreg8 SUB mem16 32 reg16 32 SETGE SETNL mem8 SUB reg8 mreg8 SETLE SETNG mreg8 SUB reg8 mem8 SETLE SETNG mem8 SUB reg16 32 mreg16 32 SETG SETNLE mreg8 SUB reg16 32 mem16 32 SETG SETNLE mem8 SUB AL imm8 SHL SAL mreg8 imm8 SUB EAX imm16 32 SHL SAL mem8 imm8 SUB mreg8 imm8 SHL SAL mreg16 32 imm8 SUB imm8 SHL SAL mem16 32 imm8 SUB mreg16 32 imm16 32 SHL SAL mregg 1 SUB mem16 32 imm16 32 SHL SAL mem8 1 SUB mreg16 32 imm8 sign extended SHL SAL mreg16 32 1 SUB mem16 32 imm8 sign extended SHL SAL mem16 32 1 TEST mreg8 reg8 SHL SAL mreg8 CL TEST mem reg8 SHL SAL memg8 CL TEST mreg16 32 reg16 32 SHL SAL mreg16 32 CL TEST mem16 32 reg16 32 SHL SAL mem16 32 CL TEST AL imm8 SHR mreg8 imm8 TEST EAX imm16 32 SHR imm8 TEST mreg8 imm8 SHR mreg16 32 imm8 TEST me
114. Fh 4Eh mm xxx xxx DirectPath CMOVNE CMOVNZ reg16 32 reg16 32 OFh 45h 11 xxx xxx DirectPath CMOVNE CMOVNZ reg16 32 mem16 32 OFh 45h mm xxx xxx DirectPath CMOVNO reg16 32 reg16 32 OFh 41h 11 xxx xxx DirectPath CMOVNO reg16 32 mem16 32 OFh 41h mm xxx xxx DirectPath CMOVNP CMOVPO reg16 32 reg16 32 OFh 4Bh 11 xxx xxx DirectPath CMOVNP CMOVPO reg16 32 mem16 32 OFh 4Bh mm xxx xxx DirectPath CMOVNS reg16 32 reg16 32 OFh 49h 11 xxx xxx DirectPath CMOVNS 16 32 mem16 32 OFh 49h mm xxx xxx DirectPath CMOVO reg16 32 16 32 OFh 40h 11 xxx xxx DirectPath CMOVO 16 32 mem16 32 OFh 40h mm xxx xxx DirectPath CMOVP CMOVPE reg16 32 reg16 32 OFh 4Ah 11 xxx xxx DirectPath CMOVP CMOVPE reg16 32 mem16 32 OFh 4Ah mm xxx xxx DirectPath CMOVS 16 32 16 32 OFh 48h 11 xxx xxx DirectPath CMOVS 16 32 mem16 32 OFh 48h mm xxx xxx DirectPath CMP mreg8 reg8 38h 11 xxx xxx DirectPath CMP mem8 reg8 38h mm xxx xxx DirectPath CMP mreg16 32 reg16 32 39h 11 xxx xxx DirectPath CMP mem16 32 reg16 32 39h mm xxx xxx DirectPath CMP reg8 mreg8 3Ah 11 xxx xxx DirectPath CMP reg8 mem8 3Ah mm xxx xxx DirectPath CMP reg16 32 mreg16 32 3Bh 11 xxx xxx DirectPath CMP reg16 32 mem16 32 3Bh mm xxx xxx DirectPath CMP AL imm8 3Ch DirectPath 192 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions
115. Floating Point to Integer Conversions C C and Fortran define floating point to integer conversions as truncating This creates a problem because the active rounding mode in an application is typically round to nearest even The classical way to do a double to int conversion therefore works as follows Example 1 Fast SUB I EDX trunc X rndint X correction FLD QWORD PTR X load double to be converted FSTCW SAVE CW save current FPU control word MOVZX WORD PTR SAVE CW retrieve control word OR EAX OCOOh rounding control field truncate MOV WORD PTR NEW CW AX new FPU control word FLDCW load new FPU control word FISTP DWORD PTR I do double gt int conversion FLDCW SAVE CW restore original control word The AMD Athlon processor contains special acceleration hardware to execute such code as quickly as possible In most situations the above code is therefore the fastest way to perform floating point to integer conversion and the conversion 1s compliant both with programming language standards and the IEEE 754 standard According to the recommendations for inlining see Always Inline Functions with Fewer than 25 Machine Instructions on page 72 the above code should not be put into a separate subroutine e g ftol It should rather be inlined into the main code In some codes floating point numbers are converted to an integer and the result is immediately converted back to
116. LL reduce range in range FSINCOS FSTP QWORD PTR cosine x FSTP QWORD PTR sine x Take Advantage of the FSINCOS Instruction 105 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 106 Take Advantage of the FSINCOS Instruction AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization 3DNow and MMX Optimizations This chapter describes 3DNow and MMX code optimization techniques for the AMD Athlon processor Guidelines are listed in order of importance 3DNow porting guidelines can be found in the 3DNow M Instruction Porting Guide order 22621 Use 3DNow Instructions Unless accuracy requirements dictate otherwise perform floating point computations using the 3DNow instructions instead of x87 instructions The SIMD nature of 3DNow achieves twice the number of FLOPs that are achieved through x87 instructions 3DNow instructions provide for a flat register file instead of the stack based approach of x87 instructions See the 3DNow Technology Manual order 21928 for information on instruction usage Use FEMMS Instruction Though there is no penalty for switching between x87 FPU and 3DNow MMX instructions in the AMD Athlon processor the FEMMS instruction should be used to ensure the same code also runs optimally on AMD K6 family processors The Use 3DNow Instructions 107 AMD Athlon Processor
117. LOW ABOVE BEHIND LEFT RIGHT BEFORE MOVQ MM1 MM2 BELOW ABOVE BEHIND LEFT RIGHT BEFORE PUNPCKHDQ MM2 MM2 BELOW ABOVE BEHIND BELOW ABOVE BEHIND POR MM2 1 zclip yclip xclip clip code Use 3DNow PAVGUSB for MPEG 2 Motion Compensation Use the 3DNow PAVGUSB instruction for MPEG 2 motion compensation The PAVGUSB instruction produces the rounded averages of the eight unsigned 8 bit integer values in the source operand a MMX register or a 64 bit memory location and the eight corresponding unsigned 8 bit integer values in the destination operand a MMX register The PAVGUSB instruction is extremely useful in DVD MPEG 2 decoding where motion compensation performs a lot of byte averaging between and within macroblocks The PAVGUSB instruction helps speed up these operations In addition PAVGUSB can free up some registers and make unrolling the averaging loops possible The following code fragment uses original MMX code to perform averaging between the source macroblock and destination macroblock Use 3DNow PAVGUSB for MPEG 2 Motion Compensation 123 AMDA AMD Athlon Processor x86 Code Optimization Example 1 Avoid MOV ESI DWORD PTR MOV EDI DWORD PTR MOV EDX DWORD PTR MOV EBX DWORD PTR MOVQ MM7 QWORD PTR MOVQ MM6 QWORD PTR MOV ECX 16 MOVQ MMO LEST MOVQ MM1 EDI MOVQ MM2 MMO MOVQ MM3 MM1 PAND MM2 MM6 PAND MM3 MM6 PAND MMO MM7 PAN
118. MD Athlon processor offers true next generation performance with x86 binary software compatibility 6 AMD Athlon Processor Microarchitecture Summary AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Top Optimizations Group I Essential Optimizations Group II Secondary Optimizations This chapter contains concise descriptions of the best optimizations for improving the performance of the AMD Athlon processor Subsequent chapters contain more detailed descriptions of these and other optimizations The optimizations in this chapter are divided into two groups and listed in order of importance Group I contains essential optimizations Users should follow these critical guidelines closely The optimizations in Group 1 are as follows m Memory Size and Alignment Issues Avoid memory size mismatches Align data where possible Use the 3DNow PREFETCH and PREFETCHW Instructions m Select DirectPath Over VectorPath Instructions Group II contains secondary optimizations that can significantly improve the performance of the AMD Athlon processor The optimizations in Group II are as follows m Load Execute Instruction Usage Use Load Execute instructions Avoid load execute floating point instructions with integer operands Take Advantage of Write Combining m Use 3DNow Instructions m Avoid Branches Dependent on Random Data Top Optimizations AMD AMD Athlon
119. MOV EAX m MUL EDX SHR EDX s EDX quotient algorithm 1 MOV EDX dividend MOV EAX m MUL EDX ADD EAX m ADC EDX 0 SHR EDX s EDX quotient The derivation for the algorithm a multiplier m and shift count s is found in the section Unsigned Derivation for Algorithm Multiplier and Shift Factor on page 93 For divisors 2 lt d 222 the possible quotient values are either 0 or 1 This makes it easy to establish the quotient by simple comparison of the dividend and divisor In cases where the dividend needs to be preserved example 1 below is recommended 78 Replace Divides with Multiplies AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 1 In EDX dividend Qut EDX quotient XOR EDX EDX 0 d CF dividend gt divisor 1 0 SBB EDX 1 quotient O 1 CF dividend gt divisor 0 1 In cases where the dividend does not need to be preserved the division can be accomplished without the use of an additional register thus reducing register pressure This is shown in example 2 below Example 2 In EDX dividend 0ut EAX quotient CMP EDX d CF dividend gt divisor 1 0 MOV O 0 SBB 1 quotient O 1 CF dividend gt divisor 0 1 Simpler Code for Integer division by a constant can be made faster if the range of Restricted Dividend the dividend is limited which removes a shift associated with mo
120. MTRR Mechanism AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization FFFFFFFFh SMM TSeg 0 8 Variable Ranges uui dame e 64 Fixed Ranges 100000h 4 Kbytes each 256 Kbytes ciodh 16 Fixed Ranges p 256 Kbytes 16 Kbytes each 80000h 8 Fixed Ranges 512 Kbytes 64 Kbytes each 0 Figure 12 MTRR Mapping of Physical Memory Memory Type Range Register MTRR Mechanism 173 AMD 1 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Memory Types Five standard memory types are defined by the AMD Athlon processor writethrough WT writeback WB write protect WP write combining WC and uncacheable UC These are described in Table 12 on page 174 Table 12 Memory Type Encodings Type Number Type Name Type Description 00h Ue qhesbla Uncacheable for reads or writes Cannot be combined Must be non speculative for reads or writes oth WC Write Combining Uncacheable for reads or writes Can be combined Can be speculative for reads Writes can never be speculative 04h WT Writethrough Reads allocate on a miss but only to the S state Writes do not allocate on a miss and for a hit writes update the cached entry and main memory 051 WP Write Protect WP is functionally the same as the WT memory type except stores do not actually modify cached data and do not cause an exception Reads wil
121. P instruction can be used to jump across the padding region Note that each of the instructions and instruction sequences below utilizes an x86 register To avoid performance degradation the register used in the padding should be selected so as to not lengthen existing dependency chains i e one should select a register that is not used by instructions in the vicinity of the neutral code filler Note that certain instructions use registers implicitly For example PUSH POP CALL and RET all make implicit use of the ESP register The 5 byte filler sequence below consists of two instructions If flag changes across the code padding are acceptable the following instructions may be used as single instruction 5 byte code fillers m TEST EAX OFFFF0000h m CMP EAX OFFFF0000h The following assembly language macros show the recommended neutral code fillers for code optimized for the AMD Athlon processor that also has to run well on other x86 processors Note for some padding lengths versions using ESP or EBP are missing due to the lack of fully generalized addressing modes 2 TEXTEQU DB O8Bh 0COh gt mov eax eax 2 EBX TEXTEQU DB O8Bh 0DBh gt mov OP2_ECX TEXTEQU DB O8Bh 0C9h gt mov ecx ecx OP2_EDX TEXTEQU DB 08Bh 0D2h mov edx edx 2 ESI TEXTEQU DB 08Bh 0F6h mov esi esi 2 EDI TEXTEQU DB O8Bh 0FFh gt mov edi edi 2 ESP TEXTEQU DB O8Bh 0E4h gt mov esp
122. Path Instruction Dispatch and Execution Resources 199 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Instruction Mnemonic b in pr M gi POP EBX 5Bh VectorPath POP ESP 5Ch VectorPath POP EBP 5Dh VectorPath POP ESI 5Eh VectorPath POP EDI 5Fh VectorPath POP mreg 16 32 8Fh 11 000 xxx VectorPath POP mem 16 32 8Fh mm 000 xxx VectorPath POPA POPAD 61h VectorPath POPF POPFD 9Dh VectorPath PUSH ES 06h VectorPath PUSH CS OEh VectorPath PUSH FS OFh VectorPath PUSH GS OFh A8h VectorPath PUSH SS 16h VectorPath PUSH DS 1Eh VectorPath PUSH EAX 50h DirectPath PUSH ECX 51h DirectPath PUSH EDX 52h DirectPath PUSH EBX 53h DirectPath PUSH ESP 54h DirectPath PUSH EBP 55h DirectPath PUSH ESI 56h DirectPath PUSH EDI 57h DirectPath PUSH imm8 6Ah DirectPath PUSH imm16 32 68h DirectPath PUSH mreg16 32 FFh 11 110 xxx VectorPath PUSH mem16 32 FFh mm 110 xxx VectorPath PUSHA PUSHAD 60h VectorPath PUSHF PUSHFD 9Ch VectorPath RCL mreg8 imm8 11 010 DirectPath RCL mem8 imm8 mm 010 xxx VectorPath RCL mreg16 32 imm8 Cih 11 010 xxx DirectPath RCL mem16 32 imm8 Cih mm 010 xxx VectorPath 200 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Ins
123. Pointer Arithmetic 73 Partial Loop 68 REP String with Low Variable Counts 85 Unroll Small 18 Unrolling Loops si eera eR er eee 67 22007E 0 November 1999 MOVZX and MOVSX 73 MSR ACCESS suu uw i Ir d Seed a EIS 177 Multiplication Alternative Code When Multiplying by a Constant 81 erdt eid 119 Multiplies over Divides Floating Point 97 Muxing 8 60 Newton Raphson 109 Newton Raphson Reciprocal Square Root 111 522 LESS VES Ges y 148 Largest Possible Operand Size Repeated String 84 Optimization Star seam lr RS tae 8 Page Attribute Table 171 177 178 Parallelism D nae ERA ER EE 25 MSR thee Anea a PE es TRUE 167 PerfEvtSel 8 162 Performance Monitoring Counters 161 168 169 Pipeline and Execution Unit Resources Overview 141 Pointers De referenced 1 31 Use Array Style Code Instead 15 Population Count 91 4 y
124. Reciprocal Division Utility The code for the utilities can be found at Derivation of Multiplier Used for Integer Division by Constants on page 93 All utilities were compiled for the Microsoft Windows 95 Windows 98 and Windows NT environments All utilities are provided as is and are not supported by AMD Replace Divides with Multiplies 77 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Signed Division Utility Unsigned Division Utility In the opt_utilities directory of the AMD documentation CDROM run sdiv exe in a DOS window to find the fastest code for signed division by a constant The utility displays the code after the user enters a signed constant divisor Type sdiv gt example out to output the code to a file In the opt_utilities directory of the AMD documentation CDROM run udiv exe in a DOS window to find the fastest code for unsigned division by a constant The utility displays the code after the user enters an unsigned constant divisor Type udiv gt example out to output the code to a file Unsigned Division by Multiplication of Constant Algorithm Divisors 1 lt lt 231 Odd d Derivation of a m s Algorithm Divisors The following code shows an unsigned division using a constant value multiplier In d divisor 1 lt d lt 2431 odd d 0ut a algorithm m multiplier S shift factor algorithm 0 MOV EDX dividend
125. ST i Instruction Mnemonic Cast second odie Decada F P Note Byte Byte Byte Type Pipe s FCMOVB ST 0 ST i DAh CO C7h VectorPath FCMOVE 51 0 ST i DAh C8 CFh VectorPath FCMOVBE ST 0 ST i DAh DO D7h VectorPath FCMOVU ST 0 ST i DAh D8 DFh VectorPath FCMOVNB ST 0 ST i DBh CO C7h VectorPath FCMOVNE ST 0 ST i DBh C8 CFh VectorPath FCMOVNBE ST 0 ST i DBh 00 071 VectorPath FCMOVNU ST 0 ST i DBh D8 DFh VectorPath FCOM ST i D8h 11 010 xxx DirectPath FADD 1 FCOMP ST i D8h 11 011 xxx DirectPath FADD 1 FCOM mem32real D8h mm 010 xxx DirectPath FADD FCOM mem 64real DCh mm 010 xxx DirectPath FADD FCOMI ST ST i DBh FO F7h VectorPath FADD FCOMIP ST ST i DFh FO F7h VectorPath FADD FCOMP mem32real D8h mm 011 xxx DirectPath FADD mem 64real DCh mm 011 xxx DirectPath FADD FCOMPP DEh D9h 11 011 001 DirectPath FADD FCOS FFh VectorPath FDECSTP D9h Feh DirectPath FADD FMUL FSTORE FDIV ST ST i D8h 11 110 xxx DirectPath FMUL 1 FDIV ST i ST DCh 11 111 xxx DirectPath FMUL 1 FDIV mem32real D8h mm 110 xxx DirectPath FMUL FDIV mem64real DCh mm 110 xxx DirectPath FMUL FDIVP ST ST i DEh 11 111 xxx DirectPath FMUL 1 FDIVR ST ST i D8h 11 110 xxx DirectPath FMUL 1 FDIVR ST i ST DCh 11 111 xxx DirectPath FMUL 1 FDIVR mem32real D8h mm 111 xxx DirectPath FMUL FDIVR mem64real DCh mm 111 xxx DirectPath FMUL F
126. Sample 1 Integer Register Operations Instruction Decode Decode Clocks Number Instruction Pipe Type 2 3 4 5 6 7 8 1 IMUL EAX ECX 0 VP D 2 INC ESI 0 DP D 3 MOV EDI 0x07F4 1 DP D 4 ADD EDI EBX 2 DP D 5 SHL EAX 8 0 DP D 6 EAX 0x0F 1 DP D 7 INC EBX 2 DP D 8 ADD ESI EDX 0 DP D Be ON Comments for Each Instruction Number 1 The IMUL is VectorPath instruction It cannot be decode or paired with other operations and therefore dispatches alone in pipe 0 The multiply latency is four cycles The simple INC operation is paired with instructions 3 and 4 The INC executes in IEUO in cycle 4 The MOV executes in IEU1 in cycle 4 The ADD operation depends on instruction 3 It executes in IEU2 in cycle 5 The SHL operation depends on the multiply result instruction 1 The MacroOP waits in a reservation station and is eventually scheduled to execute in cycle 7 after the multiply result is available This operation executes in cycle 8 in IEU1 This simple operation has a resource contention for execution in IEU2 in cycle 5 Therefore the operation does not execute until cycle 6 The ADD operation executes immediately in IEUO after dispatching AMD Athlon Processor x86 Code Optimization Execution Unit Resources 153 AMDA AMD Athlon Proces
127. Use signed types for m Integer to float conversion Use Array Style Instead of Pointer Style Code The use of pointers in C makes work difficult for the optimizers in C compilers Without detailed and aggressive pointer analysis the compiler has to assume that writes through a pointer can write to any place in memory This includes storage allocated to other variables creating the issue of aliasing i e the same block of memory is accessible in more than one way In order to help the optimizer of the C compiler in its analysis avoid the use of pointers where possible One example where this is trivially possible is in the access of data organized as arrays C allows the use of either the array operator or pointers to access the array Using array style code makes the task of the optimizer easier by reducing possible aliasing For example x 0 and x 2 can not possibly refer to the same memory location while p and q could It is highly recommended to use the array style as significant performance advantages can be achieved with most compilers Use Array Style Instead of Pointer Style Code 15 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Note that source code transformations will interact with a compiler s code generator and that it is difficult to control the generated machine code from the source level It is even possible that source code transformations for improving perfor
128. Using MMX instructions relieves register pressure on the integer registers As long as data is simply loaded stored added shifted etc MMX instructions are good substitutes for integer instructions Integer registers are freed up with the following results m May be able to reduce the number of integer registers to saved restored on function entry edit m Free up integer registers for pointers loop counters etc so that they do not have to be spilled to memory which reduces memory traffic and latency in dependency chains Be careful with regards to passing data between MMX and integer registers and of creating mismatched store to load forwarding cases See Unrolling Loops on page 67 Use MMX Instructions for Integer Only Work 83 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 In addition using MMX instructions increases the available parallelism The AMD Athlon processor can issue three integer OPs and two MMX OPs per cycle Repeated String Instruction Usage Latency of Repeated String Instructions Table 1 shows the latency for repeated string instructions on the AMD Athlon processor Table 1 Latency of Repeated String Instructions Instruction ECX 0 cycles DF 0 cycles DF 1 cycles REP MOVS 11 15 4 5 25 4 5 REP STOS 1 14 24 1 REP LODS 15 2 15 2 0 REP SCAS 15 5 2 c 15 5 2 c REP CMPS
129. VNG reg16 32 mem16 32 DEC EDX CMOVNE CMOVNZ reg16 32 reg16 32 DEC EBX CMOVNE CMOVNZ reg16 32 mem16 32 DEC ESP CMOVNO regi6 32 reg16 32 DEC EBP CMOVNO reg16 32 mem16 32 DEC ESI CMOVNP CMOVPO reg16 32 reg16 32 DEC EDI CMOVNP CMOVPO reg16 32 mem 16 32 DEC mreg8 CMOVNS reg16 32 reg16 32 DEC 8 CMOVNS reg16 32 mem16 32 DEC mreg16 32 CMOVO reg16 32 reg16 32 DEC mem16 32 CMOVO 16 32 mem16 32 INC EAX CMOVP CMOVPE reg16 32 reg16 32 INC ECX CMOVP CMOVPE reg16 32 mem16 32 INC EDX CMOVS 16 32 16 32 INC CMOVS 16 32 mem16 32 INC ESP mreg8 reg8 INC EBP reg8 INC ESI mreg16 32 reg16 32 INC EDI CMP mem16 32 reg16 32 INC mreg8 CMP reg8 mreg8 INC mem8 CMP reg8 mem8 INC mreg16 32 CMP reg16 32 mreg16 32 INC mem16 32 CMP 16 32 mem16 32 JO short disp8 DirectPath Instructions 221 AMDA AMD Athlon Processor x86 Code Optimization Table 25 DirectPath Integer Instructions Continued 22007E 0 November 1999 Instruction Mnemonic Instruction Mnemonic JNO short disp8 JMP near mreg16 32 indirect JB JNAE short disp8 JMP near 16 32 indirect JNB JAE short disp8 LEA reg32 mem16 32 JZ JE short disp8 MOV mregg reg8
130. VectorPath STI FBh VectorPath STOSB 8 AL AAh VectorPath STOSW mem16 ABh VectorPath STOSD mem32 EAX ABh VectorPath STR mreg16 OFh ooh 11 001 xxx VectorPath STR mem16 OFh ooh mm 001 xxx VectorPath SUB mreg8 reg8 28h 11 xxx xxx DirectPath SUB mem8 reg8 28h mm xxx xxx DirectPath SUB mreg16 32 reg16 32 29h 11 xxx xxx DirectPath SUB 16 32 reg16 32 29h mm xxx xxx DirectPath Instruction Dispatch and Execution Resources 205 AMD 1 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Instruction Mnemonic prin ge M pron SUB reg8 mreg8 2Ah 11 xxx xxx DirectPath SUB reg8 mem8 2Ah mm xxx xxx DirectPath SUB reg16 32 mreg16 32 2Bh 11 xxx xxx DirectPath SUB reg16 32 mem16 32 2Bh mm xxx xxx DirectPath SUB AL imm8 2Ch DirectPath SUB EAX imm16 32 2Dh DirectPath SUB mreg8 imm8 80h 11 101 xxx DirectPath SUB mem8 imm8 80h mm 101 xxx DirectPath SUB mreg16 32 imm16 32 81h 11 101 DirectPath SUB mem 16 32 imm16 32 81h mm 101 xxx DirectPath SUB mreg16 32 imm8 sign extended 85h 11 101 xxx DirectPath SUB mem16 32 imm8 sign extended 83h mm 101 xxx DirectPath SYSCALL OFh 05h VectorPath SYSENTER OFh 34h VectorPath SYSEXIT OFh 35h VectorPath SYSRET OFh 071 VectorPath TEST mreg8 reg8 84h 11 xxx xxx
131. X 2 word load cannot forward high word from store buffer Use example 5 instead of example 4 Example 4 Avoid MOVQ foo MM1 store upper and lower half ADD EAX foo fine ADD EDX 00 4 uh oh 52 Store to Load Forwarding Restrictions AMDA 22007E 0 November 1999 Misaligned Store Buffer Data Forwarding Restriction High Byte Store Buffer Data Forwarding Restriction AMD Athlon Processor x86 Code Optimization Example 5 Preferred MOVD foo 1 store lower half PUNPCKHDQ MM1 MM1 get upper half into lower half MOVD foot4 MM1 store lower half ADD foo fine ADD EDX foot4 fine If the following condition is present there is a misaligned store buffer data forwarding restriction m The store or load address is misaligned For example a quadword store is not aligned to a quadword boundary a doubleword store is not aligned to doubleword boundary etc A common case of misaligned store data forwarding involves the passing of misaligned quadword floating point data on the doubleword aligned integer stack Avoid the type of code shown in the following example Example 6 Avoid MOV ESP 24h QWORD PTR ESP esp 24 store occurs to quadword misaligned address FLD QWORD PTRLESP quadword load cannot forward from quadword misaligned fstpLesp store OP If the following condition is present there is a high byte store data buffer forwardi
132. X EFlags ullrem PROC PUSH EBX save EBX as per calling convention MOV ECX LESP 20 divisor hi MOV EBX LESP 16 divisor lo MOV EDX LESP 12 dividend hi MOV 8 dividend 10 TEST ECX ECX divisor gt 2 32 1 JNZ r big divisor yes divisor gt 32 32 1 CMP EDX EBX only one division needed ECX 0 JAE r two divs heed two divisions DIV EBX EAX quotient 10 MOV EAX EDX EAX remainder 0 MOV EDX ECX EDX remainder hi 0 POP EBX restore EBX as per calling convention RET done return to caller Efficient 64 Bit Integer Arithmetic 89 AMDA AMD Athlon Processor x86 Code Optimization VS CX AX DX BX AX BX AX DX BX DI DI DX gt gt lt gt DI BX gt lt BX 00 gt lt gt gt lt C2 UJ 4 lt gt lt gt gt EAX EDX EDX ECX EDX EDX ECX 1 1 1 1 EDI EDX GL 1 ES EAX EAX PT EDI EAX ES ES EDX EDX EDX LES EBX ECX r two di MOV E MOV E XOR E DIV E MOV E DIV E MOV E XOR E POP E RET r big divisor PUSH E MOV E SHR E RCR E ROR E RCR E BSR E SHRD E SHRD E SHR E ROL E DIV E MOV E MOV E IMUL E MUL D ADD E SUB E MOV E MOV E SBB E SBB E AND E AND E ADD E ADC E POP E POP E RET _ullrem E save d get di zero e SEAX remai SEAX SEAX EAX EDX restor done 5 5
133. a esi esi 00 nop OP5 ESI TEXTEQU DB O8Dh 064h 024h 000h 090h gt lea edi 001 OP5 EDI TEXTEQU DB 08Dh 074h 026h 000h 090h lea esp esp 00 nop OP5 ESP TEXTEQU DB 08Dh 07Ch 027h 000h 090h lea eax 00000000 68 P6 TEXTEQU DB 08Dh 080h 0 0 0 0 ES lea Lebx 00000000 OP6 EBX TEXTEQU DB 08Dh 09Bh 0 0 0 0 lea ecx 00000000 66 ECX TEXTEQU DB 08Dh 089h 0 0 0 0 lea edx 00000000 60 OP6 EDX TEXTEQU DB 08Dh 092h 0 0 0 0 ea esi esi 00000000 P6 ESI TEXTEQU DB 08Dh 0B6h 0 0 0 0 ES 42 Code Padding Using Neutral Code Fillers AMDA 22007E 0 November 1999 2 0 lea ebp Led TEXT ea edi P6_EDI P6 EBP TEXT lea P7_EAX TEXT lea ebx Lebx P7 EBX TEXT ea ecx ecx P7 ECX TEXT lea edx Ledx P7 EDX TEXT lea P7 ESI TEXT lea edi edi P _EDI TEXT ea ebp ebp P7 EBP TEXT lea P8 EAX TEXT lea ebx Lebx P8 EBX TEXT lea P8 ECX TEXT ea edx edx P8 EDX TEXT lea P8 ESI TEXT lea edi Ledi P8 EDI TEXT lea ebp ebp P8 TEXT JMP P9 TEXTEQU AMD Athlon Processor x86 Code Optimization 1 00000000 EQU DB 08Dh p 00000000 EQU
134. ache for one cyde by instruction 5 In cycles 7 and 8 instruction 7 accesses the data cache concurrently with instruction 5 The load execute instruction accesses the data cache in cycles 10 11 and executes the OR operation in IEU1 in cycle 12 154 Execution Unit Resources AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix C Implementation of Write Combining Introduction This appendix describes the memory write combining feature as implemented in the AMD Athlon processor family The AMD Athlon processor supports the memory type and range register MTRR and the page attribute table PAT extensions which allow software to define ranges of memory as either writeback WB write protected WP writethrough WT uncacheable UC or write combining WC Defining the memory type for a range of memory as WC or WT allows the processor to conditionally combine data from multiple write cycles that are addressed within this range into a merge buffer Merging multiple write cycles into a single write cycle reduces processor bus utilization and processor stalls thereby increasing the overall system performance To understand the information presented in this appendix the reader should possess a knowledge of K86 processors the x86 architecture and programming requirements Introduction 155 AMD Athlon Processor x86 Code Optimization 22007E
135. address calculations of load and store instructions m Data register operands Used for register instructions m Store data register operands Used for memory stores The two types of results are as follows m Data register results Produced by load or register instructions m Address register results Produced by LEA or PUSH instructions The following examples illustrate the operand and result definitions ADD EAX EBX The ADD instruction has two data register operands EAX and EBX and one data register result EA X MOV EBX LESP 4 ECX 8 Load The Load instruction has two address register operands ESP and ECX as base and index registers respectively and a data register result EBX MOV ESP A ECX 8 EAX Store The Store instruction has a data register operand EAX and two address register operands ESP and ECX as base and index registers respectively LEA ESI LESP 4 ECX 8 The LEA instruction has address register operands ESP and ECX as base and index registers respectively and an address register result ESI 148 Execution Unit Resources AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Integer Pipeline Operations Table 2 shows the category or type of operations handled by the integer pipeline Table 3 shows examples of the decode type Table2 Integer Pipeline Operation Types Category Execution Unit Integer Memory Load or Stor
136. agram Instruction Cache The out of order execute engine of the AMD Athlon processor contains a very large 64 Kbyte L1 instruction cache The L1 instruction cache is organized as a 64 Kbyte two way set associative array Each line in the instruction array is 64 bytes long Functions associated with the L1 instruction cache are instruction loads instruction prefetching instruction predecoding and branch prediction Requests that miss in the L1 instruction cache are fetched from the backside L2 cache or subsequently from the local memory using the bus interface unit BIU The instruction cache generates fetches on the naturally aligned 64 bytes containing the instructions and the next sequential line of 64 bytes a prefetch The principal of program spatial locality makes data prefetching very effective and avoids or reduces execution stalls due to the amount of time wasted reading the necessary data Cache line AMD Athlon Processor Microarchitecture I31 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Predecode Branch Prediction replacement is based on a least recently used LRU replacement algorithm The L1 instruction cache has an associated two level translation look aside buffer TLB structure The first level TLB is fully associative and contains 24 entries 16 that map 4 Kbyte pages and eight that map 2 Mbyte or 4 Mbyte pages The second level TLB is four way set associative
137. and contains 256 entries which can map 4 Kbyte pages Predecoding begins as the L1 instruction cache is filled Predecode information is generated and stored alongside the instruction cache This information is used to help efficiently identify the boundaries between variable length x86 instructions to distinguish DirectPath from VectorPath early decode instructions and to locate the opcode byte in each instruction In addition the predecode logic detects code branches such as CALLs RETURNSs and short unconditional JMPs When a branch is detected predecoding begins at the target of the branch The fetch logic accesses the branch prediction table in parallel with the instruction cache and uses the information stored in the branch prediction table to predict the direction of branch instructions The AMD Athlon processor employs combinations of a branch target address buffer BTB a global history bimodal counter GHBC table and a return address stack RAS hardware in order to predict and accelerate branches Predicted taken branches incur only a single cycle delay to redirect the instruction fetcher to the target instruction In the event of a mispredict the minimum penalty is ten cycles The BTB is a 2048 entry table that caches in each entry the predicted target address of a branch In addition the AMD Athlon processor implements a 12 entry return address stack to predict return addresses from a near or far call As CALLs are f
138. as a latency of five cycles Therefore use alternative code when multiplying by certain constants In addition because there is just one multiply unit the replacement code may provide better throughput The following code samples are designed such that the original source also receives the final result Other sequences are possible if the result is in a different register Adds have been favored over shifts to keep code size small Generally there is a fast replacement if the constant has very few 1 bits in binary More constants are found in the file multiply located in the same directory where this document is located in the SDK by 2 ADD REG REGI 1 cycle by 3 LEA REG REGI 2 REGI 2 cycles by 4 SHL REG 2 l cycle by 5 LEA REG REG1 4 REG1 2 cycles by 6 LEA REG2 LREG1 4 REGI1 3 cycles ADD REG REG2 by Z MOV REG2 REGI 2 Cycles SHL REG 3 SUB REG REG2 by 8 SHL REG 3 1 cycle by 9 LEA REG REG1 8 REGI1 2 cycles by 10 LEA REG2 REGI 8 REGI 3 cycles ADD 501 12 Use Alternative Code When Multiplying by a Constant 81 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 by 11 LEA REG2 REGI 8 REGI 3 cycles ADD REG REGI ADD 501 12 by 12 SHL REG 2 LEA REG1 REG1 2 REG1 3 cycles by 13 LEA REG2 REG1 2 REG1 3 cycles SHL REG1 4 SUB REG REG2 by 14 LEA REG2 REG1 4 REG1 3 cycles LEA R
139. ation However loading the data from memory runs the risk of cache misses Cases where MOVQ is superior to PCMPEQD are therefore rare and PCMPEQD should be used in general Use MMX PAND to Find Absolute Value in 3DNow Code Use the following to compute the absolute value of 3DNow floating point operands mabs DQ 7FFFFFFF7FFFFFFFR PAND MMO mabs mask out sign bit Optimized Matrix Multiplication The multiplication of a 4x4 matrix with a 4x1 vector is commonly used in 3D graphics for geometry transformation This routine serves to translate scale rotate and apply perspective to 3D coordinates represented in homogeneous coordinates The following code sample is a 3DNow optimized general 3D vertex transformation routine that completes in 16 cycles on the AMD Athlon processor Use MMX PCMPEQD to Set All Bits in an MMX Register 119 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Function XForm performs a fully generalized 3D transform on an array and stores the transformed vertices in of vertices pointed to by v the location pointed to by res The 4x4 transform matrix is pointed to by ats he argument nu also flo m Each vertex consists of four floats The matrix elements are verts indicates how many vertices have to be transformed The computation performed for each vertex is res x v x m 0 0 v y m
140. ation allows the use of a fast forwarding mechanism for the FPU condition codes internal to the AMD Athlon processor FPU and increases performance Use the FXCH Instruction Rather than FST FLD Pairs Increase parallelism by breaking up dependency chains or by evaluating multiple dependency chains simultaneously by explicitly switching execution between them Although the AMD Athlon processor FPU has a deep scheduler which in most cases can extract sufficient parallelism from existing code long dependency chains can stall the scheduler while issue slots are still available The maximum dependency chain length that the scheduler can absorb is about six 4 cycle instructions To switch execution between dependency chains use of the FXCH instruction is recommended because it has an apparent latency of zero cycles and generates only one OP The AMD Athlon processor FPU contains special hardware to handle up to three FXCH instructions per cycle Using FXCH is preferred over the use of FST FLD pairs even if the FST FLD pair works on a register An FST FLD pair adds two cycles of latency and consists of two OPs Avoid Using Extended Precision Data Store data as either single precision or double precision quantities Loading and storing extended precision data is comparatively slower Use the FXCH Instruction Rather than FST FLD Pairs 99 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Minimize
141. ay base addresses in order to cut down on loop overhead Given the latency of a typical AMD Athlon processor system and expected processor speeds the following formula should be used to determine the prefetch distance in bytes for a single array Prefetch Distance 200 5 bytes m Round up to the nearest 64 byte cache line m The number 200 is a constant based upon expected AMD Athlon processor clock frequencies and typical system memory latencies DS is the data stride in bytes per loop iteration m Cis the number of cycles for one loop to execute entirely from the L1 cache The prefetch distance for multiple arrays are typically even longer The PREFETCH and PREFETCHW instructions can be affected by false dependencies on stores If there is a store to an address that matches a request that request the PREFETCH or PREFETCHW instruction may be blocked until the store is written to the cache Therefore code should prefetch data that 1s located at least 64 bytes away from any surrounding store s data address Use the 3DNow PREFETCH and PREFETCHW Instructions 49 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Take Advantage of Write Combining Operating system and device driver programmers should take advantage of the write combining capabilities of the AMD Athlon processor The AMD Athlon processor has a very aggressive write combining algorithm which improves performance
142. ber 1999 Avoid Load Execute Floating Point Instructions with Integer Operands Do not use load execute floating point instructions with integer operands The floating point load execute instructions with integer operands are VectorPath and generate two OPs in a cycle while the discrete equivalent enables a third DirectPath instruction to be decoded in the same cycle Take Advantage of Write Combining This guideline applies only to operating system device driver and BIOS programmers In order to improve system performance the AMD Athlon processor aggressively combines multiple memory write cycles of any data size that address locations within a 64 byte cache line aligned write buffer See Appendix C Implementation of Write Combining on page 155 for more details Use 3DNow Instructions Unless accuracy requirements dictate otherwise perform floating point computations using the 3DNow instructions instead of x87 instructions The SIMD nature of 3DNow instructions achieves twice the number of FLOPs that are achieved through x87 instructions 3DNow instructions also provide for a flat register file instead of the stack based approach of x87 instructions See Table 23 on page 217 for a list of 3DNow instructions For information about instruction usage see the 3DNow Technology Manual order 21928 Avoid Branches Dependent on Random Data Avoid data dependent branches around a single instruction Data dependent
143. bit string For example this can be used to determine the cardinality of a set The following example code shows how to efficiently implement a population count operation for 32 bit operands The example is written for the inline assembler of Microsoft Visual C Function popcount implements a branchless computation of the population count It is based on a O log n algorithm that successively groups the bits into groups of 2 4 8 16 and 32 while maintaining a count of the set bits in each group The algorithms consist of the following steps Partition the integer into groups of two bits Compute the population count for each 2 bit group and store the result in the 2 bit group This calls for the following transformation to be performed for each 2 bit group 00b 00b Olb gt Olb 10b gt 016 llb lt 10b If the original value of a 2 bit group is v then the new value will be v v 1 In order to handle all 2 bit groups simultaneously 1t is necessary to mask appropriately to prevent spilling from one bit group to the next lower bit group Thus V v gt gt 1 amp 0x55555555 Add the population count of adjacent 2 bit group and store the sum to the 4 bit group resulting from merging these adjacent 2 bit groups To do this simultaneously to all groups mask out the odd numbered groups mask out the even numbered groups and then add the odd numbered groups to the even numbered groups amp 0x33333
144. branches acting upon basically random data can cause the branch prediction logic to mispredict the branch about 50 of the time Design branch free alternative code sequences which results in shorter average execution time See Avoid Branches Dependent on Random Data on page 57 for more details 10 Group II Optimizations Secondary Optimizations AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Avoid Placing Code and Data in the Same 64 Byte Cache Line Consider that the AMD Athlon processor cache line is twice the size of previous processors Code and data should not be shared in the same 64 byte cache line especially if the data ever becomes modified In order to maintain cache coherency the AMD Athlon processor may thrash its caches resulting in lower performance In general the following should be avoided m Self modifying code m Storing data in code segments See Avoid Placing Code and Data in the Same 64 Byte Cache Line on page 50 for more details Group II Optimizations Secondary Optimizations 11 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 12 Group II Optimizations Secondary Optimizations 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization C Source Level Optimizations This chapter details C programming practices for optimizing code for the AMD Athlon processor Guidelines are listed in
145. by PCMP the mask needs to be saved which requires an additional register This adds an instruction lengthens the dependency chain and increases register pressure Therefore 2 way muxing constructs should be written as follows 60 Replace Branches with Computation in 3DNow Code AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 2 Preferred gt in mmO mml b mm2 mm3 y out mml r PCMPGTD MM2 sc a GI PAND 1 MM3 e cue oO PANDN MM3 MMO lt 0 POR MM1 MM3 r y gt x b a 1 Sample Code Translated into 3DNow Code The following examples use scalar code translated into 3DNow code Note that it is not recommended to use 3DNow SIMD instructions for scalar code because the advantage of 3DNow instructions lies in their SIMDness These examples are meant to demonstrate general techniques for translating source code with branches into branchless 3DNow code Scalar source code was chosen to keep the examples simple These techniques work in an identical fashion for vector code Each example shows the C code and the resulting 3DNow code Example 1 C code float x y Z if x gt y f z 1 0 else 2 1 0 3DNow code sin MMO x y 1 MM2 z out MMO 2 MOVQ MM3 MMO 5 x MOVQ MM4 one 1 0 PFCMPGE MMO 1 Xo KY 0 OX TERE ETE PSLLD MMO 31 aX lt 0 0
146. by a load has a true dependency on a LS2 buffered store but cannot read forward data from a store buffer entry Store to Load Forwarding Restrictions 51 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Narrow to Wide Store Buffer Data Forwarding Restriction Wide to Narrow Store Buffer Data Forwarding Restriction If the following conditions are present there is a narrow to wide store buffer data forwarding restriction m The operand size of the store data is smaller than the operand size of the load data m The range of addresses spanned by the store data covers some sub region of range of addresses spanned by the load data Avoid the type of code shown in the following two examples Example 1 Avoid MOV EAX 10h MOV WORD PTR EAX BX sword store MOV ECX DWORD PTR EAX doubleword load cannot forward upper byte from store buffer Example 2 Avoid MOV EAX 10h MOV BYTE PTR EAX 3 BL byte store MOV ECX DWORD PTR EAX doubleword load cannot forward upper byte from store buffer If the following conditions are present there is a wide to narrow store buffer data forwarding restriction m The operand size of the store data is greater than the operand size of the load data m The start address of the store data does not match the start address of the load Example 3 Avoid MOV EAX 10h ADD DWORD PTR EAX EBX doubleword store MOV CX WORD PTR EA
147. cache XXXX xxxlb Instruction cache bits 15 10 reserved 74h BU xxxx xxlxb L2 single bit error Single bit ECC errors detected corrected xxxlb System single bit error bits 15 12 reserved xxxx 1xxxb I invalidates 75h BU X1xxb xx1xb I invalidates Internal cache line invalidates D invalidates He oO Me Oo XXXX_XXXlb D invalidates 76h BU Cycles processor is running not in HLT or STPCLK 1xxx xxxxb Data block write from the L2 TLB RMW Data block write from the DC xxxxb Data block write from the syste 1 xxxxb Data block read data 79h BU ef ora L2 requests Xxxx lxxxb Data block read data load Xxxx xlxxb Data block read instruction xxlxb Tag write xxxlb Tag read Performance Counter Usage 165 AMDA AMD Athlon Processor x86 Code Optimization Table 11 Performance Monitoring Counters Continued 22007E 0 November 1999 may Notes Unit Mask bits 15 8 Event Description des that at least one fill request zn to use s 2 80h PC Instruction cache fetches 8ih PC Instruction cache misses 82h PC Instruction cache refills from L2 85h PC Instruction cache refills from system 84h PC 11 ITLB misses and L2 ITLB hit
148. cal 5 0 1 and 2 Each integer pipe consists of an integer execution unit IEU and an address generation unit AGU The integer execution pipeline is organized to match the three MacroOP dispatch pipes in the ICU as shown in Figure 2 on page 135 MacroOPs are broken down into OPs in the schedulers OPs issue when their operands are available either from the register file or result buses OPs are executed when their operands are available OPs from a single MacroOP can execute out of order In addition a particular integer pipe can be executing two OPs from different MacroOPs one in the IEU and one in the AGU at the same time ol Unit and 5 5 Integer Scheduler 7 18 entry 8 Integer Multiply IMUL Figure 2 Integer Execution Pipeline AMD Athlon Processor Microarchitecture 135 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Each of the three IEUs are general purpose in that each performs logic functions arithmetic functions conditional functions divide step functions status flag multiplexing and branch resolutions The AGUs calculate the logical addresses for loads stores and LEAs A load and store unit reads and writes data to and from the L1 data cache The integer scheduler sends a completion status to the ICU when the outstanding OPs for a given MacroOP are executed All integer operations can be handled within any of the thr
149. code for a vertex and clipping if the clip code is non zero The following example shows how to use 3DNow instructions to efficiently implement a clip code computation for a frustum that is defined by m lt lt W lt lt w W lt Z lt W LIGN 8 RIGHT ABOVE LEFT BELOW BEFORE BEHIND Generalized computation of 3D clip code out code Register usage IN 5 y MM6 w z OUT MM2 clip code out code 122 Efficient 3D Clipping Code Computation Using AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization DESTROYS 1 2 3 4 PXOR MMO MMO i cg d MOVQ MM1 MM6 2 MOVQ MM4 MM5 y x PUNPCKHDQ MM1 MM1 w MOVQ MM3 MM6 22 MOVQ MM2 MM5 e PFSUBR MM3 MMO w z PFSUBR MM2 MMO pei ex PUNPCKLDQ MM3 MM6 AZ PFCMPGT 4 MM1 yDW FFFFFFFF 0 x gt w FFFFFFFF 0 MOVQ MMO QWORD PTR ABOVE RIGHT ABOVE RIGHT PFCMPGT MM1 ZOW FFFFFFFF 0 z gt w gt FFFFFFFF 0 PFCMPGT MM2 MM1 y2w FFFFFFFF 0O x gt w FFFFFFFF 0 MOVQ MM1 QWORD PTR LBEHIND BEFORE BEHIND BEFORE PAND MM4 MMO gt 2 ABOVE O x gt w RIGHT 0 MOVQ MMO QWORD PTR BELOW LEFT BELOW LEFT PAND MM3 MM1 Z gt BEHIND 0 z gt w 2 BEFORE 0 PAND MM2 MMO y gt BELOW 0 x gt w LEFT 0 POR MM2 MM4 BELOW ABOVE LEFT RIGHT POR MM2 MM3 BE
150. cture members some compilers might allocate structure elements in an order that differs from the order in which they are declared However some compilers might not offer any of these features or their implementation might not work properly in all situations Therefore to achieve the best alignment of structures and structure members while minimizing the amount of padding regardless of compiler optimizations the following methods are suggested Sort structure members according to their base type size declaring members with a larger base type size ahead of members with a smaller base type size C Language Structure Component Considerations 27 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Pad by Multiple of Pad the structure to a multiple of the largest base type size of Largest Base Type any member In this fashion if the first member of a structure is Size naturally aligned all other members are naturally aligned as well The padding of the structure to a multiple of the largest based type size allows for example arrays of structures to be perfectly aligned The following example demonstrates the reordering of structure member declarations Original ordering Avoid struct char aL 5 long k double x baz New ordering with padding Preferred struct double x long k char 5 char pad 7 baz See C Language Structure Component Considerations on page 5
151. d immediate data re use of the data at the destination is expected m AMD K6 family specific code where the destination is in non cacheable memory Example 1 block copy source and destination QWORD aligned asm mov eax src ptr mov edx dst ptr mov ecx blk size shr ecx 6 align 16 Use MMX Instructions for Block Copies and Block Fills 115 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 xfer ovq 0 eax add edx 64 ovq 1 8 68 add eax 64 ovq 2 eax 48 ovq edx 64 0 ovq 0 eax 40 ovq edx 56 1 ovq 1 eax 32 ovq edx 48 mm2 ovq 2 eax 24 ovq edx 40 0 ovq 0 eax 16 ovq edx 32 1 ovq 1 8 68 ovq edx 24 mm2 ovq edx 16 0 dec ecx ovq edx 8 mmi jnz xfer femms block fill destination QWORD aligned asm mov edx dst ptr mov ecx size shr ecx 6 movq mm0 fill data align 16 fill ovq edx mm0 ovq edx 8 mmO ovq 161 mm0 ovq edx 24 mm0 ovq Ledx 32 mmO ovq 401 mm0 add edx 64 ovq edx 16 mmO decq ecx edx 8 mmO jnz fill femms 116 Use MMX Instructions for Block Copies and Block Fills AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization AMD Athlon The following example code written for the inline assembler of Processor Specific Microsoft V
152. d store accesses and a 32 entry queue for L2 cache or system memory load and store accesses The 12 entry queue can request a maximum of two L1 cache loads and two L1 cache 32 bits stores per cycle The 32 entry queue effectively holds requests that missed in the L1 cache probe by the 12 entry queue Finally the LSU ensures that the architectural load and store ordering rules are preserved a requirement for x86 architecture compatibility Operand Buses 14133 Data Cache 2 Way 64Kbytes Store Data to BIU 138 AMD Athlon Processor Microarchitecture AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization L2 Cache Controller Write Combining The AMD Athlon processor contains a very flexible onboard L2 controller It uses an independent backside bus to access up to 8 Mbytes of industry standard SRAMs There are full on chip tags for a 512 Kbyte cache while larger sizes use a partial tag system In addition there is a two level data TLB structure The first level TLB is fully associative and contains 32 entries 24 that map 4 Kbyte pages and eight that map 2 Mbyte or 4 Mbyte pages The second level TLB is four way set associative and contains 256 entries which can map 4 Kbyte pages See Appendix C Implementation of Write Combining on page 155 for detailed information about write combining AMD Athlon System Bus The AMD Athlon system bus is a high speed bus
153. dder is occupied on every clock cycle ensuring maximal sustained utilization Explicitly Extract Common Subexpressions In certain situations C compilers are unable to extract common subexpressions from floating point expressions due to the guarantee against reordering of such expressions in the ANSI standard Specifically the compiler can not re arrange the computation according to algebraic equivalencies before extracting common subexpressions In such cases the programmer should manually extract the common subexpression It should be noted that re arranging the expression may result in different computational results due to the lack of associativity of floating point operations but the results usually differ in only the least significant bits 26 Explicitly Extract Common Subexpressions AMDA 22007E 0 November 1999 Example 1 Example 2 AMD Athlon Processor x86 Code Optimization Avoid double a b c d e f b c d b d a e f Preferred double a b c d e f t t b d e c t f a t Avoid double a b c e f e a c PSU Ce Preferred double a b c e f t Up 0 e a t f b t C Language Structure Component Considerations Sort by Base Type Size Many compilers have options that allow padding of structures to make their size multiples of words doublewords or quadwords in order to achieve better alignment for structures In addition to improve the alignment of stru
154. dependency In some instances the language definition may prohibit the compiler from using code transformations that would remove the store to load dependency It is therefore recommended that the programmer remove the dependency manually e g by introducing a temporary variable that can be kept in a register This can result in a significant performance increase The following is an example of this Example 1 Avoid double x VECLEN yLVECLEN z VECLENJ unsigned int k for k 1 k gt VECLEN k x k 11 y k for k 1 k gt VECLEN k XLk zEk 11 Example 2 Preferred double x VECLEN yLVECLEN z VECLENJ unsigned int k double t t XLO for k 1 gt VECLEN k t y k x k t j t x 0 for k 1 k gt VECLEN k t z k 1 1 x k t Avoid Unnecessary Store to Load Dependencies 19 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Consider Expression Order in Compound Branch Conditions Branch conditions in C programs are often compound conditions consisting of multiple boolean expressions joined by the boolean operators amp amp and C guarantees a short circuit evaluation of these operators This means that in the case of ll the first operand to evaluate to TRUE terminates the evaluation i e following operands are not evaluated at all Similarly
155. due to passing function arguments through memory which creates STLF store to load forwarding dependencies Some compilers allow for a reduction of this overhead by allowing arguments to be passed in registers in one of their calling conventions which has the drawback of constraining register allocation in the function and at the site of the function call In general function inlining works best if the compiler can utilize feedback from a profiler to identify the function call sites most frequently executed If such data is not available a reasonable heuristic is to concentrate on function calls inside loops Functions that are directly recursive should not be considered candidates for inlining However if they are end recursive the compiler should convert them to an iterative equivalent to avoid potential overflow of the AMD Athlon processor return prediction mechanism return stack during deep recursion For best results a compiler should support function inlining across multiple source files In addition a compiler should provide inline templates for commonly used library functions such as sin stremp memcpy Use Function Inlining 71 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Always Inline Functions if Called from One Site A function should always be inlined if it can be established that it is called from just one site in the code For the C language determination o
156. e The loop is in a frequently executed piece of code The loop count is known at compile time m The loop body once unrolled is less than 100 instructions which is approximately 400 bytes of code Partial Loop Unrolling Partial loop unrolling can increase register pressure which can make it inefficient due to the small number of registers in the x86 architecture However in certain situations partial unrolling can be efficient due to the performance gains possible Partial loop unrolling should be considered if the following conditions are met m Spareregisters are available m Loop body is small so that loop overhead is significant m Number of loop iterations is likely 10 Consider the following piece of C code double aL MAX LENGTH LENGTH for 1 0 i lt MAX LENGTH i ali ali b i Without loop unrolling the code looks like the following 68 Unrolling Loops AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Without Loop Unrolling MOV ECX MAX LENGTH MOV EAX OFFSET A MOV EBX OFFSET B add loop FLD QWORD PTR EAX FADD QWORD PTR EBX FSTP QWORD PTR EAX ADD EAX 8 ADD EBX 8 DEC ECX JNZ add loop The loop consists of seven instructions The AMD Athlon processor can decode retire three instructions per cycle so it cannot execute faster than three iterations in seven cycles or 3 7 floating point adds per cycle However
157. e 9 can receive up to three MacroOPs per cycle The schedule SCHED pipeline stage in cycle 10 schedules up to three MacroOPs per cycle from the 36 entry FPU scheduler to the FREG pipeline stage to read register operands MacroOPs are sent when their operands and or tags are obtained The register file read FREG pipeline stage reads the floating point register file for any register source operands of MacroOPs The register file read is done before the MacroOPs are sent to the floating point execution pipelines The FPU has three logical pipes FADD FMUL and FSTORE Each pipe may have several associated execution units MMX execution is in both the FADD and FMUL pipes with the exception of MMX instructions involving multiplies which are limited to the FMUL pipe The FMUL pipe has special support for long latency operations DirectPath VectorPath operations are dispatched to the FPU during cycle 6 but are not acted upon until they receive validation from the ICU in cycle 7 Floating Point Pipeline Stages 147 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Execution Unit Resources Terminology Operands Results Examples The execution units operate with two types of register values operands and results There are three operand types and two result types which are described in this section The three types of operands are as follows m Address register operands Used for
158. e Operations L S Address Generation Operations AGU Integer Execution Unit Operations IEU Integer Multiply Operations IMUL Table3 Integer Decode Types x86 Instruction Decode Type OPs MOV CX SP 4 DirectPath AGU L S ADD BX DirectPath IEU CMP AX VectorPath AGU L S IEU JZ Addr DirectPath IEU As shown in Table 2 the MOV instruction early decodes in the DirectPath decoder and requires two OPs an address generation operation for the indirect address and a data load from memory into a register The ADD instruction early decodes in the DirectPath decoder and requires a single OP that can be executed in one of the three IEUs The CMP instruction early decodes in the VectorPath and requires three OPs an address generation operation for the indirect address a data load from memory and a compare to CX using an IEU The final JZ instruction is a simple operation that early decodes in the DirectPath decoder and requires a single OP Not shown is a load op store instruction which translates into only one MacroOP one AGU OP one IEU OP and one L S OP Execution Unit Resources 149 AMDA AMD Athlon Processor x86 Code Optimization Floating Point Pipeline Operations 22007E 0 November 1999 Table 4 shows the category or type of operations handled by the floating point execution units Table 5 shows examples of the decode types Table4 Floating Point Pipeline Ope
159. e prefetch distance Prefetch Length 200 PS c m Round up to the nearest cache line m DSisthe data stride per loop iteration m Cisthe number of cycles per loop iteration when hitting in the L1 cache See Use the 3DNow PREFETCH and PREFETCHW Instructions on page 46 for more details Select DirectPath Over VectorPath Instructions Use DirectPath instructions rather than VectorPath instructions DirectPath instructions are optimized for decode and execute efficiently by minimizing the number of operations per x86 instruction Three DirectPath instructions can be decoded in parallel Using VectorPath instructions will block DirectPath instructions from decoding simultaneously See Appendix G DirectPath versus VectorPath Instructions on page 219 for a list of DirectPath and VectorPath instructions Group Il Optimizations Secondary Optimizations Load Execute Instruction Usage See Load Execute Instruction Usage on page 34 for more details Use Load Execute Instructions Wherever possible use load execute instructions to increase code density with the one exception described below The split instruction form of load execute instructions can be used to avoid scheduler stalls for longer executing instructions and to explicitly schedule the load and execute operations Group Optimizations Secondary Optimizations 9 AMD Athlon Processor x86 Code Optimization 22007E 0 Novem
160. ectPath ROL mem 16 32 imm8 Cih mm 000 xxx DirectPath ROL mreg8 1 Doh 11 000 xxx DirectPath ROL mem8 1 Doh mm 000 xxx DirectPath Instruction Dispatch and Execution Resources 201 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Instruction Mnemonic 02 js M pon ROL mreg16 32 1 Dih 11 000 xxx DirectPath ROL mem 16 32 1 Dih mm 000 xxx DirectPath ROL mrega8 CL D2h 11 000 xxx DirectPath ROL CL D2h mm 000 xxx DirectPath ROL mreg16 32 CL D3h 11 000 xxx DirectPath ROL mem16 32 CL D3h mm 000 xxx DirectPath ROR mreg8 imm8 11 001 xxx DirectPath mem8 imm8 mm 001 xxx DirectPath ROR mreg16 32 imm8 Cih 11 001 xxx DirectPath ROR mem16 32 imm8 Cih mm 001 xxx DirectPath ROR mregg 1 Doh 11 001 xxx DirectPath ROR mem 1 Doh mm 001 xxx DirectPath ROR mreg16 32 1 Dih 11 001 xxx DirectPath ROR mem16 32 1 Dih mm 001 xxx DirectPath ROR mreg8 CL D2h 11 001 xxx DirectPath ROR CL D2h mm 001 xxx DirectPath ROR mreg16 32 CL D3h 11 001 xxx DirectPath ROR mem16 32 CL D3h mm 001 xxx DirectPath SAHF 9Eh VectorPath SAR mreg8 imm8 11 111 xxx DirectPath mem imm8 mm 111 xxx DirectPath SAR mreg16 32 imm8 Cih 11 111 xxx DirectPath SAR mem16 32 imm8 Cih m
161. ects either MacroOPs from the DirectPath or MacroOPs from the VectorPath to send to the instruction decoder IDEC stage The microcode engine decode MEDEC stage converts x86 instructions into MacroOPs The microcode engine sequencer MESEQ performs the sequence controls redirects and exceptions for the MENG At the instruction decoder IDEC rename stage integer and floating point MacroOPs diverge in the pipeline Integer MacroOPs are scheduled for execution in the next cycle Floating point MacroOPs have their floating point stack Fetch and Decode Pipeline Stages 143 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 operands mapped to registers Both integer and floating point MacroOPs are placed into the ICU Integer Pipeline Stages The integer execution pipeline consists of four or more stages for scheduling and execution and if necessary accessing data in the processor caches or system memory There are three integer pipes associated with the three IEUs Pipeline BA IEEE 2 Stage MacroOPs MacroOPs Integer Scheduler 7 18 entry 8 Integer Multiply IMUL Figure 7 Integer Execution Pipeline Figure 7 and Figure 8 show the integer execution resources and the pipeline stages which are described in the following sections SS sem D anser Figure 8 Integer Pipeline Stages 144 Integer Pipeline Stages AMDA 22007E 0 November 1999
162. ed loads and stores increases the likelihood of encountering a store to load forwarding pitfall For a more detailed discussion of store to load forwarding issues see Store to Load Forwarding Restrictions on page 51 Use the 3DNow PREFETCH and PREFETCHW Instructions For code that can take advantage of prefetching use the 3DNow PREFETCH and PREFETCHW instructions to increase the effective bandwidth to the AMD Athlon processor The PREFETCH and PREFETCHW instructions take advantage of the AMD Athlon processor s high bus bandwidth to hide long latencies when fetching data from system memory The prefetch instructions are essentially integer instructions and can be used anywhere in any type of code integer x87 3DNow MMX etc Large data sets typically require unit stride access to ensure that all data pulled in by PREFETCH or PREFETCHW is actually used If necessary algorithms or data structures should be reorganized to allow unit stride access 46 Use the 3DNow PREFETCH and PREFETCHW AMDA 22007E 0 November 1999 PREFETCH W versus PREFETCHNTA TO TI T2 PREFETCHW Usage Multiple Prefetches AMD Athlon Processor x86 Code Optimization The PREFETCHNTA TO T1 T2 instructions in the MMX extensions are processor implementation dependent To Maintain compatibility with the 25 million AMD K6 2 and AMD K6 III processors already sold use the 3DNow PREFETCH W instructions instead of the various pref
163. ee IEUs with the exception of multiplies Multiplies are handled by a pipelined multiplier that is attached to the pipeline at pipe 0 See Figure 2 on page 135 Multiplies always issue to integer pipe 0 and the issue logic creates results bus bubbles for the multiplier in integer pipes 0 and 1 by preventing non multiply OPs from issuing at the appropriate time Floating Point Scheduler The AMD Athlon processor floating point logic is a high performance fully pipelined superscalar out of order execution unit It is capable of accepting three MacroOPs of any mixture of x87 floating point 3DNow or MMX operations per cycle The floating point scheduler handles register renaming and has a dedicated 36 entry scheduler buffer organized as 12 lines of three MacroOPs each It also performs OP issue and out of order execution The floating point scheduler communicates with the ICU to retire a MacroOP to manage comparison results from the FCOMI instruction and to back out results from a branch misprediction 136 AMD Athlon Processor Microarchitecture AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Floating Point Execution Unit The floating point execution unit FPU is implemented as a coprocessor that has its own out of order control in addition to the data path The FPU handles all register operations for x87 instructions all 3DNow operations and all MMX operations The FPU consists of a stack
164. efore starting the lock Writes within a lock can be combined A UC read doses write combining A WC read closes Uncacheable Read combining only if a cache block address match occurs between the WC read and a write in the write buffer Any WT write while write combining for WC memory or any Different memory type WC write while write combining for WT memory closes write combining Write combining is closed if all 64 bytes of the write buffer Buffer full are valid If 16 processor clocks have passed since the most recent WT time out write for WT write combining write combining is closed There is no time out for WC write combining Write combining is closed if a write fills the most significant byte of a quadword which includes writes that are misaligned across a quadword boundary In the misaligned case combining is closed by the LS part of the misaligned write and combining is opened by the MS part of the misaligned store WT write fills byte 7 If a subsequent WT write is not in ascending sequential order the write combining completes WC writes have no WT Nonsequential addressing constraints within the 64 byte line being combined TLB AD bit set Write combining is closed whenever a TLB reload sets the accessed A or dirty D bits of a Pde or Pte 158 Write Combining Operations AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Sending Write
165. em14byte FBSTP mem80 FLDENV mem28byte FCLEX FPTAN FCMOVB ST 0 ST i FPATAN FCMOVE ST 0 ST i FRNDINT FCMOVBE ST 0 ST i FRSTOR mem94byte FCMOVU ST 0 ST i FRSTOR mem 108byte FCMOVNB ST 0 ST i FSAVE mem94byte FCMOVNE ST 0 ST i FSAVE mem 108byte FCMOVNBE ST 0 ST i FSCALE FCMOVNU ST 0 ST i FSIN FCOMI ST ST i FSINCOS FCOMIP ST ST i FSTCW mem16 FCOS FSTENV mem 14byte FIADD mem32int FSTENV mem28byte FIADD mem 16int FSTP mem80real FICOM mem32int FSTSW AX FICOM mem 16int FSTSW mem 16 FICOMP mem32int ST ST i FICOMP mem 161 FUCOMIP ST ST i FIDIV mem32int FXAM FIDIV mem 16int FXTRACT FIDIVR mem32int FYL2X FIDIVR mem 16int FYL2XP1 FIMUL mem32int FIMUL mem 16int FINIT FISUB mem32int FISUB mem 16int FISUBR mem32int FISUBR mem 16int FLD mem80real FLDCW mem16 VectorPath Instructions 235 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 236 VectorPath Instructions AMDA 22007E 0 November 1999 Index AMD Athlon Processor x86 Code Optimization Numerics 3DNow
166. ember 1999 Table 23 3DNow Instructions AMD Athlon Processor x86 Code Optimization Notes prefetched 2 The byte listed in the column titled 1 8 is actually the opcode byte Instruction Mnemonic Prefix imm8 ModR M Decode F pU Note Byte s Byte Type Pipe s FEMMS OFh OEh DirectPath FADD FMUL FSTORE 2 PAVGUSB 1 mmreg2 OFh OFh 11 xxx xxx DirectPath FADD FMUL PAVGUSB mmreg mem64 OFh OFh BFh DirectPath FADD FMUL PF2ID mmreg1 mmreg2 OFh OFh 1Dh 11 xxx xxx DirectPath FADD PF2ID mmreg mem64 OFh OFh 1Dh mm xxx xxx DirectPath FADD mmreg1 mmreg2 OFh OFh AEh 11 xxx xxx DirectPath FADD PFACC mmreg mem64 OFh OFh AEh mm xxx xxx DirectPath FADD PFADD mmreg1 mmreg2 OFh OFh 9Eh 11 xxx xxx DirectPath FADD PFADD mmreg mem64 OFh OFh 9Eh mm xxx xxx DirectPath FADD PFCMPEQ mmreg1 mmreg2 OFh Boh 11 xxx xxx DirectPath FADD mmreg mem64 OFh OFh Boh mm xxx xxx DirectPath FADD mmreg1 mmreg2 OFh OFh 90h 11 Xxx xxx DirectPath FADD PFCMPGE mmreg mem64 OFh OFh 90h mm xxx xxx DirectPath FADD PFCMPGT mmreg1 mmreg2 OFh OFh 11 xxx xxx DirectPath FADD PFCMPGT mmreg mem64 OFh OFh Aoh mm xxx xxx DirectPath FADD PFMAX mmreg1 mmreg2 OFh OFh A4h 11 xxx xxx DirectPath FADD
167. ember 1999 Table 29 VectorPath Integer Instructions Continued Instruction Mnemonic MUL EAX mem32 AMD Athlon Processor x86 Code Optimization OUT imme AL Instruction Mnemonic RCL imm8 OUT imme RCL mem16 32 imm8 OUT imme EAX RCL meme CL OUT DX AL RCL mem16 32 CL OUT DX AX RCR mems imm8 OUT DX EAX RCR mem16 32 imm8 POP ES RCR 8 CL POP SS RCR mem16 32 CL POP DS RDMSR POP FS RDPMC POP GS RDTSC POP EAX RET near imm16 POP ECX RET near POP EDX RET far imm16 POP EBX RET far POP ESP SAHF POP EBP SCASB AL mem8 POP ESI SCASW AX mem16 POP EDI SCASD EAX mem32 POP mreg 16 32 SGDT mem48 POP mem 16 32 SIDT mem48 POPA POPAD SHLD mreg16 32 reg16 32 imm8 POPF POPFD SHLD mem16 32 reg16 32 imm8 PUSH ES SHLD mreg16 32 reg16 32 CL PUSH CS SHLD mem 16 32 reg16 32 CL PUSH FS SHRD mreg 16 32 reg16 32 imm8 PUSH GS SHRD mem16 32 reg16 32 imm8 PUSH SS SHRD mreg16 32 reg16 32 CL PUSH DS SHRD mem 16 32 reg16 32 CL PUSH mreg16 32 SLDT 16 PUSH mem16 32 SLDT mem16 PUSHA PUSHAD SMSW 16 PUSHF PUSHFD SMSW mem16 STD
168. ensions 218 Table 25 DirectPath Integer Instructions 220 Table 26 DirectPath MMX Instructions 227 Table 27 DirectPath MMX 228 Table 28 DirectPath Floating Point Instructions 229 List of Tables AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 29 VectorPath Integer Instructions 231 Table 30 VectorPath MMX Instructions 234 Table 31 VectorPath MMX Extensions 234 Table 32 VectorPath Floating Point Instructions 235 xiv List of Tables AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Revision History Date Nov 1999 Rev Description Added About this Document on page 1 Further clarification of Consider the Sign of Integer Operands on page 14 Added the optimization Use Array Style Instead of Pointer Style Code on page 15 Added the optimization Accelerating Floating Point Divides and Square Roots on page 29 Clarified examples in Copy Frequently De referenced Pointer Arguments to Local Variables on page 31 Further clarification of Select DirectPath Over VectorPath Instructions on page 34 Further clarification of Align Branch Targets in Program Hot Spots on page 36 Further clarification of REP instruction as a fille
169. er Execution 135 Figure 3 Floating Point Unit Block Diagram 137 Figure 4 Load Store Unit 138 Figure 5 Fetch Scan Align Decode Pipeline Hardware 142 Figure 6 Fetch Scan Align Decode Pipeline Stages 142 Figure 7 Integer Execution 144 Figure 8 Integer Pipeline 144 Figure 9 Floating Point Unit Block Diagram 146 Figure 10 Floating Point Pipeline Stages 146 Figure 11 PerfEvtSel 3 0 Registers 162 Figure 12 MTRR Mapping of Physical Memory 173 Figure 13 MTRR Capability Register Format 174 Figure 14 MTRR Default Type Register Format 175 Figure 15 Page Attribute Table MSR 277h 177 Figure 16 MTRRphysBasen Register Format 183 Figure 17 MTRRphysMaskn Register Format 184 List of Figures AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Xii List of Figures AMDA 22007E 0 November 1999 List of Tables AMD Athlon Processor x86 Code Optimization Table 1 Latency of Repeated String Instructions 84 Table2 Integer Pipeline Operation Types 149 Table 3 Integer Decode Types 149 Table 4 Floati
170. er and floating point schedulers for final decode issue and execution as OPs In addition the ICU handles exceptions and manages the retirement of MacroOPs The L1 data cache contains two 64 bit ports It isa write allocate and writeback cache that uses an LRU replacement policy The data cache and instruction cache are both two way set associative and 64 Kbytes in size It is divided into 8 banks where each bank is 8 bytes wide In addition this cache supports the MOESI Modified Owner Exclusive Shared and Invalid cache coherency protocol and data parity The L1 data cache has an associated two level TLB structure The first level TLB is fully associative and contains 32 entries 24 that map 4 Kbyte pages and eight that map 2 Mbyte or 4 Mbyte pages The second level TLB is four way set associative and contains 256 entries which can map 4 Kbyte pages 134 AMD Athlon Processor Microarchitecture AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Integer Scheduler The integer scheduler is based on a three wide queuing system also known as a reservation station that feeds three integer execution positions or pipes The reservation stations are six entries deep for a total queuing system of 18 integer MacroOPs Each reservation station divides the MacroOPs into integer and address generation OPs as required Integer Execution Unit The integer execution pipeline consists of three identi
171. erations Since most languages including ANSI C guarantee that floating point expressions are not re ordered compilers can not usually perform such optimizations unless they offer a switch to allow ANSI non compliant reordering of floating point expressions according to algebraic rules Note that re ordered code that is algebraically identical to the original code does not necessarily deliver identical computational results due to the lack of associativity of floating point operations There are well known numerical considerations in applying these optimizations consult a book on numerical analysis In some cases these optimizations may Dynamic Memory Allocation Consideration 25 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 lead to unexpected results Fortunately in the vast majority of cases the final result will differ only in the least significant bits Example 1 Avoid double a 100 sum int i sum 0 0f for 1 0 1 lt 100 i sum a i Example 2 Preferred double a 100 suml1 sum2 sum3 sum4 sum int i suml sum2 sum3 sum4 for 1 0 1 lt 100 1 4 5001 ali sum2 1 11 sum3 21 sum4 a i 3 sum sum4 sum3 suml sum2 G2 CO ED Notice that the 4 way unrolling was chosen to exploit the 4 stage fully pipelined floating point adder Each stage of the floating point a
172. erred double z 3 double x y long foo bar float baz short ga gu gi See Sort Variables According to Base Type Size on page 56 for more information from a different perspective Accelerating Floating Point Divides and Square Roots Divides and square roots have a much longer latency than other floating point operations even though the AMD Athlon processor provides significant acceleration of these two operations In some codes these operations occur so often as to seriously impact performance In these cases it is recommended to port the code to 3DNow inline assembly or to use a compiler that can generate 3DNow code If code has hot spots that use single precision arithmetic only i e all computation involves data of type float and for some reason cannot be ported to 3DNow the following technique may be used to improve performance The x87 FPU has a precision control field as part of the FPU control word The precision control setting determines what precision results get rounded to It affects the basic arithmetic operations including divides and square roots AMD Athlon and AMD K6 family processors implement divide and square root in such fashion as to only compute the number of bits Accelerating Floating Point Divides and Square Roots 29 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 necessary for the currently selected precision This means that setting
173. etc and the IBM PC AT platform This guide has been written specifically for the AMD Athlon processor but it includes considerations for About this Document AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 previous generation processors and describes how those optimizations are applicable to the AMD Athlon processor This guide contains the following chapters Chapter 1 Introduction Outlines the material covered in this document Summarizes the AMD Athlon microarchitecture Chapter 2 Top Optimizations Provides convenient descriptions of the most important optimizations a programmer should take into consideration Chapter 3 C Source Level Optimizations Describes optimizations that C C programmers can implement Chapter 4 Instruction Decoding Optimizations Describes methods that will make the most efficient use of the three sophisticated instruction decoders in the AMD Athlon processor Chapter 5 Cache and Memory Optimizations Describes optimizations that makes efficient use of the large L1 caches and high bandwidth buses of the AMD Athlon processor Chapter 6 Branch Optimizations Describes optimizations that improves branch prediction and minimizes branch penalties Chapter 7 Scheduling Optimizations Describes optimizations that improves code scheduling for efficient execution resource utilization Chapter 8 Integer Optimizations Describes optimizations that impr
174. etch flavors in the new MMX extensions Code that intends to modify the cache line brought in through prefetching should use the PREFETCHW instruction While PREFETCHW works the same as a PREFETCH on the AMD K6 2 AMD K6 III processors PREFETCHW gives a hint to the AMD Athlon processor of an intent to modify the cache line The AMD Athlon processor will mark the cache line being brought in by PREFETCHW as Modified Using PREFETCHW can save an additional 15 25 cycles compared to a PREFETCH and the subsequent cache state change caused by a write to the prefetched cache line Programmers can initiate multiple outstanding prefetches on the AMD Athlon processor While the AMD K6 2 and AMD K6 III processors can have only one outstanding prefetch the AMD Athlon processor can have up to six outstanding prefetches When all six buffers are filled by various memory read requests the processor will simply ignore any new prefetch requests until a buffer frees up Multiple prefetch requests are essentially handled in order If data is needed first then that data should be prefetched first The example below shows how to initiate multiple prefetches when traversing more than one array Example Multiple Prefetches CODE K3D original C code define LARGE NUM 65536 double array aL LARGE NUM double array b LARGE NUM double array cLLARGE NUM 12 i 0 i gt LARGE NUM i ali b i c i
175. etched the next EIP is pushed onto the 132 AMD Athlon Processor Microarchitecture AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization return stack Subsequent RETs pop a predicted return address off the top of the stack Early Decoding The DirectPath and VectorPath decoders perform early decoding of instructions into MacroOPs A MacroOP is a fixed length instruction which contains one or more OPs The outputs of the early decoders keep all DirectPath or VectorPath instructions in program order Early decoding produces three MacroOPs per cycle from either path The outputs of both decoders are multiplexed together and passed to the next stage in the pipeline the instruction control unit When the target 16 byte instruction window is obtained from the instruction cache the predecode data is examined to determine which type of basic decode should occur DirectPath or VectorPath DirectPath Decoder DirectPath instructions can be decoded directly into a MacroOP and subsequently into one or two OPs in the final issue stage A DirectPath instruction is limited to those x86 instructions that can be further decoded into one or two OPs The length of the x86 instruction does not determine DirectPath instructions A maximum of three DirectPath x86 instructions can occupy a given aligned 8 byte block 16 bytes are fetched at a time Therefore up to six DirectPath x86 instructions can be passed into t
176. f this characteristic is made easier if functions are explicitly declared static unless they require external linkage This case occurs quite frequently as functionality that could be concentrated in a single large function 1s split across multiple small functions for improved maintainability and readability Always Inline Functions with Fewer than 25 Machine Instructions In addition functions that create fewer than 25 machine instructions once inlined should always be inlined because it is likely that the function call overhead is close to or more than the time spent executing the function body For large functions the benefits of reduced function call overhead gives diminishing returns Therefore a function that results in the insertion of more than 500 machine instructions at the call site should probably not be inlined Some larger functions might consist of multiple relatively short paths that are negatively affected by function overhead In such a case it can be advantageous to inline larger functions Profiling information is the best guide in determining whether to inline such large functions Avoid Address Generation Interlocks Loads and stores are scheduled by the AMD Athlon processor to access the data cache in program order Newer loads and stores with their addresses calculated can be blocked by older loads and stores whose addresses are not yet calculated this is known as an address generation interlock Therefore it
177. floating point In such cases the FRNDINT instruction should be used for maximum performance instead of FISTP in the code above FRNDINT delivers the integral result directly to an FPU register in floating point form which is faster than first using FISTP to store the integer result and then converting it back to floating point with FILD If there are multiple consecutive floating point to integer conversions the cost of FLDCW operations should be minimized by saving the current FPU control word forcing the 100 Minimize Floating Point to Integer Conversions AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization FPU into truncating mode and performing all of the conversions before restoring the original control word The speed of the above code is somewhat dependent on the nature of the code surrounding it For applications in which the speed of floating point to integer conversions is extremely critical for application performance experiment with either of the following substitutions which may or may not be faster than the code above The first substitution simulates a truncating floating point to integer conversion provided that there are no NaNs infinities and overflows This conversion is therefore not IEEE 754 compliant This code works properly only if the current FPU rounding mode is round to nearest even which is usually the case Example 2 Potentially faster FLD QWORD PTR
178. for amp amp the first operand to evaluate to FALSE terminates the evaluation Because of this short circuit evaluation it is not always possible to swap the operands of and amp amp This is especially the case when the evaluation of one of the operands causes a side effect However in most cases the exchange of operands is possible When used to control conditional branches expressions involving ll and amp amp are translated into a series of conditional branches The ordering of the conditional branches is a function of the ordering of the expressions in the compound condition and can have a significant impact on performance It is unfortunately not possible to give an easy closed form formula on how to order the conditions Overall performance is a function of a variety of the following factors m probability of a branch mispredict for each of the branches generated m additional latency incurred due to a branch mispredict m cost of evaluating the conditions controlling each of the branches generated m amount of parallelism that can be extracted in evaluating the branch conditions m data stream consumed by an application mostly due to the dependence of mispredict probabilities on the nature of the incoming data in data dependent branches It is therefore recommended to experiment with the ordering of expressions in compound branch conditions in the most active areas of a program so called hot spots where most of the
179. h MOV ESP imm16 32 BCh DirectPath MOV EBP imm16 32 BDh DirectPath MOV ESI imm16 32 BEh DirectPath MOV EDI imm16 32 BFh DirectPath MOV mreg8 imm8 C6h 11 000 xxx DirectPath MOV meme imm8 C6h mm 000 xxx DirectPath MOV mreg16 32 imm16 32 C7h 11 000 xxx DirectPath MOV mem16 32 imm16 32 C7h mm 000 xxx DirectPath MOVSB mem8 mem8 A4h VectorPath MOVSD mem16 mem16 A5h VectorPath MOVSW mem32 mem32 A5h VectorPath MOVSX reg16 32 mreg8 OFh BEh 11 xxx xxx DirectPath MOVSX reg16 32 mem8 OFh BEh mm xxx xxx DirectPath MOVSX reg32 mreg16 OFh 11 xxx xxx DirectPath MOVSX reg32 mem16 OFh BFh mm xxx xxx DirectPath MOVZX reg16 32 mreg8 OFh B6h 11 xxx xxx DirectPath MOVZX reg16 32 mem8 OFh B6h mm xxx xxx DirectPath MOVZX reg32 mreg16 OFh B7h 11 xxx xxx DirectPath MOVZX reg32 mem16 OFh B7h mm xxx xxx DirectPath MUL AL mreg8 F6h 11 100 xxx VectorPath MUL AL mem8 Feh mm 100 xx VectorPath MUL AX 16 F7h 11 100 xxx VectorPath MUL AX mem16 F7h mm 100 xxx VectorPath MUL EAX mreg32 F7h 11 100 xxx VectorPath MUL EAX mem32 F7h mm 100 xx VectorPath NEG mreg8 F6h 11 011 xxx DirectPath NEG mem8 Feh mm 011 xx DirectPath NEG mreg16 32 F7h 11 011 xxx DirectPath NEG mem16 32 F7h mm 011 xx DirectPath NOP XCHG EAX EAX 90h DirectPath NOT mreg8 F6h 11 010 xxx DirectPath 198 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999
180. hat has its own out of order control in addition to the data path The FPU handles all register operations for x87 instructions all 3DNow operations and all MMX operations The FPU consists of a stack renaming unit a register renaming unit a scheduler a register file and three parallel execution units Figure 9 shows a block diagram of the dataflow through the FPU Pipeline Stage Figure 9 Floating Point Unit Block Diagram The floating point pipeline stages 7 15 are shown in Figure 10 and described in the following sections Note that the floating point pipe and integer pipe separates at cycle 7 Figure 10 Floating Point Pipeline Stages 146 Floating Point Pipeline Stages AMDA 22007E 0 November 1999 Cycle 7 STKREN Cycle 8 REGREN Cycle 9 SCHEDW Cycle 10 SCHED Cycle 11 FREG Cycle 12 15 Floating Point Execution FEXECI 4 AMD Athlon Processor x86 Code Optimization The stack rename STKREN pipeline stage in cycle 7 receives up to three MacroOPs from IDEC and maps stack relative register tags to virtual register tags The register renaming REGREN pipeline stage in cycle 8 is responsible for register renaming In this stage virtual register tags are mapped into physical register tags Likewise each destination is assigned a new physical register The MacroOPs are then sent to the 36 entry FPU scheduler The scheduler write SCHEDW pipeline stage in cycl
181. he DirectPath decode pipeline VectorPath Decoder Uncommon x86 instructions requiring two or more MacroOPs proceed down the VectorPath pipeline The sequence of MacroOPs is produced by an on chip ROM known as the MROM The VectorPath decoder can produce up to three MacroOPs per cycle Decoding a VectorPath instruction may prevent the simultaneous decode of a DirectPath instruction AMD Athlon Processor Microarchitecture 133 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Instruction Control Unit Data Cache The instruction control unit ICU is the control center for the AMD Athlon processor The ICU controls the following resources the centralized in flight reorder buffer the integer scheduler and the floating point scheduler In turn the ICU is responsible for the following functions MacroOP dispatch MacroOP retirement register and flag dependency resolution and renaming execution resource management interrupts exceptions and branch mispredictions The ICU takes the three MacroOPs per cycle from the early decoders and places them in a centralized fixed issue reorder buffer This buffer is organized into 24 lines of three MacroOPs each The reorder buffer allows the ICU to track and monitor up to 72 in flight MacroOPs whether integer or floating point for maximum instruction throughput The ICU can simultaneously dispatch multiple MacroOPs from the reorder buffer to both the integ
182. he result is the inverse of the PFCMPGT floating point comparison For example 2 84000000 4 84800000 PCMPGT gives 84800000 gt 84000000 but 4 lt 2 To address this issue simply reverse the comparison by swapping the source operands 114 Use MMX PCMP Instead of 3DNow PFCMP AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use MMX Instructions for Block Copies and Block Fills AMD K6 and AMD Athlon Processor Blended Code For moving or filling small blocks of data e g less than 512 bytes between cacheable memory areas the REP MOVS and REP STOS families of instructions deliver good performance and are straightforward to use For moving and filling larger blocks of data or to move fill blocks of data where the destination is in non cacheable space it is recommended to make use of MMX instructions and MMX extensions The following examples all use quadword aligned blocks of data In cases where memory blocks are not quadword aligned additional code is required to handle end cases as needed The following example code written for the inline assembler of Microsoft Visual C is suitable for moving filling a large quad word aligned block of data in the following situations m Blended code i e code that needs to perform well on both AMD Athlon and AMD K6 family processors m AMD Athlon processor specific code where the destination is in cacheable memory an
183. head Instead take advantage of the complex addressing modes to utilize the loop counter to index into memory arrays Using complex addressing modes does not have any negative impact on execution speed but the reduced number of instructions preserves decode bandwidth Use MOVZX and MOVSX 73 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example 1 Avoid int a MAXSIZE b MAXSIZE c MAXSIZE 1 for i20 i gt MAXSIZE i 1 6 up sar di bL ds MOV ECX MAXSIZE initialize loop counter XOR ESI ESI initialize offset into array a XOR EDI EDI initialize offset into array b XOR EBX EBX initialize offset into array c 800 loop MOV EAX ESI a get element a MOV EDX EDI b get element b ADD EAX EDX sali bli MOV EBX c write result to c ADD ESI 4 increment offset into a ADD EDI 4 increment offset into b ADD EBX 4 increment offset into c DEC ECX decrement loop count JNZ add loop until loop count 0 Example 2 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for 1 0 i gt MAXSIZE i c al eibi MOV ECX MAXSIZE 1 initialize loop counter add loop MOV EAX ECX 4 a element a MOV EDX 4 6 b get element b ADD EAX EDX 8 1 5111 MOV ECX 4 c EAX write result to c DEC ECX decrement index JNS add loop until index negative Note that the code in example 2 t
184. hen clean victims must be written back and RdlO and WrlO and WT WB or WP or 4 access to Local APIC space 6 The processor does not support this memory type Page Attribute Table PAT 181 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 MTRR Fixed Range Register Format The memory types defined for memory segments defined in each of the MTRR fixed range registers are defined in Table 17 Also See Standard MTRR Types and Properties on page 176 Table 17 MTRR Fixed Range Register Format Address Range in hexadecimal Register 63 56 55 48 47 40 39 32 3124 23 16 158 70 70000 60000 50000 40000 30000 20000 10000 00000 7FFFF 6FFFF SFFFF 4FFFF 2 9CO00 98000 94000 90000 8 000 88000 84000 80000 MTRR 16 80000 OFFFF 9BFFF 97FFF 93FFF 8FFFF 8BFFF 87FFF 83FFF BCOO0 B8000 84000 80000 000 8000 4000 0000 BFFFF BBFFF B7FFF B3FFF AFFFF ABFFF AZFFF MIRR fix16K_A0000 C7000 C6000 000 4000 3000 C2000 1000 0000 C6FFF CIF MIRR fx4K 0000 000 60000 CCOO0 000 000 C9000 000 FFFF CEFFF CDFFF CCFFF CBFFF CAFFF 8000 D7000 D6000 05000 D4000 D3000 D2000
185. her a signed or an unsigned type it should be considered that certain operations are faster with unsigned types while others are faster for signed types Integer to floating point conversion using integers larger than 16 bit 1s faster with signed types as the x86 FPU provides instructions for converting signed integers to floating point but has no instructions for converting unsigned integers In a typical case a 32 bit integer is converted as follows Example 1 Avoid double x MOV temp 4 0 unsigned int i MOV EAX i MOV temp eax X 1 FILD QWORD PTR temp FSTP QWORD PTR x This code is slow not only because of the number of instructions but also because a size mismatch prevents store to load forwarding to the FILD instruction Example Preferred double x FILD DWORD PTR i int i FSTP QWORD PTR x 1 Computing quotients and remainders in integer division by constants are faster when performed on unsigned types In a typical case a 32 bit integer is divided by four as follows 14 Consider the Sign of Integer Operands AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example Avoid int i MOV EAX i CDQ 1 174 AND EDX 3 ADD EDX SAR 2 MOV i EAX Example Preferred unsigned int i SHR 1 2 i i 4 In summary Use unsigned types for m Division and remainders m Loop counters m Array indexing
186. hes with Computation in 3DNow Code Branches negatively impact the performance of 3DNow code Branches can operate only on one data item at a time i e they are inherently scalar and inhibit the SIMD processing that makes 3DNow code superior Also branches based on 3DNow comparisons require data to be passed to the integer units which requires either transport through memory or the use of MOVD reg MMreg instructions If the body of the branch is small one can achieve higher performance by replacing the branch with computation The computation simulates predicated execution or conditional moves The principal tools for this are the following instructions PCMPGT PFCMPGT PFCMPGE PFMIN PFMAX PAND PANDN POR PXOR Muxing Constructs The most important construct to avoiding branches in 3DNow and MMX code is a 2 way muxing construct that is equivalent to the ternary operator in C and C It is implemented using the PCMP PFCMP PAND PANDN and POR instructions To maximize performance it is important to apply the PAND and PANDN instructions in the proper order Example 1 Avoid gt in mmO a mml b mm2 mm3 y out mml r PCMPGTD MM2 gt x Oxffffffff 0 MOVQ MM4 MM3 duplicate mask PANDN MM3 MMO gt 0 8 PAND 1 MM4 Gips POR 1 MM3 lt 0 8 Because the use of PANDN destroys the mask created
187. high bandwidth buses of the AMD Athlon processor Guidelines are listed in order of importance Memory Size and Alignment Issues Avoid Memory Size Mismatches Avoid memory size mismatches when instructions operate on the same data For instructions that store and reload the same data keep operands aligned and keep the loads stores of each operand the same size The following code examples result in a store to load forwarding STLF stall Example 1 Avoid MOV DWORD PTR FOO EAX MOV DWORD PTR 41 EDX FLD QWORD PTR FOO Avoid large to small mismatches as shown in the following code Example 2 Avoid FST QWORD PTR FOO MOV EAX DWORD PTR FOO MOV EDX DWORD PTR F0044 Memory Size and Alignment Issues 45 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Align Data Where Possible In general avoid misaligned data references All data whose size is a power of 2 is considered aligned if it is naturally aligned For example m QWORD accesses are aligned if they access an address divisible by 8 m DWORD accesses are aligned if they access an address divisible by 4 m WORD accesses are aligned if they access an address divisible by 2 m TBYTE accesses are aligned if they access an address divisible by 8 A misaligned store or load operation suffers a minimum one cycle penalty in the AMD Athlon processor load store pipeline In addition using misalign
188. ic Functions that are not used outside the file in which they are defined should always be declared static which forces internal linkage Otherwise such functions default to external linkage 24 Declare Local Functions as Static AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization which might inhibit certain optimizations with some compilers for example aggressive inlining Dynamic Memory Allocation Consideration Dynamic memory allocation malloc in C language should always return a pointer that is suitably aligned for the largest base type quadword alignment Where this aligned pointer cannot be guaranteed use the technique shown in the following code to make the pointer quadword aligned if needed This code assumes the pointer can be cast to a long Example double p double np double malloc sizeof double number of doubles 71L double long p 7L amp 8L p np Then use np instead of p to access the data p is still needed in order to deallocate the storage Introduce Explicit Parallelism into Code Where possible long dependency chains should be broken into several independent dependency chains which can then be executed in parallel exploiting the pipeline execution units This is especially important for floating point code whether it is mapped to x87 or 3DNow instructions because of the longer latency of floating point op
189. if control structure Although the branch would be easily predicted the extra instructions and decode limitations imposed by branching are saved which are usually well worth it 22 Use Const Type Qualifier AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Generalization for Multiple Constant Control Code To generalize this further for multiple constant control code some more work may have to be done to create the proper outer loop Enumeration of the constant cases will reduce this to a simple switch statement Example 2 FOP CI nace if CONSTANTO DoWorkO i does not affect CONSTANTO or CONSTANTI else DoWorkl i does not affect CONSTANTO or CONSTANTI if CONSTANTI DoWork2 i does not affect CONSTANTO or CONSTANTI else DoWork3 i does not affect CONSTANTO or CONSTANTI The above loop should be transformed into Lj SS I 002 define combine cl c2 0 CONSTANT1 20 1 switch combine CONSTANTO case combine 0 0 for i DoWorkO DoWork2 abes d 2 case combine 1 0 for i DoWork1 DoWork2 lt case combine 0 1 FORE 1 DoWorkO DoWork3 cde Sa 2 x break Generic Loop Hoisting 23 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 case combine 1
190. ift factor s to perform division on unsigned 32 bit integers by constant divisor Code is written for the Microsoft Visual C compiler In divisor 1 lt d lt 2 31 d odd Qut lgorithm multiplier shift factor o lI algorithm 0 MOV EDX dividend MOV EAX m MUL EDX SHR EDX s EDX quotient Derivation of Multiplier Used for Integer Division by Constants 93 AMDA AMD Athlon Processor x86 Code Optimization algorithm 1 MOV EDX dividend MOV EAX m MUL EDX ADD EAX m ADC EDX 0 SHR EDX s EDX quotient typedef unsigned __int64 U64 typedef unsigned long U32 U32 d lj S mi as rn U64 m low m high j k U32 log2 U32 i 032 t 0 i lt lt 1 while i j gt gt 1 t return t Generate m s for algorithm O0 Montgomery P L Division by Multiplication SIGPLAN Notices 61 1 log2 d 1 j U64 COxf f FF f f f k U64 1 lt lt 3241 m low U64 1 lt lt 32 1 m high U64 1 while m_low gt gt 1 lt low m_low gt gt 1 high m high lt lt 1 1 1 if m high gt gt 32 0 0032 m high 5 1 422500 lt lt 32 1 k m high gt gt 1 8 22007E 0 November 1999 Based on Granlund T Integers using June 1994 page 1 0 gt 0 94 Derivation of Multiplier Used for In
191. in MMO x MM1 r out MM1 res MOVQ MM5 mabs MOVQ MM6 one PAND MMO MM5 PCMPGTD MM6 MMO MOVQ MM4 2 PFSUB MM4 1 PANDN MM6 MM4 PFMAX MM1 MM6 mask to clear sign bit 1 0 z abs x 1 0 pi 2 a Zt mum qu ius pi 2 r res z lt 1 pi 2 r Replace Branches with Computation in 3DNow Code 63 AMDA AMD Athlon Processor x86 Code Optimization Example 5 C code define PI 3 14159265358979323 float x y xa ya nr res 22007E 0 November 1999 gt int XS df xs lt 0 1 xa fabs x ya fabs y df xa lt ya if xs 4 df res PI 2 r else if xs res PI r else if df res PI 2 r else res r 3DNow code sin MMO 1 MM2 out MMO MOVQ MOVQ MOVQ PAND PAND PAND MOVQ PCMPGTD PSLLD MOVQ PXOR MOVQ PXOR PSRAD PANDN PFSUB POR PFADD 7 MM6 5 7 1 MM2 MM6 MM6 MM6 5 7 MM3 5 MM6 MM6 MM6 MMO MMO res sgn sgn mabs MM2 MM5 MM5 MM1 MM2 MM7 MM6 npio2 MM3 MM5 MM3 MM7 MM6 mask mask mask 5 y df df 5 to extract sign bit to extract sign bit to clear sign bit sign x abs y abs x xa ya Oxffffffff 0 bit 31 xs df 0x80000000 0 1 2 5 se SXS 2 pr df sar
192. inal PFMUL MM2 MMO Y W X W 108 Use 3DNow Instructions for Fast Division AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Pipelined Pair of 24 Bit Precision Divides This divide operation executes with a total latency of 21 cycles assuming that the program hides the latency of the first MOVD MOVQ instructions within preceding code Example MOVQ MMO LDIVISORS y PFRCP MM1 MMO 1 x 1 x approximate MOVQ MM2 MMO y PUNPCKHDQ MMO MMO y y PFRCP MMO MMO l y 1 approximate PUNPCKLDQ MM1 MMO 1 1 x approximate MOVQ MMO DIVIDENDS z w PFRCPIT1 MM2 1 l y 1 x intermediate PFRCPIT2 MM2 MM1 l y 1 x final PFMUL MMO MM2 Z y Newton Raphson Reciprocal Consider the quotient q An on chip ROM based table lookup can be used to quickly produce a 14 to 15 bit precision approximation of 2 using just one PFRCP instruction A full 24 bit precision reciprocal can then be quickly computed from this approximation using a Newton Raphson algorithm The general Newton Raphson recurrence for the reciprocal is as follows 1141 Zi 9 2 De Zi Given that the initial approximation is accurate to at least 14 bits and that a full IEEE single precision mantissa contains 24 bits just one Newton Raphson iteration is required The following sequence shows the 3DNow instructions that produce the initial reciproca
193. ing range size and alignment rule m Each defined memory range must have a size equal to 2 11 n 36 m The base address for the address pair must be aligned to a similar 2 boundary An example of a variable MTRR pair is as follows To map the address range from 8 Mbytes 0080 0000h to 16 Mbytes 00FF FFFFh as writeback memory the base register should be loaded with 80 0006h and the mask should be loaded with FFF8 00800h 184 Page Attribute Table PAT AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization MTRR MSR Format This table defines the model specific registers related to the memory type range register implementation All MTRRs are defined to be 64 bits Table 18 MTRR Related Model Specific Register MSR Map Register Address Register Name Description OFEh MTRRcap See MTRR Capability Register Format on page 174 200h MTRR BaseO See MTRRphysBasen Register Format on page 183 201 MTRR Masko See MTRRphysMaskn Register Format on page 184 202h MTRR Basel 203h MTRR Mask1 204h MTRR Base2 205h MTRR Mask2 206h MTRR Base3 207h MTRR 5 3 208h MTRR Base4 209h MTRR Mask4 20Ah MTRR Base5 20Bh MTRR Mask5 20Ch MTRR Base6 20Dh MTRR Maske 20Eh MTRR Base7 20Fh MTRR Mask7 250h MTRRFIX64Kk 00000 258h MTRRFIX16k 80000 259h MTRRFIX16k 0000 268h MTRRFIX4k 0000 269h MTRRFIX4k_C800
194. ions respectively The PerfEvtSel 3 0 registers are located at MSR locations C001 0000h to C001 0003h The PerfCtr 3 0 registers located at MSR locations C001 0004h to C0001 0007h and are 64 byte registers The PerfEvtSel 3 0 registers can be accessed using the RDMSR WRMSR instructions only when operating at privilege level 0 The PerfCtr 3 0 MSRs can be read from any privilege level using the RDPMC read performance monitoring counters instruction if the PCE flag in CR4 is set PerfEvtSel 3 0 MSRs MSR Addresses 001 0000 001 0005 The PerfEvtSel 3 0 MSRs shown in Figure 11 control the operation of the performance monitoring counters with one register used to set up each counter These MSRs specify the events to be counted how they should be counted and the privilege levels at which counting should take place The functions of the flags and fields within these MSRs are as are described in the following sections 31 30 29 28 27 26 25 24 25 2221 20 19 18 17 16 15 14 15 12 1110 9 8 7 6 5 4 3 2 E V Reserved Symbol Description Bit USR User Mode 16 OS Operating System Mode 17 E Edge Detect 18 PC Pin Control 19 NT APIC Interrupt Enable 20 EN Enable Counter 22 NV nvert Mask 23 Figure 11 PerfEvtSel 3 0 Registers Event Select Field These bits are used to select the event to be monitored See Bits 0 7 Table 11 on page 164 for a list of event masks and their 8
195. is advantageous to schedule loads and stores that can calculate their addresses quickly ahead of loads and stores that require the resolution of a long dependency chain in order to generate their addresses Consider the following code examples 72 Avoid Address Generation Interlocks AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 1 Avoid ADD EBX ECX inst 1 MOV EAX DWORD PTR 10h inst 2 fast address calc MOV ECX DWORD PTR EAX EBX inst 3 slow address calc MOV EDX DWORD PTR 24h this load is stalled from accessing data cache due to long latency for generating address for inst 3 Example 2 Preferred ADD EBX ECX inst 1 MOV EAX DWORD PTR 10h sinist MOV EDX DWORD PTR 24h place load above inst 3 to avoid address generation interlock stall MOV ECX DWORD PTR EAX EBX inst 3 Use MOVZX and MOVSX Use the MOVZX and MOVSX instructions to zero extend and sign extend byte size and word size operands to doubleword length For example typical code for zero extension creates a superset dependency when the zero extended value is used as in the following code Example 1 Avoid XOR EAX EAX MOV AL MEM Example 2 Preferred MOVZX EAX BYTE PTR MEM Minimize Pointer Arithmetic in Loops Minimize pointer arithmetic in loops especially if the loop body is small In this case the pointer arithmetic would cause significant over
196. isual C is suitable for moving filling a quadword Code aligned block of data in the following situations m AMD Athlon processor specific code where the destination of the block copy is in non cacheable memory space m AMD Athlon processor specific code where the destination of the block copy is in cacheable space but no immediate data re use of the data at the destination is expected Example 2 block copy source and destination QWORD aligned asm mov eax src ptr mov edx dst ptr mov ecx blk size shr ecx 6 align 16 xfer nc prefetchnta 2561 ovq 0 eax add edx 64 ovq 1 eax 8 add eax 64 ovq 2 eax 48 ovntq edx 64 0 ovq 0 eax 40 ovntq edx 56 1 ovq 1 eax 32 ovntq edx 48 2 0 4 2 eax 24 ovntq edx 40 0 ovq 0 eax 16 ovntq 321 1 1 eax 8 ovntq edx 24 2 ovntq edx 16 0 dec ecx ovntq edx 8 mmi jnz xfer nc femms sfence Use MMX Instructions for Block Copies and Block Fills 117 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 block fill destination QWORD aligned asm mov edx dst ptr mov ecx size shr ecx 6 movq mmO fill data align 16 fill nc ovntq edx mm0 ovntq Ledx 8 mmO ovntq 161 mmO ovntq Ledx 24 mmO ovntq 321 mmO ovntq 401 mmO ovntq 481 mmO ovn
197. ither not contiguous and ascending or fills byte 7 All other memory types for stores that go through the write buffer UC and WP cannot be combined Combining is able to continue until interrupted by one of the conditions listed in Table 9 on page 158 When combining is interrupted one or more bus commands are issued to the system for that write buffer as described by Table 10 on page 159 Write Combining Operations 157 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table9 Write Combining Completion Events Event Comment The first non WB write to a different cache block address doses combining for previous writes WB writes do not affect write combining Only one line sized buffer can be open for write combining at a time Once a buffer is closed for write combining it cannot be reopened for write combining Any IN INS or OUT OUTS instruction closes combining The 1 0 Read or Write implied memory type for all IN OUT instructions is UC which cannot be combined Non WB write outside of current buffer Any serializing instruction closes combining These instructions include MOVCRx MOVDRx WRMSR INVD INVLPG WBINVD LGDT LLDT LIDT LTR CPUID IRET RSM INIT HALT Flushing instructions Any flush instruction causes the WC to complete Serializing instructions Any instruction or processor operation that requires a cache Locks or bus lock closes write combining b
198. ization 22007E 0 November 1999 If an argument out of range is detected a range reduction subroutine is invoked which reduces the argument to less than 2 63 before the instruction is attempted again While an argument gt 2 63 is unusual it often indicates a problem elsewhere in the code and the code may completely fail in the absence of a properly guarded trigonometric instruction For example in the case of FSIN or FCOS generated from a sin or cos function invocation in the HLL the downstream code might reasonably expect that the returned result is in the range 1 1 A naive solution for guarding a trigonometric instruction may check the C2 bit in the FPU status word after each FSIN FCOS FPTAN and FSINCOS instruction and take appropriate action if it is set indicating an argument out of range Example 1 Avoid FLD QWORD PTR x argument FSIN compute sine FSTSW store FPU status word to AX TEST AX 0400h is the C2 bit set JZ in range nope argument was in range all OK CALL reduce range reduce argument in ST 0 to lt 2 63 FSIN compute sine in range argument guaranteed in range Such a solution is inefficient since the FSTSW instruction is serializing with respect to all x87 3DNow MMxX instructions and should thus be avoided see the section Floating Point Compare Instructions on page 98 Use of FSTSW in the above fashion slows down the common path through the code Instead it
199. l srcl imag result result result real result imag result real lt srcO real srcl real srcO imag srcl imag result imag lt srcO real srcl imag srcO imag srcl real Example 1 21 344i gt result real result imag result real lt 1 3 2 4 5 result imag lt 1 4i 2i 3 10i result 5 101 Assuming that complex numbers are represented as two element vectors v real v imag one can see the need for swapping the elements of 1 to perform the multiplies for result imag and the need for a mixed positive negative accumulation to complete the parallel computation of result real and result imag PSWAPD performs the swapping of elements for src1 and PFPNACC performs the mixed positive negative accumulation to complete the computation The code example below summarizes the computation of a complex number multiply Example MMO sO imag sO real reg hi reg_lo 1 sl imag sl real PSWAPD MM2 MMO M2 s0 real s0 imag PFMUL MMO 1 0 sO imag sl imag 50 1 51 1 PFMUL MM1 MM2 1 sO real sl imag sO imag sl real PFPNACC MMO 1 MO res imag res real PSWAPD supports independent source and result operands and enables PSWAPD to also perform a copy function In the above example this eliminates the need for a separate MOVQ MM2 MMO instruction 126 Complex Number Arithmetic AMDA 22007E 0 November 1999 AMD Athlon Processor x86
200. l allocate on a miss and will allocate to 3 S _ state if returned with ReadDataShared command M state if returned with a ReadDataDirty command Writes allocate to the M state if the read allows the line to be marked E MTRR Capability The MTRR capability register is a read only register that Register Format defines the specific MTRR capability of the processor and is defined as follows 65 11109 8 7 0 F P VCNT X Reserved Symbol Description Bits WC Write Combining Memory Type 10 FIX Fixed Range Registers 8 VCNT No of Variable Range Registers 7 0 Figure 13 MTRR Capability Register Format For the AMD Athlon processor the MTRR capability register should contain 0508h write combining fixed MTRRs supported and eight variable MTRRs defined 174 Memory Type Range Register MTRR Mechanism AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization MTRR Default Type Register Format The MTRR default type register 1s defined as follows 65 11109 8 732 1 0 Reserved Symbol Description Bits E MTRRs Enabled 11 FE Fixed Range Enabled 10 Type Default Memory Type 7 0 Figure 14 MTRR Default Type Register Format E MTRRSs are enabled when set All MTRRs both fixed and variable range are disabled when clear and all of physical memory is mapped as uncacheable memory reset state 0 FE Fixed range MTRRs are enabled when set All MTRRs a
201. l approximation compute the full precision reciprocal from the approximation and finally complete the desired divide of Xo PFRCP b PFRCPIT2 X1 Xo q PFMUL a The 24 bit final reciprocal value is X5 In the AMD Athlon processor 3DNow technology implementation the operand X contains the correct round to nearest single precision reciprocal for approximately 99 of all arguments Use 3DNow Instructions for Fast Division 109 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Use 3DNow Instructions for Fast Square Root and Reciprocal Square Root 3DNow instructions can be used to compute a very fast highly accurate square root and reciprocal square root Optimized 15 Bit Precision Square Root This square root operation can be executed in only 7 cycles assuming a program hides the latency of the first MOVD instruction within previous code The reciprocal square root operation requires four less cycles than the square root operation Example MOVD MMO MEM 0 a PFRSQRT MM1 MMO l sqrt a 1 sqrt a approximate PUNPCKLDQ MMO MMO j 8 8 MMX instr PFMUL MMO MM1 Sqrt a sqrt a Optimized 24 Bit Precision Square Root This square root operation can be executed in only 19 cycles assuming a program hides the latency of the first MOVD instruction within previous code The reciprocal square root operatio
202. le 1 Avoid SHLD REG1 REG2 1 Preferred SHR REG2 31 LEA REG1 REG1 2 REG2 Example 2 Avoid SHLD REG1 REG2 2 Preferred SHR REG2 30 LEA REGI REGI 4 REG2 Example 3 Avoid SHLD REG1 REG2 3 Preferred SHR REG2 29 LEA REGI REGI 8 REG2 Use 8 Bit Sign Extended Immediates Using 8 bit sign extended immediates improves code density with no negative effects on the AMD Athlon processor For example ADD BX 5 should be encoded 83 C3 FB and not 81 C3 FF FB 38 Replace Certain SHLD Instructions with Alternative AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use 8 Bit Sign Extended Displacements Use 8 bit sign extended displacements for conditional branches Using short 8 bit sign extended displacements for conditional branches improves code density with no negative effects on the AMD Athlon processor Code Padding Using Neutral Code Fillers Occasionally a need arises to insert neutral code fillers into the code stream e g for code alignment purposes or to space out branches Since this filler code can be executed it should take up as few execution resources as possible not diminish decode density and not modify any processor state other than advancing A one byte padding can easily be achieved using the NOP instructions XCHG EAX EAX opcode 0x90 In the x86 architecture there are several multi byte NOP instructions availab
203. le that do not change processor state other than EIP MOV REG REG XCHG REG REG CMOVcc REG REG SHR REG 0 SAR REG 0 SHL REG 0 SHRD REG REG 0 SHLD REG REG 0 LEA REG REG LEA REG REG 00 LEA REG REG 1 00 LEA REG REG 00000000 LEA REG REG 1 00000000 Not all of these instructions are equally suitable for purposes of code padding For example SHLD SHRD are microcoded which reduces decode bandwidth and takes up execution resources Use 8 Bit Sign Extended Displacements 39 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Recommendations for the AMD Athlon Processor For code that is optimized specifically for the AMD Athlon processor the optimal code fillers are NOP instructions opcode 0x90 with up to two REP prefixes 0xF3 In the AMD Athlon processor a NOP with up to two REP prefixes can be handled by a single decoder with no overhead As the REP prefixes are redundant and meaningless they get discarded and NOPs are handled without using any execution resources The three decoders of AMD Athlon processor can handle up to three NOPs each with up to two REP prefixes each in a single cycle for a neutral code filler of up to nine bytes Note When used as a filler instruction REP REPNE prefixes can be used in conjunction only with NOPs REP REPNE has undefined behavior when used with instructions other than a NOP If a larger amount of code padding is
204. lon K6 3DNow and combinations thereof K86 and Super7 are trademarks and AMD K6 is a registered trademark of Advanced Micro Devices Inc Microsoft Windows and Windows NT are registered trademarks of Microsoft Corporation is a trademark and Pentium is a registered trademark of Intel Corporation Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Contents Revision History lt 2 poU DAN hea XV 1 Introduction 1 About this Docamett ee pus 1 AMD Athlon Processor 3 AMD Athlon Processor Microarchitecture Summary 4 2 Top Optimizations 7 Optimization Star Sud eps babe eMe cacao 8 Group I Optimizations Essential Optimizations 8 Memory Size and Alignment 5 8 Use the 3DNow PREFETCH and PREFETCHW ItSt UCEIOHS ta are re ed ED sc a 8 Select DirectPath Over VectorPath Instructions 9 Group II Optimizations Secondary Optimizations 9 Load Execute Instruction 9 Take Advantage of Write Combining 10 Use 3DNow Instructions 10 Avoid Branches Dependent on Random Data
205. ly Work 83 Repeated String Instruction 84 Latency of Repeated String Instructions 84 Guidelines for Repeated String Instructions 84 Use Instruction to Clear Integer Registers 86 Efficient 64 Bit Integer 86 Efficient Implementation of Population Count Function 91 Derivation of Multiplier Used for Integer Division by 6 RENE ad Sak NT AGAR 93 Unsigned Derivation for Algorithm Multiplier and Shift Factoren he ht ey ets PII 93 vi Contents AMDA 22007E 0 November 1999 Signed Derivation for Algorithm Multiplier and Shift Factors eter 9 Floating Point Optimizations Ensure FPU Data is Use Multiplies Rather than Use FFREEP Macro to Pop One Register from the FPU Stack Floating Point Compare Instructions Use the FXCH Instruction Rather than FST FLD Pairs Avoid Using Extended Precision AMD Athlon Processor x86 Code Optimization Minimize Floating Point to Integer Conversions 100 Floating Point Subexpression Elimination 103 Check Argument Range of Trigonome
206. m 111 xxx DirectPath SAR mreg8 1 Doh 11 111 xxx DirectPath SAR mem8 1 Doh mm 111 xxx DirectPath SAR mreg16 32 1 Dih 11 111 xxx DirectPath SAR mem16 32 1 Dih mm 111 xxx DirectPath SAR mreg8 CL D2h 11 111 xxx DirectPath SAR mem8 CL D2h mm 111 xxx DirectPath SAR mreg16 32 CL D3h 11 111 xxx DirectPath SAR mem16 32 CL D3h mm 111 xxx DirectPath SBB mreg8 reg8 18h 11 xxx xxx DirectPath SBB meme 8 18h mm xxx xxx DirectPath 202 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic b pin js M pon SBB mreg16 32 2 19h 11 xxx xxx DirectPath SBB mem16 32 reg16 32 19h mm xxx xxx DirectPath SBB reg8 mreg8 1Ah 11 xxx xxx DirectPath SBB reg8 mem8 1Ah mm xxx xxx DirectPath SBB reg16 32 mreg16 32 1Bh 11 xxx xxx DirectPath SBB 16 32 mem16 32 1Bh mm xxx xxx DirectPath SBB AL imm8 1Ch DirectPath SBB EAX imm16 32 1Dh DirectPath SBB mreg8 imm8 80h 11 011 xxx DirectPath SBB imm8 80h mm 011 xxx DirectPath SBB mreg16 32 imm16 32 8lh 11 011 xxx DirectPath SBB mem16 32 imm16 32 81h mm 011 xxx DirectPath SBB mreg16 32 imm8 sign extended 83h 11 011 xxx DirectPath SBB mem16 32 imm8 sign extended 83h mm 011 xxx
207. m xxx xxx DirectPath FADD FMUL PCMPGTB mmreg1 mmreg2 OFh 64h 11 xxx xxx DirectPath FADD FMUL PCMPGTB mmreg mem64 OFh 64h mm xxx xxx DirectPath FADD FMUL PCMPGTD mmreg1 mmreg2 OFh 66h 11 xxx xxx DirectPath FADD FMUL PCMPGTD mmreg mem64 OFh 66h mm xxx xxx DirectPath FADD FMUL PCMPGTW mmreg1 mmreg2 OFh 65h 11 xxx xxx DirectPath FADD FMUL PCMPGTW mmreg mem64 OFh 65h mm xxx xxx DirectPath FADD FMUL PMADDWD mmreg1 mmreg2 OFh F5h 11 xxx xxx DirectPath FMUL PMADDWD mmreg mem64 OFh F5h mm xxx xxx DirectPath FMUL PMULHW mmreg1 mmreg2 OFh E5h 11 xxx xxx DirectPath FMUL PMULHW mmreg mem64 OFh E5h mm xxx xxx DirectPath FMUL PMULLW mmreg1 mmreg2 OFh D5h 11 xxx xxx DirectPath FMUL PMULLW mmreg mem64 OFh D5h mm xxx xxx DirectPath FMUL POR mmreg1 mmreg2 OFh EBh 11 xxx xxx DirectPath FADD FMUL POR mmreg mem64 OFh EBh mm xxx xxx DirectPath FADD FMUL PSLLD mmreg1 mmreg2 OFh F2h 11 xxx xxx DirectPath FADD FMUL PSLLD mmreg mem64 OFh F2h mm xxx xxx DirectPath FADD FMUL PSLLD mmreg imm8 Oh 72h 11 110 xxx DirectPath FADD FMUL PSLLQ mmreg1 mmreg2 OFh Fah 11 xxx xxx DirectPath FADD FMUL PSLLQ mmreg mem64 OFh F3h mm xxx xxx DirectPath FADD FMUL PSLLQ mmreg imm8 OFh 73h 11 110 xxx DirectPath FADD FMUL PSLLW 1 mmreg2 OFh Fih 11 xxx xxx DirectPath FADD FMUL PSLLW mmreg mem64 OFh Fih mm xxx xxx DirectPath FADD FMUL PSLLW mmreg i
208. m8 imm8 SHR mem16 32 imm8 TEST mreg8 imm16 32 SHR 8 1 TEST mem8 imm16 32 SHR 1 WAIT SHR mreg16 32 1 XCHG EAX EAX SHR mem16 32 1 mreg8 reg8 SHR mregg CL reg8 SHR mem8 CL mreg16 32 reg16 32 SHR mreg16 32 CL mem16 32 reg16 32 SHR mem16 32 CL XOR reg8 mreg8 STC XOR reg8 mem8 SUB mreg8 reg8 6816 32 mreg16 32 DirectPath Instructions 225 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 25 DirectPath Integer Instructions Continued Instruction Mnemonic XOR reg16 32 mem16 32 XOR AL imm8 XOR EAX imm16 32 mreg8 imm8 mem8 imm8 XOR mreg16 32 imm16 32 mem 16 32 imm16 32 mreg16 32 imm sign extended XOR mem16 32 imm8 sign extended 226 DirectPath Instructions 22007E 0 November 1999 Table 26 DirectPath MMX Instructions Instruction Mnemonic EMMS AMD Athlon Processor x86 Code Optimization MOVD mmreg mem32 Instruction Mnemonic PCMPEQD mmreg mem64 MOVD mem32 mmreg PCMPEQW mmreg1 mmreg2 MOVQ mmreg1 mmreg2 PCMPEQW mmreg mem64 MOVQ mmreg mem64 PCMPGTB mmregl mmreg2 MOVQ mmreg2 mmreg1 PCMPGTB mmreg mem64
209. mance and compiler optimizations fight each other Depending on the compiler and the specific source code it is therefore possible that pointer style code will be compiled into machine code that is faster than that generated from equivalent array style code It is advisable to check the performance after any source code transformation to see whether performance indeed increased Example 1 Avoid typedef struct float 2 VERTEX typedef struct float m 41 4 MATRIX void XForm float res const float v const float m int numverts float dp int const VERTEX vv VERTEX v for i 0 gt numverts i dp vv gt x m dp vv gt y m dp vv gt z m dp vv 2 gt w m res dp write transformed x dp VV Xx MEF vv gt y m dp vv gt z m dp vv gt w m rest dp write transformed y dp vv gt x m dp vv gt y m dp vv gt z m dp vv gt w m 16 Use Array Style Instead of Pointer Style Code AMDA 22007E 0 November 1999 resct dp VV gt X vv gt y vv gt Z VV gt W dp dp dp dp restt dp n THVV m 16 Example 2 Preferred typedef struct float x y Z w VERTEX typedef struct float 1 AMD Athlon Processor x86 Code Optimization write transformed z m m
210. mem16 32 word or doubleword memory location mem32 48 doubleword or 6 byte memory location mem48 48 bit integer value in memory mem64 64 bit value in memory imm8 16 32 8 bit 16 bit or 32 bit immediate value disp8 8 bit displacement value Instruction Dispatch and Execution Resources 187 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 disp16 32 16 bit or 32 bit displacement value disp32 48 32 bit or 48 bit displacement value eXX register width depending on the operand size mem32real 32 bit floating point value in memory mem64real 64 bit floating point value in memory mem amp 0real 80 bit floating point value in memory mmreg MMX 3DNow register mmreg1 MMX 3DNow register defined by bits 5 4 and 3 of the modR M byte m mmreg2 MMX 3DNow register defined by bits 2 1 and 0 of the modR M byte The second and third columns list all applicable encoding opcode bytes The fourth column lists the modR M byte used by the instruction The modR M byte defines the instruction as register or memory form If mod bits 7 and 6 are documented as mm memory form mm can only be 10b 016 or 00b The fifth column lists the type of instruction decode DirectPath or VectorPath see DirectPath Decoder on page 133 and VectorPath Decoder on page 133 for more information The AMD Athlon processor enhanced decode logic can process three instructions per clock The FPU
211. ment ADD EAX EDX sali 0111 MOV ECX 4 c MAXSIZE 4 EAX write result to c INC ECX increment index JNZ add loop until index 0 Push Memory Data Carefully Carefully choose the best method for pushing memory data To reduce register pressure and code dependencies follow example 2 below Example 1 Avoid MOV EAX MEM PUSH Example 2 Preferred PUSH MEM Push Memory Data Carefully 75 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 76 Push Memory Data Carefully AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Integer Optimizations This chapter describes ways to improve integer performance through optimized programming techniques The guidelines are listed in order of importance Replace Divides with Multiplies Replace integer division by constants with multiplication by the reciprocal Because the AMD Athlon processor has a very fast integer multiply 5 9 cycles signed 4 8 cycles unsigned and the integer division delivers only one bit of quotient per cycle 22 47 cycles signed 17 41 cycles unsigned the equivalent code is much faster The user can follow the examples in this chapter that illustrate the use of integer division by constants or access the executables in the opt utilities directory in the AMD documentation CD ROM order 21860 to find alternative code for dividing by a constant Multiplication by
212. mm8 OFh 71h 11 110 xxx DirectPath FADD FMUL Instruction Dispatch and Execution Resources 209 AMDA AMD Athlon Processor x86 Code Optimization Table 20 MMX Instructions Continued 22007E 0 November 1999 Notes 1 Bits 2 1 and 0 of the modR M byte select the integer register Instruction Mnemonic An pd M pong FPU Pipe s Notes PSRAW mmreg1 mmreg2 OFh Eth 11 xxx xxx DirectPath FADD FMUL PSRAW mmreg mem64 OFh Eth mm xxx xxx DirectPath FADD FMUL PSRAW mmreg imm8 OFh 71h 11 100 xxx DirectPath FADD FMUL PSRAD mmreg1 mmreg2 OFh Eh 11 xxx xxx DirectPath FADD FMUL PSRAD mmreg mem64 OFh E2h mm xxx xxx DirectPath FADD FMUL PSRAD mmreg imm8 OFh 72h 11 100 xxx DirectPath FADD FMUL PSRLD mmreg1 mmreg2 OFh D2h 11 xxx xxx DirectPath FADD FMUL PSRLD mmreg mem64 OFh D2h mm xxx xxx DirectPath FADD FMUL PSRLD mmreg imm8 OFh 72h 11 010 xxx DirectPath FADD FMUL PSRLQ mmreg1 mmreg2 OFh D3h 11 xxx xxx DirectPath FADD FMUL PSRLQ mmreg mem64 OFh D3h mm xxx xxx DirectPath FADD FMUL PSRLQ mmreg imm8 OFh 73h 11 010 xxx DirectPath FADD FMUL PSRIW 1 mmreg2 OFh Dih 11 xxx xxx DirectPath FADD FMUL PSRLW mmreg mem64 OFh Dih mm xxx xxx DirectPath FADD FMUL PSRLW mmreg imm8 OFh 71h 11 010 xxx DirectPath FADD FMUL PSUB
213. mmreg mem16 imm8 OFh C4h VectorPath PMAXSW mmreg1 mmreg2 OFh EEh 11 xxx xxx DirectPath FADD FMUL PMAXSW mmreg mem64 OFh EEh mm xxx xxx DirectPath FADD FMUL PMAXUB mmreg1 mmreg2 OFh DEh 11 xxx xxx DirectPath FADD FMUL PMAXUB mmreg mem64 OFh DEh mm xxx xxx DirectPath FADD FMUL PMINSW mmreg1 mmreg2 OFh EAh 11 xxx xxx DirectPath FADD FMUL Notes 1 For the PREFETCHNTA TO T 1 72 instructions the mem value refers to an address in the 64 byte line that will be prefetched Instruction Dispatch and Execution Resources 211 AMD 1 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 21 MMX Extensions Continued Instruction Mnemonic Prefix First ModR M Decode F Notes Byte s Byte Byte Type Pipe s PMINSW mmreg mem64 OFh EAh mm xxx xxx DirectPath FADD FMUL PMINUB mmreg1 mmreg2 OFh DAh 11 xxx xxx DirectPath FADD FMUL PMINUB mmreg mem64 OFh mm xxx xxx DirectPath FADD FMUL PMOVMSKB reg32 mmreg OFh D7h VectorPath PMULHUW 1 mmreg2 OFh E4h 11 xxx xxx DirectPath FMUL PMULHUW mmreg mem64 OFh E4h mm xxx xxx DirectPath FMUL PSADBW mmregl mmreg2 OFh F6h 11 xxx xxx DirectPath FADD PSADBW mmreg mem64 OFh Feh mm xxx xxx DirectPath FADD PSHUFW mmreg1 mmreg2 imm8 oOFh 70h DirectPath FADD FMUL PSHUFW mmreg mem64
214. mreg8 OFh 9Ah 11 xxx xxx DirectPath SETP SETPE mem8 OFh 9Ah mm xxx xxx DirectPath SETNP SETPO mreg8 OFh 9Bh 11 xxx xxx DirectPath SETNP SETPO mem8 OFh 9Bh mm xxx xxx DirectPath SETL SETNGE mreg8 OFh 9Ch 11 xxx xxx DirectPath SETL SETNGE mem8 OFh 9Ch mm xxx xxx DirectPath SETGE SETNL 8 OFh 9Dh 11 xxx xxx DirectPath SETGE SETNL mem8 OFh 9Dh mm xxx xxx DirectPath SETLE SETNG mreg8 OFh 9Eh 11 xxx xxx DirectPath SETLE SETNG mem8 OFh 9Eh mm xxx xxx DirectPath SETG SETNLE mreg8 OFh 9Fh 11 xxx xxx DirectPath SETG SETNLE mem8 OFh 9Fh mm xxx xxx DirectPath SGDT mem48 OFh mm 000 xxx VectorPath SIDT mem48 OFh mm 001 xxx VectorPath SHL SAL mreg8 imm8 11 100 DirectPath SHL SAL mem8 imm8 mm 100 xxx DirectPath SHL SAL mreg16 32 imm8 Cih 11 100 xxx DirectPath SHL SAL mem16 32 imm8 Cih mm 100 xxx DirectPath SHL SAL 1 Doh 11 100 xxx DirectPath SHL SAL mem8 1 Doh mm 100 xxx DirectPath SHL SAL mreg16 32 1 Dih 11 100 xxx DirectPath SHL SAL mem16 32 1 Dih mm 100 xxx DirectPath SHL SAL mreg8 CL D2h 11 100 xxx DirectPath SHL SAL 8 CL D2h mm 100 xxx DirectPath SHL SAL mreg16 32 CL D3h 11 100 xxx DirectPath SHL SAL mem16 32 CL D3h mm 100 xxx DirectPath SHR mreg8 imm8 11 101 DirectPath SHR mem8 imm8 mm 101 xxx DirectPath SHR mreg16 32 imm8 Cih 11 101 xxx DirectPath
215. n Processor x86 Code Optimization 22007E 0 November 1999 240 Index
216. n addition the procedure checks the MSR and TSC flags returned to register EDX by the CPUID instruction to determine if the MSRs and the RDTSC instruction are supported 168 Event and Time Stamp Monitoring Software AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization The initialization and start counters procedure sets the PerfEvtSel0 and or PerfEvtSel1 MSRs for the events to be counted and the method used to count them and initializes the counter MSRs PerfCtr 3 0 to starting counts The stop counters procedure stops the performance counters See Starting and Stopping the Performance Monitoring Counters on page 168 for more information about starting and stopping the counters The read counters procedure reads the values in the PerfCtr 3 0 MSRs and a read time stamp counter procedure reads the time stamp counter These procedures can be used instead of enabling the RDTSC and RDPMC instructions which allow application code to read the counters directly Monitoring Counter Overflow The AMD Athlon processor provides the option of generating a debug interrupt when a performance monitoring counter overflows This mechanism is enabled by setting the interrupt enable flag in one of the PerfEvtSel 3 0 MSRs The primary use of this option is for statistical performance sampling To use this option the operating system should do the following m Provide an interrupt routine for handling the c
217. n requires four less cycles than the square root operation Example MOVD MMO MEM 0 a PFRSQRT MM1 MMO l sqrt a l sqrt a approx MOVQ MM2 1 1 0 l sqrt a approx PFMUL 1 1 X_0 0 0 0 step 1 PUNPCKLDQ MMO a a MMX instr PFRSQIT1 1 MMO intermediate step 2 PFRCPIT2 1 MM2 l sqrt a l sqrt a step 3 PFMUL MMO MM1 sqrt a sqrt a 110 Use 3DNow Instructions for Fast Square Root and AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Newton Raphson Reciprocal Square Root The general Newton Raphson reciprocal square root recurrence 1s 144 1 2 e Z gt 3 b e 212 To reduce the number of iterations the initial approximation read from a table The 3DNow reciprocal square root approximation is accurate to at least 15 bits Accordingly to obtain a single precision 24 bit reciprocal square root of an input operand b one Newton Raphson iteration is required using the following sequence of 3DNow instructions Xo PFRSQRT b X1 PFMUL Xg Xp PFRSQIT1 b X4 PFRCPIT2 X2 Xg X4 PFMUL b Xa The 24 bit final reciprocal square root value is X5 In the AMD Athlon processor 3DNow implementation the estimate contains the correct round to nearest value for approximately 8776 of all arguments The remaining arguments differ from the correct round to nearest value by one unit in the
218. ng Point Pipeline Operation Types 150 Table 5 Floating Point Decode Types 150 Table 6 Load Store Unit Stages 151 Table 7 Sample 1 Integer Register Operations 153 Table 8 Sample 2 Integer Register and Memory Load ODGFaH ossi Sid dcm Re ah ete daa d 154 Table 9 Write Combining Completion Events 158 Table 10 AMD Athlon System Bus Commands Generation 159 Table 11 Performance Monitoring Counters 164 Table 12 Memory Type 174 Table 13 Standard MTRR Types 176 Table 14 PATi 3 Bit 178 Table 15 Effective Memory Type Based on PAT and reru 0 amen PA NE 179 Table 16 Final Output Memory 180 Table 17 MTRR Fixed Range Register Format 182 Table 18 MTRR Related Model Specific Register G MSR X amsa a IR de x ed 185 Table 19 Integer 188 Table 20 MMX Instructions 208 Table 21 MMX Extensions 211 Table 22 Floating Point 5 212 Table 23 3DNow 217 Table 24 3DNow Ext
219. ng restriction m The store data is from a high byte register AH BH CH DH Avoid the type of code shown in the following example Example 7 Avoid MOV EAX 10h MOV EAX BH high byte store MOV DL EAX load cannot forward from high byte store Store to Load Forwarding Restrictions 53 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 One Supported Store There is one case of a mismatched store to load forwarding that to Load Forwarding is supported by the by AMD Athlon processor The lower 32 bits Case from an aligned QWORD write feeding into a DWORD read is allowed Example 8 Allowed MOVQ AlignedQword mmO MOV EAX AlignedQword Summary of Store to Load Forwarding Pitfalls to Avoid To avoid store to load forwarding pitfalls code should conform to the following guidelines m Maintain consistent use of operand size across all loads and stores Preferably use doubleword or quadword operand sizes m Avoid misaligned data references m Avoid narrow to wide and wide to narrow forwarding cases When using word or byte stores avoid loading data from anywhere in the same doubleword of memory other than the identical start addresses of the stores Stack Alignment Considerations Make sure the stack 1s suitably aligned for the local variable with the largest base type Then using the technique described in C Language Structure Component Considerations on
220. ng the number of integer divisions is multiple divisions in which division can be replaced with multiplication as shown in the following examples This replacement is possible only if no overflow occurs during the computation of the product This can be determined by considering the possible ranges of the divisors Example 1 Avoid nt i j k m i j k Example 2 Preferred T E 5 d ves oe up ges Copy Frequently De referenced Pointer Arguments to Local Variables Avoid frequently de referencing pointer arguments inside a function Since the compiler has no knowledge of whether aliasing exists between the pointers such de referencing can not be optimized away by the compiler This prevents data from being kept in registers and significantly increases memory traffic Note that many compilers have an assume no aliasing optimization switch This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables Otherwise copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function Avoid Unnecessary Integer Division 31 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Example 1 Avoid assumes pointers are different and q r void isqrt unsigned long a unsigned long q unsigned long
221. nrolled version of the loop Example 1 rolled loop for k lo k lt hi k inc X k 2 Example 2 partially unrolled loop for k lo k lt hi fac 1 inc k fac inc xL k 6 xLEeCfac 1 8ne handle end cases k k k lt hi k inc x k 70 Unrolling Loops AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use Function Inlining Overview Make use of the AMD Athlon processor s large 64 Kbyte instruction cache by inlining small routines to avoid procedure call overhead Consider the cost of possible increased register usage which can increase load store instructions for register spilling Function inlining has the advantage of eliminating function call overhead and allowing better register allocation and instruction scheduling at the site of the function call The disadvantage is decreasing code locality which can increase execution time due to instruction cache misses Therefore function inlining is an optimization that has to be used judiciously In general due to its very large instruction cache the AMD Athlon processor is less susceptible than other processors to the negative side effect of function inlining Function call overhead on the AMD Athlon processor can be low because calls and returns are executed at high speed due to the use of prediction mechanisms However there is still overhead
222. o maximize the number of instructions that are filled into the instruction byte queue while preventing I cache space in branch intensive code Use Short Instruction Lengths Assemblers and compilers should generate the tightest code possible to optimize use of the I cache and increase average decode rate Wherever possible use instructions with shorter lengths Using shorter instructions increases the number of instructions that can fit into the instruction byte queue For example use 8 bit displacements as opposed to 32 bit displacements In addition use the single byte format of simple integer instructions whenever possible as opposed to the 2 byte opcode ModR M format Example 1 Avoid 81 CO 78 56 34 12 add eax 12345678h uses 2 byte opcode form with ModR M 81 FB FF FF FF add ebx 5 uses 32 bit immediate OF 84 05 00 00 00 722 1 uses 2 byte opcode 32 bit immediate 36 Align Branch Targets in Program Hot Spots AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example 2 Preferred 05 78 56 34 12 add eax 12345678h uses single byte opcode form 83 C3 FB add ebx 5 uses 8 bit sign extended immediate 74 05 jz labell uses 1 opcode 8 bit immediate Avoid Partial Register Reads and Writes In order to handle partial register writes the AMD Athlon processor execution core implements a data merging scheme In the execution unit an instruction
223. ocessor x86 Code Optimization Integer Execution 135 Floating Point 136 Floating Point Execution Unit 137 Load Store Unit LSU 138 L2 Cache Controller igs ties Pa Ree v TOU 139 Write Combining Uwe Baden te ke 139 AMD Athlon System Bus 139 Appendix B Pipeline and Execution Unit Resources Overview 141 Fetch and Decode Pipeline Stages 141 Integer Pipeline Stages 144 Floating Point Pipeline Stages 146 Execution Unit Resources 2 E save dpi reed dea 148 Terminology usi re Tow da ERREUR ERR bud e RR d 148 Integer Pipeline Operations 149 Floating Point Pipeline Operations 150 Load Store Pipeline 151 Code Sample 152 Appendix Implementation of Write Combining 155 Introductions SPST TUR eA Ta AUS 155 Write Combining Definitions and Abbreviations 156 What is Write 156 Programming Details 156 Write Combining 157 Sending Write Buffer
224. on When allocating space for local variables and or outgoing parameters within a procedure adjust the stack pointer and use moves rather than pushes This method of allocation allows random access to the outgoing parameters so that they can be set up when they are calculated instead of being held somewhere else until the procedure call In addition this method reduces ESP dependencies and uses fewer execution resources 128 Dependenaes AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix A AMD Athlon Processor Microarchitecture Introduction When discussing processor design it is important to understand the following terms architecture microarchitecture and design implementation The term architecture refers to the instruction set and features of a processor that are visible to software programs running on the processor The architecture determines what software the processor can run The architecture of the AMD Athlon processor is the industry standard x86 instruction set The term microarchitecture refers to the design techniques used in the processor to reach the target cost performance and functionality goals The AMD Athlon processor microarchitecture is a decoupled decode execution design approach In other words the decoders essentially operate independent of the execution units and the execution core uses a small number of instructions and simplified circuit
225. ouble rate local bus interface Large split 128 Kbyte level one L1 cache Dedicated backside level two L2 cache Instruction predecode and branch detection during cache line fills Decoupled decode execution core Three way x86 instruction decoding Dynamic scheduling and speculative execution Three way integer execution Three way address generation Three way floating point execution 3DNow technology and MMX M Gsingle instruction multiple data SIMD instruction extensions Super data forwarding Deep out of order integer and floating point execution Register renaming Dynamic branch prediction The AMD Athlon processor communicates through a next generation high speed local bus that is beyond the current Socket 7 or Super7 bus standard The local bus can transfer data at twice the rate of the bus operating frequency by using both the rising and falling edges of the clock see AMD Athlon System Bus on page 139 for more information To reduce on chip cache miss penalties and to avoid subsequent data load or instruction fetch stalls the AMD Athlon processor has a dedicated high speed backside L2 cache The large 128 Kbyte L1 on chip cache and the backside L2 cache allow the 4 AMD Athlon Processor Microarchitecture Summary AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization AMD Athlon execution core to achieve and sustain maximum performance As a decoupled decode execution
226. ounter overflow as an APIC interrupt m Provide an entry in the IDT that points to a stub exception handler that returns without executing any instructions m Provide an event monitor driver that provides the actual interrupt handler and modifies the reserved IDT entry to point to its interrupt routine When interrupted by a counter overflow the interrupt handler needs to perform the following actions m Save the instruction pointer EIP register code segment selector TSS segment selector counter values and other relevant information at the time of the interrupt m Reset the counter to its initial setting and return from the interrupt Monitoring Counter Overflow 169 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 An event monitor application utility or another application program can read the collected performance information of the profiled application 170 Monitoring Counter Overflow AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix E Programming the MTRR and PAT Introduction The AMD Athlon processor includes a set of memory type and range registers MTRRs to control cacheability and access to specified memory regions The processor also includes the Page Address Table for defining attributes of pages This chapter documents the use and capabilities of this feature The purpose of the MTRRs is to provide system soft
227. oves integer arithmetic and makes efficient use of the integer execution units in the AMD Athlon processor Chapter 9 Floating Point Optimizations Describes optimizations that makes maximum use of the superscalar and pipelined floating point unit FPU of the AMD Athlon processor Chapter 10 3DNow and MMX Optimizations Describes guidelines for Enhanced 3DNow and MMX code optimization techniques Chapter 11 General x86 Optimizations Guidelines Lists generic optimizations techniques applicable to x86 processors Appendix A AMD Athlon Processor Microarchitecture Describes in detail the microarchitecture of the AMD Athlon processor 2 About this Document AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix B Pipeline and Execution Unit Resources Overview Describes in detail the execution units and its relation to the instruction pipeline Appendix C Implementation of Write Combining Describes the algorithm used by the AMD Athlon processor to write combine Appendix D Performance Monitoring Counters Describes the usage of the performance counters available in the AMD Athlon processor Appendix E Programming the MTRR and PAT Describes the steps needed to program the Memory Type Range Registers and the Page Attribute Table Appendix F Instruction Dispatch and Execution Resources Lists the instruction s execution resource usage Appendix G DirectPath versus VectorPa
228. p8 E3h VectorPath JO near disp16 32 OFh 80h DirectPath JNO near disp16 32 Oh 8th DirectPath JB JNAE near disp16 32 OFh 82h DirectPath JNB JAE near disp16 32 OFh 83h DirectPath JZ JE near disp16 32 OFh 84h DirectPath JNZ JNE near disp16 32 OFh 85h DirectPath JBE JNA near disp16 32 OFh 86h DirectPath JNBE JA near disp16 32 OFh 87h DirectPath JS near disp16 32 OFh 88h DirectPath JNS near disp 2 OFh 89h DirectPath Instruction Dispatch and Execution Resources 195 AMDA AMD Athlon Processor x86 Code Optimization Table 19 Integer Instructions Continued 22007E 0 November 1999 Mnemodic First Second ModR M Decode Byte Byte Byte Type JP JPE near disp16 32 OFh 8Ah DirectPath JNP JPO near disp16 32 OFh 8Bh DirectPath JL JNGE near disp16 32 OFh 8Ch DirectPath JNL JGE near disp16 32 OFh 8Dh DirectPath JLE JNG near disp16 32 OFh DirectPath JNLE JG near disp16 32 OFh 8Fh DirectPath JMP near disp16 32 direct E9h DirectPath JMP far disp32 48 direct EAh VectorPath JMP disp8 short EBh DirectPath JMP far mem32 indirect EFh mm 101 xxx VectorPath JMP far mreg32 indirect FFh mm 101 xxx VectorPath JMP near mreg16 32 indirect FFh 11 100 xxx DirectPath JMP near mem16 32 indirect FFh mm 100 xxx DirectPath LAHF 9Fh VectorPath LAR 16 32 mreg1
229. page 55 all variables can be properly aligned with no padding Extend to 32 Bits Function arguments smaller than 32 bits should be extended to Before Pushing onto 32 bits before being pushed onto the stack which ensures that Stack the stack is always doubleword aligned on entry to a function If a function has no local variables with a base type larger than doubleword no further work is necessary If the function does have local variables whose base type is larger than a doubleword additional code should be inserted to ensure proper alignment of the stack For example the following code achieves quadword alignment 54 Stack Alignment Considerations AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Example Preferred Prolog PUSH EBP MOV EBP ESP SUB ESP SIZE LOCALS size of local variables AND 7 8 push registers that need to be preserved Epilog pop register that needed to be preserved MOV ESP EBP POP EBP RET With this technique function arguments can be accessed via EBP and local variables can be accessed via ESP In order to free EBP for general use it needs to be saved and restored between the prolog and the epilog Align TBYTE Variables on Quadword Aligned Addresses Align variables of type TBYTE on quadword aligned addresses In order to make an array of TBYTE variables that are aligned array elements are 16 bytes apart In general TBYTE variables
230. page 142 show the AMD Athlon processor instruction fetch and decoding pipeline stages The pipeline consists of one cycle for instruction fetches and four cycles of instruction alignment and decoding The three ports in stage 5 provide a maximum bandwidth of three MacroOPs per cycle for dispatching to the instruction control unit ICU Fetch and Decode Pipeline Stages 141 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 3 ecode MacroOps Decode Quadword f x Queue x FETCH SCAN ALIGN1 ALIGN2 EDEC MECTL MEROM MEDEC 1 2 3 4 5 6 Figure 5 Fetch Scan Align Decode Pipeline Hardware The most common x86 instructions flow through the DirectPath pipeline stages and are decoded by hardware The less common instructions which require microcode assistance flow through the VectorPath Although the DirectPath decodes the common x86 instructions it also contains VectorPath instruction data which allows it to maintain dispatch order at the end of cycle 5 VectorPath Figure 6 Fetch Scan Align Decode Pipeline Stages 142 Fetch and Decode Pipeline Stages AMDA 22007E 0 November 1999 Cycle 1 FETCH Cycle 2 SCAN Cycle 3 DirectPath ALIGNI Cycle 3 VectorPath MECTL Cycle 4 DirectPath ALIGN2 Cycle 4 VectorPath MEROM Cycle 5 DirectPath EDEC Cycle 5 VectorPath
231. precision control to single precision versus Win32 default of double precision lowers the latency of those operations The Microsoft Visual C environment provides functions to manipulate the FPU control word and thus the precision control Note that these functions are not very fast so changes of precision control should be inserted where it creates little overhead such as outside a computation intensive loop Otherwise the overhead created by the function calls outweighs the benefit from reducing the latencies of divide and square root operations The following example shows how to set the precision control to single precision and later restore the original settings in the Microsoft Visual C environment Example prototype for _controlfp function jTinclude lt float h gt unsigned int orig cw Get current FPU control word and save it orig cw _controlfp 0 0 Set precision control in FPU control word to single precision This reduces the latency of divide and square root operations _controlfp PC 24 MCW PC restore original FPU control word controlfp orig cw Oxfffff 30 Accelerating Floating Point Divides and Square Roots AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Avoid Unnecessary Integer Division Integer division is the slowest of all integer arithmetic operations and should be avoided wherever possible One possibility for reduci
232. processor the AMD Athlon processor makes use of a proprietary microarchitecture which defines the heart of the AMD Athlon processor With the inclusion of all these features the AMD Athlon processor is capable of decoding issuing executing and retiring multiple x86 instructions per cycle resulting in superior scaleable performance The AMD Athlon processor includes both the industry standard SIMD integer instructions and the 3DNow SIMD floating point instructions that were first introduced in the AMD K69 2 processor The design of 3DNow technology was based on suggestions from leading graphics and independent software vendors ISVs Using SIMD format the AMD Athlon processor can generate up to four 32 bit single precision floating point results per clock cycle The 3DNow execution units allow for high performance floating point vector operations which can replace x87 instructions and enhance the performance of 3D graphics and other floating point intensive applications Because the 3DNow architecture uses the same registers as the MMX instructions switching between MMX and 3DNow has no penalty The AMD Athlon processor designers took another innovative step by carefully integrating the traditional x87 floating point and 3DNow execution units into one operational engine With the introduction of the AMD Athlon processor the switching overhead between x87 MMX and 3DNow technology is virtually eliminated The
233. r 118 Use MMX PCMPEQD to Set AII Bits in an MMX Register 119 Use MMX PAND to Find Absolute Value in 3DNow Code 119 Optimized Matrix Multiplication 119 Efficient 3D Clipping Code Computation Using 3DNow mee es 122 Use 3DNow PAVGUSB for MPEG 2 Motion Compensation 123 Stream of Packed Unsigned Bytes 125 Complex Number Arithmetic 126 11 General x86 Optimization Guidelines 127 Short FOUIDS o oe due Sek eee AO OX dr qud Rec 6 in 127 Dependencies SO UR REC E 128 Register Operands o eee eee ewe eio S OR Y 128 Stack AJ location ruin iyd pe ER USER TAa E PERROS 128 Appendix A AMD Athlon Processor Microarchitecture 129 Introduction u eve o ede PRX v pe e e fedet ON ce ER SCR ed 129 AMD Athlon Processor Microarchitecture 130 superscalar Processor 130 instruction Caches a kt sae du aad Ate gA 131 Predecode a TU RA 132 Branch Predictions deo opm ee Ro ae ede ey 132 Early Deeoding D seeded VR 133 Instruction Control Unit 134 Data Cache s e eae Paced Deng SES EG 134 Integer Scheduler 135 Viii Contents AMDA 22007E 0 November 1999 AMD Athlon Pr
234. r Example 2 Preferred assumes pointers are different and q r void isqrt unsigned long a unsigned long q unsigned long r unsigned long qq rr qq a if gt 0 dq gt rr qq 44 qq rr lt lt 1 pe d 250003 32 Copy Frequently De referenced Pointer Arguments to Local Variables AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Instruction Decoding Optimizations Overview This chapter discusses ways to maximize the number of instructions decoded by the instruction decoders in the AMD Athlon processor Guidelines are listed in order of importance The AMD Athlon processor instruction fetcher reads 16 byte aligned code windows from the instruction cache The instruction bytes are then merged into a 24 byte instruction queue On each cycle the in order front end engine selects for decode up to three x86 instructions from the instruction byte queue All instructions x86 x87 3DNow and classified into two types of decodes DirectPath VectorPath see DirectPath Decoder and VectorPath Decoder on page 133 for more information DirectPath instructions are common instructions that are decoded directly in hardware VectorPath instructions are more complex instructions that require the use of a sequence of multiple operations issued from an on chip ROM Up to three DirectPath
235. r PUSH EDI save EDI as per calling convention MOV EDI ECX save divisor hi SHR EDX 1 shift both divisor and dividend right RCR EAX 1 by 1 bit ROR EDI 1 RCR EBX 1 BSR ECX ECX ECX number of remaining shifts SHRD EBX EDI CL scale down divisor and dividend SHRD EDX CL such that divisor is SHR EDX CL less than 2 32 i e fits in EBX ROL EDI 1 restore original divisor hi DIV EBX compute quotient MOV EBX 121 dividend lo 88 Efficient 64 Bit Integer Arithmetic AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization MOV ECX EAX save quotient IMUL EDI EAX quotient divisor hi word low only MUL DWORD PTR LESP 20 quotient divisor lo word ADD EDX EDI EDX EAX quotient divisor SUB EBX EAX dividend lo quot divisor lo MOV EAX ECX get quotient MOV ECX ESP 16 dividend hi SBB ECX EDX subtract divisor quot from dividend SBB 0 adjust quotient if remainder negative XOR EDX EDX clear hi word of quot EAX lt FFFFFFFFh POP EDI restore EDI as per calling convention POP EBX restore EBX as per calling convention RET done return to caller ulldiv ENDP Example 8 Remainder ullrem divides two unsigned 64 bit integers and returns the remainder INPUT 5 81 1 5 441 dividend ESP 16 ESP 12 divisor OUTPUT EDX EAX remainder of division DESTROYS EAX ECX ED
236. r example due to lack of convergence in iterative algorithms 102 Minimize Floating Point to Integer Conversions AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Floating Point Subexpression Elimination There are cases which do not require an FXCH instruction after every instruction to allow access to two new stack entries In the cases where two instructions share a source operand an FXCH 1s not required between the two instructions When there is an opportunity for subexpression elimination reduce the number of superfluous FXCH instructions by putting the shared source operand at the top of the stack For example using the function func x y 2 Example 1 Avoid FLD Z FLD Y FLD X FADD ST STC2 FXCH ST 1 FMUL ST ST 2 CALL FUNC FSTP ST 0 Example 2 Preferred FLD Z FLD Y FLD X FMUL STELI ST FADDP ST 2 ST CALL FUNC Check Argument Range of Trigonometric Instructions Efficiently The transcendental instructions FSIN FCOS FPTAN and FSINCOS are architecturally restricted in their argument range Only arguments with a magnitude of lt 2 63 can be evaluated If the argument is out of range the C2 bit in the FPU status word is set and the argument is returned as the result Software needs to guard against such extremely infrequent cases Floating Point Subexpression Elimination 103 AMD Athlon Processor x86 Code Optim
237. r in Code Padding Using Neutral Code Fillers on page 39 Further clarification of Use the 3DNow PREFETCH and PREFETCHW Instructions on page 46 Modified examples 1 and 2 of Unsigned Division by Multiplication of Constant on page 78 Added the optimization Efficient Implementation of Population Count Function on page 91 Further clarification of Use FFREEP Macro to Pop One Register from the FPU Stack on page 98 Further clarification of Minimize Floating Point to Integer Conversions on page 100 Added the optimization Check Argument Range of Trigonometric Instructions Efficiently on page 103 Added the optimization Take Advantage of the FSINCOS Instruction on page 105 Further clarification of Use 3DNow Instructions for Fast Division on page 108 Further clarification Use FEMMS Instruction on page 107 Further clarification of Use 3DNow Instructions for Fast Square Root and Reciprocal Square Root on page 110 Clarified 3DNow and MMX Intra Operand Swapping on page 112 Corrected PCMPGT information in Use MMX PCMP Instead of 3DNow PFCMP on page 114 Added the optimization Use MMX Instructions for Block Copies and Block Fills on page 115 Modified the rule for Use MMX PXOR to Clear All Bits in an MMX Register on page 118 Modified the rule for Use MMX PCMPEQD to Set All Bits in an MMX Register on page 119 Added the optimization Optimized Matrix Multiplication on page 1
238. ration Types Category FPU 3DNow MMX Load store or Miscellaneous Operations Execution Unit FSTORE FPU 3DNow MMX Multiply Operation FMUL FPU 3DNow MMX Arithmetic Operation FADD Table5 Floating Point Decode Types x86 Instruction Decode Type OPs FADD ST ST DirectPath FADD FSIN VectorPath various PFACC DirectPath FADD PFRSQRT DirectPath FMUL As shown in Table 4 the FADD register to register instruction generates a single MacroOP targeted for the floating point scheduler FSIN is considered a VectorPath instruction because it is a complex instruction with long execution times as compared to the more common floating point instructions The PFACC instruction is DirectPath decodeable and generates a single MacroOP targeted for the arithmetic operation execution pipeline in the floating point logic Just like PFACC a single MacroOP is early decoded for the 3DNow PFRSQRT instruction but it is targeted for the multiply operation execution pipeline 150 Execution Unit Resources AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Load Store Pipeline Operations The AMD Athlon processor decodes any instruction that references memory into primitive load store operations For example consider the following code sample MOV EBX 1 load MacroOP PUSH EAX 1 store MacroOP POP EAX 1 load MacroOP ADD EAX EBX
239. raverses the arrays in a downward direction 1 e from higher addresses to lower addresses whereas the original code in example 1 traverses the arrays in an upward direction Such a change in the direction of the traversal is possible if each loop iteration is completely independent of all other loop iterations as is the case here In code where the direction of the array traversal can t be switched it is still possible to minimize pointer arithmetic by appropriately biasing base addresses and using an index 74 Minimize Pointer Arithmetic in Loops AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization variable that starts with a negative value and reaches zero when the loop expires Note that if the base addresses are held in registers e g when the base addresses are passed as arguments of a function biasing the base addresses requires additional instructions to perform the biasing at run time and a small amount of additional overhead is incurred In the examples shown here the base addresses are used in the displacement portion of the address and biasing is accomplished at compile time by simply modifying the displacement Example 3 Preferred int a MAXSIZE b MAXSIZE c MAXSIZE i for i90 i gt MAXSIZE i e tT e aki obs MOV ECX MAXSIZE initialize index add_loop MOV EAX LECX 4 a MAXSIZE 4 get a element MOV EDX ECX 4 b MAXSIZE 4 get b ele
240. re disabled when clear When the fixed range MTRRs are enabled and an overlap occurs with a variable range MTRR the fixed range MTRR takes priority reset state 0 Type Defines the default memory type reset state 0 See Table 13 for more details Memory Type Range Register MTRR Mechanism 175 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 13 Standard MTRR Types and Properties Memory Type Uncacheable UC Encodingin Internally Writeback Allows Speculative Memory Ordering Model Reads No No No Strong ordering MTRR Cacheable Cacheable Write Combining WC No No Yes Weak ordering Reserved Reserved Writethrough WT Yes No Yes Speculative ordering Write Protected WP Yes reads 5 No Yes Speculative ordering No Writes Writeback WB 6 Yes Yes Yes Speculative ordering Reserved 1 255 MTRR Overlapping Note that if two or more variable memory ranges match then the interactions are defined as follows 1 If the memory types are identical then that memory type is used 2 If one or more of the memory types is UC the UC memory type is used 3 If one or more of the memory types is WT and the only other matching memory type is WB then the WT memory type is used 4 Otherwise if the combination of memory types is not listed above then the behavior of the processor is undefined
241. rectPath FSTORE FISTP mem 16int DFh mm 011 xxx DirectPath FSTORE FISTP memz2int DBh mm 011 xxx DirectPath FSTORE FISTP mem64int DFh mm 111 xxx DirectPath FSTORE FISUB mem32int DAh mm 100 xxx VectorPath FISUB mem 16int DEh mm 100 xxx VectorPath FISUBR mem32int DAh mm 101 xxx VectorPath FISUBR mem 16int DEh mm 101 xxx VectorPath FLD ST i D9h 11 000 xxx DirectPath FADD FMUL 1 FLD memz2real D9h mm 000 xxx DirectPath FADD FMUL FSTORE FLD memed4real DDh mm 000 xxx DirectPath FADD FMUL FSTORE FLD mem80real DBh mm 101 xxx VectorPath FLD1 D9h E8h DirectPath FSTORE Notes 1 The last three bits of the modR M byte select the stack entry ST i 214 Instruction Dispatch and Execution Resources AMDA 22007E 0 November 1999 Table 22 Floating Point Instructions Continued AMD Athlon Processor x86 Code Optimization Notes 1 The last three bits of the modR M byte select the stack entry ST i Instruction Mnemonic Par Second MOR M F P Note Byte Byte Byte Type Pipe s FLDCW mem16 D9h mm 101 xxx VectorPath FLDENV mem 14byte D9h mm 100 xxx VectorPath FLDENV mem28byte D9h mm 100 xxx VectorPath FLDL2E D9h EAh DirectPath FSTORE FLDL2T D9h E9h DirectPath FSTORE FLDLG2 D9h ECh DirectPath FSTORE FLDLN2 D9h EDh DirectPath FSTORE
242. renaming unit a register renaming unit a scheduler a register file and three parallel execution units Figure 3 shows a block diagram of the dataflow through the FPU Pipeline Stage Figure 3 Floating Point Unit Block Diagram As shown in Figure 3 on page 137 the floating point logic uses three separate execution positions or pipes for superscalar x87 3DNow and MMX operations The first of the three pipes is generally known as the adder pipe FADD and it contains 3DNow add MMX ALU shifter and floating point add execution units The second pipe is known as the multiplier FMUL It contains a 3DNow MMX multiplier reciprocal unit an MMX ALU and a floating point multiplier divider square root unit The third pipe is known as the floating point load store FSTORE which handles floating point constant loads FLDZ FLDPI etc stores FILDs as well as many OP primitives used in VectorPath sequences AMD Athlon Processor Microarchitecture 137 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Load Store Unit LSU Result Buses from Core Figure 4 Load Store Unit The load store unit LSU manages data load and store accesses to the L1 data cache and if required to the backside L2 cache or system memory The 44 entry LSU provides a data interface for both the integer scheduler and the floating point scheduler It consists of two queues a 12 entry queue for L1 cache load an
243. required it is recommended to use a JMP instruction to jump across the padding region The following assembly language macros show this ATHLON TEXTEQU DB 090h OP2 ATHLON TEXTEQU DB OF3h 090h OP3 ATHLON TEXTEQU DB OF3h OF3h 090h ATHLON TEXTEQU DB OF3h OF3h 090h 090h 4 0 OP5 ATHLON TEXTEQU DB OF3h OF3h 090h OF3h 090h OP6 ATHLON TEXTEQU DB OF3h OF3h 090h OF3h OF3h 090h ATHLON TEXTEQU DB OF3h OF3h 090h OF3h OF3h 090h 090h OP8 ATHLON TEXTEQU DB OF3h OF3h 090h OF3h OF3h 090h OF3h 090h OP9 ATHLON TEXTEQU DB OF3h OF3h 090h OF3h OF3h 090h OF3h OF3h 090h 10 ATHLONTEXTEQU DB OEBh 008h 90h 90h 90h 90h 90h 90h 90h 90h 40 Code Padding Using Neutral Code Fillers AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Recommendations for AMD K6 Family and AMD Athlon Processor Blended Code On x86 processors other than the AMD Athlon processor including the AMD K6 family of processors the REP prefix and especially multiple prefixes cause decoding overhead so the above technique is not recommended for code that has to run well both on AMD Athlon processor and other x86 processors blended code In such cases the instructions and instruction sequences below are recommended For neutral code fillers longer than eight bytes in length the JM
244. rite Combining Definitions and Abbreviations AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization signature in register EAX where EAX 11 8 contains the instruction family code For the AMD Athlon processor the instruction family code is six 2 In addition the presence of the MTRRs is indicated by bit 12 and the presence of the PAT extension is indicated by bit 16 of the extended features bits returned in the EDX register by CPUID function 8000 0001h See the AMD Processor Recognition Application Note order 20734 for more details on the CPUID instruction 3 Write combining is controlled by the MTRRs and PAT Write combining should be enabled for the appropriate memory ranges The AMD Athlon processor MTRRs and PAT are compatible with the Pentium II Write Combining Operations In order to improve system performance the AMD Athlon processor aggressively combines multiple memory write cycles of any data size that address locations within a 64 byte write buffer that is aligned to a cache line boundary The data sizes can be bytes words longwords or quadwords WC memory type writes can be combined in any order up to a full 64 byte sized write buffer WT memory type writes can only be combined up to a fully aligned quadword in the 64 byte buffer and must be combined contiguously in ascending order Combining may be opened at any byte boundary in a quadword but is closed by a write that is e
245. rm multiply vector V by 4x4 transform matrix M for i90 1 lt 4 i 1 0 for j 0 j lt 4 j MESTETISVES T5 Example 2 Preferred 3D transform multiply vector V by 4x4 transform matrix M r 0 1 015 01 M 1 0 V 1 211015 121 ME3 E0 VE3 r 1 1 115 01 111115 111 211115 121 ME3 E1 VE3 r 2 1 215 01 2 11 ME21E2 1 VE2 ME3 E2 VE3 r 3 M 0 3 V 0 111315 11 ME21E3 VE2 ME3 E3 vE3 Avoid Unnecessary Store to Load Dependencies A store to load dependency exists when data is stored to memory only to be read back shortly thereafter See Store to Load Forwarding Restrictions on page 51 for more details The AMD Athlon processor contains hardware to accelerate such store to load dependencies allowing the load to obtain the store data before it has been written to memory However it is still faster to avoid such dependencies altogether and keep the data in an internal register Avoiding store to load dependencies is especially important if they are part of a long dependency chains as might occur in a recurrence computation If the dependency occurs while operating on arrays many compilers are unable to optimize the 18 Completely Unroll Small Loops AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization code in a way that avoids the store to load
246. s 85h PC L1 and L2 ITLB misses 86h PC Snoop resyncs 87h PC Instruction fetch stall cycles 88h PC Return stack hits 89h PC Return stack overflow con om Cih FR Retired Ops Coh ER Retired branches conditional unconditional exceptions interrupts Gh FR Retired branches mispredicted C4h FR Retired taken branches C5h FR Retired taken branches mispredicted C6h FR Retired far control transfers C8h FR Retired near returns C9h FR Retired near returns mispredicted CAh ER oe branches with target CDh FR Interrupts masked cycles IF 0 Interrupts masked while pending cycles GER TE ngee CFh FR Number of taken hardware interrupts Doh FR Instruction decoder empty Dih FR Dispatch stalls event masks D2h through DAh below combined D2h FR Branch abort to retire D3h FR Serialize D4h FR Segment load stall 166 Performance Counter Usage AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Table 11 Performance Monitoring Counters Continued Nass x Notes Unit Mask bits 15 8 Event Description D5h FR ICU full D6h FR Reservation stations full D7h FR FPU full D8h FR LS full D9h FR All quiet stall DAh FR Far transfer or resync branch pending DCh FR Breakpoint matches for DRO DDh FR Breakpoint matches for DEh FR Breakpoint matches for DR2 DFh FR Breakpoint matches for DR3 PerfCtr 3 0 MSRs MSR Addresses
247. s AMDA 22007E 0 November 1999 Table 19 Integer Instructions Continued AMD Athlon Processor x86 Code Optimization Instruction Mnemonic e gre M pon BT mem16 32 imm8 OFh BAh mm 100 xxx DirectPath BIC mreg16 32 reg16 32 OFh BBh 11 xxx xxx VectorPath BIC mem16 32 16 32 OFh BBh mm xxx xxx VectorPath BIC mreg16 32 imm8 OFh BAh 11 111 xxx VectorPath BIC mem16 32 imm8 OFh BAh mm 111 xxx VectorPath BIR mreg16 32 16 32 OFh B3h 11 xxx xxx VectorPath BIR mem16 32 reg16 32 OFh B3h mm xxx xxx VectorPath mreg16 32 imm8 OFh BAh 11 110 xxx VectorPath BIR mem16 32 imm8 OFh BAh mm 110 xxx VectorPath BTS mreg16 32 reg16 32 OFh ABh 11 xxx xxx VectorPath BTS mem16 32 reg16 32 OFh ABh mm xxx xxx VectorPath BTS mreg16 32 imm8 OFh BAh 11 101 xxx VectorPath BIS mem16 32 imm8 OFh BAh mm 101 xxx VectorPath CALL full pointer 9Ah VectorPath CALL near imm16 32 E8h VectorPath CALL mem16 16 32 FFh 11 011 VectorPath CALL near mreg32 indirect FFh 11 010 xxx VectorPath CALL near mem32 indirect FFh mm 010 xxx VectorPath CBW CWDE 98h DirectPath CLC F8h DirectPath CLD FCh VectorPath CLI FAh VectorPath CLTS OFh 06h VectorPath CMC F5h DirectPath CMOVA CMOVNBE reg16 32 reg16 32 OFh 47h 11 xxx xxx DirectPath CMOVA CMOVNBE reg16 32 mem1
248. sk bit for each valid byte Write Combining Operations 159 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 160 Write Combining Operations AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix D Performance Monitoring Counters Overview This chapter describes how to use the AMD Athlon processor performance monitoring counters The AMD Athlon processor provides four 48 bit performance counters which allows four types of events to be monitored simultaneously These counters can either count events or measure duration When counting events a counter is incremented each time a specified event takes place or a specified number of events takes place When measuring duration a counter counts the number of processor clocks that occur while a specified condition is true The counters can count events or measure durations that occur at any privilege level Table 11 on page 164 lists the events that can be counted with the performance monitoring counters Performance Counter Usage The performance monitoring counters are supported by eight MSRs PerfEvtSel 3 0 are the performance event select MSRs and PerfCtr 3 0 are the performance counter MSRs Overview 161 AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 These registers can be read from and written to using the RDMSR and WRMSR instruct
249. sor x86 Code Optimization 22007E 0 November 1999 Table8 Sample 2 Integer Register and Memory Load Operations Instruc Decode Decode 568 Num Instruction Pipe Type 1 2 314 5 6 7 8 9 2 1 DEC EDX 0 DP D l E 2 MOV EDI 1 DP 011 8 5 5 3 SUB EAX EDX 20 2 DP amp S 1 E 4 SAR EAX 5 0 DP D l E 5 ADD ECX EDI 4 1 DP D 1 18 5 5 6 AND EBX OxIF 2 DP D E 7 MOV ESI 0x0F 100 0 DP 11181 51 3 15 OR ECX ESHEAX 4 8 1 I 18 5 1 5 E 1 2 Comments for Each Instruction Number The ALU operation executes in IEUO The load operation generates the address in AGU1 and is simultaneously scheduled for the load store pipe in cycle 3 In cycles 4 and 5 the load completes the data cache access The load execute instruction accesses the data cache in tandem with instruction 2 After the load portion completes the subtraction is executed in cycle 6 in IEU2 The shift operation executes in IEUO cycle 7 after instruction 3 completes This operation is stalled on its address calculation waiting for instruction 2 to update EDI The address is calculated in cycle 6 In cycle 7 8 the cache access completes This simple operation executes quickly in IEU2 The address for the load is calculated in cycle 5 in AGUO However the load is not scheduled to access the data cache until cyde 6 The load is blocked for scheduling to access the data c
250. st divisors For example for a divide by 10 operation use the following code if the dividend is less than 40000005h MOV EAX dividend MOV EDX 01999999Ah MUL EDX MOV quotient EDX Signed Division by Multiplication of Constant Algorithm Divisors These algorithms work if the divisor is positive If the divisor is 2 lt 6 gt 231 negative use abs d instead of d and append a NEG EDX to the code The code makes use of the fact that n d n d SIN OUT divisor 2 lt d lt 2 31 algorithm multiplier shift count 5 algorithm 0 MOV EAX m MOV EDX dividend MOV ECX EDX IMUL EDX SHR 31 SAR EDX s ADD EDX ECX quotient in EDX Replace Divides with Multiplies 79 AMDA AMD Athlon Processor x86 Code Optimization Derivation for a m s al MU 2 00 Cy Tp lt lt Iw C gorithm 1 EAX EDX dividend ECX EDX L EDX EDX ECX ECX 31 EDX s EDX ECX 22007E 0 November 1999 quotient in EDX The derivation for the algorithm a multiplier m and shift count s is found in the section Signed Derivation for Algorithm Multiplier and Shift Factor on page 95 Signed Division By 2 IN EAX dividend 0UT EAX quotient CMP EAX 800000000h 1 if dividend gt 0 SBB EAX 1 Increment dividend if it is lt 0 SAR EAX 1 Perform a right shift Signed Division By 2 IN EAX dividend OUT EAX quo
251. tPath Integer Instructions Continued 22007E 0 November 1999 Instruction Mnemonic ROL mreg8 CL Instruction Mnemonic SBB reg16 32 mreg16 32 ROL meme CL SBB reg16 32 mem16 32 ROL mreg16 32 CL SBB AL imm8 ROL 16 32 CL SBB EAX imm16 32 ROR mreg8 imm8 SBB mreg8 imm8 ROR mem8 imm8 SBB imm8 ROR mreg16 32 imm8 SBB mreg16 32 imm16 32 ROR mem16 32 imm8 SBB mem16 32 imm16 32 ROR mregg 1 SBB mreg16 32 imm8 sign extended ROR 8 1 SBB mem16 32 imm8 sign extended ROR mreg16 32 1 SETO mreg8 ROR mem16 32 1 SETO mem8 ROR mreg8 CL SETNO mreg8 mem CL SETNO mem8 ROR mreg16 32 CL SETB SETC SETNAE mreg8 ROR mem 16 32 CL SETB SETC SETNAE SAR mreg8 imm8 SETAE SETNB SETNC mreg8 SAR mem8 imm8 SETAE SETNB SETNC SAR mreg16 32 imm8 SETE SETZ mreg8 SAR mem 16 32 imm8 SETE SETZ mem8 SAR 1 SETNE SETNZ mreg8 SAR mem8 1 SETNE SETNZ mem8 SAR mreg16 32 1 SETBE SETNA mreg8 SAR mem16 32 1 SETBE SETNA mem8 SAR mreg8 CL SETA SETNBE mreg8 SAR mem8 CL SETA SETNBE mem8 SAR mreg16 32 CL SETS mreg8 SAR mem16 32 CL SETS mem8 SBB reg8 SETNS mreg8 SBB mem8 reg8 SETNS mem8 SBB mreg16 32 reg16 32 S
252. te that the AMD K6 processor does not support the CMOV instruction Therefore blended AMD K6 and AMD Athlon processor code should use examples 3 and 4 Avoid Branches Dependent on Random Data 57 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 AMD Athlon Processor Specific Code Example 1 Signed integer ABS function X labs X MOV ECX X load value MOV EBX ECX save value NEG ECX value CMOVS ECX EBX if value is negative select value MOV X ECX save labs result Example 2 Unsigned integer min function z x y x y MOV EAX X load X value MOV EBX Y load Y value CMP EAX EBX EBX lt EAX CF 0 1 CMOVNC EAX EBX EAX EBX lt EAX EBX EAX MOV Z EAX save min X Y Blended AMD K6 and AMD Athlon Processor Code Example 3 Signed integer ABS function X labs X MOV ECX X load value MOV EBX ECX save value SAR ECX 31 aX 0 Oxffffffff 0 XOR EBX ECX gt 0 x SUB EBX ECX x lt 0 3 aed MOV X EBX x gt 0 2 x X Example 4 Unsigned integer min function z x y x y MOV EAX x load x MOV EBX y load y SUB EAX EBX 2 gt se NG WE IX SBB ECX ECX aX gt y Oxffffffff 0 AND ECX EAX X 0 ADD ECX EBX AX gt MOV z ECX x Syn TX vy Example 5 Hexadecimal to ASCII conversion 10 gt
253. teger Division by AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Generate m s for algorithm 1 Based on Magenheimer D J et al Integer Multiplication and Division on the HP Precision Architecture IEEE Transactions on Computers Vol 37 No 8 August 1988 page 980 else s log2 d m low U64 1 lt lt 32 50 U64 d r U32 U64 1 lt lt 32 s Z U64 d m gt 6 lt lt 1 1 2 U32 m 1 U32 m_low 1 8 1 Reduce multiplier shift factor for either algorithm to smallest possible while m amp l m gt gt 1 622 Signed Derivation for Algorithm Multiplier and Shift Factor The utility sdiv exe was compiled using the following code Code snippet to determine algorithm a multiplier m and shift count s for 32 bit signed integer division given divisor d Written for Microsoft Visual C compiler IN divisor 2 lt d lt 2 31 QUT algorithm multiplier shift count 5 algorithm 0 MOV EAX m MOV EDX dividend MOV ECX EDX IMUL EDX SHR ECX 31 SAR EDX s ADD EDX ECX quotient in EDX Derivation of Multiplier Used for Integer Division by Constants 95 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 algorithm 1 MOV EAX MOV EDX dividend MOV ECX EDX IMUL EDX ADD EDX ECX SHR ECX 31 SAR E
254. teger scheduler and a floating point scheduler These two schedulers can simultaneously issue up to nine OPs to the three general purpose integer execution units IEUs three address generation units AGUs and three floating point 3DNow M MMX M execution units The AMD Athlon moves integer instructions down the integer execution pipeline which consists of the integer scheduler and the IEUs as shown in Figure 1 on page 131 Floating point instructions are handled by the floating point execution pipeline which consists of the floating point scheduler and the x87 3DNow MMX execution units 130 AMD Athlon Processor Microarchitecture AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Branch Prediction Table Predecode Cache Fetch Decode Control 3 Way x86 Instruction Decoders Instruction Control Unit 72 Entry FPU Stack Map Rename FPU Scheduler 36 Entry FPU Register File 88 Entry FMUL MMX 3DNow Integer Scheduler 18 Entry L2 Cache Controller FSTOR 2 Way 64 Kbyte Data Cache 32 Entry L1 TLB 256 Entry L2 TLB System Interface L2 SRAMs Figure 1 AMD Athlon Processor Block Di
255. th Instructions Lists the x86 instructions that are DirectPath and VectorPath instructions AMD Athlon Processor Family The AMD Athlon processor family uses state of the art decoupled decode execution design techniques to deliver next generation performance with x86 binary software compatibility This next generation processor family advances x86 code execution by using flexible instruction predecoding wide and balanced decoders aggressive out of order execution parallel integer execution pipelines parallel floating point execution pipelines deep pipelined execution for higher delivered operating frequency dedicated backside cache memory and a new high performance double rate 64 bit local bus As an x86 binary compatible processor the AMD Athlon processor implements the industry standard x86 instruction set by decoding and executing the x86 instructions using a proprietary microarchitecture This microarchitecture allows the delivery of maximum performance when running x86 based PC software AMD Athlon Processor Family 3 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 AMD Athlon Processor Microarchitecture Summary The AMD Athlon processor brings superscalar performance and high operating frequency to PC systems running industry standard x86 software A brief summary of the next generation design features implemented in the AMD Athlon processor is as follows High speed d
256. that consists of a pair of unidirectional 13 bit address and control channels and a bidirectional 64 bit data bus The AMD Athlon system bus supports low voltage swing multiprocessing clock forwarding and fast data transfers The clock forwarding technique is used to deliver data on both edges of the reference clock therefore doubling the transfer speed A four entry 64 byte write buffer is integrated into the BIU The write buffer improves bus utilization by combining multiple writes into a single large write cycle By using the AMD Athlon system bus the AMD Athlon processor can transfer data on the 64 bit data bus at 200 MHz which yields an effective throughput of 1 6 Gbyte per second AMD Athlon Processor Microarchitecture 139 AMD AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 140 AMD Athlon Processor Microarchitecture AMDA 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Appendix B Pipeline and Execution Unit Resources Overview The AMD Athlon processor contains two independent execution pipelines one for integer operations and one for floating point operations The integer pipeline manages x86 integer operations and the floating point pipeline manages all x87 3DNow and MMX instructions This appendix describes the operation and functionality of these pipelines Fetch and Decode Pipeline Stages Figure 5 on page 142 and Figure 6 on
257. tient CDQ Sign extend into EDX AND EDX 2 n 1 Mask correction use divisor 1 ADD EAX EDX Apply correction if necessary SAR EAX n Perform right shift by log2 divisor Signed Division By 2 IN EAX dividend 0UT EAX quotient CMP EAX 800000000h CY 1 if dividend gt 0 SBB 1 Increment dividend if it is lt 0 SAR EAX 1 Perform right shift NEG EAX Use x 2 2 Signed Division By IN EAX dividend 2 OUT EAX quotient CDQ Sign extend into EDX AND EDX 2 n 1 Mask correction divisor 1 ADD EAX EDX Apply correction if necessary SAR EAX n Right shift by log2 divisor NEG EAX Use x 2 n x 2 n Remainder of Signed IN EAX dividend Integer 2 or 2 00 remainder CDQ Sign extend into EDX AND EDX 1 Compute remainder XOR EAX EDX Negate remainder if SUB EAX EDX Dividend was lt 0 MOV remainder EAX 80 Replace Divides with Multiplies AMDA 22007E 0 November 1999 Remainder of Signed Integer 2 or 2 AMD Athlon Processor x86 Code Optimization IN EAX dividend OUT EAX remainder CDQ Sign extend into EDX AND EDX 2 n 1 Mask correction abs divison 1 ADD EAX EDX Apply pre correction AND EAX 2 n 1 Mask out remainder abs divison 1 SUB EAX EDX Apply pre correction if necessary MOV remainder EAX Use Alternative Code When Multiplying by a Constant A 32 bit integer multiply by a constant h
258. ting for an interrupt to be serviced When this flag is set the processor toggles the PMi pins when the counter overflows When this flag is clear the processor toggles the PMi pins and increments the counter when performance monitoring events occur The toggling of a pin is defined as assertion of the pin for one bus clock followed by negation When this flag is set the processor generates an interrupt through its local APIC on counter overflow This flag enables disables the PerfEvtSeln MSR When set performance counting is enabled for this counter When clear this counter is disabled By inverting the Counter Mask Field this flag inverts the result of the counter comparison allowing both greater than and less than comparisons For events which can have multiple occurrences within one clock this field 1s used to set a threshold If the field is non zero the counter increments each time the number of events is Performance Counter Usage 163 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 greater than or equal to the counter mask Otherwise if this field is zero then the counter increments by the total number of events Table 11 Performance Monitoring Counters Event Source 1 AP Number Unit Notes Unit Mask bits 15 8 Event Description 1xxx xxxxb reserved xxxxb HS xxxxb GS xxxxb FS 20h LS Segment register loads Xxxx
259. torPath DIV AL mema8 F6h mm 110 xxx VectorPath Instruction Dispatch and Execution Resources 193 AMD 1 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Table 19 Integer Instructions Continued Mnemonic First Second ModR M Decode Byte Byte Byte Type DIV EAX mreg16 32 F7h 11 110 xxx VectorPath DIV EAX mem16 32 F7h mm 110 xxx VectorPath ENTER C8 VectorPath IDIV mreg8 F6h 11 111 xxx VectorPath IDIV mem8 Feh mm 111 xxx VectorPath IDIV EAX mreg16 32 F7h 11 111 xxx VectorPath IDIV EAX mem16 32 F7h mm 111 xxx VectorPath IMUL reg16 32 imm16 32 69h 11 xxx xxx VectorPath IMUL reg16 32 mreg16 32 imm16 32 69h 11 xxx xxx VectorPath IMUL reg16 32 mem16 32 imm16 32 69h mm xxx xxx VectorPath IMUL reg16 32 imm sign extended 6Bh 11 xxx xxx VectorPath IMUL reg16 32 mreg16 32 imm8 signed 6Bh 11 xxx xxx VectorPath IMUL reg16 32 mem16 32 imma8 signed 6Bh mm xxx xxx VectorPath IMUL AX AL mreg8 Feh 11 101 xxx VectorPath IMUL AX AL mem8 F6h mm 101 xxx VectorPath IMUL EDX EAX EAX mreg16 32 F7h 11 101 xxx VectorPath IMUL EDX EAX EAX mem16 32 F7h mm 101 xxx VectorPath IMUL reg16 32 mreg16 32 OFh 11 xxx xxx VectorPath IMUL reg16 32 mem16 32 OFh AFh mm xxx xxx VectorPath IN AL imm8 E4h VectorPath IN AX imm8 E5h VectorPath IN EAX imm8 E5h
260. tq Ledx 56 mmO add edx 64 dec ecx jnz fill nc femms sfence Use MMX PXOR to Clear All Bits in an MMX Register To clear all the bits in an MMX register to zero use PXOR MMreg MMreg Note that PXOR MMreg MMreg is dependent on previous writes to MMreg Therefore using PXOR in the manner described can lengthen dependency chains which in return may lead to reduced performance An alternative in such cases 1s to use zero DD 0 MOVD MMreg DWORD PTR zero i e to load a zero from a statically initialized and properly aligned memory location However loading the data from memory runs the risk of cache misses Cases where MOVD is superior to PXOR are therefore rare and PXOR should be used in general 118 Use MMX PXOR to Clear All Bits in an MMX Register AMD 22007E 0 November 1999 AMD Athlon Processor x86 Code Optimization Use MMX PCMPEQD to Set All Bits in an MMX Register To set all the bits in an MMX register to one use PCMPEQD MMreg MMreg Note that PCMPEQD MMreg MMreg is dependent on previous writes to MMreg Therefore using PCMPEQD in the manner described can lengthen dependency chains which in return may lead to reduced performance An alternative in such cases is to use ones DQ OFFFFFFFFFFFFFFFFh MOVQ MMreg QWORD PTR ones i e to load a quadword of OxXFFFFFFFFFFFFFFFF from a statically initialized and properly aligned memory loc
261. tric Instructions Effici ntly m RS SR FERES 103 Take Advantage of the FSINCOS Instruction 105 0 3DNow and MMX Optimizations 107 Use 3DNow Instructions 107 Use FEMMS Instructions 2 3s oi 107 Use 3DNow Instructions for Fast Division 108 Optimized 14 Bit Precision Divide 108 Optimized Full 24 Bit Precision Divide 108 Pipelined Pair of 24 Bit Precision Divides 109 Newton Raphson 109 Use 3DNow Instructions for Fast Square Root and Reciprocal Square 110 Optimized 15 Bit Precision Square Root 110 Optimized 24 Bit Precision Square Root 110 Newton Raphson Reciprocal Square Root 111 Use MMX PMADDWD Instruction to Perform Two 32 Bit Multiplies in 111 3DNow and MMX Intra Operand Swapping 112 Contents vii AMDA AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 Fast Conversion of Signed Words to Floating Point 113 Use MMX PXOR to Negate 3DNow Data 113 Use MMX PCMP Instead of 3DNow PFCMP 114 Use MMX Instructions for Block Copies and Block Fills 115 Use MMX PXOR to Clear All Bits in an MMX Registe
262. tructions Continued AMD Athlon Processor x86 Code Optimization Mnemonic First Second ModR M Decode Byte Byte Byte Type RCL mreg8 1 Doh 11 010 xxx DirectPath RCL 1 Doh mm 010 xxx DirectPath RCL mreg16 32 1 Dih 11 010 xxx DirectPath RCL mem16 32 1 Dih mm 010 xxx DirectPath RCL mreg8 CL D2h 11 010 xxx DirectPath RCL CL D2h mm 010 xxx VectorPath RCL mreg16 32 CL D3h 11 010 xxx DirectPath RCL mem16 32 CL D3h mm_ 010 xxx VectorPath RCR mreg8 imm8 11 011 xxx DirectPath RCR mem8 imm8 mm 011 xxx VectorPath RCR mreg16 32 imm8 Cih 11 011 xxx DirectPath RCR mem16 32 imm8 Cih mm 011 xxx VectorPath RCR mregg 1 Doh 11 011 xxx DirectPath RCR memg 1 Doh mm 011 xxx DirectPath RCR mreg16 32 1 Dih 11 011 xxx DirectPath RCR mem16 32 1 Dih mm 011 xxx DirectPath RCR 8 CL D2h 11 011 xxx DirectPath RCR mem CL D2h mm 011 xxx VectorPath RCR mreg16 32 CL D3h 11 011 xxx DirectPath RCR mem16 32 CL D3h mm 011 xxx VectorPath RDMSR OFh 32h VectorPath RDPMC OFh 33h VectorPath RDTSC OF 31h VectorPath RET near imm16 Ch VectorPath RET near C3h VectorPath RET far imm16 CAh VectorPath RET far CBh VectorPath ROL mreg8 imm8 11 000 DirectPath ROL mem8 imm8 mm 000 xxx DirectPath ROL mreg16 32 imm8 Cih 11 000 xxx Dir
263. ues to the performance counters The performance counters may be initialized using a 64 bit signed integer in the range 2 and 2 7 Negative values are useful for generating an interrupt after a specific number of events Starting and Stopping the Performance Monitoring Counters The performance monitoring counters are started by writing valid setup information in one or more of the PerfEvtSel 3 0 MSRs and setting the enable counters flag in the PerfEvtSel0 MSR If the setup is valid the counters begin counting following the execution of a WRMSR instruction which sets the enable counter flag The counters can be stopped by clearing the enable counters flag or by clearing all the bits in the PerfEvtSel 3 0 MSRs Event and Time Stamp Monitoring Software For applications to use the performance monitoring counters and time stamp counter the operating system needs to provide an event monitoring device driver This driver should include procedures for handling the following operations m Feature checking Initialize and start counters Stop counters Read the event counters Reading of the time stamp counter The event monitor feature determination procedure must determine whether the current processor supports the performance monitoring counters and time stamp counter This procedure compares the family and model of the processor returned by the CPUID instruction with those of processors known to support performance monitoring I
264. ware with the ability to manage the memory mapping of the hardware Both the BIOS software and operating systems utilize this capability The AMD Athlon processor s implementation is compatible to the Pentium II Prior to the MTRR mechanism chipsets usually provided this capability Memory Type Range Register MTRR Mechanism The memory type and range registers allow the processor to determine cacheability of various memory locations prior to bus access and to optimize access to the memory system The AMD Athlon processor implements the MTRR programming model in a manner compatible with Pentium II Introduction 171 AMD Athlon Processor x86 Code Optimization 22007E 0 November 1999 There are two types of address ranges fixed and variable See Figure 12 For each address range there is a memory type For each 4K 16K or 64K segment within the first 1 Mbyte of memory there is one fixed address MTRR The fixed address ranges all exist in the first 1 Mbyte There are eight variable address ranges above 1 Mbytes Each is programmed to a specific memory starting address size and alignment If a variable range overlaps the lower 1 MByte and the fixed MTRRs are enabled then the fixed memory type dominates The address regions have the following priority with respect to each other 1 Fixed address ranges 2 Variable address ranges 3 Default memory type UC at reset 172 Memory Type Range Register
265. x xxx DirectPath FADD FMUL PUNPCKHDQ mmreg mem64 OFh 6Ah mm xxx xxx DirectPath FADD FMUL PUNPCKHWD mmreg1 mmreg2 OFh 69h 11 xxx xxx DirectPath FADD FMUL PUNPCKHWD mmreg mem64 OFh 69h mm xxx xxx DirectPath FADD FMUL PUNPCKLBW mmreg1 mmreg2 OFh 60h 11 xxx xxx DirectPath FADD FMUL PUNPCKLBW mmreg mem64 OFh 60h mm xxx xxx DirectPath FADD FMUL PUNPCKLDQ mmreg1 mmreg2 OFh 62h 11 xxx xxx DirectPath FADD FMUL PUNPCKLDQ mmreg mem64 OFh 62h mm xxx xxx DirectPath FADD FMUL PUNPCKLWD mmreg1 mmreg2 OFh 61h 11 xxx xxx DirectPath FADD FMUL PUNPCKLWD mmreg mem64 OFh 61h mm xxx xxx DirectPath FADD FMUL PXOR mmreg1 mmreg2 OFh EFh 11 xxx xxx DirectPath FADD FMUL PXOR mmreg mem64 OFh EFh mm xxx xxx DirectPath FADD FMUL Notes 1 Bits 2 1 and 0 of the modR M byte select the integer register Table 21 MMX Extensions Instruction Mnemonic prefix First M I Decode PU Notes Byte s Byte Byte Type Pipe s MASKMOVQ mmreg1 mmreg2 OFh F7h VectorPath FADD FMUL FSTORE MOVNTQ 4 mmreg OFh E7h DirectPath FSTORE PAVGB mmreg1 mmreg2 OFh EOh 11 xxx xxx DirectPath FADD FMUL PAVGB mmreg mem64 OFh EOh mm xxx xxx DirectPath FADD FMUL PAVGW mmreg1 mmreg2 OFh E3h 11 xxx xxx DirectPath FADD FMUL PAVGW mmreg mem64 OFh E3h mm xxx xxx DirectPath FADD FMUL PEXTRW reg32 mmreg imm8 OFh C5h VectorPath PINSRW mmreg reg32 imm8 OFh C4h VectorPath PINSRW
266. x80000000 PXOR MMO MM4 6 aye 102205220 PFADD MMO MM2 66 10 20 Replace Branches with Computation in 3DNow Code 61 AMDA AMD Athlon Processor x86 Code Optimization C code float x z Z abs x if z gt 1 z 1 2 Example 2 3DNow code sin MMO x out MMO 2 MOVQ MM5 mabs PAND MMO MM5 PFRCP MM2 MMO MOVQ MM1 MMO PFRCPIT1 MMO MM2 PFRCPIT2 MMO MM2 PFMIN MMO MM1 C code float x z r res 7 fabs x if 2 gt 0 575 res r Example 3 else res PI 2 3DNow code sin MMO x 1 r out MMO res MOVQ MM7 mabs PAND MMO MM7 MOVQ MM2 bnd PCMPGTD MM2 MMO MOVQ MM3 pio2 MOVQ MMO MM1 PFADD MM1 MM1 PFSUBR MM1 MM3 PAND MMO MM2 PANDN MM2 MM1 POR MMO MM2 22007E 0 November 1999 Ox7fffffff l z approx 5 2 1 2 step l z final 22 5 LAZ 2 mask for absolute value 2 abs x 205 75 2 lt 0 575 Oxffffffff 20 pi 2 save r 2 mv 7T p 10 57216205755 p 0 yz Q0 575 07 DIOS 2 2 Replace Branches with Computation in 3DNow Code AMDA 22007E 0 November 1999 Example 4 C code AMD Athlon Processor x86 Code Optimization PI 3 14159265358979323 float x z r res 0 lt lt PI A Z abs x if 2 gt 1 else res PI 2 r 3DNow code s
267. x86 Code Optimization 22007E 0 November 1999 FEMMS instruction is supported for backward compatibility with AMD K6 family processors and is aliased to the EMMS instruction 3DNow and MMX instructions are designed to be used concurrently with no switching issues Likewise enhanced 3DNow instructions can be used simultaneously with MMX instructions However x87 and 3DNow instructions share the same architectural registers so there is no easy way to use them concurrently without cleaning up the register file in between using FEMMS EMMS Use 3DNow Instructions for Fast Division 3DNow instructions can be used to compute a very fast highly accurate reciprocal or quotient Optimized 14 Bit Precision Divide This divide operation executes with a total latency of seven cycles assuming that the program hides the latency of the first MOVD MOVQ instructions within preceding code Example MOVD MMO MEM 0 PFRCP MMO MMO 1 W 1 W approximate MOVQ MM2 MEM Y x PFMUL MM2 MMO Y W X W Optimized Full 24 Bit Precision Divide This divide operation executes with a total latency of 15 cycles assuming that the program hides the latency of the first MOVD MOVQ instructions within preceding code Example MOVD MMO W 0 W PFRCP MM1 MMO 1 W 1 W approximate PUNPCKLDQ MMO MMO W W MMX instr PFRCPITI MMO 1 1 W 1 W refine MOVQ MM2 X Y Y X PFRCPIT2 MMO 1 1 W 1 f
268. xx DirectPath FMUL PFSUB mmreg1 mmreg2 OFh OFh 9 11 xxx xxx DirectPath FADD PFSUB mmreg mem64 OFh OFh 9Ah mm xxx xxx DirectPath FADD PFSUBR mmregl mmreg2 OFh OFh AAh 11 xxx xxx DirectPath FADD PFSUBR mmreg mem64 OFh OFh AAh mm xxx xxx DirectPath FADD PI2FD mmreg1 mmreg2 OFh OFh ODh 11 xxx xxx DirectPath FADD 2 mmreg mem64 OFh OFh ODh mm xxx xxx DirectPath FADD PMULHRW 1 mmreg2 OFh OFh B7h 11 xxx xxx DirectPath FMUL PMULHRW 1 mem64 OFh B7h mm xxx xxx DirectPath FMUL PREFETCH mem8 OFh ODh mm 000 xxx DirectPath 1 2 PREFETCHW mems OFh ODh mm 001 xxx DirectPath 1 2 Nofes EE 1 TCH PREFETCHW instructions the 8 value refers to an address in the 64 byte line that will be 2 byte listed in the column titled imm is actually the opcode byte Table 24 3DNow Extensions Instruction Mnemonic uA imms id 6 e Sip Note PF2IW mmreg1 mmreg2 OFh OFh 1Ch 11 DirectPath FADD PF2IW mmreg mem64 OFh OFh 1Ch mm xxx xxx DirectPath FADD PFNACC mmreg1 mmreg2 OFh OFh 8Ah 11 xxx xxx DirectPath FADD PFNACC mmreg mem64 OFh OFh 8Ah mm xxx xxx DirectPath FADD mmreg1 mmreg2 OFh OFh 8Eh 11 xxx xxx DirectPath FADD PFPNACC mmreg mem64 OFh OFh 8Eh mm xxx xxx DirectPath FADD PI2FW mmreg1 mmreg2 OFh OFh OCh 11 xxx xxx DirectPath FADD PI2FW mmreg mem64 OFh OFh oCh mm xxx xxx

AMD x86 Typewriter User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents