Home

4stack Processor's User Manual

1. 3 3 3 Denormalized floats eto Sr Be tn BE a O oo N 11 13 13 19 20 21 21 21 CONTENTS Chapter 1 Basic Programming Model This chapter describes the application programming environment as seen by the assembly language programmer Architectural features that directly address the design and imple mentation of application programs are introduced They consist of the following parts e Stack operations e Data types e Memory Register files Instruction format 1 1 Stack Operations Stacks are pushdown lists with LIFO last in first out characteristic The two elementary operations are push to store data and pop to fetch data Pops read data in the reverse order they were pushed before Stack type operations first pop an amount of data from the stack process these data and push the results back To express these stack operation a stack effect is used This stack effect addresses only the affected topmost part of the stack A Forth like notation is used as Forth is a stack language with a long tradition Each stack effect has a left side the source operands and a right side the result operands separated by a long dash Each stack entry is represented by a space separated name The topmost stack element TOS top of stack is the rightmost element in the name list Example the stack add has the following stack effect ab c where c is the sum of a and b 8 CHAPTER 1 BASIC PROGRAMMING MOD
2. O 2 1 3 ldh 0 amp 2 R1 ldh 1 amp 3 R1 26 sub 1s3 nop 33 dz c 2 subr si 33 dz c 2 pick si dup nop 3 dz cz add 0s2 nop 100p subr si O sub 2s3 33 CZ pick si dup nop AZ cz drop drop drop zT sub s3p nop cz zr subr si cz pick si dup nop cz add 0s2 nop subr si 2 sub 283 cz pick si dup nop cz drop drop nop CHAPTER 3 PROGRAMMING EXAMPLES sub 1s3 nop ec z subr si ec z pick si dup nop ecz add 0s2 nop subr si 1 sub 2s3 e cz pick si dup nop e cz drop drop drop zr sub 1s3 nop rz subr si C rz pick s2 dup nop erz add 0s2 nop subr si 3 sub 2s3 Cr Ze pick s2 dup nop erz drop ret drop gt gt do Zr u gt u gt st 0 amp 2 R2 N sth 0 amp 2 Ri N t t ldh 0 amp 2 R1 u gt u gt 7 lt lt st 0 amp 2 R2 N sth 0 amp 2 R1 N Sb t gt gt u gt u gt st 1 amp 3 R2 N sth 1 amp 3 R1 N t E Seis ldh 1 amp 3 R1 u gt u gt lt lt st 1 amp 3 R2 N sth 1 amp 3 R1 N t as Simply drawing flat shaded polygons is not very satisfying Therefore at least Gouraud shading should be applied to fill a polygon Gouraud shading is a bilinear interpolation between the edges of a polygon This removes the most unpleasant color discontinuities at the borders of the polygons The basic line loop therefore has to increment or decrement the color each drawin
3. gt gt gt The third of stack on stack 1 contains the number of points to skip at the start the rest from dividing the start offset by 4 This code compares the starting index with 0 to 3 and sets the X flag if the number on the stack is less than the starting index The z buffer load is included into this code sequence the load is aborted if the X flag is set At the end of the loop the z buffer value is just loaded and can t be aborted Thus the order of masking points out is reversed subr si subr si subr si subr s1 u gt u gt u gt u gt O 2 1 3 a3 sub 2s3 sub 2s3 sub 2s3 sub 2s3 lt lt lt lt 33 c z0 c z2 c zl cr z3 pick si pick sl pick si pick s1 st 0 amp 2 R2 N st 1 amp 3 R2 N dup dup dup dup sth 0 amp 2 R1 N sth 1 amp 3 R1 N nop nop nop nop t t t E Here the third element of stack 2 contains the number of remaining pixel for the last it eration It is compared with the numbers 0 to 3 Only smaller numbers will continue execution If the points wouldn t be drawn anyway because they are invisible the com parison isn t performed The complete code for one z buffered line drawing is listed below It must be noted that the return address has to be moved to a save place where it doesn t disturb the computing process 3 4dz Ri zbuffer Ri zbuffer color starti endi color R2 screen R2 screen EZ color color Z N1 2 N1 2 index Z Z ret N2 2 N2 2 drawline index nop nop swap ed
4. b m 2 m x 2 m x 2 2 m The final correction is only necessary if the result is a denormal too and in this case don t really perform the m because the value is already in denormal representation The result is a denormal if the exponent is lt 64
5. 0 DO fdup fsqrt 1e0 fswap f double le0 dup s gt d d gt f f f LOOP fdrop drop end macro For square roots of numbers with odd exponent the square root of 2 has to be multiplied to the result that s the reason for the sqrt 2 macro 3 3 FLOATING POINT APPLICATIONS 37 macro sqrt 2 F 2e0 fsqrt double end macro The number of iterations can be reduced by one if the approximation table sqares thus twice as many bits in the table index This gives a space time tradeoff The tables would be 8K instead of 128 bytes While 8K can reside in the data cache it is likely that the values are spilled out or spill out other important memory cells However it could be worthful in both typical situations When division and square root are used heavily the approximation table stays in the L1 cache and supports low latency divide and square root When they are used only from time to time not much cache is wasted though because the approximation table has no chance to stay and the L2 access times may be below the spared iteration Note a typical usage pattern of division and square root is to divide by the square root of a number thus as in vector norming T Lose a be Eh y 2 In this case compute the reziprocal sqare root with a modified GOLDSCHMIDT s algorithm and multiply the result to each x y and z Reziprocal square root does not take longer than square root and thus is clearly faster than square root follow
6. and subtraction from cz looping for all 0 lt j lt m and all j lt k lt n step 2m The conversion from complex to real equations leads to See for example Skriptum zur Vorlesung Numerik I und II C Reinsch 30 CHAPTER 3 PROGRAMMING EXAMPLES Fr cosTrj ml f i sinaj m h iCal Mii Chin hr h r x fr hixcillhi h rx fi hix for Cre ere three Cik hale tia Serum Rada Ca h i Arranging the loops with constant j in the inner loop makes the first assignment loop invariant It further turns out that the second assignment and the fourth have a swap type parallelism the fourth read h the second writes to h so for loop unrolling the second equation can be executed in parallel with the fourth Now only two sequential assignment lines remain in the inner loop fr hi fi hr pick 3s0 mul 0sO pick 1s0 mul 0s0 ld 2 amp 0 R1 O fr hr fi hi gt mul 2s1 mul mul sl mul fr gi 2 hifr 2 fi gr 2 hrfr 2 asr mulr asr mulr O ld 3 1 Ri N fr gi 2 hifr fihr 2 qi fi gr 2 hrir hifi 2 qr add 1s0 subr 0s0 add 3s0 subr 2s0 st 2 amp 0 R1 N st 3 amp 1 R1 N fr qi gi 2 fi qr gr 2 qi gi 2 qr gr 2 It should be noted further that FFT needs a final scaling by 1 ldn either in the FFT case or in the reverse FFT case For integer FFTs the best place to do the scaling is the FFT procedure because it ensures that no values will be out of range The fixed point multiplication has the impli
7. are four stacks for general purpose computations Each stack allows direct access to the 8 topmost values Also access to the 4 topmost values of any other stack is allowed The 4stack processor creates carry on subtract not borrow thus a carry set while subtracting means no overflow 10 CHAPTER 1 BASIC PROGRAMMING MODEL Stack 0 Stack 1 Stack 2 Stack 3 0s0 1s0 2s0 380 Osl Isl 2s1 3s1 0s2 1s2 282 382 0s3 1s3 253 383 s4 s4 s4 s4 s5 s5 s5 s5 s6 s6 s6 s6 s7 s7 s7 s7 sr sr sr sr sp sp sp sp ip index loops loope Each element is addressed with it s index s0 is the top of stack TOS s1 the next of stack NOS and so on To address another stack the stack number preceedes the stack index 0s0 is the TOS of stack 0 1s2 is the third element in stack 1 Each stack has its on processing unit ALU and a status register sr The lower half of the status register contains ALU specific information the upper half contains the global CPU state global state per stack state STADOOOQOQIIIIIIIII CCCCCCCCIOUMMXOOC supervisor state trace mode address mode for instructions 0 32 1 64 bits activates external debugger interrupt disabled shift count signed shift mode O unsigned 1 signed rounding mode 0 to nearest 1 to zero 2 to 00 3 to oo conditional execution 0 execute 1 don t execute overflow carry reserved and must be set to 0 The UMMCCCCCCCC val
8. cleaned up Unlike conventional register architectures this cleaning up has do be done in the code not only in the mind of the programmer or compiler 3 3 Floating Point Applications The 4stack processor s application target is not excessive floating point computation the main applications are integer and fixed point computation as in graphics and signal pro cessing Therefore the number of parallel executed floating point instructions is less than integer units the 4stack processor issues one floating point multiplication and one floating point add in parallel per cycle Both result values can be forwarded to the floating point adder this reduces the adder latency from two to one cycle 3 3 1 Dot Product A basic operation on vector and matrix computation is the dot product a b y a b i l This simple operation shows how to make use of the floating point pipeline A single load multiply and add would look like this 3 3 FLOATING POINT APPLICATIONS nop nop nop nop a bi fmul fmul nop nop nop nop fsum nop nop fsum nop fsum fadd fadd fsum ldf 0 Ri N ldf 1 Ri N 33 This code is very slow and leaves many instruction slots empty Thus a full pipelined version 1s 55 count dotp pick 3s0 nip nop fmul fmul fmul loop fmul fmul ip fadd 3 fresult loops index nop fmul fmul fmul fmul fmul index loops 5 loope subr Osi nop index O nop O fmuladd fadd fmula
9. ns0 nsl nsl ns2 ns2 ns3 ns3 0 0 1 1 7FFFFFFF 7FFFFFFF max 80000000 80000000 min c0 xc cl xcl c2 xc2 c3 xc3 The following operations may use any of these second operand addresses or xl x2 x3 23 2 Zo and x1 x2 x3 3 21622 xor xl x2 x3 73 T1 22 add n1 n2 ul u2 n3 u3 nz n n3 c ur t u2 gt 2 o 2 1 lt ny ng lt 23 sub n1 n2 ul u2 n3 u3 ng n n3 c ui Ue gt 0 o 23 lt ny ng lt 231 subr nl n2 ul u2 n3 u3 ng ng n c uz u1 gt 0 o 23 lt ng n lt 231 add nl n2 ul u2 n3 u3 na n na c c Uy tute gt 2 o 231 lt ni n c lt 281 subc nl n2 ul u2 n3 ud ng n ng c c u ug c gt 0 o 1231 lt ny ng c lt 281 suber nl n2 ul u2 n3 ud ng no n c c uz u c gt 0 o 231 lt ng n c lt 231 pass dll dhl mlatch d mul nl n2 mlatch n no umul ul u2 mlatch uy us The operation pick may only use the stack addresses not the constants It allows to use stack addresses without processing any operation so the stack effect of pick is the stack effect of its stack address 2 1 STACK OPERATIONS 15 pick stack The operation pin pops th
10. parts are zero if the condition is asked in the same cycle t x r 0 n f f n 0 0 lt n f f n lt 0 ov n f f o0 u lt n f f u gt n f f cA nXH 0 lt n f f n lt 0 0 gt n f f n 0 A A n lt 0 0 f n 0 0 lt gt n f f n 0 0 gt n f f n gt 0 no n f f 0 u gt n f f c u lt n f f cV n 0 ehr lt n f f n 0 V n lt 0 0 Although the barrel shifter in mul lt Q provides a powerful bitwise shift instruction there are a number of simple shift instructions that shift only one position at a time These instructions are useful for some special purpose shifts and especially for simple divisions and multiplications by two asr nl n2 ng n gt gt 1 c n amp l o 0 lsr ul u2 us u gt gt l c u amp l o 0 ror ul u2 us u gt gt 1 u1 lt lt 31 c u amp 1 o 0 rorc ul u2 us u gt gt 1 c lt lt 31 c ui amp l o 0 asl nl n2 ng n lt lt 1 c n gt gt 31 amp 1 o sign ni sign na Isl ul u2 us u lt lt 1 c u gt gt 31 amp 1 o 0 rol ul u2 us u lt lt 1 u1 gt gt 31 c u gt gt 31 amp 1 o 0 role ul u2 us u lt lt 1 c c u gt gt 31 amp 1 o 0 There are two bit operations find first one and population count 2 1 STACK OPERATIONS 17 ffl x n
11. usage negate N on writes limit mask bits all M 0 means no limit ZUWION For pipeline intermediate results there exist some hidden registers that have certain access restrictions Each of the two integer multipliers has one intermediate result latch called mlatch One mlatch is accessible by the lower stack half stack 0 and 1 the other from the higher stack half stack 2 and 3 Values are stored with mul umul and pass and received with any sort of mul operations speak mul fetch This restricts the usage of the multiplier to only one per stack half and instruction If there are two multiply instructions per stack half the result is undefined It is not guaranteed that the behavior of one implementation is valid for other implementations Only if both pairs of operands and both operations are equal the result is specified The floating point operations store their arguments in the flatch registers and their in termediate results in the fal and fml floating point adder and multiplier latch registers The intermediate result registers are readable from any stack the argument registers are divided into first and second argument The first argument could be written by even stacks stack O and 2 the second argument by odd stacks stack 1 and 3 The two adder argu ment latches could only be read by the lower stack half the multiplier argument latches by the upper stack half 1 5 Instructio
12. 3 N1 ER 8n n 2m n m dup pick 2s0 pick 2s0 pick 2s0 set 2 N2 a tae n 2m 8n n 2m n m lt m gt n 2m 33 the middle loop see above a O n 2m 8n n 2m n m 0 drop 2 asr drop R1 R3 R1 R3 Be n 2m 2 8n n 4m n m asl subr nop asr br 1 gt until ped al 8n n 4m n 2m For the final loop f e 77 is no longer loop invariant it changes every step IF n 2m is not compared if gt 1 but gt 2 the final loop could be separated A small modification allows to process the last loop in 4 cycles per loop step reducing the overhead of 4 cycles per middle loop A tiny trick allows to consume the f r value without an additional drop instruction 32 CHAPTER 3 PROGRAMMING EXAMPLES final loop a ER hi fi hr pick 3s0 mul 0s0 pick 1s0 mul 0s0 ld 2 amp 0 R1 O fr hr fi hi u mul 2s1 mul mul sip mul t t t t drops fr gi 2 hifr 2 gr 2 hrfr 2 asr mulr asr mulr ld 0 amp 2 R2 N 1d 3 amp 1 Ri N fr gi 2 hifr fihr 2 qi 35 gr 2 hrfr hifi 2 qr add 1s0 subr 0s0 add 3s0 subr 2s0 st 2 amp 0 Ri N st 3 1 Ri N This example shows the change of active values in loop computing Each nested loop has other active values The inner loop operates on complex numbers the outer loops do index and address computation Concentrating on the actual part helps to understand the work being done The inactive variables disappear in the stack diagrams Sometimes they show up at the edges of a inner part and have to be
13. 3 instructions forth Loops may use up to 1023 instructions as loop body A call or jump may cover the entire 32 bit addressing space of 4 gigabytes 2 gigabytes forth or back a far call the entire extended 64 bit addressing space All flow control operations except indirect jump call with ip use the data move operation field so while executing flow control operations no data move operations can be executed in parallel 22 CHAPTER 2 INSTRUCTION SET SUMMARY Conditional instructions are the most important flow control operations The 4stack pro cessor allows all four stacks to be checked for a specified condition and to jump if either at least one or all of the specified stacks met the condition This allows to combine some conditions to one conditional jump Conditions may either pop the examined top of stack then preceeded with a question mark or leave it unchanged preceeded with a colon The conditonal branch operation br thus is followed by the specified stacks more than one must be separated by for oring by amp for anding the condition by the specified condition t 0 0 lt ov u lt u gt lt gt f 0 lt gt 0 gt no u gt u lt gt lt preceeded with or and by the branch target usually a label If no stack and no condition is specified the branch is unconditional Counted loops help to implement inner loops without branch overhead The loop setup
14. 8 bit bytes addressed by a 32 or 64 bit number 4 gigabytes or 16 exabytes called the physical memory address Data with size greater one byte is addressed by the first byte containing the data All accesses should be size aligned thus a division of the address by the access size gives a zero remainder Misaligned accesses either lead to an exception or are supported by the memory interface unit which inserts additional instructions to load the other part of the unaligned address Misalinged accesses should be supposed to be slower The 4stack processor supports big and little endian byte ordering This is controlled by each memory register set Thus a program may use mixed little and big endian accesses Big endian accesses are propagated unchanged to the bus interface little endian accesses are either provided by xoring the address with the appropriate bit pattern 7 for byte accesses 6 for half word accesses and 4 for word accesses or by byte swapping implementation dependent Implementations of the 4stack processors designed for a workstation like environment usu ally don t use the physical memory for addressing the memory They use a page level trans lation of the virtual memory address to the physical memory address This translation is done by TLBs translation lookaside buffers in hardware and by page tables in software Virtual memory also provides memory protection and controlled kernel entry points 1 4 Register Files There
15. Astack Processor s User Manual Bernd Paysan 25th April 2000 Introduction The 4stack processor is a research project to create a high performance VLIW very long instruction word microprocessor architecture without the typical disadvantages as un compact code by using the stack paradigm for the individual operating units Though the restrictions of such a programming model the processor will deliver highest integer and a reasonable high floating point performance without excessive resource consumption such as large decode and superscalar schedule logic Contents Basic Programming Model 1 1 Stack Operations cas ot tere eds la are e e 1 2 Dat Typs daa rw hg Bence A Lg Memory 2 2 3 ae A ed IAPS RE doe 1 4 Register Files aaa Bl 2s a Bie ae ee eg ee eg La Be E93 Instruction Format cai e a aa a an Bs hae Be Instruction Set Summary Jal Stack Operati s au DA DA is it ee SEEN 2 2 Data Move Operations Metal ta a a eh tii ool Cache Control ssi eo 38 0 AAA EA 22 20 MMU CORPOL sd ar ia BEE Ri Deeb M OCONTO Nes 6 3 pe nnd ne ee oho ea ie ee 23 Elow Control 4 nde ae Pe ka 8 oe Bae Bee e iak Programming Examples Bl A Her A a lash ok Neat Re ah nig nd oe aaa og 3 2 Fixed Point Example Fractals os er yal er ade e Eos 3 2 1 Fast Fourier Transformation 2 2 3 2 00 a QOH ah Gis 3 3 Floating Point Applications 105 A696 aa sa era 33 Dot Produehz beds lei Kan a Rs Pe i 3 3 2 Floating Point Division and Square Root
16. EL If the stack was created by pushing 1 2 and 3 in this order a is unified with 2 and b with 3 The result c then is 5 and the stack contains 1 not affected and 5 as TOS afterwards so the total stack effect is 123 15 To give a formal description of the processor s instructions each instruction is described by a name the stack effect separated by colons and the equations of the result separated by semicolons and terminated by a full stop For the above example of add the description would be add a b c c a b Stack names may be either literals then they represent constants or names containing non literal characters then they represent variables Equal symbols represent equal variables Eg the operation that duplicates the TOS dup is described as following dup x xx Each x stands for the same value both on the left and the right side 1 2 Data Types The stack elements store only one unified data type a 32 bit value Interpretation of these values are up to the instructions 64 bit data is represented by two stack elements if these values have a significance order the higher significant value is nearer to the top of stack Stack instructions either use 32 or 64 bit two s complement signed or unsigned integers or bit patterns or 32 bit IEEE single or 64 bit IEEE double floats Memory instructions load and store bytes 8 bit half words 16 bit words 32 bit and double words 64 bit
17. For clarity stack effects indicate the data type the operation assumes on the stack The first letters the prefix of a stack element signal the type of the data x Any 32 bit pattern n 32 bit two s complement signed integer u 32 bit unsigned integer c 8 bit unsigned character h 16 bit unsigned half word nb n bit unsigned bit pattern snb n bit two s complement signed bit pattern xdh 64 bit pattern higher significant part xdl 64 bit pattern lower significant part dh 64 bit two s complement signed integer higher significant part dl 64 bit two s complement signed integer lower significant part udh 64 bit unsigned integer higher significant part udl 64 bit unsigned integer lower significant part 1 3 MEMORY 9 sf 32 bit IEEE single float fh 64 bit IEEE double float higher significant part f 64 bit IEEE double float lower significant part bfd bit field descriptor The 4stack processor uses two s complement arithmetic for signed integers Negative num bers are represented by the increment of the bit inversion of their absolute value Two s complement has two important advantages among from being widely used different bit patterns represent different numbers and when ignoring the sign bit add and sub can be used for unsigned numbers except overflow conditions For the last reason there are two overflow conditions carry for unsigned arithmetics and overflow for signed 1 3 Memory Memory is a linear sequence of
18. PHSON s has more dependencies in each iteration step and surprisingly converts a bit slower in reality while both have a theoretical conversion rate of doubling the number of accurate digits each iteration GOLDSCHMIDT s approximation uses three iteration variables and thus allow enough par allelism to keep the pipeline filled at least for division The relative error is computed each time by using subtraction instead of division to get the reziprocal of the relative error Division computes a b z a y b r 1 while y 4 1 doz rz y ry r 2 ry As an approximation table would only contain reziprocal approximations for numbers between 1 and 2 sign and exponent of y should be transferred to x before 20 div table bfd5 int 0511 align 3 fdiv fdivgs bl bh al ah fxtract pick 0s0 nop pick 0s0 ip w bfd5 ld 3 ipb yl yh bx bh al ah bh drop and min pick 0s0 ip O set 3 R3 33 yl yh bs al ah bx bh bfd5 fabs O neg bfu yl yh bs 0 al ah bx bfrac 1 5 nop 40 fiscale drop 23 ldf 3 R3 s0 yl yh bs 0 40 xl xh nop hib xor 1s2 nop yl yh bs 2 0 xl xh gt Y 2 0 X r fmul fadd nop fmul es bs x 3 3 FLOATING POINT APPLICATIONS 35 fmulsub drop fmul fmul 53 3 T zu oF y fadd fmul nop nop 35 Y x y 35 Y x gt y fmul nop nop fmul fmul fmul fmulsub nop nop fadd fmul nop 5 y E x fmul fmul nop nop fmulsub nop fmul nop nop fadd fmul nop ae r x nop fmul fm
19. ause it is all in the set So the interesting part of the Mandelbrot set does not even span an order of magnitude Therefore the magnitude the exponent of a floating point number would not change much during the calculation process 32 bit fractional integer arithmetics thus provides more accuracy than single precision floating point Now we have to separate the multiplications We need x y and 2xy three multiplications However we need to calculate x y and 27 y and the multiply and accumulate doesn t allow to compute this both results at once So we compute 2x and add x y to it so we finally get x y for sacrifying some small bits of accuracy iterloop 35 x n y mul SO pick 2s0 mul 0s0 pick 0s0 ip d four 1d2 3 ipb ae 4 ny y x mul lt mul sO mulr lt mul sO 33 x dx2 n y 2xy d4 mulr lt dec add si mulr lt x x2ty2 n 1 y 2xy ty 4 2x2 add s1 dup nop add 0s0 br 1 amp 3 0 gt iterloop X X x2 y2 R n 1 n 1 gt 0 3 y 2xy y 4 x2 y2 gt 0 The final branch branches to iterloop while the square of the absolute value of c x y lt 4 and the iteration count n gt 0 3 2 FIXED POINT EXAMPLE FRACTALS 29 3 2 1 Fast Fourier Transformation One important function for digital signal processing is the fast Fourier transformation FFT It transforms a complex time discrete sample into a complex spectrum with discrete frequencies This discrete frequencies are called Fourier coeffic
20. awn if their z value is interpreted as closer to the screen than the actual z value Usually object surfaces are divided into flat areas These areas then are drawn in horizontal lines as the frame buffer is organized in horizontal lines too A C code fragment for z buffer drawing may look like that unsigned short zbuffer color screen color c unsigned short z short dz 23 24 CHAPTER 3 PROGRAMMING EXAMPLES int i loop body for i 0 i lt len i index do if z gt zbuffer sub sl u gt screen c pick si st 0 R2 N zbuffer z dup sth 0 R1 N screen zbuffer nop t zt dz add s2 1dh 0 Ri nop The sequential 4stack code in the comments above is only a bit vectorized the zbuffer load is placed at the end of the loop R1 is used to keep zbuffer R2 points into the screen The TOS is z the next of stack is c and the third of stack is dz This inner loop contains two active pointers screen and zbuffer and three active values z dz and c where only z changes There are no dependencies between neighboring pixels except the linear z change and the address pointer increments Thus although this basic block seems to be rather sequential there is a great chance to vectorize it and to make use of the four stacks As z changes linear it is proposed to exchange z by Zo 2 21 z dz zg Z 2dz and z3 z 3dz The last line the z increment i
21. c mulr n n round mlatch gt gt 32 m mulr lt n n round mlatch lt lt c 32 m mul dl dh d mlatch mul lt dl dh d mlatch lt lt c mulr n n round mlatch gt gt 32 m mulr lt n n round mlatch lt lt c 32 m mul dll dh1 dl2 dh2 c da d mlatch cc ce c mul lt Q dll dhl dl2 dh2 c da d mlatch lt lt c ce ce c 16 CHAPTER 2 INSTRUCTION SET SUMMARY mulr dl dh n c n round d mlatch gt gt 32 m ce cc c mulr lt Q dl dh n c n round d mlatch lt lt c gt gt 32 m ce ce c mul dll dh1 dl2 dh2 c da d mlatch cc cc rc mul lt dll dh1 dl2 dh2 c d d mlatch lt lt c cc ce rc mulr dl dh n c n round d mlatch gt gt 32 m cc cc rc mulr lt Q dl dh n c n round d mlatch lt lt c gt gt 32 m cc cc wc Flag computation allows some branchless conditional computing In the 4stack processor flags are computed using the TOS carry and overflow flag of each stack Flag computations consume the TOS A true flag has all its bits set 1 a false flag has all its bits clear 0 sub and subr compute the appropriate carry and overflow flag for compare flags There is one exception from this rule double number operations like mul or fadd give zero only if both 32 bit
22. cit property to divide the result by two and the asr instructions does so for the cx For reverse FFT the barrel shifter can be used to provide a non halfing fixed point multiplication The middle loop has to step through the j index and thus complete the access to every array element Remember that all array accesses are bit reverse for FFT the result is obtained by a bit reverse address walk through the input array begin ee n 2m 8n n 2m 33 lt m gt 0 3 2 FIXED POINT EXAMPLE FRACTALS nop sub ci nop drop a n 2m 2 8n n 2m nop index swap pick 2s1 lo Ea hi n 2m 8n fi ae 8n hr 33 the inner loop see above 100p ar E hi fi hr pick 3s0 mul 050 pick 1s0 mul 050 fr hr fi hi mul 2s1 mul mul sip mul gi 2 hifr 2 gr 2 hrfr 2 asr mulr asr mulr 33 gi 2 hifr fihr 2 qi ld 0 amp 2 R2 N ld 3 amp 1 DO ld 2 amp 0 Ri O CE EE 35 gr 2 hrfr hifi 2 qr add 1s0 subr 0s0 add 3s0 subr 2s0 st 2 amp 0 R1 N st 3 amp 1 the final loop step that doesn t load another c_k m value 33 j k a n 2m 8n 8n nop nop swap drop j k iz 8n n 2m dec pick 2s0 nop O je n 2m 8n n 2m 0 31 Ri s0 Ri N Ri R1 2 s0 Ri R1 3 s0 br O gt until The outer loop then has to loop through m which scales from 1 n 2 while doubling each step The above cited lt m gt is the bitwise mirrored m by a bit length of ld n thus it is n 2m begin nop nop pick 3s0 pick 3s0 set 2 N1 set
23. date and immediate offset operations An immediate offset operation in one data move field is added to the computed address in the other data move field Load and store operations may address one of four register sets as do update address operations set and get may direct address any part of the register set the four registers R N M and F Load operations have one delay slot so they push their result before the second instruction after the load starts execution Data move operations of the even data unit operate on one or both of the even stacks stack 0 and 2 operations of the odd data unit on one or both of the odd stacks stack 1 and 3 Both load and store operations may use one or two stacks as source or result Update address set and get may only use one stack as source or result The stack is specifield with n or n amp na where n is the stack number of a single destination stack n is the stack for the value at the lower address na is the stack for the value at the higher address Stack 0 or stack 1 is the default stack Load and store operations allow a number of addressing modifiers syntax address Rn becomes comment Rn Rn imm Rn indirect Rn N Rn Nn imm Rn indexed Rn N Rn imm Rn Nn modify after Rn N Rn Nn imm Rn Nn modify before sOb s0 imm Rn stack indirect b e ipb ip imm Rn ip relative b e s0l s0 imm Rn stack indirect l e ipl ip imm Rn ip relative Le Rn s0 Rn s0 imm Rn stack ind
24. dd faddadd fmuladd faddadd fmuladd faddadd fmuladd faddadd drop loope Ri v1 ldf O ldf 0 ldf 0 ldf 0 do ldf 0 Ri N Ri N Ri N Ri N Ri N Ri v2 ldf ldf ldf ldf ldf Pee ee Ri N Ri N Ri N Ri N Ri N Although the inner loop contains only one instruction there is a 4 step pipeline 1 the element s addresses are computed 2 the cache loads are done 3 the multiplication is started 4 the multiply and adder result are forwarded to the adder input and another unnor malized sum is computed The final step normalizes the sum There must be at least 3 address computations 2 loads and one multiplication start before the loop can start However the DO operation doesn t allow a parallel load so there must be one extra load in advance and the first faddadd must be replaced by a fadd of 0 0 to clear the second adder input latch 34 CHAPTER 3 PROGRAMMING EXAMPLES 3 3 2 Floating Point Division and Square Root The 4stack instruction set doesn t contain neither a floating point division nor a floating point square root These instructions have been omitted whilest silicon implementations exist because there is only a minor speedup with these hardware division or square root devices if these are implemented in an excessive way at least 4 bits per cycle GOLDSCHMIDT s algorithm should be used for both the division and the square root The traditional algorithm NEWTON RA
25. e TOS and stores it to a deeper position in the stack pin s0 x pin sl x0 x1 xl pin s2 x0 x1 x2 x2 x1 pin s3 x0 x1 x2 x3 x3 x1 x2 pin s4 x0 x1 x2 x3 x4 x4 x1 x2 x3 pin sd xO xl x2 x3 x4 x5 xd xl x2 x3 x4 pin s6 x0 x1 x2 x3 x4 x5 x6 x6 x1 x2 x3 x4 x5 pin s7 x0 x1 x2 x3 x4 x5 x6 x7 x7 x1 x2 x3 x4 x5 x There are some abbreviations for simple stack operations dup x x x pick s0 over x0 x1 x0 x1 x0 pick sl swap x0 xl x1 x0 pick slp rot x0 x1 x2 x1 x2 x0 pick s2p drop x pin 90 nip x0 x1 x1 2 pin sl There are also some abbreviation for unary operations that are performed using binary operations with a special second operand nop 2 or 0 not xl x2 12 212 xor 1 neg nl n2 na n1 subr 0 inc ul u2 u u 1 sub 4 1 dec ul u2 uz u 1 add 4 1 The rest of the stack instructions don t have additional stack addresses The multiplication is a two cycle pipelined operation The first part of the operation mul umul and pass allows various parameter addressing the second part all variations of mul makes use of an additional adder multiply and accumulate a barrel shifter and a rounder for fractional integer arithmetics If the shift count exceeds the range 32 lt c lt 32 the result is undefined mul dl dh d mlatch mul lt dl dh d mlatch lt lt
26. e dc c z dc e cz de cr z de add s2 add s2 add s2 add s3 ldh 0 amp 2 Ri ldh 1 amp 3 Ri 33 dz c Z c dc c Zc eczc crdc pin s2 pin s2 pin s2 pin s3 Ai 3 dz c z de c Z ec Zz crd This code allows to draw Gouraud shaded z buffered polygons at a peak rate of 2 cycles per pixel 3 2 Fixed Point Example Fractals A very popular benchmark program is drawing fractals like the MANDELBROT set This set contains elements of the complex number set C that fit the condition M co lim a lt 2 with ciy co To get nice colored picture the number of iterations until the number becomes bigger than 2 is counted and for each number a different color is selected This code has some strict conditions that don t allow easy vectorization It is an iteration the value computed in one step is used in the next However it contains enough inner parallelism to fill all the instruction slots 28 CHAPTER 3 PROGRAMMING EXAMPLES First we have to transform the complex calculation into real and imaginary parts x is the real part of c y the imaginary part E x y 2ixy le ar y lel lt 2 lc lt 4 2 lt 4 For speed we want to use fractional integer arithmetics The problem fits quite well to the restrictions of fractional integer arithmetics the interesting part of the Mandelbrot set is between 2 and 2 for both complex and real coordinates The part close to 0 is not very interesting bec
27. erations are used for general purpose and for floating point calculations The more important operations add sub subr addc subc subcr or and xor mul umul and pass have a second operand address that allows to access any of the addressable elements of the stacks as second operand thus 8 elements deep in the ALU s own stack 4 into any other stack and 8 often needed constants The other operations need explicit use of the stack operation pick that uses all the stack accesses from above operations except the 8 constants The second operand is further called NOS for next of stack although it might be any other second operand addressed by the second operand address The first operand is always the top of stack TOS and therefore stated in the stack effect diagram The stack effect of one of the one address operations then is apply the stack effect of the second operand address pop NOS and previous TOS apply the operation and push the result if any The default stack effect stored in the instruction word is the swap operation that exchanges TOS and NOS this stack effect will be used to describe the operation The stack effects of the second operand addresses are 13 14 CHAPTER 2 INSTRUCTION SET SUMMARY s0 x xXx sl x0 x1 x0 xl x0 s7 x0 x1 x2 x3 x4 x5 x6 x7 x0 x1 x2 x3 x4 x5 x6 x7 x0 s0p x0 x0 slp x0 x1 x1 x0 s2p x0 x1 x2 x1 x2 x0 s3p xO xl x2 x3 x1 x2 x3 x0 ns0
28. exed Rn s0 Rn imm Rn s0 stack modify after Rn s0 Rn s0 imm Rn s0 stack modify before The top of stack sO is used from the only or the low address stack at the begin of the operation The semantics of the operation depends on the flags Fn The index is the sum of the second operand if any and the constant offset from the other data move unit Z set The index is shifted by the access size R set The index is added using bit reverse addition S set The index is negated if the instruction is a store stack like usage If the limit mask bit number is not 0 the base address Rn is split into a lower half with M bits and an upper half with 32 M bits The calculation is performed on the lower 20 CHAPTER 2 INSTRUCTION SET SUMMARY half if the result exceeds the limit register Ln or causes a carry on the Mth bit it is set to 0 and a bound crossing trap occurs if B is set The result of this comparison is added to the upper half then This is the final effective address that is propagated to the cache to perform the memory access The store operations are stb sth st and st2 stf stb stores the lower 8 bits of the number s on the stack s specified at the end of the cycle sth stores the lower 16 bits st stores the total 32 bit value and st2 stf stores a value pair stf is an alias for st2 Little endian accesses when O is set are computed by xoring 7 6 and 4 for byte half word and word accesses F
29. fter the next instruction is specified by the address applied to ip This is true for loops and loope to some extend too There is a one cycle delay until the fetch unit recognizes the new loop range addresses The 4stack processor has a floating point unit with a two cycle pipelined multiplier and a two cycle pipelined adder Floating point operations move the two topmost stack elements to the unit s input latches from the unit s output latches to the stack or from one unit s unnormalized output to the adder input The last allows a three cycle multiply and add and a single cycle accumulate Once supplied with an input the floating point units start to compute the result fadd fl fh fal f fsub fl fh fal f 18 CHAPTER 2 INSTRUCTION SET SUMMARY fmul fl fh fml f fnmul fl fh fml f faddadd fal far faddsub fal far fmuladd fal fmr fmulsub fal fmr fi2f n fal 43300000 n 80000000 fni2f n fal C 3300000 n 80000000 fadd fl fh f far mula fl fh f fmr fs2d s fl fh f s fd2s fl fh s s f fiscale fll fh1 n fl2 fh2 fa f x 2 fxtract fll fh1 f2 fh2 n fo f 2 with 1 lt f lt 2 Some simple floating point instructions use the stack ALU fabs fll fhl 112 fh2 fa f and 7FFFFFFF fneg fll fh1 112 fh2 f f xor 80000000 Dx fl
30. g step This changes the loop body to the following in C and linear 4stack assembly code loop body for i 0 i lt len i if z gt zbuffer screen c zbuffer z screent zbuffer z dz color dcolor index sub si pick si dup nop add s2 pick s3 do u gt st 0 R2 sth 0 R1 vt N N 3 2 FIXED POINT EXAMPLE FRACTALS 27 add s2 ldh 0 Ri pin s2 As Gouraud shading does not produce overflows a normal add is sufficient However accuracy is not very high since true color screens usually represent colors by a set of 3 8 bit values in the format aRGB and therefore each color scales only between the integer values of 0 to 255 For larger polygons this differential increment is not very good for small polygons it can be tolerated While unrolling this part of the loop we get a small amount of additional accuracy as with dz we add 4 dcolor each iteration The final loop body then is eZ Coz zr dec cz zr ecaz zr er z zr subr si subr si subr si subr s1 u gt u gt u gt 7u gt 33 dz c 2 dc c 2 e cz C r z pick s1 pick sl pick si pick s2 st 0 amp 2 R2 N st 1 amp 3 R2 N dup dup dup dup sth 0 amp 2 R1 N sth 1 amp 3 R1 N nop nop nop nop t t t BESS 3 dz cz dc cz ecz erz add 0s2 add 0s2 add 0s2 add 0s2 pick 1s2 pick 1s2 pick 1s2 pick 1s2 3 dz c z d
31. ients C f 1 Olf Fedt kez 0 For discrete samples n samples for f t dt const only the first n coefficients C f are of interest It is easy to show that Criin f Cr f z Z in this case aliasing Cooley amp Turkey showed in 1965 that for n 2 only np 2 complex additions subtractions and multiplications and n trigonometric coefficients are necessary to compute a complete Fourier transformation fast Cooley amp Turkey Fourier transformation FFT The main principle of FFT is a recursive divide and conquer strategy The input is divided into even and odd parts each of them are processed in the same way separate and combined afterwards In practice all divide steps are performed at once consecutive dividing the input array into even and odd bit reverses the access indices The 4stack processor has a dedicated address adding mode for this sort of access that allows to process consecutive or at a constant offset bit reverse addressed elements of an array The combination part is also usually not recursively processed a loop rearrangement allows to reuse the trigonometric coefficients and therefore reduces memory bandwidth The body of the combination part is called butterfly diagram Ck Cp Chym el crm Ch Cham em Ck C kim in j m e Figure 3 1 Butterfly diagram The realization of the butterfly diagram results in the complex multiplication of Ck m and e r and the addition
32. ing by division In fact GOLDSCHMIDT s square root algorithm allows to divide any number by the square root of another number But because square root divide is more expensive than multiplication in the usual vector norming application computing reziprocal sqare root and multiplying is better Reziprocal square root computes b ya z b y a r e a while y 1dox rg y r y r 3 r y 2 3 3 3 Denormalized floats The 4stack floating point unit doesn t define denormal handling while IEEE 754 requires it Implementations that don t implement denormal handling may react in two fashions either denormal handling is disabled then they treat denormalized values as zero or denormal handling is enabled then they trap to the unimplemented floating point error It is now up to the exception handler to perform the correct operations to handle denormal floats Denormal floats processing can be emulated with instructions for normalized floats The purpose for denormalized floats is to fill in the hole between zero and the smallest normal ized floating point number So a simple shift of the exponent allows to compute normalized 38 CHAPTER 3 PROGRAMMING EXAMPLES numbers and a add with the appropriate value turns this into the same representation as a denormalized value except the exponent value A multiplication of a normalized a and a denormalized b value where b is represented as e b thus looks like a x b ax
33. l fh1 112 fh2 fo f 2 add c3 f2 fll fh1 112 fh2 fo f 2 sub c3 The bit field instructions use a bit field descriptor bfd The format of the bfd is rot left r s mask length n s mask rotation m s All these values are interpreted modulo the number of bits in a word bfu x bfd u mask rotl 1 lt lt n 1 m u rotl x r gzmask bfs x bfd n mask rotl 1 lt lt n 1 m n sign_extend rotl x r amp mask r m n cc n n cc cel n cc n The pixel pack and extract instructions transform pixel data into the next wider repre sentation and back 4 8 and 16 bits per pixel are supported The operator means concatenation The pixel elements of the word are numerated from MSB to LSB px4 pnl pn2 dpb dpb pna 0 pn 0 pn2 1 pr l1 pna 2 pn 2 pna 3 pn 3 pn2 4 pn1 4 pn2 5 pri 5 pna 6 pn 6 pna 7 pn 7 px8 pbl pb2 dph dph pb 0 pb 0 pb2 1 pb 1 pb2 2 pb1 2 pb 3 pb 3 pp4 dpb pn1 pn2 dpb pna 0 pn 0 pn 1 pn 1 pna 2 pn1 2 pr2 3 pna 3 pn2 4 pn 4 pn2 5 pn 5 pr2 6 pn 6 pn2 7 pr 7 pp8 dph pb1 pb2 dph pb2 0 pb1 0 pb2 1 pb 1 pb2 2 pb 2 pb2 3 pb1 3 2 2 DATA MOVE OPERATIONS 19 2 2 Data Move Operations Data move operations are divided into load store address up
34. n 0 while 2 amp 80000000 0 do n n 1 1 x lt lt 1 done pope x n n 0 while x 4 0 do n n z amp 1 x x gt gt 1 done Bytes and half words are loaded as unsigned integer parts If they need to be interpreted as signed integer or fraction parts they have to be converted lob u b b u gt gt 24 loh u h h u gt gt 16 extb b n n b gt 80 b 80 b exth h n n h gt 8000 A 8000 h hib b n n b lt lt 24 hih h n n h lt lt 16 There are a number of instructions to address the special global or per stack registers To load these registers to the stack the syntax register speak fetch is used to store them register speak store is used srQ x T Sr sp u u sp cm b b sr gt gt 8 amp FF sr lt lt 4 amp 700 index u u index loops u u loops loope u u loope ip u u tp flatch fl fh f flatch srl x sr s x sr amp FFFF0000 c amp FFFF sp u sp u cm b sr sr amp FFFFOO8F b amp FF lt lt 8 b lt 075 300 b gt gt 4 amp 70 index u index u loops u loops u amp 8 loope u loope u amp 8 ip u ip u amp 8 ip doesn t affect the next instruction it is executed in the delay slot The address of the instruction a
35. n Format The 4stack processor is a VLIW very long instruction word processor This sort of architectures exploit the fine grain parallelism of a program Each long instruction word 12 CHAPTER 1 BASIC PROGRAMMING MODEL consists of several operation fields for the independent execution units All these operations are performed simultaneous It is up to the programmer or compiler to schedule the individual operations to the instruction word The 4stack processor has 5 main instruction formats 1 The normal instruction consists of four stack operations and two data move opera tions 2 The conditional setup instruction consists of four stack operations and four corre sponding conditional setup operations 3 The branch instruction consists of four stack operations and one conditional branch operation 4 The call instruction consists of three stack operations and one ip relative or absolute call or jump instruction 5 The far call instruction consists of one absolute call instruction Chapter 2 Instruction Set Summary 2 1 Stack Operations Stack operations are divided into ALU operations and immediate number operations The immediate number operations are intented to push small numbers on the stack They allow either to push one sign extended 8 bit number or to shift the TOS by 8 bits to the left and add one unsigned 8 bit number to the result s8b s8b 8b gt nl n2 ng n lt lt 8 8b The ALU op
36. n odd is selected by the operation slot Example an implementation may have 4K pages and 32 byte cache lines in a four way set associative cache Thus the lower 12 bit select the cache line Therefrom the lower 5 bits select way and cache 0 is way 0 8 is way 1 10 is way 2 and 18 is way 3 all in data cache ccheck returns the result of the check at the begin of the second instruction after its issue Thus it has a delay slot as any other cache access The result contains the status bits shared exclusive modified and valid of the cache line a flag wether the compare matched or failed the number of unprocessed entries in the global write buffer and the cache tag address Checking the global write buffer helps to flush the cache without long interrupt latency 2 3 FLOW CONTROL 21 celr clears a cache line cstore writes a cache line cflush writes and clears a cache line these three operations ignore the higher part of the address cload starts a cache load to the specified line calloc allocates the cache line without changing the line entries and cxlock changes the lock flag while ensuring that the specified cache line is loaded A cache flush does not ensure that the cache is empty after the flush but it ensures that cache and memory before the flush are consistent afterwards thus all modified lines are written out and any valid data in the cache contains the same values as the corresponding memory When interrupts are disabled
37. operation do sets loops to the next instruction and loope to the label after the operation The loop index has to be set by the programmer with the instruction index The loop continues until the loop index decrements from 0 to 1 The index is decremented in the last cycle of every loop Calls operation call and jumps operation jmp transfer the program flow to the label after the operation Both use the operation field of stack 3 call executes ipQ on stack 3 to push the return address jmp performs a nop on stack 3 Far calls execute nops on any stack except on stack 3 there ipQ is executed The conditional setup operation allows some stacks to conditionally execute operations without breaking the program flow For each stack one condition is tested and if false the X bit in the status register of the corresponding stack is set and all following operations on that stack are replaced by nops until another conditional setup changes the situation As in conditional branches conditions may either pop or copy the top of stack and are preceeded with or to select that The conditions t and f have a special meaning t resets the X bit and thus all operations afterwards are executed f inverts the X bit and therefore allows if then else When set no other condition may change the state of the X bit of one stack All conditions regardless of being part of a conditionnal setup operation or a conditional branch operation read as fal
38. or double accesses the relation lower higher address is reversed so you specify the higher address first Double word addresses are the same in big and little endian The address update operation assings the sum or difference of one register 0 it s Nn or the TOS of the selected stack and the constant offset to one of the four registers Syntax Rm Rn s N s0 There is no scaling by size because no size is specified The source register s flags are used for computing conditions set and get individually select one of the registers Rn Nn Mn or Fn and set their value to the TOS of the specified stack or get the register s value and push it to the stack setd and getd do the same for 64 bit extended addressing if not implemented the higher half is ignored when set and O when get 2 2 1 Cache Control The instructions ccheck cclr cstore cflush cload calloc and cxlock control the cache All these accesses use the address in Rn to select one cache line The address part that is used to address the cache line set in usual accesses is used here too The lower bits select one line of the set The higher bits are compared with the cache line s address the mask count M specifies the lower bits of them not to compare It is ensured that a walk with constant offset through one page s address range will select all cache lines Bit 0 1 specify which cache to check data instruction lower stack s and higher stack s cache eve
39. s replaced by z Zn 4dz then As the data move unit supports double word accesses neighboring pixels are stored with the same store instruction The two data move units hold addresses for interleaving pixel pairs This leads to the following inner loop code subr s1 subr si subr si subr s1 u gt u gt u gt u gt 3 4dz c z0 c z2 c zl c z3 pick s1 pick sl pick si pick sl st 0 amp 2 R2 N st 1 amp 3 R2 N dup dup dup dup sth 0 amp 2 R1 N sth 1 amp 3 R1 N nop nop nop nop t t t t 3 4dz c z0 c z2 c zl c z3 add 0s2 add 0s2 add 0s2 add 0s2 ldh 0 amp 2 Ri ldh 1 amp 3 R1 4dz c z0 c z2 c zl c z3 nop nop nop nop For the C code above it isn t necessary to keep the color c on every stack it doesn t change However many z buffer drawing algorithms are mixed with shading or texture mapping 3 1 Z BUFFER DRAWING 25 There each pixel has a different color So to make changes easy the color is kept on every stack This inner loop draws 4 z buffered pixel every step in 6 cycles It would be very unpleas ant if every line length has to be a multiple of four So for beginning and end of the loop the left respectively the right of the four pixel line piece must be masked out This can be done with the conditional setup too as conditional setup on disabled stacks doesn t change the X flag state The startup then is O 2 1 3 ldh 0 amp 2 R1 ldh 1 amp 3 Ri sub 1s3 sub s3p sub 1s3 sub 153 gt
40. se Stores and loads to that stack are aborted address offsets s0 read as zero Indirect jumps are generated with ip They transfer program flow to the address on the stack that executes ip Indirect calls may use any stack except stack 3 and have to save the caller s instruction pointer with ip on stack 3 Chapter 3 Programming Examples This chapter contains a number of programming examples which introduce into the pro gramming of the 4stack processor The examples are intented to show programming meth ods that exploit the parallelism of the algorithm and therefore produce fast and compact code The algorithms of the example programs target possible applications of the 4stack processors high speed graphic integer signal processing and some floating point number crunching 3 1 Z Buffer Drawing Three dimensional drawing on a two dimensional CRT display has one problem hidden surfaces A number of algorithms face to that problem One simple and elegant solution is z buffering Each pixel in the three dimensional space has x and y coordinates that correspond to display coordinates The z coordinate does not have any correspondence to the display but there is one condition for pixels on the screen Only that one shows on the screen which has the highest z coordinate value of all the 3D pixel with the same x and y coordinate Therefore a z buffer holds the actual z value for the pixel shown on any screen position New pixels are only dr
41. the cache is really empty except those parts needed to run the cache flush loop 2 2 2 MMU Control 2 2 3 I O Control The operations out outd ins ioq inb inh in and ind control I O ports I O accesses are not cacheable and not translated by the memory management unit even if they go to memory addresses and not to special port addresses All I O accesses are performed strictly sequential in an I O queue with an implementation dependend number of I O read write buffers The access use the address in Rn plus the immediate constant scaled by 8 Each I O read is a 64 bit read each write is a 64 bit write to the corresponding address On a wider bus the lowest valid bits in the address selects appropriate bus part The input start operation ins returns the number of one of an implemenation defined amount of input latch register The io query operation ioq returns true if the passed register number is filled false otherwise An inb inh in or ind operation reads the port value and frees the latch register The immediate offset allows to select the appropriate byte half word word out of the double word read I O accesses will cause a privilege violation exception when not processed in supervisor mode 2 3 Flow Control Flow control operations divide into conditional branches counted loops calls jumps re turns and indirect calls jumps Conditional branches use a ten bit branch offset that allows to jump back 1024 instructions and 102
42. ue is called the cm value and individually accessible Each stack has a stack pointer sp that points to the memory address where s0 will be saved when it would be spilled out However an access to the memory location of sp will not give s0 because s0 is cached in a register file oO OM eo See There are some global registers for special purposes The instruction pointer ip points to the next instruction the loop counter index the loop start address loops and the loop end address loope are used by the hardware do loop instruction 1 5 INSTRUCTION FORMAT 11 For memory access there are two data move units Each data unit is connected with two stacks one with the stacks with even number the other with the odd stacks Loads and stores from the even data move unit can only go to or come from even stacks from the odd data move unit can go to or come from odd stacks Even stack unit Odd stack unit RO NO LO FO RO NO LO FO R1 N1 L1 F1 R1 N1 L1 F1 R2 N2 L2 F2 R2 N2 L2 F2 R3 N3 L3 F3 R3 N3 L3 F3 Each Rn contains one 32 64 bit address The corresponding Nn could be added to the address to step through an array Ln contains a modulo value to implement a ring buffer Fn contains usage flags 00000000100000000100MMMMMMIO00SBROZ scale displacement N s0 constant offset by size byte order 0 big 1 little endian bit reverse addition create bound crossing trap stack
43. ul ret fmul nop nop nop a b The division table contains 32 reziprocal values containing 64 27 65 macro div table F n 1e0 fdup dup 2 s gt d d gt f f f dup 0 DO 1e0 fover f double le0 dup s gt d d gt f f f LOOP fdrop drop end macro The same approach can be done for the square root Square root computes ya z a y a r a while y 1dox rz y r y r 3 r y 2 20 sqrt table double 1e0 sqrt 2 bfd5 int 511 align 3 fsqrt 36 fsqrtgs fxtract 1 and drop 2 3 y fmul fmul fmul fmulsub fadd 3 Y fmul nop nop fmulsub fadd 33 Y fmul nop nop fmulsub fadd fmul fmul sart a 3FF8 pick Osi pick Osi hih pick Osi pick 0s0 asr xor s0 fiscale 1 sqrt 2 x swap nop fmul fmul nop fmul fadd fmul fmul nop x ES nop nop fmul nop fmul nop nop nop nop nop x nop nop fmul nop fmul nop nop nop nop nop fmul nop nop nop CHAPTER 3 PROGRAMMING EXAMPLES pick 0s0 ip w bfd5 ld 3 ipb ip O set 3 R3 bfu drop 25 nop 5 ldf 3 R3 s0 ldf 1 R3 s0 r fmul flatch fmul fmul nop y fmul fmul fmul fmul nop fmul fmul fmul nop nop ret nop The square root approximation table contains reciprocal square roots for 32 numbers be tween 1 and 2 in fact the square roots of the division table macro sqrt table F n 1e0 fdup dup 2 s gt d d gt f f f dup

4stack Processor's User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents