Home

Library of Macros for Optimization Using eMAC and MAC

1.
2. Parameters Table 3 2 ARR2D ADD2 Parameters dest in out Pointer to the destinstion array src in Pointer to the source array sizel in Number of rows of matrices size2 in Number of columns of matrices Returns The ARR2D ADD2 macro generates unsigned signed output values which are stored in the array pointed to by the parameter dest 3 2 3 Description of Optimization C code for i 0 i SIZEl i moe 5 OP sp SIL eres ever Wer Et E gj ae etiered ECT spy Optimization can be done using the following techniques 1 The elements are accessed as 1d array elements with number of elements size size2 because elements of 2d array are located in memory sequentially 2 Loop unrolling by four 3 Every four values of array dest used in each iteration are loaded with only one movem instruction 4 Every four values of array src used in each iteration are loaded using postincrement addressing mode while performing additons 3 36 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 5 After perfoming additions the resulting four values in each iteration are stored with only one movem instruction 6 Ifthe number of elements is not divisible by four the tail elements are processed in regular order Optimized code move l sizel dl move l size2 d2 marun To azra movel ALl d2 asr 2501 beq 11 movem l a0 d3 d6 acce arrr as ades dL
3. eucse Xe Est all cm eene ES ails Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 Library of Macros for Optimization Rev 1 0 3 45 Freescale Semiconductor move l 0 d0 move l d0 MACSR moveq 1l 16 d0 move l dest a0 move l src al move l sizel dl move l size2 d2 murus Jk ol oo uL move l dl d2 asro 32204 beq outl move l 0 ACCO movem l al d7 a3 a5 accu dorrasli loopl movem l a0 d3 d6 maclol Ur dobjtel egg ACUU move l ACCO dq3 move l 0 ACCO macies ase Sm move l ACCO d4 move l 0 ACCO macl l a4 d5 al a4 ACCO move l ACCO d5 move l 0 ACCO eels I eurer listL aseo NCO move l ACCO d6 move l 0 ACCO movem 1 d3 d6 a0 add 1 d0 a0 Sung 3d bne loopl pubis and l 3 d2 beq out2 sub 1 d0 al loop2 move l a0 d3 malg dL aLr sre els move l d3 a0 subq 1 1 d2 bne loop2 3 46 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Sunes movem l a7 d2 d7 a2 a5 lea 60 a7 a7 Optimization for eMAC unit can be done using the following techniques 4 Loop unrolling by four Using four accumulators for pipelining Using ma
4. 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit movem l a0 lea 16 a0 a0 subq 1 beq L2 1 do di d4 movem l al a2 a5 lea 16 a1 a1 macl l macl l macher macl l subq 1 bne L3 lil d2 asr d4 1 a2 esr a4 CUM do a0 a0 a0 a0 There is no need for optimization of the eMAC unit because there is only one multiply accumulate sequence in the computations 4 1 4 Differences Between DOT PROD UL and DOT PROD SL DOT PROD UL macro uses the unsigned mode of the MAC unit while DOT PROD SL macro uses signed mode Library of Macros for Optimization Rev 1 0 4 65 Freescale Semiconductor 4 2 RDOT_PROD_UL RDOT_PROD_SL 4 2 1 Macros Description These macros compute the reverse dot product of two vector arrays with unsigned signed values The reverse dot product is computed by the following formula XY 2Y xy i l where X Y input vectors x y elements of the corresponding vectors n size of the vectors 4 2 20 Parameters Description Call s unsigned long RDOT_PROD_UL unsigned long arrl unsigned long arr2 int size signed long RDOT_PROD_SL signed long arrl signed long arr2 int size Parameters Table 4 2 RDOT_PROD Parameters arrl in Pointer to the first vector arr2 in Pointer to the second vector size in Number of elements in vectors Re
5. Figure 6 2 New Project Dialog Box Click OK A new folder will be created for your project and the project window appears docked at the left side of the main window Modifying the Settings of your Project Select an appropiate target to debug your code I e M5282EVB UART Debug Open the Settings window of your project by selecting Menu Edit your target Settings or i2 Alt F7 or clicking the button The Settings window should appear Enable the processor to use MAC or eMAC by selecting clicking on the appropiate checkbox in the Language Settins gt ColdFire Assembler section I e check the Processor has EMAC checkbox Library of Macros for Optimization Rev 1 0 6 105 Freescale Semiconductor H M5282EVB UART Debug Settings Target Settings Panels ColdFire Assembler Target Target Settings Processor CFM52xx ceca KANE Processor has MAC Build Extras Runtime Settings IV Processor has EMAC File Mappings Processor has FPU Source Trees Coldfire T arget E Language Settings C C Language v Labels Must End With C C Preprocessor Directives Begin With C C Warnings ColdFire Assembler Code Generation V Allow Space In Operand Field ColdFire Processor Global Optimizations Generate Listing File Linker Prefix File I 5 ELF Disassemblet Factory Settings Revert Import Panel Export Panel OK Cancel Apply Source Format IV Case Sensitive Identifi
6. Optimized code move l sizel dl move l size2 d2 ier tk eles do ld move i alaz moveq l 1 d0 aSr ERAT beq outl loopl muls a0 d0 muls a0 d0 muls a0 d0 dL ak muksa T dads l T Sub orl Pil cal bne loopl Supls and l 3 d2 beq out2 loop2 muls l a0 d0 SU Cr ine bne loop2 mute 3 5 4 Differences Between the ARR2D PROD UL and the ARR2D PROD SL Macros ARR2D PROD UL uses instruction mulu for multiplication ARR2D PROD SL uses instruction muls for multiplication to keep the signs of operands 3 6 ARR2D MUL2 SL ARR2D MUL2 UL 3 6 1 Macros Description These macros perform multiplication of two 2D arrays of unsigned signed values 3 44 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 3 6 2 Parameters Description Call s int ARR2D MUL2 UL unsigned long dest unsigned long src long sizel long size2 int ARR2D MUL2 SL long dest long src long sizel long size2 Parameters Table 3 6 ARR2D MUL2 Parameters dest in Pointer to the destination array src in Pointer to the source array sizel in Number of rows in arrays size2 in Number of columns in arrays Returns The ARR2D MUL2 macro generates an unsigned signed output matrix which is the result of dest and src multiplication and is pointed to by dest 3 6 3 Description of Optimization C code inne al e ar c uz abe more 3 cnm pe SARA apu
7. Optimized code uses MAC unit 0 ACCO CO sii o 10 ACCUL S 0 ACCO Glil wy ul ACCO d2 0 ACCO cll aw CA ING OO mas 5 100 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 0 ACCO CM ils CA ACCO d4 0 ACCO CECI BUCO xS 0 ACCO dos As IA GG GIG OEXIEPBSIRIT ACCU 1073741824 a0 89478485 al 2982616 a2 53261 a3 591 a4 2 AS olay CA iby CI d4 u ol aal Gru Optimization for the eMAC unit includes the same optimization techniques as the MAC unit as well as following 1 Using fractional mode of the eMAC unit which allows using 32x32 multiplication without lack of precision 2 Using the movelr instruction to store a value in a register and clear an accumulator at the same time 3 Using two accumulators for quickly raising the operand to the needed power Optimized code uses eMAC unit move l 0 ACCO move l 0 ACC1 mac l cor a0 ACCO moved TACCO S dil mac l dbz cll ACCO Movelr LACCO 02 Library of Macros for Optimization Rev 1 0 5 101 Freescale Semiconductor cil Cy ACEO Gr Ce INCI ize SINAC els Jp ACCN 4 d2 d3 ACCO Cle del ACEI Jbeo JUNCICIO S GS Lee ACCS d6 alt UxTtIISCIft ACCO 1073741824 a0 89478485 al 2982616 a2 SSAGile ais Beal ey Bo Elo CBE TO Cl ml COP a2 d p a3 d5 a4 AGr db ACCO d0 5 5 MUL 5 5 1 Macro Description This mac
8. where x element of the input vector size number of elements in the input vector 2 1 2 Parameters Description Call s int ARRID SUM UL unsigned long src int size int ARRID SUM SL signed long src int size Parameters Table 2 1 ARRID SUM Parameters SIC in Pointer to the source vector size n Number of elements in vector Returns The ARRID SUM macros return the unsigned signed sum of array elements Library of Macros for Optimization Rev 1 0 2 5 Freescale Semiconductor 2 1 3 Description of Optimization C code amore ab x OP ak s Sarina alse res arrl i Optimization can be done using the following techniques 1 Loop unrolling by four 2 Postincrement addressing mode to access input array elements 3 Descending loop organization The following should be noticed e The d0 register always holds the sum of array elements e The a0 register holds the pointer to input array e The dl register is the counter Optimized code loopl add ll a0 4 00 gud AOp eN addi O d add 1 awe subq 1 1 d1 bne loopl 2 1 4 Differences Between the ARR1D SUM UL and the ARR1D_SUM_SL Macros The type of ARRID SUM UL parameters src is unsigned long The type of ARRID SUM SL parameters src is signed long 2 6 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 2 ARR1D_ADD2_UL ARR1D_ADD2_SL 2 2 1 Macros Description These macro
9. FRAC32 src long size The original signals are held in array src and the first differences are stored in array dst Both arrays run from 0 to size 1 Prior to any call of FIRST DIFF the user must allocate memory for both src and dst arrays either in static or in dynamic memory Parameters Library of Macros for Optimization Rev 1 0 4 73 Freescale Semiconductor Table 4 5 FIRST_DIFF Parameters dst out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size in Number of elements in input and output arrays Returns The FIRST DIFF macro generates output values which are stored in the array pointed to by dst 4 5 3 Description of Optimization This macro does not use any multiplication operations So it is not suitable to use MAC and eMAC instructions to optimize this macro Thus instructions from the Integer Instruction Set were used for optimization The following optimization techniques were used 1 Multiple load store operations to access arrays elements 2 Loop unrolling by four 3 Descending loop organization Discussions on particular techniques of optimization is shown below C code porlar bo ab Ke STIE abu Gies xe rat eres dol pak e etieretbel sili p The following should be noticed e The loop is unrolled by four e The input operands are fetched from memory in fours a
10. int sizel int size2 int ARR2D MUL3 SL long dest long srcl long src2 int size int size2 3 48 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Parameters Table 3 7 ARR2D_MUL3 Parameters dest in Pointer to the destination array srcl in Pointer to the sourcel array src2 in Pointer to the source2 array sizel In Number of rows in arrays size2 In Number of columns in arrays Returns The ARR2D_MUL3 macro generates an unsigned signed output matrix which is the result of srcl and src2 multiplication and is pointed to by dest 3 7 3 Description of Optimization C code ino a SS Oe ah xe SurzSdLe abdb4p more x WP po SARA ars eiee ils i em xeuesedbpat al eese ah silks Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 move l 0x40 d0 move l d0 MACSR moveq 1l 16 d0 move l dest a0 move T srol ral move l src2 a2 move l sizel dl Library of Macros for Optimization Rev 1 0 3 49 Freescale Semiconductor move l size2 d2 malm Ik o2 oL move l d1 d2 asc l 32501 beq outl move l 0 ACCO movem l al
11. 1 Creating a new Projecti i ee rne rt Rene cade ede Pee ee e efe eH P eee decd ece 6 104 6 2 Modifying the settings of your project 6 105 6 3 Adding the Library of Macros ccccccsccssscesscesseeeseeeseeescecscecsaecsaeceaeecsaecaaecaecnseesaeesereseneeses 6 106 6 4 sinig a macro uicit de Ten iet tet e en dece t t Fe Eg deed 6 107 Library of Macros for Optimization Rev 1 0 V Freescale Semiconductor vi Library of Macros for Optimization Rev 1 0 Freescale Semiconductor ar About This Book This programmer s manual provides a detailed description of a set of macros used for optimizations The information in this book is subject to change without notice as described in the disclaimers on the title page As with any technical documentation it is the reader s responsibility to be sure he is using the most recent version of the documentation To locate any published errata or updates for this document refer to the world wide web at http www freescale com coldfire Audience This manual is intended for system software developers and applications programmers who want to develop products with ColdFire processors It is assumed that the reader understands microprocessor system design basic principles of software and hardware and basic details of the ColdFire architecture Organization This document is organized into five chapters Chapter 1 Overview includes a general description of the library of M
12. 3 4 The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com Revision History The following table summarizes revisions to this manual since the previous release Rev 1 4 Revision History Revision Number Date of release Substantive Changes 1 0 10 2005 Initial Public Release 1 2 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Chapter 1 Overview The Library of Macros was designed to ensure efficient programming of the ColdFire processor by using MAC and eMAC units where applicable This document is the main document describing the Library of Macros and it provides information on each macro in the library e Macros Description provides general information about a macro including a description and its purpose e Parameters Description provides information on the invoking technique of a macro as well as its parameters and returned value e Description of Optimization provides information on techniques that were used during macro optimization 1 1 Project Resources The following resources were used in the project e Targets MCF5249 Evaluation board M5249C3 MCF5206 Evaluation board M5206EC3 MCF5282 Evaluation board M5282EVB e Compilation tools Metrowers Codewarrior for ColdFire V4 0 Metrowers Codewarrior for ColdFire V5 0 WindRiver Diab RTA 4 4b Suite gc
13. 59 3 11 ARRID CAST SWI ARR2D_CAST_UWL ctr bare ostium dense 3 61 Chapter 4 Macros for DSP Algorithims cs esseeeeeeeeeeeeeeeeeeeeeeeseeeeneeeeeeeeeeeees 4 64 AN DOT PROD UTE DOT PRODE GD x exc rr r usa heresies once ae dtd 4 64 4 2 RDOT PROD UL RDOE PROD TSE cits ccc tes ortam 4 66 4 3 MATR MUL UL MATR MUL SL en ied dete 4 67 4 4 CON Pee at ee rat ea eet e E A E IATE 4 70 4 5 FURST 2 DIFE proinn a N AE e estes ciae a a e ee Due de ee aee aa 4 73 ASG RUNN SUM occ ein pee a A tu te a T dead A EL a ca bar 4 75 AT PASS I POLE PETR eects n aa ar a a a a eel teeta 4 77 45 HPASS_1POLE BIB eiit detta decttine creta betonen beca uns 4 81 Ao SNPS eA TG EDDRI ecdesia utut uo tea E 4 84 4 10 BANDPASS BETR ied ERU RE RR Eb t e ERRAT pads 4 87 4 11 BANDREJECT ELTR iiie ette te edet e aeree PE e ER PR e Ere da Eure Per edges 4 90 AO NOM A VGC TR cece stan iets atia aid peinado ttt 4 92 Chapter 5 Macros for Mathematical Functions ccccceeseeeeeeeeeeeeeeeeeeeeeeeeeeeeees 5 95 iv Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 5 1 vil PPM T EP m T HO Te PN 5 95 5 2 COS e AU ap ie Merry ere AE ad o fduE tC bti d b oncfo p AC ta eos Sacs 5 96 5 3 SIN MN mr TD PME 5 97 5 4 ESSE Sa dasttotuotdbace abt denies ledio eda ll Se ark Shan aded eut svn 5 99 5 5 MU e r E fe wad E T t esae r T E 5 102 Chapter 6 QuickStart for CodeWarrior sccceeeeeeeeeeeeeeeeeeeeeeeeeeeeseeeeeeeees 6 104 6
14. 9e ab om Sharia ab4bup fore a c p pex SZ srr eee erat d gi ces ea Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 60 a7 a7 movem l d2 d6 a2 a5 a7 move l 0 d0 move l d0 MACSR move l arr a0 move l scal d0 move l sizel dl move l size2 d2 mulu l d2 dl move l d1 d2 Elsie oll 2750 beq outl move l 0 ACCO moveq 1l 16 d7 loopl movem l a0 d3 d6 Library of Macros for Optimization Rev 1 0 3 53 Freescale Semiconductor mac l d0 d3 ACCO move l ACCO dq3 move l 0 ACCO mac l d0 d4 ACCO move l ACCO d4 move l 0 ACCO mac l d0 d5 ACCO move l ACCO d5 move l 0 ACCO mac l d0 d6 ACCO move l ACCO d6 move l 0 ACCO movem 1 d3 d6 a0 add l d7 a0 mulsa wq 3 so bne loopl OUtl1s and 1 3 d2 beq out2 loop2 move l a0 d3 d0 d3 muls move l d3 a0 AL JL ak iL subq l 1 0d2 bne loop2 out2 movem l a7 d2 d6 a2 a5 lea 60 a7 a7 Optimization for eMAC unit can be done using the following techniques 4 Loop unrolling by four Using four accumulators for pipelining Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration The first
15. Rev 1 0 5 95 Freescale Semiconductor 5 1 2 Parameters Description Call s FIXED64 SIN FIXED64 ang Parameters Table 5 1 SIN Parameters ang in an angle value Returns sine value of the angle 5 1 3 Description of Optimization Because the SIN macro only performs some simple arithmetical operations with the ang parameter before invoking the SIN F COS F functions no optimization is needed 5 2 COS 5 2 1 Macro Description This macro performs some arithmetical operations with the angle parameter to reduce the angle value to the range of 0 2 4 and then calls the SIN F or COS F macro to compute the cosine function Notes e Value of the angle parameter must be in 0 2 z e SIN and COS macros have a common header file sin_cos h 5 2 2 Parameters Description Call s FIXED64 COS FIXED64 ang Parameters Table 5 2 COS Parameters ang In an angle value Returns cosine value of the angle 5 96 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 5 2 3 Description of Optimization Because the COS macro only performs some simple arithmetical operations with the ang parameter before invoking the SIN F COS F functions no optimization is needed 5 3 SIN_F 5 3 1 Macro Description This macro computes the sine of an angle from the range 0 2 4 Computation is done by Teylor s series consisting of 6 elements Notes e Value of th
16. Should Buyer purchase or use Freescale Semiconductor products for any such unintended or unauthorized application Buyer shall indemnify and hold Freescale Semiconductor and its officers employees subsidiaries affiliates and distributors harmless against all claims costs damages and expenses and reasonable attorney fees arising out of directly or indirectly any claim of personal injury or death associated with such unintended or unauthorized use even if such claim alleges that Freescale Semiconductor was negligent regarding the design or manufacture of the part Learn More For more information about Freescale products please visit www freescale com Freescale and the Freescale logo are trademarks of Freescale Semiconductor Inc All other product or service names are the property of their respective owners Freescale Semiconductor Inc 2005 All rights reserved emac Freescale Semuconductor Contents About This BOOK Sido Mice o oi das cosa tise casaacencscees ccc RUIN CS RE Organization RR Gr d a mee i n EE FO LE d RR P RE Conventlornis 5 denis reto E te elo tete fuse i fpe ecd eee Definitions Acronyms and Abbreviations essere IN E Revision History anen one deti e Eo deedie dre t doe Ge pote deudeg od Chapter L OVEEVIEW 35i EERERIe Tipo uei pSI ie pr EE VERE MPe DE UIN 1 1 Project Resources 355 5 d tate td 1 2 Structure of the Project an
17. X5 2 15E 0 9 0 99999 1 29E 08 Y5 1 98E 08 0 09225 4 X6 2 04E 0 9 0 95105 7 1 5E 08 Y6 2 97E 08 0 13833 3 X7 1 74E 0 9 0 80901 7 1 72E 08 Y7 4 13E 08 0 19250 2 X8 1 26E 0 9 0 58778 5 1 93E 08 Y8 5 42E 08 0 25255 X9 6 64E 0 8 0 30901 7 0 0 6 6E 08 1 3E 09 0 30902 0 58779 1 7E 09 2E 09 2 1E 09 0 80902 0 95106 E 2E 09 1 7E 09 0 95106 0 80902 1 3E 09 0 58779 6 6E 08 0 30902 OTTO IIN IIO IA IA T C IN TH IOI 2 15E 08 Y9 6 78E 08 0 31568 7 8 14E 08 0 37882 4 8 69E 08 8 4E 08 0 40488 0 39130 3 7 29E 08 5 46E 08 3 1E 08 0 33942 2 0 25431 6 0 14431 7 4335878 9 2 3E 08 0 02019 1 0 10591 4 8E 08 0 22165 6 8E 08 0 31569 8 1E 08 0 37883 8 8E 08 0 40797 8 7E 08 0 40336 7 9E 08 0 36854 6 7E 08 0 31 5 1E 08 0 23657 3 4E 08 0 15852 1 9E 08 6 6E 07 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 0 08659 6 109 Table 6 1 Result of CONV Example 132 x f32 h f382 y Figure 6 5 Resulting Graph of CONV Example 6 110 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor
18. a0 mac l d6 d0 ACC2 movclr l ACC2 d0 move l d0 a0 mac l d6 d0 ACC3 movclr l ACC3 d0 move l d0 a0 computes computes computes computes computes b and stores y computes b 1 moves y i 1 and stores y computes b 1 moves y it2 and stores y computes b 1 moves y it3 and stores y 0 x i for y i ouput element 0 x i 1 for y i 1 ouput element 0 x i 2 for y i 2 ouput element 0 x i 3 for y i 3 ouput element vam eo aote seia moves y i to dod i to memory vaut oz ociucemy spy io caw i 1 to memory sey lait cOsproduice vali 52 to db i 2 to memory S yhaa co j9satexo bor S Lata Lo i 3 to memory 4 80 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 4 8 HPASS_1POLE_FLTR 4 8 1 Macro Description The macro computes a single pole high pass filter This recursive filter uses three coefficients ao a and b so the filter can be represented in the form Yn ao Xn t A1 Xn 1 TDi Yn 1 The filter s response characteristics are controlled by the parameter x a value between zero and one Physically x is the amount of decay between adjacent samples ap 1 x 2 a 1 x 2 bj x Note The filter becomes unstable if x is made greater than one Thus any non zero value on the input will increase the output until an overflow occurs More details on this digital recursive filter s characteristic ma
19. columns of matrix scal in Scalar value Returns The ARR2D ADDSC macro generates unsigned signed output values which are stored in the array pointed to by the parameter arr 3 4 3 Description of Optimization C code iene ak c o ak lt lt Sanaa abi rere 3 OF 3pm XA ru tuse e atq ES T a eala Optimization can be done using the following techniques 1 The elements are accessed as 1d array elements with number of elements size size2 because elements of 2d array are located in memory sequentially 2 Loop unrolling by four 3 Every four values of array arr used in each iteration are stored using postincrement addressing mode while performing additons 4 Ifthe number of elements is not divisible by four the tail elements are processed in regular order Optimized code Library of Macros for Optimization Rev 1 0 3 41 Freescale Semiconductor T o sizet al l size2 d2 Jor INE Lad E2 1 2 d1 adl sib gl 12 3 4 4 Differences Between the ARR2D ADDSC UL and the ARR2D_ADDSC_SL Macros There are no differences The macro was written in two versions in order to preserve library uniformity 3 5 ARR2D_PROD_UL ARR2D_PROD_SL 3 5 1 Macros Description These macros compute the product of 2d array with unsigned signed values The product is computed by the formula i sizel l j size2 1 res x 3 42 Library of Macros for Optimization Rev 1 0 Freescale Semic
20. d1 d2 asr l 2 d1 beq outl move l 0 ACCO move l 0 ACC1 Library of Macros for Optimization Rev 1 0 2 21 Freescale Semiconductor move l 0 ACC2 move l 0 ACC3 movem l al d7 a3 a5 add 1 d0 al Woop movem l a2 da3 d6 Maca dl HI CS ell sers o1 AEC macl a3 d4 a1 a3 ACC1 macl dE l qaa d5re ad yer ad AC AL maci T aSr dore aii EPA S moved ACCO A3 MONGI uL ACCEL OA movclr l ACC2 d5 movclr l ACC3 d6 movem 1 d3 d6 a0 add 1 d0 a2 add 1 d0 a0 subg L 41 01 bne loopl oubb and 1l 3 d2 beq out2 Sube dra OOP Ze move letras els mulu a1 83 move l d3 a0 T AL JL ak subd 31 02 bne loop2 Ote movem l a7 d2 d7 a2 a5 lea 60 a7 a7 2 7 4 Differences Between ARR1D MULS3 UL and ARR1D MUL3 SL The ARRID MUL3 UL macro uses unsigned mode of the MAC unit while the ARRID MUL3 SL macro uses signed mode 2 22 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 8 ARR1D_MULSC_SL ARR1D_MULSC_UL 2 8 1 Macros Description These macros perform multiplication of one vector array by scalar unsigned signed value 2 8 2 Parameters Description Call s int ARRID_MULSC_UL long arr long size unsigned long scal int ARRID MULSC SL long arr long size long scal Parameters Table 2 8 ARRID MULSC Parameters arr in Pointer to the destination vector size in Number of elements in vectors scal in
21. d7 a3 a5 acides loopl movem l a2 d3 d6 mack l ar Sir ata cene move l ACCO q3 move l 0 ACCO maci L az as talta k ACCO move l ACC0 d4 move l 0 ACCO maci I 4 d5 accade move l ACCO d5 move l 0 ACCO Macleay elo Usklb arr eto cAc CAO move l ACCO d6 move l 0 ACCO movem 1 d3 d6 a0 add 1 d0 a2 add 1 d0 a0 suba sr dE bne loopl Our ds and 3 d2 beq out2 Subp iledO aul loop2 move l a2 d3 mul L evil sre els move l d3 a0 subq 1 1 d2 bne loop2 Cubes movem l a7 d2 d7 a2 a5 lea 60 a7 a7 Optimization for eMAC unit can be done using the following techniques 3 50 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 1 Loop unrolling by four 2 Using four accumulators for pipelining 3 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 4 The first four values are loaded using one movem instruction Optimized code uses eMAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 moveq 1 16 d0 move l dest a0 move l srcl al move l src2 a2 move l sizel dl move l size2 d2 mulur AA ar move l d1 d2 asro Mna beq outl move l 0 ACCO move l 0 ACC1 move l 0 ACC2 move l 0 ACC3 movem l al d7 a3 a5 add 1 d0 al loopl movem l a2 d3 d6 macl d7 d3 a1 d7 ACCO maci I a3 d4 CSI EUROPEA C CXII Jt aL macl l a4 d5 al a4 ACC2 a macie abdo sal Erba es InO
22. elements is not divisible by 4 the tail elements are processed in regular order Optimized code 2 8 move l size dl move l d1 d2 asand Pdl beq 11 movem l a0 d3 d6 sdas eek ee ois add l al d4 AC CaN alty aS add l al d6 movem 1 d3 d6 a0 add l 16 a0 subdq 1 41 0 bne 12 a0 7 03 al d3 ds Oa 1 d2 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 2 4 Differences Between the ARR1D ADD 2 UL and the ARR1D ADD2 SL Macros The type of ARRID ADD2 UL parameters dest src is unsigned long The type of ARRID ADD2 SL parameters dest src is signed long 2 3 ARR1D ADDS UL ARR1D ADD3 SL 2 3 1 Macros Description These macros compute the elementwise sum of two vector arrays with unsigned signed values and store the results to a third vector with unsigned signed values The elementwise sum is computed by the formula Z X ty x e X y e Y z e Z i e 0 size 1 where X Y input vectors x y elements of the corresponding vectors Z resultant vector z element of vector Z size number of elements in the input vectors 2 3 2 Parameters Description Call s int ARR1ID_ADD3_UL unsigned long dest unsigned long srcl unsigned long Src2 int size int ARRID ADD3 SL signed long dest signed long srcl signed long src2 int size Parameters Table 2 3 ARRID ADD3 Parameters dest in out Point
23. for eMAC unit can be done using the following techniques gt Loop unrolling by four Using 4 accumulators for pipelining Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration The first four values are loaded using one movem instruction Optimized code uses eMAC unit lea 60 a7 a7 movem l d2 d6 a2 a5 a7 move move move move lean a aul Sxeickibs ag sre adl pal hil o asr l 2 dl beq outl move move move move d l 0 ACC2 Ji l 0 ACCO 0 ACC1 0 ACC3 moveq 1l 416 d7 loopl movem l a0 d3 d6 mac i mac mac i mac t movci movc movc movc d0 d3 ACCO d0 d4 ACC1 d0 d5 ACC2 d0 d6 ACC3 I A Co Os rp AECT EU Iegi ACCA pls leet ACES olo Library of Macros for Optimization Rev 1 0 2 25 Freescale Semiconductor movem 1 d3 d6 a0 add l d7 a0 Swoi cob ekk bne loop1 euc lL and l 3 d2 beq out2 Oop 2 move a0 d3 niaes Ti COES d3 a0 11 82 move JL T JL JL subq bne loop2 DUET movem l a7 d2 d6 a2 2 8 4 Differences Between ARR1D MULSC UL and ARR1D MULSC SL The ARRID MULSC UL macro uses unsigned mode of the MAC unit while the ARRID MULSC SL macro uses signed mode 2 9 ARR1D MAX S ARR1D MAX U 2 9 1 Macros Description Macro search for a maximum element in 1D array of signed or unsigned integer values 2 9 2 Parameters
24. four values are loaded using one movem instruction Optimized code uses eMAC unit 3 54 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor lea 60 a7 a7 movem l d2 d6 a2 a5 a7 move l arr a0 move l scal d0 move l sizel dl move l size2 d2 mulo dk o 27 toL move l d1 d2 asr l 4 2 di beq outl move l 0 ACCO move l 0 ACC1 move l 0 ACC2 move l 0 ACC3 moveq 1l 416 d7 loopl movem l a0 d3 d6 mac l d0 d3 ACCO mac l d0 d4 ACC1 Macrae conce mac l d0 d6 ACC3 Ino cec I ACCOA movclr E ACCT a4 morc iro I ACE2 5 movclr l ACC3 d6 movem l d3 d6 a0 add l d7 a0 subq l 41 01 bne loopl Giedi g and 1l 3 d2 beq out2 loop2 move a0 d3 muls l d0 d3 move l d3 a0 a JL F d subg 1 3x 42 bne loop2 out2 movem l a7 d2 d6 a2 a5 lea 60 a7 a7 Library of Macros for Optimization Rev 1 0 3 55 Freescale Semiconductor 3 8 4 Differences Between ARR2D MULSC UL and ARR2D MULSC SL ARR2D MULSC UL macro uses unsigned mode of the MAC unit while ARR2D MULSC SL macro uses signed mode 3 9 ARR2D MAX S ARR2D MAX U 3 9 1 Macros Description These macros search for a maximum element in a 2D array of signed or unsigned integer values 3 9 2 Parameters Description Call s ARR2D MAX S void src int sizel int size2 ARR2D MAX U void src int sizel int size2 The elements are held in src array The src array is searched for maximum from 0 to siz
25. out Pointer to the destinstion array srcl in Pointer to the source array 1 src2 in Pointer to the source array2 sizel in Number of rows of matrices size2 in Number of columns of matrices 3 38 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Returns The ARR2D ADD3 macro generates unsigned signed output values which are stored in the array pointed to by the parameter dest 3 3 3 Description of Optimization C code foOr 15 90 al M STZEI ables ameet Op sp STAI Seres eve qe E351 spl SS eee Pat Sp e eu ak she Optimization can be done using the following techniques 1 The elements are accessed as 1d array elements with number of elements size size2 because elements of 2d array are located in memory sequentially 2 Loop unrolling by four 3 Every four values of array src used in each iteration are loaded with only one movem instruction 4 Every four values of array src2 used in each iteration are loaded using postincrement addressing mode while performing additons 5 After perfoming additions the resulting four values in each iteration are stored into the dest array with only one movem instruction 6 Ifthe number of elements is not divisible by four the tail elements are processed in regular order Optimized code move l sizel dl move l size2 d2 mullust eye Chil move l d1 d2 Ser 25 0 beq 1 movem l al1 d3 d6 add al XIETZ yaerate ls add 1 a2 d4
26. were used Multiple load store operations to access array elements 2 Loop unrolling by four 3 Descending loop organization 4 Particular techniques of optimization are reviewed below C code for i 0 i lt SIZE1 i ifo Wl 5 KS Sava ajar ie ewer cebat pl See max assem ea ns abd E 12 hy Optimized code this code is similar to 1D array macro but in preloop operations linear size must be calculated and stored 12 taken from ARR2D MAX S macro movem l a0 dl d4 multiple load operations to access emp lil els Source array elements cul making comparisons beetwen four move l di d5 elements because of loop unrolling addq 1 1 d6 move l d6 a3 index is accumulated in d6 bra c2 c Library of Macros for Optimization Rev 1 0 3 57 Freescale Semiconductor l 1 d6 d4 d5 e move l d4 d5 addq 1 1 d6 MOVE Glir 3 T8 addq l 1 d6 5 16 a0 swieci Uc descending loop organization 3 9 4 Differences Between ARR2D MAX U and ARR2D MAX S For signed and unsigned values appropriate comparison insructions were used 3 58 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 3 10 ARR2D_MIN_S ARR2D_MIN_U 3 10 1 Macros Description The macros search for a minimum element in 2D array of signed or unsigned integer numbers 3 10 2 Parameters Description Call s ARR2D MIN S void src int sizel int size2 ARR2
27. 0 x i for y i ouput element macl w d6 u d0 u lt lt al t a2 ACCO mcomputesebii salit cya ied OW ut element and loads the next input operand msacow dl u dass ACCU computes b 2 y i 2 for y i ouput element mac w a4 u d4 u ACCO computes b 3 y i 3 for y i ouput element msacow asc do ce ACCD computes b 4 y i 4 to produce yli move l ACCO d5 mnovessyiis t oNds move l 0 ACCO Clear accumulator move l d5 a0 and stores y i to memory Optimization for eMAC unit The following should be noticed e The loop is unrolled by four e Coefficients ag b b2 b3 and b are pre computed and held in registers a3 d6 d7 a4 and a5 correspondingly e The a2 register always holds the input sample per each iteration e Input operands are fetched from memory one by one and stored in registers d5 d4 d3 and dO 4 86 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor All add multiply instructions are performed by the eMAC unit After each computation of an output sample the movclr instruction is used to clear the accumulator and store the result into the general purpose register After the result is stored into memory Optimized code uses eMAC unit micr a3 a2 ACCO computes a 0 x i for y i ouput element macie lt d6 d0 ality az ACe0 computes b 1 y i 1 for y i 1 ouput element loads the next input operand msac l d7 d3 ACCO computes b 2 y i 2 for y i ou
28. 18 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 6 4 Differences Between ARR1D_MUL2 UL and ARR1D_MUL2 SL The ARRID MUL2 UL macro uses the unsigned mode of the MAC unit while ARRID MUL2 SL macro uses signed mode 2 7 ARR1D MULS3 SL ARR1D MUL3 UL 2 7 1 Macros Description The ARRID MUL2 UL macro uses the unsigned mode of the MAC unit while ARRID MUL2 SL macro uses signed mode 2 7 2 Parameters Description Call s int ARRID MUL3 UL unsigned long dest unsigned long src unsigned long src2 int size int ARRID MUL3 SL long dest long srcl long src2 int size Parameters Table 2 7 ARR1D_MUL3 Parameters dest in Pointer to the destination vector srcl in Pointer to the sourcel vector src2 in Pointer to the source2 vector size in Number of elements in vectors Returns The ARRID MUL3 macro generates an unsigned signed output vector which is the result of the srcl and src2 multiplication and is pointed to by dest 2 7 3 Description of Optimization C code ions al F MESZ EL ES Library of Macros for Optimization Rev 1 0 2 19 Freescale Semiconductor Optimization for MAC unit c an be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 First four values are loaded using one movem instructi
29. 4 76 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Optimized code loopl movem l al d4 d7 multiple load operations to access source array s elements add 1 d4 d0 cdomputing output value move l d0 a0 Storing value on output array add 1 d5 d0 move l d0 a0 add 1 d6 d0 move l d0 a0 add 1 d7 d0 move l d0 a0 add 1l 16 al subq 1 1 d1 decsending loop organization bne loopl 4 7 LPASS 1POLE FLTR 4 7 1 Macros Description This macro computes a single pole low pass filter This recursive filter uses just two coefficients ay and b so the filter can be represented in the following form Yn ao Xn bi Yni The filter s response characteristics are controlled by the parameter x a value between zero and one Physically x is the amount of decay between adjacent samples ag 1 x b X Note The filter becomes unstable if x is made greater than one Thus any non zero value on the input will increase the output until an overflow occurs More details on this digital recursive filter s characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com Library of Macros for Optimization Rev 1 0 4 77 Freescale Semiconductor 4 7 2 Parameters Description Call s LPASS_1POLE_FLTR FRAC32 dst FRAC32 src long size FRAC32 x The input s
30. CO d6 move l 0 ACCO movem 1 d3 d6 a0 add d0 a0 suba wm 3 d bne loopl outl and t 3 d2 beq out2 ibis 1L 1610 7 Ul loop2 move l a0 d3 muls 1 al d3 move l d3 a0 subg 31 42 bne loop2 out2 movem l a7 d2 d7 a2 a5 lea 60 a7 a7 Optimization for eMAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using four accumulators for pipelining 3 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 4 The first four values are loaded using one movem instruction Optimized code uses eMAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 moveq 1l 16 d0 move l dest a0 Library of Macros for Optimization Rev 1 0 2 17 Freescale Semiconductor move l src al move l size dl ION IL hile asc l 32 01 outl move l 0 ACCO move 0 ACC1 move dl ak 1 0 ACC2 JL move l 0 ACC3 movem l al d7 a3 a5 add 1 d0 al oops movem l a0 d3 d6 maer T an a2 ade ci ACEO macl a3 d4 a1 a3 ACC1 macie a4 a5 allt aA E T d JL Indc ame a5n dorsal ao ACC movclr l ACCO dq3 MOV ACCRA Ino ed T ACC cls movclr l ACC3 d6 movem l d3 d6 a0 add 1 d0 a0 subq l 41 dl bne loopl aur and l 43 82 beq out2 sub 1 d0 al loop2 move l a0 d3 mule Sele Cad EEG move l d3 a0 subq i 41 02 bne loop2 out2 movem l a7 d2 d7 a2 a5 lea 60 a7 a7 2
31. D MIN U void src int sizel int size2 The elements are held in src array The src array is searched for maximum from 0 to size 1 where size sizel Xsize2 Prior to any call of ARR2D MIN S and ARR2D MIN U user must allocate memory for src array either in static or in dynamic memory Types of the array and the invoking macro must correspond In declaration src array 1s declared as void for compatibility Parameters Table 3 10 ARR2D MIN S ARR2D MIN U Parameters src in Pointer to the input array sizel in Number of rows size2 in Number of columns Returns ARR2D MIN S and ARR2D MIN U macros return minimum element s index as their result which is why they can be used in an assignment operation The index is linear and must be converted to two indices to access C array The convertion can be done in the following way index index size2 index2 index index x size2 where index1 first C index row index2 second C index column index linear index size number of rows size2 number of columns 3 10 3 Description of Optimization These macros does not use any multiply operations So it is not suitable to use MAC and eMAC instructions to optimize these macros This is why instructions from the Integer Instruction Set were used for optimization For signed and unsigned values appropriate comparison insructions were used All optimization issues are the same for both macros Th
32. Descending loop organization The following should be noticed e The d0 register always holds the sum of array elements e The a0 register holds the pointer to input array 3 34 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor e The dl register is the counter Optimized code loopl add lI fa dO add a0 00 add oL ag d0 add 1l a0 4 a0 subq 1l 1 d1 bne loopl 3 1 4 Differences Between the ARR2D SUM UL and the ARR2D SUM SL Macros The type of ARR2D SUM UL parameters src is unsigned long The type of ARR2D SUM SL parameters src is signed long 3 2 ARR2D ADD 2 UL ARR2D ADD2 SL 3 2 1 Macros Description These macros compute the elementwise sum of two 2d arrays of unsigned signed values The elementwise sum is computed by the formula Xj XQ t Yn x X y Y i e 0 sizel 1 j 0 size2 1l where X Y input arrays x yi elements of the corresponding arrays size number of rows size2 number of columns Note The type of elements of arrays in the ARR2D ADD2 UL macro must be unsigned long and the type of elements of arrays in the ARR2D ADD2 SL macro must be signed long Library of Macros for Optimization Rev 1 0 3 35 Freescale Semiconductor 3 2 2 Parameters Description Call s int ARR2D ADD2 UL void dest void src int sizel int size2 int ARR2D ADD2 SL void dest void src int sizel int size2
33. Description Call s ARRID MAX S signed long src int size ARR1D MAX U unsigned long src int size The elements are held in array src The src array is searched for a maximum from 0 to size 1 Prior to any call of ARRID MAX S and ARRID MAX U macros the user must allocate memory for src array either in static or in dynamic memory The types of the array and the invoking macro must correspond 2 26 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Parameters Table 2 9 ARRID MAX S ARRID MAX U Parameters src In Pointer to the input array size In Number of elements in the input array Returns The ARRID MAX S and ARRID MAX U macros return the maximum element s index as their result which is why they can be used in an assignment operation 2 9 3 Description of Optimization These macros do not use any multiplication operations Therefore it is not suitable to use MAC and eMAC instructions for optimization of these macros This is why instructions from the Integer Instruction Set were used for optimization For signed and unsigned values appropriate comparison insructions were used All optimization issues are the same for both macros The following optimization techniques were used 1 Multiple load store operations for accessing array elements 2 Loop unrolling by four 3 Descending loop organization Particular techniques of optimization are reviewed below C code fo
34. Description of Optimization C code intone al e abo Gre abb reer qe wo ene E EG Optimization can be done using the following techniques 1 Loop unrolling by four 2 Every four values of array arr used in each iteration are loaded using postincrement addressing mode while performing multiplications 3 Ifthe number of elements is not divisible by 4 the tail elements are processed in regular order Optimized code move l size dl move l d1 d2 moveq l 1 d0 aire 2r beq outl loopil mulu mulu mulu mulu subq i F T dL dL a0 d0 a0 d0 a0 d0 a0 d0 1 d1 bne loopl Ouite 1s andi dl iso beq out2 loop2 2 14 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor emo Ls IL 60 720610 subq 1 1 d2 bne loop2 QUIS 2 5 4 Differences Between the ARR1D PROD UL and the ARR1D PROD SL Macros The type of ARRID PROD UL parameters arr is unsigned long The type of ARRID PROD SL parameters arr is signed long ARRID PROD UL uses the mulu instruction for multiplication ARRID PROD SL uses the muls instruction for multiplication to keep the signs of operands 2 6 ARR1D MUL2 SL ARR1D MUL2 UL 2 6 1 Macros Description These macros perform multiplication of two vector arrays of unsigned signed values 2 6 2 Parameters Description Call s int ARRID MUL2 UL unsigned long dest unsigned long src long size i
35. Library of Macros for Optimization Using eMAC and MAC Programmer s Manual Document Number CFLMOPM Rev 1 0 10 2005 lt 2 freescale Freescale Semiconductor How to Reach Us Home Page www freescale com E mail support freescale com USA Europe or Locations Not Listed Freescale Semiconductor Technical Information Center CH370 1300 N Alma School Road Chandler Arizona 85224 1 800 521 6274 or 1 480 768 2130 support freescale com Europe Middle East and Africa Freescale Halbleiter Deutschland GmbH Technical Information Center Schatzbogen 7 81829 Muenchen Germany 44 1296 380 456 English 46 8 52200080 English 49 89 92103 559 German 33 1 69 35 48 48 French support freescale com Japan Freescale Semiconductor Japan Ltd Headquarters ARCO Tower 15F 1 8 1 Shimo Meguro Meguro ku Tokyo 153 0064 Japan 0120 191014 or 81 3 5437 9125 support japan freescale com Asia Pacific Freescale Semiconductor Hong Kong Ltd Technical Information Center 2 Dai King Street Tai Po Industrial Estate Tai Po N T Hong Kong 800 2666 8080 support asia freescale com For Literature Requests Only Freescale Semiconductor Literature Distribution Center P O Box 5405 Denver Colorado 80217 1 800 441 2447 or 303 675 2140 Fax 303 675 2150 LDCForFreescaleSemiconductor hibbertgroup com Information in this document is provided solely to enable system and software implementers to use Fre
36. O 30812 is 1 TA S222 A413 ETOUF32 0 8090169943 74947 LAO i782 SOs No LEZ ISSA y 191 32 5 2 f 1D rz Ow 375 OPOSMIOSGOIIG29 1 52 TO F32 0 809016994374948 STOEN ES OPSDICHNIOIOIOD 2 2173 TO F32 0 309016994374948 0S2 PES 2 TT EINST ET pP uS DTI Our SQ IO EE OO TSA Ay I DOSES D y LO JUS oss d TO TSZ Gp IDL 9 35 7 D TO ms E mo 92 9 d 252 3599 ihe d Click the Make button You shouldn t have any errors Otherwise review the errors and fix them e Now you can debug or execute your application You can also use the serial terminal to display the results of your function as follows for i20 i lt X SIZE H SIZE 1 i prantrid U Yed sane 344 32 e piles Note that this print function will send output data in FRAC32 format multiplied by 2 In order to get the real value the result must be divided by 2 6 108 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor f For this example the result will be as follows X0 0 2147483 6 YO f32 h f32 y decima frac32 decimal frac32 frac32 decimal 0 X1 6 64E 0 8 0 30901 7 4294967 2 Y1 6636089 0 00309 X2 1 26E 0 9 0 58778 5 6442450 9 Y2 2589477 0 0 01205 8 X3 1 74E 0 9 0 80901 7 8589934 5 YS 6252695 9 0 02911 6 X4 2 04E 0 9 0 95105 7 1 07E 08 Y4 1 2E 08 0 05568 5
37. RR1D ADDS UL and the ARR1D ADDS3 SL Macros The type of ARRID ADDS3 UL parameters dest srcl src2 is unsigned long The type of ARRID ADD3 SL parameters dest srcl src2 is signed long 2 4 ARR1D ADDSC UL ARR1D ADDSC SL 2 4 1 Macros Description This macro computes the elementwise sum of a vector array of unsigned signed values with a scalar unsigned signed value The elementwise sum is computed by the formula x x scalar x X i 0 size 1 where X input vector x element of vector X scalar variable with an unsigned signed value size number of elements in the input vectors 2 4 2 Parameters Description Call s int ARRID ADDSC UL unsigned long arr int size unsigned long scal int ARRID ADDSC SL signed long arr int size signed long scal Parameters Library of Macros for Optimization Rev 1 0 2 11 Freescale Semiconductor Table 2 4 ARRID ADDSC Parameters arr in ou Pointer to the vector t size in Number of elements in vector scal in Scalar value Returns The ARRID ADDSC macro generates unsigned signed output values which are stored in the array pointed to by the parameter arr 2 4 3 Description of Optimization C code aere al c m ak s Sap alse arr c i scalar Optimization can be done using the following techniques 1 Loop unrolling by four 2 Every four values of array arr used in each iteration are st
38. Scalar value Returns The ARRID MULSC macro generates an unsigned signed output vector which is the result of the arr multiplication by scal and is pointed to by arr 2 8 3 Description of Optimization C code ions aly 0 ak lt lt Sawin cab Es rr Clil t ScC l r Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four Library of Macros for Optimization Rev 1 0 2 23 Freescale Semiconductor 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 60 a7 a7 movem l d2 d6 a2 a5 a7 move l 0 d0 move l d0 MACSR move l arr a0 move l scal d0 move l size dl move l d1 d2 Soro 201 beq outl move l 0 ACCO moveq 1l 16 d7 ilo opis movem l a0 d3 d6 Ine credo dsmac e move l ACCO dq3 move l 0 ACCO mac l d0 d4 ACCO move l ACCO d4 move l 0 ACCO mac l d0 d5 ACCO move l ACCO d5 move l 0 ACCO mac l d0 d6 ACCO move l ACCO d6 move l 0 ACCO movem 1 d3 d6 a0 add l d7 a0 gula d 3d ed bne loopl Supls and 1 3 d2 beq out2 loop2 move l a0 d3 mules d0rd3 move l d3 a0 2 24 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor subq oak plate bne loop2 out movem l a7 d2 d6 a2 a5 lea 60 a7 a7 Optimization
39. V C I ACCO d3 MOVCLEL dL ACCE Ad MOOIE ACC2nci5 movclr l ACC3 d6 movem 1 d3 d6 a0 acceda add 1 d0 a0 suba p so bne loopl Library of Macros for Optimization Rev 1 0 3 51 Freescale Semiconductor Sand and l 3 d2 beq out2 SubpylenclO cull loop2 move a2 d3 mulu al da3 move l d3 a0 JL a al iL subq 1 1 d2 bne loop2 out2 movem l a7 d2 d7 a2 a5 lea 60 a7 a7 3 7 4 Differences Between ARR2D MULS3 UL and ARR2D MUL3 SL ARR2D MULS3 UL macro uses unsigned mode of the MAC unit while ARR2D MUL3 SL macro uses signed mode 3 8 ARR2D MULSC SL ARR2D MULSC UL 3 8 4 Macros Description These macros perform multiplication of one 2D array by scalar unsigned signed value 3 8 2 Parameters Description Call s int ARR2D_MULSC_UL long arr long sizel long size2 unsigned long scal int ARR2D_MULSC_SL long arr long sizel long size2 long scal Parameters Table 3 8 ARR2D_MULSC Parameters Arr in Pointer to the destination array 3 52 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor sizel in Number of rows in arrays Size2 in Number of columns in arrays scal in Scalar value Returns The ARR2D MULSC macro generates an unsigned signed output matrix which is the result of arr multiplication by scal and is pointed to by arr 3 8 3 Description of Optimization C code intone a
40. WL for unsigned values This library of macros only supports arrays of long data elements so these macros should be used when the programmer needs to use this library with arrays of word data elements After convertion with these macros any macro from this library can be used for word Library of Macros for Optimization Rev 1 0 3 61 Freescale Semiconductor 3 11 2 Parameters Description Call s ARR2D CAST SWL void src void dest int sizel int size2 ARR2D CAST UWL void src void dest int sizel int size2 The original elements are held in src array and the converted elements are stored in array dest Both arrays run from 0 to size 1 Prior to any call of ARRID CAST SWL or ARRID CAST UWL the user must allocate memory for both src and dest arrays either in static or dynamic memory Type void in declaration of these macros is used only for compatibility so the macro must be called with array of appropriate type Parameters Table 3 11 ARR2D CAST SWL ARR2D CAST UWL Parameters dest out Pointer to the output array of size void data elements but array must have appropriate type depending on the type of a macro src In Pointer to the input array of size signed or unsigned long data elements but array must have appropriate type depending on the type of a macro sizel in Number of columns Size2 in Number of rows Returns The ARR2D CAST SWL and ARR2D CAST UWL macros g
41. able for the eMAC unit because the eMAC has a fractional mode Optimization for the MAC unit is performed as an emulation of the fractional mode using mac w with shift to left instruction on the upper 16 bits of operands So only the upper 16 bits of the resulting signals are valuable The standard C macro MOV AVG FLTR computes the 1 M value and uses the IMPL MOV AVG FLTR macro to compute output samples Optimization of IMPL MOV AVG FLTR macro The following optimization techniques were used 1 Post increment addressing mode to load input and store output array elements 2 Descending loop organization Particular techniques for optimization are reviewed below C code for i 0 i SIZE M 1 i Library of Macros for Optimization Rev 1 0 4 93 Freescale Semiconductor ow 5 OF J SiMe ees tuse e ac sr eese dll abor 8 amr cS mal Optimization for MAC unit The following should be noticed e The 1 M value is stored in register a3 e To calculate the y 1 1 value the y 1 value is used The first item of y 1 value is subtracted from the accumulator and the last item of y 1 1 is added to the accumulator Then the accumulator value is stored as y i 1 e The al and a0 registers hold pointers to the src and dst arrays All add multiply instructions are performed by the MAC unit Optimized code uses MAC unit mac w d4 u a3 u ACCO adds the last item of y i 1 to accumulat
42. acros Chapter 2 Macros for 1D Array Operations describes the macros used for 1D Array operations Chapter 3 Macros for 2D Array Operations describes the macros used for 2D Array operations Chapter 4 Macros for DSP Algorithms includes the description of several macros used for DSP algorithms Chapter 5 Macros for Mathematical Functions includes the description of several macros used for common mathematical operations Chapter 6 QuickStart for CodeWarrior includes a step by step description of how to create a new project in CodeWarrior using the library of Macros Library of Macros for Optimization Rev 1 0 1 1 Freescale Semiconductor Conventions This document uses the following notational conventions CODE Courier in box indicates code examples Prototypes Courier is used for code in function prototypes formulas Italics is used for formulas e All source code examples are in C and Assembly Definitions Acronyms and Abbreviations The following list defines the abbreviations used in this document FRAC32 Data type that represents 32 bit signed fractional value FIXED64 Data type that represents 64 bit signed value with 32 bits in integer part and 32 bits in fractional part References The following documents were referenced to write this document 1 ColdFire Family Programmer s Reference Rev 3 2 MCF5249 ColdFire User s Manual Rev 0 3 MCF5282 ColdFire User s Manual Rev 2
43. ada a2 reS add l a2 d6 movem 1 d3 d6 a0 Library of Macros for Optimization Rev 1 0 3 39 Freescale Semiconductor 1 16 a0 1 16 al wil a Iri i italy d3 a2 d3 L d3 a0 Eid 3 3 4 Differences Between the ARR2D ADDS UL and the ARR2D ADD3 SL Macros There are no differences The macro was written in two versions in order to preserve library uniformity 3 4 ARR2D ADDSC UL ARR2D ADDSC SL 3 4 1 Macros Description These macros compute the elementwise sum of 2d array of unsigned signed values with a scalar unsigned signed value The elementwise sum is computed by the formula X 7 X scalar Xj Xi 0 sizel 1 j size2 1 where X input array x element of the array X scalar variable with unsigned signed value size number of rows size2 number of columns Note The type of elements of array in the ARR2D ADDSC UL macro must be unsigned long and the type of elements of array in the ARR2D ADDSC SL macro must be signed long 3 40 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 3 4 2 Parameters Description Call s int ARR2D_ADDSC_UL void arr int sizel int size2 unsigned long scal int ARR2D_ADDSC_SL void arr int sizel int size2 signed long scal Parameters Table 3 4 ARR2D_ADDSC Parameters arr in out Pointer to the array sizel in Number of rows of matrix size2 in Number of
44. ail crc adael lalit aS add l al d6 movem 1 d3 d6 a0 add l 16 a0 subq 1 1 d1 bne 12 a0 d3 fal gs L 33 a0 l 1 d2 add e aconds move l d3 a0 suban 1 31 42 bne 13 Library of Macros for Optimization Rev 1 0 3 37 Freescale Semiconductor 3 2 4 Differences Between the ARR2D ADD2 UL and the ARR2D ADD2 SL Macros There are no differences The macro was written in two versions to preserve library uniformity 3 3 ARR2D ADDS UL ARR2D ADD3 SL 3 3 4 Macros Description These macros compute the elementwise sum of two 2d arrays of unsigned signed values and store the results in a third 2d array of unsigned signed values The elementwise sum is computed by the formula ce An V9 X EX y Y 2 Z i 0 sizel 1 j 0 size2 1 J 7i j where X Y input arrays Xij yi elements of the corresponding arrays Z resultant vestor Zij element of vector Z size number of rows size2 number of columns Note The type of elements of arrays in the ARR2D ADD3 UL macro must be unsigned long and the type of elements of arrays in the ARR2D ADD3 SL macro must be signed long 3 3 2 Parameters Description Call s int ARR2D ADD3 UL void dest void srcl void src2 int sizel int size2 int ARR2D ADD3 SL void dest void srcl void src2 int sizel int size2 Parameters Table 3 3 ARR2D ADD3 Parameters dest in
45. and ARR1D MIN S For signed and unsigned values appropriate comparison insructions were used 2 11 ARR1D CAST SWL ARR1D CAST UWL 2 11 1 Macros Description These macros convert an array of word data elements to an array of long data elements ARRID CAST SWL is used for signed values and ARRID CAST UWL for unsigned values The Library of Macros only supports long data element arrays so these macros need to be used when a programmer wants to use the library with word data element arrays After these macros complete their conversion any macro from this library can be used for word data 2 11 2 Parameters Description Call s ARR1D CAST SWL signed short src signed long dest int size ARR1D CAST UWL unsigned short src unsigned long dest int size The original elements are held in array src and the converted elements are stored in array dest Both arrays run from 0 to size 1 Prior to any call of ARRID CAST SWL or ARRID CAST UWL the user must allocate memory for both src and dest arrays either in static or dynamic memory Parameters Library of Macros for Optimization Rev 1 0 2 31 Freescale Semiconductor Table 2 11 ARRID CAST SWL ARRID CAST UWL Parameters dest out Pointer to the output array of size of signed or unsigned long data elements depending on the type of a macro src In Pointer to the input array of of size of signed or unsigned long data elements depending on the type of a
46. c 3 3 3 GNU compiler Library of Macros for Optimization Rev 1 0 1 3 Freescale Semiconductor 1 2 Structure of the Project and Installation The Library of Macros has the following structure emac_macro h mac_macro h emac headers Mac headers Common headers Figure 1 1 Structure of Macro Library There are two main parts for the library The library for the eMAC unit The library for the MAC unit Each part has its own header file mac_macro h and emac_macro h respectively Each part also includes some common macros and can be logically divided in four sections 1D array operations 2D array operations DSP algorithms Mathematical functions To use the library of macros within your project first of all you have to include the appropriate C header file Include file mac_macro h if you use the MAC unit or file emac_macro h if you use the eMAC unit in your program To avoid macroname conflict you shouldn t include both headers in the same program Moreover there is no need to include them both because macros for the same functions are doubled in these headers 1 4 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Chapter 2 Macros for 1D Array Operations 2 1 ARR1D_SUM_UL ARR1D_SUM_SL 2 1 1 Macros Description These macros compute the sum of the array elements of unsigned signed values This sum is computed by the following formula size res 25 i 0
47. cending loop organization Particular techniques for optimization are reviewed below 4 78 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor C code Glow real lO cs swessdielpstl 3e Jol lt uci red pac dg Optimization for the MAC unit The following should be noticed e The loop is unrolled by four e Coefficients ag and b are pre computed and held in registers a3 and d correspondingly e Register d0 always holds the last computed output signal e Input operands are fetched from memory in fours and stored in registers d3 d4 d5 and a2 The MAC unit has only one accumulator and all output elements must be computed sequentially so the mac instruction pipelining is worse than in the eMAC case Another aspect is that the MAC unit has no movclr instruction so the accumulator must be cleared explicitly Optimized code uses MAC unit MACH W KAS Up dS Ua PAC C0 mac w d6 u d0 u ACCO move l ACCO d0 move l 0 ACCO move l d0 a0 mac w a3 u d4 u ACCO mac w d6 u d0 u ACCO move l ACCO d0 move l 0 ACCO move l d0 a0 Ina cw db Ur Ese AGG mac w d6 u d0 u ACCO move l ACCO d0 move l 0 ACCO move l d0 a0 mac w a3 u a2 u ACCO mac w d6 u d0 u ACCO move l ACCO d0 move l 0 ACCO move l d0 a0 computes a 0 x i for y i ouput element computes b 1 y i 1 to produce y i i moves w i eo CO clear accumulator and stores y i
48. cl instruction which allows multiplying simultaneously with loading four values for the next iteration First four values are loaded using one movem instruction Optimized code uses eMAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 moveq 1 16 d0 move l dest a0 move eee al move l sizel dl move l size2 d2 mossa Mo PAo el move l dl d2 asr l 2 d1 beq outl move l 0 ACCO move l 0 ACC1 move l 0 ACC2 move l 0 ACC3 movem l al d7 a3 a5 add 1 d0 al loopl movem l a0 d3 d6 macies coo ad ccm macl a3 d4 al a3 ACCl macl l a4 d5 al a4 ACC2 JL d dL T maci Ir a5 d6 talyt aS5S ACC3 movclr l ACCO dq3 movclr l ACC1 d4 InOV cuu AG Library of Macros for Optimization Rev 1 0 3 47 Freescale Semiconductor Move ACCS a6 movem 1 d3 d6 a0 add 1 d0 a0 suba IL lt 7al bal loopl TEES d2 beq out2 SU MSS GLO asl loop2 move l a0 d3 TUBES Cad ELI aa move l d3 a0 subq 1 1 d2 bne loop2 GEZ S movem l a7 d2 d7 a2 a5 lea 60 a7 a7 3 6 4 Differences Between ARR2D MUL2 UL and ARR2D MUL2 SL ARR2D MUL2 UL macro uses unsigned mode of the MAC unit while ARR2D MUL2 SL macro uses signed mode 3 7 ARR2D MULS3 SL ARR2D MULS3 UL 3 7 1 Macros Description These macros perform multiplication of two 2D arrays of unsigned signed values 3 7 2 Parameters Description Call s int ARR2D_MUL3_UL unsigned long dest unsigned long srcl unsigned long src2
49. d Installation sese Chapter 2 Macros for 1D Array Operations 21 ARRID SUM UL ARR ID SUM SL 2 2 ARRID ADD2 UL ARRID ADDS SE saut rubo edades 23 ARRID ADD3 UL ARRID ADD3 SL 24 ARRID ADDSC UL ARRID ADDSC SLE dde 2 5 ARRID PROD UL ARRID PROD SL 2 6 ARRID MUL2 SL ARRID MUL2 UL eene 2 7 ARRID MUL3 SL ARRID MUL3 UL eene 2 8 ARRID MULSC SL ARRID MULSC UL 4 eee DO ARRID MAX S ARRID MAX UL Dum Oe ERU 2 10 ARRID MIN S FAR RENIN caesattfe od ambu 2 1 ARRID CAST SWL ARRID CAST UNWL eee Chapter 3 Macros for 2D Array Operations Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 1 ARRODISUM UIs ARRID SUMS Iie s tidie ule ten dette edu 3 33 3 2 ARR2D ADD2 UL ARR2D_ADD2 SI etate etre f i eb s 3 35 3 33 ARR2D _ADD3 UL ARR2D ADD3 SIG ubesmueottis el Donde 3 38 34 JARROD ADDSC UL ARR2D ADDSC SL 1 2 ceteris debct tlt estote pst oe ede n 3 40 35 ARR2D PROD UL ARR2D PROD SL osito Ne tabe ai itae 3 42 36 ARR2D MUL2 SL ARR2D MUE2 UL oos tede eto Di unten 3 44 3 7 ARR2D MUL3 SL ARR 2D MURS UP d obe e vto edu at 3 48 48 ARBRODOMULSCOSE ARRODOMDDUSCUUL iubeo ebdepaptea etel enbudend 3 52 9 ARR2DOMAX S ARRODOMAX Uso cscs ted b p robe aen pu ede 3 56 3 10 ARRID MIN SCARR2ZD MIN SUS acte tadeti o allot cnet util bos 3
50. d for signed values and ARR2D_CAST_UWL is used for unsigned values For ARRID CAST SWL ext l instruction is used and for ARRID CAST _ UWL andi l instruction Library of Macros for Optimization Rev 1 0 3 63 Freescale Semiconductor Chapter 4 Macros for DSP Algorithms 4 1 DOT_PROD_UL DOT_PROD_SL 4 1 1 Macros Description These macros compute the dot product of two vector arrays with unsigned signed values The dot product is computed by the following formula X Y 2 Y xy i l where X Y input vectors x y elements of the corresponding vectors n size of the vectors 4 1 2 Parameters Description Call s unsigned long DOT_PROD_UL unsigned long arrl unsigned long arr2 int size signed long DOT_PROD_SL signed long arrl signed long arr2 int size Parameters Table 4 1 DOT_PROD Parameters arrl in Pointer to the first vector arr2 in Pointer to the second vector size in Number of elements in vectors Returns The DOT PROD macro generates an unsigned signed output value which is returned by macro 4 1 3 Description of Optimization C code dic OR sk S SETZE ESSE 4 64 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration
51. dw parameters control the computation of the ao a a2 bj and b gt filter coefficients Prior to any call of BANDPASS FLTR the user must allocate memory for both the src and dst arrays in either static or dynamic memory Parameters Table 4 10 BANDPASS FLTR Parameters dst Out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size In Number of elements in input and output arrays freq In FRAC32 value in range of 0 to 0 5 that controls filter coefficients computation bandw In FRAC32 value in range of 0 to 0 5 that controls filter coefficients computation Returns The BANDPASS FLTR macro generates output values which are stored in the array pointed to by dst 4 88 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 4 10 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It is suitable for the eMAC unit because the eMAC has a fractional mode The optimization for the MAC unit is performed as an emulation of the fractional mode using mac w with shift to left instruction on the upper 16 bits of operands Therefore only the upper 16 bits of the resulting signals are valuable The coefficients are pre computed using standard C subroutines in the BANDPASS FLTR macro Then this macro uses the IMPL BAND FLTR macro to compute outp
52. e 1 where size sizel Xsize2 Prior to any call of ARR2D MAX S and ARR2D MAX U macros the user must allocate memory for src array either in static or in dynamic memory Types of the array and the invoking macro must correspond In declaration src array is declared as void for compatibility Parameters Table 3 9 ARR2D MAX S ARR2D MAX U Parameters src In Pointer to the input array sizel In Number of rows size2 In Number of columns Returns The ARR2D MAX S and ARR2D MAX U macros return maximum element s index as their result which is why they can be used in an assignment operation The index is linear and must be converted to two indices to access C array The convertion can be done in the following way index index size2 index2 index index x size2 where index1 first C index row index2 second C index column index linear index size number of rows size2 number of columns 3 56 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 3 9 3 Description of Optimization These macros do not use any multiplication operations Therefore it is not suitable to use MAC and eMAC instructions to optimize these macros This is why instructions from the Integer Instruction Set were used for optimization For signed and unsigned values appropriate comparison insructions were used All optimization issues are the same for both macros The following optimization techniques
53. e angle parameter must be in 0 z 4 e SIN F and COS F macros have a common header file sin_cos h 5 3 2 Parameters Description Call s FRAC32 SIN F FRAC32 ang Parameters Table 5 3 SIN F Parameters ang in An angle value Returns value of the sine function of the angle 5 3 3 Description of Optimization C code res_c sin tstvalc Optimization for the MAC unit can be done using the following techniques Library of Macros for Optimization Rev 1 0 5 97 Freescale Semiconductor 1 Sequential mac instructions that allow efficient use of the MAC pipeline 2 Quick multiplication and subtraction due to the msac instruction 3 Quick multiplication due to the MAC unit Optimized code uses MAC unit 0 ACCO dO u do ACCO dl 0 ACCO Ol ty dos ACCO d2 0 ACCO Cel Wy Ses AGEU as 0 ACCO Ol clips dos ACCO d4 0 ACCO Clik Shs GYE ACEO db 0 ACCO Hilo Clee ACCO d6 0xa100 move ld0 ACCO 41357913941 a0 117895697 al 426088 a2 5917 a3 53 a4 Ck si a0 dss al Vl Sle a2 CIR UE deru ade do Optimization for the eMAC unit includes the same optimization techniques as the MAC unit as well as the following 1 Using fractional mode of the eMAC unit which allows using 32x32 multiplication without lack of precision 5 98 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 Using the movclr instruction to store a va
54. e following optimization techniques were used 1 Multiple load store operations to access array elements Library of Macros for Optimization Rev 1 0 3 59 Freescale Semiconductor 2 Loop unrolling by four 3 Decsending loop organization Particular techniques of optimization are reviewed below C code none a MO a lt SIZE sith Om a Op G eS SAEZ EE mse Bere cebat spi saat mine arr ebat IESU PR il i i2 j Optimized code this code is similar to 1D array macro but in preloop operations linear size must be calculated and stored 12 taken from ARR2D_MIN_U macro movem l a0 dli d4 multiple load operations to access cmp l eH celo Source array elements ei oll lis making comparisons beetwen four 1 d6 elements because of loop unrolling d6 a3 1 d6 index is accumulated in d6 ola lls el ol 1 d6 d6 a3 3 60 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor addq 1 1 d6 adt d3 d5 das Td 1 d6 J6 a3 addq l 1 d6 Sill 16 a0 subg 1 1 d0 decsending loop organization 12 3 10 4 Differences Between ARR2D MIN U and ARR2D MIN S For signed and unsigned values appropriate comparison insructions were used 3 11 ARR2D CAST SWL ARR2D CAST UWL 3 11 1 Macros Description These macros convert arrays of word data elements to arrays of long data elements ARR2D CAST SWL is used for signed values and ARR2D CAST U
55. eloped in CodeWarrior for ColdFire V6 0 using the MCF5282 microprocessor and the same steps may be applied to other derivatives and versions 6 1 Creating a New Project a Open CodeWarrior Usually in Start gt Programs gt Metrowerks CodeWarrior gt CW for ColdFire 6 0 CodeWarrior IDE CodeWarrior main window should appear b From the main menu bar select File New The New dialog box should appear pa Project Fie Object ColdFire Stationery Project name Empty Project eMAC_CONV_test Extemal Build Wizard Location D Profiles a19257 My Docume Bet mT Add to Project Project zj OK Cancel Figure 6 1 New Dialog Box c Select ColdFire Stationery as the type of project d Select a project name in the Project name textbox I e eMAC CONV test e Select an appropiate location for your project using the Location textbox f Click OK The New Project Dialog Box should appear 6 104 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor g h 6 2 a b c Select the appropiate stationary I e expand CF M5282EVB and select C New Project X Select project stationery Project Stationery E CF_M5249C3 amp CF MB271EVB E CF M5272C3 CF_M5275EVB E CF M5282b VB CPP ECPP CF M5307C3 CF M5407C3 Cancel
56. ements in the input array Returns The ARRID MIN S and ARRID MIN U macros return the minimum element s index as their result which is why they can be used in an assignment operation 2 10 3 Description of Optimization These macros do not use any multiplication operations Therefore it is not suitable to use MAC and eMAC instructions to optimize these macros This is why instructions from the Integer Instruction Set were used for optimization For signed and unsigned values appropriate comparison insructions were used All optimization issues are the same for both macros The following optimization techniques were used 1 Multiple load store operations to access to array s elements 2 Loop unrolling by four 3 Decsending loop organization Particular techniques of optimization are reviewed below Library of Macros for Optimization Rev 1 0 2 29 Freescale Semiconductor 0 i lt SIZE i iE arr chi l min min arr ckil index i Optimized code taken from ARRID_MIN_U macro movem l a0 dl d4 multiple load operations to access cmp 1 eli els Source array elements call eil re los making comparisons beetwen four 1 d6 elements because of loop unrolling Gras index is accumulated in d6 2 30 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor addq 1 d6 pall 16 a0 subq 1 1 d0 descending loop organization 12 2 10 4 Differences Between ARR1D MIN U
57. enerate output values which are stored in the array pointed to by dest 3 11 3 Description of Optimization These macros do not use any multiplication operations So it is not suitable to use MAC and eMAC instructions to optimize these macros This 1s why instructions from the Integer Instruction Set were used for optimization The following optimization techniques were used 1 Multiple load store operations to access array elements 2 Loop unrolling by four 3 Descending loop organization Particular techniques of optimization are reviewed below 3 62 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor ee ie ae ip ye GL 0p fori ccr a Selb euer re Fab pg Optimized code ara di long arrl i j this code is similar to 1D array macro but in preloop operations linear size must be calculated and stored Ze taken from ARR1D_CAST_SWL movem l a0 d2 d4 multiple load operations to access move l d2 d3 Source array elements move l d4 d5 swap w d2 convertion performed by four elements because of loop unrolling Swap ext ext 4 2 3 in ARRID CAST UWL andi 1 SOxffff d2 4 ext instruction was used eb nut s nh ext 5 Za Coin als 8 a0 16 al movem 1 multiple stor operation addq 1 add 1 d0 subq 1 decsending loop organization Z 3 11 4 Differences Between the ARR1D_SUM_UL and the ARR1D_SUM_SL Macros ARR2D_CAST_SWL is use
58. equation form this filter can be represented as the following li 1 M 1 f f M xli j j M is the number of points used in the moving average More details on this digital filter s characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 12 2 Parameters Description Call s MOV AVG FLTR FRAC32 dst FRAC32 src long size long M 4 92 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor The input signals to the filter are held in array src and the output values are stored in array dst 7 Both arrays run from 0 to size 1 M is the number of points used in the moving average Prior to any call of MOV AVG FLTR the user must allocate memory for both the src and dst arrays either in static or dynamic memory Parameters Table 4 12 MOV AVG FLTR Parameters dst out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size in Number of elements in input and output arrays M in M is the number of points used in moving average Returns The MOV AVG FLTR macro generates output values which are stored in the array pointed to by dst 4 12 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It is suit
59. er to the destinstion vector srcl in Pointer to the source vector src2 in Pointer to the source vector2 size in Number of elements in vector Library of Macros for Optimization Rev 1 0 2 9 Freescale Semiconductor Returns The ARRID ADD3 macro generates unsigned signed output values which are stored in the array pointed to by the parameter dest 2 3 3 Description of Optimization C code intone ak We al ws Ce aldbds Guese xe ra t arrl i Optimization can be done using the following techniques 1 Loop unrolling by four 2 Every four values of array dest used in each iteration are loaded with only one movem instruction 3 Every four values of array src used in each iteration are loaded using postincrement addressing mode while performing additons 4 After perfoming additions the resulting four values in each iteration are stored with only one movem instruction 5 Ifthe number of elements is not divisible by 4 the tail elements are processed in regular order Optimized code move l size dl move l d1 d2 asr 1 beq 11 2 d1 movem 1 a0 d3 d6 Bags aaa ECE add 1 alee ds al d4 al jire a1 d6 movem 1 d3 d6 a0 adami subq bne 2 10 m 12 16 a0 1 d1 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor and l 3 d2 beq 14 sta 0y aS al d3 L d3 a0 Lidl 2 3 4 Differences Between the A
60. ers dst Out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size In Number of elements in input and output arrays Library of Macros for Optimization Rev 1 0 4 91 Freescale Semiconductor freq In FRAC32 value in range of 0 to 0 5 that controls filter coefficients computation bandw In FRAC32 value in range of 0 to 0 5 that controls filter coefficients computation Returns The BANDREJECT_FLTR macro generates output values which are stored in the array pointed to by dst 4 11 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It 1s suitable for the eMAC unit because the eMAC has a fractional mode The optimization for the MAC unit is performed as an emulation of the fractional mode using mac w with shift to left instruction on the upper 16 bits of operands So only the upper 16 bits of the resulting signals are valuable The coefficients are pre computed using standard C subroutines in the BANDREJECT FLTR macro Then this macro uses the IMPL BAND FLTR macro to compute output samples 4 12 MOV AVG FLTR 4 12 1 Macros Description This macro computes the moving average filter As the name implies the moving average filter operates by averaging a number of points from the input signal to produce each point in the output signal In the
61. ers Figure 6 3 Settings Window in ColdFire Assembler Selection d Change to the Debugger Remote Debugging section e A message will appear informing that the project must be rebuilt Click OK f Select an appropiate Connection for your EVB I e PEMICRO USB if you are using the P amp E USB wiggler g Click OK Your project is now configured to use and debug the Library of Macros 6 3 Adding the Library of Macros h Using windows explorer copy the unzipped folder library macros into your project I e the final path for your libraries can be eMAC CONV testNSourceVMibrary macros i Drag and drop the copied library macros folder from windows explorer to your CodeWarrior project window inside the source folder This will add all files and folders from the library of macros to your current project You can also add each file and folder by right clicking in the project window and selecting Add files and Create Group 6 106 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor E Metrowerks CodeWarrior Eile Edit view Search Project Debug Tools Window Help Bese Boo xhRGaa Mu Saye zix eMAC CONV test mcp M5282EVB Console Deb v D V BR A Files Link Order Targets 0 0 0 B emac_macro h 0 B mac macro h 0 H E common 0 H emac 0 HE mac 0 Linker Command Files 0 Runtime FP C and UART 0 Support files 0 readme txt
62. escale Semiconductor products There are no express or implied copyright licenses granted hereunder to design or fabricate any integrated circuits or integrated circuits based on the information in this document Freescale Semiconductor reserves the right to make changes without further notice to any products herein Freescale Semiconductor makes no warranty representation or guarantee regarding the suitability of its products for any particular purpose nor does Freescale Semiconductor assume any liability arising out of the application or use of any product or circuit and specifically disclaims any and all liability including without limitation consequential or incidental damages Typical parameters that may be provided in Freescale Semiconductor data sheets and or specifications can and do vary in different applications and actual performance may vary over time All operating parameters including Typicals must be validated for each customer application by customer s technical experts Freescale Semiconductor does not convey any license under its patent rights nor the rights of others Freescale Semiconductor products are not designed intended or authorized for use as components in systems intended for surgical implant into the body or other applications intended to support or sustain life or for any other application in which the failure of the Freescale Semiconductor product could create a situation where personal injury or death may occur
63. h Ph D California Technical Publishing http www dspguide com Notes e Array elements must be of the FRAC32 type 4 70 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor e The size of the output array must equal the sum of sizes of the input array and array of coefficients 4 4 2 Parameters Description Call s void CONV void y void x void h int xsize int hsize Parameters Table 4 4 CONV Parameters y out Pointer to the output vector containing computed values x in Pointer to the input vector array of samples h in Pointer to the array of coefficients xsize in Size of the input vector hsize in Size of array of coefficients Returns The CONV macro generates output samples which are pointed to by y 4 4 3 Description of Optimization C code tort OF f lt Kein se Muni ig a 4 irene 5 a 3 geste Jess 4 eG Fs xe G oe xeu 3 use xellbatl ap sieeve chia 3 deeazss xellt 3 Optimization for MAC unit performing multiplication and addition at the same time due to mac instruction Optimized code uses MAC unit Library of Macros for Optimization Rev 1 0 4 71 Freescale Semiconductor 4 d6 a0 a2 Gu move l al d4 MOC ipl Cad mac w cial toro dorir sec ACCO sube ho 442 bne _IN1 move l accO d7 move l 0 accO move l d7 a4 subgoh ab a bne OUT1 Optimization f
64. he following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 Postincremental addressing mode is used for sequential access to matix elements Optimized code uses eMAC unit lea a0 al lea a2 a3 lea 16 a2 a2 move l n d2 movem l a3 d4 d7 add 1 d3 a3 al a4 d4 a4 ACCO GS ady ACCI d a4 ACC2 d7 a4 ACC3 d2 SUCCO LACCI BIAC SUIDAS Library of Macros for Optimization Rev 1 0 4 69 Freescale Semiconductor movem 1 CaS lien clon lea 16 a5 a5 sucre bne OUT1 4 3 4 Differences Between MATR MUL UL and MATR MUL SL MATR MUL UL macro uses unsigned mode of the MAC unit while MATR MUL SL macro uses signed mode 4 4 CONV 4 4 1 Macro Description This macro computes convolution using array of samples and array of coefficients Convolution is computed by the following formula Md h j ki j j 0 gt where y i is an output sample x i is an input sample and h j is coefficient There are two algorithms of convolution computing the following e Input side algorithm e Output side algorithm The macro uses output side algorithm for implementation using MAC unit because it is more suitable To learn more about convolution and its properties refer to The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smit
65. ignals to the filter are held in array src and the output values are stored in array dst Both arrays run from 0 to size The x parameter controls the computation of the ao and b filter coefficients Prior to any call of LPASS 1IPOLE FLTR the user must allocate memory for both the src and dst arrays in either static or dynamic memory Parameters Table 4 7 LPASS 1POLE FLTR Parameters dst out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size in Number of elements in input and output arrays X in FRAC32 value between zero and one that controls filter coefficients computation Returns The LPASS 1POLE FLTR macro generates output values which are stored in the array pointed to by dst 4 7 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It is suitable for the eMAC unit because the eMAC has a fractional mode Optimization for the MAC unit is performed as an emulation of the fractional mode using mac w with shift to left instruction on the upper 16 bits of operands So only the upper 16 bits of the resulting signals are valuable The following optimization techniques were used 1 Multiple load operations to access input array elements 2 Postincrement addressing mode to store output array elements 3 Loop unrolling by four 4 Des
66. lue in a register and clear an accumulator at the same time Optimized code uses eMAC unit dO dO ACCO mEACC OF 7 CO ACEO d2 Q Q e ae ACO d3 Q Q ise p els ACEO Q Q oO d4 p GHn JACCO CCF a ey aa doy ACEO ACCU de 00 move ld0 ACCO 357913941 a0 TTISSSSM LI E 426088 a2 SOMU ES 53 a4 d2 a0 ACCO d3 alg ACCO d4 a2 ACCO d5 a3 ACCO d6 a4 ACCO ACCO d0 5 4 COS F 5 4 1 Macro Description This macro computes the cosine of an angle from the range 0 2 4 Computation is done by Teylor s series consisting of 7 elements 2 4 6 8 10 12 X X X X X cose ele prot ue e y 21 4 60 S8 10 12 Notes Library of Macros for Optimization Rev 1 0 5 99 Freescale Semiconductor e Value of the angle parameter must be in 0 2 4 e SIN F and COS F macros have a common header file sin_cos h 5 4 2 Parameters Description Call s FRAC32 COS F FRAC32 ang Parameters Table 5 4 COS F Parameters ang in An angle value Returns value of the cosine function of the angle 5 4 3 Description of Optimization C code res c cos tstvalc Optimization for the MAC unit can be done using the following techniques 1 Sequential mac instructions that allow efficient use of the MAC pipeline 2 Quick multiplication and subtraction due to the msac instruction 3 Quick multiplication due to the MAC unit
67. macro size in Number of elements in input and output arrays Returns The ARRID CAST SWL and ARRID CAST UWL macros generate output values which are stored in the array pointed to by dest 2 11 3 Description of Optimization These macros do not use any multiplication operations Therefore it is not suitable to use MAC and eMAC instructions to optimize these macros This is why instructions from the Integer Instruction Set were used for optimization The following optimization techniques were used 1 Multiple load store operations to access array elements 2 Loop unrolling by four 3 Decsending loop organization Particular techniques of optimization are reviewed below C code imonat ak 0 ASTE E aion ea ome ease d Lab 8 Optimized code taken from ARR D CAST SWL movem a0 d2 d4 multiple load operations to access move l d2 d3 Source array elements move l d4 d5 convertion performed by four elements swap w d2 because of loop unrolling Swap w d4 ext l d2 2 32 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor ext 1 d3 in ARRI1D CAST UWL andi 1 Oxffff d2 ext 1 d4 instruction was used ext l d5 movem l d2 d5 al multiple store operation addq 1 8 a0 add 1 16 al subq 1 1 d0 descending loop organization 2 11 4 Differences Between ARR1D CAST SWL and ARR1D_CAST_UWL ARRID CAST SWL is used for signed values and ARRID CAST UWL is u
68. may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 84 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 4 9 2 Parameters Description Call s LPASS_4STG_FLTR FRAC32 dst FRAC32 src long size FRAC32 x The input signals to the filter are held in array src and the output values are stored in array dst 7 Both arrays run from 0 to size 1 The x parameter controls the computation of the ao b b2 b3 and b filter coefficients Prior to any call of LPASS 4STG FLTR the user must allocate memory for both the src and dst arrays either in static or in dynamic memory Parameters Table 4 9 LPASS_4STG_FLTR Parameters dst Out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size In Number of elements in input and output arrays x In FRAC32 value between zero and one that controls filter coefficients computation Returns The LPASS ASTG FLTR macro generates output values which are stored in the array pointed to by dst 4 9 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It 1s suitable for the MAC unit because the eMAC has a fractional mode Optimization for the MAC unit is performed as an emulation
69. msac w a3 u d4 u ACCO computes a 1 x i 1 for y i ouput element macl w d6 u d0 u lt lt al t d4 ACCO Ecomoutesabuu VII OX odu cea is loads the next input operand ACCO d0 moves y i to dO 0 ACCO clears accumulator d0 a0 gt stores y i to memory mac w a3 u d4 u ACCO t computes a O0 x rtl for y itl osuput element msac w a3 u d3 u ACCO computes a 1 x i for y i 1 ouput element macl w d6 u d0 u al t d3 ACCO EcombutesebN DENSIS cea loads the next input operand o db ACCO AO moves y it l to dO l 0 ACCO clears accumulator 1 dO a0 stores y i to memory Optimization for the eMAC unit The following should be noticed e The loop is unrolled by two e Coefficients a and b are pre computed and held in registers a3 and d6 correspondingly e The a coefficient is not computed because a a so thr msac operation is used e d0 register always holds the last computed output signal e Input operands are fetched from memory in msac instructions and stored in registers d3 and d4 e The a0 register holds the pointer to the output array the al register holds the pointer to the input array As the loop is unrolled by two the output values are computed in two eMAC accumulators The movelr instruction is used to clear the accumulators Optimized code uses eMAC unit Toc Mr dip AGO computes a 0 x i for y i ouput element msac l a3 d4 ACCO computes a 1 x i 1 for y i
70. n a ls ls ha Ls Ma Le ha Le s Lc Hs e amp KE EE EE Hl wDDD Figure 6 4 Library of Macros added to Project Explorer Click on the Make button in order to compile and link your project k You shouldn t get any errors Otherwise verify previous steps 1 Now you can use any desired macro from the library 6 4 Using a Macro a Include the appropate header into your main c file o Using a microprocessor with eMAC lude emac_macro h o Using a microprocessor with MAC lude mac macro h b Using the prototype declaration described in this document add the your function call I e using the CONV macro described in Section 4 4 CONV the prototype is the following void CONV void y void x void h int xsize int hsize Library of Macros for Optimization Rev 1 0 6 107 Freescale Semiconductor So the call of your function can be something like CONVE SAE HSD Se 48362 du SS ra HS EAEI c Create the arrays for testing purposes I e define X SIZE 20 define H SIZE 10 FRAC32 T32 y X SIZETH SIZE L FRAC32 32 x X SIZE L7309 10 5127 O TO F32 0 309016994374947 L WOM SIA OSS Tels Se JAAS e TO F32 0 809016994374947 MTOM 720009 5380 5 651662 9 5395417 STOSS ZA OTOI SEDONIS2ACOIRODEIOSGI62095s 54 TO F32 0 809016994374947 O TS KO OEEOI2 27197 9 TO F32 0 309016994374948 T0 F32 1 22514845490862E 16 D_TO_F32 0 309016994374948 A
71. nal mode using mac w with shift to left instruction on the upper 16 bits of operands So only the upper 16 bits of the resulting signals are valuable The following optimization techniques were used 1 Mac with load operations to access input array elements 2 Post increment addressing mode to store output array elements 3 Loop unrolling by two 4 Descending loop organization Particular techniques for optimization are reviewed below C code Optimization for the MAC unit The following should be noticed e The loop is unrolled by two e Coefficients aj and b are pre computed and held in registers a3 and d6 correspondingly e The a coefficient is not computed because a a so the msac operation is used e The d0 register always holds the last computed output signal e Input operands are fetched from memory in msac instructions and stored in registers d3 and d4 e The a0 register holds the pointer to the output array the al register holds the pointer to the input array 4 82 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor The MAC unit has only one accumulator and all output elements must be computed sequentially so mac instruction pipelining is worse than in the eMAC unit case Another aspect is that the MAC unit has no movclr instruction so the accumulator must be cleared explicitly Optimized code uses MAC unit mac w a3 u d3 u ACCO computes a 0 x i for y i ouput element
72. nd stored in registers d4 d5 d6 d7 a2 a3 a4 and a5 e The dO register contains the previously computed value e Results are stored in registers a2 a3 a4 and a5 e The a0 register holds the pointer to output array the al register holds the pointer to input array Optimized code 4 74 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor movem 1 al d4 d7 multiple load operations to access source movem l al a2 a5 array s elements d0 a2 performing loop body that unrolled by four d4 a3 d5 a4 d6 a5 l a2 a5 a0 multiple store operation to save results mover c c0 add 1l 16 al add 1 16 a0 subq 1l 1 d1 decsending loop organization bne loopl 4 6 RUNN SUM 4 6 1 Macro Description This macro performs a calculation of the running sum of the input fractional operands commonly known as discrete integration More details on this linear system s characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 6 2 Parameters Description Call s RUNN SUM FRAC32 dst FRAC32 src long size The original signals are held in array src and the running sum up to the n th element is stored in the corresponding n th element of array dst Both arrays run from 0 to size Prior to any call of RUNN SUM the user must allocate memory for both the src and dst ar
73. nt ARRID MUL2 SL long dest long src long size Parameters Table 2 6 ARRI1D MUL2 Parameters dest in Pointer to the destination vector SIC in Pointer to the source vector size in Number of elements in vectors Library of Macros for Optimization Rev 1 0 2 15 Freescale Semiconductor Returns The ARRID MUL2 macro generates an unsigned signed output vector which is the result of dest and src multiplication and is pointed to by dest 2 6 3 Description of Optimization C code fior 0r TESTSEITE euer epa vemm wuesedi rou g Optimization for MAC unit can be done using the following techniques 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 move l 0 d0 move l d0 MACSR moveq l 16 d0 move l dest a0 TON eM eSI cra move l size dl move l d1 d2 arem 2v beq outl move l 0 ACCO movem l al d7 a3 a5 add 1 d0 al loopl movem l a0 d3 d6 maci clc adco ACEO ACCO d3 move macli i asr oa aas 1 AL move l 0 ACCO al JL move l ACCO d4 2 16 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor move l 0 ACCO macl l a4 d5 al a4 ACCO move l ACCO d5 move l 0 ACCO Macleay sore alle arr be DANG G0 move l AC
74. of the fractional mode using mac w with shift to left instruction on the upper 16 bits of operands So only the upper 16 bits of the resulting signals are valuable The following optimization techniques were used 1 Mac with load instructions to access input array elements 2 Post increment addressing mode to store output array elements 3 Loop unrolling by four 4 Descending loop organization Particular techniques for optimization are reviewed below Library of Macros for Optimization Rev 1 0 4 85 Freescale Semiconductor C code eee e a eu galls a Jedi 9 eue resi a 192 9 qwe ierat ar los ewemexea 34 sp lo dk w esee pae Optimization for the MAC unit The following should be noticed e The loop is unrolled by four e Coefficients aj b b2 b3 and b are pre computed and held in registers a3 d6 d7 a4 and a5 correspondingly e The a2 register always holds the output sample per each iteration e Input operands are fetched from memory one by one and stored in registers d5 d4 d3 and dO All add multiply instructions are performed by the MAC unit The MAC unit has no movclr instruction so the accumulator must be cleared explicitly After each computation of an output sample the data from the accumulator is stored in the register and the accumulator is cleared explicitly After the result is stored into memory Optimized code uses MAC unit mac w a3 u a2 u ACCO computes a
75. on Optimized code uses MAC unit movem 1 move l move l moveq 1 move l move l move l move i move l asro ii beq out move l movem 1l lea 60 a7 a7 dez d7faz a5 a7 0x40 d0 d0 MACSR 16 d0 dest a0 Srelyat See EV size dl Cll OU Zi ol Al 0 ACCO al d7 a3 a5 accidat loopl movem 1 macle l move l move l macl l move l move l macl l move l move l macl l move t move t movem 1 2 20 a2 d3 d6 d7 d3 a1 d7 ACCO ACCO d3 0 ACCO a3 d4 a1 a3 ACCO ACCO d4 0 ACCO a4 d5 al a4 ACCO ACCO d5 0 ACCO a5 d6 al a5 ACCO ACCO d6 0 ACCO d3 d6 a0 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor add 1 d0 a2 add 1 d0 a0 subor 41 01 bne loopl aubP and 1 3 d2 beq out2 SUME rali loop2 move a2 dqd3 maU NALI FOA move l d3 a0 suba 102 bne loop2 GVEZS movem l a7 d2 d7 a2 a5 lea 60 a7 a7 Optimization for eMAC unit can be done using the following techniques 4 Loop unrolling by four Using 4 accumulators for pipelining Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration The first four values are loaded using one movem instruction Optimized code uses eMAC unit lea 60 a7 a7 movem l d2 d7 a2 a5 a7 moveq 1l 16 d0 move l dest a0 move SECI AL size dl JL ili move l src2 a2 move l Jb move l
76. onductor where res result value X input array x element of array X size number of rows size2 number of columns Notes The type of elements of array in the ARR2D PROD UL macro must be unsigned long and the type of elements of array in the ARR2D PROD SL macro must be signed long 3 5 2 Parameters Description Call s int ARR2D_PROD_SL void arr int sizel int size2 int ARR2D_PROD_UL void arr int sizel int size2 Parameters Table 3 5 ARR2D PROD Parameters arr in out Pointer to the array sizel in Number of rows of matrix size2 in Number of columns of matrix Returns The ARR2D PROD macro generates an unsigned signed output value which is returned by macro 3 5 3 Description of Optimization C code were a c 108 3L lt lt Sarva casts ioe 3 Of 3p vx SADA see prodici arredime Optimization can be done using the following techniques 1 The elements are accessed as 1d array elements with number of elements size size2 because elements of 2d array are located in memory sequentially 2 Loop unrolling by four 3 Every four values of array arr used in each iteration are loaded using post increment addressing mode while performing multiplications Library of Macros for Optimization Rev 1 0 3 43 Freescale Semiconductor 4 Ifthe number of elements is not divisible by four the tail elements are processed in regular order
77. or msac w d0 u a3 u ACCO substructs the first item of y i from accumulator move l ACCO d5 Stores the y i to d5 from accumulator move l d5 a0 Stores the y i into memory Optimization for eMAC unit The following should be noticed e The 1 M value is stored in register a3 e To calculate the y 1 1 value the y 1 value is used The first item of y 1 value is subtracted from the accumulator and the last item of y i 1 is added to the accumulator Then accumulator value is stored as y i 1 e The al and a0 registers hold pointers to the src and dst arrays All add multiply instructions are performed by the eMAC unit Optimized code uses eMAC unit mac l d4 a3 ACCO adds the last item of y i 1 to accumulator 4 94 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor msac l d0 a3 ACCO substructs the first item of y i from accumulator move l ACCO d5 Stores the y i to d5 from accumulator move l d5 a0 Stores the y i into memory Chapter 5 Macros for Mathematical Functions 5 1 SIN 5 1 1 Macro Description This macro performs some arithmetical operations with the angle parameter to reduce the angle value to the range of 0 2 4 and then calls the SIN F or COS F macro to compute the sine function Notes e Value of the angle parameter must be in 0 2 z e SIN and COS macros have a common header file sin_cos h Library of Macros for Optimization
78. or eMAC unit can be done using the following techniques 1 Loop unrolling by four 2 Reduction of the number of instructions for fetching operand from memory one element can be used in computation of several output elements 3 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 4 Using movclr instruction instead of two instructions to store value in memory and clear the accumulator 5 Sequential mac operations allow use of eMAC unit pipeline efficiently Optimized code uses eMAC unit movea l al a0 moveas ease Maver L A0 aS HE Ee a5 d4 CS dan AS dip aec cls d r aoit Asy cxereul 4 72 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor macis tk els cll a5 E cei velo mac l ds dA aces Tea 12 a5 aS Suba AdS bne _ mlkace lr d5 a4 olere El d5 a4 placer cl d5 a4 iy lace3y do d5 a4 llc aw osa a3 sube h Gl dl bne _OUT2 4 5 FIRST DIFF 4 5 1 Macro Description This macro peforms a calculation of the first differences on input fractional operands commonly known as discrete derivation More details on this linear system s characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 5 2 Parameters Description Call s FIRST DIFF FRAC32 dst
79. ored using postincrement addressing mode while performing additons 3 Ifthe number of elements is not divisible by 4 the tail elements are processed in regular order Optimized code move l d1 d2 asr l 2 d1 beq 11 1 do a0 1 do a0 1 do a0 1 do a0 gal edito 12 2 12 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor subq 1 1 d2 bne 13 2 4 4 Differences Between the ARR1D_ADDSC_UL and the ARR1D_ADDSC_SL Macros The type of ARRID ADDSC UL parameters arr scale is unsigned long The type of ARRID ADDSC SL parameters arr scale is signed long 2 5 ARR1D PROD UL ARR1D PROD SL 2 5 1 Macros Description These macros compute the product of the vector array of unsigned signed values The product is computed by the formula i size l res X i 0 x E where res result value X input vector x element of the X vector size number of elements in the input vectors 2 5 2 Parameters Description Call s int ARRID PROD UL unsigned long arr int size int ARRID PROD SL signed long arr int size Parameters Library of Macros for Optimization Rev 1 0 2 13 Freescale Semiconductor Table 2 5 ARRID PROD Parameters arr in out Pointer to the vector size in Number of elements in vector Returns The ARRID PROD macro generates an unsigned signed output value which is returned by the macro 2 5 3
80. ouput element mac scorre asc mne compu es evi to vproducer yj loads the next input operand Library of Macros for Optimization Rev 1 0 4 83 Freescale Semiconductor moved Acc 0c moves y i to dO move l d0 a0 and stores y i to memory mac l a3 d4 ACC1 computes a 0 x i 1 for y i 1 ouput element miedo d a3 ae ACD computes a 1 x i for y i 1 ouput element macl I de dlra t SAC CAL 7 computes oMi aan EOP roduc Sny s loads the next input operand MOC lary TNC IL l0 moves y i 1 to do move l d0 a0 and stores y i 1 to memory 4 9 LPASS 4STG FLTR 4 9 1 Macros Description This macro computes a four stage low pass filter This recursive filter uses five coefficients ao b b2 b3 and b so the filter can be represented in the following form Yn ao Xn bi Yn 1t b2 Yn 2 b3 ya b4 Yn 4 The filter s response characteristics are controlled by the parameter x a value between zero and one The four stage low pass filter is comparable to the Blackman and Gaussian filters relatives of the moving average but with a much faster execution speed The design equations for a four stage low pass filter are the following ao 1 x b 4x by 6x b3 4x by x Note The filter becomes unstable if x is made greater than one Thus any nonzerovalue on the input will increase the output until an overflow occurs More details on this digital recursive filter s characteristic
81. put element mac l a4 d4 ACCO computes b 3 y i 3 for y i ouput element MS Acre aot oO ACCO computes b 4 y i 4 to produce y i MOGs AGCO ds mmowvesavslbis ll eom do move l d5 a0 and stores y i to memory 4 10 BANDPASS FLTR 4 10 1 Macro Description This macro computes a band pass filter This recursive filter uses five coefficients ao aj a bj and bo The filter can be represented in the following form Yn d Xn A1 Xn 1 a2 xot bi Vat b2 y s The filter s response characteristics are controlled by the parameter f a value of center frequency and BW the bandwidth Both parameters values must be in the range 0 to 0 5 The design equations for a bandpath filter are the following ao l K a 2 K R cos 2nf a R K b 2R cos 2nf b R Library of Macros for Optimization Rev 1 0 4 87 Freescale Semiconductor where T 1 2Rcos 2f R 7 2 2cos 2zf R 1 3BW K More details on this digital recursive filter s characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 10 2 Parameters Description Call s BANDPASS FLTR FRAC32 dst FRAC32 src long size FRAC32 freq FRAC32 bandw The input signals to the filter are held in array src and the output values are stored in array dst Both arrays run from 0 to size 1 The freq and ban
82. r i 0 i lt SIZE i iE arr ce Lat Saute max c3 wen venale index i Optimized code 12 taken from ARRID MAX S macro movem l a0 dl d4 multiple load operations to access cmpl cll cls Source array elements bge el making comparisons beetwen four move l di d5 elements because of loop unrolling Library of Macros for Optimization Rev 1 0 2 27 Freescale Semiconductor 1 d6 a6 a3 index is accumulated in d6 addq 1 1 d6 P 16 a0 subq 1 1 d0 descending loop organization 2 9 4 Differences Btween ARR1D_MAX_U and ARR1D_MAX_S For signed and unsigned values appropriate comparison insructions were used 2 28 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 2 10 ARR1D_MIN_S ARR1D MIN U 2 10 1 Macros Description Macros search for a minimum element in 1D array of signed or unsigned integer values 2 10 2 Parameters Description Call s ARRID_MIN_S signed long src int size ARRID_MIN_U unsigned long src int size The elements are held in array src The src array is searched for minimum from 0 to size 1 Prior to any call of ARRID MIN S and ARRID MIN U macros the user must allocate memory for src array either in static or in dynamic memory The types of the array and the invoking macro must correspond Parameters Table 2 10 ARRID MIN S ARRID MIN U Parameters src in Pointer to the input array size in Number of el
83. rays either in static or in dynamic memory Library of Macros for Optimization Rev 1 0 4 75 Freescale Semiconductor Parameters Table 4 6 RUNN_SUM Parameters dst Out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements size In Number of elements in input and output arrays Returns The RUNN SUM macro generates output values that are stored in the array pointed to by dst 4 6 3 Description of Optimization This macro does not use any multiplication operations So it is not suitable to use MAC and eMAC instructions to optimize this macro Thus instructions from the Integer Instruction Set were used for optimization The following optimization techniques were used 1 Multiple load operations to access array src elements 2 Postincrement addressing mode to store results in array dst 3 Loop unrolling by four 4 Descending loop organization Particular techniques for optimization are reviewed below C code FOLL slp Gl lt lt SUDO apap gue vens e aie elf sb eiieieiel fat ip The following should be noticed e The loop is unrolled by four e The input operands are fetched from memory in fours and stored in registers d4 d5 d6 and d7 e The d0 register contains the latest computed value e The a0 register holds the pointer to the output array register al holds the pointer to the input array
84. ro computes a product of two fixed point numbers 5 5 2 Parameters Description Call s FIXED64 MUL FIXED64 ml FIXED64 m2 Parameters Table 5 5 MUL Parameters m1 in Multiplicand 5 102 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor m2 in Multiplier Returns product of m1 and m2 5 5 3 Description of Optimization C code Optimization for the MAC unit is unsuitable because of the absence of fractional mode in the MAC unit Optimization for the eMAC unit can be done using the following techniques 1 Using both integer and fractional modes of the eMAC unit to get all 64 bits of the result with only 6 mac instructions 2 Using the eMAC rounding mode to gain a suitable precision without additional mac instructions Optimized code uses eMAC unit 3 ACCO SEA CHI ACCEXTO1 d2 ACC1 dl d3 0x40 d5 d5 MACSR QUIM qol Boe dl a2 ACCS Cla xe Xeon Library of Macros for Optimization Rev 1 0 5 103 Freescale Semiconductor Chapter 6 QuickStart for CodeWarrior The Library of Macros is very easy to use and test Altough all macros are written in assembly they were developed in such a way that they can be easily integrated in a C program The purpose of this chapter is to guide an user on the steps required to add compile test and use the Library of Macros The CONV function will be used for demonstration purposes The example was dev
85. s a 2 x i 2 for y i ouput element mac w d6 u d0 u ACCO computes b 1 y i 1 for y i ouput element mac w d7 u d3 u ACCO computes b 2 y i 2 to produce y il move l ACCO dq3 moves y i to d3 move l 0 ACCO Clears accumulator move l d3 a0 and stores y i to memory Optimization for eMAC unit The following should be noticed e Theloop is unrolled by two e Coefficients ao a a2 bj and b are pre computed and held in registers a3 a4 a5 d6 and d7 correspondingly e The a2 and d5 registers always hold the input samples per each iteration e The d3 and d0 registers always hold the output samples per each iteration e The al and a0 registers hold pointers to the src and dst arrays All add multiply instructions are performed by the eMAC unit After each computation of an output sample the movcir instruction is used to clear the accumulator and store the result into the general purpose register After the result is stored into memory Optimized code uses eMAC unit mac l a3 a2 ACCO computes x i for y i ouput element mac l a4 d4 ACCO computes x i 1 for y i ouput element mace 5n dpa computes x i 2 for y i ouput element maccl d6 d0 ACCO computes y i 1 for y i ouput element Macrelan 917 003 2X6 computes syl CO roces saa MOC eral ACCOA moves y i to d3 move l d3 a0 and stores y i to memory 4 11 BANDREJECT_FLTR 4 11 1 Macro Description Thi
86. s compute the elementwise sum of two vector arrays with unsigned signed values The elementwise sum is computed by the following formula X X tyi x X y Y i e 0 size 1 where X Y input vectors x y element of the corresponding vector size number of elements in the input vectors 2 2 2 Parameters Description Call s int ARRID_ADD2_UL unsigned long dest unsigned long src int size int ARRID ADD2 SL signed long dest signed long src int size Parameters Table 2 2 ARRID ADD2 Parameters dest in out Pointer to the destinstion vector src in Pointer to the source vector size in Number of elements in vector Returns The ARRID ADD2 macro generates unsigned signed output values which are stored in the array pointed to by the parameter dest 2 2 3 Description of Optimization C code foL NOSE STEEL euer ven pub ap a S A Es p Library of Macros for Optimization Rev 1 0 2 7 Freescale Semiconductor Optimization can be done using the following techniques 1 Loop unrolling by four 2 Every four values of array dest used in each iteration are loaded with only one movem instruction 3 Every four values of array src used in each iteration are loaded using postincrement addressing mode while performing additons 4 After perfoming additions the resulting four values in each iteration are stored with only one movem instruction 5 Ifthe number of
87. s macro computes a band reject filter This recursive filter uses five coefficients ao a a2 b and bz so the filter can be represented in the following form Yn Qo Xn a1 ug a2 Xn 2 bi Yn 1t b2 Yn 2 4 90 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor The filter s response characteristics are controlled by the parameter f a value of center frequency and BW the bandwidth Both parameters values must be in the range 0 to 0 5 The design equations for a bandpath filter are the following ao K a 2Kcos 2nf a K b 2R cos 2mf b em R where _1 2R cos R 7 2 2cos 2zf R 1 3BW K More details on this digital recursive filters characteristic may be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 11 2 Parameters Description Call s BANDREJECT FLTR FRAC32 dst FRAC32 src long size FRAC32 freq FRAC32 bandw The input signals to the filter are held in array src 7 and the output values are stored in array dst Both arrays run from 0 to size 1 The freq and bandw parameters control the computation of the ao a a2 bj and b gt filter coefficients Prior to any call of BANDREJECT FLTR the user must allocate memory for both the svc and dst arrays either in static or dynamic memory Parameters Table 4 11 BANDREJECT FLTR Paramet
88. sed for unsigned values For ARRID CAST SWL ext l instruction is used and for ARRID CAST UWL andi l instruction is used Chapter 3 Macros for 2D Array Operations 3 1 ARR2D SUM UL ARR2D SUM SL 3 1 1 Macros Description These macros compute the sum of the array elements of unsigned signed values This sum is computed by the formula sizel lsize2 1 res Su i 0 j 0 where x element of the input array size number of rows of input array size number of columns in the input array Library of Macros for Optimization Rev 1 0 3 33 Freescale Semiconductor 3 1 2 Parameters Description Call s int ARR2D_SUM_UL unsigned long src int sizel int size2 int ARR2D_SUM_SL signed long src int sizel size2 Parameters Table 3 1 ARR2D_SUM Parameters src in Pointer to the source vector sizel in Number of raws in array size2 In Number of colomn in array Returns The ARR2D SUM macros return the unsigned signed sum of the array elements 3 1 3 Description of Optimization C code intone Gal 0 ak xe Siva able OwA e OF pom SUMAR Jes rasis dE ayer ERCTE TES IER Optimization can be done using the following techniques 1 The elements are accessed as 1d array elements with number of elements size size2 because elements of 2d array are located in memory sequentally 2 Loop unrolling by four 3 Postincrement addressing mode to access input array elements 4
89. to memory computes a 0 x it 1l for y i 1 ouput element computes b 1 y i to produce y i 1 moves y i 1 to d0 clear accumulator and stores y i l to memory computes a 0 x i 2 for y it 2 ouput elemen conmputesmbiI S vat Op roducernys aaa moves y it2 to dod Clear accumulator and stores y i 2 to memory computes a 0 x i 3 for y it 3 ouput element computes b 1 y i 2 to produce y i 3 moves 1 3 to dO Clear accumulator and stores y i 3 to memory Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 4 79 Optimization for eMAC unit The following should be noticed e The loop is unrolled by four e Coefficients a and b are pre computed and held in registers a3 and d correspondingly e The d0 register always holds the last computed output signal e Input operands are fetched from memory in fours and stored in registers d3 d4 d5 and a2 The eMAC unit has four accumulators so for better pipelining ao x parts of each output element is computed for all four output elements at the beginning of loop The rest of the output element computation is performed sequentially because computation of each output element depends on the value of the previous element Optimized code uses eMAC unit a3 d3 ACCO a3 d4 ACC1 lees ere UN a3 a2 ACC3 mac l d6 d0 ACCO movclr l ACCO d0 move l d0 a0 mac l d6 d0 ACC1 movclr l ACC1 dqd0 move l d0
90. turns The RDOT PROD macro generates an unsigned signed output value which is returned by macro 4 2 3 Description of Optimization Particular techniques of optimization are reviewed below C code for i 0 i SIZE i res er re eeigil fal arro Sivas 3b alg Optimization for MAC unit can be done using the following techniques 4 66 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor 1 Loop unrolling by four 2 Using macl instruction which allows multiplying simultaneously with loading four values for the next iteration 3 The first four values are loaded using one movem instruction Optimized code uses MAC unit lea 16 a0 a0 movem l a0 dl1 d4 subq 1 1 d0 beq L2 movem l al a2 a5 lea 16 al al maciyvi olio Sk even cil Welsh lsh eyes Ik ela vas neve sik lily cu subq 1 do bne L3 There is no need for optimization of the eMAC unit because there is only one multiply accumulate sequence in computations 4 2 4 Differences Between RDOT PROD UL and RDOT PROD SL RDOT PROD UL macro uses unsigned mode of the MAC unit while RDOT PROD SL macro uses signed mode 4 3 MATR MUL UL MATR MUL SL 4 3 1 Macros Description These macros compute the product of two matrices with unsigned signed values Matrix multiplication is computed by the following formula n MD a b i j kj k l Library of Macros for Optimization Rev 1 0 4 67 Freescale Semicond
91. uctor where c is an element of resultant matrix C a and bz are elements of the input matrices A and B respectively 4 3 2 Parameters Description Call s void MATR MUL UL void arrr void arrl void arr2 int m int n int p void MATR MUL SL void arrr void arrl void arr2 int m int n int p Parameters Table 4 3 MATR_MUL Parameters arr out Pointer to the resulting matrix size must be m p am in Pointer to the first matrix size must be m n arl in Pointer to the second matrix size must be n p m in Number of raws in the first matrix n in Number of columns in first matrix number of raws in second matrix p in Number of columns in second matrix Returns The MATR_MUL macro generates an output matrix with unsigned signed values which is pointed to by arrr 4 3 3 Description of Optimization C code ions a Oe al MSIE ELTE aene 3 OR 3 ss RSA sess for k 0 k lt NSIZE k Eyer velata Psp a eucrei ah el ener Dis T shi Optimization for MAC unit performing multiplication and addition at the same time due to mac instruction Optimized code uses MAC unit lea a0 ee ar 4 68 Library of Macros for Optimization Rev 1 0 Freescale Semiconductor lea a2 a3 lea 4 a2 a2 move l n d2 move l a3 d4 add l aap as move l al a4 mac l d4 a4 ACCO subq l 1 d2 bne IN3 Optimization for MAC unit can be done using t
92. ut samples The following optimization techniques were used 1 Postincrement addressing mode to load input and store output array elements 2 Loop unrolling by two 3 Descending loop organization Particular techniques for optimization are reviewed below C code arr ofii emm Sapra a F aL 9 euexedhel shed sp a2 99 muerdkei 2 x Iul use Gil st ili ap 49 cuseeet 2 Optimization for MAC unit The following should be noticed e Theloop is unrolled by two e Coefficients aj a a2 bj and b are pre computed and held in registers a3 a4 a5 d6 and d7 correspondingly e The a2 and d5 registers always hold the input samples per each iteration e The d3 and dO registers always hold the output samples per each iteration e The al and a0 registers hold pointers to the src and dst arrays All add multiply instructions are performed by the MAC unit The MAC unit has no movclr instruction so the accumulator must be cleared explicitly After each computation of the output sample the data the from accumulator is stored into the register and the accumulator is cleared explicitly After the result is stored into memory Optimized code uses MAC unit mac w a3 u a2 u ACCO computes a 0 x i for y i ouput element Library of Macros for Optimization Rev 1 0 4 89 Freescale Semiconductor mac w a4 u d4 u lt lt ACCO computes a 1 x i 1 for y i ouput element mac w a5 u d5 u lt lt ACCO compute
93. y be found in The Scientist and Engineer s Guide to Digital Signal Processing Steven W Smith Ph D California Technical Publishing http www dspguide com 4 8 2 Parameters Description Call s HPASS_1POLE_FLTR FRAC32 dst FRAC32 src long size FRAC32 x The input signals to the filter are held in array src and the output values are stored in array dst 7 Both arrays run from 0 to size 1 The x parameter controls the computation of the ap a and b filter coefficients Prior to any call to HPASS 1POLE FLTR the user must allocate memory for both the src 7 and dst arrays either in static or in dynamic memory Parameters Table 4 8 HPASS 1POLE FLTR Parameters dst Out Pointer to the output array of size FRAC32 data elements src In Pointer to the input array of of size FRAC32 data elements Library of Macros for Optimization Rev 1 0 4 81 Freescale Semiconductor size In Number of elements in input and output arrays x In FRAC32 value between zero and one that controls filter coefficients computation Returns The HPASS 1POLE FLTR macro generates output values which are stored in the array pointed to by dst 4 8 3 Description of Optimization This macro frequently performs multiplication and addition operations on fractional values It 1s suitable for the eMAC unit because it has a fractional mode Optimization for the MAC unit is performed as an emulation of the fractio

Library of Macros for Optimization Using eMAC and MAC

Contents

Download Pdf Manuals

Related Search

Related Contents