Home

Software Analysis on Genesi Pegasos II Using PMON and AltiVec

image

Contents

1. 0x00 0x08 0x04 0x0c 0x02 0x0a 0x06 0x0e 0x01 0x09 0x05 0x0d 0x03 0x0b 0x07 0x0f h unsigned char small lookup h 16 attribute aligned 16 0x00 0x80 0x40 0xc0 0x20 0xa0 0x60 0xe0 0x10 0x90 0x50 0xd0 0x30 0xb0 0x70 0xf0 h reversed j small lookup l j amp 0xf0 2 4 small lookup h j amp 0x0f This method uses less memory but runs a bit slower yielding 0 11 Bytes Cycle The true advantages comes from observation that small tables will fit into two AltiVec registers and all lookups are completely independent so 16 of them could be performed in parallel void reverse vector vector unsigned char in vector unsigned char out int num elements int i vector unsigned char st 1 st h vector unsigned char four vec splat u8 4 vector unsigned char v in vl vh v out st 1 vec ld 0 vector unsigned char small lookup 1 st h vec ld 0 vector unsigned char small lookup h for i 0 inum elements 1 16 vin vec ld vh vec sr v in four vl vec sl v in four vl vec_sr vl four vh vec perm st Lst Lvh vl vec perm st h st h v in vl v out vec or vh vl vec_st v_out i out Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 55 More Advanced Examples This method for the same conditions yields 2 7 Bytes Cycle It is 30x faster then the original Scalar and 15x fast
2. unsigned int read 744x upmc2 void unsigned int read 744x upmc3 void unsigned int read 744x upmc4 void char outOfAlignment dif FORCE ALIGNMENT char aa array MAX SIZE _ attribute aligned 16 char ab array MAX SIZE _ attribute aligned 16 else char aa_array MAX SIZE char ab array MAX SIZE endif void print int vector vector int this one printf 08x 08x 08x 08x n int this one 0 int this_one 1 int this 2 int this one 3 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 12 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor 57 58 59 void print char vector vector unsigned char this_one Introduction to Compiling on Linux with GCC printf 02x 02x 02x 02x 02x 02 02x 02x 02 02x 02x 02x 402K 02x 02x 60 02x n 61 unsigned 62 unsigned 63 unsigned 64 unsigned 65 unsigned 66 unsigned 67 unsigned 68 unsigned 69 unsigned 70 unsigned 71 unsigned 72 unsigned 73 unsigned 74 unsigned 75 unsigned 76 unsigned 77 78 79 vector unsigned char vectorLoadUnaligned vector unsigned char v char this char this one char this one char this one char this one char this one char this one char this one char this one char this one char this one char this one char this one char this one char this
3. Rev No Date Substantive Change s 07 23 2004 Initial release 0 1 08 31 2004 Minor editing Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 60 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Revision History THIS PAGE INTENTIONALLY LEFT BLANK Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 61 Revision History THIS PAGE INTENTIONALLY LEFT BLANK Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 62 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Revision History THIS PAGE INTENTIONALLY LEFT BLANK Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 63 How to Reach Us USA Europe Locations Not Listed Freescale Literature Distribution P O Box 5405 Denver Colorado 80217 1 480 768 2130 800 521 6274 Japan Freescale Semiconductor Japan Ltd SPS Technical Information Center 3 20 1 Minami Azabu Minato ku Tokyo 106 8573 Japan 81 3 3440 3569 Asia Pacific Freescale Semiconductor H K Ltd 2 Dai King Street Tai Po Industrial Estate Tai Po N T Hong Kong 852 26668334 Learn More For more information about Freescale Semiconductor products please visit http www freescal
4. unsigned int read 744x upmc2 void unsigned int val32 asm volatile mfspr 0 938 r val32 return val32 unsigned int read_744x_upmc3 void unsigned int val32 asm volatile mfspr 50 941 r val32 return val32 unsigned int read 744x upmc4 void unsigned int val32 asm volatile mfspr 50 942 r 132 return val32 unsigned int read 744x upmc5 void unsigned int val32 asm volatile mfspr 50 929 r val32 return val32 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 30 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the Performance Monitors for Performance Gathering 169 unsigned int read 744x upmce void 170 171 unsigned int val32 172 asm volatile mfspr 50 930 r 132 173 return val32 174 Description of the output lines follow 1 5 comments 6 Don t know what defining GNU SOURCE does 7 through 12 include header files from the standard include directory at usr include not from the kernel sources 13 Define the maximum PMC counters 14 create an array of unsigned ints to store the values for each counter selection 16 through 21 are prototypes for functions that read the non privileged counter registers UPMCI through UPMC6 22 is the prototype for the function that prints out the contents of the UPMC registers obtained from the functions prototypes in lines 16 through
5. 10 _ 1 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a Note that the first array vec_a contains the bytes loaded from char array al and vec_array 1 contains the bytes loaded from char array a2 3 4 An AltiVec Hello AltiVec from veclnt Program with Some AltiVec Int Constructs The only difference in this program is that we are loading from an integer array and we can demonstrate the offset capability of the vec_Id a b intrinsics The numbers allow a description of the constructs do not type in the numbers if you wish to try this program for yourself guestGdebian fae training 04 library maurie cat n vecInt c 1 include lt altivec h gt 2 include lt stdio h gt Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 7 Introduction to Compiling on Linux with GCC 4 void print char vector vector unsigned char this one 5 void print int vector vector int this one 6 vector unsigned char vec array 256 7 vector int vec int 8 9 main 10 11 int i 12 unsigned int a3 16 attribute aligned 16 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 14 vec int 14 0 vector int a3 15 printf nHello AltiVec from vecInt n 16 printf vec int offset by 0 17 print int vector amp vec int 18 printf Mn 19 print char vector vector unsigned char amp vec int 20
6. 57 More Advanced Examples In this case we are using PMON to collect our usual cycles and instructions but also to collect event 64 AltiVec load instructions completed and 41 L1 instruction cache accesses The performance monitor events are described in table 11 9 of MPC7450 RISC Microprocessor Family User s Manual guest debian fae training 04 library constant_gen test gt j guest debian fae training 04 library constant_gen cat n j 1 CPU 7457 2 CPU 7457 has 6 PMCs 3 Monitoring events are 0 64 4 Monitoring events are PMC 1 41 5 Monitoring events are PMC 2 1 6 Monitoring events are 31 2 7 Monitoring events are 41 0 8 Monitoring events are PMC 5 0 9 V1 function timing 3 ins 9 20871 10057 10 11 V2 function timing 0 ins 5 20302 10054 12 guestedebian fae training 04 library constant gen Line 9 compiler declaration constants takes 20871 cycles and line 11 AltiVec vector splat takes 20302 cycles not a dramatic difference but a little bit better For large amounts of constant data the vector splat could save significant amounts of computation time Also notice that declaration used 3 AltiVec load instruction while splat took zero and declaration used 9 L1 cache ac
7. 7 Monitoring events are PMC 2 1 8 Monitoring events are 31 2 9 Monitoring events are PMC 4 0 10 Monitoring events are PMC 5 0 11 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0 12 124813 Instructions 110166 Cycles 0 882648 IPC 13 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0 14 541317 Instructions 480033 Cycles 0 886787 IPC guestGdebian fae training 04 library align 3 and 4 tells us what processor we are using and the number of performance monitors that are available 5 though 8 indicate we are monitoring 1 and 2 twice i e instructions and cycles Note we are only using 1 and PMC2 in the START TIME and STOP TIME macros We are ignoring the other 4 counters 12 tells us that it took 124813 instructions and 110166 cycles to load vec a for one REPEAT 1 time By dividing the instructions by the cycles we get 0 882648 instructions per cycle 14 tells us that we executed 4 times the number of instructions and cycles to perform vectorLoadUnalligned than the previous code which just loaded aligned data Hence it is obviously more efficient to use 16 byte aligned data then to execute a function to align non 16 byte aligned data for AltiVec operations As you can see this is a rudimentary look at our code but we can become much more sophisticated in our measurements which we will see in Section 7 Advanced Examples 4 Defining and Using an AltiVec Vector AltiVec vectors are a 12
8. Comparison of Scalar Versus Vector Computations graphically illustrates the value of vectorization 7 2 Branching The home guest fae training 04 library branches is an example showing the increased efficiency of eliminating branches where possible Again it uses PMON to find the cycles and instructions used for various methods which eliminate branches Rather than a detailed look at the code this section will just present an overview of the code which can then be looked at for the details The processor can not proceed at full speed unless it knows where it is going Branches can not always be predicted correctly The processor therefore guesses which branch will be taken and if wrong must back track to the condition and then go off in the right direction e There are some general guidelines on how the processor will guess Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 50 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples Static branch prediction Forward branch not taken backward branch is taken Which means in an if then else decision the most likely section is then e Dynamic branch prediction after one or two invocations of the same branch instructions enough history is accumulated to make good predictions next time around However branch prediction is vulnerable to aliasing Try to avoid branches even if it means more computations i
9. Therefore all the sums of these alternate products will be zero for this example We are not interested in the result of the dot product only in the speed at which it completes The code also repeats the operations many times just to get enough cycles to make a comparison guest debian fae training 04 library dot_product make clean rm rf o test guestedebian fae training 04 library dot product make gcc maltivec mabi altivec O3 pmon c dot product c o test guestedebian fae training 04 library dot product test gt j guestedebian fae training 04 library dot product cat n j 1 CPU 7457 2 CPU 7457 has 6 PMCs 3 Monitoring events are PMC 0 1 4 Monitoring events are PMC 1 2 5 Monitoring events are PMC 2 1 6 Monitoring events are PMC 3 2 7 Monitoring events are PMC 4 0 8 Monitoring events are PMC 5 0 9 Scalar function timing 10 2130188 ins 2461537 2130218 2461558 11 Output 0 000000 13 Parallel version 14 420495 ins 618164 420513 618186 15 Output 0 000000 16 17 Parallel version 2 18 275311 ins 439212 275329 439234 19 Output 0 000000 Jo 2522 25 02 guest debian fae training 04 library dot_product Line explainations Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 49 More Advanced Examples 1 th
10. vec madd 1 1 2 2 1 2 temp3 temp4 vec madd 1 1 3 2 1 3 temp4 1 4 while i lt num_elements 4 while 1 if i gt num_elements 4 break temp vec madd vi i v2 i temp this time doing 4 vectors in parallel 132 133 134 135 136 137 138 139 140 141 142 143 144 temp2 vec madd 1 1 1 v2 i 1 temp2 to fill the pipeline temp3 vec madd v1 i 2 v2 i 2 temp3 temp4 vec madd 1 1 3 v2 i 3 temp4 1 4 Sum our temp vectors temp vec add temp temp2 temp3 vec add temp3 temp4 temp vec_add temp temp3 Add across the vector temp vec add temp vec_sld temp temp 4 temp vec_add temp vec_sld temp temp 8 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 44 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 More Advanced Examples Copy the result to the stack so we can return it via the IPU vec ste temp 0 amp result return result int main int 3 1 int 81 82 83 unsigned int start time stop time unsigned int start ins stop ins unsigned int Start pc3 stop pc3 start pc4
11. 14 unsigned char a2 16 attribute aligned 16 15 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 16 vec vec 1d 0 a1 17 vec array 1 vec 1d 0 a2 18 printf nHello AltiVec from vecChar n 19 printf vec a 20 print char vector amp vec a 21 printf n Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 4 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC 22 printf vec array 1 23 print char vector amp vec_array 1 j 24 printf n 25 26 27 void print char vector vector unsigned char this one 28 29 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02x 30 unsigned char this one 0 31 unsigned char this one 1 32 unsigned char this one 2 33 unsigned char this one 3 34 unsigned char this one 4 35 unsigned char this one 5 36 unsigned char this one 6 37 unsigned char this one 7 38 unsigned char this one 8 39 unsigned char this one 9 40 unsigned char this one 10 41 unsigned char this one 11 42 unsigned char this one 12 43 unsigned char this one 13 44 unsigned char this one 14 45 unsigned char this 151 46 47 void print int vector vector int this one 48 49 08 08 08 08 50 int this on
12. 21 It is not used in this program 24 through 123 is the function that initializes the performance registers 26 through 31 are declarations 32 through 68 declares a FILE type which will be used to read the proc cpuinfo pseudo file which specifies the CPU information on the running system Try the shell command proc cpuinfo Itis this information that is being read here and checking for the existence of an MPC744x MPC745x or MPC741x processor which are the only processors that have performance monitor registers 70 through 83 is commented out hence we do not give the caller the opportunity to choose which events to count we just use the events passed to this function in four arguments p1 through p4 84 through 89 set the sel to the arguments in preparation to monitor these counters 90 and 91 print out the events that are going to be monitored which corresponds to the lines 5 through 10 in the output listings below 93 through 99 opens the char device we have defined for this PMON facility dev pmon using the standard IO call open which will invoke the pmon26 ko module function pmon open We check to see if it is available and if not print the error message we see in line 11 in the Section 5 2 3 Results When dev pmon is Not Available 101 Calls the pmon26 ko module function pmon write to write the selection bits to the privileged performance monitor selector registers MMCRO and and zero the counte
13. 6 upmc end 6 int i fd byteCount len n read unsigned int cycles char textLine NULL char item 32 delim 32 name 32 int total pmc 0 FILE p cpuinfo id CPU to decide how many PMCs we have on this machine p cpuinfo fopen proc cpuinfo r if p cpuinfo NULL printf ERR unable to open 1 Mn return 0 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 26 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Using the Performance Monitors for Performance Gathering while n read getline amp textLine amp len p cpuinfo 1 sscanf textLine s s s item delim name ifdef DBG PMON printf INFO getline s n textLine printf INFO item s delim s name s n item delim name if memcmp item cpu 4 printf s n name if memcmp name 744 3 total 6 else if memcmp name 745 3 total_pmc 6 else if memcmp name 741 3 total_pmc 4 else printf ERR unsupported CPU s n name return 0 printf CPU 55 has 54 PMCs n name total break if textLine free textLine fclose p cpuinfo FIXME let usr choose which even
14. LIL 11 2 for i 0 i lt MAX SIZE i 113 aa array i i 114 ab 1 i 115 116 117 printf nAlignment Test n Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 14 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 Introduction to Compiling on Linux with GCC start pmon 1 2 1 2 START TIMER for i 0 1 lt REPEAT i vec 14 0 vector unsigned char aa_array STOP TIMER print char vector amp vec a printf d tInstructions t d Cycles t f IPC n stop time start time stop ins start ins double stop ins start ins double stop time start time START TIMER for 1 0 1 lt i vec a vectorLoadUnaligned vector unsigned char aa array STOP TIMER print char vector amp vec a printf d tInstructions t d Cycles t IPC n stop time start time stop ins start ins double stop ins start ins double stop time start time return 0 guestGdebian fae training 04 library align This example serves double duty it demonstrates the necessities of alignment and is a simple example of using performance monitoring which is in Section 3 6 AltiVec Program Demonstrating the Us
15. This applications note will concentrate on the Debian Linux however it translates directly to the Yellow Dog Linux system and any other native PowerPC Linux distribution The GCC compiler executable command resides in the usr bin directory The include files all reside in the usr include All of the relative tool chain libraries reside in usr lib gcc lib Version 3 3 3 GCC is used in this paper In order to compile any C application program the simple gcc command can be used with all the default parameters All the examples except the hello world programs discussed in this paper are in the home guest fae training 04 library directory 3 1 The Objective and Tools for Achieving Software Development The main objective is to familiarize ourselves with this new software development platform 3 1 1 Software Development Tools The Pegasos software development system consists of the following items e MPC7447 Discovery II e Running Debian Linux e Standard GNU tool set e GCC V3 3 for PPC Supports Alti Vec GNU utilities gdb objdump etc e Customized tool set for PPC monitoring PMON SimG4 e Text editor of your choice Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 2 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC vi vim emacs gnome text editor gedit 3 1 2 PMON Performance Monitor PMON is a kernel modu
16. c 50 clean rm rf o pmon test Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 51 More Advanced Examples guestGdebian fae training 04 library branches The relevant code is shown here inline void Max int a int b int c int elements int i for i 0 i lt elements i if a i gt b i ali else inline void Max_p int a int b int c int elements int i for i 0 i lt elements i ac i aa i ab i aa i ab i inline void Max_vec int a int b int c int elements vector int va vb vc vector int pva vector int a vector int pvb vector int b vector int pvc vector int c vector bool int mask int i for i 0 i lt elements 4 i va vec Id 0 pva vb vec Id 0 pvb mask vec cmplt va vb vec sel vb mask vec_st vc 0 pvc The code will run two different sets of data One in a predictable fashion the other in a random fashion This arrangement should expose branch predictor behavior define ELEMENTS 4 1024 int aa ELEMENTS attribute aligned 16 Input data set int ab ELEMENTS attribute aligned 16 Input data set int ac ELEMENTS attribute aligned 16 Output data set for i 0 ic ELEMENTS i if RANDOM rand ab i rand else aa i i ab i ELEMENTS
17. code to off load the results but that is beyond the scope of this applications note The full list of monitor able statistics is given in MPC7450 RISC Microprocessor Family User s Manual Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 23 Using the Performance Monitors for Performance Gathering Below is a short list Table 11 9 PMC1 Events MMCRO PMC1SEL Select Encodings 0 0 000 0000 Register counter holds current value 1 000_0001 Counts every processor cycle 2 000 0010 Instructions completed Counts all completed PowerPC and AltiVec instructions Load store multiple instructions Imw stmw and load store string instructions Iswl Iswx stswl stswx are only counted once Does not include folded branches The counter can increment by 0 1 2 or 3 depending on the number of completed instructions per cycle Branch folding must be disabled HIDO FOLD 0 in order to count all the instructions 3 000 0011 TBL bit transitions Counts transitions from 0 to 1 of TBL bits specified through MMCRO TBSEL 00 uses the TBL 31 bit to count 01 uses the TBL 23 bit to count 10 uses the TBL 19 bit to count 11 uses the TBL 15 bit to count 4 000 0100 Instructions dispatched Counts dispatched instructions The counter can increment by 0 1 2 or 3 depending on the number of dispatched ins
18. i endif ac i 0 We are using PMON to collect not only the cycles and instructions used but also event 15 the number of cycles an AltiVec instruction in the VFPU reservation station is waiting for an operand and event 26 the true branch target instruction hits for taken branches The call to start_pmon determines which counters we are going to use The performance monitor events are described in table 11 9 of MPC7450 RISC Microprocessor Family User s Manual start 1 2 26 15 1 2 26 15 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 52 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples The results of running this code dramatically shows the case for using the vector method which substitutes masking and vector operations for decision branching guestGdebian fae training 04 library branches test gt j guestGdebian fae training 04 library branches cat n j l CPU 7457 2 CPU 7457 has 6 PMCs 3 Monitoring events are PMC 0 1 4 Monitoring events are PMC 1 2 5 Monitoring events are PMC 2 26 6 Monitoring events are PMC 3 15 7 Monitoring events are PMC 4 0 8 Monitoring events are PMC 5 0 9 Scalar function timing 178432 ins 106480 8219 46771 10 Scalar predicate timing 379000 ins 147392 4162 267956 11 Vector timing 22210 ins 9364 0 17543 Line description 1 through 9 is our all famili
19. in Section 4 Defining and Using an AltiVec Vector Suffice it to say that the 16 bytes starting at the address of the al character array 1 2 3 etc will be written to the vector either a true AltiVec register or a memory location representing that AltiVec register and the result is that the location of vec a will now contain the characters 1 2 3 etc The same logic applies to line 17 except that the 2nd element i e address of vec array 16 will contain the characters from the char array a2 11 12 13 etc 20 calls the function print char vector which I will discuss in line 27 27 through 46 is a function to print the contents of the vector in memory one byte at a time for a total of 16 bytes Since a vector is a 16 byte quantity we can treat each byte independently similar to a char array of 16 bytes Since we are giving the address of the first byte of the vector to this function we access each additional byte by an array increment which is equivalent to adding 1 to the previous address 47 though 54 is a similar function to print the contents of the vector in memory one int i e 4 bytes at a time for a total of 4 ints 16 bytes This function will be used in the next example explained in Section 3 4 AltiVec Hello AltiVec from vecInt Program with Some AltiVec Int Constructs We can compile and execute this example in several ways I will describe three ways here 1 Explicitl
20. once 135 stores the 16 bytes starting at ab_array into the vector vec_a but regardless of alignment of this array the function vectorLoadUnaligned will aligned the data properly 126 and 138 print the vectors Compile and execute this example using this Makefile guest debian fae training 04 library align cat Makefile test align c pmon c gcc maltivec mabi altivec pmon c align c o S clean rm rf o pmon test make clean make test guestGdebian fae training 04 library align make clean rm rf o pmon test guest debian fae training 04 library aligns make gcc maltivec mabi altivec pmon c align c o test guestGdebian fae training 04 library align test Ignore the lines in italics they will be described later Alignment Test CPU 7457 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 16 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC CPU 7457 has 6 PMCs Monitoring events are PM C 0 1 Monitoring events are PMC 1 2 Monitoring events are PM C 2 1 Monitoring events are PMC 3 2 Monitoring events are PM C 4 0 Monitoring events are PMC 5 0 c 00 01 02 03 04 05 06 07 08 09 0b 125266 Instructions 110163 Cycles 0 879433 IPC 00 01 02 03 04 05 06 07 08 09 0 Ob 0c 0d Oe OF 549870 Instructions 480786 Cycles 0 874363 IPC The ignored lines are PMON output and wil
21. temp4 vec madd 1 1 3 2 1 3 temp4 92 93 94 for i 0 lt num elements 16 i Loop over the length of the vectors 95 temp vec madd vlp v2p temp this time doing 4 vectors in parallel 96 temp2 vec madd 1 v2p temp2 to fill the pipeline 97 temp3 vec madd 1 v2p temp3 98 temp4 vec madd 1 v2p temp4 99 100 101 102 for i 0 i lt num elements 16 i Loop over the length of the vectors 103 tl vec ld 0 v1p 104 t2 vec_ld 0 v2p 105 t3 vec 14 1 1 106 t4 vec_ld 1 v2p 107 temp vec madd ti t2 temp this time doing 4 vectors in parallel 108 t5 vec ld 2 v1p 109 t6 vec l1d 2 v2p 110 temp2 vec madd t3 t4 temp2 to fill the pipeline 111 t7 vec ld 3 v1p 112 t8 vec ld 3 v2p 113 temp3 vec madd t5 t6 temp3 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 43 More Advanced Examples 120 121 1 2 temp4 vec madd t7 t8 temp4 do temp vec madd vi i v2 i temp this time doing 4 vectors in parallel 122 123 124 125 126 127 128 129 130 131 temp2 vec madd 1 1 1 v2 i 1 temp2 to fill the pipeline temp3
22. 0 1 34 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the PMON Facility To determine the PMON ID see Section 5 2 4 All These Conditions Must Be Met for the PMON Facility to Work Instantiate the PMON facility by navigating to the root ppctools pmon directory and performing this command insmod pmon26 ko To stop this facility use this command rmmod pmon26 ko Remember to change back into a regular user after starting PMON To reiterate in order to use the PMON facility these steps must be performed 1 The module pmon26 ko must be built 2 The module pmon26 ko must be installed insmod pmon26 ko 3 dev pmon must be created mknod dev pmon c node number 0 4 The permissions must be 777 chmod 777 dev pmon The node number can be determined from the proc devices file After the insmod pmon26 ko look at the dev devices files find the entry for PMON and the node number will be displayed Then enter the mknod command It may be necessary to remove the dev mknod entry if it does not correspond to the proc devices id number as listed Further for this example align to work these conditions must also be met l align c and pmon c must be built 2 The resultant executable must be run See Freescale application note PMON Module An Example of Writing Kernel Module Code for Debian 2 6 on Genesi Pegasos AN2744 for more information
23. 0xd8 0x38 0xb8 0x78 0xf8 0x04 0x84 0x44 0xc4 0x24 0xa4 0x64 0xe4 0x14 0x94 0x54 0xd4 0x34 0xb4 0x74 0xf4 0x0c 0x8c 0x4c 0xcc 0x2c Oxac 0x6c 0xec 0x1c 0x9c 0x5c 0xdc 0x3c 0xbc 0x7c 0xfc 0x02 0x82 0x42 0xc2 0x22 0xa2 0x62 0xe2 0x12 0x92 0x52 0xd2 0x32 0xb2 0x72 0xf2 0x0a 0x8a 0x4a 0xca 0x2a 0xaa 0x6a 0xea 0x1a 0x9a 0x5a 0xda 0x3a 0xba 0x7a 0xfa 0x06 0x86 0x46 0xc6 0x26 0xa6 0x66 0xe6 0x16 0x96 0x56 0xd6 0x36 0xb6 0x76 0xf6 0x0e 0x8e 0x4e 0xce 0x2e 0xae 0x6e 0xee 0x1e 0x9e 0x5e 0xde 0x3e 0xbe 0x7e 0xfe 0x01 0x81 0x41 0xc1 0x21 0xa1 0x61 0xe1 0x11 0x91 0x51 0xd1 0x31 0xb1 0x71 0xf1 0x09 0x89 0x49 0xc9 0x29 0xa9 0x69 0xe9 0x19 0x99 0x59 0xd9 0x39 0xb9 0x79 0xf9 0x05 0x85 0x45 0xc5 0x25 0xa5 0x65 0xe5 0x15 0x95 0x55 0xd5 0x35 0xb5 0x75 0xf5 0x0d 0x8d 0x4d 0xcd 0x2d 0xad 0x6d 0xed 0x1d 0x9d 0x5d 0xdd 0x3d 0xbd 0x7d 0xfd 0x03 0x83 0x43 0xc3 0x23 0xa3 0x63 0xe3 0x13 0x93 0x53 0xd3 0x33 0xb3 0x73 0xf3 0x0b 0x8b 0x4b 0xcb 0x2b 0xab 0x6b 0xeb 0x1b 0x9b 0x5b 0xdb 0x3b 0xbb 0x7b 0xfb 0x07 0x87 0x47 0xc7 0x27 0xa7 0x67 0xe7 0x17 0x97 0x57 0xd7 0x37 0xb7 0x77 0xt7 OxOf 0 81 0 41 Oxcf Ox2f Oxaf Ox6f Oxef Ox1f Ox9f Ox5f Oxdf Ox3f Oxbf 0x7f 0xff This method yields 0 19 Bytes Cycle or 2x faster then the original Another method is using a Small Lookup Table e based on splitting each byte into two nibbles e looking up values for both of them independently and merging result later unsigned char small lookup l 16 attribute aligned 16
24. 64 0 657677 2 2861328 3 933137 7 3 Bit Reversal Eliminating Computations This example home guest fae training 04 library bitreverse is an example of calculating the reverse of a bit pattern or looking it up in a large table or in a small table PMON is used to collect just the cycles and instructions to determine which is more efficient i e uses less cycles The technique of data slicing can be used in Alti Vec to eliminate computation The first function is for a Byte wise Bit Reversal Algorithm unsigned char reverse unsigned char in unsigned char out in amp 0x01 7 amp 0x02 lt lt 5 in amp 0x04 3 I in amp 0x08 lt lt 1 amp 0x10 gt gt 1 in amp 0x20 gt gt 3 in amp 0x40 gt gt 5 amp 0x80 gt gt 7 return out This straightforward method yields 0 10 Bytes Cycle An alternative implementation is a Big Lookup Table e 256 entry byte table holding the reversed values So the computation for each byte is converted into a single load Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 54 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples reversed j big lookup j unsigned char big_lookup 256 0x00 0x80 0x40 0xc0 0x20 0xa0 0x60 0xe0 0x10 0x90 0x50 0xd0 0x30 0xb0 0x70 0xf0 0x08 0x88 0x48 0xc8 0x28 0xa8 0x68 0xe8 0x18 0x98 0x58
25. 8 bit quantity aligned on a 16 byte boundary How do we manipulate it We load it with the vec_Id a b which loads 16 bytes into the vector regardless of how we define it char short int float double So the receiver is a vector and the sender is an offset offset by 16 and a memory address of 16 bytes Thus if we define an array of 16 bytes and a char vector as char aa array MAX SIZE _ attribute aligned 16 vector unsigned char vec a Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 21 Defining and Using an AltiVec Vector Remember to align to 16 bytes And then we fill it with numbers from 0 to 15 for 1 0 1 lt SIZE i aa_array i i Now starting at the address of aa_array i e amp aa_array we have the 16 bytes set to the numbers from 0 to 15 E g assume that aa_array starts at address 10011b10 lt aa_array gt then the following values are stored in memory 10011b10 0 10011b11 1 10011b12 2 10011b1f 15 Now when we use the intrinsic vec a vec ld 0 vector unsigned char aa array We are loading the values one byte at a time from 10011b10 through 10011b1f into the vector register or memory location thus the vector vec a now contains the char values from 0 to 1 in each of the bytes of the vector For int arrays we have the same scenario however each int is 4 bytes So for the example be
26. 88 IPC 13 00 01 02 03 04 05 06 07 08 09 0a Ob 0c 0d 0e OF 14 540918 Instructions 480033 Cycles 0 887441 IPC guest debian fae training 04 library aligns Description of the output lines follow Previous line 11 is not printed so dev pmon was found correctly Lines 12 and 14 have values 2 printed by line 117 in align c 3 and 4 printed by line 50 and 62 in pmon c 5 through 10 printed by line 91 in pmon c 11 printed by line 126 in align c 12 printed by line 127 in align c 13 printed by line 138 in align c 14 printed by line 139 in align c 6 Using the PMON Facility Since some of the performance registers are privileged registers only the Linux root user can change those Therefore it is necessary for a normal user to call a kernel support function to set these registers Linux does not supply such a facility however the PMON facility included in the pegasos II system does contain such a facility which was written by the Freescale CPD applications team This facility called PMON is supplied as a kernel module in the root directory at root ppctools pmon Since PMON is not a normally supplied module the user is required to start and stop it In addition PMON uses the char device dev pmon for it s operation The user must therefore create this device Create the dev pmon device with the command mknod c dev pmon c lt pmon id gt 0 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev
27. A normal user cannot do any of the above thus the user is required to change to the root user User Space Shell i Filesystem _ System Calls Kernel Space Kernel Log Figure 1 User Interaction with Kernel Module PMON Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 35 Using the PMON Facility Figure 1 shows that the User program which runs as a normal user interfaces to the kernel via a call to the PMON module which can in turn perform root activities for the user program Figure 2 Instantiating the PMON Module A user cannot instantiate a kernel module Hence the module must be instantiated at boot time or by a root user with the insmod command The insmod command which means instantiate module will install the module in the kernel then call the init_module of PMON which will initialize itself and wait for user calls to the PMON module The rmmod command will de instantiate it that is call the cleanup_module to clean up any memory or other resources it is using and then remove it from the kernel module list Figure 2 shows the interaction of the user with the kernel however the user in this case must be the root user The command insmod calls the kernel function sys_inti_module which adds the module to the kernels list and invokes the initialization function of the module Modules are version
28. Freescale Semiconductor Application Note AN2743 Rev 0 1 08 2004 Software Analysis on Genesi Pegasos II Using PMON and AltiVec by Maurie Ommerman and Sergei Larin CPD Applications Freescale Semiconductor Inc Austin TX This application note is the sixth in a series describing the Genesi Pegasos II system which contains a PowerPC microprocessor and its various applications This document describes software analysis using the PMON facility for using its PowerPC processor performance measurement registers It also describes the general compiler tool set GCC and some AltiVec constructs 1 Introduction This application note describes some features of the AltiVec constructs and the PMON kernel interface and how to use one of the PowerPC performance monitor measuring facilities The PMON facility is an application written by the Freescale application team and is described in this applications note Even though this document is part of the series on Genesi Pegasos II systems the PMON facility is available on any Linux System running on a PowerPC processor The PMON facility is preloaded on the Genesi Pegasos II system but may be download by request for any PowerPC Linux Platform This paper assumes that the user will log in as guest with password guest and all the examples discussed in this paper with the exception of the hello world programs are in the home guest fae training 04 library directory Freescale Semiconduc
29. ar introduction to these programs describing the CPU type and the events to count in this case 1 2 26 and 15 cycles instructions branch target hits and cycles for VFPU instructions The performance monitor events are described in table 11 9 of MPC7450 RISC Microprocessor Family User s Manual 9 shows the counters for a scalar branch 10 shows the counters for map_p 11 shows the counters for a vector solution Clearly the vector solution is superior The next diagram graphically shows this conclusion Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 53 More Advanced Examples Sorted Values Max Max_p Max vec 12508 9364 0 748641 2 2861328 2652223 9385 8044 21 0002238 60000 ElSorted Cycles BiRandom Cycles 50000 40000 30000 Cycle 20000 4 10000 4 Max p Max vec pmc1 4 1 15 pmc2 26 Cycles Retired Instr IPC Ins Iter Speedup Dispatched Dispatch 0 Junk Junk Flush 33174 30474 0 918611 7 4399414 1 30501 11423 27 0 000885 SS 25486 31773 1 246684 7 7570801 1 301656 31799 7096 26 0 00081 Random Values 1_4 1_15 2_26 _ Max_vec Cycles Retired Instr IPC Ins Iter Speedup Dispatched Dispatch 0 Junk Junk 30920 0 552143 7 5488281 1 32410 0 751554 7 9125977 1 298581 93
30. cause the macro TRACE is undefined 175 initializes PMON to count cycles and instructions 177 through 195 performs the scalar dot product algorithm many times and counts the cycles and instructions 197 through 215 performs the 1 madd per 4 cycles vectorization algorithm many times and counts the cycles and instructions 217 through 235 performs the 4 madd per 4 cycles vectorization algorithm many times and counts the cycles and instructions 7 1 3 Results and Explanation The following commands will clean our directory make the elf executable test and execute it with the test command since the local directory in not part of the PATH variable finally the result is stored in the file j so that we can use cat n command to get line numbers for the code The line numbers are not part of the output and are used for this explanation Of course the code is supplied in the home guest fae training 04 library dot_product dot_product c The two arrays aa and ab which are vectors a special case of a matrix which has one row are filled with values that will result in a value of zero when applied to a dot vector aa 0 0 aa 1 20 ab 0 0 ab 1 0 aa 2 2 aa 3 2 ab 2 2 ab 3 2 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 48 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples etc The sum of aa i ab i 1 1 ab i 1 0
31. cesses while splat used 5 7 5 Step Back and Take a 10 000 Foot View There is a logical sequence to be observed in implementation of these methods One can look at the optimization process as moving the bottleneck around the processor e if computation takes longer then anything else speed them up e if system bus is under utilized use prefetching e if bus is 100 full computations are at the minimum reduce the code and data size e But the truly superior goal is to reach computational entropy get rid of all the unnecessary computations through algorithm modifications balance added memory bandwidth with real data I O use predictability of the data streams to the full extent Concentrate your effort in large applications work with 10 of the code which accounts for 90 of the execution time Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 58 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Conclusion 8 Conclusion This application note has presented the information needed to use some AltiVec constructs understand alignment and use the PMON monitoring facility of the Performance Monitor Registers on the MPC74xx processors Two detailed examples were presented align c and dot product c which show how PMON be used to determine which code is faster Several other examples were overviewed AltiVec Technology transparently adds SIMD functionality to a high spee
32. cutable whose name is test because that is the name of the make target it line 1 Line 3 is the clean target which invokes line 4 to remove all the objects and the elf file test guestedebian fae training 04 library dot product cat n Makefile 1 test dot product c pmon c 2 gcc maltivec mabi altivec O3 pmon c dot product c 50 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 39 More Advanced Examples 3 clean 4 rm rf o test guestGdebian fae training 04 library dot product 7 1 2 Code Listing and Explanation The listing is the same as in the example in the Genesi Pegasos II directory and file home guest fae training 04 library dot_product dot_product c with the exception that the printf statements at lines 177 195 197 215 217 and 235 have been changed to make the printing easier to discuss and lines 184 204 and 224 have been corrected to avoid the warning errors in the original guestGdebian fae training 04 library dot product cat n dot product c I 2 Sergei Larin 3 Bitreversal example 4 47 6 include altivec h 7 include lt stdio h gt 8 9 10 define START TIMER 11 start_time read_744x_upmcl1 12 start_ins read 744x upmc2 13 14 define STOP TIMER 15 asm volatile eieio 16 stop time read 744x upmcl 17 stop ins read 744x upmc2 19
33. d RISC engine Alti Vec enables a broad range of embedded and computing applications C level programming offers certain level of comfort while providing powerful way to extract parallelism from applications You must think in terms of Vector Processing throughout the design cycle of an application AltiVec is not pixie dust to be sprinkled on an existing code it takes foresight and design With these techniques a 4x to 30x or more speedup is possible AltiVec coding can speed up many common applications CPD Applications has some Alti Vec library applications that are available The items from the following categories are available at http www freescale com altivec Telecomm FFT IFFT FIR Autocorrelation Convolution Encoder Viterbi Decoder GSM MultiMedia DCT IDCT JPEG 2000 Quantization Dequantization SAD Networking QOS NAT Route Lookup IP Reassembly TCP IP Encryption SHA LibC means could be Linked at compilation Link level support for standard C functions memcpy strcmp etc Mathematical primitives Extension of LibC Math h Log Exp Sin Cos Sqrt OS enablement Linux TCP IP The items from these categories are available upon request Telecomm Convolution Encoder Viterbi Decoder 3G Error Correction Codes CRC 8 12 16 24 MultiMedia 2 MP3 JPEG Printer GhostScript Library elements Color Conversion FS Dithering Networking Encryp
34. dependent since once instantiated they are part of the kernel and must have access to all the kernel symbols The module internals are invoked by normal users making function calls to the device that is owned by the module Finally only the root user can remove a module with the rmmod command which will call the modules clean up code and remove the module from the kernel list Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 36 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples Figure 3 User Interaction to the Module Figure 3 shows that once running normal users can interface to the module with the standard POSIX API using file commands like open and read and write This is because PMON instantiates itself as a char device and will get a device entry in dev of dev pmon In summary a root user must start PMON with the insmod pmon26 ko command A shell script is available for this action root ppcttools pmon install sh If the system is rebooted then PMON must be reinstalled 7 More Advanced Examples There are several more examples in the guest fae training 04 library directory We are going to discuss just one of them in detail the Dot Product Several other examples will be overviewed the reader is encouraged to look at the other code The mathlib directory has some AltiVec functions that can be used in your own programs 7 1 Dot Product Example In th
35. described later Alignment Test CPU 7457 CPU 7457 has 6 PMCs Monitoring events are PM C 0 1 Monitoring events are PMC 1 2 Monitoring events are PM C 2 1 Monitoring events are PMC 3 2 Monitoring events are PM C 4 0 Monitoring events are PMC 5 0 00 01 02 03 04 05 06 07 08 09 0b 0c 0d 0e 0F 125423 Instructions 110162 Cycles 0 878324 IPC 00 01 02 03 04 05 06 07 08 09 0b 0c 0d 0e 0F 549744 Instructions 480785 Cycles 0 874562 IPC guest debian fae training 04 library aligns We now see that both arrays are loaded correctly because the aa_array and the ab_array are on a 16 byte boundary 10011a10 lt ab_array gt 10011b10 lt aa_array gt Thus it is important to guarantee 16 byte alignment for all memory that will be associated with AltiVec operations 3 5 1 More Information on AltiVec Data Alignment 3 5 1 1 Obtaining Data Alignment for AltiVec with Compiler Constructs It is strongly recommended that you align all data structures to 16 byte boundary if Alti Vec is used Different compilers have different means of achieving it but all of them have some method Here is a GCC example include lt altivec h gt typedef union Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 18 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC vector unsigned int vec int elements 4 LongVector at
36. dif TRACE 20 define START TRACING asm long 0x14000001 21 define STOP TRACING asm long 0x14000002 22 define MAX SIZE 64 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 40 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 More Advanced Examples define REPEAT 1 else define START TRACING define STOP TRACING define MAX SIZE 4 1024 define REPEAT 100 dendif int start pmon int pl int p2 int p3 int p4 unsigned int unsigned int unsigned int unsigned int float aa MAX float ab MAX read 744x upmcl1 void read 744x upmc2 void read 744x upmc3 void read 744x upmc4 void SIZE _ attribute _ aligned 16 SIZE _ attribute 1 16 void print int vector vector int this one printf 08x 08x 08x 08x n int this one 0 int this_one 1 int this_one 2 int this one 3 float dot_product float a float b int num_elements int i float tmp for i 0 i lt num_ elements i tmp ali b il Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 41 More Advanced Exam
37. e 01 51 int this 1 52 int this one 2 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 5 Introduction to Compiling on Linux with GCC 53 int this one 31 54 The line numbers described here are the AltiVec constructs the non AltiVec C constructs will not be explained 1 Include the GCC standard AltiVec header file Alti Vec intrinsics are built into the GCC compiler this header will expand the constructs during compilation time 4 and 5 are standard prototypes for these functions which will be described later However the construct vector indicates that a vector variable is being used 5 and 11 The construct vector invokes a vector array of 16 vector i e 128 bit elements in memory which are aligned on a 16 byte boundary i e an address that ends with 4 bits of zero e g 0x10105660 This example only uses one of those vectors 12 13 14 and 15 These are normal character arrays however the attribute signature forces 16 byte boundary alignment We will discuss this more in the align example in Section 3 5 An AltiVec Alignment Program Demonstrating Alignment Considerations 16 and 17 Load a single vector vec_a from the address of the character array al and load a single 16 byte element of a vector array from the address of a character array a2 This will be discussed in more detail
38. e com AN2743 Rev 0 1 08 2004 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Information in this document is provided solely to enable system and software implementers to use Freescale Semiconductor products There are no express or implied copyright licenses granted hereunder to design or fabricate any integrated circuits or integrated circuits based on the information in this document Freescale Semiconductor reserves the right to make changes without further notice to any products herein Freescale Semiconductor makes no warranty representation or guarantee regarding the suitability of its products for any particular purpose nor does Freescale Semiconductor assume any liability arising out of the application or use of any product or circuit and specifically disclaims any and all liability including without limitation consequential or incidental damages Typical parameters which may be provided in Freescale Semiconductor data sheets and or specifications can and do vary in different applications and actual performance may vary over time All operating parameters including Typicals must be validated for each customer application by customer s technical experts Freescale Semiconductor does not convey any license under its patent rights nor the rights of others Freescale Semiconductor products are not designed intended or authorized for use as components in systems intended for surgical implant into the body or other applications inte
39. e of PMON for Obtaining Performance Statistics Here the lines associated with demonstrating alignment are discussed Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 15 Introduction to Compiling on Linux with GCC 31 through 34 This defined variable can force 16 byte alignment or not as shown starting in line 42 42 through 49 can force 16 byte alignment due to the __attribute__ aligned 16 intrinsic When FORCE_ALIGNMENT is 0 we skip the two lines with this attribute and use the two lines without the attribute line 42 aligns to a byte which will not be on a 16 byte address FORCE_ALIGNMENT is 1 we use the two lines with the attribute and skip the two lines without the attribute Thus this code will run with non 16 byte alignment and will not give us the correct answer We will change this later and show the correct answer 51 through 77 is the previously described print an int vector and a char vector 79 through 84 is a function to load unaligned vectors correctly 86 through 102 can store vectors correctly to unaligned memory 112 though 115 fills these two aligned or unaligned depending on FORCE_ALIGNMENT arrays with the numbers 0 through 15 123 stores the 16 bytes starting at aa_array into the vector vec_a but since it is unaligned we will get unexpected results Since the macro REPEAT is set to 1 we only perform this loop
40. e stand alone and allow the user to input performance monitor register numbers via the keyboard is available in root ppctools pmon usr pmon test c guest debian fae training 04 library align cat n pmon c guest debian fae training 04 library align cat n pmon c 1 BOR KK RR KR RRR KK KK KK KR 2 Filename pmon_test c 3 Note this file test kernel module pmon c which is registered as a char device 4 at dev pmon 5 6 define _GNU_SOURCE 7 dinclude lt stdio h gt 8 include lt stdlib h gt Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 25 Using the Performance Monitors for Performance Gathering 9 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3 7 38 39 40 include lt string h gt include fcntl h include lt unistd h gt include lt sys uio h gt define NUM 6 static unsigned int pmc_sel NUM unsigned int read_744x_upmcl void unsigned int read 744x upmc2 void unsigned int read 744x upmc3 void unsigned int read 744x upmc4 void unsigned int read 744x upmc5 void unsigned int read 744x void void show upmcs unsigned int upmc int start pmon int pl int p2 int p3 int p4 static unsigned int upmc begin
41. eneral help information for GCC with the gcc h command gcc hello c a out Hello World 3 2 ASimple AltiVec hello World Program Using the same program above compile it with the AltiVec flags maltivec mabi altivec Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 3 Introduction to Compiling on Linux with GCC gcc maltivec mabi altivec hello c a out Hello World There is not any difference That is because there are no AltiVec constructs in this program 3 3 An AltiVec Hello AltiVec from vecChar Program with Some AltiVec Char Constructs AltiVec intrinsics are built into the GCC compiler and will be explained as they are encountered in this program See Section 9 References 9 and 10 This program illustrates some AltiVec constructs The numbers allow a description of the constructs do not type in the numbers if you wish to try this program for yourself guestGdebian fae training 04 library maurie cat n vecChar c 1 include altivec h 2 include lt stdio h gt 4 void print char vector vector unsigned char this one 5 void print int vector vector int this one 6 vector unsigned char vec array 16 only using vec 1 7 8 main 9 10 int i 11 vector unsigned char vec a 12 unsigned char 1 16 attribute aligned 16 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
42. er then the BigLookupTable Compiling and running this program generates this output guest debian fae training 04 library bitreverses test gt j guest debian fae training 04 library bitreverses cat n j 1 2 7 8 9 Scalar function timing CPU 7457 CPU 7457 has 6 PMCs Monitoring Monitoring Monitoring Monitoring Monitoring Monitoring events events events events events events 4872576 10 Big Lookup 1286546 11 Small Lookup 2309602 12 Vector timing guest debian fae training 04 library bitreverses are are are are are are PMC 0 PMC 1 PMC 2 PMC 3 PMC 4 5 2844410 1455586 1793101 95939 ins ins ins ins 4872558 2844465 1286521 1455615 2309580 1793279 156704 96188 156732 Line 9 uses 2 844 410 cycles versus line 10 uses 1455586 versus line 11 the vector solution uses 95939 cycles That is 2844410 95939 29 65 almost 30 times faster Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 56 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples 3000000 El Scalar Lookup O Small Lookup O Vector Small Lookup 2500000 2000000 4 1500000 4 1000000 4 500000 4 Cycles Cycles Instructions IPC Bytes C
43. fine macros for getting the start and stop time used in calculating the number of units used in a timing session in this case cycles and instructions 19 through 24 are used for simg4plus which are not used here since TRACE is not defined by the Makefile 25 through 28 are used for PMON however 25 and 26 just shut off the simg4plus tracing facility 31 through 35 are prototypes for the PMON functions which are defined in pmon c in this directory 37 and 38 declare our float vectors which are aligned to 16 bytes a requirement for vector AltiVec operations 40 through 46 is a function to print vectors it is not used 48 through57 is the scalar function to perform a dot product 59 through 73 is a vectorization of the dot product algorithm which can only perform 1 madd per 4 cycles because of data dependency 75 though 148 is the same vectorization however it can perform 4 madd per 4 cycles i e 1 madd per cycle by filling the pipe with four madd in a row 93 through 137 are all commented out and therefore ignored They do not participate in this algorithm 150 through 157 is the beginning of the main function and the declaration of variables 159 through 164 initialize the two arrays to values that will product a result of 0 in the dot product no matter how many elements are in the array as long as there are an even number of elements This is described in Section 7 1 3 Results and Explanation 166 through 173 are ignored be
44. float aa vector float ab MAX SIZE STOP TIMER Stop pc3 read 744x upmc3 stop pc4 read 744x upmc4 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 46 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples 209 printf d t ins d t d t d n 210 stop time start time 211 stop ins start ins 212 Stop pc3 start pc3 213 stop pc4 start pc4 214 215 printf Output f n n result 216 217 printf Parallel version 2 Xn We 218 219 start_pc4 read 744x upmc4 220 start pc3 read 744x upmc3 221 START TIMER 222 223 for 1 0 1 lt i 224 result dot p vec 2 vector float aa vector float ab MAX SIZE 225 226 STOP_TIMER 227 stop pc3 read 744x upmc3 228 stop pc4 read 744x upmc4 229 printf d Nt ins d NC Sd NC 54 230 stop time start time 231 stop ins start ins 232 stop pc3 start pc3 233 stop pc4 start pc4 234 235 printf Output f n n result 236 237 endif 238 return 0 239 guestGdebian fae training 04 library dot product Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 47 More Advanced Examples Line explanation 1 though 5 are comments 6 is the header to define the AltiVec intrinsics 10 through 17 de
45. gnment onto a 16 byte boundary for Alti Vec vector operations You can find this example in home guest fae training 04 library align It has been modified slightly to emphasize what we are demonstrating guest debian fae training 04 library align cat n align c 1 2 Modified slightly from the example 3 Alignment example 4 H 6 include lt altivec h gt 7 dinclude lt stdio h gt 8 9 10 define START TIMER 11 start_time read_744x_upmcl1 12 start_ins read 744x upmc2 13 14 define STOP TIMER 15 asm volatile eieio 16 stop time read 744 upmcl 17 stop ins read 744x upmc2 19 if TRACE 20 define START TRACING asm long 0x14000001 21 define STOP TRACING asm long 0x14000002 22 define MAX SIZE 64 23 define REPEAT 1 24 else Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 11 Introduction to Compiling on Linux with GCC 25 26 27 28 29 30 31 32 33 34 35 36 3 7 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 define START TRACING define STOP TRACING define MAX SIZE 256 define REPEAT 10000 endif define FORCE ALIGNMENT 0 1 forces alignment 0 forces non alignment d int start pmon int pl int p2 int p3 int p4 unsigned int read 744 upmc1 void
46. gth vector float temp vector float vec_splat_s8 0 vector float temp2 temp vector float temp3 temp vector float temp4 temp vector float result for int i 0 i lt length i 4 Loop over the length of the vectors temp vec madd 1 1 v2 i temp this time doing 4 vectors in parallel temp2 vec_madd v1 i 1 v2 i 1 temp2 to fill the pipeline temp3 vec_madd v1 i 2 v2 i 2 temp3 temp4 vec_madd v1 i 3 v2 i 3 temp4 Sum our temp vectors temp vec add temp temp2 temp3 vec add temp3 temp4 temp vec add temp temp3 Add across the vector temp vec add temp vec sld temp temp 4 temp vec add temp vec sld temp temp 8 Copy the result to the stack so we can return it via the IPU vec ste temp 0 amp result return result This code example dot product procedes using these three methods the classic method the one madd at a time and 4 madd at a time for one fourth the iterations PMON is used to calculate the number of cycles and instructions used in each method As we will see the vector method is significantly better than the classic method and the 4 madd at a time is again significantly better i e more efficient and faster than either of the others 7 1 1 Makefile Line 2 in the Makefile below will compile and link our dot product example using the AltiVec intrinsics including our pmon c interface to PMON generating an elf exe
47. is example the simg4plus facility is invoked with the DTRACE 1 parameter as shown in the build sh file In order to use the PMON facility instead of simg4plus use the Makefile which does not define the TRACE macro True data dependency as well as some classical code optimization can often prevent vectorization But in some cases the data dependency can be prevented and a gain in efficiency and speed can be obtained by using the Alti Vec engine for vectorization This example shows how to vectorize a classical data dependency problem Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 37 More Advanced Examples Consider the dot product of two Matrixes X and Y vectors of size N n Dot Product x n y n xli yli i l The classic code solution float DotProduct float X float Y int length int temp 0 N Iterations for inti 0 i lt length i temp X i Y i temp j return temp This same function could be written in vector form where each vector can contain 4 integers thus v1 and v2 are size of N 4 thus we use four times fewer iterations There is some set up time but the dot product algorithm is operating four times faster float VectorDotProduct vector float v1 vector float v2 int length vector float temp vector float vec splat u32 0 float result Loop over the length
48. l be explained in the next example in Section 3 6 An AltiVec Program Demonstrating the Use of PMON for Obtaining Performance Statistics Looking at the two lines of output we see that the first un aligned array starts with the value fc and it should start with 00 This is because the arrays aa_array and ab_array are not on a 16 byte boundary The second unaligned array vector will print correctly since the vectorLoadUnaligned function aligned the data before storing it in the vector By using the command which will disassemble an elf executable file guest debian fae training 04 library align objdump D test gt j Looking at the assembly saved in file we see that aa array and ab array start at address 10011a04 and 10011b04 which are not on a 16 byte boundary the last 4 bits of the address are not zero 10011a04 ab array 10011b04 aa array Recompile and rerun the program setting FORCE ALIGNMENT to 1 and we get this result guestGdebian fae training 04 library align make clean rm rf o pmon test guest debian fae training 04 library aligns make gcc maltivec mabi altivec pmon c align c o test Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 17 Introduction to Compiling on Linux with GCC guest debian fae training 04 library aligns test Ignore the lines in italics they will be
49. le which allows non root users to set the performance monitors to count specific CPU activities such as cycles See Section 5 Using the Performance Monitors for Performance Gathering for more details PMON is a limited application which can count only 32 bits approximately 4 billion items Similar tools are available from commercial software vendors A user can read these registers and develop statistical analysis however to determine which CPU activities to gather requires changing OEA supervisor registers thus by calling PMON the user can request that the performance registers collect specific counts Using PMON is described in this applications note however the implementation of the PMON kernel module is described in the Freescale application note PM ON Module An Example of Writing Kernel Module Code for Debian 2 6 on Genesi Pegasos II AN2744 3 1 3 A Simple Hello World Program Navigate to the directory home fae training 04 library directory create a local directory navigate to it and type in this program cd fae training 04 library mkdir localperson cd localperson editor of your choice hello c include lt stdio h gt main printf Hello World n Compile and run this program the executable elf file will be called a out by default Since the local directory is not the PATH local executables must be preceded by the 7 7 two characters thus the construct a out You get g
50. low we are assigning 16 integer values of 4 bytes each which is 64 bytes vector int vec int unsigned int a3 16 _ attribute aligned 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 vec int 14 0 vector int a3 j Lets just consider the first 4 integers which is 16 bytes Assume that int array a3 starts at 10011b10 10011b10 0 10011b11 0 10011b12 0 10011b13 1 10011b14 0 10011b15 0 10011b16 0 10011b17 2 10011 1 0 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 22 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the Performance Monitors for Performance Gathering 10011014 0 10011ble 0 10011b1f 4 Now after the 14 instruction the vector vec_int has 16 bytes of data copied from the addresses 10011b10 through 10011b1f which is 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 It still has 16 bytes but only four integer values 5 Using the Performance Monitors for Performance Gathering 5 1 General Description All G4 parts contain special hardware to collect certain statistical information about the CPU state and events The MPC7447 contains six performance counters accessible as privileged SPRs PMC1 PMC6 which can monitor up to 242 unique events Normally these are 32bit HW counters If you count an event every cycle at a speed of 1GHz you will overflow these counters in 4 3 seconds It is possible to extend them to 64 bits and or write
51. mance Gathering 105 if byteCount 1 106 107 printf ERR read failed n 108 return 0 can read again 109 110 else if byteCount lt sizeof cycles 111 112 printf ERR not read enough data n 113 return 0 can read again 114 115 PMC count 0x 08X n cycles 116 117 close fd 118 show_upmcs upmc_begin 119 printf Running my code n n n 120 asm volatile eieio 121 show_upmcs upmc_end 122 return 0 123 124 125 void show upmcs unsigned int upmc 126 127 int i 128 0 read 744x upmc1 129 1 read 744x upmc2 130 upme 2 read 744x upmc3 131 upme 3 read 744x upmc4 132 upmc 4 read 744x upmc5 133 5 744 134 for 1 0 1 lt 6 1 135 printf UPMC d 0x 08x n i upmc il 136 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 29 Using the Performance Monitors for Performance Gathering 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 I57 158 159 160 161 162 163 164 165 166 167 168 unsigned int read 744 1 void unsigned int val32 asm volatile mfspr 50 937 r val32 return val32
52. nctions here are described in Section 5 Using the Performance Monitors for Performance Gathering 11 12 16 17 call the functions read 744 upmc which is defined in pmon c described in Section 5 2 2 PMON Interface File Code 19 through 29 is used for the simg4plus facility 36 through 40 are prototypes 119 through 121 call the PMON facility and tell it to monitor performance monitors 1 and 2 which count number of instructions and number of cycles see Section 5 2 2 PMON Interface File Code 124 turns off the counters 127 through 130 display the results the number of instructions and cycles used to perform the code between the START TIMER and the STOP TIMER 133 and 134 same as 119 through 121 except that it calls the function vecotrLoadUnaligned which will execute more instruction then the previous 133 and 134 lines 136 same as 124 139 through 142 same as 127 through 130 Lets look at the result of running this program again and explain the previously ignored output lines guestGdebian fae training 04 library align test gt j Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 20 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Defining and Using AltiVec Vector guest debian fae training 04 library align cat n j 1 2 Alignment Test 3 CPU 7457 4 7457 has 6 PMCs 5 Monitoring events are PMC 0 1 6 Monitoring events are 11 2
53. nded to support or sustain life or for any other application in which the failure of the Freescale Semiconductor product could create a situation where personal injury or death may occur Should Buyer purchase or use Freescale Semiconductor products for any such unintended or unauthorized application Buyer shall indemnify and hold Freescale Semiconductor and its officers employees subsidiaries affiliates and distributors harmless against all claims costs damages and expenses and reasonable attorney fees arising out of directly or indirectly any claim of personal injury or death associated with such unintended or unauthorized use even if such claim alleges that Freescale Semiconductor was negligent regarding the design or manufacture of the part Freescale and the Freescale logo are trademarks of Freescale Semiconductor Inc The PowerPC name is a trademark of IBM Corp and is used under license All other product or service names are the property of their respective owners Freescale Semiconductor Inc 2004 e T 2 freescale semiconductor
54. of the vectors multiplying like terms and summing Number of iterations is N A for int i 0 i length i temp vec madd 1 1 v2 i temp Still have four ints splat across a vector Add across the vector temp vec add temp vec sld temp temp 4 Vector Shift Left Double temp vec add temp vec sld temp temp 8 vec ste temp 0 amp result return result However there is data dependency only 1 madd can complete every 4 cycles Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 38 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples float VectorDotProduct vector float v1 vector float v2 int length vector float temp vector int vec_splat_u32 0 float result Loop over the length of the vectors multiplying like terms and summing Number of iterations is N 4 for inti 0 i lt length i temp vec_madd v2 i temp true data dependency only 1 madd every 4 cycles temp vec_add temp vec_sld temp temp 4 Vector Shift Left Double temp vec_add temp vec_sld temp temp 8 vec_ste temp 0 amp result return result We can eliminate this dependency by performing 4 madd in a row filling the pipeline by doing 4 vectors at a time incrementing our for loop by 4 each time instead of once int FastVectorDotProduct vector float v1 vector float v2 int len
55. one char this one 80 vector unsigned char permuteVector vec 1 81 0 81 vector unsigned char low 1 0 82 vector unsigned char high vec_ld 16 83 return vec perm low high permuteVector 84 85 86 void vectorStoreUnaligned vector unsigned char v vector unsigned char where int Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 v Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 13 Introduction to Compiling on Linux with GCC 87 vector unsigned char permuteVector vec lvsr 0 int where 88 vector unsigned char low high tmp mask 89 vector unsigned char ones vec splat u8 Oxff 90 vector unsigned char zeroes vec splat u8 0 91 92 low vec ld O where Load the surrounding area 93 high vec 1 16 where 94 Make a mask for which parts of the vectors to swap out 95 mask vec perm zeroes ones permuteVector 96 tmp vec perm tmp tmp permuteVector Right rotate our input data 97 low aligned vector vec sel tmp low mask Insert masked data to 98 high vec sel high v mask 99 100 vec st low 0 where Store aligned results 101 vec st high 16 where 102 103 104 int main 105 106 vector unsigned char vec a 107 int 2 1 108 int 51 52 53 109 unsigned int start time stop time 110 unsigned int start ins stop ins
56. ools pmon insmod pmon26 ko root debian ppctools pmon cat proc devices Character devices 1 mem 4 dev vc 0 intervening lines removed 171 1394 180 usb 254 pmon Block devices 1 ramdisk 3 ideo 8 sd remaining lines removed root debian ppctools pmon mknod dev pmon c 254 0 root debian ppctools pmon chmod 777 dev pmon root debian ppctools pmon 1s 1 dev pmon CIWXIWXIWX 1 root root 254 0 Jul 12 16 28 dev pmon As can be seen from this example proc devices shows that the PMON device is assiged to id 254 Further for this example align to work these conditions must also be met 1 align c and pmon c must be built 2 Theresultant executable must be run 5 2 5 Results When dev pmon is Available and pmon26 ko is Installed Now that all these conditions have been met lets run it again guest debian fae training 04 library align cat n j 1 2 Alignment Test Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 33 Using the PMON Facility 3 CPU 7457 4 7457 has 6 PMCs 5 Monitoring events are PMC 0 1 6 Monitoring events are 1 2 7 Monitoring events are PMC 2 1 8 Monitoring events are 31 2 9 Monitoring events are PMC 4 0 10 Monitoring events are PMC 5 0 11 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e OF 12 134610 Instructions 110917 Cycles 0 8239
57. ples 55 56 57 58 59 60 61 62 63 64 65 66 67 68 Double 69 70 71 72 733 74 75 76 77 78 79 80 81 82 83 84 85 return tmp float dot p 1 float va vector float int num elements vector float temp vector float vec splat u32 0 int i float result for i 0 i num elements 4 1 temp vec madd 1 vb il temp temp vec add temp vec sld temp temp 4 Nector Shift Left temp vec add temp vec sld temp temp 8 vec ste temp 0 amp result return result float dot p vec 2 vector float vl vector float v2 int num elements vector float temp vector float vec splat s8 0 vector float temp2 temp vector float temp3 temp vector float temp4 temp vector float 1 vl vector float v2p v2 vector float t1 t2 t3 t4 t5 t6 t7 t8 float result int i 0 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 42 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor More Advanced Examples 86 87 for i 0 i lt num elements 4 i 4 Loop over the length of the vectors 88 temp vec_madd 1 1 v2 i temp this time doing 4 vectors in parallel 89 temp2 vec madd 1 1 1 v2 i 1 temp2 to 111 the pipeline 90 temp3 vec madd 1 1 2 2 1 2 temp3 91
58. printf n n 21 vec int vec_1d 16 vector int a3 22 printf nHello AltiVec from vecInt n 23 printf vec_int offset by 16 24 print int vector amp vec int 25 26 vec int vec_1d 32 vector int a3 27 printf nHello AltiVec from vecInt n 28 printf vec int offset by 32 29 print int vector amp vec int 30 printf n n 31 vec int vec_1d 48 vector int a3 32 printf nHello AltiVec from vecInt n 33 printf vec int offset by 48 34 print int vector amp vec int Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 8 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC 35 printf n n 36 37 38 void print char vector vector unsigned char this one 39 X 40 printf 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 41 unsigned char this one 0 42 unsigned char this one 1 43 unsigned char this one 2 44 unsigned char this one 3 45 unsigned char this one 4 46 unsigned char this one 5 47 unsigned char this one 6 48 unsigned char this one 7 49 unsigned char this one 8 50 unsigned char this one 9 51 unsigned char this one 10 52 unsigned char this one 11 53 unsigned char this one 12 54 unsigned char this one 13 55 un
59. rmance Monitors for Performance Gathering guest debian fae training 04 library aligns Change the following line in the makefile to remove the executable text with the clean target 4 rm rf o test 5 2 2 PMON Interface File Code We have already discussed align c The program pmon c described below is intended to be linked with any other program that wishes to set up performance monitors It has a limit of 4 performance monitor registers that can be used at any one time It can be easily changed to handle up to 6 registers by changing line 24 to accept 6 arguments and changing lines 88 and 89 to use these two new arguments Another example of this interface program rewritten to be stand alone and allow the user to input performance monitor register numbers via the keyboard is available in root ppctools pmon usr pmon test c The pmon c program is listed here taken from home guest fae training 04 library align with line numbers which are obtained with the cat n command A description of all these lines follow The program pmon c described below It is intended to be linked with any other program that wishes to set up performance monitors It has a limit of 4 performance monitor registers that can be used at any one time It can be easily changed to handle up to 6 registers by changing line 24 to accept 6 arguments and changing lines 88 and 89 to use these two new arguments Another example of this interface program rewritten to b
60. rough 8 are the same as in Section 3 6 An AltiVec Program Demonstrating the Use of PMON for Obtaining Performance Statistics and in fact we are using the same performance monitor counters cycles and instructions 9 through 12 is the result of the scalar dot product which took 2 130 188 cycles and 2 461 537 instructions 13 through 16 is the result of the vectorization using only 1 madd per 4 cycles which took 420 495 cycles and 618 164 instructions Thus this code is 5 times faster 2130188 420495 than the scalar case 17 through 20 is the result of the vectorization using 4 madd per 4 cycles i e 1 madd per cycle which took 275 311 cycles and 439 212 instructions Thus this code is 1 5 times faster 420495 275311 than the previous case and this code is 7 7 times faster 2130188 275311 than the scalar case Advanced example of PMON usage 2500000 2000000 4 1500000 4 1000000 4 500000 4 0d Cycles Counts the cycles an AltiVec instruction in the VFPU reservation station is waiting for an operand 32 010 0000 Cycles where no Counts the cycles where no instructions are completed instructions completed Cycles Instructions IPC Speedup VFPU wait No Ins Completed 15 000 1111 Cycles a VFPU instruction 427256 617450 1 445152 15508 275730 439212 1 592906 7 67812 54 Figure 4 Comparison of Scalar Versus Vector Computations Figure 4
61. rs 102 through 117 calls the pmon26 ko function pmon read to read the contents of SPR937 UPMCI which does not even require this module and then we would print out the values in line 115 but it is commented out so this is some debug code 118 through 121 are commented out so they are some debug lines 122 through 123 returns from this function 125 through 136 is the function to print out all the UMPC values which are not privileged This function is not called by align c Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 31 Using the Performance Monitors for Performance Gathering 139 through 144 is a function to read the UPMC1 SPR938 register which is not privileged Thus we can just read this register as a normal user It is used by align c to read each of the counters and print out the lines 13 and 15 in the results shown below 145 through 174 are the functions to get all the other counter values 5 2 3 Results When dev pmon is Not Available guest debian fae training 04 library align Running the executable again we get this result guest debian fae training 04 library aligns test gt j guest debian fae training 04 library align cat n j 1 2 Alignment Test 3 CPU 7457 4 7457 has 6 PMCs 5 Monitoring events are PMC 0 1 6 Monitoring events are 11 2 7 Monitoring events are PMC 2 1 8 Monitoring even
62. s of 0x0000005 0x00000006 0x00000007 0 00000008 into the vec int because the a value in vec ld a b indicates to offset 16 bytes Now if 0 lt lt 16 we offset by 0 because a must be a multiple of 16 Thus in this case 16 lt lt 32 so the offset is 16 bytes 26 store the 4 int values 128 bits of 0 00000009 etc in vec int 3 stores the next int values starting with 0x0000000d We can now use any of the three methods described in the previous example to compile and execute this program changing vecChar c to vecInt c guestGdebian fae training 04 library maurie compile vecInt c guestGdebian fae training 04 library maurie a out Hello AltiVec from vecInt vec int offset by 0 00000001 00000002 00000003 00000004 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 Hello AltiVec from vecInt vec_int offset by 16 00000005 00000006 00000007 00000008 Hello AltiVec from vecInt vec int offset by 32 00000009 0000000 00000000 0000000 Hello AltiVec from vecInt vec int offset by 48 0000000d 0000000e 0000000 00000010 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 10 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC guestGdebian fae training 04 library maurie 3 5 An AltiVec Alignment Program Demonstrating Alignment Considerations This example demonstrates the necessity of forcing ali
63. signed char this one 14 56 unsigned char this 151 57 58 void print int vector vector int this one 59 1 60 printf 08x 08x 08x 08x 61 int this one 0 62 int this one 1 63 int this one 2 64 int this 65 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 9 Introduction to Compiling on Linux with GCC The code is almost the same The code in these line numbers are different 7 defines a vector of type int instead of type char vec_int A vector of type char indicates that the 128 bit vector is divided into 16 bytes of 8 bits each A vector of type int indicates that the 128 bit vector is divided into 4 ints of 32 bits each 12 invokes an int array of 16 elements each 32 bits long The values 1 2 3 etc are stored into these ints In the previous example then the char values 1 2 3 were each 8 bits long In this example the ints 1 2 3 etc are 32 bits long Hence the char values contained all 16 values 1 2 3 etc in 128 bits The int values contains only 22753747 in the first 128 bits and 5 6 7 8 in the next 128 bits etc 14 stores the 4 int values 128 bits of 0x00000001 0x00000002 0x00000003 0x00000004 into the vec int 21 stores the 4 int values 128 bit
64. stop pc4 float result 0 0 for 1 0 i lt MAX_SIZE 2 1 2 1 float i 1 1 float i ab i float i ab i 1 float i if TRACE START TRACING result dot_product amp aa amp ab MAX SIZE result dot p vec 1 amp aa amp ab MAX SIZE result dot p vec 2 amp aa amp ab MAX SIZE STOP TRACING return int result else start pmon 1 2 1 2 1 2 1 15 56 23 Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 45 More Advanced Examples 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 printf Scalar function timing n Nt start_pc4 read 744x upmc4 start pc3 read 744x upmc3 START TIMER for 1 0 1 lt i result dot product aa ab MAX SIZE STOP TIMER Stop pc3 read 744x upmc3 Stop pc4 read 744x upmc4 printf d t ins d NC d NC 54 stop time start time stop ins start ins stop pc3 start pc3 stop pc4 start pc4 printf Output f n n result printf Parallel version Xn NEM start pc4 read 744x upmc4 start pc3 read 744x upmc3 START TIMER for 1 0 1 lt i result dot p vec_1 vector
65. t is likely to be faster Assuming it is used to compare two arrays 0 1 2 3 int Max int a int b int result lt result b else result a 5 5 lt lt return result Return the maximum of two vector integers result a amp mask b amp mask vector signed int Max vector signed int a vector signed int b vector bool int mask vec_cmplt a b a lt b vector signed int result vec sel a b mask Select a or b return result Figure 5 Scalar and Vector Method of Finding the Maximum of Two Numbers Figure 5 is an example of finding the maximum of two numbers which uses a simple comparison to determine which number is larger The code used for this example compares three methods for determining the maximum of two numbers The major assumption is that eliminating the need for branches will improve our performance The fact that we process 4 times the data per iteration with AltiVec is a second order effect The difference between the two functions Max and Max_p is mere semantics even though meaning exactly the same thing it does cause different compiler analysis for some compilers resulting in a different code We used GCC V3 3 3 with an optimization level of O2 Max vec is using AltiVec vector code guestGdebian fae training 04 library branches cat Makefile test branches c pmon c gcc maltivec mabi altivec O2 pmon c branches
66. t for which PMCs based on cpuinfo hardcode event number for each Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 27 Using the Performance Monitors for Performance Gathering 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 printf Please choose PMC events Mn for i 0 1i lt total_pmc i printf 9 event t i if getline amp textLine amp len stdin 1 sscanf textLine d amp pmc sel il printf dWMn pmc sel il 72 pmc_sel 0 pi pmc_sel 1 p2 pmc_sel 2 p3 pmc_sel 3 p4 pmc_sel 4 0 pmc_sel 5 0 for 1 0 1i lt total_pmc i printf Monitoring events are PMC d d n i sel il fd open dev pmon O RDWR printf ERR unable to open device dev pmon n return 0 Write to pmc selection information to pmon device driver write fd sel sizeof pmc sel for i 0 1 lt 10 1 byteCount read fd amp cycles sizeof int Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 28 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the Performance Monitors for Perfor
67. tion AES DES 3DES MD5 Kasumi OS enablement VxWorks elements Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 59 References These applications will be available soon MultiMedia MPEG4 Printer Scaling Rotation Networking OSPF Encryption RSA Wireless network 802 11 LZO Mathematical primitives Extension of LibC Matrix math LargeNumber Lib For assistance or answers to any question on the information that is presented in this document send an e mail to risc10 freescale com 9 References The following documents describe the various applications of the Genesi Pegasos system 1 LIE px jac ob gs Freescale application note AN2666 Genesi Pegasos Setup Freescale application note AN2736 Genesi Pegasos II Boot Options Freescale application note AN2738 Genesi Pegasos II Firmware Freescale application note AN2739 Genesi Pegasos II Debian Linux Freescale application note AN2744 PMON Module an Example of Writing Kernel Module Code for Debian 2 6 on Genesi Pegasos II AltiVec Technology Programming Environments Manual ALTIVECPEM D Rev 1 AltiVec Programming Interface Manual ALTIVECPIM D Rev 1 MPC7450 RISC Microprocessor Family User s Manual 10 Revision History Table 1 provides a revision history for this application note Table 1 Document Revision History
68. tor Inc 2004 All rights reserved WN PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Contents Introduction siiis mes 1 s Terminology 2 Introduction to Compiling on Linux with GCC 2 Defining and Using an AltiVec Vector 21 Using the Performance Monitors for Performance Gathering 23 Using the PMON Facility 34 More Advanced Examples 37 59 FED RR m 60 Revision History i cesse e re Es 60 oe Z freescale semiconductor Terminology 2 Terminology The following terms are used in this document Linux OS Linux operating system PMON Performance monitor facility GCC Gnu compiler collections and GNU utilities Performance monitors MPC74xx processors contain registers that can be used to monitor system activity AltiVec Processing engine on the MPC74xx processors that allows for SIMD functionality SIMD Single instruction multiple data paths POSIX Portable operating system interface POSIX standardization effort that was formerly run by the POSIX standards committee OEA PowerPC operating environment architecture that defines supervisor level resources typically required by an operating system 3 Introduction to Compiling on Linux with GCC The GNU native tool chain is available on the Genesi Pegasos II system with both the Debian and Yellow Dog Linux distribution
69. tor unsigned char where vector unsigned char permuteVector vec lvsr 0 int where vector unsigned char low high tmp mask vector unsigned char ones vec splat u8 Oxff vector unsigned char zeroes vec splat u8 0 low vec ld 0 where Load the surrounding area Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE 19 Introduction to Compiling on Linux with GCC high vec_ld 16 where Make a mask for which parts of the vectors to swap out mask vec_perm zeroes ones permuteVector tmp vec_perm tmp tmp permuteVector Right rotate our input data low vec_sel tmp low mask Insert masked data to aligned vector high vec_sel high mask vec st low 0 where Store aligned results vec st high 16 where 3 6 An AltiVec Program Demonstrating the Use of PMON for Obtaining Performance Statistics The code for this example and that used in Section 3 5 An AltiVec Alignment Program Demonstrating Alignment Considerations is the same What follows discusses the lines that are associated with using the pmon c code facility A more complete discussion of using the PMON facility is discussed in Section 5 Using the Performance Monitors for Performance Gathering 10 through 17 defines a macro that can be used to turn on performance monitor gathering the fu
70. tribute aligned 16 unsigned char bitbuf8 16 attribute aligned 16 include attribute table txt Where the file attribute table txt is in the local directory and contains some constant data such as data In this example every variable of data type Long Vector will be aligned on quad word boundaries and bitBuf is also aligned In this case bitBuf8 will be filled with the data in the table txt file that is in this local directory I e including attribute table txt is another way of initializing an array Data Alignment is absolutely critical for mapping algorithms on AltiVec 3 5 1 2 Obtaining Data Alignment for AltiVec with a Function Loading Unaligned Data using the function vectorLoadUnaligned requires loading twice the data you really need which is more inefficient then just aligning with the compiler attribute aligned 16 vector unsigned char vectorLoadUnaligned vector unsigned char v vector unsigned char permuteVector vec lvsl O int v vector unsigned char low vec 14 0 vector unsigned char high vec 14 16 return vec perm low high permuteVector 3 5 2 Obtaining Data Alignment for Altivec with a More Efficient Function This function is more efficient than the previous one vectorLoadUnaligned but is still less efficient then just aligning the data with the compiler attribute aligned 16 void vectorStoreUnaligned vector unsigned char v vec
71. tructions per cycle In this applications note we have seen the use of counting cycles and processor instructions which can be seen from the above table are counter 1 processor cycles and 2 instructions completed 5 2 A Code Example Using the PMON Facility to Gather Performance Statistics As described earlier the align c program is linked with the pmon c program which supplies the calls to the PMON facility which is described in Section 6 Using the PMON Facility Looking in the directory home guest fae training 04 library align we see two c files align c and pmon c 5 2 1 Makefile The Makefile shown below compiles align c and pmon c and links them together Since the target line 1 is named test then the gcc line 2 generates an elf executable named test and that is the file we execute with the test command There is a bug in this makefile in the clean target line 4 we rm pmon test however the Makefile generates the file test So the clean target does not work change pmon test to test in the clean target and the clean target will work as expected which is to remove the executable guestGdebian fae training 04 library align cat n Makefile 1 test align c pmon c 2 gcc maltivec mabi altivec pmon c align c o 3 clean 4 rm rf o pmon test Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 24 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the Perfo
72. ts are 31 2 9 Monitoring events are PMC 4 0 10 Monitoring events are PMC 5 0 11 ERR unable to open device dev pmon 12 00 01 02 03 04 05 06 07 08 09 0 0b 0c 0d 0e OF 13 0 Instructions 0 Cycles nan IPC 14 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 15 lt Q Instructions 0 Cycles nan IPC In this case line 13 and 15 gives answers because at line 11 the module gave an error because pmon26 ko has not been started or dev pmon does not exist or dev pmon has the wrong permissions they must be 777 all permissions 5 2 4 All These Conditions Must Be Met for the PMON Facility to Work 1 The module pmon26 ko must be built 2 The module pmon26 ko must be installed insmod pmon26 ko 3 dev pmon must be created mknod dev pmon c node number 0 4 The permissions must be 777 chmod 777 dev pmon Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 32 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Using the Performance Monitors for Performance Gathering The node number be determined from the proc devices file After the insmod pmon26 ko look at the dev devices files find the entry for PMON and the node number will be displayed Then enter the mknod command It may be necessary to remove the current dev mknod entry if it does not correspond to the proc devices id number as listed for example root debian ppct
73. y type in the command as shown in the second example above in Section 3 2 A Simple AltiVec hello World Program gcc maltivec mabi altivec vecChar c a out 2 Create a shell script that is easier to remember and type Use any editor to create a file call it compile and add the gcc command and change the permission set to execute The contents of the file can just be input from the cat command the 1 indicates the first parameter and 2 indicates the second parameter In this case we only have one parameter vecChar c cat compile gcc maltivec mabi altivec 1 2 d i e a control d on the keyboard Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 6 PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE Freescale Semiconductor Introduction to Compiling on Linux with GCC chmod 777 compile compile vecChar c a out 3 Create a Makefile Edit the file The permissions do not need to be changed However type in a tab character not 8 spaces in front of the make commands So ensure that while typing the contents to Makefile that indentations are made with the tab key cat Makefile make vecChar c gcc maltivec mabi altivec vecChar c clean rm rf o a out d i e control d make clean make a out In any case the output we see is this guestGdebian fae training 04 library maurie a out Hello AltiVec from vecChar vec a 01 02 03 04 05 06 07 08 09 0 0 0 0
74. ycle Speedup L1D Hits L1D Miss L1D Access Scalar 283075717 4871217 1 7208178 0 090435173 1 541155 1922 Big_Lookup 1454818 1285741 0 8837813 0 175967028 1 945781 512332 30 Small_Lookup 2360056 2403152 1 0182606 0 108472002 1199445 768249 6 95390 Vector_Small_Lookup 156143 1 6368907 2 683719467 29 67562 37047 34 7 4 Constant Generations home guest fae training 04 library constant_gen is an example of generating constant data in vectors The first case uses compiler generated code in a declaration which generates code to replicate the constants the other uses the vec splat splat Alti Vec instruction to replicate the same constant into an array vector unsigned char add constants 1 vector unsigned char vec a 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 vector unsigned char vec b 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 vector unsigned char vec 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 vec c vec add vec a vec c vec c vec add vec b vec c return vec c vector unsigned char add_constants_2 vector unsigned char vec_a vec_splat_u8 5 vector unsigned char vec_b vec_splat_u8 7 vector unsigned char vec_c vec_splat_u8 1 vec_c vec_add vec_a vec_c vec_c vec_add vec_b vec_c return vec_c Software Analysis on Genesi Pegasos Using PMON and AltiVec Rev 0 1 Freescale Semiconductor PRELIMINARY SUBJECT TO CHANGE WITHOUT NOTICE

Download Pdf Manuals

image

Related Search

Related Contents

B&W 62MM UV/IR (486M)  precaución - Graco Inc.  Cherry B.UNLIMITED - Rechargeable Wireless Desktop  らしく vol.45 2014年冬号 年末大そうじスペシャル 明るい陽光が差し込む  User Manual - kodakpixpro.com  2005 Chevrolet Tahoe/Suburban Owner Manual  InFocus LP420 User's Manual  Catálogo Mary Paint Completo  Epoxy TR 0/1 et 1/5  The Festool MFS Multi-Routing Template System  

Copyright © All rights reserved.
Failed to retrieve file