Home

Intel Fortran Compiler for Linux* Systems User's Guide

1. 79 issuing warnings 86 97 98 99 131 160 merging the data from all dyn files DUUM ee PERO ETE PA 99 optimization levels 62 producing profile guided optimization 86 100 relocating the source files 102 report generation 213 selecting routines for inlining 90 treating assumed dependence 206 using OpenMP 150 158 vectorization 130 206 COMPLEX 13 30 48 51 conditional parallel region execution duse Eier Ld tci AM 164 conforming cim S 65 IEEE 7545 eh t d ues 71 constructing an entry point name 195 226 controlling advanced PGO optimizations 98 alignment with options 55 auto parallelizer s diagnostic levels eT 127 147 compilation process 41 complex flow 125 computation of stacks and variables ose s 51 data scope attributes 166 floating point accuracy 69 70 71 floating point computations 70 generation of profile information EE 118 inline expansion 92 loop vectorization 206 number of threads 164 OpenMP diagnostics 158 rounding EE ES RR ees 69 speculation 62 your program with OpenMP 186 conventions in the User s Guide Volume ll
2. inefficient unaligned data CHECKING eo eet unaligned data di det e init routine E initialization EE initializer initiating interval profile dumping inlinable inline ChOOSE EE expansion controlling library functions 103 expansion 65 90 92 213 inline debug info compiler option 92 inlined IDEA EE 92 source position 92 inlining alleel Es INNEN 88 IESSE eeler dc 62 prevents ES ER eut De ed 41 INPUT arguments eee ee eee 20 liles MAER Lomo IE Ee 25 INPUT OLD EE 45 test prioritization 111 instruction level 127 instrumentation compilation 94 100 compilation execution 97 I p XN 98 instrumented code generating 97 execution run 100 POG GAN EA NG 93 Index INTEGER Variables oet tc ee tuts 30 integer size compiler option 48 Intel amp architecture based 150 architecture based processors 34 37 architecture specific 37 Intel amp architectures COCING EE 9 34 Intel amp Compiler adjusting optimization 88 coding guidelines 25 30 34 IESSEN MN EROR 203
3. 213 report generation 213 Ce TC 213 optimizers appe D Od XZT S 213 YOU Gode ss ENE Z S 76 optimizing see also optimizations application types 61 floating point applications 34 for specific processors 9 73 option e EE 62 controls auto parallelizer s 147 OpenMP parallelizer s 158 COMMONS o eoe 98 147 158 diSables oot eios 97 erer 51 Ile eer EE corp Eege mee 51 247 Intel Fortran Compiler for Linux Systems User s Guide Vol Il DIACES NT MU 51 POGU COS RE eie iot 210 options compiler optimization 41 correspond ee ee ee 57 debugging summary 210 improve run time performance 41 NSU Cbe EE DO erh MEE 132 output summary 210 OVerviews ee ee ee ee ee eee 127 210 ORE EES 103 139 170 178 ORDERED E EE 170 USD cuc EE EE 170 ORDERED clause 166 ORDERED directive 150 166 170 ordering data declarations 13 kmp set stacksize s 189 original serial code 143 output argument ee ee 20 overriding 248 vectorizer s efficiency heuristics eeh 206 overview compiler optimization options 41 P PADD using GNU usate 197 par report compiler option 48 127 145 147 par threshold compiler option 127 145 147 PARALLEL 144 1
4. 98 profile counters 120 E GII e EE AET 195 197 Ue ee Ra 210 ebp based 48 210 ie AE TUN EL US TM MU EN 197 DUO xt vacas deck DE ee 213 EDB UI T 13 TE 195 CO AE AR ER DEE MORS 197 effective auto parallelization usage ER EE AE EG anh 144 effects analyzing EE 83 multiple IPC 83 efficient Ode RENS EE RE SE 30 compilation 30 41 use of EC 20 record buffers 25 230 USS OF EE 25 elapsed time 111 SIZ cod d uid 189 aT 103 enable auto parallelizer 127 145 DEG didtur abd 48 denormals as zero 34 fp option 65 210 implied DO loop collapsing 25 ala EE M 92 O2 optimizations 62 parallelizer oracio 127 SIMD encodings 138 test prioritization 111 encounters SINGLE eene e cathe 166 end DO PE 166 parallel construct 156 REDUCTION 178 worksharing CODISITHGL EE 156 worksharing 150 160 END CRITICAL directive 170 END DO directive 150 166 END INTERFACE 101 END MASTER directive 160 170 END ORDERED directive 160 170 END PARALLEL directive 156 164 END PARALLEL DO directive 169 END PARALLEL SECTIONS
5. Unnamed critical sections use the global lock from the Pthread package This allows you to synchronize with other code by using the same lock Named locks are created and maintained by the compiler and can be significantly more efficient 172 Parallel Programming with Intel Fortran FLUSH Directive Use the FLUSH directive to identify a synchronization point at which a consistent view of memory is provided Thread visible variables are written back to memory at this point To avoid flushing all thread visible variables at this point include a list of comma separated named variables to be flushed The following example uses the FLUSH directive for point to point synchronization between thread 0 and thread 1 for the variable ISYNC ISYNC SOMP PARALLEL DEFAULT PRIVATE SHARED IAM OMP GET THREAD NUM ISYNC IAM 0 SOMP BARRIER CALL WORK I Am Done With My Work Synchronize With My Neighbor ISYNC IAM 1 SOMP FLUSH ISYNC Wait Till Neighbor Is Done DO WHILE ISYNC NEIGH EQ Di SOMP FLUSH ISYNC END DO SOMP END PARALLEL MASTER and END MASTER Use the MASTER and END MASTER directives to identify a block of code that is executed only by the master thread The other threads of the team skip the code and continue execution There is no implied barrier at the END MASTER directive In the foll
6. void PGOPTI Prof Reset void Recommended usage Use this function to clear the profile counters prior to collecting profile information on a section of the instrumented application See the example under PGOPTI Prof Dump Dumping and Resetting Profile Information The PGOPTI Prof Dump And Reset function dumps the profile information to a new dyn file and then resets the dynamic profile counters Then the execution of the instrumented application continues The prototype of this function is void PGOPTI Prof Dump And Reset void This function is used in non terminating applications and may be called more than once Recommended usage Periodic calls to this function enables a non terminating application to generate one or more profile information files dyn files These files are merged during the feedback phase phase 3 of profile guided optimizations The direct use of this function enables your application to control precisely when the profile information is generated Interval Profile Dumping The PGOPTI Set Interval Prof Dump function activates Interval Profile Dumping and sets the approximate frequency at which dumps occur The prototype of the function call is void PGOPTI Set Interval Prof Dump int interval This function is used in non terminating applications 120 Compiler Optimizations The interval parameter specifies the time interval
7. 48 55 263
8. 51 redeclaririg EE sd aces 201 redirected standard 25 REDUCTION eT EE 178 completed 178 TaLe ence c eee 178 WSO 1 EER N AR EUR RE 178 va tri bles ees ee see ee ee ee 178 201 reduction induction variable 62 ref dpi file Eeer ee ee 103 relieving Je GE EE EE 25 relocating source files 102 removing DOopt e 5 oor EER bep 99 reordering transformations 133 repeating instrumentation 98 replicated code 156 report avallabilily eo ctt e 213 generation ER ieee 213 DDHIDIZGI EE RD EE uE 213 lie EE 213 Index resetting dynamic profile counters 120 profile information 120 restricting FP arithmetic precision 71 optmtzattons ee ee ee ee 65 RESULT SEDES AE es TT results Ile VE 93 RETURN double precision 186 return values 51 REVERSE ie niii 176 rm PROF DIR 111 rounding ee 69 significand EE use ER EE 69 round to nearest 69 routines selecting WE 90 Li 141 oro ERSTE I EUER 186 ISl EE 25 run 253 Intel Fortran Compiler for Linux Systems User s Guide Vol Il differential coverage 103 multithreaded 158 test prioritization 111
9. DO I 1 MAX BS DO J 1 MAX BS DO I I MAX BS 1 DO J J J MAX BS 1 A II JJ A II JJ B JJ ENDDO ENDDO ENDDO ENDDO Statements in the Loop Body The vectorizable operations are different for floating point and integer data Floating point Array Operations The statements within the loop body may be REAL operations typically on arrays Arithmetic operations supported are addition subtraction multiplication 139 Intel Fortran Compiler for Linux Systems User s Guide Vol Il division negation square root MAX MIN and mathematical functions such as SIN and COS Note that conversion to from some types of floats is not valid Operation on DOUBLE PRECISION types is not valid unless optimizing for an Intel amp Pentium amp 4 and Intel amp Xeon TM processors system and Intel amp Pentium M processor using the xW or axw compiler option Integer Array Operations The statements within the loop body may be arithmetic or logical operations again typically for arrays Arithmetic operations are limited to such operations as addition subtraction ABS MIN and MAX Logical operations include bitwise AND OR and XOR operators You can mix data types only if the conversion can be done without a loss of precision Some example operators where you can mix data types are multiplication shift or unary operators Other Operations No stat
10. Optimizer optimization Full Name aby oro 1 n1 Interprocedural Optimizer inline expansion of functions ipo_cp Interprocedural Optimizer copy propagation hlo_unroll High level Language Optimizer loop unrolling hlo_prefetch High level Language Optimizer prefetching ilo_copy_propagation Intermediate Language Scalar Optimizer copy propagation ecg_swp Itanium based Compiler Code Generator software pipelining The following command generates a report for the Itanium amp based Compiler Code Generator ecg ifort c opt_report opt_report_phase ecg myfile f where e c tells the compiler to stop at generating the object code not linking e opt report invokes the report generator e opt report phaseecg indicates the phase eco for which to generate the report the space between the option and the phase is optional The entire name for a particular optimization within an optimizer need not be specified in full just a few characters is sufficient All optimization reports that have a matching prefix with the specified optimizer are generated For example if opt report phase ilo cois specified a report from both the constant propagation and the copy propagation are generated The Availability of Report Generation 214 Optimization Support Features The opt report help option lists the logical names of optimizers and optimizations that are currently available for report generation For IA 32 sy
11. 158 altparam compiler option 48 analyzing effects of IHC seger 83 programming 45 5 5t els 13 ANSI Fortran standard 160 ANSI standard conformance with 65 71 ansi alias compiler option 48 51 application basic block ees 103 code coverage 103 OpenMP 127 150 pipelining Mese RS 203 TOSUS PR CE 103 visual presentation 103 architectures coding guidelines for 34 argument EE nio Boss T 140 222 using efficiently 20 arithmetic precision improving and restricting 71 arrays ACCESSING EE EE REED Ee 20 assumed shape 20 compiler creates 20 derived part susssss 13 efficient compilation using 41 natural storage order 25 Ooperattons ees see eek ee es 139 output argument array types 20 requirements 20 using efficiently 20 assembling 2 ter eio 210 assembly files generating 83 86 210 assume compiler option 25 41 assumed shape arrays 20 ATOMIC directive 150 170 auto compiler option 51 automatic allocation of stacks 41 51 checking of stacks 51 optimization
12. 25 naturally aligned e CNN OE 13 records EE USES EED 13 reordered data 13 noalign compiler option 55 noalignments keyword 13 noauto compiler option 51 noauto scalar compiler option 51 noautomatic compiler option 51 nobuffered io keyword 25 nocommons keyword 55 nodcommons keyword 55 243 Intel Fortran Compiler for Linux Systems User s Guide Vol Il nolib inline compiler option 65 92 nologo compiler option 101 non countable loop QI e TE 166 non OpenMP ss 182 non preemptable 57 non SSE generating ess 34 NONTEMPORAL USGS cis OE MES 206 nonvarying values 30 non vectorizable loop 132 non vectorized loops 131 le 123 NOPARALLEL directive 144 145 nopartial option 103 NOPREFTCH directives 206 nosave compiler option 51 nosequence keyword 55 NOSWP directives 203 244 riototal EE OE ies 111 leift lei 205 NOVECTOR directives 206 NOWAIT option 166 nozero compiler option 51 NUM S deese eto 111 127 num threads 160 186 numbe
13. After vectorization the loop is executed as shown in figure below Vector and Scalar Clean up Iterations 2 vector iterations 2 clean up iterations in scalar mode bg wgl eg wg i 1 2 3 4 i 5 6 7 8 i 29 10 Both the vector iterations A 1 4 B 1 4 andA 5 8 B 5 8 can be implemented with aligned moves if both the elements A 1 and B 1 are 16 byte aligned P Caution If you specify the vectorizer with incorrect alignment options the compiler will generate code with unexpected behavior Specifically using aligned moves on unaligned data will result in an illegal instruction exception Alignment Strategy The compiler has at its disposal several alignment strategies in case the alignment of data structures is not known at compile time A simple example is shown below several other strategies are supported as well If in the loop shown below the alignment of A is unknown the compiler will generate a prelude loop that iterates until the array reference that occurs the most hits an aligned address This makes the alignment properties of A known and the vector loop is optimized accordingly In this case the vectorizer applies dynamic loop peeling a specific Intel amp Fortran feature 141 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Example of Data Alignment Original loop SUBROUTINE DOIT A REAL A 100 alignment of argument A is unknown DO I 1 100 Ai A I 1 0 E
14. OMP SCHEDULE Specifies the type of run static time scheduling Auto parallelization Threshold Control and Diagnostics Threshold Control The par threshold n option sets a threshold for auto parallelization of loops based on the probability of profitable execution of the loop in parallel The value of n can be from 0 to 100 The default value is 100 The par threshold n option should be used when the computation work in loops cannot be determined at compile time The meaning for various values of n is as follows 147 Intel Fortran Compiler for Linux Systems User s Guide Vol Il e n 100 Parallelization will only proceed when performance gains are predicted based on the compiler analysis data This is the default This value is used when Gar thresholdin is not specified on the command line or is used without specifying a value of n e nz par threshold0 is specified The loops get auto parallelized regardless of computation work volume that is parallelize always e The intermediate 1 to 99 values represent the percentage probability for profitable speed up For example n 50 would mean parallelize only if there is a 50 probability of the code speeding up if executed in parallel The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads Diagnostics The par_report 0111213 option controls the aut
15. Returns TRUE if nested parallelism is enabled otherwise returns FALSE Lock Routines su in broutine omp_init_lock lock teger kind omp_lock_kind lock Initializes the lock associated with lock for use in subsequent calls su in broutine omp_destroy_lock lock teger kind omp_lock_kind lock Causes the lock associated with lock to become undefined Su broutine omp_set_lock lock Forces the executing 187 Intel Fortran Compiler for Linux Systems User s Guide Vol Il integer kind omp lock kind lock thread to wait until the lock associated with lock is available The thread is granted ownership of the lock when it becomes available subroutine omp unset lock lock integer kind omp lock kind lock Releases the executing thread from ownership of the lock associated with lock The behavior is undefined if the executing thread does not own the lock associated with lock logical omp test lock lock integer kind omp lock kind lock Attempts to set the lock associated with lock If successful retums TRUE otherwise returns FALSE subroutine omp init nest lock lock integer kind omp nest lock kind lock Initializes the nested lock associated with lock for use in the subsequent calls subroutine omp destroy nest lock lock integer kind omp nest lock kind lock Causes the nested lock associated with lo
16. Step 2 1 Run instrumerssd executablez on Test 1 Step 2 0 Run instrumented executables cn Test n Menge Dynamic Profile InlF aemation dyn files Merge Ds nam Profile DnTormatson den Files Step 3 Run Test Fricritizer Here are the steps for a simple example myApp 90 for IA 32 systems 1 Set the following PROF DIR c myApp prof dir 2 Issue the following command ifort prof genx myApp f90 This command compiles the program and generates instrumented binary myApp as well as the corresponding static profile information pgopti spi 3 Issue the following command rm PROF_DIR dyn Make sure that there are no unrelated dyn files present 114 Compiler Optimizations Issue the following command myApp lt dat al Invocation of this command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR Issue the following command profmerge prof dpi Testl dpi At this step the profmerge tool merges all the dyn files into one file Test1 dpi that represents the total profile information of the application on Test1 Issue the following command rm PROF DIR dyn Make sure that there are no unrelated dyn files present Issue the following command myApp data2 This command runs the instrumented application and generates one or more new dynamic profile information files that have
17. Uses an array temporary 24 Programming for High Performance Does not pass an array descriptor Interface block optional Improving VO Performance Improving overall VO performance can minimize both device I O and actual CPU time The techniques listed in this topic can significantly improve performance in many applications VO flow problems limit the maximum speed of execution by being the slowest process in an executing program In some programs I O is the bottleneck that prevents an improvement in run time performance The key to relieving I O problems is to reduce the actual amount of CPU and VO device time involved in VO The problems can be caused by one or more of the following e A dramatic reduction in CPU time without a corresponding improvement in VO time e Such coding practices as e Unnecessary formatting of data and other CPU intensive processing e Unnecessary transfers of intermediate results e Inefficient transfers of small amounts of data e Application requirements Improved coding practices can minimize actual device VO as well as the actual CPU time Intel offers software solutions to system wide problems like minimizing device I O delays Use Unformatted Files Instead of Formatted Files 25 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Use unformatted files whenever possible Unformatted UO of numeric data is more efficient and more precise than formatted
18. b C n 18 at parallel f 1 1 6x 6864a595 in parallel at parallel f 27 2 6x4068a6507 in _ libe_start_main main 6x864a3b6 lt parallel gt argc 1 ubp_av OxbFFFFSFS init O6x8O49854 lt _init gt fini Gx8O86dc4 lt _fini gt d rtld_fini Gx880Gdc14 lt _dl fini gt d stack end xbffff8ec at sysdeps generic libc start c 129 qdb Switching from One Thread to Another gdb info threads h Thread 2051 LWP 17512 6x 688H4a38a in _padd__6__par_loop at parallel f 13 3 Thread 1826 LWP 17511 6x40144 a31 in _ libc nanosleep fron lib i 86 libc so 6 2 Thread W 16 Deh i tror in poll fds 0x80abd5c nfds 1 timeout 2666 at syusdeps ix v linux poll c 63 1 Thread 1 6xO864a398a in padd 6 par loop at parallel f 213 gdb 198 Parallel Programming with Intel Fortran Call Stack Dump of Master Thread upon Entry to Parallel Region bt x6804a38a in padd 6 par_loop at pare x880763d9 in invoke 3 at proton libi gets x8807b26c in knpc invoke task fune at proton libi gets Lat c 21 gdb bt 0x4800b8aa5 in sigsuspend set 0x 40d9e958 S re start ed self xkAd9eben at pthread c 967 1 pthread cond wait cond x 80971b8 mutex 0x8096068 at restart h 3Hh __kmp_sus e at proton libi getstat c 241 097 be7f pthread_start_thread_event arg 8x46d9ebe8 at manager c 2298 Example 2 Debugging Code Using Multiple Threads with Shared Variables Subroutine PADD Machine Cod
19. if iin ss ij se 0 L princf 1 or Din uncovered functions 33 f blocks function void f2 int n N E D if iin sen 1 dn we OM y void gi int m int j X covered functions for tj O tas j tri L 3 6 67 4 5 2 30 83 33 5 6 D 100 00 5 8 gi 100 00 15 15 main void g int m Setting the Coloring Scheme for the Code Coverage The tool provides a visible coloring distinction of the following coverage categories e covered code e uncovered basic blocks e uncovered functions e partially covered code e unknown The default colors that the tool uses for presenting the coverage information are shown in the tables that follows This color Means Covered code The portion of code colored in this color was exercised by the tests The default color can be overridden with the ccolor option Basic blocks that are colored in this color were not exercised by any of the tests They were however within functions that were executed during the tests The default color can be overridden with the bcolor option Functions that are colored in this color were never called 107 Intel Fortran Compiler for Linux Systems User s Guide Vol Il during the tests The default color can be overridden with the fcolor option Partially covered More than one basic block was generated for the code at this code position Some of the blocks were covered w
20. 127 Intel Debugger 193 Intel Enhanced Debugger 37 IA 32 applications 121 IA 32 systems 69 73 7T A 32 based little endian 45 DrOCESSOFS esse see ee ee 45 150 IA 32 specific feature 131 IA 32 targeted compilations 210 ABD eoe aZe 160 170 178 identifying synchronization 170 IEEE IEEE 754 COMMON acid eio de etra ess 71 IEBE 75d Se do ulis 71 IEEE oy Meee es te cee Wen 34 Ier 160 170 178 235 Intel Fortran Compiler for Linux Systems User s Guide Vol II IF generaled ce ree rect 103 Statement iue e 103 IF lauses ii eere e 164 iface compiler option 48 ifort 9 41 48 55 57 66 70 73 75 76 77 80 81 83 88 92 94 100 111 123 131 145 147 158 195 210 213 IL compiler reads 86 THOS TTE 86 produced uices eas 86 lG TEN 213 EP TEE 127 implied DO loop collapsing EE e erae 25 improving VO performance 25 run time performance 30 improving restricting FP arithmetic DIGCISIOFT e eege eg 71 include floating point to integer 69 incorrect usage 236 non countable loop increase BLOCKSIZE specifier BUFFERCOUNT specifier individual module source view industry standard
21. F Note The visibility options are supported by both IA 32 and Itanium compilers but currently the optimization benefits are for Itanium based systems only Global Symbols and Visibility Attributes 57 Intel Fortran Compiler for Linux Systems User s Guide Vol Il A global symbol is a symbol that is visible outside the compilation unit in which it is declared compilation unit is a single source file with its include files Each global symbol definition or reference in a compilation unit has a visibility attribute that controls how it may be referenced from outside the component in which it is defined The values for visibility are defined in the table that follows EXTERN The compiler must treat the symbol as though it is defined in another component This means that the compiler must assume that the symbol will be overridden preempted by a definition of the same name in another component See Symbol Preemption If a function symbol has external visibility the compiler knows that it must be called indirectly and can inline the indirect call stub DEFAULT Other components can reference the symbol Furthermore the symbol definition may be overridden preempted by a definition of the same name in another component PROTECTED Other components can reference the symbol but it cannot be preempted by a definition of the same name in another component HIDDEN Other components cannot directly reference t
22. Scalar Replacement IA 32 Only The goal of scalar replacement is to reduce memory references This is done mainly by replacing array references with register references While the compiler replaces some array references with register references when 01 or 02 is specified more aggressive replacement is performed when 03 scalar rep is specified For example with 03 the compiler attempts replacement when there are loop carried dependences or when data dependence analysis is required for memory disambiguation scalar_rep Enables default or disables scalar replacement performed during loop transformations requires 03 Loop Unrolling with unroll n The unroll n option is used in the following way e unrolln specifies the maximum number of times you want to unroll a loop The following example unrolls a loop at most four times ifort unroll4 a f 123 Intel Fortran Compiler for Linux Systems User s Guide Vol Il To disable loop unrolling specify n as 0 On IA 32 systems specifying 0 also disables the vectorizer s unroller except for the unrolling required to resolve cache line splits penalties The following example disables loop unrolling ifort unrollO af e unroll n omitted lets the compiler decide whether to perform unrolling or not This is the default the compiler uses default heuristics or defines n e unroll0 n 0 disables the unroller The Itanium compiler curre
23. The high level optimizations include loop interchange loop fusion loop unrolling loop distribution unroll and jam blocking data prefetch scalar replacement data layout optimizations and loop unrolling techniques The option that turns on the high level optimizations is 03 The scope of optimizations turned on by 03 is different for lA 32 and Itanium based applications See Setting Optimization Levels 1A 32 and Itanium based Applications 121 Intel Fortran Compiler for Linux Systems User s Guide Vol II The 03 option enables the 02 option and adds more aggressive optimizations for example loop transformation and prefetching 03 optimizes for maximum speed but may not improve performance for some programs 1A 32 Applications In conjunction with the vectorization options ax K IW N B P and x K W N B P the 03 option causes the compiler to perform more aggressive data dependency analysis than for default 02 This may result in longer compilation times Itanium based Applications The ivdep parallel option asserts there is no loop carried dependency in the loop where IVDEP directive is specified This is useful for sparse matrix applications Key Techniques to Tune Your Itanium based Applications Follow these steps to tune applications on Itanium based systems 1 Compile your program with 03 and ipo Use profile guided optimization whenever possible 2 ldentify hot spots in your code
24. e The control variable must be an integer The control variable cannot be a dummy argument or contained in an EQUIVALENCE or VOLATILE statement Intel Fortran must be able to determine that the control variable does not change unexpectedly at run time e The format must not contain a variable format expression For information on the VOLATILE attribute and statement see the ntel amp Fortran Language Reference For loop optimizations see Loop Transformations Loop Unrolling and Optimization Levels Use of Variable Format Expressions Variable format expressions an Intel Fortran extension are almost as flexible as run time formatting but they are more efficient because the compiler can eliminate run time parsing of the I O format Only a small amount of processing and the actual data transfer are required during run time On the other hand run time formatting can impair performance significantly For example in the following statements S1 is more efficient than s2 because the formatting is done once at compile time not at run time S1 WRITE 6 400 A I I 1 N 400 FORMAT 1X lt N gt F5 2 27 Intel Fortran Compiler for Linux Systems User s Guide Vol Il S2 WRITE CHFMT 500 1X N F5 2 500 FORMAT A 13 A WRITE 6 FMI CHFMT A I I 1 N Efficient Use of Record Buffers and Disk VO Records being read or written are transferred between the user s program buffers
25. General Directives in the Intel Fortran Language Reference Loop Distribution Directive The DISTRIBUTE POINT directive indicates a preference of performing loop distribution Loop distribution may cause large loops be distributed into smaller ones This may enable more loops to get software pipelined If the directive is placed inside a loop the distribution is performed after the directive and any loop carried dependency is ignored If the directive is placed before a loop the compiler will determine where to distribute and data dependency is observed Currently only one distribute directive is supported if it is placed inside the loop 204 Optimization Support Features IDECS DISTRIBUTE POINT do i 1 m b i a i 1 c i a i b i Compiler will decide where d to distribute Data dependency is observed c i 1 enddo do i 1 m b i a i 1 IDECS DISTRIBUTE POINT call sub a n Distribution will start here ignoring all loop carried dependency c i a i b i d i c i 1 enddo For more details on this directive see Directive Enhanced Compilation section General Directives in the Intel Fortran Language Reference Loop Unrolling Support The UNROLL n directive tells the compiler how many times to unroll a counted loop The nis an integer constant from 0 through 255 The UNROLL directive must prec
26. Intel Fortran Compiler for Linux Systems User s Guide Volume II Optimizing Applications Document Number 253260 002 Disclaimer and Legal Information Information in this document is provided in connection with Intel products No license express or implied by estoppel or otherwise to any intellectual property rights is granted by this document EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND OR USE OF INTEL PRODUCTS INCLUDING L IABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE MERCHANTABILITY OR INFRINGEMENT OF ANY PATENT COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT Intel products are not intended for use in medical life saving or life sustaining applications This User s Guide Volume ll as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license The information in this manual is furnished for informational use only is subject to change without notice and should not be construed as a commitment by Intel Corporation Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document Designers must not rely on the absence or characteristics of any features or in
27. One of the main benefits of IPO is that it enables more inlining For information on inlining and the minimum inlining criteria see Criteria for Inline Function Expansion and Controlling Inline Expansion of User Functions Inlining and other optimizations are improved by profile information For a description of how to use IPO with profile information for further optimization see Example of Profile Guided Optimization Compilation Phase When using IPO as each source file is compiled the compiler stores an intermediate representation IR of the source code in the object file which includes summary information used for optimization 80 Compiler Optimizations By default the compiler produces mock object files during the compilation phase of IPO Generating mock files instead of real object files reduces the time spent in the IPO compilation phase Each mock object file contains the IR for its corresponding source file but no real code or data These mock objects must be linked using the ipo option in ifort or using the xi1d tool See Creating a Multifile IPO Executable with xild Note Failure to link mock objects with ifort and ipo or xild will result in linkage errors There are situations where mock object files cannot be used See Compilation with Real Object Files for more information Linkage Phase When you invoke the linker adding ipo to the command line causes the compiler to be invoked a final time befor
28. s to the subroutine parallel while the second entry point s to the OpenMP parallel region at line 3 Debugging Code with Parallel Region Machine Code Listing of the Subroutine parallel globl parallel parallel Bld Preds B1 0 LN1 pushl ebp 11 0 movil esp ebp 11 0 subl 44 esp 1 0 pushl sedi 1 0 BTIT3 Preds B1 9 addl S 12 esp 6 0 movil 2 1 2 kmpc loc struct pack 2 esp 46 0 movl 0 4 esp 6 0 movl parallel 6 par regioni 8 esp 6 0 call kmpc fork call 6 0 LOE Bie Sik Preds B1 13 addl 12 esp 6 0 LOE B1 14 Preds B1 31 B1 30 LNA leave 19 0 ret 19 0 LOE type parallel_ function Size parallel_ parallel_ globl parallel 3 par region parallel 3 par region f parameter 1 8 ebp f parameter 2 12 ebp SUB T5 Preds B1 0 pushl ebp 19 0 movil esp ebp 19 0 subl 44 esp 9 0 ENOS call omp get thread num 4 0 LOE eax weBlg32s Preds B1 15 movi eax 32 ebp 44 0 196 Parallel Programming with Intel Fortran LOE B1 16 Preds B1 32 mov 1 32 ebp eax 4 0 movi eax 20 ebp 44 0 LN6 leave 49 0 ret 49 0 LOR type parallel Size parallel 3 par regionO G8 function 3 par region0 parallel 3 par regionO globl parallel 6 par regionl
29. vec reportl1 Vectorization Reports The vec report L 01112121 2151 options directs the compiler to generate the vectorization reports with different level of information as follows vec report no diagnostic information is displayed vec report 1 display diagnostics indicating loops successfully vectorized default vec report2 same as vec report plus diagnostics indicating loops not successfully vectorized 131 Intel Fortran Compiler for Linux Systems User s Guide Vol Il vec report3 same as vec report2 plus additional information about any proven or assumed dependences vec report 4 indicate non vectorized loops vec report5 indicate non vectorized loops and the reason why they were not vectorized If you specify vec report without a number the default of vec_report1 is used Usage with Other Options The vectorization reports are generated in the final compilation phase when executable is generated Therefore if you use the c option and a vec report n option in the command line no report will be generated If you use c ipo and x K W N B P or ax K W N B P and vec report n the compiler issues a warning and no report is generated To produce a report when using the above mentioned options you need to add the ipo obj option The combination of c and ipo ob produces a single file compilation and hence does generate object code and eventually a report is generated The fo
30. 138 139 170 178 maximum number 48 123 183 186 memory ACCESS t DE 34 allocation eee ee 189 dependencn sse eee 124 e T 34 MIN 88 137 139 160 170 178 191 213 minimizing execution time 111 DEI c eese c etes reo eet ee SCH 111 mintime option for test prioritization TOON WEE 111 misaligned data crossing 16 byte boundary OE ER ER EA ds 140 mispredicted 94 mixed data type arithmetic EXPIESSIONS iese EE EER AG REEN 30 mixing vectorizable 133 MM PREFETGH ettet 125 MMX TM technology 127 NET 45 modules subset coverage analysis 103 more replicated code 156 mp compiler option 66 mp1 compiler option 66 multidimensional arrays 20 134 RED dee as 80 multifile IPO IPO executable 83 ale HEC 83 multifile optimization 79 multiple threads debugging sees 197 multithread programs debugger limitations 193 e 193 multithreaded applications 9 eit reto pete eoe 34 Ge e UE 193 produces 149 150 FU es d SE A 158 multi threaded 144 mutually exclusive aT ET OR 48 N names OptImizers AE oon 213 NAN value ese see ee ee ee 51 70 natural storage order
31. 3 Execute myApp Each invocation of myApp runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR 4 Issue the following command ifort prof use myApp f90 96 Compiler Optimizations At this step the compiler merges all the dyn files into one api file representing the total profile information of the application and generates the optimized binary The default name of the dpi file is pgopt i dpi Basic PGO Options The options used for basic PGO optimizations are e Drot gen to generate instrumented code e prof use to generate a profile optimized executable e prof format 32 to produce 32 bit counters for dyn and dpi files In cases where your code behavior differs greatly between executions you have to ensure that the benefit of the profile information is worth the effort required to maintain up to date profiles In the basic profile guided optimization the following options are used in the phases of the PGO Generating Instrumented Code prof gen The prof gen option instruments the program for profiling to get the execution count of each basic block It is used in phase 1 of the PGO to instruct the compiler to produce instrumented code in your object files in preparation for instrumented execution Parallel make is automatically supported for prof gen compilations Generating a Profile optimized Ex
32. Caller is the function that contains the call site Callee is the function being called that might be inlined Minimum call site criteria e The number of actual arguments must match the number of formal arguments of the callee e The number of retum values must match the number of return values of the callee e The data types of the actual and formal arguments must be compatible e Nomultilingual inlining is permitted Caller and callee must be written in the same source language Minimum criteria for the caller e At most 2000 intermediate statements will be inlined into the caller from all the call sites being inlined into the caller You can change this value by specifying the option Qoption f ip ninl max total stats new value e The function must be called if it is declared as static Otherwise it will be deleted Minimum criteria for the callee e Does not have variable argument list 90 Compiler Optimizations e Is not considered infrequent due to the name Routines which contain the following substrings in their names are not inlined abort alloca denied err exit fail fatal fault halt init interrupt invalid quit rare stop timeout trace trap and warn e ls not considered unsafe for other reasons Selecting Routines for Inlining with or without PGO Once the above criteria are met the compiler picks the routines whose inline expansions will provide the greatest benefit to program perfo
33. IVDEP do j 1 n a j b j 1 b j a j m 1 enddo Example 1 ignores the possible backward dependencies and enables the loop to get software pipelined Example 2 shows possible forward and backward dependencies involving array a in this loop and creating a dependency cycle With IVDEP the backward dependencies are ignored 207 Intel Fortran Compiler for Linux Systems User s Guide Vol Il IVDEP has options IVDEP LOOP and IVDEP BACK The IVDEP LOOP option implies no loop carried dependencies The IVDEP BACK option implies no backward dependencies The IVDEP directive is also used with the ivdep parallel option for ltanium based applications For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language Reference Overriding Vectorizer s Efficiency Heuristics In addition to IVDEP directive there are more directives that can be used to override the efficiency heuristics of the vectorizer VECTOR ALWAYS NOVECTOR VECTOR ALIGNED VECTOR UNALIGNED VECTOR NONTEMPORAL The VECTOR directives control the vectorization of the subsequent loop in the program but the compiler does not apply them to nested loops Each nested loop needs its own directive preceding it You must place the vector directive before the loop control statement For more details on these directives see Directive Enhanced Compilation section General Directives i
34. allowing instructions within a loop to be split into different stages allowing increased instruction level parallelism This can reduce the impact of long latency operations resulting in faster loop execution Loops chosen for software pipelining are always innermost loops that do not contain procedure calls that are not inlined Because the 203 Intel Fortran Compiler for Linux Systems User s Guide Vol Il optimizer no longer considers fully unrolled loops as innermost loops fully unrolling loops can allow an additional loop to become the innermost loop see unroll n You can request and view the optimization report to see whether software pipelining was applied see Optimizer Report Generation IDECS SWP do i 1 m if a i eq 0 then b i a i 1 else b i a i c i endif enddo For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language Reference Loop Count and Loop Distribution LOOP COUNT N Directive The LOOP COUNT n directive indicates the loop count is likely to be n where n is an integer constant The value of loop count affects heuristics used in software pipelining vectorization and loop transformations IDECS LOOP COUNT 10000 do i 1 m a i 1 This is likely to enable the loop to get software pipelined For more details on this directive see Directive Enhanced Compilation section
35. and one or more disk block VO buffers which are established when the file is opened by the Intel Fortran RTL Unless very large records are being read or written multiple logical records can reside in the disk block I O buffer when it is written to disk or read from disk minimizing physical disk VO You can specify the size of the disk block physical I O buffer by using the OPEN statement BLOCKSIZE specifier the default size can be obtained from fstat 2 If you omit the BLOCKSIZE specifier in the OPEN statement it is set for optimal VO use with the type of device the file resides on with the exception of network access The OPEN statement BUFFERCOUNT specifier specifies the number of VO buffers The default for BUFFERCOUNT is 1 Any experiments to improve I O performance should increase the BUFFERCOUNT value and not the BLOCKSIZE value to increase the amount of data read by each disk I O If the OPEN statement has BLOCKSIZE and BUFFERCOUNT specifiers then the internal buffer size in bytes is the product of these specifiers If the open statement does not have these specifiers then the default internal buffer size is 8192 bytes This internal buffer will grow to hold the largest single record but will never shrink The default for the Fortran run time system is to use unbuffered disk writes That is by default records are written to disk immediately as each record is written instead of accumulating in the buffer to be written to di
36. code color Sets the html color name or code of the ffffff unknown code Visual Presentation of the Application s Code Coverage Based on the profile information collected from running the instrumented binaries when testing an application Intel Compiler creates HTML files using a code coverage tool These HTML files indicate portions of the source code that were or were not exercised by the tests When applied to the profile of the performance workloads the code coverage information shows how well the training workload covers the application s critical code High coverage of performance critical modules is essential to taking full advantage of the profile guided optimizations 104 Compiler Optimizations The code coverage tool can create two levels of coverage e Top level for a group of selected modules e Individual module source view Top Level Coverage The top level coverage reports the overall code coverage of the modules that were selected The following options are provided e You can select the modules of interest e For the selected modules the tool generates a list with their coverage information The information includes the total number of functions and blocks in a module and the portions that were covered e By clicking on the title of columns in the reported tables the lists may be sorted in ascending or descending order based on e basic block coverage e function coverage e function na
37. directive ee 169 END SECTION directive 166 END SECTIONS directive 150 166 END SINGLE directive 150 166 END SUBROUTINE 101 140 21 Lo r n ENERO REOR ARE P AS M 45 Enhanced Debugger 37 ensuring natural alignment 13 entry parallel region 197 subroutine PADD 197 entry exit es DE ee es 144 entry point name el Lee ss EE Ee 195 environment data environment directive 156 OpenMP environment routines 186 UpnlDroCessOr ee ees ee 193 variables 25 45 99 118 145 158 164 183 186 189 EOUIVALENCE statement avoid SEER DE EE sra 13 EQW A eise ae 170 178 ERRATA gannan e 197 errno variable selling E 92 error limit 20 48 error limit compiler option 48 GSD eise c I e 195 197 examples lu TEE 191 PO EES sates se oi Pod 100 vectorization 140 exceed ST e oe op imc eg 79 EXCEPTION Sk RE N A 45 executable files 41 executing 231 Intel Fortran Compiler for Linux Systems User s Guide Vol Il BARRIER EEN 150 SINGLE EE 166 test prioritization 111 execution environment routines 186 is eelerer 156 existing BEAODUADE ER He ne 99 explicit symbol visibility specification EE EA N E 57 explicit shape arrays 20 EXTENDED PRECISION 7
38. e Minimizing the number of tests that are required to achieve a given overall coverage for any subset of the application the tool defines the smallest subset of the application tests that achieve exactly the same code coverage as the entire set of tests e Reducing the tum around time of testing instead of spending a long time on finding a possibly large number of failures the tool enables the users to quickly find a small number of tests that expose the defects associated with the regressions caused by a change set 111 Intel Fortran Compiler for Linux Systems User s Guide Vol Il e Selecting and prioritizing the tests to achieve certain level of code coverage in a minimal time based on the data of the tests execution time Command line Syntax The syntax for this tool is as follows tselect dpi list file where dpi list is a required tool option that sets the path to the DPI list file that contains the list of the api files of the tests you need to prioritize Tool Options The tool uses options that are listed in the table that follows Option Description help Prints all the options of the test prioritization tool spi file Sets the path name of the static profile information file spi The default is pgopti spi dpi list Sets the path name of the file that contains the file name of the dynamic profile information dpi files Each line of the file should contain one dpi name opt
39. 11 COPYIN clause 156 160 164 174 175 COPYPRIVATE clause 160 correct usage of loops 135 136 coverage analysis 103 CPU more effective use of 20 UME uis dotis eiat 25 38 CRAY pointer aliasing preventing eror ME 51 creating DES eg 111 multifile IPO executable using xild RE ME N Arie 83 multifile IPO executable with command line 81 multithreaded applications 34 criteria for inline function expansion EEN 90 CRITICAL directive 150 160 170 customizing compilation process 41 D data alignment 13 55 140 cache uri cio eee 140 declarations esee 13 Index dependence 122 123 134 144 147 203 Meed 127 143 items iia doe EE GE EE Ee 13 OPON Seeds me Mere 51 partitlopimg is SEER RE ts 144 prefetching 121 204 scope attribute clauses 174 setINGS ER EE ER 48 CN Te ceo DEE oder DEE dog 150 type so 13 30 66 127 130 data flow analysis 127 143 data scope attribute clauses 174 DATE AND TIME ute 38 DELOCK AG RE a 38 debug compiler option 210 debugger xc oec ttem Oe SO 193 debugging COUG ees 195 multiple threads 197 multithread programs overview 193 optimizations and 210 parallel regions
40. 164 SSE RE 34 66 130 137 SEE EER mei 34 130 256 stacks ra EE EE EE OE 189 standard OpenMP clauses 160 OpenMP directives 160 OpenMP environment variables E 183 statements ACCESSING Me M 13 BEOCKSIZE Reid eod res 25 BUFFERCOUNT 25 EEN ee 25 TUR CHOTIS ose tal otra c dece 30 STATIC ESE DS RED EG ie 180 STATUS CH EE 25 stderr KEO m So N ATT 213 Stream LE 25 streaming SIMD SSB2 ER testet 34 Streaming SIMD Extensions single precision 138 stride 1 Example usce eb Ge ege dE 142 Sal e CT 25 ge ae ee OE pete 138 STRUCTURE statements 13 55 SUBDOMAIN ee 180 SUD EE dentis 195 197 E ee EE 176 SUDOD ON DEEN 41 subroutine machine code listing 195 PADD ip AE ener 197 source listing 197 PAD ds a 197 PARALLEL st est ss ee 195 PGOPTI PROF DUMP 101 VEG COPY ee eset 140 WORK c io aic sehe 170 subscripts GUT EE EE dee 20 oie o Z en OE 142 U TAT ege 25 substring ee le eirt N OE 213 Tee RE EE EE 176 support loop unrolling 205 KEE 34 OpenMP Libraries 143 182 prefetching eeeese 206 symbolic debugging 210 vectorization 206 worksharing 150 SWP directive 203 symbol UE 57 preemp
41. 186 246 omp unset nest lock 186 On compiler option 61 62 one thread see 197 open statement OPEN statement BUFFERED 25 openmp compiler option 127 158 OpenMP clauses iue ER 160 directives esses 160 environment variables 183 exvamples ee Re ee 191 extension environment variables al A 183 Intel extensions 189 Dar OOP MEE AE eos 195 Dar PE EE 195 Dat SOCOM coc 9 9759 EER SE NE 195 parallelizer option controls 158 PROCESSING EE ED el este 150 run time library routines 186 synchronization directives 150 We 191 VA e TN ZR E 150 OpenMP compliant compilers 189 openmp report compiler option 48 127 158 Openmp stubs compiler option 127 189 operator intrinsic 178 operator intrinsic 160 opt report compiler option 48 213 optima record optimization level EMS 61 POSUICUING EE 65 Setting Aen chee DR ES id 62 optimizations debugging and optimizations 210 different application types 9 floating point arithmetic precision Bees 66 el Ee EE 121 POSE nA SLE 79 optimizer report generation 213 optimizing for specific processors EE NE Hie 73 OVerview ee 41 SE EE 93 Index l eDOFTS eege 48 203 213 optimizer allowing E 41 Tall name cere mestre eee 213 logical name
42. EE 93 vectorized 48 131 135 137 140 206 vectorizer efficiency heuristics overriding sss sss sees 206 efficiency heuristics 206 ODLEHOFPIS aser eso ipd tens La ape Mt 131 262 vectorizing compilers 133 vectorizing loops 206 version numbers 86 view XMM S cono tatis dr ea 37 violation FORTRAN T iia node 41 visibility specifying cessus 57 SERIES 57 visual presentation application s code coverage 103 vms compiler option 13 41 VMS related 41 VOLATILE statement 25 VTune TM Performance Analyzer USE ica ec perdiderit d r 149 W W compiler option 13 wallclock ER tie ees 186 Whitespace esse ee EE 57 work work pgopti dpi file 102 call stack dump WORKSHARE construct directives X86 PIOCSSSOMS e Index options ipo no verbose asm 83 ipo fcode asm 83 ipo fsource asm 83 GIO Ek RO turo dati Ee 83 OPE TO seca Ri 83 OPLOS es SE ee PNE 83 rio 80 XMM ro 37 XOR EM DE 139 Y NORRIS ats aneneen 166 169 170 X FIELD EE N EE 174 175 Z E LT 166 169 zero denormal flushing 66 70 ZEIBEED coena nel uses ss 174 175 Zp compiler option
43. For information on Fortran record structures see STRUCTURE statement in the nte amp Fortran Language Reference If you specify Zzp omit n structures are packed at 8 byte boundary align and pad The align option is a front end option that changes alignment of variables in a common block 55 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Example common blockl ch doub chl int integer int character len 1 ch chl double precision doub end The align option enables padding inserted to ensure alignment of doub and int on natural alignment boundaries The noalign option disables padding The align option applies mainly to structures It analyzes and reorders memory layout for variables and arrays and basically functions as Zp n You can disable either option with noalign For align keyword options see your User s Guide Volume l The pad option is effectively not different from align when applied to structures and derived types However the scope of pad is greater because it applies also to common blocks derived types sequence types and VAX structures Recommendations on Controlling Alignment with Options The following options control whether the Intel Fortran compiler adds padding when needed to naturally align multiple data items in common blocks derived type structures and Intel Fortran record structures e By default with 02 the align commons option requests that data
44. I O Native unformatted data does not need to be modified when transferred and will take up less space on an external file Conversely when writing data to formatted files formatted data must be converted to character strings for output less data can transfer in a single operation and formatted data may lose precision if read back into binary form To write the array A 25 25 in the following statements S1 is more efficient than S2 S1 WRITE 7 A S2 WRITE 7 100 A 100 FORMAT 25 25F5 21 Although formatted data files are more easily ported to other systems Intel Fortran can convert unformatted data in several formats see Little endian to Big endian Conversion Write Whole Arrays or Strings To eliminate unnecessary overhead write whole arrays or strings at one time rather than individual elements at multiple times Each item in an UO list generates its own calling sequence This processing overhead becomes most significant in implied DO loops When accessing whole arrays use the array name Fortran array syntax instead of using implied DO loops Write Array Data in the Natural Storage Order Use the natural ascending storage order whenever possible This is column major order with the leftmost subscript varying fastest and striding by 1 See Accessing Arrays Efficiently If a program must read or write data in any other order efficient block moves are inhibited If the whole array is not being written n
45. Intel Fortran Compiler for Linux Systems User s Guide Vol Il integer identifier real weight character len 15 description end type part dt type part dt catalog spring 30 end module data defs Using Arrays Efficiently Many of the array access efficiency techniques described in this section are applied automatically by the Intel Fortran loop transformations optimizations Several aspects of array use can improve run time performance e The fastest array access occurs when contiguous access to the whole array or most of an array occurs Perform one or a few array operations that access all of the array or major parts of an array instead of numerous operations on scattered array elements Rather than use explicit loops for array access use elemental array operations such as the following line that increments all elements of array variable a az at 1 When reading or writing an array use the array name and not a DO loop or an implied DO oop that specifies each element number Fortran 95 90 array syntax allows you to reference a whole array by using its name in an expression For example real a 100 100 a 0 0 az at 1 Increment all elements of a by 1 write 8 a Fast whole array use Similarly you can use derived type array structure components such as type x integer a 5 end type x type x z write 8 z a Fast array structure component use 20 Programming for High Performance Make sure
46. Intel Fortran Compiler optimizations take effect at run time For IA 32 systems the compiler enhances processor specific optimizations by inserting in the main routine a code segment that performs run time checks described below Check for Supported Processor with xB xB or xP TT Intel Fortran Compiler for Linux Systems User s Guide Vol Il To prevent from execution errors the compiler inserts code in the main routine of the program to check for proper processor usage Programs compiled with options xN xB or xP check at run time whether they are being executed on the Intel Pentium 4 Intel Pentium M processor or the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support respectively or a compatible Intel processor If the program is not executed on one of these processors the program terminates with an error Example To optimize a program foo 90 for an Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support issue the following command ifort xP foo f90 o foo exe foo exe aborts if it is executed on a processor that is not validated to support the Intel amp Pentium amp 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support to account for the fact that this processor may have some additional feature enabling If you intend to run your programs on multiple lA 32 processors do not use the x options that optimize for proce
47. M ABRUPT UND Interprocedural Optimizations IPO Overview of Interprocedural Optimizations Use ip and ipo to enable interprocedural optimizations IPO which enable the compiler to analyze your code to determine where you can benefit from the optimizations listed in tables that follow 1A 32 and Itanium based applications Optimization Affected Aspect of Program Inline function expansion Calls jumps branches and loops Interprocedural constant propagation Arguments global variables and return values Monitoring module level static variables Further optimizations and loop invariant code Dead code elimination Code size Propagation of function Call deletion and call characteristics movement Multifile optimization The same aspects as ip but across multiple files 1A 32 applications only Affected Aspect of Program Optimization Passing arguments in registers Calls and register usage Loop invariant code motion invariant code Further optimizations and loop Inline function expansion is one of the main optimizations performed by the interprocedural optimizer For function calls that the compiler believes are freguently executed the compiler might decide to replace the instructions of the call with code for the function itself With ip the compiler performs inline function expansion for calls to procedures defined within the curre
48. Name spi pgopti spi dpi pgopti dpi The spi and dpi options specify the paths to the corresponding files The coverage tool also has the following additional options for generating a link at the bottom of each HTML page to send an electronic message to a named contact by using mname and maddr options codecov prj Project Name mname John Smith maddr 1519 company com Test Prioritization Tool The Intel amp Compilers Test prioritization tool enables the profile guided optimizations to select and prioritize application s tests based on prior execution profiles of the application The tool offers a potential of significant time saving in testing and developing large scale applications where testing is the major bottleneck The tool can be used for both IA 32 and Itanium architectures This tool enables the users to select and prioritize the tests that are most relevant for any subset of the application s code When certain modules of an application are changed the test prioritization tool suggests the tests that are most probably affected by the change The tool analyzes the profile data from previous runs of the application discovers the dependency between the application s components and its tests and uses this information to guide the process of testing Features and Benefits The tool provides an effective testing hierarchy based on the application s code coverage The advantages of the tool usage can be summarized as follows
49. Parallel Programming with Intel Fortran be set via the KMP STACKSIZE environment variable In order for kmp set stacksize stohave an effect it must be called before the beginning of the first dynamically executed parallel region in the program subroutine kmp set stacksize size integer size This routine is provided for backward compatibility only use kmp set stacksize s size for compatibility across different families of Intel processors Memory Allocation function kmp malloc size Allocate memory block of size integer kind kmp pointer kind kmp malloc bytes from thread local heap integer kind kmp size t kind size function kmp calloc nelem elsize u Allocate array of ne1em elements integer kind kmp pointer kind kmp calloc of size elsize from thread local integer kind kmp size t kind nelem heap integer kind kmp size t kind elsize i function kmp_realloc ptr size Reallocate memory block at integer kind kmp pointer kind kmp realloc address ptr and size bytes integer kind kmp pointer kind ptr from thread local heap integer kind kmp size t kind size subroutine kmp free ptr Free memory block at address integer kind kmp pointer kind ptr ptr from thread local heap Memory must have been previously allocated with kmp malloc kmp calloc or kmp realloc Examples of OpenMP Usage The following examples show how to use the Op
50. Routines topic describes the OpenMP extensions to the specification that have been added by Intel in the Intel Fortran Compiler openmp Option 158 Parallel Programming with Intel Fortran The openmp option enables the parallelizer to generate multithreaded code based on the OpenMP directives The code can be executed in parallel on both uniprocessor and multiprocessor systems The openmp option works with both 00 no optimization and any optimization level of O1 02 default and 03 Specifying 00 with openmp helps to debug OpenMP applications When you use the openmp option the compiler sets the auto option causes all variables to be allocated on the stack rather than in local static storage for the compiler unless you specified it on the command line OpenMP Directive Format and Syntax The OpenMP directives use the following format prefix directive clause clause where the brackets above mean e xxx the prefix and directive are required e xxx if a directive uses one clause or more the clause s is required s commas between the lt clause gt s are optional For fixed form source input the prefix is Somp or c omp For free form source input the prefix is omp only The prefix is followed by the directive name for example omp parallel Since OpenMP directives begin with an exclamation point the directives take the form of comments if you omit t
51. Vol Il intrinsic functions For more information on ways to restrict optimization see Using ip with Qoption Specifiers Floating point Arithmetic Optimizations Options Used for Both IA 32 and Itanium Architectures The options described in this section all provide optimizations with varying degrees of precision in floating point FP arithmetic for IA 32 and Itanium amp architectures The mp1 IA 32 only and mp options improve floating point precision but also affect the application performance See more details about these options in Improving Restricting FP Arithmetic Precision The FP options provide optimizations with varying degrees of precision in floating point arithmetic The option that disables these optimizations is OO0 mp Option Use mp to limit floating point optimizations and maintain declared precision For example the Intel amp Fortran Compiler can change floating point division computations into multiplication by the reciprocal of the denominator This change can alter the results of floating point division computations slightly The mp Switch may slightly reduce execution speed For more information see Improving Restricting FP Arithmetic Precision mp1 Option Use the mp1 option to restrict floating point precision to be closer to declared precision with less impact to performance than with the mp option The option will ensure the out of range check of operands of transcenden
52. adding ipo to your link command line The following example produces an executable named app ifort oapp ipo a o b o c o This command invokes the compiler on the objects containing IR and creates a new list of object s to be linked The command then calls GCC 1d to link the specified object files and produce app as specified by the o option IPO is applied only to the object files that contain IR otherwise the object file passes to link stage Note For the above step you can use the xild tool instead of ifort The two steps described above can be combined as shown in the following ifort ipo oapp a f b f c f Generating Multiple IPO Object Files For the most part IPO generates a single object file for the link time compilation This can be clumsy for very large applications perhaps even making it impossible to use ipo on the application The compiler provides two ways to avoid this problem The first way is a size based heuristic which automatically causes the compiler to generate multiple object files for large link time compilations The second way is using one of two explicit command line controls for that tell the compiler to do multi object IPO o ipoN where N indicates the number of object files to generate o ipo separate Which tells the compiler to generate a separate IPO object file for each source file These options are alternatives to the ipo option that is they indicate an IPO compilation Explicit
53. an extension dyn in the directory specified by PROF DIR Issue the following command profmerge prof dpi Test2 dpi At this step the profmerge tool merges all the dyn files into one file Test2 dpi that represents the total profile information of the application on Test2 Issue the following command rm PROF DIR dyn Make sure that there are no unrelated dyn files present 10 Issue the following command myApp data3 115 Intel Fortran Compiler for Linux Systems User s Guide Vol Il This command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR 11 Issue the following command profmerge prof dpi Test3 dpi At this step the profmerge tool merges all the dyn files into one file Test3 dpi that represents the total profile information of the application on Test3 12 Create a file named tests list with three lines The first line contains Test dpi the second line contains Test2 dpi and the third line contains Test3 dpi When these items are available the test prioritization tool may be launched from the command line in PROF DIR directory as described in the following examples Note that in all examples the discussion references the same set of data Example 1 Minimizing the Number of Tests tselect dpi list tests list spi pgopti spi where the spi option speci
54. and Pentium Ill processors ifort prog f ifort tpp7 prog f However if you intend to target your application specifically to the Intel Pentium and Pentium with MMX technology processors use the tpp5 option ifort tpp5 prog f Processors for Itanium amp based Systems The tpp1 and tpp2 options optimize your application s performance for a specific Intel Itanium processor as listed in the table below The resulting binaries will also run correctly on both processors mentioned in the table Option Optimizes your application for 74 Compiler Optimizations tppl Intel Itanium amp processor deu Intel amp Itanium amp 2 processor Example The following invocation results in a compiled binary of the source program prog f optimized for the Itanium 2 processor by default The same binary will also run on Itanium processors ifort prog f ifort tpp2 prog f However if you intend to target your application specifically to the Intel Itanium processor use the tpp1 option ifort tppl prog f Processor specific Optimization IA 32 only The x K W N B P options target your program to run on a specific Intel processor The resulting code might contain unconditional use of features that are not supported on other processors Option Optimizes for xK Intel Pentium Ill and compatible Intel processors xW Intel Pentium 4 and compatible Intel processors xN Intel Pentium 4 and compatibl
55. and combines subexpression redundant computations elimination conditionals Any operation that takes place depending on whether or not a certain condition is true constant An optimization in which the compiler replaces the formal argument arguments of a routine with actual constant values The propagation compiler then propagates constant variables used as actual arguments constant Conditionals that always take the same branch branches constant folding An optimization in which the compiler instead of storing the numbers and operators for computation when the program executes evaluates the constant expression and uses the result copy propagation An optimization in which the compiler eliminates unnecessary assignments by using the value assigned to a variable instead of using the variable itself dataflow The movement of data through a system from entry to destination 217 Intel Fortran Compiler for Linux Systems User s Guide Vol Il dead code elimination An optimization in which the compiler eliminates any code that generates unused values or any code that will never be executed in the program dynamic linking The process in which a shared object is mapped into the virtual address space of your program at run time empty declaration frame pointer A semicolon and nothing before it A pointer that holds a base address for the current stack and is used to access
56. are in a directory named cmp1 If no component file is specified then all files that have been compiled with prof genx are selected for coverage analysis Dynamic Counters This feature displays the dynamic execution count of each basic block of the application and as such it is useful for both coverage and performance tuning The coverage tool can be configured to generate the information about the dynamic execution counts This configuration requires using the count s option The counts information is displayed under the code after a sign precisely under the source position where the corresponding basic block begins If more than one basic block is generated for the code at a source position for example for macros then the total number of such blocks and the number of the blocks that were executed are also displayed in front of the execution count For example line 11 in the code is an IF statement 11 IF N EQ 1 OR N EQ 0 10 1 2 il PRINT N SH The coverage lines under code lines 11 and 12 contain the following information e The IF statement in line 11 was executed 10 times e Two basic blocks were generated for the IF statement in line 11 e Only one of the two blocks was executed hence the partial coverage color e Only seven out of the ten times variable n had a value of 0 or 1 In certain situations it may be desirable to consider all the blocks generated for a single sour
57. by code within loops In most cases 02 is recommended over 01 On IA 32 systems Disables intrinsics inlining to reduce code size Enables optimizations for speed Also disables intrinsic recognition and the fp option On Itanium based systems Disables software pipelining and global code scheduling Enables optimizations for server applications straight line and branch like code with flat profile Enables optimizations for speed while being aware of code size For example this option disables software pipelining and loop unrolling 62 Compiler Optimizations 02 0 This option is the default for optimizations However if g is specified the default is o0 Optimizes for code speed This is the generally recommended optimization level However if g is specified 02 is turned off and 00 is the default unless 02 or 01 or 03 is explicitly specified in the command line together with eg On IA 32 systems this option is the same as the O1 option On Itanium based systems Enables optimizations for speed including global code scheduling software pipelining predication and speculation On these systems the 02 option enables inlining of intrinsics It also enables the following capabilities for performance gain constant propagation copy propagation dead code elimination global register allocation global instruction scheduling and control speculation loop unrolling optimized code sele
58. by the number of threads in the team Each piece is then dispatched to a thread before loop execution begins DYNAMIC The iterations are divided into pieces having a size specified by chunk As each thread finishes its currently dispatched piece of the iteration space the next piece is dynamically dispatched to the thread When no chunk is specified the default is 1 GUIDED The chunk size is decreased exponentially with each succeeding dispatch Chunk specifies the minimum number of iterations to dispatch each time If there are less than chunk number of iterations remaining the rest are dispatched When no chunk is specified the default is 1 RUNTIME The decision regarding scheduling is deferred until run time The schedule type and chunk size can be chosen at run time by using the OMP SCHEDULE environment variable When you specify RUNTIME you cannot specify a chunk size The following list shows which schedule type is used in priority order 1 2 The schedule type specified in the SCHEDULE clause of the current DO or PARALLEL DO directive If the schedule type for the current DO or PARALLEL DO directive is RUNTIME the default value specified in the OMP_SCHEDULE environment variable 181 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 3 The compiler default schedule type of STATIC The following list shows which chunk size is used in priority order 1 The chunk size specified in
59. compiler to evaluate the expressions involving floating point operands in the following way IPF flt eval method0 directs the compiler to evaluate the expressions involving floating point operands in the precision indicated by the variable types declared in the program IPF flt eval method is not supported in the current version Controlling Accuracy of the FP Results IPF fltacc disables the optimizations that affect floating point accuracy The efault is IPF fltacc to enable such optimizations O The Itanium amp compiler may reassociate floating point expressions to improve application performance Use IPF fltacc or mp to disable or restrict these floating point optimizations Improving Restricting FP Arithmetic Precision 71 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The mp and mp1 options maintain and restrict respectively floating point precision but also affect the application performance The mp1 option causes less impact on performance than the mp option mp1 ensures the out of range check of operands of transcendental functions and improve accuracy of floating point compares For IA 32 systems the mp option implies mp1 mp1 implies fp port mp slows down performance the most of these three p port the least of these three The mp option restricts some optimizations to maintain declared precision and to ensure that floating point arithmetic c
60. cross their declared dimension boundaries can be converted to their linearized form before the tests are applied Some of the simple tests that can be used are the fast greatest common divisor GCD test and the extended bounds test The GCD test proves independence if the GCD of the coefficients of loop indices cannot evenly divide the constant term The extended bounds test checks for potential overlap of the extreme values in subscript expressions If all simple tests fail to prove independence we eventually resort to a powerful hierarchical dependence solver that uses Fourier Motzkin elimination to solve the data dependence problem in all dimensions Loop Constructs Loops can be formed with the usual DO END DO and DO WHILE or by using IF GOTO statements and a label However the loops must have a single entry and a single exit to be vectorized Following are the examples of correct and incorrect usages of loop constructs Example of Correct Usage SUBROUTINE FOO A B C DIMENSION A 100 B 100 C 100 NTEGER I 1 DO WHILE LE 100 ALL B I C I IF A I LT 0 0 A I I ENDDO ll D 135 Intel Fortran Compiler for Linux Systems User s Guide Vol Il RETURN END Example of Incorrect Usage SUBROUTINE FOO A B C DIMENSION A 100 B 100 C 100 NTEGER I 1 DO WHILE L
61. each thread as if you had used the PRIVATE clause The private copy is initialized to a value that depends on the operator or intrinsic as shown in the following table The actual initialization value is consistent with the data type of the reduction variable Operators Intrinsics and Initialization Values for Reduction Variables Operator Intrinsic Initialization Value T 0 1 178 Parallel Programming with Intel Fortran 0 AND TRUE OR FALSE EQV TRUE NEQV FALSE MAX Largest representable number MIN Smallest representable number IAND All bits on IOR 0 IEOR 0 At the end of the construct to which the reduction applies the shared variable is updated to reflect the result of combining the original value of the SHARED reduction variable with the final value of each of the private copies using the specified operator Except for subtraction all of the reduction operators are associative and the compiler can freely reassociate the computation of the final value The partial results of a subtraction reduction are added to form the final value The value of the shared variable becomes undefined when the first thread reaches the clause containing the reduction and it remains undefined until the reduction computation is complete Normally the computation is complete at the end of the REDUCTION construct However if you use the REDUCTION clause on a
62. end program The Intel Fortran whole array access x y 1 uses efficient column major order However if the application requires that J vary the fastest or if you cannot modify the loop order without changing the results consider 21 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 22 modifying the application program to use a rearranged order of array dimensions Program modifications include rearranging the order of e Dimensions in the declaration of the arrays x 5 3 and y 5 3 e The assignment of x 5 i and y 3 i within the do loops e All other references to arrays x and y In this case the original DO loop nesting is used where J is the innermost loop Integer x 3 5 wi3 5 ip J y 0 do i 1 3 I outer loop varies slowest do j 1 5 J inner loop varies fastest x j i y j i 1 Efficient column major storage order end do leftmost subscript varies fastest end do end program Code written to access multidimensional arrays in row major order like C or random order can often make use of the CPU memory cache less efficient For more information on using natural storage order during record see Improving I O Performance Use the available Fortran 95 90 array intrinsic procedures rather than create your own Whenever possible use Fortran 95 90 array intrinsic procedures instead of creating your own routines to accomplish the same task Fortran 95 90 array intrinsic
63. for the IPO compilation You can specify an optional name for the object file or a directory with the backslash in which to place the file The default object file name is ipo_out o ipo_fcode asm Adds code bytes to the assembly listing ipo_fsource asm Adds high level source code to the assembly listing ipo_fverbose asm Enables and disables respectively ipo_fnoverbose asm inserting comments containing version and options used in the assembly listing for xild If the xi 1a invocation leads to an IPO multi object compilation either because the application is big or because the user explicity asked for multiple objects the first s file takes its name from the qipo_fa option The compiler derives the names of subsequent s files by appending a number to the name for example foo s and fool s for gipo fafoo s The same is true for the qipo_fo option Code Layout and Multi Object IPO One of the optimizations performed during an IPO compilation is code layout IPO analysis determines a layout order for all of the routines for which it has IR If a single object is being generated the compiler generates the layout simply by compiling the routines in the desired order For a multi object IPO compilation the compiler must tell the linker about the desired order The compiler first puts each routine in a named text section the first routine in text00001 the second in text00002 and so forth I
64. improving performance An application developer willing to trade pure IEEE 754 compliance for speed would benefit from these options For more 36 Programming for High Performance information on ETZ and DAZ see Setting FTZ and DAZ Flags and Floating point Exceptions in the Intel amp Architecture Optimization Reference Manual For Itanium architecture enable flush to zero FTZ mode with the ftz option set by O3 option Auto vectorization Many applications significantly increase their performance if they can implement vectorization which uses streaming SIMD SSE2 instructions for the main computational loops The Intel Compiler turns vectorization on auto vectorization or you can implement it with compiler directives See Auto vectorization IA 32 Only section for complete details Creating Multithreaded Applications The Intel Fortran Compiler and the Intel amp Threading Toolset have the capabilities that make developing multithreaded application easy See Parallel Programming with Intel Fortran Multithreaded applications can show significant benefit on multiprocessor Intel symmetric multiprocessing SMP systems or on Intel processors with Hyper Threading technology Analyzing and Timing Your Application Using Intel Performance Analysis Tools Intel offers a variety of application performance tools that are optimized to take advantage of the Intel architecture based processors You can employ these tools for developi
65. invoke the Intel Fortran Compiler without specifying any compiler options the default state of each option takes effect The following tables summarize the options whose default status is ON as they are required for Intel Fortran Compiler default operation The tables group the options by their functionality For the default states and values of all options see the Alphabetical Quick Reference Guide in the Intel Fortran Compiler Options Quick Reference The table provides links to the sections describing the functionality of the options If an option has a default value such value is indicated Depending on your application requirements you can disable one or more options For general methods of disabling optimizations see Volume I 48 Compiler Optimizations The following tables list all options that compiler uses for its default optimizations Data Setting and Fortran Language Conformance Default Option Description align records Analyzes and reorders memory layout for variables and arrays align rec8byte Specifies 8 byte boundary for alignment constraint altparam Specifies that the alternate form of parameter constant declarations is recognized ansi alias Enables assumption of the program s ANSI conformance assume cc omp Enables OpenMP conditional compilation directives ccdefault default Specifies default carriage control for units 6 and double
66. meer 41 PRINT ee EE EDE 176 PRINT statement 103 prioritization 111 PRIVATE PI EINE 176 PRIVATE clause 176 178 private scoping variable esses 150 procedure names 160 process e EE N od 41 process data 119 processor processor based 73 processor instruction 73 targeting AA AN ET 73 produced IL 86 multithreaded 149 150 profile optimized 97 prof dir dirname compiler option 98 prof dpi U 111 prof dpi Test1 dpi 111 prof dpi Test2 dpi 111 prof dpi Test3 dpi 111 PROF DUMP INTERVAL 99 118 prof file filename compiler option 98 prof gen compiler option 97 PROF NO CLOBBER 99 prof use compiler option 97 profile data QUITID ING cc oe odo ne nt oe dr dne 101 profile IGS describe ses eke RE ke ee 118 environment variable 118 Tul ellons sce tees 118 Variable 118 profile information QUEDDIDO ASEM EER ER us 119 generation support 118 profile guided optimizations see also PGO instrumented program 93 methodology 94 OVOFVIOW uoc ede ee 93 BRaEeS MEE CR CUERO 94 OHSS s rone eee iem 101 profil
67. member U is a simple unit number or a number of units The number of list members is limited to 64 decimal is a non negative decimal number less than 2 Converted data should have basic data types or arrays of basic data types Derived data types are disabled Command lines for variable setting with different shells Sh export F UFMTENDIAN MODE EXCEPTION Csh setenv F UFMTENDIAN MODE EXCEPTION Note Environment variable values should be enclosed in quotes if a semicolon is present Another Possible Environment Variable Setting The environment variable can also have the following syntax F_UFMTENDIAN u u Command lines for the variable setting with different shells 46 Compiler Optimizations Sh export F UFMTENDIAN u u Csh setenv F_UFMTENDIAN u u See error messages that may be issued during the little endian big endian conversion They are all fatal You should contact Intel if such errors occur Usage Examples 1 F UFMTENDIAN big All input output operations perform conversion from big endian to little endian on READ and from little endian to big endian on WRITE F UFMTENDIAN little big 10 20 or F_UFMTENDIAN big 10 20 Or F_UFMTENDIAN 10 20 In this case only on unit numbers 10 and 20 the input output operations perform big little endian conversion F_UFMTENDIAN big little 8 In this case on unit number 8 no conversion operati
68. modify source code to avoid slow arithmetic operators be aware that optimizations convert many slow arithmetic operators to faster arithmetic operators For example the compiler optimizes the expression H J 2 to be H J J Consider also whether replacing a slow arithmetic operator with a faster arithmetic operator will change the accuracy of the results or impact the maintainability readability of the source code Replacing slow arithmetic operators with faster ones should be reserved for critical code areas The following hierarchy lists the Intel Fortran arithmetic operators from fastest to slowest Addition subtraction and floating point multiplication Integer multiplication Division Exponentiation Avoid Using EQUIVALENCE Statements Avoid using EQUIVALENCE statements EQUIVALENCE statements can e Force unaligned data or cause data to span natural boundaries e Prevent certain optimizations including e Global data analysis under certain conditions see 02 in Setting Optimization with On options e Implied Do loop collapsing when the control variable is contained in an EQUIVALENCE statement 32 Programming for High Performance Use Statement Functions and Internal Subprograms Whenever the Intel Fortran compiler has access to the use and definition of a subprogram during compilation it may choose to inline the subprogram Using statement functions and internal subprograms maximiz
69. multidimensional arrays are referenced using proper array syntax and are traversed in the natural ascending storage order which is column major order for Fortran With column major order the leftmost subscript varies most rapidly with a stride of one Whole array access uses column major order Avoid row major order as is done by C where the rightmost subscript varies most rapidly For example consider the nested do loops that access a two dimension array with the j loop as the innermost loop integer 3 y 3 DS dy J y 0 do i 1 3 I outer loop varies slowest do j 1 5 J inner loop varies fastest x 1 3 y i j 1 Inefficient row major storage order end do rightmost subscript varies fastest end do end program Since varies the fastest and is the second array subscript in the expression x i j the array is accessed in row major order To make the array accessed in natural column major order examine the array algorithm and data being modified Using arrays x and y the array can be accessed in natural column major order by changing the nesting order of the do loops so the innermost loop variable corresponds to the leftmost array dimension integer x 3 5 y 3 5 i J y 0 do j 1 5 J outer loop varies slowest do i 1 3 T inner loop varies fastest x 1 3 y i j 1 Efficient column major storage order end do leftmost subscript varies fastest end do
70. non iterative worksharing construct END SECTIONS that specifies a set of structured blocks that are to be divided among threads in a team SECTION Indicates that the associated structured block should be executed in parallel as part of the enclosing sections construct 160 Parallel Programming with Intel Fortran SINGLE END SINGLE Identifies a construct that specifies that the associated structured block is executed by only one thread in the team PARALLEL DO END PARALLEL DO A shortcut for a parallel region that contains a single DO directive F Note The PARALLEL DO or DO OpenMP directive must be immediately followed by a DO statement do stmt as defined by R818 of the ANSI Fortran standard If you place another statement or an OpenMP directive between the PARALLEL DO or DO directive and the DO statement the Intel Fortran Compiler issues a syntax error PARALLEL Provides a shortcut form for specifying a parallel SECTIONS region containing a single SECTIONS construct END PARALLEL SECTIONS MASTER Identifies a construct that specifies a structured END MASTER block that is executed by only the MASTER thread of the team CRITIC ALU 1 ock Identifies a construct that restricts execution of END the associated structured block to a single CRITIC ALU 1 ock thread at a time Each thread waits at the beginning of the critical construct until no other thread is executing a critical construct with the
71. oe age 210 function splitting 97 a Ete ORR EN IE 48 intrinsics inlining 62 IPO puc oe metes 79 On optimizations 62 disclaimer eterne 3 disk OES SEERDE EED DEE 25 dispatch options 73 DISTRIBUTE POINT directive 204 division to multiplication optimization DU EM C 69 DO directive 150 166 170 176 DO loop 20 25 30 166 176 180 DO WHILE 135 136 166 170 document number 1 DO ENDDO EE is 135 DOUBLE orco meme ueste 34 DOUBLE PRECISION FORSEER cam 186 edel EE Or 139 variables KIND eec ecu edes 48 variables ou ead 48 double size 64 48 double size compiler option 48 dpi dpi customer dpi 103 dpi file 94 97 101 103 111 S ead emersit dettes 111 dpi OpUOnS se eee 103 dpipAopi dPI EES ESE 103 ADS AE ER AE ER ee DE 48 dps compiler option 48 dummy argument 20 25 41 dummy aliases 41 51 dumping profile data sees 101 profile information 119 120 dyn files dynamic information files 100 101 dynamic al la 103 DYNAMIC 170 180 dynamic threads 186 dynamic information 229 Intel Fortran Compiler for Linux Systems User s Guide Vol Il information files
72. of PGO often enables the compiler to make better decisions about function inlining thereby increasing the effectiveness of interprocedural optimizations Instrumented Program Profile guided Optimization creates an instrumented program from your source code and special code from the compiler Each time this instrumented code is executed the instrumented program generates a dynamic information file When you compile a second time the dynamic information files are merged into a summary file Using the profile information in this file the compiler attempts to optimize the execution of the most heavily travelled paths in the program Unlike other optimizations such as those strictly for size or speed the results of IPO and PGO vary This is due to each program having a different profile and different opportunities for optimizations The guidelines provided help you determine if you can benefit by using IPO and PGO You need to understanding the principles of the optimizations and the unique aspects of your source code Added Performance with PGO In this version of the Intel amp Fortran Compiler PGO is improved in the following ways e Register allocation uses the profile information to optimize the location of spill code e Forindirect function calls branch prediction is improved by identifying the most likely targets With the Intel amp Pentium amp 4 and Intel amp Xeon TM processors longer pipeline improving branch prediction tran
73. of code e The ATOMIC directive is used to update a memory location in an uninterruptable fashion e The FLUSH directive is used to insure that all threads in a team have a consistent view of memory e ABARRIER directive forces all team members to gather at a particular point in code Each team member that executes a BARRIER waits at the BARRIER until all of the team members have arrived A BARRIER cannot be used within worksharing or other synchronization constructs due to the potential for deadlock e The MASTER directive is used to force execution by the master thread See the list of OpenMP Directives and Clauses 152 Parallel Programming with Intel Fortran Data Sharing Data sharing is specified at the start of a parallel region or worksharing construct by using the SHARED and PRIVATE clauses All variables in the SHARED clause are shared among the members of a team It is the application s responsibility to e Synchronize access to these variables All variables in the PRIVATE clause are private to each team member For the entire parallel region assuming t team members there are t 1 copies of all the variables in the PRIVATE clause one global copy that is active outside parallel regions and a PRIVATE copy for each team member e Initialize PRIVATE variables at the start of a parallel region unless the FIRSTPRIVATE clause is specified In this case the PRIVATE copy is initialized from the global copy at the start of the c
74. of the parallel loop as PRIVATE This step is optional 6 COMMON block elements must not be placed on the PRIVATE list if their global scope is to be preserved The THREADPRIVATE directive can be used to privatize to each thread the common block containing those variables with global scope THREADPRIVATE creates a copy of the COMMON block for each of the threads in the team Any VO in the parallel region should be synchronized Identify more parallel loops and restructure them If possible merge adjacent PARALLEL DO constructs into a single parallel region containing multiple DO directives to reduce execution overhead Oo N Tune The tuning process should include minimizing the sequential code in critical sections and load balancing by using the SCHEDULE clause or the omp schedule environment variable 155 Intel Fortran Compiler for Linux Systems User s Guide Vol Il F Note This step is typically performed on a multiprocessor system Parallel Processing Thread Model This topic explains the processing of the parallelized program and adds more definitions of the terms used in the parallel programming The Execution Flow A program containing OpenMP Fortran API compiler directives begins execution as a single process called the master thread of execution The master thread executes sequentially until the first parallel construct is encountered In OpenMP Fortran API the PARALLEL and END PARALLEL directives d
75. parallel 6 par regionl f parameter 1 8 ebp f parameter 2 12 ebp EL Ys Preds B1 0 pushl Sebp 9 0 movil esp ebp 19 0 subl 44 esp 9 0 LN7 call omp get thread num 17 0 LOE eax ABL Preds B1 17 movi eax 28 ebp 17 0 LOE w l 18 Preds B1 33 movl 28 ebp eax 7 0 movi eax 16 ebp 17 0 LN8 leave 19 0 ret 19 0 align 4 0x90 mark end Debugging the program at this level is just like debugging a program that uses POSIX threads directly Breakpoints can be set in the threaded code just like any other routine With GNU debugger breakpoints can be set to source level routine names such as parallel Breakpoints can also be set to entry point names such as parallel and parallel 3 par regionO Note that the Intel Fortran Compiler for Linux converted the upper case Fortran subroutine name to the lower case one Debugging Multiple Threads When in a debugger you can switch from one thread to another Each thread has its own program counter so each thread can be in a different place in the code Example 2 shows a Fortran subroutine PADD A breakpoint can be set at the entry point of OpenMP parallel region Source listing of the Subroutine PADD 197 Intel Fortran Compiler for Linux Systems User s Guide Vol Il T2 SUBROUTINE PADD A B C N 13 INTEGER N 14 INTEGER
76. parallel threads and memory allocation The Intel extension routines described in this section can be used for low level debugging to verify that the library code and application are functioning as intended It is recommended to use these routines with caution because using them requires the use of the openmp stubs command line option to execute the program sequentially These routines are also generally not recognized by other vendor s OpenMP compliant compilers which may cause the link stage to fail for these other compilers Stack Size In most cases environment variables can be used in place of the extension library routines For example the stack size of the parallel threads may be set using the KMP STACKSIZE environment variable rather than the kmp set stacksize library routine Note A run time call to an Intel extension routine takes precedence over the corresponding environment variable setting 189 Intel Fortran Compiler for Linux Systems User s Guide Vol II The routines kmp set stacksize and kmp get stacksize take a 32 bit argument only The routines kmp set stacksize s and kmp get stacksize s takea size t argument which can hold 64 bit integers On Itanium based systems it is recommended to always use kmp set stacksize and kmp get stacksize These s variants must be used if you need to set a stack size 2 2 32 bytes 4 gigabytes See the definitions of stack size routines in the ta
77. patet DE De Ee 97 Advanced PGO Options E 98 PGO Environment variables it iode Ends 99 Example of Profile Guided Optimization iese ee ee ee ee ee ee Re ee ee ee 100 Merging tie dy Piles s o EE Ge RERUM REP 101 Using profmerge to Relocate the Gourceties ee ee ee RR Ee Re 102 Gode coverage Tool Ge eatin RE Een 103 Test Prioritization Tool si does oat ER ee 111 PGO API Profile Information Generation Support esse ese see ee ee ee 118 High level Language Optimizations HO 121 FILO VET ENE EP 121 Loop Irensformaltior1S 52 ae e E EG d re EO ee Rd ei 122 Scalar Replacement IA 32 Only 123 Loop Unrolling with ET le EE 123 Table Of Contents Memory Dependency with IVDEP Directive sss sees ees se ee ese ee ee dee ee ee ee 124 El e DEE 125 Parallel Programming with Intel Fortran iese ese se ee ee ee ee ee ee ek ee ee ee ee 127 Parallelism an Overview bon ite poire mer Ee bed ies 127 Parallel Program Development 128 Auto vectorization IA 32 Only ee eee 130 Vectorizatior OVerVleW 552 eroe seed ade gute de sen tec oa Ke ZZ EER e N Dee egeo au ed epe 130 VECIOMIZEMODIIOMNG sao destin detis ee i ventole san etui il obitu ri dE 131 Loop Parallelization and Vechortzaton see eee 132 Vectorization Key Programming Gudelmes sss sese eee eee eee 133 Data Ree ie EE 134 LOOP elne EE Ge De dante e edu iut 135 EOOpE xiEOndINDIS EE sd RS ei eo DE 136 Types of Loop Veetorized EN EE DE Eg Ee DR GE EAT
78. performs optimizations as described for ipo but stops prior to the final link stage leaving an optimized assembly file The default name for this file is ipo out s You can use the o option to specify a different name For example ifort tpp6 ipo S ofilename a f b f c f The ipo c and ipo S options generate multiple outputs if multi object IPO is being used The name of the first file is taken from the value of the o option The name of subsequent files is derived from this file by appending a numeric value to the file name For example if the first object file is named foo o the second object file will be named fool o The compiler generates a message indicating the name of each object or assembly file it is generating These files can be added to the real link step to build the final application Creating an IPO Executable Using xild Use the Intel linker xila instead of step 2 in Command Line for Creating an IPO Executable The linker xilda performs the following steps 1 Invokes the compiler to perform IPO if objects containing TR are found 83 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 2 Invokes GCC 1a to link the application The command line syntax for xild is the same as that of the GCC linker xild lt options gt LINK commandline where e lt options gt optional may include any GCC linker options or options supported only by xild e LINK commandline is your lin
79. procedures are designed for efficient use with the various Intel Fortran run time components Using the standard conforming array intrinsics can also make your program more portable With multidimensional arrays where access to array elements will be noncontiguous avoid leftmost array dimensions that are a power of two such as 256 512 Since the cache sizes are a power of 2 array dimensions that are also a power of 2 may make less efficient use of cache when array access is noncontiguous If the cache size is an exact multiple of the leftmost dimension your program will probably make inefficient use of the cache Programming for High Performance This does not apply to contiguous seguential access or whole array access One work around is to increase the dimension to allow some unused elements making the leftmost dimension larger than actually needed For example increasing the leftmost dimension of A from 512 to 520 would make better use of cache real a 512 100 do i 2 51 do j 2 9 ati 3 at end do end do 1 9 ixis3edy a i 1 j 1 0 5 In this code array a has a leftmost dimension of 512 a power of two The innermost loop accesses the rightmost dimension row major causing inefficient access Increasing the leftmost dimension of a to 520 real a 520 100 allows the loop to provide better performance but at the expense of some unused elements Because loop index variables and J are use
80. running in OpenMP mode 158 USING RE 38 83 87 vectorization support 206 Intel Debugger IA 32 applications 193 ltanium based applications 193 Intel amp Enhanced Debugger Intel amp extensions extended intrinsics 33 237 Intel Fortran Compiler for Linux Systems User s Guide Vol Il OpenMP routines 189 Intel amp Fortran language record structures 13 d p 25 Intel Itanium amp Compiler 33 Intel Itanium processor 48 73 Intel Pentium 4 processor 75 76 77 Intel Pentium Ill processor 75 76 77 Intel Pentium M processor 73 75 76 77 139 Intel Pentium processors Intel processors optimizing for 73 75 76 77 Intel Threading Toolset 34 37 Intel VTune Performance Analyzer TERRE 37 Intel specific 33 149 IR 101 intermediate language scalar OPZET EE 213 intermediate results USE TIG TIO y cosciente eed or ees 25 internal subprograms 30 238 INTERNAL visibility attribute 57 interprocedural EIDEN 92 use ESE EE EED ee erem 34 interprocedural optimizations IPO compilation with real object files 86 criteria for inline function exparSion ee eee eee 90 inline expansion of user functions Duo LE Det 92 library of IPO objects 87 mult
81. same lock argument BARRIER Synchronizes all the threads in a team Each thread waits until all of the other threads in that team have reached this point ATOMIC Ensures that a specific memory location is updated atomically rather than exposing it to the possibility of multiple simultaneously writing threads FLUSH 1ist Specifies a cross thread sequence point at which the implementation is required to ensure that all the threads in a team have a consistent view of certain objects in memory The optional list argument consists of a comma separated list of variables to be flushed ORDERED END ORDERED The structured block following an ORDERED directive is executed in the order in which iterations would be executed in a sequential loop 161 Intel Fortran Compiler for Linux Systems User s Guide Vol Il THREADPRIVATE Makes the named COMMON blocks or variables list private to a thread The list argument consists of a comma separated list of COMMON blocks or variables OpenMP Clauses Clause Description PRIVATE list Declares variables in list to be PRIVATE to each thread in a team FIRSTPRIVATE list Same as PRIVATE but the copy of each variable in the list is initialized using the value of the original variable existing before the construct LASTPRIVATE list Same as PRIVATE but the original variables in list are updated using the values assigned t
82. single operations Unless mp is specified the compiler tries to contract these operations whenever possible The mp option disables the contractions IPF fma and IPF fma can be used to override the default compiler behavior For example a combination of mp and IPF fma enables the compiler to contract operations ifort mp IPF fma myprog f FP Speculation 70 Compiler Optimizations IPF fp speculationmode sets the compiler to speculate on floating point operations in one of the following modes fast sets the compiler to speculate on floating point operations this is the default safe enables the compiler to speculate on floating point operations only when it is safe strict enables the compiler s speculation on floating point operations preserving floating point status in all situations In the current version this mode disables the speculation of floating point operations same as off off disables the speculation on floating point operations FP Math Function Optimization IPF fp relaxed enables or disables use of faster but slightly less accurate code sequences for math functions such as divide and sqrt Compared to strict IEEE precision this option slightly reduces the accuracy of floating point calculations performed by these functions usually limited to the least significant digit The default is OIPF fp relaxed FP Operations Evaluation IPF flt eval methodi0 2 directs the
83. size 64 Defines DOUBLE PRECISION declarations constants functions and intrinsics as REAI 8 dps Enables DEC parameter statement recognition error limit 30 Specifies the maximum number of error level or fatal level compiler errors permissible fpe3 Specifies floating point exception handling at run time for the main program integer size 32 Makes default integer and logical variables 4 bytes long INTEGER and LOGICAL declarations are treated as KIND 4 pad Enables changing variable and array memory layout pc80 pc 321 64 80 enables floating point IA 32 only significand precision control as follows pc32 to 24 bit significand pc64 to 53 bit significand and pc80 to 64 bit significand real size 64 Specifies the size of REAL and COMPLEX declarations constants functions and intrinsics 49 Intel Fortran Compiler for Linux Systems User s Guide Vol Il save Saves all variables in static allocation Disables auto that is disables setting all variables AUTOMATIC Zp8 Zpn specifies alignment constraint for structures on 1 2 4 8 or 16 byte boundary To disable use noalign Or Zpl Optimizations Default Option Description assume cc omp Enables OpenMP conditional compilation directives fp Disables the use of the ebp register IA 32 only in optimizations Directs to use the ebp ba
84. the END DO directive Usage Rules e You cannot use a GOTO statement or any other statement to transfer control onto or out of the DO construct e f you specify the optional END DO directive it must appear immediately after the end of the DO loop If you do not specify the END DO directive an END DO directive is assumed at the end of the DO loop and threads synchronize at that point 167 Intel Fortran Compiler for Linux Systems User s Guide Vol Il e The loop iteration variable is private by default so it is not necessary to declare it explicitly SECTIONS SECTION and END SECTIONS Use the noniterative worksharing SECTIONS directive to divide the enclosed sections of code among the team Each section is executed just one time by one thread Each section should be preceded with a SECTION directive except for the first section in which the SECTION directive is optional The SECTION directive must appear within the lexical extent of the SECTIONS and END SECTIONS directives The last section ends at the END SECTIONS directive When a thread completes its section and there are no undispatched sections it waits at the END SECTIONS directive unless you specify NOWAIT The SECTIONS directive takes an optional comma separated list of clauses that specifies which variables are PRIVATE FIRSTPRIVATE LASTPRIVATE or REDUCTION The following example shows how to use the SECTIONS and SECTION directives to execute subroutines X
85. the max the full report The default is opt_report_levelmin e opt_report_routine substring generates reports from all routines with names containing the subst ring as part of their name If subst ring is not specified reports from all routines are generated The default is to generate reports for all routines being compiled Specifying Optimizations to Generate Reports The compiler can generate reports for an optimizer you specify in the phase argument of the opt report phasephase option The option can be used multiple times on the same command line to generate reports for multiple optimizers Currently the reports for the following optimizers are supported Optimizer Logical Optimizer Full Name Name ipo Interprocedural Optimizer hlo High level Language Optimizer ilo Intermediate Language Scalar Optimizer ecg Itanium based Compiler Code Generator all All optimizers 213 Intel Fortran Compiler for Linux Systems User s Guide Vol Il When one of the above logical names for optimizers are specified all reports from that optimizer will be generated For example opt report phaseipo and opt report phaseecg generate reports from the interprocedural optimizer and the code generator Each of the optimizers can potentially have specific optimizations within them Each of these optimizations are prefixed with the optimizer s logical name For example
86. the stack frame in line function An optimization in which the compiler replaces each function expansion call with the function body expanded in place induction An optimization in which the compiler reduces the complexity variable of an array index calculation by using only additions simplification instruction An optimization in which the compiler reorders the generated scheduling machine instructions so that more than one can execute in parallel instruction An optimization in which the compiler eliminates less efficient sequencing instructions and replaces them with instruction sequences that take advantage of a particular processor s features interprocedural An optimization that applies to the entire program except for optimization library routines loop blocking An optimization in which the compiler reorders the execution sequence of instructions so that the compiler can execute iterations from outer loops before completing all the iterations of the inner loop loop unrolling An optimization in which the compiler duplicates the executed statements inside a loop to reduce the number of loop iterations loop invariant code movement An optimization in which the compiler detects multiple instances of a computation that does not change within a loop padding The addition of bytes or words at the end of each data type in order to meet size and alignment constraints preloading An opti
87. to Big Endian Conversion Environment Variable In order to use the little endian to big endian conversion feature specify the numbers of the units to be used for conversion purposes by setting the F UFMTENDIAN environment variable Then the READ WRITE statements that use these unit numbers will perform relevant conversions Other READ WRITE statements will work in the usual way 45 Intel Fortran Compiler for Linux Systems User s Guide Vol II In the general case the variable consists of two parts divided by a semicolon No spaces are allowed inside the F UFMTENDIAN value The variable has the following syntax F UFMTENDIAN MODE MODE EXCEPTION where MODE big little EXCEPTION big ULIST little ULIST ULIST ULIST U ULIST U U decimal decimal decimal e MODE defines current format of data represented in the files it can be omitted The keyword little means that the data have little endian format and will not be converted This keyword is a default The keyword big means that the data have big endian format and will be converted This keyword may be omitted together with the colon e EXCEPTION is intended to define the list of exclusions for MODE it can be omitted EXCEPTION keyword little or big defines data format in the files that are connected to the units from the EXCEPTION list This value overrides MODE value for the units listed e Each list
88. two runs The syntax for this tool is as follows codecov codecov option 103 Intel Fortran Compiler for Linux Systems User s Guide Vol II where codecov option is a tool option you choose to run the code coverage With If you do not use any option the tool will provide the top level code coverage for your whole program The tool uses options that are listed in the table that follows Option Description Default help Prints all the options of the code coverage tool spi file Sets the path name of the static profile information pgopti spi file spi dpi file Sets the path name of the dynamic profile pgopti dpi information file dpi prj Sets the project name counts Generates dynamic execution counts x Treats partially covered code as fully covered nopartial code comp Sets the filename that contains the list of files of interest ref Finds the differential coverage with respect to ref dpi file demang Demangles both function names and their arguments mname Sets the name of the web page owner maddr Sets the email address of the web page owner beoloe Sets the html color name or code of the Tffff99 uncovered blocks fcolor Sets the html color name or code of the LET GGG C uncovered functions pcolor Sets the html color name or code of the partially fafad2 covered code ecoler Sets the html color name or code of the covered ffffff
89. unless you are sure that the data items in the record structures will be naturally aligned e EQUIVALENCE statements EQUIVALENCE statements can force unaligned data or cause data to span natural boundaries For more information see the ntel amp Fortran Language Reference To avoid unaligned data in a common block derived type structure or record structure use one or both of the following e For new programs or for programs where the source code declarations can be modified easily plan the order of data declarations with care For example you should order variables in a COMMON statement such that numeric data is arranged from largest to smallest followed by any character data see the data declaration rules in Ordering Data Declarations to Avoid Unaligned Data below e For existing programs where source code changes are not easily done or for array elements containing derived type or record structures you can use command line options to request that the compiler align numeric data by adding padding spaces where needed Other possible causes of unaligned data include unaligned actual arguments and arrays that contain a derived type structure or record structure e When actual arguments from outside the program unit are not naturally aligned unaligned data access occurs Intel Fortran assumes all passed arguments are naturally aligned and has no information at compile time about data that will be introduced by actual arguments duri
90. unrelated dyn files oftentimes from previous runs or from other tests are not present in that directory Otherwise profile information will be based on invalid profile data This can negatively impact the performance of optimized code as well as generate misleading coverage information User generated file containing the list of tests to be prioritized For successful tool execution you should e Name each test dpi file so that the file names uniquely identify each test e Create a DPI list file a text file that contains the names of all dpi test files The name of this file serves as an input for the test prioritization tool execution command Each line of the DPI list file should include one and only one dpi file name The name can optionally be followed by the duration of the execution time for a corresponding test in the dd hh mm ss format For example Test1 dpi 00 00 60 35 informs that Test1 lasted 0 days 0 hours 60 minutes and 35 seconds The execution time is optional However if it is not provided then the tool will not prioritize the test for minimizing execution time It will prioritize to minimize the number of tests only Usage Model The chart that follows presents the test prioritization tool usage model 113 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Keep the static profile mformation Ep for coverage analysis and PET Instrumented Executablez Spooxe
91. vectorization You will often need to make some changes to your loops Guidelines for loop bodies follow Use e Straight line code a single basic block e Vector data only that is arrays and invariant expressions on the right hand side of assignments Array references can appear on the left hand side of assignments e Only assignment statements Avoid e Function calls Unvectorizable operations other than mathematical Mixing vectorizable types in the same loop Data dependent loop exit conditions Loop unrolling compiler does it Decomposing one loop with several statements in the body into several single statement loops There are a number of restrictions that you should be aware of Vectorization depends on the two major factors e Hardware The compiler is limited by restrictions imposed by the underlying hardware In the case of Streaming SIMD Extensions the vector memory operations are limited to st ride 1 accesses with a preference to 16 byte aligned memory references This means that if the compiler abstractly recognizes a loop as vectorizable it still might not vectorize it for a distinct target architecture e Style The style in which you write source code can inhibit optimization For example a common problem with global pointers is that they often prevent the compiler from being able to prove that two memory references refer to distinct locations Consequently this prevents certain reordering transformati
92. 0 overflows single precision floating point value and results in a Infinity 1E30 1E30 results in a Infinity Floating divide by zero if the computation is 0 0 0 0 the result is the exceptional value NaN Not a Number a value that means the computation was not successful If the numerator is not 0 0 the result is a signed Infinity Floating underflow the result of a computation is too small for the floating poinit type Each floating point type 32 64 and 128 bit has a denormalized range where very small numbers can be represented with some loss of precision For example the lower bound for normalized single precision floating point value is approximately 1E 38 the lower bound for denormalized single precision floating point value is 1E 45 1E 30 1E10 underflows the normalized range but not the denormalized range so the result is the denormal exceptional value 1E 40 1E 30 1E30 underflows the entire range and the result is zero This is known as gradual underflow to O Floating invalid when the exceptional value signed Infinities NaN denormal is used as input to a computation the result is also a NaN The fpen option allows some control over the results of floating point exception handling at run time for the main program 68 fpe0 restricts floating point exceptions as follows e Floating overflow floating divide by zero and floating invalid cause the program to print an error message and abort e Ifa fl
93. 1 which has already been written to and changed in the previous iteration To see this look at the access patterns of the array for the first two iterations as shown below Example of Data Dependence Vectorization Patterns I 1 READ DATA 0 READ DATA 1 READ DATA 2 WRITE DATA 1 I 2 READ DATA 1 READ DATA 2 READ DATA 3 WRITE DATA 2 In the normal sequential version of this loop the value of DATA 1 read from during the second iteration was written to in the first iteration For vectorization it 134 Parallel Programming with Intel Fortran must be possible to do the iterations in parallel without changing the semantics of the original loop Data Dependence Analysis Data dependence analysis involves finding the conditions under which two memory accesses may overlap Given two references in a program the conditions are defined by e Whether the referenced variables may be aliases for the same or overlapping regions in memory and for array references e The relationship between the subscripts For IA 32 data dependence analyzer for array references is organized as a series of tests which progressively increase in power as well as in time and space costs First a number of simple tests are performed in a dimension by dimension manner since independence in any dimension will exclude any dependence relationship Multidimensional arrays references that may
94. 1 extended precision 30 extensions support 73 EXTERN symbol visibility attribute VEE SE EE 57 F F_UFMTENDIAN SOWING in MEE 45 UE 45 fast compiler option 62 el 103 feature contributes 232 application ee 127 contributes 127 Gelee 103 lo ER aed een 45 OpenMP contains 150 OVerview eee e 203 WOK VEE EE dee es 193 feedback compilation 100 FIELDS co aea 755 174 175 file a a EE M APRIRE 97 103 111 Re ig 101 assembly 83 86 210 default output 41 dynamic information 97 executable 38 41 76 81 83 86 94 97 100 131 145 149 150 I DUE cere TR 25 multiple PORK es 80 multiple source files 41 object 41 48 80 81 83 86 97 pathname esse sees ee ee 57 real object files 86 relocating the source files 102 required E 81 103 111 specifying symbol files 57 FIRSTPRIVATE clause USE DE 176 floating point applications 34 arithmetic precision IA 32 systems 69 Itanium based systems 70 Mp OpHOL uie EE ie 66 ipT opltiOric eco ceo p cene 66 OPLUONS see tetra erte DD OVerview ass eee ee eree eee DD exception bandlmg 34 66 floating point to integer 69 multiply and
95. 150 shared variables 197 Updating iet GE 191 gc RE N se teneeete 180 significand FOUND t 69 SIMD 34 127 130 133 137 138 SIMD SSE2 streaming eee 34 SIMD encodings enabling EE 138 simple difference operator 191 EE 137 139 SINGLE directive el Ee mee 166 170 encounters RE RE 166 executing eeeeeeeeees 166 PLI cC 166 single instruction 133 single precision 30 66 single statement loops 133 single threaded 193 small logical data items 30 Small Dar aE stents 33 SKIP se ED ds 34 143 149 software pipelining 127 203 204 source e 158 coding guidelines 30 files relocation 102 IDE EE 145 158 MEHN 195 197 VIEW SEG eoa uices 103 specialized code 75 76 127 131 255 Intel Fortran Compiler for Linux Systems User s Guide Vol Il specific Op UMIEING as ese AE eee 73 specifying 8 byle data 48 DEFAUI TED SS ES 175 dieet sd 98 END DO 5 EES RR EDE wt 166 KIND DE DE Mot 13 ORDERED oa se Ses 170 profiling summary 98 EE 25 GE 180 SEQUENCE ee Ee ie 13 symbol visibility explicitly 57 VECLOPIZ deeg 140 visibility without symbol file 57 spi lee ee ee 103 111 ell SO a ee T11 DOODELSDIE c eye Ee 103 111 SORTEER tl ss oe ee tuse
96. 195 shared variables 201 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Eu EER EE Es 193 Une e 210 default compiler optimizations 147 for record buffers 25 level optimization 41 lie e cte rec ed 83 luz nq E 83 optimizations 48 Value zie rte b pedet 147 DEFAULT Clause 175 deferred shape arrays 20 demang option for code coverage LOO eser t 103 denormal e ele EE 34 Ulf nl ne ua eicit ar Dee 66 70 Values 34 66 70 denormals are zero 34 77 dependence of data 134 GeQueulrig DEE He 170 derived type components 13 20 determining parallelization 127 device specific blocksize 25 228 diagnostic reports 147 158 diagnostics auto parallelizer 127 147 indicating loops 131 indicating MASTER 158 OpenMP oontra 158 difference operators 191 differential coverage 103 DIMENSION 135 136 directives eo yia E 156 enhanced compilation 203 206 formal eed 145 158 IVDEP EE 124 206 N T EE 145 158 OVerview AE erge eer 203 preeeding RE ES EE 206 usage rules 150 MEG TOR geet 206 directory specifying EE EE ek ee 98 disable
97. 3 Turn on Optimization reporting 4 Check why loops are not software pipelined o Use CDECS ivdep to tell the compiler there is no dependency You may also need the option ivdep parallel to indicate there is no loop carried dependency o Use CDECS swp to enable software pipelining useful for lop sided control and unknown loop count Use CDECS loop count n when needed If cray pointers are used use safe cray ptr to indicate there is no aliasing o Use CDECS distribute point to split large loops normally this is automatically done 5 Check that the prefetch distance is correct Use CDECS prefetch to override the distance when it is needed Loop Transformations The loop transformation techniques include e loop normalization 122 Compiler Optimizations e loop reversal loop interchange and permutation loop distribution loop fusion scalar replacement absence of loop carried memory dependency with the IVDEP directive e runtime data dependencies checking Itanium amp based systems only The loop transformations listed above are supported by data dependence The loop transformation techniques also include induction variable elimination constant propagation copy propagation forward substitution and dead code elimination In addition to the loop transformations listed for both IA 32 and Itanium amp architectures above the Itanium architecture enables implementation of the collapsing techniques
98. 3 object file eene Aen 80 SYNE Pe se elena 170 tanium acrchitectures 34 Itanium amp compiler auto ilp32 compiler option 79 code generator 213 Itanium amp processors 34 48 73 Itanium based applications pipelining s eiecit ice 203 Itanium based compilation 94 Itanium based multiprocessor 127 Itanium based processors 66 Itanium based systems 239 Intel Fortran Compiler for Linux Systems User s Guide Vol Il gelat M Rt 90 Intel Debugger 193 optimization reports 213 pipelining Mese SR Ee M 203 software pipelining 127 using intrinsics 33 IVDEP directive 121 124 206 ivdep parallel 124 ivdep parallel compiler option 121 124 206 K KIND parameter double precision variables 48 SPOCHYVING EE 13 KAP OER EDE AG 183 197 KMP ALL THREADS 183 KMP BLOCKTIME 183 KMP BLOCKTIME value 182 Kip 6allo6 DE es 189 KOP tees a 189 kmp get stacksize 189 kmp get stacksize s 189 KMP_LIBRARY ee 183 KOP malloes sese Ra Ee ido 189 240 KMP MONITOR STACKSIZE 183 kmp pointer kind 189 kmp realloc 189 kmp set stacksize 189 kmp set
99. 45 150 156 160 164 169 175 176 178 180 201 parallel construct PARALLEL directive 145 164 170 PARALLEL DO PARALLEL DO directive 144 180 parallel invocations with makefile 83 97 PARALLEL PRIVATE 127 parallel processing directive groups 150 thread model pseudo code 156 thread model 156 parallel program development 127 parallel regions debugging iese esse eee 195 directives de es 164 DEN idi Ee 197 PARALLEL SECTIONS SO EE NE N EE es 169 PARALLEL SECTIONS END PARALLEL SECTIONS 169 parallel worksharing 150 169 parallelisme ses rotto oaa aic 127 parallelization elei IEN 127 overview 48 127 144 156 158 reli ves engste eire EE eretet 143 parsing lG ELE EE 25 part mutually exclusive 48 pathname sesse ees ee ee ee ee 57 pc compiler option 48 69 DEDI a E T 103 Pentium 4 processors 73 Pentium Ill processors 73 Pentium M processors 73 performance analysis 149 performance analyzer 37 193 performance critical 103 182 performance related options 41 performing data flow 127 143 Je E 25 PGO environment variables 99 methodology 94 PO
100. 70 C OMP END PARALLEL 170 CSOMP PARALLEL uis 170 c OMP prefix for OpenMP directives L 158 191 Cache size funtion retuming 33 CACHESIZE i etis SE Re 23 call stack dumps master thread 197 worker thread 197 Gallge cs sco ette sues 90 calls and DO loop collapsing 25 mallOE s ocio ee 57 GallStack aeter 193 cc omp keyword for assume 48 ccdefault compiler option 48 224 ccolor option of code coverage tool TTT 103 CDEC prefix for general directives pm 145 203 204 205 206 CEIL rounding mode 34 character data 13 55 checking floating point stack state 51 inefficient unaligned data 13 chunk size Specifying E 180 clauses COPYIN EE 175 cross reference of 160 DEFAULT eaa bees 175 FIRSTPRIVAIE 2 5 etti 176 in worksharing constructs 166 LASTPRIVATE ceti 176 ig ese M 174 overview of 160 174 PRIVATE 2 ED doutes hs 176 REDUCTION ee 178 SHARED eiert 180 summary OF ss pe BERE Bede soe 160 to debug shared variables 201 cleanup of loops 138 code assembly e 37 206 preparing vere eee dete 150 codecov command 103 codecov option for code coverage LOO EE 103 code coverage tool 103 coding guide
101. 93 Debugging Multithread Programs Overview see ee ee ee ee ee ee ee ee ee ee ee na 193 Debugging Parallel Regions esse eee 195 Debugging Multiple Threads sese eee 197 Debugging Shared Variables iss ees se se Ee EE Ee EE Gee GE ER ee ee Re ee GE Ee Pe 201 Optimization Support Features ss Ek ee ai a raid ear rh eR Ge add 203 Optimization Support Features OverVieW iese ee ee e ee 203 Compiler Bee 203 Compiler Directives Overview ee ee ee Ee AE ee RE ee AR ee ee ee Ee 203 Pipelining for Itanium amp based Applications ee eee ee ee RR ee 203 vil Table Of Contents Loop Count and Loop Distribution is ES it ean ER Ne sg De oder N ER Sad Eg ie 204 Loop Unrolling Support ss EG EG Ge ee Rr ei 205 Prefeteliindg UIP POM EE 206 Vectortzatlon SUDPOR EES SE ES RNET eg PX Pe Eo fae OTS ORE ee 206 Optimizations and Debugging sss sese eee ee eee 210 Support for Symbolic Debugging 9 211 The Use of ebp EE 211 Combining Optimization and Debugging ee ee eee eee 211 Debugging and Assembling ee eee 212 Optimizer Report Generation 213 Specifying Optimizations to Generate Reports ese eee e ee ee RR Ee Re 213 BIOS SAMY RO erleedegt A eebe 217 GLOSS EE KE 217 Index AR AE IE INL MS 221 viii Optimizing Applications Overview This is the second volume in a two volume Intel amp Fortran Compiler User s Guide It covers the following topics Programming for high performance using the Intel Fortran C
102. A N B N C N 15 NTEGER I ID OMP GET THREAD NUM 16 SOMP PARALLEL DO SHARED A B C N PRIVATE ID Joys DO T 1 N 18 ID OMP GET THREAD NUM 19 C I A I B I ID 20 ENDDO 21 SOMP END PARALLEL DO 22 END The Call Stack Dumps The first call stack below is obtained by breaking at the entry to subroutine PADD using GNU debugger At this point the program has not executed any OpenMP regions and therefore has only one thread The call stack shows a system run time 1ibc start main function calling the Fortran main program parallel and parallel calls subroutine padd When the program is executed by more than one thread you can switch from one thread to another The second and the third call stacks are obtained by breaking at the entry to the parallel region The call stack of master contains the complete call sequence At the top of the call stack is padd 6 par 1oopO0 Invocation of a threaded entry point involves a layer of Intel OpenMP library function calls that is functions with kmp prefix The call stack of the worker thread contains a partial call sequence that begins with a layer of Intel OpenMP library function calls ERRATA The GNU debugger sometimes fails to properly unwind the call stack of the immediate caller of the Intel OpenMP library function kmpc fork call Call Stack Dump of Master Thread upon Entry to Subroutine PADD gdb bt H 8 8x0864a031 in padd a
103. API Profile Information Generation Support PGO API Support Overview The Profile Information Generation Support Profile IGS enables you to control the generation of profile information during the instrumented execution phase of profile guided optimizations Normally profile information is generated by an instrumented application when it terminates by calling the standard exit function To ensure that profile information is generated the functions described in this section may be necessary or useful in the following situations e The instrumented application exits using a non standard exit routine e The instrumented application is a non terminating application exit is never called e The application requires control of when the profile information is generated A set of functions and an environment variable comprise the Profile IGS The Profile IGS Functions The Profile IGS functions are available to your application by inserting a header file at the top of any source file where the functions may be used include pgouser h 118 Compiler Optimizations A Note The Profile IGS functions are written in C language Fortran applications need to call C functions The rest of the topics in this section describe the Profile IGS functions Note Without instrumentation the Profile IGS functions cannot provide PGO API support The Profile IGS Environment Variable The environment variable for Profile IGS
104. AXIS Y AXIS and Z_AXTS in parallel The first SECTION directive is optional SOMP PARALLEL SOMP SECTIONS SOMP SECTION CALL X AXI SOMP SECTION CALL Y AXI SOMP SECTION CALL Z AXIS SOMP END SECTIONS SOMP END PARALLEL LD LD SINGLE and END SINGLE Use the SINGLE directive when you want just one thread of the team to execute the enclosed block of code Threads that are not executing the SINGLE directive wait at the END SINGLE directive unless you specify NOWAIT The SINGLE directive takes an optional comma separated list of clauses that specifies which variables are PRIVATE or FIRSTPRIVATE 168 Parallel Programming with Intel Fortran When the END SINGLE directive is encountered an implicit barrier is erected and threads wait until all threads have finished This can be overridden by using the NOWAIT option In the following example the first thread that encounters the SINGLE directive executes subroutines OUTPUT and INPUT SOMP PARALLEL DEFAULT SHARED CALL WORK X SOMP BARRIER SOMP SINGLE CALL OUTPUT X CALL INPUT Y SOMP END SINGLE CALL WORK Y SOMP END PARALLEL Combined Parallel Worksharing Constructs The combined parallel worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct The combined parallel worksharing constructs are e PARALLEL DO
105. Bold normal text indicates menu names menu items button names dialog window names and other user interface items File Open Menu names and menu items joined by a greater than gt sign indicate a sequence of actions For example Click File Open indicates that in the 11 Intel Fortran Compiler for Linux Systems User s Guide Vol Il File menu click Open to perform this action lfort The use of the compiler command in examples follows this general rule when there is no usage difference between architectures only one command is given Whenever there is a difference in usage the commands for each architecture are given This type Regular monospaced text indicates an element of style syntax a reserved word a keyword a file name a variable or a code example The text appears in lowercase unless uppercase is required This type Bold monospaced text indicates user input It style shows what you type as a command or input This type Italic monospaced text indicates placeholders for style information that you must supply This style is also used to introduce new terms opt ions Items inside single square brackets are optional In some examples square brackets are used to show arrays value Braces and a vertical bar indicate a choice of value items You must choose one of the items unless all of the items are also enclosed in square br
106. DO par section for OpenMP parallel sections SOMP PARALLEL SECTIONS e Sequence number of the parallel region for each source file sequence number starts from zero When you use routine names for example pada and entry names for example M PADD PADD 6 par loopO the following occurs The Fortran Compiler by default first changes lower mixed case routine names to upper case For example pAdD becomes PADD and this becomes entry name by adding one underscore The secondary entry name change happens after that That s why the par 1oop part of the entry name stays as lower case For some reason the debugger doesn t accept the upper case routine name PADD to set the breakpoint Instead it accepts the lower case routine name pada Example 1 shows the debugging of the code with a parallel region Example 1 is produced by this command ifort openmp g O0 S file f90 Let us consider the code of subroutine parallel in Example 1 Subroutine PARALLEL source listing subroutine parallel integer id OMP GET THREAD NUM 1 z 3 SOMP PARALLEL PRIVATE id 4 id OMP GET THREAD NUM 195 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 5 SOMP END PARALLEL 6 end The parallel region is at line 3 The compiler created two entry points parallel and parallel 3 par regionO The first entry point correspond correspond Example 1
107. E 100 A I B I C I C The next statement allows early C exit from the loop and prevents C vectorization of the loop IF A I LT 0 0 GOTO 10 ENDDO 10 CONTINUE RETURN END Loop Exit Conditions Loop exit conditions determine the number of iterations that a loop executes For example fixed indexes for loops determine the iterations The loop iterations must be countable that is the number of iterations must be expressed as one of the following e Aconstant e Aloop invariant term e Alinear function of outermost loop indices Loops whose exit depends on computation are not countable Examples below show countable and non countable loop constructs Correct Usage for Countable Loop Example 1 SUBROUTINE FOO A B C N LB DIMENSION A N B N C N INTEGER N LB I COUNT Number of iterations is N LB 1 COUNT N DO WHILE COUNT GE LB A I B I C I COUNT COUNT I ENDDO LB is not defined within loop RETURN END 136 Parallel Programming with Intel Fortran Correct Usage for Countable Loop Example 2 Number of iterations is N M 2 2 SUBROUTINE FOO A B C M N LB DIMENSION A N B N C N NTEGER I L M N DO L M N 2 A I B I C I I ENDDO RETURN END Exam
108. EQUIVALENCE or SAVE statement or those that are in COMMON auto is the same as automatic and nosave auto may provide a performance gain for your program but if your program depends on variables having the same value as the last time the routine was invoked your program may not function properly Variables that need to retain their values across routine calls should appear in a SAVE statement If you specify recursive or openmp the default is auto auto scalar The auto scalar option causes allocation of local scalar variables of intrinsic type INTEGER REAL COMPLEX or LOGICAL to the stack This option does not affect variables that appear in an EQUIVALENCE or SAVE statement or those that are in COMMON auto scalar may provide a performance gain for your program but if your program depends on variables having the same value as the last time the routine was invoked your program may not function properly Variables that need to retain their values across subroutine calls should appear in a SAVE statement This option is similar to auto which causes all local variables to be allocated on the stack The difference is that auto scalar allocates only scalar variables of the stated above intrinsic types to the stack auto scalar enables the compiler to make better choices about which variables should be kept in registers during program execution save zero The save option is opposite of auto the save option saves
109. ER Pa De ee ese 137 Strip mining and Cleanup WEE 138 Statements in the Loop Body sse eee eee 139 Ve torizalomEamples degen e ENEE Ee 140 Loop Interchange and Subscripts Matrix Multiply 142 Ueleger eege 143 Auto parallelization COvervew ee RR AA ee Ee EE Ee ee ee ee EE ee 143 Programming with Auto parallelization sese eee eee ee ee RR Ee ee Re ee ee 144 Vi Auto parallelization Enabling Options Directives and Environment Ee csse ete EE GE GE EG i ee EE uM EE ER RR Ge De 145 Auto parallelization Threshold Control and Diagnostics 147 Table OT Contents Parallelizatiorr with Open MP x 4 inr e 149 Parallelization with OpenMP Overvew eee ee e e e e ee ee ee ee ee 149 Programming wi Ren TEE 150 Parallel Processing Thread Model A 156 Compiling with OpenMP Directive Format and Diagnostics 158 OpenMP Directives and Clauses Gummanm sese 160 OpenMP Directive Descriptions sss ex ee e eee e 164 OpenMP Clause Descrptons e eee ee AA Ee RE ee AR ee ee ee Ee 174 OpenMP Support Libraries uci recto Ed ee DE eoe EE SE etd acest due 182 OpenMP Environment Variables see ss ees ee se ee Ee ee EE ee ee ee ee ee Ee ee ee 183 OpenMP Run time Library RoutineS iese ees ee ee ee ee ee ee RR ee ee ee 186 Intel Extension Routines sse eee ee es ede ere rer eene inneren 189 Examples ele TE le CH 191 Debugging Multithreaded Programs ees sse ee ee eee eee 1
110. F 164 OMP PARALLEL PRIVATE 176 195 OMP PARALLEL SECTIONS 169 195 OMP SECTION 166 169 OMP SECTIONS 166 OMP SINGEE EES ES ea 166 omp destroy lock 186 omp destroy nest lock 186 OMP DYNAMIC 183 omp get dynamic 186 omp get max threads 186 omp get nested 186 omp get num procs 164 186 omp get num threads 180 186 245 Intel Fortran Compiler for Linux Systems User s Guide Vol II omp get thread num 170 180 186 195 197 omp get wtick 186 omp get wtime 186 Omp in parallel choi 186 omp Init lock 186 omp init nest lock 186 omp lib mod file 186 omp lock kind 186 omp eier 186 omp nest lock kind 186 omp nest lock t 186 OMP NES TED vice intense 183 OMP NUM THREADS 145 158 164 183 OMP SCHEDULE 145 150 180 183 omp set dynamic 186 omp set lock 186 omp set nest lock 186 omp set nested 186 omp set num threads 164 186 omp test lock ores 186 omp test nest lock 186 omp unset lock
111. F par threshold n Sets a threshold for the auto parallelization of loops based on the probability of profitable execution of the loop in parallel n 0 to 100 n 0 implies always Default n 100 par_report 0 1 2 3 Controls the auto parallelizer s diagnostic levels Default par_reportl OpenMP 1A 32 and Itanium architectures openmp Enables the parallelizer to generate multithreaded code based on the OpenMP directives Default OFF openmp_report 0 1 2 Controls the OpenMP parallelizer s diagnostic levels Default Qopenmp_reportl openmp_stubs Enables compilation of OpenMP programs in sequential mode The OpenMP directives are ignored and a stub OpenMP library is linked Default OFF Note When both openmp and parallel are specified on the command line the parallel option is only honored in routines that do not contain 129 Intel Fortran Compiler for Linux Systems User s Guide Vol Il OpenMP Directives For routines that contain OpenMP directives only the openmp option is honored With the right choice of options the programmers can e Increase the performance of your application with minimum effort e Use compiler features to develop multithreaded programs faster With a relatively small effort of adding the OpenMP directives to their code the programmers can transform a sequential program into a parallel program The following are examples o
112. GO ADI deci ores aces 101 ie 94 usage model 94 PGO API support dumping and resetting profile information 120 dumping profile information 119 interval profile dumping 120 OVerview eee e e 118 resetting the dynamic profile COUNTES e Leod EE deos 120 249 Intel Fortran Compiler for Linux Systems User s Guide Vol Il resetting the profile information 120 pgopti dpi file compiler produces 100 pXISIITIQ EE 99 Kale 99 pgopti spi 94 103 111 PGOPTI Prof Dump 101 119 PGOPTI Prof Dump And Reset RUN ONCE OR ODDS 120 PGOPTI Prof Reset 119 120 PGOPTI Set Interval Prof Dump RE I Cu ME 120 leie E ME 118 pliase DEE 150 eln EEN 150 pipelining Itanium amp based applications 203 optimization Se gas nt eie RE 203 placing PREFETIGELL or eise ies eserse 206 pointer aliasing 51 pointers 20 51 79 125 133 160 210 position independent code 57 POSIX tee ee E EN 195 prec div compiler option 69 preemption preemptable 57 preempted 57 90 PREFETCH placing sata ott EA 206 prefetching optimizations 125 ODUM cut artes ED aa 125 SHDDOLL oo e EE Eus 206 preparing COUG mu e 2 ER LA ts 150 preventing CRAY pointers coop e eie 51 ERG
113. High level parallelization analyze dependence graph to determine loops which can execute in parallel compute run time dependency Data partitioning examine data reference and partition based on the following types of access SHARED PRIVATE and FIRSTPRIVATE Multi threaded code generation modify loop parameters generate entry exit per threaded task generate calls to parallel run time routines for thread creation and synchronization Auto parallelization Enabling Options Directives and Environment Variables 145 Intel Fortran Compiler for Linux Systems User s Guide Vol Il To enable the auto parallelizer use the parallel option The parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops An example of the command using auto parallelization is as follows ifort c parallel myprog f Auto parallelization Options The parallel option enables the auto parallelizer if the 02 or 03 optimization option is also on the default is 02 The parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops parallel Enables the auto parallelizer par threshold 0 Controls the work threshold 100 needed for auto parallelization x Controls the diagnostic par report 1l2l3 messages from the auto parallelizer see later su
114. Itanium based applications support the debugging of programs that are executed by multiple threads However the currently available versions of such debuggers do not directly support the debugging of parallel decomposition directives and therefore there are limitations on the debugging features Some of the new features used in OpenMP are not yet fully supported by the debuggers so it is important to understand how these features work to know how to debug them The two problem areas are e Multiple entry points e Shared variables The Intel Debugger IDB is not aware of and currently does not handle unique OpenMP features that relate to multi threading 194 Parallel Programming with Intel Fortran Debugging Parallel Regions The compiler implements a parallel region by enabling the code in the region and putting it into a separate compiler created entry point Although this is different from outlining the technique employed by other compilers that is creating a subroutine the same debugging technique can be applied Constructing an Entry point Name The compiler generated parallel region entry point name is constructed with a concatenation of the following strings e character e entry point name for the original routine for example parallel e character e line number of the parallel region e par region for OpenMP parallel regions OMP PARALLEL par loop for OpenMP parallel loops 0MP PARALLEL
115. NDDO END SUBROUTINE Aligning Data The vectorizer will apply dynamic loop peeling as follows SUBROUTINE DOIT A REAL A 100 let P be A 16 where A is address of A 1 IF P NE 0 THEN P 16 P 4 determine run time peeling factor DO I 1 P A I A I 1 0 ENDDO ENDIF Now this loop starts at a 16 byte boundary and will be vectorized accordingly DO I P 1 100 A I A I 1 0 ENDDO END SUBROUTINE Loop Interchange and Subscripts Matrix Multiply Matrix multiplication is commonly written as shown in the following example DO I 1 N DO J 1 N DO K 1 N CT ec E Er L END DO END DO END DO I K B K J The use of B K J is not a stride 1 reference and therefore will not normally be vectorizable If the loops are interchanged however all the references will become stride 1 as in the Matrix Multiplication with Stride 1 example that follows 142 Parallel Programming with Intel Fortran L Note Interchanging is not always possible because of dependencies which can lead to different results Example of Matrix Multiplication with Stride 1 DO J 1 N DO K 1 N DO I 1 N I J DUET A ENDDO ENDDO ENDDO I K B K J For additional information see publications on Compiler Optimizations Auto parallelization Auto parallelization
116. OMMON statement The order of variables in the COMMON statement determines their storage order Unless you are sure that the data items in the common block will be naturally aligned specify either the align commons or align dcommons option depending on the largest data size used See Alignment Options Derived type user defined data Derived type data items are declared after a TYPE statement If your data includes derived type data structures you should use the align records option unless you are sure that the data items in the derived type structures will be naturally aligned If you omit the SEQUENCE statement the align records option default ensures all data items are naturally aligned If you specify the SEQUENCE statement the align records option is prevented from adding necessary padding to avoid unaligned data data items are packed unless you specify the align sequence option When you use SEQUENCE you should specify data declaration order so that all data items are naturally aligned Record structures RECORD and STRUCTURE statements Intel Fortran record structures usually contain multiple data items The order of variables in the STRUCTURE statement determines their storage Programming for High Performance order The RECORD statement names the record structure Record structures are an Intel Fortran language extension If your data includes record structures you should use the align records option
117. Overview The auto parallelization feature of the Intel Fortran Compiler automatically translates serial portions of the input program into equivalent multithreaded code The auto parallelizer analyzes the dataflow of the program s loops and generates multithreaded code for those loops which can be safely and efficiently executed in parallel This enables the potential exploitation of the parallel architecture found in symmetric multiprocessor SMP systems Automatic parallelization relieves the user from e Having to deal with the details of finding loops that are good worksharing candidates e Performing the dataflow analysis to verify correct parallel execution e Partitioning the data for threaded code generation as is needed in programming with OpenMP directives The parallel run time support provides the same run time features as found in OpenMP such as handling the details of loop iteration modification thread scheduling and synchronization While OpenMP directives enable serial applications to transform into parallel applications quickly the programmer must explicitly identify specific portions of the application code that contain parallelism and add the appropriate compiler directives Auto parallelization triggered by the parallel option automatically identifies those loop structures which contain parallelism During compilation the compiler automatically attempts to decompose the code sequences into separate threads for p
118. P END PARALLEL Setting Conditional Parallel Region Execution When an IF clause is present on the PARALLEL directive the enclosed code region is executed in parallel only if the scalar logical expression evaluates to TRUE Otherwise the parallel region is serialized When there is no IF clause the region is executed in parallel by default In the following example the statements enclosed within the OMP DO and I OMP END DO directives are executed in parallel only if there are more than three processors available Otherwise the statements are executed serially SOMP PARALLEL IF OMP GET NUM PROCS GT 3 SOMP DO DO I 1 N Y I SORT ES CET END DO SOMP END DO SOMP END PARALLEL If a thread executing a parallel region encounters another parallel region it creates a new team and becomes the master of that new team By default nested parallel regions are always executed by a team of one thread E Note To achieve better performance than sequential execution a parallel region must contain one or more worksharing constructs so that the team of threads can execute work in parallel It is the contained worksharing constructs that lead to the performance enhancements offered by parallel processing Worksharing Construct Directives 166 Parallel Programming with Intel Fortran A worksharing construct must be enclosed dynamically within a parallel
119. P parallel region has four shared variables First two parameters parameters 1 and 2 are reserved for the compiler s use and each of the remaining four parameters corresponds to one shared variable These four parameters exactly match the last four parameters to kmpc fork call in the machine code of PADD EJ Note The FIRSTPRIVATE LASTPRIVATE and REDUCTION variables also require shared variables to get the values into or out of the parallel region Due to the lack of support in debuggers the correspondence between the shared variables in their original names and their contents cannot be seen in the debugger at the threaded entry point level However you can still move to the call stack of one of the subroutines and examine the contents of the variables at that level This technique can be used to examine the contents of shared variables In Example 2 contents of the shared variables A B C and N can be examined if you move to the call stack of PARALLEL 201 Optimization Support Features Optimization Support Features Overview This section describes the Intel amp Fortran features such as directives intrinsics run time library routines and various utilities which enhance your application performance in support of compiler optimizations These features are Intel Fortran language extensions that enable you to optimize your source code directly This section includes examples of optimizations supported by Inte
120. S AR Ee EL OD E 147 TRUNG RS EG dea 34 tselect command 111 two dimensional AAS REDE bebes 34 type aliasablility 51 ee RE EE DE 125 INTEGER SS AES OR AS 51 padd functiON 197 parallel Qfunction 195 Dat uui Gode eet 13 REAL iot den d 71 TYPE statement 13 U UBC ll se SES ED ws 25 ucolor code coverage tool option 103 Index unary SOR ee Oak tun 137 unbuffered eene 25 underflow overflow 51 undispatched 166 unformatted files 25 unformatted l O 25 uninterruptable 150 uniprocessor 150 158 193 units Ge SEER DE a 164 unpredicatble 51 unproven distinction unvectorizable copy 140 unroll compiler option 48 123 UNROLL directive 205 unrolling elei EE 123 unvectorizable 133 updating Shared ooi pieds 191 usage model EE 94 111 259 Intel Fortran Compiler for Linux Systems User s Guide Vol Il requirements 111 Intel amp performance analysis tools NOT PATE UE DURUM S 37 PUSS o oo ee EE 83 166 interprocedural optimizations 34 user Unions EI ug 92 79 using Uie AE EN ER 33 32 bit
121. TE FIRSTPRIVATE LASTPRIVATE REDUCTION PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION PRIVATE FIRSTPRIVATE COPYIN DEFAULT PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SHARED SCHEDULE COPYIN DEFAULT PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SHARED None None None None None None None OpenMP Directive Descriptions Parallel Region Directives The PARALLEL and END PARALLEL directives define a parallel region as follows SOMP PARALLEL parallel region SOMP END PARALLEL When a thread encounters a parallel region it creates a team of threads and becomes the master of the team You can control the number of threads in a team by the use of an environment variable or a run time library call or both 164 Parallel Programming with Intel Fortran The PARALLEL directive takes an optional comma separated list of clauses Clauses include e IF whether the statements in the parallel region are executed in parallel by a team of threads or serially by a single thread e PRIVATE FIRSTPRIVATE SHARED or REDUCTION variable types e DEFAULT variable data scope attribute e COPYIN master thread common block values are copied to THREADPRIVATE copies of the common block Changing the Number of Threads Once created the number of threads in the team remains constant for the duration of that parallel region To explicitly change the number of threads used in the next parallel region cal
122. The following example illustrates the use of the VECTOR NONTEMPORAL directive 209 Intel Fortran Compiler for Linux Systems User s Guide Vol Il subroutine set a n integer i n real a n DECS VECTOR NONTEMPORAL DECS VECTOR ALIGNED do i 1 n a i 1 enddo end program setit parameter n 1024 1204 real a n integer i do i 1 n a i 0 enddo call set a n do i 1 n if a i ne 1 then print failed nontemp f a i i stop endif enddo print passed nontemp f end For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language Reference Optimizations and Debugging This topic describes the command line options that you can use to debug your compilation and to display and check compilation errors The options that enable you to get debug information while optimizing are as follows OI Disables optimizations Enables fp option me Generates symbolic debugging information and line numbers in the object code for use by the source level debuggers Turns off 02 and makes 00 the default unless 02 or 01 or 03 is explicitly specified in the command line together with g debug keyword Specifies settings that enhance debugging To use this option you must also specify the g option The only choice for keyword is 210 Optimization Support Features variable locations w
123. U TIME SYSTEM CLOCK TIME and DATE AND TIME See Intrinsic Procedures in the ntel amp Fortran Language Reference 40 Compiler Optimizations Compiler Optimizations Overview Intel amp Fortran Compiler optimizations enable you to enhance the performance of your application Optimization options are described in the following sections e Optimizing the compilation process includes stack alignment and symbol visibility attribute options Optimizing different application types Floating point arithmetic operations Optimizing applications for specific processors Interprocedural optimizations IPO Profile guided optimizations High level language optimizations In addition to optimizations invoked by the compiler command line options other performance enhancing features such as directives intrinsics run time library routines and various utilities are provided These features are discussed in the Optimization Support Features section Optimizing the Compilation Process Optimizing the Compilation Process Overview This section describes the Intel amp Fortran Compiler options that optimize the compilation process By default the compiler converts source code directly to an executable file Appropriate options enable you not only to control the process and obtain desired output file produced by the compiler but also make the compilation itself more efficient A group of options monitors the outcome of Intel compiler generate
124. _port The Intel Fortran Compiler uses the rca option to disable changing of rounding mode for floating point to integer conversions The system default floating point rounding mode is round to nearest This means that values are rounded during floating point calculations However the Fortran language requires floating point values to be truncated when a conversion to an integer is involved To do this the compiler must change the rounding mode to truncation before each floating point conversion and change it back afterwards The rca option disables the change to truncation of the rounding mode for all floating point calculations including floating point to integer conversions Turning on this option can improve performance but floating point conversions to integer will not conform to Fortran semantics You can also use the fp_port option to round floating point results at assignments and casts May cause some speed impact but also makes sure that rounding to the user declared precision at assignments is always done The mp1 option implies p port Floating point Arithmetic Precision for Itanium amp based Systems The following Intel Fortran Compiler options enable you to control the compiler optimizations for floating point computations on Itanium based systems Contraction of FP Multiply and Add Subtract Operations IPF fma enables or disables the contraction of floating point multiply and add subtract operations into a
125. a information into the object file which allows a symbolic stack traceback to be produced if a run time failure occurs Combining Optimization and Debugging 211 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The 00 option turns off all optimizations so you can debug your program before any optimization is attempted To get the debug information use the g option The compiler lets you generate code to support symbolic debugging while one of the O1 O2 or 03 optimization options is specified on the command line along with g which produces symbolic debug information in the object file Note that if you specify an O1 02 or 03 option with the g option some of the debugging information retumed may be inaccurate as a side effect of optimization It is best to make your optimization and or debugging choices explicit e lf you need to debug your program excluding any optimization effect use the 00 option which turns off all the optimizations e f you need to debug your program with optimization enabled then you can specify the O1 02 or 03 option on the command line along with g F Note The g option slows down the program when no optimization level On is specified In this case g turns on 00 which is what slows the program down However if for example both 02 and g are specified the code should run very nearly at the same speed as if g were not specified Refer to the t
126. able below for the summary of the effects of using the g option with the optimization options These Produce these results options cm Debugging information produced 00 enabled optimizations disabled fp enabled for lA 32 targeted compilations g O1 Debugging information produced 01 optimizations enabled g 02 Debugging information produced 02 optimizations enabled g 03 fp Debugging information produced 03 optimizations enabled fp enabled for IA 32 targeted compilations Debugging and Assembling The assembly listing file is generated without debugging information but if you produce an object file it will contain debugging information If you link the object 212 Optimization Support Features file and then use the GDB debugger on it you will get full symbolic representation Optimizer Report Generation The Intel amp Fortran Compiler provides options to generate and manage optimization reports e opt report generates optimizations report and places it in a file specified in opt report filefilename f opt report file is not specified opt report directs the report to stderr The default is OFF no reports are generated e opt_report_file filename generates optimizations report and directs it to afile specified in filename e opt_report_level min med max specifies the detail level of the optimizations report The min argument provides the minimal summary and
127. ackets In syntax examples a horizontal ellipsis three dots following an item indicates that the item preceding the ellipsis can be repeated In code examples a horizontal ellipsis means that not all of the statements are shown Linux systems An asterisk at the end of a word or name indicates itis a third party product trademark 12 Programming for High Performance Programming for High Performance Overview This section provides information on the following e Programming Guidelines This section discusses programming guidelines that can enhance application performance and includes specific coding practices that use the Intel amp architecture features e Analyzing and Timing Your Application This section discusses how to use the Intel performance analysis tools and how to time program execution to collect information about problem areas Programming Guidelines Setting Data Type and Alignment Data alignment considerations apply to the following kinds of variables Those that are dynamically allocated Those that are members of a data structure Those that are global or local variables Those that are parameters passed on the stack For best performance align data as follows Align 8 bit data at any address Align 16 bit data to be contained within an aligned four byte word Align 32 bit data so that its base address is a multiple of four Align 64 bit data so that its base address is a m
128. acter variable each element contains five bytes of padding 16 is an exact multiple of 8 However if the structure contains one 4 byte floating point number one 4 byte integer followed by a 3 byte character variable each element would contain one byte of padding 12 is an exact multiple of 4 Checking for Inefficient Unaligned Data During compilation the Intel Fortran compiler naturally aligns as much data as possible Exceptions that can result in unaligned data are described above Because unaligned data can slow run time performance it is worthwhile to e Double check data declarations within common blocks derived type structures or record structures to ensure all data items are naturally aligned see the data declaration rules in the subsection below Using modules to contain data declarations can ensure consistent alignment and use of such data e Avoid the EQUIVALENCE statement or use it in a way that cannot cause unaligned data or data spanning natural boundaries e Ensure that arguments passed from outside the program unit are naturally aligned e Check that the size of array elements containing at least one derived type structure or record structure causes array elements to start on aligned boundaries see the previous subsection There are two ways unaligned data might be reported e During compilation warning messages are issued for any data items that are known to be unaligned unless you specify the warn noalig
129. add FA 71 LV DO eege 66 FEOM eege Ee 144 A E be Eeer 170 flushing denormal see 66 70 zero denormal 66 70 EMA E TEE 71 fnsplit compiler option 97 FOR SET FPE intrinsic FOR M ABRUPT UND 77 fork join EE 191 format auto parallelization directives 145 bio endan eee 45 ele EE 25 floating point applications 34 OpenMP directives 158 formatted files unformatted files 25 FORT BUFFERED run time environment variable 25 Fortran EE 150 156 193 FORTRAN 77 dummy aliases 41 FORTRAN 77 13 20 41 Fortran standard 9 Fortran uninitialized 57 233 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Fortran USE statement 186 INCLUDE statement 186 Fourier Motzkin elimination 134 FP Wl e iem eie a aas 70 operations evaluation 70 OPUOMS 23 o EE yo eda RE e et 66 TRI ED ON MO 70 fp compiler option 210 frames DIOWSINO SE neee 103 ftz compiler option 34 FTZ flag ItaniumG based systems 66 SOLUTIO deett d rate Putus 77 Tull name x RR rr oog 213 function best Performance 77 function splitting disabling BE 97 function splitting 97 fun
130. add_ globl _padd__6__par_loop0d padd 6 par loopO0 f parameter 1 8 ebp f parameter 2 12 ebp f parameter 3 16 ebp f parameter 4 20 ebp f parameter 5 24 ebp f parameter 6 28 ebp SBIz30 5 Preds B1 0 LN16 pushl Sebp 1 movil esp Sein 1 subl 208 esp 1 movi ebx 4 ebp 1 LN17 movl 8 ebp eax 6 movi Seax eax 6 movi eax 8 ebp 6 movl 28 ebp eax 6 LN18 movi Seax eax 7 movi Seax eax 7 200 OO OO OOOO OO O OO O OOO Parallel Programming with Intel Fortran movi eax 80 ebp 17 0 movl S1 76 ebp 17 0 movl 80 ebp eax 7 0 testl Seax eax 7 0 jg B1 20 Prob 50 7 0 LOE BIZ 31 Preds B1 41 B1 39 B1 38 tab 30 LN19 movl 4 ebp ebx 13 0 leave 13 0 ret 13 0 align 4 0x90 mark end Debugging Shared Variables When a variable appears in a PRIVATE FIRSTPRIVATE LASTPRIVATE or REDUCTION clause on some block the variable is made private to the parallel region by redeclaring it in the block SHARED data however is not declared in the threaded code Instead it gets its declaration at the routine level At the machine code level these shared variables become incoming subroutine call arguments to the threaded entry points such as PADD 6 par loopO In Example 2 the entry point PADD 6 par loop has six incoming parameters The corresponding OpenM
131. adding bytes as needed e The align recnbyte option requests that fields of records and components of derived types be aligned on either the size byte boundary specified or the boundary that will naturally align them whichever is smaller This option does not affect whether common blocks are naturally aligned or packed e The align sequence option controls alignment of derived type components declared with the SEQUENCE statement sequenced components The align nosequence option means that sequenced components are packed regardless of any other alignment rules Note that align none implies align nosequence The align sequence option means that sequenced components obey whatever alignment rules are currently in use Consequently since align record is a default value then align sequence alone on the command line will cause the components of these derived types to be naturally aligned The default behavior is that multiple data items in derived type structures and record structures will be naturally aligned data items in common blocks will not align records With align nocommons In derived type structures using the SEQUENCE statement prevents align records from adding needed padding bytes to naturally align data items Symbol Visibility Attribute Options Applications that do not require symbol preemption or position independent code can obtain a performance benefit by taking advantage of the generic ABI visibility attributes
132. all variables in static allocation except local variables within a recursive routine If a routine is invoked more than once this option forces the local variables to retain their values between the invocations The save option ensures that the final results on the exit of the routine is saved on memory and can be reused at the next occurrence of that routine This may cause some performance degradation as it causes more frequent rounding of the results 52 Compiler Optimizations When the compiler optimizes the code the results are stored in registers save is the same as noauto The zero option initializes to zero all local scalar variables of intrinsic type INTEGER REAL COMPLEX or LOGICAL which are saved and not initialized yet Used in conjunction with save The default is zero Summary There are three choices for allocating variables save auto and auto scalar Only one of these three can be specified The correlation among them is as follows e save disables auto sets noautomatic and allocates all variables not marked AUTOMATIC to static memory e auto disables save sets automatic and allocates all variables scalars and arrays of all types not marked SAVE to the stack e auto scalar o It makes local scalars of intrinsic types INTEGER REAL COMPLEX and LOGICAL automatic o This is the default there is no noauto scalar however recursive Or openmp disables auto scalar and makes au
133. amp Fortran Language Reference The VECTOR ALIGNED and UNALIGNED Directives Like VECTOR ALWAYS these directives also override the efficiency heuristics The difference is that the qualifiers UNALIGNED and ALIGNED instruct the compiler to use respectively unaligned and aligned data movement instructions for all array references This disables all the advanced alignment optimizations of the compiler such as determining alignment properties from the program context or using dynamic loop peeling to make references aligned F Note The directives VECTOR ALWAYS UNALIGNED ALIGNED should be used with care Overriding the efficiency heuristics of the compiler should only be done if the programmer is absolutely sure the vectorization will improve performance Furthermore instructing the compiler to implement all array references with aligned data movement instructions will cause a run time exception in case some of the access patterns are actually unaligned For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language Reference The VECTOR NONTEMPORAL Directive The VECTOR NONTEMPORAL directive results in streaming stores on Pentium 4 based systems A floating point type loop together with the generated assembly are shown in the example below For large n significant performance improvements result on a Pentium 4 systems over a non streaming implementation
134. arallel In the OpenMP Fortran API a parallel construct is defined by placing OpenMP directives PARALLEL at the beginning and END PARALLEL at the end of the code segment Code segments thus bounded can be executed in parallel A structured block of code is a collection of one or more executable statements with a single point of entry at the top and a single point of exit at the bottom The Intel Fortran Compiler supports worksharing and synchronization constructs Each of these constructs consists of one or two specific OpenMP directives and sometimes the enclosed or following structured block of code For complete definitions of constructs see the OpenMP Fortran version 2 0 specifications 150 Parallel Programming with Intel Fortran At the end of the parallel region threads wait until all team members have arrived The team is logically disbanded but may be reused in the next parallel region and the master thread continues serial execution until it encounters the next parallel region Worksharing Construct A worksharing construct divides the execution of the enclosed code region among the members of the team created on entering the enclosing parallel region When the master thread enters a parallel region a team of threads is formed Starting from the beginning of the parallel region code is replicated executed by all team members until a worksharing construct is encountered A worksharing construct divides the execution of the
135. arallel processing No other effort by the programmer is needed 143 Intel Fortran Compiler for Linux Systems User s Guide Vol II The following example illustrates how a loop s iteration space can be divided so that it can be executed concurrently on two threads Original Serial Code do i 1 100 a i a i b i c i enddo Transformed Parallel Code Thread 1 do i 1 50 a i a i b i c i enddo Thread 2 do i 51 100 a i a i b i c i enddo Programming with Auto parallelization Auto parallelization feature implements some concepts of OpenMP such as worksharing construct with the PARALLEL DO directive See Programming with OpenMP for worksharing construct This section provides specifics of auto parallelization Guidelines for Effective Auto parallelization Usage A loop is parallelizable if e The loop is countable at compile time this means that an expression representing how many times the loop will execute also called the loop trip count can be generated just before entering the loop e There are no FLOW READ after WRITE OUTPUT WRITE after WRITE or ANTI WRITE after READ loop carried data dependences A loop carried data dependence occurs when the same memory location is referenced in different iterations of the loop At the compiler s discretion a loop may be parallelized if any assumed inhibiting loop carried dependencies can be resolved by run time dependency testing The co
136. at which profile dumping occurs and is measured in milliseconds For example if interval is set to 5000 then a profile dump and reset will occur approximately every 5 seconds The interval is approximate because the time check controlling the dump and reset is only performed upon entry to any instrumented function in your application F Note 1 Setting interval to zero or a negative number will disable interval profile dumping 2 Setting a very small value for interval may cause the instrumented application to spend nearly all of its time dumping profile information Be sure to set interval to a large enough value so that the application can perform actual work and substantial profile information is collected Recommended usage This function may be called at the start of a non terminating user application to initiate Interval Profile Dumping Note that an alternative method of initiating Interval Profile Dumping is by setting the environment variable PROF_DUMP_INTERVAL to the desired interval value prior to starting the application The intention of Interval Profile Dumping is to allow a non terminating application to be profiled with minimal changes to the application source code High level Language Optimizations HLO HLO Overview High level optimizations exploit the properties of source code constructs for example loops and arrays in the applications developed in high level programming languages such as Fortran and C
137. atural storage order is the best order possible If you must use an unnatural storage order in certain cases it might be more efficient to transfer the data to memory and reorder the data before performing the I O operation Use Memory for Intermediate Results Performance can improve by storing intermediate results in memory rather than storing them in a file on a peripheral device One situation that may not benefit 26 Programming for High Performance from using intermediate storage is when there is a disproportionately large amount of data in relation to physical memory on your system Excessive page faults can dramatically impede virtual memory performance If you are primarily concemed with the CPU performance of the system consider using a memory file system mfs virtual disk to hold any files your code reads or writes Enable Implied DO Loop Collapsing DO loop collapsing reduces a major overhead in VO processing Normally each element in an VO list generates a separate call to the Intel Fortran run time library RTL The processing overhead of these calls can be most significant in implied DO loops Intel Fortran reduces the number of calls in implied DO loops by replacing up to seven nested implied DO loops with a single call to an optimized run time library VO routine The routine can transmit many I O elements at once Loop collapsing can occur in formatted and unformatted UC but only if certain conditions are met
138. ble that follows Memory Allocation The Intel amp Fortran Compiler implements a group of memory allocation routines as an extension to the OpenMP run time library to enable threads to allocate memory from a heap local to each thread These routines are kmp malloc kmp calloc and kmp realloc The memory allocated by these routines must also be freed by the kmp free routine While it is legal for the memory to be allocated by one thread and kmp free d by a different thread this mode of operation has a slight performance penalty See the definitions of these routines in the table that follows Function Routine Stack Size Description function kmp get stacksize si integer kind kmp size t kind kmp get stacksize s Returns the number of bytes that will be allocated for each parallel thread to use as its private stack This value can be changed via the kmp get stacksize s routine prior to the first parallel region or via the KMP STACKSIZE environment variable function kmp get stacksizel integer kmp get stacksize This routine is provided for backwards compatibility only use kmp get stacksize s routine for compatibility across different families of Intel processors subroutine kmp set stacksize s size integer kind kmp size t kind size Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack This value can also 190
139. brary and use it to optimize the program For example it is possible to inline functions defined in the libraries into the users source code Creating a Library from IPO Objects Normally libraries are created using a library manager such as ar Given a list of objects the library manager will insert the objects into a named library to be used in subsequent link steps xiar cru user a a o b o The above command creates a library named user a that contains the a o and b o objects 87 Intel Fortran Compiler for Linux Systems User s Guide Vol Il If however the objects have been created using ipo c then the objects will not contain a valid object but only the intermediate representation IR for that object file For example ifort ipo c a f b f will produce a o and b o that only contains IR to be used in a link time compilation The library manager will not allow these to be inserted in a library In this case you must use the Intel library driver xiid ar This program will invoke the compiler on the IR saved in the object file and generate a valid object that can be inserted in a library xild lib cru user a a o b o See Creating a Multifile IPO Executable Using xild Using ip with Qoption Specifiers You can adjust the Intel amp Fortran Compiler s optimization for a particular application by experimenting with memory and interprocedural optimizations Enter the Qopt ion option with the applicable ke
140. bsection Auto parallelization Directives Auto parallelization uses two specific directives IDEC PARALLEL and DEC NOPARALLEL The format of an Intel Fortran auto parallelization compiler directive is prefix directive where the brackets above mean e xxx the prefix and directive are required For fixed form source input the prefix is DEC or CDEC For free form source input the prefix is DEC only The prefix is followed by the directive name for example IDECS PARALLEL Since auto parallelization directives begin with an exclamation point the directives take the form of comments if you omit the parallel option 146 Parallel Programming with Intel Fortran Examples The DEC PARALLEL directive instructs the compiler to ignore dependencies which it assumes may exist and which would prevent correct parallelization in the immediately following loop However if dependencies are proven they are not ignored The IDEC NOPARALLEL directive disables auto parallelization for the immediately following loop program main parameter n 100 integer x n a n IDECS NOPARALLEL do i 1 n x i i enddo IDECS PARALLEL do i 1 n al x i i enddo end Auto parallelization Environment Variables Option Description Default OMP NUM THREADS Controls the number of Number of processors threads used currently installed in the system while generating the executable
141. cal vectorization issues and resolutions 130 Parallel Programming with Intel Fortran The Intel Fortran compiler supports a variety of directives that can help the compiler to generate effective vector instructions See compiler directives supporting vectorization Vectorizer Options Vectorization is an A 32 specific feature and can be summarized by the command line options described in the following tables Vectorization depends upon the compiler s ability to disambiguate memory references Certain options may enable the compiler to do better vectorization These options can enable other optimizations in addition to vectorization When an x K W N B P or ax K W N B P J is used and 02 which is ON by default is also in effect the vectorizer is enabled The xiK W N B P or ax KIW N B P options enable vectorizer with 01 and 03 options also X KIWINI BI P Generates specialized code to run exclusively on the processors supporting the extensions indicated by K W N B P See Processor specific Optimization IA 32 only for details ax KIWINIBIP Generates in a single binary code specialized to the extensions specified by K W N B P and also generic IA 32 code The generic code is usually slower See Automatic Processor specific Optimization IA 32 only for details vec report Controls the diagnostic messages from 01112131415 the vectorizer see the subsection that Default follows the table
142. ce of the second command you could use the linker 1a directly to produce the instrumented program If you do this make sure you link with the 1ibirc a library 2 Instrumented Execution Run your instrumented program with a representative set of data to create a dynamic information file prompt al The resulting dynamic information file has a unique name and dyn suffix every time you run a1 The instrumented file helps predict how the program runs with a particular set of data You can run the program more than once with different input data 3 Feedback Compilation Compile and link the source files with prof use to use the dynamic information to optimize your program according to its profile 100 Compiler Optimizations ifort prof use prof dir usr profdata ipo al E ai E a3 f Besides the optimization the compiler produces a pgopti dpi file You typically specify the default optimizations 02 for phase 1 and specify more advanced optimizations ip or ipo for phase 3 This example used 02 in phase 1 and the ipo in phase 3 Note The compiler ignores the ip or the ipo options with prof gen See Basic PGO Options Merging the dyn Files To merge the dyn files use the profmerge utility The compiler executes profmerge automatically during the feedback compilation phase when you specify prof use The command line usage for profmerge is as follows profmerge nologo prof dirdirname w
143. ce position as one entity In such cases it is necessary to assume that all blocks generated for one source position are covered when at least one 109 Intel Fortran Compiler for Linux Systems User s Guide Vol Il of the blocks is covered This assumption can be configured with the nopartial option When this option is specified decision coverage is disabled and the related statistics are adjusted accordingly The code lines 11 and 12 indicate that the PRINT statement in line 12 was covered However only one of the conditions in line 11 was ever true With the nopartial option the tool treats the partially covered code like the code on line 11 as covered Differential Coverage Using the code coverage tool you can compare the profiles of the application s two runs a reference run and a new run identifying the code that is covered by the new run but not covered by the reference run This feature can be used to find the portion of the application s code that is not covered by the application s tests but is executed when the application is run by a customer It can also be used to find the incremental coverage impact of newly added tests to an application s test space The dynamic profile information of the reference run for differential coverage is specified by the ref option such as in the following command codecov prj Project Name dpi customer dpi ref appTests dpi The coverage statistics of a differential coverage run sho
144. ck You cannot use a thread private common block or its constituent variables in any clause other than the COPYIN clause In the following example common blocks BLK1 and FIELDS are specified as thread private COMMON BLK1 SCRATCH COMMON FIELDS XFIELD YFIELD ZFIELD SOMP THREADPRIVATE BLK1 FIELDS OpenMP Clause Descriptions Controlling Data Scope Data Scope Attribute Clauses Overview 174 Parallel Programming with Intel Fortran You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them If you do not specify a data scope attribute clause on a directive the default is SHARED for those variables affected by the directive Each of the data scope attribute clauses accepts a list which is a comma separated list of named variables or named common blocks that are accessible in the scoping unit When you specify named common blocks they must appear between slashes name Not all of the clauses are allowed on all directives but the directives to which each clause applies are listed in the clause descriptions The data scope attribute clauses are COPYIN DEFAULT PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SHARED COPYIN Clause Use the COPYIN clause on the PARALLEL PARALLEL DO and PARALLEL SECTIONS directives to copy the data in the master thread common block to th
145. ck to become undefined subroutine omp set nest lock lock integer kind omp nest lock kind lock Forces the executing thread to wait until the nested lock associated with lock is available The thread is granted ownership of the nested lock when it becomes available subroutine omp unset nest lock lock integer kind omp nest lock kind lock 188 Releases the executing thread from ownership of the nested lock associated with lock if the nesting count is zero Behavior is undefined if the executing thread does not own the nested lock associated with lock Parallel Programming with Intel Fortran integer omp test nest lock lock Attempts to set the integer kind omp nest lock kind lock nested lock associated with lock If successful returns the nesting count otherwise returns zero Timing Routines double precision function Returns a double omp get wtime precision value equal to the elapsed wallclock time in seconds relative to an arbitrary reference time The reference time does not change during program execution double precision function Returns a double omp get wtick precision value equal to the number of seconds between successive clock ticks Intel Extension Routines The Intel amp Fortran Compiler implements the following group of routines as an extension to the OpenMP run time library getting and setting stack size for
146. construct to which NOWAIT is also applied the shared variable remains undefined until a barrier synchronization has been performed This ensures that all of the threads have completed the REDUCTION clause The REDUCTION clause is intended to be used on a region or worksharing construct in which the reduction variable is used only in reduction statements having one of the following forms X operator expr expr operator x except for subtraction intrinsic x expr intrinsic expr x xx XX ow dw og Some reductions can be expressed in other forms For instance a MAX reduction might be expressed as follows IF x LT expr x expr 179 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Alternatively the reduction might be hidden inside a subroutine call Be careful that the operator specified in the REDUCTION clause matches the reduction operation Any number of reduction clauses can be specified on the directive but a variable can appear only once in a REDUCTION clause for that directive as shown in the following example SOMP DO REDUCTION A Y REDUCTION OR AM The following example shows how to use the REDUCTION clause 0MP PARALLEL DO DEFAULT PRIVATE SHARED A B REDUCTION A B DO I 1 N CALL WORK ALOCAL BLOCAL A A ALOCAL B B BLOCAL END DO SOMP END PARALLEL DO SHARED Clause Use the SHARED c
147. counters 97 POL co czas hcec dus 79 93 advanced PGO 98 IVDEP eoi edo 206 ATOMIC ge eege AE 170 LASTPRIVATE 5 532 25 2 176 auto parallelization 144 MASTER N ee ee 170 BARRIE RS ES ess 170 MEMOTFY se oet be Ne ae 25 COP YIN se EE eds 175 noniterative worksharing SECTIONS ere 166 CRITIC AL 170 non SSE instructions 34 DEFAULT SS a eto 175 NONTEMPORAL 206 ebp register sss 210 optimal record 25 EDE es ee SEER REDE ES 13 ORDERED vykes 170 efficient data types 30 orphaned directives 156 EQUIVALENCE statements 30 PARALLEL DO xs 169 FIRSTERIVATE EE ie 176 PARALLEL SECTIONS 169 EEUSFI sea tein aoa 170 e Es 176 formatted files 25 profile guided optimization 100 GDB educa 210 profmerge utility 102 GOTO EE 166 REAL data type 30 GP relative 57 REAL variables 34 implied DO loops 25 RECORD ES DS Oen an 13 260 REDUG TON 178 threshold control 147 SCHEDULE rtr 166 visibility attributes 57 el ER EE 166 variables SEQUENCE recente 13 AUTOMATIC 48 SHARED se ee ctc tus 180 automatic allocation 51 SINGLE eet dre
148. ction partial redundancy elimination strength reduction induction variable simplification variable renaming exception handling optimizations tail recursions peephole optimizations structure assignment lowering and optimizations and dead store elimination 63 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 03 Enables 02 optimizations and in addition enables more aggressive optimizations such as prefetching scalar replacement and loop and memory access transformations Enables optimizations for maximum speed but does not guarantee higher performance unless loop and memory access transformation take place The 03 optimizations may slow down code in some cases compared to 02 optimizations Recommended for applications that have loops that heavily use floating point calculations and process large data sets On IA 32 systems In conjunction with ax K W N B P or x K W N B P options this option causes the compiler to perform more aggressive data dependency analysis than for 02 This may result in longer compilation times On Itanium based systems enables optimizations for technical computing applications loop intensive code loop optimizations and data prefetch fast This option is a single simple method to enable a collection of optimizations for run time performance Sets the following options that can improve run time performance 03 maximum speed and high level op
149. ction is preempted the new version of this function is executed rather than the old version Preemption can be used to replace an erroneous or inferior version of a function with a correct or improved version 91 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The compiler assumes that when ip is on any externally visible function might be preempted and therefore cannot be inlined Currently this means that all Fortran subprograms except for internal procedures are not inlinable when ip is on However if you use ipo and ipo obj on a file by file basis the functions can be inlined See Compilation with Real Object Files Controlling Inline Expansion of User Functions The compiler enables you to control the amount of inline function expansion with the options shown in the following summary Option Effect ip no inlining This option is only useful if ip or ipo is also specified In this case ip no inlining disables inlining that would result from the ip interprocedural optimizations but has no effect on other interprocedural optimizations inline debug info Preserve the source position of inlined code instead of assigning the call site source position to inlined code ip no pinlining Disables partial inlining can be used if ip or ipo is also specified Inline Expansion of Library Functions By default the compiler automatically expands inlines a number of stan
150. ction routine 189 function subroutine 51 234 G g compiler option 210 GCC Id 81 ELE KEE 134 GDB NEE SUR 210 general purpose registers 210 generating instrumented code 97 MON SS E edv 34 processor specific function version NP nM MEET UTR 76 profile optimized executable 97 le y cu St E C UN 213 vectorization reports 131 gigabytes octets 183 189 global symbols 57 GNU see also GCC 195 197 GOT global offset table 57 GP relative ds sa AR ie tac 57 GUIDED schedule type 180 guidelines advanced PGO 98 auto parallelization 144 GOdINE iie EE desto reacts 34 vectorization sse 133 H help od ULTIO bies sd de ed ae 45 HIDDEN visibility attribute 57 high performance programming sss 13 high level Ooptmtzer ee ee ee ee ee 213 parallelization 144 HLO hlo prefetchiss uie t 213 hlo_unroll o as ro redo 213 OVErVI W see EE Ee eene nenne 121 prefetching sese EE Ee EE 125 HELENE se Es EE Ee 123 ATME les ses Ed se edt 103 Hyper Threading technology 34 127 149 Darsing esse beseer toot dee Ee 25 performance see 25 IA 32 floating point arithmetic 69 Hyper Threading Technology enabled
151. d code without interfering with the way your program runs These options control some computation aspects such as allocating the stack memory setting or modifying variable settings and defining the use of some registers The options in this section provide you with the following capabilities of efficient compilation e Automatic allocation of variables and stacks e Aligning data e Symbol visibility attribute options Efficient Compilation 41 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Understandably efficient compilation contributes to performance improvement Before you analyze your program for performance improvement and improve program performance you should think of efficient compilation itself Based on the analysis of your application you can decide which Intel Fortran Compiler optimizations and command line options can improve the run time performance of your application Efficient Compilation Techniques The efficient compilation techniques can be used during the earlier stages and later stages of program development During the earlier stages of program development you can use incremental compilation with minimal optimization For example ifort c g O0 sub2 f90 generates object file of sub2 ifort c g O0 sub3 f90 generates object file of sub3 ifort o main g O0 main f90 sub2 o sub3 o The above commands tum off all compiler default optimizations for example O2 with 00 You can u
152. d for running on Itanium 2 processor 73 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The tpp n option always generates code that is backwards compatible with Intel amp processors of the same family This means that code generated with tpp7 Will run correctly on Pentium Pro or Pentium Ill processors possibly just not quite as fast as if the code had been compiled with tpp6 Similarly code generated with tpp2 will run correctly on Itanium processor but possibly not quite as fast as if it had been generated with tpp1 Processors for IA 32 Systems The tpp5 tpp6 and tpp7 options optimize your application s performance for a specific Intel IA 32 processor as listed in the table below The resulting binaries will also run correctly on any of the processors mentioned in the table Option Optimizes your application for tpp5 Intel Pentium and Pentium with MMX TM technology processor tpp6 Intel Pentium Pro Pentium II and Pentium Ill processors ee Intel Pentium 4 processors Intel Xeon TM processors Intel d fault Pentium M processors and Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support Example The invocations listed below each result in a compiled binary of the source program prog f optimized for Pentium 4 and Intel Xeon processors by default The same binary will also run on Pentium Pentium Pro Pentium II
153. d in the calculation changing the nesting order of the do loops changes the results For more information on arrays and their data declaration statements see the Intel Fortran Language Reference Passing Array Arguments Efficiently In Fortran there are two general types of array arguments e Explicit shape arrays used with Fortran 77 These arrays have a fixed rank and extent that is known at compile time Other dummy argument receiving arrays that are not deferred shape such as assumed size arrays can be grouped with explicit shape array arguments e Deferred shape arrays introduced with Fortran 95 90 Types of deferred shape arrays include array pointers and allocatable arrays Assumed shape array arguments generally follow the rules about passing deferred shape array arguments When passing arrays as arguments either the starting base address of the array or the address of an array descriptor is passed 23 Intel Fortran Compiler for Linux Systems User s Guide Vol Il e When using explicit shape or assumed size arrays to receive an array the starting address of the array is passed e When using deferred shape or assumed shape arrays to receive an array the address of the array descriptor is passed the compiler creates the array descriptor Passing an assumed shape array or array pointer to an explicit shape array can slow run time performance This is because the compiler needs to create an array temporary for t
154. d parallel regions 157 Intel Fortran Compiler for Linux Systems User s Guide Vol Il PROGRAM MAIN SOMP PARALLEL team team OMP SECTIONS SOMP SECTION SOMP SECTION SOMP END SECTIONS complete OMP DO END DO SOMP END DO NOWAI NOWAIT wait SOMP END PARALLEL team END PROGRAM MAIN T Begin serial execution Only the master thread executes Begin a Parallel construct form a This is Replicated Code where each member executes the same cod Begin a Worksharing construct One unit of work Another unit of work Wait until both units of work More Replicated Code Begin a Worksharing construct each iteration is a unit of work Work is distributed among the team End of Worksharing construct is specified threads need not until all work is completed befor proceeding More Replicated Code End of PARALLEL construct disband and continue with serial execution Possibly more PARALLEL Constructs End serial execution Compiling with OpenMP Directive Format and Diagnostics To run the Intel Fortran Compiler in Open MP mode you need to invoke the Intel compiler with the openmp Option ifort openmp input _ file s Before you run the multithreaded code you can set the number of desired threads to the OpenMP environment variable OMP_NUM_THREADS See the OpenMP Environment Variables section for further information The Intel Extensjon
155. d to ensure the consistency of shared data and to coordinate parallel execution among threads The synchronization constructs are ATOMIC directive BARRIER directive CRITICAL directive FLUSH directive MASTER directive ORDERED directive ATOMIC Directive Use the ATOMIC directive to ensure that a specific memory location is updated atomically instead of exposing the location to the possibility of multiple simultaneously writing threads This directive applies only to the immediately following statement which must have one of the following forms X x operator expr X expr operator x x intrinsic x expr 170 Parallel Programming with Intel Fortran x intrinsic expr x In the preceding statements x is a scalar variable of intrinsic type expr is a scalar expression that does not reference x intrinsic is either MAX MIN IAND IOR or IEOR operator is either AND OR EQV Or NEQV This directive permits optimization beyond that of a critical section around the assignment An implementation can replace all ATOMIC directives by enclosing the statement in a critical section All of these critical sections must use the same unique name Only the load and store of x are atomic the evaluation of expr is not atomic To avoid race conditions all updates of the location in parallel must be protected by using the ATOMIC directive except those that are known to be free of race conditions The fu
156. dard and math library functions at the point of the call to that function which usually results in faster computation However the inlined library functions do not set the errno variable when being expanded inline In code that relies upon the setting of the errno variable you should use the nolib inline option Also if one of your functions has the same name as one of the compiler supplied library functions then when this function is called the compiler assumes that the call is to the library function and replaces the call with an inlined version of the library function So if the program defines a function with the same name as one of the known library routines you must use the nolib inline option to ensure that the user supplied function is used nolib inline disables inlining of all intrinsics 92 Compiler Optimizations Note Automatic inline expansion of library functions is not related to the inline expansion that the compiler does during interprocedural optimizations For example the following command compiles the program sum without expanding the math library functions ifort ip nolib inline sum f Profile guided Optimizations Profile guided Optimizations Overview Profile guided optimizations PGO tell the compiler which areas of an application are most frequently executed By knowing these areas the compiler is able to be more selective and specific in optimizing the application For example the use
157. declarations in a module so the declarations are consistent If the common block is not needed for compatibility such as file storage or Fortran 77 use you can place the data declarations in a module without using a common block Arranging Data Items in Derived Type Data Like common blocks derived type structures can contain multiple data items members Data item components within derived type structures are naturally aligned on up to 64 bit boundaries with certain exceptions related to the use of the SEQUENCE statement and Fortran options See Options Controlling Alignment for information about these exceptions Intel Fortran stores a derived data type as a linear sequence of values as follows e f you specify the SEQUENCE statement the first data item is in the first storage location and the last data item is in the last storage location The data items appear in the order in which they are declared The Fortran options have no effect on unaligned data so data declarations must be carefully specified to naturally align data The align sequence option specifically aligns data items in a SEQUENCE derived type on natural boundaries e lf you omit the SEQUENCE statement Intel Fortran adds the padding bytes needed to naturally align data item components unless you specify the align norecords option Consider the following declaration of array CATALOG SPRING of derived type PART DT module data defs type part dt 19
158. description to the customized linker script If you add these lines to your linker script it is desirable to add additional entries to account for future development This is harmless since the syntax makes these contributions optional If you choose to not use the linker script your application will still build but the layout order will be random This may have an adverse affect on application performance particularly for large applications Compilation with Real Object Files In certain situations you might need to generate real object files with ipo To force the compiler to produce real object files instead of mock ones with IPO you must specify ipo ob in addition to ipo Use of ipo obj is necessary under the following conditions e The objects produced by the compilation phase of ipo will be placed in a static library without the use of xiar The compiler does not support multifile IPO for static libraries so all static libraries are passed to the linker Linking with a static library that contains mock object files will result in linkage errors because the objects do not contain real code or data Specifying 86 Compiler Optimizations ipo obj causes the compiler to generate object files that can be used in static libraries e Alternatively if you create the static library using xiar then the resulting static library will work as a normal library e The objects produced by the compilation phase o
159. e thread private copies of the common block The copy occurs at the beginning of the parallel region The COPYIN clause applies only to common blocks that have been declared THREADPRIVATE You do not have to specify a whole common block to be copied in you can specify named variables that appear in the THREADPRIVATE common block In the following example the common blocks BLK1 and FIELDS are specified as thread private but only one of the variables in common block FIELDS is specified to be copied in COMMON BLK1 SCRATCH COMMON FIELDS XFIELD YFIELD ZFIELD SOMP THREADPRIVATE BLK1 FIELDS SOMP PARALLEL DEFAULT PRIVATE COPY IN BLK1 ZFIELD DEFAULT Clause Use the DEFAULT clause on the PARALLEL PARALLEL DO and PARALLEL SECTIONS directives to specify a default data scope attribute for all variables within the lexical extent of a parallel region Variables in THREADPRIVATE 175 Intel Fortran Compiler for Linux Systems User s Guide Vol Il common blocks are not affected by this clause You can specify only one DEFAULT clause on a directive The default data scope attribute can be one of the following e PRIVATE Makes all named objects in the lexical extent of the parallel region private to a thread The objects include common block variables but exclude THREADPRIVATE variables e SHARED Makes all named objects in the lexical ext
160. e used in a computation be treated as a zero so no floating invalid exception occurs On Itanium based systems the 03 option sets the abrupt underflow to zero ftz is on At lower optimization levels gradual underflow to 0 is the default on the Itanium based systems On 1A 32 setting abrupt underflow by ftz may improve performance of SSE SSE2 instructions while it does not affect either performance or numerical behavior of x87 instructions Thus ftz will have no effect unless you select the x Or ax options which activate instructions of the more recent IA 32 Intel processors On Itanium based processors gradual underflow to 0 can degrade performance Using higher optimization levels to get the default abrupt underflow or explicitly setting tz improves performance ftz may improve performance on Itanium 2 processor even in the absence of actual underflow most frequently for single precision code Using the Floating point Exception Handling fpen Use the fpen option to control the handling of exceptions The fpen option controls floating point exceptions according to the value of n 67 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The following are the kinds of floating point exceptions Floating overflow the result of a computation is too large for the floating point data type The result is replaced with the exceptional value Infinity with the proper or sign For example 1E30 1E3
161. e 166 comma separated list 160 slow arithmetic operators 30 COITeSDOLd TE 20 DOE dias LC ME Re 34 EX St NG D 160 THREADPRIVATE directive 156 IS gr Le 170 unbuffered disk writes 25 lengtes ESSI 25 unformatted files 25 OOD EE 176 vectorization ee ee 34 PGO environment 99 VTune TM Performance Analyzer private scoping 150 edo ea uates adn rio 149 150 profile ES EDE ES 118 worksharing 164 renaming ane EE tp etes 62 utilities for PGO 101 DEE EE M ROUTER UR NON UN 51 V SUING Citas dressed as ritis 13 189 value Ee 55 TEO AA E B DD vec report compiler option 48 131 Ii fir EE 66 VECTOR ALWAYS directive 206 mixed data type 30 VECION CODY EE 140 NaN SE EE 66 VECTOR directives specified for src old and E DIEN enee eu 102 VECTOR ALIGNED 206 261 Intel Fortran Compiler for Linux Systems User s Guide Vol Il VECTOR ALWAYS 206 VECTOR NONTEMPORAL 206 VECTOR UNALIGNED 206 vectorizable MXN e elencate 133 vectorization see also Loop elen coco doo fo ens 206 exvamples ee EE EE ee 140 key programming guidelines 133 levels cud EE SEE RE 130 010 SERE EE GE ee 206 DEIER deelt 121 131 OVETV EW PR 130 Feb oo So EO 131 SIDDOLLD e eei etae e Pens 206 vectorize elei
162. e Intel processors When the main program is compiled with this option it will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations xB Intel amp Pentium amp M and compatible Intel processors When the main program is compiled with this option it will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations Intel amp Pentium amp 4 processors with Streaming SIMD Extensions 3 SSE3 instruction support When the main program is compiled with this option it will detect non compatible processors and generate an error message during execution This option also enables new xP 15 Intel Fortran Compiler for Linux Systems User s Guide Vol Il optimizations in addition to Intel processor specific optimizations To execute a program on x86 processors not provided by Intel Corporation do not specify the x K W N B P option Example The invocation below compiles myprog f for Intel Pentium 4 and compatible processors The resulting binary might not execute correctly on Pentium Pentium Pro Pentium II Pentium IIl or Pentium with MMX technology processors or on x86 processors not provided by Intel corporation ifort xN myprog f Caution If a program comp
163. e Listing globl padd_ padd_ parameter 1 8 ebp parameter 2 12 ebp parameter 3 16 ebp parameter 4 n 20 ebp oe Ble Preds B1 0 LN1 pushl ebp 1 0 B1 19 Preds B1 15 add 28 esp 6 0 movil 2 1 2 kmpc loc struct pack 1 esp 6 0 movl 4 4 esp 6 0 movil padd 6 par loopO 8 esp 16 0 movl 196 ebp eax 6 0 movl eax 12 esp 6 0 movl 152 ebp Sean 6 0 movl eax 16 Sesp 6 0 movl 112 ebp eax 6 0 movl eax 20 esp 6 0 lea 20 ebp eax 6 0 movl eax 24 Sesp 6 0 call kmpc fork call 16 0 LOE De EE E Preds B1 19 addl 28 esp 6 0 jmp 2B 123 Prob 100 16 0 LOE BI 20 Preds B1 30 199 Intel Fortran Compiler for Linux Systems User s Guide Vol Il O OO LO LO CO CO call kmpc for static init 4 6 LOE B1 40 Preds B1 20 addl 36 esp 6 LOE B1 26 Preds B1 28 B1 21 addl 8 esp 6 movil 2 1 2 kmpc loc struct pack 1 esp 6 movl 8 ebp eax 6 movi eax 4 esp 6 call kmpc for static fini 6 LOE es Bl 4 Preds B1 26 addl 8 esp 6 jmp el 27 Prob 100 6 LOE te Blea ls Preds B1 28 B1 25 p BG call omp get thread num 8 LOE eax B1 42 Preds B1 27 cmpl edx eax 1 jle B1 27 Prob 50 1 jmp B1 26 Prob 100 1 LOE type padd 8 function Size padd_ p
164. e PARALLEL SECTIONS PARALLEL DO and END PARALLEL DO Use the PARALLEL DO directive to specify a parallel region that implicitly contains a single DO directive You can specify one or more of the clauses for the PARALLEL and the DO directives The following example shows how to parallelize a simple loop The loop iteration variable is private by default so it is not necessary to declare it explicitly The END PARALLEL DO directive is optional SOMP PARALLEL DO DO tal N B I A I A I 1 2 0 END DO SOMP END PARALLEL DO PARALLEL SECTIONS and END PARALLEL SECTIONS Use the PARALLEL SECTIONS directive to specify a parallel region that implicitly contains a single SECTIONS directive 169 Intel Fortran Compiler for Linux Systems User s Guide Vol Il You can specify one or more of the clauses for the PARALLEL and the SECTIONS directives The last section ends at the END PARALLEL SECTIONS directive In the following example subroutines X AXIS Y AXIS and Z AXIS can be executed concurrently The first SECTION directive is optional Note that all SECTION directives must appear in the lexical extent of the PARALLEL SECTIONS END PARALLEL SECTIONS construct SOMP PARALLEL SECTIONS SOMP SECTION CALL X AXIS SOMP SECTION CALL Y AXIS SOMP SECTION CALL Z AXIS 0MP END PARALLEL SECTIONS Synchronization Constructs Synchronization constructs are use
165. e Vol Il OpenMP Run time Library Routines OpenMP provides several run time library routines to assist you in managing your program in parallel mode Many of these run time library routines have corresponding environment variables that can be set as defaults The run time library routines enable you to dynamically change these factors to assist in controlling your program In all cases a call to a run time library routine overrides any corresponding environment variable The following table specifies the interface to these routines The names for the routines are in user name space The omp 1ib f omp lib h and omp lib mod header files are provided in the INCLUDE directory of your compiler installation The omp 1ib h header file is provided in the INCLUDE directory of your compiler installation for use with the Fortran INCLUDE statement The omp 1ib mod file is provided in the INCLUDE directory for use with the Fortran USE statement There are definitions for two different locks omp lock t and omp nest lock t which are used by the functions in the table that follows This topic provides a summary of the OpenMP run time library routines For detailed descriptions see the OpenMP Fortran version 2 0 specifications Function Description Execution Environment Routines subroutine Sets the number of omp set num threads num threads threads to use for MEEL DEE subsequent parallel regions integer function omp get num thr
166. e a default size such as INTEGER LOGICAL COMPLEX and REAL be aware that the compiler options integer size 16 32 64 or real size 32 64 128 can change the size of an individual field s data declaration size and thus can alter the data alignment of a carefully planned order of data declarations Using the suggested data declaration guidelines minimizes the need to use the align keyword options to add padding bytes to ensure naturally aligned data In cases where the align keyword options are still needed using the suggested data declaration guidelines can minimize the number of padding bytes added by the compiler Arranging Data Items in Common Blocks The order of data items in a common statement determine the order in which the data items are stored Consider the following declaration of a common block named x 17 Intel Fortran Compiler for Linux Systems User s Guide Vol Il logical kind 2 flag integer iarry i 3 character len 5 name ch common x flag iarry i 3 name ch As shown in Figure 1 1 if you omit the appropriate Fortran command options the common block will contain unaligned data items beginning at the first array element of iarry i Figure 1 1 Common Block with Unaligned Data 14 NAME CH 18 byte offset FLAG JARRY WA IARRY an LARRY KS RE AN 1 byte per character zkK 5659A GE As shown in Figure 1 2 if you compile the program units that use the common block with the al
167. e optimized executable 97 generating ER ER oett 97 ite Ile 97 profiling summary specifying sess 98 profmerge LOO Pee URN 101 111 Intel Fortran Compiler for Linux Systems User s Guide Vol II Ur UPON 102 Utili buono Ld cogere pend 101 program affected aspect 79 program loops EE 143 programming high performance 13 project makefile 83 PROTECTED Jien EE 57 providing siiperset ee deier dee ed 176 pseudo code parallel processing model 156 pushlo aai ee tent 195 197 Q gipo fa xild option 83 gipo fo xild option 83 Qoption compiler option 88 R rcd compiler option 69 READ READ DAERN etti r ete 134 252 READ WRITE statements 45 REAL REAL DATA decus 134 real object files 86 REAL TO EE 30 REALA PREFERITI 30 REAI BEG e dade 30 real size compiler option 48 reassociation 70 71 178 rec8byte keyword 55 RECL value EER ED ooo os 25 recnbyte keyword 55 recommendations EIER 34 controlling alignment 55 record buffers efficient use of 25 RECORD statement 13 recursive compiler option
168. e the linker The compiler performs IPO across all object files that have an IR The compiler first analyzes all of the summary information and then finishes compiling the pieces of the application for which it has IR Having global information about the application while it is compiling individual pieces can improve the quality of optimization Note The compiler does not support multifile IPO for static libraries a files See Compilation with Real Object Files for more information ipo enables the driver and compiler to attempt detecting a whole program automatically If a whole program is detected the interprocedural constant propagation stack frame alignment data layout and padding of common blocks perform more efficiently while more dead functions get deleted This option is safe Command Line for Creating an IPO Executable The command line options to enable IPO for compilations targeted for both the IA 32 and ltanium architectures are identical To produce mock object files containing IR compile your source files with ipo as follows ifort ipo g ast bef c f This produces a o b o and c o object files These files contain Intel compiler intermediate representation IR corresponding to the compiled source files a f b and c f Using c to stop compilation after generating o files is required 81 Intel Fortran Compiler for Linux Systems User s Guide Vol Il You can now optimize interprocedurally by
169. eads Returns the number of threads that are being used in the current parallel region integer function omp get max threads Returns the maximum number of threads that are available for parallel execution integer function omp get thread num Determines the unique thread number of the thread currently executing this section of code integer function omp get num procs Determines the number of processors available to the program 186 Parallel Programming with Intel Fortran lo gical function omp in parallel Returns TRUE if called within the dynamic extent of a parallel region executing in parallel otherwise returns FALSE subroutine omp set dynamic dynamic threads lo gical dynamic threads Enables or disables dynamic adjustment of the number of threads used to execute a parallel region If dynamic threads is TRUE dynamic threads are enabled If dynamic threads is FALSE dynamic threads are disabled Dynamics threads are disabled by default logical function omp get dynamic Returns TRUE if dynamic thread adjustment is enabled otherwise returns FALSE subroutine omp set nested nested integer nested Enables or disables nested parallelism If nestedis TRUE nested parallelism is enabled If nested is FALSE nested parallelism is disabled Nested parallelism is disabled by default logical function omp get nested
170. ecutable prof use The prof use option is used in phase 3 of the PGO to instruct the compiler to produce a profile optimized executable and merges available dynamic information dyn files into a pgopt i dpi file F Note The dynamic information files are produced in phase 2 when you run the instrumented executable If you perform multiple executions of the instrumented program prof_use merges the dynamic information files again and overwrites the previous pgopti dpi file Using 32 bit Counters prof format 32 The Intel Fortran compiler by default produces profile data with 64 bit counters to handle large numbers of events in the dyn and dpi files The prof format 32 option produces 32 bit counters for compatibility with the earlier compiler versions If the format of the dyn and api files is incompatible 97 Intel Fortran Compiler for Linux Systems User s Guide Vol Il with the format used in the current compilation the compiler issues the following message Error xxx dyn has old or incompatible file format delete file and redo instrumentation compilation execution The 64 bit format for counters and pointers in dyn and dpi files eliminate the incompatibilities on various platforms due to different pointer sizes Disabling Function Splitting nsplit fnsplit disables function splitting on Itanium based systems Function splitting is enabled by prof use in phase 3 to improve code locality by s
171. ede the DO statement for each DO loop it affects If n is specified the optimizer unrolls the loop n times If n is omitted or if it is outside the allowed range the optimizer assigns the number of times to unroll the loop The UNROLL directive overrides any setting of loop unrolling from the command line Currently the directive can be applied only for the innermost loop nest If applied to the outer loop nests it is ignored The compiler generates correct code by comparing n and the loop count IDECS UNROLL 4 do i 1 m 205 Intel Fortran Compiler for Linux Systems User s Guide Vol Il b i a i d i c i enddo For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language Reference Prefetching Support The PREFETCH and NOPREFETCH directives assert that the data prefetches be generated or not generated for some memory references This affects the heuristics used in the compiler If loop includes expression a j placing PREFETCH a in front of the loop instructs the compiler to insert prefetches for a j d within the loop dis determined by the compiler This directive is supported when option 03 is on CDECS NOPREFETCH c CDECS PREFETCH a do i 1 m b i a c i 1 enddo For more details on these directives see Directive Enhanced Compilation section General Directives in the nte amp Fortran Language R
172. ee see ee ee ee ee ee 71 Optimizing for Specific Processors se ee see ee ee Re ee ee ee ee ee ee ee ee 73 Optimizing for Specific Processors OVerVieW ees see e 73 Targeting a Processor PP EE ES EES EE EE EG Ee ee Ge 73 Processor specific Optimization IA 32 only 75 Automatic Processor specific Optimization 1A 32 only 76 Processor specific Run time Checks IA 32 Systems 77 Interprocedural Optimizations IPO sese 79 Overview of Interprocedural Optimizations ee ee ee ee Ee ee ee RR ee 79 IPO Compilation Model SE ES RI n Cu aa cya We peas as ecg ae ae ee ea Ge ee and 80 Command Line for Creating an IPO Executable iese ee see ee ee ee Re ee ee 81 Generating Multiple IPO Object Files eee eee eee 82 Capturing Intermediate Outputs of IPC 83 Table OT Contents Creating an IPO Executable Using ad 83 Code Layout and Multi Object IPO sse gees si ee ea ER ee Gee Ek Re ee esee 85 Compilation with Real Object Files eee 86 Creating a Library from IPO Object iese see ee ee RE ee ee ee EE ee ee 87 Using ip with Qoption Specifiers sse eee eee 88 Inline Expansion of EUNEUONS oe dr i E deet od d ene edu d e 90 Profile guided Optimizations so ia Ie eoa eee eee 93 Profile guided Optimizations OvervieW ss eee ee ee e ee 93 Profile guided Optimizations Methodology and Usage Model 94 Basic O0 Real EE RE EED pe toda en ot T OUS
173. eference Vectorization Support The directives discussed in this topic support vectorization IVDEP Directive The IVDEP directive instructs the compiler to ignore assumed vector dependences To ensure correct code the compiler treats an assumed dependence as a proven dependence which prevents vectorization This directive overrides that decision Use IVDEP only when you know that the assumed loop dependences are safe to ignore For example if the expression j gt 0 is always true in the code fragment bellow the IVDEP directive can communicate this information to the compiler This directive informs the compiler that the conservatively assumed loop carried flow dependences for values j 0 can be safely ignored 206 Optimization Support Features DECS IVDEP do i 1 100 a i a i j enddo Note The proven dependences that prevent vectorization are not ignored only assumed dependences are ignored The usage of the directive differs depending on the loop form For loops of the form 1 use old values of a and assume that there is no loop carried flow dependencies from DEF to USE For loops of the form 2 use new values of a and assume that there is no loop carried anti dependencies from USE to DEF In both cases it is valid to distribute the loop and there is no loop carried output dependency Example 1 CDECS IVDEP do j 1 n at a j m 1 enddo Example 2 CDEC
174. efine the parallel construct When the master thread encounters a parallel construct it creates a team of threads with the master thread becoming the master of the team The program statements enclosed by the parallel construct are executed in parallel by each thread in the team These statements include routines called from within the enclosed statements The statements enclosed lexically within a construct define the static extent of the construct The dynamic extent includes the static extent as well as the routines called from within the construct When the END PARALLEL directive is encountered the threads in the team synchronize at that point the team is dissolved and only the master thread continues execution The other threads in the team enter a wait state You can specify any number of parallel constructs in a single program As a result thread teams can be created and dissolved many times during program execution Using Orphaned Directives In routines called from within parallel constructs you can also use directives Directives that are not in the lexical extent of the parallel construct but are in the dynamic extent are called orphaned directives Orphaned directives allow you to execute major portions of your program in parallel with only minimal changes to the sequential version of the program Using this functionality you can code parallel constructs at the top levels of your program call tree and use directives to contro
175. el work without yielding to other threads n Note Avoid over allocating system resources This occurs if either too many threads have been specified or if too few processors are available at run time If system resources are over allocated this mode will cause poor performance The throughput mode should be used instead if this occurs OpenMP Environment Variables This topic describes the standard OpenMP environment variables with the OMP prefix and Intel specific environment variables with the KMP prefix that are Intel extensions to the standard Fortran Compiler Standard Environment Variables Variable Description Default OMP SCHEDULE Sets the run time schedule STATIC type and chunk size no chunk size specified OMP NUM THREADS Sets the number of threads Number of to use during execution processors OMP DYNAMIC Enables true or disables false false the dynamic adjustment of the number of threads OMP_NESTED Enables true or disables false false nested parallelism Intel Extension Environment Variables 183 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Environment Variable Description Default KMP ALL THREADS Sets the max 32 4 maximum OMP NUM THREADS number of 4 number of threads that can be used by any parallel region processors KMP BLOCKTIME Sets the time in milliseconds that a thread should wait aft
176. ements other than the preceding floating point and integer operations are permitted The loop body cannot contain any function calls other than the ones described above Vectorization Examples This section contains simple examples of some common issues in vector programming Argument Aliasing A Vector Copy The loop in the example of a vector copy operation does not vectorize because the compiler cannot prove that DEST A T and DEST B I are distinct Example of Unvectorizable Copy Due to Unproven Distinction SUBROUTINE VEC COPY DEST A B LEN DIMENSION DEST INTEGER A B NTEGER LEN I 1 LEN ST A I DEST B I Data Alignment 140 Parallel Programming with Intel Fortran A 16 byte or greater data structure or array should be aligned so that the beginning of each structure or array element is aligned in a way that its base address is a multiple of 16 The Misaligned Data Crossing 16 Byte Boundary figure shows the effect of a data cache unit DCU split due to misaligned data The code loads the misaligned data across a 16 byte boundary which results in an additional memory access causing a six to twelve cycle stall You can avoid the stalls if you know that the data is aligned and you specify to assume alignment Misaligned Data Crossing 16 Byte Boundary 16 Byte 16 Byte Boundaries Boundaries Misaligned Data
177. enMP feature See more examples in the OpenMP Fortran version 2 0 specifications DO A Simple Difference Operator This example shows a simple parallel loop where each iteration contains a different number of instructions To get good load balancing dynamic scheduling is used The END DO has a NOWAIT because there is an implicit BARRIER at the end of the parallel region subroutine do 1 real a n n a b n b n n 191 Intel Fortran Compiler for Linux Systems User s Guide Vol II cSomp parallel cSomp amp shared a b n cSomp amp private i j cSomp do schedule dynamic 1 do i 2 n do j 1 i b j i a j i a j i 1 Z 2 enddo enddo cSomp end do nowait cSomp end parallel end DO Two Difference Operators This example shows two parallel regions fused to reduce fork join overhead The first END DO has a NOWAIT because all the data used in the second loop is different than all the data used in the first loop subroutine do 2 real a n n b n n cSomp parallel a b c d m n c m m d m m cSomp amp shared a b c d m n cSomp amp private i j cSomp do schedule dynamic 1 do i 2 n do j 1 i b j i ate T a j i 1 Z enddo enddo cSomp end do nowait cSomp do schedule dynamic 1 do i 2 m do j 1 i d j i c j i ea cm 2 enddo enddo c omp end do nowait c omp end parallel end SECTIONS Two Difference Operators This example dem
178. enclosed code among the members of the team that encounter it The OpenMP SECTIONS or DO constructs are defined as worksharing constructs because they distribute the enclosed work among the threads of the current team A worksharing construct is only distributed if it is encountered during dynamic execution of a parallel region If the worksharing construct occurs lexically inside of the parallel region then it is always executed by distributing the work among the team members If the worksharing construct is not lexically explicitly enclosed by a parallel region that is it is orphaned then the worksharing construct will be distributed among the team members of the closest dynamically enclosing parallel region if one exists Otherwise it will be executed serially When a thread reaches the end of a worksharing construct it may wait until all team members within that construct have completed their work When all of the work defined by the worksharing construct is finished the team exits the worksharing construct and continues executing the code that follows A combined parallel worksharing construct denotes a parallel region that contains only one worksharing construct Parallel Processing Directive Groups The parallel processing directives include the following groups Parallel Region e PARALLEL and END PARALLEL Worksharing Construct e The DO and END DO directives specify parallel execution of loop iterations 151 Intel Fort
179. ent of the parallel region shared among all the threads in the team e NONE Declares that there is no implicit default as to whether variables are PRIVATE or SHARED You must explicitly specify the scope attribute for each variable in the lexical extent of the parallel region If you do not specify the DEFAULT clause the default is DEFAULT SHARED However loop control variables are always PRIVATE by default You can exempt variables from the default data scope attribute by using other scope attribute clauses on the parallel region as shown in the following example OMP PARALLEL DO DEFAULT PRIVATE IRSTPRIVATE I SHARED X SOMP amp SHARED R LASTPRIVATE I Hy PRIVATE FIRSTPRIVATE and LASTPRIVATE Clauses PRIVATE Use the PRIVATE clause on the PARALLEL DO SECTIONS SINGLE PARALLEL DO and PARALLEL SECTIONS directives to declare variables to be private to each thread in the team The behavior of variables declared PRIVATE is as follows e Anew object of the same type and size is declared once for each thread in the team and the new object is no longer storage associated with the original object e All references to the original object in the lexical extent of the directive construct are replaced with references to the private object 176 Parallel Programming with Intel Fortran e Variables defined as PRIVATE are undefined for each thread on entering the constr
180. er completing the execution of a parallel region before sleeping See also the throughput execution mode and the KMP LIBRARY environment variable Use the optional character suffix s m h or d to specify seconds minutes hours 200 milliseconds or days KMP LIBRARY Selects the throughput OpenMP run execution mode time library throughput The options for the variable value are serial turnaround Of throughput indicating the execution mode The default value of throughput is used if this variable is not 184 Parallel Programming with Intel Fortran specified KMP MONITOR STACKSIZE Sets the number of bytes to allocate for the monitor thread which is used for book keeping during program execution Use the optional Suffix b k m g or t to specify bytes kilobytes max 32k system minimum thread stack size megabytes gigabytes or terabytes KMP STACKSIZE Sets the number IA 32 2m of bytes to Itanium compiler 4m allocate for each parallel thread to use as its private stack Use the optional suffix b k m g or t to specify bytes kilobytes megabytes gigabytes or terabytes KMP_VERSI ON Enables set or disables unset the printing of OpenMP run time library version information during program execution Disabled 185 Intel Fortran Compiler for Linux Systems User s Guid
181. er s Guide Vol Il The debugging of multithreaded program discussed in this section applies to both the OpenMP Fortran API and the Intel Fortran parallel compiler directives When a program uses parallel decomposition directives you must take into consideration that the bug might be caused either by an incorrect program statement or it might be caused by an incorrect parallel decomposition directive In either case the program to be debugged can be executed by multiple threads simultaneously To debug the multithreaded programs you can use e The Intel Debugger for IA 32 and the Intel Debugger for Itanium based applications idb e The Intel Fortran Compiler debugging options and methods e The Intel parallelization extension routines for low level debugging e The VTune TM Performance Analyzer to define the problematic areas Other best known debugging methods and tips include Correct the program in a single threaded uni processor environment Statically analyze locks Use a trace statement such as the PRINT statement Think in parallel make very few assumptions Step through your code Make sense of threads and callstack information Identify the primary thread Know what thread you are debugging Single stepping in one thread does not mean single stepping in others Watch out for context switch Debugger Limitations for Multithread Programs Debuggers such as Intel Debugger IDB for lA 32 and Intel Debugger IDB for
182. eration after every function subroutine call to insure a proper state of a floating point stack and slows down compilation It is meant only as a debugging aid for finding floating point stack underflow overflow problems which can be otherwise hard to find Aliases common args The common args option assumes that the by reference subprogram arguments may have aliases of one another Preventing CRAY Pointer Aliasing Option safe cray ptr specifies that the CRAY pointers do not alias with other variables The default is OFF Consider the following example pointer pb b pb getstorage do i 1 n b i a i 1 enddo When safe_cray_ptr is not specified default the compiler assumes that b and a are aliased To prevent such an assumption specify this option and the compiler will treat b i and a i as independent of each other However if the variables are intended to be aliased with CRAY pointers using the safe cray ptr option produces incorrect result For the code example below safe cray ptr should not be used pb loc a 2 do i l n b i a i 1 enddo ansi alias The ansi alias enables default or disables the compiler to assume that the program adheres to the ANSI Fortran type aliasablility rules For example an object of type real cannot be accessed as an integer You should see the ANSI standard for the complete set of rules 54 Compiler Optimizations The option di
183. es the number of subprogram references that will be inlined especially when multiple source files are compiled together at optimization level 03 For more information see Efficient Compilation Code DO Loops for Efficiency Minimize the arithmetic operations and other operations in a DO loop whenever possible Moving unnecessary operations outside the loop will improve performance for example when the intermediate nonvarying values within the loop are not needed For more information on loop optimizations see Pipelining for Itanium amp based Applications and Loop Unrolling on the syntax of Intel Fortran statements see the ntel amp Fortran Language Reference Using Intrinsics for Itanium amp based Systems Intel amp Fortran supports all standard Fortran intrinsic procedures and in addition provides Intel specific intrinsic procedures to extend the functionality of the language Intel Fortran intrinsic procedures are provided in the library libintrins a For more information on intrinsic procedures see the Intel Fortran Language Reference This topic provides examples of the Intel extended intrinsics that are helpful in developing efficient applications CACHESIZE Intrinsic Itanium Compiler Intrinsic CACHESIZE n is used only with the Intel Itanium amp Compiler CACHESIZE n returns the size in kilobytes of the cache at level n 1 represents the first level cache Zero is retumed for a nonexistent cache level T
184. f ipo might be linked without the ipo option and without the use of xiar e You want to generate an assembly listing for each source file using S while compiling with ipo If you use ipo with s but without ipo obj the compiler issues a warning and an empty assembly file is produced for each compiled source file Implementing the i1 Files with Version Numbers An IPO compilation consists of two parts the compile phase and the link phase In the compile phase the compiler produces an intermediate language IL version of the users code In the link phase the compiler reads the IL and completes the compilation producing a real object file or executable Generally different compiler versions produce IL based on different definitions and therefore the ILs from different compilations can be incompatible Intel Fortran Compiler assigns a unique version number with each compiler s IL definition If a compiler attempts to read IL in a file with a version number other than its own the compilation proceeds but the IL is discarded and not used in the compilation The compiler then issues a warning message about an incompatible IL detected and discarded The IL produced by the Intel compiler is stored in file with a suffix 11 Then the il file is placed in the library If this library is used in an IPO compilation invoked with the same compiler as produced the IL for the library the compiler can extract the i1 file from the li
185. f the OpenMP directives within the code VOMPS PARALLEL PRIVATE NUM SHARED X A B C Defines a parallel region OMPS PARALLEL DO Specifies a parallel region that implicitly contains a single DO directive DO I 1 1000 NUM FOO B i C I X I BAR A I NUM Assume FOO and BAR have no side effects ENDDO See examples of the Auto parallelization and Auto vectorization directives in the respective sections Auto vectorization IA 32 Only Vectorization Overview The vectorizer is a component of the Intel amp Fortran Compiler that automatically uses SIMD instructions in the MMX TM SSE SSE2 and SSE3 instruction sets The vectorizer detects operations in the program that can be done in parallel and then converts the sequential operations like one SIMD instruction that processes 2 4 8 or up to 16 elements in parallel depending on the data type This section provides options description guidelines and examples for Intel Fortran Compiler vectorization implemented by IA 32 compiler only For additional information see Publications on Compiler Optimizations The following list summarizes this section contents e Descriptions of compiler options to control vectorization e Vectorization Key programming guidelines e Discussion and general guidelines on vectorization levels o Automatic vectorization o Vectorization with user intervention e Examples demonstrating typi
186. fies the path to the spi file Here is a sample output from this run of the test prioritization tool Total number of tests 3 Total block coverage e CESSIT al function coverage 50 00 SRatCvrg SBlkCvrg SFncCvrg R Name ptions 1 87 50 45 65 37 50 100 00 52 17 50 00 In this example the test prioritization tool has provided the following information e By running all three tests we achieve 52 17 block coverage and 50 00 function coverage 116 Compiler Optimizations e Test3 by itself covers 45 65 of the basic blocks of the application which is 87 5096 of the total block coverage that can be achieved from all three tests e By adding Test2 we achieve a cumulative block coverage of 52 17 or 10096 of the total block coverage of Test1 Test2 and Test3 e Elimination of Test1 has no negative impact on the total block coverage Example 2 Minimizing Execution Time Suppose we have the following execution time of each test in the tests 1ist file Testl dpi 00 00 60 35 Test2 dpi 00 00 10 15 Test3 dpi 00 00 30 45 The following command executes the test prioritization tool to minimize the execution time with the mintime option tselect dpi list tests list spi pgopti spi mintime Here is a sample output Total number of tests 3 Total block coverage OBAT Total function coverage 50 00 Total execution time 1 41 35 num lapsedTime RatCvrg BlkCvrg FncCvrg Test Name H O
187. for IA 32 systems 76 automatic compiler option 51 AUTOMATIC statement 51 auto parallelization data d tht net ec as 144 le leet 147 enabling EERS Pesto 145 environment variables 145 OVEIVICW ss AR EE AN 143 DIOCESSING DE 144 programming with 144 threshold control 147 threshold needed 145 auto parallelized loops 48 147 auto parallelizer ele ET 127 enabling 127 145 threshold eee 147 auto vectorization 34 127 auto vectorizer 134 ax compiler option 34 62 131 Index B BAOKSEAGE acude tui aetas 25 backtrace compiler option 210 BARRIER directive description of 160 USINO EE 150 170 basic PGO options profile guided optimization 94 97 bcolor option of code coverage tool EA EE MT 103 big endian E 41 45 binding to a parallel region 150 block sizg ced ee cet 138 BLOCKSIZE increasing LA 25 STT SRM Ret ROME 25 E 25 browsing frames DEERE ER N d 103 BUFFERCOUNT buffered io option 25 Eelere 25 le EE EN 25 buffers 223 Intel Fortran Compiler for Linux Systems User s Guide Vol Il WBC sende SG EES Dr ees 25 C c compiler option 41 81 C OMP BARRIER 170 c OMP DO PRIVATE 1
188. formance is affected slightly by the run time checks to determine which code to use E Note Applications that you compile to optimize themselves for specific processors in this way will execute on any Intel IA 32 processor If you specify both the x and ax options the x option forces the generic code to execute only on processors compatible with the processor type specified by the x option Option Optimizes Your Code for axK Intel Pentium Ill and compatible Intel processors axW Intel Pentium 4 and compatible Intel processors Intel Pentium 4 and compatible Intel processors This option also axN enables new optimizations in addition to Intel processor specific optimizations Intel Pentium M and compatible Intel processors This option also axB enables new optimizations in addition to Intel processor specific optimizations Intel amp Pentium amp 4 processors with Streaming SIMD Extensions 3 axP SSE3 instruction support This option also enables new optimizations in addition to Intel processor specific optimizations Example The compilation below generates a single executable that includes A generic version for use on any IA 32 processor A version optimized for Intel Pentium 4 processors as long as there is a performance benefit A version optimized for Intel Pentium M processors as long as there is a performance benefit ifort axNB prog f90 Processor specific Run time Checks IA 32 Systems The
189. he openmp option Syntax for Parallel Regions in the Source Code The OpenMP constructs defining a parallel region have one of the following syntax forms Somp directive structured block of code Somp end directive 159 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Or Somp directive structured block of code Or Somp directive where directive is the name of a particular OpenMP directive OpenMP Diagnostic Reports The openmp report 0 1 2 option controls the OpenMP parallelizer s diagnostic levels 0 1 or 2 as follows openmp_report0 no diagnostic information is displayed openmp report1 display diagnostics indicating loops regions and sections successfully parallelized openmp report2 same as openmp report 1 plus diagnostics indicating MASTER constructs SINGLE constructs CRITICAL constructs ORDERED constructs ATOMIC directives etc successfully handled The default is openmp_reportl OpenMP Directives and Clauses Summary This topic provides a summary of the OpenMP directives and clauses For detailed descriptions see the OpenMP Fortran version 2 0 specifications OpenMP Directives Directive Description PARALLEL Defines a parallel region END PARALLEL DO Identifies an iterative worksharing construct in END DO which the iterations of the associated loop should be executed in parallel SECTIONS Identifies a
190. he team is executing a critical section having the same name When a thread enters the critical section a latch variable is set to closed and all other threads are locked out When the thread exits the critical section at the END CRITICAL directive the latch variable is set to open allowing another thread access to the critical section If you specify a critical section name in the CRITICAL directive you must specify the same name in the END CRITICAL directive If you do not specify a name for the CRITICAL directive you cannot specify a name for the END CRITICAL directive All unnamed CRITICAL directives map to the same name Critical section names are global to the program The following example includes several CRITICAL directives and illustrates a queuing model in which a task is dequeued and worked on To guard against multiple threads dequeuing the same task the dequeuing operation must be in a critical section Because there are two independent queues in this example each queue is protected by CRITICAL directives having different names X_AXIS and Y_AXIS respectively SOMP PARALLEL DEFAULT PRIVATE SHARED X Y SOMP CRITICAL X AXIS CALL DEQUEUE IX NEXT X SOMP END CRITICAL X AXIS CALL WORK IX NEXT X SOMP CRITICAL Y AXIS CALL DEQUEUE IY NEXT Y SOMP END CRITICAL Y AXIS CALL WORK IY NEXT Y SOMP END PARALLEL
191. he entire array The array temporary is created because the passed array may not be contiguous and the receiving explicit shape array requires a contiguous array When an array temporary is created the size of the passed array determines whether the impact on slowing run time performance is slight or severe The following table summarizes what happens with the various combinations of array types The amount of run time performance inefficiency depends on the size of the array Actual Dummy Argument Array Types Argument choose one Away IYPeS Explicit Shape Arrays one BERE sess Result when using this Result when using this combination Very efficient combination Efficient Only Does not use an array allowed for assumed shape temporary Does not pass an arrays not deferred shape array descriptor Interface arrays Does not use an array block optional temporary Passes an array descriptor Requires an interface block Result when using this Result when using this combination When passing combination Efficient an allocatable array very Requires an assumed shape or efficient Does not use an array pointer as dummy array temporary Does not argument Does not use an pass an array descriptor array temporary Passes an Interface block optional array descriptor Requires an interface block When not passing an allocatable array not efficient Instead use allocatable arrays whenever possible
192. he pgopt i dpi file whose source path begins with the c work sources prefix profmerge replaces that prefix with d project src The c work pgopti dpi file is updated with the new source path information The following rules apply e You can execute profmerge more than once on a given pgopti dpi file You may need to do this if the source files are located in multiple directories For example profmerge src old c program files src new e program files profmerge src old c proj application src new d app e Inthe values specified for src old and src new uppercase and lowercase characters are treated as identical Likewise forward slash and backward slash characters are treated as identical e Because the source relocation feature of profmerge modifies the pgopti dpi file you may wish to make a backup copy of the file prior to performing the source relocation Code coverage Tool The Intel Compilers Code coverage tool can be used for both IA 32 and Itanium amp architectures in a number of ways to improve development efficiency reduce defects and increase application performance The major features of the Intel Compilers code coverage tool are e Visual presentation of the application s code coverage information with the code coverage coloring scheme e Display of the dynamic execution counts of each basic block of the application e Differential coverage or comparison of the profiles of the applications
193. he symbol However its address might be passed to other components indirectly for example as an argument to a call to a function in another component or by having its address stored in a data item referenced by a function in another component INTERNAL The symbol cannot be referenced outside the component where it is defined either directly or indirectly 0 Note Visibility applies to both references and definitions A symbol reference s visibility attribute is an assertion that the corresponding definition will have that visibility Symbol Preemption and Optimization Sometimes programmers need to use some of the functions or data items from a shareable object but at the same time they need to replace other items with definitions of their own For example an application may need to use the standard run time library shareable object 1ibc so but to use its own definitions of the heap management routines malloc and free In this case it is important that calls to malloc and free within 1ibc so use the user s definition of the routines and not the definitions in 1ibc so The user s definition should then override or preempt the definition within the shareable object 58 Compiler Optimizations This functionality of redefining the items in shareable objects is called symbol preemption When the run time loader loads a component all symbols within the component that have default visibility are subject to preempt
194. here prof dirdirname is a profmerge utility option This merges all ayn files in the current directory or the directory specified by prof dir and produces the summary file pgopt i dpi The prof filefilename option enables you to specify the name of the api file The command line usage for profmerge with prof filefilename is as follows profmerge nologo prof filefilename where prof filefilename is a profmerge utility option Note The profmerge tool merges all the dyn files that exist in the given directory It is very important to make sure that unrelated dyn files oftentimes from previous runs are not present in that directory Otherwise profile information will be based on invalid profile data This can negatively impact the performance of optimized code as well as generate misleading coverage information 101 Intel Fortran Compiler for Linux Systems User s Guide Vol Il A Note The dyn files can be merged to a dpi file by the profmerge tool without recompiling the application Dumping Profile Data This subsection provides an example of how to call the C PGO API routines from Fortran For complete description of the PGO API support routines see PGO API Profile Information Generation Support As part of the instrumented execution phase of profile guided optimization the instrumented program writes profile data to the dynamic information file dyn file The file is written after the in
195. hich produces enhanced debug information useful in finding scalar local variables fp Disables the use of the ebp register in IA 32 only optimizations Directs to use the ebp based stack frame for all functions Support for Symbolic Debugging g Use the g option to direct the compiler to generate code to provide symbolic debugging information and line numbers in the object code that will be used by your source level debugger For example ifort g progi f Turns off 02 and makes 00 the default unless 02 or O1 or 03 is explicitly specified in the command line together with 9 The Use of ebp Register fp IA 32 only Most debuggers use the ebp register as a stack frame pointer to produce a stack backtrace The fp option disables the use of the ebp register in optimizations and directs the compiler to generate code that maintains and uses ebp as a stack frame pointer for all functions so that a debugger can still produce a stack backtrace without turning off O1 02 or 03 optimizations Note that using this option reduces the number of available general purpose registers by one and results in slightly less efficient code fp Summary Default OFF 01 02 or Disables fp O3 O0 Enables fp The traceback Option The traceback option also forces the compiler to use ebp as the stack frame pointer In addition the t raceback option causes the compiler to generate extr
196. hile some were not The default color can be overridden with the pcolor option Unknown No code was generated for this source line Most probably the source at this position is a comment a header file inclusion or a variable declaration The default color can be overridden with the ucolor option The default colors can be customized to be any valid HTML by using the options mentioned for each coverage category in the table above For code coverage colored presentation the coverage tool uses the following heuristic Source characters are scanned until reaching a position in the source that is indicated by the profile information as the beginning of a basic block If the profile information for that basic block indicates that a coverage category changes then the tool changes the color corresponding to the coverage condition of that portion of the code and the coverage tool inserts the appropriate color change in the HTML files E Note You need to interpret the colors in the context of the code For instance comment lines that follow a basic block that was never executed would be colored in the same color as the uncovered blocks Another example is the closing brackets in C C applications Coverage Analysis of a Modules Subset One of the capabilities of the Intel Compilers code coverage tool is efficient coverage analysis of an application s subset of modules This analysis is accomplished based on the selected opti
197. his intrinsic can be used in many scenarios where application programmers would like to tailor their algorithms for the target processor s cache hierarchy For example an application may query the cache size and use it to select block sizes in algorithms that operate on matrices subroutine foo level integer level if cachesize level gt threshold then call big bar 33 Intel Fortran Compiler for Linux Systems User s Guide Vol Il else call small bar end if end subroutine Coding Guidelines for Intel amp Architectures This topic provides general guidelines for coding practices and techniques for e A 32 architecture supporting MMX TM technology and Streaming SIMD Extensions SSE and Streaming SIMD Extensions 2 SSE2 e ltaniumQ architecture This topic describes practices tools coding rules and recommendations associated with the architecture features that can improve the performance on IA 32 and Itanium processor families For details about optimization for IA 32 processors see the Intel amp Architecture Optimization Reference Manual For all details about optimization for Itanium processor family see the Intel Itanium 2 Processor Reference Manual for Software Development and Optimization Note If a guideline refers to a particular architecture only this architecture is explicitly named The default is for both IA 32 and Itanium architectures Performance of compiler generated code may vary from one co
198. ibutes The options ensure immediate access to the feature without depending on header file modifications The visibility options cause all global symbols to get the visibility specified by the option There are two variety of options to specify symbol visibility explicitly fvisibility keyword fvisibility keyword file The first form specifies the default visibility for global symbols The second form specifies the visibility for symbols that are in a file this form overrides the first form The ileis the pathname of a file containing the list of symbols whose visibility you want to set the symbols are separated by whitespace spaces tabs or newlines In both options the keyword is extern default protected hidden and internal see definitions above 59 Intel Fortran Compiler for Linux Systems User s Guide Vol Il F Note These two ways to explicitly set visibility are mutually exclusive you may use the visibility attribute on the declaration or specify the symbol name in a file but not both The option visibility keyword file specifies the same visibility attribute for a number of symbols using one of the five command line options corresponding to the keyword fvisibility extern file fvisibility default file fvisibility protected file fvisibility hidden file fvisibility internal file where file is the pathname of a file containing a list of the symbol names whose visibility you w
199. icit different algorithms to be called This can cause the behavior of your program to vary from one execution to the next Phases of Basic Profile Guided Optimization 94 Compiler Optimizations 1 Instrumented Compilation ifort prof gen a f executable files with instrumented code 4 out Output dynamic information files with unique names for each execution 8 hex d gits dyn 2 Instrumented Execution a out 3 Feedback Compilation ifort brof use my option a f Creates and uses merged dynamic information summary file pgopti dpi PGO Usage Model The chart that follows presents PGO usage model 95 Intel Fortran Compiler for Linux Systems User s Guide Vol II Step One Compile with Keep the static profile information prof genx spi for coverage analysis and PGT Step Three PGO Compile with Instrumented Executables prof use r Optimized Executables Step two IE Run instrumented executables Merge Dynamic Profile Information Here are the steps for a simple example myApp 90 for IA 32 systems Keep the dynamic profile information dpi for coverage analysis and PGT 1 Set the following PROF DIR c myApp prof dir 2 Issue the following command ifort prof genx myApp f90 This command compiles the program and generates instrumented binary myApp exe as well as the corresponding static profile information pgopti spi
200. ign commons option data items will be naturally aligned Figure 1 2 Common Block with Naturally Aligned Data 16 NAME CH 21 byte offset uon ees wr T T LT Padding 1 byte per character 2K 66604 GE Because the common block x contains data items whose size is 32 bits or smaller specify the align commons option If the common block contains data items whose size might be larger than 32 bits such as REAL KIND 8 data use the align commons option If you can easily modify the source files that use the common block data define the numeric variables in the COMMON statement in descending order of size and place the character variable last This provides more portability ensures natural alignment without padding and does not require the command line options align commons Of align dcommons option logical kind 2 flag integer iarry i 3 character len 5 name ch common x iarry i 3 flag name ch 18 Programming for High Performance As shown in Figure 1 3 if you arrange the order of variables from largest to smallest size and place character data last the data items will be naturally aligned Figure 1 3 Common Block with Naturally Aligned Reordered Data 2 4 MAME CH 18 ibvte offset 1 ae EE 1 byte per character ZK 79154 GE 0 4 8 1 IARRY M13 IARRY K2 IARRY i When modifying or creating all source files that use common block data consider placing the common block data
201. iled with x K W N B P is executed on a non compatible processor it might fail with an illegal instruction exception or display other unexpected behavior Executing programs compiled with xN xB OF xP on unsupported processors see table above will display the following run time error Fatal error This program was not built to run on the processor in your system Automatic Processor specific Optimization IA 32 only The ax K W N B P options direct the compiler to find opportunities to generate separate versions of functions that take advantage of features that are specific to the specified Intel processor If the compiler finds such an opportunity it first checks whether generating a processor specific version of a function is likely to result in a performance gain If this is the case the compiler generates both a processor specific version of a function and a generic version of the function The generic version will run on any IA 32 processor At run time one of the versions is chosen to execute depending on the Intel processor in use In this way the program can benefit from performance gains on more advanced Intel processors while still working properly on older IA 32 processors The disadvantages of using ax K W N B P are e The size of the compiled binary increases because it contains processor specific versions of some of the code as well as a generic version of the code 76 Compiler Optimizations Per
202. iler For complete information on the OpenMP standard visit the www openmp org web site For complete Fortran language specifications see the OpenMP Fortran version 2 0 specifications Parallel Processing with OpenMP To compile with OpenMP you need to prepare your program by annotating the code with OpenMP directives in the form of the Fortran program comments The Intel Fortran Compiler first processes the application and produces a multithreaded version of the code which is then compiled The output is a Fortran executable with the parallelism implemented by threads that execute parallel regions or constructs See Programming with OpenMP Performance Analysis For performance analysis of your program you can use the VTune TM analyzer and or the Intel Threading Tools to show performance information You can obtain detailed information about which portions of the code that require the largest amount of time to execute and where parallel performance problems are located Programming with OpenMP The Intel amp Fortran Compiler accepts a Fortran program containing OpenMP directives as input and produces a multithreaded version of the code When the parallel program begins execution a single thread exists This thread is called the master thread The master thread will continue to process serially until it encounters a parallel region Parallel Region A parallel region is a block of code that must be executed by a team of threads in p
203. iler enable you to enhance the performance of your application Each optimization is performed using a set of options discussed in the sections of this volume In addition to optimizations invoked by the compiler command line options the compiler includes features that enhance your application performance such as directives intrinsics run time library routines and various utilities These features are discussed in the Optimization Support Features section E Note This document explains how information and instructions apply differently to targeted architectures If there is no reference to a specific architecture the description applies to all supported architectures This documentation assumes that you are familiar with the Fortran Standard programming language and with the Intel processor architecture You should also be familiar with the host computer s operating system Notation Conventions This manual uses the following conventions Intel Fortran The name of the common compiler language supported by the Intel Fortran Compiler for Windows and Intel Fortran Compiler for Linux products Fortran 95 These terms are references to versions of the Fortran 90 Fortran language The default is Fortran which Fortran 77 corresponds to all versions THIS TYPE STYLE Statements keywords and directives are shown in all uppercase in a normal font For example add the USE statement This type style
204. iler will not speculate on floating point operations that may affect the floating point state of the machine See Floating point Arithmetic Precision for Itanium based Systems e Floating point arithmetic comparisons conform to IEEE 754 e The exact operations specified in the code are performed For example division is never changed to multiplication by the reciprocal e The compiler performs floating point operations in the order specified without reassociation e The compiler does not perform the constant folding on floating point values Constant folding also eliminates any multiplication by 1 division by 1 and addition or subtraction of 0 For example code that adds 0 0 to a number is executed exactly as written Compile time floating point 72 Compiler Optimizations arithmetic is not performed to ensure that floating point exceptions are also maintained On IA 32 systems whenever an expression is spilled it is spilled as 80 bits extended precision not 64 bits DOUBLE PRECISION Floating point operations conform to IEEE 754 When assignments to type REAL and DOUBLE PRECISION are made the precision is rounded from 80 bits extended down to 32 bits REAL or 64 bits DOUBLE PRECISION When you do not specify o0 the extra bits of precision are not always rounded away before the variable is reused e Even if vectorization is enabled by the x K W N B P options the compiler does not vectorize reduction loops loops com
205. in common blocks be aligned on up to 4 byte boundaries by adding padding bytes as needed The align nocommons arbitrarily aligns the bytes of common block data In this case unaligned data can occur unless the order of data items specified in the COMMON statement places the largest numeric data item first followed by the next largest numeric data and so on followed by any character data e By default with 02 the align dcommons option requests that data in common blocks be aligned on up to 8 byte boundaries by adding padding bytes as needed The align nodcommons arbitrarily aligns the bytes of data items in a common data Specify the align dcommons option for applications that use common blocks unless your application has no unaligned data or if the application might have unaligned data all data items are four bytes or smaller For 56 Compiler Optimizations applications that use common blocks where all data items are four bytes or smaller you can specify align commons instead of align dcommons e The align norecords option requests that multiple data items in derived type data and record structures an Intel Fortran extension be aligned arbitrarily on byte boundaries instead of being naturally aligned The default is align records e The align records option requests that multiple data items in record structures extension and derived type data without the SEQUENCE statement be naturally aligned by adding p
206. ing representable ranges during computation since handling these cases can have a performance impact Use REAL variables in single precision format unless the extra precision obtained through DOUBLE or REAL 8 variables is required Using variables with a larger precision formation will also increase memory size and bandwidth requirements e For IA 32 only Avoid repeatedly changing rounding modes between more than two values which can lead to poor performance when the computation is done using non SSE instructions Hence avoid using FLOOR and TRUNC instructions together when generating non SSE code The same applies for using CETL and TRUNC Another way to avoid the problem is to use the x K W N B P options to do the computation using SSE instructions e Reduce the impact of denormal exceptions for both architectures as described below Denormal Exceptions Floating point computations with underflow can result in denormal values that have an adverse impact on performance For IA 32 Take advantage of the SIMD capabilities of Streaming SIMD Extensions SSE and Streaming SIMD Extensions 2 SSE2 instructions The x K W N B P options enable the flush to zero FTZ mode in SSE and SSE2 instructions whereby underflow results are automatically converted to zero which improves application performance In addition the xP option also enables the denormals are zero DAZ mode whereby denormals are converted to zero on input further
207. ing the C shell the following program timing reports 1 19 seconds of total actual CPU time 0 61 seconds in actual CPU time for user program use and 0 58 seconds of actual CPU time for system use about 4 seconds 0 04 of elapsed time the use of 28 of available CPU time and other information o time a out Average of all the numbers is 4368488960 000000 0 61u 0 58s 0 04 28 78 424k 9 5io Opf 0w Using the bash shell the following program timing reports that the program uses 1 19 seconds of total actual CPU time 0 61 seconds in actual CPU time for user program use and 0 58 seconds of actual CPU time for system use and 2 46 seconds of elapsed time user system user time a out Average of all the numbers is 4368488960 000000 elapsed 0m2 46s user Om0 61s sys Om0 58s Timings that indicate a large amount of system time is being used may suggest excessive VO a condition worth investigating If your program displays a lot of text you can redirect the output from the program on the time command line Redirecting output from the program will change the times reported because of reduced screen VO 39 Intel Fortran Compiler for Linux Systems User s Guide Vol Il For more information see time 1 In addition to the time command you might consider modifying the program to call routines within the program to measure execution time For example use the Intel Fortran intrinsic procedures such as SECNDS DCLOCK CP
208. ion for compatibility vec reportl Indicates loops successfully vectorized Disabling Default Options To disable an option you can generally use one of the following e To disable one or a group of optimization options use 00 option For example ifort 02 O0 input file s F Note The 00 option is part of a mutually exclusive group of options that includes 00 O O1 02 and 03 The last of any of these options specified on the command line will override the previous options from this group e To disable options that include optional shown as use that version of the option in the command line for example ftz e To disable options that have an n parameter use n 0 version for example unrol10 Note If there are enabling and disabling versions of options on the line the last one takes precedence Using Compilation Options Stacks Automatic Allocation and Checking The options in this group enable you to control the computation of stacks and variables in the compiler generated code 51 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Automatic Allocation of Variables auto The auto option specifies that locally declared variables are allocated to the run time stack rather than static storage If variables defined in a procedure do not have the SAVE or ALLOCATABLE attribute they are allocated to the stack It does not affect variables that appear in an
209. ion host association or common block use These program semantics slow performance so you should specify assume dummy aliases only for the called subprograms that depend on such aliases The use of dummy aliases violates the Fortran 77 and Fortran 95 90 standards but occurs in some older programs check bounds Generates extra code for array bounds checking at run time check Generates extra code to check integer calculations overflow for arithmetic overflow at run time Once the program is debugged omit this option to reduce executable program size and slightly improve run time performance fpe3 Using this option enables certain types of floating point exception handling which can be expensive g Generate extra symbol table information in the object file Specifying this option also reduces the default level of optimization to 00 or 00 no optimization Note The g option only slows your program down when no optimization level is specified in which case g turns on 00 which slows the compilation down If g 02 are specified the code runs very much the same speed as if g were not specified 44 Compiler Optimizations 00 Turns off optimizations Can be used during the early stages of program development or when you use the debugger save Forces the local variables to retain their values from the last invocation terminated This may change the output of your program fo
210. ion by symbols of the same name in components that are already loaded Note that since the main program image is always loaded first none of the symbols it defines will be preempted redefined The possibility of symbol preemption inhibits many valuable compiler optimizations because symbols with default visibility are not bound to a memory address until run time For example calls to a routine with default visibility cannot be inlined because the routine might be preempted if the compilation unit is linked into a shareable object A preemptable data symbol cannot be accessed using GP relative addressing because the name may be bound to a symbol in a different component and the GP relative address is not known at compile time Symbol preemption is a rarely used feature and has negative consequences for compiler optimization For this reason by default the compiler treats all global symbol definitions as non preemptable protected visibility Global references to symbols defined in another compilation unit are assumed by default to be preemptable default visibility In those rare cases where all global definitions as well as references need to be preemptable specify the fpic option to override this default Specifying Symbol Visibility Explicitly The Intel Fortran Compiler has the visibility attribute options that provide command line control of the visibility attributes as well as a source syntax to set the complete range of these attr
211. ionally followed by its execution time The name must uniquely identify the test DEDE dp Sets the path name of the output report file file comp Sets the filename that contains the list of files of interest cutoff Terminates when the cumulative block value coverage reaches value of pre computed total coverage value must be greater than 0 0 for example 99 00 It may be set to 100 nototal Does not pre compute the total coverage mintime Minimizes testing execution time The execution time of each test must be provided on the same line of api 1ist file after the test name in dd hh mm ss format verbose Generates more logging information about the program progress Usage Requirements 112 Compiler Optimizations To run the test prioritization tool on an application s tests the following files are reguired The spi file generated by Intel Compilers when compiling the application for the instrumented binaries with the rot genx option The dpi files generated by Intel Compilers profmerge tool as a result of merging the dynamic profile information dyn files of each of the application tests The user needs to apply the profmerge tool to all dyn files that are generated for each individual test and name the resulting dpi in a fashion that uniquely identifies the test The profmerge tool merges all the dyn files that exist in the given directory Note It is very important that the user makes sure that
212. iple IPO executable 83 Qoption specifiers 88 interprocedural optimizer 79 213 IMGT AG Ss sasies ds erasa 150 interval profile dumping initiating ec EE ER DE te 120 intrinsics CASNOSIZE EMEN 33 FUNGON Sei siektes Een eine 170 Tai aal EE a SG ESEG 62 procedures sss eee eee 203 invoking GEE ld Ee me pomme Seen 83 lei 160 170 178 ip compiler option 65 79 88 90 92 100 213 ip ninl max total stats 88 ip ninl min stats 88 90 ip no inlining compiler option 48 92 ip no pinlining compiler option 92 ID SDGCITIBE deiere Ereegnes 88 IPF flt eval method compiler OP UDRE Se eo ie 70 IPF fltacc compiler option 70 IPF fma compiler option 70 IPF fp relaxed compiler option 70 IPF fp speculation compiler option EE EK 70 IPO code layout cess 85 Compilation see ioci RU 86 disable cen asume 79 generating multiple IPO object files Ce pe ttn e oda 82 intermediate output 83 IPO executable 81 OD ISG ES SEE esate oe ade 87 ODIIOTIS 5o eee 83 86 OVOLVIBW ge eege ga 79 80 ie EE 80 TRT EE 93 SIOITBS ER DE HE EE GE ee hen 80 ipo compiler option 79 ipo c compiler option 83 ipo obj compiler option 48 86 90 131 ipo S compiler option 83 IR containing DEE 81 8
213. is the coverage rate are displayed For example 66 67 4 6 indicates that four out of the six blocks of the corresponding function were covered The block coverage rate of that function is thus 66 67 These lists can be sorted based on the coverage rate number of blocks or function names Function names are linked to the position in source view where the function body starts So just by one click the user can see the least covered function in the list and by another click the browser displays the body of the function The user can then scroll down in the source view and browse through the function body Individual Module Source View Within the individual module source views the tool provides the list of uncovered functions as well as the list of covered functions The lists are reported in two distinct frames that provide easy navigation of the source code The lists can be sorted based on e the number of blocks within uncovered functions e the block coverage in the case of covered functions e the function names The following screen shows the coverage source view of SAMPLE C 106 Compiler Optimizations T itelt Compilers code coverage Information for D COVERAGE 18 37 OOMPILER SAMPLE SAMPLES SAMPLE Microsoft Internet De ER Men Pavortes lob Help KR DRA sewn Graes Breda 3 he c EX B Address Doer again ger ar ge ges Cede CoveragelD COVERAGE IAI COMPILER SAMPLE SAMPLES SAMPLE C HTIML void fi int n
214. is PROF DUMP INTERVAL This environment variable may be used to initiate Interval Profile Dumping in an instrumented user application For more information see the recommended usage of PGOPTI Set Interval Prof Dump Dumping Profile Information The PGOPTI Prof Dump function dumps the profile information collected by the instrumented application and has the following prototype void PGOPTI Prof Dump void The profile information is generated in a dyn file generated in phase 2 of the PGO Recommended usage Insert a single call to this function in the body of the function which terminates the user application Normally PGOPTI Prof Dump should be called just once It is also possible to use this function in conjunction with the PGOPTI Prof Reset function to generate multiple dyn files presumably from multiple sets of input data Example selectively collect profile information for the portion of the application involved in processing input data input data get input data do while input data call PGOPTI Prof Reset call process data input data call PGOPTI Prof Dump 119 Intel Fortran Compiler for Linux Systems User s Guide Vol Il input data get input data end do Resetting the Dynamic Profile Counters The PGOPTI Prof Reset function resets the dynamic profile counters and has the following prototype
215. ish to set the symbol names in the file are separated by either blanks tabs or newlines For example the command line option fvisibility protected prot txt where file prot txt contains symbols a b c d and e sets protected visibility for symbols a b c d and e This has the same effect as declared attribute visibility protected on the declaration for each of the symbols Specifying Visibility without Symbol File fvisibility keyword This option sets the visiblity for symbols not specified in a visibility list file and that do not have visibilty attribute in their declaration If no symbol file option is specified all symbols will get the specified attribute Command line example ifort fvisibility protected a f You can set the default visibility for symbols using one of the following command line options fvisibility extern fvisibility default fvisibility protected fvisibility hidden fvisibility internal The above options are listed in the order of precedence explicitly setting the visibility to extern by using either the attribute syntax or the command line option overrides any setting to default protected hidden Or internal Explicitly setting the visibility to default overrides any setting to protected hidden or internal and so on 60 Compiler Optimizations The visibility attribute default enables compiler to change the default symbol visibility and then set the default attribu
216. ker command line containing a set of valid arguments to the 1d To create app using IPO use the option ofilename as shown in the following example xild oapp a o b o c o xild calls the compiler to perform IPO for objects containing TR and creates a new list of object s to be linked Then xild calls 1d to link the object files that are specified in the new list and produce app Note The ipo option can reorder object files and linker arguments on the command line Therefore if your program relies on a precise order of arguments on the command line ipo can affect the behavior of your program The xild command recognizes all three spellings the IPO switch ipo ipoN and ipo separate Usage Rules You must use the Intel linker xila to link your application if e Your source files were compiled with IPO enabled IPO is enabled by specifying the ipo command line option e You normally would invoke the GCC linker 1a to link your application The xild Options The additional options supported by xi1d may be used to examine the results of IPO These options are described in the following table qipo fa file s Produces an assembly listing for the 84 Compiler Optimizations IPO compilation You can specify an optional name for the listing file or a directory with the backslash in which to place the file The default listing name is ipo out s qipo_fo file o Produces an object file
217. l extended directives and intrinsics or library routines that enhance and or help analyze performance For complete details of the Intel Fortran Compiler directives and examples of their use see Chapter 14 Directive Enhanced Compilation in the nte amp Fortran Language Reference For intrinsic procedures see Chapter 9 Intrinsic Procedures in the Intel Fortran Language Reference A final topic describes options that enable you to generate optimization reports for major compiler phases and major optimizations The optimization report capability is used for Itanium amp based applications only Compiler Directives Compiler Directives Overview This section discusses the Intel amp Fortran language extended directives that enhance optimizations of application code such as software pipelining loop unrolling prefetching and vectorization For complete list descriptions and code examples of the Intel amp Fortran Compiler directives see Directive Enhanced Compilation section General Directives in the ntel amp Fortran Language Reference Pipelining for ltanium based Applications The SWP and NOSWP directives indicate preference for a loop to get software pipelined or not The SWP directive does not help data dependence but overrides heuristics based on profile counts or lop sided control flow The software pipelining optimization triggered by the SWP directive applies instruction scheduling to certain innermost loops
218. l execution in any of the called routines For example subroutine F 156 Parallel Programming with Intel Fortran SOMP parallel call G subroutine G ISOMP DO The OMP DO is an orphaned directive because the parallel region it will execute in is not lexically present in G Data Environment Directive A data environment directive controls the data environment during the execution of parallel constructs You can control the data environment within parallel and worksharing constructs Using directives and data environment clauses on directives you can e Privatize named common blocks by using THREADPRIVATE directive e Control data scope attributes by using the THREADPRIVATE directive s clauses The data scope attribute clauses are COPYIN DEFAULT PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SHARED O O Oo O O 0 0 You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them If you do not specify a data scope attribute clause on a directive the default is SHARED for those variables affected by the directive For detailed descriptions of the clauses see the OpenMP Fortran version 2 0 specifications Pseudo Code of the Parallel Processing Model A sample program using some of the more common OpenMP directives is shown in the code example that follows This example also indicates the difference between serial regions an
219. l the OMP SET NUM THREADS run time library routine from a serial portion of the program This routine overrides any value you may have set using the OMP NUM THREADS environment variable Assuming you have used the OMP NUM THREADS environment variable to set the number of threads to 6 you can change the number of threads between parallel regions as follows CALL OMP SET NUM THREADS 3 SOMP PARALLEL SOMP END PARALLEL CALL OMP SET NUM THREADS 4 ISOMP PARALLEL DO SOMP END PARALLEL DO Setting Units of Work Use the worksharing directives such as DO SECTIONS and SINGLE to divide the statements in the parallel region into units of work and to distribute those units so that each unit is executed by one thread In the following example the OMP DO and OMP END DO directives and all the statements enclosed by them comprise the static extent of the parallel region SOMP PARALLEL SOMP DO DO I 1 N B I A I A END DO TSL IA AG 165 Intel Fortran Compiler for Linux Systems User s Guide Vol Il SOMP END DO SOMP END PARALLEL In the following example the OMP DO and OMP END DO directives and all the statements enclosed by them including all statements contained in the WORK subroutine comprise the dynamic extent of the parallel region SOMP PARALLEL DEFAULT SHARED SOMP DO DO I 1 N CALL WORK I N END DO SOMP END DO SOM
220. lause on the PARALLEL PARALLEL DO and PARALLEL SECTIONS directives to make variables shared among all the threads in a team In the following example the variables x and NPOINTS are shared among all the threads in the team SOMP PARALLEL DEFAULT PRIVATE SHARED X NPOINTS IAM OMP GET THREAD NUM NP OMP GET NUM THREADS POINTS NPOINTS NP CALL SUBDOMAIN X IAM IPOINTS SOMP END PARALLEL Specifying Schedule Type and Chunk Size The SCHEDULE clause of the DO or PARALLEL DO directive specifies a scheduling algorithm that determines how iterations of the DO loop are divided among and dispatched to the threads of the team The SCHEDULE clause applies only to the current DO or PARALLEL DO directive Within the SCHEDULE clause you must specify a schedule type and optionally a chunk size A chunk is a contiguous group of iterations dispatched to a thread Chunk size must be a scalar integer expression 180 Parallel Programming with Intel Fortran The following list describes the schedule types and how the chunk size affects scheduling STATIC The iterations are divided into pieces having a size specified by chunk The pieces are statically dispatched to threads in the team in a round robin manner in the order of thread number When chunk is not specified the iterations are first divided into contiguous pieces by dividing the number of iterations
221. lines for Intel architectures 34 coloring scheme Setting T 103 combined parallel and worksharing constructs 150 169 command line options 48 55 65 131 syntax for code coverage tool 103 syntax for IPO executable 81 syntax for linker tool 83 syntax for OpenMP directives 158 syntax for test prioritization tool 111 comma separated list for clauses 164 166 174 for variables 160 COMMON block 13 34 41 55 57 150 164 174 175 statement ees 13 34 55 compilation controlling cruce rtu ets 55 customizing process of 41 Glen 41 OptimIzING TEEN 41 ODtUOFISa tere Eege 48 51 55 57 parallel programming 127 ai 80 teclinigues orn pcd EG e ed 41 using linker tool 83 with real object files 86 COmpllaltlODi E 80 compiler applying heuristic 147 Commands sese ee eee 41 compiler supplied library 92 creating temporary array 20 debugging parallel regions 195 default optimizations 48 225 Intel Fortran Compiler for Linux Systems User s Guide Vol Il defining the size of the array elements ess 13 directives 425 nec su ees eR 203 efficient compilation 41 Intel extension routines 189 195 IPO benefits
222. llowing commands generate vectorization report ifort x K W N B P vec report3 file f ifort x K W N B P ipo ipo obj vec report3 file f ifort c x K W N B P ipo ipo obj vec report3 file f Loop Parallelization and Vectorization Combining the parallel and x K W N B P options instructs the compiler to attempt both automatic loop parallelization and automatic loop vectorization in the same compilation In most cases the compiler will consider outermost loops for parallelization and innermost loops for vectorization If deemed profitable however the compiler may even apply loop parallelization and vectorization to the same loop See Guidelines for Effective Auto parallelization Usage and Vectorization Key Programming Guidelines Note that in some rare cases successful loop parallelization either automatically or by means of OpenMP directives may affect the messages reported by the compiler for a non vectorizable loop in a non intuitive way 132 Parallel Programming with Intel Fortran Vectorization Key Programming Guidelines The goal of vectorizing compilers is to exploit single instruction multiple data SIMD processing automatically Users can help however by supplying the compiler with additional information for example directives Review these guidelines and restrictions see code examples in further topics and check them against your code to eliminate ambiguities that prevent the compiler from achieving optimal
223. llustrates access to arrays A and B in inner loop J in a non contiguous manner which results in poor performance The compiler itself can transform the code so that inner loops access memory in a contiguous manner To do that you need to use advanced optimization options such as 03 for both IA 32 and Itanium architectures and 03 and ax K W N B P for lA 32 only Memory Layout Alignment is a very important factor in ensuring good performance Aligned memory accesses are faster than unaligned accesses If you use the interprocedural optimization on multiple files the ipo option the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary Multiple arrays specified in a single common block can impose extra constraints on the compiler For example consider the following COMMON statement COMMON AREA1 A 200 X B 200 35 Intel Fortran Compiler for Linux Systems User s Guide Vol Il If the compiler added padding to align A 1 at a 16 byte aligned address the element B 1 would not be at a 16 byte aligned address So it is better to split AREA1 as follows COMMON AREA1 A 200 COMMON AREA2 X COMMON AREA3 B 200 The above code provides the compiler maximum flexibility in determining the padding required for both A and B Optimizing for Floating point Applications To improve floating point performance observe these general rules e Avoid exceed
224. loops or both Parallelism defined with OpenMP and Auto parallelization directives is based on thread level parallelism TLP Parallelism defined with Auto vectorization techniques is based on instruction level parallelism ILP The Intel Fortran compiler supports OpenMP and Auto parallelization on both IA 32 and Itanium architectures for multiprocessor systems as well as on single IA 32 processors with Hyper Threading Technology for Hyper Threading Technology refer to the IA 32 Intel Architecture Optimization Reference Manual Auto vectorization is supported on the families of the Pentium Pentium with MMX TM technology Pentium II Pentium Ill and Pentium 4 processors To enhance the compilation of the code with Auto vectorization the users can also add vectorizer directives to their program A closely related technique that is available on the Itanium based systems is software pipelining SWP The table below summarizes the different ways in which parallelism can be exploited with the Intel Fortran compiler Parallelism Explicit Implicit Parallelism programmed Parallelism generated by the compiler and by user by the user supplied hints OpenMP TLP Auto parallelization Auto vectorization TLP ILP 127 Intel Fortran Compiler for Linux Systems User s Guide Vol Il IA 32 and Itanium of outer most loops of inner most loops architectures IA 32 and Itanium IA 32 only architectures Sof
225. ly requesting a multi object IPO compilation turns the size based heuristic off The number of files generated by the link time compilation is invisible to the user unless either the ipo_c or ipo S option is used In this case the compiler appends a number to the file name For example consider this command line ifort a o b o c o ipo separate ipo c In this command line a o b o and c o all contain IR so the compiler will generate ipo out o ipo outl o ipo out2 0 and ipo out3 o 82 Compiler Optimizations The first object file contains global symbols The other object files correspond to the source files This naming convention is also applied to user specified names For example ifort a o b o c o ipo separate ipo c o appl o This will generate appl o appll o app12 0 and app13 o Capturing Intermediate Outputs of IPO The ipo c and ipo S options are useful either for analyzing the effects of IPO or when using IPO on modules that do not make up a complete program Use the ipo c option to optimize across files and produce an object file This option performs optimizations as described for ipo but stops prior to the final link stage leaving an optimized object file The default name for this file is ipo out o You can use the o option to specify a different name For example ifort tpp6 ipo c ofilename a f b f c f Use the ipo S option to optimize across files and produce an assembly file This option
226. me The screenshot that follows shows a sample top level coverage summary for a project By clicking on a module name for example SAMPLE C the browser will display the coverage source view of that particular module K intel Compilers code cowerage information foe Sample Project Microsolt Internet Explorer EL e Be E de Fetes Tock Hep SE e Hock E Fr Peter heda diu 3 B Address ei D lCoveragetia32lconmpdes sanplelsarpieS CODE COVERAGE HTML mla zraig by Intai Corpse Coverage Summary of Sample Project E orege foo Blocks total cvrd EE HN total EN E uncvrd total cvrd uncvrd ST 66 57 19 103 Covered Files in Sample_Project Uncovered Files in Sample_Project Functions n Blocks ot mea mme wa en 4 00 SAMPLE3C 7 g 34 67 65 generates dr Intag Somos AC WebPage Ovner inte intel peversted hy 1n itane Sampieri Web Pape Owner code coverage toot code coverage too Lote Browsing the Frames 105 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The coverage tool creates frames that facilitate browsing through the code to identify uncovered code The top frame displays the list of uncovered functions while the bottom frame displays the list of covered functions For uncovered functions the total number of basic blocks of each function is also displayed For covered functions both the total number of blocks and the number of covered blocks as well as their ratio that
227. mic information files are created This variable applies to all three phases of the profiling process PROF DUMP INTERVAL Initiates interval profile dumping in an instrumented user application 99 Intel Fortran Compiler for Linux Systems User s Guide Vol Il PROF NO CLOBBER Alters the feedback compilation phase slightly By default during the feedback compilation phase the compiler merges the data from all dynamic information files and creates a new pgopti dpi file even if one already exists When this variable is set the compiler does not overwrite the existing pgopti dpi file Instead the compiler issues a warning and you must remove the pgopti dpi file if you want to use additional dynamic information files See the documentation for your operating system for instructions on how to specify environment variables and their values Example of Profile Guided Optimization The following is an example of the basic PGO phases 1 Instrumentation Compilation and Linking Use prof gen to produce an executable with instrumented information Use also the prof dir option as recommended for most programs especially if the application includes the source files located in multiple directories prof dir ensures that the profile information is generated in one consistent place For example ifort prof gen prof dir usr profdata c al f a2 f a3 f ifort oal al o ai o a3 o In pla
228. mization in which the compiler loads the vectors one cache at a time so that during the loop computation the number of external bus tumarounds is reduced profiling A process in which detailed information is produced about the program s execution 218 Glossary register variable An optimization in which the compiler detects the variables detection that never need to be stored in memory and places them in register variables side effects Results of the optimization process that might increase the code size and or processing time static linking The process in which a copy of the object file that contains a function used in your program is incorporated in your executable file at link time strength An optimization in which the compiler reduces the complexity reduction of an array index calculation by using only additions strip mining An optimization in which the compiler creates an additional level of nesting to enable inner loop computations on vectors that can be held in the cache This optimization reduces the size of inner loops so that the amount of data required for the inner loop can fit the cache size token pasting The process in which the compiler treats two tokens separated by a comment as one for example a b become ab transformation A rearrangement of code In contrast an optimization is a rearrangement of code where improved run time performance is guaranteed un
229. mpiler may generate a run time test for the profitability of executing in parallel for loop with loop parameters that are not compile time constants Coding Guidelines Enhance the power and effectiveness of the auto parallelizer by following these coding guidelines 144 Parallel Programming with Intel Fortran Expose the trip count of loops whenever possible specifically use constants where the trip count is known and save loop parameters in local variables Avoid placing structures inside loop bodies that the compiler may assume to carry dependent data for example procedure calls ambiguous indirect references or global references Insert the DEC PARALLEL directive to disambiguate assumed data dependencies Insert the DEC NOPARALLEL directive before loops known to have insufficient work to justify the overhead of sharing among threads Auto parallelization Data Flow For auto parallelization processing the compiler performs the following steps Data flow analysis gt Loop classification gt Dependence analysis gt High level parallelization gt Data partitioning gt Multi threaded code generation These steps include Data flow analysis compute the flow of data through the program Loop classification determine loop candidates for parallelization based on correctness and efficiency as shown by threshold analysis Dependence analysis compute the dependence analysis for references in each loop nest
230. mpiler to another The Intel amp Fortran Compiler generates code that is highly optimized for Intel architectures You can significantly improve performance by using various compiler optimization options In addition you can help the compiler to optimize your Fortran program by following the guidelines described here To achieve optimum processor performance in your Fortran application do the following e avoid memory access stalls e ensure good floating point performance e ensure good SIMD integer performance e use vectorization The coding practices rules and recommendations described here will contribute to optimizing the performance on Intel architecture based processors Memory Access 34 Programming for High Performance The Intel compiler lays out Fortran arrays in column major order For example in a two dimensional array elements A 22 34 and A 23 34 are contiguous in memory For best performance code arrays so that inner loops access them in a contiguous manner Consider the following examples The code in example 1 will likely have higher performance than the code in example 2 Example 1 DO J 1 N DO I 1 N B I J A I J 1 END DO END DO The code above illustrates access to arrays A and B in the inner loop 1 in a contiguous manner which results in good performance Example 2 DO I DO J 1 B I J A I J 1 END DO END DO The code above i
231. n the nte amp Fortran Language Reference The VECTOR ALW AYS and NOVECTOR Directives The VECTOR ALWAYS directive overrides the efficiency heuristics of the vectorizer but it only works if the loop can actually be vectorized that is use IVDEP to ignore assumed dependences The VECTOR ALWAYS directive can be used to override the default behavior of the compiler in the following situation Vectorization of non unit stride references usually does not exhibit any speedup so the compiler defaults to not vectorizing loops that have a large number of non unit stride references compared to the number of unit stride references The following loop has two references with stride 2 Vectorization would be disabled by default but the directive overrides this behavior DECS VECTOR ALWAYS do i 1 100 2 a i b i enddo Optimization Support Features If on the other hand avoiding vectorization of a loop is desirable if vectorization results in a performance regression rather than improvement the NOVECTOR directive can be used in the source text to disable vectorization of a loop For instance the Intel amp Compiler vectorizes the following example loop by default If this behavior is not appropriate the NOVECTOR directive can be used as shown below DECS NOVECTOR do i 1 100 a i b i c i enddo For more details on these directives see Directive Enhanced Compilation section General Directives in the nte
232. nction intrinsic the operator operator and the assignment must be the intrinsic function operator and assignment This restriction applies to the ATOMIC directive All references to storage location x must have the same type parameters In the following example the collection of v locations is updated atomically SOMP ATOMIC Y Y B I BARRIER Directive To synchronize all threads within a parallel region use the BARRIER directive You can use this directive only within a parallel region defined by using the PARALLEL directive You cannot use the BARRIER directive within the DO PARALLEL DO SECTIONS PARALLEL SECTIONS and SINGLE directives When encountered each thread waits at the BARRIER directive until all threads have reached the directive In the following example the BARRIER directive ensures that all threads have executed the first loop and that it is safe to execute the second loop cSOMP PARALLEL cSOMP DO PRIVATE 1 DO i 1 100 b i i END DO CSOMP BARRIER CSOMP DO PRIVATE i 171 Intel Fortran Compiler for Linux Systems User s Guide Vol Il DO i 1 100 a i b 101 1 END DO cSOMP END PARALLEL CRITICAL and END CRITICAL Use the CRITICAL and END CRITICAL directives to restrict access to a block of code referred to as a critical section to one thread at a time A thread waits at the beginning of a critical section until no other thread in t
233. ne 25 Improving Run time Efficiency sse eee eee eee 30 Using Intrinsics for Itanium based Systems eee ee ee ee ee RR Ke ee ee 33 Coding Guidelines for Intel Architectures sse eee eee eee 34 Analyzing and Timing Your ApplicatiON iese ee ee RR Ee RR Ee AA EE ee ee RE ee ea 37 Using Intel Performance Analysis Tools eee eee 37 Timing Your Application EE 38 Compiler Optimizations TEE 41 Compiler Optimizations Cverview eee ee e e e e 41 Optimizing the Compilation Process 41 Optimizing the Compilation Process Overview ee ee ee ee e 41 Efficient Compilaton sc ER As qu QE a LT CA IE 41 Little endian to Big endian Conversion sese eee eee ee 45 Table Of Contents Default Compiler OptimizationS ee eee ee ee ee EE EE RR ee EE Ee ee 48 Using Compilation Options eN e dct RR Ee Deco ever dees 51 Optimizing Different Application Types ese ee EE RE ee AR ee ee EE ee 61 Optimizing Different Application Types OverVieW sx eee se ee ee RR ee 61 Setting Optimizations with On Options e see ee ee ee 62 Restricting er edu e EE 65 Floating point Arithmetic OptimizationS iese ee ee Re AR ee ee ee ee ee 66 Options Used for Both IA 32 and ltanium Architectures 66 Floating point Arithmetic Precision for lA 32 Systems 69 Floating point Arithmetic Precision for Itanium amp based Systems 70 Improving Restricting FP Arithmetic Precision esse eee
234. ng point division computations into multiplication by the reciprocal of the denominator Use prec div to disable floating point division to multiplication optimization resulting in more accurate division results May have speed impact pc 32 64 80 Option Use the pc132 64 80 option to enable floating point significand precision control Some floating point algorithms created for specific lA 32 and Itanium amp based systems are sensitive to the accuracy of the significand or fractional part of the floating point value Use appropriate version of the option to round the significand to the number of bits as follows pc32 24 bits single precision pc64 53 bits double precision pc 80 64 bits extended precision The default version is pc80 for full floating point precision This option enables full optimization Using this option does not have the negative performance impact of using the mp option because only the fractional part of the floating point value is affected The range of the exponent is not affected EJ Note This option only has an effect when the module being compiled contains the main program 69 Intel Fortran Compiler for Linux Systems User s Guide Vol Il A caution A change of the default precision control or rounding mode for example by using the pc32 option or by user intervention may affect the results returned by some of the mathematical functions Rounding Control rcd fp
235. ng program execution e For arrays where each array element contains a derived type structure or record structure the size of the array elements may cause some elements but not the first to start on an unaligned boundary e Even if the data items are naturally aligned within a derived type structure without the SEQUENCE statement or a record structure the size of an array element might require use of the align records option to supply needed padding to avoid some array elements being unaligned e lf you specify align norecords or specify vms without align records no padding bytes are added between array elements If array elements each contain a derived type structure with the SEQUENCE statement array elements are packed without padding bytes regardless of 15 Intel Fortran Compiler for Linux Systems User s Guide Vol II the Fortran command options specified In this case some elements will be unaligned e When the align records option is in effect the number of padding bytes added by the compiler for each array element is dependent on the size of the largest data item within the structure The compiler determines the size of the array elements as an exact multiple of the largest data item in the derived type structure without the SEQUENCE statement or a record structure The compiler then adds the appropriate number of padding bytes For instance if a structure contains an 8 byte floating point number followed by a 3 byte char
236. ng the most efficient programs without having to write assembly code The following performance tools help you analyze your application and find and resolve problem areas e Intel amp Debugger IDB The IDB debugger provides extensive support for debugging programs through a command line or graphical user interface e Intel VTune TM Performance Analyzer The VTune analyzer collects analyzes and provides Intel architecture specific software performance data from the system wide view down to a 37 Intel Fortran Compiler for Linux Systems User s Guide Vol Il specific module function and instruction in your code For information see http www intel com software products vtune e Intel Threading Tools The Intel Threading Tools consist of the following e Intel Thread Checker e Intel Thread Profiler For general information see http www intel com software products threadtool htm Timing Your Application One of the performance indicators is your application timing Use the time command to provide information about program performance The following considerations apply to timing your application e Run program timings when other users are not active Your timing results can be affected by one or more CPU intensive processes also running while doing your timings e Try torun the program under the same conditions each time to provide the most accurate results especially when comparing execution times of a p
237. nments or w0 option that suppresses all warnings e During program execution warning messages are issued for any data that is detected as unaligned The message includes the address of the unaligned access Consider the following run time message 16 Programming for High Performance Unaligned access pid 24821 lt a out gt va 140000154 pc 3ff80805860 ra 1200017bc This message shows that e The statement accessing the unaligned data program counter is located at 3ff80805d60 e The unaligned data is located at address 140000154 Ordering Data Declarations to Avoid Unaligned Data For new programs or when the source declarations of an existing program can be easily modified plan the order of your data declarations carefully to ensure the data items in a common block derived type structure record structure or data items made equivalent by an EQUIVALENCE statement will be naturally aligned Use the following rules to prevent unaligned data e Always define the largest size numeric data items first e If your data includes a mixture of character and numeric data place the numeric data first e Add small data items of the correct size or padding before otherwise unaligned data to ensure natural alignment for the data that follows When declaring data consider using explicit length declarations such as specifying a KIND parameter For example specify INTEGER KIND 4 or INTEGER 4 rather than INTEGER If you do us
238. nt source file However when you use ipo to specify 79 Intel Fortran Compiler for Linux Systems User s Guide Vol Il multifile IPO the compiler performs inline function expansion for calls to procedures defined in separate files To disable the IPO optimizations use the 00 option A caution The ip and ipo options can in some cases significantly increase compile time and code size Option auto_ilp32 for ltanium based Systems On Itanium based systems the auto i1p32 option requires interprocedural analysis over the whole program This optimization allows the compiler to use 32 bit pointers whenever possible as long as the application does not exceed a 32 bit address space Using the auto i1p32 option on programs that exceed 32 bit address space might cause unpredictable results during program execution Because this optimization requires interprocedural analysis over the whole program you must use the auto_ilp32 option with the ipo option On Intel EM64T systems auto i1p32 has no effect unless xP or axP is also specified IPO Compilation Model For the topics in this section the term IPO generally refers to multi file IPO When you use the ipo option the compiler collects information from individual program modules of a program Using this information the compiler performs optimizations across modules In order to do this the ipo option is applied to both the compilation phase and the link phase
239. ntly uses only n 0 any other value is NOP Benefits and Limitations of Loop Unrolling The benefits are e Unrolling eliminates branches and some of the code e Unrolling enables you to aggressively schedule or pipeline the loop to hide latencies if you have enough free registers to keep variables live e The Intel Pentium 4 or Intel Xeon TM processors can correctly predict the exit branch for an inner loop that has 16 or fewer iterations if that number of iterations is predictable and there are no conditional branches in the loop Therefore if the loop body size is not excessive and the probable number of iterations is known unroll inner loops for Pentium 4 or Intel Xeon processor until they have a maximum of 16 iterations Pentium Ill or Pentium II processors until they have a maximum of 4 iterations The potential cost excessive unrolling or unrolling of very large loops can lead to increased code size For more information on how to optimize with unro11 n refer to the Intel Pentium 4 and Intel amp Xeon TM Processor Optimization Reference Manual Memory Dependency with IVDEP Directive For Itanium amp based applications the ivdep parallel option indicates there is absolutely no loop carried memory dependency in the loop where IVDEP directive is specified This technique is useful for some sparse matrix applications For example the following loop requires ivdep parallel in addition to the directi
240. o parallelization and vectorization can be combined for better performance results For example in the code below TLP can be exploited in the outermost loop while ILP can be exploited in the innermost loop DO I 1 100 execute groups of iterations in different threads TLP DO J 1 32 execute in SIMD style with multimedia extension ILP A J I A J I 1 ENDDO ENDDO 128 Parallel Programming with Intel Fortran Auto vectorization can help improve performance of an application that runs on the systems based on Pentium Pentium with MMX TM technology Pentium 11 Pentium Ill and Pentium 4 processors The following table lists the options that enable Auto vectorization Auto parallelization and OpenMP support Auto vectorization 1A 32 only KIKIWINIBIP Generates specialized code to run exclusively on processors with the extensions specified by K W N B P ax K W N B P Generates in a single binary code specialized to the extensions specified by KIW N B P and also generic IA 32 code The generic code is usually slower vec report 0 1 21314 5 Controls the diagnostic messages from the vectorizer see subsection that follows the table Auto parallelization 1A 32 and Itanium architectures parallel Enables the auto parallelizer to generate multithreaded code for loops that can be safely executed in parallel Default OF
241. o loops i l do while i lt n mod n 4 Vector strip mined loop a i it3 b i it3 c i i c3 i i 4 end do do while i lt n a i b i eti Scalar clean up loop i i 1 end do Loop Blocking 138 Parallel Programming with Intel Fortran It is possible to treat loop blocking as strip mining in two or more dimensions Loop blocking is a useful technigue for memory performance optimization The main purpose of loop blocking is to eliminate as many cache misses as possible This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain Each chunk should be small enough to fit all the data for a given computation into the cache thereby maximizing data reuse Consider the following example The two dimensional array A is referenced in the j column direction and then in the i row direction column major order array B is referenced in the opposite manner row major order Assume the memory layout is in column major order therefore the access strides of array A and B for the code would be 1 and MAX respectively In the B example BS block size MAX must be evenly divisible by BS Example of Loop Blocking of Arrays A Original loop REAL A MAX MAX B MAX MAX DO I 1 MAX DO J 1 MAX A I J A I J B J T ENDDO ENDDO B Transformed Loop after blocking REAL A MAX MAX B MAX MAX
242. o parallelizer s diagnostic levels O 1 2 or 3 as follows par report 0 no diagnostic information is displayed par_report1 indicates loops successfully auto parallelized default Issues a LOOP AUTO PARALLELIZED message for parallel loops par report2 indicates successfully auto parallelized loops as well as unsuccessful loops par report3 same as 2 plus additional information about any proven or assumed dependences inhibiting auto parallelization reasons for not parallelizing The following example shows an output generated by par report3 as a result from the command ifort c parallel par report3 myprog f90 where the program myprog f 90 is as follows program myprog integer a 10000 q C Assumed side effects do i 1 10000 a i foo i enddo C Actual dependence do i 1 10000 ali a i 1 i enddo 148 Parallel Programming with Intel Fortran end Example of par report Output program myprog procedure myprog serial loop line 5 not a parallel candidate due to statement at line 6 serial loop line 9 flow data dependence from line 10 to line 10 due to a 12 Lines Compiled Troubleshooting Tips e Use par threshold0 to see if the compiler assumed there was not enough computational work e Use par report3 to view diagnostics e Use the DIR PARALLEL directive to eliminate assumed data dependencies e Use ipo to eliminate assumed side effects d
243. o the corresponding PRIVATE variables in the last iteration in the DO construct loop or the last SECTION construct COPYPRIVATE list Uses private variables in list to broadcast values or pointers to shared objects from one member of a team to the other members at the end of a single construct NOWAIT Specifies that threads need not wait at the end of worksharing constructs until they have completed execution The threads may proceed past the end of the worksharing constructs as soon as there is no more work available for them to execute SHARED list Shares variables in list among all the threads in a team DEFAULT mode Determines the default data scope attributes of variables not explicitly specified by another clause Possible values for mode are PRIVATE SHARED or NONE REDUCTION Performs a reduction on variables 162 Parallel Programming with Intel Fortran operatorlintrinsicklist that appear in 1ist with the operator operator or the intrinsic procedure name intrinsic operator is one of the following and or eqv neqv intrinsic refers to one of the following MAX MIN IAND IOR or IEOR ORDERED END ORDERED Used in conjunction with a DO or SECTIONS construct to impose a serial order on the execution of a section of code If ORDERED constructs are contained in the dynamic extent of the DO construct the ordered clause must be present
244. oating underflow occurs the result is set to zero and execution continues This is called abrupt underflow to O fpel restricts only floating underflow e Floating overflow floating divide by zero and floating invalid produce exceptional values NaN and signed Infinities and execution continues e If a floating underflow occurs the result is set to zero and execution continues The default is fpe3 on both IA 32 and Itanium based processors This allows full floating point exception behavior e Floating overflow floating divide by zero and floating invalid produce exceptional values NaN and signed Infinities and execution continues e Floating underflow is gradual denormalized values are produced until the result becomes 0 Compiler Optimizations The fpen only affects the Fortran main program The floating point exception behavior set by the Fortran main program is in effect throughout the execution of the entire program If the main program is not Fortran you can use the Fortran intrinsic FOR SET FPE to set the floating point exception behavior When compiling different routines in a program separately you should use the same value of n in fpen For more information refer to the Intel Fortran Compiler User s Guide for Linux Systems Volume I section Controlling Floating point Exceptions Floating point Arithmetic Precision for IA 32 Systems prec div Option The Intel amp Fortran Compiler can change floati
245. of which the compiler takes advantage 137 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Strip mining and Cleanup Strip mining also known as loop sectioning is a loop transformation technique for enabling SIMD encodings of loops as well as providing a means of improving memory performance By fragmenting a large loop into smaller segments or strips this technique transforms the loop structure in two ways e It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm e ltreduces the number of iterations of the loop by a factor of the length of each vector or number of operations being performed per SIMD operation In the case of Streaming SIMD Extensions this vector or strip length is reduced by 4 times four floating point data items per single Streaming SIMD Extensions single precision floating point SIMD operation are processed First introduced for vectorizers this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine The compiler automatically strip mines your loop and generates a cleanup loop Example of Strip Mining and Cleaning Up Loops Before Vectorization i 1 do while i lt n a i b i c i Original loop code i i 1 end do After Vectorization The vectorizer generates the following tw
246. ofiling information which you can use to identify those parts of your program where improving source code efficiency would most likely improve run time performance After you modify the appropriate source code recompile the program and test the run time performance tpp n Optimizes your application s performance for specific Intel processors See Targeting a Processor tpp n Snrol in Specifies the number of times a loop is unrolled n when specified with optimization level 03 If you omit n in unro11 the optimizer determines how many times loops can be unrolled 43 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Options That Slow Down the Run time Performance The table below lists options in alphabetical order that can slow down the run time performance Some applications that require floating point exception handling or rounding might need to use the fpen dynamic option Other applications might need to use the assume dummy aliases or vms options for compatibility reasons Other options that can slow down the run time performance are primarily for troubleshooting or debugging purposes The following table lists the options that can slow down run time performance Option Description assume Forces the compiler to assume that dummy formal dummy aliases arguments to procedures share memory locations with other dummy arguments or with variables shared through use associat
247. olidx k enddo CDECS NOPREFETCH colidx CDECS PREFETCH a 1 40 CDECS PREFETCH p 1 20 do k itiresidue rowstr jt 1l 8 8 125 Intel Fortran Compiler for Linux Systems User s Guide Vol Il sum sum a k p colidx k amp a k 1 p colidx k 1 a k 2 p colidx k 2 amp a k 3 p colidx k 3 a k 4 p colidx k 4 amp a k 5 p colidx k 5 a k 6 p colidx k 6 amp a k 7 p colidx k 7 enddo q j sum enddo For details refer to the Intel Fortran Language Reference 126 Parallel Programming with Intel Fortran Parallelism an Overview This section discusses the three major features of parallel programming supported by the Intel Fortran compiler OpenMP Auto parallelization and Auto vectorization Each of these features contributes to the application performance depending on the number of processors target architecture IA 32 or Itanium architecture and the nature of the application The three features OpenMP Auto parallelization and Auto vectorization can be combined arbitrarily to contribute to the application performance Parallel programming can be explicit that is defined by a programmer using OpenMP directives Parallel programming can be implicit that is detected automatically by the compiler Implicit parallelism is exploited by either Auto parallelization of outer most loops or Auto vectorization of innermost
248. ompiler Setting Data Type and Alignment Using Arrays Efficiently Improving I O Performance Improving Run time Efficiency Using Intrinsics for Itanium based Systems Coding Guidelines for Intel Architectures Analyzing and timing your application e Using Intel Performance Analysis Tools e Timing Your Application Implementing Intel Fortran Compiler optimizations Optimizing the Compilation Process Efficient Compilation Stack Options for Automatic Allocation and Checking Alignment Options Symbol Visibility Attribute Options Options to Optimize Different Application Types Floating Point Arithmetic Optimizations Optimizing for Specific Processors Interprocedural Optimizations Profile guided Optimizations High level Language Optimizations HLO Parallel programming with Intel Fortran Auto vectorization IA 32 Only Auto parallelization Parallelization with OpenMP Debugging Multi Threaded Programs Optimization support features e Compiler Directives e Optimizations and Debugging Intel Fortran Compiler for Linux Systems User s Guide Vol Il e Optimizer Report Generation For information on new features in this release see the topic titled What s New in This Release in Volume 1 Also refer to the product Release Notes 10 How to Use This Document This User s Guide explains how you can use the Intel amp Fortran Compiler to enhance your application The optimizations provided by the Intel Fortran Comp
249. on comp of the tool s execution You can generate the profile information for the whole application or a subset of it and then break the covered modules into different components and use the coverage tool to obtain the coverage information of each individual component If only a subset of the application modules is compiler with the prof genx option then the coverage information is generated only for those modules that are involved with this compiler option thus avoiding the overhead incurred for profile generation of other modules To specify the modules of interest use the tool s comp option This option takes the name of a file as its argument That file must be a text file that includes the name of modules or directories you would like to analyze Here is an example 108 Compiler Optimizations codecov prj Project Name comp componenti Note Each line of component file should include one and only one module name Any module of the application whose full path name has an occurrence of any of the names in the component file will be selected for coverage analysis For example if a line of file component 1 in the above example contains mod1 90 then all modules in the application that have such a name will be selected The user can specify a particular module by giving more specific path information For instance if the line contains cmp1 mod1 f90 then only those modules with the name mod1 c will be selected that
250. on occurs On all other units the input output operations perform big little endian conversion F UFMTENDIAN 10 20 Define 10 11 12 19 20 units for conversion purposes on these units the input output operations perform big little endian conversion Assume you set F UFMTENDIAN 10 100 and run the following program integer 4 cc4 integer 8 cc8 integer 4 c4 integer 8 c8 c4 456 c8 789 C prepare a little endian representation of data open 11 file lit tmp form unformatted write 11 c8 write 11 c4 close 11 C prepare a big endian representation of data 47 Intel Fortran Compiler for Linux Systems User s Guide Vol Il open 10 file big tmp form unformatted write 10 c8 write 10 c4 close 10 C read big endian data and operate with them on C little endian machine open 100 file big tmp form unformatted read 100 cc8 read 100 cc4 C Any operation with data which have been read G zr 3 close 100 stop end Now compare 1it tmp and big tmp files with the help of od utility gt od t x4 lit tmp 0000000 00000008 00000315 00000000 00000008 0000020 00000004 000001c8 00000004 0000034 gt od t x4 big tmp 0000000 08000000 00000000 15030000 08000000 0000020 04000000 c8010000 04000000 0000034 You can see that the byte order is different in these files Default Compiler Optimizations If you
251. on the DO directive IF scalar logical expression The enclosed parallel region is executed in parallel only if the scalar logical expression evaluates to TRUE otherwise the parallel region is serialized NUM THREADS scalar integer expression Requests the number of threads specified by Scalar integer expression for the parallel region SCHEDULE type chunk Specifies how iterations of the DO construct are divided among the threads of the team Possible values for the type argument are STATIC DYNAMIC GUIDED and RUNTIME The optional chunk argument must be a positive scalar integer expression COPYIN 1ist Specifies that the master thread s data values be copied to the THREADPRIVATE s copies of the common blocks or variables specified in list at the beginning of the parallel region Directives and Clauses Cross reference Directive PARALLEL Uses These Clauses COPYIN DEFAULT PRIVATE 163 Intel Fortran Compiler for Linux Systems User s Guide Vol Il END PARALLEL DO END DO SECTIONS END SECTIONS SECTION SINGLE END SINGLE PARALLEL DO END PARALLEL DO PARALLEL SECTIONS END PARALLEL SECTIONS MASTER END MASTER CRITICALL 1ocX END CRITICAL 10cX BARRIER ATOMIC FLUSH 1ist ORDERED END ORDERED THREADPRIVATE 1ist FIRSTPRIVATE REDUCTION SHARED PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SCHEDULE PRIVA
252. on updates the version of the object it had before the construct Subobjects that are not assigned a value by the last iteration of the DO loop or the lexically last SECTION directive are undefined after the construct Correct execution sometimes depends on the value that the last iteration of a loop assigns to a variable You must list all such variables as arguments to a 177 Intel Fortran Compiler for Linux Systems User s Guide Vol Il LASTPRIVATE clause so that the values of the variables are the same as when the loop is executed sequentially As shown in the following example the value of at the end of the parallel region is equal to N 1 as it would be with sequential execution SOMP PARALLEL SOMP DO LASTPRIVATE I DO I 1 N ACEN BCEE ELI END DO SOMP END PARALLEL CALL REVERSE I REDUCTION Clause Use the REDUCTION clause on the PARALLEL DO SECTIONS PARALLEL DO and PARALLEL SECTIONS directives to perform a reduction on the specified variables by using an operator or intrinsic as shown REDUCTION operator T intrinsic list Operator can be one of the following AND OR EQV or NEO Intrinsic can be one ofthe following MAX MIN IAND IOR or IEOR The specified variables must be named scalar variables of intrinsic type and must be SHARED in the enclosing context A private copy of each specified variable is created for
253. one to function calls Parallelization with OpenMP Parallelization with OpenMP Overview The Intel Fortran Compiler supports the OpenMP Fortran version 2 0 API specification except for the WORKSHARE directive OpenMP provides symmetric multiprocessing SMP with the following major features e Relieves the user from having to deal with the low level details of iteration space partitioning data sharing and thread scheduling and synchronization e Provides the benefit of the performance available from shared memory multiprocessor systems and for lA 32 systems from Hyper Threading Technology enabled systems for Hyper Threading Technology refer to the IA 32 Intel Architecture Optimization Reference Manual The Intel Fortran Compiler performs transformations to generate multithreaded code based on the user s placement of OpenMP directives in the source program making it easy to add threading to existing software The Intel compiler supports all of the current industry standard OpenMP directives except WORKSHARE and compiles parallel programs annotated with OpenMP directives In addition the Intel Fortran Compiler provides Intel specific extensions to the OpenMP Fortran version 2 0 specification including run time library routines and environment variables 149 Intel Fortran Compiler for Linux Systems User s Guide Vol Il See parallelization options summary for all options of the OpenMP feature in the Intel Fortran Comp
254. onforms more closely to the ANSI and IEEE standards This option causes more frequent stores to memory or disallow some data from being register candidates altogether The Intel architecture normally maintains floating point results in registers These registers are 80 bits long and maintain greater precision than a double precision number When the results have to be stored to memory rounding occurs This can affect accuracy toward getting more of the expected result but at a cost in speed The pc 32 64 80 option IA 32 only can be used to control floating point accuracy and rounding along with setting various processor IEEE flags For most programs specifying the mp option adversely affects performance If you are not sure whether your application needs this option try compiling and running your program both with and without it to evaluate the effects on performance versus precision Specifying this option has the following effects on program compilation e On lA 32 systems floating point user variables declared as floating point types are not assigned to registers e On Itanium based systems floating point user variables may be assigned to registers The expressions are evaluated using precision of source operands The compiler will not use Floating point Multiply and Add FMA function to contract multiply and add subtract operations in a single operation The contractions can be enabled by using IPF fma option The comp
255. ons 133 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures The ambiguity arises from the complexity of the keywords operators data references and memory operations within the loop bodies However by understanding these limitations and by knowing how to interpret diagnostic messages you can modify your program to overcome the known limitations and enable effective vectorization The following sections summarize the capabilities and restrictions of the vectorizer with respect to loop structures Data Dependence Data dependence relations represent the required ordering constraints on the operations in serial loops Because vectorization rearranges the order in which operations are executed any auto vectorizer must have at its disposal some form of data dependence analysis An example where data dependencies prohibit vectorization is shown below In this example the value of each element of an array is dependent on the value of its neighbor that was computed in the previous iteration Example of Data dependent Loop REAL DATA 0 N NTEGER DO I 1 N I DATA I DATA I 1 0 254DATA I 0 5 DATA I 1 0 25 END DO The loop in the above example is not vectorizable because the WRITE to the current element DATA 1 is dependent on the use of the preceding element DATA I
256. onstrates the use of the SECTIONS directive The logic is identical to the preceding DO example but uses SECTIONS instead of DO Here the speedup is limited to 2 because there are only two units of work whereas in DO Two Difference Operators above there are n 1 m 1 units of work subroutine sections 1 real a n n b n n Somp parallel a b c d m n c m m d m m 192 Parallel Programming with Intel Fortran omp amp shared a b c d m n omp amp private i Jj Somp sections Somp section do i 2 n do y 9 Id big yet uu Jue a j i 1 2 R enddo S omp section do i 2 m do j 1 i d j i c j i c j i 1 2 Ber enddo Somp end sections nowait Somp end parallel end SINGLE Updating a Shared Scalar This example demonstrates how to use a SINGLE construct to update an element of the shared array a The optional NOWAIT after the first loop is omitted because it is necessary to wait at the end of the loop before proceeding into the SINGLE construct subroutine sp la a b n real a n b n Somp parallel omp amp shared a b n Somp amp private i Somp do do i 1 n a i 1 0 a i enddo Somp single a 1 min a 1 1 0 Somp end single Somp do do i 1 n b i b i a i enddo Somp end do nowait Somp end parallel end Debugging Multithreaded Programs Debugging Multithread Programs Overview 193 Intel Fortran Compiler for Linux Systems Us
257. onstruct at which the FIRSTPRIVATE clause is specified e Update the global copy of a PRIVATE variable at the end of a parallel region However the LASTPRIVATE clause of a DO directive enables updating the global copy from the team member that executed serially the last iteration of the loop In addition to shared and PRIVATE variables individual variables and entire COMMON blocks can be privatized using the THREADPRIVATE directive Orphaned Directives OpenMP contains a feature called orphaning which dramatically increases the expressiveness of parallel directives Orphaning is a situation when directives related to a parallel region are not required to occur lexically within a single program unit Directives such as CRITICAL BARRIER SECTIONS SINGLE MASTER and DO can occur by themselves in a program unit dynamically binding to the enclosing parallel region at run time Orphaned directives enable parallelism to be inserted into existing code with a minimum of code restructuring Orphaning can also improve performance by enabling a single parallel region to bind with multiple do directives located within called subroutines Consider the following code segment Somp parallel call phasel call phase2 Somp end parallel subroutine phasel omp do private i shared n do i 1 n 153 Intel Fortran Compiler for Linux Systems User s Guide Vol Il call some work i end do Somp end do end subroutine pha
258. oop DIOGKING Mete p t 138 Tse deett ML TE 139 collapsing ee ee ee 25 El Er le EE 135 COURSE EE Ag gee EES 204 diagnostics 131 147 directives AREE CE 204 distribution axe oe 204 exit conditions 136 interchange see ee 142 LOOP option of IVDEP directive ea OE EE AL 206 parallelization 127 132 parallelizer 73 parallelizing 73 150 peeling EE uri 140 206 Intel Fortran Compiler for Linux Systems User s Guide Vol Il SOCUOMING oco eod cate 138 Ee oft oe bert 122 transformations 71 122 204 types vectorized 137 unrolling es gelten 123 205 variable assignment 176 vectorization 132 206 vectorized types 137 loop carried memory dependency absence un EE sien 124 loops ONE Tell re EE 20 esl s Le DEE 71 lower mixed 193 M machine code listing SUDFOUENE SS MR ee ee 195 maddr option for code coverage tool REKE RIES a 103 maintainability 30 MAK STNG uio td eco teet tides 83 malloc Elle SE Dd 57 MASTER directive 150 170 242 master thread call stack dump 197 logo d uc Ee 170 math libraries o mochte 92 matrix multiplication 142 MAX 137
259. ortran Compiler for Linux Systems User s Guide Vol Il The RECL value unit for formatted files is always 1 byte units For unformatted files the RECL unit is 4 byte units unless you specify the assume byterecl option to request 1 byte units see assume byterecl Use the Optimal Record Type Unless a certain record type is needed for portability reasons choose the most efficient type as follows e For sequential files of a consistent record size the fixed length record type gives the best performance e For sequential unformatted files when records are not fixed in size the variable length record type gives the best performance particularly for BACKSPACE operations e For sequential formatted files when records are not fixed in size the Stream LF record type gives the best performance Reading from a Redirected Standard Input File Due to certain precautions that the Fortran run time system takes to ensure the integrity of standard input reads can be very slow when standard input is redirected from a file For example when you use a command such as myprogram exe myinput data the data is read using the READ or READ 5 statement and performance is degraded To avoid this problem do one of the following e Explicitly open the file using the OPEN statement For example open 5 STATUS OLD FILE myinput dat e Use an environment variable to specify the input file To take advantage of these methods be
260. owing example only the master thread executes the routines OUTPUT and INPUT SOMP PARALLEL DEFAULT SHARED CALL WORK X SOMP MASTER CALL OUTPUT X CALL INPUT Y 0MP END MASTER CALL WORK Y SOMP END PARALLEL ORDERED and END ORDERED 173 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Use the ORDERED and END ORDERED directives within a DO construct to allow work within an ordered section to execute sequentially while allowing work outside the section to execute in parallel When you use the ORDERED directive you must also specify the ORDERED clause on the DO directive Only one thread at a time is allowed to enter the ordered section and then only in the order of loop iterations In the following example the code prints out the indexes in sequential order S OMP DO ORDERED SCHEDULE DYNAMIC DO I LB UB ST CALL WORK I END DO SUBROUTINE WORK K SOMP ORDERED WRITE K OMP END ORDERED THREADPRIVATE Directive You can make named common blocks private to a thread but global within the thread by using the THREADPRIVATE directive Each thread gets its own copy of the common block with the result that data written to the common block by one thread is not directly visible to other threads During serial portions and MASTER sections of the program accesses are to the master thread copy of the common blo
261. piler performs some optimizations by default unless you tum them off by corresponding commanddine options Additional optimizations can be enabled or disabled using command options Option align keyword ax K W N B P IA 32 and Intel Extended Memory 64 Technology Intel EM64T systems only Description Analyzes and reorders memory layout for variables and arrays Controls whether padding bytes are added between data items within common blocks derived type data and record structures to make the data items naturally aligned Optimizes your application s performance for specific processors Regardless of which ax suboption you choose your application is optimized to use all the benefits of that processor with the resulting binary file capable of being run on any Intel IA 32 processor fast Enables a collection of optimizations for run time performance E Optimizes to favor code size and code locality See Setting Optimizations with On Options 255 Optimizes for code speed Sets performance related options Setting Optimizations with On Options 03 Activates loop transformation optimizations Setting Optimizations with On Options openmp Enables the parallelizer to generate multithreaded code based on the OpenMP directives parallel Enables the auto parallelizer to generate multithreaded code for loops that can be safely executed in parallel qp Requests pr
262. ple of Incorrect Usage for Non Countable Loop Number of iterations is dependent on A I SUBROUTINE FOO A B C DIMENSION A 100 B 100 C 100 NTEGER DO WHILE A I GT 0 0 A I B I C T ENDI RETURN END Types of Loop Vectorized For integer loops the 64 bit MMX TM technology and 128 bit Streaming SIMD Extensions 2 SSE2 provide SIMD instructions for most arithmetic and logical operators on 32 bit 16 bit and 8 bit integer data types Vectorization may proceed if the final precision of integer wrap around arithmetic will be preserved A 32 bit shift right operator for instance is not vectorized in 16 bit mode if the final stored value is a 16 bit integer Because the MMX TM and SSE2 instruction sets are not fully orthogonal shifts on byte operands for instance are not supported not all integer operations can actually be vectorized For loops that operate on 32 bit single precision and 64 bit double precision floating point numbers SSE SSE2 provides SIMD instructions for the arithmetic operators and T In addition SSE SSE2 provides SIMD instructions for the binary MIN and MAX and unary SORT operators SIMD versions of several other mathematical operators like the trigonometric functions SIN COS TAN are supported in software in a vector mathematical run time library that is provided with the Intel amp Fortran Compiler
263. plitting routines into different sections one section to contain the cold or very infrequently executed code and one section to contain the rest of the code hot code You can use nsplit to disable function splitting for the following reasons e Most importantly to get improved debugging capability In the debug symbol table it is difficult to represent a split routine that is a routine with some of its code in the hot code section and some of its code in the cold code section The nsplit option disables the splitting within a routine but enables function grouping an optimization in which entire routines are placed either in the cold code section or the hot code section Function grouping does not degrade debugging capability e Another reason can arise when the profile data does not represent the actual program behavior that is when the routine is actually used frequently rather than infrequently EJ Note For Itanium based applications if you intend to use the prof use option with optimizations at the 03 level the 03 option must be on If you intend to use the prof use option with optimizations at the 02 level or lower you can generate the profile data with the default options See an example of using PGO Advanced PGO Options The options controlling advanced PGO optimizations are e prof dirdirname e prof filefilename 98 Compiler Optimizations Use the prof dirdirname option to specify
264. ptions 75 00 25 00 100 00 50 00 In this case the results indicate that the running all tests sequentially would require one hour 45 minutes and 35 seconds while the selected tests would achieve the same total block coverage in only 41 minutes S Note The order of tests when prioritization is based on minimizing time first Test2 then Test3 could be different than when prioritization is done based on minimizing the number of tests See example above first Test3 then 117 Intel Fortran Compiler for Linux Systems User s Guide Vol Il Test2 In Example 2 Test2 is the test that gives the highest coverage per execution time So it is picked as the first test to run Using Other Options The cutoff option enables the test prioritization tool to exit when it reaches a given level of basic block coverage tselect dpi list tests list spi pgopti spi cutoff 85 00 If the tool is run with the cutoff value of 85 00 in the above example only Test3 will be selected as it achieves 45 6596 block coverage which corresponds to 87 50 of the total block coverage that is reached from all three tests The test prioritization tool does an initial merging of all the profile information to figure out the total coverage that is obtained by running all the tests The nototal option enables you to skip this step In such a case only the absolute coverage information will be reported as the overall coverage remains unknown PGO
265. puting the dot product and loops with mixed precision types Similarly the compiler does not enable certain loop transformations For example the compiler does not transform reduction loops to perform partial summation or loop interchange Optimizing for Specific Processors Optimizing for Specific Processors Overview This section describes targeting a processor and processor dispatch and extensions support options The options tpp 5 1617 optimize for the IA 32 processors and the options tpp 1 2 optimize for the Itanium amp processor family The options x K W N B P and ax K W N B P generate code that is specific to processor instruction extensions Note that you can run your application on the latest processor based systems like Intel Pentium M processor or Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support and still gear your code to any of the previous processors specified by N W or K versions of the x and ax options Targeting a Processor tpp n The tpp n optimizes your application s performance for specific Intel processors This option generates code that is tuned for the processor associated with its version For example tpp7 generates code optimized for running on Intel Pentium 4 Intel Xeon TM Intel Pentium M processors and Intel Pentium 4 processors with Streaming SIMD Extensions 3 SSE3 instruction support and tpp2 generates code optimize
266. r Changing eee eee 164 MINMZIN inertes EE RS eos 111 O O compiler option 62 o filename compiler option 81 O0 compiler option 62 65 210 O1 compiler option 62 O2 compiler option 30 41 48 55 61 62 65 66 100 121 123 131 145 158 210 O3 compiler option 34 62 66 70 97 121 131 210 object files omitting BLOCKSIZE eene EE vts 25 SEQUENCE an etin Sisters es es 13 OMP 127 150 156 158 175 183 191 OMP ATOMIC iii reset teneis 170 OMP BARRIER 166 170 OMP CRITICAL 170 OMP DO essens sds ige 156 164 OMP DO LASTPRIVATE 176 OMP DO ORDERED SCHEDULE RE EE OR 170 OMP DO REDUCTION 178 OMP END CRITICAL 170 OMP END DO 5 eie 164 OMP END DO directives 164 OMP END MASTER 170 OMP END ORDERED 170 OMP END PARALLEL 164 166 170 176 180 195 OMP END PARALLEL DO 164 169 178 197 OMP END PARALLEL SECTIONS M SER 169 OMP END SECTIONS 166 OMP END SINGLE 166 OMP ELUSEL a ini tb e oes 170 OMP MASTER ties 170 OMP ORDERED oc 170 OMP PARALLEL 164 166 176 195 OMP PARALLEL DEFAULT 164 166 170 175 180 OMP PARALLEL DO 164 169 195 OMP PARALLEL DO DEFAULT 175 178 OMP PARALLEL DO SHARED 197 OMP PARALLEL I
267. r floating point values as it forces operations to be carried out in memory rather than in registers which in turn causes more frequent rounding of your results vms Controls certain VMS related run time defaults including alignment If you specify the vms option you may need to also specify the align records option to obtain optimal run time performance Little endian to Big endian Conversion The Intel Fortran Compiler can write unformatted sequential files in big endian format and also can read files produced in big endian format by using the little endian to big endian conversion feature Both on IA 32 based processors and on Itanium based processors Intel Fortran handles internal data in little endian format The little endian to big endian conversion feature is intended for Fortran unformatted input output operations in unformatted sequential files The feature enables e processing of the files developed on processors that accept big endian data format e producing big endian files for such processors on little endian systems The little endian to big endian conversion is accomplished by the following operations e The WRITE operation converts little endian format to big endian format e The READ operation converts big endian format to little endian format The feature enables the conversion of variables and arrays or array subscripts of basic data types Derived data types are not supported Little
268. ran Compiler for Linux Systems User s Guide Vol Il e The SECTIONS and END SECTIONS directives specify parallel execution for arbitrary blocks of sequential code Each SECTION is executed once by a thread in the team e The SINGLE and END SINGLE directives define a section of code where exactly one thread is allowed to execute the code threads not chosen to execute this section ignore the code Combined Parallel Worksharing Constructs The combined parallel worksharing constructs provide an abbreviated way to specify a parallel region that contains a single worksharing construct The combined parallel worksharing constructs are e PARALLEL DO and END PARALLEL DO e PARALLEL SECTIONS and END PARALLEL SECTIONS Synchronization and MASTER Synchronization is the interthread communication that ensures the consistency of shared data and coordinates parallel execution among threads Shared data is consistent within a team of threads when all threads obtain the identical value when the data is accessed A synchronization construct is used to insure this consistency of the shared data e The OpenMP synchronization directives are CRITICAL ORDERED ATOMIC FLUSH and BARRIER e Within a parallel region or a worksharing construct only one thread at a time is allowed to execute the code within a CRITICAL construct e The ORDERED directive is used in conjunction with a DO or SECTIONS construct to impose a serial order on the execution of a section
269. reachable code Instructions that are never executed by the compiler unused code Instructions that produce results that are not used in the program variable renaming An optimization in which the compiler renames instances of a variable that refer to distinct entities 219 128 bit Streaming SIMD Extensions ER AE EE S 137 16 bit data accessing 13 30 137 16 byte aligned address 066 140 boundary cece Ge sese eee 140 3 32 bit elt lat ie eaZ TR ESE 97 51 Dee ae eer NE PRONUS 13 30 137 coo M M 79 pointers ee ee 79 6 64 bit data 13 30 97 189 64 bit MMX TM 137 8 EIN 12 8 bit data 13 30 137 BEYER dE 13 30 55 ABI visibility options 57 ABS ege 139 absence of loop carried memory dependency 124 accessing le E I 30 accuracy controlling EE EE S 70 added performance 93 advanced PGO options 98 affected aspect of the program 79 DB 101 align compiler option 19 align dcommons compiler option 13 41 55 c 206 aligning rM 140 alignment example HI ds 140 selde RE EE Eu M auod hd 55 Intel Fortran Compiler for Linux Systems User s Guide Vol Il SOUING EE 13 LE Le Vs EE cede pont 140 ALLOGATABLE 5 tese e 51 allocating temporary arrays
270. rects the compiler to assume the following e Arrays are not accessed out of arrays bounds e Pointers are not cast to non pointer types and vice versa e References to objects of two different scalar types cannot alias For example an object of type INTEGER cannot alias with an object of type real or an object of type real cannot alias with an object of type double precision If your program satisfies the above conditions setting the ansi alias option will help the compiler better optimize the program However if your program may not satisfy one of the above conditions the option must be disabled as it can lead the compiler to generate incorrect code The synonym of ansi alias is assume no dummy aliases Alignment Options align recnbyte Or Zp n Use the align recnbyte or Zp n option to specify the alignment constraint for structures on n byte boundaries where n 1 2 4 8 or 16 with Zp n When you specify this option each structure member after the first is stored on either the size of the member type or n byte boundaries where n 1 2 4 8 or 16 whichever is smaller For example to specify 2 bytes as the packing boundary or alignment constraint for all structures and unions in the file prog1 f use the following command ifort Zp2 progl f The default for IA 32 and Itanium based systems is align rec8byte or Zp8 The Zp16 option enables you to align Fortran structures such as common blocks
271. region if the worksharing directive is to execute in parallel No new threads are launched and there is no implied barrier on entry to a worksharing construct The worksharing constructs are e DO and END DO directives e SECTIONS SECTION and END SECTIONS directives e SINGLE and END SINGLE directives DO and END DO The DO directive specifies that the iterations of the immediately following DO loop must be dispatched across the team of threads so that each iteration is executed by a single thread The loop that follows a DO directive cannot be a DO WHILE or a DO loop that does not have loop control The iterations of the DO loop are dispatched among the existing team of threads The DO directive optionally lets you e Control data scope attributes see Controlling Data Scope Attributes e Use the SCHEDULE clause to specify schedule type and chunk size see Specifying Schedule Type and Chunk Size Clauses Used The clauses for DO directive specify e Whether variables PRIVATE FIRSTPRIVATE LASTPRIVATE or REDUCTION e How loop iterations are SCHEDULEd onto threads e In addition the ORDERED clause must be specified if the ORDERED directive appears in the dynamic extent of the DO directive e f you do not specify the optional NOWAIT clause on the END DO directive threads synchronize at the END DO directive If you specify NOWAIT threads do not synchronize and threads that finish early proceed directly to the instructions following
272. revious version of the same program Use the same CPU system model amount of memory version of the operating system and so on if possible e f you do need to change systems you should measure the time using the same version of the program on both systems so you know each system s effect on your timings e For programs that run for less than a few seconds run several timings to ensure that the results are not misleading Overhead functions like loading shared libraries might influence short timings considerably Use the time command and specify the name of the executable program to provide the following e The elapsed real or wall clock time which will be greater than the total charged actual CPU time e Charged actual CPU time shown for both system and user execution The total actual CPU time is the sum of the actual user CPU time and actual system CPU time Example In the following example timings the sample program being timed displays the following line Average of all the numbers is 4368488960 000000 38 Programming for High Performance Using the Bourne shell the following program timing reports that the program uses 1 19 seconds of total actual CPU time 0 61 seconds in actual CPU time for user program use and 0 58 seconds of actual CPU time for system use and 2 46 seconds of elapsed time time a Out Average of all the numbers is 4368488960 000000 real Om2 46s user 0m0 61s sys Om0 58s Us
273. rmance This is done using the default heuristics The inlining heuristics used by the compiler differ based on whether you use profile guided optimizations prof use or not When you use profile guided optimizations with ip or ipo the compiler uses the following heuristics e The default heuristic focuses on the most frequently executed call sites based on the profile information gathered for the program e By default the compiler does not inline functions with more than 230 intermediate statements You can change this value by specifying the option Qoption f ip ninl max stats new value e The default inline heuristic will stop inlining when direct recursion is detected e The default heuristic always inlines very small functions that meet the minimum inline criteria o Default for Itanium based applications ip ninl min stats 15 o Default for lA 32 applications ip ninl min stats 7 e These limits can be modified with the option Qopt ion f ip ninl min stats new value See Qoption Specifiers and Profile Guided Optimization PGO When you do not use profile guided optimizations with ip or ipo the compiler uses less aggressive inlining heuristics it inlines a function if the inline expansion does not increase the size of the final program Inlining and Preemption Preemption of a function means that the code which implements that function at run time is replaced by different code When a fun
274. run time UE 189 checks for IA 32 systems 77 library routines 186 peeling E 140 performance 41 processor specific checks 77 scheduling 145 S S compiler option 86 safe cray ptr compiler option 51 SAVE statement 51 scalar clean up iterations 140 replacement 123 scalar integer expression 160 scalar logical expression 160 scalar ren 123 scalar rep compiler option 123 SCHEDULE 254 Glause m 180 specifying se EE EE 180 VR EE eRe 166 SEENEN 174 SORATGOPU SE 174 175 Screenshot ase 103 SEENDS ED Ee 38 SECTION 150 160 166 SECTION directive 166 169 176 SECTIONS directive 166 169 HE ee eet ee 166 selecting DEENS 90 SEQUENCE oi eae P A 13 Tel e ace Ee rd ee 13 Statement EE 13 55 TN 13 SetenNV AA EE dee ie pus 45 setting arguments enne 13 coloring scheme 103 Conditional parallel region execution esses 164 email eoe eeh 103 ger 92 F UFMTENDIAN variable 45 EZ cte RE EE TT b Tiles si ust 103 optimization level 62 DIS oou oo Ee 164 SHARED e E 180 debugging Ts SERE ES 201 shared scoping
275. s be written using unbuffered writes enable buffered writes by a method described above e Especially with large files increasing the BLOCKSIZE value increases the size of the block sent on the network and how often network data blocks get sent e Time the application when using different BLOCKSIZE values under similar conditions to find the optimal network block size When writing records be aware that I O records are written to unified buffer cache UBC system buffers To request that VO records be written from program buffers to the UBC system buffers use the FLUSH library routine see the Intel Fortran Libraries Reference Be aware that calling FLUSH also discards read ahead data in user buffer Specify RECL The sum of the record length RECL specifier in an OPEN statement and its overhead is a multiple or divisor of the blocksize which is device specific For example if the BLOCKSIZE is 8192 then RECL might be 24576 a multiple of 3 or 1024 a divisor of 8 The RECL value should fill blocks as close to capacity as possible but not over capacity Such values allow efficient moves with each operation moving as much data as possible the least amount of space in the block is wasted Avoid using values larger than the block capacity because they create very inefficient moves for the excess data only slightly filling a block allocating extra memory for the buffer and writing partial blocks are inefficient 29 Intel F
276. scalar to make the locals automatic This is the default behavior of the Intel Fortran Compiler when openmp is used Avoid using the save option which inhibits stack allocation of local variables By default automatic local scalar variables become shared across threads so you may need to add synchronization code to ensure proper access by threads 154 Parallel Programming with Intel Fortran Analyze Analysis includes the following major actions e Profile the program to find out where it spends most of its time This is the part of the program that benefits most from parallelization efforts This stage can be accomplished using VTune TM analyzer or basic PGO options e Wherever the program contains nested loops choose the outer most loop which has very few cross iteration dependencies Restructure To restructure your program for successful OpenMP implementation you can perform some or all of the following actions 1 If a chosen loop is able to execute iterations in parallel introduce a PARALLEL DO construct around this loop 2 Try to remove any cross iteration dependencies by rewriting the algorithm 3 Synchronize the remaining cross iteration dependencies by placing CRITICAL constructs around the uses and assignments to variables involved in the dependencies 4 List the variables that are present in the loop within appropriate SHARED PRIVATE LASTPRIVATE FIRSTPRIVATE or REDUCTION clauses 5 List the DO index
277. se the g option to generate symbolic debugging information and line numbers in the object code for all routines in the program for use by a source level debugger The main file created in the third command above contains symbolic debugging information as well During the later stages of program development you should specify multiple source files together and use an optimization level of at least 02 default to allow more optimizations to occur For instance the following command compiles all three source files together using the default level of optimization 02 ifort o main main f90 sub2 f90 sub3 f90 Compiling multiple source files lets the compiler examine more code for possible optimizations which results in e Inlining more procedures e More complete data flow analysis e Reducing the number of external references to be resolved during linking For very large programs compiling all source files together may not be practical In such instances consider compiling source files containing related routines together using multiple i fort commands rather than compiling source files individually Options That Improve Run Time Performance 42 Compiler Optimizations The table below lists the options in alphabetical order that can directly improve run time performance Most of these options do not affect the accuracy of the results while others improve run time performance but can change some numeric results The Intel Fortran Com
278. se2 omp do private j shared n do j 1 n call more work j end do Somp end do end The following orphaned directives usage rules apply e An orphaned worksharing construct SECTIONS SINGLE DO is executed by a team consisting of one thread that is serially e Any collective operation worksharing construct or BARRIER executed inside of a worksharing construct is illegal e Itis illegal to execute a collective operation worksharing construct or BARRIER from within a synchronization region CRITICAL ORDERED e The opening and closing directives of a directive pair for example DO and END DO must occur in a single block of the program e Private scoping of a variable can be specified at a worksharing construct Shared scoping must be specified at the parallel region For complete details see the OpenMP Fortran version 2 0 specifications Preparing Code for OpenMP Processing The following are the major stages and steps of preparing your code for using OpenMP Typically the first two stages can be done on uniprocessor or multiprocessor systems later stages are typically done only on multiprocessor systems Before Inserting OpenMP Directives Before inserting any OpenMP parallel directives verify that your code is safe for parallel execution by doing the following e Place local variables on the stack This is the default behavior of the Intel Fortran Compiler when openmp is used e Use automatic Or auto
279. sed stack frame for all functions fpe3 Specifies floating point exception handling at run time for the main program fpe0 disables the option IPF_fltacc Enables the compiler to apply Itanium amp compiler optimizations that affect floating point accuracy IPF fma Itanium compiler Enables the contraction of floating point multiply and add subtract operations into a single operation IPF fp speculation fast Itanium compiler Sets the compiler to speculate on floating point operations IPF_fp_speculationoff disables this optimization O 02 Optimizes for maximum speed openmp reportl Indicates loops regions and sections parallelized opt report levelmin Specifies the minimal level of the optimizations report par reporti Indicates loops successfully auto parallelized tpp2 Itanium compiler Optimizes code for the Intel amp Itanium amp 2 processor for Itanium based applications Generated code is compatible with the Itanium processor 50 Compiler Optimizations tpp Optimizes code for the Intel IA 32 only Pentium 4 and Intel Xeon TM processor for IA 32 applications unroll unroll n omit nto let the compiler decide whether to perform unrolling or not default Specify n to set maximum number of times to unroll a loop The Itanium compiler currently uses only n 0 unro110 disabled opt
280. sk later To enable buffered writes that is to allow the disk device to fill the internal buffer before the buffer is written to disk use one of the following e The OPEN statement BUFFERED specifier e The assume buffered io command line option e The FORT_BUFFERED run time environment variable The OPEN statement BUFFERED specifier takes precedence over the assume buffered io option If neither one is set which is the default the FORT BUFFERED environment variable is tested at run time 28 Programming for High Performance The OPEN statement BUFFERED specifier applies to a specific logical unit In contrast the assume nobuffered io option and the FORT BUFFERED environment variable apply to all Fortran units Using buffered writes usually makes disk I O more efficient by writing larger blocks of data to the disk less often However a system failure when using buffered writes can cause records to be lost since they might not yet have been written to disk Such records would have been written to disk with the default unbuffered writes When performing VO across a network be aware that the size of the block of network data sent across the network can impact application efficiency When reading network data follow the same advice for efficient disk reads by increasing the BUFFERCOUNT When writing data through the network several items should be considered e Unless the application requires that record
281. slates into high performance gains 93 Intel Fortran Compiler for Linux Systems User s Guide Vol Il e The compiler detects and does not vectorize loops that execute only a small number of iterations reducing the run time overhead that vectorization might otherwise add Profile guided Optimizations Methodology and Usage Model PGO works best for code with many frequently executed branches that are difficult to predict at compile time An example is the code with intensive error checking in which the error conditions are false most of the time The cold error handling code can be placed such that the branch is hardly ever mispredicted Minimizing cold code interleaved into the hot code improves instruction cache behavior PGO Phases The PGO methodology requires three phases and options 1 Instrumentation compilation and linking with prof gen 2 Instrumented execution by running the executable as a result the dynamic information files dyn are produced 3 Feedback compilation with 5xof use The flowcharts below illustrate this process for lA 32 compilation and Itanium based compilation A key factor in deciding whether you want to use PGO lies in knowing which sections of your code are the most heavily used If the data set provided to your program is very consistent and it elicits a similar behavior on every execution then PGO can probably help optimize your program execution However different data sets can el
282. ssor specific features consider using ax to attain processor specific performance and portability among different processors Setting FTZ and DAZ Flags Previously the default status of the flags flush to zero FTZ and denormals are zero DAZ for IA 32 processors were off by default However even at the cost of losing IEEE compliance turning these flags on significantly increases the performance of programs with denormal floating point values in the gradual underflow mode run on the most recent IA 32 processors Hence for the Intel Pentium Ill Pentium 4 Pentium M Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support and compatible IA 32 processors the compiler s default behavior is to turn these flags on The compiler inserts code in the program to perform a run time check for the processor on which the program runs to verify it is one of the afore listed Intel processors e Executing a program on a Pentium Ill processor enables the FTZ flag but not DAZ e Executing a program on an Intel Pentium M processor or Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 instruction support enables both the FTZ and DAZ flags These flags are only tumed on by Intel processors that have been validated to support them 78 For non Intel processors the flags can be set manually by calling the following Intel Fortran intrinsic Compiler Optimizations RESULT FOR SET FPE FOR
283. stacksize s 189 kmp size t kind 189 KMP STACKSIZE 183 189 KMP VERSION 183 kmpc for static fini 197 kmpc for static init 4 197 kmpc fork call 195 197 201 L LASTPRIVATE clauses LEE 176 Pc S 176 Id 83 100 legal information 3 level coverage 103 IDC EE 57 libc_start main REG 197 se aeu H Ze bad ite um 48 libdir keyword compiler option 48 libg ide EEN 182 libirc a Ibra eee tret 100 libraries een d SEER ee obees 92 inline expansion 92 libintrins a E 33 library VO da Ed id etc 25 OpenMP runtime routines 186 POULINGS 2 ct te ote rt tee 186 limitations loop unrolling 123 line DPFIiSt aa eeh eser 111 dpi aco ee orb bt 111 lines compiled 147 LINK commandline 83 linkage phase 80 list tool generates 103 tool provides 103 listing file containing 111 MING cos EIE Emu 83 little endian eene CEET 45 CONVORIA G DEE 41 little endian to big endian conversion environment variable 45 Lock routines 186 aeie PE 13 51 logo compiler option 101 l
284. stats n Sets the valid number of intermediate language statements for a function that is expanded in line The number n is a positive integer The number of intermediate language statements usually exceeds the actual number of source language statements The default value for n is 230 ip ninl min stats n Sets the valid min number of intermediate language statements for a function that is expanded in line The number n is a positive integer The default value for ip ninl min stats is IA 32 compiler ip ninl min stats gt Itanium amp compiler ip nint min stats 15 Sets the maximum increase in ip_ninl_max_total_stats n size of a function measured in intermediate language statements due to inlining The 89 Intel Fortran Compiler for Linux Systems User s Guide Vol Il number n is a positive integer The default value for n is 2000 The following command activates procedural and interprocedural optimizations on source f and sets the maximum increase in the number of intermediate language statements to five for each function ifort ip Qoption f ip ninl max statss5 source f Inline Expansion of Functions Criteria for Inline Function Expansion For a call to be considered for inlining it has to meet certain minimum criteria There are three main components of a call Call site is the site of the call to the function that might be inlined
285. stems the reports can be generated for e ilo e hloif 03 is on e ipo if interprocedural optimizer is invoked with ip or ipo e all the above optimizers if 03 and ip or ipo options are on For Itanium based systems the reports can be generated for ilo ecg hlo if 03 is on ipo if interprocedural optimizer is invoked with ip or ipo e all the above optimizers if 03 and ip or ipo options are on Note If hlo or ipo report is requested but the controlling option O3 or ip ipo respectively is not on the compiler generates an empty report 215 Glossary alignment constraint The proper boundary of the stack where data must be stored alternate loop transformation An optimization in which the compiler generates a copy of a loop and executes the new loop depending on the boundary size branch count A tool that counts the number of times a program executes profiler each branch statement The utility also generates a database that shows how the program executed branch The database generated by the branch count profiler The probability database contains the number of times each branch is database executed cache hit The situation when the information the processor wants is in the cache call site A call site consists of the instructions immediately preceding a call instruction and the call instruction itself common An optimization in which the compiler detects
286. structions marked reserved or undefined Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The software described in this User s Guide Volume II may contain software defects which may cause the product to deviate from published specifications Current characterized software defects are available on request Intel SpeedStep Intel Thread Checker Celeron Dialogic i386 i486 iCOMP Intel Intel logo Intel386 Intel486 Intel740 IntelDX2 IntelDX4 Intel SX2 Intel Inside Intel Inside logo Intel NetBurst Intel NetStructure Intel Xeon Intel XScale Itanium MMX MMX logo Pentium Pentium II Xeon Pentium III Xeon Pentium M and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Other names and brands may be claimed as the property of others Copyright Intel Corporation 2003 2004 Portions Copyright 2001 Hewlett Packard Development Company L P Table Of Contents Optimizing Beie e te Ee EE 9 How to Use Ri EE 11 Notation GOtiverliORS Pm 11 Programming for High Performance cie eee eee ine rrt eoi 13 Programming for High Performance Overview eee e ee se ee ee ee ee ee ee 13 Programming Guidelines Ek ES Ek Se EE See Ge cd dee T3 Setting Data Type and Alignment eese 13 Using Arrays e UE 20 Improving VO A Gate
287. strumented program returns normally from main or calls the standard exit function Programs that do not terminate normally can use the PGOPTI Prof Dump function During the instrumentation compilation prof gen you can add a call to this function to your program Here is an example INTERFACE SUBROUTINE PGOPTI PROF DUMP DECS ATTRIBUTES C ALIAS PGOPTI Prof Dump PGOPTI PROF DUMP E E ND SUBROUTINE ND INTERFACE CALL PGOPTI PROF DUMP EJ Note You must remove the call or comment it out prior to the feedback compilation with prof use Using profmerge to Relocate the Source Files The compiler uses the full path to the source file for each routine to look up the profile summary information associated with that routine By default this prevents you from e Using the profile summary file api if you move your application sources e Sharing the profile summary file with another user who is building identical application sources that are located in a different directory To enable the movement of application sources as well as the sharing of profile summary files use the pro merge with src old and src new options For example 102 Compiler Optimizations prompt profmerge prof dir c work src old c work sources src new d project src The above command will read the c work pgopti dpi file For each routine represented in t
288. sure your program does not rely on sharing the standard input file For more information on Intel Fortran data files and I O see Files Devices and VO in Volume on OPEN statement specifiers and defaults see Open Statement in the ntel amp Fortran Language Reference Improving Run time Efficiency Follow these source coding guidelines to improve run time performance The amount of improvement in run time performance is related to the number of times a statement is executed For example improving an arithmetic expression 30 Programming for High Performance executed within a loop many times has the potential to improve performance more than improving a similar expression executed once outside a loop Avoid Small Integer and Small Logical Data Items Avoid using integer or logical data less than 32 bits Accessing a 16 bit or 8 bit data type can make data access less efficient especially on Itanium based systems To minimize data storage and memory cache misses with arrays use 32 bit data rather than 64 bit data unless you require the greater numeric range of 8 byte integers or the greater range and precision of double precision floating point numbers Avoid Mixed Data Type Arithmetic Expressions Avoid mixing integer and floating point REAL data in the same computation Expressing all numbers in a floating point arithmetic expression assignment statement as floating point values eliminates the need to convert da
289. t then generates a linker script that tells the linker to first link contributions from text00001 then text00002 This happens transparently when the same ifort or xild invocation is used for both the link time compilation and the final link However the linker script must be taken into account by the user if ipo cor ipo S is used With these switches the IPO compilation and actual link are done by different invocations of ifort When this occurs ifort will issue an 85 Intel Fortran Compiler for Linux Systems User s Guide Vol Il informational message indicating that it is generating an explicit linker script ipo layout script When ipo layout script is generated the typical response is to modify your link command to use this script script ipo layout script If your application already requires a custom linker script you can place the necessary contents of ipo layout script in your script The layout specific content of ipo layout script is at the beginning of the description of the text section For example to describe the layout order for 12 routines text text00001 text00002 text00003 text00004 text00005 text00006 text00007 text00008 text00009 text00010 text00011 text00012 For applications that already require a linker script you can add this section of the text section
290. ta between fixed and floating point formats Expressing all numbers in an integer arithmetic expression as integer values also achieves this This improves run time performance For example assuming that 1 and J are both INTEGER variables expressing a constant number 2 as an integer value 2 eliminates the need to convert the data Inefficient Code You can use different sizes of the same general data type in an expression with minimal or no effect on run time performance For example using REAL DOUBLE PRECISION and COMPLEX floating point numbers in the same floating point arithmetic expression has minimal or no effect on run time performance Use Efficient Data Types 31 Intel Fortran Compiler for Linux Systems User s Guide Vol Il In cases where more than one data type can be used for a variable consider selecting the data types based on the following hierarchy listed from most to least efficient e Integer also see above example e Single precision real expressed explicitly as REAL REAL KIND 4 or REAL 4 e Double precision real expressed explicitly as DOUBLE PRECISION REAL KIND 8 or REAL 8 e Extended precision real expressed explicitly as REAL KIND 16 or REAL 16 However keep in mind that in an arithmetic expression you should avoid mixing integer and floating point REAL data see example in the previous subsection Avoid Using Slow Arithmetic Operators Before you
291. tal functions and improve accuracy of floating point compares Flushing to Zero Denormal Values ftz Option ftz flushes denormal results to zero when the application is in the gradual underflow mode Flushing the denormal values to zero with tz may improve performance of your application The default status of is OFF tz By default the compiler lets results gradually underflow With the default o2 option tz is OFF 66 Compiler Optimizations ftz on Itanium based systems On Itanium based systems only the 03 option turns on ftrz If the ftz option produces undesirable results of the numerical behavior of your program you can turn the flush to zero FTZ mode off by using tz in the command line while still benefiting from the 03 optimizations ifort 03 ftz myprog f Usage e Use this option if the denormal values are not critical to application behavior e ftz only needs to be used on the source that contains the main program to turn the FTZ mode on The initial thread and any threads subsequently created by that process will operate in FTZ mode The ftz option affects the results of floating underflow as follows e ftz results in gradual underflow to 0 the result of a floating underflow is a denormalized number or a zero e ftz results in abrupt underflow to 0 the result of a floating underflow is set to zero and execution continues tz also makes a denormal valu
292. te on functions and variables that require the default setting Since internal is a processor specific attribute it may not be desirable to have a general option for it In the combined command line options fvisibility protected fvisibility default prot txt file prot txt see above causes all global symbols except a b c d and e to have protected visibility Those five symbols however will have default visibility and thus be preemptable Visibility related Options fminshared Directs to treat the compilation unit as a component of a main program and not to link it as a part of a shareable object Since symbols defined in the main program cannot be preempted this enables the compiler to treat symbols declared with default visibility as though they have protected visibility It means that fminshared implies visibility protected The compiler need not generate position independent code for the main program It can use absolute addressing which may reduce the size of the global offset table GOT and may reduce memory traffic fpic Specifies full symbol preemption Global symbol definitions as well as global symbol references get default that is preemptable visibility unless explicitly specified otherwise Generates position independent code Required for building shared objects on Itanium based systems Optimizing Different Application Types Optimizing Different Application Types Overview This section disc
293. the SCHEDULE clause of the current DO or PARALLEL DO directive 2 For RUNTIME schedule type the value specified in the OMP SCHEDULE environment variable 3 For DYNAMIC and GUIDED schedule types the default value 1 4 If the schedule type for the current DO or PARALLEL DO directive is STATIC the loop iteration space divided by the number of threads in the team OpenMP Support Libraries The Intel Fortran Compiler with OpenMP support provides a production support library 1ibguide a This library enables you to run an application under different execution modes It is used for normal or performance critical runs on applications that have already been tuned Execution modes The compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time The libraries support the serial turnaround and throughput modes These modes are selected by using the kmp library environment variable at run time Turnaround In a multi user environment where the load on the parallel machine is not constant or where the job stream is not predictable it may be better to design and tune for throughput This minimizes the total time to run multiple jobs simultaneously In this mode the worker threads will yield to other threads while waiting for more parallel work The throughput mode is designed to make the program aware of its environment that is the system load and to adjust its resource usage
294. the directory in which you intend to place the dynamic information dyn files to be created The default is the directory where the program is compiled The specified directory must already exist You should specify orof dirdirname option with the same directory name for both the instrumentation and feedback compilations If you move the dyn files you need to specify the new path The prof filefilename option specifies file name for profiling summary file Guidelines for Using Advanced PGO When you use PGO consider the following guidelines e Minimize the changes to your program after instrumented execution and before feedback compilation During feedback compilation the compiler ignores dynamic information for functions modified after that information was generated Note The compiler issues a warning that the dynamic information does not correspond to a modified function e Repeat the instrumentation compilation if you make many changes to your source files after execution and before feedback compilation e Specify the name of the profile summary file using the prof filefilename option See PGO Environment Variables PGO Environment Variables The environment variables determine the directory in which to store dynamic information files or whether to overwrite pgopti dpi The PGO environment variables are described in the table below Variable Description PROF DIR Specifies the directory in which dyna
295. timizations see above ipo enables interprocedural optimizations across files static prevents linking with shared libraries On IA 32 and Intel EM64T systems fast sets these three options and also sets xP Provides a shortcut that requests several important compiler optimizations To override one of the options set by fast specify that option after the fast option on the command line The options set by the fast option may change from release to release On IA 32 systems 64 Compiler Optimizations In conjunction with ax K W N B P or x K W N BI P options this option provides the best run time performance Restricting Optimizations The following options restrict or preclude the compiler s ability to optimize your program OO0 Disables optimizations Enables fp option g Turns off the default 02 option and makes 00 the default unless 02 or 01 or 03 is explicitly specified in the command line together with g See Optimizations and Debugging mp Restricts optimizations that cause some minor loss or gain of precision in floating point arithmetic to maintain a declared level of precision and to ensure that floating point arithmetic more nearly conforms to the ANSI and IEEE standards See mp option for more details nolib inline Disables inline expansion of 65 Intel Fortran Compiler for Linux Systems User s Guide
296. tion EEN 57 visibility attribute options 57 symbolic debugging 210 synchronization le EE 170 el NE ee eeaeee 170 worksharing construct directives EE 166 SV T AR Lus 145 158 SYSTEM CLOCK 38 CG DD 257 Intel Fortran Compiler for Linux Systems User s Guide Vol Il T table operators intrinsics 178 TAN e eoo eee SEE bebes 137 targeting a processor 73 Ee 183 test prioritization tool Test Testi EE 111 TestTdpl 0 SEE 111 Test e EE 111 Test Demeter are c red 111 Test2 BOING EE 111 Test2 dpi OO ES 111 AE EE RA 111 Test3 Elter Se eg 111 Test3 dpi 00 ces 111 pn om 111 tests list file sese TT tselect command 111 THREADPRIVATE directive ugebuede sg EES 150 174 258 Varlables odere 175 Ihireads uoo n e y 164 threshold auto parallelization 145 EU E 147 option seis 147 time interval for profile dumping 120 TIME intrinsic procedure 38 UTs E 90 timing POULINGS ee 186 your application 38 tips troubleshooting 147 TER nC 127 tool code coverage NS ON 103 code coverage ssees 103 test prioritization 111 tpp compiler option 48 73 traceback compiler option 210 transformations Eege De ET 133 transformed parallel code 143 troubleshooting TD
297. to produce efficient execution in a dynamic environment This mode is the default After completing the execution of a parallel region threads wait for new parallel work to become available After a certain period of time has elapsed they stop waiting and sleep Sleeping allows the threads to be used until more parallel work becomes available by non OpenMP threaded code that may execute between parallel regions or by other applications The amount of time to wait before sleeping is set either by the KMP BLOCKTIME environment variable or by the kmp set blocktime function A small KMP_BLOCKTIME value may offer 182 Parallel Programming with Intel Fortran better overall performance if your application contains non OpenMP threaded code that executes between parallel regions A larger KMP BLOCKTIME value may be more appropriate if threads are to be reserved solely for use for OpenMP execution but may penalize other concurrently running OpenMP or threaded applications Throughput In a dedicated batch or single user parallel environment where all processors are exclusively allocated to the program for its entire run it is most important to effectively utilize all of the processors all of the time The tumaround mode is designed to keep active all of the processors involved in the parallel computation in order to minimize the execution time of a single job In this mode the worker threads actively wait for more parall
298. to the default Checking the Floating point Stack State IA 32 only pstkchk The fpstkchk option IA 32 only checks whether a program makes a correct call to a function that should return a floating point value If an incorrect call is detected the option places a code that marks the incorrect call in the program When an application calls a function that retums a floating point value the returned floating point value is supposed to be on the top of the floating point stack If retum value is not used the compiler must pop the value off of the floating point stack in order to keep the floating point stack in correct state If the application calls a function either without defining or incorrectly defining the function s prototype the compiler does not know whether the function must return a floating point value and the retum value is not popped off of the floating point stack if it is not used This can cause the floating point stack overflow The overflow of the stack results in two undesirable situations e ANaN value gets involved in the floating point calculations e The program results become unpredictable the point where the program starts making errors can be arbitrarily far away from the point of the actual error 53 Intel Fortran Compiler for Linux Systems User s Guide Vol Il The fpstkchk option marks the incorrect call and makes it easy to find the error Note This option causes significant code gen
299. tware pipelining for Itanium architecture Supported on Supported on IA 32 or Itanium based Multiprocessor systems Pentium Pentium with MMX IA 32 Hyper Threading Technology enabled systems Technology Pentium Il Pentium Ill and Pentium 4 processors Parallel Program Development The Intel Fortran Compiler supports the OpenMP Fortran version 2 0 API specification available from the www openmp org web site The OpenMP directives relieve the user from having to deal with the low level details of iteration space partitioning data sharing and thread scheduling and synchronization The Auto parallelization feature of the Intel Fortran Compiler automatically translates serial portions of the input program into semantically equivalent multithreaded code Automatic parallelization determines the loops that are good worksharing candidates performs the dataflow analysis to verify correct parallel execution and partitions the data for threaded code generation as is needed in programming with OpenMP directives The OpenMP and Auto parallelization applications provide the performance gains from shared memory on multiprocessor systems and IA 32 processors with the Hyper Threading Technology Auto vectorization detects low level operations in the program that can be done in parallel and then converts the sequential program to process 2 4 8 or up to 16 elements in one operation depending on the data type In some cases aut
300. uct and the corresponding shared variable is undefined on exit from a parallel construct e Contents allocation state and association status of variables defined as PRIVATE are undefined when they are referenced outside the lexical extent but inside the dynamic extent of the construct unless they are passed as actual arguments to called routines In the following example the values of and J are undefined on exit from the parallel region NTEGER I J I cd J 2 OMP PARALLEL PRIVATE I FIRSTPRIVATE J T 3 T J 2 SOMP END PARALLEL PRINT I J FIRSTPRIVATE Use the FIRSTPRIVATE clause on the PARALLEL DO SECTIONS SINGLE PARALLEL DO and PARALLEL SECTIONS directives to provide a superset of the PRIVATE clause functionality In addition to the PRIVATE clause functionality private copies of the variables are initialized from the original object existing before the parallel construct LASTPRIVATE Use the LASTPRIVATE clause on the DO SECTIONS PARALLEL DO and PARALLEL SECTIONS directives to provide a superset of the PRIVATE clause functionality When the LASTPRIVATE clause appears on a DO or PARALLEL DO directive the thread that executes the sequentially last iteration updates the version of the object it had before the construct When the LASTPRIVATE clause appears on a SECTIONS or PARALLEL SECTIONS directive the thread that executes the lexically last secti
301. ultiple of eight Align 80 bit data so that its base address is a multiple of sixteen Align 128 bit data so that its base address is a multiple of sixteen Causes of Unaligned Data and Ensuring Natural Alignment For optimal performance make sure your data is aligned naturally A natural boundary is a memory address that is a multiple of the data item s size For example a REAL KIND 8 data item aligned on natural boundaries has an address that is a multiple of 8 An array is aligned on natural boundaries if all of its elements are so aligned Intel Fortran Compiler for Linux Systems User s Guide Vol Il All data items whose starting address is on a natural boundary are naturally aligned Data not aligned on a natural boundary is called unaligned data Although the Intel amp Fortran Compiler naturally aligns individual data items when it can certain Fortran statements can cause data items to become unaligned You can use the command line option align to ensure naturally aligned data but you should check and consider reordering data declarations of data items within common blocks derived type structures and record structures as follows Carefully specify the order and sizes of data declarations to ensure naturally aligned data Start with the largest size numeric items first followed by smaller size numeric items and then non numeric character data The following statements can cause unaligned data 14 Common blocks C
302. usses the command line options 00 01 02 or O and O3 The 00 option disables optimizations Each of the other three turns on several compiler capabilities To specify one of these optimizations take into consideration the nature and structure of your application as indicated in the more detailed description of the options In general terms O1 02 or O and 03 optimize as follows 61 Intel Fortran Compiler for Linux Systems User s Guide Vol Il 01 code size and locality 02 or 0 code speed this is the default option 03 enables 02 with more aggressive optimizations fast enables 03 and ipo to enhance speed across the entire program These options behave similarly on IA 32 and ItaniumQ architectures with some specifics that are detailed in the sections that follow Setting Optimizations with On Options The following table details the effects of the 00 01 02 03 and fast options The table first describes the characteristics shared by both IA 32 and Itanium architectures and then explicitly describes the specifics if any of the On and fast options behavior on each architecture Option Effect 00 Disables On optimizations On IA 32 systems this option sets the fp option 01 Optimizes to favor code size and code locality Disables loop unrolling May improve performance for applications with very large code size many branches and execution time not dominated
303. ve IVDEP to ensure there is no loop carried dependency for the store into a 124 Compiler Optimizations IDIRS IVDEP do j 1 n a b jJ a b j 1 enddo See also Vectorization Support Prefetching The goal of prefetch insertion is to reduce cache misses by providing hints to the processor about when data should be loaded into the cache The prefetching optimizations implement the following options prefetch Enables or disables prefetch prefetch insertion This option requires that 03 be specified The default with 03 is prefetch To facilitate compiler optimization e Minimize use of global variables and pointers e Minimize use of complex control flow e Choose data types carefully and avoid type casting For more information on how to optimize with prefetch refer to the Intel Pentium 4 and Intel Xeon TM Processor Optimization Reference Manual In addition to the prefetch option an intrinsic subroutine MM PREFETCH and compiler directive PREFETCH are also available The subroutine MM_PREFETCH prefetches data from the specified address on one memory cache line The compiler directive PREFETCH enables a data prefetch from memory The following example is for Itanium based systems only do j l1 lastrow firstrowtl i rowstr j iresidue mod rowstr j 1 i 8 sum 0 d0 CDECS NOPREFETCH a p colidx do k i itiresidue l sum sum a k p c
304. ws the percentage of the code that was exercised on a new run but was missed in the reference run In such cases the coverage tool shows only the modules that included the code that was uncovered The coloring scheme in the source views also should be interpreted accordingly The code that has the same coverage property covered or not covered on both runs is considered as covered code Otherwise if the new run indicates that the code was executed while in the reference run the code was not executed then the code is treated as uncovered On the other hand if the code is covered in the reference run but not covered in the new run the differential Coverage source view shows the code as covered Running for Differential Coverage To run the Intel Compilers code coverage tool for differential coverage the following files are required e The application sources e The spi file generated by Intel Compilers when compiling the application for the instrumented binaries with the rot genx option e The dpi file generated by Intel Compilers profmerge utility as the result of merging the dynamic profile information dyn files or the dpi file 110 Compiler Optimizations generated implicitly by Intel Compilers when compiling the application with the prof use option See Usage Model of the Profile guided Optimizations Once the required files are available the coverage tool may be launched from this command line codecov prj Project
305. ywords to select particular inline expansions and loop optimizations The option must be entered with an ip or ipo specification as follows ip Qoption tool opts where tool is Fortran and opts are Qoption specifiers see below Also refer to Criteria for Inline Function Expansion to see how these specifiers may affect the inlining heuristics of the compiler For more information about passing options to other tools see Qoption tool opts Qoption Specifiers If you specify ip or ipo without any Oopti on qualification the compiler does the following Expands functions in line Propagates constant arguments Passes arguments in registers Monitors module level static variables 88 Compiler Optimizations You can refine interprocedural optimizations by using the following Oopt ion specifiers To have an effect the Ooption option must be entered with either ip Or ipo also specified as in this example ip Qoption f ip specifier where ip specifier is one of the Qoption specifiers described in the following table Qoption Specifiers ip args in regs 0 Disables the passing of arguments in registers By default external functions can pass arguments in registers when called locally Normally only static functions can pass arguments in registers provided the address of the function is not taken and the function does not use a variable number of arguments ip ninl max

Intel Fortran Compiler for Linux* Systems User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents