Home

PathScale™ Compiler Suite User Guide

image

Contents

1. Intrinsic Name Result Arguments Families Remarks FSTAT Subroutine UNIT 1 1 1 2 4 1 8 G77 O SARRAY I 1 I 2 I4 I 8 Array rank 1 STATUS I 1 1 2 1 4 I 8 FTELL I8 UNIT 174 G77 PGI FTELL I8 UNIT 1 8 G77 PGI FTELL Subroutine UNIT F4 G77 OFFSET 174 FTELL Subroutine UNIT F4 G77 OFFSET I 8 FTELL Subroutine UNIT 1 8 G77 OFFSET I 8 GERROR Subroutine MESSAGE C G77 PGI GETARG Subroutine POS I 4 G77 PGI VALUE C GETCWD 1 4 NAME C G77 PGI O STATUS 1 4 GETCWD Subroutine NAME C G77 O STATUS I 4 GETENV Subroutine NAME C G77 PGI VALUE C GETGID 1 4 G77 PGI GETLOG Subroutine LOGIN C G77 PGI GETPID 1 4 G77 PGI GETUID 1 4 G77 PGI GETPOS 1 11 I 2 1 4 I8 TRADITIONAL E GET Subroutine COMMAND C ANSI O COMMAND LENGTH I 4 TRADITIONAL Q STATUS I 4 O GET Subroutine NUMBER I 4 ANSI COMMAND VALUE C TRADITIONAL oO ARGUMENT LENGTH 1 4 o STATUS I 4 O C 19 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY C 20 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks GET Subroutine NAME C ANSI ENVIRONMENT VALUE C TRADITIONAL Q VARIABLE LENGTH I 4 O STATUS I 4 O TRIM NAME L 4 O GET_IEEE_ Subroutine STATUS I 8 TRADITIONAL EXCEPTIONS GET_IEEE_ Subroutine STATUS I 8 TRADITIONAL INTERRUPTS GET_IEEE_ Subroutine STATUS I 8 TRADITIONAL ROUNDING MODE
2. 922 19r he IRRE LER RIVA RREEXq e 3 37 3 9 1 Fortran KINDS 4 eara EI EX Ie 6 ene NGO a a a 3 37 3 10 Library Compatibility a m LAG I HE Rp PRAE eie nixus 3 37 3 10 1 Name Manging ma 539 Poele eee RAG ee a x edes ede ie Pret arco e 3 38 Page viii PathScale Compiler Suite User Guide Version 3 2 T II aa 3 10 2 ABI Compatibility i marc Ebr aa edge ae eae ae eae GE 3 39 3 10 3 Linking with g77 compiled Libraries 0 00 e eee eee 3 39 3 10 3 1 AMD Core Math Library ACML 0 000 e eee eee 3 40 3 10 4 List Directed I O and Repeat Factors eee eae 3 40 3 10 4 1 Environment Variable llli 3 41 3 10 4 2 assign Command 25 4 oid bs bad E RS AB iaces ea xu ed eh aS 3 41 3 11 Porting Fortran Code ese eer ERE ERR eee EE Ea 3 42 3 12 Debugging and Troubleshooting Fortran a 3 42 3 12 1 Writing to Constants Can Cause Crashes 3 43 3 12 2 Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments 3 43 3 12 3 Fortran malloc Debugging 5 tack he oque ra Sains MEE DESEE P 3 44 3 12 4 Arguments Copied to Temporary Variables 3 44 3 13 Fortran Compiler Stack Size es 3 46 Section 4 The PathScale C C Compiler 4 1 Using the C C Compilers x paaa et ex ciTe te Ee 4 2 4 1 1 Accessing the GCC 4 x Front ends for C and C 4 2 4 2 Compiler and Runtime Features a 4 3 4 2 1 Preprocessi
3. aa 5 5 Page x PathScale Compiler Suite User Guide Version 3 2 ls Section 6 Tuning Quick Reference 6 1 Basic Optimization rs ce KA An sah oa Rw E oe Sale Wee eS 6 1 6 2 BAD ars sicnt ee ue a ior mi Qo A ORI PNG e OR EC Aid al ean ek 6 1 6 3 Feedback Directed Optimization FDO a 6 2 6 4 Aggressive ODIUITIZSor xiu oues dede xn bem es eR Ree 6 2 6 5 Compiler Flag Recommendations 0c eee 6 3 6 6 Performance Analysis eere 6 4 6 7 Optimize Your Hardware 0000 eee 6 4 Section 7 Tuning Options 7 1 Basic Optimizations The o flag a 7 1 7 2 Syntax for Complex Optimizations CG IPA LNO OPT WOPT 7 2 7 3 Inter Procedural Analysis IPA 000 e eee eee ees 7 3 7 3 1 The IPA Compilation Model 20 Seen dee eee eee Ged 7 3 7 3 2 Inter procedural Analysis and Optimization 7 4 7 3 2 1 ArialysSlS usas ax PED M RR Ie de BP KIND E mesa er udis 7 4 7 3 3 Optifizaltlon AA 7 5 7 3 4 E OFREOIInG IPA s sees ettesat taste Oke Rb P E Ed 7 7 7 3 4 1 riliilng sa es rte Eee RR d ur DUE le ERE Tere 7 7 7 3 5 CIONN usta oet tod one AA Dom sse ANG a Bl Ib DS Po Garb 7 9 7 3 6 Other IPA Tuning Options us wand te bec eae gyre as de EG 7 9 7 3 6 1 Page xi PathScale Compiler Suite User Guide Version 3 2 E 9 1 9O Disabling Options usa 220 rk RC RR C mi hn b NANG 7 10 7 3 7 Case Study on SPEC
4. pathcc gnu4 world c 4 The PathScale C C Compiler Compiler and Runtime Features ls This default behavior can be changed in your compiler defaults file by adding this line gnu4 See section 2 3 for an example compiler defaults file The option has no effect on pathf90 or pathf95 There are currently some limitations when using this option Please see the Release Notes for more information 4 2 Compiler and Runtime Features 4 2 1 Preprocessing Source Files Before being passed to the compiler front end source files are optionally passed through a source code preprocessor The preprocessor searches for certain directives in the file and based on these directives can include or exclude parts of the source code include other files or define and expand macros All C and C files are passed through the the C preprocessor unless the noccp flag is specified 4 The PathScale C C Compiler Compiler and Runtime Features AA NY 4 2 1 1 Pre defined Macros The PathScale compiler pre defines some macros for preprocessing code These include the following Table 4 1 Pre defined Macros Macro Remarks linux 1 These macros specify the type of operating jinis 1 system linux 1 unix 1 unix 1 unix 1 gnu linux 1 GNUC 4 The GNU and PATH values are GNUC MINOR 1 derived from the respective compiler version numbers and will change with each release GNUC PATCHLEVEL 1
5. Default is 0 CG load exe N Specify the threshold for subsuming a memory load operation into the operand of an arithmetic instruction The value of 0 turns off this subsumption optimization If N is 1 this subsumption is performed only when the result of the load has only one use This subsumption is not performed if the number of times the result of the load is used exceeds the value N a non negative integer The default value varies based On processor target and source language CG local sched alg 0 1 2 Select the basic block instruction scheduling algorithm If 0 perform backward scheduling where instructions are scheduled from the bottom of the basic block to the top If 1 perform forward scheduling If 2 schedule the instructions twice once in the forward direction and once in the backward direction and take the better of the two schedules The default value of this option is determined by the compiler during compilation CG locs best ON OFF Run the local instruction scheduler several times using different heuristics and pick the best schedule generated If enabled this option supercedes other options that control local instruction scheduling such as CG local sched alg and CG locs shallow depth The default is OFF CG locs reduce prefetch ON OFF If ON delete prefetch instructions that cannot be scheduled into unused processor cycles The deletion occurs only for backward instruction scheduling Th
6. F 23 F eko man Page ls isystem dir Search dir for header files after all directories specified by l but before the standard system directories Mark it as a system directory so that it gets the same special treatment as is applied to the standard system directories keep Write all intermediate compilation files file s contains the generated assembly language code file i contains the preprocessed source code These files are retained after compilation is finished If IPA is in effect and you want to retain file s you must specify IPA keeplight OFF in addition to keep keepdollar For Fortran only Treat the dollar sign as a normal last character in symbol names L directory In XPG4 mode changes the algorithm of searching for libraries named in L operands to look in the specified directory before looking in the default location Directories specified in L options are searched in the specified order Multiple instances of L options can be specified 1 library In XPG4 mode searches the specified library A library is searched when its name is encountered so the placement of a 4 operand is significant LANG This controls the language option group The following sections describe the suboptions available in this group LANG copyinout ON OFF When an array section is passed as the actual argument in a call the compiler sometimes copies the array section to atemporary array and passes the
7. LANGUAGE FORTRAN 1 LANGUAGE FORTRAN90 1 LANGUAGE FORTRAN90 1 unix 1 unix 1 NOTE By default Fortran uses cpp You must specify the tpp command line switch with Fortran code to use the Fortran preprocessor 3 26 3 The PathScale Fortran Compiler Compiler and Runtime Features I aaa This command will print to stdout all of the def ine s used with cpp on a Fortran file S echo gt junk F90 pathf95 cpp Wp dD E junk F90 There is no corresponding way to find out what is defined by the default Fortran preprocessor ftpp See section 3 6 4 1 for information on how to find pre defined macros in C and C No macros are predefined for the coco preprocessor 3 6 5 Error Numbers The explain Command By default the Fortran compiler and its runtime library print brief error messages such as this one lib 4081 UNRECOVERABLE library error An unformatted read or write is not allowed on a formatted file If you set the environment variable PSC ERR VERBOSE the compiler and library will print a longer explanation following each message such as this lib 4081 UNRECOVERABLE library error An unformatted read or write is not allowed on a formatted file A Fortran READ or WRITE statement attempted an unformatted I O operation on a file that was opened for formatted I O Either change the I O statement to formatted add a FORMAT specifier or open the file for unformatted I O See the descripti
8. 2 Compiler Quick Reference Other Input Files S a pathf95 command line The default preprocessor for files with F F90 or F95 extensions is cpp See section 3 6 1 for more information on preprocessing The compiler drivers can use the extension to determine which language front end to invoke For example some mixed language programs can be compiled with a single command pathf95 stream d f second wall c o stream The path f 95 driver will use the c extension to know that it should automatically invoke the C front end on the second wal1 c module and link the generated object files into the stream executable NOTE GNU make does not contain a rule for generating object files from Fortran 90 files You can add the following rules to your project Makefiles to achieve this 0 90 S FC S FFLAGS c lt 0 F90 S FC S FFLAGS c lt You may need to modify this for your project but in general the rules should follow this form For more information on compatibility and porting existing code see section 5 Information on GCC compatibility and a wrapper script that you can use for your build packages can be found in section 5 7 1 25 Other Input Files Other possible input files common to both C C and Fortran are assembly language files object files and libraries These can be used as inputs on the command line Extension Implication to the driver i preprocessed C source file BA
9. LANG IEEE_save OFF which disables the saving and restoring of IEEE state even in procedures which access the intrinsic modules 3 4 7 1 Gradual Underflow Fortran 2003 adds one feature not described in the TR15581 document mentioned earlier control over gradual underflow IEEE denormalized numbers Most IEEE floating point implementations execute faster if they are allowed to flush to zero instead of generating denormalized numbers when a computation underflows Our compiler disables gradual underflow by default when the optimization level is O3 or greater You can also query and set it explicitly with procedures provided in the IEEE ARITHMETIC module use intrinsic ieee arithmetic logical gradual call ieee get underflow mode gradual print gradual call ieee set underflow mode false Flush to zero for speed Gradual underflow cannot be disabled by O3 or via these procedures on the IA32 architecture when SSE instructions are not available 3 4 8 Allocatable Components and Dummy Arguments Fortran 2003 allows dummy variables function results and structure components to have the ALLOCATABLE attribute which in Fortran 95 was restricted to ordinary variables The specification of this extension is available at http www nag co uk SC22WG5 TR15581 html In brief allocatable components behave much like ordinary allocatable variables except that when a structure contains allocatable components an assignment
10. PATHSCALE 3 1 PATHCC 3 PATHCC MINOR 1 PATHCC PATCHLEVEL O0 LANGUAGE FORTRAN 1 These Fortran macros will also be used if the LANGUAGE FORTRAN 1 source file is Fortran but cpp is used _ LANGUAGE FORTRAN90 1 LANGUAGE FORTRAN90 1 1386 1 The macros specify32 bit x86 compilation i1386 1 i386 1 x86 64 1 These macros specify 64 bit x86 compilation x86 64 1 LP64A 1 These macros specify that long and LP64 1 pointer are 64 bit while int is 32 bit _ OPTIMIZE 1 When using an optimization level at 01 or higher the compiler will use this macro _mips 1 MIPS specific mips 1 Indicates the target is a MIPS processor mips 1 4 The PathScale C C Compiler Compiler and Runtime Features A aaa Table 4 1 Pre defined Macros Macro Remarks mips64 1 MIPS specific The target MIPS processor has 64 bit capability _MIPS SIM ABIN32 MIPS specific MIPS SIM ABI64 Forthe MIPS SIM macro ABIN32 indiates the n32 ABl and ABI64 indicates the 64 ABI MIPS ISA MIPS ISA MIPS3 MIPS specific _MIPS ARCH MIPS3 1 Indicates that the target supports the MIPS3 instruction set _MIPS ARCH mips3 _MIPS TUNE mips3 _MIPS TUNE MIPS3 1 _ mips 3 _ MIPSEL 1 MIPS specific MIPSEL 1 Indicates that the target is little endian _MIPSEL 1 MIPSEL 1 _MIPS SZPTR 32 MIPS specific _MIPS SZINT 32 Size of pointer int and long in bits _MIPS SZL
11. qualify to be DVD DEFAULT or DVD KIND CONST or DVD KIND DOUBLE DVD STAR 2 n is specified example REAL 8 DVD KIND CONST 3 KIND expression constant across all implementations DVD KIND DOUBLE 4 KIND expression which evaluates to ARRRARAR ER 0 D 1 D Fortran 90 Dope Vector AA NY KIND 1 0D0 for real across all implementations This code may be passed for real or complex type kind or star 3 Set if KIND or n appears in the variable declaration Values are from enum dec_codes unsigned int int len 12 internal length in bits of iolist entity 8 for character data to indicate size of each character unsigned int dec_len 8 declared length in bytes for n or KIND value Ignored if kind or star DVD DEFAULT 90 type t If DopeVectorType alloc cpnt is true then following the last actual dimension or codimension not necessarily MAXDIM there is a count of the number of allocatable components followed by an array of byte offsets from the beginning of the structure to each allocatable component If DopeVectorType alloc cpnt is false neither of these appears typedef struct unsigned long n alloc cpnt unsigned long alloc cpnt offset 0 DopeAllocType typedef struct DopeVector union fcd charptr Fortran character descriptor struct void ptr pointer to base addre
12. system Fortran interface to the C library function system Execute command using a command interpreter or shell The function form returns the value returned by the interpreter conventionally o to indicate success and nonzero to indicate failure The subroutine form sets status to the value which the function would return time time8 Fortran interface to the POSIX function time Returns the current time as an integer suitable for use with ctime gmtime Or ltime C 53 C Supported Fortran Intrinsics Fortran Intrinsic Extensions AA NN ttynam Fortran interface to the POSIX function ttyname The function form returns the name of the interactive terminal device associated with logical unit unit or blanks if unit is not associated with such a device The subroutine form sets name to the value that the function would return umask Fortran interface to the POSIX function uma sk Sets the file creation mask to mask The function form returns the previous value of the mask The subroutine form sets ola to the previous value of the mask unlink Fortran interface to the POSIX function unlink Remove the link to the file named file The function form returns o on success or the error code from the C library value errno The subroutine form sets status to the value which the function would return Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character
13. IPA IPA IPA IPA IPA IPA IPA IPA IPA alias ON OFF IPA Options or IPA addressing ON OFF aggr cprop ON OFF callee limit N cgi ON OFF clone list ON OFF common pad size N Cprop ON OFF ctype ON OFF IPA depth N dfe ON OFF dve ON OFF echo ON OFF field reorder ON OFF forcedepthzN ignore lang ON OFF inline ON OFF keeplight ON OFF linear ON OFF Treats all source files as in free source form otherwise default is that only f90 or F90 suffix files are treated this way Fortran only Fortran only Defaults Comments If this is used without suboptions defaults for all suboptions will be used Same as ipa OFF ON ON 500 ON OFF ON OFF Identical to IPA maxdepth N lt ON gt lt ON gt lt OFF gt lt OF Hy V O x Hy V lt OF lt O lt O Hy ni V Hi V ni V maxdepthzN map limit N max jobs N 0 1 1 Identical to IPA depthzN 1 E Summary of Compiler Options PM aaa Table E 1 Summary of Compiler Options by Function IPA min hotness N 10 IPA multi clone N 0 IPA node bloat N IPA plimit N lt 2500 gt IPA pu reorder 0 1 2 205 for non C programs lt 1 gt for C programs IPA relopt ON OFF OFF IPA small pu N 30 IPA sp
14. OPT unroll sizezN Set the ceiling of maximum number of instructions for an unrolled inner loop If N20 the ceiling is disregarded At O3 the default is 128 otherwise the default is 40 OPT wrap around unsafe opt ON OFF OPT wrap around unsafe opt OFF disables both the induction variable replacement and linear function test replacement optimizations By default these optimizations are enabled at O3 This option is disabled by default at OO Setting OPT wrap around unsafe opt to OFF can degrade performance It is provided as a diagnostic tool When used with E the source preprocessor will not generate lines in the output pad char literals For Fortran only Blank pad all character literal constants that are shorter than the size of the default integer type and that are passed as actual arguments The padding extends the length to the size of the default integer type pathcc Define PATHCC and other macros pedantic errors Issue warnings needed by strict compliance to ANSI C F 44 F eko man Page ls pg Generate extra code to profile information suitable for the analysis program pathprof 1 You must use this option when compiling the source files you want data about and you must also use it when linking This option turns on application level profiling but not library level profiling see also profile See the gcc man pages for more information profile Generate extra code to profile information s
15. These constants are all scalar default kind integers CHARACTER STORAGE SIZE The number of bits in a character for our compiler 8 ERROR UNIT The logical unit for error reporting for our compiler 0 3 10 3 The PathScale Fortran Compiler Fortran 2003 Support ls 3 4 7 FILE STORAGE SIZE The number of bits in a file storage unit which is used to specify the record length of an unformatted file for our compiler 8 INPUT UNIT The logical unit corresponding to in a READ statement for our compiler 5 IOSTAT END The value which IOSTAT returns for a normal end of file during l O for our compiler 4001 IOSTAT END The value which IOSTAT returns for a normal end of record during l O for our compiler 4006 NUMERIC STORAGE UNIT The number of bits in a numeric storage unit for our compiler 32 Notice that the i8 and r8 command line options do not change this they cause integer and real declarations without explicit kind type parameters to use kind 8 which corresponds to two numeric storage units A single numeric storage unit remains available via integer kind 4 or real kind 4 declarations OUTPUT UNIT The logical unit corresponding to in a WRITE statement for our compiler 6 IEEE Floating Point Three intrinsic modules IEEE EXCEPTIONS IEEE ARITHMETIC and IEEE FEATURES provide control over IEEE floating point behavior such as Enabling and disabling IEEE exception
16. This option specifies the cache size n can be 0 or a positive integer followed by one of the following letters k K m Or M These letters specify the cache size in Kbytes or Mbytes Specifying 0 indicates there is no cache at that level cs1 is the primary cache cs2 refers to the secondary cache cs3 refers to memory cs4 is the disk 7 Tuning Options Loop Nest Optimization LNO AA NN 7 4 3 Default cache size for each type of cache depends on your system Use LIST options ON to see the default cache sizes used during compilation With a smaller cache the cache set associativity is often decreased as well The flagset LNO assocl n assoc2 n assoc3 n assoc4 n can define this appropriately for your system Once again the above flags are already set appropriately for Opteron Cache Blocking Loop Unrolling Interchange Transformations 7 4 4 Prefetch Cache blocking also called tiling is the process of choosing the appropriate loop interchanges and loop unrolling sizes at the correct levels of the loop nests so that cache reuse can be optimized and memory accesses reduced This whole LNO feature is on by default but can be turned off with LNO blocking off LNO blocking_size n specifies a block size that the compiler must use when performing any blocking where n is a positive integer that represents the number of iterations LNO interchange is on by default but setting this 0 can disable the loop
17. frequent The branch is executed frequently The branch of the IF statement that contains the pragma will be affected 4 2 3 Mixing Code If you have a large application that mixes Fortran code with code written in other languages and the main entry point to your application is from C or C you can optionally use pathcc or pathcc to link the application instead of pathf 95 If you do you must manually add the Fortran runtime libraries to the link line See section 3 7 for details To link object files that were generated with pathcc using pathcc or pathf95 include the option lstdc 4 2 4 Linking Note that the pathcc C language user needs to add 1m to the link line when calling 1ibm functions The second pass of feedback compilation may require an explicit 1m 4 3 Debugging and Troubleshooting C C The flag g tells the PathScale C and C compilers to produce data in the form used by modern debuggers such as pathdb or GDB This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g will be capable of being debugged using pathdb GDB or other debuggers The g option automatically sets the optimization level to 00 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible but the code transformation performed by the optimizations may make it more difficult See section 10 for more infor
18. fshort double C C only fshort enums C C only fshort wchar C C only ftest coverage Coverage data will map better to the source files if used without optimization f no underscoring Fortran only fuse cxa atexit C only fwritable strings C C only gnu N If system compiler is GCC 3 default is gnu3 if GCC 4 gnu4 C C only iN lt 4 gt Other arg is lt 8 gt Fortran only ignore suffix no intrinsic name Fortran only module dir mp MP Use with M or MM MQ MT nobool nog77mangle Fortran only no pathcc o outfile openmp E 3 E Summary of Compiler Options ee E 4 Table E 1 Summary of Compiler Options by Function pad char literals pathcc LY rreal spec S U name uvar Wc argl arg2 Yc path Diagnostic Debugging Options clist CLIST ON OFF CLIST dotc file filename CLIST doth file filename Fortran only r4 REAL KIND 4 and COMPLEX KIND 4 Other option is r8 REAL KIND 8 and COMPLEX KIND 8 Fortran only Pass argumentss to compiler pass c can be one of preprocessor front end inliner backend assembler 1 loader c is same asfor w Can also specify I Where to search for include files S Where to search for startup files crt o L Where to search for libraries o O H MG Aa Defaults Comments For Fortran C Only Same as CLIST ON C on
19. to receive the address and examine the data at that address instead of assigning to an ordinary variable p user getlogin nounderscore write 6 3a user 1 index user char 0 1 end program f part Subroutine to be called from C subroutine fl c i i8 f d 1 implicit none intrinsic flush character c integer i integer 8 i8 real f doubleprecision d logical 1 writet6 3a 215 2f5 1 18 IU c NIN 1 185 f d 1l call flush 6 Flush output before switching languages end subroutine f1 And here is the third file decorate txt getlogin nounderscore getlogin Compile and execute these three files c part c f part f90 and decorate txt like this pathf90 Wall intrinsic flush fdecorate decorate txt f part f90 c part c a out d1 9 8 f1 7 6 il1z5 i2 4 11 0 12 1 c1 len 5 c2 len 4 c3 len 6 cl hello c2 from c3 f part hello from call fortran 123 456 7 8 9 1 T d 9 8 f 7 6 i 5 18 4 le p johndoe 3 7 1 2 Example Accessing Common Blocks from C Variables in Fortran 90 modules are grouped into common blocks one for initialized data and another for uninitialized data It is possible to use fdecorate to access these common blocks from C as shown in this example 3 33 3 The PathScale Fortran Compiler Runtime I O Compatibility AA NY S cat mymodule f90 module mymodule public integer modulevarl doubleprecision modulevar2 integer
20. xor Bitwise Boolean xor zabs zcos Specific names for various mathematical functions having an zexp zlog argument of type complex 16 zsin zsqrt C 54 C Supported Fortran Intrinsics Fortran Intrinsic Extensions o 9 9 9 7 aaa Notes C 55 C Supported Fortran Intrinsics Fortran Intrinsic Extensions AA NY C 56 Appendix D Fortran 90 Dope Vector Here is an example of a simplified data structure from a Fortran 90 dope vector from the file clibinc cray dopevec h found in the source distribution See section 3 6 6 for more details typedef struct FCD char c pointer C character pointer unsigned long byte len Length of item in bytes fcd typedef struct f90 type unsigned int 32 used for future development enum typecodes DVTYPE UNUSED O0 DVTYPE TYPELESS 1 DVTYPE INTEGER Z2 DVTYPE REAL 3 DVTYPE COMPLEX 4 DVTYPE LOGICAL 5 DVTYPE ASCII 6 DVTYPE DERIVEDBYTE 7 DVTYPE DERIVEDWORD 8 type 8 type code unsigned int dpflag 1 set if declared double precision or double complex enum dec codes DVD DEFAULT O KIND and n absent or IND expression which evaluates to he default KIND ie IND O for integer IND 0 0 for real IND 0 0 for complex IND TRUE for logical IND 2019A 2019 for character across on all ANSI conformant implementations DVD_KIND 1 KIND expression which does not
21. 7 30 7 Tuning Options The pathopt2 Tool ls Table 7 4 pathopt2 Options Continued con figfile The f option is used to specify the filename of the pathopt2 XML configuration file If itis not specified the tool will first check for a file called pathopt2 xm1 in the current working directory and use it if present otherwise the tool will use the file install paths pathscale share pa thopt2 pathopt2 xml g external con figfile Loads in additional user defined configfile s This allows a user to extend the pathopt2 xml file without having to modify it h Show usage j Number of jobs 1 k Keep temporary Remove temporary directory with T directory M Directory name pwd n num iterations Number of iterations to run on each option 1 r test command Test script If this option is not specified then there is no test run and the performance of the build command is used This is useful when the program is built and run in one step and thetiming file or rate file mechanism is used to report the performance S real user system timing file rate file Selects the performance metricfor choosing options and for sorting the results real t execute target Use execute target which corresponds to an execute tag found in conf igfile The first target in configfile 7 31 7 Tuning Options The pathopt2 To
22. Comments lt OFF gt lt OFF gt lt ON gt 0 5 2000 OFF 1 5 1 ON OFF OFF ON C C only If not set the default of the current processor is used ON s15 16 ON 5 4096 c2 E 9 E Summary of Compiler Options AA NY E 10 Table E 1 Summary of Compiler Options by Function LNO prefetch ahead N LNO prefetch verbose ON OFF LNO processorszN LNO sclrze ON OFF LNO simd 0 1 2 LNO simd reduction ON OFF LNO svr phasel ON OFF LNO trip count assumed when unknown trip c ount N LNO vintr 0 1 2 LNO vintr verbose ON OFF LNO Transformation Options LNO interchange ON OFF LNO unswitch ON OFF LNO unswitch verbose ON OFF LNO ouzN LNO ou deep ON OFF LNO ou_further N LNO ou_max N LNO pwr2 ON OFF LNO Target Cache Memory Options LNO assocl N assoc2 N assoc3 N assoc4 N LNO cmp1 N cmp2 N cmp3 N cmp4 N dmp1 N dmp2 N dmp3 N dmp4 N INO cs1 N cs2 N cs3 N cs4 N LNO is mem1 ON OFF is mem2 ON OFF is mem3 ON OFF is mem4 ON OFF LNO ls1zN 1s2 N 1s3 N ls4zN LNO TLB Options LNO psl N ps2 N ps3zN ps4 N I INO t1b1 N tlb2zN tlb3zN tlb4zN 2 OFF 0 ON 1 OFF ON 1000 1 OFF Defaults Comments ON ON OFF ON C C only S
23. GET_IEEE_ Subroutine STATUS I 8 TRADITIONAL STATUS GMTIME Subroutine STIME 4 G77 PGI TARRAY 4 Array rank 1 HOSTNM 1 4 NAME C G77 PGI O STATUS 1 4 HOSTNM Subroutine NAME C G77 O STATUS 1 4 HUGE X 1 1 I2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL IABS 1 4 A 11 1 2 1 4 I8 ANSI G77 E P PGI TRADITIONAL IACHAR 1 4 C C ANSI G77 E PGI TRADITIONAL LAND 1 4 I 1 1 1 2 1 4 I8 ANSI G77 E J I 1 I 2 1 4 I8 PGI TRADITIONAL IARGC 1 4 G77 PGI IBCHNG 1 4 I 1 1 1 2 1 4 I8 TRADITIONAL E POS 1 1 1 2 1 4 1 8 IBCLR 1 4 I 1 1 1 2 1 4 I8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL C Supported Fortran Intrinsics Table of Supported Intrinsics TI a Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks IBITS 1 4 I 11 1 2 1 4 I8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI LEN I 1 1 2 1 4 1 8 TRADITIONAL IBSET 1 4 I 11 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL ICHAR 1 4 C C ANSI G77 E PGI TRADITIONAL IDATE Subroutine 1 11 G77 PGI J 171 TRADITIONAL K I 1 IDATE Subroutine I 1 2 G77 PGI J 2 TRADITIONAL K I 2 IDATE Subroutine I 1 4 G77 PGI J 174 TRADITIONAL K I4 IDATE Subroutine I 18 G77 PGI J I8 TRADITIONAL K 18 IDATE Subroutine TARRAY I1 G77 PGI Array rank 1 TRADITIONAL IDATE Subroutine TARRAY l 2 G77 PGI Arra
24. IPA space 7 8 O0 3 2 IPA split 7 10 O1 3 2 4 4 keep 4 5 O2 3 2 4 2 6 1 L 2 4 O2 ipa 6 1 2 8 03 2 8 3 2 7 2 7 12 LANG formal deref unsafe 3 43 O3 ipa 6 1 LIST options 7 16 Ofast 6 3 7 12 7 14 7 20 Im 2 8 4 7 OPT 3 24 LNO 3 24 4 2 OPT alias 3 44 6 2 LNO assoc1 n assoc2 n assoc3 n assoc4 OPT alias any 7 20 Index 6 PathScale Compiler Suite User Guide Version 3 2 ls OPT alias cray_pointer 7 20 OPT alias disjoint 7 20 OPT alias no_parm 3 44 OPT alias no_ restrict 7 20 OPT alias restrict 6 2 7 20 OPT alias typed 6 2 7 20 OPT alias unnamed 7 20 OPT div_split 6 2 7 21 OPT early_mp 8 28 OPT fast_complex 7 23 OPT fast_exp 7 22 OPT fast_math 7 21 OPT fast_nint 7 23 OPT fast_trunc 7 22 OPT fold_reassociate 7 22 OPT goto 7 2 OPT IEEE_arithmetic 7 22 OPT IEEE_arithmetic N 7 20 OPT Ofast 6 3 OPT Olimit 6 2 7 8 OPT recip 7 21 OPT reorg common 10 3 OPT roundoff 6 2 7 21 7 22 OPT wrap around unsafe opt 10 5 p 9 1 pg 2 8 r8 3 21 S 7 43 show defaults 2 5 static 2 9 5 4 8 11 trapuv 10 1 version 2 2 WI 5 1 WOPT 3 24 WOPT fold 3 44 WOPT fold off 3 44 Wuninitialized 10 1 y on 3 41 zerouv 10 1 enabling and disabling features 7 2 group 7 2 IPA specfile 7 8 LANG rw_const 3 43 msse4a 2 4 Ofast 10 4 OPT alias parm 7 20 OPT roundoff 6 3 syntax 7 2 Outer loop unrolling 7 16 p Parallel directives 8 1 Parallelism controlling 7 1
25. Invokes the IPA linker 2 Performs inter procedural analysis and optimization on the linked program 3 Invokes the backend phases to optimize and generate the object code 4 Invokes the real linker to produce the final executable Under IPA compilation the user will notice that the compilation of separate files proceeds very fast because it does not involve the backend phases On the other hand the linking phase will appear much slower because it now encompasses the compilation and optimization of the entire program 7 3 2 Inter procedural Analysis and Optimization We call the phase that operates on the IR of the linked program IPA for Inter Procedural Analysis but its tasks can be divided into two categories Analysis to collect information over the entire program Optimization to transform the program so it can run faster 7 3 2 1 Analysis IPA first constructs the program call graph Each node in the call graph corresponds to a function in the program The call graph represents the caller callee relationship in the program Once the call graph is built based on different inlining heuristics IPA prepares a list of function calls where it wants to inline the callee into the caller Based on the call graph IPA computes the mod ref information for the program variables This represents the information as to whether a variable is modified or referenced inside a function call IPA also computes alias information for all the
26. Linker Library Options Fortran version C C version Fortran only For g For gcc g For gcc g For gcc g For g For gcc g For gcc g For gcc g For gcc g For gcc g For gcc g For gcc g Defaults Comments XPG4 mode XPG4 mode For C E 8 E Summary of Compiler Options ls Table E 1 Summary of Compiler Options by Function stdi LIST LIST LIST All LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LNO LIST LIST nc List Options ON OFF all options s ON OFF notes ON OFF options ON OFF symbols ON OFF LNO General Options LNO options require 03 or higher apo use feedback ON OFF build scalar reductions ON OFF blocking ON OFF blocking_size N fission 0 1 2 full unroll fu N full unroll size N full unroll outer ON OFF fusion 0 1 2 fusion peeling limit N gather scatter 0 1 2 hoistif ON OFF ignore feedback ON OFF ignore pragmas ON OFF local pad size N minvariant minvar ON OFF non blocking loads ON OFF oinvar ON OFF opt 0 1 ou prod max N outer ON OFF outer unroll max ou max N parallel overhead N prefetch 0 1 2 3 Defaults Comments lt ON gt if any LIST suboptions are enabled lt OFF gt lt ON gt lt OFF gt Defaults
27. PGI LOC 1 8 l Any type G77 PGI Array rank any TRADITIONAL LOCK_ Subroutine I 1 4 18 TRADITIONAL E RELEASE LOCK TEST I 1 4 TRADITIONAL E AND SET J 14 LOCK TEST I 1 8 TRADITIONAL E AND SET J I 8 LOG R 4 X R 4 R 8 Z 8 ANSI G77 E Z 16 PGI TRADITIONAL LOG10 R 4 X R 4 R 8 ANSI G77 E PGI TRADITIONAL LOG2 IMAGES l 4 TRADITIONAL LOGICAL 74 L L 1 L 2 L 4 L 8 ANSI PGI E KIND 11 1 2 1 4 TRADITIONAL Oo I8 LONG 1 4 A 1 1 1 2 1 4 8 G77 E R 4 R 8 Z 8 Z 16 TRADITIONAL C 29 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY C 30 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks LSHIFT 1 11 1 2 1 4 I8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 POSITIVE_SHIFT I 1 1 2 1 4 18 LSTAT 1 4 FILE C G77 PGI O SARRAY I 4 Array rank 1 STATUS 1 4 LSTAT Subroutine FILE C G77 O SARRAY I 4 Array rank 1 STATUS 1 4 LTIME Subroutine STIME 174 G77 PGI TARRAY 4 Array rank 1 MALLOC I 1 1 1 2 1 4 I8 PGI E TRADITIONAL MASK 1 11 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 MATMUL ANSI PGI See Std TRADITIONAL MAX ANSI G77 See Std PGI TRADITIONAL MAXO ANSI G77 See Std PGI TRADITIONAL MAX1 ANSI G77 See Std PGI TRADITIONAL MAX X R 4 R 8 ANSI PGI E EXPONENT TRADITIONAL MAXLOC ANSI PGI See S
28. PGI O F8 TRADITIONAL ALARM 1 4 SECONDS l 4 18 G77 PGI HANDLER Procedure STATUS I 4 O C 3 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks ALARM Subroutine SECONDS 4 1 8 G77 HANDLER Procedure STATUS I 4 O ALL ANSI PGI See Sid TRADITIONAL ALLOCATED ANSI PGI See Sid TRADITIONAL ALOG R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ALOG10 R 4 X R 4 R 8 ANSI G77 E P PGI TRADITI NAL AMAXO ANSI G77 See Std PGI TRADITIONAL AMAX1 ANSI G77 See Std PGI TRADITIONAL AMINO ANSI G77 See Std PGI TRADITIONAL AMIN1 ANSI G77 See Std PGI TRADITIONAL AMOD R4 A R 4 R 8 ANSI G77 E P P R 4 R 8 PGI TRADITIONAL AND 1 11 F2 1 4 1 8 ANSI G77 E R 4 R 8 PGI CrayPtr L 1 L 2 TRADITIONAL L 4 L 8 J 11 12 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 AND AND I 1 4 TRADITIONAL E FETCH J 1 4 C 4 C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks AND AND I 1 8 TRADITIONAL E FETCH J I 8 ANINT R4 A R 4 R 8 ANSI G77 E P KIND I 1 1 2 1 4 PGI O 8 TRADITIONAL ANY ANSI PGI See Sid TRADITIONAL ASIN R 4 X R 4
29. PROGRAM HELLO INTEGER NTHREADS TID OMP GET NUM THREADS OMP GET THREAD NUM TID 0 NTHREADS 1 Fork a team of threads giving them their own copies of variables TID PARALLEL PRIVATE TID Obtain and print thread id IS TID OMP GET THREAD NUM SOMP CRITICAL PRINT Hello World from thread TID SOMP END CRITICAL SOMP MASTER SOMP CRITICAL Only master thread does this IS NTHREADS OMP GET NUM THREADS PRINT Number of threads NTHREADS SOMP END CRITICAL SOMP END MASTER All threads join master thread and disband SOMP END PARALLEL END The before some of the lines are conditional compilation tokens These lines are ignored when compiled without mp We compile omphello f for OpenMP with this command S pathf 95 c mp omphello f Now we link it again using mp S pathf 95 mp omphello o o omphello out We set the environment variable for the number of threads with this command export OMP NUM THREADS 5 8 24 8 Using OpenMP and Autoparallelization Example OpenMP Code in C C ls Now run the program S omphello out Hello World from thread1l Hello World from thread2 Hello World from thread3 Hello World from threado Number of threads 5 Hello World from thread4 The output from the different threads can be in a different order each time the program is run We can change the environment variable to run with two threads export OMP NUM THREADS 2 Now the outp
30. R 8 ANSI G77 E P PGI TRADITIONAL ASIND R 4 X R 4 R 8 PGI E TRADITIONAL ASSOCIATED ANSI PGI See Std TRADITIONAL ATAN R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ATAN2 R4 Y R 4 R 8 ANSI G77 E P X R 4 R 8 PGI TRADITIONAL ATAN2D R 4 Y R 4 R 8 PGI E P X R 4 R 8 TRADITIONAL ATAND R 4 X R 4 R 8 PGI E P TRADITIONAL BESJO R 4 X R 4 G77 PGI BESJ1 R 4 X R 4 G77 PGI BESJ1 R 8 X R 8 G77 PGI BESJN R 4 N R 4 G77 PGI X R 4 BESJN R 8 N R 4 G77 PGI X R 8 BESYO R 4 X R 4 G77 PGI BESYO R 8 X R 8 G77 PGI BESY1 R 4 X R 4 G77 PGI BESY1 R 8 X R 8 G77 PGI BESYN R 4 N R 4 G77 PGI X R 4 C 5 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks BESYN R 8 N R 4 G77 PGI X R 8 BITEST 1 2 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL BIT_SIZE I 1 1 F2 1 4 1 8 ANSI G77 E PGI TRADITIONAL BJTEST I 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL BKTEST I 1 8 TRADITIONAL E POS 1 1 1 2 1 4 1 8 BTEST I 1 1 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL CABS R 4 A Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CCOS Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CDABS R 8 A Z 16 G77 PGI E P TRADITIONAL CDCOS Z 16 X Z 16 G77 PGI E P TRADI TIONAL CDEXP Z 16 X Z 16 G7
31. TRADITIONAL TRANSPOSE Depends on arg MATRIX Any type ANSI PGI Array TRADITIONAL rank 2 TRIM Depends onarg STRING C ANSI PGI TRADITIONAL TTYNAM C UNIT 174 G77 PGI TTYNAM Subroutine UNIT 1 4 G77 NAME C UBOUND ANSI PGI See Std TRADITIONAL UMASK 1 4 MASK I 4 G77 UMASK Subroutine MASK I 4 G77 O OLD 4 UNIT I 1 1 I2 1 4 I 8 TRADITIONAL E UNLINK 1 4 FILE C G77 PGI O STATUS 1 4 UNLINK Subroutine FILE C G77 O STATUS 1 4 UNPACK ANSI PGI See Std TRADITIONAL VERIFY 1 4 STRING C ANSI PGI E SET C TRADITIONAL Q BACK L 1 L 2 L 4 L 8 C Supported Fortran Intrinsics Fortran Intrinsic Extensions ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks WRITE_ Subroutine TRADITIONAL E MEMORY BARRIER XOR I 1 1 1 2 1 4 I8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 l2 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 XOR_AND_ I 1 4 TRADITIONAL E FETCH J 1 4 XOR_AND_ I 148 TRADITIONAL E FETCH J I 8 ZABS R 8 A Z 16 G77 E P TRADITIONAL ZCOS Z216 X Z716 G77 E P TRADITIONAL ZEXP Z216 X Z716 G77 E P TRADITIONAL ZLOG Z216 X Z716 G77 E P TRADITIONAL ZSIN Z216 X Z716 G77 E P TRADITIONAL ZSQRT Z216 X Z716 G77 E P TRADITIONAL C 4 Fortran Intrinsic Extensions Standard Fortran intrinsic procedures are documented in ISO 1539 1 or any good textbook on Fortran 95
32. The ipa flag implies 02 ipa because 02 is the default Flags like ipa can be used in combination with a very large number of other flags but some typical combinations with the o flags are shown below 03 ipaor 02 ipa is a typical additional attempt at improved performance over the 03 or 02 flag alone ipa needs to be used both in the compile and in the link steps of a build Using IPA with your program is usually straightforward If you have only a few source files you can simply use it like this pathf95 03 ipa main f subsi f subs2 f If you compile files separately the o files generated by the compiler do not actually contain object code they contain a representation of the source code Actual compilation happens at link time The link command also needs the ipa flag added For example you could separately compile and then link a series of files like this pathf95 c 03 ipa main f pathf95 c 03 ipa subsi f pathf95 c 03 ipa subs2 f pathf95 03 ipa main o subsl o subs2 o Currently there is a restriction that each archive for example 1ibfoo a must contain either o files compiled with ipa or o files compiled without ipa but not both Note that in a non IPA compile most of the time is incurred with compiling all the files to create the object files the o s and the link step is quite fast In an IPA compile the creating of o files is very fast but the link step can take a long time The t
33. This preprocessor automatically expands macros outside of preprocessor statements The default is to run the C preprocessor cpp if the input file ends in a F or F90 suffix For more information on controlling preprocessing see the ftpp E and nocpp options For information on enabling macro expansion see the macro expand option By default no preprocessing is performed on files that end in a f or f90 suffix Dvar def var def Define variables used for source preprocessing as if they had been defined by a define directive If no def is specified 1 is used For information on undefining variables see the Uvar option d lines Fortran only Compile lines with a D in column 1 default64 For Fortran only Set the sizes of default integer real logical and double precision objects This option is a synonym for the pair of options r8 i8 Calling a routine ina specialized library such as SCSL requires that its 64 bit entry point be specified F eko man Page o 9 9 97 MEN when 64 bit data are used Similarly its 32 bit entry point must be specified when 32 bit data are used dumpversion Show the version of the compiler being used and nothing else Run only the source preprocessor files without considering suffixes and write the result to stdout This option overrides the nocpp option The output file contains line directives To generate an output file w
34. and releases the associated thread from ownership of the nestable lock omp nest lock t int omp test lock omp lock t Try to acquire the lock return a non zero i i i i value if successful o if not omp test nest lock omp nest Attempt to set a lock using the same method lock t a 7 B as omp set nest lock but execution E thread does not wait for confirmation that the lock is available If lock is successfully set function in crements the nesting count and returns the new nesting count if lock is unavailable function returns a value of zero double omp get wtime void Returns double precision value equal to the ao number of seconds since the initial value of the operating system real time clock double omp get wtick void Returns double precision floating point value equal to the number of seconds between successive clock ticks 8 8 Runtime Libraries There are both static and dynamic versions of each library and the libraries are supplied in both 64 bit and 32 bit versions 8 10 8 Using OpenMP and Autoparallelization Environment Variables ls The libraries are opt pathscale lib lt version gt libopenmp so dynamic 64 bit opt pathscale lib lt version gt libopenmp a Static 64 bit opt pathscale lib lt version gt 32 libopenmp so dynamic 32 bit opt pathscale lib lt version gt 32 libopenmp a static 32 bit The symbolic links to the dynamic versions of the libr
35. e ROUNDUP x rounds x upwards to the nearest higher integer MIN a b is the minimum of a and b e MAX a b is the maximum of a and b 8 19 8 Using OpenMP and Autoparallelization Environment Variables ee The minimum chunk size is the value specified by the user in the guided scheduling directive defaults to 1 NOTE If the values of PSC OMP GUIDED CHUNK MAX and minimum chunk size are inconsistent i e the minimum is larger than the maximum the minimum chunk size takes precedence per the OpenMP specification PSC OMP LOCK SPIN Integer value 0 or non zero This chooses the locking mechanism used by critical sections and OMP locks 0 user level spin locks are disabled uses pthread mutexes non zero user level spin locks are enabled This is the default This determines whether locking in critical sections and OMP locks is implemented with user level spin loops or using pthread mutexes Synchronization using pthread mutexes is significantly more expensive but frees up execution resources for other threads PSC OMP SILENT Set or not set If you set PSC OMP SILENT to anything then warning and debug messages from the 1ibopenmp library are inhibited Fatal error messages are not affected by the setting of PSC_OMP_ SILENT PSC OMP STACK SIZE Stack size specifications Stack size specification follows the syntax in section 3 13 See section 8 10 1 for more details PSC OMP STATIC FAIR Set or not set Th
36. file Default is OFF LNO local pad size N This option specifies the amount by which to pad local array dimensions The compiler automatically by default chooses the amount of padding to improve cache behavior for local array accesses LNO minvariant minvar ON OFF Enable or disable moving loop invariant expressions out of loops The default is ON LNO non blocking loads ON OFF For C C only The option specifies whether the processor blocks on loads If not set the default of the current processor is used LNO oinvar ON OFF This option controls outer loop hoisting Default is ON LNO opt 0 1 This option controls the LNO optimization level The options can be one of the following 0 Disable nearly all loop nest optimizations 1 Perform full loop nest transformations This is the default F 28 F eko man Page ls LNO ou prod max N This option indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed N where N is a positive integer The default is 16 LNO outer ON OFF This option enables or disables outer loop fusion Default is ON LNO outer unroll max ou max N The Outer unroll max option indicates that the compiler may unroll outer loops in a loop nest by as many as N per loop but no more The default is 5 LNO parallel overhead N Effective only when specified with apo the parallel overhead option controls the auto parallelizing compi
37. if or endif is followed by text W no error Werror makes all warnings into errors Wno error tells the compiler not to make all warnings into errors Werror implicit function declaration For C C only Give an error when a function is used before being declared W no float equal Wfloat equal warns if floating point values are compared for equality Wno float equal tells the compiler not to warn if floating point values are compared for equality W no format For C C only Wformat warns about printf format anomalies Wno format tells the compiler not to warn about printf format anomalies W no format nonliteral For C C only With the Wformat nonliteral option and if Wformat warn if format string is not a string literal For Wno format nonliteral do not warn if format string is not a string literal F 50 F eko man Page ls W no format security For C C only For Wformat security if Wformat warn on potentially insecure format functions Wfno format security do not warn on potentially insecure format functions W no id clash For C C only Wid clash warns if two identifiers have the same first num chars Wid clash tells the compiler not to warn if two identifiers have the same first num chars W no implicit For C C only Wimplicit warns about implicit declarations of functions or variables Wno implicit tells the compiler not to warn about implicit dec
38. just as we do for the cpp and tpp preprocessors As with the other preprocessors an option like Isubdir no trailing is needed tells the preprocessor to add subdir to the list of directories in which it will search for included files Unlike the cpp and ftpp preprocessors this one requires that its identifiers be declared with a data type so an option like DIVAR 5 declares a constant not a variable IVAR with the type integer andthe value 5 while an option like DLVAR declares a constant LVAR with the type logical and the value t rue Only integer and logical constants are allowed You can use the D option to override the value of a constant declaration for that identifier which might appear in the source file The standard requires that the preprocessor read a set file capable of defining constants variables and modes of operation but it does not specify how to find the setfile lf you use fcoco the preprocessor looks for coco set in the current directory If no such file exists the preprocessor quietly proceeds without it If you use an option like coco somedir mysettings the preprocessor looks for file somedir mysettings You cannot use the D option to override a constant declaration which appears in the setfile The open source package on which this feature is based does provide additional extensions and command line options described at 3 25 3 The PathScale Fortran Compiler Compiler and Runtime F
39. since memory access latency and bandwidth may vary based on the relative locations of the processor and memory The affinity mechanismis often specificto a particular OS or kernel and the following discussion is relevant to most modern Linux distributions and kernels though details may still vary A processor here refers to a CPU core and this might be a conventional single core processor a CPU core in a multi core processor or a hyper threaded CPU core Affinity can be specified at the thread level allowing distinct threads in a process to have different settings By default the affinity of a thread is usually set to all available CPU cores on the system which allows the kernel to schedule that thread freely Typically affinity is inherited by a child process 8 Using OpenMP and Autoparallelization Environment Variables ls when forked from a parent process Affinity can be modified to any subset of the CPU cores except the empty set Examples include a single CPU core all CPU cores on a particular socket and all CPU cores on the system Affinity may be set or retrieved from the command line using the taskset utility or similar Run time libraries such as the PathScale OpenMP run time library may automatically set affinity in order to optimize thread placement Also application programs may themselves set affinity if required PSC_OMP AFFINITY TRUE or FALSE When TRUE the operating system s affinity mechanism where availab
40. specifying IPA or IPA Default settings for the individual IPA suboptions are used IPA The inter procedural analyzer option group controls application of inter procedural analysis and optimization including inlining constant propagation common block F 19 F eko man Page ls array padding dead function elimination alias analysis and others Specify IPA by itself to invoke the inter procedural analysis phase with default options If you compile and link in distinct steps you must specify at least IPA for the compile step and specify IPA and the individual options in the group for the link step If you specify IPA for the compile step and do not specify IPA for the link step you will receive an error IPA addressing ON OFF Invoke the analysis of address operator usage The default is Off IPA alias ON is a prerequisite for this option IPA aggr cprop ON OFF Enable or disable aggressive inter procedural constant propagation Setting can be ON or OFF This attempts to avoid passing constant parameters replacing the corresponding formal parameters by the constant values Less aggressive inter procedural constant propagation is done by default The default setting is ON IPA alias ON OFF Invoke alias mod ref analysis The default is ON IPA callee limit N Functions whose size exceeds this limit will never be automatically inlined by the compiler The default is 500 IPA cgi ON OFF In
41. that do not have the value attribute More simply you can achieve compatibility by using the bind c attribute on a Fortran procedure and then either using the value attribute on a Fortran dummy variable to make it match the C default behavior or using pointer arguments in the C code to make them match the Fortran default behavior The following example uses the Fortran value attribute to make argument a match the corresponding C argument which is passed by value For the argument b the C prototype uses a pointer to match the corresponding Fortran argument which uses call by reference Argument c illustrates that type c_ptr matches a C void pointer extern long c function long a long b void c interface integer function c function a b c bind c integer c long value a integer c long b type c ptr c end function c function bind c end interface 3 4 9 5 Enumerations A C enumeration establishes a series of named integer constants analogous to Fortran declarations having the PARAMETER attribute To aid interoperability with C Fortran 2003 provides an analogous statement By default the first name in an enumeration has the value 1 and each subsequent name has a value one greater than its prececessor But you can assign a specific value to any name then the next name will unless you assign a specific value to it as well have a value one greater than its predecessor red 1 blue 2 gree
42. write a end This example generates the following output 5 88 3 The PathScale Fortran Compiler Library Compatibility ls This behavior conforms to the language standard However some users prefer to see multiple values instead of the repeat factor 88 88 88 88 88 There are two ways to accomplish this using an environment variable and using the assign command 3 10 4 1 Environment Variable If the environment variable FTN SUPPRESS REPEATS is set before the program starts executing then list directed write and print statements will output multiple values instead of using the repeat factor To output multiple values when running within the bash shell export FTN SUPPRESS REPEATS yes To output multiple values when running within the csh shell setenv FTN SUPPRESS REPEATS yes To output repeat factors when running within the bash shell unset FTN SUPPRESS REPEATS To output repeat factors when running within the csh shell unsetenv FTN SUPPRESS REPEATS 3 10 4 2 assign Command Using the y onoption to the assign command will cause all list directed output to the specified file names or unit numbers to output multiple values using the y off option will cause them to use repeat factors instead For example to output multiple values on logical unit 6 and on any logical unit which is associated with file test 2559 out type these commands before running the program export FILENV myassignfile
43. 18 8 23 vsin 7 17 Index 8 W Whole program optimization IPA 7 3 X x86 ABI 3 1 4 1 X86_64 ABI 3 1 x86 64 ABI 3 39 4 1 x86 64 platform configuration 7 24 xeon 2 4
44. 2 PSC OMP GUIDED CHUNK MAX is the value of the PSC OMP GUIDED CHUNK MAX environment variable defaults to 300 minimum chunk size is the size of the smallest piece this is the value of chunk in the SCHEDULE directive ROUNDUP x rounds x upwards to the nearest higher integer MIN a b is the minimum of a and b MAX a b is the maximum of a and b B Implementation Dependent Behavior for OpenMP Fortran AA NN When SCHEDULE RUNTIME is specified the decision regarding scheduling is deferred until runtime The schedule type and chunk size can be chosen at runtime by setting the OMP SCHEDULE environment variable If this environment variable is not set the resulting schedule is implementation dependent Table 1 page 17 The default runtime schedule is static scheduling The default chunk size is set to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of the number of threads in the team the last chunk will be smaller than the default chunk size and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop
45. 4 1 8 ANSI PGI _KIND TRADITIONAL SELECTED_ Depends on arg P 1 1 172 174 178 ANSI PGI O REAL_KIND R 1 1 1 2 1 4 1 8 TRADITIONAL SETBUF 1 4 UNIT 174 TRADITIONAL BUF C SETLINEBUF 1 4 UNIT 14 TRADITIONAL SET X R4 R 8 ANSI PGI E EXPONENT I 1 1 1 2 1 4 18 TRADITIONAL SET_IEEE_ Subroutine EXCEPTION I 8 TRADITIONAL E EXCEPTION SET_IEEE_ Subroutine STATUS I 8 TRADITIONAL EXCEPTIONS SET_IEEE_ Subroutine STATUS I 8 TRADITIONAL INTERRUPTS SET_IEEE_ Subroutine STATUS I 8 TRADITIONAL ROUNDING MODE SET_IEEE_ Subroutine STATUS I 8 TRADITIONAL STATUS C 36 C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Arguments Families Remarks SHAPE ANSI PGI See Std TRADITIONAL SHIFT 1 11 I2 1 4 I 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L74 L 8 J F1 12 1 4 1 8 SHIFTA I 1 1 I2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L74 L 8 J 1 12 1 4 1 8 SHIFTL I 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L74 L 8 J F1 12 1 4 1 8 SHIFTR 1 11 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 11 12 1 4 1 8 SHORT I 2 A I 1 I2 I 4 I 8 G77 E R 4 R 8 TRADITIONAL Z 8 Z 16 SIGN R 4 A 1 1 I2 1 4 1 8 ANSI G77 E P R 4 R 8 PGI B 1 1 I 2 1 4 1 8 TRADITIONAL R 4 R 8 SIGNAL I 8 NUMBER I 1 1 2 G77
46. 9 7 EN or Bitwise Boolean OR perror like the C library function perror prints on the stderr stream the string followed by a colon a blank and the message corresponding to the error code from the C library value errno rand fortran interface to POSIX function rana Returns a uniform pseudorandom integer If 1ag is o return the next number in the current sequence if flag is 1 call POSIX function srand 0 otherwise call szand flag to seed a new sequence realpart Real part of a complex number synonym for standard intrinsic real which in Fortran 95 preserves the precision of its argument rename Fortran interface to the C library function rename Change name of file path1 topath2 The function form returns o on success or an error code from the C library value errno The subroutine sets status to the value which the function would return Trailing blanks in file are ignored you can prevent this by using char 0 toplace a null character after the last significant character rshift Arithmetic sign preserving bitwise right shift Shift count must be nonnegative and less than the bit size of the data secnds return the number of seconds since midnight in the local time zone nao minus the argument t second The function form returns the sum of user and system CPU time consumed by the process since the start of execution The subroutine form sets seconds to that value setbuf This is simil
47. Bit pattern for pi end 3 4 3 Pointer INTENT A dummy argument with the POINTER attribute may also use the INTENT attribute section 5 1 2 7 of the Fortran 2003 standard 3 7 3 The PathScale Fortran Compiler Fortran 2003 Support AA NN subroutine s arg0 argl arg2 integer pointer intent in argo integer pointer dimension intent out argl real pointer intent inout arg2 Illegal argO gt null arg0 5 Legal end subroutine s When used with a pointer the INTENT attribute refers to the pointer itself not to the target of the pointer Therefore in the preceding example it would be illegal to nullify argO or to associate arg0 with a different target but it is legal to use argO to change the value of the target 3 4 4 VOLATILE Attribute and Statement The VOLATILE attribute tells the compiler that a variable might change in ways outside the ambit of the Fortran language itself For example suppose that a function remember written in C takes the address of its argument and stores it in a C global variable and suppose that a function assignit uses that stored address to change the value of the variable interface subroutine remember a bind c real a end subroutine remember subroutine assignit end subroutine assignit end interface real rvalue call remember rvalue rvalue 5 0 call assignit Changes rvalue to something besides 5 print rvalue end The Fo
48. E 1 Summary of Compiler Options by Function Cpp Dvar def var def d lines fcoco setfile f no preprocessed ftpp E D Dtarget Dupdate Ss a S amp S Ss X F z G M z MMD 3 acro expand nocpp no gcc P traditional Uvar Processor Target Description m32 m3dnow m64 march lt cpu type gt mcmode1 small medium Fortran only Fortran only Fortran only Use with M or MM Fortran only Fortran only Fortran only Defaults Comments 32 bit ABI lt OFF gt 64 bit if march mcpu mtu ne is 64 bit otherwise 32 bit ABI auto which optimizes for platform compiler is running on Explicit choices are opteron athlon athlone4 athlon64fx em 4t pentium4 xeon core anyx86 small usually sufficient E Summary of Compiler Options aaa Table E 1 Summary of Compiler Options by Function mcpu lt cpu type gt mno sse mno sse2 mno sse3 msse2 msse3 mtune lt cpu type gt Profiling Options pg profile Target Environment Options TENV frame_pointer ON OFF TENV X 0 4 TENV simd imask ON OFF TENV simd dmask ON OFF TENV simd omask ON OFF O O TENV simd zmask ON OFF O ON OFF TENV simd_umask TENV simd pmask ON OFF Warning Options Wall Wdeclaration after statement Werror implicit function declara
49. Example 1 Run with Makefile This shows the simplest use of the application with a Makefile There are no optimization flags in the make def file we supply All optimization flags are sent from pathopt2 to the compiler by propagating the value of from the pathopt2 command line to the CFLAGS and FFLAGS Makefile variables 7 37 7 Tuning Options The pathopt2 Tool AA NN The command will now look like this S pathopt2 t try5 r bin ft A make clean ft CLASS A FFLAGS Note that we omitted the pathopt2 xm1 option in this example As mentioned previously when this option is omitted pathopt2 will use the file pathopt2 xml if itis present in the current working directory otherwise it will use the default pathopt2 xml that ship with the software Output from the run should be similar to the following Only the sorted summary is shown here Sorted summary from all runs Flags Build Test Real User System O3 OPT Ofast PASS PASS 12 74 12 38 0 36 03 ipa PASS PASS 12 77 12 31 0 45 03 PASS PASS 12 79 12 42 0 37 Ofast PASS PASS 13 66 13 19 0 47 02 PASS PASS 14 50 14 12 0 38 7 9 8 3 Example 2 Use Build Run Scripts and a Timing File Next let s assume that we want to do our pathopt2 work in a sub directory of NPB2 3 SER to avoid littering the top level directory with scripts and possibly output files mkdir pathopt2 cd pathopt2 mkdir logs logs is where we will keep a copy of t
50. Fortran source files are converted into an internal representation when compiled into object files For example a Fortran subroutine called oo gets turned into the name oo when placed in the object file We do this to avoid name collisions with similar functions in other libraries This makes mixing code from C C and Fortran easier Name mangling ensures that function subroutine and common block names from a Fortran program or library do not clash with names in libraries from other programming languages For example the Fortran library contains a function named access which performs the same function as the function access in the standard C library However the Fortran library access function takes four arguments making it incompatible with the standard C library access function which takes only two arguments If your program links with the standard C library this would cause a symbol name clash Mangling the Fortran symbols prevents this from happening By default we follow the same name mangling conventions as the GNU g77 compiler and 1ibf2c library when generating mangled names Names without an underscore have a single underscore appended to them and names containing an underscore have two underscores appended to them The following examples should help make this clear molecule gt molecule run check run check _ energy gt energy _ This behavior can be modified by using the no second underscore and the
51. IMAG 2 28 Z 16 G77 E TRADITIONAL IMAGPART Z Z 8 Z 16 G77 E IMOD I2 A I2 PGI E P P 2 TRADITIONAL IMVBITS Subroutine FROM I 2 TRADITIONAL E FROMPOS 1 1 2 1 4 1 8 LEN 171 1 2 1 4 18 TO I2 TOPOS I 1 1 2 1 4 I8 INDEX 1 4 STRING C ANSI G77 E P SUBSTRING C PGI O BACK L 1 L 2 L 4 TRADITIONAL L 8 ININT I2 A R 4 R 8 PGI E P TRADITIONAL INOT I2 I 1 2 PGI E TRADITIONAL INT 1 4 A 1 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 Z 8 Z 16 PGI O KIND 171 1 2 1 4 TRADITIONAL I8 INT2 I2 A 1 1 1 2 I4 F8 G77 E R 4 R 8 Z 8 Z 16 TRADITIONAL INT4 1 4 A 1 1 1 2 1 4 F8 TRADITIONAL E R 4 R 8 Z 8 Z 16 INT8 I8 A 1 1 1 2 1 4 F8 G77 PGI E R 4 R 8 Z 8 Z 16 TRADITIONAL INT_MULT_ I 1 8 E UPPER J 1 8 INT_MULT_ I E UPPER J IOR 1 4 I 1 1 I2 1 4 I8 ANSI G77 E J 1 1 12 I4 I8 PGI TRADITIONAL IRAND 1 4 FLAG 1 4 G77 PGI O IRTC 1 8 TRADITIONAL C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks ISATTY L 4 UNIT 1 4 G77 PGI ISHA 1 11 I2 1 4 I8 TRADITIONAL E SHIFT 171 1 2 1 4 1 8 ISHC I 1 1 I2 1 4 I8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 ISHFT 1 11 I 2 1 4 I8 ANSI G77 E SHIFT 1 1 1 2 1 4 PGI 1 8 TRADITIONAL ISHFTC 1 11 I2 1 4 I8 ANSI G77 E SHIFT 1 1 1 2 1 4 PGI O 48 TRAD
52. Intrinsic Name Result Arguments Families Remarks SUM ANSI PGI See Std TRADITIONAL SYMLNK 1 4 PATH1 C G77 PGI O PATH2 C STATUS 4 SYMLNK Subroutine PATH1 C G77 O PATH2 C STATUS I 4 SYNCHRONIZE Subroutine TRADITIONAL E SYNC IMAGES Subroutine TRADITIONAL SYNC IMAGES Subroutine IMAGE I 1 1 2 I 4 TRADITIONAL I8 SYNC IMAGES Subroutine IMAGE 171 172 1 4 TRADITIONAL I8 Array rank 1 SYSTEM 1 4 COMMAND C G77 PGI O STATUS I 4 SYSTEM Subroutine COMMAND C G77 O STATUS I 4 SYSTEM Subroutine COUNT 114 ANSI G77 O CLOCK COUNT RATE I4 PGI O COUNT MAX t4 TRADITIONAL SYSTEM CLOC Subroutine COUNT I 8 ANSI G77 O K COUNT RATE I 8 PGI O TAN R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL TAND R 4 X R 4 R 8 PGI E TRADITIONAL TANH R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL TEST IEEE EXCEPTION I 8 TRADITIONAL E EXCEPTION TEST_IEEE_ INTERRUPT I 8 TRADITIONAL E INTERRUPT C 39 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN C 40 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks THIS IMAGE Depends on arg ARRAY Any type TRADITIONAL O Arrayrank any DIM 171 1 2 1 4 I 8 TIME 1 4 G77 PGI TRADITIONAL TIMEF R 8 X 1 TIME8 I8 G77 TRADITIONAL TIME Subroutine BUF C G77 TINY X R 4 R 8 ANSI PGI E TRADITIONAL TRANSFER ANSI PGI See Std
53. L 8 K 171 12 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CVMGN 1 11 I2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J I 1 172 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 171 12 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 C 9 C Supported Fortran Intrinsics Table of Supported Intrinsics AA p V PV Am 1A Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued C 10 Intrinsic Name Arguments Families Remarks CVMGP I 1 1 I 2 I 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J I 1 I 2 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 I2 I 4 I8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CVMGT 1 11 I2 I4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 11 12 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K L 1 L 2 L 4 L 8 CVMGZ 1 11 I 2 I4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J I 1 F2 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 I2 1 4 I8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 C LOC I8 X Any type Array TRADITIONAL rank any DABS R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DACOS R8 X R8 ANSI G77 E P PGI TRADITIONAL C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Resul
54. NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks CONJG Z 8 Z Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL COS R 4 X R 4 R 8 Z 8 ANSI G77 E P Z 16 PGI TRADITIONAL COSD R 4 X R 4 R 8 PGI E P TRADITIONAL COSH R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL COT R 4 X R 4 R 8 TRADITIONAL E P COTAN R 4 X R 4 R 8 TRADITIONAL E P COUNT ANSI PGI See Std TRADITIONAL CPU_TIME Subroutine TIME R 4 ANSI G77 PGI TRADITIONAL CPU_TIME Subroutine TIME R 8 ANSI G77 PGI TRADITIONAL CSHIFT ANSI PGI See Std TRADITIONAL CSIN Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CSMG 1 11 I2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L74 L 8 J 1 1 12 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L74 L 8 K 171 I2 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CSQRT Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL C 8 C Supported Fortran Intrinsics Table of Supported Intrinsics TI a Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks CTIME C STIME I 4 G77 PGI CTIME C STIME I 8 G77 PGI CTIME Subroutine G77 STIME I 4 O RESULT C CTIME Subroutine STIME 18 G77 O RESULT C CVMGM 1 11 I2 1 4 1 8 TRADITIONAL E R4 R 8 CrayPtr L 1 L 2 L 4 L 8 J I 1 172 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4
55. ON OFF Controls conversion of a multi dimensional array to a single dimensional linear array that covers the same block of memory When inlining Fortran subroutines IPA tries to map formal array parameters to the shape of the actual parameter In F 21 F eko man Page ls the case that it cannot map the parameter it linearizes the array reference By default IPA will not inline such callsites because they may cause performance problems The default is OFF IPA map limit N Direct when IPA enables sp_partition N is the maximum size in bytes of input files mapped before IPA invokes IPA sp_ partition IPA maxdepth N This option directs IPA to not attempt to inline functions at a depth of more than N in the callgraph where functions that make no calls are at depth O those that call only depth 0 functions are at depth 1 and so on The default is a very large number This inlining remains subject to overriding limits on code expansion Also see IPA forcedepth IPA space and IPA plimit IPA max jobs N This option limits the maximum parallelism when invoking the compiler after IPA to at most N compilations running at once The option can take the following values 0 The parallelism chosen is equal to either the number of CPUs the number of cores or the number of hyperthreading units in the compiling system whichever is greatest 1 Disable parallelization during compilation default gt 1 Specifically set the degree o
56. OpenMP in either Fortran or C and C OMP DYNAMIC OMP NESTED OMP SCHEDULE OMP NUM THREADS A 5 2 Enables or disables dynamic adjustment of the number of threads available for execution Default is FALSE since this mechanism is not supported Enables or disables nested parallelism Default is FALSE This environment variable only applies to DO and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC or GUIDED Default is STATIC with no chunk size specified Set the number of threads to use during execution Default is number of CPUs in the machine PathScale OpenMP Environment Variables These environment variables can be used with OpenMP in Fortran and C and C except as indicated PSC OMP AFFINITY PSC OMP AFFINITY GLOBAL When TRUE the operating system s affinity mechanism where available is used to assign threads to CPUs otherwise no affinity assignments are made The default value is TRUE This environment variable controls where thread global ID or local ID values are used when assigning threads to CPUs The default is TRUE so that global ID values are used for calculating thread assignments A 3 A Environment Variables Environment Variables for OpenMP AA NY PSC OMP AFFINITY MAP This environment variable allows the mapping from threads to CPUs to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list m
57. Page AA nn PSC GENFLAGS Generic flags passed to all compilers PSC STACK LIMIT Fortran Controls the stack size limit the Fortran runtime attempts to use This string takes the format of a floating point number optionally followed by one of the characters k for units of 1024 bytes m for units of 1048576 bytes g for units of 1073741824 bytes or Yo to specify a percentage of physical memory If the specifier is following by the string cpu the limit is divided by the number of CPUs the system has For example a limit of 1 5g specifies that the Fortran runtime will use no more than 1 5 gigabytes GB of stack On a system with 2GB of physical memory a limit of 90 cpu will use no more than 0 9GB of stack 2 2 0 90 PSC STACK VERBOSE Fortran If this environment variable is set the Fortran runtime will print detailed information about how it is computing the stack size limit to use Standard OpenMP Runtime Environment Variables These environment variables can be used with OpenMP in either Fortran or C and C OMP DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution Default is FALSE since this mechanism is not supported OMP NESTED Enables or disables nested parallelism Default is FALSE OMP SCHEDULE This environment variable only applies to DO and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC or GUIDED Default is ST
58. PathScale compilers 1 Check the compiler name in your makefile is the correct compiler being called For example you may need to add a line like this CC pathce configure options Change the compiler in your makefile to pathcc or pathf95 2 Check any flags that are called to be sure that the PathScale Compiler Suite supports them See the eko man page in Appendix E for a complete listing of supported flags 3 If you plan on using IPA see section 7 3 for suggestions 4 Compile your code and look at the results a Did the program compile and link correctly Are there missing libraries that were previously linked automatically b Look for behavior differences does the program behave correctly Are you getting the right answer for example with numerical analysis 5 7 Compatibility 5 7 1 gcc Compatibility Wrapper Script Many software build packages check for the existence of gcc and may even require the compiler used to be called gcc in order to build correctly To provide complete compatibility with gcc we provide a set of gcc compatibility wrapper scripts in opt pathscale compat gcc bin or install directory compat gcc bin This script can be invoked with different names gcc cc to look like the GNU C compiler and call pathcc g c to look like the GNU C compiler and call pathcc g77 77 to look like the GNU Fortran compiler and call pathf 95 To use this script you must put t
59. Performance 8 27 8 14 1 Reduced Datasets cas accea tease ie die RR RRERRREXREXS 8 27 8 14 2 Enable OpenMP a inate i6est cased ae o C908 EDAD EDNA DH etica 8 28 8 14 3 Optimizations for OpenMP 0 000 eee 8 28 8 14 3 1 Ris sme PT TT 8 28 8 14 3 2 Memory System Performance cece eee eee 8 28 8 14 3 3 Load Balancing za 9e donee due tes Seite eek Bp hdd 8 29 8 14 3 4 Tuning the Application Code lessen 8 30 8 14 3 5 Using Feedback Data axes paga scien BLEND ke ke o Ses NGA 8 30 8 15 Other Resources for OpenMP 0 000 e eee eee 8 31 Section 9 Examples 9 1 Compiler Flag Tuning and Profiling With pathprof 9 1 9 2 Using the profile Option 0 00 2 c eee eee eee 9 4 Section 10 Debugging and Troubleshooting 10 1 Subscription Manager Problems 0c eee enters 10 1 10 2 Deb gging MC 10 1 10 3 Dealing with Uninitialized Variables a 10 1 10 4 Trapping IEEE Exceptions 345 943 ERROR RRER Hab RR n 10 2 10 5 Larga Object Support xxx ERR ERR Re RE DRE hes 10 3 Page xv PathScale Compiler Suite User Guide Version 3 2 _ CP 10 6 More Inputs Than Registers 20 eee eee ee 10 4 10 7 Linking With 1ibg2g usura wb CODE PER NG Soe te EYERG Ie 10 4 10 8 Linking Large Object Files 20 0 00 e eee eee 10 4 10 9 Using ipa and SOLSS saved ss dice edge INS aude P
60. R 4 A 1 2 PGI E TRADITIONAL FLOATJ R 4 A 1 4 PGI E TRADITIONAL FLOATK R 4 A 1 8 PGI E TRADITIONAL FLOOR A R 4 R 8 ANSI PGI E KIND 1 1 1 2 1 4 TRADITIONAL I 8 C 17 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks FLUSH Subroutine UNIT 174 1 8 G77 PGI O STATUS 1 4 O FNUM 1 4 UNIT 1 4 G77 TRADITIONAL FPUT 1 4 C C G77 O STATUS 1 4 FPUT Subroutine C C G77 O STATUS 1 4 FPUTC 1 4 UNIT 1 4 18 G77 PGI O C C STATUS I 4 FPUTC Subroutine UNIT 1 4 I 8 G77 O C C STATUS 1 4 FP CLASS Depends on arg X R 4 TRADITIONAL E FP_CLASS Depends on arg X R 4 TRADITIONAL E FP_CLASS Depends on arg X R 8 TRADITIONAL E FP_CLASS Depends on arg X R 8 TRADITIONAL E FRACTION X R 4 R 8 ANSI PGI E TRADITIONAL FREE Subroutine P 1 1 1 2 1 4 I 8 PGI E CrayPtr TRADITIONAL FSEEK 1 4 UNIT 1 4 G77 PGI OFFSET 174 WHENCE 4 FSEEK Subroutine UNIT 74 G77 OFFSET 174 WHENCE 4 FSEEK Subroutine UNIT 74 G77 OFFSET I 8 WHENCE 4 FSTAT 1 4 UNIT I 1 1 2 4 1 8 G77 PGI SARRAY I 1 I 2 TRADITIONAL 1 4 18 Array rank 1 STATUS I 1 1 2 1 4 1 8 C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued
61. Subscription Manager Install Guide for general information on setting your PATH You should see a list of output summarizing the result of all the runs The first set of flags are listed in the order in which they were run This is followed by a summary table which sorts the same output by time from fastest to slowest Sample output from this run is shown below Flagsb Build Test Real User System 02 PASS PASS 2 83 2 82 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 03 OPT Ofast PASS PASS 2 37 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 Sorted summary from all runs Flags Build Test Real User System 03 OPT Ofast PASS PASS 2 37 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 02 PASS PASS 2 83 2 82 0 00 From these results we see that the best option from this run is 03 OPT Ofast The next sections will discuss details on usage command line options and the configuration file format 7 Tuning Options The pathopt2 Tool ls 7 9 2 pathopt2 Usage Basic usage is as follows pathopt2 n num_iterations f configfile t execute target r test command S real user system build command e args The command line above shows the most commonly used options for the complete list of options see Table 7 4 The pathopt2 tool runs build command with the provided arguments and using additional options as specified in configfile The
62. TRADITIONAL Qo I8 SIZE 1 1 1 2 14 178 JISHL 1 4 I 1 4 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 JISIGN 1 4 A P4 PGI E P B 1 4 TRADITIONAL JMOD 4 A l 4 PGI E P P 1 4 TRADITIONAL C 26 C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks JMVBITS Subroutine FROM 4 TRADITIONAL E FROMPOS 1 1 1 2 1 4 I 8 LEN 171 172 1 4 18 TO 14 TOPOS 171 172 1 4 I8 JNINT 1 4 A R 4 R 8 TRADITIONAL E P JNOT 1 4 I 1 4 PGI E TRADITIONAL KIABS 1 8 A I8 PGI E TRADITIONAL KIAND I8 I 1 8 PGI E J 18 TRADITIONAL KIBCHNG I 8 I8 TRADITIONAL E POS 1 1 172 1 4 18 KIBCLR I 8 I 8 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL KIBITS 1 8 1 8 PGI E POS 1 1 172 1 4 1 8 TRADITIONAL LEN 171 172 1 4 18 KIBSET I 8 I 8 PGI E POS 1 1 1 2 144 1 8 TRADITIONAL KIDIM I 8 X 1 8 PGI E Y 8 TRADITIONAL KIDINT I 8 A R 8 TRADITIONAL E KIEOR I 8 1 8 TRADITIONAL E J l 8 KIFIX I 8 A R 4 R 8 PGI E TRADITIONAL KILL 1 4 PID 1 4 G77 PGI SIG 1 4 TRADITIONAL KILL Subroutine PID 1 4 G77 O SIG 1 4 TRADITIONAL STATUS I4 KIND 1 4 X Any type ANSI PGI E TRADITIONAL KINT I 8 A R 4 TRADITIONAL E C 27 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN Table C 1 Fortran Intrinsics Supported in Vers
63. The optional third argument igndf 1 takes these values 1 Use the second argument to provide a handler to restore the default response to the signal or to ignore the signal o Regardless of the value of the second argument restore the default response to the signal 1 Regardless of the value of the second argument ignore the signal instead When igndf 1 is omitted handler can be an integer with these possible values addre s s An integer containing the address of the external procedure o Restore the default response to the signal 1 lgnore the signal The function form returns the previous state of the signal zero if the default response was in effect one if the signal was being ignored or the address of a handler procedure Here is an example using the two argument form C Keyboard interrupt normally Control C alternately triggers C handlerl and handler2 until 4 interrupts have occurred Then C restore the default handling so the fifth interrupt stops the C program C 50 C Supported Fortran Intrinsics Fortran Intrinsic Extensions o 9 9 97 aaa program once implicit none external handlerl handler2 common previous count integer 8 previous integer count previous signal 2 handler1 previous signal 2 handler2 count 4 do while true call sleep 100 end do end subroutine handlerl implicit none common previous count integer 8 previous integer count pri
64. UN secs log SPSC METRIC FILE grep SUCCESSFUL logs 1 2 txt Make the file executable and run pathopt 2 chmod x compile go rate pathopt2 S rate file t try5 compile go rate ft A Sorted summary from all runs Flags Build Test Rate Ofast PASS PASS 662 60 O3 ipa PASS PASS 662 37 03 PASS PASS 655 03 03 OPT Ofast PASS PASS 654 30 02 PASS PASS 603 43 Since Ofast produced the best results in the sorted summary we can now try the target peak_Ofast pathopt2 S rate file t peak_Ofast compile go rate ft A A truncated listingof the output shows the top fixe results for this run Sorted summary from all runs Flags Build Test Rate Ofast CG prefetch off PASS PASS 702 72 CG load_exe 0 OPT unroll size 256 Ofast CG prefetch off PASS PASS 702 17 CG load exe 0 Ofast msse3 CG load exe 0 PASS PASS 696 36 LNO interchange off OPT unroll size 256 Ofast CG prefetch off msse PASS PASS 695 08 CG load exe 0 LNO interchange off Ofast msse3 CG load exe 0 694 48 LNO interchange off In a situation like this with a near tie at the top one would normally use the simpler flag set for production Ofast CG prefetch off CG load exe 0 7 42 7 Tuning Options How Did the Compiler Optimize My Code ls which can be shortened to Ofast CG prefetch off load_exe 0 7 10 How Did the Compiler Optimize My Code Often you may want to
65. a block of code or execution order of statements within a block of code ATOMIC SOMP atomic expression statement BARRIER SOMP barrier CRITICAL SOMP critical name structured block SOMP end critical name FLUSH SOMP flush list MASTER SOMP master structured block SOMP end master ORDERED SOMP ordered structured block SOMP end ordered Data environments Control the data environment during the execution of parallel constructs THREADPRIVATE SOMP threadprivate c1 c2 WORKSHARE SOMP workshare 8 5 8 Using OpenMP and Autoparallelization OpenMP Compiler Directives C C AA NY 8 5 OpenMP Compiler Directives C C pragmaThe OpenMP directives for C and C all start with pragma They are only processed by the compiler if mp is specified 8 6 Some of the OpenMP directives also support additional clauses The following table lists the C and C compiler directives provided by version 2 0 of the OpenMP C C Application Program Interface Table 8 2 C C Compiler Directives Directive Clauses Example Parallel region construct Defines a parallel region PARALLEL pragma omp parallel clause structured block PRIVATE SHARED FIRSTPRIVATE DEFAULT SHARED NONE REDUCT ION COPYIN IF NUM THREADS encounter it Work sharing constructs Divide the execution of the enc
66. are represented as signed 32 bit quantities The offsets for data within the binaries are represented as signed 64 bit quantities In this model all code in an executable must come to less than 2GB in total size The data both static and BSS are allowed to exceed 2GB in size As with the small memory model pointers are also signed 64 bit quantities and may exceed 2 GB in size NOTE The PathScale compilers do not support the use of the PIC option flag in combination with the mcmodel medium option The code model medium is not supported in PIC mode The PathScale compilers support memode1 medium and fPIC in the same way that GCC does When building shared libraries only PIC should be used The option mcmodel medium but not PIC when compiling and linking the main program The reasoning behind this is that because the shared library is self contained it does not know about the fixed addresses of the data in the program that it is linked with The library will only access the program data through pointers and such pointer data accesses are not affected by the value of the mcmode1 option The memode1 value only affects the addressing of data with fixed addresses When these addresses are larger than 2GB the compiler has to generate longer sequences of instructions Thus it does not want to do that unless the mcmodel mediun flag is given See 10 4 for more information on using large objects and your GCC 3 3 1 documentation
67. array arguments so that procedure calls pass a bit indicating whether the array is contiguous This requires that the program use explicit interfaces for all procedures with interface blocks with module use statements or by nesting one procedure inside another with contains Each of those methods provides the compiler with an explicit interface from the viewpoint of the Fortran standard NOTE Redundant interfaces are incorrect don t provide an interface block for a procedure whose interface is already imported via a use statement The compiler will also copy noncontiguous arrays to temporary variables in some situations where the standard does not require it but where heuristics suggest that this will improve performance by better using the cache To disable this category of copying use the command line option LANG copyinout off 3 13 Fortran Compiler Stack Size The Fortran compiler allocates data on the stack by default Some environments set a low limit on the size of a process s stack which may cause Fortran programs that use a large amount of data to crash shortly after they start If the PathScale Fortran runtime environment detects a low stack size limit it will automatically increase the size of the stack allocated to a Fortran process before the Fortran program begins executing By default it automatically increases this limitto the total amount of physical memory on a system less 128 megabytes per CPU For example
68. assume unknown trip count 0 1000 This flag is no longer supported It has been promoted to LNO trip_count_assumed_when_unknown LNO pf1 ON OFF pf2 ON OFF pf3 ON OFF pf4 ON OFF This options selectively disables or enables prefetching for cache level x for pfx ON OFF LNO prefetch 0 1 2 3 This option specifies the levels of prefetching The options can be one of the following 0 Prefetch disabled 1 Prefetch is done only for arrays that are always referenced in each iteration of a loop 2 Prefetch is done without the above restriction This is the default 3 Most aggressive LNO prefetch ahead N This option prefetches the specified number of cache lines ahead of the reference Specify a positive integer for N default is 2 LNO prefetch manual ON OFF This option specifies whether manual prefetches through directives should be respected or ignored OFF Ignores directives for prefetches ON Respects directives for prefetches This is the default M Run cpp and print list of make dependencies m32 Compile for 32 bit ABI also known as x86 or IA32 See m64 for defaults m3 dnow Enable use of 3DNow instructions The default is OFF F 33 F eko man Page o 9 7 7 a m64 Compile for 64 bit ABI also known as AMD64 x86 64 or IA32e On a 32 bit host the default is 32 bit ABI On a 64 bit host the default is 64 bit ABI if the target platform march mcpu mt
69. assumptions made about numerical accuracy at different levels of optimization Table 7 3 Numerical Accuracy with Options OPT option name 00 O1 02 03 Ofast Notes div split off off off off on onif IEEE a 3 fast complex off off off off off onifroundoff 3 fast exp off off off on on Onifroundoff gt 1 fast nint off off off off off onifroundof f 3 fast s qrt off off off off off fast trunc off off off on on onifzoundoff 1 fold reassociate off off off off on Onifroundoff gt 2 fold unsafe relops on on on on on fold unsigned relops off off off off off IEEE arithmetic 1 1 1 2 2 7 23 7 Tuning Options Hardware Performance EN PA UAh V A Table 7 3 Numerical Accuracy with Options IEEE NaN inf off off off off off recip off off off off on onifroundoff 2 roundoff 0 0 0 1 2 fast math off off off off off onifroundoff 5 2 rsqrt 0 0 0 0 1 1 ifroundoff 5 2 For example if you use OPT IEEE arithmetic at 03 the flag is set to IEEE arithmetic 2 by default 7 7 6 1 Flush to Zero Behavior The processor hardware which implements IEEE floating point arithmetic generally runs faster if itis allowed to generate zero rather than a denormalized number when an arithmetic operation underflows Therefore at optimization level 03 the PathScale compiler allows this behavior which is commonly known as flush to zero The flush to z
70. before running the program You wantto take advantage of the S rate fileor S timing file feature that requires some grep and sed commands to isolate the number in the output to use as the performance metric of interest e g a megaflops number in the rate file case The next sections provide examples of a Makefile build and test scripts and the rate and timing files 7 9 8 The NAS Parallel Benchmark Suite Next is a concrete example with measurable results The NAS Parallel Benchmark NPB suite is commonly used for both serial and parallel benchmarking It consists of a set of dissimilar pieces of applications illustrating the various numerical techniques used by NASA s high performance applications The benchmark comes with several data set sizes with W being a workstation size smallest and A and B being two sizes appropriate to a cluster or supercomputer size problem Thes examples uses the Class A data set Several examples will be provided showing usage in a step by step mannner By following these steps you will get a better idea of how pathopt2 works 7 9 8 1 Set Up the Workarea The NAS Parallel Benchmark Suite NPB can be downloaded by going to http www nas nasa gov Software NPB and following the links to the file Download the file to a writable working directory Then tar zxf NPB2 3 tar gz cd NPB2 3 NPB2 3 SER config cp opt pathscale share pathopt2 examples make def S ICQ nu 7 9 8 2
71. buggy inline assembly If ON the compiler assumes each asm has memory specified even if itis not there The default is OFF OPT bbzN This specifies the maximum number of instructions a basic block straight line sequence of instructions with no control flow can contain in the code generator s F 38 F eko man Page ls program representation Increasing this value can improve the quality of optimizations that are applied at the basic block level but can increase compilation time in programs that exhibit such large basic blocks The default is 1300 If compilation time is an issue use a smaller value OPT cis ON OFF Convert SIN COS pairs using the same argument to a single call calculating both values at once The default is ON OPT cyg instr 0 1 2 3 4 Insert instrumentation calls into each function just after the function entry and just before the function returns void cyg profile func entry void func_address void return address void cyg profile func exit void func address void return address The first argument is the address at the start of the current function The second argument is the return address into the caller of the current function Instrumentation is also performed on the bodies of the inlined functions In this case the original uninlined function will not be deleted because its address is passed as the first argument to the profiling calls The value of OPT cyg instr co
72. build command can be an PathScale invocation command pathocc pathf 95 pathcc a make command or a script which eventually invokes the compiler perhaps via a make command The character e is replaced in the command with the list of options from the conf igf ile being considered The configfileis typically the provided pathopt2 xml file although you can write your own The execute target parameter specifies the execution target from the configfile The test command parameter is the command to run the program and can be replaced with a script The program is expected to return a status value of 0 to indicate success Or a non zero status to indicate failure The s option specifies the metric used for comparing performance real the elapsed real time this is the default user the CPU time spent executing in user mode system the CPU time spent executing in system mode timing file to use a file containing a timing value rate file to use a file containing a rate value The chosen metric is used to guide the choices made by the pathopt2 algorithms when selecting options for the best performance and is used to sort the final output The interpretation of real user and system time is the same as the time 1 command real is equivalent to wall clock time An application may switch back and forth between user and kernel mode so these components are factored separately into user and system times Since the O S is typically
73. cannot be found then no defaults file will be used even if one is present in opt pathscale etc For more details see the compiler defaults man page 2 3 1 Target Options for This Release These options related to ABI ISA and processor target are supported in this release e m32 m64 march same as mcpu and mtune e mcpu same as march and mtune mtune same as march and mcpu msse2 msse3 msse4a m3dnow There are also mno versions for these options msse2 msse3 msse4a m3dnow For example mno msse3 As indicated in this list using the march flag the architectures supported in this release are march opteron athlon64 athlon64fx march barcelona march pentium4 marchzxeon march em64t 2 Compiler Quick Reference Compiling for Different Platforms o 9 9 97 aaa march core We have also added two special options march any86 and march auto If you want to compile the program so that it can be run on any x86 machine you can specify anyx86 as the value of the march mcpu or mtune options march anyx86 If the value for the march mcpu or mtune options is auto the compiler will automatically choose the target processor based on the machine on which the compilation takes place march auto The compiler defaults to march auto Here is a sample of how options are specified in the compiler default
74. char c hello from call fortran int i 123 long long 11 45611 float f 7 8 double d 9 1 int nonzero 10 Any nonzero integer is true in Fortran f1 c Si amp ll Sf ed amp nonzero strlen c C function designed to be called from Fortran passing arguments by reference void c reference double di float f1 int il long long i2 char cl int 11 int 12 char c2 char c3 int cl len int c2 len int c3 len A fortran string has no null terminator so make a local copy and add a terminator Depending on the situation it might be preferable to put the terminator in place of the first trailing blank char null terminated c1 memcpy alloca cl_len 1 c1 c1 len char null terminated c2 memcpy alloca c2 len 1 c2 c2 len 3 31 3 The PathScale Fortran Compiler Mixed Code ee char null terminated c3 memcpy alloca c3 len 1 c3 c3 len null terminated c1 c1 len null terminated c2 c2 len null terminated c3 c3 len N 0 printf d1 1f f1 1f il d i2 11d 11 d 12 d cl len d c2 len d c3_len d n dl 1 il i2 11 12 c1 len c2 len c3 len printf ci s c2 5s c3 s WMn null terminated c1 null terminated c2 null terminated c3 fflush stdout Flush output before switching languages call fortran C function designed to be called from Fortran passing arguments by val
75. come with a package called schedutils which includes a program called taskset You can use taskset to specify that a program must run on one particular CPU For low level programming this facility is provided by the sched setaffinity 2 call in the C library You will need a recent C library to be able to use this call On systems that lack NUMA support in the kernel and on runs that do not set process affinity before they start we have seen variations in performance of 3096 or more between individual runs Testing Memory Latency and Bandwidth 7 26 To test your memory latency and bandwidth we recommend two tools For memory latency the LMbench package provides a tool called lat mem ra This provides a cryptic but fairly accurate view of your memory hierarchy latency LMbench is available from http www bitmover com Imbench For measuring memory bandwidth the STREAM benchmark is a useful tool Compiling either the Fortran or C version of the benchmark with the following command lines will provide excellent performance pathf95 Ofast stream d f second wall c DUNDERSCORE pathcc Ofast lm stream d c second wall c If you do not compile with at least 03 performance may drop by 4096 or more The STREAM benchmark is available from http www streambench org For both of these tools we recommend that you perform a number of identical runs and average your results as we have observed variations of more than 1096 betwee
76. core Otherwise it is OFF by default msse4a Enable use of SSE4A instructions Default is OFF mtune lt cpu type gt Behaves like march See march MT Change the target of the generated dependency rules F 35 F eko man Page AA NN mx87 precision 32 64 80 Specify the precision of x87 floating point calculations The default is 80 bits nobool Do not allow boolean keywords nocpp For Fortran only Disable the source preprocessor See the cpp E and ftpp options for more information on controlling preprocessing nodefaultlibs Do not use standard system libraries when linking noexpopt Do not optimize exponentiation operations noextend source Restrict Fortran source code lines to columns 1 through 72 See the coln and extend source options for more information on controlling line length no gcc For Fortran only no gcc turns off the GNUC and other predefined preprocessor macros nog77mangle The PathScale Fortran compiler modifies Fortran symbol names by appending an underscore so a name like foo in a source file becomes foo in an object file However if a name in a Fortran source file contains an underscore the compiler appends a second underscore in the object file so foo bar becomes foo bar and baz becomes baz The nog77mangle option suppresses the addition of this second underscore noinline Suppress expansion of inline functions When this option
77. declarations in different places are compatible if they have either BIND or SEQUENCE Notice that making the Fortran generated symbol match the C generated symbol does not necessarily mean the symbol will never be decorated with extra 3 The PathScale Fortran Compiler Fortran 2003 Support I aaa underscores In that example on Linux the subroutine s will generate the linker symbol s instead of the symbol s But on an operating system where pathcc would generate the linker symbol _s for a C function named s the example likewise will generate the linker symbol _s for the Fortran procedure named s rather than the symbol s which it would ordinarily use so as to be compatible with pathcc For procedures module variables and common blocks it is possible to specify an explicit binding label as well subroutine s1 bind c name S1 name end subroutine s1 When you use the name clause pathf90 generates the same linker symbol that pathcc would generate for an entity having that name taking into account upper case Thus the preceding example would match a C function named S1 name but not a C function named s1 name Finally it is possible to use an empty string for the binding label which tells the compiler to make the variable compatible with C but to use the same linker external symbol that it would use in the absence of BIND On Linux the following procedure would generate the linker symbol s2 and t
78. default performs extensive optimizations that will always shorten execution time but may cause compile time to be lengthened Level 03 performs aggressive optimization that may or may not improve execution time See section 7 1 for more information about the o flag Use the ipa switch to enable inter procedural analysis pathf95 c ipa matrix f90 pathf95 c ipa prog f90 pathf95 ipa matrix o prog o o prog Note that the link line also specifies the ipa option This is required to perform the IPA link properly See section 7 3 for more information on IPA NOTE The compiler typically allocates data for Fortran programs on the stack for best performance Some major Linux distributions impose a relatively low limit on the amount of stack space a program can use When you attempt to run a Fortran program that uses a large amount of data on such a system it will print an informative error message and abort You can use your shell s ulimit bash or limit tcsh command to increase the stack size limit to a point where the program no longer crashes or remove the limit entirely See section 3 13 for more information on Fortran compiler stack size 3 1 1 Fixed form and Free form Files Fixed form files follow the obsolete Fortran standard of assigning special meaning to the first 6 character positions of each line in a source file If a C or character is present in the first character position on a line that specifie
79. directives by enclosing the statement in a critical section Section 2 5 4 page 27 Many ATOMIC directives are implemented with in line atomic code for the atomic statement while others are implemented using a critical section due to the absence of hardware support If the dynamic threads mechanism is enabled on entering a parallel region the allocation status of an allocatable array that is not affected by a COPYIN clause that appears on the region is implementation dependent Section 2 6 1 page 32 The allocation status of the thread s copy of an allocatable array will be retained on entering a parallel region Due to resource constraints it is not possible for an implementation to document the maximum number of threads that can be created successfully during a program s execution This number is dependent upon the load on the system the amount of memory allocated by the program and the amount of implementation dependent stack space allocated to each thread If the dynamic threads mechanism is disabled the behavior of the program is implementation dependent when more threads are requested than can be successfully created If the dynamic threads mechanism is enabled requests for more threads than an implementation can support are satisfied by a smaller number of threads Section 2 3 1 page 15 Sincethe implementation does not support dynamic thread adjustment the dynamic threads mechanism is always disabled If more threads are
80. enabled some or all of the following PathScale C compiler for x86 64 and EM64T architectures PathScale C compiler for x86 64 and EM64T architectures PathScale Fortran compiler for x86 64 and EM64T architectures Documentation Libraries Subscription Manager client You must have a valid subscription and associated subscription file in order to run the compiler Subscription Manager server The PathScale Subscription Manager server is only required for floating subscriptions PathScale debugger pathdb GNU binutils 22 How To Invoke the PathScale Compilers The PathScale Compiler Suite has three different front ends to handle programs written in C C and Fortran and ithas common optimization and code generation 2 1 2 Compiler Quick Reference How To Invoke the PathScale Compilers AA pP O components that interface with all the language front ends The language your program uses determines which command driver name to use Language Command Name Compiler Name C pathcc PathScale C compiler C pathcc PathScale C compiler Fortran 77 pathf95 PathScale Fortran compiler Fortran 90 Fortran 95 You can create a common example program called world c include lt stdio h gt main printf Hello World n Then you can compile it from your shell prompt very simply pathcc world c The default output file for the pathcc generated executable is named a out You can
81. exceeds this limit will never be automatically inlined by IPA The default is 500 7 Tuning Options Inter Procedural Analysis IPA ls IPA min hotness N is applicable only under feedback compilation A call site s invocation count must be at least N before it can be inlined by IPA The default is 10 INLINE aggressive ON increases the aggressiveness of the inlining in which more non leaf and out of loop calls are inlined Default is OFF We mentioned that leaf functions are good candidates to be inlined These functions do not contain calls that may inhibit various backend optimizations To amplify the effect of leaf functions IPA provides two options that exploit its call tree based inlining feature This is based on the fact that a function that calls only leaf functions can become a leaf function if all of its calls are inlined This in turn can be applied repeatedly up the call graph In the description of the following two options a function is said to be at depth N if it is never more than N edges from a leaf node in the call graph A leaf function has depth o IPA maxdepth N causes IPA to inline all routines at depth N in the call graph subject to space limitation IPA forcedepth N causes IPA to inline all routines at depth N in the call graph regardless of space limitation 7 3 5 Cloning There are two options for controlling cloning IPA multi_clone N specifies the maximum number of clones that can be creat
82. execute it and see the output a out Hello World As with most compilers you can use the o filename option to give your program executable file the desired name If invoked with the flag v or version the compilers will emit some text that identifies the version For example pathcc v PathScale TM Compiler Suite Version 3 2 Built on 2007 10 21 07 03 08 0800 Thread model posix GNU gcc version 4 0 2 PathScale 3 2 driver There are online manual pages man pages with descriptions of the large number of command line options that are available Type man pathscale intro atthe command line to see the pathscale intro man page and its overview of the various man pages included with the Compiler Suite 2 2 1 Accessing the GCC 4 x Front ends for C and C This release supports GCC 3 x and GCC 4 x The compiler defaults to gnu3 or gnu4 depending on whether the system installed gcc g is a 3 x or 4 x compiler It is possible to override this choice using gnu3 or gnu4 to get the compiler to use the alternate front end instead of the default one A sample command for C is 2 Compiler Quick Reference Compiling for Different Platforms I aaa pathcc gnu4 world c This default option can be changed in your compiler defaults file by adding this line gnu4 See section 2 3 for an example compiler defaults file The option has no effect on pathf 90 or pathf95 There are currently some limitations when
83. file the code appears in IPA can improve performance significantly IPA can be used in combination with the other optimization flags 03 ipa or 02 ipa Will typically provide increased performance over the 03 or 02 flags alone ipa needs to be used both in the compile and in the link steps of a build See section 7 3 for more details on how to use ipa 6 1 6 Tuning Quick Reference Feedback Directed Optimization FDO ee 6 3 Feedback Directed Optimization FDO 6 4 Feedback directed optimization uses a special instrumented executable to collect profile information about the program that is then used in later compilations to tune the executable See section 7 6 for more information Aggressive Optimization The PathScale compilers provide an extensive set of additional options to cover special case optimizations The ones documented in section 7 contain options that may significantly improve the speed or performance of your code This section briefly introduces some of the first tuning flags to try beyond 02 or 03 Some of these options require knowledge of what the algorithms are and what coding style of the program require otherwise they may impact the program s correctness Some of these options depend on certain coding practices to be effective One word of caution The PathScale Compiler Suite like all modern compilers has a range of optimizations Some produce identical program output to the non optimize
84. fno underscoring options to the pathf95 compiler The default policies for Intel ifort PGI pgf 90 Sun 90 GNU gfortran and g95 all correspond to our no second underscore option Common block names are also mangled Our name for the blank common block is the same as g77 BLNK PGI s compiler uses the same name for the blank common block while Intels compiler uses BLANK _ 3 38 3 The PathScale Fortran Compiler Library Compatibility ee 3 10 2 ABI Compatibility The PathScale compilers support the official x86_64 Application Binary Interface ABI which is not always followed by other compilers In particular g77 does not pass the return values from functions returning COMPLEX or REAL values according to the x86 64 ABI Double precision REALS are OK For more details about what g77 does see the info g77 entry for the ff2c flag This issue is a problem when linking binary only libraries such as Kazushige Goto s BLAS library or the ACML library AMD Core Math Library we have not tested ACML on the EM64T version of the compiler suite Libraries such as FFTW and MPICH don t have any functions returning REAL or COMPLEX so there are no issues with these libraries For linking with g77 compiled functions returning COMPLEX or REAL values see section 3 10 3 Like most Fortran compilers we represent character strings passed to subprograms with a character pointer and add an integer length parameter to the en
85. for your motherboard and experiment If you fail to set up your memory correctly this can account for up to a factor of two difference in memory performance In extreme cases this can even affect system stability 7 8 2 BIOS Setup Some BlOSes allow you to change your motherboard s memory interleaving options Depending on your configuration this may have an effect on performance For a discussion of memory interleaving across nodes see section 7 8 3 below 7 8 3 Multiprocessor Memory Traditional small multiprocessor MP systems use symmetric multiprocessing SMP in which the latency and bandwidth of memory is the same for all CPUs This is not the case on Opteron multiprocessor systems which provide non uniform memory access known as NUMA On Opteron MP systems each CPU has its own direct attached memory Although every CPU can access the memory of all others memory that is physically closest has both the lowest latency and highest bandwidth The larger the number of CPUs the higher will be the latency and the lower the bandwidth between the two CPUS that are physically furthest apart Most multiprocessor BlOSes allow you to turn on or off the interleaving of memory across nodes Memory interleaving across nodes masks the NUMA variation in behavior but it imposes uniformly lower performance We recommend that you turn node interleaving off 7 8 4 Kernel and System Effects To achieve best performance on a NUMA system
86. i8 3 21 IPA max jobs 7 13 ipa 3 2 6 1 7 3 7 8 Im 2 8 4 7 LNO fission 7 14 fusion 7 14 ignore_pragmas 3 22 opt 7 2 march anyx86 2 5 mcmodel medium 2 9 10 3 mcmodel small 2 9 mcpu 5 4 mp 8 2 8 3 8 6 O 3 2 7 1 O0 3 2 7 1 O1 3 2 7 1 O2 3 2 7 1 9 1 Index O3 3 2 9 1 Ofast 4 2 7 12 7 14 OPT alias 7 19 OPT early mp 8 28 OPT fast math 7 21 OPT IEEE arithmetic 7 20 7 21 OPT Ofast 6 1 6 3 7 1 OPT reorg_common OFF 10 3 OPT wrap_around_unsafe_opt OFF 10 5 p 9 1 pg 2 11 r8 3 21 S 7 29 7 43 trapuv 10 1 V 2 2 Wuninitialized 10 1 zerouv 10 1 F 2 6 3 1 3 25 3 26 12 6 3 25 F90 2 6 3 1 3 25 3 26 190 2 6 3 25 F95 2 6 3 1 3 25 3 26 195 2 6 3 25 O files 7 3 define 3 27 4 5 pragma 4 7 8 6 SOMP 8 3 A ACML 10 4 Alias analysis 7 19 aliasing 3 43 Aliasing rule Fortran 3 44 AMD Core Math Library ACML 3 40 AMD64 2 1 ANSI 3 1 5 4 7 20 Application Binary Interface ABI 3 39 apropos pathscale F 1 asm 10 4 Index 1 PathScale Compiler Suite User Guide Version 3 2 RCC assign or ASSIGN 3 35 athlon64 2 4 athlon64fx 2 4 Autoparallelization 8 1 8 2 B barcelona 2 4 E 1 F 6 F 34 F 35 Big endian format 3 35 BIOS settings for OpenMP 8 28 setup 7 25 BLAS 3 39 3 40 Bounds checking 3 29 BSS 2 9 C Cache blocking 7 16 Call graph 7 4 Call graph profile 9 3 Calls between C and Fortran 3 30 CMOVE 7 23 Code genera
87. inheritance ensures that the OpenMP library creates the right number of threads and that CPUs are not overloaded with threads When using affinity inheritance any explicit affinity settings made using PSC OMP AFFINITY MAP PSC OMP CPU STRIDEand PSC OMP CPU OFFSET employ a virtualized CPU numbering The virtualized CPU numbers are a sequence of incrementing integers starting from O and refer to the potentially non contiguous real CPU numbers in ascending order This means that the settings for these variables are independent of the specific CPU numbers specified by taskset PSC OMP AFFINITY MAP a list of integer values separated by commas This environment variable allows the mapping from threads to CPUs to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list must contain at least one CPU identifier and entries in the list beyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers in the PSC OMP AFFINITY MAP listare inserted in the mapping table starting at the index for thread 0 and increasing 8 Using OpenMP and Autoparallelization Environment Variables ls upwards If the list is shorter than the maximum number of thre
88. interchange transformation in the loop nest optimizer The LNO group controls outer loop unrolling but the OPT group controls inner loop unrolling Here are the major LNO flags to control loop unrolling LNO outer unroll max ou max n specifies that the compiler may unroll outer loops in a loop nest by up to n per loop but no more The default is 10 LNO ou prod max n Indicates that the product of unrolling levels of the outer loops in a given loop nest is not to exceed n where n is a positive integer The default is 16 To be more specific about how much unrolling is to be done use LNO outer unroll ou n This indicates that exactly n outer loop iterations should be unrolled if unrolling is legal For loops where outer unrolling would cause problems unrolling is not performed The LNO group can provide guidance to the compiler about the level and type of prefetching to enable General guidance on how aggressively to prefetch is specified by LNO prefetch n where n 1 is the default level n 20 disables prefetching in loop nests while n 2 means to prefetch more aggressively than the default LNO prefetch ahead n defines how many cache lines ahead of the current data being loaded should be prefetched The default is n 2 cache lines 7 Tuning Options Code Generation CG ls 7 4 5 Vectorization Vectorization is an optimization technique that works on multiple pieces of data at once For example the compiler will tu
89. is OFF by default fb phase 0 1 2 3 4 Used to specify the compilation phase at which instrumentation for the collection of profile data is performed so is useful only when used with fb create The values must be in the range 0 to 4 The default value is 0 and specifies the earliest phase for instrumentation which is after the front end processing no check new For C only Check the result of new for NULL When fno check new is used the compiler will not check the result of an operator of NULL F eko man Page I aaa fcoco setfile For Fortran only Run the ISO IEC 1539 3 conditional compilation preprocessor on input Fortran source files before compiling This overrides the default whereby files suffixed with F F90 or F95 are preprocessed with cpp but files suffixed with f 190 or f95 are not preprocessed If no setfile is specified the preprocessor looks for coco set in the current working directory Any l flags are passed to the preprocessor and take precedence over the setfile Any D flags are passed to the preprocessor to assign values to constants overriding values assigned within the source files If the flag contains the value on the right side must be an integer and the name on the left side must be declared as an integer constant within the source files Otherwise the name must be declared as a logical constant within the source files and will be set true Constants defined
90. may find it useful to reduce the size of the data sets to give a quicker runtime allowing the efficacy of particular tuning options to be quickly ascertained One thing to note is that OpenMP performance tends to get better with larger data sets because 8 27 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance AA NN 8 14 2 the fork join overheads diminish as the loops get larger Thus you should also run trials with the full data set especially when looking at scaling issues You can also make use of more memory and more cache on an n way multi processor than a uni processor and this sometimes leads to a very nice superlinear speed up Enable OpenMP 8 14 3 After you have tuned the serial version of the application turn on OpenMP parallelization with the mp flag Try running the code on varying numbers of CPUs to see how the application scales One very important option for OpenMP tuning is OPT early mp which by default is off but can be turned on using OPT early mp on The setting of this primarily determines the ordering of SIMD vectorization and OpenMP parallelization optimization phase of the compiler With late MP loops will first be vectorized and then the vectorized loops will be parallelized With early MP loops will first be parallelized and then the parallel loops will be vectorized Occasionally one of these orderings works better than the other so you have to try both Opti
91. of PSC OMP GUIDED CHUNK DIVISOR equal to 1 the first thread will get 1 n th of the iterations for a team of n If these iterations happen to be particularly expensive then this thread will be the critical path through the loop The default value is 2 PSC OMP GUIDED CHUNK MAX Integer value This is the maximum chunk size that will be used by the loop scheduler for guided scheduling The default value for this is 300 Note that a minimum chunk size can already be set by the user on a guided schedule directive This environment variable allows the user to set a maximum too though it applies to the whole program The rationale for setting a maximum is to break up the iterations under guided scheduling for better dynamic load balancing between the threads The full equation for the chunk size for guided scheduling is chunk size MAX MIN ROUNDUP remaining size number of threads PSC OMP GUIDED CHUNK DIVISOR PSC OMP GUIDED CHUNK MAX minimum chunk size Where remaining size is the number of iterations of the loop number of threads is the number of threads in the team PSC OMP GUIDED CHUNK DIVISORiS the value of the PSC OMP GUIDED CHUNK DIVISOR environment variable defaults to 2 PSC OMP GUIDED CHUNK MAX is the value of the PSC OMP GUIDED CHUNK MAX environment variable defaults to 300 minimum chunk size is the size of the smallest piece this is the value of chunk in the SCHEDULE directive
92. only the first execute target in a configuration file The following is a listing of the try5 list and the try5 execute target in the default pathopt2 xml file try5 is typically the first target to use when testing options with pathopt2 define name tryb5 list s option O2 option option 03 lt option gt choose kz 1 append option O3 option choose kz 1 option ipa option option OPT Ofast lt option gt choose 7 32 7 Tuning Options The pathopt2 Tool ls lt append gt lt choose gt lt option gt Ofast lt option gt lt define gt lt execute name try5 gt choose k 1 5 source from try5 list gt lt choose gt lt execute gt The first two options 02 and 03 are runin order Next the 03 option is appended to both ipa and OPT Ofast Finally Of ast is used This ordering is shown in the first part of the pathopt2 output when try5 is the target Flags Build Test Real User System 02 PASS PASS 2 83 2 82 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 03 OPT Ofast PASS PASS 2 37 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 7 33 7 Tuning Options The pathopt2 Tool ee Table 7 5 Tags for option configuration file Table 7 5 Tags for Option Configuration Fle Tag Description lt config gt lt config gt Main body tag describing the configuration All other tag
93. operations such as sin sqrt compare etc If a NaN is detected the application will abort Assignments are not considered floating point calculations and so x y doesn t trap even if y is NaN 10 1 10 Debugging and Troubleshooting Trapping IEEE Exceptions E The trapuv option affects local scalar and array variables and memory returned by alloca It does not affect the behavior of globals memory allocated with malloc or Fortran common data The option initializes integer variables to the bit pattern for floating point NaN integers don t have NaNs The CPU doesn t trap on these integer operands although the NaN bit pattern will make the wrong result more obvious This option is not supported under 32 bit ABI without SSE2 The Wuninitialized option warns about uninitialized automatic variables at compile time Wno uninitialized tells the compiler not to warn about uninitialized automatic variables The new zerouv option sets uninitialized variables in your code to zero at program runtime Doing this will have a slight performance impact This option affects local scalar and array variables and memory returned by alloca It does not affect the behavior of globals memory allocated with malloc or Fortran common data 10 4 Trapping IEEE Exceptions By default when an IEEE floating point operation generates a denormalized number or a special symbol such as NaN or Infinity the program will continu
94. partition setting OFF IPA space N lt no limit gt IPA specfile filename IPA use intrinsic ON OFF lt OFF gt Inline Processing Options Defaults Comments f no implicit inline templates C f no implicit templates C f no inline functions C C fkeep inline functions C C inline INLINE Same as inline INLINE aggressive ON OFF lt OFF gt INLINE list ON OFF lt OFF gt INLINE preempt ON OFF lt OFF gt noinline Language Options Defaults Comments LANG copyinout ON OFF lt OFF gt unless O2 or higher LANG formal deref unsafe ON OFF lt OFF gt LANG heap allocation threshold size lt 1 gt LANG IEEE save setting lt ON gt Fortran only LANG recursive setting lt OFF gt LANG rw const ON OFF lt OFF gt LANG short circuit conditionals ON OFF ON Fortran only Language Standards Options Defaults Comments E 7 E Summary of Compiler Options AA NN Table E 1 Summary of Compiler Options by Function ansi ansi ffortran2003 std c 98 std c89 std c99 std c9x std gnu 98 std gnu89 std gnu99 std gnu9x Std iso9899 1990 Std iso9899 199409 Std iso9899 1999 Std iso9899 199x ar f no fast stdlib H Idir iquotedir isystem dir L directory 1 library nodefaultlibs nostartfiles nostdinc nostdinc nostdlib objectlist shared shared libgcc static static data static libgcc
95. pathscale bin or lt install_directory gt bin if you installed to a non default location When a o file archive or an executable is passed to pathhow compiled it will display the compilation options for each o file constituting the argument file This includes any linked archives For example compile the file my ile c with pathcc and then use the pathhow compiled tool pathcc myfile c o myfile pathhow compiled myfile o The output would look something like this PathScale Compiler Version 3 2 compiled myfile c with options 02 march opteron msse2 mno sse3 mno 3dnow m64 2 4 Input File Types The name for a source file usually has the form ilename ext where ext is a one to three character extension used on a source code file that can have various meanings Extension Implication to the driver C source file that will be preprocessed nG C source file that will be preprocessed CC Cpp CXX wt Fortran source file 90 f is fixed format no preprocessor 95 90 is freeform format no preprocessor 95 is freeform format no preprocessor JE Fortran source file F90 F is fixed format invokes preprocessor F95 F90 is freeform format invokes preprocessor F95 is freeform format invokes preprocessor For Fortran files with the extensions f 90 or 95 you can use ftpp to invoke the Fortran preprocessor or cpp to invoke the C preprocessor on the 2 6
96. pragma omp master pragma omp critical ifdef 5 OPENMP Only master thread does this nthreads omp get num threads endif printf Number of threads d n nthreads All threads join master thread and disband The pragma and ifdef before some of the lines are conditional compilation tokens These lines are ignored when compiled without mp We compile omphello c for OpenMP with this command S pathce c mp omphello c Now we link it again using mp S pathcc mp omphello o o omphello out We set the environment variable for the number of threads with this command export OMP NUM THREADS 5 Now run the program S omphello out Hello World from thread 1 Hello World from thread 2 Hello World from thread 3 Hello World from thread 0 Number of threads 5 Hello World from thread 4 The output from the different threads can be in a different order each time the program is run We can change the environment variable to run with two threads export OMP NUM THREADS 2 8 26 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance ls Now the output looks like this S omphello out Hello World from thread 0 Number of threads 2 Hello World from thread 1 The same program can be compiled and linked without mp and the directives will be ignored We compile the program without mp pathcc c omphello c Link the object file and create an o
97. preprocessed C source file 8 assembly language file O object file a a static library of object files so a library of shared dynamic object files 2 7 2 Compiler Quick Reference Common Compiler Options AA NN 2 6 Common Compiler Options 2 7 The PathScale Compiler Suite has command line options that are similar to many other Linux or Unix compilers Option What it does c Generates an intermediate object file for each source file but doesn t link g Produces debugging information to allow full symbolic debugging I lt dir gt Adds path to the directories searched by preprocessor for include file resolution l lt library gt Searches the library specified during the linking phase for unresolved symbols L lt dir gt Adds path to the directories searched during the linking phase for libraries 1m Links using the libm math library This is typically required in C programs that use functions such as exp log sin cos o filename Generates the named executable binary file 03 Generates a highly optimized executable generally numerically safe O or 02 Generates an optimized executable that is numerically safe This is also the default if no o flag is used pg Generates profile information suitable for the analysis program pathprof Many more options are available and described in the man pages pathscale intro pathcc pathf 95 pathCC eko and se
98. program generates a list of all Fortran symbols in the library including those that do not return COMPLEX or REAL types The extra symbols will be ignored by the compiler AMD Core Math Library ACML 3 10 4 The AMD Core Math Library ACML incorporates BLAS LAPACK and FFT routines andis designed to obtain maximum performance from applications running on AMD platforms This highly optimized library contains numeric functions for mathematical engineering scientific and financial applications ACML is available both as a 32 bit library for compatibility with legacy x86 applications and as a 64 bit library that is designed to fully exploit the large memory space and improved performance offered by the x86 64 architecture we have not tested ACML on the EM64T version of the compiler suite To use ACML 1 5 with the PathScale Fortran compiler use the following pathf95 foo f bar f lacml To use ACML 2 0 with the PathScale Fortran compiler use the following pathf95 L path to acml lib foo f bar f lacml ACML 2 5 1 and later built with the PathScale compilers is available from the AMD website at http developer amd com acml aspx With these later versions of ACML the workarounds described above are unnecessary List Directed I O and Repeat Factors 3 40 By default when list directed I O is used and two or more consecutive values are identical the output uses a repeat factor For example real a 5 88 0
99. refer to Troubleshooting in the PathScale Compiler Suite and Subscription Manager Install Guide 10 2 Debugging The earlier sections on the PathScale Fortran and C C compilers contain language specific debugging information See section 3 12 and section 4 3 More general information on debugging can be found in this section The flag g tells the PathScale compilers to produce data in the form used by modern debuggers such as pathdb or GDB This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g will be capable of being debugged using pathdb GDB or other debuggers See the PathScale Debugger User Guide for more information on using pathdb It is advisable to use the 00 level of optimization in conjunction with the g flag since code rearrangement and other optimizations can sometimes make debugging difficult If g is specified without an optimization level then 00 is the default 10 3 Dealing with Uninitialized Variables Uninitialized variables may cause your program to crash or to produce incorrect results New options have been added to help identify and deal with uninitialized variables in your code These options are trapuv Wuninitialized and Zerouy The trapuv option works by initializing local variables to NaN floating point not a number and setting the CPU to detect floating point calculations involving NaNs Floating point calculations are
100. related tuning flags are omitted The third column gives the running times with the IPA related tuning flags The fifth column lists their IPA related tuning flags As this second table shows proper IPA tuning can produce major improvements in applications 7 3 8 Invoking IPA Inter procedural analysis is invoked in several possible ways ipa IPA and implicitly via Ofast IPA can be used with any optimization level but gives the biggest potential benefit when combined with 03 The Ofast flag turns on ipa as part of its many optimizations When compiling with ipa the o files that are created are not regular o files IPA uses the ofiles in its analysis of your program and then does a second compilation using that information to optimize the executable The IPA linker checks to see if the entire program is compiled with the same set of optimization options If different optimization options are used IPA will give a warning Warning Inconsistent optimization options detected between files involved in For example the following invocation will generate this warning for two C files a c andb c 7 Tuning Options Inter Procedural Analysis IPA I aaa pathcc 02 ipa c a c pathce 03 ipa c b c pathcc ipa a o b o The user can pass consistent optimization options to the individual compilations to remove the warning In the above example the user can either pass 02 or pass 03 to both the files
101. requested than are available the request will be satisfied using only the available threads B Implementation Dependent Behavior for OpenMP Fortran aaa The maximum number of threads that can be allocated simultaneously is limited to 256 by the implementation Additionally if a system call to allocate threads memory or other system resources does not succeed then the runtime library will exit with a fatal error message If an OMP runtime library routine interface is defined to be generic by an implementation use of arguments of kind other than those specified by the OMP KIND constants is implementation dependent Section D 3 page 111 No generic OMP runtime library routine interface is provided B Implementation Dependent Behavior for OpenMP Fortran AA NY Notes B 6 C 1 Appendix C Supported Fortran Intrinsics The Version 3 2 release of the PathScale Compiler Suite supports all of the GNU g77 intrinsics You must use intrinsic PGI or intrinsic G77 to get new G77 intrinsics which were added in the release All of the argument types for each intrinsic may not be supported in this release How to Use the Intrinsics Table C 2 As an example let s look at the intrinsic aco s This is what it looks like in the table Intrinsic Name Result Arguments Families Remarks ACOS R 4 X R 4 R 8 ANSI G77 PGI E P TRADITIONAL For the intrinsic ACOS the result is R 4 which means REAL
102. rsqrt followed by instructions to refine the result 2 means to use rsqrt by itself Default is 1 when OPT roundoff 2 or greater else the default is O OPT space ON OFF When ON this option specifies that code size is to be given priority in tradeoffs with execution time in optimization choices Default is OFF This can be turned on either directly or by compiling with Os OPT speculate ON OFF When ON this option makes the compiler convert short circuiting conditionals to their equivalent non short circuited forms F 43 F eko man Page AA NN whenever possible This eliminates branches at the expense of more computations Default is OFF OPT transform to memlib ON OFF When ON this option enables transformation of loop constructs to calls to memcpy or memset Default is ON OPT treeheight ON OFF The value ON enables re association in expressions to reduce the expressions tree height The default is OFF OPT unroll analysis ON OFF The default value of ON lets the compiler analyze the content of the loop to determine the best unrolling parameters instead of strictly adhering to the OPT unroll times max and OPT unroll size parameters OPT unroll analysiszON can have the negative effect of unrolling loops less than the upper limit dictated by the OPT unroll times max and OPT unroll size specifications OPT unroll times max N Unroll inner loops by a maximum of N The default is 4
103. s are also accepted The Fortran source files are compiled and an executable object file is produced The default name of the executable object file is a out For example the following command line produces a out pathf95 myprog f By default several files are created during processing The compiler adds a suffix to the file portion of the file name and places the files it creates into your working directory See the FILES section for more information on files used and generated files C C Indicates the source files to be compiled or assembled File suffixes and the commands that accept them are as follows Command _ File Suffix pathCC C C ii c C cC cxx CXX CC cpp and CPP pathcc c and i F 58 F eko man Page ls ENVIRONMENT VARIABLES F90 BOUNDS CHECK ABORT Fortran Setto YES causes the program to abort on the first bounds check violation F90 DUMP MAP Fortran When set to YES if a segmentation fault occurs print the current process s memory map before aborting The memory map describes how the process s address space is allocated The Fortran runtime will print the address of the segmentation fault you can examine the memory map to see which mapped area was nearest to the fault address This can help distinguish between program bugs that involve running out of stack space and null pointer dereferences The memory map is displayed using the same format as the file proc self maps FILENV Th
104. section 8 11 describes how these pthread stacks are sized NOTE The automatic stack sizing algorithm used by Fortran serial program and Fortran OpenMP programs is not employed for C and C programs 8 11 Stack Size Algorithm The stack limit for each OpenMP pthread is calculated as follows If PSC_OMP_STACK_SIZE is set then this specifies the stack limit If this is a Fortran program the stack limit is automatically set using the same approach as described in section 3 13 except that the calculated value is divided by the number of CPUs in the system This ensures that the physical memory available for stack can be shared between as many threads as there are CPUs in the system Otherwise this is a C C program and the stack limit is set to a default value of 32MB The distinction between Fortran and C C programs is determined by whether the program entry point is MAIN for Fortran or main for C C 8 22 8 Using OpenMP and Autoparallelization Stack Size Algorithm ls This stack size is then compared against system imposed limits both lower and upper If the check fails then a warning is generated and the stack size is automatically adjusted to the appropriate limit The following lower limit is imposed The minimum size of a pthread stack specified by the system This is typically 16KB The following upper limits are imposed The maximum stack size that the system s pthread library will accept i e t
105. specified the transformation of code to run under multiple threads can only take place after the LNO phase in which case this flag is ignored OPT early intrinsics ON OFF When ON this option causes calls to intrinsics to be expanded to inline code early in the backend compilation This may enable more vectorization opportunities if vector forms of the expanded operations exist Default is OFF OPT fast bit intrinsics ON OFF Setting this to ON will turn off the check for the bit count being within range for Fortran intrinsics like BTEST and ISHFT The default setting is OFF OPT fast complex ON OFF Setting fast_complex ON enables fast calculations for values declared to be of the type complex When this is set to ON complex absolute value norm and complex division use fast algorithms that overflow for an operand the divisor in the case of division that has an absolute value that is larger than the square root of the largest representable floating point number This would also apply to an underflow for a value that is smaller than the square root of the smallest representable floating point number OFF is the default fast_complex ON is enabled if OPT roundoff 3 is in effect OPT fast exp ON OFF This option enables optimization of exponentiation by replacing the runtime call for exponentiation by multiplication and or square root operations for certain compile time constant exponents integers and halfs This can pro
106. taskset or using the other OpenMP library environment variables for controlling thread affinity See the following descriptions of releated environment variables PSC OMP AFFINITY GLOBAL boolean TRUE or FALSE This environment variable controls where thread global ID or local ID values are used when assigning threads to CPUs The default is TRUE so that global ID values are used for calculating thread assignments Global IDs uniquely identify each thread and are integer values starting from 0 for the original master thread and incrementing upwards in the order in which threads are allocated The global ID is constant for a particular thread from its fork to its join Using the global ID for the affinity mapping ensures that threads do not change CPU in their lifetime and ensures that threads will be evenly distributed over CPUs The alternative is to use the thread local ID for this mapping When nested parallelism is not employed then each thread s global and local ID will be identical 8 13 8 Using OpenMP and Autoparallelization Environment Variables AA NY and the setting of this variable is irrelevant However when a nested team of threads is created that team will be assigned new local thread IDs starting at O for the master of that team and incrementing upwards Note that the local ID of a thread can change when that thread performs a nested fork and then a nested join and that these events may cause the CPU binding o
107. that an argument to a subroutine or function is given aconstant value such as 0 or FALSE but the subroutine or function tries to assign a new value to that argument We recommend that where possible you fix code that assigns to constants so that it no longer does this Such a change will continue to work with other Fortran compilers but will allow the PathScale Fortran compiler to generate code that will not crash and will run more efficiently If you cannot modify your code we provide an option called LANG rw const on that will change the compiler s behavior so that it allocates constant values in read write memory We do not make this option the default as it reduces the compiler s ability to propagate constant values which makes the resulting executables slower You might also try the LANG formal deref unsafe option This option tells the compiler whether it is unsafe to speculate a dereference of a formal parameter in Fortran The default is OFF which is better for performance See the eko man page for more details on these two flags Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments The Fortran standards require that arguments to functions and subroutines not alias each other As an example this is illegal program bar call foo c c subroutine foo a b integer i real a 100 b 100 do i 2 100 a i b i b i 1 enddo Because a and b are dummy arguments the compiler relies on the assumption th
108. the compiler not to warn about code violating sequence point rules W no shadow For C C only Wshadow warns when one local variable shadows another Wno shadow tells the compiler not to warn when one local variable shadows another W no sign compare For C C only Wsign compare warns about signed unsigned comparisons Wsign compare tells the compiler not to warn about signed unsigned comparisons W no sign promo For C C only The Wsign promo option warns when overload resolution promotes from unsigned to signed Wno sign promo tells the compiler not to warn when overload resolution promotes from unsigned to signed W no strict aliasing For C C only Wstrict aliasing warns about code that breaks strict aliasing rules Wno strict aliasing tells the compiler not to warn about code that breaks strict aliasing rules W no strict prototypes For C C only Wstrict prototypes warns about non prototyped function decls Wnocstrict prototypes tells the compiler not to warn about non prototyped function decls W no switch For C C only Wswitch warns when a switch statement is incorrectly indexed with an enum Wno switch tells the compiler not to warn when a switch statement is incorrectly indexed with an enum F 55 F eko man Page AA NN Wswitch default For C C only Warn when a switch statement has no default Wswitch enum For C C only Warn when a switch statement is m
109. the list is shorter than the maximum number of threads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads PSC OMP CPU STRIDE This specifies the striding factor used when mapping threads to CPUs It takes an integer value in the range of 0 to the number of CPUs inclusive The default is a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a Stride of 1 PSC OMP CPU OFFSET This specifies an integer value that is used to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU number calculated after PSC OMP CPU STRIDE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs PSC OMP GUARD SIZE This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S PSC OMP GUIDED CHUNK DIVISOR The value of PSC OMP GUIDED CHUNK DIVISOR is used to divide down the chunk size as
110. to link against fast versions of standard library routines During compilation ffast stdlib implies OPT fast stdlib on If fno fast stdlib is used during linking the compiler will not link against the PathScale compiler runtime library If you link code with fno fast stdlib that was not also compiled with this flag you may see linker errors Much of the PathScale compiler Fortran runtime is compiled with ffast stdlib so itis not advised to link Fortran applications with fno fast stdlib ffloat store Do not store floating point variables in registers and inhibit other options that might change whether a floating point value is taken from a register or memory This option prevents undesirable excess precision on the X87 floating point unit where all floating point computations are performed in one precision regardless of the original type see mx87 precision If the program uses floating point values with less precision the extra precision in the X87 may violate the precise definition of IEEE floating point ffloat store causes all pertinent immediate computations to be stored to memory to force truncation to lower precision However the extra stores will slow down program execution substantially ffloat store has no effect under msse2 which is the default under both m64 and m32 ffortran2003 When you apply the Fortran intrinsic real dble or cmplx to a boz constant such as 2 3ff00000 the compiler traditionally c
111. to minimize data cache misses This optimization is based on reference patterns of fields in large structs learned during feedback compilation The default is OFF IPA ctype ON optimizes interfaces to constructs defined in the standard header file ctype h by assuming that the program will not run in a multi threaded environment The default is OFF 7 3 6 1 Disabling Options The following options are for disabling various optimizations in IPA They are useful for studying the effects of the optimizations IPA alias OFF disables IPA s alias and mod ref analyses IPA addressing OFF disables IPA s address taken analysis which is a component of the alias analysis IPA cgi 0FF disables the constant propagation for global variables constant global identification IPA cprop OFF disables the constant propagation for parameters IPA dfe OFF disables dead function elimination IPA dve OFF disables dead variable elimination IPA split OFF disables common block splitting 7 3 7 Case Study on SPEC CPU2000 This section presents experimental data to show the importance of IPA in improving program performance Our experiment is based on the SPEC CPU2000 benchmark suite compiled using release 1 2 of the PathScale compiler The compiled benchmarks are run on a 1 4 GHz Opteron system Two sets of data are shown here The first set studies the effects of using the single option ipa The second set shows the effects of additional IPA related tunin
112. to the POSIX function xiii Send to the process whose ID is pia the signal whose number is signal The function form returns o on success or an error code from the C library value errno The subroutine form sets status to the value which the function would return link Fortran interface to the POSIX function 1ink Creates a hard link path2 pointing to the same file as path 1 The function form returns 0 Oh success or an error code from the C library value errno The subroutine form sets status to the value which the function would return Trailing blanks in path1 and path2 are ignored you can prevent this by using char 0 to place a null character after the last significant character Inbink Returns the length of its argument neglecting trailing blanks synonym for standard function 1en trim loc Returns address of argument im memory long Convert to type integer 4 Ishift Bitwise left shift High order bit is not treated as a sign bit Shift count must be nonnegative and less than the bit size of the data C 47 C Supported Fortran Intrinsics Fortran Intrinsic Extensions ee Istat Fortran interface to the POSIX function 1st at Store in array s array information about the file named file if that is a symbolic link describe the link rather than the target of the link cf stat The function form returns o or an error code from the C library value errno Trailing blanks in file are ignored you can prevent this by u
113. used by the compiler to generate better code ThePathScale Compiler Suite uses feedback information for branches loop counts calls switch statements and variable values A command line option for the compiler usually an option relating to code optimization A utility used to determine if a test suite exercises all code paths in a program Inter Procedural Analysis A sophisticated compiler technique in which multiple functions and subroutines are optimized together Intermediate Representation A step in compilation where code is linked in an intermediate representation so that inter procedual analysis and optimization can take place A utility program that links a compiled or assembled program to a particular environment Also known as a link editor the linker unites references between program modules and libraries of subroutines Its output is aload module which is executable code ready to run in the computer loop nest optimizer Performs transformation on a loop nest improves data cache performance improves optimization opportunities in later phases of compiling vectorizes loops by calling vector intrinsics parallelizes loops computes data dependency information for use by code generator can generate listing of transformed code in source form Multiprocessor Non uniform memory access is amethod of configuring a cluster of microprocessors in a multiprocessing system so that they can share memory locally i
114. using this option Please see the Release Notes for more information 2 3 Compiling for Different Platforms The PathScale Compiler Suite currently compiles and optimizes your code for the Opteron processor independent of where the compilation is happening This may change in the future To select the 32 bit 64 bit ABI the compiler queries the machine where the compilation is happening and will compile to the best ABI supported for that machine These defaults for the target processor and the ABI can be overridden by command line flags or the compiler defaults file You can set or change the default platform for compilation using the compiler defaults file found in opt pathscale etc If you installed in a non default location the path willbe lt install_directory gt pathscale etc You can use the defaults file to provide a set of additional include or library directories to search or to specify some default compiler optimization flags The compiler refers to the compiler defaults file for options to be used during compilation The syntax in c mpiler defaults file is the same as options specified on the compiler command line Options are added to the command line in the order in which they appear in the defaults file Every option is included unconditionally For exclusive options the command line takes precedence over the defaults file For example if the defaults file contains the 03 option but the compiler is invoked with 02 o
115. values for N are 0 3 gnu N No debugging information for symbolic debugging is produced This is the default Produces minimal information enough for making backtraces in parts of the program that you don t plan to debug This is also the flag to use if the user wants backtraces but does not want the overhead of full debug information This flag also causes export dynamic to be passed to the linker Produces debugging information for symbolic debugging Specifying g without a debug level is equivalent to specifying g2 If there is no explicit optimization flag specified the O0 optimization level is used in order to maintain the accuracy of the debugging information If optimization options O1 O2 or O3 are explicitly specified the optimizations are performed accordingly but the accuracy of the debugging cannot be guaranteed If ipa is specified along with option g2 then IPA is disabled Produces additional debugging information for debugging macros For C C only Direct the compiler to generate code compatible with the GNU N series of compilers where N is either 3 GCC 3 3 or 4 GCC 4 2 On systems whose system compiler is GCC 3 the default is gnu3 on GCC 4 systems the default is gnu4 Use show defaults to display the default gnu40 is also supported which selects GCC 4 0 GRA Option group for Global Register Allocator GRA home ON OFF Turn off the rematerialization optimization for
116. vies desea nA MERE NAGA ese Obra SA eee berks 7 24 7 8 2 BIOS Sli 22 odas aS oat RGibase es Redi ade bee KANG 7 25 7 8 3 Multiprocessor Memory sseeeeeee ee 7 25 7 8 4 Kernel and System Effects 000 cee eee eee eee 7 25 7 8 5 Tools and APIS 22286 ew kaa ace sci he ox GNG No dde RU RM d eee 7 26 7 8 6 Testing Memory Latency and Bandwidth 7 26 7 9 The p athopt2 ToOl a kaan ebat Sce teat aciem Naa 7 27 7 9 1 A Simple Example nnana anaana iad uu neee Ed dc e edes 7 28 7 9 2 pathopt2 Usage fas r aka nen ar iiiaae E 7 29 7 9 3 Option Configuration File nasasa aaaea 7 32 7 9 4 Testing Methodology 4 4 RR RR Rh AKA KAWANG 7 35 7 9 5 Using an External Configuration File to Modify pathopt2 xml 7 35 7 9 6 PSC GENFLAGS Environment Variable a 7 36 7 9 7 Using Build and Test Scripts 2224 Bada pas REPRERILBLRRR RAS RE eas 7 36 7 9 8 The NAS Parallel Benchmark Suite a 7 37 7 9 8 1 Set Up the Workarea ees 7 37 7 9 8 2 Example 1 Run with Makefile 0 00 2 eee eee eee 7 37 7 9 8 3 Example 2 Use Build Run Scripts and a Timing File 7 38 7 9 8 4 Example 3 Using a Single Script with the rate file 7 41 7 10 Page xiii PathScale Compiler Suite User Guide Version 3 2 HERES rl ll How Did the Compiler Optimize My Code aaa 7 43 7 10 1 Using the S flag re viii Ro bak Sade e hoes Dade
117. 0 both suppress the generation of prefetch instructions but LNO prefetch 0 also affects LNO optimizations that depend on prefetch CG ptr load usezN Add a latency of N cycles between an instruction that loads a pointer and an instruction that uses the pointer The extra latency will force the instruction scheduler to schedule the pointer load earlier In general it is beneficial to load pointers as soon as possible so that dependent memory instructions can begin execution N is 4 by default Load pointer instructions include load execute instructions that compute a pointer result CG push pop int saved regs ON OFF Use the x86 push and pop instructions to save the integer callee saved registers atfunction prologues and epilogues instead of mov instructions to and from memory locations based off the stack pointer The default is ON when CPU target is barcelona and OFF otherwise CG sse cse regs N When performing common subexpression elimination during code generation assume there are N extra SSE registers available over the number provided by the CPU N can be positive zero or negative The default is positive infinity See also CG cse_regs F eko man Page ls CG use prefetchnta ON OFF Prefetch when data is non temporal at all levels of the cache hierarchy This is for data streaming situations in which the data will not need to be re used soon The default is OFF CG use_test ON OFF Make the code generator us
118. 1 Programming languages Fortran Fortran 90 Supports legacy FORTRAN 77 ANSI X3 9 1978 programs Provides support for common extensions to the above language definitions Links binaries generated with the GNU Fortran 77 compiler Generates code that complies with the x86 64 ABI and the 32 bit x86 ABI 3 1 Using the Fortran Compiler To invoke the PathScale Fortran compiler use this command pathf95 By default the compiler will treat input files with an F suffix or suffix as fixed form files Files with an F90 90 F95 0r 95 suffix are treated as free form files This behavior can be overridden using the fixedform and freeform switches See section 3 1 1 for more information on fixed form and free form files By default all files ending in F F90 or F95 are first preprocessed using the C preprocessor cpp If you specify the ftpp option all files are preprocessed using the Fortran preprocessor tpp regardless of suffix See section 3 6 1 for more information on preprocessing 3 1 3 The PathScale Fortran Compiler Using the Fortran Compiler AA PV Invoking the compiler without any options instructs the compiler to use optimization level O2 These three commands are equivalent pathf95 test f90 pathf95 O test f90 pathf95 O2 test f90 Using optimization level 00 instructs the compiler to do no optimization Optimization level 01 performs only local optimization Level 02 the
119. 3 pathbug tool debugging with 10 1 pathCC 2 1 4 2 pathcc 2 1 4 2 pathcov 2 11 2 12 pathdb 2 1 2 11 3 42 10 1 pathf95 2 1 3 1 pathhow compiled 2 6 pathopt2 7 27 8 27 pathopt2 xml 7 27 7 28 pathprof 2 11 2 12 pathprof command 9 2 Peeling 7 15 pentium4 2 4 POSIX threads library 8 21 Pragma 4 6 options 4 6 pragma pack 4 6 Prefetch 7 16 Prefetch directives C PREFETCH 3 22 C PREFETCH MANUAL 3 23 C PREFETCH RE 3 23 C PREFETCH REF DISABLE 3 23 Preprocessing options 3 24 pre defined macros 3 26 4 4 Preprocessor C 2 6 4 3 Fortran 2 6 3 24 PRNG Pseudo random number generator 3 29 Process affinity 2 12 7 25 Processor target 2 4 pthread 8 18 8 20 pthreads 8 21 Index 7 PathScale Compiler Suite User Guide Version 3 2 aaa a R RAND 5 4 REAL 3 39 RES 8 23 Roundoff error 7 22 RSS 8 23 S sched setaffinity 7 26 schedutils 2 12 7 26 Separate compilation 7 3 SIMD 8 28 sin 7 17 SIZE 8 18 8 23 Static data 2 9 Static scheduling 8 20 Statically allocated data 10 3 STREAM benchmark example 7 43 STREAM benchmark tool 7 26 STREAM with OpenMP 8 28 Striding factor 8 16 Sub options multiple 7 2 Summary table pathopt2 7 28 Symmetric multiprocessing SMP 7 25 T taskset 7 26 Thread assignments 8 13 Threads mapping to CPUs 8 14 Tiling 7 16 time tool 2 11 TRADITIONAL intrinsics family C 2 Translation Lookaside Buffer TLB 7 14 U ulimit command 3 2 V VIRT 8
120. 4 or REAL KIND 4 and its arguments X can be either R 4 REAL 4 or R 8 REAL 8 ACOS belongs to the ANSI G77 PGI and TRADITIONAL families of intrinsics see appendix C 2 for an explanation of intrinsic families which means the compiler will recognize it if any of those families is enabled Under remarks E P are listed E tells us that this is an elemental intrinsic and P tells us that the intrinsic may be passed as an actual argument Here is a simple scalar call to intrinsic ACOS print acos 1 0 Because the intrinsic is elemental you can also apply it to an array print acos 1 0 0 707 0 5 NOTE One of the lesser known features of Fortran 90 is that you can use argument names when calling intrinsics instead of passing all of the arguments in strictly defined order There are only a couple of cases where itis actually useful to know the official name so that you can omit optional arguments that don t interest you for example call date and time time timevar but you re always allowed to specify the name if you like Intrinsic Options If your program contains a function or subroutine whose name conflicts with that of one of the intrinsic procedures you have three choices Within each program unit C 1 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN that calls that function or subroutine you can declare the procedure in an external statement or you can declare it with F
121. 44 fno math errno 6 2 fno second underscore 3 38 fno underscoring 3 38 fPIC 2 10 freeform 3 1 Index 5 PathScale Compiler Suite User Guide Version 3 2 aaa TB ftpp 3 1 3 24 n 7 16 g 2 8 2 11 3 42 7 1 LNO blocking 7 16 gnu4 4 2 LNO blocking_size 7 16 2 4 2 8 LNO cs1 n cs2 n cs3 n cs4 n 7 15 i8 3 21 LNO fission 7 15 INLINE 7 7 LNO fusion 7 15 INLINE aggressive 7 9 LNO fusion_peeling_limit 7 15 INLINE list 7 8 LNO ignore pragmas 3 22 INLINE must 7 8 LNO interchange 7 16 INLINE never 7 8 LNO opt 7 2 INLINE none 7 8 LNO ou prod max 7 16 intrinsic 3 33 5 4 C 1 LNO outer unroll max ou max 7 16 IPA 7 12 LNO outer unroll ou2n 7 16 ipa 3 2 4 2 6 1 7 8 10 4 LNO parallel overhead 8 2 IPA addressing 7 10 LNO prefetch 7 2 7 16 IPA alias 7 10 LNO prefetch ahead 7 16 IPA callee limit 7 8 LNO simd 7 17 IPA cgi 7 10 LNO simd_verbose 7 17 7 44 IPA common pad size 7 9 LNO vintr 7 17 IPA cprop 7 10 LNO vintr_verbose 7 44 IPA ctype 7 10 lstdc 4 7 IPA dfe 7 10 m32 2 4 5 4 IPA dve 7 10 m3dnow 2 4 IPA field reorder 7 10 m64 2 4 IPA forcedepth 7 9 march 2 4 IPA inline 7 8 mcmodel 2 9 10 3 IPA linear 7 9 mcpu 2 4 IPA max jobs 7 13 mp 8 2 8 3 8 11 10 5 IPA maxdepth 7 9 msse2 2 4 IPA min hotness 7 9 msse3 2 4 IPA multi clone 7 9 mtune 2 4 IPA node bloat 7 9 no intrinsic C 2 IPA plimit 7 8 noccp 4 3 IPA pu reorder 7 10 O 2 8 6 1 IPA small pu 7 8 O 2 2 2 8
122. 5 C gasdyn f90 o gasdyn The generated code checks all array accesses to ensure that they fall within the bounds of the array If an access falls outside the bounds of the array you will get a warning from the program printed on the standard error at runtime gasdyn lib 4961 WARNING Subscript 20 is out of range for dimension 1 for array X at line 11 in file t f90 with bounds 1 10 If you set the environment variable F90 BOUNDS CHECK ABORT to YES then the resulting program will abort on the first bounds check violation Obviously array bounds checking will have an impact on code performance so it should be enabled only for debugging and disabled in production code that is performance sensitive 3 6 8 Pseudo random Numbers The pseudo random number generator PRNG implemented in the standard PathScale Fortran library is a non linear additive feedback PRNG with a 32 entry long seed table The period of the PRNG is approximately 16 2 32 1 3 7 Mixed Code If you have a large application that mixes Fortran code with code written in other languages and the main entry point to your application is from C or C you can optionally use pathcc or pathcc to link the application instead of pathf 95 If you do you must manually add the Fortran runtime libraries to the link line As an example you might do something like this pathCC o my big app filel o file2 0 lpathfstart lpathfortran Ifthe main program is wr
123. 64 s0 2 0x0000002a95556000 If you use the static option notice that the shared libraries are no longer required pathcc o hello hello c static ldd hello not a dynamic executable 2 8 Large File Support The Fortran runtime libraries are compiled with large file support PathScale does not provide any runtime libraries for C or C that do I O so large file support is provided by the libraries in the Linux distribution being used 2 9 Memory Model Support The PathScale compilers currently support two memory models small and medium The default memory model on x86 64 systems and the default for the compilers is small equivalent to GCC s mcmode1 sma11 This means that offsets of code and data within binaries are represented as signed 32 bit quantities In this model all code in an executable must total less than 2GB and all the data must also be less than 2GB Note that by data we mean the static and unlimited static data BSS that are compiled into an executable not data allocated dynamically on the stack or from the heap Pointers are 64 bits however so dynamically allocated memory may exceed 2GB Programs can be statically or dynamically linked Additionally the compilers support the medium memory model with the use of the option mcmode1 mediumon all of the compilation and link commands This means 2 9 2 Compiler Quick Reference Memory Model Support AA NN that offsets of code within binaries
124. 68 real or 12 27 user times from the Ofast run The reason the Time in seconds output by NPB is considerably lower than 12 68 is that it measures the time for the main work section of the program ignoring the start up and array initialization time For the parallel versions of NPB it is appropriate to ignore the initialization since that time does not improve when more processes are used in the computation This Time in seconds and Mop s total millions of operations per second from the NPB benchmarks turn out to be useful metrics for testing optimization The S timing file and rate file features can be used to search for the Time in seconds or the Mop s total metrics In this next 7 39 7 Tuning Options The pathopt2 Tool ee example we will use the timing file option See section 7 9 8 4 for information on the rate file option This Time in seconds output can be used as pathopt2 s sorting criterion by using the S timing file option However the psc_build script has to be enhanced to be able to isolate the number after the Time in seconds part of the output Here is how to do this in a script found in opt pathscale share pathopt2 examples called psc test2 bin sh bin ft A gt logs ft A txt grep in sec logs ft A txt secs log sed e s Time in seconds secs log PSC METRIC FILE grep SUCCESSFUL logs ft A txt NOTE pathopt2 checks the result status of the build command s
125. 7 PGI E P TRADITIONAL CDLOG Z 16 X Z 16 G77 PGI E P TRADITIONAL CDSIN Z 16 X Z 16 G77 PGI E P TRADITIONAL CDSQRT Z 16 X Z 16 G77 PGI E P TRADITIONAL CEILING A R 4 R 8 ANSI PGI E KIND I 1 1 2 114 TRADITIONAL I 8 CEXP Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL C 6 C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks CHAR C I 111 1 2 I4 I8 ANSI G77 E KIND 171 1 2 1 4 PGI I8 TRADITIONAL CHDIR 1 4 DIR C G77 PGI O STATUS I 4 CHDIR Subroutine DIR C G77 O STATUS 1 4 CHMOD 1 4 NAME C G77 PGI O MODE C STATUS 1 4 CHMOD Subroutine NAME C G77 O MODE C STATUS 1 4 CLEAR_IEEE_ Subroutine EXCEPTION I 8 TRADITIONAL E EXCEPTION CLOC I 8 C C TRADITIONAL CLOCK C TRADITIONAL CLOG Z8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CMPLX Z8 X I 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 Z 8 Z 16 PGI O Y F1 1 2 1 4 1 8 TRADITIONAL R 4 R 8 Z 8 Z 16 COMMAND 1 4 KIND I 1 1 2 1 4 ANSI O ARGUMENT 1 8 TRADITIONAL COUNT COMPARE L 4 I 1 4 TRADITIONAL E AND_SWAP J 1 4 K 114 COMPARE_ L 8 I 148 TRADITIONAL E AND SWAP J I 8 K I8 COMPL 1 11 1 2 1 4 1 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 C 7 C Supported Fortran Intrinsics Table of Supported Intrinsics AA
126. 8 14 3 3 Load Balancing It is possible to gain some coarse insight into the load balancing of the OpenMP application using the top program Depending on the version of top you should be able to view the breakdown of user system and idle time per CPU Often this view can be obtained by pushing 1 You may also want to increase the update rate e g with s followed by 0 5 It is sometimes possible to see the program moving from serial to parallel phases and also see whether the work is being well distributed If there is excessive time spent in the system or swapping then this should also be investigated It goes without saying that it is best to run OpenMP applications on nodes with no other running applications If the OpenMP application uses runtime scheduling then try varying the runtime schedule using the OMP_ SCHEDULE environment variable A good choice of schedule and chunk size is sometimes important for performance NOTE The gprof profiling pg does not work in conjunction with pthreads or the OpenMP library An alternative approach is to use OProfile which uses hardware counters and sampling techniques to build up a profile of the system 8 29 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance AA NY It is possible to capture application code dynamic libraries kernel modules and drivers in a profile created by OProfile giving insight into system wide performance characteris
127. 8 TRADITIONAL E J 1 1 12 1 4 1 8 K I 1 I2 1 4 18 C 13 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks DSHIFTR I 111 I2 1 4 I 8 TRADITIONAL E J 11 I2 I4 1 8 K 11 1 2 1 4 I 8 DSIGN R 8 A R 8 ANSI G77 E P B R 8 PGI TRADITIONAL DSIN R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DSIND R 8 X R 8 PGI E TRADITIONAL DSINH R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DSM_ I8 ARRAY Any type TRADITIONAL CHUNKSIZE Array rank any DIM 171 1 2 1 4 I8 DSM_ I 8 ARRAY Any TRADITIONAL DISTRIBUTION type Array rank any BLOCK DIM 171 1 2 1 4 I 8 DSM_ I 8 ARRAY Any type TRADITIONAL DISTRIBUTION Array rank any CYCLIC DIM 171 1 2 1 4 I 8 DSM_ I 8 ARRAY Any type TRADITIONAL DISTRIBUTION Array rank any STAR DIM 171 172 1 4 I8 DSM_ I8 ARRAY Any type TRADITIONAL ISDISTRIBUTED Array rank any DSM_ I 8 ARRAY Any type TRADITIONAL ISRESHAPED Array rank any DSM_ I 8 ARRAY Any type TRADITIONAL NUMCHUNKS Array rank any DSM_ I 8 ARRAY Any type TRADITIONAL NUMTHREADS Array rank any DIM 171 1 2 1 4 I8 DSM REM I8 ARRAY Any type TRADITIONAL CHUNKSIZE Array rank any DIM 171 1 2 1 4 I8 INDEX I 1 172 1 4 I8 C 14 C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fo
128. A 2 PSC PROBLEM REPORT DIR A 2 Environment variables OpenMP OMP DYNAMIC A 3 OMP NESTED A 3 OMP NUM THREADS A 3 OMP SCHEDULE A 3 Environment variables PathScale OpenMP PSC OMP AFFINITY A 3 PSC OMP AFFINITY GLOBAL A 3 PSC OMP AFFINITY MAP A 4 PSC OMP CPU OFFSET A 4 PSC OMP CPU STRIDE A 4 PSC OMP GUARD SIZE A 4 PSC OMP GUIDED CHUNK DIVISOR A 4 PSC OMP GUIDED CHUNK MAX A 4 PSC OMP LOCK SPIN A 5 PSC OMP SILENT A 5 PSC OMP STACK SIZE A 5 PSC OMP STATIC FAIR A 5 PSC OMP THREAD SPIN A 5 EVERY intrinsics family C 2 Execute target 7 32 explain command 3 27 used with iostat 3 28 extension source file name 2 6 F F90 BOUNDS CHECK ABORT 3 29 Fast math functions 7 21 FDO Feedback Directed Optimization 6 2 7 18 FFT 3 40 FILENV 3 35 Final object code 7 3 fixed form 3 2 Fixed form files 3 1 3 2 Floating point calculations 10 1 Format big endian 3 35 little endian 3 35 Fortran accessing common blocks 3 33 compiler commands 3 1 debugging 3 42 dope vector data structure 3 28 file units 3 37 KIND attribute 3 37 modules 3 3 preprocessor 3 1 3 24 3 25 runtime libraries 3 29 stack size 3 2 3 46 8 11 8 21 8 23 Fortran intrinsics abort C 42 C 47 access C 42 alarm C 42 and C 42 besyn C 42 cdsqrt C 42 chdir C 42 chmod C 43 ctime C 43 date C 43 dbesyn C 43 dcmplx C 43 dconj C 43 derfc C 43 dfloat C 43 dimag C 43 dreal C 43 dtime C 43 erfc C 43 etime C 44 exit C 44 fda
129. ATIC with no chunk size specified OMP NUM THREADS Set the number of threads to use during execution Default is number of CPUs in the machine PathScale OpenMP Environment Variables These environment variables can be used with OpenMP in both Fortran and C and C except as indicated PSC OMP AFFINITY When TRUE the operating system s affinity mechanism where available is used to assign threads to CPUs otherwise no affinity assignments are made The default value is TRUE F 60 F eko man Page I a PSC OMP AFFINITY GLOBAL This environment variable controls where thread global ID or local ID values are used when assigning threads to CPUs The default is TRUE so that global ID values are used for calculating thread assignments PSC OMP AFFINITY MAP This environment variable allows the mapping from threads to CPUs to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list must contain at least one CPU identifier and entries in the list beyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers in the PSC_OMP_AFFINITY_MAP list are inserted in the mapping table starting at the index for thread 0 and increasing upwards If
130. CPU2000 ua waaa kakasa anakan hana 7 10 7 3 8 Invokitig IPSA oy sic acd dos pe ci AA ag he E 7 12 7 3 9 Size and Correctness Limitations to IPA 7 14 7 4 Loop Nest Optimization LNO 0 0020 c eee eee 7 14 7 4 1 Loop Fusion and Fission 0 000 e eee ee 7 14 7 4 2 Cache Size Specification a 7 15 7 4 3 Cache Blocking Loop Unrolling Interchange Transformations 7 16 7 4 4 PIGIGIEN magagaan 2854 14e2uete03 a citat abet NEA GALA oan was 7 16 7 4 5 VECIONZAON 624426014 ose Se elas eee PER d SP dd me dde 7 17 7 5 Code Generation CG css Rr aaae 7 17 7 6 Feedback Directed Optimization FDO nananana aa 7 18 7 7 Aggressive Optimizations 1 2 0 0 ee 7 19 7 7 1 Alias ANAYSIS sass senora Poe be RoL hee pd RR a a E DATA pe aE es 7 19 7 7 2 Numerically Unsafe Optimizations s an a 7 20 7 7 3 Fast math Functions iuo ear inier d ei REREIPReed3xdcve5es fd 7 21 7 7 4 IEEE 754 Compliance isses R RE ERR e eee ees 7 21 7 7 4 1 iunc C Tc n 7 21 7 7 4 2 RONGO uocem hha attore Decale tts sodio tue d eu EU URS 7 22 7 7 5 Other Unsafe Optimizations s a saaana a 7 23 7 7 6 Assumptions About Numerical Accuracy llle 7 23 7 7 6 1 Page xii PathScale Compiler Suite User Guide Version 3 2 Flush to Zero Behavior 0c ee eee eee 7 24 7 8 Hardware Performance ins na aid mad hak RG KG GRAE RAANG RS 7 24 7 8 1 Hardware Setup
131. DDT This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g will be capable of being debugged using pathdb GDB or other debuggers The g option automatically sets the optimization level to 00 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible but the code transforming performed by the optimizations many make it more difficult Bounds checking is quite a useful debugging aid This can also be used to debug allocated memory If you are noticing numerical accuracy problems see section 7 7 for more information on numerical accuracy See section 10 for more information on debugging and troubleshooting See the PathScale Debugger User Guide for more information on pathdb 3 42 3 The PathScale Fortran Compiler Debugging and Troubleshooting Fortran ls 3 12 1 Writing to Constants Can Cause Crashes 3 12 2 Some Fortran compilers allocate storage for constant values in read write memory The PathScale Fortran compiler allocates storage for constant values in read only memory Both strategies are valid but the PathScale compiler s approach allows it to propagate constant values aggressively This difference in constant handling can result in crashes at runtime when Fortran programs that write to constant variables are compiled with the PathScale Fortran compiler A typical situation is
132. DITIONAL DERF X R 4 R 8 G77 PGI E P TRADITIONAL DERFC X R 4 R 8 G77 PGI E P TRADITIONAL DEXP R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DFLOAT R8 A 11 1 2 1 4 I8 G77 PGI E TRADITIONAL DFLOATI R 8 A I2 TRADITIONAL E DFLOATJ R 8 A I4 TRADITIONAL E DFLOATK R 8 A I 8 TRADITIONAL E DIGITS X 1 1 1 2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL DIM R 4 X R 4 ANSI G77 E P Y R 4 PGI TRADITIONAL DIM X R 8 ANSI G77 E P Y R 8 PGI TRADITIONAL C 12 C Supported Fortran Intrinsics Table of Supported Intrinsics Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued I Intrinsic Name Result Arguments Families Remarks DIM X 171 1 2 1 4 I8 ANSI G77 E P PGI TRADITIONAL Y 1 1 1 2 1 4 18 DIMAG R 8 Z Z 16 G77 PGI E TRADITIONAL DINT R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DISABLE IEEE Subroutine INTERRUPT I8 TRADITIONAL E INTERRUPT DLOG R8 X R8 ANSI G77 E P PGI TRADITIONAL DLOG10 R8 X R8 ANSI G77 E P PGI TRADITIONAL DMAX1 ANSI G77 See Std PGI TRADITIONAL DMIN1 ANSI G77 See Std PGI TRADITIONAL DMOD R8 A R 8 ANSI G77 E P P R 8 PGI TRADITIONAL DNINT R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DOT ANSI PGI See Std PRODUCT TRADITIONAL DPROD R8 X R4 R 8 ANSI G77 E P Y R 4 R 8 PGI TRADITIONAL DREAL R 8 A 1 1 1 2 1 4 I 8 G77 PGI E R 4 R 8 Z 8 Z 16 TRADITIONAL DSHIFTL I 1 1 I2 1 4 I
133. EE 754 1985 floating pointing roundoff and overflow behavior is OPT IEEE arithmetic N where N 1 2 or 3 OPT IEEE arithmetic 1 Requires strict conformance to the standard 2 Allows use of any operations as long as exact results are produced This allows less accurate inexact results For example X 0 may be replaced by 0 and X X may replaced by 1 even though this is inaccurate when x is inf inf or NaN This is the default level at 03 3 Means to allow any mathematically valid transformations For example replacing x y by x recip y For more information on the defaults for IEEE arithmetic at different levels of optimization see Table 7 3 7 7 4 2 Roundoff Use OPT roundof to identify the extent of roundoff error the compiler is allowed to introduce 0 No roundoff error 1 Limited roundoff error allowed 2 Allow roundoff error caused by re associating expressions 3 Any roundoff error allowed The default roundoff level with 00 01 and 02 is 0 The default roundoff level with 03 is 1 Listing some of the other OPT sub options that are activated by various roundoff levels can give more understanding about what the levels mean OPT roundoff 1 implies e OPT fast exp ON This option enables optimization of exponentiation by replacing the run time call for exponentiation by multiplication and or square root operations for certain compile time constant exponents integers and halves e OP
134. FF Build scalar reductions before any loop transformation analysis Using this flag may enable further loop transformations involving reduction loops The default is OFF This flag is redundant when OPT roundoff 2 or greater is in effect LNO blocking ON OFF Enable or disable the cache blocking transformation The default is ON LNO blocking size N This option specifies a block size that the compiler must use when performing any blocking N must be a positive integer number that represents the number of iterations LNO fission 0 1 2 This option controls loop fission The option can be one of the following 0 Disable loop fission default 1 Perform normal fission as necessary 2 Specify that fission be tried before fusion Because LNO fusion is on by default turning on fission without turning off fusion may result in their effects being nullified Ordinarily fusion is applied before fission Specifying LNO fissionz2 will turn on fission and cause it to be applied before fusion LNO full unroll fU N Fully unroll loops with trip count lt N inside LNO N can be any integer between 0 and 100 The default value for N is 5 Setting this flag to O disables full unrolling of small trip count loops inside LNO LNO full unroll sizezN Fully unroll loops with unrolled loop size lt N inside LNO N can be any integer between 0 and 10000 The conditions implied by the full unroll option must also be satisfied for the loop to be f
135. For C C only Use the same size for double as for float fshort enums For C C only Use the smallest fitting integer to hold enums fshort wchar For C C only Use short unsigned int for wchar tinstead of the default underlying type for the target ftest coverage Create data files for the pathcov 1 code coverage utility The data file names begin with the name of your source file SOURCENAME bb A mapping from basic blocks to line numbers which pathcov uses to associate basic block execution counts with line numbers F 15 F eko man Page I aaa SOURCENAME bbg A list of all arcs in the program flow graph This allows pathcov to reconstruct the program flow graph so that it can compute all basic block and arc execution counts from the information in the SOURCENAME da file Use ftest coverage with fprofile arcs the latter option adds instrumentation to the program which then writes execution counts to another data file SOURCENAME da Runtime arc execution counts used in conjunction with the arc information in the file SOURCENAME bbg Coverage data will map better to the source files if ftest coverage is used without optimization See the gcc man pages for more information ftpp Run the Fortran source preprocessor on input Fortran source files before compiling By default files suffixed with F or F90 are run through the C source preprocessor cpp Files that are suffixed with f or f90 are not ru
136. G Red hha 3 9 3 4 6 Intrinsic Module ISO FORTRAN ENV 0000 e eee 3 10 3 4 7 IEEE Floating POINT ucc rre t REX EEG EE RR eet 3 11 3 4 7 1 Gradual Underflow anaana hoan ded M tota te EE Ba 3 12 Page vi PathScale Compiler Suite User Guide Version 3 2 ls 3 4 8 Allocatable Components and Dummy Arguments 3 12 3 4 9 Fortran 2003 C Interoperability 0 002 e eee eee 3 13 3 4 9 1 BIND attribute 0 0 0 ees 3 14 3 4 9 2 Intrinsic Module ISO C BINDING 0 0 00 3 16 3 4 9 3 Pointer Compatibility vore meer oO RC Na Sce Bob ates 3 18 3 4 9 4 Passing Arguments by Value ssllusessllusss 3 18 3 4 9 5 Enumerations ss c Rev RR wie ed ANN NAG Y XE s 3 19 3 4 9 6 Example Using C malloc from Fortran 3 20 3 4 9 7 Issues Unique to C nes 3 20 3 4 9 8 PittallS 22 525 i Nii re Bd a Ree eros NAN ENG 3 21 3 5 Extenso S nse cured esu dada SIRE RA RERO dapa ER Vere dud 3 21 3 5 1 Promotion of REAL and INTEGER Types 200005 3 21 3 5 2 Cray Pointers uz pee T PIE a Ha kA Sd 3 21 3 5 3 Directives i cues cre RACIO AG Sa 3 22 3 5 3 1 Prefetch Directives naaa 3 22 3 5 3 2 Changing Optimization Using Directives 3 24 3 6 Compiler and Runtime Features 0 00 cece ee eee eee 3 24 3 6 1 Preprocessing Source Files with cpp aaa 3 24 3 6 2 Preprocessing Source Files
137. ITIONAL SIZE 1 1 1 2 1 4 1 8 ISHL I 1 1 I2 1 4 I8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 ISIGN 1 4 A 11 1 2 1 4 I 8 ANSI G77 E P B 1 1 1 2 1 4 1 8 PGI TRADITIONAL ISNAN X R 4 R 8 TRADITIONAL E IS IOSTAT END L 4 I 1 1 I 2 1 4 I 8 ANSI TRADITIONAL IS IOSTAT _ 74 1 11 I2 1 4 I8 ANSI EOR TRADITIONAL ITIME Subroutine TARRAY 4 G77 PGI Array rank 1 JDATE C TRADITIONAL JIABS 1 4 A l 4 PGI E TRADITIONAL JIAND 1 4 I 1 4 PGI E J I 4 TRADITIONAL JIBCHNG 1 4 I 1 4 TRADITIONAL E POS 1 1 1 2 1 4 1 8 JIBCLR 1 4 I 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL C 25 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks JIBITS 1 4 I F4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL LEN 1 1 1 2 1 4 1 8 JIBSET 1 4 I 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL JIDIM 1 4 X 1 4 PGI E Y 1 4 TRADITIONAL JIDINT 1 4 A R 8 PGI E TRADITIONAL JIEOR 1 4 I 1 4 PGI E J 1 4 TRADITIONAL JIFIX E A R 4 RB PGI E TRADITIONAL JINT 1 4 A R 4 PGI E TRADITIONAL JIOR 1 4 I 1 4 PGI E J 1 4 TRADITIONAL JISHA 1 4 I 1 4 TRADITIONAL E SHIFT 1 1 1 2 1 4 I8 JISHC 1 4 I 1 4 TRADITIONAL E SHIFT 11 1 2 1 4 I 8 JISHFT 1 4 I 14 PGI E SHIFT 1 1 1 2 1 4 TRADITIONAL I8 JISHFTC 1 4 I 1 4 PGI E SHIFT 1 1 1 2 1 4
138. ITIONAL E POS 1 1 172 1 4 18 IIBCLR I 2 PGI E TRADITIONAL C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks IIBITS I 2 I 1 2 PGI E POS I 1 I 2 1 4 1 8 TRADITIONAL IIBSET I2 I 1 2 PGI E POS I 1 1 2 1 4 8 TRADITIONAL IDIM I 2 X I2 PGl E Y 2 TRADITIONAL IIDINT I 2 A R 8 PGI E TRADITIONAL IIEOR I 2 I 2 PGI E J I 2 TRADITIONAL IIFIX I 2 A R 4 R 8 PGl E TRADITIONAL INT 1 2 A R 4 PGI E TRADITIONAL IIOR I2 I 1 2 PGl E J I 2 TRADITIONAL ISHA 1 2 I 1 2 TRADITIONAL E SHIFT 1 1 1 2 1 4 I8 IISHC I 2 I 1 2 TRADITIONAL E SHIFT 171 1 2 1 4 I 8 IISHFT I 2 I 1 2 PGl E SHIFT 171 1 2 1 4 TRADITIONAL I8 IISHFTC I2 I 1 2 PGI E SHIFT I1 1 2 P4 TRADITIONAL I 8 SIZE 11 172 1 4 1 8 ISHL I 2 I 1 2 TRADITIONAL E SHIFT 171 1 2 I 4 I 8 IISIGN I 2 A I2 PGI E P B I2 TRADITIONAL ILEN Depends on arg I 11 TRADITIONAL E P ILEN Depends on arg 172 TRADITIONAL E P ILEN Depends on arg 1 4 TRADITIONAL E P ILEN Depends on arg 178 TRADITIONAL E P C 23 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY C 24 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks
139. KAPAL ED RM 7 43 7 10 2 Dsirig OEIST Or EEIST ori et ex ma eom KA Reg 7 44 7 10 3 Verbose Flagi sadeg iani adamo gaa PR IS E ARRA es ka aE datas 7 44 Section 8 Using OpenMP and Autoparallelization 8 1 OpenMP as act ace toten eta NA AG go ae ud ace to seer has Soca i puce wp aen 8 1 8 2 Autoparallelization 229223 REPRE RR SR RERET UA ER E ESTE 8 2 8 3 Getting Started With OpenMP 2 2 2222 8 3 8 4 OpenMP Compiler Directives Fortran 0c eee eee ee 8 3 8 5 OpenMP Compiler Directives C C lees 8 6 8 6 OpenMP Runtime Library Calls Fortran 200 0 eae eee 8 7 8 7 OpenMP Runtime Library Calls C C a 8 9 8 8 Runime Libraries es esr RC i EG ae ek Ea TY EER EY NG 8 10 8 9 Environment Variables 4449 2 4220 VERE GR KAKAIBA AA AT Shad dd 8 11 8 9 1 Standard OpenMP Environment Variables 8 12 8 9 2 PathScale OpenMP Environment Variables 8 12 8 10 OpenMP Stack SIZ6 vae Ede Shaw gan halaga Hanna KPAG DAG hae 8 21 8 10 1 Stack Size Tor F rra 33 nz cu obse x RERENQURUE ws teed kag 8 21 8 10 2 Stack SIZE Tor O G pp s scii ud eme TEIG BAG ARAB PA RAANG 8 22 8 11 Stack Size Algorithm 2 sexe ree x Rep eed PR Yr yee 8 22 8 12 Example OpenMP Code in Fortran 000 eee eee 8 24 Page xiv PathScale Compiler Suite User Guide Version 3 2 Ts 8 13 Example OpenMP Code in C C 4 ccc eee 8 25 8 14 Tuning for OpenMP Application
140. Load subsumption is the combining of an arithmetic instruction and a memory load into one instruction Default is OFF WOPT retype expr ON OFF Enables the optimization in the compiler that converts 64 bit address computation to use 32 bit arithmetic as much as possible Default is OFF WOPT unroll 0 1 2 Control the unrolling of innermost loops in the scalar optimizer Setting to 0 suppresses this unroller The default is 1 which makes the scalar optimizer unroll only loops that contain IF statements Setting to 2 makes the unrolling to also apply to loop bodies that are straight line code which duplicates the unrolling done in the code generator and is thus unnecessary The default setting of 1 makes this unrolling complementary to what is done in the code generator This unrolling is not affected by the unrolling options under the OPT group WOPT val 0 1 2 Control the number of times the value numbering optimization is performed in the global optimizer with the default being 1 This optimization tries to recognize expressions that will compute identical runtime values and changes the program to avoid re computing them W no overloaded virtual For C only The Woverloaded virtual option will warn when a function declaration hides virtual functions Wno overloaded virtual tells the compiler not to warn when a function declaration hides virtual functions W no packed For C C only Wpacked warns when packed attribut
141. M Fortran Version 2 1 99 f14 Tue Nov 21 2006 14 22 16 pathf95 9 source lines pathf95 2 Error s 0 Warning s 0 Other message s 0 ANSI s pathf95 explain pathf95 message number gives more information about each message Note that the real error is pointed out after the first error on line 1 is reported 3 4 Fortran 2003 Support This section discusses a number of the Fortran 2003 features that have been implemented in the PathScale Fortran Compiler 3 4 1 Syntax Improvements Names may have as many as 63 characters Statements may have as many as 256 lines 3 5 3 The PathScale Fortran Compiler Fortran 2003 Support AA NY An array constructor may use and instead of and for example 1 2 3 and 1 2 3 are synonymous A complex constant may use a named constant as its real or imaginary part For example real parameter limit 1 2e10 complex rlimit upper limit 0 0 complex ilimit 0 0 upper limit In an I O format the comma after a P edit descriptor is optional for example 1P2E12 4 and 1P E12 4 are synonymous 3 4 2 Intrinsic Procedures See also the intrinsic modules for IEEE Floating Point and for C interoperability COMMAND ARGUMENT COUNT integer function command argument count Retrieve the number of command line arguments not counting the command name itself GET COMMAND subroutine get command command length s
142. ONG 32 A quick way to list all the predefined cop macros would be to compile your program with the flags dD keep You can find all the defines or predefined macros in the resulting i file Here is an example for C S cat hello c main printf Hello World n S pathcc dD keep hello c wc hello i 94 278 2606 hello i cat hello i The hello i file will contain the list of pre defined macros NOTE Generating an i file doesn t work well with Fortran because if the preprocessor sends the define s to the i file Fortran can t parse them See section 3 6 4 1 for information on finding pre defined macros in Fortran 4 5 4 The PathScale C C Compiler Compiler and Runtime Features ee 4 2 2 Pragmas 4 2 2 1 Pragma pack In this release we have tested and verified that the pragma pack is supported The syntax for this pragma is pragma pack n This pragma specifies that the next structure should have each of their fields aligned to an alignment of n bytes if its natural alignment is not smaller than n 4 2 2 2 Changing Optimization Using Pragmas Optimization flags can now be changed via directives in the user program In C and C the directive is of the form pragma options lt list of options gt Any number of these can be specified inside function scopes Each affects only the optimization of the entire function in which it is specified The literal string can also contain an unlimi
143. PA on SPEC CPU 2000 Performance aa 7 10 7 2 Effects of IPA tuning on some SPEC CPU2000 benchmarks 7 12 7 3 Numerical Accuracy with Options 00000 ee 7 23 TA alhopi2 ODpHOrS c condi orte S ORE vate Ee otio ta QC E RC RIDE ed 7 30 7 5 Tags for Option Configuration Fle 2 0 c eee eee 7 34 8 1 Fortran Compiler Directives 45 244420042 nesade ase EA dee eA KAG RE 8 4 8 2 C C Compiler Directives unana see eed Wag eke ede deh cade bands 8 6 8 3 Fortran OpenMP Runtime Library Routines 0000 ce eee eee 8 8 8 4 C C OpenMP Runtime Library Routines eee 8 9 8 5 Standard OpenMP Environment Variables 000 cece eee eee eee 8 12 C 1 Fortran Intrinsics Supported in 3 2 lt s lt sc zoe NG weed dws hoe ex eS eS C 3 E 1 Summary of Compiler Options by Function 00 000 eee E Page xvii PathScale Compiler Suite User Guide Version 3 2 aaa CB Page xviii Section 1 Introduction This User Guide covers how to use the PathScale Compiler Suite compilers how to configure them how to use them to optimize your code and how to get the best performance from them This guide also covers the language extensions and differences from the other commonly available language compilers The PathScale Compiler Suite will be referred to as the PathScale Compiler Suite or the PathScale compiler in the rest of this document The PathScale Compile
144. PASS PASS 10 45 CG load exe 0 LNO interchange off OPT unroll times max 16 03 OPT unroll times max 8 PASS PASS 10 47 03 OPT unroll times max 8 PASS PASS 10 47 7 9 8 4 Example 3 Using a Single Script with the rate file With some applications or benchmarks it is more convenient to combine building and testing into one script In this case you must use the s timing file rate file feature so that you don t use the combined compile and run time as your sorting criterion to find the best solutions Sometimes the options that produce the fastest executable take more compile time One advantage of using a single script is that it is easier to parameterize and requires less editing For example you can pass in another benchmark executable name from the command line rather than having to edit the name in the psc test script We will use S rate file this time rather than timing file The use of rate file means that we need to use grep sed commands in the script below that differ from those inpsc test2 above You can copy the file compile go rate from opt pathscale share pathopt2 examples into your working directory It is show here bin sh cd make clean code 1 size 2 shift 2 make code CLASS Ssize FFLAGS cd pathopt2 bin code size gt logs code size txt grep Mop logs code size txt secs log 7 41 7 Tuning Options The pathopt2 Tool eS 9 sed e s Mop s total Ff
145. PGI O I4 18 TRADITIONAL HANDLER Procedure IGNDFL I 4 SIGNAL I 8 NUMBER 1 1 2 G77 PGI 1 4 1 8 TRADITIONAL HANDLER I 4 SIGNAL I 8 NUMBER I 1 1 2 G77 PGI 1 4 1 8 TRADITIONAL HANDLER I 8 C 37 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks SIGNAL Subroutine G77 PGI TRADITIONAL SIN R 4 X R 4 R 8 Z 8 ANSI G77 E P Z216 PGI TRADITIONAL SIND R4 X R4 R 8 PGI E P TRADITIONAL SINH R4 X R4 R 8 ANSI G77 E P PGI TRADITIONAL SIZE ANSI PGI See Std TRADITIONAL SIZEOF 1 8 X Any type Array TRADITIONAL rank any SLEEP Subroutine SECONDS 4 G77 PGI SNGL R 4 A R 8 ANSI G77 E PGI TRADITIONAL SPACING X R 4 R 8 ANSI PGI E TRADITIONAL SPREAD ANSI PGI See Std TRADITIONAL SQRT R 4 X R 4 R 8 Z 8 ANSI G77 E P Z216 PGI TRADITIONAL SRAND Subroutine SEED I 4 G77 PGI STAT 1 4 FILE C G77 PGI O SARRAY I 4 Array TRADITIONAL rank 1 STATUS I 4 STAT Subroutine FILE C G77 O SARRAY 14 Array TRADITIONAL rank 1 STATUS I 4 SUB AND I 1 4 TRADITIONAL E FETCH J 1 4 SUB_AND_ I 1 8 TRADITIONAL E FETCH J I 8 C 38 C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued
146. PathScale Compiler Suite User Guide Version 3 2 Page i PathScale Compiler Suite User Guide Version 3 2 AA NY Page ii PathScale Compiler Suite User Guide Version 3 2 ls Information furnished in this manual is believed to be accurate and reliable However PathScale LLC assumes no responsibility for its use nor for any infringements of patents or other rights of third parties which may result from its use PathScale LLC reserves the right to change product specifications at any time without notice Applications described in this document for any of these products are for illustrative purposes only PathScale LLC makes no representation nor warranty that such applications are suitable for the specified use without further testing or modification PathScale LLC assumes no responsibility for any errors that may appear in this document No part of this document may be copied nor reproduced by any means nor translated nor transmitted to any magnetic medium without the express written consent of PathScale LLC In accordance with the terms of their valid PathScale agreements customers are permitted to make electronic and paper copies of this document for their own exclusive use Linux is a registered trademark of Linus Torvalds PathScale the PathScale logo and EKOPath are registered trademarks of PathScale LLC Red Hat and all Red Hat based trademarks are trademarks or registered trademarks of Red Hat Inc SuSE is a registered tra
147. Practical Programming by Bruce Eckel Prentice Hall Second Edition 2003 ISBN 0 130 35313 2 C Inside amp Out by Bruce Eckel Osborne McGraw Hill 1993 ISBN 0 07 881809 5 C How to Program by H M Deitel and P J Deitel Prentice Hall 2005 5th edition ISBN 0 131 85757 6 Other Topics Effective STL 50 Specific Ways to Improve Your Use of the Standard Template Library by Scott Meyers Addison Wesley Professional 2001 ISBN 0 201 74962 9 1 Introduction Documentation Suite AA NN Notes Section 2 Compiler Quick Reference This section describes how to get started using the PathScale Compiler Suite The compilers follow the standard conventions of Unix and Linux compilers produce code that follows the Linux x86 64 ABI and run on both the AMD64 and Intel EM64T families of chips AMD64 is the AMD 64 bit extension to the Intel IA32 architecture often referred to as x86 EM64T is the Intel Extended Memory 64 Technology chip family This means that object files produced by the PathScale compilers can link with object files produced by other Linux x86 64 compliant compilers such as Red Hat and SUSE GNU gcc g and g771 2 1 What You Installed For details on installing the PathScale compilers see the PathScale Compiler Suite Install Guide The PathScale Compiler Suite includes optimizing compilers and runtime support for C C and Fortran Depending on the type of subscription you purchased you
148. R 4 R 8 TRADITIONAL NEQV 1 11 I2 I4 I 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 11 12 N4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 NINT 1 4 A R 4 R 8 ANSI G77 E P KIND 1 1 1 2 1 4 PGI O 48 TRADITIONAL NOT I 1 1 I2 1 4 I8 ANSI G77 E PGI TRADITIONAL NULL MOLD Any type ANSI PGI Array rank any TRADITIONAL NUM IMAGES 4 TRADITIONAL OMP_ Subroutine LOCK 1 4 1 8 OMP DESTROY LOCK OMP_ Subroutine LOCK 1 4 1 8 OMP DESTROY NEST LOCK OMP GET Depends on arg OMP DYNAMIC C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks OMP GET MAX Depends on arg OMP THREADS OMP GET Depends on arg OMP NESTED OMP GET Depends on arg OMP NUM PROCS OMP GET Depends on arg OMP NUM THREADS OMP GET Depends on arg OMP THREAD NUM OMP GET R 8 OMP WTICK OMP GET R 8 OMP WTIME OMP INIT Subroutine LOCK 1 4 I 8 OMP LOCK OMP INIT Subroutine LOCK 1 4 I 8 OMP NEST LOCK OMP IN Depends on arg OMP PARALLEL OMP SET Subroutine DYNAMIC OMP DYNAMIC THREADS L 4 L 8 OMP_SET_ Subroutine LOCK 1 4 I 8 OMP LOCK OMP_SET_ Subroutine NESTED L 4 L 8 OMP NESTED OMP_SET_ Subroutine LOCK 1 4 I 8 OMP NEST LOCK OMP SET NUM Subroutine NUM THREADS OMP THREADS I4 1 8 OMP TEST Depends on arg LOCK 1 4 I 8 OMP LOCK OMP TE
149. RADITIONAL ETIME R 4 TARRAY R 4 G77 PGI Array rank 1 TRADITIONAL ETIME Subroutine TARRAY R 4 G77 Array rank 1 TRADITIONAL RESULT R 4 EXIT Subroutine STATUS I 1 1 2 G77 PGI O 1 4 I 8 TRADITIONAL EXP R4 X R4 R8 Z 8 ANSI G77 E P Z216 PGI TRADITIONAL EXPONENT X R 4 R 8 ANSI PGI E TRADITIONAL FCD 1 11 I2 I4 I 8 TRADITIONAL E CrayPtr J I1 12 1 4 18 FDATE C G77 PGI TRADITIONAL FDATE Subroutine DATE C G77 PGI FETCH AND I 1 4 TRADITIONAL E ADD J 1 4 FETCH_AND_ I 1 8 TRADITIONAL E ADD J I8 FETCH_AND_ I 1 4 TRADITIONAL E AND J 1 4 FETCH_AND_ I 1 8 TRADITIONAL E AND J I 8 C Supported Fortran Intrinsics Table of Supported Intrinsics I Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks FETCH AND I 1 4 TRADITIONAL E NAND J 1 4 FETCH AND I 18 TRADITIONAL E NAND J I8 FETCH AND I 1 4 TRADITIONAL E OR J 1 4 FETCH AND I 18 TRADITIONAL E OR J 18 FETCH AND I 1 4 TRADITIONAL E SUB J 174 FETCH AND I 18 TRADITIONAL E SUB J I8 FETCH AND I 1 4 TRADITIONAL E XOR J 1 4 FETCH AND I 18 TRADITIONAL E XOR J I8 FGET 1 4 C C G77 O STATUS I 4 FGET Subroutine C C G77 O STATUS I 4 FGETC 1 4 UNIT 1 4 1 8 G77 PGI O C C STATUS I 4 FGETC Subroutine UNIT 1 4 1 8 G77 O C C STATUS I 4 FLOAT R 4 A 1 1 1 2 1 4 I8 ANSI G77 E PGI TRADITIONAL FLOATI
150. RESB ERE E 10 4 10 10 AA 10 5 10 11 Troubleshooting OpenMP sassa nananana DALANG NG NA ou dee Se 10 5 10 11 1 Compiling and Linking with mp eee eee 10 5 Appendix A Environment Variables A 1 Environment Variables for Use with C 0 0 0 cc eee eee A 1 A 2 Environment variables for Use with C 0 0000 cee ee eae A 1 A 3 Environment Variables for Use with Fortran 0005 A 1 A 4 Language independent Environment Variables A 2 A 5 Environment Variables for OpenMP 2 00 eee eee A 2 A 5 1 Standard OpenMP Runtime Environment Variables A 3 A 5 2 PathScale OpenMP Environment Variables A 3 Appendix B Implementation Dependent Behavior for OpenMP Fortran Appendix C Supported Fortran Intrinsics C 1 How to Use the Intrinsics Table 0c ee ee C 1 C 2 Intrinsic Options 52 ccu DNA ck BI s cet cauti AN tae NG Ba UE dens C 1 C 3 Table of Supported Intrinsics llle C 2 Page xvi PathScale Compiler Suite User Guide Version 3 2 T S a aa C 4 Fortran Intrinsic Extensions 5 oves RA RR mc C 41 Appendix D Fortran 90 Dope Vector Appendix E Summary of Compiler Options Appendix F eko man Page Appendix G Glossary Figures Figure Page JA IPA Compilation Model a 2 up d ect NER obe PANA Sd RU ES 7 6 Tables Table Page a Predefined Maios ine oot tae atas ett ms PO pen uns Est 4 4 7 1 Effects of I
151. Registers AA NN try adding OPT reorg_common OFF to the flags Alternatively using the mcemodel medium option will allow this optimization 10 6 More Inputs Than Registers The compiler will complain if an asm has more inputs than there are available CPU registers For m32 32 bit the maximum number of asm inputs is seven 7 For m64 64 bit the maximum number is fifteen 15 10 7 Linking With 1ibg2c When using Fortran with a Red Hat or Fedora Core system you cannot link 11b92c automatically In order to link successfully against 1ibg2c on a Red Hat or Fedora Core system you should first install the appropriate 1ibf2c library then add a symlink in usr 1ib64 or usr 1lib from libg2c so 0 to libg2c so This problem is due to a packaging issue with Red Hat s version of this library You will only need to take this step if you are linking against either the AMD Core Math Library ACML or Fortran object code that was compiled using the 977 compiler 10 8 Linking Large Object Files The PathScale Compiler Suite does not support the linking or assembly of large object files on the x86 platform Earlier versions of the compiler before 2 1 contained a bug that would truncate static data structures whose size exceeded four gigabytes This sometimes caused a compilation error or generation of binaries that would crash or corrupt data at runtime This bug has been fixed in the 2 1 release 10 9 Using ipa and Ofast
152. ST Depends on arg LOCK 1 4 I 8 OMP NEST LOCK OMP UNSET Subroutine LOCK 1 4 I 8 OMP LOCK OMP UNSET Subroutine LOCK 1 4 I 8 OMP NEST LOCK C 33 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks OR 1 11 I2 N4 1 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 11 12 I4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 OR_AND_ I 1 4 TRADITIONAL E FETCH J 1 4 OR AND I 1 8 TRADITIONAL E FETCH J I 8 PACK ANSI PGI See Std TRADITIONAL PERROR Subroutine G77 PGI STRING C POPCNT 1 11 I2 I4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 POPPAR 1 11 I 2 I4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 PRECISION X R 4 R 8 Z 8 ANSI PGI E Z216 TRADITIONAL PRESENT A Procedure any ANSI PGI E type TRADITIONAL PRESENT A Any type ANSI PGI E TRADITIONAL PRODUCT ANSI PGI See Sid TRADITIONAL RADIX X 1 1 1 2 1 4 F8 ANSI PGI E R 4 R 8 TRADITIONAL RAND R 8 FLAG 74 G77 PGI O RANDOM Subroutine HARVEST R 4 R 8 ANSI PGI E NUMBER TRADITIONAL Q C 34 C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks RANDOM Subrouti
153. T fast truncimplies inlining of the NINT ANINT AINT and AMOD Fortran intrinsics OPT roundoff 2 turns on the following sub options OPT fold_reassociatewhich allows optimizations involving re association of floating point quantities 7 22 7 Tuning Options Aggressive Optimizations ls OPT roundoff 3 turns on the following sub options e OPT fast complex When this is set ON complex absolute value norm and complex division use fast algorithms that overflow for an operand the divisor in the case of division that has an absolute value that is larger than the square root of the largest representable floating point number OPT fast nint uses a hardware feature to implement single and double precision versions of NINT and ANINT 7 7 5 Other Unsafe Optimizations A few advanced optimizations intended to exploit some exotic instructions such as CMOVE conditional move result in slightly changed program behavior such as programs which write into variables guarded by an if statement For example if a eq 1 then a 3 endif In this example the fastest code on an x86 CPU is code which avoids a branch by always writing a if the condition is false it writes a s existing value into a else it writes 3 into a If a is a read only value not equal to 1 this optimization will cause a segmentation fault in an odd but perfectly valid program 7 7 6 Assumptions About Numerical Accuracy See the following table for the
154. The value of this option is the compiler s estimate of the overhead in processor cycles in invoking the parallel version of a 8 Using OpenMP and Autoparallelization OpenMP Compiler Directives Fortran I aaa loop This value affects the runtime decision on whether to execute the serial or parallel versions Because the optimal value varies across systems and programs this option can be used for parallel performance tuning under apo For more information on this option see the eko man page 8 3 Getting Started With OpenMP To use OpenMP you need to add directives where appropriate and then compile and link your code using the mp flag This flag tells the compiler to honor the OpenMP directives in the program and process the source code guarded by the OpenMP conditional compilation sentinels e g 5 for Fortran and pragma for C C The actual program execution is also affected by the way the OpenMP Environment Variables see section 8 9 are set The compiler will generate different output that causes the program to be run in multiple threads during execution The output code is linked with the PathScale OpenMP Runtime Library for execution under multiple threads See the Fortran code in section 8 12 and the C C code in section 8 13 for examples Because the OpenMP directives tell the compiler what constructs in the program can be parallelized and how to parallelize them it is possible to make mistakes in the inserted Ope
155. This section documents procedures that are extensions to C 41 C Supported Fortran Intrinsics Fortran Intrinsic Extensions AA NN the standard referring to argument names shown in the table of intrinsics in table C 1 abort Prints a message and then like the C library function abort stops the program access Like the C library function access returns zero if the file named by name satisfies the requirements indicated by mode but otherwise returns the error code from the C library value errno Trailing blanks in name are ignored you can prevent this by using char 0 to place a null character after the last significant character mode may contain any of the following r Readable w Writable x Executable File exists alarm Uses the C library functions a1axmand signal to waitthe time indicated by seconds and then execute the external subroutine handler status returns the number of seconds remaining until the previously scheduled alarm would have taken place or 0 if no alarm was pending and Bitwise boolean AND besj0 besj1 Fortran interfaces to C library functions j 0 j1 jn yO y1 and besjn besy0 yn Bessel functions besy1 besyn cdabs cdcos Specific names for various mathematical functions having an cdexp cdlog argument of type complex 16 cdsin cdsqrt chdir Like the C library function chdir sets the current working directory to dir The function form returns 0 on success but otherw
156. UE A 2 0 B 0 0 C A B PRINT C END PROGRAM MAIN To run the program S pathf95 ieee f95 o example example Floating point exception signaled at 400db2 floating point divide by zero Aborted This Fortran standard feature is documented here http www nag co uk sc22wg5 TR15580 htmi It can also be downloaded from their ftp site ftp ftp nag co uk sc22wg5 N 1351 N1400 Search for the document N1378 pdf Additionally the trapuv option will trap on a NaN as a side effect but there is no control over the individual classes of trap NaN overflow underflow or zerodivide See section 10 3 for more information on using trapuv 10 5 Large Object Support Statically allocated data bss objects such as Fortran COMMON blocks and C variables with file scope are currently limited to 2GB in size If the total size exceeds that the compilation without the mcmodel medium option will likely fail with the message relocation truncated to fit R_X86 64 PC32 For Fortran programs with only one COMMON block or with no COMMON blocks after the one that exceeds the 2GB limit the program may compile and run correctly At higher optimization levels 03 Ofast OPT reorg common is set to ON by default This might split a COMMON block such that a block begins beyond the 2GB boundary If a program builds correctly at 02 or below but fails at 03 or Ofast 10 3 10 Debugging and Troubleshooting More Inputs Than
157. Variable The pathopt2 tool arranges that the specified options are passed through as arguments to the build command using the expansion of the character on the pathopt2 command line Usually these options will then be explicitly passed to the compiler either directly or via a Makefile variable such as CFLAGS or FFLAGS Alternatively the PathScale compilers will also process options from the PSC_GENFLAGS environment variable This provides a way to implicitly pass the pathopt2 selected options to the compiler through existing scripts and Makefiles without their modification Note that pathopt2 itself does not set the value of PSC GENFLAGS but it can be easily achieved using a shell script as the build command and using the syntax export PSC GENFLAGS 7 9 7 Using Build and Test Scripts The first example was run without build or test scripts However scripts provide added flexibility to pathopt2 Here are three common reasons for using a build script You might need to cd to another directory before issuing the make command There may be several directories you need to go to to complete the build There may be no make clean target so youneeda rm o command before the make command 7 36 7 Tuning Options The pathopt2 Tool ls There are several reasons for using a test script pathopt2 can t handle a complicated program run command with whitespace in it You may need to cd to another directory
158. When compiling with ipa the o files that are created are not a regular o files IPA uses the o files in its analysis of your program and then does a second compilation using that information NOTE NOTE When you are using ipa all the o files have to have been compiled with ipa for your compilation to be successful Each archive for example 1ibfoo a must contain either ofiles compiled with ipa or o files compiled without ipa but not both The requirement of ipa may mean modifying Makefiles If your Makefiles build libraries and you wish this code to be built with ipa you will need to split these libraries into separate o files before linking 10 Debugging and Troubleshooting Troubleshooting OpenMP I aaa 10 10 Tuning 10 11 By default ipa is turned on when you use Ofast so the caveats above apply to using Of ast as well Our compilers often optimize loops by eliminating the loop variable and instead using a quantity related to the loop variable called an induction variable If the induction variable overflows the loop test will be incorrectly evaluated This is a very rare circumstance To see if this is causing your code to fail under optimization try OPT wrap around unsafe opt OFF Troubleshooting OpenMP 10 11 1 You must use the mp flag when you compile code that contains OpenMP directives If you do not use the mp flag the compiler will ignore the OpenMP directives and compile yo
159. a process or thread and as much as possible of the memory that it uses must be allocated to the same single CPU The Linux kernel has historically had no support for setting the affinity of a process in this way Running a non NUMA kernel on a NUMA system can result in changes in performance while a program is running and non reproducibility of performance across runs This occurs because the kernel will schedule a process to run on whatever CPU is free without regard to where the process s memory is allocated Recent kernels have some degree of NUMA support They will attempt to allocate memory local to the CPU where a process is running but they still may not prevent that process from later being run on a different CPU after it has allocated memory 7 25 7 Tuning Options Hardware Performance AA NN 7 8 5 Current NUMA aware kernels do not migrate memory across NUMA nodes so if a process moves relative to its memory its performance will suffer in unpredictable ways Note that not all vendors ship NUMA aware kernels or C libraries that can interface to them If you are unsure of whether your kernel supports NUMA check with your distribution vendor Tools and APIs 7 8 6 Recent Linux distributions include tools and APIs that allow you to bind a thread or process to run on a specific CPU This provides an effective workaround for the problem of the kernel moving a process away from its memory Your Linux distribution may
160. able lists the Fortran intrinsics supported by the PathScale Compiler Suite along with the result arguments families and characteristics for each See the Legend for more information Legend Key to Types I Integer R Real Z Complex C Character L Logical Depends on arg Result type varies depending on the argument type Subroutine Intrinsic is a subroutine not a function Key to Remarks E Elemental intrinsic C Supported Fortran Intrinsics Table of Supported Intrinsics aan P May pass intrinsic itself as an actual argument X Extension to the Fortran standard O Optional argument 1 Must use intrinsic lt name gt to enable this Table C 1 Fortran Intrinsics Supported in Version 3 2 Intrinsic Name Result Arguments Families Remarks ABORT Subroutine G77 PGI ABS R 4 A 1 1 1 2 1 4 1 8 ANSI G77 R 4 R 8 Z 8 Z 16 PGI TRADITIONAL ACCESS 1 4 NAME C G77 PGI MODE C ACHAR C I 1 1 12 1 4 I 8 ANSI G77 E PGI TRADITIO NAL ACOS R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ACOSD R 4 X R 4 R 8 PGI E TRADITIONAL ADD AND I 1 4 TRADITIONAL E FETCH J 1 4 ADD AND 18 TRADITIONAL E FETCH J I 8 ADJUSTL STRING C ANSI PGI E TRADITIONAL ADJUSTR STRING C ANSI PGI E TRADITIONAL AIMAG R 4 Z Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL AINT R 4 A R 4 R 8 ANSI G77 E P KIND 1 1 1 2 1 4
161. ads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads Here are some examples for assigning eight threads on an eight CPU system 1 Assign all threads to the same CPU PSC OMP AFFINITY MAP 0 CPUO CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 TO T1 T2 T3 T4 T5 T6 T7 2 Assignthreadstothe lower half ofthe machine PSC OMP AFFINITY MAP 0 Lo 3 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 TO T1 T2 T3 T4 T5 T6 T7 3 Assign threads to the upper half of the machine PSC OMP AFFINITY MAP 4 5 6 7 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 TO TI T2 T3 T4 T5 T6 T7 4 Assign threads to a dual core machine in the same way as PSC OMP CPU STRIDE 2 PSC OMP AFFINITY MAP 0 2 4 6 1 3 5 7 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 TO T4 T1 T5 T2 T6 T3 T7 NOTE When PSC OMP AFFINITY MAP is defined the values of PSC OMP CPU STRIDE and PSC OMP CPU OFFSET are ignored However the value of PSC OMP GLOBAL AFFINITY still determines whether the thread s global or local ID is used in the mapping process 8 15 8 Using OpenMP and Autoparallelization Environment Variables AA NY PSC OMP CPU STRIDE Integer value This specifies the striding factor used whe
162. allow for the setting of multiple sub options in two ways Separating each sub flag by colons or Using multiple flags on the command line For example the following command lines are equivalent pathce OPT roundoff 2 alias restrict wh c pathce OPT roundoff 2 OPT alias restrict wh c Some sub options either enable or disable the feature To enable a feature either specify only the subflag name or with 21 ON or TRUE Disabling a feature is accomplished by adding 20 OFF or FALSE The following command lines mean the same thing pathf95 OPT div split fast complex FALSE IEEE NaN inf OFF wh F pathf95 OPT div split 1 fast complex 0 IEEE NaN inf false wh F 7 Tuning Options Inter Procedural Analysis IPA ls 7 3 Inter Procedural Analysis IPA Software applications are normally written and organized into multiple source files that make up the program The compilation process usually defined by a Makefile invokes the compiler to compile each source file called compilation unit separately This traditional build process is called separate compilation After all compilation units have been compiled into o files the linker is invoked to produce the final executable The problem with separate compilation is that it does not provide the compiler with complete program information The compiler has to make worst case assumptions at places in the program that access external data or call external functions In whole p
163. an error number such as 4198 or when the program prints out such an error number during execution you can look up its meaning using the explain command by prefixing the number with lib asin explain 1ib 4198 For example explain lib 4098 A BACKSPACE is invalid on a piped file A Fortran BACKSPACE statement was attempted on a named or unnamed pipe FIFO file that does not support backspace Either remove the BACKSPACE statement or change the file so that it is not a pipe See the man pages for pipe 2 read 2 and write 2 3 6 6 Fortran 90 Dope Vector Modern Fortran provides constructs that permit the program to obtain information about the characteristics of dynamically allocated objects such as the size of arrays and character strings Examples of the language constructs that return this information include the ubound and the size intrinsics To implement these constructs the compiler may maintain information about the object in a data structure called a dope vector If there is a need to understand this data structure in detail it can be found in the source distribution in the file clibinc cray dopevec h See Appendix D for an example of a simplified version of that data structure extracted from that file 3 28 3 The PathScale Fortran Compiler Mixed Code ls 3 6 7 Bounds Checking The PathScale Fortran compiler can perform bounds checking on arrays To enable this feature use the C option pathf9
164. and code generation have been performed In the IPA compilation model the link step is applied very early in the compilation process before most optimization and code generation In this scenario the program code being linked are not in the object code format Instead they are in the form of the intermediate representation IR used during compilation and optimization After the program has been linked at the IR level inter procedural analysis and optimization are applied to the whole program Subsequently compilation continues with the backend phases to generate the final object code 7 Tuning Options Inter Procedural Analysis IPA ee The IPA compilation model see Figure 7 1 has been implemented with ease of use as one of its main objectives At the user level it is sufficient to just add the ipa flag to both the compile line and the link line Thus users can avoid having to re structure their Makefiles to use IPA In order to do this we have to introduce a new kind of o files that we call IPA o s These are o files in which the program code is in the form of IR and are different from ordinary o files that contain object code IPA o files are produced when a file is compiled with the flags ipa c IPA o files can only be linked by the IPA linker The IPA linker is invoked by adding the ipa flag to the link command This appears as if it is the final link step In reality this link step performs the following tasks 1
165. ar to the C library function setbuf To disable buffering on the specified logical unit so that output appears immediately pass a variable of type character len 0 or type character 0 To use a particular buffer in place of the default buffer for that logical unit pass a character string whose length is greater than zero The logical unit must be appropriate for sequential formatted output In case of error the function returns errno ora Fortran iostat error code otherwise it returns zero Note that you must enable this on the command line with intrinsic setbuf or intrinsic EVERY setlinebuf Similar to the C library function set1inebuf this causes the specified logical unit to flush buffered output at the end of every line and before any read from the terminal The logical unit must be appropriate for sequential formatted output In case of error the function returns errno or a Fortran iostat error code otherwise it returns zero Note that you must enable this on the command line with intrinsic setlinebuf or intrinsic EVERY short Convert to type integer 2 C 49 C Supported Fortran Intrinsics Fortran Intrinsic Extensions AA NY signal Fortran interface to the C library function signal Arrange for the signal whose number is number to trigger a call to external procedure handler which should be a subroutine with no arguments or restore the default response to the signal or ignore the signal
166. aries for both 32 bit and 64 bit environments can be found here opt pathscale lib lt version gt libopenmp so 1 symbolic link to dynamic version 64 bit opt pathscale lib version 32 libopenmp so 1 symbolic link to dynamic version 32 bit Be sure to use the mp flag on both the compile and link lines NOTE For running OpenMP executables compiled with the PathScale compiler on a system where no PathScale compiler is currently installed please see the PathScale Compiler Suite Install Guide for instructions on installing the PathScale libraries on the target system 8 9 Environment Variables The OpenMP environment variables allow you to change the execution behavior of the program running under multiple threads The table in this section lists the environment variables currently supported The environment variables can be set using the shell commands For example in bash export OMP NUM THREADS 4 In csh setenv OMP NUM THREADS 4 After the previous shell commands the following command will print 4 echo SOMP NUM THREADS 4 section 8 9 1 lists the available environment variables both Standard and PathScale for use with OpenMP 8 Using OpenMP and Autoparallelization Environment Variables ee 8 9 1 Standard OpenMP Environment Variables Table 8 5 Standard OpenMP Environment Variables Variable Possible Values Description OMP DYNAMIC FALSE Enables or disables dynamic adjustment of i the number of
167. ase there are limitations to the options that are processed in this options directive and their effects on the optimization There is no warning or error given for options that are not processed These directives are processed only in the optimizing backend Thus only options that affect optimizations are processed In addition it will not affect the phase invocation of the backend components For example specifying 00 will not suppress the invocation of the global optimizer though the invoked backend phases will honor the specified optimization level Apartfrom the optimization level flags only flags belonging to the following option groups are processed LNO OPT and WOPT 3 6 Compiler and Runtime Features The compiler offers three different preprocessing options cpp ftpp and now fcoco 3 6 1 Preprocessing Source Files with cpp Before being passed to the compiler front end source files are optionally passed through a source code preprocessor The preprocessor searches for certain directives in the file and based on these directives can include or exclude parts of the source code include other files or define and expand macros By default Fortran F F90 and F95 files are passed through the C preprocessor cpp 3 6 2 Preprocessing Source Files with ftpp The Fortran preprocessor tpp accepts many of the same directives as the C preprocessor but differs in significant details for example
168. assign I y on u 6 assign I y on f test2559 out The following program would then use no repeat factors because the first write statement refers explicitly to unit 6 the second write statement refers implicitly to 8 41 3 The PathScale Fortran Compiler Porting Fortran Code c o LUEI unit 6 by using test2559 out in place of a logical unit and the third is bound to file real a 5 88 0 write 6 a write 77 0 77 0 77 0 77 0 77 0 open unit 17 file test2559 out write 17 99 0 99 0 99 0 99 0 99 0 end 3 11 Porting Fortran Code The following option can help you fix problems prior to porting your code r8 i8 Respectively promotes the default representation for REAL and INTEGER type from 4 bytes to 8 bytes Useful for porting from Cray code when integer and floating point data is 8 bytes long by default Watch out for type mismatches with external libraries These sections contain helpful information for porting Fortran code e Section 3 9 1 has information on porting code that includes KINDS sometimes a problem when porting Fortran code Section 3 9 has information on source code compatibility Section 3 10 has information on library compatibility 3 12 Debugging and Troubleshooting Fortran The flag g tells the PathScale compilers to produce data in the form used by modern debuggers such as PathScale s pathdb GDB Etnus TotalView Absoft Fx2 and Streamline s
169. at a and b are in non overlapping areas of memory when it optimizes the program The resulting program when run will give wrong results 3 43 3 The PathScale Fortran Compiler Debugging and Troubleshooting Fortran AA NN 3 12 3 Programmers occasionally break this aliasing rule and as a result their programs get the wrong answer only under high levels of optimization This sort of bug frequently is thought to be a compiler bug so we have added this option to the compiler for testing purposes If your failing program gets the right answer with OPT alias no parm or WOPT fold off then it is likely that your program is breaking this Fortran aliasing rule Fortran malloc Debugging 3 12 4 The PathScale Compiler Suite includes a feature to debug Fortran memory allocations By setting the environment variable PSC FDEBUG ALLOC memory allocations will be initialized during execution to the following values PSC FDEBUG ALLOC Value ZERO 0 NaN Oxffa5a5a5 4 byte NaN NaN8 Oxffa5a5a5fff5a5a511 8 byte NaN For example to initialize all memory allocations to zeroes set PSC FDEBUG ALLOC ZERO before running the program The four byte and eight byte NaNs will only initialize arrays that are aligned with their width 32 and 64 bits respectively Arguments Copied to Temporary Variables 3 44 In some situations the Fortran standard requires that actual arguments to procedure calls be copied to and from temporary variables Oft
170. ation in memory This inhibits optimization A common example in the C language is two pointers if the compiler cannot prove that they point to different locations a write through one of the pointers will cause the compiler to believe that the second pointer s target has changed A statement in a program that a certain condition is expected to be true at this point If it is not true when the program runs execution stops with an output of where the program stopped and what the assertion was that failed Set of standard flags used in SPEC runs with compiler G 1 G Glossary az P P bind To link subroutines in a program Applications are often built with the help of many standard routines or object classes from a library and large programs may be built as several program modules Binding links all the pieces together Symbolic tags are used by the programmer in the program to interface to the routine At binding time the tags are converted into actual memory addresses or disk locations Or bind to link any element tag identifier or mnemonic with another so that the two are associated in some manner See alias and linker BSS Block Started by Symbol Section in a Fortran output object module that contains all the reserved but unitialized space It defines its label and the reserved space for a given number of words CG Code generation a pass in the PathScale Compiler common block A Fortran term for variables s
171. ault is OFF CLIST linelength N Set the maximum line length to N characters The default is unlimited CLIST show ON OFF Print the input and output file names to stderr If ON or OFF is not specified the default is ON F eko man Page I aa co1N Fortran only Specify the line width for fixed format source lines Specify 72 80 or 120 for N col72 col80 or col120 By default fixed format lines are 72 characters wide Specifying col120 implies extend source and recognizes lines up to 132 characters wide For more information on specifying line length see the extend source and noextend source options convert conversion For Fortran only Control the swapping of bytes during I O so that unformatted files on alittle endian processor are read and written in big endian format or vice versa In sequential unformatted files this affects record headers as well as data To be effective the option must be used when compiling the Fortran main program Setting the environment variable FILENV when running the program will override the compiled in choice in favor of the choice established by the command assign 1 Legal values of conversion are native No conversion the default big endian Files are big endian little endian Files are little endian copyright Show the copyright for the compiler being used Cpp Run the preprocessor cpp on all input source files regardless of suffix before compiling
172. bFafles eeri sua ade ORE Ux em Se eee ae e ote tee 2 8 2 8 Large File SUDDON edari Ie Tr EREEL EI bo RE SAGE aad Pee Pha 2 9 2 9 Memory Model Support herren e cow eine KAEN 2 9 2 9 1 Page v PathScale Compiler Suite User Guide Version 3 2 HERES Support for Large Memory Model 2 2 2 eere 2 10 2 10 Debugging scere oce ANN OBSS ma a Bo Pa REO Pare rA Det 2 11 2 11 Profiling Locate Your Program s Hot Spots 2 11 2 12 taskset Assigning a Process to a Specific CPU 2 12 Section 3 The PathScale Fortran Compiler 3 1 Using the Fortran Compiler us reo a qc mech SO eh let 3 1 3 1 1 Fixed form and Free form Files cee eee eee 3 2 3 2 Modules tex e ERE NG NYT PUB TANDA BEDA AE wales wee PRAYING 3 3 3 2 1 Order of Appearance oues Ui or ot eru tuse ed Ae naag 3 3 3 2 2 Linking Object Files to the Rest of the Program 3 4 3 3 Linking When the Main Program ls In a Library 3 4 3 3 1 Module related Error Messages 00 eee eee eee 3 4 3 4 Fortran 2003 SUPpPON oue ba kd paham es Geta A khan 3 5 3 4 1 Syntax Improvements 224 vx hA NAGA LIELA DARAAN 3 5 3 4 2 Intrinsic Procedures 1o Success Nace tote ea e ER EUER Pees QUU E 3 6 3 4 3 EOIrite INTENT Euros ber ue oa Bx ecd caa oe Bang 3 7 3 4 4 VOLATILE Attribute and Statement 0 00 0 cee eee 3 8 3 4 5 IMPORT Statement sure Eure eic ahi bets Cue ANG N
173. bal variables that are never used over the program and deletes them These variables are often exposed due to IPA s constant propagation Dead function elimination finds functions that are never called and deletes them They can be the by product of inlining and cloning Common padding applies to common blocks in Fortran programs Ordinarily compilers are incapable of changing the layout of the user variables in a common block because this has to be co ordinated among all the subroutines that use the same common block and the subroutines may belong to different compilation units But under IPA all the subroutines are available The padding improves the 7 Tuning Options Inter Procedural Analysis IPA I aaa alignments of the arrays so they can be accessed more efficiently and even vectorized The padding can also reduce data cache conflicts during execution Common block splitting also applies to common blocks in Fortran programs This splits a common block into a number of smaller blocks which also reduces data cache conflicts during execution Procedure re ordering lays out the functions of the program in an order based on their call relationship This can reduce thrashing in the instruction cache during execution 7 3 4 Controlling IPA Although the compiler tries to make the best decisions regarding how to optimize a program it is hard to make the optimal choice in general Thus the compiler provides many compilation opt
174. bin This is the same directory that contains pathcc pathCC pathf 95 pathf90 and so on An option configuration file pathopt2 xml is provided The default location is opt pathscale share pathopt2 pathopt2 xml See section 7 9 3 for details on this file format Sample programs are found in opt pathscale share pathopt2 examples In the following sections we review the command syntax the option configuration file structure and general usage information Step by step examples show how to use the different features of pathopt2 7 27 7 Tuning Options The pathopt2 Tool EN PpVRVVv 9 9 7 9 1 A Simple Example 7 28 An example is provided here to show basic usage of pathopt2 In this example you will copy a test program into your working directory and then run pathopt2 with the options file and the test program Copy the program factorial c from opt pathscale share pathopt2 examples into your own working directory factorial cis a program that calculates a table of 50 000 factorials from 1 to 50000 You can now run this simple example by typing pathopt2 f pathopt2 xml t try5 N r factorial pathec o factorial factorial c NOTE If you do not have set in your PATH you need to use factorial to run this command from the current working directory The PATH for the program pathopt2 is the same as for pathcc etc and should already be set correctly See the PathScale Compiler Suite and
175. bout porting code but it is useful to mention the following option that you can use to help in porting your Fortran code r8 i8 Respectively promotes the default representation for REAL and INTEGER type from 4 bytes to 8 bytes Useful for porting from Cray code when integer and floating point data is 8 bytes long by default Watch out for type mismatches with external libraries NOTE The x8 and i8 flags only affect default reals and integers not variable declarations or constants that specify an explicit KIND This can cause incorrect results if a 4 byte default real or integer is passed into a subprogram that declares a KIND 4 integer or real Using an explicit KIND value like this is unportable and is not recommended Correct usage of KIND i e KIND KIND 1 or KIND KIND 0 0d0 will not result in any problems Cray Pointers The Cray pointer is a data type extension to Fortran to specify dynamic objects different from the Fortran pointer Both Cray and Fortran pointers use the POINTER keyword but they are specified in such a way that the compiler can differentiate between them 3 21 3 The PathScale Fortran Compiler Extensions AA NN The declaration of a Cray pointer is POINTER lt pointer gt lt pointee gt Fortran pointers are declared using POINTER object name PathScale s implementation of Cray Pointers is the Cray implementation which is astricter implementation than in other compilers In
176. bverbose Produce diagnostic output about the subscription management for the compiler TENV This option specifies the target environment option group These options control the target environment assumed and or produced by the compiler TENV frame pointer ON OFF Default is ON for C and OFF otherwise Local variables in the function stack frame are addressed via the frame pointer register Ordinarily the compiler will replace this use of frame pointer by addressing local variables via the stack pointer when it determines that the stack pointer is fixed throughout the function invocation This frees up the frame pointer for other purposes Turning this flag on forces the compiler to use the frame pointer to address local variables This flag defaults to ON for C because the exception handling mechanism relies on the frame pointer register being used to address local variables This flag can be turned OFF for C for programs that do not throw exceptions TENV X 0 4 Specify the level of enabled exceptions that will be assumed for purposes of performing speculative code motion default is level 1 at all optimization levels In general an instruction will not be speculated i e moved above a branch by the optimizer unless any exceptions it might cause are disabled by this option 0 No speculative code motion may be performed 1 Safe speculative code motion may be performed with IEEE 754 underflow and inexact exceptions disable
177. by D should not be defined in the setfile fdecoratepath For Fortran only Specify how to decorate external Fortran identifiers to generate linker symbols Ordinarily we apply the rules established by options f no underscoring and f no second underscore but fdecorate overrides those rules for specific identifiers The file path should contain two blank or tab delimited tokens per line The first token is a Fortran identifier and the second is the linker symbol to use for that identifier An abbreviation is allowed in place of the second token 0 says to append no underscore to the Fortran identifier 1 says to append asingle underscore and 2 says to append two underscores if the Fortran identifier contains an underscore but otherwise to append one If an identifier appears twice the second rule overrides the first You may repeat this option to specify multiple files f no directives For Fortran only fno directives ignores all directives such as ISOMP or C PREFETCH REF inside comments The default is fdirectives which scans the comments for directives although certain directives may have no effect unless additional options such as mp are present fe Stop after the front end is run f no exceptions For C only fexceptions enables exception handling This is the default fno exceptions disables exception handling This option has a subset of the effects of fno gnu exceptions Hence it
178. c int8 t int8 t integer c int16 t int16 t integer c int32 t int32 t integer c int64 t int64 t integer c int least8 t int least8 t integer c int least16 t int least16 t integer c int least32 t int least32 t integer c int least64 t int leasto4 t 3 The PathScale Fortran Compiler Fortran 2003 Support I aaa Table 3 1 Compatible Fortran and C Types Fortran Type C Type integer c int fast8 t int fast8 t integer c int fast16 t int fast16 t integer c int fast32 t int fast32 t integer c int fast64 t int fasto4 t integer c intmax t intmax t integer c intptr t intptr t real c float float real c double double real c long double long double real c float complex float Complex real c double complex double Complex real c long double complex long double Complex logical c bool Bool character kind c char char Because our compiler does not provide real 16 and complex 16 types c long double andc long double complex are 1 and declarations using them are not allowed The standard suggests that Fortran integer variables which are compatible with C signed variables are equally compatible with C unsigned variables the bit patterns will be correct although obviously Fortran arithmetic would treat them as if they were signed The ISO C BINDING module also provides constants corresponding to some of the special characters defined in C Table 3 2 Compatible Fortran and C Character Co
179. cally integrates simple functions into their callers fno inline functions does not automatically integrate simple functions into their callers finstrument functions Insert instrumentation calls into each function just after the function entry and just before the function returns Refer to OPT cyg_instr for more details finstrument functions is equivalent to OPT cyg instr 3 fixedform For Fortran only Treat all input source files regardless of suffix as if they were written in fixed source form f77 72 column format instead of F90 free format By default only input files suffixed with f or F are assumed to be written in fixed source form F 12 F eko man Page T O a fkeep inline functions For C C only Generate code for functions even if they are fully inlined flist Invoke all Fortran listing control options The effect is the same as if all FLIST options are enabled FLIST Invoke the Fortran listing control group which controls production of the compiler s internal program representation back into Fortran code after IPA inlining and loop nest transformations This is used primarily as a diagnostic tool and the generated Fortran code may not always compile With the exception of FLIST OFF any use of this option implies flist The arguments to the FLIST option are as follows FLIST setting Enable or disable the listing Setting can be either ON or OFF The default is OFF This optio
180. can be used on some C applications on which fno gnu exceptions cannot be applied F 10 F eko man Page Anna ff2c abipath For Fortran only Use the GNU f2c ABI when calling any functions listed in the file at path On the x86 64 platform the g77 compiler generates code that does not follow the documented platform ABI in some cases involving functions returning complex or single precision real values You must use this flag if you are mixing code generated by g77 with code generated by the PathScale Fortran compiler The format of an f2c ABI description file is simply a list of Fortran function names one per line without any of the trailing underscores that are added in object files To generate files in this format you can use the fsymlist 1 utility f no fast math ffast math improves FP speed by relaxing ANSI amp IEEE rules ffast math is implied by Ofast fno fast math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed ffast math implies OPT IEEE arithmetic 2 fno math errno fno fast math implies OPT IEEE arithmetic 1 fmath errno f no fast stdlib The ffast stdlib flag improves application performance by generating code to link against special versions of some standard library routines and linking against the PathScale compiler runtime library This option is enabled by default If fno fast stdlib is used during compilation the compiler will not emit code
181. can use the no underscoring option but in many cases that will create symbols that conflict with those in the Fortran and C runtime libraries so it is not the preferred choice Normally pathf 90 passes arguments by reference so C needs to use pointers in order to interoperate with Fortran In many cases you can use the val intrinsic function in Fortran to pass an argument by value The programmer must be careful to match argument data types For instance pathf90 integer 4 matches C int integer 8 matches C long long real matches C float provided the C function has an explicit prototype and doubleprecision matches C double Fortran character is problematic because in addition to passing a pointer to the first character it appends an integer length count argument to the end of the usual argument list Fortran Cray pointers declared with the pointer statement correspond to C pointers but Fortran 90 pointers declared with the pointer attribute are unique to Fortran The sequence keyword makes it more likely that a Fortran 90 structure will use the same layout as a C structure although it is wise to verify this by experiment in each case For arrays it is wise to limit the interface to the kinds of arrays provided in Fortran 77 since the arrays introduced in Fortran 90 add to the data structures information that C cannot understand Thus for example an argument a 5 6 or a n or a 1 4 where n is a dummy argument wi
182. ce these data Use frandom seed to override that random number this will force reproducibility across different compilations You should use a different string to compile each source file F 14 F eko man Page ls freeform For Fortran only Treats all input source files regardless of suffix as if they were written in free source form By default only input files suffixed with f90 or F90 are assumed to be written in free source form no rtti For C only Using frtti will generate runtime type information The fno rtti option will not generate runtime type information f no second underscore For Fortran only fsecond underscore appends a second underscore to symbols that already contain an underscore fno second underscore tells the compiler not to append a second underscore to symbols that already contain an underscore no signed bitfields For C C only fsigned bitfields makes bitfields be signed by default The fno signed bitfields will make bitfields be unsigned by default no signed char For C C only fsigned char makes char signed by default fno signed char makes char unsigned by default no strict aliasing For C C only fstrict aliasing tells the compiler to assume strictest aliasing rules fno strict aliasing tells the compiler not to assume strict aliasing rules fshared data For C C only Mark data as shared rather than private fshort double
183. chieved with the compiler options O and c where objects are built with the assumption that the compiled objects will be linked into a call shared executable later The default is OFF In effect optimizations based on position dependent code non PIC are performed on the compiled objects IPA small puzN A procedure with size smaller than N is not subjected to the plimit restriction The default is 30 IPA sp partition setting This option enables partitioning for disk addressing saving purposes The default is OFF Mainly used for building very large programs Normally partitioning would be done by IPA internally IPA spacezN Inline until a program expansion of N is reached For example IPA space 20 limits code expansion due to inlining to approximately 2096 Default is no limit IPA specfile filename Opens a filename to read additional options The specification file contains zero or more lines with inliner options in the form expected on the command line The specfile option cannot occur in a specification file so specification files cannot invoke other specification files IPA use intrinsic ON OFF Enable disable loading the intrinsic version of standard library functions The default is OFF iquote dir Search dir for header files specified by include file but not for header files specified by include lt file gt Dir is searched before all directories specified by I and the standard system directories
184. compiler not to mark strings as const char Suppress warning messages woff Turn off named warnings woffall Turn off all warnings woffoptions Turn off warnings about options woffnum Specify message numbers to suppress Examples F 57 F eko man Page AA NN Specifying woff2026 suppresses message number 2026 Specifying woff2026 2352 suppresses messages 2026 through 2352 Specifying woff2026 2352 2400 2500 suppresses messages 2026 through 2352 and messages 2400 through 2500 In the message level indicator the message numbers appear after the dash Xlinker option Pass option to the linker To pass an option that requires an argument you must use Xlinker twice once for the option and once for the argument Yc path Set the path in which to find the associated phase using the same phase names as given in the W option The following characters can also be specified Specifies where to search for include files S Specifies where to search for startup files crt o L Specifies where to search for libraries Zerouv Set uninitialized variables to zero Affects local scalar and array variables and memory returned by alloca Does not affect the behavior of globals malloc ed memory or Fortran common data fFile suffix 90 file suffix 90 Fortran File or files to be processed where suffix is either an uppercase F or a lowercase f for source files Files ending in i o and
185. compiler recognizes only some of the intrinsics that it can support The name can be the lower case name of any intrinsic that the compiler can support or it can be an upper case name representing a predefined family of intrinsics You can use the options to tune the compiler to provide all the intrinsics a program needs while eliminating the ones whose names conflict with those of the program s own functions and subroutines The options may appear multiple times and will be interpreted in order For example no intrinsic EVERY intrinsic G77 no intrinsic abort would remove all intrinsics then add the family of G77 intrinsics and then remove the individual intrinsic abort Predefined families are EVERY Every intrinsic that the pathf95 compiler can support ANSI Intrinsics defined in the ANSI standard this is the default for the ansi option G77 Intrinsics known to the GNU compiler PGI Intrinsics known to the PGI TM compiler OMP Intrinsics defined by the OpenMP standard automatically enabled by the mp option see the eko 7 man page for more information TRADITIONAL Intrinsics known to pathf95 prior to version 2 0 this is the default in the absence of the ansi option A family like PGI contains intrinsics supported by both pathf95 and the PGI compiler that does not imply that pathf95 supports every intrinsic in the PGI compiler ipa Invoke inter procedural analysis IPA Specifying this option is identical to
186. cript and of the run command script A zero status indicates that the build or run was successful while a non zero status indicates failure If running the program indicates its status in some other way this must be detected by a script and reflected in the script s return status In the example above the grep SUCCESSFUL line is a way to pass the NPB correctness test results to pathopt 2 The grep will have a status of O if the output contains this phrase and this will be the status of the whole shell script since this is the last command Next make the file executable and run pathopt2 chmod x psc test2 pathopt2 S timing file t try5 r psc test2 N psc build ft A The sorted summary will be similar to the following Sorted summary from all runs Flags Build Test Time 03 ipa PASS PASS 10 87 Ofast PASS PASS 10 87 O3 OPT Ofast PASS PASS 11 01 03 PASS PASS 11 02 02 PASS PASS 11 82 Since 03 ipa was the fastest in the try5 target we can run pathopt2 again with the peak 03 target S pathopt2 S timing file t peak 03 r psc test2 N psc build ft A 7 40 7 Tuning Options The pathopt2 Tool ls In the truncated sorted summary we can see that there is some improvement with the new options Sorted summary from all runs Flags Build Test Time 03 OPT unroll times max 8 PASS PASS 10 33 CG load exe 0 LNO interchange off CG local fwd sched on 03 OPT unroll times max 8
187. ction 7 in this document Shared Libraries 2 8 The PathScale Compiler Suite includes shared versions of the runtime libraries that the compilers use The shared libraries are packaged in the pathscale compilers libs package The compiler will use these shared libraries by default when linking executables and shared objects Therefore if you link a program with these shared libraries you must install them on systems where that program will run You should continue to use the static versions of the runtime libraries if you wish to obtain maximum portability or peak performance The latter is the case because the compiler cannot optimize shared libraries as aggressively as static libraries Shared libraries are compiled using position independent code which limits some opportunities for optimization while our static libraries are not compiled this way 2 Compiler Quick Reference Memory Model Support o 9 9 7 aaa To link with static libraries instead of shared libraries use the static option For example the following code is linked using the shared libraries pathcc o hello hello c ldd hello libpscrt so 1 gt opt pathscale 1lib 2 3 99 libpscrt so 1 0x0000002a9566d000 libmpath so 1 opt pathscale lib 2 3 99 libmpath so 1 0x0000002a9576e000 libc so 6 gt lib64 libc so 6 0x0000002a9588b000 libm so 6 gt lib64 libm so 6 0x0000002a95acd000 lib64 ld linux x86 64 80 2 gt 1i1b64 1d linux x86
188. culations The compiler tries to issue one prefetch per stride iteration but cannot guarantee it Redundant prefetches are preferred to transformations such as inserting conditionals which incur other overhead Scope No scope Just generates a prefetch instruction The following arguments are used with this option array ref Required The reference itself for example A i j str Optional Prefetch every st r iterations of this loop The default is 1 Lev Optional The level in memory hierarchy to prefetch The default is 2 If lev 1 prefetch from L2 to L1 cache If lev 2 prefetch from memory to L1 cache rd wr Optional The default is read write sz Optional The size in Kbytes of the array referenced in this loop This must be a constant 3 23 3 The PathScale Fortran Compiler Compiler and Runtime Features AA NN 3 5 3 2 Changing Optimization Using Directives Optimization flags can now be changed via directives in the user program In Fortran the directive is used in the form C S options lt list of options gt Any number of these can be specified inside function scopes Each affects only the optimization of the entire function in which it is specified The literal string can also contain an unlimited number of different options separated by spaces and must include the enclosing quotes The compilation of the next function reverts back to the settings specified in the compiler command line In this rele
189. cy This is the default when optimization levels O0 01 and O2 are in effect 2 May produce inexact result not conforming to IEEE 754 This is the default when O3 is in effect 3 All mathematically valid transformations are allowed F 41 F eko man Page AA NN OPT IEEE NaN Inf ON OFF OPT IEEE_NaN_inf ON forces all operations that might have IEEE 754 NaN or infinity operands to yield results that conform to ANSI IEEE 754 1985 the IEEE Standard for Binary Floating point Arithmetic which describes a standard for NaN and inf operands Default is ON OPT IEEE NaN inf OFF Produces non IEEE results for various operations For example x x is treated as TRUE without executing a test and x x is simplified to 1 without dividing OFF can enable many common optimizations that can help performance OPT inline intrinsics ON OFF When OFF this option turns all Fortran intrinsics that have a library function into a call to that function Default is ON OPT madd height N Allow at most N multiply add instructions that follow one another If more than N multiply add instructions break them into chains of size N and sum the resulting chains Available only for the MIPS family of processors not available for x86 x86 64 OPT malloc algorithm 0 1 or OPT malloc alg 0 1 Select an alternate malloc algorithm which may improve speed The compiler adds setup code in the C C Fortran main function to enable the chosen algo
190. d All IEEE 754 exceptions are disabled except divide by zero All IEEE 754 exceptions are disabled including divide by zero Memory exceptions may be disabled or ignored F 47 F eko man Page AA NN TENV simd imask ON OFF Default is ON Turning it OFF unmasks SIMD floating point invalid operation exception TENV simd dmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point denormalized operand exception TENV simd zmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point zero divide exception TENV simd omask ON OFF Default is ON Turning it OFF unmasks SIMD floating point overflow exception TENV simd umask ON OFF Default is ON Turning it OFF unmasks SIMD floating point underflow exception TENV simd pmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point precision exception traditional Attempt to support traditional K amp R style C trapuv Trap uninitialized variables Initialize variables to the value NaN which helps your program crash if it uses uninitialized variables Affects local scalar and array variables and memory returned by alloca Does not affect the behavior of globals malloc ed memory or Fortran common data U name Remove any initial definition of name Uvar Undefine a variable for the source preprocessor See the Dvar option for information on defining variables uvar Make the default type of a variable und
191. d some can change the program s behavior slightly The first class of optimizations is termed safe and the second unsafe See for section 7 7 for more information on these optimizations OPT Olimit 0 is a generally safe option but may result in the compilation taking along time or consuming large quantities of memory This option tells the compiler to optimize the files being compiled at the specified levels no matter how large they are The option fno math errno bypasses the setting of ERRNO in math functions This can result in a performance improvement if the program does not rely on IEEE exception handling to detect runtime floating point errors OPT roundoff 2 also allows for fairly extensive code transformations that may result in floating point round off or overflow differences in computations Refer to section 7 7 4 2 and section 7 7 4 for more information The option OPT div split ON allows the conversion of x y into x recip y which may result in less accurate floating point computations Refer to section 7 7 4 2 and section 7 7 4 for more information The oPT alias settings allow the compiler to apply more aggressive optimizations to the program The option OPT alias typed assumes that the program has been coded in adherence with the ANSI ISO C standard which states that two pointers of different types cannot point to the same location in memory Setting OPT alias restrict allows the compiler to assume that point
192. d is part of the schedutils package RPM NOTE Some ofthe Linux distributions supported by the PathScale compilers do not contain the schedutils package RPM The CPU affinity is represented as a bitmask typically given in hexadecimal Assigning a process to a specific CPU prevents the Linux scheduler from moving or splitting the process Example taskset 0x00000001 This would assign the process to processor 0 If an invalid mask is given an error is returned so when taskset returns it is guaranteed that the program has been scheduled on a valid and legal CPU See the taskset 1 man page for more information Section 3 The PathScale Fortran Compiler The PathScale Fortran compiler supports Fortran 77 Fortran 90 Fortran 95 and an evolving subset of Fortran 2003 The PathScale Fortran compiler Partial comformance with ISO IEC 1539 1 2004 Programming Languages Fortran Part 1 Base Language Fortran 2003 Conforms to the more recent ISO IEC 1539 1 1997 Programming languages Fortran Fortran 95 Conforms to ISO IEC TR 15580 Fortran Floating point exception handling See also section 14 of ISO IEC 1539 1 2004 the Fortran 2003 standard for a complete description Conforms to ISO IEC TR 15581 Fortran Enhanced data type facilities Conforms to ISO IEC 1539 2 Varying length character strings section 3 6 3 Conforms to ISO IEC 1539 3 Conditional compilation section 3 6 4 Conforms to ISO IEC 1539 199
193. d number x Here is another example for two chips with four cores and PSC OMP CPU STRIDE set to 4 lt CHIPO gt lt CHIP1 gt CPUO CPU1 CPU2 CPU3 CPU4 CPU6 CPU6 CPU7 TO T2 T4 T6 T1 T3 T5 T7 T8 T10 T12 T14 T9 T11 T13 T15 T16 This variable is most useful when the number of threads is fewer than the number of CPUs In the common case where the number of threads is the same as the number of CPUS then there is typically no need to set PSC OMP CPU STRIDE Note that the same mappings can also be obtained by enumerating the CPU numbers using the PSC OMP AFFINITY MAP variable PSC OMP CPU OFFSET Integer value This specifies an integer value that is used to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU number calculated after PSC OMP CPU STRIDE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs The effect of this is to apply an offset to the CPU assignments for a set of threads This is particularly useful when multiple OpenMP jobs are to be run at the same time on the same system and allows the jobs to be separated onto different CPUs Without this mechanism both jobs would be assigned to CPUs starting at CPU 0 causing a non uniform di
194. d of the call list 3 10 3 Linking with g77 compiled Libraries If you wish to link with a library compiled by 977 and if that library contains functions that return COMPLEX or REAL types you need to tell the compiler to treat those functions differently Use the ff2c abi switch at compile time to point the PathScale compiler at a file that contains a list of functions in the 97 7 compiled libraries that return COMPLEX or REAL types When the PathScale compiler generates code that calls these listed functions it will modify its ABI behavior to match g77 s expectations The ff2c abi flag is used at compile time and not at link time NOTE You can only specify the 2c abi switch once on the command line If you have multiple 977 compiled libraries you need to place all the appropriate symbol names into a single file The format of the file is one symbol per line Each symbol should be as you would specify it in your Fortran code i e do not mangle the symbol As an example cat example list sdot cdot You can use the fsymlist program to generate a file in the appropriate format For example fsymlist opt gnu64 lib mylibrary a mylibrary list 3 39 3 The PathScale Fortran Compiler Library Compatibility ee 3 10 3 1 This will find all Fortran symbols in the mylibrary a library and place them into the mylibrary 2 0 1ist file You can then use this file with the 2c abi switch NOTE The fsymlist
195. date The subroutine form is equivalent to call ctime date times The function form is equivalent to ctime time 8 fget Like fgetc but uses logical unit 5 fgetc Fortran interface to the C library function gecc Reads into c a single character from logical unit unit treating that unit as if itwere a stream of bytes The function form returns o for success 1 for end of file or an error code from the C library value errno The subroutine sets status to the value that the function would return Between the opening and closing of a file you should use either stream intrinsics get fgetc fput fputc fseek and ftell or standard Fortran I O but not both flush Flush buffered I O for logical unit unit If unit is omitted flush all logical units fnum Return the POSIX file descriptor corresponding to the open Fortran logical unit unit fput Like putc but uses logical unit 6 fputc Fortran interface to the C library function fput Writes to logical unit unit asingle character c treating that unit as if it were a stream of bytes The function form returns o for success 1 for end of file or an error code from the C library value errno The subroutine sets status to the value that the function would return Betweenthe opening and closing of afile you should use either stream intrinsics get fgetc fput fputc fseek and ftell Or standard Fortran I O but not both fseek Fortran interface to the C library fu
196. default is 0 which means unknown number of processors The default value of 0 should be used if the program is intended to run in different systems with different number of processors If the option is set to non zero and the value is different from the number of processors the parallelized code will not perform optimally F 29 F eko man Page ls LNO sclrze ON OFF Turn ON or OFF the optimization that replaces an array by a scalar variable The default is ON LNO simd 0 1 2 This flag controls inner loop vectorization which makes use of SIMD instructions provided by the native processor 0 Turn off the vectorizer 1 Default Vectorize only if the compiler can determine that there is no undesirable performance impact due to sub optimal alignment Vectorize only if vectorization does not introduce accuracy problems with floating point operations 2 Most aggressive Vectorize without any constraints LNO simd reduction ON OFF This flag controls whether reduction loops will be vectorized Default is ON LNO simd verbose ON OFF LNO simd_verbose ON prints verbose vectorizer info to stdout Default is OFF LNO svr phasel ON OFF This flag controls whether the scalar variable naming phase should be invoked before first phase of LNO The default is ON LNO trip count assumed when unknown trip count N This flag is to provide an assumed loop trip count if it is unknown at compile time LNO uses this infor
197. default values for these are widely applicable but some applications with guided scheduling can be fairly sensitive to their setting See section 8 9 2 for the interpretation of these By default the OpenMP library employs spin locks for synchronization and these loops can be tuned for performance using the PSC OMP THREAD SPIN and PSC OMP LOCK SPIN environment variables lt may be desirable to turn off the spinning and use blocking pthread calls instead for OpenMP applications that use multiple threads per CPU This is fairly uncommon and in the usual case the use of spin locks is a significant optimization over the use of blocking pthread calls See section 8 9 2 for details on these environment variables 8 14 3 5 Using Feedback Data If an OpenMP program is instrumented via the fb create option to generate feedback data in feedback directed compilation the execution of the instrumented executable should only be run under a single thread This can be effected via the OMP_NUM_THREADS environment variable The reason is because the instrumentation library libinstr so used during execution does not support 8 30 8 Using OpenMP and Autoparallelization Other Resources for OpenMP ls simultaneous updates of the feedback data by multiple threads Running the instrumented executable under multiple threads can result in segmentation faults 8 15 Other Resources for OpenMP For more information on OpenMP you might also find the
198. demark of SUSE Linux AG All other brand and product names are trademarks or registered trademarks of their respective owners 2007 2008 PathScale LLC All rights reserved 2006 2007 QLogic Corporation All rights reserved worldwide 2004 2005 2006 PathScale All rights reserved First Published April 2004 Printed in U S A PathScale LLC 2071 Stierlin Ct Suite 200 Mountain View CA 94043 Page iii PathScale Compiler Suite User Guide Version 3 2 AA NY Page iv Table of Contsents Section 1 Introduction 1 1 Conventions Used in This Document 0020 ce eee eee 1 2 1 2 Documentation Suite a aka hana ee ENG SEXO RIEGO LER ER 1 2 Section 2 Compiler Quick Reference 2 1 What You Installed ii aa had R REIR ER ERRAT oo ven oe sien os 2 1 2 2 How To Invoke the PathScale Compilers n nananana a 2 1 2 2 1 Accessing the GCC 4 x Front ends for C and C 2 2 2 3 Compiling for Different Platforms llle 2 3 2 3 1 Target Options for This Release 0 e eee ee 2 4 2 3 2 Defaults Flag 24221 prsne opaa Pe ese aR eRe NAG ws PIER Oe Re 2 5 2 3 3 Compiling for an Alternate Platform a 2 5 2 3 4 Compiling Option Tool pathhow compiled 2 6 2 4 Input File Types 2 24 cx e bx x o see gure eee RE QR EE ERIS 2 6 2 5 Other Input Files euis kat rs stes rea Eco M ee AE cla 2 7 2 6 Common Compiler Options sellers 2 8 2 7 Shared LI
199. ds library and are called pthreads The PathScale Fortran runtime environment automatically sizes the stack for the main thread and the pthreads to avoid stack size problems where possible Additionally diagnostics are given on memory segmentation faults to help diagnose stack size issues The stack size limit for the main thread of an OpenMP program is set using the same algorithm as for a serial Fortran program see section 3 13 for information about Fortran compiler stack size except that the calculated stack limit is subsequently divided by the number of CPUs in the system This ensures that the physical memory available for stack can be shared between as many threads as there are CPUs in the system The limit tries to avoid excessive swapping in the case where all of these threads consume all of their available stack Note that if there are more OpenMP threads than CPUs and they all consume all of their stack then this will cause swapping The stack size of the main thread can be controlled using the PSC STACK LIMIT environment variable and diagnostics for its setting can be generated using the PSC STACK VERBOSE environment variable in exactly the same way as for a serial Fortran program 8 21 8 Using OpenMP and Autoparallelization Stack Size Algorithm AA NN The stack sizing of OpenMP pthreads follows a complementary approach to that for the main thread There are some differences because the sizing of pthread stacks has differe
200. duce differently rounded results that those from the runtime function fast exp is OFF unless O3 or Ofast are specified or OPT roundoff 1 is in effect OPT fast io ON OFF For C C only This option enables inlining of printf fprintf sprintf scanf fscanf sscanf and printw OPT fast_io is only in effect when the candidates for inlining are marked as intrinsic to the stdio h and curses h files Default is OFF OPT fast math ON OFF Setting this to ON will tell the compiler to use the fast math functions tuned for the processor The affected math functions include log exp sin cos sincos expf and pow The default setting is OFF It is turned on automatically when OPT roundoff is at 2 or above OPT fast nint ON OFF This option uses hardware features to implement NINT and ANINT both single and double precision versions Default is OFF but fast nint ON is enabled by default if OPT roundoff 3 is in effect F 40 F eko man Page ls OPT fast sqrt ON OFF This option calculates square roots using the identity sqrt x x rsqrt x where rsqrt is the reciprocal square root operation This transformation generates fairly accurate code Default is OFF Note that in order for OPT fast_sqrt ON to take effect OPT fast_exp must be ON which tells the compiler to emit inlined instructions instead of calling the library pow function Also note that OPT fast sqrt is independent of OP T rsqrt which tran
201. dure within an INTERFACE block cannot access identifiers in the host so the following example gives an error type t integer component end type t integer parameter n 8 interface subroutine s a implicit none type t a n Type t and integer n are undefined here end subroutine s end interface end Adding an import statement solves the problem 3 9 3 The PathScale Fortran Compiler Fortran 2003 Support ee type t integer component end type t integer parameter n 8 interface subroutine s a import t n implicit none type t a n Type t and integer n are imported from the host end subroutine s end interface end If you omit the list of identifiers the IMPORT statement allows the interface body to access any identifier in the host subject to the rules that would apply to an internal procedure for example a local declaration overrides a declaration in the host environment type t integer component end type t integer n interface subroutine s a import implicit none integer parameter n 8 type t a n Type t is imported from host but n is local end subroutine s end interface end 3 4 6 Intrinsic Module ISO FORTRAN ENV The intrinsic module ISO FORTRAN ENV provides information about the program s environment Unlike traditional intrinsic procedures the declarations in these module are available only if you employ the use statement to access the module
202. e Retrieve an environment variable name is the name of the variable value is its value blank if the variable does not exist or has no value length is the length of the value zero if the variable does not exist or has no value status is O if the procedure succeeds 1 if the actual argument corresponding to value was too short 1 if the variable does not exist 2 if the environment does not support environment variables or another positive number if the retrieval failed for another reason trim name is false if trailing blanks in the name should be considered significant and true otherwise the usual case NEW LINE character function new line a Return a CHARACTER 1 variable containing the newline character A is a scalar or array of type CHARACTER Binary octal and hex BOZ constants may appear as the A argument of the intrinsic functions INT REAL or DBLE and as the X or Y argument of the intrinsic function CMPLX Historically the compiler allowed this as an extension with the REAL DBLE and CMPLX intrinsics converting the BOZ value from integer to floating point Instead Fortran 2003 requires those intrinsics to return the floating point value whose bit pattern matches the BOZ constant The command line option ffortran2003 enables the new interpretation With ffortran2003 the following program prints 3 14150 without it the program prints 1078529664 print 25 5 real z 40490E56
203. e building phase where option is iteratively replaced with the rules specified in the try5 subset within the configuration file pathopt2 xm1 The character must be included somewhere in the build command since this is the mechanism by which the chosen optimization options are propagated to the build command Finally factorial is used as the test command For simple cases the o flag can be omitted and the default executable output a out can be used as the test command pathopt2 f pathopt2 xml t try5 N r a out pathcc factorial c NOTE The order of the options in the command line does not matter However the required build command comes last since it may have an arbitrary number of options and arguments of its own When the f option is not specified pathopt2 will use the file pathopt2 xml if itis present in the current working directory otherwise it will use the default pathopt2 xm1 that ships with the software The pathopt2 available options are given in Table7 4 You can also type pathopt2 h on the command line to get usage information Table 7 4 pathopt2 Options Option Description Default D Do not redirect I O to All I O from the build and dev null test commands will be sent This is useful for to dev nu11l under the debugging problems assumption that the with the compilation program will build and run the run or the build cleanly and test scripts
204. e write and format The Fortran standard leaves it to the implementation to choose names for any runtime library functions used to implement that behavior 3 Each compiler may use a different data structure often called a dope vector to implement an assumed shape array argument allocatable array or Fortran pointer In contrast with the C language the data structure is more elaborate than a simple hardware pointer because it must be capable of describing the shape element type and stride of an array or a section of an array 4 Each compiler uses a different strategy to mangle or decorate module level identifiers to generate symbols which will not collide in the flat namespace of the linker For example two modules M1 and M2 may each define a public procedure named x and the program may define a third Fortran 77 style external procedure which is also named x all three must have different names from the point of view of the linker One compiler might use M1 xto represent procedure x belonging to module M1 where another might use X in M1 5 Each compiler pursues a different strategy to implement the use statement Even if two compilers both expect to employ a mod file to communicate module information from one compilation to another the compilers generally assume different formatting of data inside the mod file For the special case of the g77 compiler Pathscale addresses issue 1 by using the same data representatio
205. e a null character after the last significant character getgid Like the POSIX function getgid returns the group ID for this process getlog Sets login to the login name for this process getpid Like the POSIX function getpid returns the process ID for this process getuid Like the POSIX function getuid returns the process ID for this process gmtime Fortran interface to the C library function gmtime Sets tarray to the broken down time corresponding to stime which can be obtained from the intrinsic time 8 All values are in Coordinated Universal Time tarray must have nine elements Seconds since the last minute ranging o 61 due to leap seconds Minutes since the last hour ranging o 59 Hours since midnight ranging o 23 Day of month ranging o 31 Month ranging o 11 Years since 1900 Days since Sunday ranging o 6 Days since January 1 ranging 0 365 Positive if daylight savings time is in effect zero if not or negative if unknown hostnm Fortran interface to the POSIX function gethostname Sets name to the network name of the host computer The function form returns o on success or an error code from the C library value errno The subroutine form sets status to the value that the function would return iargc Return the number of arguments on the command line used to execute this program not including the program name itself idate The single argument version stores in tarray which musthave three e
206. e aggressively LANG rw const ON OFF Tell the compiler whether to treat a constant parameter in Fortran as read only or read write If treated as read write the compiler has to generate extra code in passing these constant parameters so as to tolerate their being modified in the called function The default is OFF which is more efficient but will cause segmentation fault if the constant parameter is written into F 25 F eko man Page T 9 9 97 wEENNENNNNNNNMSS LANG short circuit conditionals ON OFF Handle AND and OR via short circuiting in which the second operand is not evaluated if unnecessary even if it contains side effects Default is ON This flag is applicable only to Fortran the flag has no effect on C C programs LIST The list option group controls information that gets written to a listing Ist file The individual controls in this group are LIST ON OFF Enable or disable writing the listing file The default is ON if any LIST group options are enabled By default the listing file contains a list of options enabled LIST all options ON OFF Enable or disable listing of most supported options The default is OFF LIST notes ON OFF If an assembly listing is generated for example on S various parts of the compiler such as software pipelining generate comments within the listing that describe what they have done Specifying OFF suppresses th
207. e default is OFF CG locs shallow depth ON OFF When performing local instruction scheduling to reduce register usage give priority to instructions that have shallow depths in the dependence graph The default is OFF CG movnti N Convert ordinary stores to non temporal stores when writing memory blocks of size larger than N KB When N is set to 0 this transformation is avoided The default value is 1000 KB F eko man Page ls CG p2align ON OFF Align loop heads to 64 byte boundaries The default is OFF CG p2align freq N Align branch targets based on execution frequency This option is meaningful only under feedback directed compilation The default value N 0 turns off the alignment optimization Any other value specifies the frequency threshold at or above which this alignment will be performed by the compiler CG post local sched ON OFF Enable the local scheduler phase after register allocation The default is ON CG pre local sched ON OFF Enable the local scheduler phase before register allocation The default is ON CG prefer legacy regs ON OFF Tell the local register allocator to use the first 8 integer and SSE registers whenever possible Yrax Y rbp xmm0 xmm7 Instructions using these registers have smaller instruction sizes The default is OFF CG prefetch ON OFF Enable generation of prefetch instructions in the code generator The default is ON CG prefetch OFF and LNO prefetch
208. e default static scheduling policy when no chunk size is specified is as follows The number of iterations of the loop is divided by the number of threads in the team and rounded up to give the chunk size Loop iterations are grouped into chunks of this size and assigned to threads in order of increasing thread id within the team If the division was not exact then the last thread will have fewer iterations and possibly none at all The policy for static scheduling when no chunk size is specified can be changed to the static fair policy by defining the environment variable PSC OMP STATIC FAIR The number of iterations is divided by the number of threads in the team and rounded down to give the chunk size Each thread will be assigned atleast this many iterations Ifthe division was not exact then the remaining iterations are scheduled across the threads in increasing thread order until no more iterations are left The set of iterations assigned to a thread are always contiguous in terms of their loop iteration value Note that the difference between the minimum and maximum number of iterations assigned to individual threads in the team is at most 1 Thus the set of iterations is shared as fairly as possibly among the threads 8 20 8 Using OpenMP and Autoparallelization OpenMP Stack Size ls Consider the static scheduling of four iterations across 3 threads With the default policy threads 0 and 1 will be assigned two iterations and thr
209. e following example liba a is an archive containing only a o b o and c o The a o b o and c o objects are prelinked to instantiate any required template entities and the ar r c v liba a a o b o c o command is executed All three objects must be specified with ar even if only b o needs to be replaced in lib a pathCC ar WR v o liba a a o b o c o See the Id 1 man page for more information about shared libraries and archives auto use module name module name For Fortran only Direct the compiler to behave as if a USE module name statement were entered in your Fortran source code for each module name The USE statements are entered in every program unit and interface body in the source file F eko man Page ls being compiled for example pathf95 auto use mpi interface or pathf95 auto use shmem interface Using this option can add compiler time in some situations backslash Treat a backslash as a normal character rather than as an escape character When this option is used the preprocessor will not be called byteswapio CG For Fortran only Swap bytes during I O so that unformatted files on a little endian processor are read and written in big endian format or vice versa In sequential unformatted files this affects record headers as well as data To be effective the option must be used when compiling the Fortran main program Setting the environment variable FILENV when running the program
210. e invoked when ipa is not specified INLINE off suppresses the invocation of the lightweight inliner The options below are applicable to both the lightweight inliner and IPA s inliner INLINE al11 performs all possible inlining Since this results in code bloat this should only be used if the program is small 7 Tuning Options Inter Procedural Analysis IPA AA NY INLINE 1ist 0N makes the inliner list its actions on the fly This is an useful option for the user to find out which functions are getting inlined which functions are not being inlined and why Thus if the user wants to inline or not inline a function tweaking the inlining controls based on the reasons specified by the output of this flag should help INLINE must name1 name2 forces inlining for the named functions INLINE never namel name2 suppresses inlining for the named functions When ipa is specified IPA will invoke its own inliner and the lightweight inliner is not invoked IPA s inliner automatically determines additional functions to inline in addition to those that are required Small callees or callers are favored over larger ones If profile data is available calls executed more frequently are preferred Otherwise calls inside loops are preferred Leaf routines functions containing no call are also favored Inlining continues until no more call satisfies the inlining criteria which can be controlled by the inlining option
211. e line pathce 03 LNO simd verbose c stream d c The output might look something like this Stream d c 103 LOOP WAS VECTORIZED Stream d c 119 LOOP WAS VECTORIZED Stream d c 142 LOOP WAS VECTORIZED Stream d c 147 LOOP WAS VECTORIZED stream d c 152 LOOP WAS VECTORIZED Stream d c 157 LOOP WAS VECTORIZED stream d c 164 Nonvectorizable ops non unit stride Loop was not vectorized stream d c 211 Nonvectorizable ops non unit stride Loop was not vectorized This would tell you more about what the compiler is doing with loops You can also try the LNO vintr verbose flag on the compile line pathce 03 LNO vintr verbose c stream d c 7 44 7 Tuning Options How Did the Compiler Optimize My Code ls In this case the output doesn t tell you much No output because there are no intrinsic functions to get vectorized in STREAM 7 45 7 Tuning Options How Did the Compiler Optimize My Code 7 46 8 1 OpenMP Section 8 Using OpenMP and Autoparallelization The PathScale Compiler Suite includes OpenMP and autoparallelization for Fortran and C C This implementation of OpenMP supplies parallel directives that comply with the OpenMP Application Program Interface API specification 2 5 Runtime libraries and environment variables are also included This section is not a tutorial on how to use OpenMP To learn more about using OpenMP please see a reference like Parallel Programming in OpenMP by R
212. e location of the assign file See the assign 1 man page for more details FTN SUPPRESS REPEATS Fortran Output multiple values instead of using the repeat factor used at runtime NLSPATH Fortran Flags for runtime and compile time messages PSC _ CFLAGS C Flags to pass to the C compiler pathcc PSC COMPILER DEFAULTS PATH Specifies a path or colon separated list of paths designating where the compiler is to look for the compiler defaults 5 file If the environment variable is set the path opt pathscale etc will not be used If the file cannot be found then no defaults file will be used even if one is present in opt pathscale etc PSC PROBLEM REPORT DIR Name a directory in which to save problem reports and preprocessed source files if the compiler encounters an internal error If not specified the directory used is HOME ekopath bugs PSC CXXFLAGS C Flags to pass to the C compiler pathCC PSC ENABLE SEGV HANDLER Fortran The Fortran runtime system provides a signal handler to print helpful information if a segmentation violation occurs If this variable exists a value of 0 disables the handler and any other value enables it If this variable does not exist then the handler is disabled if the operating system core file limit see ulimit 1 is not zero Core file stack traces often work better without the handler PSC FFLAGS Fortran Flags to pass to the Fortran compiler pathf95 F 59 F eko man
213. e memory more often which causes program slow down In addition too much inlining can slow down the later phases of the compilation process Many function calls pass constants including addresses of variables as parameters Replacing a formal parameter by its known constant value helps in the optimization of the function body Very often part of the code of the function can be determined useless and deleted Function cloning creates different clones of a function with its parameters customized to the forms of the calls It provides a subset of the benefits of inlining without increasing the size of the function that contains the call Like inlining it also increases the total size of the program If IPA can determine that all the calls pass the same constant parameter it will perform constant propagation for the parameter This has the same benefit as 7 Tuning Options Inter Procedural Analysis IPA i Language Language pathcc ipa c Front end Front end Other o s a s so s IPA v A pathcc ipa o Y eid LLLA a out Figure 7 1 IPA Compilation Model function cloning but does notincrease the size of the program Constant propagation also applies to global variables If a global variable is found to be constant throughout the entire program execution IPA will replace the variable by the constant value Dead variable elimination finds glo
214. e of a struct has no effect Wno packed tells the compiler not to warn when packed attribute of a struct has no effect W no padded For C C only Wpadded warns when padding is included in a struct Wno padded tells the compiler not to warn when padding is included in a struct W no parentheses For C C only Wparentheses warns about possible missing parentheses Wno parentheses tells the compiler not to warn about possible missing parentheses W no pointer arith For C C only Wpointer arith warns about function pointer arithmetic Wno pointer arith tells the compiler not to warn about function pointer arithmetic F 54 F eko man Page o 9 9 97 aEEEKEEENENNNNNN W no redundant decls For C C only Wredundant decls warns about multiple declarations of the same object Wno redundant decls tells the compiler not to warn about multiple declarations of the same object W no reorder For C C only The Wreorder option warns when reordering member initializers Wno reorder tells the compiler not to warn when reordering member initializers W no return type For C C only Wreturn type warns when a function return type defaults to int Wno return type tells the compiler not to warn when a function return type defaults to int W no sequence point For C C only Wsequence point warns about code violating sequence point rules Wno sequence point tells
215. e replaced by temporaries that are incremented together with the loop variable When strength reduction is overdone the additional temporaries increase register pressure resulting in excessive register spills that decrease performance The value specified must be a positive integer value which specifies the maximum number of induction expressions that will be strength reduced across an index variable increment When set at 0 strength reduction is only performed for non trivial induction expressions The default is 11 WOPT const pre ON OFF When OFF disables the placement optimization for loading constants to registers Default is ON WOPT if conv 0 1 2 Controls the optimization that translates simple IF statements to conditional move instructions in the target CPU Setting to 0 suppresses this optimization The value of 1 designates conservative if conversion in which the context around the IF statement is used in deciding whether to if convert The value of 2 enables aggressive if conversion by causing it to be performed regardless of the context The default is 1 WOPT ivar pre ON OFF When OFF disables the partial redundancy elimination of indirect loads in the program Default is ON F 53 F eko man Page AA NN WOPT mem opnds ON OFF Makes the scalar optimizer preserve any memory operands of arithmetic operations so as to help bring about subsumption of memory loads into the operands of arithmetic operations
216. e the TEST instruction instead of CMP See Opteron s instruction description for the difference between these two instructions The default is OFF clist For C only Enable the C listing Specifying clist is the equivalent of specifying CLIST ON CLIST For C only The CLIST option group controls emission of the compiler s internal program representation back into C code after IPA inlining and loop nest transformations This is a diagnostic tool and the generated C code may not always be compilable The generated C code is written to two files a header file containing file scope declarations and a file containing function definitions With the exception of CLIST OFF any use of this option implies clist The individual controls in this group are as follows CLIST ON OFF Enable the C listing This option is implied by any of the others but may be used to enable the listing when no other options are required For example specifying CLIST ON is the equivalent of specifying clist CLIST dotc file filename Write the program units into the specified file filename The default source file name has the extension w2c c CLIST doth file filename Specify the file into which file scope declarations are deposited Defaults to the source file name with the extension w2c h CLIST emit pfetch ON OFF Display prefetch information as comments in the transformed source If ON or OFF is not specified the def
217. e to execute If instead you wish to stop the program here are two options to do so Option 1 This option works if you are running on a machine with SSE SIMD instructions Set these compiler options to OFF to unmask the exceptions on which you wish to trap TENV simd_imask OFF traps invalid TENV simd dmask OFF traps denormalized TENV simd omask OFF TENV simd zmask OFF traps divide by zero traps overflows TENV simd umask OFF traps underflows TENV simd pmask OFF traps imprecise For more information see the eko man page Option 2 Use the TR15580 Floating Point features in your code The following example will work on any machine but only for Fortran NOTE If you are using C or C try the GNU C library extensions feenableexcept and fedisableexcept which are documented in the GNU man pages This will also work on any machine Here is the Fortran example named ieee f95 USE IEEE EXCEPTIONS 10 Debugging and Troubleshooting Large Object Support A Za REAL A B C Uncomment the halt mode you need to use IEEE USUAL implies IEEE INVALID IEEE OVERFLOW and IEEE DIVIDE BY ZERO CALL ieee set halting mode IEEE INVALID TRUE CALL ieee set halting mode IEEE OVERFLOW TRUE CALL ieee set halting mode IEEE DIVIDE BY ZERO TRUE CALL ieee set halting mode IEEE UNDERFLOW TRUE CALL ieee set halting mode IEEE INEXACT TRUE CALL ieee set halting mode IEEE USUAL TR
218. ead 2 will be assigned no iterations With the fair policy thread 0 will be assigned two iterations and threads 1 and 2 will be assigned one iteration NOTE The maximum number of iterations assigned to a thread which determines the worst case path through the schedule is the same for the default scheduling policy and the fair scheduling policy In many cases the performance of these two scheduling policies will be very similar PSC OMP THREAD SPIN Integer value This takes a numeric value and sets the number of times that the spin loops will spin at user level before falling back to O S schedule reschedule mechanisms By default it is 100 If there are more active threads than processors and this is set very high then the thread contention will typically cause a performance drop Synchronization using the O S schedule and reschedule mechanisms is significantly more expensive but frees up execution resources for other threads 8 10 OpenMP Stack Size 8 10 1 Stack Size for Fortran The Fortran compiler allocates data on the stack by default Some environments set a low limit on the size of a process stack which may cause Fortran programs that use a large amount of data to crash shortly after they start In an OpenMP program there is a stack for the main thread of execution as in serial programs and also an additional separate stack for each additional thread created by 1ibopenmp These additional threads are created by the POSIX threa
219. eatures EN Pp O O2 tV http users erols com dnagle coco html To pass those options through the compiler driver to the preprocessor you can usethe Wp options flag For example you can use Wp mto pass the m option to the preprocessor to turn off macro preprocessing Note that the instructions given in that web page for passing file names to the preprocessor and identifying the set file are not relevant when you use the PathScale compiler since the compiler automatically passes each source file name to the preprocessor for you captures the preprocessor output for compilation and identifies the sec ile as described in the preceding paragraphs More information about the coco option can be found in the eko man page 3 6 4 1 Pre defined Macros The PathScale compiler pre defines some macros for preprocessing code When you use the C preprocessor cpp with Fortran or rely on the F F90 and F95 suffixes to use the default cpp preprocessor the PathScale compiler uses the same preprocessor it uses for C with the addition of the following macros LANGUAGE FORTRAN LANGUAGE FORTRAN 1 LANGUAGE FORTRAN90 1 LANGUAGE FORTRAN90 1 unix1 unix 1 unix 1 NOTE When using an optimization level at 01 or higher the compiler will set andusethe oPTIMIZE macro with cpp See the complete list of macros for cpp in Section 4 2 1 1 If you use the Fortran preprocessor tpp only these five macros are defined for you
220. ed above it NOTE The pathprof program included in the PathScale Compiler Suite is a symbolic link your system s gprof executible The pathprof and pathcov programs link to the gprof and gcov executibles in the version of GCC on which the PathScale Compiler Suite is based Please note that the pathprof tool will generate a segmentation fault when used with OpenMP applications that are run with more than one thread There is no current workaround for pathprof or gprof Now we note that the total time that pathprof measures is 163 3 secs vs the 150 3 that we measured for the original 02 binary But considering that the 02 pg instrumented binary took 247 seconds to run this is a pretty good estimate It is nice that the top hot spot zgemm consumes about 50 of the total time We also note that some very small routines zaxpy zcopy and 1same are called a very large number of times These look like ideal candidates for inlining 9 2 9 Examples Compiler Flag Tuning and Profiling With pathprof ls In the second part of the pathprof output after the explanation of the column headings for the flat profile is a call graph profile In the example of such a profile below one can follow the chain ofcallsfrommaintomatmul muldoe su3mul and zgemm_ where most of the time is consumed Additional call graph profile info Call graph explanation follows granularity each sample hit covers 4 byte s for 0 01 of 163 32 second
221. ed from a single function The default is 0 which implies that cloning is turned OFF by default IPA node bloat N specifies the maximum percentage growth in the number of procedures relative to the original program that cloning can produce The default is 100 7 3 6 Other IPA Tuning Options The following are options un related to inlining and cloning but useful in tuning IPA common_pad_size N specifies that common block padding should use pad size of up to N bytes The default value is 0 which specifies that the compiler will determine the best padding size IPA linear ON enables linearization of array references When inlining Fortran subroutines IPA tries to map formal array parameters to the shape of the actual parameters The default is OFF which means IPA will suppress the inlining if it cannot do the mapping Turning this option ON instructs IPA to still perform the inlining but linearizes the array references Such linearization may cause performance problems but the inlining may produce more performance gain 7 Tuning Options Inter Procedural Analysis IPA AA NN IPA pu reorder N controls IPA s procedure reordering optimization A value of O disables the optimization N 1 enables reordering based on the frequency in which different procedures are invoked N 2 enables procedure reordering based on caller callee relationship The default is 0 IPA field reorder 0Nenables IPA s field reordering optimization
222. efined rather than using default Fortran 90 rules Print on standard error output the commands executed to run the stages of compilation Also print the version number of the compiler driver program and of the preprocessor and the compiler proper version Write compiler release version information to stdout No input file needs to be specified when this option is used F 48 F eko man Page I aaa Wc argl arg2 Pass the argument s argi to the compiler pass c where c is one of pfibal The c selects the compiler pass according to the following table Character Name p preprocessor f front end i inliner b backend a assembler l loader Sets of these phase names can be used to select any combination of phases For example Wba o foo passes the option o foo to the b and a phases Wall Enable most warning messages WB WB lt arg gt passes arg to the backend via ipacom W no aggregate return For C C only Waggregate return warns about returning structures unions or arrays Wno aggregate return will not warn about returning structures unions or arrays W no bad function cast Wbad function cast attempts to support writable strings K amp R style C Wno bad function cast tells the compiler not to warn when a function call is cast to a non matching type W no cast align For C C only Wcast align warns about pointer casts that increase alignment Wno cast align ins
223. efined reference to rand_ fantasian o text 0xab48 undefined reference to rand_ fantasian o text 0xab82 undefined reference to rand_ fantasian o textr0xabbf undefined reference to rand fantasian o textr0xee0a more undefined references to rand_ follow collect2 ld returned 1 exit status The problem is that RAND is not ANSI The solution is to build the code with the flag intrinsic PGI 5 4 2 Name mangling Name mangling ensures that function subroutine and common block names from a Fortran program or library do not clash with names in libraries from other programming languages This makes mixing code from C C and Fortran easier See section 3 10 1 for details on name mangling 5 4 3 Static Data Some codes expect data to be initialized to zero and allocated in the heap If this is the case with your code use the static flag when compiling 5 5 Porting to x86 64 Keep these things in mind when porting existing code to x86 64 Some source packages make assumptions about the locations of libraries and fail to look in 11564 named directories for libraries resulting in unresolved symbols at during the link For the x86 platform use the mcpu flag x86any to specify the x86 platform like this mcpu x86 64 5 Porting and Compatibility Compatibility ls 5 6 Migrating from Other Compilers Here is a suggested step by step approach to migrating code from other compilers to the
224. ematical expressions and changing the order or number of floating point operations can slightly change the result Example A 2 X B 4 Y C22 X 2 Y A clever compiler will notice that C A B But the order of operations is different and so a slightly different C will be the result This particular transformation is controlled by the OPT roundoff flag but there are several other numerically unsafe flags Some options that fall into this category are The options that control IEEE behavior such as OPT roundoff N and OPT IEEE arithmetic N Here are a couple of others 7 20 7 Tuning Options Aggressive Optimizations o 9 9 7 a OPT div split ON OFF This option enables or disables transforming expressions of the form X Y into X 1 Y The reciprocal is inherently less accurate than a straight division but may be faster OPT recip ON OFF This option allows expressions of the form 1 x to be converted to use the reciprocal instruction of the computer This is inherently less accurate than a division but will be faster These options can have performance impacts For more information see the e ko manual page You can view the manual page by typing man eko at the command line 7 7 3 Fast math Functions When OPT fast math onis specified the compiler uses fast versions of math functions tuned for the processor The affected math functions include 109 exp sin cos sincos
225. en this occurs because a program employs array features introduced in the Fortran 90 standard along with procedures having traditional Fortran 77 style implicit interfaces In particular Fortran 77 style procedures expect all arrays to be contiguous in memory but Fortran 90 permits arrays whose elements are scattered or strided The copying takes time but contiguous arrays may better use the processor cache memory Whether the program runs faster or slower depends on whether one of those factors dominates the other and that depends on the details of the program Because unintended copying can slow program execution the compiler provides optional warnings about it The example below shows two out of many situations in which copying takes place one in which copying is conditional on the nature of the array and another in which copying is unconditional 3 The PathScale Fortran Compiler Debugging and Troubleshooting Fortran aaa S cat cico f90 subroutine possible a n implicit none integer n integer dimension n a print a 25i5 possible a end subroutine possible program copier implicit none logical 1 integer i integer target a 5 5 reshape i i 1 25 5 5 integer pointer dimension p read 1 if 1 then p gt a else p gt a 1 5 2 1 5 2 endif Because possible does not have an explicit interface it expects a contiguous array Therefore the compil
226. er generates a runtime test to check a contiguous bit belonging to the pointer p and if the target is not contiguous the values are copied to a temporary array before the call and copied back after the call call possible p size p The compiler must always copy this sequence array to a temporary variable to make it contiguous call possible a 1 2 5 2 3 5 8ize a 1 2 5 2 3 5 end program copier pathf90 fullwarn c cico f90 call possible p size p pathf95 1438 pathf90 CAUTION COPIER File cico f90 Line 26 Column 17 This argument produces a possible copy in and out to a temporary variable call possible a 1 2 5 2 3 5 size a 1 2 5 2 3 5 pathf95 1438 pathf90 CAUTION COPIER File cico f90 Line 30 3 45 3 The PathScale Fortran Compiler Fortran Compiler Stack Size AA NN Column 18 This argument produces a copy in to a temporary variable pathf95 PathScale TM Fortran Version 2 9 99 f14 Thu Dec 7 2006 06 03 17 pathf95 32 source lines pathf95 0 Error s 0 Warning s 2 Other message s 0 ANSI s pathf95 explain pathf95 message number gives more information about each message One way to minimize copying while still taking advantage of Fortran 90 features is to use Fortran 90 style assumed shape and deferred shape arrays that is arrays whose bounds look like rather than 2 3 or n m for all dummy
227. erated by the compiler For best performance the number of threads should typically be equal to the number of processors you will be using The amount of speedup you can get under parallel execution depends a great deal on the algorithms used and the way the OpenMP directives are used Programs 8 1 8 Using OpenMP and Autoparallelization Autoparallelization ee that exhibit a high degree of coarse grain parallelism can achieve significant speedup as the number of processors are increased Appendix B describes the implementation dependent behavior for PathScale s OpenMP in C C and Fortran For more information on OpenMP and the OpenMP specification please see the OpenMP website at http www openmp org 8 2 Autoparallelization Under autoparallelization the compiler tries to parallelize program code without depending on user directives Autoparallization is invoked by specifying the apo option on the compile and link lines S pathf95 apo c foo F95 S pathf95 apo o foobar foo o bar o Since the compiler is only able to parallelize a subset of the loops that the user knows are parallelizable OpenMP directives are always helpful OpenMP directives are not seen by the compiler unless mp is specified Thus for programs that contain OpenMP directives autoparallelization can be combined with OpenMP to additionally parallelize code that does not contain OpenMP directives In this case it is good
228. ere are more details and an example using pathprof later in section 9 but the following steps are all that are needed to get started in profiling 1 Add the pg flag to both the compile and link steps with the PathScale compilers This generates an instrumented binary 2 Run the program executable with the input data of interest This creates a gmon out file with the profile data 3 Run pathprof lt program name gt to generate the profiles The standard output of pathprof includes two tables 2 Compiler Quick Reference taskset Assigning a Process to a Specific CPU AA a A flat profile with the time consumed in each routine and the number of times it was called and b A call graph profile that shows for each routine which routines it called and which other routines called it There is also an estimate of the inclusive time spent in a routine and all of the routines called by that routine NOTE The pathprof tool will generate a segmentation fault when used with OpenMP applications that are run with more than one thread There is no current workaround for pathprof or gprof See section 9 for a more detailed example of profiling 2 12 taskset Assigning a Process to a Specific CPU To improve the performance of your application on multiprocessor machines it is useful to assign the process to a specific CPU The tool used to do this is taskset which can be used to retrieve or set a process affinity This comman
229. ero behavior is controlled by the OPT IEEE arith flag Setting it to either 2 or 3 will result in flush to zero The OPT IEEE arith flag defaults to 1 under 00 01 02 and it defaults to 2 under 03 as seen in the table above The compilation flag works by generating instructions to do the setting at the entry to main During runtime it can be further set by the IEEE SET UNDERFLOW MODE Fortran intrinsic found in the intrinsic module IEEE ARITHMETIC Gradual underflow means produce denormalized numbers USE INTRINSIC IEEE ARITHMETIC CALL IEEE SET UNDERFLOW MODE GRADUAL TRUE 7 8 Hardware Performance Although the x86 64 platform has excellent performance there are a number of subtleties in configuring your hardware and software that can each cause substantial performance degradations Many of these are not obvious but they can reduce performance by 3096 or more at a time We have collected a set of techniques for obtaining best performance described below 7 8 1 Hardware Setup There is no catch all memory configuration that works best across all systems We have seen instances where the number type and placement of memory modules on a motherboard can each affect the memory latency and bandwidth that you can achieve 7 24 7 Tuning Options Hardware Performance aaa Most motherboard manuals have tables that document the effects of memory placement in different slots We recommend that you read the table
230. ers named constants for use in declarations These are accessible from an intrinsic module ISO_C_BINDING which you can obtain with an ordinary use statement adding a intrinsic clause insures that you use the intrinsic version in the unlikely event that your program has defined its own module named ISO C BINDING For example you can use c int to ensure that a Fortran integer declaration is compatible with a C int declaration and you can use c float to ensure that a Fortran real declaration is compatible with a C float declaration module m3 use intrinsic iso c binding integer c int bind c m3ivar Compatible with C int real c float bind c m3rvar Compatible with C float end module m3 In the earlier examples we used default INTEGER and REAL types under the assumption that these are compatible with C int and float types That assumption is correct for the Pathscale Fortran and C compilers and is likely to be correct for most compilers but for greatest portability one would always use the predefined constants to ensure the code will work correctly even under a compiler for which that assumption did not hold The following table shows all the types available Table 3 1 Compatible Fortran and C Types Fortran Type C Type integer c int int integer c short short int integer c long long int integer c long long long long int integer c signed char signed char integer c size t size t integer
231. ersion 3 2 ls see also Appendix C iostat 3 28 IPA 7 3 O files 7 4 ISA target 2 5 L L2 cache size 7 15 LAPACK 3 40 Large object files linking or assembly of 10 4 lat mem rd tool 7 26 libg2c 10 4 libopenmp 8 11 8 21 8 23 Library ACML 3 39 BLAS 3 39 FFTW 3 39 MPICH 3 39 limit command 3 2 Linker symbol 3 30 linuxthreads 8 23 Little endian format 3 35 LMBench tool 7 26 Load balancing using OProfile 8 29 Load balancing using top 8 29 Local ID 8 13 Loop unrolling 7 16 Macros pre defined 3 26 Makefile 2 7 4 2 5 5 7 3 man pages 1 2 2 2 F 1 Math intrinsic functions vectorizing 7 17 Memory allocation Fortran 3 44 Memory model 2 9 Memory non overlapping 3 43 Mixed code 3 29 Multiple sub options 7 2 Multiprocessor memory MP 7 25 N Name mangling 5 4 NaN 10 1 Non Temporal at All NTA 7 17 Non uniform memory NUMA 7 25 NUMA OpenMP 8 29 Numerical libraries and OpenMP 8 28 O Object files generating from f90 files 2 7 OMP DYNAMIC 8 12 OMP NESTED 8 12 8 30 OMP NUM THREADS 8 12 OMP SCHEDULE 8 12 8 29 OpenMP 8 1 OProfile 8 29 opteron 2 4 Optimization basic 6 1 Options ansi C 2 apo 8 2 byteswapio 3 36 C 3 29 C 2 8 CG gcm 7 17 CG load_exe 7 17 CG use_prefetchnta 7 17 CLIST 7 44 convert conversion 3 36 Cpp 3 1 3 24 dD 4 5 F 3 36 fb create 7 18 fb opt 7 18 fcoco 3 24 fdecorate 3 30 3 33 ff2c 3 39 ff2c abi 3 39 ffast math 7 21 fixedform 3 1 FLIST 7
232. es routines and pathnames variable Italic typeface is used for variable names or concepts being defined user input Bold fixed space font is used for literal items the user types in Output is shown in non bold fixed space font Indicates a command line prompt Brackets enclose optional portions of a command or directive line Ellipses indicate that a preceding element can be repeated NOTE Indicates important information 1 2 Documentation Suite The PathScale Compiler Suite product documentation set includes The PathScale Compiler Suite and Subscription Manager Install Guide The PathScale Compiler Suite User Guide The PathScale Compiler Suite Support Guide The PathScale Debugger User Guide There are also online manual pages man pages available describing the flags and options for the PathScale Compiler Suite These man pages are a subset of the pages that are shipped with the Compiler Suite eko pathf 95 pathf90 pathcc pathCC The pathscale intro man page gives a complete list of all the various man pages that are included with the Compiler Suite Please see the PathScale website for further information about current releases and developer support http www pathscale com support html In addition you may want to refer to language reference books for more information on compilers and language usage Programming and language reference books are often a matter of personal taste Everyone
233. es are set to o if they are not available from the relevant file system ftell Fortran interface to the C library function fte11 Treats logical unit unit asa stream of bytes The function form returns the offset from the beginning of the file to the position pointer used to read or write the file or 1 to indicate an error The subroutine form sets offset to the value which the function would return gerror Fortran interface to the C library function st re r ro r Sets me s sage to the error message corresponding to the error code from the C library variable errno getarg Stores into value an argument from the command line used to execute this process pos is an index into the argument list where o identifies the name of the program 1 identifies the first argument etc Intrinsic iargc provides the number of arguments available getcwd Fortran interface to the C library function getcwd Sets name to the current working directory name The function form returns o for success or an error code from the C library value errno The subroutine form sets status to the value which the function would return C 45 C Supported Fortran Intrinsics Fortran Intrinsic Extensions ee getenv Fortran interface to the C library function getenv Sets value to the value of environment variable whose name is name or to blanks if the variable is missing or not set Trailing blanks in name are ignored you can prevent this by using char 0 toplac
234. ese comments The default is ON LIST options ON OFF Enable or disable listing of the options modified directly in the command line or indirectly as a side effect of other options The default is OFF LIST symbols ON OFF Enable or disable listing of information about the symbols variables managed by the compiler LNO This group specifies options and transformations performed on loop nests by the Loop Nest Optimizer LNO The LNO options are enabled only if the optimization level of O3 or higher is in effect For information on the LNO options that are in effect during a compilation use the LIST all optionszON option LNO apo use feedback ON OFF Effective only when specified with apo under feedback directed compilation this flag tells the auto parallelizer whether to use the feedback data of the loops in deciding whether each loop should be parallelized When the compiler parallelizes a loop it generates both a serial and a parallel version If the trip count of the loop is small it is not beneficial to use the parallel version during execution When this flag is set to ON and the feedback data indicates that the loop has small trip count the auto parallelizer will not generate the parallel version thus saving the runtime check needed to decide whether to execute the serial or parallel version of the loop The default is OFF F 26 F eko man Page ls LNO build scalar reductions ON O
235. et to lt OFF gt to ignore Defaults Comments 0 indicates no cache at that level 0 indicates no cache at that level 0 indicates no cache at that level OFF for each option 0 indicates no cache at that level Defaults Comments N is hardware dependent N is hardware dependent E Summary of Compiler Options o aaa aa Table E 1 Summary of Compiler Options by Function LNO tlbcmp1 N tlbcmp2 N tlbcmp3 N tlbcmp4 N tlbdmp1 N tlbdmp2 N tlbdmp3 N tbldmp4 N LNO Prefetch Options LNO pf1 ON OFF pf2 ON OFF pf3 ON OFF pf4 ON OFF LNO prefetch 0 1 2 3 LNO prefetch ahead N LNO prefetch manual ON OFF LNO trip count assumed when unknown Math Precision Options f no fast math ffloat store fno math errno no unsafe math optimizations mx87 precision 32 64 80 noexpopt Optimization Options apo GRA home ON OFF GRA optimize boundary ON OFF 0 0 1 2 3 s Ofast OPT alias typed OPT alias restrict OPT alias disjoint OPT alias no f90 pointer alias OPT align unsafe ON OFF OPT asm memory ON OFF lt N gt is hardware dependent Defaults Comments lt 2 gt lt 2 gt lt ON gt Replaces LNO assume_unkno wn trip count 0 1000 Defaults Comments Implied by Ofast fmath errno lt 80 gt Defaults Comments lt ON gt lt OFF gt s25 This is the global optimi
236. expf and pow In general the accuracy is within 1 ulp of the fully precise result though the accuracy may be worse than this in some cases The routines may not raise IEEE exception flags They call no error handlers and denormal number inputs outputs are typically treated as 0 but may also produce unexpected results OPT fast math on is effected when OPT roundoff is set to 2 or above A different flag ffast math improves FP speed by relaxing ANSI amp IEEE rules fno fast math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed ffast math implies OPT IEEE arithmetic 2 fno math errno while no fast math implies OPT IEEE arithmetic 1 fmath errno These flags apply to all languages Both OPT fast math onand ffast math are implied by Ofast 7 7 4 IEEE 754 Compliance Itis possible to control the level of IEEE 754 compliance through options Relaxing the level of compliance allows the compiler greater latitude to transform the code for improved performance The following subsections discuss some of those options 7 7 4 1 Arithmetic Sometimes it is possible to allow the compiler to use operations that deviate from the IEEE 754 standard to obtain significantly improved performance while still obtaining results that satisfy the accuracy requirements of your application 7 21 7 Tuning Options Aggressive Optimizations AA NN The flag regulating the level of conformance to ANSI IE
237. f OPT is memxz ON OFF is specified the corresponding assocxzN specification is ignored any cmpxzN and dmpxzN options on the command line are ignored LNO 1s1 N 1s2 N 1s3 N 1s4 N This option specifies the line size in bytes This is the number of bytes specified in the form of a positive integer number N that are moved from the memory hierarchy level further out to this level on a miss Specifying N 0 indicates there is no cache at that level Following are LNO TLB Options These arguments control the TLB a cache for the page table assumed to be fully associative The TLB control arguments are the following LNO pslzN ps2 N ps3 N ps4 N This option specifies the number of bytes in a page with N as positive integer The default for N depends on your system hardware LNO tlb1 N tlb2 N tlb3 N tlb4 N This option specifies the number of entries in the TLB for this cache level with N as a positive integer The default for N depends on your system hardware F 32 F eko man Page o 9 7 a LNO tlbcmp1 N tlbcmp2 N tlbcmp3 N tlbcmp4 N tlbdmplzN tlbdmp2 N tlbdmp3 N tbldmp4 N This option specifies the number of processor cycles it takes to service a clean TLB miss the tlocmpx options or a dirty TLB miss the tlodmpx options with N as a positive integer The default for N depends on your system hardware Following are LNO Prefetch Options These arguments control the prefetch operation LNO
238. f a uniform memory system However OpenMP programs tend to have very good memory locality and the correct approach is to use NUMA optimizations in the operating system to give good placement of data relative to threads This optimization relies on first touch the thread that first touches the data is assumed to be the most frequent user of the data and thus the data is allocated onto physical addresses in the DRAM associated with the CPU that is currently running that thread This is applied by a NUMA aware operating system at the page level If your kernel version is not NUMA aware then a kernel upgrade may be required for good performance Similarly thread to CPU affinity is also important for good OpenMP performance The OpenMP library by default uses affinity system calls to strongly associate threads with CPUs The idea is to keep the threads co located with their associated data Without affinity assignments the threads may be migrated by the O S scheduler to other nodes and lose their good placement relative to their data However sometimes the use of affinity binding can cause a load imbalance and prevent the scheduler from make sensible decisions about thread placement In this case the thread affinity assignments can be disabled by setting the PSC_OMP AFFINITY environment variable to FALSE If your kernel does not support scheduling affinity you may need to upgrade to a newer kernel to see the performance benefit of this mechanism
239. f parallelism IPA min hotnesszN When feedback information is available a call site to a procedure must be invoked with a count that exceeds the threshold specified by N before the procedure will be inlined at that call site The default is 10 IPA multi clonezN This option specifies the maximum number of clones that can be created from a single procedure Default value is 0 Aggressive procedural cloning may provide opportunities for inter procedural optimization but may also significantly increase the code size IPA node bloatzN When this option is used in conjunction with IPA multi clone it specifies the maximum percentage growth of the total number of procedures relative to the original program The default is 100 IPA plimit N This option stops inlining into a specific subprogram once it reaches size N in the intermediate representation Default is 2500 F 22 F eko man Page ls IPA pu reorder 0 1 2 Control re ordering the layout of program units based on their invocation patterns in feedback compilation to minimize instruction cache misses This option is ignored unless under feedback compilation 0 Disable procedure reordering This is the default for non C programs 1 HReorder based on the frequency in which different procedures are invoked This is the default for C programs 2 Reorder based on caller callee relationship IPA relopt ON OFF This option enables optimizations similarto those a
240. f that thread to change Also note that all team masters will have a local ID of 0 and will therefore map to the same CPU Usually these properties are undesirable so the default is to use the thread global ID for scheduling assignments PSC OMP AFFINITY INHERITANCE TRUE or FALSE This determines whether the OpenMP library inherits any prevailing affinity settings from its environment and the default value is TRUE When affinity inheritance is disabled the OpenMP library ignores the environment s affinity setting and sets up its own affinity mappings according to its built in heuristics By default the OpenMP library will bind one thread to each CPU in the machine though this can be over ridden by OpenMP environment variables When affinity inheritance is enabled the default and the OpenMP program is run under an affinity assignment then the OpenMP program is restricted to just the subset of CPUs specified in that affinity assignment This behavior ensures that the OpenMP library inter operates with programs like taskset in the expected way The behavior is as if the OpenMP program had been run on a machine that consisted of just the CPU subset specified by taskset The OpenMP library will then use its usual thread count and affinity rules but applied to the CPU subset A common approach is to run multiple OpenMP processes on a node e g using MPI such that each OpenMP process uses a distinct subset of CPUs specified by taskset Affinity
241. ffect when O2 or above C C only C C only C C only Suppress warning messages E Summary of Compiler Options AA NY Table E 1 Summary of Compiler Options by Function woffoptions woffnum Options Affecting Global Optimizer 02 or Above Defaults Comments WOPT aggstrzN 11 WOPT const pre ON OFF lt ON gt WOPT if_conv 0 1 2 1 WOPT ivar pre ON OFF lt ON gt WOPT mem opnds ON OFF lt OFF gt WOPT retype expr ON OFF lt OFF gt WOPT unroll 0 1 2 lt 1 gt WOPT va1 0 1 2 1 E 18 Appendix F eko man Page There are online manual pages man pages available describing the flags and options for the PathScale Compiler Suite The man pages distributed as part of the PathScale Compiler Suite are pathCC 1 pathcc 1 pathf95 1 eko 7 pathscale intro 7 compiler defaults 5 explain 1 pathhow compiled 1 pathopt2 1 pathdb 1 Invoke the PathScale TM C or C compiler Invoke the PathScale TM C or C compiler Invoke the PathScale TM Fortran 77 90 and 95 compil ers The complete list of options and flags for the Path Scale TM Compiler Suite Introductory page for the PathScale TM Compiler Suite Default options for the PathScale TM Compiler Suite PathScale Fortran compiler and runtime error message explanation utility PathScale TM display compiled options utility utility used to aid in tuning the PathScale TM compiler for higher
242. fic name for a function that converts its argument to type complex 16 Specific name for complex conjugate whose argument is type complex 16 Fortran interface to C library function in erf and erfc 3m Specific name for function that converts its argument to type real 8 Specific name for a function that returns the imaginary part of a complex 16 argument Specific name for a function that converts its argument to type real 8 Find out the number of seconds of CPU time consumed by this process since the previous call to dt ime or if there was no previous call since the start of execution tarray 1 gives user CPU time and tarray 2 gives system CPU time The function form returns the sum of those times The subroutine form sets result to the sum of those times Fortran interface to C library functions described in erff and erfcf 3m C 43 C Supported Fortran Intrinsics Fortran Intrinsic Extensions ee etime Find out the number of seconds of CPU time consumed by this process since the start of execution tarray 1 gives user CPU time and tarray 2 gives system CPU time The function form returns the sum of those times The subroutine form sets result to the sum of those times exit Like the C library function exit terminate the process and return the value status to the process usually the shell that caused this process to execute status defaults to 0 Open Fortran logical units are flushed and closed f
243. for calls within a parallel region otherwise return FALSE call omp set nested logical Enable or disable nested parallelism logical omp get nested Return TRUE if nested parallelism is enabled otherwise return FALSE Lock routines omp init lock int omp init nest lock int Allocate and initialize lock associating it with the lock variable passed in as a parameter Initialize a nestable lock and associate it with a specified lock variable omp set lock int Acquire the lock waiting until it becomes available if necessary omp set nest lock int Set a nestable lock The thread executing the subroutine will wait until a lock becomes available and then set that lock incrementing the nesting count omp unset lock int Release the lock resuming a waiting thread if any omp unset nest lock int Release ownership of a nestable lock The subroutine decrements the nesting count and releases the associated thread from ownership of the nestable lock 8 Using OpenMP and Autoparallelization OpenMP Runtime Library Calls C C ls Table 8 3 Fortran OpenMP Runtime Library Routines Continued Routine Description logical omp test lock int Try to acquire the lock return TRUE if i i successful FALSE if not omp test nest lock int Attempt to set a lock using the same method i i as omp set nest lock but execution thread does not
244. for more information on this topic 2 9 1 Support for Large Memory Model At this time the PathScale compilers do not support the large memory model The significance is that the code offsets must fit within the signed 32 bit address space To determine if you are close to this limit use the Linux size command size bench text data bss dec hex filename 910219 1448 3192 914859 df5ab bench If the total value of the text segment is close to 2GB then the size of the memory model may be an issue for you We believe that codes that are this large are extremely rare and would like to know if you are using such an application The size of the bss and data segments are addressed by using the medium memory model 2 Compiler Quick Reference Profiling Locate Your Program s Hot Spots A aaa 2 10 Debugging The flag g tells the PathScale compilers to produce data in the DWARF 2 0 format used by modern debuggers such as GDB and PathScale s debugger pathdb This format is incorporated directly into the object files The g option automatically sets the optimization level to 00 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible but the code transformation performed by the optimizations may make it more difficult See the individual sections on the PathScale Fortran and C C compilers for more language specific debugging information and section 10 for debu
245. g FLIST Fortran Listing GRA Global Register Allocator INLINE Subprogram Inlining IPA Inter procedural Analyzer LANG Language LIST Listing LNO Loop Nest Optimizer OPT Miscellaneous TENV Target Environment and WOPT Global Optimizer Modification The general usage format is PARENT OPTION suboption arg Two options INLINE and IPA have separate behavior for the PARENT OPTION without any suboptions Additionally INLINE and inline mean the same thing the case is similar for IPA and ipa Specifying clist is equivalent to CLIST ON Specifying flist is equivalent to enabling all the FLIST options Like the v option only nothing is run and args are quoted A pred ans Make an assertion with the predicate pred and answer ans The pred ans form cancels an assertion with predicate pred and answer ans alignN Align data on common blocks to specified boundaries The alignN specifications are as follows F eko man Page ls Option Action align32 Align data in common blocks 32 bit boundaries align64 Align data in common blocks to 64 bit boundaries This is the default When an alignment is specified objects smaller than the specification are aligned on boundaries according to their sizes For example when align64 is specified objects smaller than 64 bits but at least 32 bits in size are aligned on 32 bit boundaries objects smaller than 32 bits but at least 16 bi
246. g declarations For C C only Wmissing declarations warns about global funcs without previous declarations Wno missing declarations tells the compiler not warn about global funcs without previous declarations W no missing format attribute For C C only For the Wmissing format attribute option if Wformat is used warn on candidates for format attributes For Wno missing format attribute do not warn on candidates for format attributes W no missing noreturn For C C only Wmissing noreturn warns about functions that are candidates for noreturn attribute Wno missing noreturn tells the compiler not to warn about functions that are candidates for noreturn attribute W no missing prototypes For C C only Wmissing prototypes warns about global funcs without prototypes Wno missing prototypes tells the compiler not to warn about global funcs without prototypes W no multichar For C C only Wmultichar warns if a multi character constant is used Wno multichar tells the compiler not to warn if a multi character constant is used W no nested externs For C C only Wnested externs warns about externs not at file scope level Wno nested externs tells the compiler not to warn about externs not at file scope level Wno cast qual For C C only Wcast qual warns about casts that discard qualifiers Wno cast qual tells the compiler not to warn about casts that discard qualifiers Wno dep
247. g flags on the same files Table 7 1 Effects of IPA on SPEC CPU 2000 Performance Benchmark Time w o ipa Time with ip Improvement 164 gzip 170 7 s 164 7 s 3 5 175 vpr 202 4s 192 3 s 5 176 gcc 113 6 s 113 2 s 0 4 7 Tuning Options Inter Procedural Analysis IPA o 9 9 7 aaa Table 7 1 Effects of IPA on SPEC CPU 2000 Performance Continued Benchmark Time w o ipa Time with ip Improvement 181 mcf 391 9 s 390 8 s 0 396 186 crafty 83 5s 83 4s 0 196 197 parser 301 4s 289 3 s 496 252 eon 152 8s 126 8s 17 253 perlbmk 196 2 s 192 3 s 2 254 gap 153 5 s 128 6s 16 2 255 vortex 175 2 s 132 1 s 24 6 256 bzip2 210 25 181 0s 13 996 300 twolf 376 5 s 362 2 s 3 896 168 wupwise 220 0 s 161 5s 26 6 171 swim 181 4s 180 7 s 0 4 172 mgrid 184 7 s 182 3 s 1 3 173 applu 282 5 S 245 2 S 13 2 177 mesa 155 4 s 131 5 s 15 4 178 galgel 150 4 s 149 9 s 0 3 179 art 245 7 S 2211s 10 183 equake 143 7 s 143 2 s 0 3 187 facerec 154 3 s 147 4 s 4 596 188 ammp 266 5 s 261 7 s 1 896 189 lucas 165 9 s 167 9 s 1 296 191 fma3d 239 6 s 244 6 s 2 196 200 sixtrack 265 0 s 276 9 s 4 5 301 apsi 280 7 s 273 7 S 2 5 Table 7 1 shows how ipa effects the base runs of the CPU2000 benchmarks IPA improves the running times of 17 out of the 26 benchmarks the improvements range from 1 3 to 26 6 There are six benchmarks that improve by less than 0 5 which is withi
248. gain with a new execute target The next set of refinements in the execute targets are the options with the peak_ prefix For example if the best results were obtained with 02 then the next target to try will be peak 02 Here is a summary of the target usage Option in try5 with best results Use this target for next run 02 peak 02 03 peak 03 03 OPT Ofast peak 03 03 ipa peak 03 Ofast peak Ofast This progressive refinement is shown in more detail in section 7 9 8 3 and section 7 9 8 4 7 9 5 Using an External Configuration File to Modify pathopt2 xml It is possible to build hierarchies of lists and to construct new execution targets by combining existing ones The way to do this without modifying pathopt2 xm1 is to create an external configuration file then use the g option in the pathopt2 7 35 7 Tuning Options The pathopt2 Tool AA 7 command line to load it in The XML files are processed in order as if they were concatenated The g option can be repeated to load in more than one file The t option chooses the execution target as before The rules for using the option remain the same Here is an example of an external configuration file that extends the try5 list with a 6th possibility config execute name try6 gt choose kz 1 source from try5 list gt option O1 lt option gt choose lt execute gt lt con fig gt 7 9 6 PSC_GENFLAGS Environment
249. gging and troubleshooting tips See the PathScale Debugger User Guidefor more information on pathdb 2 11 Profiling Locate Your Program s Hot Spots Often a program has hot spots a few routines or loops that are responsible for most of the execution time Profilers are a common tool for finding these hot spots in a program To figure out where and how to tune your code use the time tool to get a rough estimate and determine if the issue is system load application load or a system resource that is slowing down your program Then use the pathprof tool to find the programs hot spots Once you find the hot spots in your program you can improve your code for better performance or use the information to help choose which compiler flags are likely to lead to better performance The time tool provides the elapsed or wa11 time user time and system time of your program Its usage is typically time program args Elapsed time is usually the measurement of interest especially for parallel programs but if your system is busy with other loads then user time might be a more accurate estimate of performance than elapsed time If there is substantial system time being used and you don t expect to be using substantial non compute resources of the system you should use a kernel profiling tool to see what is causing it The pathprof and pathcov programs included with the compilers are symbolic links to your system s gcov and gprof executables Th
250. h the fb create and fb opt flags NOTE If the b create and fb opt compiles are done with different compilation flags it may or may not work depending on whether the different compilation flags cause different code to be seen by the phase that is performing the instrumentation feedback We recommend using the same flags for both instrumentation and feedback FDO requires compiling the program at least twice In the first pass pathce 03 ipa fb create fbdata o foo foo c The executable foo will contain extra instrumentation library calls to collect feedback information this means foo will actually run a bit slower than normal We are using bdata for the file name in this example you can use any name for your file Next run the program foo with an example dataset foo typical input data During this run a file with the prefix bdata will be created containing feedback information The file name you use will become the prefix for your output file For example the output file from this example dataset might be named fbdata instr0 ab342 Each file will have a unique string as part of its name so that files can t be overwritten To use this data in a subsequent compile pathce 03 ipa fb opt fbdata o foo foo c This new executable should run faster than a non FDO oo and will not contain any instrumentation library calls Experiment to see if FDO provides significant benefit for your application More details
251. hared between compilation units source files Common blocks are a Fortran 77 language feature that creates a group of global variables The PathScale compiler does sophisticated padding of common blocks for higher performance when the Inter Procedural Analysis IPA is in use constant A constant is a variable with a value known at compile time DSO dynamic shared object A library that is linked in at runtime In Linux the C library glibc is commonly dynamically linked in In Windows such libraries are called DLLs DWARF A debugging file format used by many compilers and debuggers to support source level debugging It is architecture independent and applicable to any processor or operating system It is widely used on Unix Linux and other operating systems as well in stand alone environments EBO The Extended Block Optimization pass in the PathScale compiler EM64T The Intel Extended Memory 64 Technology family of chips equivalence A Fortran feature similar to a C C union in which several variables occupy the same are of memory G Glossary ls executable feedback flag gcov IPA linker LNO MP NUMA object file The file created by the compiler and linker whose contents can be interpreted and run by a computer The compiler can also create libraries and debugging information from the source code A compiler optimization technique in which information from a run of the program is then
252. has a personal preferences in reference books and this list reflects the variety of opinions found within the PathScale engineering team 1 Introduction Documentation Suite ls Fortran Language Fortran 95 Handbook Complete ISO ANSI Reference by Jeanne C Adams et al MIT Press 1997 ISBN 0 262 51096 0 Fortran 95 Explained by Metcalf M and Reid J Oxford University Press 1996 ISBN 0 19 851888 8 C Language C Programming Language by Brian W Kernighan Dennis Ritchie Dennis M Ritchie Prentice Hall 1988 2nd edition ISBN 0 13 110362 8 C A Reference Manualby Samuel P Harbison Guy L Steele Prentice Hall 5th Edition 2002 ISBN 0 130 89592 X C How to Program by H M Deitel and P J Deitel Prentice Hall Fourth Edition 2004 ISBN 0 131 42644 3 C Language The C Standard Library A Tutorial and Reference by Josutis Nicolai M 1999 Addison Wesley ISBN 0 201 37926 0 Effective C 55 Specific Ways to Improve Your Programs and Design by Scott Meyers Addison Wesley Professional 2005 3rd edition ISBN 0 321 33487 6 More Effective C 35 New Ways to Improve Your Programs and Designs by Scott Meyers Addison Wesley Professional 1995 ISBN 0 201 63371 X Thinking in C Volume 1 Introduction to Standard C by Bruce Eckel Prentice Hall 2nd Edition 2000 ISBN 0 139 79809 9 NOTE There is a later version 2002 available online as a free download Thinking in C Vol 2
253. hat all outer loops for which unrolling is legal should be unrolled by N where N is a positive integer The compiler unrolls loops by this amount or not at all LNO ou deep ON OFF This option specifies that for loops with 3 deep or deeper loop nests the compiler should outer unroll the wind down loops that result from outer unrolling loops further out This results in large code size but generates faster code whenever wind down loop execution costs are important Default is ON LNO ou_further N This option specifies whether or not the compiler performs outer loop unrolling on wind down loops N must be specified and be an integer Additional unrolling can be disabled by specifying LNO ou_further 999999 Unrolling is enabled as much as is sensible by specifying LNO ou_further 3 LNO ou_max N This option enables the compiler to unroll as many as N copies per loop but no more LNO pwr2 ON OFF For C C only This option specifies whether to ignore the leading dimension set this to OFF to ignore Following are LNO Target Cache Memory Options These arguments allow you to describe the target cache memory system In the following arguments the numbering starts with the cache level closest to the processor and works outward LNO assoclzN assoc2 N assoc3 N assoc4 N This option specifies the cache set associativity For a fully associative cache such as main memory N should be set to any sufficiently large number s
254. he system imposed upper bound on the pthread stack size The library dynamically detects this value at start up time For systems using linuxthreads this limit is typically in the range of 8MB to 32MB For systems using NPTL threads there is typically no arbitrary limit imposed by the system on the stack size libopenmp imposes a limit of 1GB is imposed when using the 32 bit version of libopenmp and a limit of 4GB when using the 64 bit version of 1ibopenmp These limits prevent excessive stack limits when using 1ibopenmp When each pthread is created the operating system will allocate virtual memory for its entire stack as sized by the above algorithms This essentially allocates virtual memory space for that stack so that it can grow up to its specified limit The operating system will provide physical memory pages to back up this virtual memory as and when it is required A consequence for this is that the top program will include the whole of these stacks in the VIRT or SIZE VIRT Or SIZE Will be used depending on your Linux distribution memory usage figure while only the allocated physical pages for these stacks will be shown in the RES or RSS resident figure RES or RSS will be used depending on your Linux distribution If the OpenMP program runs with a large pthread stack size which is the common case then it is quite normal for VIRT or SIZE to be a large figure It will be at least the number of pthreads created by 1ibopen
255. he flags 03 ipa LNO fusion 2 and OPT div split on Testing combinations of these two flags as additions to the 03 ipa we have already tested results in 03 ipa LNO fusion 2 results in 109 74 seconds run time 03 ipa OPT div split on results in 112 24 seconds 03 ipa OPT div split on LNO fusion 2 results in 111 28 seconds So 03 ipa is essentially a tie for the best set of flags with 03 ipa LNO fusion 2 9 2 Using the profile Option This compiler option will generate extra profiling information suitable for the analysis program pathprof 1 The profile option tells the compiler to generate profiling information for both the program and the runtime libraries whereas the pg option tells the compiler to generate profiling information for the program only Use this option when compiling the source files for which you want to gather data You must also use it when linking NOTE You will need to include 1ibc p a which is available in the glibc profile package for your distribution Section 10 Debugging and Troubleshooting The PathScale Compiler Suite Support Guide contains information about getting support from PathScale and tells you how to submit a bug We consider performance issues to be a bug The pat hbug tool described in the Support Guide can help you gather information for submitting your bug 10 1 Subscription Manager Problems For recommendations in addressing problems or issues with subscriptions
256. he last run of the ft a executable Copy the two scripts psc_build and psc_test from opt pathscale share pathopt2 examples into the pathopt2 directory The scripts are shown below For psc_build bin sh cd make clean code 1 si z e S2 shift 2 make code CLASS Ssize FFLAGS cd pathopt2 For psc_test bin sh bin ft A gt logs ft A txt 7 38 7 Tuning Options The pathopt2 Tool I aaa Make the files executable and then run pathopt2 S chmod x psc S pathopt2 t try5 r psc test psc build ft A Note that the first argument to the psc_build script is the name of the code the second argument is the problem size and all remaining arguments are the optimization options This matches the code inthe psc build script that interprets the arguments The output will be similar to the following Sorted summary from all runs Flags Build Test Real User System 03 ipa PASS PASS 12 67 12 23 0 Ofast PASS PASS 12 68 12 27 0 40 O3 OPT Ofast PASS PASS 12 83 12 39 0 4 03 PASS PASS 13 86 12 46 0 40 02 PASS PASS 14 53 14 14 0 39 It is useful to check the output in logs ft A txt FT Benchmark completed Class A Size 256x256x128 Iterations 6 Time in seconds 10 78 Mop s total 662 05 Operation type floating point Verification SUCCESSFUL Version 2 3 Since Ofast runs last in the try5 target the output in this file corresponds to the 12
257. he path to this directory in your shell s search path before the location of your system s gcc which is usually usr bin You can confirm the order in the search path by running which gcc after modifying your search path The output should print the location of the acc wrapper not usr bin gcc 5 5 5 Porting and Compatibility Compatibility E ne Notes 5 6 6 1 Section 6 Tuning Quick Reference This section provides some ideas for tuning your code s performance with the PathScale compiler The following sections describe a small set of tuning options that are relatively easy to try and often give good results These are tuning options that do not require Makefile changes or risk the correctness of your code results More detail on these flags can be found in the next section and in the man pages A comprehensive list of the options for the PathScale compiler can be found in the eko man page Basic Optimization 6 2 IPA Here are some things to try first when optimizing your code The basic optimization flag 0 is equivalent to 02 This is the first flag to think about using when tuning your code Try 02 then 03 and then 03 OPT Ofast For more information on the o flags and OPT Ofast see section 7 1 Inter Procedural Analysis IPA invoked most simply with ipa is a compilation technique that analyzes an entire program This allows the compiler to do optimizations without regard to which source
258. her of the following 03 OPT Ofast ro 1 03 OPT Ofast div split OFF Note that ro is short for roundoff Ofast is equivalentto 03 ipa OPT Ofast fno math errno ffast math so similar cautions apply to it as to 03 OPT Ofast To use interprocedural analysis without the Ofast type optimizations use either of the following 03 ipa 02 ipa Testing different optimizations can be automated by pathopt2 This program compiles and runs your program with a variety of compiler options and creates a sorted list of the execution times for each run 6 3 6 Tuning Quick Reference Performance Analysis ee The try5 target tests five flag combinations which is easily done using pathopt2 The combinations are O2 O3 03 ipa O3 OPT Ofast Ofast For more information on using pathopt2 see section 7 9 6 6 Performance Analysis In addition to these suggestions for optimizing your code here are some other ideas to assist you in tuning Section 2 11 discusses figuring out where to tune your code using time to get an overview of your code and using pathprof to find your program s hot spots 6 7 Optimize Your Hardware Make sure you are optimizing your hardware as well section 7 8 discusses getting the best performance out of x86 64 based hardware Opteron Athlon 64 Athlon 64 FX and Intel EM64T Hardware configuration can have a significant effect on the performance of your application Secti
259. herefore the corresponding C code would need to use s2 rather than s2 to be compatible subroutine s2 bind c name end subroutines s2 The Fortran 2003 standard imposes many restrictions on the use of BIND mostly to avoid situations where a Fortran construct is not implementable in C or vice versa Some of the incompatibilities are 1 Fortran POINTER variables are represented differently than C pointers 2 Fortran ALLOCATABLE variables have no counterpart in C 3 The Fortran default LOGICAL type occupies the same storage as the Fortran default INTEGER type but the C bool type does not occupy the same storage as the C int type 4 C does not provide OPTIONAL arguments 5 Fortran assumed shape dummy arguments like arg are not generally compatible with C 6 AC array of char corresponds more closely to a Fortran array of character len 1 rather than a Fortran scalar character variable with a length greater than 1 The details are described in the standard itself but in general a Fortran global variable cannot use the BIND attribute unless its data type is compatible with C and a Fortran procedure cannot use the BIND attribute unless all its dummy arguments and if it is a function its result are compatible 3 15 3 The PathScale Fortran Compiler Fortran 2003 Support ee 3 4 9 2 Intrinsic Module ISO_C BINDING To aid in choosing compatible types the standard provides a variety of paramet
260. his is chip major ordering since incrementing the chip number increases the CPU number by m while incrementing the core number only increases the CPU number by 1 For core major ordering a linear assignment of threads to CPU numbers will have the effect of spreading threads over chips first For chip major ordering the linear assignment will fill up the first chip with threads before moving to the second chip and so forth This behavior can be changed by setting the stride factor to the value of m It causes the OpenMP library to spread the threads across the chips with a stride equal to the number of cores in a chip The decision on whether to spread threads over chips or over cores first depends on what one is trying to achieve and the system architecture It may be desirable to spread over cores first and minimize the number of chips to improve locality Alternatively it may be desirable to spread over chips first to maximize the number of chips to maximize the available system memory bandwidth 8 Using OpenMP and Autoparallelization Environment Variables ls For example here are the generated thread assignments for a system comprising of four chips each with two cores where PSC_OMP_CPU_STRIDE is set to 2 lt CHIP 0 5 lt CHIP 1 5 lt CHIP 2 lt CHIP 3 CPUO CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 TO T4 T1 T5 T2 T6 T3 T7 T8 T12 T9 T13 T10 T14 T11 T15 T16 Tx indicates threa
261. ian files are little endian For more details see the pathf 95 man page 3 The PathScale Fortran Compiler Library Compatibility NO myro oU 3 8 2 Reserved File Units The PathScale Fortran compiler reserves Fortran file units 5 6 and 0 3 9 Source Code Compatibility This section discusses our compatibility with source code developed for other compilers Different compilers represent types in various ways and this may cause some problems 3 9 1 Fortran KINDs The Fortran KIND attribute is a way to specify the precision or size of a type Modern Fortran uses KINDS to declare types This system is very flexible but has one drawback The recommended and portable way to use KINDS is to find out what they are like this integer dp kind kind 0 0d0 In actuality some users hard wire the actual values into their programs integer dp kind 8 This is an unportable practice because some compilers use different values for the KIND of a double precision floating point value The majority of compilers use the number of bytes in the type as the KIND value For floating point numbers this means KIND 4 is 32 bit floating point and KIND 8 is 64 bit floating point The PathScale compiler follows this convention Unfortunately for us and our users this is incompatible with unportable programs written using GNU Fortran 977 977 uses KIND 1 for single precision 32 bits and KIND 2 for double precision 64 bits For in
262. ill consist of 1 thread the thread that requests the fork Otherwise the number of threads is specified by the NUM THREADS clause on the parallel directive if NUM THREADS has been specified Otherwise the number of threads is specified by the most recent call to OMP SET NUM THREADS if it has been called Otherwise the number of threads is specified by the OMP NUM THREADS environment variable if it has been defined Otherwise the number of threads defaults to the number of CPUs in the machine If the number of threads is greater than 1 the request requires allocation of new threads and this may fail if insufficient machine resources are available The maximum number of threads that can be allocated simultaneously is limited to 256 by the implementation Currently nested parallelism is not supported where nested parallel directives are statically scoped within the same subroutine as the outer parallel directive In this case only the outer parallel directive will be parallelized and any inner nested directives will be serialized executed by a team of 1 thread To achieve nested parallelism the nested parallel directives must be moved to a separate subroutine OMP SCHEDULE environment variable The default value for this environment variable is implementation dependent Section 4 1 page 59 The default for the OMP SCHEDULE environment variable is static scheduling with no chunk size specified The chunk size will defaul
263. in Fortran however its linker symbolis MAIN rather than main and when you link the program with path 95 the linker will not automatically import it from a library The usual symptom is a program which links without error but then prints Someone linked a Fortran program with no MAIN The solution is to tell the linker explicitly to import the symbol MAIN with two underscores pathf90 Wl undefined MAIN mylibrary a 3 3 1 Module related Error Messages Error messages report the error as the first line in the module even if the real error is further inside the module The real error is reported after this first standard message An example is given below 3 The PathScale Fortran Compiler Fortran 2003 Support I aaa Here is a program hellow f95 which contains this module MODULE HELLOW CONTAINS SUBROUTINE HELLO SPRINTZ Hello World END SUBROUTINE HELLO END MODULE HELLOW Next compile the program containing the module and look at the error that is generated S pathf95 hellow f95 MODULE HELLOW pathf95 855 pathf95 ERROR HELLOW File hellow f95 Line 1 Column 8 The compiler has detected errors in module HELLOW No module information file will be created for this module SPRINTZ Hello World A pathf95 724 pathf95 ERROR HELLO File hellow f95 Line 5 Column 11 Unknown statement Expected assignment statement but found instead of or gt pathf95 PathScale T
264. include optimizations that are generally beneficial but may hurt performance So let s look at a profile of the 02 binary We do need to recompile using flags 02 pg Then we need to run the generated instrumented binary again with the same reference dataset time p wupwise wupwise out Here we used the p POSIX flag to get a different time output format This run generates the file gmon out of profiling information Then we need to run pathprof to generate the human readable profile 9 1 9 Examples Compiler Flag Tuning and Profiling With pathprof AA NN S pathprof wupwise Flat profile Each sample counts as 0 01 seconds cumulative self self total time seconds seconds calls s cal s cal name 51515 83 54 83 54 155648000 0 00 0 00 zgemm 17 65 112 337 28 83 603648604 0 00 0 00 Zaxpy _ 8 72 126 61 14 24 214528306 0 00 0 00 Zcopy _ 8 03 139 72 13 11 933888000 0 00 0 00 lsame 4 59 147 21 TAO s cmp 1451 149 67 2 46 512301 0 00 0 00 zdotc 1 49 152 11 2 44 603648604 0 00 0 00 dcabsl 1 37 154 34 2 23 155648000 0 00 0 00 gammul 1 08 156 10 1 76 155648000 0 00 0 00 su3mul 1 07 157 85 1 75 152 0 01 0 50 muldeo 0 00 163 32 0 00 1 0 00 155 83 MAIN _ 0 00 163 32 0 00 1 0 00 0 00 init_ 0 00 163 32 0 00 1 0 00 0 06 phinit the percentage of the total running time of the time program used by this function cumulative secondsa running sum of the number of seconds accounted for by this function and those list
265. insic functions which g77 implements With respect to linking the PathScale Fortran compiler is not generally compatible with other Fortran compilers such as gfortran g95 or commercial compilers when 5 1 5 Porting and Compatibility Compatibility with Other Fortran Compilers AA NN source code makes use of language features beyond Fortran 77 although careful programming may make linking possible Pathscale Fortran is compatible with g77 with respect to linking provided you use the command line option 2c abi There are five major issues affecting linking compatibility 1 ABI application binary interface and data representation the size and encoding of each data type and how each data type is passed as an argument in a procedure call For example one compiler might use an integer 1 to represent true while another might use 1 one compiler might interpret integer kind 2 as a two byte integer and another interpret that as a two word integer 2 Each compiler may use a different runtime library to perform tasks such as I O string manipulation and certain other operations which are too bulky to perform in line For example in contrast with the C language where the standard dictates that the runtime library will provide functions named strcpy st remp and fputs to copy compare and write strings the Fortran standard merely describes the behavior of assignment using operators like ge and statements lik
266. ion 3 2 Continued Intrinsic Name Result Arguments Families Remarks KIOR I 8 I 1 8 PGI E J I 8 TRADITIONAL KISHA I8 I 1 8 TRADITIONAL E SHIFT 171 1 2 1 4 I8 KISHC I8 I 1 8 TRADITIONAL E SHIFT 171 1 2 1 4 I8 KISHFT I8 I 1 8 PGI E SHIFT 1 1 1 2 1 4 TRADITIONAL I8 KISHL I8 I 1 8 TRADITIONAL E SHIFT 171 1 2 1 4 I8 KISIGN I8 A I8 PGI E P B 18 TRADITIONAL KMOD I8 A l 8 PGI E P P 18 TRADITIONAL KMVBITS Subroutine FROM I 8 TRADITIONAL E FROMPOS 1 1 1 2 1 4 1 8 LEN 171 1 2 1 4 18 TO I 8 TOPOS I 1 1 2 1 4 I8 KNINT I8 A R 4 R 8 PGI E P TRADITIONAL KNOT I8 I 1 8 PGI E TRADITIONAL LBOUND ANSI PGI See Std TRADITIONAL LEN 1 4 STRING C ANSI G77 E P PGI TRADITIONAL LENGTH 1 11 I2 1 4 I8 TRADITIONAL E LEN TRIM 1 4 STRING C ANSI G77 E PGI TRADITIONAL C 28 C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks LGE C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LGT C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LINK 1 4 PATH1 C G77 PGI PATH2 C LINK Subroutine PATH1 C G77 O PATH2 C STATUS 1 4 LLE C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LLT C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LNBLNK 1 4 STRING C G77
267. ions so the user can use them to tune for the peak performance of his program This section presents the IPA related compilation options that are useful in tuning programs But first it is worthwhile to mention that IPA is one of the compilation phases that can benefit substantially from feedback compilation In feedback compilation a feedback data file containing a profile of a typical run of the program is presented to the compiler This enables IPA to make better decisions regarding what functions to inline and clone By ensuring that busy callers and callees are placed next to each other IPA s procedure re ordering can also be more effective Feedback compilation is enabled by the fb create and fb opt options See section 7 6 for more details 7 3 4 1 Inlining There are actually two incarnations of the inliner in the PathScale compiler depending on whether ipa is specified This is because inlining is nowadays a language feature and has to be performed independent of IPA The inliner invoked when ipa is not specified is the lightweight inliner and it can only operate on a single compilation unit The lightweight inliner does not do automatic inlining It inlines strictly according to the C language requirement C inline keyword or any INLINE options specified by the user It may be invoked by default The basic options to control inlining in the lightweight inliner are inline Or INLINE causes the lightweight inliner to b
268. ironment during the execution ofparallel constructs THREADPRIVATE pragma omp threadprivate 8 6 OpenMP Runtime Library Calls Fortran OpenMP programs can explicitly call standard routines implemented in the OpenMP runtime library If you want to ensure the program is still compilable without mp you need to guard such code with the OpenMP conditional compilation sentinels 8 7 8 Using OpenMP and Autoparallelization OpenMP Runtime Library Calls Fortran AA NN e g The following table lists the OpenMP runtime library routines provided by version 2 0 of the OpenMP Fortran Application Program Interface Table 8 3 Fortran OpenMP Runtime Library Routines 8 8 Routine Description call omp set num threads integer Set the number of threads to use in a team integer omp get num threads integer omp get max threads integer omp get thread num Return the number of threads in the currently executing parallel region Return the maximum value that omp get num threads may return Return the thread number within the team integer omp get num procs Return the number of processors available to the program call omp set dynamic logical Control the dynamic adjustment of the number of parallel threads logical omp get dynamic logical omp in parallel Return TRUE if dynamic threads is enabled otherwise return FALSE Return TRUE
269. is option group controls miscellaneous optimizations These options override defaults based on the main optimization level OPT alias name Specify the pointer aliasing model to be used By specifying one or more of the following for name the compiler is able to make assumptions throughout the compilation Option Action typed Assume that the code adheres to the ANSI ISO C standard which states that two pointers of different types cannot point to the same location in memory This is ON by default when OPT Ofast is specified restrict Specify that distinct pointers are assumed to point to distinct non overlapping objects This is OFF by default disjoint Specify that any two pointer expressions are assumed to point to distinct non overlapping objects This is OFF by default no f90 pointer alias Specify that any two different f90 pointers are assumed to point to distinct non overlapping objects This is OFF by default OPT align unsafe ON OFF Instruct the vectorizer invoked at O3 to aggressively perform vectorization by assuming that array parameters are aligned at 128 bit boundaries The vectorizer will then generate 128 bit aligned load and store instructions which are faster than their unaligned counterparts If the assumption is incorrect the aligned memory accesses will result in run time segmentation faults The default is OFF OPT asm memory ON OFF A debugging option to be used when debugging suspected
270. is specified copies of inline functions are emitted as static functions in each compilation unit where they are called If you are using IPA IPA inline OFF must be specified to suppress inlining no pathcc no pathcc turns off the PATHSCALE and other predefined preprocessor macros nostartfiles Do not use standard system startup files when linking F 36 F eko man Page I aaa nostdinc Direct the system to skip the standard directory usr include when searching for include files and files named on INCLUDE statements nostdinc Do not search for header files in the standard directories specific to C nostdlib No predefined libraries or startfiles o outfile When this option is used in conjunction with the c option and a single C source file a relocatable object file named outfile is produced When specified with the S option the o option is ignored If o and c are not specified a file named a out is produced If specified writes the executable file to out file rather than to a out 0 0 1 2 3 s Specify the basic level of optimization desired The options can be one of the following 0 Turn off all optimizations 1 Tum on local optimizations that can be done quickly 2 Turn on extensive optimization This is the default The optimizations at this level are generally conservative in the sense that they are virtually always beneficial provide improvements commensurate to the compile time s
271. ise returns the error code from the C library value errno The subroutine form sets status to the value that the function form would return Trailing blanks in air are ignored you can prevent this by using char 0 to place a null character after the last significant character C 42 chmod ctime date dbesj0 dbesj1 dbesjn dbesy0 dbesy1 dbesyn dcmplx dconj derf derfc dfloat dimag dreal dtime erf erfc C Supported Fortran Intrinsics Fortran Intrinsic Extensions o 9 9 9 97 aaa Like the POSIX command chmod changes the access permissions of file name according to mode See the operating system documentation for the characters allowed in mode The function form returns 0 on success but otherwise returns the error code from the C library value errno The subroutine form sets status to the value which the function form would return Trailing blanks in name are ignored you can prevent this by using char 0 toplace a null character after the last significant character Like the C library function ct ime converts stime which can be obtained from the intrinsic t ime8 to a string of the form Thu Mar 2 12 45 36 PST 2006 The function form returns that string The subroutine form sets result to that string Set the date argument to a string of the form 16 Mar 06 DD MMM YY Fortran interfaces to C library functions 301 311 jnl y01 y11 and yn1 Bessel functions Speci
272. issing a case for an enum member W no system headers For C C only Wsystem headers prints warnings for constructs in system header files Wno system headers tells the compiler not to print warnings for constructs in system header files W no synth For C only The Wsynth option warns about synthesis that is not backward compatible with cfront Wno synth tells the compiler not to warn about synthesis that is not backwards compatible with cfront W no traditional For C C only Wtraditional warns about constructs whose meanings change in ANSI C Wno traditional tells the compiler not to warn about constructs whose meanings change in ANSI C W no trigraphs For C C only Wtrigraphs warns when trigraphs are encountered Wno trigraphs tells the compiler not to warn when trigraphs are encountered W no undef Wundef warns if an undefined identifier appears in a if directive Wno undef tells the compiler not to warn if an undefined identifier appears in a if directive W no uninitialized Wuninitialized warns about uninitialized automatic variables Because the analysis to find uninitialized variables is performed in the global optimizer invoked at O2 or above this option has no effect at O0 and O1 Wno uninitialized tells the compiler not to warn about uninitialized automatic variables W no unknown pragmas Wunknown pragmas warns when an unknown pragma directive is encountered Wn
273. it does not allow C style comments beginning with to extend across multiple lines You should use the 3 24 3 The PathScale Fortran Compiler Compiler and Runtime Features ls cpp option if you wish to use the C preprocessor on Fortran source files ending in 90 or 95 These files will not be preprocessed unless you use either ftpp to select the Fortran preprocessor or cpp to select the C preprocessor on the command line 3 6 3 Support for Varying Length Character Strings Beginning with Release 2 5 PathScale Fortran compiler now supports ISO IEC Standard 1539 2 which provides support for varying length character strings This is an optional add on to the Fortran Standard You can download and compile this module It is available from this location http www fortran com fortran iso varying string f95 3 6 4 Preprocessing Source Files with coco Beginning with release 2 4 the PathScale Fortran compiler now supports the ISO IEC 1539 3 conditional compilation preprocessor When you use the coco option the compiler runs this preprocessor on each individual source file before compiling that source file overriding the default whereby files suffixed with F F90 or F95 are preprocessed with cpp but files suffixed with 90 or 95 are not preprocessed The ISO IEC standard does not specify any command line options for the preprocessor but as an extension we pass 1 and D options to it
274. iterations which are executed by a thread are contiguous in terms of their loop iteration number NOTE The PSC OMP STATIC FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced over the threads in cases where the division in not exact In the absence of the SCHEDULE clause the default schedule is implementation dependent section 2 3 1 In the absence of the SCHEDULE clause the default schedule is static scheduling The default chunk size is set to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of the number of threads in the team the last chunk will be smaller than the default chunk size and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop iterations which are executed by a thread are contiguous in terms of their loop iteration number NOTE The PSC_OMP_STATIC_FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced
275. ithout line directives see the P option For more information on controlling source preprocessing see the cpp ftpp macro expand and nocpp options extend source For Fortran only Specify a 132 character line length for fixed format source lines By default fixed format lines are 72 characters wide For more information on controlling line length see the coln option fabi versionzN For C only Use version N of the C ABI Version 1 is the version of the C ABI that first appeared in G 3 2 Version 0 will always be the version that conforms most closely to the C ABI specification Therefore the ABI obtained using version 0 will change as ABI bugs are fixed The default is version 1 fb create path Used to specify that an instrumented executable program is to be generated Such an executable is suitable for producing feedback data files with the specified prefix for use in feedback directed compilation FDO The commonly used prefix is fbdata This is OFF by default fb opt prefix for feedback data files Used to specify feedback directed compilation FDO by extracting feedback data from files with the specified prefix which were previously generated using fb create The commonly used prefix is fodata The same optimization flags must have been used in the fb create compile Feedback data files created from executables compiled with different optimization flags will give checksum errors FDO
276. itten in C or C but some procedures are written in Fortran you may wish to call the function PSC ftn init to initialize the Fortran runtime library While standard Fortran I O and most intrinsic functions will work correctly without this initialization it is needed for runtime error messages automatic stack sizing and the intrinsics dealing with the command line arguments You should call it prior to executing any Fortran generated code passing it the arguments argc and argv from the C main program 3 29 3 The PathScale Fortran Compiler Mixed Code AA NN 3 7 1 int main int argc char argv extern void PSC ftn init int argc char argv PSC ftn init argc argv Legacy Support for Calls between C and Fortran 3 30 In calls between C and Fortran the two issues are Mapping Fortran procedure names onto C function names and Matching argument types Normally a pathf 90 procedure name x not containing an underscore creates a linker symbol x and a pathf 90 name x y containing an underscore creates a linker symbol x y note the second underscore A pathcc function name by contrast does not append any underscores when creating a linker symbol You can write your C code to conform to this use x in C so that it will match Fortran s x Or you can use the fdecorate option described in man pathf90 to provide a mapping from each Fortran name onto some possibly quite different linker symbol Or you
277. ized suite of source code based upon existing applications that has already been ported to a wide variety of platforms by its membership The benchmarker takes this source code compiles it for the system in question and tunes the system for the best results See http www spec org for more information SSE3 Instruction set extension to Intel 2019s IA 32 and IA 64 architecture to speed processing These new instructions are supposed to enable and improve hyperthreading rather than floating point operations TLB Translation Look aside Buffer vectorization An optimization technique that works on multiple pieces of data at once For example the PathScale Compiler Suite will turn a loop computing the mathematical function sin into acallto the vsin function which is twice as fast G Glossary ls WHIRL x86 64 The intermediate language IR used by compilers allowing the C C and Fortran front ends to share a common backend It was developed at Silicon Graphics Inc and is used by the Open64 compilers The Linux 64 bit application binary interface ABI G Glossary AA NY G 6 Symbols PSC ftn init 3 29 apo 8 2 C 3 29 CG see Code Generation 7 17 CLIST 7 44 Cpp 2 6 3 1 3 24 fb create 7 7 fb opt 7 7 fcoco 3 25 ff2c abi 3 39 ffast math 7 21 fixedform 3 1 FLIST 7 44 fno second underscore 3 38 fno underscoring 3 38 fPIC 2 10 freeform 3 1 ftpp 3 1 3 24 3 26 g 3 42 4 7 7 1
278. jects ignore suffix Determine the language of the source file being compiled by the command used to invoke the compiler By default the language is determined by the file suffixes c cpp C cxx f f90 s When the ignore suffix option is specified the pathcc command invokes the C compiler pathCC invokes the C compiler and pathf95 invokes the Fortran 95 compiler inline Request inline processing INLINE Option group for subprogram inlining May not always compile With the exception of INLINE OFF any use of this option implies inline If you have included inlining directives in your source code the INLINE option must be specified in order for those directives to be honored INLINE aggressive ON OFF Tell the compiler to be more aggressive about inlining The default is INLINE aggressive OFF F 18 F eko man Page I aaa INLINE list ON OFF Tell the inliner to list inlining actions as they occur to stderr The default is INLINE list OFF INLINE preempt ON OFF Perform inlining of functions marked preemptible in the light weight inliner Default is OFF This inlining prevents another definition of such a function in another DSO from preempting the definition of the function being inlined no intrinsic name For Fortran only Add a procedure to or remove a procedure from the set of intrinsic functions and subroutines that the compiler recognizes By default the
279. know what the compiler did to optimize your code There are several ways to generate a listing showing by line number what the compiler did to optimize a subroutine Choose the one that seems most useful to you 7 10 1 Using the s flag The s flag can be a useful way to see what the compiler did especially if you understand some assembly but it is useful even if you don t Here is an example using the STREAM benchmark First we compile STREAM with the s flag pathcc 03 stream d c S This produces a stream d s assembly file In this file you can see sections of human readable comments interspersed with sections of assembly code that look something like this lt loop gt Loop body line 118 nesting depth 1 iterations 250000 lt loop gt unrolled 4 times lt sched gt lt sched gt Loop schedule length 13 cycles ignoring nested loops lt sched gt lt sched gt 4 flops 15 of peak lt sched gt 8 mem refs 30 of peak lt sched gt 3 integer ops 11 of peak lt sched gt 15 instructions 28 of peak lt sched gt lt freq gt BB 60 frequency 250000 00000 heuristic lt freq gt BB 60 gt BB 60 probability 0 99994 lt freq gt BB 60 gt BB 59 probability 0 00006 freq gt loc 1 120 0 119 for j 0 j lt N j 120 a j 2 0E0 a jl movapd 0 r8 xmm3 0 id 82 a 0x0 movapd 16 r8 xmm2 1 id 82 a 0x0 addpd xmm3 xmm3 4 addpd xmm32 xmm2 5 m
280. lable at http www pathscale com docs html For the most current information on supported features please see the Release Notes and README files for your current release F 63 F eko man Page AA NY F 64 G Glossary I aaa Appendix G Glossary This section describes common terms that are used in connection with the PathScale Compiler Suite ABI affinity AMD64 alias aliasing assertion base Describes the interface between program components at the binary level It encompasses details such as procedure calling convention how parameters and return values are passed the mangling encoding of function and variable names and the dedication of registers for different usages Processor affinity is used to specify the preferred processor or subset of processors for scheduling a thread An affinity setting might be made in order to bind a thread close to a resource and to prevent the kernel from rescheduling the thread to another processor further away from that resource Affinity is particularly important on NUMA non uniform memory architectures since memory access latency and bandwidth may vary based on the relative locations of the processor and memory AMD s 64 bit extensions to Intel s IA32 more commonly known as x86 architecture An alternate name used for identification such as for naming a field or a file Two variables are said to be aliased if they potentially are in the same loc
281. larations of functions or variables W no implicit function declaration For C C only Wimplicit function declaration warns when a function is used before being declared Wimplicit function declaration tells the compiler not to warn when a function is used before being declared W no implicit int For C C only Wimplicit int warns when a declaration does not specify a type Wno implicit int tells the compiler not to warn when a declaration does not specify a type W no import Wimport warns about the use of the import directive Wno import tells the compiler not to warn about the use of the import directive W no inline For C C only Winline warns if a function declared as inline cannot be inlined Wno inline tells the compiler not to warn if a function declared as inline cannot be inlined W no larger than lt number gt Wlarger than warns if an object is larger than lt number gt bytes Wno larger than tells the compiler not to warn if an object is larger than number bytes W no main For C C only Wmain warns about suspicious declarations of main Wno main tells the compiler not warn about suspicious declarations of main W no missing braces For C C only Wmissing braces warns about possibly missing braces around initializers Wno missing braces tells the compiler not warn about possibly missing braces around initializers F 51 F eko man Page AA NY W no missin
282. le is used to assign threads to CPUs otherwise no affinity assignments are made If the OpenMP program is run with one initial thread OMP NUM THREADS is one or the machine has one CPU the default value is FALSE otherwise the default value is TRUE The rationale for this default is that it is useful to assign affinity assignments to multi threaded programs for performance reasons but that single threaded programs should be run without explicit affinity assignments so that they can be scheduled freely by the operating system just like any other serial program generated by the compiler These defaults can of course be changed by explicitly setting PSC OMP AFFINITY to TRUE Or FALSE An interesting case is when many multiple OpenMP processes are run on the same node e g using MPI The OpenMP library has no specific knowledge of MPI and each OpenMP process has no knowledge of other OpenMP processes running on that node By default each OpenMP process will make the same affinity assignments and the CPU utilization may be unbalanced In hybrid OpenMP MPI programs using multiple OpenMP threads per process it may be necessary to set PSC OMP AFFINITY to FALSE to prevent this For hybrid OpenMP MPI programs using a single OpenMP thread per process the default is to disable OpenMP affinity and the operating system will hopefully use all CPUs equitably An alternative approach is to specify explicit and disjoint affinity assignments per MPI process using
283. lements the current local date Day ranging 1 31 Month ranging 1 12 Year using 4 digits The three argument version sets its arguments to the month day and year Note that the order is different from that of the one argument version C 46 C Supported Fortran Intrinsics Fortran Intrinsic Extensions o 9 9 97 a ierrno Returns the C library value errno which is the last error code set by a C library or Linux system function Note that a function which does not encounter an error may not set this value back to zero imag Return the imaginary part of a complex number without altering precision imagpart Imaginary part of a complex number synonym for standard intrinsic aimag Which in Fortran 95 preserves the precision of its argument int2 Convert to type integer 2 int4 Convert to type integer 4 int8 Convert to type integer 8 irand Fortran interface to POSIX function rana Returns a uniform pseudorandom integer If 1ag is o return the next number in the current sequence if f1ag is 1 call POSIX function srand 0 otherwise call srand flag to seed a new sequence isatty Fortran interface to Linux function is att y Returns true if logical unit unit is associated with an interactive terminal device itime Store in tarray which must have three elements the current local time Hour ranging o 23 Minutes ranging 0 59 Seconds ranging o 60 to allow for leap seconds kill Fortran interface
284. ler s estimate of the overhead in processor cycles incurred by invoking the parallel version of a loop When the compiler parallelizes a loop it generates both a serial and a parallel version If the amount of work performed by the loop is small it may not be beneficial to use the parallel version during execution The set value of parallel overhead is used in this determination during execution time when the number of processors and the iteration count of the loop are taken into account The default value is 4096 Because the optimal value varies across systems and programs this option can be used for parallel performance tuning LNO prefetch 0 1 2 3 This option specifies the level of prefetching 0 Prefetch disabled 1 Prefetch is done only for arrays that are always referenced in each iteration of a loop 2 Prefetch is done without the above restriction This is the default 3 Most aggressive LNO prefetch ahead N Prefetch N cache line s ahead The default is 2 LNO prefetch verbose ON OFF LNO prefetch_verbose ON prints verbose prefetch info to stdout Default is OFF LNO processorszN Tells the compiler to assume that the program compiled under apo will be run on a system with the given number of processors This helps in reducing the amount of computation during execution for determining whether to enter the parallel or serial versions of loops that are parallelized see the LNO parallel overhead option The
285. ll pass a simple pointer that corresponds well to a C array whereas a oran allocatable array or a Fortran 90 pointer array does not correspond to anything in C 3 The PathScale Fortran Compiler Mixed Code I aaa NOTE Fortran arrays are placed in memory in column major order whereas C arrays use row major order And of course one must adjust for the fact that C array indices originate a zero whereas Fortran array indices originate at 1 by default but can be declared with other origins instead Calls between C and Fortran are more difficult for the same reason that calls between C and C are difficult the C compiler must mangle symbol names to implement overloading and the C compiler must add to data structures various information such as virtual table pointers that other languages cannot understand The simplest solution is to use the extern C declaration within the C source code to tell it to generate a C compatible interface which reduces the problem to that of interfacing C and Fortran 3 7 1 1 Example Calls between C and Fortran Here are three files you can compile and execute that demonstrate calls between C and Fortran This is the C source code c_part c include lt stdio h gt include lt alloca h gt include lt string h gt extern void f1 char c int i long long 11 float f double d int Il int c len Demonstrate how to call Fortran from C void call_fortran
286. losed block of code among the members of the team that FOR NOWAIT pragma omp for clause for loop PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SCHEDULE static dynamic guided runtime ORDERED SECTIONS NOWAIT pragma omp sections clause structured block PRIVATE 8 Using OpenMP and Autoparallelization OpenMP Runtime Library Calls Fortran ls Table 8 2 C C Compiler Directives Continued Directive Clauses Example FIRSTPRIVATE LASTPRIVATE REDUCTION SINGLE NOWAIT pragma omp single clause structured block PRIVATE FIRSTPRIVATE COPYPRIVATE Combined parallel work sharing constructs Shortcut for denoting a parallel region that contains only one work sharing construct PARALLEL FOR pragma omp parallel for structured block PARALLEL pragma omp parallel sections SECTIONS structured block Synchronization constructs Provide various aspects of synchronization for example access to a block of code or execution order of statements within a block of code ATOMIC pragma omp atomic expression statement BARRIER pragma omp barier CRITICAL pragma omp critical name structured block FLUSH pragma omp flush list MASTER pragma omp master tructured block ORDERED pragma omp ordered structured block Data environments Control the data env
287. lue as the dummy variable regardless of intent An allocatable function result is unallocated at the beginning of the function but must be allocated and defined before the function returns The result is deallocated automatically at the end of the statement which calls the function Fortran 2003 adds to the TR15581 document mentioned earlier a requirement that an assignment to an ordinary allocatable variable must automatically deallocate and reallocate the target to match the source whereas Fortran 95 requires the programmer to ensure that the target is allocated and has the same shape as the source This makes the behavior of ordinary allocatable variables consistent with that of allocatable components of structures Our compiler does not yet provide this feature 3 4 9 Fortran 2003 C Interoperability A number of Fortran 2003 features allow procedures and variables coded in Fortran to interoperate with functions and variables coded in C and thanks to the C declaration extern C with functions and variables coded in C These appear in sections 15 4 6 5 1 2 15 of the standard The features address these issues 3 13 3 The PathScale Fortran Compiler Fortran 2003 Support AA NN The language binding labels which the linker uses to represent procedures and global variables must be consistent with the external linker symbols generated by pathcc When a variable is accessable from both languages its data type
288. ly Same as clist C only C only E Summary of Compiler Options ls Table E 1 Summary of Compiler Options by Function CLIST emit_pfetch ON OFF CLIST linelength N CLIST show ON OFF ffortran bounds check flist Hy LIST ON OFF Hy LIST ansi format ON OFF Hy LIST emit pfetch ON OFF LIST ftn file ile ni Hy LIST linelength N Hy LIST show setting f no permissive fullwarn g 0 1 2 3 pedantic errors subverbose trapuv zerouv FDO Options fb create lt path gt fb opt lt prefix for feedback data files gt fb phase 0 1 2 3 4 Fortran Source Form Options colN extend source fixedform OFF C only unlimited C only ON C only Fortran only Fortran only Same as FLIST ON Fortran only Same as flist Fortran only Fortran only Fortran only Fortran only Fortran only 0 Initialize variables to NaN Defaults Comments OFF If used commonly used prefix is fbdata 0 Defaults Comments s725 Fortran only Fortran only or F assumed to be written in fixed source form Fortran only E 5 E Summary of Compiler Options AA NN Table E 1 Summary of Compiler Options by Function E 6 freeform noextend source ipa IPA IPA IPA IPA IPA IPA IPA IPA IPA IPA IPA IPA IPA
289. mation for loop transformations and prefetch etc N can be any positive integer and the default value is 1000 LNO vintr 0 1 2 This flag controls loop vectorization to make use of vector intrinsic routines Note a vector intrinsic routine is called once to compute a math intrinsic for the entire vector LNO vintr 1 is the default LNO vintr 0 turns off the vintr optimization Under LNO vintr 2 the compiler will do aggressive optimization for all vector intrinsic routines Note that LNO vintr 2 could be unsafe in that some of these routines could have accuracy problems LNO vintr verbose ON OFF LNO vinter_verbose ON prints verbose information to stdout on optimizing for vector intrinsic routines Default is OFF This flag will let you know which loops are vectorized to make use of vector intrinsic routines Following are LNO Transformation Options Loop transformation arguments allow control of cache blocking loop unrolling and loop interchange They include the following options F 30 F eko man Page ls LNO interchange ON OFF Disable the loop interchange transformation in the loop nest optimizer Defaultis ON LNO unswitch ON OFF Turn ON or OFF the optimization that performs a simple form of loop unswitching The default is ON LNO unswitch verbose ON OFF LNO unswitch_verbose ON prints verbose info to stdout on unswitching loops Default is OFF LNO ou N This option indicates t
290. mation on troubleshooting and debugging See the PathScale Debugger User Guide for more information on pathdb 4 The PathScale C C Compiler Unsupported GCC Extensions AA NN 4 4 Unsupported GCC Extensions The PathScale C and C Compiler Suite supports most of the C and C extensions supported by the GCC version 4 2 0 suite In this release we do not support the following extensions For C Nested functions Complex integer data type Complex integer data types are not supported Although the PathScale Compiler Suite fully supports floating point complex numbers it does not support complex integer data types such as Complex int SSE intrinsics Many ofthe builtin functions Agoto outside of the block PathScale compilers do support taking the address of a label in the current function and doing indirect jumps to it The compiler generates incorrect code for structs generated on the fly a GCC extension Java style exceptions java interface attribute init priority attribute 4 The PathScale C C Compiler Unsupported GCC Extensions ls Notes 4 9 4 The PathScale C C Compiler Unsupported GCC Extensions a s NXNUNMM S 4 10 Section 5 Porting and Compatibility 5 1 Getting Started Here are some tips to get you started compiling selected applications with the PathScale Compiler Suite 5 2 GNU Compatibility The PathScale Compiler Suite C C and Fortran c
291. mber calculated after PSC OMP CPU STRI DE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs PSC OMP GUARD SIZE This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S PSC OMP GUIDED CHUNK The value of PSC OMP GUI DE D CHUNK DIVISORiS DIVISOR used to divide down the chunk size assigned by the guided scheduling algorithm See section 8 9 2 for details PSC OMP GUIDED CHUNK This is the maximum chunk size that will be used by the MAX loop scheduler for guided scheduling See section 8 9 2 for details A 4 A Environment Variables Environment Variables for OpenMP A PSC_OMP_LOCK_SPIN This chooses the locking mechanism used by critical sections and OMP locks See section 8 9 2 for details PSC_OMP_SILENT If you set PSC OMP SILENT to anything then warning and debug messages from the libopenmp library are inhibited PSC OMP STACK SIZE Fortran Stack size specification follows the syntax in section 3 13 PSC OMP STATIC FAIR This determines the default static scheduling policy when no chunk size is specified as discussed in section 8 9 2 PSC OMP THREAD SPIN This takes a numeric value and sets the number of times that the spin loops will spin at user level before falling back to O S schedule reschedule mecha
292. me implementations that aren t compatible with any Fortran or C construct Fortran 2003 does not attempt to interoperate with C The best way to interface Fortran with C is to use the extern C declaration to create C compatible functions and data structures within the C code and then to use Fortran s C interoperability features to interface with those 3 The PathScale Fortran Compiler Extensions I aaa Linking a program which contains both Fortran and C code presents a special problem because neither language automatically uses the other s libraries Generally you should use pathCC to link the program specifying Ipathfortran on the command line See section 3 7 for details 3 4 9 8 Pitfalls It is important that declarations are consistent in their use bind c In particular on the A32 architecture or the X8664 architecture with the m32 option Fortran normally pads 8 byte data to force 8 byte alignment but C and the bind c attribute requires only 4 byte alignment If one Fortran compilation declares a derived type or common block with the bind c attribute but another Fortran compilation omits the attribute the two compilations may use different memory addresses for the data 3 5 Extensions The PathScale Fortran compiler supports a number of extensions to the Fortran standard which are described in this section 3 5 1 Promotion of REAL and INTEGER Types 3 5 2 Section 5 has more information a
293. me will use no more than 1 5 gigabytes GB of stack On a system with 2GB of physical memory a limit of 90 cpu will use no more than 0 9GB of stack 2 2 0 90 PSC STACK VERBOSE If this environment variable is set the Fortran runtime will print detailed information about how it is computing the stack size limit to use A 4 Language independent Environment Variables FILENV The location of the assign file See the assign man page for more details PSC COMPILER DEFAULTS Specifies a PATH or a colon separated list of PATHs PATH designating where the compiler is to look for the compiler defaults file If the environment variable is set the PATH opt pathscale etc will not be used If the file cannot be found then no defaults file will be used even if one is present in opt pathscale etc PSC GENFLAGS Generic flags passed to all compilers This variable is used with the gcc compatibility wrapper scripts PSC PROBLEM REPORT DIR Name a directory in which to save problem reports and preprocessed source files if the compiler encounters an internal error If not specified the directory used is SHOME ekopath bugs A 5 Environment Variables for OpenMP These environment variables are described in detail in section 8 They are listed here for your reference A 2 A Environment Variables Environment Variables for OpenMP ls A 5 1 Standard OpenMP Runtime Environment Variables These environment variables can be used with
294. ming language as documented in Using GCC The GNU Compiler Collection Reference Manual October 2003 for GCC version 3 3 1 Refer to section 4 4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C compiler g as implemented on the platforms supported by the PathScale Compiler Suite Supports most of the widely used command line options supported by g Generates code that complies with the x86_64 ABI and the 32 bit x86 ABI To invoke the PathScale C and C compilers use these commands pathcc invoke the C compiler pathcc invoke the C compiler 4 The PathScale C C Compiler Using the C C Compilers ee 4 1 The command line flags for both compilers are compatible with those taken by the GCC suite See section 4 1 for more discussion of this Using the C C Compilers 4 1 1 If you currently use the GCC compilers the PathScale compiler commands will be familiar Makefiles that presently work with GCC should operate with the PathScale compilers effortlessly simply change the command used to invoke the compiler and rebuild See section 5 7 1 for information on modifying existing scripts The invocation of the compiler is identical to the GCC compilers but the flags to control the compilation are different We have sought to provide flags compatible with GCC s flag usage whenever possible and also provide optimizati
295. mizations for OpenMP 8 14 3 1 Libraries 8 14 3 2 The most important optimizations for OpenMP applications tend to be loop nest optimization LNO code generation CG and aggressive optimizations e g by reducing numerical accuracy IPA inter procedural analysis may help with OpenMP programs too try it and see Some applications spend a large amount of time in numerical libraries At small numbers of nodes a highly optimized and tuned serial algorithm crafted for the target processor may out perform a parallel implementation based on a non optimized algorithm At higher numbers of nodes the parallel version may scale and give better performance However best performance will typically require an OpenMP parallelization of the best serial algorithm exploiting target features such as SSE for example Check to see if there are OpenMP enabled versions of these numerical libraries available Memory System Performance 8 28 OpenMP applications are often very sensitive to memory system performance An excellent approach is to tune the memory system with an OpenMP version of the STREAM benchmark In particular the BIOS settings for memory bank interleaving should be auto and for node interleaving should be off 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance ls Interleaving memory by node causes memory addresses to be striped across the various nodes at a low granularity creating the illusion o
296. modulevar3 44 doubleprecision modulevar4 55 5 end module mymodule program myprogram use mymodule modulevarl 22 modulevar2 33 3 call mycfunction end program myprogram S cat mycprogram c include lt stdio h gt extern struct int modulevarl double modulevar2 mymodule data extern struct int modulevar3 double modulevar4 mymodule data init void mycfunction printf Sd gWMn mymodule data modulevarl mymodule data modulevar2 printf d g n mymodule data init modulevar3 mymodule data init modulevar4 cat dfile data init in mymodule mymodule data init data in mymodule in mymodule mymodule data mycfunction mycfunction pathf90 fdecorate dfile mymodule f90 mycprogram c mymodule f90 mycprogram c a out 22 33 3 44 55 5 3 8 Runtime I O Compatibility Files generated by the Fortran I O libraries on other systems may contain data in different formats than that generated or expected by codes compiled by the PathScale Fortran compiler This section discusses how the PathScale Fortran compiler interacts with files created by other systems 3 34 3 The PathScale Fortran Compiler Runtime I O Compatibility aan 3 8 1 Performing Endian Conversions Use the assign command or the ASSIGN procedure to perform endian conversions while doing file I O 3 8 1 1 The assign Command The assign command changes or displays the I O processing directives fo
297. mp times their stack size However RES or RSS will typically be much less and this is the real physical memory requirement for the application NOTE A large stack limit for the main thread does not show up in the VIRT or SIZE figure This is because the operating system has special handling for the main thread of an application and does not need to pre allocate virtual memory pages for its stack up to the stack limit The pthread stack limit is typically much lower when using 1inuxthreads than with NPTL threads Linux kernels in the 2 4 series and earlier tend to be provided with 1inuxthreads while NPTL is typically the default with 2 6 series kernels However some distributions have back ported NPTT to their 2 4 series kernels NOTE When a program is statically linked with pthreads this might also trigger use of linuxthreadson some distributions 8 23 8 Using OpenMP and Autoparallelization Example OpenMP Code in Fortran ee For best 1ibopenmp performance and to avoid stack size limitations it is highly recommended that 2 6 series Linux kernels NPTL and dynamic linkage is used with OpenMP programs 8 12 Example OpenMP Code in Fortran The following program is a parallel version of hello world written using OpenMP directives When run it spawns multiple threads It uses the CRITICAL directive to ensure that the printing from the various threads will not overwrite one another Here is the program omphello f
298. mproving performance and the ability of the system to be expanded NUMA is used in a symmetric multiprocessing SMP system The intermediate representation of code generated by a compiler after it processes a source file G Glossary E P A O OLOD pathcov The version of gcov that PathScale supports with its compilers Other versions of gcov may not work with code generated by the PathScale Compiler Suite and are not supported by PathScale pathprof The version of gprof that PathScale supports with its compilers Other versions of gprof may not work with code generated by the PathScale Compiler Suite and are not supported by PathScale peak Set of optional flags used with compiler in SPEC runs to optimize performance SIMD Single Instruction Multiple Data An i386 AMD64 instruction set extension which allows the CPU to operate on multiple pieces of datacontained in a single wide register These extensions were in three parts named MMX SSE and SSE2 SMP Symmetric multiprocessing is a tightly coupled share everything system in which multiple processors working under a single operating system access each other s memory over a common bus or interconnect path source file A software program usually made up of several text files written in a programming language that can be converted into machine readable code through the use of a compiler SPEC Standard Performance Evaluation Corporation SPEC provides a standard
299. must be compatible Therepresentation of a pointer in one language must be converted to that of the other language The Fortran interface for a procedure must agree with the C prototype for a function with regard to whether arguments are passed by value Enumeration constants must have consistent values in both languages 3 4 9 1 BIND attribute The BIND attribute tells the Fortran compiler that a procedure type variable common block or enumeration must be compatible with C For procedures module level variables and common blocks it can also alter the language binding label used by the linker so as to be compatible with the external symbol generated by pathcc The simplest use of the BIND attribute simply declares that a variable type or procedure must be compatible with C module m type bind c t integer icomponent real rcomponent end type t type t bind c mvar contains subroutine s bind c common c i bind c c end subroutine s end module m In the preceding example pathf90 will arrange the components of type t in memory with the same alignment and padding that pathcc would use for a similar C struct It will use the same linker external symbols for mvar c and s that pathcc would use for variables named mvar and c and for a function named s A type cannot have both the BIND and SEQUENCE properties but BIND behaves like SEQUENCE in the sense that two identical type
300. n Strength reduction and loop termination test replacement Dead store elimination Control flow optimizations Instruction scheduling across basic blocks 7 Tuning Options Syntax for Complex Optimizations CG IPA LNO OPT WOPT EN PV b e 02 implies the flag OPT goto on which enables the conversion of GOTOs into higher level structures like FOR loops 02 also sets OPT Olimit 6000 03 turns on additional optimizations which will most likely speed your program up but may in rare cases slow your program down The optimizations provided at this level includes all O1 and O2 optimizations and also includes but is not limited to the flags noted below LNO opt 1 Turn on Loop Nest Optimization for more details see section 7 4 e OPT with the following options in the OPT group see the opt man pages for more information OPT roundoff 1 see section 7 7 4 2 OPT IEEE arith 2 see section 7 7 4 e OPT Olimit 9000 see section 6 3 OPT reorg common 1 see the eko 7 man page NOTE n our in house testing we have noticed that several codes which are slower at 03 than 02 are fixed by using 03 LNO prefetch 0 This seems to mainly help codes that fit in cache 7 2 Syntax for Complex Optimizations CG IPA LNO OPT WOPT The group optimizations control a variety of behaviors and can override defaults This section covers the syntax of these options The group options
301. n 3 enum bind c enumerator red blue enumerator green end enum tiger 1 giraffe 7 lion 8 enum bind c enumerator tiger giraffe 7 lion end enum 3 19 3 The PathScale Fortran Compiler Fortran 2003 Support ee 3 4 9 6 Example Using C malloc from Fortran 3 4 9 7 The C Interoperability features can create relatively straightforward interfaces to Standard C library functions like malloc and free as shown in the following example program malloc example use intrinsic iso c binding implicit none interface type c ptr function malloc ksize bind c use intrinsic iso c binding implicit none integer c size t value ksize end function malloc subroutine free p bind c use intrinsic iso c binding implicit none type c ptr value intent in p end subroutine free end interface real c float pointer tmp type c ptr tmp ptr integer r c tmp ptr malloc int 4 2 2 kind c size t call c f pointer tmp ptr tmp 2 2 do r 1 ubound tmp 1 do c 1 ubound tmp 2 tmp r c r 104tc end do end do print f10 5 tmp call free tmp ptr end program malloc example Issues Unique to C 3 20 C compilers normally mangle the names of external symbols decorating them so that overloaded identifiers have unique names at link time In addition many C constructs such as polymorphic classes and member pointers require runti
302. n Program Is In a Library ee 3 2 2 Linking Object Files to the Rest of the Program A source file containing a module generates an object o file as well as a module information mod file even if the source file contains nothing other than the module That object file must be linked with the rest of the program If a single command compiles and links the entire program this will happen automatically but if you use a Separate command to link objects together you must be careful not to omit object files resulting from source files which contain only modules The order of object files in such a command does not matter For example pathf95 c mymodule f95 pathf95 c myprogram f95 pathf95 myprogram o mymodule o Notice that a source file containing multiple modules will generate one object 0 file which takes its name from the source file plus multiple module information mod files which take their names from the names of the modules themselves For example generate MYMODULE1 mod MYMODULE2 mod MYMODULE3 mod and my3modules o pathf95 c my3modules f95 Then generate the main program which uses modules pathf95 c myprogram f95 pathf95 my3modules o myprogram o 3 3 Linking When the Main Program Is In a Library When workling with a long list of object files itis possible to put them all into a single library then specify the library in place of the object files when linking the program If the main program is coded
303. n for default data types and by providing the ff2c abi option to address a situation where g77 deviates from the Linux standard ABI for the x8664 machine We address issue 2 by including the g77 runtime library in the PathScale library Issues 3 4 and 5 do not arise because 5 Porting and Compatibility Porting Fortran ls g77 does not support any of the Fortran 90 95 features which require a dope vector the decoration of identifiers or the generation of a mod file For compilers other than g77 it may nevertheless be possible to link their object files with those generated by Pathscale Fortran even if the program uses features from Fortran 90 and later standards provided one manages to circumvent incompatibilities when coding Some tips 1 When code generated by one compiler calls a procedure generated by another use the Fortran 77 style of procedure call avoiding any of the sorts of dummy arguments which would require the calls to be explicit in Fortran 90 and later standards Do not use a module generated by one compiler in a procedure generated by another 2 Use options like fno second underscore and fdecorate as needed The gfortran g95 ifort pgf90 and Sun f90 compilers all behave like our fno second underscore g77 behaves like our fsecond underscore These options are meant to address the name mangling problems for Fortran 77 style external identifiers not for Fortran 90 style module level identifie
304. n is enabled when any other FLIST options are enabled but it can also be used to enable a listing when no other options are enabled FLIST ansi format setting Set ANSI format setting can be either ON or OFF When set to ON the compiler uses a space instead of tab for indentation and a maximum of 72 characters per line The default is OFF FLIST emit pfetch setting Writes prefetch information as comments in the transformed source file setting can be either ON or OFF The default is OFF In the listing PREFETCH identifies a prefetch and includes the variable reference with an offset in bytes an indication of read write a stride for each dimension and a number in the range from 1 low to 3 high which reflects the confidence in the prefetch analysis Prefetch identifies the reference s being prefetched by the PREFETCH descriptor The comments occur after a read write to a variable and note the identifier of the PREFETCH spec for each level of the cache FLIST ftn file file Write the program to file By default the program is written to file w2f f FLIST linelength N Set the maximum line length to N characters FLIST show setting Write the input and output filenames to stderr setting can be either ON or OFF The default is ON fms extensions For C C only Accept broken MFC extensions without warning F 13 F eko man Page ls fno asm For C C only Do not recognize the asm keyword fno b
305. n mapping threads to CPUs It takes an integer value in the range of O to the number of CPUs inclusive The default is a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a Stride of 1 Strides greater than 1 are useful when there is a hierarchy of CPUs in the system and the scheduling algorithm needs to take account of this to make best use of system resources A particularly interesting case is when the system comprises a number of multi core chips such that each core shares some resources e g a memory interface with other cores on that chip It may then be desirable to spread threads across the chips first to make best use of that resource before scheduling multiple threads to the cores on each chip Let the number of CPUs in a multi core chip be m and the number of multi core chips in the system be n The total number of CPUs is then n multiplied by m There are two typical orders in which the system may number the CPUs For chip index pin 0 n and core index c in 0 m the CPU number is p c n This is core major ordering since incrementing the core number increases the CPU number by n while incrementing the chip number only increases the CPU number by 1 e For chip index pin 0 n and core index c in 0 m the CPU number is p m c T
306. n provide up toa 10 20x performance advantage over other compilers on certain matrix operations at 03 In rare circumstances this feature can make things slower so you can use LNO opt 0 to disable nearly all loop nest optimization Trying to make an 02 compile faster by adding LNO opt 0 will not work because the LNO feature is only active with 03 or Ofast which implies 03 Some of the features that one can control with the LNO group are Loop fusion and fission Blocking to optimize cache line reuse Cache management TLB Translation Lookaside Buffer optimizations Prefetch In this section we will highlight a few of the LNO options that have frequently been valuable Loop Fusion and Fission Sometimes loop nests have too few instructions and consecutive loops should be combined to improve utilization of CPU resources Another name for this process is loop fusion Sometimes a loop nest will have too many instructions or deal with too many data items in its inner loop leading to too much pressure on the registers resulting in 7 Tuning Options Loop Nest Optimization LNO ls spills of registers to memory In this case splitting loops can be beneficial Like splitting an atom splitting loops is termed fission These are the LNO options to control these transformations LNO fusion n Perform loop fusion n 0 off 1 conservative 2 aggressive Level 2 implies that outer loops in consecutive loop ne
307. n runs 7 Tuning Options The pathopt2 Tool o 7 7 7 EEEEEEENIEEMEEN 7 9 The pathopt2 Tool The pathopt2 tool is used to iteratively test different options and option combinations by compiling a set of application source code files measuring the performance of the executable and tracking the results The best options are obtained from the output of these runs and are used to adaptively tune successive runs yielding the best set of compiler options for a given combination of application code data set hardware and environment A sorted list of execution times is produced for each run The tool uses an XML option configuration file that defines one or more execution targets Each execution target specifies options to try and indicates how they are to be combined into a series of tests In general using pathopt2 involves these steps 1 Run pathopt2 using an execution target in the supplied option configuration file 2 Interpret the results 3 Choose a more detailed execution target based on the results from the first run and repeat the process until the best compiler options are found The pathopt2 tool can be completely driven from its command line or it can alternatively use scripts to build and test the programs Scripts are useful for more complex runs for interfacing to existing build and test mechanisms and for automating the process For a standard installation the program pathopt2 is located in opt pathscale
308. n the command line itwill behave as if invoked with 02 alone because 02 and 03 are exclusive options For additive options the commandline is used before the defaults file For example if the defaults compiler contains I usr foo and the command line contains I usr bar the compiler will behave as if invoked with I usr bar I usr foo The format of the compiler defaults file is simple Each line can contain compiler options separated by white space followed by an optional comment A comment begins with the character and ends at the end of a line Empty lines and lines containing only comments are skipped 2 3 2 Compiler Quick Reference Compiling for Different Platforms AA NY Here is an example defaults file PathScale compiler defaults file Set default CPU type to optimize for since all of our systems use the same CPUs march opteron We have a recent Opteron CPU stepping so it s safe to always use SSE3 msse3 Ensure that the FFTW library is available to users so they don t need to remember where it s installed L share fftw3 lib I share fftw3 include Use the GCC 4 x front end by default gnu4 The environment variable PSC COMPILER DEFAULTS PATH if set specifies a PATH or a colon separated list of PATHs designating where the compiler is to look for the compiler defaults file If the environment variable is set the PATH opt pathscale etc will notbe used If the file
309. n the noise threshold There are three FP benchmarks that slow down from 1 2 to 4 5 due to ipa The slowdown indicates that the benchmarks do not benefit from the default settings of the IPA parameters By using additional IPA 7 Tuning Options Inter Procedural Analysis IPA AA NY tuning flags such slowdown can often be converted to performance gain The average performance improvement over all the benchmarks listed in table 7 1 is 6 Table 7 2 Effects of IPA tuning on some SPEC CPU2000 benchmarks Time Peak Time Peak Bench flags w o IPA flags with IPA Improve mark tuning tuning ment IPA Tuning Flags 181 mcf 325 3 s 275 55 15 396 IPA eld reorder on 197 parser 296 5s 245 2s 17 396 IPA ctype on 253 perlomk 195 1 s 177 7 s 8 9 IPA min_hotness 5 plimit 20000 168 wupwise 147 7 s 129 7 s 12 2 IPA space 1000 linear on IPA plimit 50000 callee_limit 5000 INLINE aggressive on 187 facerec 144 6 s 141 6s 2 1 IPA plimit 1800 Table 7 2 shows the effects of using additional IPA tuning flags on the peak runs of the CPU2000 performance In the peak runs each benchmark can be built with its own combination of any number of tuning flags We started with the peak flags of the benchmarks used in PathScale s SPEC CPU2000 submission and we found that five of the benchmarks are using IPA tuning flags Table 7 1 lists these five benchmarks The second column gives the running times if the IPA
310. n through any preprocessor by default The Fortran source preprocessor does not automatically expand macros outside of preprocessor statements so you need to specify macro expand if you want macros expanded fullwarn Request that the compiler generate comment level messages These messages are suppressed by default Specifying this option can be useful during software development no underscoring For Fortran only funderscoring appends underscores to symbols fno underscoring tells the compiler not to append underscores to symbols no unsafe math optimizations funsafe math optimizations improves FP speed by violating ANSI and IEEE rules fno unsafe math optimizations makes the compilation conform to ANSI and IEEE math rules at the expense of speed This option is provided for GCC compatibility and is equivalent to OPT IEEE_arithmetic 3 fno math errno no unwind tables funwind tables emits unwind information fno unwind tables tells the compiler never to emit any unwind information This is the default Flags to enable exception handling automatically enable funwind tables fuse cxa atexit For C only Register static destructors with cxa atexit instead of atexit fwritable strings For C C only Attempt to support writable strings K amp R style C F 16 F eko man Page ls g N Specify debugging support and to indicate the level of information produced by the compiler The supported
311. nMP code that will result in incorrect execution As long as all the OpenMP related code is guarded by conditional compilation sentinels e g or pragma you can re compile the same program without the mp flag In these cases the resulting executable will run serially If the error no longer occurs you can conclude that the problems in the parallel execution are due to mistakes in the OpenMP part of the code making the problem easier to track down and fix See section 10 11 for more tips on troubleshooting OpenMP problems 8 4 OpenMP Compiler Directives Fortran The OpenMP directives for Fortran all start with comment characters followed by SOMP or Somp They are only processed by the compiler if mp is specified NOTE Possible comment characters that can be used include C c and In the following examples we use as the comment character The Open MP standard dictates that for fixed form Fortran SOMP directives must begin in the first column of the line Some of the OpenMP directives also support additional clauses The following table lists the Fortran compiler directives provided by version 2 0 of the OpenMP Fortran Application Program Interface 8 3 8 Using OpenMP and Autoparallelization OpenMP Compiler Directives Fortran AA NY Table 8 1 Fortran Compiler Directives Directive Clauses Example Parallel region construct Defines a parallel region PARALLEL SOMP parallel clause s
312. nction f seek which treats logical unit unit as a stream of bytes and changes to offset the position pointer used by the next stream intrinsicm which reads or writes the file If whence iS 0 offset counts bytes from the beginning of the file if whence is 1 offset positions the pointer relative to the current position and if whence iS 2 offset positions the pointer relative to the end of the file The function form returns o on success or an error code from the C library value errno Betweenthe opening and closing of afile you should use either stream intrinsics get fgetc fput fputc fseek and ftell or standard Fortran I O but not both C 44 C Supported Fortran Intrinsics Fortran Intrinsic Extensions ls fstat Fortran interface to the C library function fstat Stores in sarray information about the file opened on logical unit unit The function form returns o on success or an error code from the C library variable errno The subroutine form sets status to the value which the function would return sarray must have thirteen elements ID of device containing file Inode number File mode Number of links UID of owner GID of owner ID of device containing directory entry for file Size of file in bytes Time of last access Time of last modification Time of last file status change Preferred I O block size 1 if not available Number of blocks allocated 1 if not available Except for elements 12 and 13 valu
313. ndencies to specified output file MG With M or MM treat missing header files as generated files MM Output user dependencies of source file MMD Write user dependencies to d output file F 34 F eko man Page ls mno sse Disable the use of SSE2 SSE3 instructions SSE2 cannot be disabled under m64 and will result in a warning mno sse2 Disable the use of SSE2 SSE3 instructions SSE2 cannot be disabled under m64 and will result in a warning mno sse3 Disable the use of SSE3 instructions mno sse4a Disable the use of SSE4A instructions module dir Create the mod file corresponding to a module statement in the directory dir instead of the current working directory Also when searching for modules named in use statements examine the directory dirbefore the directories established by Idir options mp Interpret OpenMP directives to explicitly parallelize regions of code for execution by multiple threads on a multi processor system Most OpenMP 2 0 directives are supported by pathf95 pathcc and pathCC See the PathScale Compiler Suite User Guide for more information on these directives MP With M or MM add phony targets for each dependency MQ Same as MT but quote characters that are special to Make msse2 Enable use of SSE2 instructions This is the default under both m64 and m32 msse3 Enable use of SSE3 instructions Default is ON under march barcelona march em64t and march
314. ne SIZE 1 1 2 1 4 1 8 ANSI PGI O SEED PUT 1 1 1 2 1 4 1 8 TRADITIONAL OQ Array O rank 1 GET 1 1 1 2 1 4 1 8 Array rank 1 RANF TRADITIONAL E RANGE X 1 1 I2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL Z8 Z 16 REAL R 4 A 1 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 PGI O Z 8 Z 16 TRADITIONAL KIND 171 1 2 1 4 I8 REALPART R 4 A 1 1 1 2 1 4 I 8 G77 E R 4 R 8 Z 8 Z 16 KIND 11 1 2 1 4 I8 REMOTE Subroutine TRADITIONAL E WRITE BARRIER REM IMAGES 1 4 TRADITIONAL RENAME 1 4 PATH1 C G77 PGI O PATH2 C STATUS I4 RENAME Subroutine G77 PATH1 C O PATH2 C STATUS I4 REPEAT Depends on arg STRING C ANSI PGI NCOPIES I 1 1 2 TRADITIONAL 1 4 1 8 RESHAPE ANSI PGI See Std TRADITIONAL RRSPACING X R 4 R 8 ANSI PGI E TRADITIONAL C 35 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks RSHIFT 1 11 I 2 1 4 1 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 NEGATIVE_SHIFT F1 1 2 1 4 18 RTC TRADITIONAL E SCALE X R 4 R 8 ANSI PGI E l 1 1 Po 1 4 1 8 TRADITIONAL SCAN 1 4 STRING C ANSI PGI E SET C TRADITIONAL BACK L 1 L 2 L 4 L 8 SECNDS R 4 T R 4 G77 PGI SECOND R 4 SECONDS R 4 G77 O SECOND Subroutine SECONDS R 4 G77 SELECTED INT R 171 1 2 1
315. ng Source Files 0 0 0 cece eee eee 4 3 4 2 1 1 Pre defined Macros 0 000 cece ees 4 4 4 2 2 PIGO MAS maawat axe Na acit eet tie ne ae eb eek 4 6 4 2 2 1 Page ix PathScale Compiler Suite User Guide Version 3 2 aaa CB MG Pl AA Eee ge ete de See oe ae 4 6 4 2 2 2 Changing Optimization Using Pragmas 4 6 4 2 2 3 Code Layout Optimization Using Pragmas 4 6 4 2 3 Mixing CODE ese bi thane shea ehResheSeau ABA GAAN EP sess 4 7 4 2 4 LINKING as seek eR wae ne ak At EGRE a aes ea ACE as 4 7 4 3 Debugging and Troubleshooting C C llli 4 7 4 4 Unsupported GCC Extensions 0 eee eee eee 4 8 Section 5 Porting and Compatibility 5 1 Getting Started uae we deen kac R PE 5 1 5 2 GNU Compatibility ux eg e een Reade WA bed ee HAKA AGA 5 1 5 3 Compatibility with Other Fortran Compilers aa 5 1 5 4 PONING FOWAG saevi 6222eas onodtadieetearetscudesetuateee 5 3 5 4 1 Unngorie sce cate ook Ase Yeeeah ee ek oe OE ae eee Males 5 3 5 4 1 1 Anm EXdmple 3 153 964 duree RR oci deed eee keyed Bka NAH awd 5 4 5 4 2 Name mangling 3a hccewe OE OR RON ERA PA NAA EAS NO des 5 4 5 4 3 Static Data CDI 5 4 5 5 Porting 10 X88 04 ba vo NG puce ROLE epe 3 cen uae NENE E e At 5 4 5 6 Migrating from Other Compilers llle 5 5 5 7 Compatibili ersa 26364532084 HA KO Kada DAG DE r a a a bambang 5 5 5 7 1 gcc Compatibility Wrapper Script
316. ng declarations C C only W no missing format attribute C C only W no missing noreturn C C only W no missing prototypes C C only W no multichar C C only W no nested externs C C only Wno non template friend C C only W no non virtual dtor C C only W no old style cast C C only W no overloaded virtual C C only W no packed C C only W no padded C C only PM aaa E Summary of Compiler Options Table E 1 Summary of Compiler Options by Function Wno pm W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no W no Wnonnu Wswitc Wswitc W woff woffal parentheses f conversions pointer arith redundant decls reorder return type sequence point shadow sign compare sign promo strict aliasing strict prototypes switch system headers synth traditional trigraphs undef uninitialized unknown pragmas unreachable code unused unused function unused label unused parameter unused value unused variable write strings 11 h default h enum 1 C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only C C only Has e
317. nisms A 5 A Environment Variables Environment Variables for OpenMP AA NY A 6 Appendix B Implementation Dependent Behavior for OpenMP Fortran The OpenMP Fortran specification 2 0 Appendix E requires that the implementation defined behavior of PathScale s OpenMP implementation be defined and documented see http www openmp org For the Fortran version 2 0 OpenMP Specification click on Specifications in the left column of the OpenMP home page This appendix summarizes the behaviors that are described as implementation dependent in this API The sections in italic including the cross references come from the Fortran 2 0 specification and each is followed by the relevant details for the PathScale implementation in its Compiler Suite Version 3 2 release of OpenMP for Fortran SCHEDULE GUIDED chunk chunk specifies the size of the smallest piece except possibly the last The size of the initial piece is implementation dependent Table 1 page 17 The size of the initial piece is given by the following equation chunk size MAX MIN ROUNDUP remaining size number o f threads PSC OMP GUIDED CHUNK DIVISOR PSC OMP GUIDED CHUNK MAX minimum chunk size Where remaining size is the number of iterations of the loop number of threads is the number of threads in the team PSC OMP GUIDED CHUNK DIVISORiS the value of the PSC OMP GUIDED CHUNK DIVISOR environment variable defaults to
318. non local user variables in the Global Register Allocator Default is ON GRA optimize boundary ON OFF Allow the Global Register Allocator to allocate the same register to different variables in the same basic block Default is OFF GRA prioritize by density ON OFF Tell the Global Register Allocator to prioritize register assignment to variables based on the variable s reference density instead of the variable s reference count Default is OFF help List all available options The compiler is not invoked F eko man Page ls help Print list of possible options that contain a given string H Print the name of each header file used Idir Specify a directory to be searched This is used for the following types of files Files named in INCLUDE lines in the Fortran source file that do not begin with a slash character Files named in include source preprocessing directives that do not begin with a slash character Files specified on Fortran USE statements Files are searched in the following order first in the directory that contains the input file second in the directories specified by dir and third in the standard directory usr include iN For Fortran only Specify the length of default integer constants default integer variables and logical quantities Specify one of the following Option Action i4 Specifies 32 bit 4 byte objects The default i8 Specifies 64 bit 8 byte ob
319. ns PM aa Table E 1 Summary of Compiler Options by Function OPT OPT OPT HORI OPT OPT OPT OPT OPT OPT OPRTS OPT SORT OPT OPT OPT OPT OPT OPT OPT IEEE IEEE arithmetic arith 1 2 3 IEEE NaN Inf ON OFF inline intrinsics ON OFF malloc_algorithm 0 1 or malloc alg 0 1 Ofast Olimit N pad common ON OFF recip ON OFF reorg common ON OFF roundoff 0 1 2 3 or OPT ro 0 1 2 3 rsqrt 0 1 2 space ON OFF speculate ON OFF transform to memlibz ON OFF treeheight ON OFF unroll analysis ON OFF unroll times max N unroll size N wrap around unsafe opt ON OFF Preprocessor Options 1 when Oo 01 and O2 are in effect 2 when 03 in effect ON ON 0 x86 x86 64 only Equivalent to OPT ro 2 Olimit 0 div_split ON alias typed 6000 OFF OFF ON when 03 is in effect OFF when files that contain common block compiled at 02 or below 0 when O0 01 and O2 are in effect 1 when 03 is in effect 2 when OPT Ofast is enabled 0 1 if OPT roundoff is at 2 or above OFF unless Os is specified OFF ON OFF ON 4 lt 40 gt OFF when 00 is in effect ON when 03 is in effect Defaults Comments C version E 13 E Summary of Compiler Options AA NY Table
320. nstants Fortran Constant C Character Constant C NULL CHAR 0 C_ALERT a C BACKSPACE b C FORM FEED N C NEW LINE An C CARRIAGE RETURN r C_HORIZONTAL_TAB N C VERTICAL TAB v 3 The PathScale Fortran Compiler Fortran 2003 Support AA NN 3 4 9 3 Pointer Compatibility 3 4 9 4 The ISO C BINDING module provides two types TYPE C PTR and TYPE C FUNPTR which are compatible with C data pointers and C function pointers A C pointer is typically a simple memory address whereas a Fortran pointer contains not only an address but also data type information and for an array the shape and stride To aid in converting between the world of Fortran pointers and the world of C pointer the module also provides functions C LOC and C_FUNLOC which obtain C pointers to Fortran data a function C ASSOCIATED which tests whether one C pointer is associated with data or whether two C pointers are associated with the same data and functions C F POINTER and C F PROCPOINTER to convert C pointers into Fortran pointers The function C F PROCPOINTER cannot yet be used in our compiler because the Fortran 2003 feature which allows pointers to procedures has not yet been added Finally there are constants C NULL PTR of type C PTR and C NULL FUNPTR of type C FUNPTR which represent C null pointers The standard permits C LOC to take the address of Fortran data which isn t compatible with C because it may be usef
321. nt I am handlerl count count 1 if count le 0 then previous 0 end if previous signal 2 previous end subroutine handler1 subroutine handler2 implicit none common previous count integer 8 previous integer count print I am handler2 count count 1 if count le 0 then previous 0 end if previous signal 2 previous end subroutine handler2 C 51 C Supported Fortran Intrinsics Fortran Intrinsic Extensions AA NY Here is an example using the three argument form C Keyboard interrupt normally Control C triggers C handler until 4 interrupts have occurred Then C restore the default so the fifth interrupt stops C the program program single implicit none external handler intrinsic signal integer 8 previous common count integer count previous signal 2 handler 1 count 4 do while true call sleep 100 end do end subroutine handler implicit none intrinsic signal integer 8 previous common count integer count print I am handler count count 1 if count le 0 then previous signal 2 handler 0 else previous signal 2 handler 1 end if end subroutine handler sleep Like the POSIX function sleep pauses the process for seconds seconds srand Like the POSIX function s rand restarts the random number sequence for irand or rand using seed as the seed C 52 C Supported Fortran Intrinsics Fortran Intrinsic E
322. nt system imposed limits and mechanisms The PSC STACK VERBOSE flag can also be used to turn on diagnostics for the stack sizing of pthreads However the stack size is controlled by the PSC OMP STACK SIZE environment variable not PSC_STACK LIMIT The syntax and allowed values for PSC OMP STACK SIZE are identical to the PSC STACK LIMIT so please see section 3 13 for instructions The reason for having both PSC OMP STACK LIMITandPSC OMP STACK SIZE isto allow the stacks of the main thread and the OpenMP pthreads to have different limits Often the system imposed limits are different in these two cases and sometimes the stack requirements of the OpenMP pthreads may be quite different from the main thread For example in some applications the main thread of an OpenMP program might allocate large arrays for the whole program on its stack and in others the large arrays will be allocated by all of the threads 8 10 2 Stack Size for C C The stack size of serial C and C programs is typically set by the ulimit command provided by the shell Since C and C programs typically do not allocate large arrays on the stack it is usually convenient to use whatever default ulimit your system provides More strict ulimit settings can be used to catch runaway stacks or unbounded recursion before the program exhausts all available memory For OpenMP C and C programs there will be an additional stack for each pthread created by the 1ibopenmp library
323. ntation fault FIN SUPPRESS REPEATS Output multiple values instead of using the repeat factor used at runtime NLS PATH Flags for runtime and compile time messages If the main function in your program is coded in C then even though other parts of the program are coded in Fortran the Fortran runtime library will not be able to find the file which provides runtime error messages To remedy this set the NLSPATH environment variable to the location of the error messages using 3N for the base name of the file For example if the compiler version is 2 1 set it to opt pathscale lib 2 1 N cat PSC FDEBUG ALLOC Flag to debug Fortran memory allocations This variable is used to initialize memory locations during execution A 1 A Environment Variables Language independent Environment Variables AA NY PSC FFLAGS Flags to pass to the Fortran compiler pathf 95 This variable is used with the gcc compatibility wrapper scripts PSC STACK LIMIT Controls the stack size limit the Fortran runtime attempts to use This string takes the format of a floating point number optionally followed by one of the characters k for units of 1024 bytes m for units of 1048576 bytes g for units of 1073741824 bytes or 5 to specify a percentage of physical memory If the specifier is following by the string cpu the limit is divided by the number of CPUs the system has For example a limit of 1 5g specifies that the Fortran runti
324. ntrols which functions are not instrumented 0 Don t instrument any function default 1 Don t instrument functions the GNU front end selects for inlining 2 Don t instrument functions marked inline in the source 3 Don t instrument functions marked extern inline or always inline 4 Instrument all functions Disable deletion of extern inline functions On some codes this can cause linking and runtime errors The option finstrument function is equivalent to OPT cyg_instr 3 Instrumentation will be suppressed for any function assigned the attribute no instrument function In particular cyg profile func enter and cyg profile func exit must not be instrumented OPT div split ON OFF Enable or disable changing x y into x recip y This is OFF by default but enabled by OPT Ofast or OPT IEEE arithmetic 3 This transformation generates fairly accurate code OPT early mp ON OFF This flag has any effect only under mp compilation It controls whether the transformation of code to run under multiple threads should take place before or after the loop nest optimization LNO phase in the compilation process The default is OFF when the transformation occurs after LNO Some OpenMP programs can F 39 F eko man Page AA NN yield better performance by enabling OPT early mp because LNO can sometimes generate more appropriate loop transformation when working on the multi threaded forms of the loops If apo is
325. o unknown pragmas tells the compiler not to warn when an unknown pragma directive is encountered W no unreachable code Wunreachable code warns about code that will never be executed Wno unreachable code tells the compiler not to warn about code that will never be executed F 56 F eko man Page ls W no unused Wunused warns when a variable is unused Wno unused tells the compiler not to warn when a variable is unused W no unused function Wunused function warns about unused static and inline functions Wno unused function tells the compiler not to warn about unused static and inline functions W no unused label Wunused label warns about unused labels Wno unused label tells the compiler not to warn about unused labels W no unused parameter Wunused parameter warns about unused function parameters Wno unused parameter tells the compiler not to warn about unused function parameters W n0 unused value Wunused value warns about statements whose results are not used Wno unused value tells the compiler not to warn about statements whose results are not used W no unused variable Wunused variable warns about local and static variables that are not used Wunused variable tells the compiler not to warn about local and static variables that are not used W no write strings Wwrite strings marks strings as const char Wno write strings tells the
326. ocessor is invoked file i File generated by the source preprocessor If using Fortran and you want to retain this file specify the P option file ii Pre processed C source file file Listing file F 62 F eko man Page ls file mod Fortran module file Compiling a module generates both a module file which must be available before compiling use statements that refer to that module and an object file which must be available when linking the program When compiling multiple source files at once you must order them so that each module is compiled before any use statement which refers to that module file o Object file file s Assembly language file To retain this file specify the S option file so Dynamic Shared Object DSO library ii files Directory that contains ii files usr include Standard directory for include files usr bin ld Loader tmp cc Temporary files COPYRIGHT Copyright C 2007 2008 PathScale LLC All Rights Reserved Copyright C 2006 2007 QLogic Corp All Rights Reserved Copyright C 2003 2004 2005 2006 PathScale Inc All Rights Reserved SEE ALSO pathcc 1 pathCC 1 pathf95 1 compiler defaults 5 pathopt2 1 assign 1 explain 1 fsymlist 1 pathscale_intro 7 pathdb 1 PathScale Compiler Suite and Subscription Manager Install Guide PathScale Compiler Suite User Guide PathScale Compiler Suite Support Guide PathScale Debugger User Guide Online documentation avai
327. ohit Chandra et al Morgan Kaufmann Publishers 2000 ISBN 1 55 860671 8 See section 8 15 for more resources The OpenMP API defines compiler directives and library routines that make it relatively easy to create programs for shared memory computers processors that share physical memory from new or existing code OpenMP provides a portable scalable interface that has become the de facto standard for programming shared memory computers Using OpenMP you can create threads assign work to threads and manage data within the program OpenMP enables incremental parallelization of your code on SMP shared memory processor systems allowing you to add directives to chunks of existing code alittle at a time The PathScale OpenMP implementation in Fortran and C C consists of parallelization directives and libraries Using directives you can distribute the work of the application over several processors OpenMP supports the three basic aspects of parallel programming Specifying parallel execution communicating between multiple threads and expressing synchronization between threads The OpenMP runtime library automatically creates the optimal number of threads to be executed in parallel for the multiple processors on the platform where the program is being run If you are running the program on a system with only one processor you will not see any speedup In fact the program may run slower due to the overhead in the synchronization code gen
328. oid omp set dynamic int Control the dynamic adjustment of the number ID of parallel threads int omp get dynamic void Return a non zero value if dynamic threads pov an is enabled otherwise return 0 int omp in parallel void Return a non zero value for calls within a MEE parallel region otherwise return 0 8 9 8 Using OpenMP and Autoparallelization Runtime Libraries AA NY Table 8 4 C C OpenMP Runtime Library Routines Continued Routine Description void omp set nested int Enable or disable nested parallelism int omp get nested void Return a non zero value if nested parallelism is enabled otherwise return 0 Lock routines omp init lock omp lock t Allocate and initialize lock associating it with B 7 E B the lock variable passed in as a parameter omp init nest lock omp nest Initialize a nestable lock and associate it with lock t 7 Hu B a specified lock variable omp set lock omp lock t Acquire the lock waiting until it becomes available if necessary omp set nest lock omp nest lock Setanestable lock The thread executing the t i E x subroutine will wait until a lock becomes avail able and then set that lock incrementing the nesting count omp unset lock omp lock t Release the lock resuming a waiting thread B B i i if any omp unset nest lock Release ownership of a nestable lock The sub routine decrements the nesting count
329. ol E PVVP Cm Table 7 4 pathopt2 Options Continued T Run script in Do not use a temporary temporary directory directory v Generate more verbose output w columns Number of columns to 40 use in formatting output X Don t print out a summary table 7 9 3 Option Configuration File The PathScale Compiler Suite includes pathopt2 xm1 a pre configured option configuration file found in opt pathscale share pathopt2 that contains about 200 test flags and options This XML file specifies a tree of options to try A small set of tags and attributes are used The file supports many common combinations of options in a framework that enables pathopt2 to adapt as it runs pathopt2 xml can be used on its own or as a framework for creating a custom configuration file More than one configuration can be described in a single file A single configuration in pathopt2 xm1 consists of two parts A list of options This list is contained within a def ine tag This list can also contain any number of option choose Or append tags An execute target This is a set of rules that accesses the named options list via the lt source gt tag The execute target can use multiple lt source gt tags in order to combine different lists of options It can also contain option or append tags An execute target can be addressed on the command line using the t option By default pathopt2 runs
330. ompilers are compatible with gcc and g77 Some packages will check strings like the gcc version or the name of the compiler to make sure you are using gcc you may have to work around these tests See section 5 7 1 for more information Some packages continue to use deprecated features of gcc While gcc may print a warning and continue compilation the PathScale Compiler Suite C C and Fortran compilers may print an error and exit Use the instructions in the error to substitute an updated flag For example some packages will specify the deprecated Xlinker gcc flag to pass arguments to the linker while the PathScale Compiler Suite uses the modern w1 flag Some gcc flags may not yet be implemented These will be documented in the release notes If a configure script is being used PathScale provides wrapper scripts for gcc that are frequently helpful See section 5 7 1 for more information 5 3 Compatibility with Other Fortran Compilers For Fortran the term compatibility can mean two different things Dotwo compilers accept the same source code Can object files generated by two different compilers be linked together With respectto source code Pathscale Fortran is compatible with all other compilers provided the program conforms strictly to the Fortran 95 standard It is compatible with g77 with relatively few exceptions such as the meaning of kind type parameters even if the program uses extensions such as additional intr
331. on 7 Tuning Options This section discusses in more depth some of the major groups of flags available in the PathScale Compiler Suite 7 1 Basic Optimizations The o flag The o flag is the first flag to think about using See table 7 3 showing the default flag settings for various levels of optimization 00 O followed by a zero specifies no optimization this is useful for debugging The g debugging flag is fully compatible with this level of optimization NOTE Using g by itself without specifying o will change the default optimization level from 02 to 00 unless explicitly specified 01 specifies minimal optimizations with no noticeable impact on compilation time compared with 00 Such optimizations are limited to those applied within straight line code basic blocks like peephole optimizations and instruction scheduling The 01 level of optimization minimizes compile time 02 only turns on optimizations which always increase performance and the increased compile time compared to 01 is commensurate with the increased performance This is the default if you don t use any of the o flags The optimizations performed at level 2 are For inner loops perform Loop unrolling Simple if conversion Recurrence related optimizations Two passes of instruction scheduling Global register allocation based on first scheduling pass Global optimizations within function scopes Partial redundancy eliminatio
332. on features that are absent in GCC such as IPA and LNO Generally speaking instead of being a single component as in GCC the PathScale compiler is structured into components that perform different classes of optimizations Accordingly compilation flags are provided under group names like IPA LNO OPT CG etc For this reason many of the compilation flags in our compiler will differ from those in GCC See the eko man page for more information The default optimization level is 2 This is equivalent to passing 02 as a flag The following three commands are identical in their function pathcc hello c pathcc O hello c pathcc 02 hello c See section 7 1 for information about the optimization levels available for use with the compiler To run with Ofast or with ipa the flag must also be given on the link command pathCC c Ofast warpengine cc pathCC c Ofast wormhole cc pathCC o ftl Ofast warpengine o wormhole o See section 7 3 for information on ipa and Ofast Accessing the GCC 4 x Front ends for C and C This release is compatible with version 4 2 0 of the GNU C C compiler in terms of the source language constructs they support This is the default on Linux distributions whose compiler is GNU 4 x On systems with GNU 3 x compilers pathcc pathCC will generate code compitable with GNU 3 x You can use the gnu4 option to direct pathcc pathCC to be compitable with GNU 4 x A sample command for C is
333. on feedback compilation with the PathScale compilers can be found under the b create and b opt options in the eko man page 7 Tuning Options Aggressive Optimizations a 7 7 Aggressive Optimizations The PathScale Compiler Suite like all modern compilers has a range of optimizations Some produce identical program output to the original some can change the program s behavior slightly The first class of optimizations is termed safe and the second unsafe As a general rule our 01 02 03 flags only perform safe optimizations But the use of unsafe optimizations often can produce a good speedup in a program while producing a sufficiently accurate result Some unsafe optimizations may be safe depending on the coding practices used We recommend first trying safe flags with your program and then moving on to unsafe flags checking for incorrect results and noting the benefit of unsafe optimizations Examples of unsafe optimizations include the following 7 7 1 Alias Analysis Both C and Fortran have occasions where it is possible that two variables might occupy the same memory For example in C two pointers might point to the same location such that writing through one pointer changes the value of the variable pointed to by another While the C standard prohibits some kinds of aliasing many real programs violate these rules so the aliasing behavior of the compiler is controlled by the OPT alia
334. on of input output statements in your Fortran reference manual Since the verbose messages print more slowly and take up more room on the screen you may wish to unset the environment variable and instead use a tool called explain to print the longer message only when you need further explanation for a particular message When the Fortran compiler or runtime prints out an error message it prefixes the message with a string in the format subsystem number For example path 95 0724 The pathf 95 0724 is the message ID string that you will give to explain When you type explain pathf95 0724 the explain program provides a more detailed error message 3 27 3 The PathScale Fortran Compiler Compiler and Runtime Features AA NN S explain pathf95 0724 Error Unknown statement Expected assignment statement but found Ss instead of or gt The compiler expected an assignment statement but could not find an assignment or pointer assignment operator at the correct point Another example S explain pathf95 0700 Error The intrinsic call ss is being made with illegal arguments A function or subroutine call which invokes the name of an intrinsic procedure does not match any specific intrinsic All dummy arguments without the OPTIONAL attribute must match in type and rank exactly The explain command can also be used with iostat error numbers When the iostat specifier in a Fortran I O statement provides
335. onverts the constant to an integer and returns the real value whose magnitude matches that integer This option makes each F eko man Page ls intrinsic behave as Fortran 2003 requires returning the real value whose bit pattern matches the boz constant ffortran bounds check For Fortran only Check bounds no gnu exceptions For C only fgnu exceptions enables exception handling and is equivalent to fexceptions This is the default fno gnu exceptions disables exception handling and is equivalent to GNU option fno exceptions f no gnu keywords For C C only Recognize typeof as a keyword If fno gnu keywords is used do not recognize typeof as a keyword no implicit inline templates For C only fimplicit inline templates emits code for inline templates instantiated implicitly fno implicit inline templates tells the compiler to never emit code for inline templates instantiated implicitly f no implicit templates For C only The fimplicit templates option emits code for non inline templates instantiated implicitly With fno implicit templates the compiler will not emit code for non inline templates instantiated implicitly finhibit size directive Do not generate size directives f no inline finline requests inline processing same as inline fno inline disables inlining same as noinline f no inline functions For C C only finline functions automati
336. option gt 03 lt option gt choose k 1 s5 lt option gt ipa lt option gt lt option gt OPT Ofast lt option gt lt choose gt lt append gt 7 Tuning Options The pathopt2 Tool ls Table 7 5 Tags for Option Configuration Fle Continued lt define name name gt Defines a block of options that can be later included using the source from name gt tag Note that 2 depinss this block can include any number of option choose Or append tags source from name gt Includes a block of options previously defined with define bestof k k gt Choose the best option in the list referenced by run context context time and chosen in the context of the option listed in the context tag The k option is used as described for the choose tag context specifies an option to use as a basis for testing but bestof not to propagate to outside tags option option lt option gt lt option gt lt comment gt Standard XML comment tag ignored by the parser NOTE Alltags other than lt source gt require an end tag e g lt append gt requires a corresponding lt append gt 7 9 4 Testing Methodology Typically the execute target try5 in pathopt2 xm1 is used first with the pathopt2 command After the results of the run are available you can look for the fastest result of the 5 options and then run pathopt2 a
337. ortran 90 interface block or you can use command line options to tell the compiler not to provide that intrinsic The option ansi if present removes all non standard intrinsics The options intrinsic name and no intrinsic name are applied to add or remove specific intrinsics from the set of remaining ones For example the compile command might look like this S pathf95 myprogram f ansi intrinsic second To make it convenient to compile programs developed under other compilers pathf95 provides the ability to enable and disable a group or family of intrinsics with a single option Family names are ANSI EVERY G77 PGI OMP and TRADITIONAL These family names must appear in uppercase to distinguish them from the names of individual intrinsics By default the compiler enables either ANSI Or TRADITIONAL depending on whether you use the ansi option It automatically enables OMP as well if you use the mp option As an example suppose you are compiling a program that was originally developed under the GNU G77 compiler and encounter problems because it contains subroutine names which conflict with some of the intrinsics in the TRADITIONAL family Suppose that you have also decided that you want to use the individual intrinsic adjust1 which is not provided by G77 These options would give you the set of intrinsics you need no intrinsic traditional instinsic G77 intrinsic adjust1 C 3 Table of Supported Intrinsics The following t
338. otal compile time can be considerably longer with IPA than without When invoking the final link phase with ipa for example pathcc ipa o foo o significant portions of this process can be done in parallel on a system with multiple processing units To use this feature of the compiler use the IPA max jobs flag Here are the options for the IPA max jobs flag IPA max jobs N This option limits the maximum parallelism when invoking the compiler after IPA to at most N compilations running at once The option can take the following values 7 Tuning Options Loop Nest Optimization LNO ee 7 3 9 0 The parallelism chosen is equal to either the number of CPUs the number of cores or the number of hyperthreading units in the compiling system whichever is greatest 1 Disable parallelization during compilation default gt 1 Specifically set the degree of parallelism Size and Correctness Limitations to IPA 7 4 IPA often works well on programs up to 100 000 lines but is not recommended for use in larger programs in this release Loop Nest Optimization LNO 7 4 1 If your program has many nests of loops you may want to try some of the Loop Nest Optimization group of flags This group defines transformations and options that can be applied to loop nests One of the nice features of the PathScale compilers is that its powerful Loop Nest Optimization feature is invoked by default at 03 This feature ca
339. ovapd 32 r8 xmm1 2 id 82 a 0x0 movapd 48 r8 xmmO 3 id 82 a 0x0 addpd xmml xmm1 6 addpd xmmO xmmO 7 movntpd xmm3 O r8 9 id 83 a 0x0 movntpd xmm2 16 r8 d 10 id 83 a 0x0 7 43 7 Tuning Options How Did the Compiler Optimize My Code EE Z2 Cc addq 64 r8 8 movntpd xmml1 32 r8 d 11 id 83 a 0x0 cmpq rbp r8 11 movntpd xmmO 16 r8 12 id 83 a 0x0 jle LBB60 main 12 Note the unrolled 4 times comment above and the original source in comments which tell you what the compiler did even if you can t read x86 assembly code 7 10 2 Using CLIST or FLIST You can use CLIST on for C codes or FLIST on for Fortran codes to see what the compiler is doing On the same STREAM source code compile with the CLIST flag pathcc 03 CLIST ON c stream d c The output will look something like this opt pathscale lib 2 3 99 be translates tmp ccI 16xQZJ into Stream w2c h and stream w2c c based on source stream c When you look at stream d w2c c with an editor you might see some pretty strange looking C code In this case there doesn t seem to be much optimizing going on but in codes where LNO Loop Nest Optimization is more important you would see a lot of the optimizations 7 10 3 Verbose Flags You can also turn on verbose flags in LNO to see vectorization activity You would do this with the LNO simd verbose flag in the compil
340. over the threads in cases where the division in not exact OMP GET NUM THREADS If the number of threads has not been explicitly set by the user the default is implementation dependent Section 3 1 2 page 48 If the number of threads has not been explicitly set by the user it defaults to the number of CPUs in the machine OMP SET DYNAMIC The default for dynamic thread adjustment is implementation dependent Section 3 1 7 page 51 B Implementation Dependent Behavior for OpenMP Fortran a The default for OMP_ DYNAMIC is false Dynamic thread adjustment is not supported by this implementation the number of threads that are assigned to a new team is not adjusted dynamically by this implementation If dynamic thread adjustment is requested by the user or program by setting OMP DYNAMIC to TRUE or calling OMP SET DYNAMIC With a TRUE parameter the implementation produces a diagnostic message and ignores the request The value returned by OMP GET DYNAMIC is always FALSE to indicate that this mechanism is not supported OMP SET NESTED When nested parallelism is enabled the number of threads used to execute nested parallel regions is implementation dependent Section 3 1 9 page 52 The implementation supports dynamically nested parallelism The number of threads assigned to a new team is determined by the following algorithm Tfthis fork is dynamically nested inside another fork and nesting is disabled then the new team w
341. particular the PathScale Fortran compiler does not treat pointers exactly like integers The compiler will report an error if you do something like p p47 8 8 to align a pointer 3 5 3 Directives Directives within a program unit apply only to that program unit reverting to the default values at the end of the program unit Directives that occur outside of a program unit alter the default value and therefore apply to the rest of the file from that point on until overridden by a subsequent directive Directives within a file override the command line options by default To have the command line options override directives use the command line option LNO ignore pragmas Use following option to control the behavior for directives contained within comments no directives no directives ignores alldirectives such as 0MPorC PREFETCH REF inside comments The default is directives which scans the comments for directives Note that certain directives may have no effect unless additional options such as mp are present For the 3 2 release the PathScale Compiler Suite supports the following prefetch directives 3 5 3 1 Prefetch Directives C PREFETCH N N Specify prefetching for each level of the cache The scope is the entire function containing the directive N can be one ofthe following values 0 Prefetching off the default 1 Prefetching on but conservative 2 Prefetching on and aggressive
342. penMP programs By default the guard area size is 0 for 32 bit programs disabling the mechanism and 32MB for 64 bit programs since virtual memory is typically bountiful in 64 bit environments The PSC OMP GUARD SIZE environment variable can be used to over ride the default value Its format is a decimal number following by an optional k m or g in lower or uppercase to denote kilobytes megabytes or gigabytes If the size is 0 then the guard is not created The guard area consumes no physical memory but does consume virtual memory and will show up in the VIRT or SIZE figure of a top command PSC OMP GUIDED CHUNK DIVISOR Integer value The value of PSC OMP GUIDED CHUNK DIVISOR is used to divide down the chunk size assigned by the guided scheduling algorithm If the number of iterations left to be scheduled is xemaining size and the number of threads in the team is number of threads the chunk size will be determined as chunk size remaining size number of threads PSC OMP GUI DED CHUNK DIVI SOR 8 18 8 Using OpenMP and Autoparallelization Environment Variables ls A value of 1 gives the biggest possible chunks and the fewest number of calls into the loop scheduler Larger values will result in smaller chunks giving more opportunities for the dynamic guided scheduler to assign work balancing out variation between loop iterations at the expense of more calls into the loop scheduler With a value
343. pent to achieve them and avoid changes which affect such things as floating point accuracy 3 Turn on aggressive optimization The optimizations at this level are distinguished from O2 by their aggressiveness generally seeking highest quality generated code even if it requires extensive compile time They may include optimizations that are generally beneficial but may hurt performance This includes but is not limited to turning on the Loop Nest Optimizer LNO optz1 and setting OPT ro 1 IEEE_arith 2 Olimit 9000 reorg_common ON S Specify that code size is to be given priority in tradeoffs with execution time If no value is specified 2 is assumed objectlist Read the following file to get a list of files to be linked Ofast Equivalent to O3 ipa OPT Ofast fno math errno ffast math Use optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of F 37 F eko man Page AA NN computations NOTE Ofast enables ipa inter procedural analysis which places limitations on how libraries and o files are built openmp Interpret OpenMP directives to explicitly parallelize regions of code for execution by multiple threads on a multi processor system Most OpenMP 2 0 directives are supported by pathf95 pathcc and pathCC See the PathScale Compiler Suite User Guide for more information on these directives OPT Th
344. performance with specific codes The PathScale TM Debugger You can view this same information online by typing man lt man_page name gt The eko man page information begins on the following page For the most complete and up to date listing please refer to the online version which can be found in the support section at the PathScale web site http www pathscale com support html F 1 F eko man Page ls NAME eko The complete list of options and flags for the PathScale TM Compiler Suite CG INLINE IPA LANG LNO OPT TENV WOPT other major topics covered DESCRIPTION This man page describes the various flags available for use with the PathScale pathcc pathCC and pathf95 compilers OPTIMIZATION FLAGS Some suboptions either enable or disable the feature To enable a feature either specify only the suboption name or specify z1 ON or Z TRUE Disabling a feature is accomplished by adding 0 OFF or FALSE These values are insensitive to case on and ON mean the same thing Below ON and OFF are used to indicate the enabling or disabling of a feature Many options have an opposite no counterpart This is represented as no in the option description and if used will turn off or prevent the action of the option If no no is shown there is no opposite option to the listed option OPTION GROUPS There are twelve available compiler option groups CG Code Generation CLIST C Listin
345. program variables Whenever a variable has its address taken it can potentially be pointed to by a pointer Places that dereference or store through the pointer potentially access the variable IPA s alias analysis keeps track of this information so that in the presence of pointer 7 Tuning Options Inter Procedural Analysis IPA ls accesses as few variables are affected as possible so they can be optimized more aggressively The mod ref and alias information collected by IPA are not just used by IPA itself The information is also recorded in the program representation so the optimizations in the backend phases also benefit 7 3 3 Optimization The most important optimization performed by IPA is inlining in which the call to a function is replaced by the actual body of the function Inlining is most versatile in IPA because all the user function definitions are visible to it Apart from eliminating the function call overhead inlining increases optimization opportunities of the backend phases by letting them work on larger pieces of code For instance inlining may result in the formation of a loop nest that enables aggressive loop transformations Inlining requires careful benefit analysis because overdoing it may result in performance degradation The increased program size can cause higher instruction cache miss rate If a function is already quite large inlining may result in the compiler running out of registers so it has to us
346. r Suite generates both 32 bit and 64 bit code with 64 bit code as the default See the eko man page for details The information in this guide is organized into these sections Section 2 is a quick reference to using the PathScale compilers Section 3 covers the PathScale Fortran compiler Section 4 covers the PathScale C C compilers Section 5 provides suggestions for porting and compatibility Section 6 is a Tuning Quick Reference with tips for getting faster code Section 7 discusses tuning options in more detail Section 8 covers using autoparallelization and OpenMP in Fortran and C C Section 9 provides an example of optimizing code Section 10 covers debugging and troubleshooting code Appendix A lists environmental variables used with the compilers Appendix B discusses implementation dependent behavior for OpenMP Fortran Appendix C is a list of the supported Fortran intrinsics Appendix D provides a simplified data structure from a Fortran 90 dope vector Appendix E is a summary of the compiler options grouped by function Appendix F is a reference copy of the eko man page Appendix G contains a glossary of terms associated with the compilers 1 1 1 Introduction Conventions Used in This Document AA NY 1 1 Conventions Used in This Document These conventions are used throughout this document Convention Meaning command Fixed space font is used for literal items such as commands fil
347. r a Fortran file or unit The assign command allows various processing directives to be associated with a unit or file name This can be used to perform numeric conversion while doing file I O The assign command uses the file pointed to by the FILENV environment variable to store the processing directives This file is also used by the Fortran O libraries to load directives at runtime For example S FILENV assign S export FILENV S assign N mips u 15 This instructs the Fortran I O library to treat all numeric data read from or written to unit 15 as being MIPS formatted data This effectively means that the contents of the file will be translated from big endian format MIPS to little endian format Intel while being read Data written to the file will be translated from little endian format to big endian format See the assign 1 man page for more details and information 3 8 1 2 Using the Wildcard Option The wildcard option for the assign command is assign N mips p Before running your program run the following commands S FILENV assign S export FILENV S assign N mips p This example matches all files 3 35 3 The PathScale Fortran Compiler Runtime I O Compatibility AA NY 3 8 1 3 Converting Data and Record Headers 3 8 1 4 To convert numeric data in all unformatted units from big endian and convert the record headers from big endian use the following S assign F f77 mips N mips g su S a
348. r subsuming a memory load operation into the operand of an arithmetic instruction The value of 0 turns off this subsumption optimization By default this subsumption is performed only when the result of the load has only one n 1 use This subsumption is not performed if the number of times the result of the load is used exceeds the value n a non negative integer We have found that load exe 2 or 0 are occasionally profitable The default for 64 bit ABI and Fortran is n 2 otherwise the default is n 1 CG use prefetchnta ON means for the compiler to use the prefetch operation that assumes that data is Non Temporal at All NTA levels of the cache hierarchy This is for data streaming situations in which the data will not need to be re used soon Default is OFF 7 Tuning Options Feedback Directed Optimization FDO AA NN 7 6 Feedback Directed Optimization FDO Feedback directed optimization uses a special instrumented executable to collect profile information about the program for example it records how frequently every if statement is true This information is then used in later compilations to tune the executable FDO is most useful if a program s typical execution is roughly similar to the execution of the instrumented program on its input data set if different input data has dramatically different if frequencies using FDO might actually slow down the program This section also discusses how to invoke this feature wit
349. recated declarations Do not warn about deprecated declarations in code Wno format extra args For C C only Do not warn about extra arguments to printf like functions Wno format y2k For C C only Do not warn about strftime formats that yield two digit years Wno long long For C C only Wlong long warns if the long long type is used Wno long long tells the compiler not to warn if the long long type is used Wno non template friend For C only Do not warn about friend functions declared in templates F 52 F eko man Page aaa Wno pmf conversions For C only Do not warn about converting PMFs to plain pointers W no non virtual dtor For C only Wnon virtual dtor will warn when a class declares a dtor destructor that should be virtual Wno non virtual dtor tells the compiler not to warn when a class declares a dtor that should be virtual Wnonnull For C C only Warn when passing null to functions requiring non null pointers W no old style cast For C C only Wold style cast will warn when a C style cast to a non void type is used Wno old style cast tells the compiler not to warn when a C style cast to a non void type is used WOPT Specifies options that affect the global optimizer are enabled at O2 or above WOPT aggstr N This controls the aggressiveness of the strength reduction optimization performed by the scalar optimizer in which induction expressions within a loop ar
350. rformed Default is OFF OPT reorg common ON OFF This option reorganizes common blocks to improve the cache behavior of accesses to members of the common block The reorganization is done only if the compiler detects that it is safe to do so reorg commonzON is enabled when O3 is in effect and when all of the files that reference the common block are compiled at O3 reorg commonzOFF is set when the file that contains the common block is compiled at O2 or below OPT roundoff 0 1 2 3 or OPT ro 0 1 2 3 Specify the level of acceptable departure from source language floating point round off and overflow semantics The options can be one of the following 0 Inhibit optimizations that might affect the floating point behavior This is the default when optimization levels O0 O1 and O2 are in effect 1 Allow simple transformations that might cause limited round off or overflow differences Compounding such transformations could have more extensive effects This is the default when O3 is in effect 2 Allow more extensive transformations such as the reordering of reduction loops This is the default level when OPT Ofast is specified 3 Enable any mathematically valid transformation OPT rsqrt 0 1 2 This option calculates reciprocal square roots using the rsqrt machine instruction rsqrt is faster but potentially less accurate than the regular square root operation 0 means not to use rsqrt 1 means to use
351. riables are initialized to zero and exist for the life of the program This option can be useful when porting programs from older systems in which all variables are statically allocated When compiling with the static data option global data is allocated as part of the compiled object file o file The total size of any file o cannot exceed 2 GB but the total size of a program loaded from multiple o files can exceed 2 GB An individual common block cannot exceed 2 GB but you can declare multiple common blocks each having that size If a parallel loop in a multi processed program calls an external routine that external routine cannot be compiled with the static data option You can mix static and multi processed object files in the same executable but a static routine cannot be called from within a parallel region static libgcc Force the use of the static libgcc library std c 98 Std option for g std c89 std option for gcc g std c99 std option for gcc g std c9x std option for gcc g std gnu 98 std option for g std gnu89 std option for gcc g std gnu99 std option for gcc g std gnu9x std option for gcc g F 46 F eko man Page ls 8td iso9899 1990 std option for gcc g 8td iso9899 199409 std option for gcc g 8td iso9899 1999 std option for gcc g 8td iso9899 199x std option for gcc g stdinc Predefined include search path list su
352. rithm The default is generally set to 0 Available only for the x86 x86 64 family of processors not available for MIPS OPT Ofast Use optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of computations This effectively turns on the following optimizations OPT ro 2 Olimit 0 div_split ON alias typed OPT Olimit N Disable optimization when size of program unit is 5 N When N is O program unit size is ignored and optimization process will not be disabled due to compile time limit The default is 0 when OP T Ofast is specified 9000 when O3 is specified otherwise the default is 6000 OPT pad common ON OFF This option reorganizes common blocks to improve the cache behavior of accesses to members of the common block This may involve adding padding between members and or breaking a common block into a collection of blocks Defaultis OFF This option should not be used unless the common block definitions including EQUIVALENCE are consistent among all sources making up a program In addition pad commonzON should notbe specified if common blocks are initialized F 42 F eko man Page I aaa with DATA statements If specified pad commonzON must be used for all of the source files in the program OPT recip ON OFF This option specifies that faster but potentially less accurate reciprocal operations should be pe
353. rn a loop computing the mathematical function sin into a call to the vsin function which is twice as fast The use of vectorized versions of functions in the math library like sin cosin is controlled by the flag LNO vintr 0 1 2 0 will turn off vectorization of math intrinsics while 1 is the default Under LNO vintr 2 the compiler will vectorize all math functions Note that vintr 2 could be unsafe in that the vector forms of some of the functions could have accuracy problems Vectorization of user code excluding these mathematical functions is controlled by the flag LNO simd 0 1 2 which enables or disables inner loop vectorization o turns off the vectorizer 1 the default causes the compiler to vectorize only if it can determine that there is no undesirable performance impact due to sub optimal alignment and 2 will vectorize without any constraints this is the most aggressive LNO simd verbose ON prints vectorizer information from vectorizing user code to stdout LNO vintr verbose ON prints information about whether or not the math intrinsic functions were vectorized See the eko man page for more information 7 5 Code Generation CG The code generation group governs some aspects of instruction level code generation that can have benefits for code tuning CG gcm OFF turns off the instruction level global code motion optimization phase The default is ON CG load_exe n specifies the threshold fo
354. rogram optimization the compiler can collect information over the entire program so itcan make better decision on whether it is safe to perform various optimizations Thus the same optimization performed under whole program compilation will become much more effective In addition more types of optimization can be performed under whole program compilation than separate compilation This section presents the compilation model that enables whole program optimization in the PathScale compiler and how it relates to the ipa flag that invokes it at the user level Various analyses and optimizations performed by IPA are described How IPA improves the quality of the backend optimization is also explained Various IPA related flags that can be used to tune for program performance are presented and described Finally we have an example of the difference that IPA makes in the performance of the SPEC CPU2000 benchmark suite 7 3 1 The IPA Compilation Model Inter procedural compilation is the mechanism that enables whole program compilation in the PathScale compiler The mechanism requires a different compilation model than separate compilation This new mode of compilation is used when the ipa flag is specified Whole program compilation requires the entire program to be presented to the compiler for analysis and optimization This is possible only after a link step is applied Ordinarily the link step is applied to o files after all optimization
355. rs 3 When linking with one compiler specify explicitly the additional runtime library or libraries needed by the other compiler If you need additional control over the order in which the linker scans libraries run the linker directly specifying the startup object file which the first compiler would use and the union of the sets of libraries which the two compilers would use For Pathscale Fortran running pathf 95 with the command line option show will print the names of these objects and libraries 4 If possible perform all I O in code generated by one compiler If that is not possible make sure that all I O related to a particular logical unit and file occurs within code generated by one compiler 5 4 Porting Fortran If you are porting Fortran code see section 3 11 for more information about Fortran specific issues 5 4 1 Intrinsics The PathScale Fortran compiler supports many intrinsics and also has many unique intrinsics of its own See Appendix C for the complete list of supported intrinsics 5 3 5 Porting and Compatibility Porting to x86 64 ee 5 4 1 1 An Example Here is some sample output from compiling Amber 8 using only ANST intrinsics You get this series of error messages S pathf95 03 msse2 m32 o fantasian fantasian o lib random o lib mexit o fantasian o In function simplexrun fantasian o text 0xaad4 undefined reference to rand_ fantasian o text 0xab0e und
356. rtran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks DSM_THIS_ I 8 ARRAY Any type TRADITIONAL CHUNKSIZE Array rank any DIM 171 172 1 4 18 INDEX 171 1 2 174 I8 DSM THIS I8 ARRAY Any type TRADITIONAL STARTINGINDE Array rank any X DIM 171 I 2 1 4 18 INDEX I 1 I 2 1 4 I 8 DSM THIS I8 ARRAY Any type TRADITIONAL THREADNUM Array rank any DIM 171 172 1 4 18 INDEX 171 1 2 174 I8 DSQRT R 8 X R 8 ANSI G77 E P PGl TRADITIONAL DTAN R 8 X R 8 ANSI G77 E P PGl TRADITIONAL DTAND R 8 X R 8 PGI E P TRADITIONAL DTANH R 8 X R 8 ANSI G77 E P PGl TRADITIONAL DTIME R 4 TARRAY R 4 G77 PGI Array rank 1 TRADITIONAL DTIME Subroutine TARRAY R 4 G77 Array rank 1 TRADITIONAL RESULT R 4 ENABLE IEEE X Subroutine INTERRUPT I 8 TRADITIONAL E _INTERRUPT EOSHIFT ANSI PGI See Std TRADITIONAL EPSILON X R 4 R 8 ANSI PGI E TRADITIONAL C 15 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued C 16 Intrinsic Name Result Arguments Families Remarks EQV 1 11 I2 I4 1 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 l2 I4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 ERF X R 4 R 8 G77 PGI E P TRADITIONAL ERFC X R 4 R 8 G77 PGI E P T
357. rtran optimizer is allowed to assume that the value of rvalue cannot change between the assignment and PRINT statements and might decide to eliminate the assignment and simply print a constant 5 The volatile attribute prevents this real volatile rvalue The old fashioned Fortran declaration syntax is also available 3 8 3 The PathScale Fortran Compiler Fortran 2003 Support ls real rvalue volatile rvalue Unlike most old fashioned declaration statements the VOLATILE statement does not necessarily create a local variable if a variable is available via host association VOLATILE merely adds an attribute to that variable subroutine local volatile rvalue Implicitly declares local variable end subroutine local subroutine outer rvalue 5 0 contains subroutine inner volatile rvalue Adds attribute to variable obtained from outer end subroutine inner end subroutine outer When used with a pointer VOLATILE refers to the pointer rather than the target usually it makes sense to apply VOLATILE to both the pointer and its target s When used with an allocatable variable VOLATILE refers to both the allocation status and the value When used with an equivalenced variable it refers only to accesses via that variable usually it makes sense to apply VOLATILE to all variables in an equivalence group IMPORT Statement Fortran 2003 provides an IMPORT statement for use within an interface body By default a proce
358. s IPA inline OFF turns off IPA s inliner and the lightweight inliner is also suppressed since IPA is invoked Default is ON INLINE none turns off automatic inlining by IPA but required inlining implied by the language or specified by the user are still performed By default automatic inlining is turned ON IPA specfile filename directs the compiler to open the given file to read more IPA or INLINE options The following options can be used to tune the aggressiveness of the inliner Very aggressive inlining can cause performance degradation as discussed in section 7 3 3 OPT Olimit N specifies the size limit N where N is computed from the number of basic blocks that make up a function inlining will never cause a function to exceed this size limit The default is 6000 under 02 and 9000 under 03 The value 0 means no limit is imposed IPA space N specifies that inlining should continue until a factor of N increase in code size is reached The default is 100 If the program size is small the value of N could be increased IPA plimit N suppresses inlining into a function once its size reaches N where N is measured in terms of the number of basic blocks and the number of calls inside a function The default is 2500 IPA small pucN specifies that a function with size smaller than N basic blocks is not subject to the IPA plimit restriction The default is 30 IPA callee limit n specifies that a function whose size
359. s Here are some sample stack size settings on a 4 CPU system with 1G of memory Value Meaning 100000 100000 bytes 820K 820K 839680 bytes 0 25g all but 0 25G or 0 75G total 128M cpu 128M per CPU or 512M total 10M cpu all but 10M per CPU all but 40M total or 0 96G total If the Fortran runtime encounters problems while attempting to modify the stack size limit it will print some warning messages but will not abort 3 47 3 The PathScale Fortran Compiler Fortran Compiler Stack Size CR 3 48 Section 4 The PathScale C C Compiler The PathScale C and C compilers conform to the following set of standards and extensions The C compiler Conforms to ISO IEC 9899 1990 Programming Languages C standard Supports extensions to the C programming language as documented in Using GCC The GNU Compiler Collection Reference Manual October 2003 for GCC version 3 3 1 Refer to section 4 4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C compiler gcc as implemented on the platforms supported by the PathScale Compiler Suite Supports most of the widely used command line options supported by gcc Generates code that complies with the x86_64 ABI and the 32 bit x86 ABI The C compiler Conforms to ISO IEC 14882 1998 E Programming Languages C standard Supports extensions to the C program
360. s Accessing the IEEE flag bits Generating IEEE special symbols like NaN and testing for them Selecting the IEEE rounding mode Enabling and disabling gradual underflow IEEE denormalized numbers The specification of these three modules is available at http www nag co uk sc22wg5 TR15580 html Unlike traditional intrinsic procedures the declarations in these modules are available only if you employ the use statement to access them A compiler is allowed to support only a part of the IEEE functionality or none at all and the user program is expected to call the procedures in IEEE_FEATURES to determine which functionality is available Our compiler will return TRUE for all of the 3 The PathScale Fortran Compiler Fortran 2003 Support ee IEEE_SUPPORT query functions in IEEE_FEATURES except for IEEE SUPPORT GRADUAL UNDERFLOW For that procedure it will return FALSE for the IA32 architecture if the compiler has been told via the mnosse2 command line switch not to use SSE instructions The standard calls for certain behavior which imposes overhead on the program Onentry each procedure must save a copy of the IEEE flags and rounding modes It must then clear the flags Onreturn each procedure must restore the saved copy of the flags and rounding modes As the standard allows our compiler does not do this in any procedure which does not access the IEEE intrinsic modules We also provide a command line option
361. s index time self children called name 0 00 155 83 1 1 main 2 1 95 4 0 00 155 83 1 MAIN 1 0 00 151 19 1504 i52 matmul 3 0 05 4 47 1 1 uinith 13 0 00 0 06 T L phinit 22 0 02 0 05 1 2 rndphi 21 0 0 0 00 301 512301 zdotc 14 0 0 0 00 77 1024077 dznrm2 17 0 0 0 00 452 603648604 zaxpy 9 0 0 0 00 154 214528306 zcopy 10 0 0 0 00 75 39936075 zscal 16 0 00 0 00 1 1 init 23 0 00 151 19 152 152 MAIN 1 3 92 6 0 00 151 19 152 matmul 3 1 75 73 84 152 152 muldoe 7 1 75 73 84 152 152 muldeo 6 0 00 00 00 152 214528306 zcopy 10 0 00 00 00 152 603648604 zaxpy 9 0 88 48 33 77824000 155648000 muldeo 6 0 88 48 33 77824000 155648000 muldoe 7 4 60 3 1 76 96 65 155648000 su3mul 4 83 54 13 11 155648000 155648000 zgemm 5 83 54 13 11 155648000 155648000 su3mul 4 5 59 2 83 54 13 11 155648000 zgemm 5 23 11 0 00 933888000 933888000 lsame 11 9 3 9 Examples Using the profile Option ee The ipa option can analyze the code to make smart decisions on when and which routines to inline so we try that 02 ipa results in a 133 8 second run time a nice improvement over our previous best of 150 seconds with only 02 Since we heard somewhere that improvements with compiler flags are not always predictable we also try 03 ipa To our great surprise we achieve a run time of 110 5 seconds a 58 speed up over our previous 03 time and a nice improvement over 02 ipa Section 7 7 mentions t
362. s This option is activated by Ofast OPT alias unnamed assumes that pointers never to point to named objects OPT alias restrict tells the compiler to assume that all pointers are restricted pointers and point to distinct non overlapping objects This allows the compiler to invoke as many optimizations as if the program were written in Fortran A restricted pointer behaves as though the C restrict keyword had been used with it in the source code OPT alias disjoint says that any two pointer expressions are assumed to point to distinct non overlapping objects To make the opposite assertion about your program s behavior put no before the value For example OPT alias no restrict means that distinct pointers may point to overlapping storage Additional OPT alias values are relevant to Fortran programmers in some situations OPT alias cray pointer asserts that an object pointed to by a Cray pointer is never overlaid on another variable s storage This flag also specifies that the compiler can assume that the pointed to object is stored in memory before a call to an external procedure and is read out of memory at its next reference It is also stored before a END or RETURN statement of a subprogram OPT alias parm promises that Fortran parameters do not alias to any other variable This is the default no parm asserts that parameter aliasing is present in the program 7 7 2 Numerically Unsafe Optimizations Rearranging math
363. s that the remainder of the line is to be treated as a comment If a is present at any character position on a line except for the 6th character position then the remainder of that line is treated as a comment Lines containing only blank characters or empty lines are also treated as comments If any character other than a blank character is present in the 6th character position on a line that specifies that the line is a continuation from the previous line The Fortran standard specifies that no more than 19 continuation lines can follow a line but the PathScale compiler supports up to 499 continuation lines 3 The PathScale Fortran Compiler Modules T Od a 3 2 Modules 3 2 1 Source code appears between the 7th character position and the 72nd character position in the line inclusive Semicolons are used to separate multiple statements on a line A semicolon cannot be the first non blank character between the 7th character position and the 72nd character position Character positions 1 through 5 are for statement labels Since statement labels cannot appear on continuation lines the first five entries of a continuation line must be blank Free form files have fewer limitations on line layout Lines can be arbitrarily long and continuation is indicated by placing an ampersand 8 at the end of the line before the continuation line Statement labels can be placed at any character position in a line as long as it is preceded b
364. s a factor of 4 in it signed long stride mult stride multiplier dimension 7 DopeAllocType alloc_info appears following the last actual dimension there may be fewer than 7 dimensions if alloc_cpnt is true DopeVectorType AH X D 3 D Fortran 90 Dope Vector E V Vd GCd w7 D 4 E Summary of Compiler Options ls Appendix E Summary of Compiler Options Options are grouped according to function A brief listing of defaults and comments are also listed for more detailed information see appendix F Table E 1 Summary of Compiler Options by Function EM CG CG CG CG CG CG CG CG CG CG CG CG CG General Options HHH copyright dumpversion help help show show defaults show0 showt version Code Generation Options cflow ON OFF cse regs N gem ON OFF load_exe N local_fwd_sched ON OFF movnti N p2align ON OFF p2align freg N prefer legacy regs ON OFF prefetch ON OFF ptr load use N push pop int saved regs ON OFF sse cse regs N Defaults Comments Defaults Comments lt ON gt lt positive infinity gt lt ON gt lt ON gt for 32 bit ABI OFF for 64 bit ABI 1000KB OFF 0 OFF ON 4 ON for barcelona else OFF positive infinity E Summary of Compiler Options AA NN E 2 Table E 1 Summar
365. s and attributes must reside inside this tag lt execute name name gt Specifies an execute target and must contain at least one lt source gt tag that references a previously defined lt define gt tag May also contain lt execute 2 option Or append tags Specify execute targets on the command line using t name option lt option gt Describes a single option Surround the content for this option in space characters to ensure differentiation e g lt option gt Ofast lt option gt rather than lt option gt Ofast lt option gt lt choose k k hoist true gt lt choose gt Choose the best option among those provided within this tag The k k attribute specifies the number of choices to run iteratively If k is given as a range separated by a colon e g k 0 2 pathopt2 chooses among that number of options inclusive e g between O and 2 options The optional hoist true attribute merges the lists returned by the children of the execute tag into the list for that tag By default choose picks combinations only from directly related children append lt option gt lt option gt lt append gt The first option described within this tag is appended to the test stream for the remaining options The following instructs pathopt2 to find the best option between O3 ipa and O3 OPT Ofast but not any of these options singly lt append gt lt
366. s file Compile for Athlon64 and turn on 3DNow extensions One option per line march athlon64 d anything after is ignored m3dnow These options can also be used on the command line See the eko man page for details 2 3 2 Defaults Flag This release includes a flag show defaults which directs the compiler to print out the defaults used related to ABI ISA and processor targets When this flag is specified the compiler will just print the defaults and quit No compilation is performed pathcc show defaults 2 3 3 Compiling for an Alternate Platform You will need to compile with the march anyx86 flag if you want to run your compiled executables on both AMD and Intel platforms See the eko man page for more information about the march flag To run code generated with the PathScale Compiler Suite on a different host machine you will need to install the runtime libraries on your host machine or you need to static link your programs when you compile See section 2 7 for information on static linking and the PathScale Compiler Suite Install Guide for information on installing runtime libraries 2 5 2 Compiler Quick Reference Input File Types AA NN 2 3 4 Compiling Option Tool pathhow compiled The PathScale Compiler Suite includes a tool that displays the compilation options and compiler version currently being used The tool is called pathhow compiled and can be found after installation in opt
367. s flag See section 7 7 4 2 for more information Aliases are hidden definitions and uses of data due to Accesses through pointers Partial overlap in storage locations e g unions in C Procedure calls for non local objects Raising of exceptions The compiler normally has to assume that aliasing will occur The compiler does alias analysis to identify when there is no alias so later optimizations can be performed Certain C and C language rules allow some levels of alias analysis Fortran has additional rules which make it possible to rule out aliasing in more situations subroutine parameters have no alias and side effects of calls are limited to global variables and actual parameters For C or C the coding style can help the compiler make the right assumptions Using type qualifiers such as const restrict 0r volatile can help the compiler Furthermore if you supply some assumptions to make concerning your program more optimizations can then be applied The following are some of the various aliasing models you can specify listed in order of increasingly stringent and potentially dangerous assumptions you are telling the compiler to make about your program 7 Tuning Options Aggressive Optimizations AA NY OPT alias any the default level which implies that any two memory references can be aliased OPT alias typed means to activate the ANSI rule that objects are not aliased it they have different base type
368. s from hardware instructions and optimizations that can return a 0 0 result from a 0 0 value To obtain a minus sign when printing a negative floating point zero 0 0 use the z option on the assign 1 command LANG IEEE save setting For Fortran only the ISO standard requires that any procedure which accesses the standard IEEE intrinsic modules via a use statement must save the floating point flags halting mode and rounding mode on entry must restore the halting mode and rounding mode on exit and must OR the saved flags with the current flags on exit Setting this option OFF may improve execution speed by skipping these steps LANG recursive setting Invoke the language option control group to control recursion support setting can be either ON or OFF The default is OFF In either mode the compiler supports a recursive stack based calling sequence The difference lies in the optimization of statically allocated local variables as described in the following paragraphs With LANG recursive ON the compiler assumes that a statically allocated local variable could be referenced or modified by a recursive procedure call Therefore such a variable must be stored into memory before making a call and reloaded afterwards With LANG recursive OFF the compiler can safely assume that a statically allocated local variable is not referenced or modified by a procedure call This setting enables the compiler to optimize mor
369. s refer 6 Tuning Quick Reference Compiler Flag Recommendations TO E U to distinct non overlapping objects If the these options are specified and the program does violate the assumptions being made the program may behave incorrectly Refer to section 7 7 1 for more information There are several shorthand options that can be used in place of the above options The option OPT Ofast is equivalent to OPT roundoff 2 Olimit 0 div_split ON alias typed Ofast is equivalent to 03 ipa OPT Ofast fno math errno When using this shorthand options make sure the impact of the option is understood by stepwise building up the functionality by using the equivalent options There are many more options that may help the performance of the program These options are discussed elsewhere in the User Guide and in the associated man pages 6 5 Compiler Flag Recommendations As a general methodology we usually recommend that you start tuning with 02 then 03 then 03 OPT Ofast and then Ofast With 03 OPT Ofast and Ofast you should look to see if the results are accurate The OPT Ofast flag uses optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of computations This effectively turns on the following optimizations OPT ro 2 Olimitz0 div split 0N alias typed If there are numerical problems with 03 OPT Ofast then try eit
370. se resources useful At the OpenMP home page http www openmp org For the Fortran C and C version 2 5 OpenMP Specification click on Specifications in the left column of the OpenMP home page For Tutorials Benchmarks Publications and Books click on Resources in the left column of the OpenMP home page Parallel Programming in OpenMP by Rohit Chandra et al Morgan Kaufmann Publishers 2000 ISBN 1 55 860671 8 8 31 8 Using OpenMP and Autoparallelization Other Resources for OpenMP Ce Notes 8 32 Section 9 Examples 9 1 Compiler Flag Tuning and Profiling With pathprof We ll use the 168 wupwise program from the CPU2000 floating point suite for this example This is a Physics Quantum Chromodynamics QCD code For those who care wupwise is an acronym for Wuppertal Wilson Fermion Solver a program in the area of lattice gauge theory quantum chromodynamics The code is in about 2100 lines of Fortran 77 in 23 files We ll be running and tuning wupwi se performance on the reference largest dataset Each run takes about two to four minutes on a 2 GHz Opteron system to complete Even though this is a Fortran 77 code the PathScale Fortran compiler pathf 95 can handle it Outline Try pathf95 02 and pathf95 03 first Run times user time were seconds O2 150 3 03 174 3 We re a little surprised since 03 is supposed to be faster than 02 in general But the man page did say that the 03 may
371. sforms 1 sqrt x to rsqrt x Unlike OPT rsqrt the compiler does not generate extra code to refine the rsqrt result for OPT fast sqrt OPT fast stdlib ON OFF This option controls the generation of calls to faster versions of some standard library functions Default is ON OPT fast trunc ON OFF This option inlines the NINT ANINT and AMOD Fortran intrinsics both single and double precision versions Default is OFF fast trunc is enabled automatically if OPT roundoff 1 or greater is in effect OPT fold reassociate ON OFF This option allows optimizations involving reassociation of floating point quantities Default is OFF fold reassociate ON is enabled automatically when OPT roundoff 2 or greater is in effect OPT fold unsafe relops ON OFF This option folds relational operators in the presence of possible integer overflow The default is ON for O3 and OFF otherwise OPT fold unsigned relops ON OFF This option folds unsigned relational operators in the presence of possible integer overflow Default is OFF OPT goto ON OFF Disable or enable the conversion of GOTOs into higher level structures like FOR loops The default is ON for O2 or higher OPT IEEE arithmetic IEEE arith 1 2 3 Specify the level of conformance to IEEE 754 floating pointing roundoff overflow behavior Note that OPT IEEE a is a valid abbreviation for this flag The options can be one of the following 1 Adhere to IEEE accura
372. signed by the guided scheduling algorithm F 61 F eko man Page AA NY PSC OMP GUIDED CHUNK MAX This is the maximum chunk size that will be used by the loop scheduler for guided scheduling PSC OMP LOCK SPIN This chooses the locking mechanism used by critical sections and OMP locks PSC OMP SILENT If yu set PSC OMP SILENT to anything then warning and debug messages from the libopenmp library are inhibited PSC OMP STACK SIZE Fortran Stack size specification follows the syntax in described in the OpenMP in Fortran section of PathScale Compiler Suite User Guide PSC OMP STATIC FAIR This determines the default static scheduling policy when no chunk size is specified It is discussed in the OpenMP in Fortran section of PathScale Compiler Suite User Guide PSC OMP THREAD SPIN This takes a numeric value and sets the number of times that the spin loops will spin at user level before falling back to O S schedule reschedule mechanisms FILES The following is a file summary File Type a out Executable output file file a Object file archive file B Intermediate file written by the front end of the compiler To retain this file specify the keep option file c C source file file f or file F Input Fortran source file in fixed source form If file ends in F the C preprocessor is invoked file f90 file f95 file F90 or file F95 Input Fortran source file in free source form If file ends in F90 or F95 the C prepr
373. sing char 0 to place a null character after the last significant character sarray must have thirteen elements ID of device containing file Inode number File mode File mode 5 UID of owner GID of owner ID of device containing directory entry for file Size of file in bytes Time of last access Time of last modification Time of last file status change Preferred I O block size 1 if not available Number of blocks allocated 1 if not available Except for elements 12 and 13 values are set to o if they are not available from the relevant file system Itime Fortran interface to the C library function 1ocaltime Sets tarray to the broken down time corresponding to stime which can be obtainedfromthe intrinsic time 8 All values are inthe local time zone tarray must have nine elements Seconds since the last minute ranging o 61 due to leap seconds Minutes since the last hour ranging 0 59 Hours since midnight ranging o 23 Day of month ranging o 31 Month ranging o 11 Years since 1900 Days since Sunday ranging o 6 Days since January 1 rangingo 365 Positive if daylight savings time is in effect zero if not or negative if unknown mclock Fortran interface to the C library function clock Returns the number mclock8 of clock ticks of CPU time since the start of execution of the process or 1 if this is not known C 48 C Supported Fortran Intrinsics Fortran Intrinsic Extensions o 9 9
374. ss or shared data desc unsigned long el len element len in bits ai base addr flags and information fields within word 3 of the header A unsigned int assoc 1 associated flag unsigned int ptr alloc 1 set if allocated by pointer enum ptrarray NOT P OR A 0 POINTTR 1 ALLOC ARRY 2 pora 2 pointer or allocatable array Use enum ptrarray values unsigned int a contig 1 array storage contiguous flag unsigned int alloc cpnt 1 this is an allocatable D 2 D Fortran 90 Dope Vector ls array whose element type is a derived type having component s which are themselves allocatable unsigned int 26 pad for first 32 bits unsigned int 29 pad for second 32 bits unsigned int n_dim 3 number of dimensions 90 type t type lens data type and lengths void orig base original base address unsigned long orig size original size Per Dimension Information array will contain only the necessary number of elements T define MAXDIM 7 struct DvDimen Signed long low bound lower bound for ith dimension may be negative signed long extent number of elts for ith dimension The stride mult is not defined in constant units so that address calculations do not always require a divide by 8 or 64 For double and complex stride mult has a factor of 2 in it For double complex stride mult ha
375. ssign I F f77 mips N mips g du The su specifier matches all sequential unformatted open requests The du specifier matches all direct unformatted open requests The F option sets the record header format to big endian F77 mips The ASSIGN Procedure 3 8 1 5 The ASSIGN procedure provides a programmatic interface to the assign command It takes as an argument a string specifying the assign command and an integer to store a returned error code For example integer err call ASSIGN assign N mips u 15 err This example has the same effect as the example in section 3 8 1 1 l O Compilation Flags 3 36 Two compilation flags have been added to help with I O byteswapio and convert conversion The byteswapio flag swaps bytes during I O so that unformatted files on a little endian processor are read and written in big endian format or vice versa The convert conversion flag controls the swapping of bytes during I O so that unformatted files on a little endian processor are read and written in big endian format or vice versa To be effective the option must be used when compiling the Fortran main program Setting the environment variable FILENV when running the program will override the compiled in choice in favor of the choice established by the command assign The convert conversion flag can take one of three arguments native no conversion the default e big endian files are big endian little end
376. stribution 8 Using OpenMP and Autoparallelization Environment Variables AA NN For example consider a system with four chips each with two cores using chip major numbering Let there be 2 OpenMP jobs each consisting of 4 threads If these jobs are run with the default scheduling the assignments will be lt CHIP 0 5 lt CHIP 1 lt CHIP 2 lt CHIP 3 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 J0 TO J0 T1 J0 T2 J0 T3 J1 TO J1 T1 J1 T2 J1 13 Jx Ty indicates thread y of job x If PSC OMP CPU OFFSET is set to 4 for job 1 the scheduling will be changed to CHIP 0 5 CHIP 1 lt CHIP 2 gt CHIP 3 5 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 J0 TO J0 T1 J0 T2 J0 T3 J1 TO J1 T1 J1 T2 J1 T3 If PSC OMP CPU STRIDE is set to 2 for both jobs and PSC OMP CPU OFFSET is set to 1 for job 1 only then the scheduling will be lt CHIP 0 5 CHIP 1 lt CHIP 2 gt CHIP 3 CPUO CPU1 CPU2 CPUS CPU4 CPU5 CPU6 CPU7 J0 TO J1 TO JO T1 J1 T1 J0 T2 J1 T2 J0 T3 J1 T3 PSC OMP GUARD SIZE Integer value This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S It is often useful to have a larger guard area to catch pthread stack overflows particularly for Fortran O
377. sts should be fused even if it is found that not all levels of the loop nests can be fused The default level is 1 standard outer loop fusion but 2 has been known to benefit a number of well known codes LNO fission n Perform loop fission n 0 off 1 standard 2 try fission before fusion The default level is 0 but 2 has been known to benefit a number of well known codes Be careful with mixing the above two flags because fusion has some precedence overfission if LNO ission 1 or 2 and LNO fusion 1 or 2 then fusion is performed LNO fusion peeling limit n controls the limit for the number of iterations allowed to be peeled in fusion where n has a default of 5 but can be any non negative integer Peeling is done when the iteration counts in consecutive loops is different but close and several iterations are replicated outside the loop body to make the loop counts the same 7 4 2 Cache Size Specification The PathScale compilers are primarily targeted at the Opteron CPU currently so they assume an L2 cache size of 1MB Athlon 64 can have either a 512KB or 1MB L2 cache size If your target machine is Athlon 64 and you have the smaller cache size then setting LNO cs2 512k could help You can also specify your target machine instead using march athlon 64 That would automatically set the standard machine cache sizes Here is the more general description of some of what is available LNO csl n cs2 n cs3 n cs4 n
378. t Arguments Families Remarks DACOSD R 8 X R 8 PGI E TRADITIONAL DASIN R8 X R8 ANSI G77 E P PGI TRADITIONAL DASIND R 8 X R 8 PGI E TRADITIONAL DATAN R 8 X R 8 ANSI G77 EP PGI TRADITIONAL DATAN2 R 8 Y R 8 ANSI G77 E P X R 8 PGI TRADITIONAL DATAN2D R 8 Y R 8 PGI E X R 8 TRADITIONAL DATAND R 8 X R 8 PGI E TRADITIONAL DATE C G77 PGI TRADITIONAL DATE Subroutine DATE C G77 PGI DATE_AND Subroutine DATE C ANSI G77 O TIME TIME C PGI ZONE C TRADITIONAL VALUES I 1 1 2 I4 1 8 Array rank 1 DBESJO R 8 X R 8 G77 PGI DBESJ1 R 8 X R 8 G77 PGI DBESJN R 8 N 1 4 G77 PGI X R 8 DBESYO R 8 X R 8 G77 PGI DBESY1 R 8 X R 8 G77 PGI DBESYN R 8 N 1 4 G77 PGI X R 8 DBLE R 8 A 1 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 Z 8 Z 16 PGI E TRADITIONAL DCMPLX 16 X 1 1 1 2 1 4 1 8 G77 PGI E R 4 R 8 Z 8 Z 16 TRADITIONAL C 11 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks Y F1 1 2 I4 1 8 O R 4 R 8 Z 8 Z 16 DCONJG Z 16 Z 216 G77 PGI E TRADITIONAL DCOS R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DCOSD R8 X R8 PGI E TRADITIONAL DCOSH R8 X R8 ANSI G77 E P PGI TRADITIONAL DCOT R8 X R8 TRADITIONAL E P DCOTAN R8 X R8 TRADITIONAL E P DDIM R8 X R8 ANSI G77 E P Y R 8 PGI TRA
379. t to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of the number of threads in the team the last chunk will be smaller than the default chunk size B Implementation Dependent Behavior for OpenMP Fortran AA NN and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop iterations which are executed by a thread are contiguous in terms of their loop iteration number NOTE The PSC OMP STATIC FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced over the threads in cases where the division in not exact OMP NUM THREADS environment variable The default value is implementation dependent Section 4 2 page 60 The default value of the OMP NUM THREADS environment variable is the number of CPUs in the machine OMP DYNAMIC environment variable The default value is implementation dependent Section 4 3 page 60 The default value of the OMP DYNAMIC environment variable is false An implementation can replace all ATOMIC
380. tatus character intent out optional command integer intent out optional length integer intent out optional status Retrieve the entire command line command is set to the command line length is set to the number of characters in the command line and status is set to O if the procedure succeeds 1 if the actual argument corresponding to command is too short or a positive number if retrieval failed GET COMMAND ARGUMENT subroutine get command argument number value amp length status integer intent in number character intent out optional value integer intent out optional length integer intent out optional status Retrieve one command line argument number specifes the desired argument with O being the command name itself 1 the first argument and so on value returns the argument length returns the length of the argument and status returns 0 if the procedure succeeds 1 if the actual argument corresponding to value is too short or a positive number if retrieval failed 3 6 3 The PathScale Fortran Compiler Fortran 2003 Support ls GET ENVIRONMENT VARIABLE subroutine get environment variable name value length amp status trim name character intent in name character intent out optional value integer intent out optional length integer intent out optional status logical intent in optional trim nam
381. td TRADITIONAL MAXVAL ANSI PGI See Std TRADITIONAL C Supported Fortran Intrinsics Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks MCLOCK 1 4 G77 PGI MCLOCK8 1 8 G77 MEMORY _ Subroutine TRADITIONAL E BARRIER MERGE TSOURCE Any ANSI PGI E type TRADITIONAL FSOURCE Any type MASK L 1 L 2 L 4 L 8 MIN ANSI G77 See Sid PGI TRADITIONAL MINO ANSI G77 See Std PGI TRADITIONAL MNT ANS G77 Seda PGI TRADITIONAL MINEXPONENT X R4 R 8 ANSI PGI E TRADITIONAL MINLOC ANSI PGI See Std TRADITIONAL MINVAL ANSI PGI See Sid TRADITIONAL MOD 1 4 A 1 1 1 2 1 4 1 8 ANSI G77 E P R 4 R 8 PGI P 1 1 1 2 1 4 1 8 TRADITIONAL R 4 R 8 MODULO A 1 1 1 2 1 4 I 8 ANSI PGI E R 4 R 8 TRADITIONAL P 1 1 1 2 1 4 I 8 R 4 R 8 C 31 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NN C 32 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks MVBITS Subroutine FROM I 1 1 2 1 4 ANSI G77 E I8 PGI FROMPOS I 1 I 2 TRADITIONAL 1 4 1 8 LEN 171 172 1 4 18 TO 171 1 2 1 4 1 8 TOPOS 1 1 1 2 1 4 I8 NAND AND I 1 4 TRADITIONAL E FETCH J 1 4 NAND_AND_ I 1 8 TRADITIONAL E FETCH J I 8 NEAREST X R 4 R 8 ANSI PGI E S
382. te C 44 fget C 44 fgetc C 44 flush C 44 Index 3 PathScale Compiler Suite User Guide Version 3 2 _ lt lt _ lt d A X 3 fnum C 44 fput C 44 fputc C 44 fseek C 44 fstat C 45 ftell C 45 gerror C 45 getarg C 45 getcwd C 45 getenv C 46 getgid C 46 getlog C 46 getpid C 46 getuid C 46 gmtime C 46 hostnm C 46 iargc C 46 idate C 46 ierrno C 47 imag C 47 imagpart C 47 int2 C 47 int4 C 47 int8 C 47 irand C 47 isatty C 47 itime C 47 kill C 47 link C 47 Inbink C 47 loc C 47 Ishift C 47 Istat C 48 Itime C 48 mclock8 C 48 or C 49 perror C 49 rand C 49 realpart C 49 rename C 49 rshift C 49 secnds C 49 second C 49 setbuf C 49 setlinebuf C 49 Index 4 short C 49 signal C 50 sleep C 52 srand C 52 stat C 53 symlink C 53 system C 53 time8 C 53 ttynam C 54 umask C 54 unlink C 54 xor C 54 zsqrt C 54 Free form files 3 2 fsymlist 3 39 G g77 3 37 3 38 3 39 5 1 gcc 5 1 gcc compatibility wrapper script 5 5 gcc compilers 4 2 gcov 2 11 2 12 GDB 2 11 10 1 Global ID 8 13 gmon out 2 11 gprof 8 29 Group optimizations 7 2 GUIDED scheduling algorithm 8 30 H Higher optimization levels 10 3 Implementation defined behavior B 1 Induction variable 10 5 Initialize Fortran runtime library 3 29 Inlining 7 4 Inner loop unrolling 7 16 Interleaving 7 25 Intermediate representation IR 7 3 Intrinsics Fortran 5 3 PathScale Compiler Suite User Guide V
383. ted number of different options separated by space The compilation of the next function reverts back to the settings specified in the compiler command line In this release there are limitations to the options that are processed in this options directive and their effects on the optimization There is no warning or error given for options that are not processed These directives are processed only in the optimizing backend Thus only options that affect optimizations are processed In addition it will not affect the phase invocation of the backend components For example specifying 00 will not suppress the invocation of the global optimizer though the invoked backend phases will honor the specified optimization level Apartfrom the optimization level flags only flags belonging to the following option groups are processed LNO OPT and WOPT 4 2 2 3 Code Layout Optimization Using Pragmas This pragma is applicable to C C The user can provide a hint to the compiler regarding which branch of an IF statement is more likely to be executed at runtime This hint allows the compiler to optimize code generated for the different branches 4 The PathScale C C Compiler Debugging and Troubleshooting C C ls The directive is of the form pragma frequency hint hint where lt hint gt is a choice from never The branch is rarely or never executed e init The branch is executed only during initialization
384. tegers however 977 uses KIND 3 for 1 byte KIND 5 for 2 bytes KIND 1 for 4 bytes and KIND 2 for 8 bytes We are investigating the cost of providing a compatibility flag for unportable g77 programs If you find this to be a problem the best solution is to change your program to inquire for the actual KIND values instead of hard wiring them If you are using i8 or r8 see section 3 5 1 for more details on usage 3 10 Library Compatibility This section discusses our compatibility with libraries compiled with C or other Fortran compilers Linking object code compiled with other Fortran compilers is a complex issue Fortran 90 or 95 compilers implement modules and arrays so differently that it is 3 37 3 The PathScale Fortran Compiler Library Compatibility ee extremely difficult to attempt to link code from two or more compilers For Fortran 77 run time libraries for things like I O and intrinsics are different but it is possible to link both runtime libraries to an executable We have experimented using object code compiled by g77 This code is not guaranteed to work in every instance It is possible that some of our library functions have the same name but different calling conventions than some of g77 s library functions We have not tested linking object code from other compilers with the exception of 977 3 10 1 Name Mangling Name mangling is a mechanism by which names of functions procedures and common blocks from
385. temporary array thus promoting locality in the accesses to the array argument This optimization is relevant only to Fortran and this flag controls the aggressiveness of this optimization The default is ON for O2 or higher and OFF otherwise LANG formal deref unsafe ON OFF Tell the compiler whether it is unsafe to speculate a dereference of a formal parameter in Fortran The default is OFF which is better for performance LANG global asm ON OFF When a program has a file scope asm statement this option may be used if the asm allocates objects to sections Enabling this option disables some alignment optimizations so that the compiler s allocations are compatible with those in the asm statement The default is OFF F 24 F eko man Page I aaa LANG heap allocation threshold size Determine heap or stack allocation If the size of an automatic array or compiler temporary exceeds size bytes it is allocated on the heap instead of the stack If size is 1 objects are always put on the stack If size is O objects are always put on the heap The default is 1 for maximum performance and for compatibility with previous releases LANG IEEE minus zero setting Enable or disable the SIGN 31 intrinsic function s ability to recognize negative floating point zero 0 0 Specify either ON or OFF for setting The default is OFF which suppresses the minus sign The minus sign is suppressed by default to prevent problem
386. the default when prefetch is on 3 22 3 The PathScale Fortran Compiler Extensions ls C PREFETCH MANUAL N Specify if manual prefetches through directives should be respected or ignored Scope Entire function containing the directive N can be one of the following values 0 Ignore manual prefetches 1 Respect manual prefetches C PREFETCH REF DISABLE A size num This directive explicitly disables prefetching all references to array A in the current function The auto prefetcher runs if enabled ignoring array A The size is used for volume analysis Scope Entire function containing the directive size num is the size of the array references in this loop in Kbyte This is an optional argument and must be a constant C PREFETCH REF array ref stride str strll level lev lev kind rd wr size sz Thisdirective generates a single prefetch instruction to the specified memory location It searches for array references that match the supplied reference in the current loop nest If such areference is found that reference is connected to this prefetch node with the specified parameters If no such reference is found this prefetch node stays free floating and is scheduled loosely All references to this array in this loop nest are ignored by the automatic prefetcher if enabled If the size is supplied then the auto prefetcher if enabled reduces the effective cache size by that amount in its cal
387. threads available for execution Default is FALSE since this mechanism is not supported OMP NESTED TRUE OR FALSE Enables or disables nested parallelism i Default is FALSE OMP SCHEDULE type chunk This environment variable only applies to DO i and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC or GUIDED Default is STATIC with no chunk size specified OMP NUM THREADS Integer value Set the number of threads to use during 7 E execution Default is number of CPUs in the machine 8 9 2 PathScale OpenMP Environment Variables The PathScale OpenMP environment variables provide addtional control over thread scheduling through processor affinity Processor affinity is used to specify the preferred processor or subset of processors for scheduling a thread An affinity setting might be made in order to bind a thread close to a resource andto preventthe kernel from rescheduling the thread to another processor further away from that resource The resource might be cache memory main memory or an i o device for example Note that there is a tension between affinity and load balancing since specifying affinities may prevent the kernel scheduler from balancing the workload over the processors The policy of the kernel scheduler determines whether affinity or load balance prevails in cases of conflict Affinity is particularly important on NUMA non uniform memory architectures
388. tics OProfile can also attribute the samples on a thread or CPU basis allowing load balancing and scheduling issues to be observed OProfile can access many different performance counters giving more detail insight into the CPU behavior however these advanced features of OProfile are not easy to use If the application uses nested OpenMP parallelism then try turning on the nested parallelism support by setting the OMP NESTED environment variable to TRUE 8 14 3 4 Tuning the Application Code If you are able to tune the code of the application it is worth checking whether any of the OpenMP directives specify a chunk size It may be possible to make more appropriate choices of the chunk size perhaps influenced by the number of CPUs available the L2 size or the data size You may also want to try different scheduling strategies If the amount of workin an OpenMP loop varies significantly from iteration to iteration then a DYNAMIC or GUIDED scheduling algorithm is preferable The default loop scheduling algorithm is static scheduling and this is used by the majority OpenMP applications If this leads to an unbalanced distribution of work across the threads try setting the PSC OMP STATIC FAIR environment variable which will cause the library to use a fairer distribution If the application uses guided scheduling the PSC OMP GUIDED CHUNK DIVISOR and PSC OMP GUIDED CHUNK MAX environment variables can be used to tune the loop scheduling The
389. time slicing between many processes the sum of user and system does not necessarily equal real since other processes could also have run The default metric used when comparing the performance of one set of options with another is real time All 3 times will be displayed in the output Additionally pathopt2 allows arbitrary performance metrics to be used to guide option selection using the timing file and rate file choices When either of these options is used pathopt2 sets an environment variable called PSC METRIC FILE with the name of atemporary file before running the command The run command is required to write the performance metric into this file before it terminates The pathopt2 tool then opens this file reads a value from the file as a double precision floating point number and deletes the temporary file The only interpretation placed on these values is that smaller is better for timing and that 7 29 7 Tuning Options The pathopt2 Tool ee larger is better for rate The actual units of the values do not matter as far as pathopt2 is concerned since it just performs comparisons on the values Using the above usage as a guide we can now summarize the simple command from the previous section S pathopt2 f pathopt2 xml t try5 r factorial pathcc o factorial factorial c This example directs pathopt2 to use pathopt2 xm1 as the configuration file The build command pathce e o factorial factorial c is used for th
390. tion W no aggregate return W no bad function cast W no cast align Wno cast qual W no char subscripts W no comment Same as march SSE2 cannot be disabled under m64 Same as mno sse ON under m64 and m32 ON under march em64t and march core otherwise OFF Same as march Defaults Comments Use when compiling and linking Use when compiling and linking Defaults Comments ON for C otherwise OFF s15 lt ON gt lt ON gt lt ON gt lt ON gt lt ON gt ON Defaults Comments C C only C C only C C only C C only C C only C C only C C only E 15 E Summary of Compiler Options AA NN Table E 1 Summary of Compiler Options by Function E 16 W no conversion C C only W no deprecated Wno deprecated declarations W no disabled optimization W no div by zero W no endif labels W no error W no float equal W no format C C only Wno format extra args C C only W no format nonliteral C C only W no format security C C only Wno format y2k C C only W no id clash C C only W no implicit C C only W no implicit function declaration C C only W no implicit int C C only W no import W no inline C C only W no larger than lt number gt Wno long long C C only W no main C C only W no missing braces C C only W no missi
391. tion 7 17 Code tuning example 9 1 COMMON block 10 3 Compilation unit 7 3 Compiler C 4 1 C 4 1 invoking the 2 1 options common 2 8 quick reference 2 1 Compiler defaults file 2 4 compiler defaults 2 4 Compilers using the C C 4 2 Compiling on alternate platforms 2 5 COMPLEX 3 39 Conditional complilation sentinels 8 3 core 2 5 cosin 7 17 Cray pointer 3 21 CRITICAL directive 8 25 Index 2 D Debugging C C 4 7 Fortran 3 42 general information 10 1 Default optimization level 4 2 options 2 3 Directives about 3 22 ATOMIC 8 5 B 4 BARRIER 8 5 changing optimization flags with 3 24 CRITICAL 8 5 DO 8 4 FLUSH 8 5 MASTER 8 5 ORDERED 8 5 PARALLEL 8 4 PARALLEL DO 8 5 PARALLEL SECTIONS 8 5 PARALLEL WORKSHARE 8 5 SECTIONS 8 4 SINGLE 8 5 THREADPRIVATE 8 5 WORKSHARE 8 5 Dope vector 3 28 D 1 DWARF 3 42 4 7 10 1 DYNAMIC scheduling algorithm 8 30 E em64t 2 4 Environment variables Fortran 3 41 OpenMP 8 11 8 12 pathopt2 7 36 PathScale OpenMP 8 12 Environment variables C PSC CFLAGS A 1 Environment variables C PSC CXXFLAGS A 1 Environment variables Fortran F90 BOUNDS CHECK ABORT A 1 F90 DUMP MAP A 1 PathScale Compiler Suite User Guide Version 3 2 o 9 9 9 7 7 aaa FTN SUPPRESS REPEATS A 1 NLS PATH A 1 PSC FDEBUG ALLOC A 1 PSC FFLAGS A 2 PSC STACK LIMIT A 2 PSC STACK VERBOSE A 2 Environment variables language independent FILENV A 2 PSC COMPILER DEFAULTS PATH A 2 PSC GENFLAGS
392. to a 3 The PathScale Fortran Compiler Fortran 2003 Support aaa variable of that type automatically deallocates and reallocates the components of the target as need be to match the source of the assignment and then copies the components from the source to the target If you deallocate a variable containing directly or indirectly an allocatable component the compiler automatically deallocates the component as well If a procedure has an allocatable dummy argument or function result the procedure interface must be explicit that is the caller must obtain a declaration of the interface via a use statement by nesting the function under contains or via an interface block When a dummy variable has the allocatable attribute the actual argument associated with it must also have the allocatable attribute The behavior of an allocatable dummy variable depends on its intent On entry to a procedure an allocatable dummy variable with intent in or intent inout has the allocation status and value if any of the associated actual argument an allocatable dummy variable with intent out is deallocated During execution a procedure may not change the allocation status of an allocatable dummy variable with intent in but it may allocate or deallocate a variable with intent inout or intent out On return from a procedure the actual argument associated with an allocatable dummy variable has the same allocation status and va
393. to specify the apo and mp options together S pathf95 apo mp c foo F95 pathf95 apo mp o foobar foo o bar o Other than the OpenMP directives the compiler currently does not implement any additional directives to help the compiler in its autoparallelization analysis Many codes benefit from autoparallelization and the extent of the benefit may vary with the characteristics of the program and data set being used There are cases where autoparallelization causes small performance degradation of an application This happens because an autoparallelized program runs under multiple threads The runtime decision to create multiple threads followed by their synchronization are overhead during execution When the compiler parallelizes a loop it generates both a serial and a parallel version At runtime the generated code looks at the total amount of work performed by the loop and decides whether to execute the serial or the parallel version This decision can only be made at runtime when the number of processors and the loop iteration counts are available If the amount of work is not large enough to justify the additional synchronization overhead it will execute the serial version instead In such cases the performance will be slower than if the program is not compiled with apo due to the need to make this decision at run time The synchronization overhead can be controlled using the LNO parallel overhead option
394. tructs the compiler not warn about pointer casts that increase alignment W no char subscripts For C C only Wchar subscripts warns about subscripts whose type is char The Wno char subscripts option tells the compiler not warn about subscripts whose type is char W no comment For C C only Wcomment warns if nested comments are detected Wno comment tell the compiler not to warn if nested comments are detected F 49 F eko man Page AA NN W no conversion For C C only Wconversion warns about possibly confusing type conversions Wno conversion tells the compiler not to warn about possibly confusing type conversions Wdeclaration after statement For C C only Warn about declarations after statements pre C99 W no deprecated Wdeprecated will announce deprecation of compiler features Wno deprecated tells the compiler not to announce deprecation of compiler features W no disabled optimization Wdisabled optimization warns if a requested optimization pass is disabled Wno disabled optimization tells the compiler not warn if a requested optimization pass is disabled W no div by zero Wodiv by zero warns about compile time integer division by zero Wno div by zero suppresses warnings about compile time integer division by zero W no endif labels Wendif labels warns if if or endif is followed by text Wno endif labels tells the compiler not to warn if
395. tructured block 1 OMP end parallel PRIVATE SHARED DEFAULT FIRSTPRIVATE SHARED NONE REDUCTION COPYIN IF NUM THREADS Work sharing constructs Divide the execution of the enclosed block of code among the members of the team that encounter it DO NOWAIT SOMP do clause do loop SOMP enddo nowait PRIVATE FIRSTPRIVATE LAST PRIVATE REDUCTION SCHEDULE static dynamic guided runtime ORDERED SECTIONS SOMP sections clause structured block SOMP end sections nowait PRIVATE FIRSTPRIVATE LAST PRIVATE REDUCTION 8 4 8 Using OpenMP and Autoparallelization OpenMP Compiler Directives Fortran ls Table 8 1 Fortran Compiler Directives Continued Directive Clauses Example SINGLE SOMP single clause structured block SOMP end single nowait PRIVATE FIRSTPRIVATE CO PYPRIVATE Shortcut for denoting Combined parallel work sharing constru a parallel region that contains only one work sharing construct cts PARALLEL DO SOMP parallel do structured block SOMP end parallel do PARALLEL 1 OMP parallel sections SECTIONS structured block SOMP end parallel sections PARALLEL SOMP parallel workshare WORKSHARE structured block 1 OMP end parallel workshare Synchronization constructs Provide various aspects of synchronization for example access to
396. ts in size are aligned on 16 bit boundaries and objects smaller than 16 bits are aligned on 8 bit boundaries ansi For Fortran only Generate messages about constructs which violate standard Fortran syntax rules and constraints plus messages about obsolescent and deleted features This also disables all nonstandard intrinsic functions and subroutines and implies ffortran2003 Specifying ansi in conjunction with fullwarn causes all messages regardless of level to be generated ansi For C C only Enable pure ANSI ISO C mode apo This auto parallelizing option signals the compiler to automatically convert sequential code into parallel code when it is safe and beneficial to do so The resulting executable can then run faster on a machine with more than one CPU ar Create an archive using ar 1 instead of a shared object or executable The name of the archive is specified by using the o option Template entities required by the objects being archived are instantiated before creating the archive The pathCC command implicitly passes the r and c options of ar to ar in addition to the name of the archive and the objects being created Any other option that can be used in conjunction with the c option of ar can be passed to ar using WR option name NOTE The objects specified with this option must include all of the objects that will be included in the archive Failure to do so may cause prelinker internal errors In th
397. uch as 128 Specify a positive integer for N specifying N 0 indicates there is no cache at that level F 31 F eko man Page I aaa LNO cmplzN cmp2 N cmp3 N cmp4 N dmp1 N dmp2 N dmp3 N dmp4 N This option specifies in processor cycles the time for a clean miss cmpx or a dirty miss dmpx to the next outer level of the memory hierarchy This number is approximate because it depends on a clean or dirty line read or write miss etc Specify a positive integer for N specifying N 0 indicates there is no cache at that level LNO cs1zN cs2zN cs3 N cs4 N This option specifies the cache size N can be 0 or a positive integer followed by one of the following letters k K m or M These letters specify the cache size in Kbytes or Mbytes Specifying O indicates there is no cache at that level cs1 is the primary cache cs2 refers to the secondary cache cs3 refers to memory and cs4 is the disk Default cache size for each type of cache depends on your system Use LIST all optionszON to see the default cache sizes used during compilation LNO is mem1 ON OFF is mem2 ON OFF is mem3 ON OFF is mem4 ON OFF This option specifies that certain memory hierarchies should be modeled as memory not cache Default is OFF for each option Blocking can be attempted for this memory level and blocking appropriate for memory rather than cache is applied No prefetching is performed and any prefetching options are ignored I
398. ue int c value double d float f int i long long i8 printf d 1f f 1f i d i8 lld n d f i i8 fflush stdout Flush output before switching languages return 4 Nonzero will be treated as true by Fortran Here is the Fortran source code part f90 program f part implicit none Explicit interface is not required but adds some error checking interface subroutine c reference d1 f1 il i2 c1 11 12 c2 c3 doubleprecision di real f1 integer i1 integer 8 i2 character c1 c3 character 4 c2 logical 11 12 end subroutine c reference logical function c value d f i i8 doubleprecision d real f integer i integer 8 i8 end function c value end interface logical 1 pointer p user user character 32 user integer 8 getlogin nounderscore File decorate txt maps this to external getlogin nounderscore getlogin without underscore intrinsic char Demonstrate calling from Fortran a C function taking arguments by reference 3 32 3 The PathScale Fortran Compiler Mixed Code ls call c_reference 9 8d0 7 6 5 4 8 hello false true amp from f part Demonstrate calling from Fortran a C function taking arguments by value 1 c value val 9 8d0 val 7 6 val 5 val 4 8 write 6 a 18 l 1 getlogin is a standard C library function which returns char px When a C function returns a pointer you must use a Cray pointer
399. uiltin For C C only Do not recognize any built in functions fno common For C C only Use strict ref def initialization model fno ident Ignore ident directives fno math errno Do not set ERRNO after calling math functions that are executed with a single instruction e g sqrt A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility This is implied by Ofast The default is fmath errno fpack struct For C C only Pack structure members together without holes f no permissive fpermissive will downgrade messages about non conformant code to warnings fno permissive keeps messages about non conformant code as errors f no PIC fPIC tells the compiler to generate position independent code if possible The default is fno PIC which tells the compiler not to generate position independent code fprefix function name For C C only Add a prefix to all function names f no preprocessed fpreprocessed tells the preprocessor that input has already been preprocessed Using fno preprocessed tells preprocessor that input has not already been preprocessed frandom seed string For C C only The compiler normally uses a random number to generate names that have to be different in each compiled file These names include certain symbol names unique stamps in coverage data and the object files that produ
400. uitable for the analysis program pathprof 1 You must use this option when compiling the source files you want data about and you must also use it when linking This option turns on application level and library level profiling see also pg Produce a relocatable o and stop rreal spec For Fortran only Specify the default kind specification for real values Option Kind value r4 Use REAL KINDz4 and COMPLEX KINDzA for real and complex variables respectively the default r8 Use REAL KIND 8 and COMPLEX KIND 8 for real and complex variables respectively S Generate an assembly file file s rather than an object file file o shared DSO shared PIC code shared libgcc Force the use of the shared libgcc library show Print the passes as they execute with their arguments and their input and output files show defaults Show the processor target settings and the default options in the compiler defaults 5 file For C C also shows the GNU GCC version compitability show0 Show what phases would be called but don t invoke anything showt Show time taken by each phase F 45 F eko man Page AA NN static Same as static except static does not cause the compiler to warn about possible confusion with static data static Suppress dynamic linking at runtime for shared libraries use static linking instead static data Statically allocate all local variables Statically allocated local va
401. ul to store an opaque handle to such data within C code even if the code cannot use the pointer to access the data itself However the standard does require that any argumentto C LOC have the TARGET attribute and does restrict some arguments for example it requires that a pointer argument be scalar Passing Arguments by Value C passes arguments by value To pass the address of a variable so that the called procedure can modify the variable one generally declares the formal argument to be a pointer and one explicitly passes the address of the variable as the actual argument Fortran compilers pass arguments in a variety of ways For the kinds of arguments allowed in the Fortran 77 standard they commonly passthe address ofthe argument that is they pass the argument by reference although other methods are allowed But for some of the kinds of arguments added in the Fortran 90 and later standards a simple address is not sufficient The Fortran 2003 standard ensures argument passing compatibility with C in three ways provided a procedure has the bind c attribute 3 The PathScale Fortran Compiler Fortran 2003 Support I aa The Fortran 90 arguments which cannot be represented as simple addresses are generally prohibited in procedures which have the bind c attribute You can use the value attribute to pass any dummy argument by value If a procedure has the bind c attribute it must pass by reference any arguments
402. ully unrolled The default value for N is 2000 LNO full unroll outer ON OFF Control the full unrolling of loops with known trip count that do not contain a loop and are not contained in a loop The conditions implied by both the full unroll and the full unroll size options must be satisfied for the loop to be fully unrolled The default is OFF LNO fusion 0 1 2 Perform loop fusion The option can be one of the following 0 Loop fusion is off 1 Perform conservative loop fusion This is the default F 27 F eko man Page ls 2 Perform aggressive loop fusion LNO fusion peeling limit N This option sets the limit for the number of iterations allowed to be peeled in fusion where N gt 0 N 5 by default LNO gather scatter 0 1 2 This option enables gather scatter optimizations The option can be one of the following 0 Disable all gather scatter optimizations 1 Perform gather scatter optimizations in non nested IF statements This is the default 2 Perform multi level gather scatter optimizations LNO hoistif ON OFF This option enables or disables hoisting of IF statements inside inner loops to eliminate redundant loops Default is ON LNO ignore feedback ON OFF If the flag is ON then feedback information from the loop annotations will be ignored in LNO transformations The default is OFF LNO ignore pragmas ON OFF This option specifies thatthe command line options override directives in the source
403. une is 64 bit otherwise the default is 32 bit macro expand Enable macro expansion in preprocessed Fortran source files throughout each file Without this option specified macro expansion is limited to preprocessor directives in files processed by the Fortran preprocessor When this option is specified macro expansion occurs throughout the source file march lt cpu type gt Compiler will optimize code for the selected cpu type opteron athlon athlon64 athlon64fx barcelona em64t pentium4 xeon core anyx86 auto auto means to optimize for the platform that the compiler is running on which the compiler determines by reading proc cpuinfo anyx86 means a generic x86 processor Under 32 bit ABI anyx86 is a processor without SSE2 SSE3 3DNow support under 64 bit ABI it is a processor with SSE2 but without SSE3 3DNow Core refers to the Intel Core Microarchitecture used by 64 bit CPUs such as Woodcrest The default is auto mcmodel small medium Select the code size model to use when generating offsets within object files Most programs will work with mcmodel small using 32 bit pointers but some need mcmodel medium using 32 bit pointers for code and 64 bit pointers for data mcpu lt cpu type gt Behaves like march See march MD Write dependencies to d output file MDtarget Use the following as the target for Make dependencies MDupdate Update the following file with Make dependencies MF Write depe
404. ur code as if the directives were not there Compiling and Linking with mp If a program compiled with mp is linked and linked without the mp flag the linker will not link with the OpenMP library and the linker will display undefined references similar to these undefined reference to ompc can fork libutil a diffu o text 0xa93 In function diffu tg undefined reference to ompc get thread num libutil a diffu o text 0x2400 In function tare EU 3 undefined reference to ompc fork libutil a diffu o text 0x2499 In function ompdo diffu 1 10 5 10 Debugging and Troubleshooting Troubleshooting OpenMP aaa BB Appendix A Environment Variables This appendix lists environment variables utilized by the compiler along with a short description These variables are organized by language with a separate section for language independent variables A 1 Environment Variables for Use with C PSC CFLAGS Flags to pass to the the C compiler pathcc This variable is used with the gcc compatibility wrapper scripts A 2 Environment variables for Use with C PSC CXXFLAGS Flags to pass to the C compiler pat hCC This variable is used with the gcc compatibility wrapper scripts A 3 Environment Variables for Use with Fortran F90 BOUNDS CHECK ABORT Set to YES causes the program to abort on the first bounds check violation F90 DUMP MAP Dump memory mapping at the location of asegme
405. ust contain at least one CPU identifier and entries in the list beyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers inthe PSC OMP AFFINITY MAPlistare inserted in the mapping table starting at the index for thread 0 and increasing upwards If the list is shorter than the maximum number of threads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads PSC OMP CPU STRIDE This specifies the striding factor used when mapping threads to CPUs It takes an integer value in the range of 0 to the number of CPUS inclusive The default is a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a stride of 1 PSC OMP CPU OFFSET This specifies an integer value that is used to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU nu
406. ut looks like this omphello out Hello World from threadO Number of threads 2 Hello World from thread1 The same program can be compiled and linked without mp and the directives will be ignored We compile the program without mp pathf 95 c omphello f Link the object file and create an output file pathf 95 omphello o o omphello out Run the program and the output looks like this omphello out Hello World from thread O0 Number of threads 1 For more examples using OpenMP please see the sample code at http www openmp org drupal node view 1 4 There are also examples of OpenMP code in Appendix A of the OpenMP 2 0 Fortran specification See section 8 15 for more details 8 13 Example OpenMP Code in C C The following program is a parallel version of hello world written using OpenMP directives When run it spawns multiple threads It uses the CRITICAL directive to ensure that the printing from the various threads will not overwrite one another Here is the program omphello c include lt omp h gt main int tid 0 8 25 8 Using OpenMP and Autoparallelization Example OpenMP Code in C C AA NY int nthreads 1 Fork a team of threads giving them their own copies of variable tid pragma omp parallel private tid ifdef OPENMP Obtain and print thread id tid omp get thread num endif pragma omp critical printf Hello World from thread d n tid
407. utput file pathcc omphello o o omphello out Run the program and the output looks like this omphello out Hello World from thread 0 Number of threads 1 For more examples using OpenMP please see the sample code at http www openmp org drupal node view 1 4 There are also examples of OpenMP code in Appendix A of the OpenMP 2 0 C C specification See section 8 15 for more details 8 14 Tuning for OpenMP Application Performance A good first step in tuning OpenMP code is to build a serial version of the application and tune the serial performance See section 7 for ideas and suggestions Often good flags for serial performance are also good for OpenMP performance Typically OpenMP parallelizes the outer iterations of the compute intensive loops in a coarse fashion leaving chunks of the outer loops and the inner loops that generally behave very similarly to the serial code Use pathopt2 see section 7 9 for details on pathopt2 to help find good serial tuning options for the application You may be able to find interesting options for tuning by looking at tuned configuration files for similar codes With this approach you can find good options for the serial parts of the code before having to consider OpenMP specific issues such as scheduling scaling and affinity If the test case takes a long time to run or needs a lot of memory then you may be forced to tune the flags with OpenMP enabled 8 14 1 Reduced Datasets You
408. voke constant global variable identification This option marks non scalar global variables that are never modified as constant and propagates their constant values to all files Default is ON IPA clone_list ON OFF Tell the IPA function cloner to list cloning actions as they occur to stderr The default is OFF IPA common pad size N This specifies the amount by which to pad common block array dimensions The value of N can affect cache behavior for common block array accesses The default is 0 IPA cprop ON OFF Turn on or off inter procedural constant propagation This option identifies the formal parameters that always have a specific constant value Default is ON See also IPA aggr cprop IPA ctype ON OFF When ON causes the compiler to generate faster versions of the lt ctype h gt macros such as isalpha isascii etc This flag is unsafe both in multi threaded programs and in all locales other than the 7 bit ASCII or C locale The default is OFF Do F 20 F eko man Page I aaa not turn this on unless the program will always run under the 7 bit ASCII or C locale and is single threaded IPA depth N Identical to maxdepth N IPA dfe ON OFF Enable or disable dead function elimination Removes any functions that are inlined everywhere they are called The default is ON IPA dve ON OFF Enable or disable dead variable elimination This option removes variables that are never referenced b
409. wait for confirmation that the lock is available If lock is successfully set function in crements the nesting count if lock is unavailable function returns a value of zero omp get wtime Returns double precision value equal to the Dec number of seconds since the initial value of the operating system real time clock omp get wtick Returns double precisionfloating point value MEN equal to the number of seconds between successive clock ticks 8 7 OpenMP Runtime Library Calls C C OpenMP programs can explicitly call standard routines implemented in the OpenMP runtime library If you want to ensure the program is still compilable without mp you need to guard such code with the OpenMP conditional compilation sentinels e g pragma The following table lists the OpenMP runtime library routines provided by version 2 1 of the OpenMP C C Application Program Interface Table 8 4 C C OpenMP Runtime Library Routines Routine Description void omp set num threads int Set the number of threads to use in a team int omp get num threads void Return the number of threads in the currently oF executing parallel region int omp get max threads void Return the maximum value that ee NE omp_get_num_threads may return int omp get thread num void Return the thread number within the team int omp get num procs void Return the number of processors available to Ss the program v
410. when run on a 4 CPU system with 1G of memory the Fortran runtime will attempt to raise the stack size limit to 1G 128M 4 or 640M To have the Fortran runtime tell you what it is doing with the stack size limit set the PSC STACK VERBOSE environment variable before you run a Fortran program You can control the stack size limit that the Fortran runtime attempts to use using the PSC STACK LIMIT environment variable 3 46 3 The PathScale Fortran Compiler Fortran Compiler Stack Size ls If this is set to the empty string the Fortran runtime will not attempt modify the stack size limit in any way Otherwise this variable must contain a number If the number is not followed by any text it is treated as a number of bytes If it is followed by the letter k or K it is treated as kilobytes 1024 bytes If m or M itis treated as megabytes 1024K If g or G it is treated as gigabytes 1024M If 96 it is treated as a percentage of the system s physical memory If the number is negative it is treated as the amount of memory to leave free i e itis subtracted from the amount of physical memory on the machine If all of this text is followed by cpu it is treated as a per cpu number and that number is multiplied by the number of CPUs on the system This is useful for multiprocessor systems that are running several processes concurrently The value specified implicitly or explicitly is the memory value per proces
411. will override the compiled in choice in favor of the choice established by the command assign 1 For Fortran only Perform runtime subscript range checking Subscripts that are out of range cause fatal runtime errors If you set the F90 BOUNDS CHECK ABORT environment variable to YES the program aborts For C only Keep comments after preprocessing Create an intermediate object file for each named source file but does not link the object files The intermediate object file name corresponds to the name of the source file a o suffix is substituted for the suffix of the source file Because they are mutually exclusive do not specify this option with the r option The Code Generation option group controls the optimizations and transformations of the instruction level code generator CG cflow ON OFF OFF disables control flow optimization in the code generation Default is ON CG cse regs N When performing common subexpression elimination during code generation assume there are N extra integer registers available over the number provided by the CPU N can be positive zero or negative The default is positive infinity See also CG sse cse regs CG gcm ON OFF Specifying OFF disables the instruction level global code motion optimization phase The default is ON F eko man Page ls CG inflate reg request N The local register allocator will inflate its register request by N percent for innermost loops
412. with ftpp 00 cece eee 3 24 3 6 3 Support for Varying Length Character Strings 3 25 3 6 4 Preprocessing Source Files with fcoco 3 25 Page vii PathScale Compiler Suite User Guide Version 3 2 ee ee 3 9 9 go l 3 3 6 4 1 Pre detined Macros s rc pedea RR Some ees NG 3 26 3 6 5 Error Numbers The explain Command ss 3 27 3 6 6 Fortran 90 Dope Vector 00000 c cece eee ee 3 28 3 6 7 Bounds CHECKING sicui oed math mite Re NIA BAGAN E ERE 3 29 3 6 8 Pseudo random Numbers sss sam RR eave ees eh Ee tees 3 29 3 7 Mixed Gode saute Ama sete ised IMMER RATS EE d E 3 29 3 7 1 Legacy Support for Calls between C and Fortran 3 30 3 7 1 1 Example Calls between C and Fortran 3 31 3 7 1 2 Example Accessing Common Blocks from C 3 33 3 8 Runtime I O Gompatbiliy 45 eto nee RE m n res 3 34 3 8 1 Performing Endian Conversions llle 3 35 3 8 1 1 Theassign COmmane 12 oe eode es ede Be he we totes 3 35 3 8 1 2 Using the Wildcard Option 00000 e eee eee 3 35 3 8 1 3 Converting Data and Record Headers 3 36 3 8 1 4 The ASSIGN Procedure s 2225 ad Da aaah Nina SS hate eg s 3 36 3 8 1 5 KO Compilation Flags 4 223k oot ioc he eshte tos epe heGavee 3 36 3 8 2 Reserved File Dnllsss o4 sedare e rot eR eee ERE ee 3 37 3 9 Source Code Compatibility
413. xtensions o 9 97 o aaa stat Fortran interface to the POSIX function stat Store in array sarray information about the file named file if that is a symbolic link describe the target rather than the link itself cf 1s tat The function form returns o or an error code from the C library value errno Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character sarray must have thirteen elements ID of device containing file Inode number File mode Number of links UID of owner GID of owner ID of device containing directory entry for file Size of file in bytes 9 Time of last access 10 Time of last modification 11 Time of last file status change 12 Preferred I O block size 1 if not available 13 Number of blocks allocated 1 if not available Except for elements 12 and 13 values are set to o if they are not available from the relevant file system c N DO 0 BB ON symink Fortran interface to the POSIX function symlink Creates a symbolic link path2 pointing to the same file as path 1 The function form returns o on success or an error code from the C library value errno The subroutine form sets status to the value which the function would return Trailing blanks in path1 and path2 are ignored you can prevent this by using char 0 to place a null character after the last significant character
414. y blank characters only Comments start with a character anywhere on the line When a Fortran module is compiled information about the module is placed into a file called MODULENAME mod The default location for this file is in the directory where the command is executed This location can be changed using module option The MODULENAME mod file allows other Fortran files to use procedures functions variables and any other entities defined in the module Module files can be considered similar to C header files Like C header files you can use the I option to point to the location of module files pathf95 I work project include c foo f90 This instructs the compiler to look for mod files in the work project include directory If foo 90 contains a use arith statement the following locations would be searched work project include ARITH mod ARITH mod Order of Appearance If a module and the use statements referring to that module appear in the same source file the module must appear first If a module appears in one source file and the use statements referring to that module appear in other source files the file containing the module must be compiled first If a single command compiles all the files the file containing the module must appear on the command line before the files containing the use statements pathf95 mymodule f95 myprogram f95 3 The PathScale Fortran Compiler Linking When the Mai
415. y of Compiler Options by Function CG use prefetchnta ON OFF CG use test ON OFF Compilation Control Options A pred ans alignN auto use module name module name backslash byteswapio c convert conversion default64 f no check new fdecoratepath f no directives fe ff2c abipath f no unwind tables f no gnu keywords finhibit size directive fabi version N fms extensions fno asm fno builtin fno common f no exceptions fno ident f no signed char lt OFF gt lt OFF gt Defaults Comments pred ans cancels A pred ans lt 64 gt Other options are 8 16 32 128 Fortran only If used preprocessor not called Fortran only Do not use with x option since mutually exclusive natives Fortran only Synonym for r8 i8 Fortran only C only Fortran only fdirectives Fortran only Fortran only fno unwind tables C C only s15 C only C C only C C only C C only C C only lt fexceptions gt C only C C only E Summary of Compiler Options ls Table E 1 Summary of Compiler Options by Function fpack struct C C only frandom seed string C C only f no rtti C only f no second underscore Fortran only f no signed bitfields C C only f no strict aliasing C C only f no PIC lt fno PIC gt fprefix function name C C only fshared data C C only
416. y rank 1 TRADITIONAL IDATE Subroutine TARRAY 4 G77 PGI Array rank 1 TRADITIONAL IDATE Subroutine TARRAY I 8 Array G77 PGI rank 1 TRADITIONAL IDIM 1 4 X 1 1 1 2 1 4 I8 ANSI G77 E P Y 1 1 1 2 1 4 1 8 PGI TRADITIONAL IDINT 1 4 ANSI G77 E PGI TRADITIONAL A R 8 IDNINT 1 4 A R 8 ANSI G77 E P PGI TRADITIONAL C 21 C Supported Fortran Intrinsics Table of Supported Intrinsics AA NY C 22 Table C 1 Fortran Intrinsics Supported in Version 3 2 Continued Intrinsic Name Result Arguments Families Remarks IEEE BINARY Y R 4 R 8 TRADITIONAL E SCALE N 1 1 1 2 1 4 I 8 IEEE_CLASS X R 4 R 8 TRADITIONAL E IEEE COPY X R 4 R 8 TRADITIONAL E SIGN Y R 4 R 8 IEEE_ X R 4 R 8 TRADITIONAL E EXPONENT Y 1 1 1 2 1 4 I 8 O R 4 R 8 IEEE_FINITE X R 4 R 8 TRADITIONAL E IEEE_INT X R 4 R 8 TRADITIONAL E Y V1 1 2 1 4 18 O R 4 R 8 IEEE_IS_NAN X R 4 R 8 TRADITIONAL E IEEE_NEXT_ X R 4 R 8 TRADITIONAL E AFTER Y R 4 R 8 IEEE REAL X 1 1 1 2 1 4 F8 TRADITIONAL E R 4 R 8 O Y R 4 R 8 IEEE_ X R 4 R 8 TRADITIONAL E REMAINDER Y R 4 R 8 IEEE_ X R 4 R 8 TRADITIONAL E UNORDERED Y R 4 R 8 IEOR 1 4 1 11 I2 1 4 I8 ANSI G77 E J Pi P2 r4 r8 PG TRADITIONAL IERRNO 1 4 G77 PGI IFIX 1 4 A R 4 R 8 ANSI G77 E PGI TRADITIONAL IIABS I 2 A I2 PGI E TRADITIONAL HAND I2 I 1 2 PGI E J l 2 TRADITIONAL IIBCHNG I2 I 12 TRAD
417. y the program Default is ON IPA echo ON OFF Option to echo to stderr the compile commands and the final link commands that are invoked from IPA Default is OFF This option can help monitor the progress of a large system build IPA field reorder ON OFF Enable the re ordering of fields in large structs based on their reference patterns in feedback compilation to minimize data cache misses The default is OFF IPA forcedepth N This option sets inline depths directing IPA to attempt to inline all functions at a depth of at most N in the callgraph instead of using the default inlining heuristics This option ignores the default heuristic limits on inlining Functions at depth O make no calls to any sub functions Functions only making calls to depth O functions are at depth 1 and so on By default this optimization is not done IPA ignore lang ON OFF Enable disable inlining across language boundaries of Fortran on one side and C C on the other The compiler may not always be aware of the correct effective language semantics if this optimization is done making it unsafe in some scenarios The default is OFF IPA inline ON OFF This option performs inter file subprogram inlining during the main IPA processing The default is ON Does not affect the light weight inliner IPA keeplight ON OFF This option directs IPA not to send keep to the compiler in order to save space The default is OFF IPA linear
418. zer Equivalent to O3 ipa OPT Ofast fno math errno ffast math ON by default when OPT Ofast is specified OFF OFF OFF OFF OFF E Summary of Compiler Options AA NN Table E 1 Summary of Compiler Options by Function OPT bb N 1300 OPT cis ON OFF lt ON gt OPT div_split ON OFF lt OFF gt but enabled by OPT Ofast or OPT IEEE arithmeti c 3 OPT early mp z ON OFF lt OFF gt Has effect only under mp compilation OPT early intrinsics ON OFF OFF OPT fast bit intrinsics ON OFF OFF OPT fast complex ON OFF OFF but enabled if OPT roundoff 3 OPT fast exp ON OFF OFF butenabledif 03 or Ofast are specified or OPT roundof f 1 is in effect OPT fast io ON OFF OFF C C only OPT fast math ON OFF OFF but enabled if OPT roundoff is at 2 or above OPT fast nint ON OFF lt OFF gt but enabled if OPT roundoff 3 OPT fast sqrt ON OFF OFF if ON OPT fast exp must also be ON OPT fast stdlib ON OFF lt ON gt OPT fast trunc ON OFF lt OFF gt but enabled if OPT roundoff isat1 or above OPT fold reassociate ON OFF OFF but enabled if OPT roundoff is at 2 or above OPT fold unsafe relops ON OFF OFF but enabled if O3 OPT fold unsigned relops ON OFF OFF OPT goto ON OFF OFF but enabled if O2 or higher E 12 E Summary of Compiler Optio

Download Pdf Manuals

image

Related Search

Related Contents

ロロFFELE膚=EE團@自転車 組立説明害A    seção 6 tocha de plasma pt-36 - ESAB Welding & Cutting Products  Sirocco 550 Owner`s Manual (Italian)  man146 - manta osmo first 90    Riccar 8900 Series Vacuum Cleaner User Manual  KORG SV-1 1.2 User Guide (EFGI4)  Silex SX-200-1213  Garmin G600 GPS Receiver User Manual  

Copyright © All rights reserved.
Failed to retrieve file