Home

QLogic PathScale™ Compiler Suite User Guide

1. Intrinsic Name Result Arguments Families Remarks SUM ANSI PGI See Std TRADITIONAL SYMLNK 1 4 PATH1 C G77 PGI O PATH2 C STATUS 1 4 SYMLNK Subroutine PATH1 C G77 O PATH2 C STATUS I 4 SYNCHRONIZE Subroutine TRADITIONAL E SYNC IMAGES Subroutine TRADITIONAL SYNC IMAGES Subroutine IMAGE I 1 1 2 1 4 TRADITIONAL 1 8 SYNC_IMAGES Subroutine IMAGE I 1 1 2 1 4 TRADITIONAL I 8 Array rank 1 SYSTEM 1 4 COMMAND C G77 PGI O STATUS 1 4 SYSTEM Subroutine COMMAND C G77 O STATUS I 4 SYSTEM_ Subroutine COUNT 1 4 ANSI G77 O CLOCK COUNT 1 4 PGI _ 4_ TRADITIONAL SYSTEM CLOC Subroutine COUNT 1 8 ANSI G77 O K COUNT 1 8 PGI COUNT I g TRADITIONAL TAN R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL TAND R 4 X R 4 R 8 PGI E TRADITIONAL TANH R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL TEST IEEE EXCEPTION I 8 TRADITIONAL E EXCEPTION TEST IEEE INTERRUPT I 8 TRADITIONAL E INTERRUPT 1 02404 15 C 39 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWGIISTI IT IINTIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks THIS_IMAGE Depends on arg ARRAY Anytype TRADITIONAL O Arrayrank any DIM 1 1 1 2 1 4 1 8 TIM
2. 2 1 How To Invoke the PathScale Compilers 2 1 Accessing the GCC 4 x Front ends for C and C 2 2 Compiling for Different Platforms 2 3 Target Options for This Release 2 4 Defaults Flag serer me br EARN RR EN bend 2 5 Compiling for an Alternate Platform 2 5 Compiling Option Tool pathhow compiled 2 5 Input File Types scs er ara RR CR CNRC OR TUE RR D IER TR 2 6 Other Inp t Files iuge eU CER e ER etm See nue aa 2 7 Common Compiler Options 2 8 Shared Libraries petere ed dede epe er Seek 2 8 Large File Support 0 00 cece ees 2 9 Memory Model Support 2 9 Support for Large Memory Model 2 10 D eb ggImg n oe xo esa b Soo e LRL IUS a 2 11 Profiling Locate Your Program s Hot Spots 2 11 taskset Assigning a Process to a Specific CPU 2 12 The PathScale Fortran Compiler Using the Fortran 3 1 Fixed form and Free form Files 3 2 Modules d a eu ulusm iro LWA EM S CUI Cas A pat e S 3 3 Order of Appearance 3 3 Linking Object Files to the Rest of the Program
3. Key to Types Key to Characteristics Depends on arg Result type varies depending on the argument type Subroutine Intrinsic is a subroutine not a function Table C 1 Fortran Intrinsics Supported in 3 0 Intrinsic Name Result Arguments Families Remarks ABORT Subroutine G77 PGI ABS R 4 A 1 1 1 2 I 4 1 8 ANSI G77 R 4 R 8 Z 8 Z 16 PGI TRADITIONAL ACCESS 1 4 NAME C G77 PGI MODE C ACHAR C I 1 1 12 1 4 8 ANSI G77 E PGI TRADITIO NAL ACOS R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ACOSD R 4 X R 4 R 8 PGI E TRADITIONAL ADD AND I 1 4 TRADITIONAL E FETCH J 1 4 ADD AND I 1 8 TRADITIONAL E FETCH J I 8 ADJUSTL STRING C ANSI PGI E TRADITIONAL ADJUSTR STRING C ANSI PGI E TRADITIONAL AIMAG R 4 Z Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL AINT R 4 A R 4 R 8 ANSI G77 E P KIND 1 1 1 2 1 4 PGI 1 8 TRADITIONAL ALARM 1 4 SECONDS 4 I 8 G77 PGI HANDLER Procedure STATUS 1 4 O 1 02404 15 C 3 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWGIISTI IT IINTIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks ALARM Subroutine SECONDS 1 4 1 8 G77 HANDLER Procedure STATUS 1 4 ALL ANSI PGI See Std TRADITIONAL ALLOCATED ANSI PGI See Std TRADITIONAL ALOG R 4 X R 4 R 8 ANSI G
4. Next compile the program containing the module and look at the error that is generated pathf95 hellow f95 3 4 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Extensions 9 9 97 r Www lt MODULE HELLOW pathf95 855 pathf95 ERROR HELLOW File hellow f95 Line 1 Column 8 The compiler has detected errors in module HELLOW No module information file will be created for this module SPRINTZ Hello World pathf95 724 pathf95 ERROR HELLO File hellow f95 Line 5 Column 11 Unknown statement Expected assignment statement but found instead of or gt pathf95 PathScale TM Fortran Version 2 1 99 14 Tue Nov 21 2006 14 22 16 pathf95 9 source lines pathf95 2 Error s 0 Warning s 0 Other message s 0 ANSI s pathf95 explain pathf95 message number gives more information about each message Note that the real error is pointed out after the first error on line 1 is reported 3 3 Extensions The PathScale Fortran compiler supports a number of extensions to the Fortran standard which are described in this section 3 3 1 Promotion of REAL and INTEGER Types 1 02404 15 Section 5 has more information about porting code but it is useful to mention the following option that you can use to help in porting your Fortran code r8 i8 Respectively promotes the default representation for REA
5. 3 13 Mixed Code aana cc es 3 13 Calls between C and Fortran 3 13 Example Calls between C and Fortran 3 14 Example Accessing Common Blocks from C 3 17 Runtime Compatibilily 3 18 Performing Endian Conversions 3 18 The assign Command u re aa 3 18 Using the Wildcard Option 3 19 Converting Data and Record Headers 3 19 The ASSIGN Procedure 3 19 Compilation Flags 3 19 Reserved File Units 3 20 Source Code Compatibilitly 3 20 Fortran KINDS bbe aa p as 3 20 Library 3 21 Name Manglilig 450223 Mawes ee Fhe Bae a BRI SS ake ARN 3 21 ABI Compaubllity noia dob bud Re aded Ns Rd 3 22 Linking with g77 compiled Libraries 3 22 AMD Core Math Library ACML 3 23 List Directed I O and Repeat Factors 3 23 Environment Variable 3 24 assign Command 3 24 Porting Fortran Code 3 25 Debuggi
6. 3 4 Module related Error Messages 3 4 Extensions dde rg BREAD uya ga suqu hama 3 5 Promotion of REAL and INTEGER Types 3 5 Cray POIMErS sodas ea Sexe rut deut is ea aU MO Dude Co Uie Sc Lega 3 6 Directives dae RAW REA AU UERSUM ds 3 6 F77 or F90 Prefetch Directives 3 6 Page v QLogic PathScale Compiler Suite User Guide XX Version 3 0 3 3 3 2 3 4 3 4 1 3 4 2 3 4 3 3 4 4 3 4 4 1 3 4 5 3 4 6 3 4 7 3 4 8 3 5 3 5 1 3 5 1 1 3 5 1 2 3 6 3 6 1 3 6 1 1 3 6 1 2 3 6 1 3 3 6 1 4 3 6 1 5 3 6 2 3 7 3 7 1 3 8 3 8 1 3 8 2 3 8 3 3 8 3 1 3 8 4 3 8 4 1 3 8 4 2 3 9 3 10 3 10 1 3 10 2 3 10 3 3 10 4 QLOGIC P P PVD s Changing Optimization Using Directives 3 8 Compiler and Runtime 3 8 Preprocessing Source Files with cpp 3 8 3 4 2 Preprocessing Source Files with ftpp 3 8 Support for Varying length Character Strings 3 9 Preprocessing Source Files with coco 3 9 Pre defined Macros 3 10 Error Numbers The explain Command 3 11 Fortran 90 Dope Vector 3 12 Bounds Checking e db eere RR eee E Re RR 3 12 Pseudo random Numbers
7. Updated environment variable info for section 8 9 2 PSC OMP CPU STRIDE and PSC OMP CPU OFFSET Corrected bookmarks in PDF document PDF bookmarks Update version number All sections Update copyright date This page Updated man page options based on checked in appendix E changes eko 7 OPT malloc algorithm f no Jexceptions gnu Added introductory info on processor affinity to section 8 9 2 OpenMP chapter Added section on section 8 9 2 PSC OMP AFFINITY INHERITANCE and PSC OMP AFFINITY Added affinity glossary item appendix F 2006 2007QLogic Corporation All rights reserved worldwide PathScale 2004 2005 2006 All rights reserved First Published April 2004 Printed in U S A QLogic Corporation 26650 Aliso Viejo Parkway Aliso Viejo CA 92656 800 662 4471 or 949 389 6000 1 02404 15 Page iii QLogic PathScale Compiler Suite User Guide Version 3 0 QLOGIC EN phV ALL LL AA WVW Page iv 1 02404 15 Section 1 1 1 1 2 Section 2 2 1 2 2 2 2 1 2 3 2 3 1 2 3 2 2 3 3 2 3 4 2 4 2 5 2 6 2 7 2 8 2 9 2 9 1 2 10 2 11 2 12 Section 3 3 1 3 1 1 3 2 3 2 1 3 2 2 3 2 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 1 1 02404 15 Table of Contents Introduction Conventions Used in This Document 1 1 Documentation Suite 1 2 Compiler Quick Reference What You Installed
8. OFF Enable or disable writing the listing file The default is ON if any LIST group options are enabled By default the listing file contains a list of options enabled all options ON OFF Enable or disable listing of most supported options The default is OFF notes ON OFF Ifan assembly listing is generated for example on S various parts of the compiler such as software pipelining generate comments within the listing that describe what they have done Specifying OFF suppresses these comments The default is ON options ON OFF Enable or disable listing of the options modified directly in the command line or indirectly as a side effect of other options The default is OFF symbols Enable or disable listing of information about the symbols variables managed by the compiler LNO Specify options and transformations performed on loop nests by the Loop Nest Optimizer LNO The LNO options are enabled only if the optimization level of O3 or higher is in effect For information on the LNO options that are in effect during a compilation use the LIST all optionszON option LNO apo use feedback ON OFF Effective only when specified with Capo under feedback directed compilation this flag tells the auto parallelizer whether to use the feedback data of the loops in deciding whether each loop should be parallelized When the compiler parallelizes a loop it generates both a
9. pPOP 3 8 4 1 For example real a 5 88 0 write a end This example generates the following output 5 88 This behavior conforms to the language standard However some users prefer to see multiple values instead of the repeat factor 88 88 88 88 88 There are two ways to accomplish this using an environment variable and using the assign command Environment Variable 3 8 4 2 If the environment variable FTN SUPPRESS REPEATS is set before the program starts executing then list directed write and print statements will output multiple values instead of using the repeat factor To output multiple values when running within the bash shell export FTN SUPPRESS REPEATS yes To output multiple values when running within the csh shell Setenv FTN SUPPRESS REPEATS yes To output repeat factors when running within the bash shell unset FTN SUPPRESS REPEATS To output repeat factors when running within the csh shell unsetenv FTN SUPPRESS REPEATS assign Command 3 24 Using the y on option to the assign command will cause all list directed output to the specified file names or unit numbers to output multiple values using the y of f option will cause them to use repeat factors instead For example to output multiple values on logical unit 6 and on any logical unit which is associated with file cesc2559 out type these commands before runn
10. LNO vinter verbosezON prints verbose information to stdout on optimizing for vector intrinsic routines Default is OFF This flag will let you know which loops are vectorized to make use of vector intrinsic routines Following are LNO Transformation Options Loop transformation arguments allow control of cache blocking loop unrolling and loop interchange They include the following options LNO interchange ON OFF Disable the loop interchange transformation in the loop nest optimizer Defaultis ON LNO unswitch ON OFF Turn ON or OFF the optimization that performs a simple form of loop unswitching The default is ON LNO unswitch_verbose ON OFF LNO unswitch_verbose ON prints verbose info to stdout on unswitching loops Default is OFF LNO ou N This option indicates that all outer loops for which unrolling is legal should be unrolled by N where N is a positive integer The compiler unrolls loops by this amount or not at all LNO ou deep ON OFF This option specifies that for loops with 3 deep or deeper loop nests the compiler should outer unroll the wind down loops that result from outer unrolling loops further out This results in large code size but generates faster code whenever wind down loop execution costs are important Default is ON LNO ou furtherzN This option specifies whether or not the compiler performs outer loop unrolling on wind down loops N must be specified and be an integer Additiona
11. Wunused parameter warns about unused function parameters Wno unused parameter tells the compiler not to warn about unused function parameters W no unused value Wunused value warns about statements whose results are not used Wno unused value tells the compiler not to warn about statements whose results are not used W no unused variable Wunused variable warns about local and static variables that are not used Wunused variable tells the compiler not to warn about local and static variables that are not used W no write strings Wwrite strings marks strings as const char Wno write strings tells the compiler not to mark strings as const char 1 02404 15 QLOGIC ll l l iii Wnonnull For C C only Warn when passing null to functions requiring non null pointers Wswitch default For C C only Warn when a switch statement has no default Wswitch enum For C C only Warn when a switch statement is missing a case for an enum member Suppress warning messages woff Turn off named warnings woffall Turn off all warnings woffoptions Turn off warnings about options woffnum Specify message numbers to suppress Examples m Specifying woff2026 suppresses message number 2026 m Specifying woff2026 2352 suppresses messages 2026 through 2352 m Specifying woff2026 2352 2400 2500 suppresses messages 2026 through 23
12. 5 3 5 Porting and Compatibility Compatibility QLOGIC E V C Notes 5 4 1 02404 15 6 1 Section 6 Tuning Quick Reference This section provides some ideas for tuning your code s performance with the PathScale compiler The following sections describe a small set of tuning options that are relatively easy to try and often give good results These are tuning options that do not require Makefile changes or risk the correctness of your code results More detail on these flags can be found in the next section and in the man pages A comprehensive list of the options for the PathScale compiler can be found in the eko man page Basic Optimization 6 2 IPA 1 02404 15 Here are some things to try first when optimizing your code The basic optimization flag o is equivalent to 02 This is the first flag to think about using when tuning your code Try 02 then 03 and then O3 OPT Ofast For more information on the o flags and OPT Ofast see section 7 1 Inter Procedural Analysis IPA invoked most simply with ipa is a compilation technique that analyzes an entire program This allows the compiler to do optimizations without regard to which source file the code appears in IPA can improve performance significantly IPA can be used in combination with the other optimization flags 03 ipa or 02 ipa will typically provide increased performance overthe 03 or 02 flags alone ipa n
13. Testing Memory Latency and Bandwidth 7 26 THe pathopt2 eer ay dd uyak EA CREE ect 7 26 A Simple Example eai ae eee 7 27 pathopt2 Usage sok sone e e BIRD m GM IR oes Re iat 7 28 Option Configuration File 7 31 Testing Methodology 7 34 Using an External Configuration File to Modify pathopt2 xml 7 34 PSC GENFLAGS Environment Variable 7 35 Using Build and Test Scripts 7 35 The NAS Parallel Benchmark Suite 7 36 Set Up the Workarea 7 36 Example 1 Run with Makefile 7 36 Example 2 Use Build Run Scripts and a Timing File 7 37 Example 3 Using a Single Script with the rate file 7 40 How Did the Compiler Optimize My Code 7 42 Using the S Mad x eror rece tek Meas x qe M SU Ped ds 7 42 Using CLIST or EEIST 3 uec m ke ER 8 usta Ros 7 43 Verbose Flags sesterco E EVER RUE EET ETERE ERG P eX 7 43 Using OpenMP and Autoparallelization OpenMP ss m u S mer ERO UA ORA RIS AR RR Eran 8 1 Autoparallelization 8 2 Getting Started With OpenMP 8 3 OpenMP Compiler Directives 8 3 OpenMP Compi
14. m C Programming Language by Brian W Kernighan Dennis Ritchie Dennis M Ritchie Prentice Hall 1988 2nd edition ISBN 0 13 110362 8 m C A Reference Manual by Samuel P Harbison Guy L Steele Prentice Hall 5th Edition 2002 ISBN 0 130 89592 X m C How to Program by H M Deitel and P J Deitel Prentice Hall Fourth Edition 2004 ISBN 0 131 42644 3 C Language m The Standard Library A Tutorial and Reference by Josutis Nicolai M 1999 Addison Wesley ISBN 0 201 37926 0 m Effective C 55 Specific Ways to Improve Your Programs and Design by Scott Meyers Addison Wesley Professional 2005 3rd edition ISBN 0 321 33487 6 m More Effective C 35 New Ways to Improve Your Programs and Designs by Scott Meyers Addison Wesley Professional 1995 ISBN 0 201 6337 1 X m Thinking in C Volume 1 Introduction to Standard C by Bruce Eckel Prentice Hall 2nd Edition 2000 ISBN 0 139 79809 9 NOTE There is a later version 2002 available online as a free download m Thinking in C Vol 2 Practical Programming by Bruce Eckel Prentice Hall Second Edition 2003 ISBN 0 130 35313 2 m C Inside amp Out by Bruce Eckel Osborne McGraw Hill 1993 ISBN 0 07 881809 5 m C How to Program by H M Deitel and P J Deitel Prentice Hall 2005 5th edition ISBN 0 131 85757 6 Other Topics m Effective STL 50 Specific Ways to Improve Your Use of the Standard Template Library by Scott Meyers Addison Wesley Professional 20
15. E eko man Page QLOGIC I EES9Zto a l l i is register being used to address local variables This flag can be turned OFF for C for programs that do not throw exceptions 0 4 Specify the level of enabled exceptions that will be assumed for purposes of performing speculative code motion default is level 1 at all optimization levels In general an instruction will not be speculated i e moved above a branch by the optimizer unless any exceptions it might cause are disabled by this option Level 0 No speculative code motion may be performed Level 1 Safe speculative code motion may be performed with IEEE 754 underflow and inexact exceptions disabled Level 2 All IEEE 754 exceptions are disabled except divide by zero Level 3 All IEEE 754 exceptions are disabled including divide by zero Level 4 Memory exceptions may be disabled or ignored TENV simd Default is ON Turning it OFF unmasks SIMD floating point invalid operation exception TENV simd_dmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point denormalized operand exception TENV simd_zmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point zero divide exception TENV simd_omask ON OFF Default is ON Turning it OFF unmasks SIMD floating point overflow exception TENV simd_umask ON OFF Default is ON Turning it OFF unmasks SIMD floating point underflow except
16. F Glossary XX QLOGIC e pathcov The version of gcov that PathScale supports with its compilers Other versions of gcov may not work with code generated by the PathScale Compiler Suite and are not supported by PathScale pathprof The version of gprof that PathScale supports with its compilers Other versions of gprof may not work with code generated by the PathScale Compiler Suite and are not supported by PathScale peak Set of optional flags used with compiler in SPEC runs to optimize performance SIMD Single Instruction Multiple Data An i386 AMD64 instruction set extension which allows the CPU to operate on multiple pieces of datacontained in a single wide register These extensions were in three parts named SSE and SSE2 SMP Symmetric multiprocessing is a tightly coupled share everything system in which multiple processors working under a single operating system access each other s memory over a common bus or interconnect path source file A software program usually made up of several text files written in a programming language that can be converted into machine readable code through the use of a compiler SPEC Standard Performance Evaluation Corporation SPEC provides a standardized suite of source code based upon existing applications that has already been ported to a wide variety of platforms by its membership The benchmarker takes this source code compiles it for the system in questi
17. name FLUSH 1SOMP flush list MASTER OMP master structured block SOMP end master ORDERED SOMP ordered structured block 50 end ordered Data environments Control the data environment during the execution of parallel constructs THREADPRIVATE OMP threadprivate c1 c2 WORKSHARI SOMP workshare Lj 1 02404 15 8 5 8 Using OpenMP and Autoparallelization OpenMP Compiler Directives QLOGIC 8 5 OpenMP Compiler Directives pragmaThe OpenMP directives for C and C all start with pragma They are only processed by the compiler if mp is specified Some ofthe OpenMP directives also support additional clauses The following table lists the C and C compiler directives provided by version 2 0 of the OpenMP C C Application Program Interface Table 8 2 C C Compiler Directives Directive Clauses Example Parallel region construct Defines a parallel region PARALLEL pragma omp parallel clause structured block PRIVATE SHARED Hy RSTPRIVATE DEFAULT SHARED NONE REDUCT ION COPYIN IF NUM THREADS Work sharing constructs Divide the execution of the enclosed block of code among the members of the team that encounter it
18. 1 4 1 8 TRADITIONAL KIBITS I 8 1 8 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL LEN I 1 I 2 1 4 I 8 KIBSET 1 8 1 8 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL KIDIM 1 8 X r8 PGI E Y 8 TRADITIONAL KIDINT 1 8 A R 8 TRADITIONAL E KIEOR I 8 1 8 TRADITIONAL E J 8 KIFIX 1 8 A R 4 R 8 PGI E TRADITIONAL KILL 1 4 PID 1 4 G77 PGI SIG 1 4 TRADITIONAL KILL Subroutine PID 1 4 G77 O SIG 1 4 TRADITIONAL STATUS 1 4 KIND 1 4 X Any type ANSI PGI E TRADITIONAL KINT 1 8 A R 4 TRADITIONAL E 1 02404 15 C 27 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWGIISTI IT IINTIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks KIOR I 8 I I 8 PGI E J 8 TRADITIONAL KISHA 1 8 I 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 KISHC 1 8 I 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 KISHFT 1 8 I 8 PGI E SHIFT 1 1 1 2 1 4 TRADITIONAL 1 8 KISHL 1 8 I 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 KISIGN 1 8 I 8 PGI E P B 1 8 TRADITIONAL KMOD 1 8 I 8 PGI E P I 8 TRADITIONAL KMVBITS Subroutine FROM I 8 TRADITIONAL E FROMPOS 1 1 1 2 1 4 1 8 LEN 1 1 1 2 1 4 I 8 TO I 8 TOPOS 1 1 1 2 1 4 1 8 KNINT 1 8 A R 4 R 8 PGI E P TRADITIONAL KNOT 1 8 I 8 PGI E TRADITIONAL
19. 2 9 The Fortran runtime libraries are compiled with large file support PathScale does not provide any runtime libraries for C or C that do I O so large file support is provided by the libraries in the Linux distribution being used Memory Model Support 1 02404 15 The PathScale compilers currently support two memory models small and medium The default memory model on x86 64 systems and the default for the compilers is small equivalent to GCC s mcmode1 2sma11 This means that offsets of code and data within binaries are represented as signed 32 bit quantities In this model all code in an executable must total less than 2GB and all the data must also be less than 2GB Note that by data we mean the static and unlimited static data BSS that are compiled into an executable not data allocated dynamically on the stack or from the heap Pointers are 64 bits however so dynamically allocated memory may exceed 2GB Programs can be statically or dynamically linked Additionally the compilers support the medium memory model with the use of the option mcmodel medium on all of the compilation and link commands This 2 Compiler Quick Reference Memory Model Support QLOGIC See means that offsets of code within binaries are represented as signed 32 bit quantities The offsets for data within the binaries are represented as signed 64 bit quantities In this model all code in an executable must come to less than 2GB
20. 7 4 1 If your program has many nests of loops you may want to try some of the Loop Nest Optimization group of flags This group defines transformations and options that can be applied to loop nests One of the nice features of the PathScale compilers is that its powerful Loop Nest Optimization feature is invoked by default at 03 This feature can provide up to a 10 20x performance advantage over other compilers on certain matrix operations at 03 In rare circumstances this feature can make things slower so you can use LNO opt 0 to disable nearly all loop nest optimization Trying to make 02 compile faster by adding LNO opt 0 will not work because the LNO feature is only active with 03 or Ofast which implies 03 Some of the features that one can control with the LNO group are m Loop fusion and fission Blocking to optimize cache line reuse Cache management TLB Translation Lookaside Buffer optimizations Prefetch In this section we will highlight a few of the LNO options that have frequently been valuable Loop Fusion and Fission 7 14 Sometimes loop nests have too few instructions and consecutive loops should be combined to improve utilization of CPU resources Another name for this process is loop fusion Sometimes a loop nest will have too many instructions or deal with too many data items in its inner loop leading to too much pressure on the registers resulting in 1 02404 15 XX 7 Tuning
21. CHIP 3 CPUO CPU1 CPU2 CPU3 CPU4 CPUS CPU6 CPU7 JO TO JO T1 JO T2 JO T3 J1 TO J1 T1 J1 T2 J1 T3 If PSC OMP CPU STRIDE is set to 2 for both jobs and PSC OMP CPU OFFSET is set to 1 for job 1 only then the scheduling will be CHIP 0 CHIP 1 CHIP 2 gt CHIP 3 CPUO CPU1 CPU2 CPU3 CPU4 CPUS CPUG CPU7 JO TO J1 TO JO T1 J1 T1 JO T2 J1 T2 JO T3 J1 T3 PSC OMP GUARD SIZE Integer value This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S It is often useful to have a larger guard area to catch pthread stack overflows particularly for Fortran OpenMP programs By default the guard area size is 0 for 32 bit programs disabling the mechanism and 32MB for 64 bit programs since virtual memory is typically bountiful in 64 bit environments The PSC OMP GUARD SIZE environment variable can be used to over ride the default value Its format is a decimal number following by an optional k m or g in lower or uppercase to denote kilobytes megabytes or gigabytes If the size is 0 then the guard is not created The guard area consumes no physical memory but does consume virtual memory and will show up in the VIRT or SIZE figure of a top command PSC OMP GUIDED CHUNK DIVISOR Integer value The value of PSC OMP G
22. Environment Variables for OpenMP These environment variables are described in detail in section 8 They are listed here for your reference A 2 1 02404 15 XX A Environment Variables QLOGIC Environment Variables for OpenMP ls A 5 1 Standard OpenMP Runtime Environment Variables These environment variables can be used with OpenMP in either Fortran or C and C OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution Default is FALSE since this mechanism is not supported OMP_NESTED Enables or disables nested parallelism Default is FALSE OMP_SCHEDULE This environment variable only applies to po and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC Or GUIDED Default is STATIC with no chunk size specified OMP NUM THREADS Set the number of threads to use during execution Default is number of CPUs in the machine A 5 2 PathScale OpenMP Environment Variables These environment variables can be used with OpenMP in Fortran and C and C except as indicated PSC OMP AFFINITY When TRUE the operating system s affinity mechanism where available is used to assign threads to CPUs otherwise no affinity assignments are made The default value is TRUE PSC_OMP_AFFINITY_ This environment variable controls where thread global GLOBAL ID or local ID values are used when assigning threads to
23. FOR NOWAIT pragma omp for clause for loop PRIVATE FIRSTPRIVATE LASTPRIVATE REDUCTION SCHEDULE static dynamic guided runtime ORDERED SECTIONS NOWAIT pragma omp sections clause structured block PRIVATE 8 6 1 02404 15 XX QLOGIC 8 Using OpenMP and Autoparallelization OpenMP Runtime Library Calls Fortran o 9 97 O il Dl Dl SS Table 8 2 C C Compiler Directives Continued Directive Clauses Example FIRSTPR VAT LASTPR VATE REDUCT ON SINGLE NOWAIT pragma omp single clause structured block PRIVATE FIRSTPR VAT COPY PRIVATE Combined parallel work sharing constructs Shortcut for denoting a parallel region that contains only one work sharing construct PARALLEL FOR PARALLE Ep pragma omp parallel for structured block pragma omp parallel sections structured block Synchronization constructs Provide various aspects of synchronization for example access to a block of code or execution order of statements within a block of code ATOMIC pragma omp atom
24. P 1 2 TRADITIONAL ILEN Depends onarg I 1 TRADITIONAL ILEN Depends arg I 1 2 TRADITIONAL ILEN Depends arg 1 4 TRADITIONAL ILEN Depends arg 1 8 TRADITIONAL 1 02404 15 C 23 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC P V hbbbbjbb L X Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks IMAG Z Z 8 Z 16 G77 E TRADITIONAL IMAGPART Z Z 8 Z 16 G77 E IMOD I 2 A 1 2 PGI E P 2 TRADITIONAL IMVBITS Subroutine FROM I 2 TRADITIONAL E FROMPOS 1 1 2 1 4 1 8 LEN I 1 I 2 1 4 I 8 TO 1 2 TOPOS 1 1 1 2 1 4 1 8 1 4 STRING C ANSI G77 E P SUBSTRING C PGI L 8 ININT 1 2 A R 4 R 8 PGI E P TRADITIONAL INOT I 2 I 1 2 PGI E TRADITIONAL INT 1 4 1 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 Z 8 Z 16 PGI KIND 1 1 I 2 I 4 TRADITIONAL 1 8 INT2 1 2 A 1 1 1 2 1 4 1 8 G77 E R 4 R 8 2 8 Z 16 TRADITIONAL INT4 1 4 A 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 Z 8 Z 16 INT8 1 8 1 1 2 1 4 1 8 G77 PGI E R 4 R 8 2 8 Z 16 TRADITIONAL INT MULT I 1 8 E UPPER J 8 INT_MULT_ I E UPPER J IOR 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E J 1 1 1 2 1 4 1 8 PGI TRADITIONAL IRAND 1 4 FLAG 1 4 G77 PGI O IRTC 1 8 TRADITIONAL 1 02404 15 XX C Supported
25. PGI O C C STATUS I 4 FGETC Subroutine UNIT 1 4 1 8 G77 O C C STATUS I 4 FLOAT R 4 1 1 I 2 I 4 I 8 ANSI G77 E PGI TRADITIONAL FLOATI R 4 2 PGI E TRADITIONAL FLOATJ R 4 4 PGI E TRADITIONAL FLOATK R 4 A I 8 PGI E TRADITIONAL FLOOR A R 4 R 8 ANSI PGI E KIND 1 1 1 2 1 4 TRADITIONAL 1 8 1 02404 15 C 17 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC e Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks FLUSH Subroutine UNIT 1 4 1 8 G77 PGI O STATUS 1 4 O FNUM 1 4 UNIT 1 4 G77 TRADITIONAL FPUT 1 4 C C G77 O STATUS 1 4 FPUT Subroutine C C G77 O STATUS 1 4 FPUTC 1 4 UNIT 1 4 1 8 G77 PGI O C C STATUS 1 4 FPUTC Subroutine UNIT 1 4 1 8 G77 O C C STATUS 1 4 FP_CLASS Depends arg X R 4 TRADITIONAL E FP_CLASS Depends arg X R 4 TRADITIONAL E FP_CLASS Depends on arg X R 8 TRADITIONAL E FP CLASS Depends onarg X R 8 TRADITIONAL E FRACTION X R 4 R 8 ANSI PGI E TRADITIONAL FREE Subroutine 1 1 1 2 1 4 1 8 PGI CrayPtr TRADITIONAL FSEEK 1 4 UNIT 1 4 G77 PGI OFFSET 1 4 WHENCE 1 4 FSEEK Subroutine UNIT 1 4 G77 OFFSET 1 4 WHENCE 1 4 FSEEK Subroutine UNIT 1 4 G77 OFFSET 1 8 WHENCE 1 4 FSTAT 1 4 UNIT 1 1 1 2 1 4 1 8 G77 PGI SARRAY 1 1 1 2 TRADITIONAL 1 4 1 8 Array rank 1 STATUS 1
26. W no missing declarations For C C only Wmissing declarations warns about global funcs without previous declarations Wno missing declarations tells the compiler not warn about global funcs without previous declarations 1 02404 15 E 47 E man Page XX QLOGIC e 48 W no missing format attribute For C C only Forthe Wmissing format attribute option if Wformat is used warn on candidates for format attributes For Wno missing format attribute do not warn on candidates for format attributes W no missing noreturn For C C only Wmissing noreturn warns about functions that are candidates for noreturn attribute Wno missing noreturn tells the compiler not to warn about functions that are candidates for noreturn attribute W no missing prototypes For C C only Wmissing prototypes warns about global funcs without prototypes Wno missing prototypes tells the compiler notto warn about global funcs without prototypes W no multichar For C C only Wmultichar warns if a multi character constant is used Wno multichar tells the compiler not to warn if a multi character constant is used W no nested externs For C C only Wnested externs warns about externs not at file scope level Wno nested externs tells the compiler not to warn about externs not at file scope level Wno non template friend For C only Do not warn about friend functions declared in tem
27. ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks OR I 1 1 I 2 1 4 1 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 OR_AND_ l 1 4 TRADITIONAL E FETCH J 1 4 OR_AND_ I 1 8 TRADITIONAL E FETCH J I 8 PACK ANSI PGI See Std TRADITIONAL PERROR Subroutine G77 PGI STRING C POPCNT I 1 1 1 2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 POPPAR I 1 1 1 2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 PRECISION X R 4 R 8 Z 8 ANSI PGI E Z 16 TRADITIONAL PRESENT A Procedure any ANSI PGI E type TRADITIONAL PRESENT A Any type ANSI PGI E TRADITIONAL PRODUCT ANSI PGI See Std TRADITIONAL RADIX X 1 1 I2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL RAND R 8 FLAG 4 G77 PGI O RANDOM _ Subroutine HARVEST R 4 R 8 ANSI PGI E NUMBER TRADITIONAL C 34 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics o 9 9 97 MEN Table C 1 Fortran Intrinsics Supported in 3 0 Continued 1 02404 15 Intrinsic Name Result Arguments Families Remarks RANDOM Subroutine SIZE I 1 1 2 1 4 1 8 ANSI PGI O SEED 1 1 1 2 1 4 1 8 TRADITIONAL Array 1 GET 1 1 1 2 1 4
28. ignore pragmas 3 6 opt 7 2 march anyx86 2 5 mcmodel medium 2 9 10 2 mcmodel small 2 9 mcpu 5 2 mp 8 2 8 3 8 6 O 3 2 7 1 O0 3 2 7 1 1 02404 13 Index O1 3 2 7 1 O2 3 2 7 1 9 1 3 2 9 1 Ofast 4 2 7 12 7 14 OPT alias 7 19 OPT early_mp 8 28 OPT fast_math 7 21 OPT IEEE_arithmetic 7 20 7 21 OPT Ofast 6 1 6 3 7 1 OPT reorg_common OFF 10 2 OPT wrap_around_unsafe_opt OFF 10 3 p 9 1 pg 2 11 r8 3 5 S 7 29 7 42 trapuv 10 1 V 2 2 Wuninitialized 10 1 zerouv 10 1 F 2 6 3 1 3 9 3 10 f 2 6 3 9 F90 2 6 3 1 3 9 3 10 f90 2 6 3 9 F95 2 6 3 1 3 9 3 10 f95 2 6 3 9 files 7 3 A ACML 10 3 Alias analysis 7 19 aliasing 3 26 Aliasing rule Fortran 3 27 AMD Core Math Library ACML 3 23 AMD64 2 1 ANSI 3 1 5 2 7 20 Application Binary Interface ABI 3 22 apropos pathscale E 1 asm 10 2 assign or ASSIGN 3 18 Index 1 QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 XX QLOGIC EGENEEEEEEEEMUO Autoparallelization 8 1 8 2 B Big endian format 3 18 BIOS settings for OpenMP 8 28 setup 7 24 BLAS 3 22 3 23 Bounds checking 3 12 BSS 2 9 C Cache blocking 7 16 Call graph 7 4 Call graph profile 9 3 Calls between C and Fortran 3 13 CMOVE 7 23 Code generation 7 17 Code tuning example 9 1 COMMON block 10 2 Compilation unit 7 3 Compiler C 4 1 C 4 1 invoking the 2 1 options common 2 8 quick reference 2
29. is OFF clist C only Enable the C listing Specifying clist is the equivalent of specifying CLIST ON CLIST C only Control emission of the compiler s internal program representation back into C code after IPA inlining and loop nest transformations This is a diagnostic tool and the generated C code may not always be compilable The generated C code is written to two files a header file containing file scope declarations and a file containing function definitions With the exception of CLIST OFF any use of this option implies clist The individual controls in this group are as follows 1 02404 15 E 5 E eko man Page XX QLOGIC ee ON OFF Enable the C listing This option is implied by any of the others but may be used to enable the listing when no other options are required For example specifying CLIST ON is the equivalent of specifying clist dotc_file filename Write the program units into the specified file filename The default source file name has the extension w2c c doth_file filename Specify the file into which file scope declarations are deposited Defaults to the source file name with the extension w2c h emit_pfetch ON OFF Display prefetch information as comments in the transformed source If ON or OFF is not specified the default is OFF linelength N Set the maximum line length to N characters The default is unlimited show ON OFF Print the input and
30. not a variable IVAR with the type integer and the value 5 while an option like DLVAR declares a constant LVAR with the type logical and the value true Only integer and logical constants are allowed You can use the D option to override the value of a constant declaration for that identifier which might appear in the source file The standard requires that the preprocessor read a set file capable of defining constants variables and modes of operation but it does not specify how to find the setfile If you use coco the preprocessor looks for coco set in the current directory If no such file exists the preprocessor quietly proceeds without it If you use an option like coco somedir mysettings the preprocessor looks for file somedir mysettings You cannot use the D option to override a constant declaration which appears in the setfile The open source package on which this feature is based does provide additional extensions and command line options described at 1 02404 15 3 9 3 The PathScale Fortran Compiler XX Compiler and Runtime Features QLOGIC WIIIIIIIGIEIIII ITCHF IGIISIKITT 3 4 4 1 http users erols com dnagle coco htm1 To pass those options through the compiler driver to the preprocessor you can usethe Wp options flag For example you can use wp m to pass the m option to the preprocessor to turn off macro preprocessing Note that the instructions given in that web page for passi
31. option ipa lt option gt lt option gt OPT Ofast lt option gt lt choose gt lt append gt 7 33 7 Tuning Options XX The pathopt2 Tool QLOGIC Table 7 5 Tags for Option Configuration Fle Continued define name name gt Defines a block of options that can be later included ae using the source from name gt tag Note that PETS this block can include any number of option choose or append tags source from name gt Includes a block of options previously defined with define bestof k k gt Choose the best option in the list referenced by run context context time and chosen in the context of the option listed in options options the context tag The k option is used as described for the choose tag context lt option gt lt option gt i specifies an option to use as a basis for testing but lt bestof gt not to propagate to outside tags lt comment gt Standard XML comment tag ignored by the parser NOTE Alltags other than lt source gt require an end tag e g append requires a corresponding lt append gt 7 9 4 Testing Methodology Typically the execute target try5 in pathopt2 xm1 is used first with the pathopt2 command After the results of the run are available you can look for the fastest result of the 5 options and then run pathopt2 again with a new execute target
32. option relating to code optimization A utility used to determine if a test suite exercises all code paths in a program Inter Procedural Analysis A sophisticated compiler technique in which multiple functions and subroutines are optimized together Intermediate Representation A step in compilation where code is linked in an intermediate representation so that inter procedual analysis and optimization can take place A utility program that links a compiled or assembled program to a particular environment Also known as a link editor the linker unites references between program modules and libraries of subroutines Its output is a load module which is executable code ready to run in the computer loop nest optimizer Performs transformation on a loop nest improves data cache performance improves optimization opportunities in later phases of compiling vectorizes loops by calling vector intrinsics parallelizes loops computes data dependency information for use by code generator can generate listing of transformed code in source form Multiprocessor Non uniform memory access is a method of configuring a cluster of microprocessors in a multiprocessing system so that they can share memory locally improving performance and the ability of the system to be expanded NUMA is used in a symmetric multiprocessing SMP system The intermediate representation of code generated by a compiler after it processes a source file
33. or memset Default is ON OPT treeheight ON OFF The value ON enables re association in expressions to reduce the expressions tree height The default is OFF OPT unroll analysis ON OFF The default value of ON lets the compiler analyze the content of the loop to determine the best unrolling parameters instead of strictly adhering to the OPT unroll times max and OPT unroll size parameters OPT unroll analysis ON can have the negative effect of unrolling loops less than the upper limit dictated by the OPT unroll times max and OPT unroll size specifications OPT unroll times maxzN Unroll inner loops by a maximum of N The default is 4 OPT unroll sizezN Set the ceiling of maximum number of instructions for an unrolled inner loop If N20 the ceiling is disregarded The default is 40 OPT wrap around unsafe opt ON OFF OPT wrap around unsafe optzOFF disables both the induction variable replacement and linear function test replacement optimizations By default these optimizations are enabled at O3 This option is disabled by default at O0 1 02404 15 E 39 E man Page XX QLOGIC ee Setting OPT wrap_around_unsafe_opt to OFF can degrade performance It is provided as a diagnostic tool When used with E the source preprocessor will not generate lines in the output pad char literals For Fortran only Blank pad all character literal constants that are shorter than the size of the defa
34. status to the value that the function would return Between the opening and closing of afile you should use either stream intrinsics get fgetc fput fputc fseek and ftell or standard Fortran I O but not both fseek Fortran interface to the C library function seek which treats logical unit unit as a stream of bytes and changes to offset the position pointer used by the next stream intrinsicm which reads or writes the file If whence is 0 offset counts bytes from the beginning of the file ifwhence is1 offset positions the pointer relative to the current position and if whence is 2 offset positions the pointer relative to the end ofthe file The function form returns 0 on success or an error code from the C library value errno Between the opening and closing of afile you should use either stream intrinsics get fgetc fput fputc fseek and ftell or standard Fortran I O but not both C 44 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions I SE fstat Fortran interface to the C library function fstat Stores in sarray information about the file opened on logical unit unit The function form returns 0 on success or an error code from the C library variable errno The subroutine form sets status to the value which the function would return sarray must have thirteen elements ID of device containing file Inode number File mode Number of links UID of owner GID of own
35. subroutine fl c i 18 f d 1 implicit none intrinsic flush character c integer i integer 8 18 real f doubleprecision d logical 1 wrilite o 3a 2175 2f5 l T18 j 08 UB UE cda 1 call flush 6 Flush output before switching languages end subroutine f1 And here is the third file Gecorate txt getlogin nounderscore getlogin 3 16 1 02404 15 XX 3 The PathScale Fortran Compiler QLOGIC Mixed Code ls Compile and execute these three files c_part c f part f90 and decorate txt like this pathf90 Wall intrinsic flush fdecorate decorate txt f part f90 c part c a out d129 8 1 7 6 i1 5 i2 4 11 20 12 1 c1 len 5 c2 len 4 len 26 ci hello c2 2 from c3 f part hello from call fortran 123 456 7 8 9 1 T d 9 8 f 7 6 i 5 i8 4 t m johndoe 3 5 1 2 Example Accessing Common Blocks from C Variables in Fortran 90 modules are grouped into common blocks one for initialized data and another for uninitialized data It is possible to use Edecorate to access these common blocks from C as shown in this example cat mymodule 90 module mymodule public integer modulevarl doubleprecision modulevar2 integer modulevar3 44 doubleprecision modulevar4 end module mymodule program myprogram use mymodule modulevarl 22 modulevar2 33 3 call mycfunction end program myprogram cat mycprogram c include lt stdio h gt extern str
36. to stdout LNO vintr verbose ON prints information about whether or not the math intrinsic functions were vectorized See the eko man page for more information 7 5 Code Generation CG The code generation group governs some aspects of instruction level code generation that can have benefits for code tuning CG gcm OFF turns off the instruction level global code motion optimization phase The default is ON CG load_exe n specifies the threshold for subsuming a memory load operation into the operand of an arithmetic instruction The value of 0 turns off this subsumption optimization By default this subsumption is performed only when the result of the load has only one n 1 use This subsumption is not performed if the number of times the result of the load is used exceeds the value n a non negative integer We have found that load_exe 2 or 0 are occasionally profitable The default for 64 bit ABI and Fortran is n 2 otherwise the default is n 1 CG use_prefetchnta ON means for the compiler to use the prefetch operation that assumes that data is Non Temporal at All NTA levels of the cache hierarchy This is for data streaming situations in which the data will not need to be re used soon Default is OFF 1 02404 15 7 17 7 Tuning Options XX Feedback Directed Optimization FDO QLOGIC EE 7 6 Feedback Directed Optimization FDO 7 18 Feedback directed optimization uses a special instrumented executable to colle
37. 1 2 4 2 8 j8 3 5 INLINE 7 7 INLINE aggressive 7 9 INLINE list 7 8 INLINE must 7 8 Index 5 QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 XX QLOGIC INLINE never 7 8 INLINE none 7 8 intrinsic 3 17 5 2 C 1 IPA 7 12 ipa 3 2 4 2 6 1 7 8 10 3 IPA addressing 7 10 IPA alias 7 10 IPA callee_limit 7 8 IPA cgi 7 10 IPA common_pad_size 7 9 IPA cprop 7 10 IPA ctype 7 10 IPA dfe 7 10 IPA dve 7 10 IPA field reorder 7 10 IPA forcedepth 7 9 IPA inline 7 8 IPA linear 7 9 IPA max_jobs 7 13 IPA maxdepth 7 9 IPA min hotness 7 9 IPA multi_clone 7 9 IPA node_bloat 7 9 IPA plimit 7 8 IPA pu_reorder 7 10 IPA small_pu 7 8 IPA space 7 8 IPA split 7 10 keep 4 4 L 2 4 2 8 LANG formal deref unsafe 3 26 LIST options 7 16 Im 2 8 4 6 LNO 3 8 4 2 LNO assoc1 n assoc2 n assoc3 n assoc4 n 7 16 LNO blocking 7 16 LNO blocking_size 7 16 LNO cs1 n cs2 n cs3 n cs4 n 7 15 LNO fission 7 15 LNO fusion 7 15 LNO fusion_peeling_limit 7 15 LNO ignore_pragmas 3 6 LNO interchange 7 16 Index 6 LNO opt 7 2 LNO ou_prod_max 7 16 LNO outer_unroll ou n 7 16 LNO outer unroll max ou max 7 16 LNO parallel overhead 8 3 LNO prefetch 7 2 7 16 LNO prefetch ahead 7 16 LNO simd 7 17 LNO simd verbose 7 17 7 43 LNO vintr 7 17 LNO vintr verbose 7 43 Istdc 4 6 m32 2 4 5 2 m3dnow 2 4 m64 2 4
38. 1 1 2 1 4 1 8 C 18 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks FSTAT Subroutine UNIT 1 1 1 2 1 4 8 G77 O SARRAY I 1 I 2 1 4 1 8 Array rank 1 STATUS I 1 1 2 1 4 1 8 FTELL 1 8 UNIT 1 4 G77 PGI FTELL 1 8 UNIT 1 8 G77 PGI FTELL Subroutine UNIT 1 4 G77 OFFSET 1 4 FTELL Subroutine UNIT 1 4 G77 OFFSET 1 8 FTELL Subroutine UNIT 1 8 G77 OFFSET 1 8 GERROR Subroutine MESSAGE C G77 PGI GETARG Subroutine POS 1 4 G77 PGI VALUE C GETCWD 1 4 NAME C G77 PGI O STATUS 1 4 GETCWD Subroutine NAME C G77 O STATUS 1 4 GETENV Subroutine NAME C G77 PGI VALUE C GETGID 1 4 G77 PGI GETLOG Subroutine LOGIN C G77 PGI GETPID 1 4 G77 PGI GETUID 1 4 G77 PGI GETPOS I 1 1 1 2 1 4 1 8 TRADITIONAL E GET _ Subroutine COMMAND C ANSI O COMMAND LENGTH I 4 TRADITIONAL STATUS 1 4 GET _ Subroutine NUMBER 1 4 ANSI COMMAND _ VALUE C TRADITIONAL ARGUMENT LENGTH 1 4 STATUS 1 4 1 02404 15 19 C Supported Fortran Intrinsics Table of Supported Intrinsics ee 20 XX QLOGIC Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks GET Subroutine NA
39. 2 7 3 2 1 7 3 3 7 3 4 7 3 4 1 7 3 5 7 3 6 7 3 6 1 7 3 7 7 3 8 7 3 9 7 4 7 4 1 7 4 2 7 4 3 7 4 4 7 4 5 7 5 7 6 7 7 7 7 1 7 7 2 7 7 3 7 7 4 7 7 4 1 7 7 4 2 7 7 5 7 7 6 7 8 7 8 1 7 8 2 7 8 3 7 8 4 7 8 5 Page viii Tuning Options Basic Optimizations The 7 1 Syntax for Complex Optimizations CG IPA LNO OPT WOPT 7 2 Inter Procedural Analysis 7 3 The IPA Compilation Model 7 3 Inter procedural Analysis and Optimization 7 4 PINAY SIS Ache MC EE CETT 7 4 Optimization eh Re E e E UR OE CO s 7 5 Controlling IPA Zac tri Sa yaaa au sq aa wa EROR REP c Ane e bos 7 7 Inlibipige s uo ass a ce 7 7 e oce td ect ees Sec es ac ante BB cenit ta 7 9 Other IPA Tuning Options 7 9 Disabling OptionS oe Bee aaa SERERE WEE 7 10 Case Study SPEC CPU2000 7 10 Invoking Ip c dedo sabe erasa BORA saa 7 12 Size and Correctness Limitations to 7 14 Loop Nest Optimization LNO 7 14 Loop Fusion and Fission 7 14 Cache Size Specification 7 15 Cache Blocking Loop Unrolling Interchange Transformations 7 16 Prefetch cux eR R
40. 4 1 8 G77 PGI E TRADITIONAL DFLOATI R 8 A 1 2 TRADITIONAL E DFLOATJ R 8 1 4 TRADITIONAL DFLOATK R 8 A 1 8 TRADITIONAL E DIGITS X 1 1 I 2 I 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL DIM R 4 X R 4 ANSI G77 E P Y R 4 PGI TRADITIONAL DIM X R 8 ANSI G77 E P Y R 8 PGI TRADITIONAL C 12 1 02404 15 XX QLOGIC 1 02404 15 Table C 1 Fortran Intrinsics Supported in 3 0 Continued C Supported Fortran Intrinsics Table of Supported Intrinsics I i _ Intrinsic Result Arguments Families Remarks DIM X 1 1 2 I 4 I 8 ANSI G77 E P PGI TRADITIONAL Y 1 1 1 2 1 4 1 8 DIMAG R 8 Z Z 16 G77 PGI TRADITIONAL DINT R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DISABLE_IEEE_ Subroutine INTERRUPT 1 8 TRADITIONAL E INTERRUPT DLOG R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DLOG10 R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DMAX1 ANSI G77 See Std PGI TRADITIONAL DMIN1 ANSI G77 See Std PGI TRADITIONAL DMOD R 8 A R 8 ANSI G77 E P P R 8 PGI TRADITIONAL DNINT R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DOT ANSI PGI See Std PRODUCT TRADITIONAL DPROD R 8 X R 4 R 8 ANSI G77 E P Y R 4 R 8 PGI TRADITIONAL DREAL R 8 A 1 1 1 2 1 4 1 8 G77 PGI E R 4 R 8 2 8 Z 16 TRADITIONAL DSHIFTL I 1 1 1 2 I 4 I 8 TRADITIONAL E J 1 1 I 2 1 4 1 8 1 1 1 2 I 4 I 8 C Supported Fortran Intrin
41. 4 l 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL LEN I 1 1 2 1 4 I 8 JIBSET 1 4 l 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL JIDIM 1 4 X 1 4 PGI E Y 1 4 TRADITIONAL JIDINT 1 4 A R 8 PGI E TRADITIONAL JIEOR 1 4 l 1 4 PGI E J 1 4 TRADITIONAL JIFIX 1 4 A R 4 R 8 PGI E TRADITIONAL JINT 1 4 A R 4 PGI E TRADITIONAL JIOR 1 4 l 1 4 PGI E J 1 4 TRADITIONAL JISHA 1 4 l 1 4 TRADITIONAL E SHIFT 1 1 I 2 1 4 I 8 JISHC 1 4 l 1 4 TRADITIONAL E SHIFT 1 1 I 2 1 4 I 8 JISHFT 1 4 l 1 4 PGI E SHIFT 1 1 1 2 1 4 TRADITIONAL I 8 JISHFTC 1 4 l 1 4 PGI E SHIFT 1 1 I2 1 4 TRADITIONAL O I 8 SIZE I 1 1 2 1 4 1 8 JISHL 1 4 l 1 4 TRADITIONAL E SHIFT 1 1 I 2 1 4 1 8 JISIGN 1 4 A 1 4 PGI E P B 1 4 TRADITIONAL JMOD 1 4 A 1 4 PGI E P P 1 4 TRADITIONAL C 26 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks JMVBITS Subroutine FROM 1 4 TRADITIONAL E FROMPOS 1 1 2 1 4 1 8 LEN I 1 1 2 I 4 1 8 TO 1 4 TOPOS 1 2 1 4 1 8 1 4 A R 4 R 8 TRADITIONAL JNOT 1 4 I 1 4 PGI E TRADITIONAL KIABS I 8 1 8 PGI E TRADITIONAL KIAND I 8 I 1 8 PGI E J I 8 TRADITIONAL KIBCHNG I 8 I 8 TRADITIONAL E POS 1 1 1 2 1 4 1 8 KIBCLR I 8 1 8 PGI E POS I 1 1 2
42. 8 4 OpenMP Compiler Directives Fortran The OpenMP directives for Fortran all start with comment characters followed by SOMP or Somp They are only processed by the compiler if mp is specified NOTE Possible comment characters that can be used include C c and In the following examples we use as the comment character The Open MP standard dictates that for fixed form Fortran SOMP directives must begin in the first column of the line 1 02404 15 8 3 8 Using OpenMP and Autoparallelization Compiler Directives Fortran QLOGIC Some ofthe OpenMP directives also support additional clauses The following table lists the Fortran compiler directives provided by version 2 0 of the OpenMP Fortran Application Program Interface Table 8 1 Fortran Compiler Directives Directive Clauses Example Parallel region construct Defines a parallel region PARALLEL SOMP parallel clause structured block OMP end parallel NUM THREADS Work sharing constructs Divide the execution of the enclosed block of code among the members of the team that encounter it DO NOWAIT SOMP do clause do loop SOMP enddo nowait PRIVATE FIRSTPRIVATE LAST PRIVATE REDUCTION SCHEDULE static dynamic guided
43. Control flow optimizations Instruction scheduling across basic blocks Oocoo 7 1 7 Tuning Options XX Syntax for Complex Optimizations CG IPA LNO OPT WOPT QLOGIC 7 2 m 02 implies the flag OPT goto on which enables the conversion of GOTOs into higher level structures like FOR loops m O2 also sets OPT Olimit 6000 03 turns on additional optimizations which will most likely speed your program up but may in rare cases slow your program down The optimizations provided at this level includes all O1 and O2 optimizations and also includes but is not limited to the flags noted below m LNO opt 1 Turn on Loop Nest Optimization for more details see section 7 4 m OPT with the following options in the OPT group see the opt man pages for more information Q OPT roundoff 1 see section 7 7 4 2 Q OPT IEEE arith 2 see section 7 7 4 OPT Olimit 9000 see section 6 3 OPT reorg common 1 see the 7 man page NOTE our in house testing we have noticed that several codes which are slower at 03 than 02 are fixed by using 03 LNO prefetch 0 This seems to mainly help codes that fit in cache Syntax for Complex Optimizations CG IPA LNO OPT WOPT The group optimizations control a variety of behaviors and can override defaults This section covers the syntax of these options The group options allow for the setting of multiple sub options in two ways m S
44. E P X R 8 PGI TRADITIONAL DATAN2D R 8 Y R 8 PGI E X R 8 TRADITIONAL DATAND R 8 X R 8 PGI E TRADITIONAL DATE C G77 PGI TRADITIONAL DATE Subroutine DATE C G77 PGI DATE_AND Subroutine DATE C ANSI G77 O _TIME TIME C PGI ZONE C TRADITIONAL VALUES 1 1 1 2 1 4 1 8 Array rank 1 DBESJO R 8 X R 8 G77 PGI DBESJ1 R 8 X R 8 G77 PGI DBESJN R 8 N 1 4 G77 PGI X R 8 DBESYO R 8 X R 8 G77 PGI DBESY1 R 8 X R 8 G77 PGI DBESYN R 8 N 1 4 G77 PGI X R 8 DBLE R 8 1 1 I2 1 4 I8 ANSI G77 E R 4 R 8 2 8 2 16 PGI E TRADITIONAL DCMPLX 16 X 1 1 I 2 1 4 8 G77 PGI E R 4 R 8 2 8 Z 16 TRADITIONAL 1 02404 15 C 11 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWI5I IGIIII III ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks Y 1 1 1 2 1 4 1 8 R 4 R 8 Z 8 Z 16 DCONJG Z 16 Z Z 16 G77 PGI E TRADITIONAL DCOS R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DCOSD R 8 X R 8 PGI E TRADITIONAL DCOSH R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DCOT R 8 X R 8 TRADITIONAL DCOTAN R 8 X R 8 TRADITIONAL DDIM R 8 X R 8 ANSI G77 E P Y R 8 PGI TRADITIONAL DERF X R 4 R 8 G77 PGI E P TRADITIONAL DERFC X R 4 R 8 G77 PGI E P TRADITIONAL DEXP R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DFLOAT R 8 A 1 1 1 2 1
45. Environment variables PathScale OpenMP PSC_OMP_AFFINITY A 3 PSC_OMP_AFFINITY_GLOBAL A 3 PSC_OMP_AFFINITY_MAP A 4 PSC_OMP_CPU_OFFSET A 4 PSC_OMP_CPU_STRIDE A 4 PSC OMP GUARD SIZE A 4 PSC OMP GUIDED CHUNK DIVISOR A 4 PSC OMP GUIDED CHUNK MAX A 4 PSC OMP LOCK SPIN A 5 PSC OMP SILENT A 5 PSC OMP STACK SIZE A 5 PSC OMP STATIC FAIR A 5 PSC OMP THREAD SPIN A 5 EVERY intrinsics family C 2 Execute target 7 32 explain command 3 11 used with iostat 3 11 extension source file name 2 6 E F90 BOUNDS CHECK ABORT 3 12 Fast math functions 7 21 FDO Feedback Directed Optimization 6 2 7 18 FFT 3 23 FILENV 3 18 3 19 Final object code 7 3 fixed form 3 2 Fixed form files 3 1 3 2 1 02404 13 Floating point calculations 10 1 Format big endian 3 18 little endian 3 18 Fortran accessing common blocks 3 17 compiler commands 3 1 debugging 3 25 dope vector data structure 3 12 file units 3 20 KIND attribute 3 20 modules 3 3 preprocessor 3 1 3 8 3 9 runtime libraries 3 13 stack size 3 2 3 29 8 11 8 21 8 23 Fortran intrinsics abort C 42 C 47 access C 42 alarm C 42 and C 42 besyn C 42 cdsqrt C 42 chdir C 42 chmod C 43 ctime C 43 date C 43 dbesyn C 43 C 43 dconj C 43 derfc C 43 dfloat C 43 dimag C 43 dreal C 43 dtime C 43 erfc C 43 etime C 44 exit C 44 fdate C 44 fget C 44 fgetc C 44 flush C 44 fnum C 44 fput C 44 fputc C 44 fseek C 44 Index 3 QLogic PathScale Compiler S
46. Fortran Intrinsics QLOGIC Table of Supported Intrinsics MENS NE Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks ISATTY L 4 UNIT 1 4 G77 PGI ISHA I 1 1 1 2 1 4 1 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 ISHC I 1 1 1 2 1 4 1 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 ISHFT I 1 1 1 2 1 4 1 8 ANSI G77 E SHIFT 1 1 1 2 1 4 PGI 1 8 TRADITIONAL ISHFTC I 1 1 1 2 1 4 1 8 ANSI G77 E SHIFT 1 1 1 2 1 4 PGI 8 TRADITIONAL SIZE I 1 1 2 1 4 1 8 ISHL I 1 1 1 2 1 4 1 8 TRADITIONAL E SHIFT 1 1 1 2 1 4 1 8 ISIGN 1 4 1 1 1 2 1 4 1 8 ANSI G77 E P B 1 1 1 2 1 4 1 8 PGI TRADITIONAL ISNAN X R 4 R 8 TRADITIONAL E IS IOSTAT END L 4 I 1 1 1 2 1 4 1 8 ANSI TRADITIONAL IS_IOSTAT_ L 4 I 1 1 1 2 1 4 1 8 ANSI EOR TRADITIONAL ITIME Subroutine TARRAY 1 4 G77 PGI Array rank 1 JDATE C TRADITIONAL JIABS 1 4 A 1 4 PGI E TRADITIONAL JIAND 1 4 l 1 4 PGI E J 4 TRADITIONAL JIBCHNG 1 4 l 1 4 TRADITIONAL E POS 1 1 1 2 1 4 I 8 JIBCLR 1 4 I 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL 1 02404 15 C 25 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC CO PP Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks JIBITS 1
47. Fortran interface to Linux function is att y Returns true if logical unit unit is associated with an interactive terminal device itime Store in tarray which must have three elements the current local time Hour ranging 0 23 Minutes ranging 0 59 Seconds ranging 0 60 to allow for leap seconds kill Fortran interface to the POSIX function kill Send to the process whose ID is pid the signal whose number is signal The function form returns 0 on success or an error code from the C library value errno The subroutine form sets status to the value which the function would return link Fortran interface to the POSIX function 1ink Creates a hard link path2 pointing to the same file as path 1 The function form returns 0 on success or an error code from the C library value errno The subroutine form sets status to the value which the function would return Trailing blanks in path1 and path2 are ignored you can prevent this by using char 0 to placea null character after the last significant character Inbink Returns the length of its argument neglecting trailing blanks synonym for standard function 1en trim loc Returns address of argument im memory long Convert to type integer 4 Ishift Bitwise left shift High order bit is not treated as a sign bit Shift count must be nonnegative and less than the bit size ofthe data 1 02404 15 C 47 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGI
48. I 1 TRADITIONAL I 1 IDATE Subroutine I 1 2 G77 PGI J 2 TRADITIONAL 1 2 IDATE Subroutine l 1 4 G77 PGI J 1 4 TRADITIONAL 1 4 IDATE Subroutine I 1 8 G77 PGI J I 8 TRADITIONAL 8 IDATE Subroutine TARRAY 1 1 G77 PGI Array rank 1 TRADITIONAL IDATE Subroutine TARRAY 2 G77 PGI Array rank 1 TRADITIONAL IDATE Subroutine TARRAY 4 G77 PGI Array rank 1 TRADITIONAL IDATE Subroutine TARRAY I 8 Array G77 PGI rank 1 TRADITIONAL IDIM 1 4 X 1 1 1 2 1 4 1 8 ANSI G77 E P Y 1 1 1 2 1 4 1 8 PGI TRADITIONAL IDINT 1 4 ANSI G77 E PGI TRADITIONAL A R 8 IDNINT 1 4 A R 8 ANSI G77 E P PGI TRADITIONAL 1 02404 15 C 21 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWI5I IGIIII III ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks IEEE_BINARY_ Y R 4 R 8 TRADITIONAL E SCALE N 1 1 1 2 1 4 1 8 IEEE_CLASS X R 4 R 8 TRADITIONAL E IEEE_COPY_ X R 4 R 8 TRADITIONAL E SIGN Y R 4 R 8 IEEE_ X R 4 R 8 TRADITIONAL E EXPONENT Y 1 1 I2 1 4 1 8 R 4 R 8 IEEE FINITE X R 4 R 8 TRADITIONAL IEEE INT X R 4 R 8 TRADITIONAL E Y 1 1 1 2 1 4 1 8 R 4 R 8 IEEE IS NAN X R 4 R 8 TRADITIONAL E IEEE NEXT X R 4 R 8 TRADITIONAL E AFTER Y R 4 R 8 IEEE REAL X 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 Y
49. ID for this mapping When nested parallelism is not employed then each thread s global and local ID will be identical 1 02404 15 8 13 8 Using OpenMP and Autoparallelization Environment Variables QLOGIC ee and the setting ofthis variable is irrelevant However when a nested team of threads is created that team will be assigned new local thread IDs starting at 0 for the master of that team and incrementing upwards Note that the local ID of a thread can change when that thread performs a nested fork and then a nested join and that these events may cause the CPU binding of that thread to change Also note that all team masters will have a local ID of 0 and will therefore map to the same CPU Usually these properties are undesirable so the default is to use the thread global ID for scheduling assignments PSC OMP AFFINITY INHERITANCE TRUE or FALSE This determines whether the OpenMP library inherits any prevailing affinity settings from its environment and the default value is TRUE When affinity inheritance is disabled the OpenMP library ignores the environment s affinity setting and sets up its own affinity mappings according to its built in heuristics By default the OpenMP library will bind one thread to each CPU in the machine though this can be over ridden by OpenMP environment variables When affinity inheritance is enabled the default and the OpenMP program is run under an affinity assi
50. If all of this text is followed by it is treated as a cpu number and that number is multiplied by the number of CPUs on the system This is useful for multiprocessor systems that are running several processes concurrently The value specified implicitly or explicitly is the memory value per process Here are some sample stack size settings on a 4 CPU system with 1G of memory Value Meaning 100000 100000 bytes 820K 820K 839680 bytes 0 25g all but 0 25G or 0 75G total 128M cpu 128M per CPU or 512M total 10M cpu all but 10M per CPU all but 40M total or 0 96G total If the Fortran runtime encounters problems while attempting to modify the stack size limit it will print some warning messages but will not abort 1 02404 15 1 02404 15 Section 4 The PathScale C C Compiler The PathScale C and C compilers conform to the following set of standards and extensions The C compiler Conforms to ISO IEC 9899 1990 Programming Languages C standard Supports extensions to the C programming language as documented in Using GCC The GNU Compiler Collection Reference Manual October 2003 for GCC version 3 3 1 Refer to section 4 4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C compiler gcc as implemented on the platforms supported by the PathScale Compiler Suite Supports most of
51. LNO code generation CG and aggressive optimizations e g by reducing numerical accuracy IPA inter procedural analysis may help with OpenMP programs too try it and see Some applications spend a large amount of time in numerical libraries At small numbers of nodes a highly optimized and tuned serial algorithm crafted for the target processor may out perform a parallel implementation based on a non optimized algorithm At higher numbers of nodes the parallel version may scale and give better performance However best performance will typically require an OpenMP parallelization of the best serial algorithm exploiting target features such as SSE for example Check to see if there are OpenMP enabled versions of these numerical libraries available Memory System Performance 8 28 OpenMP applications are often very sensitive to memory system performance An excellent approach is to tune the memory system with an OpenMP version of the STREAM benchmark In particular the BIOS settings for memory bank interleaving should be auto and for node interleaving should be off 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Tuning for OpenMP Application Performance E am ii Interleaving memory by node causes memory addresses to be striped across the various nodes ata low granularity creating the illusion of a uniform memory system However OpenMP programs tend to have very good memory locality and the correct approach is
52. OMP DYNAMIC to TRUE or calling OMP_SET_DYNAMIC with a TRUE parameter the implementation produces a diagnostic message and ignores the request The value returned by GET DYNAMIC is always FALSE to indicate that this mechanism is not supported OMP SET NESTED When nested parallelism is enabled the number of threads used to execute nested parallel regions is implementation dependent Section 3 1 9 page 52 The implementation supports dynamically nested parallelism The number of threads assigned to a new team is determined by the following algorithm m If this fork is dynamically nested inside another fork and nesting is disabled then the new team will consist of 1 thread the thread that requests the fork m Otherwise the number of threads is specified by the NUM THREADS clause on the parallel directive if NUM THREADS has been specified m Otherwise the number of threads is specified by the most recent call to OMP SET NUM THREADS ifit has been called m Otherwise the number of threads is specified by the OMP NUM THREADS environment variable if it has been defined m Otherwise the number of threads defaults to the number of CPUs in the machine If the number of threads is greater than 1 the request requires allocation of new threads and this may fail if insufficient machine resources are available The maximum number of threads that can be allocated simultaneously is limited to 256 by the implementation Currently n
53. OpenMP please see the sample code at http I www openmp org drupal node view 14 There are also examples of OpenMP code in Appendix A of the OpenMP 2 0 C C specification See section 8 15 for more details 8 14 Tuning for OpenMP Application Performance A good first step in tuning OpenMP code is to build a serial version of the application and tune the serial performance See section 7 for ideas and suggestions Often good flags for serial performance are also good for OpenMP performance Typically OpenMP parallelizes the outer iterations of the compute intensive loops in a coarse fashion leaving chunks ofthe outer loops and the inner loops that generally behave very similarly to the serial code Use pathopt2 see section 7 9 for details on pathopt2 to help find good serial tuning options for the application You may be able to find interesting options for tuning by looking at tuned configuration files for similar codes With this approach you can find good options for the serial parts of the code before having to consider OpenMP specific issues such as scheduling scaling and affinity If the test case takes a long time to run or needs a lot of memory then you may be forced to tune the flags with OpenMP enabled 8 14 1 Reduced Datasets You may find it useful to reduce the size of the data sets to give a quicker runtime allowing the efficacy of particular tuning options to be quickly ascertained One thing to note is that OpenMP p
54. OpenMP pthreads to have different limits Often the system imposed limits are different in these two cases and sometimes the stack requirements of the OpenMP pthreads may be quite different from the main thread For example in some applications the main thread of an OpenMP program might allocate large arrays for the whole program on its stack and in others the large arrays will be allocated by all of the threads 8 10 2 Stack Size for C C The stack size of serial C and C programs is typically set by the ulimit command provided by the shell Since C and C programs typically do not allocate large arrays on the stack it is usually convenient to use whatever default ulimit your system provides More strict ulimit settings can be used to catch runaway stacks or unbounded recursion before the program exhausts all available memory For OpenMP C and C programs there will be an additional stack for each pthread created by the 1ibopenmp library section 8 11 describes how these pthread stacks are sized NOTE automatic stack sizing algorithm used by Fortran serial program and Fortran OpenMP programs is not employed for C and C programs 8 11 Stack Size Algorithm The stack limit for each OpenMP pthread is calculated as follows m If PSC_OMP_STACK_SIZE is set then this specifies the stack limit m If this is a Fortran program the stack limit is automatically set using the same approach as described in section 3 11 exceptthatth
55. PSC OMP SILENT If you set OMP SILENT to anything then warning and debug messages from the libopenmp library are inhibited PSC OMP STACK SIZE Fortran Stack size specification follows the syntax in section 3 11 PSC OMP STATIC FAIR This determines the default static scheduling policy when no chunk size is specified as discussed in section 8 9 2 PSC OMP THREAD SPIN This takes a numeric value and sets the number of times that the spin loops will spin at user level before falling back to O S schedule reschedule mechanisms 1 02404 15 A 5 A Environment Variables XX Environment Variables for OpenMP QLOGIC FF C C C a A 6 1 02404 15 Appendix B Implementation Dependent Behavior for OpenMP Fortran 1 02404 15 The OpenMP Fortran specification 2 0 Appendix E requires that the implementation defined behavior of PathScale s OpenMP implementation be defined and documented see http www openmp org For the Fortran version 2 0 OpenMP Specification click on Specifications in the left column of the OpenMP home page This appendix summarizes the behaviors that are described as implementation dependent in this API The sections in italic including the cross references come from the Fortran 2 0 specification and each is followed by the relevant details for the PathScale implementation in its Compiler Suite 3 0 release of OpenMP for Fortran SCHEDULE GUIDED chunk chunk specifies the size of the smallest piece
56. R 4 BESJN R 8 N R 4 G77 PGI X R 8 BESYO R 4 X R 4 G77 PGI BESYO R 8 X R 8 G77 PGI BESY1 R 4 X R 4 G77 PGI BESY1 R 8 X R 8 G77 PGI BESYN R 4 N R 4 G77 PGI X R 4 1 02404 15 C 5 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC e Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks BESYN R 8 N R 4 G77 PGI X R 8 BITEST I 1 2 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL BIT SIZE I 1 1 1 2 1 4 I 8 ANSI G77 E PGI TRADITIONAL BJTEST l 1 4 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL BKTEST l 1 8 TRADITIONAL E POS 1 1 12 1 4 I 8 BTEST l 1 1 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL CABS R 4 A Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CCOS Z 8 X 2 8 716 ANSI G77 E P PGI TRADITIONAL CDABS R 8 A Z 16 G77 PGI E P TRADITIONAL CDCOS Z 16 X Z 16 G77 PGI E P TRADI TIONAL CDEXP Z 16 X Z 16 G77 PGI E P TRADITIONAL CDLOG Z 16 X Z 16 G77 PGI E P TRADITIONAL CDSIN Z 16 X Z 16 G77 PGI E P TRADITIONAL CDSQRT Z 16 X Z 16 G77 PGI E P TRADITIONAL CEILING A R 4 R 8 ANSI PGI E KIND I 1 I2 1 4 TRADITIONAL O I 8 CEXP Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL C 6 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics o 9 9 97 ENS
57. R 4 R 8 IEEE _ X R 4 R 8 TRADITIONAL E REMAINDER Y R 4 R 8 IEEE _ X R 4 R 8 TRADITIONAL E UNORDERED Y R 4 R 8 IEOR 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E J 1 1 1 2 1 4 1 8 PGI TRADITIONAL IERRNO 1 4 G77 PGI IFIX 1 4 A R 4 R 8 ANSI G77 E PGI TRADITIONAL 1 2 A 1 2 PGI E TRADITIONAL 1 2 I 1 2 PGI E J P2 TRADITIONAL IIBCHNG 1 2 I 1 2 TRADITIONAL E POS 1 1 1 2 1 4 I 8 IIBCLR 1 2 PGI E TRADITIONAL C 22 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics Table C 1 Fortran Intrinsics Supported 3 0 Continued Intrinsic Name Result Arguments Families Remarks IIBITS I 2 I 1 2 PGI E POS 1 1 1 2 1 4 1 8 TRADITIONAL IIBSET 1 2 I 1 2 PGI E POS I 1 1 2 1 4 8 TRADITIONAL IIDIM I 2 X r2 PGI E Y 2 TRADITIONAL IIDINT 1 2 A R 8 PGI E TRADITIONAL IIEOR I 2 1 2 PGI E J l 2 TRADITIONAL I 2 A R 4 R 8 PGI E TRADITIONAL 1 2 A R 4 PGI E TRADITIONAL IIOR I 2 I 1 2 PGI E J l 2 TRADITIONAL IISHA 1 2 I 1 2 TRADITIONAL E SHIFT 1 1 1 2 1 4 I 8 IISHC 1 2 I 1 2 TRADITIONAL E SHIFT 1 1 1 2 1 4 I 8 IISHFT 1 2 I 1 2 PGI E SHIFT 1 1 1 2 1 4 TRADITIONAL I 8 IISHFTC I 2 I 1 2 PGI E SHIFT 1 1 2 1 4 TRADITIONAL I 8 SIZE 1 1 1 2 1 4 I 8 IISHL I 2 I 1 2 TRADITIONAL SHIFT I 1 1 2 I 4 1 8 IISIGN 1 2 1 2 PGI E
58. REAL_KIND R I 1 1 2 1 4 1 8 TRADITIONAL SETBUF 1 4 UNIT 1 4 TRADITIONAL BUF C SETLINEBUF 1 4 UNIT 1 4 TRADITIONAL SET X R 4 R 8 ANSI PGI E EXPONENT I 1 1 1 2 1 4 I 8 TRADITIONAL SET IEEE _ Subroutine EXCEPTION I 8 TRADITIONAL E EXCEPTION SET IEEE Subroutine STATUS 1 8 TRADITIONAL EXCEPTIONS SET IEEE Subroutine STATUS 1 8 TRADITIONAL INTERRUPTS SET IEEE _ Subroutine STATUS I 8 TRADITIONAL ROUNDING MODE SET IEEE _ Subroutine STATUS I 8 TRADITIONAL STATUS C 36 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued 1 02404 15 Intrinsic Name Result Arguments Families Remarks SHAPE ANSI PGI See Std TRADITIONAL SHIFT I 1 1 1 2 1 4 1 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 1 2 1 4 1 8 SHIFTA I 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 12 1 4 1 8 SHIFTL I 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 2 1 4 I 8 SHIFTR I 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 1 2 1 4 1 8 SHORT I 2 A 1 1 2 1 4 1 8 G77 E R 4 R 8 TRADITIONAL Z 8 Z 16 SIGN R 4 A 1 1 I2 1 4 1 8 ANSI G77 E P R 4 R 8 PGI 1 1 I 2 1 4 1 8 TRADITIONAL R 4 R 8 SIGNAL I 8 NUMBER 1 1 I 2 G77 PGI O 1 4 1 8
59. Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC e Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks CVMGP I 1 1 1 2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 1 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CVMGT I 1 1 1 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K L 1 L 2 L 4 L 8 CVMGZ I 1 1 1 2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 12 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 1 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 C LOC 1 8 X Any type Array TRADITIONAL rank any DABS R 8 A R 8 ANSI G77 E P PGI TRADITIONAL DACOS R 8 X R 8 ANSI G77 E P PGI TRADITIONAL C 10 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics o 9 9 97 ENS Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks DACOSD R 8 X R 8 PGI E TRADITIONAL DASIN R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DASIND R 8 X R 8 PGI E TRADITIONAL DATAN R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DATAN2 R 8 Y R 8 ANSI G77
60. TRADITIONAL HANDLER Procedure IGNDFL I 4 SIGNAL I 8 NUMBER 1 1 I 2 G77 PGI 1 4 1 8 TRADITIONAL HANDLER I 4 SIGNAL I 8 NUMBER 1 1 I 2 G77 PGI 1 4 1 8 TRADITIONAL HANDLER I 8 C 37 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks SIGNAL Subroutine G77 PGI TRADITIONAL SIN R 4 X R 4 R 8 Z 8 ANSI G77 E P Z 16 PGI TRADITIONAL SIND R 4 X R 4 R 8 PGI TRADITIONAL SINH R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL SIZE ANSI PGI See Std TRADITIONAL SIZEOF 1 8 X Any type Array TRADITIONAL rank any SLEEP Subroutine SECONDS 1 4 G77 PGI SNGL R 4 A R 8 ANSI G77 E PGI TRADITIONAL SPACING X R 4 R 8 ANSI PGI E TRADITIONAL SPREAD ANSI PGI See Std TRADITIONAL SQRT R 4 X R 4 R 8 Z 8 ANSI G77 E P Z 16 PGI TRADITIONAL SRAND Subroutine SEED 1 4 G77 PGI STAT 1 4 FILE C G77 PGI O SARRAY I 4 Array TRADITIONAL rank 1 STATUS 1 4 STAT Subroutine FILE C G77 O SARRAY I 4 Array TRADITIONAL rank 1 STATUS 1 4 SUB AND I I 4 TRADITIONAL FETCH J 4 SUB AND I I 8 TRADITIONAL E FETCH J I 8 C 38 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued
61. TRADITIONAL MINO ANSI G77 See Std PGI TRADITIONAL MIN1 ANSI G77 See Std PGI TRADITIONAL MINEXPONENT X R 4 R 8 ANSI PGI E TRADITIONAL MINLOC ANSI PGI See Std TRADITIONAL MINVAL ANSI PGI See Std TRADITIONAL MOD 1 4 A 1 1 I2 1 4 1 8 ANSI G77 E P R 4 R 8 PGI P 1 1 1 2 1 4 1 8 TRADITIONAL R 4 R 8 MODULO 1 1 1 2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL P 1 1 1 2 1 4 1 8 R 4 R 8 1 02404 15 C 31 C Supported Fortran Intrinsics Table of Supported Intrinsics XX QLOGIC FWGIISTI IT IINTIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks MVBITS Subroutine FROM 1 1 2 1 4 _ ANSI G77 E 1 8 PGI FROMPOS 1 1 1 2 TRADITIONAL 1 4 1 8 LEN 1 1 1 2 1 4 1 8 TO 1 1 1 2 1 4 1 8 TOPOS 1 1 1 2 1 4 1 8 NAND_AND_ l 1 4 TRADITIONAL E FETCH J 1 4 NAND_AND_ I I 8 TRADITIONAL E FETCH J I 8 NEAREST X R 4 R 8 ANSI PGI E S R 4 R 8 TRADITIONAL NEQV I 1 1 1 2 1 4 I 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 I 2 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 NINT 1 4 A R 4 R 8 ANSI G77 E P KIND 1 1 1 2 1 4 PGI O 1 8 TRADITIONAL NOT I 1 1 1 2 1 4 1 8 ANSI G77 E PGI TRADITIONAL NULL MOLD Any type ANSI PGI Array rank any TRADITIONAL NUM IMAGES 1 4 TRADITIO
62. TRUE 1 02404 15 E 55 E man Page QLOGIC W IIOcCU UING JI I III I IIIGIC IHIII I H I ee PSC OMP AFFINITY GLOBAL This environment variable controls where thread global ID or local ID values are used when assigning threads to CPUs The default is TRUE so that global ID values are used for calculating thread assignments PSC OMP AFFINITY MAP This environment variable allows the mapping from threads to CPUs to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list must contain at least one CPU identifier and entries in the list beyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers in the PSC_OMP_AFFINITY_MAP list are inserted in the mapping table starting at the index for thread 0 and increasing upwards If the list is shorter than the maximum number of threads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads PSC_OMP_CPU_STRIDE This specifies the striding factor used when mapping threads to CPUs It takes an integer value in the range of 0 to the nu
63. The next set of refinements in the execute targets are the options with the peak prefix For example if the best results were obtained with 02 then the next target to try will be peak 02 Here is a summary of the target usage Option in try5 with best results Use this target for next run 02 peak OZ 03 peak 03 03 OPT Ofast peak 03 03 ipa peak 03 Ofast peak Ofast This progressive refinement is shown in more detail in section 7 9 8 3 and section 7 9 8 4 7 9 5 Using an External Configuration File to Modify pathopt2 xml It is possible to build hierarchies of lists and to construct new execution targets by combining existing ones The way to do this without modifying pathopt2 xml is to create an external configuration file then use the g option in the pathopt2 7 34 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool ls command line to load it in The XML files are processed in order as if they were concatenated The g option can be repeated to load in more than one file The t option chooses the execution target as before The rules for using the option remain the same Here is an example of an external configuration file that extends the try5 list with a 6th possibility config execute name try6 gt choose k 1 gt source from try5_list gt option 01 lt option gt lt choose gt lt execute gt lt con fig gt 7 9 6 PSC_GENFLAGS Environment Variab
64. World from thread TID SOMP END CRITICAL SOMP MASTER SOMP CRITICAL Only master thread does this 15 NTHREADS OMP GET NUM THREADS PRINT Number of threads NTHREADS SOMP END CRITICAL SOMP END MASTER All threads join master thread and disband SOMP END PARALLEL END The before some of the lines are conditional compilation tokens These lines are ignored when compiled without mp We compile omphello t for OpenMP with this command pathf 95 c mp omphello f Now we link it again using mp pathf 95 mp omphello o o omphello out We set the environment variable for the number of threads with this command export OMP NUM THREADS 5 8 24 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Example OpenMP Code in C C ls Now run the program omphello out Hello World from thread1 Hello World from thread2 Hello World from thread3 Hello World from threadO0 Number of threads 5 Hello World from thread4 The output from the different threads can be in a different order each time the program is run We can change the environment variable to run with two threads export OMP NUM THREADS 2 Now the output looks like this omphello out Hello World from thread0 Number of threads 2 Hello World from thread1 The same program can be compiled and linked without mp and the directives will be ignored We compile the program without m
65. any subset of the CPU cores except the empty set Examples include a single CPU core all CPU cores on a particular socket and all CPU cores on the system Affinity may be set or retrieved from the command line using the taskset utility or similar Run time libraries such as the PathScale OpenMP run time library may automatically set affinity in order to optimize thread placement Also application programs may themselves set affinity if required PSC_OMP_AFFINITY TRUE or FALSE When TRUE the operating system s affinity mechanism where available is used to assign threads to CPUs otherwise no affinity assignments are made If the OpenMP program is run with one initial thread OMP_NUM_THREADS is one or the machine has one CPU the default value is FALSE otherwise the default value is TRUE The rationale for this default is that it is useful to assign affinity assignments to multi threaded programs for performance reasons but that single threaded programs should be run without explicit affinity assignments so that they can be scheduled freely by the operating system just like any other serial program generated by the compiler These defaults can of course be changed by explicitly setting PSC_OMP_AFFINITY to TRUE or FALSE An interesting case is when many multiple OpenMP processes are run on the same node e g using MPI The OpenMP library has no specific knowledge of MPI and each OpenMP process has no knowledge of other OpenM
66. b 100 do i 2 100 a i b i b i 1 enddo 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Debugging and Troubleshooting Fortran 3 10 3 Because a and b are dummy arguments the compiler relies on the assumption that a and bare in non overlapping areas of memory when it optimizes the program The resulting program when run will give wrong results Programmers occasionally break this aliasing rule and as a result their programs get the wrong answer only under high levels of optimization This sort of bug frequently is thought to be a compiler bug so we have added this option to the compiler for testing purposes If your failing program gets the right answer with OPT alias no_parm or WOPT fold off then it is likely that your program is breaking this Fortran aliasing rule Fortran malloc Debugging 3 10 4 The PathScale Compiler Suite includes a feature to debug Fortran memory allocations By setting the environment variable PSC_FDEBUG_ALLOC memory allocations will be initialized during execution to the following values PSC FDEBUG ALLOC Value ZERO 0 NaN Oxffa5a5a5 4 byte NaN NaN8 Oxffabababfff5a5a511 8 byte NaN For example to initialize all memory allocations to zeroes set PSC_FDEBUG_ALLOC ZERO before running the program The four byte and eight byte NaNs will only initialize arrays that are aligned with their width 32 and 64 bits respectively Arguments Copie
67. be nonnegative and less than the bit size of the data secnds return the number of seconds since midnight in the local time zone minus the argument t second The function form returns the sum of user and system CPU time consumed by the process since the start of execution The subroutine form sets seconds to that value setbuf This is similar to the C library function setbuf To disable buffering on the specified logical unit so that output appears immediately pass a variable of type character len 0 or type character 0 To use a particular buffer in place of the default buffer for that logical unit pass a character string whose length is greater than zero The logical unit must be appropriate for sequential formatted output In case of error the function returns errno Fortran iostat error code otherwise it returns zero Note that you must enable this on the command line with intrinsic setbuf or intrinsic EVERY 1 02404 15 C 49 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC ee setlinebuf Similar to the C library function set 1inebuf this causes the specified logical unit to flush buffered output at the end of every line and before any read from the terminal The logical unit must be appropriate for sequential formatted output In case of error the function returns errno ora Fortran iostat error code otherwise it returns zero Note that you must enable this on the
68. behavior for PathScale s OpenMP in C C and Fortran For more information on OpenMP and the OpenMP specification please see the OpenMP website at http www openmp org 8 2 Autoparallelization Under autoparallelization the compiler tries to parallelize program code without depending on user directives Autoparallization is invoked by specifying the apo option on the compile and link lines pathf95 apo c foo F95 pathf95 apo o foobar foo o bar o Since the compiler is only able to parallelize a subset of the loops that the user knows are parallelizable OpenMP directives are always helpful OpenMP directives are not seen by the compiler unless mp is specified Thus for programs that contain OpenMP directives autoparallelization can be combined with OpenMP to additionally parallelize code that does not contain OpenMP directives In this case itis good to specify the apo and mp options together pathf95 apo mp c foo F95 pathf95 apo mp o foobar foo o bar o Other than the OpenMP directives the compiler currently does not implement any additional directives to help the compiler in its autoparallelization analysis Many codes benefit from autoparallelization and the extent of the benefit may vary with the characteristics of the program and data set being used There are cases where autoparallelization causes small performance degradation of an application This happens be
69. benefit of unsafe optimizations Examples of unsafe optimizations include the following Alias Analysis 1 02404 15 Both C and Fortran have occasions where it is possible that two variables might occupy the same memory For example in C two pointers might point to the same location such that writing through one pointer changes the value of the variable pointed to by another While the C standard prohibits some kinds of aliasing many real programs violate these rules so the aliasing behavior of the compiler is controlled by the OPT alias flag See section 7 7 4 2 for more information Aliases are hidden definitions and uses of data due to m Accesses through pointers Partial overlap in storage locations e g unions in C Procedure calls for non local objects Raising of exceptions The compiler normally has to assume that aliasing will occur The compiler does alias analysis to identify when there is no alias so later optimizations can be performed Certain C and C language rules allow some levels of alias analysis Fortran has additional rules which make it possible to rule out aliasing in more situations subroutine parameters have no alias and side effects of calls are limited to global variables and actual parameters For C or C the coding style can help the compiler make the right assumptions Using type qualifiers such as const restrict volatile can help the compiler Furthermore if you supply some assumption
70. by default gather scatter zN This option enables gather scatter optimizations N can be one of the following 0 Disable all gather scatter optimizations 1 Perform gather scatter optimizations in non nested IF statements default 2 Perform multi level gather scatter optimizations hoistif ON OFF This option enables or disables hoisting of IF statements inside inner loops to eliminate redundant loops Default is ON ignore_feedback ON OFF If the flag is ON then feedback information from the loop annotations will be ignored in LNO transformations The default is OFF ignore pragmas ON OFF This option specifies thatthe command line options override directives in the source file Default is OFF local pad sizezN This option specifies the amount by which to pad local array dimensions The compiler automatically by default chooses the amount of padding to improve cache behavior for local array accesses minvariant minvar ON OFF Enable or disable moving loop invariant expressions out of loops The default is ON non blocking 1 For only The option specifies whether the processor blocks on loads If not set the default of the current processor is used 1 02404 15 QLOGIC ls LNO oinvar ON OFF This option controls outer loop hoisting Default is ON LNO opt 0 1 This option controls the LNO optimization level The options can
71. can be created successfully during a program s execution This number is dependent upon the load on the system the amount of memory allocated by the program and the amount of implementation dependent stack space allocated to each thread If the dynamic threads mechanism is disabled the behavior of the program is implementation dependent when more threads are requested than can be successfully created If the dynamic threads mechanism is enabled requests for more threads than an implementation can support are satisfied by a smaller number of threads Section 2 3 1 page 15 Sincethe implementation does not support dynamic thread adjustment the dynamic threads mechanism is always disabled If more threads are requested than are available the request will be satisfied using only the available threads B 4 1 02404 15 B Implementation Dependent Behavior for OpenMP Fortran QLOGIC Kw The maximum number of threads that can be allocated simultaneously is limited to 256 by the implementation Additionally if a system call to allocate threads memory or other system resources does not succeed then the runtime library will exit with a fatal error message If an OMP runtime library routine interface is defined to be generic by an implementation use of arguments of kind other than those specified by the OMP KIND constants is implementation dependent Section D 3 page 111 No generic OMP runtime library routine inte
72. characters or empty lines are also treated as comments If any character other than a blank character is present in the 6th character position on a line that specifies that the line is a continuation from the previous line The Fortran standard specifies that no more than 19 continuation lines can follow a line but the PathScale compiler supports up to 499 continuation lines 3 2 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Modules ls 3 2 Modules 3 2 1 Source code appears between the 7th character position and the 72nd character position in the line inclusive Semicolons are used to separate multiple statements on a line A semicolon cannot be the first non blank character between the 7th character position and the 72nd character position Character positions 1 through 5 are for statement labels Since statement labels cannot appear on continuation lines the first five entries of a continuation line must be blank Free form files have fewer limitations on line layout Lines can be arbitrarily long and continuation is indicated by placing an ampersand amp at the end of the line before the continuation line Statement labels can be placed at any character position in a line as long as it is preceded by blank characters only Comments start with a character anywhere on the line When a Fortran module is compiled information about the module is placed into a file called MODULENAME mod The default lo
73. control the level of IEEE 754 compliance through options Relaxing the level of compliance allows the compiler greater latitude to transform the code for improved performance The following subsections discuss some of those options 7 7 4 1 Arithmetic Sometimes it is possible to allow the compiler to use operations that deviate from the IEEE 754 standard to obtain significantly improved performance while still obtaining results that satisfy the accuracy requirements of your application 1 02404 15 7 21 7 Tuning Options XX Aggressive Optimizations QLOGIC Se The flag regulating the level of conformance to ANSI IEEE 754 1985 floating pointing roundoff and overflow behavior is OPT IEEE arithmeticzN where N 1 2 or 3 OPT IEEE arithmetic 1 Requires strict conformance to the standard 2 Allows use of any operations as long as exact results are produced This allows less accurate inexact results For example 0 may be replaced by 0 and may replaced by 1 even though this is inaccurate when X is inf inf or NaN This is the default level at 03 3 Means to allow any mathematically valid transformations For example replacing x y by x recip y For more information on the defaults for IEEE arithmetic at different levels of optimization see Table 7 3 7 7 4 2 Roundoff Use OPT roundof f to identify the extent of roundoff error the compiler is allowed to introduce 0 No roundoff error
74. data in the form used by modern debuggers such as pathdb or GDB This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g Will be capable of being debugged using pathdb GDB or other debuggers See the QLogic PathScale Debugger User Guide for more information on using pathdb It is advisable to use the 00 level of optimization in conjunction with the g flag since code rearrangement and other optimizations can sometimes make debugging difficult If g is specified without an optimization level then 00 is the default Dealing with Uninitialized Variables 1 02404 15 Uninitialized variables may cause your program to crash or to produce incorrect results New options have been added to help identify and deal with uninitialized variables in your code These options are trapuv Wuninitialized and Zerouv The trapuv option works by initializing local variables to NaN floating point not a number and setting the CPU to detect floating point calculations involving NaNs Floating point calculations are operations such as sin sqrt compare etc If a NaN is detected the application will abort Assignments are not considered floating point calculations and so x y doesn t trap even if y is NaN 10 1 10 Debugging and Troubleshooting Large Object Support QLOGIC _ ____ u u u u The trapuv option affects local scalar and array
75. default is OFF emit pfetch setting Writes prefetch information as comments in the transformed source file setting can be either ON or OFF The default is OFF In the listing PREFETCH identifies a prefetch and includes the variable reference with an offset in bytes an indication of read write a stride for each dimension and a number in the range from 1 low to 3 high which reflects the confidence in the prefetch analysis Prefetch identifies the reference s being prefetched by the PREFETCH descriptor The comments occur after a read write to a variable and note the identifier of the PREFETCH spec for each level of the cache ftn_file file Write the program to file By default the program is written to file w2f f linelength N Set the maximum line length to N characters show setting Write the input and output filenames to stderr setting can be either ON or OFF The default is ON flist Invoke all Fortran listing control options The effect is the same as if all FLIST options are enabled fms extensions For C C only Accept broken MFC extensions without warning fno asm For C C only Do not recognize the asm keyword fno builtin For C C only Do not recognize any built in functions fno common For C C only Use strict ref def initialization model E 10 1 02404 15 QLOGIC ls f no exceptions For C only fexceptions enables exce
76. do not contain the schedutils package RPM The CPU affinity is represented as a bitmask typically given in hexadecimal Assigning a process to a specific CPU prevents the Linux scheduler from moving or splitting the process Example taskset 0x00000001 This would assign the process to processor 0 If an invalid mask is given an error is returned so when taskset returns it is guaranteed that the program has been scheduled on a valid and legal CPU See the taskset 1 man page for more information 1 02404 15 3 1 Section 3 The PathScale Fortran Compiler The PathScale Fortran compiler supports Fortran 77 Fortran 90 and Fortran 95 The PathScale Fortran compiler m Conforms to ISO IEC 1539 1991 Programming languages Fortran Fortran 90 m Conforms to the more recent ISO IEC 1539 1 1997 Programming languages Fortran Fortran 95 m Conforms to ISO IEC TR 15580 Fortran Floating point exception handling See also section 14 of ISO IEC 1539 1 2004 the Fortran 2003 standard for a complete description Conforms to ISO IEC TR 15581 Fortran Enhanced data type facilities Conforms to ISO IEC 1539 2 Varying length character strings section 3 4 3 Conforms to ISO IEC 1539 3 Conditional compilation section 3 4 4 Supports legacy FORTRAN 77 ANSI X3 9 1978 programs Provides support for common extensions to the above language definitions Links binaries generated with the GNU Fortran 77 compiler Generates code that compli
77. even if the source file contains nothing other than the module That object file must be linked with the rest of the program If a single command compiles and links the entire program this will happen automatically but if you use a separate command to link objects together you must be careful not to omit object files resulting from source files which contain only modules The order of object files in such a command does not matter For example pathf95 c mymodule f95 pathf95 c myprogram f95 pathf95 myprogram o mymodule o Notice that a source file containing multiple modules will generate one object file which takes its name from the source file plus multiple module information mod files which take their names from the names of the modules themselves For example generate MYMODULE1 mod MYMODULE2 mod MYMODULE3 mod and my3modules o pathf95 c my3modules f95 The generate the main program which uses modules pathf95 c myprogram f95 pathf95 my3modules o myprogram o 3 2 3 Module related Error Messages Error messages report the error as the first line in the module even if the real error is further inside the module The real error is reported after this first standard message An example is given below Here is a program hellow f95 which contains this module MODULE HELLOW CONTAINS SUBROUTINE HELLO SPRINTZ Hello World END SUBROUTINE HELLO END MODULE HELLOW
78. except possibly the last The size of the initial piece is implementation dependent Table 1 page 17 The size of the initial piece is given by the following equation chunk size MIN ROUNDUP remaining size number o f threads PSC OMP GUIDED CHUNK DIVISOR PSC OMP GUIDED CHUNK MAX minimum chunk size Where m remaining size is the number of iterations of the loop m number of threads is the number of threads in the team m PSC OMP GUIDED CHUNK DIVISORiS the value of the PSC OMP GUIDED CHUNK DIVISOR environment variable defaults to 2 m PSC OMP GUIDED CHUNK MAX is the value of the PSC OMP GUIDED CHUNK MAX environment variable defaults to 300 m minimum chunk size is the size of the smallest piece this is the value of chunk in the SCHEDULE directive m ROUNDUP x rounds x upwards to the nearest higher integer m MIN a b is the minimum of a and b m MAX a b is the maximum of a and b B 1 B Implementation Dependent Behavior for OpenMP Fortran XX QLOGIC E ee When SCHEDULE RUNTIME is specified the decision regarding scheduling is deferred until runtime The schedule type and chunk size can be chosen at runtime by setting the OMP SCHEDULE environment variable If this environment variable is not set the resulting schedule is implementation dependent Table 1 page 17 The default runtime schedule is static scheduling The default chunk
79. guided scheduling can be fairly sensitive to their setting See section 8 9 2 for the interpretation of these By default the OpenMP library employs spin locks for synchronization and these loops can be tuned for performance using the PSC OMP THREAD SPIN and PSC OMP LOCK SPIN environment variables It may be desirable to turn off the spinning and use blocking pthread calls instead for OpenMP applications that use multiple threads per CPU This is fairly uncommon and in the usual case the use of spin locks is a significant optimization over the use of blocking pthread calls See section 8 9 2 for details on these environment variables Using Feedback Data 8 30 If an OpenMP program is instrumented via the fb create option to generate feedback data in feedback directed compilation the execution of the instrumented executable should only be run under a single thread This can be effected via the OMP NUM THREADS environment variable The reason is because the instrumentation library libinstr so used during execution does not support 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Other Resources for OpenMP simultaneous updates of the feedback data by multiple threads Running the instrumented executable under multiple threads can result in segmentation faults 8 15 Other Resources for OpenMP For more information on OpenMP you might also find these resources useful m At the OpenMP home page h
80. include the option 1lstdc Note that the pathcc C language user needs to add 1m to the link line when calling 1ibm functions The second pass of feedback compilation may require an explicit 1m Debugging and Troubleshooting C C 4 4 The flag g tells the PathScale C and C compilers to produce data in the form used by modern debuggers such as pathdb or GDB This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g will be capable of being debugged using pathdb GDB or other debuggers The g option automatically sets the optimization level to 00 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible butthe code transformation performed by the optimizations may make it more difficult See section 10 for more information on troubleshooting and debugging See the QLogic PathScale Debugger User Guide for more information on pathdb Unsupported GCC Extensions 4 6 The PathScale C and C Compiler Suite supports most of the C and C extensions supported by GCC Version 3 3 1 Suite In this release we do not support the following extensions For C m Nested functions m Complex integer data type Complex integer data types are not supported 1 02404 15 XX QLOGIC 4 The PathScale C C Compiler Unsupported GCC Extensions ls 1 02404 15 Although the PathScale Compiler
81. information about the o flag Use the ipa switch to enable inter procedural analysis pathf95 c ipa matrix 90 pathf95 c ipa prog 90 pathf95 ipa matrix o prog o o prog Note that the link line also specifies the ipa option This is required to perform the IPA link properly See section 7 3 for more information on IPA NOTE The compiler typically allocates data for Fortran programs on the stack for best performance Some major Linux distributions impose a relatively low limit on the amount of stack space a program can use When you attempt to run a Fortran program that uses a large amount of data on such a system it will print an informative error message and abort You can use your shells ulimit bash or 1imit tcsh command to increase the stack size limit to a point where the program no longer crashes or remove the limit entirely See section 3 11 for more information on Fortran compiler stack size 3 1 1 Fixed form and Free form Files Fixed form files follow the obsolete Fortran standard of assigning special meaning to the first 6 character positions of each line in a source file If ac or character is present in the first character position on a line that specifies that the remainder of the line is to be treated as a comment If a is present at any character position on a line except for the 6th character position then the remainder ofthatline is treated as a comment Lines containing only blank
82. is expected to be true at this point If it is not true when the program runs execution stops with an output of where the program stopped and whatthe assertion was that failed Set of standard flags used in SPEC runs with compiler F Glossary XX QLOGIC V V V A bind To link subroutines in a program Applications are often built with the help of many standard routines or object classes from a library and large programs may be built as several program modules Binding links all the pieces together Symbolic tags are used by the programmer in the program to interface to the routine At binding time the tags are converted into actual memory addresses or disk locations Or bind to link any element tag identifier or mnemonic with another so that the two are associated in some manner See alias and linker BSS Block Started by Symbol Section in a Fortran output object module that contains all the reserved but unitialized space It defines its label and the reserved space for a given number of words CG Code generation a pass in the PathScale Compiler common block A Fortran term for variables shared between compilation units source files Common blocks are a Fortran 77 language feature that creates a group of global variables The QLogic PathScale compiler does sophisticated padding of common blocks for higher performance when the Inter Procedural Analysis IPA is in use constant A constant is a variable
83. noexpopt Do not optimize exponentiation operations noextend source Restrict Fortran source code lines to columns 1 through 72 See the coln and extend source options for more information on controlling line length nog77mangle The PathScale Fortran compiler modifies Fortran symbol names by appending an underscore so a name like foo in a source file becomes foo in an object file However if a name in a Fortran source file contains an underscore the compiler appends a second underscore in the object file so foo_bar becomes foo bar baz_ becomes baz The nog77mangle option suppresses the addition of this second underscore uo gcc For C C only no gcc turns off the GNUC and other predefined preprocessor macros noinline Suppress expansion of inline functions When this option is specified copies of inline functions are emitted as static functions in each compilation unit where they are called If you are using IPA IPA inline OFF must be specified to suppress inlining no pathcc no pathcc turns off the PATHSCALE and other predefined preprocessor macros nostartfiles Do not use standard system startup files when linking 1 02404 15 E eko man Page QLOGIC ls nostdinc Direct the system to skip the standard directory usr include when searching for include files and files named on INCLUDE statements nostdinc Do not search for header files in the stand
84. not suppress the invocation of the global optimizer though the invoked backend phases will honor the specified optimization level m Apartfromthe optimization level flags only flags belonging to the following option groups are processed LNO OPT and WOPT 3 4 Compiler and Runtime Features The compiler offers three different preprocessing options cpp ftpp and now fcoco 3 4 1 Preprocessing Source Files with cpp Before being passed to the compiler front end source files are optionally passed through a source code preprocessor The preprocessor searches for certain directives in the file and based on these directives can include or exclude parts of the source code include other files or define and expand macros By default Fortran F F90 and F95 files are passed through the C preprocessor cpp 3 4 2 3 4 2 Preprocessing Source Files with tpp The Fortran preprocessor ftpp accepts many of the same directives as the C preprocessor but differs in significant details for example it does not allow C style comments beginning with to extend across multiple lines You should use the 3 8 1 02404 15 XX 3 The PathScale Fortran Compiler QLOGIC Compiler and Runtime Features cpp option if you wish to use the C preprocessor on Fortran source files ending in 90 or 95 These files will not be preprocessed unless you use either ftpp to select the Fortran preprocessor or cpp t
85. of 1 makes this unrolling complementary to what is done in the code generator This unrolling is not affected by the unrolling options under the OPT group WOPT val 0 1 2 Control the number of times the value numbering optimization is performed in the global optimizer with the default being 1 This optimization tries to recognize expressions that will compute identical runtime values and changes the program to avoid re computing them W no overloaded virtual For C only The Woverloaded virtual option will warn when a function declaration hides virtual functions Wno overloaded virtual tells the compiler not to warn when a function declaration hides virtual functions 1 02404 15 E 49 E eko man Page XX QLOGIC ee W no packed For C C only Wpacked warns when packed attribute of a struct has no effect Wno packed tells the compiler not to warn when packed attribute of a struct has no effect W no padded For C C only Wpadded warns when padding is included in a struct Wno padded tells the compiler not to warn when padding is included in a struct W no parentheses For C C only Wparentheses warns about possible missing parentheses Wno parentheses tells the compiler not to warn about possible missing parentheses Wno pmf conversions For only Do not warn about converting PMFs to plain pointers W no pointer arith For C C only Wpointer arith warns about function pointer
86. ranging 0 365 Positive if daylight savings time is in effect zero if not or negative if unknown C 48 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions ls mclock Fortran interface to the C library function clock Returns the number mclock8 of clock ticks of CPU time since the start of execution of the process or 1 if this is not known or Bitwise Boolean OR perror like the C library function perror prints on the stderr stream the string followed by a colon a blank and the message corresponding to the error code from the C library value errno rand fortran interface to POSIX function rand Returns a uniform pseudorandom integer If flag is 0 return the next number in the current sequence if flag is 1 call POSIX function srand 0 otherwise call srand flag to seed a new sequence realpart Real part of a complex number synonym for standard intrinsic real which in Fortran 95 preserves the precision of its argument rename Fortran interface to the C library function rename Change name of file path1 to path2 The function form returns 0 on success or an error code from the C library value errno The subroutine sets status to the value which the function would return Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character rshift Arithmetic sign preserving bitwise right shift Shift count must
87. real logical and double precision objects This option is a synonym for the pair of options r8 18 Calling a routine in a specialized library such as SCSL requires that its 64 bit entry point be specified when 64 bit data are used Similarly its 32 bit entry point must be specified when 32 bit data are used dumpversion Show the version of the compiler being used and nothing else E Run only the source preprocessor files without considering suffixes and write the result to stdout This option overrides the nocpp option The output file contains line directives To generate an output file without line directives see the P option For more information on controlling source preprocessing see the cpp ftpp macro expand and options extend source Fortran only Specify a 132 character line length for fixed format source lines By default fixed format lines are 72 characters wide For more information on controlling line length see the coln option fb create path Used to specify that an instrumented executable program is to be generated Such an executable is suitable for producing feedback data files with the specified prefix for use in feedback directed compilation FDO The commonly used prefix is fbdata This is OFF by default fb opt prefix for feedback data files Used to specify feedback directed compilation FDO by extracting feedback data from files with the spec
88. show Printthe passes as they execute with their arguments and their input and outputfiles show defaults Show the processor target settings and the default options in the compiler defaults 5 file For C C also shows the GNU GCC version compitability show0 Show what phases would be called but don t invoke anything showt Show time taken by each phase static Suppress dynamic linking at runtime for shared libraries use static linking instead static data Statically allocate alllocal variables Statically allocated local variables are initialized to zero and exist for the life of the program This option can be useful when porting programs from older systems in which all variables are statically allocated When compiling with the static data option global data is allocated as part of the compiled object file o file The total size of any file o cannot exceed 2 GB but the total size of a program loaded from multiple o files can exceed 2 GB An individual common block cannot exceed 2 GB but you can declare multiple common blocks each having that size If a parallel loop in a multi processed program calls an external routine that external routine cannot be compiled with the static data option You can mix static and multi processed object files in the same executable but a static routine cannot be called from within a parallel region static libgcc Force the use of the static libgcc library std c 98 st
89. size is set to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of the number of threads in the team the last chunk will be smaller than the default chunk size and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop iterations which are executed by a thread are contiguous in terms of their loop iteration number NOTE The PSC OMP STATIC FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced over the threads in cases where the division in not exact In the absence of the SCHEDULE clause the default schedule is implementation dependent section 2 3 1 In the absence of the SCHEDULE clause the default schedule is static scheduling The default chunk size is set to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of t
90. spread threads across the chips first to make best use of that resource before scheduling multiple threads to the cores on each chip Let the number of CPUs in a multi core chip be m and the number of multi core chips in the system be n The total number of CPUs is then n multiplied by m There are two typical orders in which the system may number the CPUs m For chip index p in 0 n and core index c in 0 m the CPU number is p c n This is core major ordering since incrementing the core number increases the CPU number by n while incrementing the chip number only increases the CPU number by 1 m For chip index p in 0 n and core index c in 0 the CPU number is p m c This is chip major ordering since incrementing the chip number increases the CPU number by m while incrementing the core number only increases the CPU number by 1 For core major ordering a linear assignment of threads to CPU numbers will have the effect of spreading threads over chips first For chip major ordering the linear assignment will fill up the first chip with threads before moving to the second chip and so forth This behavior can be changed by setting the stride factor to the value of m It causes the OpenMP library to spread the threads across the chips with a stride equal to the number of cores in a chip The decision on whether to spread threads over chips or over cores first depends on what one is trying to achieve and the system architectur
91. string The subroutine form sets result to that string Set the date argument to a string of the form 16 Mar 06 DD MMM YY Fortran interfaces to C library functions 301 j11 jnl y01 y11 and yn1 Bessel functions Specific name for a function that converts its argument to type complex 16 Specific name for complex conjugate whose argument is type complex 16 Fortran interface to C library function in erf and erfc 3m Specific name for function that converts its argument to type real 8 Specific name for a function that returns the imaginary part of a complex 16 argument Specific name for a function that converts its argument to type real 8 Find out the number of seconds of CPU time consumed by this process since the previous call to dt ime or if there was no previous call since the start of execution carray 1 gives user CPU time and tarray 2 gives system CPU time The function form returns the sum of those times The subroutine form sets result to the sum of those times Fortran interface to C library functions described in ex f f and erfcf 3m C 43 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC S etime Find out the number of seconds of CPU time consumed by this process since the start of execution carray 1 gives user CPU time and tarray 2 gives system CPU time The function form returns the sum of those times The subroutine form sets result
92. that RAND is not ANST The solution is to build the code with the flag intrinsic PGI 5 3 2 Name mangling Name mangling ensures that function subroutine and common block names from a Fortran program or library do not clash with names in libraries from other programming languages This makes mixing code from C C and Fortran easier See section 3 8 1 for details on name mangling 5 3 3 Static Data Some codes expect data to be initialized to zero and allocated in the heap If this is the case with your code use the static flag when compiling 54 Porting to x86 64 Keep these things in mind when porting existing code to x86 64 m Some source packages make assumptions about the locations of libraries and fail to look in 11564 named directories for libraries resulting in unresolved symbols at during the link m For the x86 platform use the mcpu flag x86any to specify the x86 platform like this mcpu x86 64 5 2 1 02404 15 XX QLOGIC 5 Porting and Compatibility Compatibility ls 5 5 Migrating from Other Compilers 5 6 Here is a suggested step by step approach to migrating code from other compilers to the PathScale compilers 1 Check the compiler name in your makefile is the correct compiler being called For example you may need to add a line like this CC pathcc configure options Change the compiler in your makefile to pathcc or pathf95 2 Check any flags that are called to be sure that t
93. the fno underscoring options to the pathf95 compiler PGI Fortran and Intel Fortran s default policies correspond to our fno second underscore option Common block names are also mangled Our name for the blank common block is the same as g77 _BLNK_ _ POI s compiler uses the same name for the blank common block while Intels compiler uses _BLANK_ _ 3 8 2 ABI Compatibility The PathScale compilers support the official x86 64 Application Binary Interface ABI which is not always followed by other compilers In particular g77 does not pass the return values from functions returning COMPLEX or REAL values according to the x86 64 ABI Double precision REALS are OK For more details about what 977 does see the info g77 entry for the 2c flag This issue is a problem when linking binary only libraries such as Kazushige Goto s BLAS library or the ACML library AMD Core Math Library we have not tested ACML on the EM64T version of the compiler suite Libraries such as FFTW MPICH don t have any functions returning REAL or COMPLEX so there are no issues with these libraries For linking with g77 compiled functions returning COMPLEX or REAL values see section 3 8 3 Like most Fortran compilers we represent character strings passed to subprograms with a character pointer and add an integer length parameter to the end of the call list 3 8 3 Linking with g77 compiled Libraries If you wish to link with a library
94. the number of threads in the currently executing parallel region int omp get max threads void Return the maximum value that omp get num threads may return int omp get thread num void Return the thread number within the team int omp get num procs void Return the number of processors available to the program void omp set dynamic int Control the dynamic adjustment ofthe number of parallel threads int omp get dynamic void Return a non zero value if dynamic threads is enabled otherwise return 0 int omp in parallel void Return a non zero value for calls within a parallel region otherwise return 0 1 02404 15 8 9 8 Using OpenMP and Autoparallelization Runtime Libraries QLOGIC C CtV MpP I Table 8 4 C C OpenMP Runtime Library Routines Continued Routine Description void omp set nested int Enable or disable nested parallelism int omp get nested void Return a non zero value if nested parallelism is enabled otherwise return 0 Lock routines omp init lock omp lock t Allocate and initialize lock associating it with the lock variable passed in as a parameter omp init nest lock omp nest Initialize a nestable lock and associate it with lock t a specified lock variable omp set lock omp lock t Acquire the lock waiting until it becomes available if necessary omp set nest lock omp nest
95. to O3 ipa OPT Ofast fno math errno ffast math so similar cautions apply to it as to 03 OPT Ofast To use interprocedural analysis without the Ofast type optimizations use either of the following 03 ipa 02 ipa Testing different optimizations can be automated by pathopt2 This program compiles and runs your program with a variety of compiler options and creates a sorted list of the execution times for each run 6 Tuning Quick Reference XX Performance Analysis QLOGIC The try5 target tests five flag combinations which is easily done using pathopt2 The combinations are 02 03 03 ipa 03 OPT Ofast Ofast For more information on using pathopt2 see section 7 9 6 6 Performance Analysis In addition to these suggestions for optimizing your code here are some other ideas to assist you in tuning section 2 11 discusses figuring out where to tune your code using time to get an overview of your code and using pathprof to find your program s hot spots 6 7 Optimize Your Hardware Make sure you are optimizing your hardware as well section 7 8 discusses getting the best performance out of x86 64 based hardware Opteron Athlon 64 Athlon 64 FX and Intel EM64T Hardware configuration can have a significant effect on the performance of your application 6 4 1 02404 15 7 1 Section 7 Tuning Options This section discusses in more depth some of the major groups of flags ava
96. to the sum of those times exit Like the C library function exit terminate the process and return the value status to the process usually the shell that caused this process to execute status defaults to 0 Open Fortran logical units are flushed and closed fdate The subroutine form is equivalent to call ctime date time8 The function form is equivalent to ctime time 8 fget Like getc butuses logical unit 5 fgetc Fortran interface to the C library function gecc Reads into c a single character from logical unitunit treating that unitas if itwere a stream of bytes The function form returns 0 for success 1 for end of file or an error code from the C library value errno The subroutine sets status to the value that the function would return Between the opening and closing of afile you should use either stream intrinsics get fgetc fput fputc fseek and ftell Or standard Fortran I O but not both flush Flush buffered I O for logical unit unit If unit is omitted flush all logical units fnum Return the POSIX file descriptor corresponding to the open Fortran logical unit unit fput Like putc but uses logical unit 6 fputc Fortran interface to the C library function puc Writes to logical unit unit single character c treating that unit as if it were a stream of bytes The function form returns 0 for success 1 for end of file or an error code from the C library value errno The subroutine sets
97. using the top program Depending on the version of top you should be able to view the breakdown of user system and idle time per CPU Often this view can be obtained by pushing 1 You may also want to increase the update rate e g with s followed by 0 5 Itis sometimes possible to see the program moving from serial to parallel phases and also see whether the work is being well distributed If there is excessive time spent in the system or swapping then this should also be investigated It goes without saying that it is best to run OpenMP applications on nodes with no other running applications If the OpenMP application uses runtime scheduling then try varying the runtime schedule using the OMP SCHEDULE environment variable good choice of schedule and chunk size is sometimes important for performance NOTE The gprof profiling pg does not work in conjunction with pthreads or the OpenMP library An alternative approach is to use OProfile which uses hardware counters and sampling techniques to build up a profile of the system 1 02404 15 8 29 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance QLOGIC 8 14 3 4 It is possible to capture application code dynamic libraries kernel modules drivers in a profile created by OProfile giving insight into system wide performance characteristics OProfile can also attribute the samples on a thread or CPU basis allowing
98. value 0 or non zero This chooses the locking mechanism used by critical sections and OMP locks 0 user level spin locks are disabled uses pthread mutexes non zero user level spin locks are enabled This is the default This determines whether locking in critical sections and OMP locks is implemented with user level spin loops or using pthread mutexes Synchronization using pthread mutexes is significantly more expensive but frees up execution resources for other threads PSC OMP SILENT Set or not set If you set PSC OMP SILENT to anything then warning and debug messages from the 1ibopenmp library are inhibited Fatal error messages are not affected by the setting of PSC_OMP_STLENT PSC OMP STACK SIZE Stack size specifications Stack size specification follows the syntax in section 3 11 See section 8 10 1 for more details PSC OMP STATIC FAIR Set or not set The default static scheduling policy when no chunk size is specified is as follows The number of iterations of the loop is divided by the number of threads in the team and rounded up to give the chunk size Loop iterations are grouped into chunks of this size and assigned to threads in order of increasing thread id within the team If the division was not exact then the last thread will have fewer iterations and possibly none at all The policy for static scheduling when no chunk size is specified can be changed to the static fair policy by defining the environme
99. will improve cache behavior for common block array accesses E 16 1 02404 15 E eko man Page QLOGIC ls IPA cprop ON OFF Turn on or off inter procedural constant propagation This option identifies the formal parameters that always have a specific constant value Default is ON See also IPA aggr_cprop IPA ctype ON OFF When ON causes the compiler to generate faster versions of the lt ctype h gt macros such as isalpha isascii etc This flag is unsafe both in multi threaded programs and in all locales other than the 7 bit ASCII or C locale The default is OFF Do not turn this on unless the program will always run under the 7 bit ASCII or C locale and is single threaded IPA depth N Identical to maxdepth N OFF Enable or disable dead function elimination Removes functions that are inlined everywhere they are called The default is ON IPA dve ON OFF Enable or disable dead variable elimination This option removes variables that are never referenced by the program Default is ON IPA echo ON OFF Option to echo to stderr the compile commands and the final link commands that are invoked from IPA Default is OFF This option can help monitor the progress of a large system build IPA field reorder ON OFF Enable the re ordering of fields in large structs based on their reference patterns in feedback compilation to minimize data cache misses The default is O
100. with an editor you might see some pretty strange looking C code In this case there doesn t seem to be much optimizing going on but in codes where LNO Loop Nest Optimization is more important you would see a lot of the optimizations Verbose Flags 1 02404 15 You can also turn on verbose flags in LNO to see vectorization activity You would do this with the LNO simd verbose flag in the compile line pathcc 03 LNO simd verbose c stream d c The output might look something like this stream d c 103 LOOP WAS VECTORIZED stream d c 119 LOOP WAS VECTORIZED stream d c 142 LOOP WAS VECTORIZED stream d c 147 LOOP WAS VECTORIZED stream d c 152 LOOP WAS VECTORIZED stream d c 157 LOOP WAS VECTORIZED stream d c 164 Nonvectorizable ops non unit stride Loop was not vectorized stream d c 211 Nonvectorizable ops non unit stride Loop was not vectorized This would tell you more about what the compiler is doing with loops You can also try the LNO vintr verbose flag on the compile line pathcc 03 LNO vintr verbose c stream d c 7 43 7 Tuning Options XX How Did the Compiler Optimize My Code QLOGIC EN P bPBp L1L P In this case the output doesn ttell you much No output because there are no intrinsic functions to get vectorized in STREAM 7 44 1 02404 15 8 1 OpenMP 1 02404 15 Section 8 Using OpenMP and Autoparallelization The
101. within the team integer omp get num procs Return the number of processors available to the program call omp set dynamic logical Control the dynamic adjustment of the number of parallel threads logical omp get dynamic Return TRUE if dynamic threads is enabled otherwise return FALSE logical omp in parallel Return TRUE for calls within a parallel region otherwise return FALSE call omp set nested logical Enable or disable nested parallelism logical omp get nested Return TRUE if nested parallelism is enabled otherwise return FALSE Lock routines omp init lock int Allocate and initialize lock associating it with the lock variable passed in as a parameter omp init nest lock int Initialize a nestable lock and associate it with a specified lock variable omp set lock int Acquire the lock waiting until it becomes available if necessary omp set nest lock int Set a nestable lock The thread executing the subroutine will wait until a lock becomes available and then set that lock incrementing the nesting count omp unset lock int Release the lock resuming a waiting thread if any omp unset nest lock int Release ownership of a nestable lock The subroutine decrements the nesting count and releases the associated thread from ownership of the nestable lock 8 8 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC OpenMP Runtime Libra
102. 01 ISBN 0 201 74962 9 1 02404 15 1 3 1 Introduction XX Documentation Suite QLOGIC ae se 9 1 Notes 1 4 1 02404 15 Section 2 Compiler Quick Reference This section describes how to get started using the PathScale Compiler Suite The compilers follow the standard conventions of Unix and Linux compilers produce code that follows the Linux x86 64 ABI and run on both the AMD64 and Intel EM64T families of chips AMD64 is the AMD 64 bit extension to the Intel A32 architecture often referred to as x86 EM64T is the Intel Extended Memory 64 Technology chip family This means that object files produced by the PathScale compilers can link with object files produced by other Linux x86_64 compliant compilers such as Red Hat and SUSE GNU gcc g and g771 2 1 What You Installed For details on installing the PathScale compilers see the QLogic PathScale Compiler Suite Install Guide The PathScale Compiler Suite includes optimizing compilers and runtime support for C C and Fortran Depending on the type of subscription you purchased you enabled some or all of the following PathScale C compiler for x86_64 and EM64T architectures PathScale C compiler for x86 64 and EM64T architectures PathScale Fortran compiler for x86 64 and EM64T architectures Documentation Libraries Subscription Manager client You must have a valid subscription and associated subscription file in order to run the compiler m Subscriptio
103. 1 lev Optional The level in memory hierarchy to prefetch The default is 2 If lev 1 prefetch from L2 to L1 cache If lev 2 prefetch from memory to L1 cache rd wr Optional The default is read write sz Optional The size in Kbytes of the array referenced in this loop This must be a constant 3 7 3 The PathScale Fortran Compiler XX Compiler and Runtime Features QLOGIC 3 3 3 2 Changing Optimization Using Directives Optimization flags can now be changed via directives in the user program In Fortran the directive is used in the form C options lt list of options gt Any number of these can be specified inside function scopes Each affects only the optimization of the entire function in which it is specified The literal string can also contain an unlimited number of different options separated by spaces and must include the enclosing quotes The compilation of the next function reverts back to the settings specified in the compiler command line In this release there are limitations to the options that are processed in this options directive and their effects on the optimization m There is no warning or error given for options that are not processed m These directives are processed only in the optimizing backend Thus only options that affect optimizations are processed m In addition it will not affect the phase invocation of the backend components For example specifying 00 will
104. 1 Compiler defaults file 2 4 compiler defaults 2 4 Compilers using the C C 4 2 Compiling on alternate platforms 2 5 COMPLEX 3 22 Conditional complilation sentinels 8 3 cosin 7 17 Cray pointer 3 6 CRITICAL directive 8 25 D Debugging C C 4 6 Index 2 Fortran 3 25 general information 10 1 Default optimization level 4 2 options 2 3 Directives about 3 6 ATOMIC 8 5 B 4 BARRIER 8 5 changing optimization flags with 3 8 CRITICAL 8 5 DO 8 4 FLUSH 8 5 MASTER 8 5 ORDERED 8 5 PARALLEL 8 4 PARALLEL DO 8 5 PARALLEL SECTIONS 8 5 PARALLEL WORKSHARE 8 5 SECTIONS 8 4 SINGLE 8 5 THREADPRIVATE 8 5 WORKSHARE 8 5 Dope vector 3 12 D 1 DWARF 3 25 4 6 10 1 DYNAMIC scheduling algorithm 8 30 E Environment variables Fortran 3 24 OpenMP 8 11 8 12 pathopt2 7 35 PathScale OpenMP 8 12 Environment variables C PSC CFLAGS A 1 Environment variables C PSC CXXFLAGS 1 Environment variables Fortran F90 BOUNDS CHECK ABORT A 1 F90 DUMP MAP A 1 FTN SUPPRESS REPEATS A 1 NLS PATH A 1 PSC FDEBUG ALLOC A 1 PSC FFLAGS A 2 1 02404 13 XX QLOGIC QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 TnkgtCY 9 9 9 7 7 Ea wi PSC STACK LIMIT A 2 PSC STACK VERBOSE 2 Environment variables language independent FILENV A 2 PSC COMPILER DEFAULTS PATH A 2 PSC GENFLAGS A 2 PSC PROBLEM REPORT DIR A 2 Environment variables OpenMP OMP DYNAMIC A 3 OMP NESTED A 3 OMP NUM THREADS 3 OMP SCHEDULE A 3
105. 1 8 1 RANF TRADITIONAL E RANGE X 1 1 I 2 1 4 I8 ANSI PGI E R 4 R 8 TRADITIONAL Z 8 Z 16 REAL R 4 1 1 I2 1 4 1 8 ANSI G77 E R 4 R 8 PGI Z 8 Z 16 TRADITIONAL KIND I 1 1 2 1 4 I 8 REALPART R 4 1 1 1 2 1 4 I 8 G77 E R 4 R 8 Z 8 Z 16 KIND 1 1 1 2 1 4 1 8 REMOTE Subroutine TRADITIONAL E WRITE _ BARRIER REM_IMAGES 1 4 TRADITIONAL RENAME 1 4 PATH1 C G77 PGI O PATH2 C STATUS 1 4 RENAME Subroutine G77 PATH1 C O PATH2 C STATUS 1 4 REPEAT Depends on arg STRING C ANSI PGI NCOPIES I 1 1 2 TRADITIONAL 1 4 1 8 RESHAPE ANSI PGI See Std TRADITIONAL RRSPACING X R 4 R 8 ANSI PGI E TRADITIONAL C 35 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC e Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks RSHIFT I 1 1 1 2 1 4 I 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 NEGATIVE_SHIFT 1 1 1 2 1 4 1 8 RTC TRADITIONAL E SCALE X R 4 R 8 ANSI PGI E I 1 1 1 2 1 4 I 8 TRADITIONAL SCAN 1 4 STRING C ANSI PGI E SET C TRADITIONAL BACK L 1 L 2 L 4 L 8 SECNDS R 4 T R 4 G77 PGI SECOND R 4 SECONDS R 4 G77 O SECOND Subroutine SECONDS R 4 G77 SELECTED_INT R 1 1 1 2 I 4 1 8 ANSI PGI _KIND TRADITIONAL SELECTED _ Depends on arg 1 1 1 2 1 4 1 8 ANSI PGI O
106. 1 Limited roundoff error allowed 2 Allow roundoff error caused by re associating expressions 3 Any roundoff error allowed The default roundoff level with 00 01 and 02 is The default roundoff level with 03 is 1 Listing some of the other OPT sub options that are activated by various roundoff levels can give more understanding about what the levels mean OPT roundoff 1 implies m OPT fast exp ON This option enables optimization of exponentiation by replacing the run time call for exponentiation by multiplication and or square root operations for certain compile time constant exponents integers and halves m OPT fast truncimpliesinlining ofthe NINT ANINT AINT and AMOD Fortran intrinsics OPT roundoff 2 turns on the following sub options m OPT fold_reassociatewhichallows optimizations involving re association of floating point quantities 7 22 1 02404 15 XX QLOGIC 7 Tuning Options Aggressive Optimizations Y 7 7 5 OPT roundoff 3 turns on the following sub options m OPT fast complex When this is set ON complex absolute value norm and complex division use fast algorithms that overflow for an operand the divisor in the case of division that has an absolute value that is larger than the square root of the largest representable floating point number m OPT fast nint uses a hardware feature to implement single and double precision versions of NINT and ANINT Other Unsafe Op
107. 32 bit 64 bit ABI the compiler queries the machine where the compilation is happening and will compile to the best ABI supported for that machine These defaults for the target processor and the ABI can be overridden by command line flags or the compiler defaults file You can set or change the default platform for compilation using the compiler defaults file found in opt pathscale etc If you installed in a non default location the path willbe install directory pathscale etc You can use the defaults file to provide a set of additional include or library directories to search or to specify some default compiler optimization flags The compiler refers to the compiler defaults file for options to be used during compilation The syntax in compiler defaults file is the same as options specified on the compiler command line Options are added to the command line in the order in which they appear in the defaults file Every option is included unconditionally For exclusive options the command line takes precedence over the defaults file For example if the defaults file contains the 03 option but the compiler is invoked with 02 on the command line it will behave as if invoked with 02 alone because 02 and 03 are exclusive options For additive options the command line is used before the defaults file For example if the defaults compiler contains I usr foo and the command line contains I usr bar the compiler will behave as if inv
108. 4 L 8 ERF X R 4 R 8 G77 PGI E P TRADITIONAL ERFC X R 4 R 8 G77 PGI E P TRADITIONAL ETIME R 4 TARRAY R 4 G77 PGI Array rank 1 TRADITIONAL ETIME Subroutine TARRAY R 4 G77 Array rank 1 TRADITIONAL RESULT R 4 EXIT Subroutine STATUS I 1 1 2 G77 PGI 1 4 1 8 TRADITIONAL EXP R 4 X R 4 R 8 Z 8 ANSI G77 E P Z 16 PGI TRADITIONAL EXPONENT X R 4 R 8 ANSI PGI E TRADITIONAL FCD I 1 1 1 2 I 4 I 8 TRADITIONAL CrayPtr J 1 1 I 2 1 4 1 8 FDATE C G77 PGI TRADITIONAL FDATE Subroutine DATE C G77 PGI FETCH_AND_ I 1 4 TRADITIONAL E ADD J 1 4 FETCH AND I 1 8 TRADITIONAL E ADD J 1 8 FETCH AND _ I 1 4 TRADITIONAL E AND J 4 FETCH_AND_ I 1 8 TRADITIONAL E AND J 1 8 C 16 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks FETCH AND _ l 1 4 TRADITIONAL E NAND J 1 4 FETCH AND _ I I 8 TRADITIONAL E NAND J 1 8 FETCH AND _ l 1 4 TRADITIONAL E OR J 1 4 FETCH_AND_ I I 8 TRADITIONAL OR J 1 8 FETCH_AND_ l 1 4 TRADITIONAL E SUB J 4 FETCH AND _ I I 8 TRADITIONAL E SUB J I 8 FETCH AND _ l 1 4 TRADITIONAL E XOR J 1 4 FETCH AND _ I I 8 TRADITIONAL E XOR J 1 8 FGET 1 4 C G G77 O STATUS I 4 FGET Subroutine C C G77 O STATUS I 4 FGETC 1 4 UNIT 1 4 1 8 G77
109. 5 6 7 CPUO CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 TO T1 T2 T3 T4 T5 T6 T7 4 Assign threads to a dual core machine in the same way as PSC_OMP_CPU_STRIDE 2 PSC OMP AFFINITY MAP 0 2 4 6 1 3 5 7 CPUO CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 TO T4 T1 T5 T2 T6 T3 7 NOTE When PSC OMP AFFINITY is defined the values of PSC OMP CPU STRIDE and PSC OMP CPU OFFSET are ignored However the value of PSC OMP GLOBAL AFFINITY still determines whether the thread s global or local ID is used in the mapping process 8 Using OpenMP and Autoparallelization Environment Variables QLOGIC M U C C sss C C s C s s PSC OMP CPU STRIDE Integer value This specifies the striding factor used when mapping threads to CPUs It takes an integer value in the range of 0 to the number of CPUs inclusive The default is a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a stride of 1 Strides greater than 1 are useful when there is a hierarchy of CPUs in the system and the scheduling algorithm needs to take account of this to make best use of system resources A particularly interesting case is when the system comprises a number of multi core chips such that each core shares some resources e g a memory interface with other cores on that chip It may then be desirable to
110. 52 and messages 2400 through 2500 In the message level indicator the message numbers appear after the dash Yc path Set the path in which to find the associated phase using the same phase names as given in the W option The following characters can also be specified Specifies where to search for include files S Specifies where to search for startup files crt o L Specifies where to search for libraries Zzerouv Set uninitialized variables to zero Affects local scalar and array variables and memory returned by alloca Does not affect the behavior of globals malloc ed memory or Fortran common data 1 02404 15 E 53 E man Page XX QLOGIC e Environment Variables E 54 F90 BOUNDS CHECK ABORT Fortran Set to YES causes the program to abort on the first bounds check violation F90 DUMP MAP Fortran If a segmentation fault occurs print the current process s memory map before aborting The memory map describes how the process s address space is allocated The Fortran runtime will print the address of the segmentation fault you can examine the memory map to see which mapped area was nearest to the fault address This can help distinguish between program bugs that involve running out of stack space and null pointer dereferences The memory map is displayed using the same format as the file proc self maps FILENV The location of the assign file See the assign 1 man page for more detail
111. 77 E P PGI TRADITIONAL ALOG10 R 4 X R 4 R 8 ANSI G77 E P PGI TRADITI NAL AMAXO ANSI G77 See Std PGI TRADITIONAL AMAX1 ANSI G77 See Std PGI TRADITIONAL AMINO ANSI G77 See Std PGI TRADITIONAL AMIN1 ANSI G77 See Std PGI TRADITIONAL AMOD R 4 A R 4 R 8 ANSI G77 E P P R 4 R 8 PGI TRADITIONAL AND I 1 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 PGI CrayPtr L 1 L 2 TRADITIONAL L 4 L 8 J 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 AND AND l I 4 TRADITIONAL FETCH J 1 4 C 4 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks AND AND I 1 8 TRADITIONAL E FETCH J I 8 ANINT R 4 A R 4 R 8 ANSI G77 E P KIND 1 1 1 2 1 4 PGI I 8 TRADITIONAL ANY ANSI PGI See Std TRADITIONAL ASIN R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ASIND R 4 X R 4 R 8 PGI E TRADITIONAL ASSOCIATED ANSI PGI See Std TRADITIONAL ATAN R 4 X R 4 R 8 ANSI G77 E P PGI TRADITIONAL ATAN2 R 4 Y R 4 R 8 ANSI G77 E P X R 4 R 8 PGI TRADITIONAL ATAN2D R 4 Y R 4 R 8 PGI E P X R 4 R 8 TRADITIONAL ATAND R 4 X R 4 R 8 PGI TRADITIONAL BESJO R 4 X R 4 G77 PGI BESJ1 R 4 X R 4 G77 PGI BESJ1 R 8 X R 8 G77 PGI BESJN R 4 N R 4 G77 PGI X
112. 8 ANSI G77 E P PGI TRADITIONAL COT R 4 X R 4 R 8 TRADITIONAL COTAN R 4 X R 4 R 8 TRADITIONAL COUNT ANSI PGI See Std TRADITIONAL CPU TIME Subroutine TIME R 4 ANSI G77 PGI TRADITIONAL CPU_TIME Subroutine TIME R 8 ANSI G77 PGI TRADITIONAL CSHIFT ANSI PGI See Std TRADITIONAL CSIN Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CSMG I 1 1 1 2 I 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 I 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 1 1 1 2 1 4 I 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CSQRT Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL C 8 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks CTIME C STIME 1 4 G77 PGI CTIME C STIME 1 8 G77 PGI CTIME Subroutine G77 STIME I 4 O RESULT C CTIME Subroutine STIME 1 8 G77 O RESULT C CVMGM I 1 1 I 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 1 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 CVMGN I 1 1 I 2 1 4 1 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 K 1 1 1 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 1 02404 15 C 9 C
113. ADS 4 section 8 9 1 lists the available environment variables both Standard and PathScale for use with OpenMP 1 02404 15 8 11 8 Using OpenMP and Autoparallelization Environment Variables QLOGIC Se 8 9 1 Standard OpenMP Environment Variables 8 9 2 Table 8 5 Standard OpenMP Environment Variables Variable Possible Values Description OMP DYNAMIC FALSE Enables or disables dynamic adjustment of i the number of threads available for execution Default is FALSE since this mechanism is not supported OMP NESTED TRUE OR FALSE Enables or disables nested parallelism Default is FALSE OMP SCHEDULE type chunk This environment variable only applies to DO and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC or GUIDED Default is STATIC with no chunk size specified OMP NUM THREADS Integer value Setthe number of threads to use during execution Default is number of CPUs in the machine PathScale OpenMP Environment Variables 8 12 The PathScale OpenMP environment variables provide addtional control over thread scheduling through processor affinity Processor affinity is used to specify the preferred processor or subset of processors for scheduling a thread An affinity setting might be made in order to bind a thread close to a resource and to prev
114. Accuracy with Options IEEE NaN inf off off off off off recip off off off off on onifroundoff 2 roundoff 0 0 0 1 2 fast math off off off off off onifroundoff 2 rsqrt 0 0 0 0 1 1 ifrcoundof gt 2 For example if you use OPT IEEE arithmetic at O3 the flag is set to IEEE arithmetic 2 by default 7 8 Hardware Performance Although the x86 64 platform has excellent performance there are a number of subtleties in configuring your hardware and software that can each cause substantial performance degradations Many of these are not obvious but they can reduce performance by 30 or more ata time We have collected a set of techniques for obtaining best performance described below 7 8 1 Hardware Setup There is no catch all memory configuration that works best across all systems We have seen instances where the number type and placement of memory modules on a motherboard can each affect the memory latency and bandwidth that you can achieve Most motherboard manuals have tables that document the effects of memory placement in different slots We recommend that you read the table for your motherboard and experiment If you fail to set up your memory correctly this can account for up to a factor of two difference in memory performance In extreme cases this can even affect system stability 7 8 2 BIOS Setup Some BlOSes allow you to change your motherboard s memory inter
115. C S Istat Fortran interface to the POSIX function 1st at Store in array s array information about the file named file if that is a symbolic link describe the link rather than the target of the link cf stat The function form returns 0 or an error code from the C library value errno Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character sarray must have thirteen elements ID of device containing file Inode number File mode File mode 5 UID of owner GID of owner ID of device containing directory entry for file Size of file in bytes Time of last access Time of last modification Time of last file status change Preferred I O block size 1 if not available Number of blocks allocated 1 if not available Except for elements 12 and 13 values are set to 0 if they are not available from the relevant file system Itime Fortran interface to the C library function localtime Sets tarray to the broken down time corresponding to stime which can be obtained from the intrinsic time 8 All values are in the local time zone tarray must have nine elements Seconds since the last minute ranging 0 61 duetoleap seconds Minutes since the last hour ranging 0 59 Hours since midnight ranging 0 23 Day of month ranging 0 31 Month ranging 0 11 Years since 1900 Days since Sunday ranging 0 6 Days since January 1
116. CPU identifier and entries in the list beyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers in the PSC OMP AFFINITY MAP list are inserted in the mapping table starting at the index for thread 0 and increasing 8 14 1 02404 15 XX QLOGIC 8 Using OpenMP and Autoparallelization Environment Variables I sw saagii 1 02404 15 upwards If the list is shorter than the maximum number of threads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads Here are some examples for assigning eight threads on an eight CPU system 1 Assign all threads to the same CPU PSC_OMP_AFFINITY_MAP 0 CPUO CPU1 CPU2 CPU3 CPU4 CPUS CPUG CPU7 TO T1 T2 T3 T4 T5 T6 T7 2 Assignthreads to the lower half ofthe machine PSC OMP AFFINITY 0 1 2 3 CPUO CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 TO T1 T2 T3 T4 T5 T6 T7 3 Assign threads to the upper half of the machine PSC OMP AFFINITY MAP 4
117. CPUs The default is TRUE so that global ID values are used for calculating thread assignments 1 02404 15 A 3 A Environment Variables Environment Variables for OpenMP u 4 PSC OMP AFFINITY MAP PSC OMP CPU STRIDE PSC OMP CPU OFFSET PSC OMP GUARD SIZE PSC OMP GUIDED CHUNK DIVISOR PSC OMP GUIDED CHUNK MAX XX QLOGIC This environment variable allows the mapping from threads to CPUs to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list must contain at least one CPU identifier and entries in the listbeyond the maximum number of threads supported by the implementation 256 are ignored Each CPU identifier is a decimal number between 0 and one less than the number of CPUs in the system inclusive The implementation generates a mapping table that enumerates the mapping from each thread to CPUs The CPU identifiers in the PSC OMP AFFINITY MAP list are inserted in the mapping table starting at the index for thread 0 and increasing upwards Ifthe list is shorter than the maximum number of threads then it is simply repeated over and over again until there is a mapping for each thread This repeat feature allows short lists to be used to specify repetitive thread mappings for all threads This specifies the striding factor used when mapping threads to CPUs It takes an integer value in the range of 0 to the number of CPUs inclusive The default is
118. Controls the optimization that translates simple IF statements to conditional move instructions in the target CPU Setting to 0 suppresses this optimization The value of 1 designates conservative if conversion in which the context around the IF statement is used in deciding whether to if convert The value of 2 enables aggressive if conversion by causing it to be performed regardless of the context The default is 1 WOPT ivar_pre ON OFF When OFF disables the partial redundancy elimination of indirect loads in the program Default is ON WOPT mem_opnds ON OFF Makes the scalar optimizer preserve any memory operands of arithmetic operations so as to help bring about subsumption of memory loads into the operands of arithmetic operations Load subsumption is the combining of an arithmetic instruction and a memory load into one instruction Default is OFF WOPT retype_expr ON OFF Enables the optimization in the compiler that converts 64 bit address computation to use 32 bit arithmetic as much as possible Default is OFF WOPT unrol1 0 1 2 Control the unrolling of innermost loops in the scalar optimizer Setting to 0 suppresses this unroller The default is 1 which makes the scalar optimizer unroll only loops that contain IF statements Setting to 2 makes the unrolling to also apply to loop bodies that are straight line code which duplicates the unrolling done in the code generator and is thus unnecessary The default setting
119. E of the lesser known features of Fortran 90 is that you can use argument names when calling intrinsics instead of passing all of the arguments in strictly defined order There are only a couple of cases where itis actually useful to know the official name so that you can omit optional arguments that don t interest you for example call date and time time timevar but you re always allowed to specify the name if you like Intrinsic Options 1 02404 15 If your program contains a function or subroutine whose name conflicts with that of one of the intrinsic procedures you have three choices Within each program unit C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC Se that calls that function or subroutine you can declare the procedure in an external statement or you can declare it with Fortran 90 interface block or you can use command line options to tell the compiler not to provide that intrinsic The option ansi if present removes all non standard intrinsics The options intrinsic name and no intrinsic name are applied to add or remove specific intrinsics from the set of remaining ones For example the compile command might look like this pathf95 myprogram f ansi intrinsic second To make it convenient to compile programs developed under other compilers pathf95 provides the ability to enable and disable a group or family of intrinsics with a single option Family names a
120. E 1 4 G77 PGI TRADITIONAL TIME8 I 8 G77 TRADITIONAL TIME Subroutine BUF C G77 TINY X R 4 R 8 ANSI PGI E TRADITIONAL TRANSFER ANSI PGI See Std TRADITIONAL TRANSPOSE Depends onarg MATRIX Any type ANSI PGI Array TRADITIONAL rank 2 TRIM Depends onarg STRING C ANSI PGI TRADITIONAL TTYNAM C UNIT 1 4 G77 PGI TTYNAM Subroutine UNIT 1 4 G77 NAME C UBOUND ANSI PGI See Std TRADITIONAL UMASK 1 4 MASK I 4 G77 UMASK Subroutine MASK I 4 G77 O OLD 1 4 UNIT I 1 1 1 2 1 4 1 8 TRADITIONAL E UNLINK 1 4 FILE C G77 PGI O STATUS 1 4 UNLINK Subroutine FILE C G77 O STATUS 1 4 UNPACK ANSI PGI See Std TRADITIONAL VERIFY 1 4 STRING C ANSI PGI E SET C TRADITIONAL BACK L 1 L 2 L 4 L 8 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions ls Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks WRITE _ Subroutine TRADITIONAL E MEMORY BARRIER XOR I 1 1 1 2 I 4 I 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 XOR_AND_ I 1 4 TRADITIONAL E FETCH J 1 4 XOR_AND_ I 1 8 TRADITIONAL E FETCH J 1 8 ZABS R 8 A Z 16 G77 E P TRADITIONAL ZCOS Z 16 X Z 16 G77 E P TRADITIONAL ZEXP Z 16 X Z 16 G77 E P TRADITIONAL ZLOG Z 16 X Z 16 G77 E P TRADITIONAL ZSIN Z 16 X Z 16 G77 E P TRADITIONAL
121. E ASCII 6 DVTYPE DERIVEDBYTE 7 DVTYPE DERIVEDWORD 8 type 8 type code unsigned int dpflag 1 set if declared double precision or double complex I I enum dec codes DVD DEFAULT KIND and n absent or KIND expression which evaluates to the default KIND ie KIND 0 for integer KIND 0 0 for real D D 0 KIND 0 0 for complex KIND TRUE for logical KIND 2019A 2019 for character across on all ANSI conformant implementations KIND expression which does not qualify to be DVD DEFAULT or DVD KIND CONST or DVD KIND DOUBLE DVD STAR 2 n is specified example REAL 8 DVD KIND CONST 3 KIND expression constant across all implementations 4 KIND expression which evaluates to X DVD_KIND 1 Ts DVD KIND DOUBLI Lj 1 02404 15 D 1 D Fortran 90 Dope Vector XX QLOGIC FW IISSSTFFZsIssIIGIIIIIIOCTF III KIND 1 0D0 for real across all implementations This code may be passed for real or complex type kind or star 3 Set if KIND or n appears in the variable declaration Values are from enum dec codes unsigned int int len 12 internal length in bits of
122. FF IPA forcedepthzN This option sets inline depths directing IPA to attempt to inline all functions at a depth of at most N in the callgraph instead of using the default inlining heuristics This option ignores the default heuristic limits on inlining Functions at depth 0 make no calls to any sub functions Functions only making calls to depth 0 functions are at depth 1 and so on IPA ignore lang ON Orr Enable disable inlining across language boundaries of Fortran on one side and C C on the other The compiler may not always be aware of the correct effective language semantics if this optimization is done making it unsafe in some scenarios The default is OFF IPA inline ON OFF This option performs inter file subprogram inlining during the main IPA processing The default is ON Does not affect the light weight inliner 1 02404 15 E 17 E man Page XX QLOGIC WIIII W GTII U IGII I NTGC I III I I IIIIEIIII IPA keeplight ON OFF This option directs IPA not to send keep to the compiler in order to save space The default is OFF 1 OFF Controls conversion of a multi dimensional array to a single dimensional linear array that covers the same block of memory When inlining Fortran subroutines IPA tries to map formal array parameters to the shape of the actual parameter In the case that it cannot map the parameter it linearizes the array reference By d
123. I am handler count count 1 if count le 0 then previous signal 2 handler 0 else previous signal 2 handler 1 end if end subroutine handler sleep Like the POSIX function sleep pauses the process for seconds seconds srand Like the POSIX function s rand restarts the random number sequence for irand or rand using seed as the seed C 52 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions ls stat Fortran interface to the POSIX function stat Store in array sarray information about the file named file if that is a symbolic link describe the target rather than the link itself cf 15 tat The function form returns 0 or an error code from the C library value errno Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character sarray must have thirteen elements ID of device containing file Inode number File mode Number of links UID of owner GID of owner ID of device containing directory entry for file Size of file in bytes Time of last access 10 Time of last modification 11 Time of last file status change 12 Preferred I O block size 1 if not available 13 Number of blocks allocated 1 if not available Except for elements 12 and 13 values are set to if they are not available from the relevant file system N sy
124. IC Environment Variables a KO GESERISSEEEIISIF The libraries are opt pathscale lib version libopenmp so dynamic 64 bit opt pathscale lib version libopenmp a static 64 bit opt pathscale lib version 32 libopenmp so dynamic 32 bit opt pathscale lib version 32 libopenmp a static 32 bit The symbolic links to the dynamic versions ofthe libraries for both 32 bit and 64 bit environments can be found here opt pathscale lib version libopenmp so 1 symbolic link to dynamic version 64 bit opt pathscale lib version 32 libopenmp so 1 symbolic link to dynamic version 32 bit Be sure to use the mp flag on both the compile and link lines NOTE Forrunning OpenMP executables compiled with the PathScale compiler on a system where no PathScale compiler is currently installed please see the QLogic PathScale Compiler Suite Install Guide for instructions on installing the PathScale libraries on the target system 8 9 Environment Variables The OpenMP environment variables allow you to change the execution behavior of the program running under multiple threads The table in this section lists the environment variables currently supported The environment variables can be set using the shell commands For example in bash export OMP NUM THREADS 4 In csh setenv OMP NUM THREADS 4 After the previous shell commands the following command will print 4 echo S OMP NUM THRE
125. IC Inter Procedural Analysis IPA ls pathcc 02 ipa c a c pathcc 03 ipa c b c pathcc ipa a o b o The user can pass consistent optimization options to the individual compilations to remove the warning In the above example the user can either pass O2 or pass 03 to both the files The ipa flag implies 02 ipa because 02 is the default Flags like ipa can be used in combination with a very large number of other flags but some typical combinations with the o flags are shown below 03 02 ipa is a typical additional attempt at improved performance over the 03 or 02 flag alone ipa needs to be used both in the compile and in the link steps of a build Using IPA with your program is usually straightforward If you have only a few source files you can simply use it like this pathf95 03 ipa main f subs1 f subs2 f If you compile files separately the ofiles generated by the compiler do notactually contain object code they contain a representation of the source code Actual compilation happens at link time The link command also needs the ipa flag added For example you could separately compile and then link a series of files like this pathf95 c 03 ipa main f pathf95 c 03 ipa subs1 f pathf95 c 03 ipa subs2 f pathf95 03 ipa main o subs1 o subs2 o Currently there is a restriction that each archive for example 1ibfoo a must contain either o files compiled with ipa
126. IEEE arithmeticz1 fmath errno f no fast stdlib The ffast stdlib flag improves application performance by generating code to link against special versions of some standard library routines and linking against the PathScale runtime library This option is enabled by default If fno fast stdlib is used during compilation the compiler will not emit code to link against fast versions of standard library routines During compilation ffast stdlib implies OPT fast stdlib on If fno fast stdlib is used during linking the compiler will not link against the PathScale runtime library If you link code with fno fast stdlib that was not also compiled with this flag you may see linker errors Much of the PathScale Fortran runtime is compiled with ffast stdlib so it is not advised to link Fortran applications with fno fast stdlib ffloat store Do not store floating point variables in registers and inhibit other options that might change whether a floating point value is taken from a register or memory This option prevents undesirable excess precision on the X87 floating point unit where all floating point computations are performed in one precision regardless of the original type see mx87 precision If the program uses floating point values with less precision the extra precision in the X87 may violate the precise definition of IEEE floating point ffloat store causes all pertinent immediate computations to be stored to m
127. L and INTEGER type from 4 bytes to 8 bytes Useful for porting from Cray code when integer and floating point data is 8 bytes long by default Watch out for type mismatches with external libraries NOTE r8 and i8 flags only affect default reals and integers not variable declarations or constants that specify an explicit KIND This can cause incorrect results if a 4 byte default real or integer is passed into a subprogram that declares a KIND 4 integer or real Using an explicit KIND value like this is unportable and is not recommended Correct usage of KIND i e KIND KIND 1 Or KIND KIND 0 0d0 will not result in any problems 3 5 3 The PathScale Fortran Compiler XX Extensions QLOGIC See 3 3 2 Pointers 3 3 3 Directives 3 3 3 1 The Cray pointer is a data type extension to Fortran to specify dynamic objects different from the Fortran pointer Both Cray and Fortran pointers use the POINTER keyword but they are specified in such a way that the compiler can differentiate between them The declaration of a Cray pointer is POINTER pointer lt pointee gt Fortran pointers are declared using POINTER object name PathScale s implementation of Cray Pointers is the Cray implementation which is astricterimplementation than in other compilers In particular the PathScale Fortran compiler does not treat pointers exactly like integers The compiler will report an err
128. LBOUND ANSI PGI See Std TRADITIONAL LEN 1 4 STRING C ANSI G77 E P PGI TRADITIONAL LENGTH I 1 1 1 2 1 4 1 8 TRADITIONAL E LEN_TRIM 1 4 STRING C ANSI G77 E PGI TRADITIONAL 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics I i _ Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks LGE C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LGT C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LINK 1 4 PATH1 C G77 PGI PATH2 C LINK Subroutine PATH1 C G77 O PATH2 C STATUS 1 4 LLE C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LLT C STRING A C ANSI G77 E STRING B C PGI TRADITIONAL LNBLNK 1 4 STRING C G77 PGI LOC 1 8 I Any type G77 Array rank any TRADITIONAL LOCK Subroutine I 1 4 I 8 TRADITIONAL E RELEASE LOCK TEST _ 1 4 TRADITIONAL E AND_SET J 4 LOCK TEST I 1 8 TRADITIONAL E AND_SET J I 8 LOG R 4 X R 4 R 8 Z 8 ANSI G77 E Z 16 PGI TRADITIONAL LOG10 R 4 X R 4 R 8 ANSI G77 E PGI TRADITIONAL LOG2 IMAGES 1 4 TRADITIONAL LOGICAL L 4 L L 1 L 2 L 4 L 8 ANSI PGI E KIND I 1 I2 1 4 TRADITIONAL 1 8 LONG 1 4 1 1 I 2 1 4 1 8 G77 E R 4 R 8 Z 8 2 16 TRADITIONAL 1 02404 15 C 29 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWGIISTI IT IIN
129. ME C ANSI ENVIRONMENT VALUE C TRADITIONAL LENGTH 1 4 STATUS 1 4 TRIM NAME L 4 GET IEEE _ Subroutine STATUS I 8 TRADITIONAL EXCEPTIONS GET IEEE Subroutine STATUS 1 8 TRADITIONAL INTERRUPTS GET IEEE _ Subroutine STATUS I 8 TRADITIONAL ROUNDING _ MODE GET IEEE _ Subroutine STATUS I 8 TRADITIONAL STATUS GMTIME Subroutine STIME 1 4 G77 PGI TARRAY I 4 Array rank 1 HOSTNM 1 4 NAME C G77 PGI O STATUS 1 4 HOSTNM Subroutine NAME C G77 O STATUS 1 4 HUGE X 1 1 1 2 1 4 1 8 ANSI PGI E R 4 R 8 TRADITIONAL IABS 1 4 1 1 1 2 1 4 1 8 ANSI G77 PGI TRADITIONAL IACHAR 1 4 C G ANSI G77 E PGI TRADITIONAL IAND 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E J 1 1 1 2 1 4 1 8 PGI TRADITIONAL IARGC 1 4 G77 PGI IBCHNG 1 4 I 1 1 1 2 1 4 1 8 TRADITIONAL E POS 1 1 1 2 1 4 1 8 IBCLR 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics ls Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks IBITS 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI LEN 1 1 12 1 4 1 8 TRADITIONAL IBSET 1 4 I 1 1 1 2 1 4 1 8 ANSI G77 E POS 1 1 1 2 1 4 1 8 PGI TRADITIONAL ICHAR 1 4 C C ANSI G77 E PGI TRADITIONAL IDATE Subroutine I 1 1 G77 PGI J
130. NAL OMP_ Subroutine LOCK 1 4 1 8 DESTROY _ LOCK OMP __ Subroutine LOCK 1 4 1 8 DESTROY _ NEST GET Depends on arg OMP DYNAMIC C 32 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics o 9 9 97 ENS Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks OMP GET MAX Depends on arg OMP THREADS GET Depends on arg OMP NESTED OMP GET Depends on arg OMP NUM PROCS OMP GET Depends on arg OMP NUM THREADS GET Depends on arg OMP THREAD NUM GET R 8 OMP WTICK OMP GET R 8 OMP WTIME OMP INIT Subroutine LOCK 1 4 1 8 LOCK OMP_INIT_ Subroutine LOCK 1 4 1 8 NEST OMP IN _ Depends on arg OMP PARALLEL OMP SET Subroutine DYNAMIC OMP DYNAMIC THREADS L 4 L 8 OMP SET Subroutine LOCK 1 4 1 8 LOCK OMP_SET_ Subroutine NESTED L 4 L 8 NESTED OMP_SET_ Subroutine LOCK 1 4 1 8 NEST OMP SET NUM Subroutine NUM THREADS OMP THREADS 1 4 1 8 OMP TEST Depends on arg LOCK 1 4 I 8 OMP LOCK OMP TEST Depends on arg LOCK 1 4 I 8 OMP NEST LOCK OMP UNSET Subroutine LOCK 1 4 1 8 LOCK OMP UNSET Subroutine LOCK 1 4 1 8 NEST 1 02404 15 C 33 C Supported Fortran Intrinsics Table of Supported Intrinsics XX QLOGIC
131. Often a program has hot spots a few routines or loops that are responsible for most of the execution time Profilers are a common tool for finding these hot spots in a program To figure out where and how to tune your code use the time tool to get a rough estimate and determine if the issue is system load application load or a system resource that is slowing down your program Then use the pathprof tool to find the programs hot spots Once you find the hot spots in your program you can improve your code for better performance or use the information to help choose which compiler flags are likely to lead to better performance The time tool provides the elapsed or wa11 time user time and system time of your program Its usage is typically time program args Elapsed time is usually the measurement of interest especially for parallel programs but if your system is busy with other loads then user time might be a more accurate estimate of performance than elapsed time If there is substantial system time being used and you don t expect to be using substantial non compute resources of the system you should use a kernel profiling tool to see what is causing it The pathprof and pathcov programs included with the compilers are symbolic links to your system s gcov and gprof executables There are more details and an example using pathprof later in section 9 but the following steps are all that are needed to get started in profiling 1 A
132. Options QLOGIC Loop Nest Optimization LNO YPYFY Y I D W m KWT spills of registers to memory In this case splitting loops can be beneficial Like splitting an atom splitting loops is termed fission These are the LNO options to control these transformations LNO fusion n Perform loop fusion n 0 off 1 conservative 2 aggressive Level 2 implies that outer loops in consecutive loop nests should be fused even if it is found that not all levels of the loop nests can be fused The default level is 1 standard outer loop fusion but 2 has been known to benefit a number of well known codes LNO fission n Perform loop fission n 0 off 1 standard 2 try fission before fusion The default level is 0 but 2 has been known to benefit a number of well known codes Be careful with mixing the above two flags because fusion has some precedence overfission if LNO ission 1 or 2 and LNO fusion 1 or 2 then fusion is performed LNO fusion peeling limit n controls the limit for the number of iterations allowed to be peeled in fusion where n has a default of 5 but can be any non negative integer Peeling is done when the iteration counts in consecutive loops is different but close and several iterations are replicated outside the loop body to make the loop counts the same 7 4 2 Cache Size Specification The PathScale compilers are primarily targeted at the Opteron CPU currently so they assume an L2 cache size of 1MB Ath
133. P processes running on that node By default each OpenMP process will make the same affinity assignments and the CPU utilization may be unbalanced In hybrid OpenMP MPI programs using multiple OpenMP threads per process it may be necessary to set PSC OMP AFFINITY to FALSE to prevent this For hybrid OpenMP MPI programs using a single OpenMP thread per process the default is to disable OpenMP affinity and the operating system will hopefully use all CPUs equitably An alternative approach is to specify explicit and disjoint affinity assignments per MPI process using taskset or using the other OpenMP library environment variables for controlling thread affinity See the following descriptions of releated environment variables PSC_OMP_AFFINITY_GLOBAL boolean TRUE or FALSE This environment variable controls where thread global ID or local ID values are used when assigning threads to CPUs The default is TRUE so that global ID values are used for calculating thread assignments Global IDs uniquely identify each thread and are integer values starting from 0 for the original master thread and incrementing upwards in the order in which threads are allocated The global ID is constant for a particular thread from its fork to its join Using the global ID for the affinity mapping ensures that threads do not change CPU in their lifetime and ensures that threads will be evenly distributed over CPUs The alternative is to use the thread local
134. PathScale Compiler Suite includes OpenMP and autoparallelization for Fortran and C C This implementation of OpenMP supplies parallel directives that comply with the OpenMP Application Program Interface API specification 2 5 Runtime libraries and environment variables are also included This section is not a tutorial on how to use OpenMP To learn more about using OpenMP please see a reference like Parallel Programming in OpenMP by Rohit Chandra et al Morgan Kaufmann Publishers 2000 ISBN 1 55 860671 8 See section 8 15 for more resources The OpenMP API defines compiler directives and library routines that make it relatively easy to create programs for shared memory computers processors that share physical memory from new or existing code OpenMP provides a portable scalable interface that has become the de facto standard for programming shared memory computers Using OpenMP you can create threads assign work to threads and manage data within the program OpenMP enables incremental parallelization of your code on SMP shared memory processor systems allowing you to add directives to chunks of existing code a little at a time The PathScale OpenMP implementation in Fortran and C C consists of parallelization directives and libraries Using directives you can distribute the work of the application over several processors OpenMP supports the three basic aspects of parallel programming Specifying parallel execution communicat
135. Routines 8 8 8 4 C C OpenMP Runtime Library Routines 8 9 8 5 Standard OpenMP Environment Variables 8 12 C 1 Fortran Intrinsics Supported in 3 0 C 3 1 02404 15 Page xi QLogic PathScale Compiler Suite User Guide Version 3 0 QLOGIC o Page xii 1 02404 15 1 1 Section 1 Introduction This User Guide covers how to use the QLogic PathScale Compiler Suite compilers how to configure them how to use them to optimize your code and how to get the best performance from them This guide also covers the language extensions and differences from the other commonly available language compilers The QLogic PathScale Compiler Suite will be referred to as the PathScale Compiler Suite or the PathScale compiler in the rest of this document The PathScale Compiler Suite generates both 32 bit and 64 bit code with 64 bit code as the default See the eko man page for details The information in this guide is organized into these sections Section 2 is a quick reference to using the PathScale compilers Section 3 covers the PathScale Fortran compiler Section 4 covers the PathScale C C compilers Section 5 provides suggestions for porting and compatibility Section 6 is a Tuning Quick Reference with tips for getting faster code Section 7 discusses tuning options in more detail Section 8 cove
136. Suite fully supports floating point complex numbers it does not support complex integer data types such as Complex int Thread local storage SSE3 intrinsics Many of the _builtin functions A goto outside of the block PathScale compilers do support taking the address of a label in the current function and doing indirect jumps to it The compiler generates incorrect code for structs generated on the fly a GCC extension Java style exceptions java_interface attribute init priority attribute 4 The PathScale C C Compiler XX Unsupported GCC Extensions QLOGIC A PV VyO Notes 4 8 1 02404 15 Section 5 Porting and Compatibility 5 1 Getting Started Here are some tips to get you started compiling selected applications with the PathScale Compiler Suite 5 2 GNU Compatibility The PathScale Compiler Suite C C and Fortran compilers are compatible with gcc and g77 Some packages will check strings like the gcc version or the name of the compiler to make sure you are using gcc you may have to work around these tests See section 5 6 1 for more information Some packages continue to use deprecated features of gcc While gcc may print a warning and continue compilation the PathScale Compiler Suite C C and Fortran compilers may print an error and exit Use the instructions in the error to substitute an updated flag For example some packages will specify the deprecated Xlinker gcc flag t
137. TIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks LSHIFT I 1 1 1 2 I 4 I 8 G77 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 POSITIVE_SHIFT 1 1 1 2 1 4 1 8 LSTAT 1 4 FILE C G77 PGI O SARRAY 4 Array rank 1 STATUS 1 4 LSTAT Subroutine FILE C G77 O SARRAY I 4 Array rank 1 STATUS 1 4 LTIME Subroutine STIME 1 4 G77 PGI TARRAY I 4 Array rank 1 MALLOC I 1 1 1 2 I 4 1 8 PGI E TRADITIONAL MASK I 1 1 1 2 1 4 I 8 TRADITIONAL E R 4 R 8 CrayPtr L 1 L 2 L 4 L 8 MATMUL ANSI PGI See Std TRADITIONAL MAX ANSI G77 See Std PGI TRADITIONAL MAXO ANSI G77 See Std PGI TRADITIONAL MAX1 ANSI G77 See Std PGI TRADITIONAL MAX X R 4 R 8 ANSI PGI E EXPONENT TRADITIONAL MAXLOC ANSI PGI See Std TRADITIONAL MAXVAL ANSI PGI See Std TRADITIONAL C 30 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics _ gt i Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks MCLOCK 1 4 G77 PGI MCLOCK8 1 8 G77 MEMORY Subroutine TRADITIONAL E BARRIER MERGE TSOURCE Any ANSI PGI E type TRADITIONAL FSOURCE Any type MASK L 1 L 2 L 4 L 8 MIN ANSI G77 See Std PGI
138. Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks CHAR C I 1 1 1 2 I 4 I 8 ANSI G77 E KIND I 1 1 2 1 4 PGI I 8 TRADITIONAL CHDIR 1 4 DIR C G77 PGI O STATUS 1 4 CHDIR Subroutine DIR C G77 O STATUS 1 4 CHMOD 1 4 NAME C G77 PGI O MODE C STATUS 1 4 CHMOD Subroutine NAME C G77 O MODE C STATUS 1 4 CLEAR_IEEE_ Subroutine EXCEPTION 1 8 TRADITIONAL E EXCEPTION CLOC 1 8 C C TRADITIONAL CLOCK C TRADITIONAL CLOG Z 8 X Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL CMPLX Z 8 X I 1 1 2 1 4 1 8 ANSI G77 E R 4 R 8 Z 8 Z 16 PGI Y 1 1 1 2 1 4 78 TRADITIONAL R 4 R 8 Z 8 Z 16 COMMAND _ 1 4 KIND 1 1 I 2 1 4 O ARGUMENT _ I 8 TRADITIONAL COUNT COMPARE _ L 4 I 1 4 TRADITIONAL E AND_SWAP J 1 4 1 4 COMPARE _ L 8 I 1 8 TRADITIONAL E AND_SWAP J I 8 1 8 COMPL I 1 1 1 2 I 4 I 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 1 02404 15 C 7 C Supported Fortran Intrinsics Table of Supported Intrinsics XX QLOGIC ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks CONJG Z 8 Z Z 8 Z 16 ANSI G77 E P PGI TRADITIONAL COS R 4 X R 4 R 8 Z 8 ANSI G77 E P Z 16 PGI TRADITIONAL COSD R 4 X R 4 R 8 PGI E P TRADITIONAL COSH R 4 X R 4 R
139. The F option sets the record header format to big endian F77 mips The ASSIGN Procedure 3 6 1 5 The ASSIGN procedure provides a programmatic interface to the assign command It takes as an argument a string specifying the assign command and an integer to store a returned error code For example integer err call ASSIGN assign N mips u 15 err This example has the same effect as the example in section 3 6 1 1 I O Compilation Flags 1 02404 15 Two compilation flags have been added to help with I O byteswapio and convert conversion The byteswapio flag swaps bytes during I O so that unformatted files on a little endian processor are read and written in big endian format or vice versa The convert conversion flag controls the swapping of bytes during I O so that unformatted files on a little endian processor are read and written in big endian format or vice versa To be effective the option must be used when compiling the Fortran main program 3 19 3 The PathScale Fortran Compiler XX Source Code Compatibility QLOGIC E ee Setting the environment variable FILENV when running the program will override the compiled in choice in favor of the choice established by the command assign The convert conversion flag can take one of three arguments m native no conversion the default m big_endian files are big endian little_endian files are little endian For more details see the pathf95 man p
140. U hae ee eee ad 7 16 i r Errem 7 17 Code Generation CG 7 17 Feedback Directed Optimization FDO 7 18 Aggressive Optimizations 7 19 Alias Analysis io o exu etae bor a EN ated E Na co A qd 7 19 Numerically Unsafe Optimizations 7 20 Fast math Functions 7 21 IEEE 754 Compliance 7 21 Arithimetle excu dox Rm OR ERREUR EGRE Roe a 7 21 Ioundolf gaan vL Aa E tese aa pi UE LAE M tdi 7 22 Other Unsafe Optimizations 7 23 Assumptions About Numerical Accuracy 7 23 Hardware 7 24 Hardware Setup iios eie UR Lon esta ua CM Dedi eee 7 24 BIOS Setup at ti AE s Nu lA 7 24 Multiprocessor Memory 7 25 Kernel and System Effects lisse eee 7 25 TOOIS ANG EM CREDE 7 25 1 02404 15 XX QLOGIC QLogic PathScale Compiler Suite User Guide Version 3 0 A 7 8 6 7 9 7 9 1 7 9 2 7 9 3 7 9 4 7 9 5 7 9 6 7 9 7 7 9 8 7 9 8 1 7 9 8 2 7 9 8 3 7 9 8 4 7 10 7 10 1 7 10 2 7 10 3 Section 8 8 1 8 2 8 3 8 4 8 5 8 6 8 7 8 8 8 9 8 9 1 8 9 2 8 10 8 10 1 8 10 2 8 11 8 12 8 13 8 14 8 14 1 8 14 2 1 02404 15
141. UIDED CHUNK DIVISOR is used to divide down the chunk size assigned by the guided scheduling algorithm If the number of iterations left to be scheduled is remaining_size and the number of threads in the team is number of threads the chunk size will be determined as chunk size remaining size number of threads PSC OMP GUI DED CHUNK DIVI SOR 8 18 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Environment Variables ls A value of 1 gives the biggest possible chunks and the fewest number of calls into the loop scheduler Larger values will result in smaller chunks giving more opportunities for the dynamic guided scheduler to assign work balancing out variation between loop iterations at the expense of more calls into the loop scheduler With a value of PSC_OMP_GUIDED_CHUNK_DIVISOR equal to 1 the first thread will get 1 n th of the iterations for a team of n If these iterations happen to be particularly expensive then this thread will be the critical path through the loop The default value is 2 PSC_OMP_GUIDED_CHUNK_MAX Integer value This is the maximum chunk size that will be used by the loop scheduler for guided scheduling The default value for this is 300 Note that a minimum chunk size can already be set by the user on a guided schedule directive This environment variable allows the user to set a maximum too though it applies to the whole program The rationale for setting a maximum
142. Vector YY O I I IU ra H m 3 HT es 1 02404 15 array whose element type is a derived type having component s which are themselves allocatable unsigned int 26 pad for first 32 bits unsigned int 29 pad for second 32 bits unsigned int n dim 3 number of dimensions f90 type t type lens data type and lengths void orig base original base address unsigned long orig size original size Per Dimension Information array will contain only the necessary number of elements define MAXDIM 7 struct DvDimen signed long low_bound lower bound for ith dimension may be negative signed long extent number of elts for ith dimension The stride mult is not defined in constant units so that address calculations do not always require a divide by 8 or 64 For double and complex stride mult has a factor of 2 in it For double complex stride mult has a factor of 4 in it signed long stride mult stride multiplier dimension 7 DopeAllocType alloc info appears following the last actual dimension there may be fewer than 7 dimensions if alloc cpnt is true x DopeVectorType D 3 D Fortran 90 Dope Vector QLOGIC Q9 b D 4 1 02404 15 Appendix E eko man Page There are online manual pages man pages available describing the flags and options for the P
143. XX QLOGIC QLogic PathScale Compiler Suite User Guide Version 3 0 1 02404 15 Page i QLogic PathScale Compiler Suite User Guide Version 3 0 XX QLOGIC Se Information furnished in this manual is believed to be accurate and reliable However QLogic Corporation assumes no responsibility for its use nor for any infringements of patents or other rights of third parties which may result from its use QLogic Corporation reserves the right to change product specifications at any time without notice Applications described in this document for any of these products are for illustrative purposes only QLogic Corporation makes no representation nor warranty that such applications are suitable for the specified use withoutfurther testing or modification QLogic Corporation assumes no responsibility for any errors that may appear in this document No part of this document may be copied nor reproduced by any means nor translated nor transmitted to any magnetic medium without the express written consent of QLogic Corporation In accordance with the terms oftheir valid PathScale agreements customers are permitted to make electronic and paper copies of this document for their own exclusive use Linux is a registered trademark of Linus Torvalds QLA QLogic SANsurfer the QLogic logo PathScale the PathScale logo and InfiniPath are registered trademarks of QLogic Corporation Red Hat and all Red Hat based trademarks are trademark
144. ZSQRT Z 16 X Z 16 G77 E P TRADITIONAL C 4 Fortran Intrinsic Extensions Standard Fortran intrinsic procedures are documented in ISO 1539 1 or any good textbook on Fortran 95 This section documents procedures that are extensions to 1 02404 15 C 41 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC e the standard referring to argument names shown in the table of intrinsics in table C 1 abort Prints a message and then like the C library function abort stops the program access Like the C library function access returns zero if the file named by name satisfies the requirements indicated by mode but otherwise returns the error code from the C library value errno Trailing blanks in name are ignored you can prevent this by using char 0 to place a null character after the last significant character mode may contain any of the following r Readable w Writable x Executable File exists alarm Uses the C library functions alarmand signal to wait the time indicated by seconds and then execute the external subroutine handler status returns the number of seconds remaining until the previously scheduled alarm would have taken place or 0 if no alarm was pending and Bitwise boolean AND besj0 besj1 Fortran interfaces to C library functions 30 51 jn yO y1 and besjn besy0 Bessel functions besy1 besyn cdabs cdcos Specific names for various mathematical f
145. a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a stride of 1 This specifies an integer value that is used to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU number calculated after PSC OMP CPU STRI DE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S The value of PSC OMP GUI DE D CHUNK DIVISORiS used to divide down the chunk size assigned by the guided scheduling algorithm See section 8 9 2 for details This is the maximum chunk size that will be used by the loop scheduler for guided scheduling See section 8 9 2 for details 1 02404 15 XX A Environment Variables QLOGIC Environment Variables for OpenMP YyYI IWI I IlH _ wn e KW PSC OMP LOCK SPIN This chooses the locking mechanism used by critical sections and OMP locks See section 8 9 2 for details
146. age 3 6 2 Reserved File Units The PathScale Fortran compiler reserves Fortran file units 5 6 and O 3 7 Source Code Compatibility This section discusses our compatibility with source code developed for other compilers Different compilers represent types in various ways and this may cause some problems 3 7 1 Fortran KINDs The Fortran KIND attribute is a way to specify the precision or size of atype Modern Fortran uses KINDS to declare types This system is very flexible but has one drawback The recommended and portable way to use KINDS is to find out what they are like this integer dp kind kind 0 0d0 In actuality some users hard wire the actual values into their programs integer dp kind 8 This is an unportable practice because some compilers use different values for the KIND of a double precision floating point value The majority of compilers use the number of bytes in the type as the KIND value For floating point numbers this means KIND 4 is 32 bit floating point and KIND 8 is 64 bit floating point The PathScale compiler follows this convention Unfortunately for us and our users this is incompatible with unportable programs written using GNU Fortran 977 g77 uses KIND 1 for single precision 32 bits and KIND 2 for double precision 64 bits For integers however g77 uses KIND 3 for 1 byte KIND 5 for 2 bytes KIND 1 for 4 bytes and KIND 2 for 8 bytes We are investigating the cost of pro
147. airs using the same argument to a single call calculating both values at once The default is ON OPT div split ON OFF Enable or disable changing x y into x recip y This is OFF by default but enabled by OPT Ofast or OPT IEEE_arithmetic 3 This transformation generates fairly accurate code OPT early mp ON OFF This flag has any effect only under mp compilation It controls whether the transformation of code to run under multiple threads should take place before or after the loop nest optimization LNO phase inthe compilation process The default is OFF when the transformation occurs after LNO Some OpenMP programs can yield better performance by enabling OPT early mp because LNO can sometimes generate more appropriate loop transformation when working on the multi threaded forms of the loops If apo is specified the transformation of code to run under multiple threads can only take place after the LNO phase in which case this flag is ignored OPT early intrinsics ON OFF When ON this option causes calls to intrinsics to be expanded to inline code early in the backend compilation This may enable more vectorization opportunities if vector forms of the expanded operations exist Default is OFF OPT fast bit intrinsicsz ON OFF Setting this to ON will turn off the check for the bit count being within range for Fortran intrinsics like BTEST and ISHFT The default setting is OFF OPT fast complex ON OFF Setting fas
148. all routines at depth N in the call graph regardless of space limitation 7 3 5 Cloning There are two options for controlling cloning IPA multi_clone N specifies the maximum number of clones that can be created from a single function The default is 0 which implies that cloning is turned OFF by default IPA node_bloat N specifies the maximum percentage growth in the number of procedures relative to the original program that cloning can produce The default is 100 7 3 6 Other IPA Tuning Options The following are options un related to inlining and cloning but useful in tuning IPA common pad sizezNspecifies that common block padding should use pad size of up to N bytes The default value is 0 which specifies that the compiler will determine the best padding size IPA linear ON enables linearization of array references When inlining Fortran subroutines IPA tries to map formal array parameters to the shape of the actual parameters The default is OFF which means IPA will suppress the inlining if it cannot do the mapping Turning this option ON instructs IPA to still perform the inlining but linearizes the array references Such linearization may cause performance problems but the inlining may produce more performance gain 1 02404 15 7 9 7 Tuning Options XX Inter Procedural Analysis IPA QLOGIC E ee IPA pu_reorder N controls IPA s procedure reordering optimization A value of 0 disables the optimization N 1 e
149. an Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks DSM THIS 1 8 ARRAY Any type TRADITIONAL CHUNKSIZE Array rank any DIM I 1 1 2 1 4 I 8 INDEX 1 1 1 2 I 4 I 8 DSM_THIS_ 1 8 ARRAY Any type TRADITIONAL STARTINGINDE Array rank any X DIM 1 1 1 2 I 4 I 8 INDEX 1 1 1 2 1 4 1 8 DSM_THIS_ 1 8 ARRAY Any type TRADITIONAL THREADNUM Array rank any DIM I 1 1 2 1 4 1 8 INDEX 1 1 1 2 I 4 1 8 DSQRT R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DTAN R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DTAND R 8 X R 8 PGI E P TRADITIONAL DTANH R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DTIME R 4 TARRAY R 4 G77 PGI Array rank 1 TRADITIONAL DTIME Subroutine TARRAY R 4 G77 Array rank 1 TRADITIONAL RESULT R 4 ENABLE_IEEE Subroutine INTERRUPT 1 8 TRADITIONAL E _INTERRUPT EOSHIFT ANSI PGI See Std TRADITIONAL EPSILON X R 4 R 8 ANSI PGI E TRADITIONAL 1 02404 15 C 15 C Supported Fortran Intrinsics Table of Supported Intrinsics QLOGIC FWGIISTI IT IINTIT lt Gr lt GFGFEPEERcRTOGSGIGIEIEIEXEXEFEGGGIWWIICII ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks EQV I 1 1 1 2 I 4 I 8 PGI E R 4 R 8 TRADITIONAL CrayPtr L 1 L 2 L 4 L 8 J 1 1 I 2 1 4 1 8 R 4 R 8 CrayPtr L 1 L 2 L
150. an be 0 or a positive integer followed by one of the following letters k K m or M These letters specify the cache size in Kbytes or Mbytes Specifying 0 indicates there is no cache at that level cs1 is the primary cache cs2 refers to the secondary cache cs3 refers to memory and cs4 is the disk Default cache size for each type of cache depends on your system Use LIST all optionszON to see the default cache sizes used during compilation LNO is mem1 ON OFF is mem2 ON OFF is mem3 ON OFF is mem4 OFF This option specifies that certain memory hierarchies should be modeled as memory not cache Default is OFF for each option Blocking can be attempted for this memory level and blocking appropriate for memory rather than cache is applied No prefetching is performed and any prefetching options are ignored If OPT is_memx ON OFF is specified the corresponding assocxzN specification is ignored any cmpx N and dmpx N options on the command line are ignored LNO 1s1 N 1s2 N 1s3 N 1s4 N This option specifies the line size in bytes This is the number of bytes specified in the form of a positive integer number N that are moved from the memory hierarchy level further out to this level on a miss Specifying 0 indicates there is no cache at that level Following are LNO TLB Options These arguments control the TLB a cache for the page table assumed to be fully associative The TLB control arguments are the
151. and Subscription Manager Install Guide The QLogic PathScale Compiler Suite User Guide The QLogic PathScale Compiler Suite Support Guide The QLogic PathScale Debugger User Guide There are also online manual pages man pages available describing the flags and options for the PathScale Compiler Suite These man pages are a subset of the pages that are shipped with the Compiler Suite pathf95 pathf90 pathcc pathCC The pathscale intro man page gives a complete list of all the various man pages that are included with the Compiler Suite Please see the QLogic website or the PathScale legacy website for further information about current releases and developer support http www glogic com http www pathscale com support html In addition you may want to refer to language reference books for more information on compilers and language usage Programming and language reference books are often a matter of personal taste Everyone has a personal preferences in reference books and this list reflects the variety of opinions found within the QLogic engineering team Fortran Language m Fortran 95 Handbook Complete ISO ANSI Reference by Jeanne C Adams et al MIT Press 1997 ISBN 0 262 51096 0 m Fortran 95 Explained by Metcalf M and Reid J Oxford University Press 1996 ISBN 0 19 851888 8 1 2 1 02404 15 1 Introduction QLOGIC Documentation Suite __ C Language
152. ard directories specific to C nostdlib No predefined libraries or startfiles o outfile When this option is used in conjunction with the c option and a single C source file a relocatable object file named outfile is produced When specified with the S option the o option is ignored If o and c are not specified a file named a out is produced If specified writes the executable file to out file rather than to a out 0 0 1 2 3 s Specify the basic level of optimization desired The options can be one of the following 0 Turn off all optimizations 1 Turn on local optimizations that can be done quickly 2 Turn on extensive optimization This is the default The optimizations at this level are generally conservative in the sense that they are virtually always beneficial provide improvements commensurate to the compile time spent to achieve them and avoid changes which affect such things as floating point accuracy 3 Turn on aggressive optimization The optimizations at this level are distinguished from O2 by their aggressiveness generally seeking highest quality generated code even if it requires extensive compile time They may include optimizations that are generally beneficial but may hurt performance This includes but is not limited to turning on the Loop Nest Optimizer LNO opt 1 and setting OPT ro 1 IEEE_arith 2 Olimit 9000 reorg_common ON s Specify that code size is to be given priority in trade
153. are compared for equality Wno float equal tells the compiler not to warn if floating point values are compared for equality W no format For C C only Wformat warns about printf format anomalies Wno format tells the compiler not to warn about printf format anomalies Wno format extra args For C C only Do not warn about extra arguments to printf like functions W no format nonliteral For C C only With the Wformat nonliteral option and if Wformat warn if format string is not a string literal For Wno format nonliteral do not warn if format string is not a string literal W no format security For C C only For Wformat security if Wformat warn on potentially insecure format functions Wfno format security do not warn on potentially insecure format functions Wno format y2k For C C only Do not warn about strftime formats that yield two digit years W no id clash For C C only Wid clash warns if two identifiers have the same first num chars Wid clash tells the compiler not to warn if two identifiers have the same first num chars 1 02404 15 QLOGIC y Y XY W no implicit For C C only Wimplicit warns about implicit declarations of functions or variables Wno implicit tells the compiler not to warn about implicit declarations of functions or variables W no implicit function declaration For C C only Wimpl
154. are preferred Leaf routines functions containing no call are also favored Inlining continues until no more call satisfies the inlining criteria which can be controlled by the inlining options IPA inline OFF turns off IPA s inliner and the lightweight inliner is also suppressed since IPA is invoked Default is ON INLINE none turns off automatic inlining by IPA but required inlining implied by the language or specified by the user are still performed By default automatic inlining is turned ON IPA specfile filenamedirects the compilerto open the given file to read more IPA Or INLINE options The following options can be used to tune the aggressiveness of the inliner Very aggressive inlining can cause performance degradation as discussed in section 7 3 3 OPT OlimitzN specifies the size limit N where N is computed from the number of basic blocks that make up a function inlining will never cause a function to exceed this size limit The default is 6000 under 02 and 9000 under 03 The value 0 means no limit is imposed IPA spacecN specifies that inlining should continue until a factor of N increase in code size is reached The default is 100 If the program size is small the value of N could be increased IPA plimit N suppresses inlining into a function once its size reaches N where N is measured in terms of the number of basic blocks and the number of calls inside a function The default is 2500 IPA small puzN
155. argument data types For instance pathf90 integer 4 matches C int integer 8 matches C long long real matches C float provided the C function has an explicit prototype and doubleprecision matches C double Fortran character is problematic because in addition to passing a pointer to the first character it appends an integer length count argument to the end of the usual argument list Fortran Cray pointers 1 02404 15 3 13 3 The PathScale Fortran Compiler XX Mixed Code QLOGIC izar pwu 77 C s s declared with the pointer statement correspond to C pointers but Fortran 90 pointers declared with the pointer attribute are unique to Fortran The sequence keyword makes it more likely that a Fortran 90 structure will use the same layout as a C structure although it is wise to verify this by experiment in each case For arrays it is wise to limit the interface to the kinds of arrays provided in Fortran 77 since the arrays introduced in Fortran 90 add to the data structures information that C cannot understand Thus for example an argument a 5 6 or a n or a 1 where n is a dummy argument will pass a simple pointer that corresponds well to a C array whereas a allocatable array or a Fortran 90 pointer array does not correspond to anything in C NOTE Fortran arrays are placed in memory in column major order whereas C arrays use row major order And of course one must adjust for the fact that C array in
156. arithmetic Wno pointer arith tells the compiler not to warn about function pointer arithmetic W no redundant decls For C C only Wredundant decls warns about multiple declarations of the same object Wno redundant decls tells the compiler not to warn about multiple declarations of the same object W no reorder For C C only The Wreorder option warns when reordering member initializers Wno reorder tells the compiler not to warn when reordering member initializers W no return type For C C only Wreturn type warns when a function return type defaults to int Wno return type tells the compiler not to warn when a function return type defaults to int W no sequence point For C C only Wsequence point warns about code violating sequence point rules Wno sequence point tells the compiler not to warn about code violating sequence point rules W no shadow For C C only Wshadow warns when one local variable shadows another Wno shadow tells the compiler not to warn when one local variable shadows another E 50 1 02404 15 E eko man Page QLOGIC ls W no sign compare For C C only Wsign compare warns about signed unsigned comparisons Wsign compare tells the compiler not to warn about signed unsigned comparisons W no sign promo For C C only The Wsign promo option warns when overload resolution promotes from unsigned to signed Wno sign promo tells the compiler not t
157. art implicit none Explicit interface is not required but adds some error checking interface subroutine c reference di1 fl il i2 cl 11 12 c2 c3 doubleprecision d1 real fl integer i1 integer 8 12 character cl c3 character 4 c2 logical 11 12 end subroutine c referenc logical function c value d f i i8 doubleprecision d real f integer i 1 02404 15 3 15 3 The PathScale Fortran Compiler XX Mixed Code QLOGIC integer 8 18 end function value end interface logical 1 pointer p user user character 32 user integer 8 getlogin nounderscore File decorate txt maps this to external getlogin nounderscore getlogin without underscore intrinsic char Demonstrate calling from Fortran a C function taking arguments by reference call c_reference 9 8d0 7 6 5 4 8 hello false true amp from f part Demonstrate calling from Fortran a C function taking arguments by value 1 c value val 9 8d0 val 7 6 val 5 val 4 8 write 6 a 18 l 1 getlogin is a standard C library function which returns char When C function returns a pointer you must use a Cray pointer to receive the address and examine the data at that address instead of assigning to an ordinary variable p user getlogin nounderscore write 6 3a user 1 index user char 0 1 end program f part Subroutine to be called from C
158. athScale Compiler Suite You can type man k pathscale apropos pathscale togetalistof all the PathScale man pages on your system The following appendix is a copy of the information found in the eko man page which is a listing of all of the supported flags and options You can view this same information online by typing man eko The eko man page information begins on the following page 1 02404 15 E 1 E man Page XX QLOGIC e The complete list of options and flags for the QLogic PathScale TM Compiler Suite CG INLINE IPA LANG LNO OPT TENV WOPT other major topics covered DESCRIPTION This man page describes the various flags available for use with the PathScale pathcc pathCC and pathf95 compilers OPTIMIZATION FLAGS E 2 Some suboptions either enable or disable the feature To enable a feature either specify only the suboption name or specify 21 ON or TRUE Disabling a feature is accomplished by adding 70 OFF or FALSE These values are insensitive to case on and ON mean the same thing Below ON and OFF are used to indicate the enabling or disabling of a feature Many options have an opposite no counterpart This is represented as no in the option description and if used will turn off or prevent the action of the option If no no is shown there is no opposite option to the listed option Like the v option only nothing is run and ar
159. athopt2 examples make def cd 7 9 8 2 Example 1 Run with Makefile This shows the simplest use of the application with a Makefile There are no optimization flags in the make def file we supply All optimization flags are sent from pathopt2 to the compiler by propagating the value of from the pathopt2 command line to the CFLAGS and FFLAGS Makefile variables 7 36 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool ls The command will now look like this pathopt2 t try5 r bin ft A make clean ft CLASS A FFLAGS Note that we omitted the pathopt2 xm1 option in this example As mentioned previously when this option is omitted pathopt2 will use the file pathopt2 xm1 if it is present in the current working directory otherwise it will use the default pathopt2 xm1 that ship with the software Output from the run should be similar to the following Only the sorted summary is shown here Sorted summary from all runs Flags Build Test Real User System OPT Ofast PASS PASS 12 74 12 38 0 36 03 ipa PASS PASS 124 747 2 31 0 45 03 PASS PASS 2549 12 42 0 37 Ofast PASS PASS 13 66 3 19 0 47 02 PASS 4 50 4 12 0 38 7 9 8 3 Example 2 Use Build Run Scripts and a Timing File Next let s assume that we want to do our pathopt2 work in a sub directory of NPB2 3 SER to avoid littering the top level directory with scripts and possibly output files mkdi
160. automatic array or compiler temporary exceeds size bytes it is allocated on the heap instead of the stack If size is 1 objects are always put on the stack If size is 0 objects are always put on the heap The default is 1 for maximum performance and for compatibility with previous releases IEEE minus zero setting Enable or disable the SIGN 3I intrinsic function s ability to recognize negative floating point zero 0 0 Specify either ON or OFF for setting The default is OFF which suppresses the minus sign The minus sign is suppressed by default to prevent problems from hardware instructions and optimizations that can return a 0 0 result from a 0 0 value To obtain a minus sign when printing a negative floating point zero 0 0 use the z option on the assign 1 command IEEE savezssetting For Fortran the ISO standard requires that any procedure which accesses the standard IEEE intrinsic modules via a use statement must save the floating point flags halting mode and rounding mode on entry must restore the halting mode and rounding mode on exit and must OR the saved flags with the current flags on exit Setting this option OFF may improve execution speed by skipping these steps recursive setting Invoke the language option control group to control recursion support setting can be either ON or OFF The default is OFF In either mode the compiler supports a recursive stack based calling sequence The diffe
161. be one of the following 0 Disable nearly all loop nest optimizations 1 Perform full loop nest transformations This is the default LNO ou prod maxzN This option indicates that the product of unrolling of the various outer loops in a given loop nest is not to exceed N where N is a positive integer The default is 16 LNO outer ON OFF This option enables or disables outer loop fusion Default is ON LNO outer unroll max ou The Outer unroll max option indicates that the compiler may unroll outer loops in a loop nest by as many as N per loop but no more The default is 5 LNO parallel overheadzN Effective only when specified with apo the parallel overhead option controls the auto parallelizing compiler s estimate ofthe overhead in processor cycles incurred by invoking the parallel version of a loop When the compiler parallelizes a loop it generates both a serial and a parallel version If the amount of work performed by theloopis small it may notbe beneficial to use the parallel version during execution The set value of parallel overhead is used in this determination during execution time when the number of processors and the iteration count of the loop are taken into account The default value is 4096 Because the optimal value varies across systems and programs this option can be used for parallel performance tuning LNO prefetch 0 1 2 3 This option specifies the level of prefetching 0 Prefetch disable
162. be used to select any combination of phases For example Wba o foo passes the option o foo to the b and a phases Wall Enable most warning messages WB WB arg passes arg to the backend via ipacom E 44 1 02404 15 E eko man Page QLOGIC I s Wdeclaration after statement For C C only Warn about declarations after statements pre C99 Werror implicit function declaration For C C only Give an error when a function is used before being declared W no aggregate return For C C only Waggregate return warns about returning structures unions or arrays Wno aggregate return will not warn about returning structures unions or arrays W no bad function cast Wbad function cast attempts to support writable strings K amp R style C Wno bad function cast tells the compiler not to warn when a function call is cast to a non matching type W no cast align For C C only Wcast align warns about pointer casts that increase alignment Wnoc cast align instructs the compiler not warn about pointer casts that increase alignment Wno cast qual For C C only Wcast qual warns about casts that discard qualifiers Wno cast qual tells the compiler not to warn about casts that discard qualifiers W no char subscripts For C C only Wchar subscripts warns about subscripts whose type is char The Wno char subscripts option tells the compiler not warn about subscripts
163. by the sched setaffinity 2 call in the C library You will need a recent C library to be able to use this call On systems that lack NUMA support in the kernel and on runs that do not set process affinity before they start we have seen variations in performance of 3096 or more between individual runs 7 8 6 Testing Memory Latency and Bandwidth To test your memory latency and bandwidth we recommend two tools For memory latency the LMbench package provides a tool called 1at mem This provides a cryptic but fairly accurate view of your memory hierarchy latency LMbench is available from http www bitmover com Imbench For measuring memory bandwidth the STREAM benchmark is a useful tool Compiling either the Fortran or C version of the benchmark with the following command lines will provide excellent performance pathf95 Ofast stream d f second wall c DUNDERSCORE pathcc Ofast lm stream d c second wall c If you do not compile with at least 03 performance may drop by 40 or more The STREAM benchmark is available from http www streambench org For both of these tools we recommend that you perform a number of identical runs and average your results as we have observed variations of more than 1096 between runs 7 9 The pathopt2 Tool The pathopt2 tool is used to iteratively test different options and option combinations by compiling a set of application source code files measuring the pe
164. c This variable is used with the gcc compatibility wrapper scripts A 3 Environment Variables for Use with Fortran F90 BOUNDS CHECK ABORT Setto YES causes the program to abort on the first bounds check violation F90 DUMP MAP Dump memory mapping at the location of a segmentation fault FTN SUPPRESS REPEATS Output multiple values instead of using the repeat factor used at runtime NLS PATH Flags for runtime and compile time messages If the main function in your program is coded in C then even though other parts of the program are coded in Fortran the Fortran runtime library will not be able to find the file which provides runtime error messages To remedy this set the NLSPATH environment variable to the location of the error messages using N for the base name of the file For example if the compiler version is 2 1 set it to opt pathscale 1lib 2 1 N cat PSC FDEBUG ALLOC Flag to debug Fortran memory allocations This variable is used to initialize memory locations during execution 1 02404 15 A 1 A Environment Variables XX Language independent Environment Variables QLOGIC PSC FFLAGS Flags to pass to the Fortran compiler path 95 This variable is used with the gcc compatibility wrapper scripts PSC STACK LIMIT Controls the stack size limit the Fortran runtime attempts to use This string takes the format of a floating point number optionally followed by one of the characters k for units o
165. cation for this file is in the directory where the command is executed This location can be changed using module option The MODULENAME mod file allows other Fortran files to use procedures functions variables and any other entities defined in the module Module files can be considered similar to C header files Like C header files you can use the I option to point to the location of module files pathf95 I work project include c foo f90 This instructs the compiler to look for mod files in the work project include directory If 00 90 contains a use arith statement the following locations would be searched work project include ARITH mod ARITH mod Order of Appearance 1 02404 15 If a module and the use statements referring to that module appear in the same source file the module must appear first If a module appears in one source file and the use statements referring to that module appear in other source files the file containing the module must be compiled first If a single command compiles all the files the file containing the module must appear on the command line before the files containing the use statements pathf95 mymodule f95 myprogram f95 3 The PathScale Fortran Compiler Modules QLOGIC 3 2 2 Linking Object Files to the Rest of the Program A source file containing a module generates an object o file as well as a module information mod file
166. cause an autoparallelized program runs under multiple threads The runtime decision to create multiple threads followed by their synchronization are overhead during execution When the compiler parallelizes a loop it generates both a serial and a parallel version At runtime the generated code looks at the total amount of work performed by the loop and decides whether to execute the serial or the parallel version This decision can only be made at runtime when the number of processors and the loop 8 2 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC OpenMP Compiler Directives Fortran ls iteration counts are available If the amount of work is not large enough to justify the additional synchronization overhead it will execute the serial version instead In such cases the performance will be slower than if the program is not compiled with apo due to the need to make this decision at run time The synchronization overhead can be controlled using the LNO parallel_overhead option The value of this option is the compiler s estimate of the overhead in processor cycles in invoking the parallel version of a loop This value affects the runtime decision on whether to execute the serial or parallel versions Because the optimal value varies across systems and programs this option can be used for parallel performance tuning under apo For more information on this option see the eko man page 8 3 Getting Started With Op
167. character must be included somewhere in the build command since this is the mechanism by which the chosen optimization options are propagated to the build command Finally factorial is used as the test command 1 02404 15 7 29 7 Tuning Options The pathopt2 Tool C 44I4 4c rIGr4 rr r 4444444 7 77 z s s 7 30 XX QLOGIC For simple cases the o flag can be omitted and the default executable output a out can be used as the test command pathopt2 f pathopt2 xml t try5 r a out pathcc factorial c NOTE The order of the options in the command line does not matter However the required build command comes last since it may have an arbitrary number of options and arguments of its own When the option is not specified pathopt2 will use the file bathopt2 xml if itis present in the current working directory otherwise it will use the default pathopt2 xml that ships with the software The pathopt2 available options are given in Table7 4 You can also type pathopt2 h on the command line to get usage information Table 7 4 pathopt2 Options Option Description Default Do not redirect I O to dev null This is useful for debugging problems with the compilation the run or the build and test scripts All I O from the build and test commands will be sent to dev null under the assumption that the program will build and run cleanly f con figfile The f option
168. command line with intrinsic setlinebuf or intrinsic EVERY short Convert to type integer 2 signal Fortran interface to the C library function signal Arrange for the signal whose number is number to trigger a call to external procedure handler which should be a subroutine with no arguments or restore the default response to the signal or ignore the signal The optional third argument igndf 1 takes these values 1 Use the second argument to provide a handler to restore the default response to the signal or to ignore the signal 0 Regardless of the value of the second argument restore the default response to the signal 1 Regardless of the value of the second argument ignore the signal instead When igndf 1 is omitted handler can be an integer with these possible values addre s s Aninteger containing the address of the external procedure 0 Restore the default response to the signal 1 Ignore the signal The function form returns the previous state of the signal zero if the default response was in effect one if the signal was being ignored or the address of a handler procedure Here is an example using the two argument form C Keyboard interrupt normally Control C alternately triggers C handlerl and handler2 until 4 interrupts have occurred Then C restore the default handling so the fifth interrupt stops the C program C 50 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrin
169. compiled by 977 and if that library contains functions that return COMPLEX or REAL types you need to tell the compiler to treat those functions differently Use the 2c abi switch at compile time to point the PathScale compiler at a file that contains a list of functions in the g77 compiled libraries that return COMPLEX or REAL types When the PathScale compiler generates code that calls these listed functions it will modify its ABI behavior to match g77 s expectations The ff2c abi flag is used at compile time and not at link time NOTE You only specify the 2c abi switch once on the command line If you have multiple g77 compiled libraries you need to place all the appropriate symbol names into a single file 3 22 1 02404 15 XX 3 The PathScale Fortran Compiler QLOGIC Library Compatibility A a na wawssali The format of the file is one symbol per line Each symbol should be as you would specify it in your Fortran code i e do not mangle the symbol As an example cat example list sdot cdot You can use the fsymlist program to generate file in the appropriate format For example fsymlist opt gnu64 lib mylibrary a gt mylibrary list This will find all Fortran symbols in the mylibrary a library and place them into the mylibrary 2 0 list file You can then use this file with the 2c abi Switch NOTE The fsymlist program generates a list of all Fortran symbols in the library inclu
170. ct profile information about the program for example it records how frequently every if statementis true This information is then used in later compilations to tune the executable FDO is most useful if a program s typical execution is roughly similar to the execution of the instrumented program on its input data set if different input data has dramatically different i frequencies using FDO might actually slow down the program This section also discusses how to invoke this feature with the fb create and fb opt flags NOTE If the fb create and fb opt compiles are done with different compilation flags it may or may not work depending on whether the different compilation flags cause different code to be seen by the phase that is performing the instrumentation feedback We recommend using the same flags for both instrumentation and feedback FDO requires compiling the program at least twice In the first pass pathcc 03 ipa fb create fbdata o foo foo c The executable oo will contain extra instrumentation library calls to collect feedback information this means foo will actually run a bit slower than normal We are using bdata for the file name in this example you can use any name for your file Next run the program foo with an example dataset foo typical input data During this run a file with the prefix bdata will be created containing feedback information The file name you use will become the p
171. ctions module dir Create the mod file corresponding to a module statement in the directory dir instead of the current working directory Also when searching for modules named in use statements examine the directory dir before the directories established by Idir options Interpret OpenMP directives to explicitly parallelize regions of code for execution by multiple threads on a multi processor system Most OpenMP 2 0 directives are supported by pathf95 pathcc and pathCC See the QLogic PathScale Compiler Suite User Guide for more information on these directives MP With M or MM add phony targets for each dependency MQ Same as MT but quote characters that are special to Make msse2 Enable use of SSE2 instructions This is the default under both m64 and m32 msse3 Enable use of SSE3 instructions Default is ON under march em64t Otherwise itis OFF by default mtune cpu type Behaves like march See march Change the target of the generated dependency rules 1 02404 15 E 31 E man Page XX QLOGIC ee E 32 mx87 precision 32 64 80 Specify the precision of x87 floating point calculations The default is 80 bits nobool Do not allow boolean keywords nocpp For Fortran only Disable the source preprocessor See the cpp E and ftpp options for more information on controlling preprocessing nodefaultlibs Do not use standard system libraries when linking
172. cy 250000 00000 heuristic lt freq gt BB 60 gt BB 60 probability 0 99994 lt freq gt BB 60 gt BB 59 probability 0 00006 freq gt loc 1 120 0 119 for j 0 j lt N j 120 a j 2 0E0 a j movapd 0 r8 xmm3 0 id 82 a 0x0 movapd 16 r8 xmm2 1 id 82 a 0x0 addpd xmm3 xmm3 4 addpd xmm32 xmm2 5 movapd 32 r8 xmm1 2 id 82 a 0x0 movapd 48 r8 xmm0 3 id 82 a 0x0 addpd xmmi xmm1 6 addpd xmm0 7 movntpd xmm3 0 r8 9 14 83 a 0x0 movntpd xmm2 16 r8 10 id 83 a 0x0 7 42 1 02404 15 XX QLOGIC 7 Tuning Options How Did the Compiler Optimize My Code 7 10 2 addq 64 r8 8 movntpd xmm1 32 r8 11 14 83 a 0x0 rbp r8 11 movntpd xmm0 16 r8 12 id 83 a 0x0 jle LBB60_main 12 Note the unrolled 4 times comment above and the original source in comments which tell you what the compiler did even if you can t read x86 assembly code Using CLIST or FLIST 7 10 3 You can use CLIST for C codes or FLIST on for Fortran codes to see what the compiler is doing On the same STREAM source code compile with the CLIST flag pathcc 03 CLIST ON c stream d c The output will look something like this opt pathscale lib 2 3 99 be translates tmp ccI 16xQZJ into Stream w2c h and stream w2c c based on source stream c When you look at stream d w2c c
173. d 1 Prefetch is done only for arrays that are always referenced in each iteration of a loop 2 Prefetch is done without the above restriction This is the default 3 Most aggressive LNO prefetch_ahead N Prefetch N cache line s ahead The default is 2 LNO prefetch verbose ON LNO prefetch verbosezON prints verbose prefetch info to stdout Default is OFF 1 02404 15 E 25 E man Page XX QLOGIC e LNO processorszN Tells the compiler to assume that the program compiled under apo will be run on a system with the given number of processors This helps in reducing the amount of computation during execution for determining whether to enter the parallel or serial versions of loops that are parallelized see the LNO parallel overhead option The default is 0 which means unknown number of processors The default value of 0 should be used if the program is intended to run in different systems with different number of processors If the option is set to non zero and the value is different from the number of processors the parallelized code will not perform optimally LNO sclrze ON OFF Turn ON or OFF the optimization that replaces an array by a scalar variable The default is ON LNO simd 0 1 2 This flag controls inner loop vectorization which makes use of SIMD instructions provided by the native processor 0 Turn off the vectorizer 1 Default Vectorize only if the compiler can d
174. d option for g std c89 std option for gcc gt 1 02404 15 E 41 E eko man Page XX QLOGIC ee std c99 std option for gcc g std c9x std option for gcc gt std gnu 98 std option for g std gnu89 std option for gcc gt std gnu99 std option for gcc gt std gnu9x std option for gcc gt stdziso9899 1990 std option for gcc gt stdziso9899 199409 std option for gcc g stdziso9899 1999 std option for gcc gt stdziso9899 199x std option for gcc gt stdinc Predefined include search path list subverbose Produce diagnostic output about the subscription management for the compiler TENV This option specifies the target environment option group These options control the target environment assumed and or produced by the compiler TENV frame_pointer ON OFF Default is ON for C and OFF otherwise Local variables in the function stack frame are addressed via the frame pointer register Ordinarily the compiler will replace this use of frame pointer by addressing local variables via the stack pointer when it determines that the stack pointer is fixed throughout the function invocation This frees up the frame pointer for other purposes Turning this flag on forces the compiler to use the frame pointer to address local variables This flag defaults to ON for C because the exception handling mechanism relies on the frame pointer E 42 1 02404 15
175. d to Temporary Variables 1 02404 15 In some situations the Fortran standard requires that actual arguments to procedure calls be copied to and from temporary variables Often this occurs because a program employs array features introduced in the Fortran 90 standard along with procedures having traditional Fortran 77 style implicit interfaces In particular Fortran 77 style procedures expect all arrays to be contiguous in memory but Fortran 90 permits arrays whose elements are scattered or strided The copying takes time but contiguous arrays may better use the processor cache memory Whether the program runs faster or slower depends on whether one of those factors dominates the other and that depends on the details of the program Because unintended copying can slow program execution the compiler provides optional warnings about it The example below shows two out of many situations in which copying takes place one in which copying is conditional on the nature of the array and another in which copying is unconditional 3 27 3 The PathScale Fortran Compiler XX Debugging and Troubleshooting Fortran QLOGIC UU U C s sC Y cat cico f90 subroutine possible a n implicit none integer n integer dimension n a print a 2515 possible a end subroutine possible program copier implicit none logical 1 integer i integer target a 5 5 reshape i i 1 25 5 5 integer poin
176. dd the pg flag to both the compile and link steps with the PathScale compilers This generates an instrumented binary 2 Run the program executable with the input data of interest This creates a gmon out file with the profile data 3 Run pathprof lt program name gt to generate the profiles The standard output of pathprof includes two tables 1 02404 15 2 11 2 Compiler Quick Reference XX taskset Assigning a Process to a Specific CPU QLOGIC 2 12 a Aflat profile with the time consumed each routine and the number of times it was called and b Acall graph profile that shows for each routine which routines it called and which other routines called it There is also an estimate of the inclusive time spent in a routine and all of the routines called by that routine NOTE The pathprof tool will generate a segmentation fault when used with OpenMP applications that are run with more than one thread There is no current workaround for pathprof or gprof See section 9 for a more detailed example of profiling taskset Assigning a Process to a Specific CPU 2 12 To improve the performance of your application on multiprocessor machines it is useful to assign the process to a specific CPU The tool used to do this is taskset which can be used to retrieve or set a process affinity This command is part of the schedutils package RPM NOTE Some ofthe Linux distributions supported by the PathScale compilers
177. default 3 Most aggressive LNO prefetch aheadzN This option prefetches the specified number of cache lines ahead of the reference Specify a positive integer for N default is 2 LNO prefetch manual ON This option specifies whether manual prefetches through directives should be respected or ignored prefetch manual OFF ignores directives for prefetches prefetch manual ON respects directives for prefetches This is the default Run cpp and print list of make dependencies 1 02404 15 E 29 E man Page XX QLOGIC ee E 30 m32 Compile for 32 bit ABI also known as x86 or IA32 See m64 for defaults m3dnow Enable use of 3DNow instructions The default is OFF m64 Compile for 64 bit ABI also known as AMD64 x86 64 or IA32e On a 32 bit host the default is 32 bit ABI On a 64 bit host the default is 64 bit ABI if the target platform march mcpu mtune is 64 bit otherwise the default is 32 bit macro expand Enable macro expansion in preprocessed Fortran source files throughout each file Without this option specified macro expansion is limited to preprocessor directives in files processed by the Fortran preprocessor When this option is specified macro expansion occurs throughout the source file march lt cpu type gt Compiler will optimize code for the selected cpu type opteron athlon athlon64 athlon64fx em64t pentium4 xeon core anyx86 auto auto means to optimize
178. dices originate a zero whereas Fortran array indices originate at 1 by default but can be declared with other origins instead Calls between C and Fortran are more difficult for the same reason that calls between C and C are difficult the C compiler must mangle symbol names to implement overloading and the C compiler must add to data structures various information such as virtual table pointers that other languages cannot understand The simplest solution is to use the extern C declaration within the C source code to tell it to generate a C compatible interface which reduces the problem to that of interfacing C and Fortran 3 5 1 1 Example Calls between C and Fortran Here are three files your can compile and execute that demonstrate calls between C and Fortran This is the C source code c include lt stdio h gt include lt alloca h gt include lt string h gt extern void f1 char c int i long long 11 float f double d int l int c len Demonstrate how to call Fortran from C void call fortran char c hello from call fortran int as 1235 long long 11 45611 float f 7 8 double d 9 1 int nonzero 10 Any nonzero integer is true in Fortran 1 c amp i amp 11 amp f amp d amp nonzero strlen c 3 14 1 02404 15 XX 3 The PathScale Fortran Compiler QLOGIC Mixed Code ls C function designed to be called from Fortran pass
179. ding those that do not return COMPLEX or REAL types The extra symbols will be ignored by the compiler 3 8 3 1 AMD Core Math Library ACML The AMD Core Math Library ACML incorporates BLAS LAPACK and FFT routines and is designed to obtain maximum performance from applications running on AMD platforms This highly optimized library contains numericfunctions for mathematical engineering scientific and financial applications ACML is available both as a 32 bit library for compatibility with legacy x86 applications and as a 64 bit library that is designed to fully exploit the large memory space and improved performance offered by the x86 64 architecture we have not tested ACML on the EM64T version of the compiler suite To use ACML 1 5 with the PathScale Fortran compiler use the following pathf95 foo f bar f lacml To use ACML 2 0 with the PathScale Fortran compiler use the following pathf95 L path to acml lib foo f bar f lacml ACML 2 5 1 and later built with the PathScale compilers is available from the AMD website at http developer amd com acml aspx With these later versions of ACML the workarounds described above are unnecessary 3 8 4 List Directed I O and Repeat Factors By default when list directed I O is used and two or more consecutive values are identical the output uses a repeat factor 1 02404 15 3 23 3 The PathScale Fortran Compiler XX Library Compatibility QLOGIC pBVVB
180. e It may be desirable to spread over cores first and minimize the number of chips to improve locality Alternatively it may be desirable to spread over chips first to maximize the number of chips to maximize the available system memory bandwidth 8 16 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Environment Variables YR CY WWIZI l YI S For example here are the generated thread assignments for a system comprising of four chips each with two cores where PSC OMP CPU STRIDE is set to 2 lt CHIP 0 lt CHIP 1 gt CHIP 2 CHIP 3 gt CPUO CPU1 CPU2 CPU3 CPU4 CPUS CPU6 CPU7 TO T4 T1 T5 T2 T6 T3 T7 T8 T12 T9 T13 T10 T14 T11 T15 T16 Tx indicates thread number x Here is another example for two chips with four cores and PSC OMP CPU STRIDE set to 4 CHIPO gt lt CHIP1 gt CPUO CPU1 CPU2 CPU3 CPU4 CPU6 CPU6 CPU7 TO T2 T4 T6 T1 T3 T5 T7 T8 T10 T12 T14 T9 T11 T13 T15 T16 This variable is most useful when the number of threads is fewer than the number of CPUs In the common case where the number of threads is the same as the number of CPUs then there is typically no need to set PSC OMP CPU STRIDE Note that the same mappings can also be obtained by enumerating the CPU numbers using the PSC OMP AFFINITY MAP variable PSC OMP CPU OFFSET Integer value This specifies an integer value that is u
181. e calculated value is divided by the number of CPUs in the system This ensures that the physical memory available for stack can be shared between as many threads as there are CPUs in the system m Otherwise this is a C C program and the stack limit is set to a default value of 32MB The distinction between Fortran and C C programs is determined by whether the program entry point is MAIN for Fortran or main for C C 8 22 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Stack Size Algorithm ls This stack size is then compared against system imposed limits both lower and upper If the check fails then a warning is generated and the stack size is automatically adjusted to the appropriate limit The following lower limit is imposed m The minimum size of a pthread stack specified by the system This is typically 16KB The following upper limits are imposed m The maximum stack size that the system s pthread library will accept i e the system imposed upper bound on the pthread stack size The library dynamically detects this value at start up time For systems using linuxthreads this limit is typically in the range of 8MB to 32MB For systems using NPTT threads there is typically no arbitrary limit imposed by the system on the stack size m libopenmp imposes a limit of 1GB is imposed when using the 32 bit version of libopenmp and a limit of 4GB when using the 64 bit version of 1ibopenmp These limits pr
182. e of a simplified version of that data structure extracted from that file 3 4 7 Bounds Checking The PathScale Fortran compiler can perform bounds checking on arrays To enable this feature use the C option pathf95 C gasdyn f90 o gasdyn The generated code checks all array accesses to ensure that they fall within the bounds of the array If an access falls outside the bounds of the array you will get a warning from the program printed on the standard error at runtime gasdyn lib 4961 WARNING Subscript 20 is out of range for dimension 1 for array X at line 11 in file t f90 with bounds 1 10 If you set the environment variable F90 BOUNDS CHECK ABORT to YES then the resulting program will abort on the first bounds check violation Obviously array bounds checking will have an impact on code performance so it should be enabled only for debugging and disabled in production code that is performance sensitive 3 12 1 02404 15 XX 3 The PathScale Fortran Compiler QLOGIC Mixed Code I i 3 4 8 Pseudo random Numbers The pseudo random number generator PRNG implemented in the standard PathScale Fortran library is a non linear additive feedback PRNG with a 32 entry long seed table The period of the PRNG is approximately 16 2 32 1 3 5 Mixed Code If you have a large application that mixes Fortran code with code written in other languages and the main entry point to your application is from C o
183. e same basic block Default is OFF help List all available options The compiler is not invoked help Print list of possible options that contain a given string H Print the name of each header file used Idir Specify a directory to be searched This is used for the following types of files m Files named in INCLUDE lines in the Fortran source file that do not begin with a slash character 1 02404 15 E eko man Page QLOGIC ls m Files named in Zinclude source preprocessing directives that do not begin with a slash character m Files specified on Fortran USE statements Files are searched in the following order first in the directory that contains the input file second in the directories specified by dir and third in the standard directory usr include iN For Fortran only Specify the length of default integer constants default integer variables and logical quantities Specify one of the following Option Action i4 Specifies 32 bit 4 byte objects The default i8 Specifies 64 bit 8 byte objects ignore suffix Determine the language of the source file being compiled by the command used to invoke the compiler By default the language is determined by the file suffixes c C cxx f 90 s When the ignore suffix option is specified the pathcc command invokes the C compiler pathCC invokes the C compiler and pathf95 invokes the Fortran 95 c
184. e secondsa running sum of the number of seconds accounted for by this function and those listed above it NOTE pathprof program included the PathScale Compiler Suite is a symbolic link your system s gprof executible The pathprof and pathcov programs link to the gprof and gcov executibles in the version of GCC on which the PathScale Compiler Suite is based Please note that the pathprof tool will generate a segmentation fault when used with OpenMP applications that are run with more than one thread There is no current workaround for pathprof or gprof Now we note that the total time that pathprof measures is 163 3 secs vs the 150 3 that we measured for the original 02 binary But considering that the 02 pg instrumented binary took 247 seconds to run this is a pretty good estimate It is nice that the top hot spot zgemm consumes about 50 of the total time We also note that some very small routines zaxpy zcopy and 1same are called a very large number of times These look like ideal candidates for inlining 9 2 1 02404 15 XX 9 Examples QLOGIC Compiler Flag Tuning and Profiling With pathprof a In the second part of the pathprof output after the explanation of the column headings for the flat profile is a call graph profile In the example of such a profile below can follow the chain of calls from maintomatmul_ muldoe_ su3mul_ and zgemm_ where most of the time is consumed Additional call graph pro
185. e section 5 6 1 for information on modifying existing scripts The invocation of the compiler is identical to the GCC compilers but the flags to control the compilation are different We have sought to provide flags compatible with GCC s flag usage whenever possible and also provide optimization features that are absent in GCC such as IPA and LNO Generally speaking instead of being a single component as in GCC the PathScale compiler is structured into components that perform different classes of optimizations Accordingly compilation flags are provided under group names like IPA LNO OPT CG etc For this reason many of the compilation flags in our compiler will differ from those in GCC See the eko man page for more information The default optimization level is 2 This is equivalent to passing 02 as a flag The following three commands are identical in their function pathcc hello c pathcc O hello c pathcc 02 hello c See section 7 1 for information about the optimization levels available for use with the compiler To run with Ofast or with ipa the flag must also be given on the link command pathCC c Ofast warpengine cc pathCC c Ofast wormhole cc pathCC o ftl Ofast warpengine o wormhole o See section 7 3 for information on ipa and Ofast Accessing the GCC 4 x Front ends for C and C This release supports GCC 4 x If GCC 4 x is installed on your Linux distribution you can use this new feat
186. e transformations that might cause limited round off or overflow differences Compounding such transformations could have more extensive effects This is the default when O3 is in effect 2 Allow more extensive transformations such as the reordering of reduction loops This is the default level when OPT Ofastis specified 3 Enable any mathematically valid transformation E 38 1 02404 15 QLOGIC o 9 9 97 7 3 MEN OPT rsqrt 0 1 2 This option calculates reciprocal square roots using the rsqrt machine instruction rsqrt is faster but potentially less accurate than the regular square root operation 0 means not to use rsqrt 1 means to use rsqrt followed by instructions to refine the result 2 means to use rsqrt by itself Default is 1 when OPT roundoff 2 or greater else the default is 0 OPT space ON OFF When ON this option specifies that code size is to be given priority in tradeoffs with execution time in optimization choices Default is OFF This can be turned on either directly or by compiling with Os OPT speculate OFF When ON this option makes the compiler convert short circuiting conditionals to their equivalent non short circuited forms whenever possible This eliminates branches at the expense of more computations Default is OFF OPT transform to memlib ON OFF When ON this option enables transformation of loop constructs to calls to memcpy
187. ed and the program does violate the assumptions being made the program may behave incorrectly Refer to section 7 7 1 for more information There are several shorthand options that can be used in place of the above options The option OPT Ofast is equivalent to OPT roundoff 2 O0limit 0 div_split ON alias typed Ofast is equivalent to 03 ipa OPT Ofast fno math errno When using this shorthand options make sure the impact of the option is understood by stepwise building up the functionality by using the equivalent options There are many more options that may help the performance of the program These options are discussed elsewhere in the User Guide and in the associated man pages Compiler Flag Recommendations 1 02404 15 As a general methodology we usually recommend that you start tuning with 02 then 03 then 03 OPT Ofast and then Ofast With 03 OPT Ofast and Ofast you should look to see if the results are accurate The OPT Ofast flag uses optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of computations This effectively turns on the following optimizations OPT ro 2 Olimit 0 div_split ON alias typed If there are numerical problems with 03 OPT Ofast then try either of the following 03 OPT Ofast ro 1 03 OPT Ofast div_split OFF Note that ro is short for roundoff Ofast is equivalent
188. ed in this release m32 m64 march same as mcpu and mtune mcpu same as march and mtune mtune same as march and mcpu msse2 msse3 m3dnow There are also mno versions for these options msse2 msse3 m3dnow For example mno msse3 As indicated in this list using the march flag the architectures supported in this release are m march opteron athlon64 athlon64fx m march pentium4 m march xeon m march em64t 2 4 1 02404 15 2 Compiler Quick Reference QLOGIC Compiling for Different Platforms We have also added two special options march any86 and march auto If you want to compile the program so that it can be run on any x86 machine you can specify anyx86 as the value of the march mcpu or mtune options m march anyx86 If the value for the march mcpu or mtune options is auto the compiler will automatically choose the target processor based on the machine on which the compilation takes place march auto The compiler defaults to march auto Here is a sample of how options are specified in the compiler defaults file Compile for Athlon64 and turn on 3DNow extensions One option per line march athlon64 anything after is ignored m3dnow These options can also be used on the command line See the eko man page for details 2 3 2 Defaults Flag This release includes a flag show defaults which directs the compiler to print ou
189. eeds to be used both in the compile and in the link steps of a build See section 7 3 for more details on how to use ipa 6 1 6 Tuning Quick Reference XX Feedback Directed Optimization FDO QLOGIC U 6 3 Feedback Directed Optimization FDO 6 4 Feedback directed optimization uses a special instrumented executable to collect profile information about the program that is then used in later compilations to tune the executable See section 7 6 for more information Aggressive Optimization The PathScale compilers provide an extensive set of additional options to cover special case optimizations The ones documented in section 7 contain options that may significantly improve the speed or performance of your code This section briefly introduces some of the first tuning flags to try beyond 02 or 03 Some of these options require knowledge of what the algorithms are and what coding style of the program require otherwise they may impact the program s correctness Some of these options depend on certain coding practices to be effective One word of caution The PathScale Compiler Suite like all modern compilers has a range of optimizations Some produce identical program output to the non optimized some can change the program s behavior slightly The first class of optimizations is termed safe and the second unsafe See for section 7 7 for more information on these optimizations OPT Olimit 0 is ge
190. efault IPA will not inline such callsites because they may cause performance problems The default is OFF IPA map_limit N Direct when IPA enables sp_partition N is the maximum size in bytes of input files mapped before IPA invokes IPA sp partition IPA maxdepth N This option directs IPA to not attempt to inline functions at a depth of more than N in the callgraph where functions that make no calls are at depth 0 those that call only depth 0 functions are at depth 1 and so on This inlining remains subject to overriding limits on code expansion Also see IPA forcedepth IPA space and IPA plimit IPA max jobszN This option limits the maximum parallelism when invoking the compiler after IPA to at most N compilations running at once The option can take the following values 0 The parallelism chosen is equal to either the number of CPUs the number of cores or the number of hyperthreading units in the compiling system whichever is greatest 1 Disable parallelization during compilation default gt 1 Specifically set the degree of parallelism IPA min_hotness N When feedback information is available a call site to a procedure must be invoked with a count that exceeds the threshold specified by N before the procedure will be inlined at that call site The default is 10 IPA multi_clone N This option specifies the maximum number of clones that can be created from a single procedure Default value is 0 Aggressive pr
191. emory to force truncation to lower precision However the extra stores will slow down program execution substantially ffloat store has no effect under msse2 which is the default under both m64 and m32 ffortran bounds check For Fortran only Check bounds 1 02404 15 QLOGIC y YWwWUIY KII xIlI l r aa f no gnu keywords For C C only Recognize typeof as a keyword If fno gnu keywords is used do not recognize typeof as a keyword f no implicit inline templates For C only fimplicit inline templates emits code for inline templates instantiated implicitly fno implicit inline templates tells the compiler to never emit code for inline templates instantiated implicitly f no implicit templates For C only The fimplicit templates option emits code for non inline templates instantiated implicitly With fno implicit templates the compiler will not emit code for non inline templates instantiated implicitly finhibit size directive Do not generate size directives f no inline functions For C C only finline functions automatically integrates simple functions into their callers fno inline functions does not automatically integrate simple functions into their callers fabi versionzN For C only Use version N of the C ABI Version 1 is the version of the C ABI thatfirst appeared in G 3 2 Version 0 will always be the version that conforms most c
192. enMP To use OpenMP you need to add directives where appropriate and then compile and link your code using the mp flag This flag tells the compiler to honor the OpenMP directives in the program and process the source code guarded by the OpenMP conditional compilation sentinels e g for Fortran and pragma for C C The actual program execution is also affected by the way the OpenMP Environment Variables see section 8 9 are set The compiler will generate different output that causes the program to be run in multiple threads during execution The output code is linked with the PathScale OpenMP Runtime Library for execution under multiple threads See the Fortran code in section 8 12 and the C C code in section 8 13 for examples Because the OpenMP directives tell the compiler what constructs in the program can be parallelized and how to parallelize them it is possible to make mistakes in the inserted OpenMP code that will result in incorrect execution As long as all the OpenMP related code is guarded by conditional compilation sentinels e g or pragma you can re compile the same program without the mp flag In these cases the resulting executable will run serially If the error no longer occurs you can conclude that the problems in the parallel execution are due to mistakes in the OpenMP part of the code making the problem easier to track down and fix See section 10 10 for more tips on troubleshooting OpenMP problems
193. entthe kernel from rescheduling the thread to another processor further away from that resource The resource might be cache memory main memory or an i o device for example Note that there is a tension between affinity and load balancing since specifying affinities may prevent the kernel scheduler from balancing the workload over the processors The policy of the kernel scheduler determines whether affinity or load balance prevails in cases of conflict Affinity is particularly important on NUMA non uniform memory architectures since memory access latency and bandwidth may vary based on the relative locations of the processor and memory The affinity mechanism is often specific to a particular OS or kernel and the following discussion is relevant to most modern Linux distributions and kernels though details may still vary A processor here refers to a CPU core and this might be a conventional single core processor a CPU core in a multi core processor or a hyper threaded CPU core Affinity can be specified at the thread level allowing distinct threads in a process to have different settings By default the affinity of a thread is usually set to all available CPU cores on the system which allows the kernel to schedule that thread freely Typically affinity is inherited by a child process 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Environment Variables ls when forked from a parent process Affinity can be modified to
194. eparating each sub flag by colons or m Using multiple flags on the command line For example the following command lines are equivalent pathcc OPT roundoff 2 alias restrict wh c pathcc OPT roundoff 2 OPT alias restrict wh c Some sub options either enable or disable the feature To enable a feature either specify only the subflag name or with 21 ON or TRUE Disabling a feature is accomplished by adding 20 OFF or FALSE The following command lines mean the same thing pathf95 OPT div split fast complex FALSE IEEE NaN inf OFF wh F pathf95 OPT div split 1 fast complex 0 IEEE NaN inf false wh F 1 02404 15 XX QLOGIC 7 Tuning Options Inter Procedural Analysis IPA ls 7 3 Inter Procedural Analysis IPA 7 3 1 Software applications are normally written and organized into multiple source files that make up the program The compilation process usually defined by a Makefile invokes the compiler to compile each source file called compilation unit separately This traditional build process is called separate compilation After all compilation units have been compiled into o files the linker is invoked to produce the final executable The problem with separate compilation is that it does not provide the compiler with complete program information The compiler has to make worst case assumptions at places in the program that access external data or call external functions In wh
195. er ID of device containing directory entry for file Size of file in bytes Time of last access Time of last modification Time of last file status change Preferred I O block size 1 if not available Number of blocks allocated 1 if not available Except for elements 12 and 13 values are set to 0 if they are not available from the relevant file system ftell Fortran interface to the C library function fte11 Treats logical unit unit a stream of bytes The function form returns the offset from the beginning of the file to the position pointer used to read or write the file or 1 to indicate an error The subroutine form sets offset to the value which the function would return gerror Fortran interface to the C library function st re r ro r Sets me s sage to the error message corresponding to the error code from the C library variable errno getarg Stores into value an argument from the command line used to execute this process pos is an index into the argument list where o identifies the name of the program 1 identifies the first argument etc Intrinsic iargc provides the number of arguments available getcwd Fortran interface to the C library function getcwd Sets name to the current working directory name The function form returns o for success or an error code from the C library value errno The subroutine form sets status to the value which the function would return 1 02404 15 C 45 C Supported Fortran Intrins
196. er is replaced in the command with the list of options fromthe configfilebeingconsidered The configfileistypically the provided pathopt2 xml file although you can write your own The execute target parameter specifies the execution target from the configfile The 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool YFP I2XOLT H V mm PS Ki l l Si test command parameter is the command to run the program and can be replaced with a script The program is expected to return a status value of 0 to indicate success or a non zero status to indicate failure The s option specifies the metric used for comparing performance m real the elapsed real time this is the default user the CPU time spent executing in user mode system the CPU time spent executing in system mode timing file to use a file containing a timing value rate file to use a file containing a rate value The chosen metric is used to guide the choices made by the pathopt2 algorithms when selecting options for the best performance and is used to sort the final output The interpretation of real user and system time is the same as the time 1 command real is equivalent to wall clock time An application may switch back and forth between user and kernel mode so these components are factored separately into user and system times Since the O S is typically time slicing between many processes the sum of user and system does not necessarily equal real since o
197. er thread and disband The pragma and ifdef before some of the lines are conditional compilation tokens These lines are ignored when compiled without mp We compile omphello c for OpenMP with this command pathcc c mp omphello c Now we link it again using mp pathcc mp omphello o o omphello out We set the environment variable for the number of threads with this command export OMP NUM THREADS 5 Now run the program omphello out Hello World from thread 1 Hello World from thread 2 Hello World from thread 3 Hello World from thread O0 Number of threads 5 Hello World from thread 4 The output from the different threads can be in a different order each time the program is run We can change the environment variable to run with two threads export OMP NUM THREADS 2 8 26 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC Tuning for OpenMP Application Performance j Now the output looks like this omphello out Hello World from thread 0 Number of threads 2 Hello World from thread 1 The same program can be compiled and linked without mp and the directives will be ignored We compile the program without mp pathcc c omphello c Link the object file and create an output file pathcc omphello o o omphello out Run the program and the output looks like this omphello out Hello World from thread 0 Number of threads 1 For more examples using
198. erformance tends to get better with larger data sets 1 02404 15 8 27 8 Using OpenMP and Autoparallelization Tuning for OpenMP Application Performance QLOGIC PV OO m A 8 14 2 because the fork join overheads diminish as the loops get larger Thus you should also run trials with the full data set especially when looking at scaling issues You can also make use of more memory and more cache on an n way multi processor than a uni processor and this sometimes leads to a very nice superlinear speed up Enable OpenMP 8 14 3 After you have tuned the serial version of the application turn on OpenMP parallelization with the mp flag Try running the code on varying numbers of CPUs to see how the application scales One very important option for OpenMP tuning is OPT early mp which by default is off but can be turned on using OPT early setting of this primarily determines the ordering of SIMD vectorization and OpenMP parallelization optimization phase of the compiler With late MP loops will first be vectorized and then the vectorized loops will be parallelized With early MP loops will first be parallelized and then the parallelloops will be vectorized Occasionally one of these orderings works better than the other so you have to try both Optimizations for OpenMP 8 14 3 1 Libraries 8 14 3 2 The most important optimizations for OpenMP applications tend to be loop nest optimization
199. erification SUCCESSFUL Version 2 3 Since Ofast runs last in the try5 target the output in this file corresponds to the 12 68 realor 12 27 user times from the ofast run The reason the Time in seconds output by NPB is considerably lower than 12 68 is that it measures the time for the main work section of the program ignoring the start up and array initialization time For the parallel versions of NPB it is appropriate to ignore the initialization since that time does not improve when more processes are used in the computation This Time in seconds and Mop s total millions of operations per second from the NPB benchmarks turn out to be useful metrics for testing optimization The S timing file and rate file features can be used to search for the Time in seconds or the Mop s total metrics In this next 7 38 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool example we will use the timing file option See section 7 9 8 4 for information on the rate file option This Time in seconds output can be used as pathopt2 s sorting criterion by using the S timing file option However the psc build script has to be enhanced to be able to isolate the number after the Time in seconds part of the output Here is how to do this in a script found in opt pathscale share pathopt2 examples called psc test2 d bin sh bin ft A logs ft A txt grep in sec logs ft A txt sec
200. es with the x86 64 ABI and the 32 bit x86 ABI Using the Fortran Compiler 1 02404 15 To invoke the PathScale Fortran compiler use this command pathf95 By default the compiler will treat input files with an F suffix or suffix as fixed form files Files with an 90 90 F95 or 95 suffix are treated as free form files This behavior can be overridden using the fixedform and freeform switches See section 3 1 1 for more information on fixed form and free form files By default all files ending in F F90 or F95 are first preprocessed using the C preprocessor cpp If you specify the tp option all files are preprocessed using the Fortran preprocessor ftpp regardless of suffix See section 3 4 1 for more information on preprocessing 3 1 3 The PathScale Fortran Compiler XX Using the Fortran Compiler QLOGIC Invoking the compiler without any options instructs the compiler to use optimization level O2 These three commands are equivalent pathf95 test f90 pathf95 O test f90 pathf95 02 test f90 Using optimization level 00 instructs the compiler to do no optimization Optimization level 01 performs only local optimization Level 02 the default performs extensive optimizations that will always shorten execution time but may cause compile time to be lengthened Level 03 performs aggressive optimization that may or may not improve execution time See section 7 1 for more
201. eshooting 4 6 Unsupported GCC Extensions 4 6 Porting and Compatibility Getting SUI IG vos sono tien bk aa ms See a y eens 5 1 GNU Compatibility ended ees Sa SE ahaa ad eed wads a aes 5 1 Porting iocis ete Ge be A ag awed bale cam Se Ao 8 S heute 5 1 DEVENS IOS rede M Ah EM As hace Poe Sa dri Ri Ae BARE Bs bbs 5 1 An Example os acted See Su ete tpe Edu dint d qr 5 2 Name mangling 2 25 bts bd 5 2 Static Dald SR EE E ats eas 5 2 Porting 10X66 rra dud Avete C X Red xa a inte Vitae x 5 2 Migrating from Other Compilers 5 3 Compatibility ssa rex ria eee EROR eM Rae eS 5 3 gcc Compatibility Wrapper Script 5 3 Tuning Quick Reference Basic Optimization 6 1 l rcr 6 1 Feedback Directed Optimization FDO 6 2 Aggressive 6 2 Compiler Flag Recommendations 6 3 Performance ANAlySIS c ces eee E V tcdg eee tee yer eww aware soe 6 4 Optimize Your 6 4 Page vii QLogic PathScale Compiler Suite User Guide XX Version 3 0 QLOGIC FWWI I3IK H WI I lt IIISISSIIIISIIIYATTIEo 2 A 3A Section 7 7 1 7 2 7 3 7 3 1 7 3
202. ested parallelism is not supported where nested parallel directives are statically scoped within the same subroutine as the outer parallel directive In this case only the outer parallel directive will be parallelized and any inner nested directives will be serialized executed by a team of 1 thread To achieve nested parallelism the nested parallel directives must be moved to a separate subroutine OMP SCHEDULE environment variable The default value for this environment variable is implementation dependent Section 4 1 page 59 The default for the SCHEDULE environment variable is static scheduling with no chunk size specified The chunk size will default to the number of iterations of the loop divided by the number of threads in the team rounded up to the nearest integer The loop iterations are partitioned into chunks of the default chunk size If the number of iterations of the loop is not an exact integer multiple of the number of threads in the team the last chunk will be smaller than the default chunk size 1 02404 15 B 3 B Implementation Dependent Behavior for OpenMP Fortran XX QLOGIC and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop iterations which are executed by a thread are contiguou
203. et in a configuration file The following is a listing of the try5_list and the try5 execute target in the default pathopt2 xml file try5 is typically the first target to use when testing options with pathopt2 define name try5_list gt option 02 lt option gt option 03 lt option gt choose k 1 gt append option 03 lt option gt choose k 1 gt option ipa lt option gt option OPT Ofast lt option gt lt choose gt lt append gt lt choose gt lt option gt Ofast lt option gt lt define gt execute name try5 gt lt choose k 1 gt lt source from try5_list gt lt choose gt lt execute gt The first two options 02 and 03 are run in order Next the 03 option is appended to both ipa and OPT Ofast Finally ofast is used This ordering is shown in the first part of the pathopt2 output when try5 is the target Flags Build Test Real User System 02 PASS PASS 2 83 2 82 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 03 OPT Ofast PASS PASS Dye 3 17 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 7 32 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool ls Table 7 5 Tags for option configuration file Table 7 5 Tags for Option Configuration Fle Tag Description lt config gt lt config gt Main body tag describing the configuration All other tags and attributes must reside inside thi
204. etermine that there is no undesirable performance impact due to sub optimal alignment Vectorize only if vectorization does not introduce accuracy problems with floating point operations 2 Vectorize without any constraints most aggressive LNO simd_reduction ON OFF This flag controls whether reduction loops will be vectorized Default is ON LNO simd_verbose ON OFF LNO simd verbosezON prints verbose vectorizer info to stdout Default is OFF LNO svr_phase1 ON OFF This flag controls whether the scalar variable naming phase should be invoked before first phase of LNO The default is ON LNO trip count assumed when unknown trip countzN This flag is to provide an assumed loop trip count if itis unknown at compile time LNO uses this information for loop transformations and prefetch etc N can be any positive integer and the default value is 1000 LNO vintr 0 1 2 This flag controls loop vectorization to make use of vector intrinsic routines Note a vector intrinsic routine is called once to compute a math intrinsic for the entire vector LNO vintr 1 is the default LNO vintr 0 turns off the vintr optimization Under LNO vintr 2 the compiler will do aggressive optimization for all vector intrinsic routines Note that LNO vintr 2 could be unsafe in that some of these routines could have accuracy problems E 26 1 02404 15 QLOGIC ls LNO vintr_verbose ON OFF
205. event excessive stack limits when using 1ibopenmp When each pthread is created the operating system will allocate virtual memory for its entire stack as sized by the above algorithms This essentially allocates virtual memory space for that stack so that it can grow up to its specified limit The operating system will provide physical memory pages to back up this virtual memory as and when it is required A consequence for this is that the top program will include the whole of these stacks in the VIRT or SIZE VIRT or SIZE will be used depending on your Linux distribution memory usage figure while only the allocated physical pages for these stacks will be shown in the RES or RSS resident figure RES or RSS will be used depending on your Linux distribution If the OpenMP program runs with a large pthread stack size which is the common case then it is quite normal for VIRT or SIZE to be a large figure It will be at least the number of pthreads created by 1ibopenmp times their stack size However RES or RSS will typically be much less and this is the real physical memory requirement for the application NOTE large stack limit for the main thread does not show up in the VIRT or SIZE figure This is because the operating system has special handling for the main thread of an application and does not need to pre allocate virtual memory pages for its stack up to the stack limit The pthread stack limit is typically much lower when usin
206. f 1024 bytes m for units of 1048576 bytes g for units of 1073741824 bytes or to specify a percentage of physical memory If the specifier is following by the string cpu the limit is divided by the number of CPUs the system has For example a limit of 1 59 specifies that the Fortran runtime will use no more than 1 5 gigabytes GB of stack On a system with 2GB of physical memory a limit of 90 cpu will use no more than 0 9GB of stack 2 2 0 90 PSC STACK VERBOSE If this environment variable is set the Fortran runtime will print detailed information about how it is computing the stack size limit to use A 4 Language independent Environment Variables FILENV The location of the assign file See the assign man page for more details PSC COMPILER DEFAULTS Specifies a PATH or a colon separated list of PA THs PATH designating where the compiler is to look for the compiler defaults file If the environment variable is set the PATH opt pathscale etc will not be used Ifthe file cannot be found then no defaults file will be used even if one is present in opt pathscale etc PSC GENFLAGS Generic flags passed to all compilers This variable is used with the gcc compatibility wrapper scripts PSC PROBLEM REPORT DIR directory in which to save problem reports and preprocessed source files ifthe compiler encounters an internal error If not specified the directory used is SHOME ekopath bugs A 5
207. file info Call graph explanation follows granularity each sample hit covers 4 byte s for 0 01 of 163 32 seconds index time self children called name 0 00 155 83 1 1 main 2 1 95 4 0 00 155 83 1 MAIN 1 0 00 151 19 1524 152 matmul 3 0 05 4 47 1 1 uinith 13 0 00 0 06 Tl phinit 22 0 02 0 05 1 2 rndphi 21 0 0 0 00 301 512301 zdotc 14 0 0 0 00 77 1024077 dznrm2 17 0 0 0 00 452 603648604 zaxpy 9 0 0 0 00 154 214528306 zcopy 10 0 0 0 00 75 39936075 zscal 16 0 00 0 00 1 1 init 23 0 00 151 19 152 152 MAIN 1 3 92 6 0 00 151 19 152 matmul 3 17 5 73 84 152 152 muldoe 7 TaS 73 84 152 152 muldeo 6 0 00 00 00 152 214528306 zcopy 10 0 00 00 00 152 603648604 zaxpy 9 0 88 48 33 77824000 155648000 muldeo 6 0 88 48 33 77824000 155648000 muldoe 7 4 60 3 1 76 96 65 155648000 su3mul 4 83 54 el 155648000 155648000 zgemm 5 83 54 3 511 155648000 155648000 su3mul_ 4 5 59 2 83 54 155648000 zgemm 5 0 00 933888000 933888000 Isame 11 1 02404 15 9 3 9 Examples XX Compiler Flag Tuning and Profiling With pathprof QLOGIC The ipa option can analyze the code to make smart decisions on when and which routines to inline so we try that 02 ipa results in a 133 8 second run time a nice improvement over our previous best of 150 seconds with only 02 Since we heard somewhere that improvements with compiler flags are not always predic
208. following LNO ps1 N ps2 N ps3 N ps4 N This option specifies the number of bytes in a page with N as positive integer The default for N depends on your system hardware E 28 1 02404 15 E eko man Page QLOGIC yr Y I II I _ Q s THWw LNO tlb1 N tlb2 N tlb3 N tlb4 N This option specifies the number of entries in the TLB for this cache level with N as a positive integer The default for N depends on your system hardware LNO tlbcmpl N tlbcmp2 N tlbcmp3 N tlbcmp4 N tlbdmp1 N tlbdmp2 N tlbdmp3 N tbldmp4 N This option specifies the number of processor cycles it takes to service a clean TLB miss the tlbcmpx options or a dirty TLB miss the tlbdmpx options with N as a positive integer The default for N depends on your system hardware Following are LNO Prefetch Options These arguments control the prefetch operation LNO assume unknown trip count z 0 1000 This flag is no longer supported It has been promoted to LNO trip count assumed when unknown LNO pf1 ON OFF pf2 ON OFF pf3 ON OFF pf4 ON OFF This options selectively disables or enables prefetching for cache level x for pfx ON OFF LNO prefetch 0 1 2 3 This option specifies the levels of prefetching The options can be one of the following 0 Prefetch disabled 1 Prefetch is done only for arrays that are always referenced in each iteration of a loop 2 Prefetch is done without the above restriction This is the
209. for the platform that the compiler is running on which the compiler determines by reading proc cpuinfo anyx86 means a generic x86 processor Under 32 bit ABI anyx86 is a processor without SSE2 SSE3 3DNow support under 64 bit ABI it is a processor with SSE2 but without SSE3 3DNow Core refers to the Intel Core Microarchitecture used by 64 bit CPUs such as Woodcrest The default is auto mcmodel small medium Select the code size model to use when generating offsets within object files Most programs will work with mcmodel small using 32 bit pointers but some need mcmodel medium using 32 bit pointers for code and 64 bit pointers for data mcpu cpu type Behaves like march See march MD Write dependencies to d output file MDtarget Use the following as the target for Make dependencies MDupdate Update the following file with Make dependencies MF Write dependencies to specified output file MG With M or MM treat missing header files as generated files 1 02404 15 QLOGIC MM Output user dependencies of source file MMD Write user dependencies to d output file Inno sse Disable the use of SSE2 SSE3 instructions SSE2 cannot be disabled under m64 and will result in a warning mno sse2 Disable the use of SSE2 SSE3 instructions SSE2 cannot be disabled under m64 and will result in a warning mno sse3 Disable the use of SSE3 instru
210. g linuxthreads than with NPTL threads Linux kernels in the 2 4 series and earlier tend to be provided with 1inuxthreadGs while NPTL is typically the default with 2 6 series kernels However some distributions have back ported NPTL to their 2 4 series kernels NOTE When a program is statically linked with pthreads this might also trigger use of linuxthreadson some distributions 1 02404 15 8 23 8 Using OpenMP and Autoparallelization Example OpenMP Code in Fortran QLOGIC PVVV 9 Q PPh V 8 12 For best 1ibopenmp performance and to avoid stack size limitations it is highly recommended that 2 6 series Linux kernels NPTL and dynamic linkage is used with OpenMP programs Example OpenMP Code in Fortran The following program is a parallel version of hello world written using OpenMP directives When run it spawns multiple threads It uses the CRITICAL directive to ensure that the printing from the various threads will not overwrite one another Here is the program omphello f PROGRAM HELLO INTEGER NTHREADS TID OMP GET NUM THR OMP GET THREAD NUM TID 0 NTHREADS 1 Fork a team of threads giving them their own copies of ADS Lj variables TID PARALLEL PRIVATE TID Obtain and print thread id 15 TID OMP GET THREAD SOMP CRITICAL PRINT Hello
211. g the Application Code 8 30 Using Feedback Data ias ER ARR ARI nS 8 30 Other Resources for OpenMP 8 31 Examples Compiler Flag Tuning and Profiling With pathprof 9 1 Section 10 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 10 1 Appendix A A 1 A 2 A 3 4 5 5 1 A 5 2 Appendix B Appendix C C 1 C 2 4 Debugging and Troubleshooting Subscription Manager Problems 10 1 DGDUGGING inci ot tas ened Gown OS Aah eta eos Mitt dtd 10 1 Dealing with Uninitialized Variables 10 1 Large Object 8 4 eat sassa 10 2 More Inputs Than Registers 10 2 Linking With libg2c 10 2 Linking Large Object Files 10 3 Using ipa and Ofast 10 3 PM 10 3 Troubleshooting 10 4 Compiling and linking with mp 10 4 Environment Variables Environment Variables for Use with A 1 Environment variables for Use with C A 1 Environment Variables for Use with Fortran A 1 Language indepe
212. gn command uses the file pointed to by the FTLENV environment variable to store the processing directives This file is also used by the Fortran I O libraries to load directives at runtime For example FILENV assign export FILENV assign N mips u 15 This instructs the Fortran I O library to treat all numeric data read from or written to unit 15 as being MIPS formatted data This effectively means that the contents of the file will be translated from big endian format MIPS to little endian format Intel while being read Data written to the file will be translated from little endian format to big endian format See the assign 1 man page for more details and information 3 18 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Runtime I O Compatibility ls 3 6 1 2 Using the Wildcard Option 3 6 1 3 The wildcard option for the assign command is assign N mips p Before running your program run the following commands FILENV assign export FILENV assign N mips p This example matches all files Converting Data and Record Headers 3 6 1 4 To convert numeric data in all unformatted units from big endian and convert the record headers from big endian use the following assign F f77 mips N mips g su assign I F f77 mips N mips g du The su specifier matches all sequential unformatted open requests The du specifier matches all direct unformatted open requests
213. gnment then the OpenMP program is restricted to just the subset of CPUs specified in that affinity assignment This behavior ensures that the OpenMP library inter operates with programs like taskset in the expected way The behavioris as ifthe OpenMP program had been run on a machine that consisted of just the CPU subset specified by taskset The OpenMP library will then use its usual thread count and affinity rules but applied to the CPU subset A common approach is to run multiple OpenMP processes on a node e g using MPI such that each OpenMP process uses a distinct subset of CPUs specified by taskset Affinity inheritance ensures that the OpenMP library creates the right number of threads and that CPUs are not overloaded with threads When using affinity inheritance any explicit affinity settings made using PSC OMP AFFINITY MAP PSC OMP CPU STRIDEandPSC OMP CPU OFFSET employ a virtualized CPU numbering The virtualized CPU numbers are a sequence of incrementing integers starting from 0 and refer to the potentially non contiguous real CPU numbers in ascending order This means that the settings for these variables are independent of the specific CPU numbers specified by caskset PSC OMP AFFINITY MAP a list of integer values separated by commas This environment variable allows the mapping from threads to CPUS to be fully specified by the user It must be set to a list of CPU identifiers separated by commas The list must contain at least one
214. gs are quoted A pred ans Make an assertion with the predicate pred and answer ans The pred ans form cancels an assertion with predicate pred and answer ans alignN Align data on common blocks to specified boundaries The alignN specifications are as follows Option Action align8 Align data in common blocks to 8 bit boundaries aligni6 Align data in common blocks to 16 bit boundaries align32 Align data in common blocks 32 bit boundaries align64 Align data in common blocks to 64 bit boundaries This is the default align128 Align data in common blocks to 128 bit boundaries When an alignment is specified objects smaller than the specification are aligned on boundaries that correspond to their sizes For example when align64 is specified 32 bit and larger objects are aligned on 32 bit boundaries 16 bit and 1 02404 15 E eko man Page QLOGIC 9 9 97 WG _w larger objects are aligned on 16 bit boundaries and 8 bit and larger objects are aligned on 8 bit boundaries ansi For Fortran Generate messages about constructs which violate standard Fortran syntax rules and constraints plus messages about obsolescent and deleted features This also disables all nonstandard intrinsic functions and subroutines Specifying ansi in conjunction with fullwarn causes all messages regardless of level to be generated ansi For C C Enable
215. hat the 03 may include optimizations that are generally beneficial but may hurt performance So let s look at a profile of the 02 binary We do need to recompile using flags O2 pg Then we need to run the generated instrumented binary again with the same reference dataset time p wupwise wupwise out Here we used the p POSIX flag to get a different time output format This run generates the file gmon out of profiling information Then we need to run pathprof to generate the human readable profile 1 02404 15 9 1 9 Examples XX Compiler Flag Tuning and Profiling With pathprof QLOGIC EN pathprof wupwise Flat profile Each sample counts as 0 01 seconds cumulative self self total time seconds seconds calls s cal s cal name 51 15 83 54 83 54 155648000 0 00 0 00 zgemm 17 65 112 317 28 83 603648604 0 00 0 00 Zaxpy 8 72 126 61 14 24 214528306 0 00 0 00 Zcopy 8 03 139 72 13 11 933888000 0 00 0 00 lsame 4 59 47 21 7 49 S 54 149 67 2 46 512301 0 00 0 00 zdotc 1 49 152 11 2 44 603648604 0 00 0 00 dcabs1_ L337 154 34 2 23 155648000 0 00 0 00 gammul L 08 156 10 1 76 155648000 0 00 0 00 su3mul_ L 07 157 85 1 75 152 0 01 0 50 muldeo _ 0 00 163 32 0 00 1 0 00 155 83 MAIN 0 00 163 32 0 00 1 0 00 0 00 Ynrt 0 00 163 32 0 00 1 0 00 0 06 phinit the percentage of the total running time of the time program used by this function cumulativ
216. he PathScale Compiler Suite supports them See the eko man page in Appendix E for a complete listing of supported flags 3 If you plan on using IPA see section 7 3 for suggestions 4 Compile your code and look at the results a Did the program compile and link correctly Are there missing libraries that were previously linked automatically b Look for behavior differences does the program behave correctly Are you getting the right answer for example with numerical analysis Compatibility 5 6 1 gcc Compatibility Wrapper Script 1 02404 15 Many software build packages check for the existence of gcc and may even require the compiler used to be called gcc in order to build correctly To provide complete compatibility with gcc we provide a set of gcc compatibility wrapper scripts in opt pathscale compat gcc bin or install directory compat gcc bin This script can be invoked with different names m gcc cc to look like the GNU C compiler and call pathcc B g c to look like the GNU C compiler and call pathCC m g77 77 to look like the GNU Fortran compiler and call pathf95 To use this script you must put the path to this directory in your shell s search path before the location of your system s gcc which is usually usr bin You can confirm the order in the search path by running which gcc after modifying your search path The output should print the location of the gcc wrapper not usr bin gcc
217. he current function The auto prefetcher runs if enabled ignoring array A The size is used for volume analysis Scope Entire function containing the directive size num is the size of the array references in this loop in Kbyte This is an optional argument and must be a constant C PREFETCH REF array ref stride str str level lev lev kKind rd wr size sz This directive generates a single prefetch instruction to the specified memory location It searches for array references that match the supplied reference in the current loop nest If such a reference is found that reference is connected to this prefetch node with the specified parameters If no such reference is found this prefetch node stays free floating and is scheduled loosely All references to this array in this loop nest are ignored by the automatic prefetcher if enabled If the size is supplied then the auto prefetcher if enabled reduces the effective cache size by that amount in its calculations The compiler tries to issue one prefetch per stride iteration but cannot guarantee it Redundant prefetches are preferred to transformations such as inserting conditionals which incur other overhead Scope No scope Just generates a prefetch instruction The following arguments are used with this option array ref Required The reference itself for example A i j str Optional Prefetch every st r iterations of this loop The default is
218. he number of threads in the team the last chunk will be smaller than the default chunk size and in some cases it may contain zero loop iterations The chunks are assigned to threads starting from the thread with local index 0 The thread with the highest local index will receive the last chunk and this may be smaller than the others or even zero The loop iterations which are executed by a thread are contiguous in terms of their loop iteration number NOTE The PSC OMP STATIC FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced over the threads in cases where the division in not exact OMP GET NUM THREADS If the number of threads has not been explicitly set by the user the default is implementation dependent Section 3 1 2 page 48 If the number of threads has not been explicitly set by the user it defaults to the number of CPUs in the machine OMP SET DYNAMIC The default for dynamic thread adjustment is implementation dependent Section 3 1 7 page 51 B 2 1 02404 15 B Implementation Dependent Behavior for OpenMP Fortran QLOGIC ls The default for OMP_DYNAMTC is false Dynamic thread adjustment is not supported by this implementation the number of threads that are assigned to a new team is not adjusted dynamically by this implementation If dynamic thread adjustment is requested by the user or program by setting
219. he spin loops will spin at user level before falling back to O S schedule reschedule mechanisms By default it is 100 If there are more active threads than processors and this is set very high then the thread contention will typically cause a performance drop Synchronization using the O S schedule and reschedule mechanisms is significantly more expensive but frees up execution resources for other threads 8 10 OpenMP Stack Size 8 10 1 Stack Size for Fortran The Fortran compiler allocates data on the stack by default Some environments set a low limit on the size of a process stack which may cause Fortran programs that use a large amount of data to crash shortly after they start In an OpenMP program there is a stack for the main thread of execution as in serial programs and also an additional separate stack for each additional thread created by 1 ibopenmp These additional threads are created by the POSIX threads library and are called pthreads The PathScale Fortran runtime environment automatically sizes the stack for the main thread and the pthreads to avoid stack size problems where possible Additionally diagnostics are given on memory segmentation faults to help diagnose stack size issues The stack size limit for the main thread of an OpenMP program is set using the same algorithm as for a serial Fortran program see section 3 11 for information about Fortran compiler stack size except that the calculated stack limit is subseque
220. he user It may be invoked by default The basic options to control inlining in the lightweight inliner are inline or INLINE causes the lightweight inliner to be invoked when ipa is not specified INLINE off suppresses the invocation of the lightweight inliner The options below are applicable to both the lightweight inliner and IPA s inliner INLINE a11 performs all possible inlining Since this results in code bloat this should only be used if the program is small 1 02404 15 7 7 7 Tuning Options XX Inter Procedural Analysis IPA QLOGIC INLINE list zON makes the inliner list its actions on the fly This is an useful option for the user to find out which functions are getting inlined which functions are not being inlined and why Thus if the user wants to inline or not inline a function tweaking the inlining controls based on the reasons specified by the output of this flag should help INLINE must namel name2 forces inlining for the named functions INLINE never namel name2 suppresses inlining for the named functions When ipa is specified IPA will invoke its own inliner and the lightweight inliner is not invoked IPA s inliner automatically determines additional functions to inline in addition to those that are required Small callees or callers are favored over larger ones If profile data is available calls executed more frequently are preferred Otherwise calls inside loops
221. ic xpression statement BARRIER pragma omp barier CRITICAL pragma omp critical name structured block FLUSH pragma omp flush list MASTER pragma omp master tructured block ORDERED pragma omp ordered structured block Data environments Control the data environment during the execution ofparallel constructs THREADPRIVATE pragma omp threadprivate 8 6 OpenMP Runtime Library Calls Fortran OpenMP programs can explicitly call standard routines implemented in the OpenMP runtime library If you want to ensure the program is still compilable without mp you need to guard such code with the OpenMP conditional compilation sentinels 1 02404 15 8 7 8 Using OpenMP and Autoparallelization Runtime Library Calls Fortran QLOGIC e e g The following table lists the OpenMP runtime library routines provided by version 2 0 of the OpenMP Fortran Application Program Interface Table 8 3 Fortran OpenMP Runtime Library Routines Routine Description call Set the number of threads to use in a team omp set num threads integer integer omp get num threads Return the number ofthreads inthe currently executing parallel region integer omp get max threads Return the maximum value that omp get num threads may return integer omp get thread num Return the thread number
222. icit function declaration warns when a function is used before being declared Wimplicit function declaration tells the compiler not to warn when a function is used before being declared W no implicit int For C C only Wimplicit int warns when a declaration does not specify a type Wno implicit inttells the compiler not to warn when a declaration does not specify a type W no import Wimport warns about the use of the import directive Wno import tells the compiler not to warn about the use of the import directive W no inline For C C only Winline warns if a function declared as inline cannot be inlined Wno inline tells the compiler not to warn if a function declared as inline cannot be inlined W no larger than lt number gt Wlarger than warns if an object is larger than number bytes Wno larger than tells the compiler not to warn if an object is larger than number bytes Wno long long For C C only Wlong long warns if the long long type is used Wno long long tells the compiler not to warn if the long long type is used W no main For C C only Wmain warns about suspicious declarations of main Wno main tells the compiler not warn about suspicious declarations of main W no missing braces For C C only Wmissing braces warns about possibly missing braces around initializers Wno missing braces tells the compiler not warn about possibly missing braces around initializers
223. ics Fortran Intrinsic Extensions QLOGIC S getenv Fortran interface to the C library function getenv Sets value to the value of environment variable whose name is name or to blanks if the variable is missing or not set Trailing blanks in name are ignored you can prevent this by using char 0 to place a null character after the last significant character getgid Like the POSIX function getgid returns the group ID for this process getlog Sets 1ogin to the login name for this process getpid Like the POSIX function getpid returns the process ID for this process getuid Like the POSIX function getuid returns the process ID for this process gmtime Fortran interface to the C library function gmtime Sets tarray to the broken down time corresponding to stime which can be obtained from the intrinsic time 8 All values are in Coordinated Universal Time tarray must have nine elements Seconds since the last minute ranging 0 61 duetoleap seconds Minutes since the last hour ranging 0 59 Hours since midnight ranging 0 23 Day of month ranging 0 31 Month ranging 0 11 Years since 1900 Days since Sunday ranging 0 6 Days since January 1 ranging 0 365 Positive if daylight savings time is in effect zero if not or negative if unknown hostnm Fortran interface to the POSIX function gethostname Sets name to the network name of the host computer The function form returns 0 on success or an e
224. ict Specify that distinct pointers are assumed to point to distinct non overlapping objects This is OFF by default disjoint Specify that any two pointer expressions are assumed to point to distinct non overlapping objects This is OFF by default OPT align unsafe ON OFF Instruct the vectorizer invoked at O3 to aggressively perform vectorization by assuming that array parameters are aligned at 128 bit boundaries The vectorizer will then generate 128 bit aligned load and store instructions which are faster than their unaligned counterparts If the assumption is incorrect the aligned memory accesses will result in run time segmentation faults The default is OFF OPT asm memory A debugging option to be used when debugging suspected buggy inline assembly If ON the compiler assumes each asm has memory specified even if it is not there The default is OFF OPT bb N This specifies the maximum number of instructions a basic block straight line sequence of instructions with no control flow can contain in the code generator s program representation Increasing this value can improve the quality of optimizations that are applied at the basic block level but can increase compilation E 34 1 02404 15 E eko man Page QLOGIC time in programs that exhibit such large basic blocks The default is 1300 If compilation time is an issue use a smaller value OPT cis ON OFF Convert SIN COS p
225. ified prefix which were previously generated using fb create The commonly used prefix is fbdata The same optimization flags must have been used in the fb create compile Feedback data files created from executables compiled with different optimization flags will give checksum errors FDO is OFF by default fb phase 0 1 2 3 4 Used to specify the compilation phase at which instrumentation for the collection of profile data is performed so is useful only when used with fb create The values must be in the range 0 to 4 The default value is 0 and specifies the earliest phase for instrumentation which is after the front end processing 1 02404 15 E 7 E man Page XX QLOGIC ee E 8 f no check new For C only Check the result of new for NULL When fno check new is used the compiler will not check the result of an operator of NULL fe Stop after the front end is run f no unwind tables funwind tables emits unwind information fno unwind tables tells the compiler never to emit any unwind information This is the default Flags to enable exception handling automatically enable funwind tables f no fast math ffast math improves FP speed by relaxing ANSI amp IEEE rules ffast math is implied by Ofast fno fast math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed ffast math implies C OPT IEEE arithmeticz2 fno math errno fno fast math implies OPT
226. ift 2 make code CLASS size FFLAGS cd pathopt2 bin code size gt logs code size txt grep Mop logs code size txt secs log 7 40 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool um 9 sed e s Mop s total PEEN secs log gt PSC METRIC FILI grep SUCCESSFUL logs 1 2 txt Lj Make the file executable and run pathopt2 chmod x compile go rate pathopt2 S rate fil t try5 compile go rate ft A Sorted summary from all runs Flags Build Test Rate Ofast PASS PASS 662 60 03 ipa PASS PASS 662 37 03 PASS PASS 655 03 03 OPT Ofast PASS PASS 654 30 02 PASS PASS 603 43 Since Ofast produced the best results in the sorted summary we can now try the target peak_Ofast pathopt2 S rate fil t peak Ofast compile go rate ft A A truncated listingof the output shows the top fixe results for this run Sorted summary from all runs Flags Build Test Rate Ofast CG prefetch off PASS PASS 702 72 CG load_exe 0 OPT unroll size 256 Ofast CG prefetch off PASS PASS 702 17 CG load_exe 0 Ofast msse3 CG load_exe 0 PASS PASS 696 36 LNO interchange off OPT unroll size 256 Ofast CG prefetch off msse PASS PASS 695 08 CG load_exe 0 LNO interchange off Ofast msse3 CG load_exe 0 694 48 LNO interchange off In a situation like this with a near tie at the top one w
227. ifying LNO fissionz2 will turn on fission and cause it to be applied before fusion LNO full unroll fuzN Fully unroll loops with trip count N inside LNO N can be any integer between 0 and 100 The default value for N is 5 Setting this flag to O disables full unrolling of small trip count loops inside LNO LNO full unroll size N Fully unroll loops with unrolled loop size lt N inside LNO N can be any integer between 0 and 10000 The conditions implied by the full unroll option must also be satisfied for the loop to be fully unrolled The default value for N is 2000 LNO full unroll outer ON OFF Control the full unrolling of loops with known trip count that do not contain a loop and are not contained in a loop The conditions implied by both the full unroll and the full unroll size options must be satisfied for the loop to be fully unrolled The default is OFF 1 02404 15 E 23 E man Page QLOGIC FWGII C MO I I IIII ITH GHI G GN ZIII IIIIIGI IGI IIILIL I l F lhk LLL m9 h 21 L VVhA E 24 LNO LNO LNO LNO LNO LNO LNO LNO LNO fusionzN Perform loop fusion N can be one of the following 0 Loop fusion is off 1 Perform conservative loop fusion 2 Perform aggressive loop fusion The default is 1 fusion peeling limit N This option sets the limit for the number of iterations allowed to be peeled in fusion where N gt 0 N 5
228. ilable in the PathScale Compiler Suite Basic Optimizations The o flag 1 02404 15 The o flag is the first flag to think about using See table 7 3 showing the default flag settings for various levels of optimization 00 O followed by a zero specifies no optimization this is useful for debugging The g debugging flag is fully compatible with this level of optimization NOTE Using g by itself without specifying will change the default optimization level from 02 to 00 unless explicitly specified 01 specifies minimal optimizations with no noticeable impact on compilation time compared with o0 Such optimizations are limited to those applied within straight line code basic blocks like peephole optimizations and instruction scheduling The 01 level of optimization minimizes compile time 2 only turns on optimizations which always increase performance and the increased compile time compared to 01 is commensurate with the increased performance This is the default if you don t use any of the o flags The optimizations performed at level 2 are m Forinner loops perform Loop unrolling Simple if conversion Recurrence related optimizations m Two passes of instruction scheduling m Global register allocation based on first scheduling pass m Global optimizations within function scopes a Partial redundancy elimination Strength reduction and loop termination test replacement Dead store elimination
229. in total size The data both static and BSS are allowed to exceed 2GB in size As with the small memory model pointers are also signed 64 bit quantities and may exceed 2 GB in size NOTE PathScale compilers do not support the use of the PIC option flag in combination with the mcmodel medium option The code model medium is not supported in PIC mode The PathScale compilers support mcmode1 medium and fPIC in the same way that GCC does When building shared libraries only PIC should be used The option mcmodel medium but not PIC when compiling and linking the main program The reasoning behind this is that because the shared library is self contained it does not know about the fixed addresses of the data in the program that it is linked with The library will only access the program data through pointers and such pointer data accesses are not affected by the value of the mcmode1 option The mcmode1 value only affects the addressing of data with fixed addresses When these addresses are larger than 2GB the compiler has to generate longer sequences of instructions Thus it does not want to do that unless the mcmodel medium flag is given See 10 4 for more information on using large objects and your GCC 3 3 1 documentation for more information on this topic 2 9 1 Support for Large Memory Model At this time the PathScale compilers do not support the large memory model The significance is that the code off
230. in addition to keep keepdollar For Fortran only Treat the dollar sign as a normal last character in symbol names L directory In XPG4 mode changes the algorithm of searching for libraries named in L operands to look in the specified directory before looking in the default location Directories specified in L options are searched in the specified order Multiple instances of L options can be specified 1 library In XPG4 mode searches the specified library A library is searched when its name is encountered so the placement of a l operand is significant LANG Controls the language option group The following sections describe the suboptions available in this group copyinout When an array section is passed as the actual argument a call the compiler sometimes copies the array section to a temporary array and passes the temporary array thus promoting locality in the accesses to the array argument This optimization is relevant only to Fortran and this flag controls the aggressiveness of this optimization The default is ON for O2 or higher and OFF otherwise 1 02404 15 QLOGIC ls formal_deref_unsafe ON OFF Tell the compiler whether it is unsafe to speculate a dereference of a formal parameter in Fortran The default is OFF which is better for performance heap allocation threshold size Determine heap or stack allocation If the size of an
231. ing arguments by reference void c reference double d1 float f1 int 11 long long i2 char int 11 int 12 char c2 char c3 int cl_len int c2_len int c3_len A fortran string has no null terminator so make a local copy and add a terminator Depending on the situation it might be preferable to put the terminator in place of the first trailing blank char null_terminated_cl memcpy alloca ci_len 1 cl c1 len char null terminated c2 memcpy alloca c2 len 1 c2 c2 len char null terminated c3 memcpy alloca c3 len 1 c3 c3 len null terminated 1 ci len null terminated c2 c2 len null terminated c3 c3 len 0 printf d1 1f 1 1f i1 d i2 11d 11 d 12 d cl_len d c2 len d c3_len d n d1 FEL 11 i2 11 12 cl_len c2 len c3_len printf 1 5 c2 s c3 Ss n null terminated cl null terminated c2 null terminated c3 fflush stdout Flush output before switching languages call fortran C function designed to be called from Fortran passing arguments by value int c_value__ double d float f int i long long i8 printf d 1f f 1f i d i8 lld n d f i i8 fflush stdout Flush output before switching languages return 4 Nonzero will be treated as true by Fortran Here is the Fortran source code f_part 90 program f_p
232. ing between multiple threads and expressing synchronization between threads The OpenMP runtime library automatically creates the optimal number of threads to be executed in parallel for the multiple processors on the platform where the program is being run If you are running the program on a system with only one processor you will not see any speedup In fact the program may run slower due to the overhead in the synchronization code generated by the compiler For best performance the number of threads should typically be equal to the number of processors you will be using The amount of speedup you can get under parallel execution depends a great deal on the algorithms used and the way the OpenMP directives are used Programs 8 1 8 Using OpenMP and Autoparallelization Autoparallelization QLOGIC e that exhibit a high degree of coarse grain parallelism can achieve significant speedup as the number of processors are increased NOTE OpenMP with certain C constructs is not supported We recommend that C OpenMP programs be compiled with no exceptions Compiling for C applications that require both OpenMP and C exceptions is not currently supported In addition C OpenMP applications using C class data structures or class templates are not supported An application that does not satisfy these restrictions can cause compile time failure or runtime failure Appendix B describes the implementation dependent
233. ing the program export FILENV myassignfile assign I y on u 6 assign I y on f test2559 out 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Debugging and Troubleshooting Fortran Kip 3 9 The following program would then use no repeat factors because the first write statement refers explicitly to unit 6 the second write statement refers implicitly to unit 6 by using in place of a logical unit and the third is bound to file test2559 out real a 5 88 0 write 6 a write 77 0 77 0 77 0 77 0 77 0 open unit 17 file test2559 out write 17 99 0 99 0 99 0 99 0 99 0 end Porting Fortran Code 3 10 The following option can help you fix problems prior to porting your code r8 i8 Respectively promotes the default representation for REAL and INTEGER type from 4 bytes to 8 bytes Useful for porting from Cray code when integer and floating point data is 8 bytes long by default Watch out for type mismatches with external libraries These sections contain helpful information for porting Fortran code m Section 3 7 1 has information on porting code that includes KINDS sometimes a problem when porting Fortran code m Section 3 7 has information on source code compatibility m Section 3 8 has information on library compatibility Debugging and Troubleshooting Fortran 1 02404 15 The flag g tells the PathScale compilers to produce data in the form used by m
234. iolist entity 8 for character data to indicate size of each character unsigned int dec len 8 declared length in bytes for n or KIND value Ignored if kind or star DVD DEFAULT 90 type t If DopeVectorType alloc cpnt is true then following the last actual dimension or codimension not necessarily MAXDIM there is a count of the number of allocatable components followed by an array of byte offsets from the beginning of the structure to each allocatable component If DopeVectorType alloc cpnt is false neither of these appears typedef struct unsigned long n_alloc_cpnt unsigned long alloc_cpnt_offset 0 DopeAllocType typedef struct DopeVector union _fcd charptr Fortran character descriptor struct void ptr pointer to base address or shared data desc unsigned long el_len element len in bits a base_addr flags and information fields within word 3 of the header 7 unsigned int assoc 1 associated flag unsigned int ptr alloc 1 set if allocated by pointer enum ptrarray NOT P OR A 0 POINTTR 1 ALLOC ARRY 2 por a 2 pointer or allocatable array Use enum ptrarray values unsigned int a contig 1 array storage contiguous flag unsigned int alloc cpnt 1 this is an allocatable 1 02404 15 XX QLOGIC D Fortran 90 Dope
235. ion TENV simd_pmask ON OFF Default is ON Turning it OFF unmasks SIMD floating point precision exception traditional Attempt to support traditional K amp R style C trapuv Trap uninitialized variables Initialize variables to the value NaN which helps your program crash if it uses uninitialized variables Affects local scalar and array 1 02404 15 E 43 E man Page QLOGIC WIIIAIIAII I AIII I III variables and memory returned by alloca Does not affect the behavior of globals malloc ed memory or Fortran common data U name Remove any initial definition of name Uvar Undefine a variable for the source preprocessor See the Dvar option for information on defining variables uvar Make the default type of a variable undefined rather than using default Fortran 90 rules V Print on standard error output the commands executed to run the stages of compilation Also print the version number of the compiler driver program and of the preprocessor and the compiler proper version Write compiler release version information to stdout No input file needs to be specified when this option is used Wc arg1 arg2 Pass the argument s argi to the compiler pass c where c is one of pfibal The c selects the compiler pass according to the following table Character Name p preprocessor f front end i inliner b backend a assembler loader Sets of these phase names can
236. ion levels 03 Ofast OPT reorg commonis set to ON by default This might split a COMMON block such that a block begins beyond the 2GB boundary If a program builds correctly at 02 or below but fails at 03 or Ofast try adding OPT reorg common OFF to the flags Alternatively using the mcmodel medium option will allow this optimization 10 5 More Inputs Than Registers The compiler will complain if an asm has more inputs than there are available CPU registers For m32 32 bit the maximum number of asm inputs is seven 7 For m64 64 bit the maximum number is fifteen 15 10 6 Linking With 1ibg2c When using Fortran with a Red Hat or Fedora Core system you cannot link libg2c automatically In order to link successfully against 1ibg2c on a Red Hat or Fedora Core system you should first install the appropriate 1ibf2c library then add a 10 2 1 02404 15 XX QLOGIC 10 Debugging and Troubleshooting Tuning 10 7 symlink usr 1ib64 or usr 1ib from libg2c so 0 to 1ibg2c so This problem is due to a packaging issue with Red Hat s version of this library You will only need to take this step if you are linking against either the AMD Core Math Library ACML or Fortran object code that was compiled using the g77 compiler Linking Large Object Files 10 8 The PathScale Compiler Suite does not support the linking or assembly of large object files on the x86 platform Earlier versions of the com
237. is to break up the iterations under guided scheduling for better dynamic load balancing between the threads The full equation for the chunk size for guided scheduling is chunk size MIN ROUNDUP remaining size number of threads PSC OMP GUIDED CHUNK DIVISOR PSC OMP GUIDED CHUNK MAX minimum chunk size Where m remaining size is the number of iterations of the loop m number of threads is the number of threads in the team m PSC OMP GUIDED CHUNK DIVISORiS the value of the PSC OMP GUIDED CHUNK DIVISOR environment variable defaults to 2 m PSC OMP GUIDED CHUNK MAX is the value of the PSC OMP GUIDED CHUNK MAX environment variable defaults to 300 m minimum chunk size is the size of the smallest piece this is the value of chunk in the SCHEDULE directive m ROUNDUP x rounds x upwards to the nearest higher integer m MIN a b is the minimum of a and b m MAX a b is the maximum of a and b 1 02404 15 8 19 8 Using OpenMP and Autoparallelization XX Environment Variables LOGIC V P The minimum chunk size is the value specified by the user in the guided scheduling directive defaults to 1 NOTE the values of PSC OMP GUIDED CHUNK MAX and minimum chunk sizeare inconsistent i e the minimum is larger than the maximum the minimum chunk size takes precedence per the OpenMP specification PSC OMP LOCK SPIN Integer
238. is used to specify the filename of the pathopt2 XML configuration file If itis not specified the tool will first check for a file called pathopt2 xml1 in the current working directory and use it if present otherwise the tool will use the file install path pathscale share pa thopt2 pathopt2 xml g external con figfile Loads in additional user defined configfile s This allows a user to extend the pathopt2 xml file without having to modify it Show usage Number of jobs 1 02404 15 XX QLOGIC 7 Tuning Options The pathopt2 Tool ls 7 9 3 Table 7 4 pathopt2 Options Continued k Keep temporary Remove temporary directory with T directory M Directory name pwd n num iterations Number of iterations to run on each option 1 r test command Test script If this option is not specified then there is no test run and the performance of the build command is used This is useful when the program is built and run in one step and thetiming file or rate file mechanism is used to report the performance S real user system timing file Pate file Selects the performance metric for choosing options and for sorting the results real t execute target Use xecute_target which corresponds to an execute tag found in configfile The first target in configfile T Run script in Do not use a temporary
239. l unrolling can be disabled by specifying LNO ou_further 999999 Unrolling is enabled as much as is sensible by specifying CLNO ou furtherz3 LNO ou max zN This option enables the compilerto unroll as many as N copies perloop butno more LNO pwr2 ON OFF For C C only This option specifies whether to ignore the leading dimension set this to OFF to ignore Following are LNO Target Cache Memory Options These arguments allow you to describe the target cache memory system In the following arguments the numbering starts with the cache level closest to the processor and works outward 1 02404 15 E 27 E man Page XX QLOGIC 1 1 assoc2 N assoc3 N assoc4 N This option specifies the cache set associativity For a fully associative cache such as main memory N should be set to any sufficiently large number such as 128 Specify a positive integer for N specifying N 0 indicates there is no cache at that level LNO cmp1 N cmp2 N cmp3 N cmp4 N dmp1 N dmp2 N dmp3 N dmp4 N This option specifies in processor cycles the time for a clean miss or a dirty miss dmpx to the next outer level of the memory hierarchy This number is approximate because it depends on a clean or dirty line read or write miss etc Specify a positive integer for N specifying N 0 indicates there is no cache at that level LNO cs1zN cs2 N cs3 N cs4 N This option specifies the cache size N c
240. le The pathopt2 tool arranges that the specified options are passed through as arguments to the build command using the expansion of the character on the pathopt2 command line Usually these options will then be explicitly passed to the compiler either directly or via a Makefile variable such as CFLAGS or FFLAGS Alternatively the PathScale compilers will also process options from the PSC_GENFLAGS environment variable This provides a way to implicitly pass the pathopt2 selected options to the compiler through existing scripts and Makefiles without their modification Note that pathopt2 itself does not set the value of PSC_GENFLAGS but it can be easily achieved using a shell script as the build command and using the syntax export PSC_GENFLAGS 7 9 7 Using Build and Test Scripts The first example was run without build or test scripts However scripts provide added flexibility to pathopt2 Here are three common reasons for using a build script m You might need to cd to another directory before issuing the make command m There may be several directories you need to go to to complete the build m There may be no make clean target so you needa rm command before the make command 1 02404 15 7 35 7 Tuning Options XX The pathopt2 Tool QLOGIC C C YY ss s s ss q C s There are several reasons for using a test script m pathopt2 can t handle a complicated program run command with white
241. le 7 1 shows how ipa effects the base runs of the CPU2000 benchmarks IPA improves the running times of 17 out of the 26 benchmarks the improvements range from 1 3 to 26 6 There are six benchmarks that improve by less than 0 5 which is within the noise threshold There are three FP benchmarks that slow down from 1 296 to 4 596 due to ipa The slowdown indicates that the benchmarks do not benefit from the default settings of the IPA parameters By using additional IPA 7 Tuning Options XX Inter Procedural Analysis IPA QLOGIC Pp O AA tuning flags such slowdown can often be converted to performance gain The average performance improvement over all the benchmarks listed in table 7 1 is 696 Table 7 2 Effects of IPA tuning on some SPEC CPU2000 benchmarks Time Peak Time Peak Bench flags w o IPA flags with Improve mark tuning tuning ment IPA Tuning Flags 181 mcf 325 35 275 55 15 3 IPA _eld_reorder on 197 parser 296 5 s 245 25 17 3 IPA ctype on 253 perlbmk 195 1 s 177 7 S 8 996 IPA min_hotness 5 plimit 20000 168 wupwise 147 7 s 129 7 s 12 2 IPA space 1000 linear on IPA plimit 50000 callee_limit 5000 INLINE aggressive on 187 facerec 144 6s 141 6 s 2 196 IPA plimit 1800 Table 7 2 shows the effects of using additional IPA tuning flags on the peak runs of the CPU2000 performance In the peak runs each benchmark can be built with its own combination of any
242. leaving options Depending on your configuration this may have an effect on performance For a discussion of memory interleaving across nodes see section 7 8 3 below 7 24 1 02404 15 XX 7 Tuning Options QLOGIC Hardware Performance ls 7 8 3 Multiprocessor Memory Traditional small multiprocessor MP systems use symmetric multiprocessing SMP in which the latency and bandwidth of memory is the same for all CPUs This is not the case on Opteron multiprocessor systems which provide non uniform memory access known as NUMA On Opteron MP systems each CPU has its own direct attached memory Although every CPU can access the memory of all others memory that is physically closest has both the lowestlatency and highest bandwidth The larger the number of CPUs the higher will be the latency and the lower the bandwidth between the two CPUs that are physically furthest apart Most multiprocessor BlOSes allow you to turn on or off the interleaving of memory across nodes Memory interleaving across nodes masks the NUMA variation in behavior but it imposes uniformly lower performance We recommend that you turn node interleaving off 7 8 4 Kernel and System Effects To achieve best performance on a NUMA system a process or thread and as much as possible of the memory that it uses must be allocated to the same single CPU The Linux kernel has historically had no support for setting the affinity of a process in this way Running a
243. ler Directives 8 6 OpenMP Runtime Library Calls Fortran 8 7 OpenMP Runtime Library Calls 8 9 Runtime Libraries 8 10 Environment Variables 8 11 Standard OpenMP Environment Variables 8 12 PathScale OpenMP Environment Variables 8 12 OpenMP Stack Size 2 eee 8 21 Stack Size for Fortran 8 21 Stack Size Tor us cog bee 8 22 Stack Size AIGOMAM 8 22 Example OpenMP Code in Fortran 8 24 Example OpenMP Code 8 25 Tuning for OpenMP Application Performance 8 27 Reduced Datasets E eee eee kak eae RES 8 27 Enable OpenMP duse OMe CE MIS ae el 8 28 Page ix QLogic PathScale Compiler Suite User Guide XX Version 3 0 8 14 3 8 14 3 1 8 14 3 2 8 14 3 3 8 14 3 4 8 14 3 5 8 15 Section 9 9 1 QLOGIC VC1 O 9 amp Ah 9 Optimizations for OpenMP 8 28 Libraries aud ha Jib 8 28 Memory System Performance 8 28 Load Balancing Lu e etos a URS Rae Raves 8 29 Tunin
244. lls preprocessor that input has not already been preprocessed freeform For Fortran only Treats all input source files regardless of suffix as if they were written in free source form By default only input files suffixed with f90 or F90 are assumed to be written in free source form f no rtti For C only Using frtti will generate runtime type information The fno rtti option will not generate runtime type information 1 02404 15 E 11 E man Page XX QLOGIC ee f no second underscore For Fortran only fsecond underscore appends a second underscore to symbols that already contain an underscore fno second underscore tells the compiler not to append a second underscore to symbols that already contain an underscore f no signed bitfields For C C only fsigned bitfields makes bitfields be signed by default The fno signed bitfields will make bitfields be unsigned by default f no strict aliasing For C C only fstrict aliasing tells the compiler to assume strictest aliasing rules fno strict aliasing tells the compiler not to assume strict aliasing rules f no PIC fPIC tells the compiler to generate position independent code if possible The default is fno PIC which tells the compiler not to generate position independent code fprefix function name For C C only Add a prefix to all function names fshared data For C C only Mark data as shared rather than p
245. load balancing and scheduling issues to be observed OProfile can access many different performance counters giving more detail insight into the CPU behavior however these advanced features of OProfile are not easy to use If the application uses nested OpenMP parallelism then try turning on the nested parallelism support by setting the OMP_NESTED environment variable to TRUE Tuning the Application Code 8 14 3 5 If you are able to tune the code of the application it is worth checking whether any of the OpenMP directives specify a chunk size It may be possible to make more appropriate choices of the chunk size perhaps influenced by the number of CPUs available the L2 size orthe data size You may also wantto try different scheduling strategies If the amount of work in an OpenMP loop varies significantly from iteration to iteration then a DYNAMIC or GUIDED scheduling algorithm is preferable The default loop scheduling algorithm is static scheduling and this is used by the majority OpenMP applications If this leads to an unbalanced distribution of work across the threads try setting the PSC OMP STATIC FAIR environment variable which will cause the library to use a fairer distribution If the application uses guided scheduling the PSC OMP GUIDED CHUNK DIVISOR and PSC OMP GUIDED CHUNK MAX environment variables can be used to tune the loop scheduling The default values for these are widely applicable but some applications with
246. lock Seta nestable lock The thread executing the te I3 subroutine will wait until a lock becomes avail able and then set that lock incrementing the nesting count omp unset lock omp lock t Release the lock resuming a waiting thread if any omp unset nest lock Release ownership of a nestable lock The sub routine decrements the nesting count and releases the associated thread from ownership of the nestable lock omp nest lock t int omp test lock omp lock t Try to acquire the lock return a non zero value if successful 0 if not omp test nest lock omp nest Attempt to set a lock using the same method lock t as omp set nest lock but execution thread does not wait for confirmation that the lock is available If lock is successfully set function in crements the nesting count and returns the new nesting count if lock is unavailable function returns a value of zero double omp get wtime void Returns double precision value equal to the number of seconds since the initial value of the operating system real time clock double omp get wtick void Returns double precision floating point value equal to the number of seconds between successive clock ticks 8 8 Runtime Libraries There are both static and dynamic versions of each library and the libraries are supplied in both 64 bit and 32 bit versions 8 10 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOG
247. lon 64 can have either a 512KB or 1MB L2 cache size If your target machine is Athlon 64 and you have the smaller cache size then setting LNO cs2 512k could help You can also specify your target machine instead using march athlon 64 That would automatically set the standard machine cache sizes Here is the more general description of some of what is available LNO csl n cs2 n cs3 n cs4 n This option specifies the cache size n can be 0 or a positive integer followed by one of the following letters K m or M These letters specify the cache size in Kbytes or Mbytes Specifying 0 indicates there is no cache at that level cs1 is the primary cache cs2 refers to the secondary cache cs3 refers to memory cs4 is the disk 1 02404 15 7 15 7 Tuning Options XX Loop Nest Optimization LNO QLOGIC Se Default cache size for each type of cache depends your system Use LIST options ON to see the default cache sizes used during compilation With a smaller cache the cache set associativity is often decreased as well The flagset LNO associ n assoc2 n assoc3 n assoc4 n can define this appropriately for your system Once again the above flags are already set appropriately for Opteron 7 4 3 Cache Blocking Loop Unrolling Interchange Transformations Cache blocking also called tiling is the process of choosing the appropriate loop interchanges and loop unrolling sizes at the correct levels of the loo
248. losely to the C ABI specification Therefore the ABI obtained using version 0 will change as ABI bugs are fixed The default is version 1 fixedform For Fortran only Treat all input source files regardless of suffix as if they were written in fixed source form f77 72 column format instead of F90 free format By default only input files suffixed with f or F are assumed to be written in fixed source form fkeep inline functions For C C only Generate code for functions even if they are fully inlined FLIST Invoke the Fortran listing control group which controls production of the compiler s internal program representation back into Fortran code after IPA inlining and loop nest transformations This is used primarily as a diagnostic tool and the generated Fortran code may not always compile With the exception of FLIST OFF any use of this option implies flist The arguments to the FLIST option are as follows 1 02404 15 E 9 E eko man Page XX QLOGIC ee Argument Action setting Enable or disable the listing setting can be either ON or OFF The default is OFF This option is enabled when any other FLIST options are enabled but it can also be used to enable a listing when no other options are enabled ansi_format setting Set ANSI format setting can be either ON or OFF When set to ON the compiler uses a space instead of tab for indentation and a maximum of 72 characters per line The
249. lows optimizations involving reassociation of floating point quantities Default is OFF fold reassociatezON is enabled automatically when OPT roundoff 2 or greater is in effect OPT fold unsafe relops ON OFF This option folds relational operators in the presence of possible integer overflow The default is ON for O3 and OFF otherwise E 36 1 02404 15 E eko man Page QLOGIC ls OPT fold unsigned 1 This option folds unsigned relational operators the presence of possible integer overflow Default is OFF OPT goto ON OFF Disable or enable the conversion of GOTOs into higher level structures like FOR loops The default is ON for O2 or higher OPT IEEE arithmetic IEEE arith 1 2 3 Specify the level of conformance to IEEE 754 floating pointing roundoff overflow behavior Note that OPT IEEE_a is a valid abbreviation for this flag The options can be one of the following 1 Adhere to IEEE accuracy This is the default when optimization levels OO0 O1 and O2 are in effect 2 May produce inexact result not conforming to IEEE 754 This is the default when O3 is in effect 3 All mathematically valid transformations are allowed OPT IEEE_NaN_Inf ON OFF OPT IEEE_NalN_inf ON forces all operations that might have IEEE 754 NaN or infinity operands to yield results that conform to ANSI IEEE 754 1985 the IEEE Standard for Binary Floating point Arithmetic which de
250. march 2 4 mcmodel 2 9 10 2 mcpu 2 4 mp 8 2 8 3 8 11 10 4 msse2 2 4 msse3 2 4 mtune 2 4 no intrinsic C 2 noccp 4 3 O 2 8 6 1 O 2 2 2 8 O0 3 2 O1 3 2 4 4 O2 3 2 4 2 6 1 O2 ipa 6 1 O3 2 8 3 2 7 2 7 12 O3 ipa 6 1 Ofast 6 3 7 12 7 14 7 20 OPT 3 8 OPT alias 3 27 6 2 OPT alias any 7 20 OPT alias cray_pointer 7 20 OPT alias disjoint 7 20 OPT alias no_parm 3 27 OPT alias no_restrict 7 20 OPT alias restrict 6 2 7 20 OPT alias typed 6 2 7 20 OPT alias unnamed 7 20 OPT div_split 6 2 7 21 OPT early_mp 8 28 1 02404 13 XX QLOGIC QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 OPT fast complex 7 23 OPT fast exp 7 22 OPT fast math 7 21 OPT fast_nint 7 23 OPT fast_trunc 7 22 OPT fold_reassociate 7 22 OPT goto 7 2 OPT IEEE_arithmetic 7 22 OPT IEEE_arithmetic N 7 20 OPT Ofast 6 3 OPT Olimit 6 2 7 8 OPT recip 7 21 OPT reorg_common 10 2 OPT roundoff 6 2 7 21 7 22 OPT wrap_around_unsafe_opt 10 3 p 9 1 pg 2 8 r8 3 5 S 7 42 show defaults 2 5 static 2 9 5 2 8 11 trapuv 10 1 version 2 2 WI 5 1 WOPT 3 8 WOPT fold 3 27 WOPT fold off 3 27 Wuninitialized 10 1 y on 3 24 zerouv 10 1 enabling and disabling features 7 2 group 7 2 IPA specfile 7 8 LANG rw_const 3 26 Ofast 10 3 OPT alias parm 7 20 OPT roundoff 6 3 syntax 7 2 Outer loop unrolling 7 16 P Parallel directives 8 1 Parallelism controlling 7 13
251. may resultin the compiler running out of registers so it has to use memory more often which causes program slow down In addition too much inlining can slow down the later phases of the compilation process Many function calls pass constants including addresses of variables as parameters Replacing a formal parameter by its known constant value helps in the optimization of the function body Very often part of the code of the function can be determined useless and deleted Function cloning creates different clones of a function with its parameters customized to the forms ofthe calls It provides a subset of the benefits of inlining without increasing the size of the function that contains the call Like inlining it also increases the total size of the program If IPA can determine that all the calls pass the same constant parameter it will perform constant propagation for the parameter This has the same benefit as 7 Tuning Options XX Inter Procedural Analysis IPA QLOGIC Langage P Language pathce ipa c Front end Front end v Other o5 a s v i m IPA v v pathcc ipa o Backend m m su a s S Backend Y siti a out Figure 7 1 IPA Compilation Model function cloning but does not increase the size of the program Constant propagation also applies to global variables If a global variable is found to be consta
252. mber of CPUs inclusive The default is a stride of 1 which causes the threads to be linearly mapped to consecutive CPUs When there are more threads than CPUs the mapping wraps around giving a round robin allocation of threads to CPUs The behavior for a stride of 0 is the same as a Stride of 1 PSC_OMP_CPU_OFFSET This specifies an integer value that is used to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU number calculated after PSC_OMP_CPU_STRIDE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs PSC_OMP_GUARD_SIZE This environment variable specifies the size in bytes of a guard area that is placed below pthread stacks This guard area is in addition to any guard pages created by your O S PSC_OMP_GUIDED_CHUNK_DIVISOR The value of PSC OMP GUIDED CHUNK DIVISOR is used to divide down the chunk size assigned by the guided scheduling algorithm E 56 1 02404 15 QLOGIC PSC OMP GUIDED CHUNK This is the maximum chunk size that will be used by the loop scheduler for guided scheduling PSC OMP LOCK SPIN This chooses the locking mechanism used by critical sections and OMP locks PSC OMP SILENT If you set PSC OMP SILENT to anythi
253. mbly of 10 3 lat mem rd tool 7 26 libg2c 10 2 libopenmp 8 11 8 21 8 23 Library ACML 3 22 BLAS 3 22 FFTW 3 22 MPICH 3 22 limit command 3 2 Linker symbol 3 13 linuxthreads 8 23 Little endian format 3 18 LMBench tool 7 26 Load balancing using OProfile 8 29 Load balancing using top 8 29 Local ID 8 13 Loop unrolling 7 16 Macros pre defined 3 10 Makefile 2 7 4 2 5 3 7 3 man pages 1 2 2 2 E 1 Math intrinsic functions vectorizing 7 17 Memory allocation Fortran 3 27 Memory model 2 9 Memory non overlapping 3 27 Mixed code 3 13 Multiple sub options 7 2 Multiprocessor memory MP 7 25 N Name mangling 5 2 NaN 10 1 Non Temporal at All NTA 7 17 Non uniform memory NUMA 7 25 NUMA OpenMP 8 29 Numerical libraries and OpenMP 8 28 1 02404 13 O Object files generating from f90 files 2 7 OMP_DYNAMIC 8 12 OMP_NESTED 8 12 8 30 OMP_NUM_THREADS 8 12 OMP_SCHEDULE 8 12 8 29 OpenMP 8 1 OProfile 8 29 Optimization basic 6 1 Options ansi C 2 apo 8 2 byteswapio 3 19 C 3 12 C 2 8 CG gcm 7 17 CG load_exe 7 17 CG use_prefetchnta 7 17 CLIST 7 43 convert conversion 3 19 cpp 3 1 3 8 dD 4 4 F 3 19 fb create 7 18 fb opt 7 18 fcoco 3 8 fdecorate 3 13 3 17 ff2c 3 22 ff2c abi 3 22 ffast math 7 21 fixedform 3 1 FLIST 7 43 fno math errno 6 2 fno second underscore 3 22 fno underscoring 3 22 fPIC 2 10 freeform 3 1 ftpp 3 1 3 8 g 2 8 2 11 3 25 7
254. mink Fortran interface to the POSIX function symlink Creates a symbolic link path2 pointing to the same file as path 1 The function form returns 0 on success or an error code from the C library value errno The subroutine form sets status tothe value which the function would return Trailing blanks in path1 and path2 are ignored you can prevent this by using char 0 to place a null character after the last significant character system Fortran interface to the C library function system Execute command using a command interpreter or shell The function form returns the value returned by the interpreter conventionally o to indicate success and nonzero to indicate failure The subroutine form sets status to the value which the function would return time time8 Fortran interface to the POSIX function time Returns the current time as an integer suitable for use with ctime gmtime or ltime 1 02404 15 C 53 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC S Fortran interface to the POSIX function ttyname The function form returns the name of the interactive terminal device associated with logical unit unit orblanks if unit is not associated with such a device The subroutine form sets name to the value that the function would return umask Fortran interface to the POSIX function uma sk Sets the file creation mask to mask The function form returns the previous value of the mask The s
255. mit it will automatically increase the size of the stack allocated to a Fortran process before the Fortran program begins executing By default itautomatically increases this limitto the total amountof physical memory on a system less 128 megabytes per CPU For example when run on a 4 CPU system with 1G of memory the Fortran runtime will attempt to raise the stack size limit to 1G 128M 4 or 640M To have the Fortran runtime tell you what it is doing with the stack size limit set the PSC STACK VERBOSE environment variable before you run a Fortran program You can control the stack size limit that the Fortran runtime attempts to use using the PSC STACK LIMIT environment variable 3 29 3 The PathScale Fortran Compiler XX Fortran Compiler Stack Size QLOGIC PV p b 3 30 If this is set to the empty string the Fortran runtime will not attempt modify the stack size limit in any way Otherwise this variable must contain a number If the number is not followed by any text it is treated as a number of bytes If it is followed by the letter k it is treated as kilobytes 1024 bytes If m or itis treated as megabytes 1024K If g or G itis treated as gigabytes 1024M If 96 it is treated as a percentage of the system s physical memory If the number is negative it is treated as the amount of memory to leave free i e itis subtracted from the amount of physical memory on the machine
256. mization and code generation have been performed In the IPA compilation model the link step is applied very early in the compilation process before most optimization and code generation In this scenario the program code being linked are not in the object code format Instead they are in the form of the intermediate representation IR used during compilation and optimization After the program has been linked at the IR level inter procedural analysis and optimization are applied to the whole program Subsequently compilation continues with the backend phases to generate the final object code 7 Tuning Options XX Inter Procedural Analysis IPA QLOGIC ee The IPA compilation model see Figure 7 1 has been implemented with ease of use as one of its main objectives At the user level it is sufficient to just add the ipa flag to both the compile line and the link line Thus users can avoid having to re structure their Makefiles to use IPA In order to do this we have to introduce a new kind of o files that we call IPA o s These are o files in which the program code is in the form of IR and are different from ordinary o files that contain object code IPA o files are produced when a file is compiled with the flags ipa c IPA ofiles can only be linked by the IPA linker The IPA linker is invoked by adding the ipa flag to the link command This appears as if it is the final link step In reality this link step performs the foll
257. mmand will print to stdout all of the define s used with cpp on a Fortran file echo gt junk F90 pathf95 cpp Wp dD E junk F90 There is no corresponding way to find out what is defined by the default Fortran preprocessor f tpp See section 3 4 4 1 for information on how to find pre defined macros in C and C No macros are predefined for the coco preprocessor Error Numbers The explain Command 1 02404 15 The explain program is a compiler and runtime error message utility that prints a more detailed message for the numerical compiler messages you may see When the Fortran compiler or runtime prints out an error message it prefixes the me s sage with a string in the format subsystem number For example pathf 95 0724 The path 95 0724 isthe message ID string that you will give to explain When you type explain pathf95 0724 the explain program provides a more detailed error message explain pathf95 0724 Error Unknown statement Expected assignment statement but found s instead of or gt The compiler expected an assignment statement but could not find an assignment or pointer assignment operator at the correct point Another example explain pathf95 0700 Error The intrinsic call s is being made with illegal arguments A function or subroutine call which invokes the name of an intrinsic procedure does not match any specific intrinsic All dummy arguments with
258. mmon Compiler Options The PathScale Compiler Suite has command line options that are similar to many other Linux or Unix compilers 2 7 o filename Option What it does c Generates an intermediate object file for each source file but doesn t link g Produces debugging information to allow full symbolic debugging I lt dir gt Adds path tothe directories searched by preprocessor for include file resolution l lt library gt Searches the library specified during the linking phase for unresolved symbols L lt dir gt Adds lt path gt to the directories searched during the linking phase for libraries 1m Links using the libm math library This is typically required in C programs that use functions such as exp Log sin cos Generates the named executable binary file 03 Generates a highly optimized executable generally numerically safe 0 or 02 Generates an optimized executable that is numerically safe This is also the default if no o flag is used pg Generates profile information suitable for the analysis program pathprof Many more options are available and described in the man pages pathscale_intro pathcc pathf95 pathCC eko and section 7 in this document Shared Libraries 2 8 The PathScale Compiler Suite includes shared versions of the runtime libraries that the compilers use The shared libraries are packaged in the pathscale compilers libs package The compiler will u
259. n 4 2 2 Pragmas 4 2 2 1 Pragma pack In this release we have tested and verified that the pragma pack is supported The syntax for this pragma is 4 4 1 02404 15 XX QLOGIC 4 The PathScale C C Compiler Compiler and Runtime Features ls 4 2 2 2 pragma pack n This pragma specifies that the next structure should have each of their fields aligned to an alignment of n bytes if its natural alignment is not smaller than n Changing Optimization Using Pragmas 4 2 2 3 Optimization flags can now be changed via directives in the user program In C and C the directive is of the form pragma options lt list of options gt Any number of these can be specified inside function scopes Each affects only the optimization of the entire function in which it is specified The literal string can also contain an unlimited number of different options separated by space The compilation of the next function reverts back to the settings specified in the compiler command line In this release there are limitations to the options that are processed in this options directive and their effects on the optimization m There is no warning or error given for options that are not processed m These directives are processed only in the optimizing backend Thus only options that affect optimizations are processed m In addition it will not affect the phase invocation of the backend components For example specifying 00 will n
260. n Manager server The PathScale Subscription Manager server is only required for floating subscriptions m PathScale debugger pathdb m GNU binutils 2 2 How To Invoke the PathScale Compilers The PathScale Compiler Suite has three different front ends to handle programs written in C C and Fortran and it has common optimization and code generation 1 02404 15 2 1 2 Compiler Quick Reference XX How To Invoke the PathScale Compilers QLOGIC PV V i 4 components that interface with all the language front ends The language your program uses determines which command driver name to use Language Command Name Compiler Name C pathcc PathScale C compiler pathcc PathScale C compiler Fortran 77 pathf95 PathScale Fortran compiler Fortran 90 Fortran 95 You can create a common example program called world c include lt stdio h gt main printf Hello World n Then you can compile it from your shell prompt very simply pathcc world c The default output file for the pathcc generated executable is named a out You can execute it and see the output a out Hello World As with most compilers you can use the o filename option to give your program executable file the desired name If invoked with the flag v or version the compilers will emit some text that identifies the version For example pathcc v QLogic PathScale TM Compiler Suite Ver
261. nables reordering based on the frequency in which different procedures are invoked N 2 enables procedure reordering based on caller callee relationship The default is 0 IPA field_reorder ON enables IPA s field reordering optimization to minimize data cache misses This optimization is based on reference patterns of fields in large structs learned during feedback compilation The default is OFF IPA ctype ON optimizes interfaces to constructs defined in the standard header file ctype h by assuming that the program will not run in a multi threaded environment The default is OFF 7 3 6 1 Disabling Options The following options are for disabling various optimizations in IPA They are useful for studying the effects of the optimizations IPA alias OFF disables IPA s alias and mod ref analyses IPA addressing OFF disables IPA s address taken analysis which is a component of the alias analysis IPA cgi OFF disables the constant propagation for global variables constant global identification IPA cprop OFF disables the constant propagation for parameters IPA dfe OFF disables dead function elimination IPA dve OFF disables dead variable elimination IPA split OFF disables common block splitting 7 3 7 Case Study on SPEC CPU2000 This section presents experimental data to show the importance of IPA in improving program performance Our experiment is based on the SPEC CPU2000 benchmark suite compiled using release 1 2 of the Pa
262. ndent Environment Variables A 2 Environment Variables for OpenMP A 2 Standard OpenMP Runtime Environment Variables A 3 PathScale OpenMP Environment Variables A 3 Implementation Dependent Behavior for OpenMP Fortran Supported Fortran Intrinsics How to Use the Intrinsics Table C 1 Intrinsic Options C 1 Table of Supported Intrinsics C 2 Fortran Intrinsic Extensions C 41 1 02404 15 XX QLogic PathScale Compiler Suite User Guide QLOGIC Version 3 0 a Appendix D Fortran 90 Dope Vector Appendix eko man Page AppendixF Glossary Figures Figure Page 7 1 IPA Compilation Model en 7 6 Tables Table Page 7T 1 Effects of IPA on SPEC CPU 2000 Performance 7 10 7 2 Effects of IPA tuning on some SPEC CPU2000 benchmarks 7 12 7 3 Numerical Accuracy with Options 7 23 7 4 pathopt2 Options i oa o e iaa ti a aia has 7 30 7 5 Tags for option configuration file 7 34 8 1 Fortran Compiler 8 4 8 2 C C Compiler Directives 8 6 8 3 Fortran OpenMP Runtime Library
263. nerally safe option but may result in the compilation taking a long time or consuming large quantities of memory This option tells the compiler to optimize the files being compiled at the specified levels no matter how large they are The option no math errno bypasses the setting of ERRNO in math functions This can result in a performance improvement if the program does not rely on IEEE exception handling to detect runtime floating point errors OPT roundof f 2 also allows for fairly extensive code transformations that may result in floating point round off or overflow differences in computations Refer to section 7 7 4 2 and section 7 7 4 for more information The option OPT div split OoN allows the conversion of x y into x recip y which may result in less accurate floating point computations Refer to section 7 7 4 2 and section 7 7 4 for more information The oPT alias settings allow the compiler to apply more aggressive optimizations to the program The option OPT alias typed assumes that the program has been coded in adherence with the ANSI ISO C standard which states that two pointers of different types cannot point to the same location in memory Setting OPT alias restrict allows the compiler to assume that points refer 1 02404 15 XX QLOGIC 6 Tuning Quick Reference Compiler Flag Recommendations 9 9 7 mP 6 5 to distinct non overlapping objects If the these options are specifi
264. ng then warning and debug messages from the libopenmp library are inhibited PSC OMP STACK SIZE Fortran Stack size specification follows the syntax in decribed in the OpenMP in Fortran section of the QLogic PathScale Compiler Suite User Guide PSC OMP STATIC FAIR This determines the default static scheduling policy when no chunk size is specified It is discused in the OpenMP in Fortran section of the QLogic PathScale Compiler Suite User Guide PSC OMP THREAD SPIN This takes a numeric value and sets the number of times that the spin loops will spin at user level before falling back to O S schedule reschedule mechanisms COPYRIGHT Copyright 2006 2007 QLogic Corp All Rights Reserved Copyright 2003 2004 2005 2006 PathScale Inc All Rights Reserved SEE ALSO pathcc 1 pathCC 1 pathf95 1 compiler defaults 5 pathopt2 1 assign 1 explain 1 fsymlist 1 pathscale intro 7 pathdb 1 QLogic PathScale Compiler Suite and Subscription Manager Install Guide QLogic PathScale Compiler Suite User Guide QLogic PathScale Compiler Suite Support Guide QLogic PathScale Debugger User Guide Online documentation available at http www pathscale com docs html 1 02404 15 E 57 E eko man Page XX QLOGIC WIXSIIIZ IINNW IIX TIIIGI N lt I G GII I I IIIIIIIFPIP C L F 4 E 58 1 02404 15 1 02404 15 Appendix F Glossary This section describes common terms that are used in connection wi
265. ng and Troubleshooting Fortran 3 25 Writing to Constants Can Cause Crashes 3 26 Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments 3 26 Fortran malloc Debugging 3 27 Arguments Copied to Temporary Variables 3 27 1 02404 15 Page vi XX QLOGIC QLogic PathScale Compiler Suite User Guide Version 3 0 kl 3 11 Section 4 4 1 4 1 1 4 2 4 2 1 4 2 1 1 4 2 2 4 2 2 1 4 2 2 2 4 2 2 3 4 2 3 4 2 4 4 3 4 4 Section 5 5 1 5 2 5 3 5 3 1 5 3 1 1 5 3 2 5 3 3 5 4 5 5 5 6 5 6 1 Section 6 6 1 6 2 6 3 6 4 6 5 6 6 6 7 1 02404 15 Fortran Compiler Stack 51 3 29 The PathScale C C Compiler Using the C C Compilers 4 2 Accessing the GCC 4 x Front ends for C and C 4 2 Compiler and Runtime 4 3 Preprocessing Source Files 4 3 Pre defiried Macros a ate sw eo Rb e rp NP ares 4 3 PIAQM sick M E Mr Et 4 4 Pragrmaqeeleo cesses aed rented rco Me eee 4 4 Changing Optimization Using Pragmas 4 5 Code Layout Optimization Using Pragmas 4 5 Mixing GOdB ard e ode Sb ne ooa DU ede 4 6 Bid EM 4 6 Debugging and Troubl
266. ng file names to the preprocessor and identifying the set file are not relevant when you use the PathScale compiler since the compiler automatically passes each source file name to the preprocessor for you captures the preprocessor output for compilation and identifies the set file as described in the preceding paragraphs More information about the coco option can be found in the eko man page Pre defined Macros 3 10 The PathScale compiler pre defines some macros for preprocessing code When you use the C preprocessor cpp with Fortran or rely on the F F90 and F95 suffixes to use the default cpp preprocessor the PathScale compiler uses the same preprocessor it uses for C with the addition of the following macros LANGUAGE FORTRAN LANGUAGE FORTRAN 1 LANGUAGE FORTRAN90 1 LANGUAGE FORTRAN90 1 _ unix 1 unix 1 unix 1 NOTE When using an optimization level at 01 or higher the compiler will set and use the OPTIMIZE _ macro with cpp See the complete list of macros for cpp in Section 4 2 1 1 If you use the Fortran preprocessor ftpp only these five macros are defined for you LANGUAGE_FORTRAN 1 _ LANGUAGE FORTRAN90 1 LANGUAGE FORTRAN90 1 _ unix 1 unix 1 NOTE By default Fortran uses cpp You must specify the f tpp command line switch with Fortran code to use the Fortran preprocessor 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Compiler and Runtime Features ls 3 4 5 This co
267. nguage front end to invoke For example some mixed language programs can be compiled with a single command pathf95 stream_d f second_wall c o stream The path f 95 driver will use the c extension to know that it should automatically invoke the C front end on the second wall c module and link the generated object files into the stream executable NOTE GNU make does not contain a rule for generating object files from Fortran 90 files You can add the following rules to your project Makefiles to achieve this 0 90 S FC S FFLAGS c lt 0 F90 S FC S FFLAGS c lt You may need to modify this for your project but in general the rules should follow this form For more information on compatibility and porting existing code see section 5 Information on GCC compatibility and a wrapper script that you can use for your build packages can be found in section 5 6 1 2 5 Other Input Files Other possible input files common to both C C and Fortran are assembly language files object files and libraries These can be used as inputs on the command line Extension Implication to the driver i preprocessed C source file VOL preprocessed C source file 8 assembly language file object file a a static library of object files SO a library of shared dynamic object files 1 02404 15 2 7 2 Compiler Quick Reference Common Compiler Options 2 6 XX QLOGIC Co
268. non NUMA kernel on a NUMA system can result in changes in performance while a program is running and non reproducibility of performance across runs This occurs because the kernel will schedule a process to run on whatever CPU is free without regard to where the process s memory is allocated Recent kernels have some degree of NUMA support They will attempt to allocate memory local to the CPU where a process is running but they still may not prevent that process from later being run on a different CPU after it has allocated memory Current NUMA aware kernels do not migrate memory across NUMA nodes so if a process moves relative to its memory its performance will suffer in unpredictable ways Note that not all vendors ship NUMA aware kernels or C libraries that can interface to them If you are unsure of whether your kernel supports NUMA check with your distribution vendor 7 8 5 Tools and APIs Recent Linux distributions include tools and APIs that allow you to bind a thread or process to run on a specific CPU This provides an effective workaround for the problem of the kernel moving a process away from its memory 1 02404 15 7 25 7 Tuning Options XX The pathopt2 Tool QLOGIC Your Linux distribution come with a package called schedutils which includes a program called taskset You can use taskset to specify that a program must run on one particular CPU For low level programming this facility is provided
269. ns Sorted summary from all runs Flags Build Test Time O3 OPT unroll_times_max 8 PASS PASS 10 33 CG load exe 0 LNO interchange off CG local fwd sched on O3 OPT unroll times max 8 PASS PASS 10 45 CG load exe 0 LNO interchange off OPT unroll times max 16 03 OPT unroll times max 8 PASS PASS 10 47 03 OPT unroll times max 8 PASS PASS 10 47 7 9 8 4 Example 3 Using a Single Script with the rate file With some applications or benchmarks it is more convenient to combine building and testing into one script In this case you must use the s timing file rate file feature so that you don t use the combined compile and run time as your sorting criterion to find the best solutions Sometimes the options that produce the fastest executable take more compile time One advantage of using a single script is that it is easier to parameterize and requires less editing For example you can pass in another benchmark executable name from the command line rather than having to edit the name in the psc test script We will use S rate file this time rather than timing file The use of rate file means that we need to use grep sed commands in the script below that differ from those in psc test2 above You can copy the file compile go rate from opt pathscale share pathopt2 examples into your working directory It is show here bin sh Gd make clean code 1 Size 2 sh
270. nsafe opt OFF 10 3 10 Debugging and Troubleshooting Troubleshooting OpenMP QLOGIC See 10 10 Troubleshooting OpenMP 10 10 1 You must use the mp flag when you compile code that contains OpenMP directives If you do not use the mp flag the compiler will ignore the OpenMP directives and compile your code as if the directives were not there Compiling and linking with mp 10 4 If a program compiled with mp is linked and linked without the mp flag the linker will not link with the OpenMP library and the linker will display undefined references similar to these undefined reference to ompc can fork libutil a diffu o text 0xa93 In function diffu undefined reference to ompc get thread num libutil a diffu o text 0x2400 In function diffu undefined reference to ompc fork libutil a diffu o text 0x2499 In function ompdo diffu 1 1 02404 15 Appendix A Environment Variables This appendix lists environment variables utilized by the compiler along with a short description These variables are organized by language with a separate section for language independent variables A 1 Environment Variables for Use with C PSC CFLAGS Flags to pass to the the C compiler pathcc This variable is used with the gcc compatibility wrapper scripts A 2 Environment variables for Use with C PSC CXXFLAGS Flags to pass to the C compiler pathc
271. nt throughout the entire program execution IPA will replace the variable by the constant value Dead variable elimination finds global variables that are never used over the program and deletes them These variables are often exposed due to IPA s constant propagation Dead function elimination finds functions that are never called and deletes them They can be the by product of inlining and cloning Common padding applies to common blocks in Fortran programs Ordinarily compilers are incapable of changing the layout of the user variables in a common block because this has to be co ordinated among all the subroutines that use the same common block and the subroutines may belong to different compilation units But under IPA all the subroutines are available The padding improves the 1 02404 15 XX 7 Tuning Options QLOGIC Inter Procedural Analysis IPA alignments of the arrays so they be accessed more efficiently even vectorized The padding can also reduce data cache conflicts during execution Common block splitting also applies to common blocks in Fortran programs This splits a common block into a number of smaller blocks which also reduces data cache conflicts during execution Procedure re ordering lays out the functions of the program in an order based on their call relationship This can reduce thrashing in the instruction cache during execution 7 3 4 Controlling IPA Although the compile
272. nt variable PSC OMP STATIC FAIR The number of iterations is divided by the number of threads in the team and rounded down to give the chunk size Each thread will be assigned at least this many iterations If the division was not exact then the remaining iterations are scheduled across the threads in increasing thread order until no more iterations are left The set of iterations assigned to a thread are always contiguous in terms of their loop iteration value Note that the difference between the minimum and maximum number of iterations assigned to individual threads in the team is at most 1 Thus the set of iterations is shared as fairly as possibly among the threads 8 20 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC OpenMP Stack Size Consider static scheduling of four iterations across 3 threads With the default policy threads 0 and 1 will be assigned two iterations and thread 2 will be assigned noiterations With the fair policy thread O will be assigned two iterations and threads 1 and 2 will be assigned one iteration NOTE The maximum number of iterations assigned to a thread which determines the worst case path through the schedule is the same for the default scheduling policy and the fair scheduling policy In many cases the performance of these two scheduling policies will be very similar PSC OMP THREAD SPIN Integer value This takes a numeric value and sets the number of times that t
273. ntered W no undef Wundef warns if an undefined identifier appears in a if directive Wno undef tells the compiler not to warn if an undefined identifier appears in a if directive 1 02404 15 E 51 E man Page XX QLOGIC E ee E 52 W no uninitialized Wuninitialized warns about uninitialized automatic variables Because the analysis to find uninitialized variables is performed in the global optimizer invoked at O2 or above this option has no effect at 00 and O1 Wno uninitialized tells the compiler not to warn about uninitialized automatic variables W no unknown pragmas Wunknown pragmas warns when an unknown Zpragma directive is encountered Wno unknown pragmas tells the compiler not to warn when an unknown pragma directive is encountered W no unreachable code Wunreachable code warns about code that will never be executed Wno unreachable code tells the compiler not to warn about code that will never be executed W no unused Wunused warns when a variable is unused Wno unused tells the compiler not to warn when a variable is unused W no unused function Wunused function warns about unused static and inline functions Wno unused function tells the compiler not to warn about unused static and inline functions W no unused label Wunused label warns about unused labels Wno unused label tells the compiler not to warn about unused labels W no unused parameter
274. ntly divided by the number of CPUs in the system This ensures that the physical memory available for stack can be shared between as many threads as there are CPUs in the system The limit tries to avoid excessive swapping in the case where all of these threads consume all of their available stack Note that if there are more OpenMP threads than CPUs and they all consume all of their stack then this will cause swapping The stack size of the main thread can be controlled using the PSC STACK LIMIT environment variable and diagnostics for its setting can be generated using the PSC STACK VERBOSE environment variable in exactly the same way as for a serial Fortran program 1 02404 15 8 21 8 Using OpenMP and Autoparallelization Stack Size Algorithm QLOGIC EN The stack sizing of OpenMP pthreads follows a complementary approach to that for the main thread There are some differences because the sizing of pthread stacks has different system imposed limits and mechanisms The PSC STACK VERBOSE flag can also be used to turn on diagnostics for the stack sizing of pthreads However the stack size is controlled by the PSC OMP STACK SIZE environment variable not PSC STACK LIMIT The syntax and allowed values for PSC OMP STACK SIZE are identical to the STACK LIMIT so please see section 3 11 for instructions The reason for having both PSC OMP STACK LIMITand PSC OMP STACK SIZE is to allow the stacks of the main thread and the
275. number of tuning flags We started with the peak flags of the benchmarks used in PathScale s SPEC CPU2000 submission and we found that five of the benchmarks are using IPA tuning flags Table 7 1 lists these five benchmarks The second column gives the running times if the IPA related tuning flags are omitted The third column gives the running times with the IPA related tuning flags The fifth column lists their IPA related tuning flags As this second table shows proper IPA tuning can produce major improvements in applications 7 3 8 Invoking IPA Inter procedural analysis is invoked in several possible ways ipa IPA and implicitly via Ofast can be used with any optimization level but gives the biggest potential benefit when combined with 03 The Ofast flag turns on ipa as part of its many optimizations When compiling with ipa the o files that are created are not regular o files IPA uses the ofiles inits analysis of your program and then does a second compilation using that information to optimize the executable The IPA linker checks to see if the entire program is compiled with the same set of optimization options If different optimization options are used IPA will give a warning Warning Inconsistent optimization options detected between files involved in For example the following invocation will generate this warning for two C files a c and b c 7 12 1 02404 15 XX 7 Tuning Options QLOG
276. o warn when overload resolution promotes from unsigned to signed W no strict aliasing For C C only Wstrict aliasing warns about code that breaks strict aliasing rules Wno strict aliasing tells the compiler not to warn about code that breaks strict aliasing rules W no strict prototypes For C C only Wstrict prototypes warns about non prototyped function decls Wno strict prototypes tells the compiler not to warn about non prototyped function decls W no switch For C C only Wswitch warns when a switch statement is incorrectly indexed with an enum Wno switch tells the compiler not to warn when a switch statement is incorrectly indexed with an enum W no system headers For C C only Wsystem headers prints warnings for constructs in system header files Wno system headers tells the compiler not to print warnings for constructs in system header files W no synth For C only The Wsynth option warns about synthesis that is not backward compatible with cfront Wno synth tells the compiler not to warn about synthesis that is not backwards compatible with cfront W no traditional For C C only Wtraditional warns about constructs whose meanings change in ANSI C Wno traditional tells the compiler not to warn about constructs whose meanings change in ANSI C W no trigraphs For C C only Wtrigraphs warns when trigraphs are encountered Wno trigraphs tells the compiler not to warn when trigraphs are encou
277. o pass arguments to the linker while the PathScale Compiler Suite uses the modern w1 flag Some gcc flags may not yet be implemented These will be documented in the release notes If a configure script is being used QLogic provides wrapper scripts for gcc that are frequently helpful See section 5 6 1 for more information 5 3 Porting Fortran If you are porting Fortran code see section 3 9 for more information about Fortran specific issues 5 3 1 Intrinsics The PathScale Fortran compiler supports many intrinsics and also has many unique intrinsics of its own See Appendix C for the complete list of supported intrinsics 1 02404 15 5 1 5 Porting and Compatibility Porting to x86_64 QLOGIC 5 3 1 1 An Example Here is some sample output from compiling Amber 8 using only ANSI intrinsics You get this series of error messages pathf95 03 msse2 m32 o fantasian fantasian o lib random o lib mexit o fantasian o In function simplexrun fantasian o text 0Oxaad4 undefined reference to rand_ fantasian o text Oxab0e undefined reference to rand_ fantasian o text 0xab48 undefined reference to rand_ fantasian o text 0xab82 undefined reference to rand_ fantasian o text Oxabbf undefined reference to rand fantasian o text 0xee0a more undefined references to rand follow collect2 ld returned 1 exit status The problem is
278. o select the C preprocessor on the command line 3 4 3 Support for Varying length Character Strings Beginning with Release 2 5 PathScale Fortran compiler now supports ISO IEC Standard 1539 2 which provides support for varying length character strings This is an optional add on to the Fortran Standard You can download and compile this module It is available from this location http www fortran com fortran iso varying string f95 3 4 4 Preprocessing Source Files with coco Beginning with release 2 4 the PathScale Fortran compiler now supports the ISO IEC 1539 3 conditional compilation preprocessor When you use the fcoco option the compiler runs this preprocessor on each individual source file before compiling that source file overriding the default whereby files suffixed with F90 Or F95 are preprocessed with cpp but files suffixed with 90 or 95 are not preprocessed The ISO IEC standard does not specify any command line options for the preprocessor but as an extension we pass I and D options to it just as we do for the cpp and tpp preprocessors As with the other preprocessors an option like Isubdir no trailing is needed tells the preprocessor to add subdir to the list of directories in which it will search for included files Unlike the cpp and tpp preprocessors this one requires that its identifiers be declared with a data type so an option like DTVAR 5 declares a constant
279. ocedural cloning may provide opportunities for inter procedural optimization but may also significantly increase the code size E 18 1 02404 15 QLOGIC ls IPA node bloatzN When this option is used in conjunction with CIPA multi clone it specifies the maximum percentage growth of the total number of procedures relative to the original program IPA plimit N This option stops inlining into a specific subprogram once it reaches size N in the intermediate representation Default is 2500 IPA pu reorder 0 1 2 Control re ordering the layout of program units based on their invocation patterns in feedback compilation to minimize instruction cache misses This option is ignored unless under feedback compilation 0 Disable procedure reordering This is the default for non C programs 1 Reorder based on the frequency in which different procedures are invoked This is the default for C programs 2 Reorder based on caller callee relationship IPA relopt ON OFF This option enables optimizations similar to those achieved with the compiler options O and c where objects are built with the assumption that the compiled objects will be linked into a call shared executable later The defaultis OFF In effect optimizations based on position dependent code non PIC are performed on the compiled objects IPA small_pu N A procedure with size smaller than N is not subjected to the plimit rest
280. odern debuggers such as PathScale s pathdb GDB Etnus TotalView Absoft Fx2 Streamline s DDT This format is known as DWARF 2 0 and is incorporated directly into the object files Code that has been compiled using g will be capable of being debugged using pathdb GDB or other debuggers The g option automatically sets the optimization level to o0 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible but the code transforming performed by the optimizations many make it more difficult Bounds checking is quite a useful debugging aid This can also be used to debug allocated memory If you are noticing numerical accuracy problems see section 7 7 for more information on numerical accuracy 3 25 3 The PathScale Fortran Compiler XX Debugging and Troubleshooting Fortran QLOGIC 3 10 1 See section 10 for more information on debugging and troubleshooting See the QLogic PathScale Debugger User Guide for more information on pathdb Writing to Constants Can Cause Crashes 3 10 2 Some Fortran compilers allocate storage for constant values in read write memory The PathScale Fortran compiler allocates storage for constant values in read only memory Both strategies are valid but the PathScale compiler s approach allows it to propagate constant values aggressively This difference in constant handling can result in crashes at runtime when For
281. oduces minimal information enough for making backtraces in parts of the program that you don t plan to debug This is also the flag to use if the user wants backtraces but does not want the overhead of full debug information This flag also causes export dynamic to be passed to the linker 2 Produces debugging information for symbolic debugging Specifying g without a debug level is equivalent to specifying g2 If there is no explicit optimization flag specified the O0 optimization level is used in order to maintain the accuracy of the debugging information If optimization options O1 O2 O3 or ipa are explicitly specified the optimizations are performed accordingly but the accuracy of the debugging cannot be guaranteed 3 Produces additional debugging information for debugging macros gcc For C C only Define the _ GNUC and other predefined preprocessor macros gnu N For C C only Enables the compiler to generate code compatible with the GNU N series of compilers where N is either 3 or 4 On systems whose system compiler is GCC 3 the default is gnu3 on GCC 4 systems the default is gnu4 Use show defaults to display the default GRA home ON OFF Turn off the rematerialization optimization for non local user variables in the Global Register Allocator Default is ON GRA optimize boundary ON OFF Allow the Global Register Allocator to allocate the same register to different variables in th
282. of the common block This may involve adding padding between members and or breaking a common block into a collection of blocks Defaultis OFF This option should not be used unless the common block definitions including EQUIVALENCE are consistent among all sources making up a program In addition pad commonzON should not be specified if common blocks are initialized with DATA statements If specified pad commonzON must be used for all of the source files in the program OPT recip ON OFF This option specifies that faster but potentially less accurate reciprocal operations should be performed Default is OFF OPT reorg_common ON OFF This option reorganizes common blocks to improve the cache behavior of accesses to members of the common block The reorganization is done only if the compiler detects that it is safe to do so reorg_common ON is enabled when O3 is in effect and when all of the files that reference the common block are compiled at O3 reorg_common OFF is set when the file that contains the common block is compiled at O2 or below OPT roundoff 0 1 2 3 or 0 1 2 3 Specify the level of acceptable departure from source language floating point round off and overflow semantics The options can be one of the following 0 Inhibit optimizations that might affect the floating point behavior This is the default when optimization levels O0 O1 and O2 are in effect 1 Allow simpl
283. offs with execution time If no value is specified 2 is assumed objectlist Read the following file to get a list of files to be linked Ofast Equivalent to O3 ipa OPT Ofast fno math errno ffast math Use optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of computations 1 02404 15 E 33 E man Page XX QLOGIC Pp V ss C Lw tht XDJ NOTE Ofastenables ipa inter procedural analysis which places limitations on how libraries and o files are built Openmp Interpret OpenMP directives to explicitly parallelize regions of code for execution by multiple threads on a multi processor system Most OpenMP 2 0 directives are supported by pathf95 pathcc and pathCC See the QLogic PathScale Compiler Suite User Guide for more information on these directives OPT This option group controls miscellaneous optimizations These options override defaults based on the main optimization level OPT aliasz name Specify the pointer aliasing model to be used By specifying one or more of the following for name the compiler is able to make assumptions throughout the compilation typed Assume that the code adheres to the ANSI ISO C standard which states that two pointers of different types cannot point to the same location in memory This is ON by default when OPT Ofast is specified restr
284. oked with r usr bar I usr foo The format of the compiler defaults file is simple Each line can contain compiler options separated by white space followed by an optional comment A comment begins with the character and ends at the end of a line Empty lines and lines containing only comments are skipped 1 02404 15 2 3 2 Compiler Quick Reference XX Compiling for Different Platforms QLOGIC Se is an example defaults file PathScale compiler defaults file Set default CPU type to optimize for since all of our systems use the same CPUs march opteron We have a recent Opteron CPU stepping so it s safe to always use SSE3 msse3 Ensure that the FFTW library is available to users so they don t need to remember where it s installed L share fftw3 lib I share fftw3 include Use the GCC 4 x front end by default gnu4 The environment variable PSC COMPILER DEFAULTS PATH if set specifies a PATH or a colon separated list of PATHs designating where the compiler is to look for the compiler defaults file If the environment variable is set the PATH opt pathscale etc will not be used If the file cannot be found then no defaults file will be used even if one is present in opt pathscale etc For more details see the compiler defaults man page 2 3 1 Target Options for This Release These options related to ABI ISA and processor target are support
285. ole program optimization the compiler can collect information over the entire program so it can make better decision on whether it is safe to perform various optimizations Thus the same optimization performed under whole program compilation will become much more effective In addition more types of optimization can be performed under whole program compilation than separate compilation This section presents the compilation model that enables whole program optimization in the PathScale compiler and how it relates to the ipa flag that invokes it at the user level Various analyses and optimizations performed by IPA are described How IPA improves the quality of the backend optimization is also explained Various IPA related flags that can be used to tune for program performance are presented and described Finally we have an example of the difference that IPA makes in the performance of the SPEC CPU2000 benchmark suite The IPA Compilation Model 1 02404 15 Inter procedural compilation is the mechanism that enables whole program compilation in the PathScale compiler The mechanism requires a different compilation model than separate compilation This new mode of compilation is used when the ipa flag is specified Whole program compilation requires the entire program to be presented to the compiler for analysis and optimization This is possible only after a link step is applied Ordinarily the link step is appliedto o files after all opti
286. ompiler inline Request inline processing INLINE Specify options for subprogram inlining may not always compile With the exception of INLINE OFF any use of this option implies inline If you have included inlining directives in your source code the INLINE option must be specified in order for those directives to be honored INLINE aggressive ON OFF Tell the compiler to be more aggressive about inlining The default is INLINE aggressive OFF INLINE list ON OFF Tell the inliner to list inlining actions as they occur to stderr The default is INLINE list OFF INLINE preempt Perform inlining of functions marked preemptible in the light weight inliner Default is OFF This inlining prevents another definition of such a function in another DSO from preempting the definition of the function being inlined 1 02404 15 E 15 E man Page QLOGIC E ee ipa Invoke inter procedural analysis IPA Specifying this option is identical to specifying IPA or IPA Default settings for the individual IPA suboptions are used IPA The inter procedural analyzer option group controls application of inter procedural analysis and optimization including inlining constant propagation common block array padding dead function elimination alias analysis and others Specify IPA by itself to invoke the inter procedural analysis phase with default options If you compile and link in di
287. on and tunes the system for the best results See http www spec org for more information SSE3 Instruction set extension to Intel2019s IA_32 and IA 64 architecture to speed processing These new instructions are supposed to enable and improve hyperthreading rather than floating point operations TLB Translation Look aside Buffer vectorization An optimization technique that works on multiple pieces of data at once For example the PathScale Compiler Suite will turn a loop computing the mathematical function sin intoacalltothe vsin function which is twice as fast F 4 1 02404 15 XX F Glossary QLOGIC WHIRL The intermediate language IR used by compilers allowing the C C and Fortran front ends to share a common backend It was developed at Silicon Graphics Inc and is used by the Open64 compilers x86 64 The Linux 64 bit application binary interface ABI 1 02404 15 F 5 F Glossary XX QLOGIC X m P amp A F 6 1 02404 15 Symbols define 3 11 4 4 pragma 4 5 8 6 OMP 8 3 apo 8 2 C 3 12 CG see Code Generation 7 17 CLIST 7 43 cpp 2 6 3 1 3 8 fb create 7 7 fb opt 7 7 fcoco 3 9 ff2c abi 3 22 ffast math 7 21 fixedform 3 1 FLIST 7 43 fno second underscore 3 22 fno underscoring 3 22 fPIC 2 10 freeform 3 1 ftpp 3 1 3 8 3 10 g 3 25 4 6 7 1 j8 3 5 IPA max_jobs 7 13 ipa 3 2 6 1 7 3 7 8 Im 2 8 4 6 LNO fission 7 14 fusion 7 14
288. or o files compiled without ipa but not both Note that in a non IPA compile most of the time is incurred with compiling all the files to create the object files the and the link step is quite fast In an IPA compile the creating of o files is very fast but the link step can take a long time The total compile time can be considerably longer with IPA than without When invoking the final link phase with ipa for example pathcc ipa o foo o significant portions of this process can be done in parallel on a system with multiple processing units To use this feature of the compiler use the IPA max jobs flag Here are the options for the TPA max jobs flag IPA max_jobs N This option limits the maximum parallelism when invoking the compiler after IPA to at most N compilations running at once The option can take the following values 1 02404 15 7 13 7 Tuning Options XX Loop Nest Optimization LNO QLOGIC Se 7 3 9 0 The parallelism chosen is equal to either the number of CPUs the number of cores or the number of hyperthreading units in the compiling system whichever is greatest 1 Disable parallelization during compilation default gt 1 Specifically set the degree of parallelism Size and Correctness Limitations to IPA 7 4 IPA often works well on programs up to 100 000 lines but is not recommended for use in larger programs in this release Loop Nest Optimization LNO
289. or if you do something like p 7 8 8toalign a pointer Directives within a program unit apply only to that program unit reverting to the default values at the end of the program unit Directives that occur outside of a program unit alter the default value and therefore apply to the rest of the file from that point on until overridden by a subsequent directive Directives within a file override the command line options by default To have the command line options override directives use the command line option LNO ignore pragmas For the 3 0 release the PathScale Compiler Suite supports the following prefetch directives F77 or F90 Prefetch Directives C PREFETCH N N Specify prefetching for each level of the cache The scope is the entire function containing the directive N can be one ofthe following values 0 Prefetching off the default 1 Prefetching on but conservative 2 Prefetching on and aggressive the default when prefetch is on 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Extensions ls 1 02404 15 C PREFETCH MANUAL Specify if manual prefetches through directives should be respected or ignored Scope Entire function containing the directive N can be one of the following values 0 Ignore manual prefetches 1 Respect manual prefetches C PREFETCH REF DISABLE A size num This directive explicitly disables prefetching all references to array A in t
290. or the processor The affected math functions include log exp sin cos sincos expf and pow The default setting is OFF It is turned on automatically when OPT roundoff is at 2 or above OPT fast nint This option uses hardware features to implement NINT and ANINT both single and double precision versions Default is OFF but fast nintzON is enabled by default if OPT roundoff 3 is in effect OPT fast_sqrt ON OFF This option calculates square roots using the identity sqrt x x rsqrt x where rsqrt is the reciprocal square root operation This transformation generates fairly accurate code Default is OFF Note that in order for OPT fast_sqrt ON to take effect COPT fast exp must be ON which tells the compiler to emit inlined instructions instead of calling the library pow function Also note that OPT fast sqrt is independent of OPT rsqrt which transforms 1 sqrt x to rsqrt x Unlike OPT rsqrt the compiler does not generate extra code to refine the rsqrt result for COPT fast sqrt OPT fast_std1lib ON OFF This option controls the generation of calls to faster versions of some standard library functions Default is ON OPT fast_trunc ON OFF This option inlines the NINT ANINT and AMOD Fortran intrinsics both single and double precision versions Default is OFF fast_trunc is enabled automatically if OPT roundoff 1 or greater is in effect OPT fold reassociate This option al
291. ot suppress the invocation of the global optimizer though the invoked backend phases will honor the specified optimization level m Apartfromthe optimization level flags only flags belonging to the following option groups are processed LNO OPT and WOPT Code Layout Optimization Using Pragmas 1 02404 15 This pragma is applicable to C C The user can provide a hint to the compiler regarding which branch of an IF statement is more likely to be executed at runtime This hint allows the compiler to optimize code generated for the different branches The directive is of the form pragma frequency hint hint where hint is a choice from m never The branch is rarely or never executed m init The branch is executed only during initialization m frequent The branch is executed frequently 4 The PathScale C C Compiler XX Debugging and Troubleshooting C C QLOGIC The branch of the IF statement that contains the pragma will be affected 4 2 3 Mixing Code 4 2 4 Linking 4 3 If you have a large application that mixes Fortran code with code written in other languages and the main entry point to your application is from C or C you can optionally use pathcc or pathcc to link the application instead of pathf95 If you do you must manually add the Fortran runtime libraries to the link line See section 3 5 for details To link object files that were generated with pathcc using pathcc or pathf95
292. ould normally use the simpler flag set for production Ofast CG prefetch off CG load exe 0 1 02404 15 7 41 7 Tuning Options XX How Did the Compiler Optimize My Code QLOGIC V C s C s A which can be shortened to Ofast CG prefetch off load_exe 0 7 10 How Did the Compiler Optimize My Code Often you may want to know what the compiler did to optimize your code There are several ways to generate a listing showing by line number what the compiler did to optimize a subroutine Choose the one that seems most useful to you 7 10 1 Using the s flag The s flag can be a useful way to see what the compiler did especially if you understand some assembly but it is useful even if you don t Here is an example using the STREAM benchmark First we compile STREAM with the s flag pathcc 03 stream d c S This produces a stream d s assembly file In this file you can see sections of human readable comments interspersed with sections of assembly code that look something like this lt loop gt Loop body line 118 nesting depth 1 iterations 250000 lt loop gt unrolled 4 times lt sched gt lt sched gt Loop schedule length 13 cycles ignoring nested loops lt sched gt lt sched gt 4 flops 15 of peak lt sched gt 8 mem refs 30 of peak lt sched gt 3 integer ops 11 of peak lt sched gt 15 instructions 28 of peak lt sched gt lt freq gt BB 60 frequen
293. out the OPTIONAL attribute must match in type and rank exactly The explain command can also be used with iostat error numbers When the iostat specifier in a Fortran I O statement provides an error number such as 4198 or when the program prints out such an error number during execution you can look up its meaning using the explain command by prefixing the number with 1ib as in explain 1ib 4198 3 The PathScale Fortran Compiler XX Compiler and Runtime Features QLOGIC ee For example explain 1ib 4098 A BACKSPACE is invalid on a piped file A Fortran BACKSPACE statement was attempted on a named or unnamed pipe FIFO file that does not support backspace Either remove the BACKSPACE statement or change the file so that it is not a pipe See the man pages for pipe 2 read 2 and write 2 3 4 6 Fortran 90 Dope Vector Modern Fortran provides constructs that permit the program to obtain information about the characteristics of dynamically allocated objects such as the size of arrays and character strings Examples of the language constructs that return this information include the ubound and the size intrinsics To implement these constructs the compiler may maintain information about the object in a data structure called a dope vector If there is a need to understand this data structure in detail it can be found in the source distribution in the file clibinc cray dopevec h See Appendix D for an exampl
294. output file names to stderr If ON or OFF is not specified the default is ON colN Fortran only Specify the line width for fixed format source lines Specify 72 80 or 120 for N col72 col80 or col120 By default fixed format lines are 72 characters wide Specifying col120 implies extend source and recognizes lines up to 132 characters wide For more information on specifying line length see the extend source and noextend source options copyright Show the copyright for the compiler being used Cpp Run the preprocessor cpp on all input source files regardless of suffix before compiling This preprocessor automatically expands macros outside of preprocessor statements The default is to run the C preprocessor cpp if the input file ends in a F or F90 suffix For more information on controlling preprocessing see the ftpp E and nocpp options For information on enabling macro expansion see the macro expand option By default no preprocessing is performed on files that end in a f or f90 suffix d lines Fortran only Compile lines with a D in column 1 E 6 1 02404 15 E eko man Page QLOGIC ls Dvar def var def Define variables used for source preprocessing as if they had been defined by a define directive If no def is specified 1 is used For information on undefining variables see the Uvar option default64 Fortran only Set the sizes of default integer
295. owing tasks 1 Invokes the IPA linker 2 Performs inter procedural analysis and optimization on the linked program 3 Invokes the backend phases to optimize and generate the object code 4 Invokes the real linker to produce the final executable Under IPA compilation the user will notice that the compilation of separate files proceeds very fast because it does not involve the backend phases On the other hand the linking phase will appear much slower because it now encompasses the compilation and optimization of the entire program 7 3 2 Inter procedural Analysis and Optimization We call the phase that operates on the IR of the linked program IPA for Inter Procedural Analysis but its tasks can be divided into two categories m Analysis to collect information over the entire program m Optimization to transform the program so it can run faster 7 3 2 1 Analysis IPA first constructs the program call graph Each node in the call graph corresponds to a function in the program The call graph represents the caller callee relationship in the program Once the call graph is built based on different inlining heuristics IPA prepares a list of function calls where it wants to inline the callee into the caller Based on the call graph IPA computes the mod ref information for the program variables This represents the information as to whether a variable is modified or referenced inside a function call IPA also computes alias informa
296. p pathf 95 c omphello f Link the object file and create an output file pathf 95 omphello o o omphello out Run the program and the output looks like this omphello out Hello World from thread O0 Number of threads 1 For more examples using OpenMP please see the sample code at http www openmp org drupal node view 14 There are also examples of OpenMP code in Appendix A ofthe OpenMP 2 0 Fortran specification See section 8 15 for more details 8 13 Example OpenMP Code in C C The following program is a parallel version of hello world written using OpenMP directives When run it spawns multiple threads It uses the CRITICAL directive to ensure that the printing from the various threads will not overwrite one another Here is the program omphello c include lt omp h gt main int tid 0 1 02404 15 8 25 8 Using OpenMP and Autoparallelization Example OpenMP Code in QLOGIC e int nthreads 1 Fork a team of threads giving them their own copies of variable tid pragma omp parallel private tid ifdef OPENMP Obtain and print thread id tid omp get thread num endif pragma omp critical printf Hello World from thread d n tid pragma omp master pragma omp critical ifdef OPENMP Only master thread does this nthreads omp_get_num_threads endif printf Number of threads d n nthreads All threads join mast
297. p nests so that cache reuse can be optimized and memory accesses reduced This whole LNO feature is on by default but can be turned off with LNO blocking off LNO blocking size n specifies a block size that the compiler must use when performing any blocking where n is a positive integer that represents the number of iterations LNO interchange is by default but setting this 20 can disable the loop interchange transformation in the loop nest optimizer The LNO group controls outer loop unrolling but the OPT group controls inner loop unrolling Here are the major LNO flags to control loop unrolling LNO outer unroll max ou max n specifies that the compiler may unroll outer loops in a loop nest by up to n per loop but no more The default is 10 LNO ou prod max n Indicates that the product of unrolling levels of the outer loops in a given loop nest is not to exceed n where n is a positive integer The default is 16 To be more specific about how much unrolling is to be done use LNO outer unroll ou n This indicates that exactly n outer loop iterations should be unrolled if unrolling is legal Forloops where outer unrolling would cause problems unrolling is not performed 7 4 4 Prefetch The LNO group can provide guidance to the compiler about the level and type of prefetching to enable General guidance on how aggressively to prefetch is specified by LNO prefetch n where n 1 is the default level n 0 disable
298. pathbug tool debugging with 10 1 1 02404 13 pathCC 2 1 4 2 pathcc 2 1 4 2 pathcov 2 11 2 12 pathdb 2 1 2 11 3 25 10 1 pathf95 2 1 3 1 pathhow compiled 2 5 pathopt2 7 26 8 27 pathopt2 xml 7 27 pathprof 2 11 2 12 pathprof command 9 2 Peeling 7 15 POSIX threads library 8 21 Pragma 4 4 options 4 5 pragma pack 4 5 Prefetch 7 16 Prefetch directives C PREFETCH 3 6 C PREFETCH_MANUAL 3 7 C PREFETCH_RE 3 7 C PREFETCH_REF_DISABLE 3 7 Preprocessing options 3 8 pre defined macros 3 10 4 3 Preprocessor C 2 6 4 3 Fortran 2 6 3 8 PRNG Pseudo random number generator 3 13 Process affinity 2 12 7 25 Processor target 2 4 pthread 8 18 8 20 pthreads 8 21 R RAND 5 2 REAL 3 22 RES 8 23 Roundoff error 7 22 RSS 8 23 Index 7 QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 QLOGIC S X86 64 ABI 3 1 ORF x86_64 ABI 3 22 4 1 sched_setaffinity 7 26 x86_64 platform configuration 7 24 schedutils 2 12 7 26 Separate compilation 7 3 SIMD 8 28 sin 7 17 SIZE 8 18 8 23 Static data 2 9 Static scheduling 8 20 Statically allocated data 10 2 STREAM benchmark example 7 42 STREAM benchmark tool 7 26 STREAM with OpenMP 8 28 Striding factor 8 16 Sub options multiple 7 2 Summary table pathopt2 7 28 Symmetric multiprocessing SMP 7 25 T taskset 7 26 Thread assignments 8 13 Threads mapping to CPUs 8 14 Tiling 7 16 time tool 2 11 TRADITIONAL intrinsics family C 2 Tran
299. pathscale share pathopt2 pathopt2 xml See section 7 9 3 for details on this file format Sample programs are found in opt pathscale share pathopt2 examples In the following sections we review the command syntax the option configuration file structure and general usage information Step by step examples show how to use the different features of pathopt2 7 9 1 A Simple Example An example is provided here to show basic usage of pathopt2 In this example you will copy a test program into your working directory and then run pathopt2 with the options file and the test program Copy the program factorial c from opt pathscale share pathopt2 examples into your own working directory factorial c is a program that calculates a table of 50 000 factorials from 1 to 50000 You can now run this simple example by typing pathopt2 f pathopt2 xml t try5 r factorial pathcc o factorial factorial c NOTE f you do not have set in your PATH you need to use factorial to run this command from the current working directory The PATH for the program pathopt2 is the same as for pathcc etc and should already 1 02404 15 7 27 7 Tuning Options XX The pathopt2 Tool QLOGIC CWC RX P s s wVw 7 9 2 pathopt2 7 28 be set correctly See the QLogic PathScale Compiler Suite and Subscription Manager Install Guide for general information on setting your PATH You should see a list of output summarizing
300. piler before 2 1 contained a bug that would truncate static data structures whose size exceeded four gigabytes This sometimes caused a compilation error or generation of binaries that would crash or corrupt data at runtime This bug has been fixed in the 2 1 release Using ipa and Ofast 10 9 Tuning 1 02404 15 When compiling with ipa the o files that are created are not a regular o files IPA uses the o files in its analysis of your program and then does a second compilation using that information NOTE NOTE When you are using ipa all the o files have to have been compiled with ipa for your compilation to be successful Each archive for example 1ibfoo a mustcontain either ofiles compiled with ipa or o files compiled without ipa but not both The requirement of ipa may mean modifying Makefiles If your Makefiles build libraries and you wish this code to be built with ipa you will need to split these libraries into separate o files before linking By default ipa is turned on when you use Ofast so the caveats above apply to using Ofast as well Our compilers often optimize loops by eliminating the loop variable and instead using a quantity related to the loop variable called an induction variable If the induction variable overflows the loop test will be incorrectly evaluated This is a very rare circumstance To see if this is causing your code to fail under optimization try OPT wrap around u
301. plates W no non virtual dtor For C only Wnon virtual dtor will warn when a class declares a dtor destructor that should be virtual Wno non virtual dtor tells the compiler not to warn when a class declares a dtor that should be virtual W no old style cast For C C only Wold style cast will warn when a C style cast to a non void type is used Wno old style cast tells the compiler not to warn when a C style cast to a non void type is used WOPT Specifies options that affect the global optimizer are enabled at O2 or above WOPT aggstr N This controls the aggressiveness of the strength reduction optimization performed by the scalar optimizer in which induction expressions within a loop are replaced by temporaries that are incremented together with the loop variable When strength reduction is overdone the additional temporaries increase register pressure resulting in excessive register spills that decrease performance The value specified must be a positive integer value which specifies the maximum number of induction expressions that will be strength reduced across an index variable increment 1 02404 15 E eko man Page QLOGIC I When set at 0 strength reduction is only performed for non trivial induction expressions The default is 11 WOPT const_pre ON OFF When OFF disables the placement optimization for loading constants to registers Default is ON WOPT if conv 0 1 2
302. ption handling This is the default fno exceptions disables exception handling This option has a subset of the effects of fno gnu exceptions Hence it can be used on some C applications on which fno gnu exceptions cannot be applied f no fast math ffast math improves FP speed by relaxing ANSI amp IEEE rules fno fast math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed f no gnu keywords For C C only Recognize typeof as a keyword If fno gnu keywords is used do not recognize typeof as a keyword fno ident Ignore ident directives fno math errno Do not set ERRNO after calling math functions that are executed with a single instruction e g sqrt A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility This is implied by Ofast The default is fmath errno f no signed char For C C only fsigned char makes char signed by default fno signed char makes char unsigned by default fpack struct For C C only Pack structure members together without holes f no permissive fpermissive will downgrade messages about non conformant code to warnings fno permissive keeps messages about non conformant code as errors f no preprocessed fpreprocessed tells the preprocessor that input has already been preprocessed Using fno preprocessed te
303. pure ANSI ISO C mode This auto parallelizing option signals the compiler to automatically convert sequential code into parallel code when it is safe and beneficial to do so The resulting executable can then run faster on a machine with more than one CPU Create an archive using ar 1 instead of a shared object or executable The name of the archive is specified by using the o option Template entities required by the objects being archived are instantiated before creating the archive The pathCC command implicitly passes the r and c options of ar to ar in addition to the name of the archive and the objects being created Any other option that can be used in conjunction with the c option of ar can be passed to ar using WR option name NOTE The objects specified with this option must include all of the objects that will be included in the archive Failure to do so may cause prelinker internal errors In the following example liba a is an archive containing only a o b o and c o The a o b o and c o objects are prelinked to instantiate any required template entities and the ar r c v liba a a o b o c o command is executed All three objects must be specified with ar even if only b o needs to be replaced in lib a pathCC ar WR v o liba a a o b o c o See the Id 1 man page for more information about shared libraries and archives auto use module name module name For Fortran Direct the compile
304. r C you can optionally use pathcc or pathcc to link the application instead of pathf95 If you do you must manually add the Fortran runtime libraries to the link line As an example you might do something like this pathCC o my big app filel o file2 0 lpathfortran 3 5 1 Calls between C and Fortran In calls between C and Fortran the two issues are m Mapping Fortran procedure names onto C function names and m Matching argument types Normally a path 90 procedure name x not containing an underscore creates a linker symbol x and a path 90 name x y containing an underscore creates a linker symbol x y _ note the second underscore A pathcc function name by contrast does not append any underscores when creating a linker symbol You can write your C code to conform to this use x in C so that it will match Fortran s x Or you can use the Hecorate option described man pathf90 to provide a mapping from each Fortran name onto some possibly quite different linker symbol Or you can use the fno underscoring option but in many cases that will create symbols that conflict with those in the Fortran and C runtime libraries so it is not the preferred choice Normally path 90 passes arguments by reference so C needs to use pointers in order to interoperate with Fortran In many cases you can use the va1 intrinsic function in Fortran to pass an argument by value The programmer must be careful to match
305. r pathopt2 cd pathopt2 mkdir logs logs is where we will keep a copy of the last run of the t A executable Copy the two scripts psc build and psc test from opt pathscale share pathopt2 examples into the pathopt2 directory The scripts are shown below For psc build 4 bin sh Cd make clean code 1 si z e 2 shift 2 make code CLASS size FFLAGS cd pathopt2 For psc test 4 bin sh bin ft A logs ft A txt 1 02404 15 7 37 7 Tuning Options XX The pathopt2 Tool QLOGIC Eu uuu UU U C Ts CU s Make the files executable and then run pathopt2 chmod x psc pathopt2 t try5 r psc test psc build ft A Note that the first argument to the psc_build script is the name of the code the second argument is the problem size and all remaining arguments are the optimization options This matches the code in the psc build script that interprets the arguments The output will be similar to the following Sorted summary from all runs Flags Build Test Real User System 03 ipa PASS PASS 12 67 12 23 0 44 Ofast PASS PASS 12 68 12 27 0 40 03 OPT Ofast PASS PASS 12 83 12 39 0 44 03 PASS PASS 13 86 12 46 0 40 02 PASS PASS 14 53 14 14 0 39 It is useful to check the output in 1ogs ft A txt FT Benchmark completed Class A Size 256x256x128 Iterations 6 Time in seconds 10 78 Mop s total 662 05 Operation type floating point V
306. r to behave as if a USE module name statement were entered in your Fortran source code for each module name The USE statements are entered in every program unit and interface body in the source file being compiled for example pathf95 auto use mpi interface or pathf95 auto use shmem interface Using this option can add compiler time in some situations backslash Treat a backslash as a normal character rather than as an escape character When this option is used the preprocessor will not be called 1 02404 15 E 3 E man Page XX QLOGIC FWOII HGSII I III IUIII I3IIIII Pp 7 m P E 4 C For Fortran Perform runtime subscript range checking Subscripts that are out of range cause fatal runtime errors If you set the F90_ BOUNDS CHECK ABORT environment variable to YES the program aborts C For C Keep comments after preprocessing C Create an intermediate object file for each named source file but does not link the objectfiles The intermediate object file name corresponds to the name of the source file a o suffix is substituted for the suffix of the source file Because they are mutually exclusive do not specify this option with the r option CG The Code Generation option group controls the optimizations and transformations of the instruction level code generator CG cflow ON OFF OFF disables control flow optimization in the code generation Default is ON CG cse_regs N When pe
307. r tries to make the best decisions regarding how to optimize a program it is hard to make the optimal choice in general Thus the compiler provides many compilation options so the user can use them to tune for the peak performance of his program This section presents the IPA related compilation options that are useful in tuning programs But first itis worthwhile to mention that IPA is one of the compilation phases that can benefit substantially from feedback compilation In feedback compilation a feedback data file containing a profile of a typical run of the program is presented to the compiler This enables IPA to make better decisions regarding what functions to inline and clone By ensuring that busy callers and callees are placed next to each other IPA s procedure re ordering can also be more effective Feedback compilation is enabled by the b create and fb opt options See section 7 6 for more details 7 3 4 1 Inlining There are actually two incarnations of the inliner in the PathScale compiler depending on whether ipa is specified This is because inlining is nowadays a language feature and has to be performed independent of IPA The inliner invoked when ipa is not specified is the lightweight inliner and it can only operate on a single compilation unit The lightweight inliner does not do automatic inlining It inlines strictly according to the C language requirement C inline keyword or any INLINE options specified by t
308. re ANSI EVERY G77 PGI OMP and TRADITIONAL These family names must appear in uppercase to distinguish them from the names of individual intrinsics By default the compiler enables either ANSI or TRADI TIONAL depending on whether you use the ans i option It automatically enables OMP as well if you use the mp option As an example suppose you are compiling a program that was originally developed under the GNU G77 compiler and encounter problems because it contains subroutine names which conflict with some of the intrinsics in the TRADITIONAL family Suppose that you have also decided that you want to use the individual intrinsic adjust1 which is not provided by G77 These options would give you the set of intrinsics you need no intrinsic traditional instinsic G77 intrinsic adjustl C 3 Table of Supported Intrinsics The following table lists the Fortran intrinsics supported by the PathScale Compiler Suite along with the result arguments families and characteristics for each See the Legend for more information Legend Key to Types Key to Characteristics I Integer E Elemental intrinsic R Real P May pass intrinsic itself as an actual argument Z Complex X Extension to the Fortran standard C Character O Optional argument L Logical C 2 1 02404 15 XX QLOGIC C Supported Fortran Intrinsics Table of Supported Intrinsics
309. refix for your output file For example the output file from this example dataset might be named fbdata instr0 ab342 Eachfile will have a unique string as part of its name so that files can t be overwritten To use this data in a subsequent compile pathcc 03 ipa fb opt fbdata o foo foo c This new executable should run faster than a non FDO oo and will not contain any instrumentation library calls Experiment to see if FDO provides significant benefit for your application More details on feedback compilation with the PathScale compilers can be found under the fb create and fb opt options in the man page 1 02404 15 XX QLOGIC 7 Tuning Options Aggressive Optimizations MEME vs 7 7 Aggressive Optimizations 7 7 1 The PathScale Compiler Suite like all modern compilers has a range of optimizations Some produce identical program output to the original some can change the program s behavior slightly The first class of optimizations is termed safe and the second unsafe As a general rule our 01 02 03 flags only perform safe optimizations But the use of unsafe optimizations often can produce a good speedup in a program while producing a sufficiently accurate result Some unsafe optimizations may be safe depending onthe coding practices used We recommend first trying safe flags with your program and then moving on to unsafe flags checking for incorrect results and noting the
310. rence lies in the optimization of statically allocated local variables as described in the following paragraphs With LANG recursive ON the co mpiler assumes that a statically allocated local variable could be referenced or modified by a recursive procedure call Therefore such a variable must be stored into memory before making a call and reloaded afterwards With LANG recursive OFF the compiler can safely assume that a statically allocated local variable is not referenced or modified by a procedure call This setting enables the compiler to optimize more aggressively rw const ON OFF Tell the compiler whether to treat a constant parameter in Fortran as read only or read write If treated as read write the compiler has to generate extra code in passing these constant parameters so as to tolerate their being modified in the 1 02404 15 E 21 E man Page XX QLOGIC ee called function The default is OFF which is more efficient but will cause segmentation fault if the constant parameter is written into short_circuit_conditionals ON OFF Handle AND and OR via short circuiting in which the second operand is not evaluated if unnecessary even if it contains side effects Default is ON This flag is applicable only to Fortran the flag has no effect on C C programs LIST The listing option flag controls information that gets written to a listing Ist file The individual controls in this group are
311. rface is provided 1 02404 15 B 5 B Implementation Dependent Behavior for OpenMP Fortran XX QLOGIC V C m p CeC VU Notes B 6 1 02404 15 C 1 Appendix C Supported Fortran Intrinsics The 3 0 release of the PathScale Compiler Suite supports all of the GNU g77 intrinsics You must use intrinsic PGI or intrinsic G77 to get new G77 intrinsics which were added in the 2 3 release All of the argument types for each intrinsic may not be supported in this release How to Use the Intrinsics Table C 2 As an example let s look at the intrinsic aco s This is what it looks like in the table Intrinsic Name Result Arguments Families Remarks ACOS R 4 X R 4 R 8 ANSI G77 PGI E P TRADITIONAL For the intrinsic Acos the result is R 4 which means REAL 4 or REAL KIND 4 and its arguments X can be either R 4 REAL 4 or R 8 REAL 8 ACOS belongs to the ANSI G77 PGI and TRADITIONAL families of intrinsics see appendix C 2 for an explanation of intrinsic families which means the compiler will recognize it if any of those families is enabled Under remarks E P are listed E tells us that this is an elemental intrinsic and P tells us that the intrinsic may be passed as an actual argument Here is a simple scalar call to intrinsic ACOS print acos 1 0 Because the intrinsic is elemental you can also apply it to an array print acos 1 0 0 707 0 5 NOT
312. rformance of the executable and tracking the results The best options are obtained from the output of these runs and are used to adaptively tune successive runs yielding the best set of compiler options for a given combination of application code data set hardware and environment A sorted list of execution times is produced for each run The tool uses an XML option configuration file that defines one or more execution targets Each execution target specifies options to try and indicates how they are 7 26 1 02404 15 XX 7 Tuning Options QLOGIC The pathopt2 Tool ls to be combined into a series of tests In general using pathopt2 involves these steps 1 Run pathopt2 using an execution target in the supplied option configuration file 2 Interpret the results 3 Choose a more detailed execution target based on the results from the first run and repeat the process until the best compiler options are found The pathopt2 tool can be completely driven from its command line or it can alternatively use scripts to build and test the programs Scripts are useful for more complex runs for interfacing to existing build and test mechanisms and for automating the process For a standard installation the program pathopt2 is located in opt pathscale bin This is the same directory that contains pathcc pathCC pathf95 pathf90 and so on An option configuration file pathopt2 xml1 is provided The default location is opt
313. rforming common subexpression elimination during code generation assume there are N extra integer registers available over the number provided by the CPU N can be positive zero or negative The default is positive infinity See also CG sse_cse_regs CG gcm ON OFF Specifying OFF disables the instruction level global code motion optimization phase The default is ON CG load_exe N Specify the threshold for subsuming a memory load operation into the operand of an arithmetic instruction The value of 0 turns off this subsumption optimization If N is 1 this subsumption is performed only when the result of the load has only one use This subsumption is not performed if the number of times the result of the load is used exceeds the value N a non negative integer The default value varies based on processor target and source language CG local_fwd_sched ON OFF Change the instruction scheduling algorithm to work forward instead of backward for the instructions in each basic block The default is OFF for 64 bit ABI and ON for 32 bit ABI CG movnti N Convert ordinary stores to non temporal stores when writing memory blocks of size larger than N KB When N is set to 0 this transformation is avoided The default value is 1000 KB 1 02404 15 QLOGIC ls CG p2align ON OFF Align loop heads to 64 byte boundaries The default is OFF CG p2align_freq N Align branch targets ba
314. riction The default is 30 IPA sp partition setting This option enables partitioning for disk addressing saving purposes The default is OFF Mainly used for building very large programs Normally partitioning would be done by IPA internally IPA space N Inline until a program expansion of N is reached For example IPA space 20 limits code expansion due to inlining to approximately 20 Default is no limit 1 02404 15 E man Page XX QLOGIC E e E 20 IPA specfile filename Opens a filename to read additional options The specification file contains zero or more lines with inliner options in the form expected on the command line The specfile option cannot occur in a specification file so specification files cannot invoke other specification files IPA use intrinsic ON OFF Enable disable loading the intrinsic version of standard library functions The default is OFF isystem dir Search dir for header files after all directories specified by I but before the standard system directories Mark it as a system directory so that it gets the same special treatment as is applied to the standard system directories keep Write all intermediate compilation files file s contains the generated assembly language code file i contains the preprocessed source code These files are retained after compilation is finished If IPA is in effect and you want to retain file s you must specify IPA keeplight OFF
315. rivate fshort double For C C only Use the same size for double as for float fshort enums For C C only Use the smallest fitting integer to hold enums fshort wchar For C C only Use short unsigned intfor wchar_t instead ofthe default underlying type for the target ftest coverage Create data files forthe pathcov 1 code coverage utility The data file names begin with the name of your source file SOURCENAME bb A mapping from basic blocks to line numbers which pathcov uses to associate basic block execution counts with line numbers SOURCENAME bbg A list of all arcs in the program flow graph This allows pathcov to reconstruct the program flow graph so that it can compute all basic block and arc execution counts from the information in the SOURCENAME da file E 12 1 02404 15 E eko man Page QLOGIC I Use ftest coverage with fprofile arcs the latter option adds instrumentation to the program which then writes execution counts to another data file SOURCENAME da Runtime arc execution counts used in conjunction with the arc information in the file SOURCENAME bbg Coverage data will map better to the source files if ftest coverage is used without optimization See the gcc man pages for more information ftpp Run the Fortran source preprocessor on input Fortran source files before compiling By default files suffixed with F or F90 are
316. rror code from the C library value errno The subroutine form sets status to the value that the function would return iargc Return the number of arguments on the command line used to execute this program not including the program name itself idate The single argument version stores in which musthave three elements the current local date Day ranging 1 31 Month ranging 1 12 Year using 4 digits The three argument version sets its arguments to the month day and year Note that the order is different from that of the one argument version C 46 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions ls ierrno Returns the C library value errno which is the last error code set by a C library or Linux system function Note that a function which does not encounter an error may not set this value back to zero imag Return the imaginary part of a complex number without altering precision imagpart Imaginary part of a complex number synonym for standard intrinsic aimag which in Fortran 95 preserves the precision of its argument int2 Convert to type integer 2 int4 Convert to type integer 4 int8 Convert to type integer 8 irand Fortran interface to POSIX function rand Returns a uniform pseudorandom integer If flag is 0 return the next number in the current sequence if flag is 1 call POSIX function srand 0 otherwise call srand flag to seed a new sequence isatty
317. rs using autoparallelization and OpenMP in Fortran and C C Section 9 provides an example of optimizing code Section 10 covers debugging and troubleshooting code Appendix A lists environmental variables used with the compilers Appendix B discusses implementation dependent behavior for OpenMP Fortran Appendix C is a list of the supported Fortran intrinsics Appendix D provides a simplified data structure from a Fortran 90 dope vector Appendix E is a reference copy of the eko man page Appendix F contains a glossary of terms associated with the compilers Conventions Used in This Document These conventions are used throughout this document 1 02404 15 Convention Meaning command Fixed space fontis used for literal items such as commands files routines and pathnames variable Italic typeface is used for variable names or concepts being defined user input Bold fixed space font is used for literal items the user types in Output is shown in non bold fixed space font Indicates a command line prompt 1 1 1 Introduction Documentation Suite QLOGIC ERE X Convention Meaning Brackets enclose optional portions of a command or directive line bo Ellipses indicate that a preceding element can be repeated NOTE Indicates important information 1 2 Documentation Suite The PathScale Compiler Suite product documentation set includes m The QLogic PathScale Compiler Suite
318. run through the C source preprocessor cpp Files that are suffixed with f or f90 are not run through any preprocessor by default The Fortran source preprocessor does not automatically expand macros outside of preprocessor statements so you need to specify macro expand if you want macros expanded fullwarn Request that the compiler generate comment level messages These messages are suppressed by default Specifying this option can be useful during software development f no underscoring For Fortran only funderscoring appends underscores to symbols fno underscoring tells the compiler not to append underscores to symbols f no unsafe math optimizations funsafe math optimizations improves FP speed by violating ANSI and IEEE rules fno unsafe math optimizations makes the compilation conform to ANSI and IEEE math rules at the expense of speed This option is provided for GCC compatibility and is equivalent to OPT IEEE_arithmetic 3 fno math errno fuse cxa atexit For C only Register static destructors with cxa atexit instead of atexit fwritable strings For C C only Attempt to support writable strings K amp R style C g N Specify debugging support and to indicate the level of information produced by the compiler The supported values for N are 0 No debugging information for symbolic debugging is produced This is the default 1 02404 15 E 13 E man Page XX QLOGIC ee 1 Pr
319. runtime ORDERED SECTIONS SOMP sections clause structured block SOMP end sections nowait PRIVATE FIRSTPRIVATE LAST PRIVATE 8 4 1 02404 15 XX 8 Using OpenMP and Autoparallelization QLOGIC OpenMP Compiler Directives Fortran I n Table 8 1 Fortran Compiler Directives Continued Directive Clauses Example REDUCTION SINGLE SOMP single clause structured block SOMP end single nowait PRIVATE FIRSTPRIVATE CO PYPRIVATE Combined parallel work sharing constructs Shortcut for denoting a parallel region that contains only one work sharing construct PARALLEL DO 50 parallel do structured block SOMP end parallel do PARALLEL 50 parallel sections SECTIONS structured block SOMP end parallel sections PARALLEL SOMP parallel workshare WORKSHARE structured block 50 end parallel workshare Synchronization constructs Provide various aspects of synchronization for example access to a block of code or execution order of statements within a block of code ATOMIC OMP atomic expression statement BARRIER ISOMP barrier CRITICAL SOMP critical name structured block SOMP end critical
320. ry Calls C C Table 8 3 Fortran Runtime Library Routines Continued Routine Description logical omp test lock int Try to acquire the lock return TRUE if successful FALSE if not omp test nest lock int Attempt to set a lock using the same method as omp set nest lock but execution thread does not wait for confirmation that the lock is available If lock is successfully set function in crements the nesting count if lock is unavailable function returns a value of zero omp get wtime Returns double precision value equal to the number of seconds since the initial value of the operating system real time clock omp get wtick Returns double precision floating point value equal to the number of seconds between successive clock ticks 8 7 OpenMP Runtime Library Calls C C OpenMP programs can explicitly call standard routines implemented in the OpenMP runtime library If you want to ensure the program is still compilable without mp you need to guard such code with the OpenMP conditional compilation sentinels e g pragma The following table lists the OpenMP runtime library routines provided by version 2 1 of the OpenMP C C Application Program Interface Table 8 4 C C OpenMP Runtime Library Routines Routine Description void omp set num threads int Setthe number of threads to use in a team int get num threads void Return
321. s FTN SUPPRESS REPEATS Fortran Output multiple values instead of using the repeat factor used at runtime NLSPATH Fortran Flags for runtime and compile time messages PSC CFLAGS C Flags to pass to the C compiler pathcc PSC COMPILER DEFAULTS PATH Specifies a path or colon separated list of paths designating where the compiler is to look for the compiler defaults 5 file If the environment variable is set the path opt pathscale etc will not be used If the file cannot be found then no defaults file will be used even if one is present in opt pathscale etc PSC PROBLEM REPORT DIR Name a directory in which to save problem reports and preprocessed source files if the compiler encounters an internal error If not specified the directory used is HOME ekopath bugs PSC_CXXFLAGS C Flags to pass to the C compiler pathCC PSC_FFLAGS Fortran Flags to pass to the Fortran compiler pathf95 PSC_GENFLAGS Generic flags passed to all compilers 1 02404 15 QLOGIC ls PSC STACK LIMIT Fortran Controls the stack size limit the Fortran runtime attempts to use This string takes the format of a floating point number optionally followed by one of the characters k for units of 1024 bytes m for units of 1048576 bytes g for units of 1073741824 bytes or 96 to specify a percentage of physical memory If the specifier is following by the string cpu the limit is di
322. s For example a Fortran subroutine called foo gets turned into the name oo when placed in the object file We do this to avoid name collisions with similar functions in other libraries This makes mixing code from C C and Fortran easier Name mangling ensures that function subroutine and common block names from a Fortran program or library do not clash with names in libraries from other programming languages For example the Fortran library contains a function named access which performs the same function as the function access in the standard C library However the Fortran library access function takes four arguments making it incompatible with the standard C library access function which takes only two arguments If your program links with the standard C library this would cause a symbol name clash Mangling the Fortran symbols prevents this from happening By default we follow the same name mangling conventions as the GNU g77 compiler and 1ibf2c library when generating mangled names Names without an underscore have a single underscore appended to them and names containing an underscore have two underscores appended to them The following examples should help make this clear molecul gt molecule _ run check gt run check _ nergy gt nergy _ _ 3 21 3 The PathScale Fortran Compiler XX Library Compatibility QLOGIC This behavior can be modified by using the fno second underscore and
323. s prefetching in loop nests while n 2 means to prefetch more aggressively than the default LNO prefetch ahead n defines how many cache lines ahead of the current data being loaded should be prefetched The default is n 2 cache lines 7 16 1 02404 15 XX 7 Tuning Options QLOGIC Code Generation CG ls 7 4 5 Vectorization Vectorization is an optimization technique that works on multiple pieces of data at once For example the compiler will turn a loop computing the mathematical function sin into a call to the vsin function which is twice as fast The use of vectorized versions of functions in the math library like sin cosin is controlled by the flag LNO vintr 0 1 2 0 will turn off vectorization of math intrinsics while 1 is the default Under LNO vintr 2 the compiler will vectorize all math functions Note that vintr 2 could be unsafe in that the vector forms of some of the functions could have accuracy problems Vectorization of user code excluding these mathematical functions is controlled by the flag LNO simd 0 1 2 which enables or disables inner loop vectorization 0 turns off the vectorizer 1 the default causes the compiler to vectorize only if it can determine that there is no undesirable performance impact due to sub optimal alignment and 2 will vectorize without any constraints this is the most aggressive LNO simd verbose ON prints vectorizer information from vectorizing user code
324. s in terms of their loop iteration number NOTE The PSC OMP STATIC FAIR environment variable can be used to change the default static scheduling algorithm to an alternate scheme where the iterations are more equally balanced over the threads in cases where the division in not exact OMP NUM THREADS environment variable The default value is implementation dependent Section 4 2 page 60 The default value of the NUM THREADS environment variable is the number of CPUs in the machine OMP DYNAMIC environment variable The default value is implementation dependent Section 4 3 page 60 The default value of the DYNAMIC environment variable is false An implementation can replace all ATOMIC directives by enclosing the statement in a critical section Section 2 5 4 page 27 Many ATOMIC directives are implemented with in line atomic code for the atomic statement while others are implemented using a critical section due to the absence of hardware support If the dynamic threads mechanism is enabled on entering a parallel region the allocation status of an allocatable array that is not affected by a COPYIN clause that appears on the region is implementation dependent Section 2 6 1 page 32 The allocation status of the thread s copy of an allocatable array will be retained on entering a parallel region Due to resource constraints it is not possible for an implementation to document the maximum number of threads that
325. s log Sed e s Time in seconds secs log gt PSC METRIC FILE grep SUCCESSFUL logs ft A txt NOTE pathopt2 checks the result status of the build command script and of the run command script A zero status indicates that the build or run was successful while a non zero status indicates failure If running the program indicates its status in some other way this must be detected by a script and reflected in the script s return status In the example above the grep SUCCESSFUL line is a way to pass the NPB correctness test results to pathopt2 The grep will have a status of 0 ifthe output contains this phrase and this will be the status of the whole shell script since this is the last command Next make the file executable and run pathopt2 chmod x psc test2 pathopt2 S timing file t try5 r psc test2 psc build ft A The sorted summary will be similar to the following Sorted summary from all runs Flags Build Test Time 03 ipa PASS PASS 10 87 Ofast PASS PASS 10 87 03 OPT Ofast PASS PASS 01 03 PASS PASS 02 02 PASS PASS 82 Since 03 ipa was the fastest in the try5 target we can run pathopt2 again with the peak 03 target pathopt2 S timing file t peak O3 r psc test2 psc build ft A 1 02404 15 7 39 7 Tuning Options XX The pathopt2 Tool QLOGIC ee In the truncated sorted summary we can see that there is some improvement with the new optio
326. s or registered trademarks of Red Hat Inc SuSE is a registered trademark of SuSE Linux AG All other brand and product names are trademarks or registered trademarks of their respective owners Document Revision History 1 4 2 0 2 1 2 2 2 3 2 4 2 5Beta 2 5 February 21 2007 Version 3 0 New Sections 3 3 3 1 3 7 4 10 4 10 5 Added Appendix B Supported Fortran intrinsics New Sections 2 3 8 9 7 11 8 Added Chapter 8 Using OpenMP in Fortran New Appendix B Implementation dependent behavior for OpenMP Fortran Expanded and updated Appendix C Supported Fortran intrinsics Added Chapter 9 Using OpenMP in C C Appendix E eko man page Expanded and updated Appendix B and Appendix C New Sections 3 5 1 4 3 5 2 combined OpenMP chapters Added to reference list in Chapter 1 new Section 8 2 on autoparallelization Expanded and updated Section 3 4 Section 3 5 and Section 7 9 Updated Section C 3 added Section C 4 Fortran intrinstic extensions Updated Fortran Dope Vector info in Appendix D Added info in support for varying length character strings for Fortran in section 3 4 3 Changes Removed sentence stating not possible to use FDO with OpenMP programs Document Sections Affected section 8 14 3 Updated O3 option wording to be same as in man page section 7 1 Page ii 1 02404 15 QLogic PathScale Compiler Suite User Guide QLOGIC Version 3 0 I
327. s tag lt execute name name gt Specifies an execute target and must a contain at least one lt source gt tag that references a previously defined lt define gt tag May also contain lt execute option Or append tags Specify execute targets on the command line using t name option option Describes a single option Surround the content for this option in space characters to ensure differentiation e g option Ofast lt option gt rather than option Ofast option gt choose kz k hoist true choose Choose the best option among those provided within this tag The k x attribute specifies the number of choices to run iteratively If k is given as a range separated by a colon e g k 0 2 pathopt2 chooses among that number of options inclusive e g between 0 and 2 options The optional hoist true attribute merges the lists returned by the children of the execute tag into the list for that tag By default choose picks combinations only from directly related children append lt option gt lt option gt lt append gt The first option described within this tag is appended to the test stream for the remaining options The follow ing instructs pathopt2 to find the best option between O3 O3 OPT Ofast but not any of these options singly lt append gt option 03 lt option gt choose k 1 gt
328. s to make concerning your program more optimizations can then be applied The following are some of the various aliasing models you can specify listed in order of increasingly stringent and potentially dangerous assumptions you are telling the compiler to make about your program 7 19 7 Tuning Options XX Aggressive Optimizations QLOGIC OPT alias any the default level which implies that any two memory references can be aliased OPT alias typed means to activate the ANSI rule that objects are not aliased it they have different base types This option is activated by Ofast OPT alias unnamed assumes that pointers never to point to named objects OPT alias restrict tells the compiler to assume that all pointers are restricted pointers and point to distinct non overlapping objects This allows the compiler to invoke as many optimizations as if the program were written in Fortran A restricted pointer behaves as though the C restrict keyword had been used with it in the source code OPT alias disjoint says that any two pointer expressions are assumed to point to distinct non overlapping objects To make the opposite assertion about your program s behavior put no_ before the value For example OPT alias no restrict means that distinct pointers may point to overlapping storage Additional OPT alias values are relevant to Fortran programmers in some situations OPT alias cray pointer as
329. scribes a standard for NaN and inf operands Default is ON OPT IEEE NaN infzOFF produces non IEEE results for various operations For example x x is treated as TRUE without executing a test and x x is simplified to 1 without dividing OFF can enable many common optimizations that can help performance OPT inline intrinsics ON OFF When OFF this option turns all Fortran intrinsics that have a library function into a call to that function Default is ON OPT malloc algorithm 0 1 or OPT malloc alg 0 1 Select an alternate malloc algorithm which may improve speed The compiler adds setup code in the C C Fortran main function to enable the chosen algorithm The default is O OPT Ofast Use optimizations selected to maximize performance Although the optimizations are generally safe they may affect floating point accuracy due to rearrangement of computations This effectively turns on the following optimizations OPT roz2 Olimitz0 div splitZON alias typed 1 02404 15 E 37 E man Page XX QLOGIC ee OPT Olimit N Disable optimization when size of program unit is gt N When N is 0 program unit size is ignored and optimization process will not be disabled due to compile time limit The default is 0 when OPT Ofast is specified 9000 when O3 is specified otherwise the default is 6000 OPT pad common ON OFF This option reorganizes common blocks to improve the cache behavior of accesses to members
330. se these shared libraries by default when linking executables and shared objects Therefore if you link a program with these shared libraries you must install them on systems where that program will run You should continue to use the static versions of the runtime libraries if you wish to obtain maximum portability or peak performance The latter is the case because the compiler cannot optimize shared libraries as aggressively as static libraries Shared libraries are compiled using position independent code which limits some opportunities for optimization while our static libraries are not compiled this way 1 02404 15 XX QLOGIC 2 Compiler Quick Reference Memory Model Support ls 2 8 To link with static libraries instead of shared libraries use the static option For example the following code is linked using the shared libraries pathcc o hello hello c ldd hello libpscrt so 1 gt opt pathscale 1lib 2 3 99 libpscrt so 1 0x0000002a9566d000 libmpath so 1 gt opt pathscale 1lib 2 3 99 libmpath so 1 0x0000002a9576e000 libc so 6 gt 11b64 libc so 6 0x0000002a9588b000 libm so 6 gt lib64 libm so 6 0x0000002a95acd000 lib64 1d linux x86 64 s0 2 gt 1ib64 1d linux x86 64 s0 2 0x0000002a95556000 If you use the static option notice that the shared libraries are no longer required pathcc o hello hello c static ldd hello not a dynamic executable Large File Support
331. sed on execution frequency This option is meaningful only under feedback directed compilation The default value N 0 turns off the alignment optimization Any other value specifies the frequency threshold at or above which this alignment will be performed by the compiler CG prefer legacy regs ON OFF Tell the local register allocator to use the first 8 integer and SSE registers whenever possible Yrax Y rbp xmm0 xmm7 Instructions using these registers have smaller instruction sizes The default is OFF CG prefetch ON OFF Enable generation of prefetch instructions in the code generator The default is ON CG prefetch OFF and LNO prefetch 0 both suppress the generation of prefetch instructions but LNO prefetch 0 also affects LNO optimizations that depend on prefetch CG sse cse regszN When performing common subexpression elimination during code generation assume there are N extra SSE registers available over the number provided by the CPU Ncan be positive zero or negative The default is positive infinity See also CG cse regs CG use prefetchnta ON OFF Prefetch when data is non temporal at all levels of the cache hierarchy This is for data streaming situations in which the data will not need to be re used soon The default is OFF CG use test ON OFF Make the code generator use the TEST instruction instead of CMP See Opteron s instruction description for the difference between these two instructions The default
332. sed to offset the CPU assignments for the set of threads It takes an integer value in the range of 0 to the number of CPUs inclusive When a thread is mapped to a CPU this offset is added onto the CPU number calculated after PSC OMP CPU STRIDE has been applied If the resulting value is greater than the number of CPUs then the remainder is used from the division of this value by the number of CPUs The effect of this is to apply an offset to the CPU assignments for a set of threads This is particularly useful when multiple OpenMP jobs are to be run at the same time on the same system and allows the jobs to be separated onto different CPUs Without this mechanism both jobs would be assigned to CPUs starting at CPU 0 causing a non uniform distribution 1 02404 15 8 17 8 Using OpenMP and Autoparallelization Environment Variables QLOGIC T s s E For example consider a system with four chips each with two cores using chip major numbering Let there be 2 OpenMP jobs each consisting of 4 threads If these jobs are run with the default scheduling the assignments will be CHIP 0 CHIP 1 CHIP 2 gt CHIP 3 CPUO CPU1 CPU2 CPU3 CPU4 CPUS CPUG CPU7 JO TO JO T1 JO T2 JO T3 J1 TO J1 T1 J1 T2 J1 T3 Jx Ty indicates thread y of job x If PSC OMP CPU OFFSET is set to 4 for job 1 the scheduling will be changed to CHIP 0 CHIP 1 lt CHIP 2
333. serial and a parallel version If the trip count of the loop is small it is not beneficial to use the parallel version during execution When this flag is set to ON and the feedback data indicates that the loop has small trip count the auto parallelizer will not generate the parallel version thus saving the runtime E 22 1 02404 15 QLOGIC ls check needed to decide whether to execute the serial or parallel version of the loop The default is OFF LNO build scalar reductions ON OFF Build scalar reductions before any loop transformation analysis Using this flag may enable further loop transformations involving reduction loops The default is OFF This flag is redundant when OPT roundoff 2 or greater is in effect LNO blocking ON OFF Enable or disable the cache blocking transformation The default is ON LNO blocking sizezN This option specifies a block size that the compiler must use when performing any blocking N must be a positive integer number that represents the number of iterations LNO fission 0 1 2 This option controls loop fission The options can be one of the following 0 Disable loop fission default 1 Perform normal fission as necessary 2 Specify that fission be tried before fusion Because LNO fusion is on by default turning on fission without turning off fusion may result in their effects being nullified Ordinarily fusion is applied before fission Spec
334. serts that an object pointed to by a Cray pointer is never overlaid on another variable s storage This flag also specifies that the compiler can assume that the pointed to object is stored in memory before a call to an external procedure and is read out of memory at its next reference It is also stored before a END or RETURN statement of a subprogram OPT alias parm promises that Fortran parameters do not alias to any other variable This is the default no parm asserts that parameter aliasing is present in the program 7 7 2 Numerically Unsafe Optimizations Rearranging mathematical expressions and changing the order or number of floating point operations can slightly change the result Example A 2 X B 4 Y Eisen egX dc 2e clever compiler will notice that A B But the order of operations is different and so a slightly different C will be the result This particular transformation is controlled by the OPT roundof f flag but there are several other numerically unsafe flags Some options that fall into this category are The options that control IEEE behavior such as OPT roundof fzN and OPT IEEE arithmetic N are a couple of others 7 20 1 02404 15 XX 7 Tuning Options QLOGIC Aggressive Optimizations OPT div split ON OFF This option enables or disables transforming expressions of the form X Y into X 1 Yv The reciprocal is inherently less accurate than a straigh
335. sets must fit within the signed 32 bit address space To determine if you are close to this limit use the Linux size command size bench text data bss dec hex filename 910219 1448 3192 914859 df5ab bench If the total value of the text segment is close to 2GB then the size of the memory model may be an issue for you We believe that codes that are this large are extremely rare and would like to know if you are using such an application The size of the bss and data segments are addressed by using the medium memory model 2 10 1 02404 15 XX 2 Compiler Quick Reference QLOGIC Profiling Locate Your Program s Hot Spots ls 2 10 Debugging The flag g tells the PathScale compilers to produce data in the DWARF 2 0 format used by modern debuggers such as GDB and PathScale s debugger pathdb This format is incorporated directly into the object files The g option automatically sets the optimization level to 00 unless an explicit optimization level is provided on the command line Debugging of higher levels of optimization is possible but the code transformation performed by the optimizations may make it more difficult See the individual sections on the PathScale Fortran and compilers for more language specific debugging information and section 10 for debugging and troubleshooting tips See the QLogic PathScale Debugger User Guide for more information on pathdb 2 11 Profiling Locate Your Program s Hot Spots
336. sic Extensions ls program once implicit none external handlerl handler2 common previous count integer 8 previous integer count previous signal 2 handler1 previous signal 2 handler2 count 4 do while true call sleep 100 end do end subroutine handler1 implicit common previous count integer 8 previous integer count print I am handlerl count count 1 if count le 0 then previous 0 end if previous signal 2 previous end subroutine handler1 subroutine handler2 implicit none common previous count integer 8 previous integer count print I am handler2 count count 1 if count le 0 then previous 0 end if previous signal 2 previous end subroutine handler2 1 02404 15 C 51 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC Se is an example using the three argument form C Keyboard interrupt normally Control C triggers C handler until 4 interrupts have occurred Then C restore the default so the fifth interrupt stops C the program program single implicit none external handler intrinsic signal integer 8 previous common count integer count previous signal 2 handler 1 count 4 do while true call sleep 100 end do end subroutine handler implicit none intrinsic signal integer 8 previous common count integer count print
337. sics Table of Supported Intrinsics QLOGIC FWI5I IGIIII III ee Table C 1 Fortran Intrinsics Supported in 3 0 Continued Intrinsic Name Result Arguments Families Remarks DSHIFTR I 1 1 I 2 1 4 I 8 TRADITIONAL E J I 1 I 2 1 4 I 8 1 1 I 2 1 4 1 8 DSIGN R 8 A R 8 ANSI G77 E P B R 8 PGI TRADITIONAL DSIN R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DSIND R 8 X R 8 PGI E TRADITIONAL DSINH R 8 X R 8 ANSI G77 E P PGI TRADITIONAL DSM 1 8 ARRAY Any type TRADITIONAL CHUNKSIZE Array rank any DIM I 1 1 2 1 4 I 8 DSM 1 8 ARRAY Any TRADITIONAL DISTRIBUTION type Array rank any BLOCK DIM I 1 1 2 1 4 I 8 DSM 1 8 ARRAY Any type TRADITIONAL DISTRIBUTION Array rank any CYCLIC DIM 1 1 1 2 1 4 I 8 DSM_ 1 8 ARRAY Any type TRADITIONAL DISTRIBUTION _ Array rank any STAR DIM I 1 1 2 I 4 1 8 DSM 1 8 ARRAY Any type TRADITIONAL ISDISTRIBUTED Array rank any DSM 1 8 ARRAY Any type TRADITIONAL ISRESHAPED Array rank any DSM 1 8 ARRAY Any type TRADITIONAL NUMCHUNKS Array rank any DSM 1 8 ARRAY Any type TRADITIONAL NUMTHREADS Array rank any DIM I 1 1 2 1 4 I 8 DSM REM 1 8 ARRAY type TRADITIONAL CHUNKSIZE Array rank any DIM I 1 1 2 1 4 I 8 INDEX I 1 I 2 I 4 I 8 C 14 1 02404 15 XX C Supported Fortran Intrinsics QLOGIC Table of Supported Intrinsics o 9 9 97 MEN Table C 1 Fortr
338. sion 3 0 Built on 2007 02 21 07 03 08 0800 Thread model posix GNU gcc version 4 0 2 PathScale 3 0 driver There are online manual pages man pages with descriptions of the large number of command line options that are available Type man pathscale intro atthe command line to see the pathscale intro man page and its overview of the various man pages included with the Compiler Suite 2 2 1 Accessing the GCC 4 x Front ends for C and C This release supports GCC 3 x and GCC 4 x The compiler defaults to gnu3 or gnu4 depending on whether the system installed gcc g is a 3 x or 4 x compiler It is possible to override this choice using gnu3 or gnu4 to get the compiler to use the alternate front end instead of the default one A sample command for C is 2 2 1 02404 15 2 Compiler Quick Reference QLOGIC Compiling for Different Platforms pathcc gnu4 world c This default option can be changed in your compiler defaults file by adding this line gnu4 See section 2 3 for an example compiler defaults file The option has no effect path 90 or pathf 95 There are currently some limitations when using this option Please see the Release Notes for more information 2 3 Compiling for Different Platforms The PathScale Compiler Suite currently compiles and optimizes your code for the Opteron processor independent of where the compilation is happening This may change in the future To select the
339. slation Lookaside Buffer TLB 7 14 U ulimit command 3 2 V VIRT 8 18 8 23 vsin 7 17 W Whole program optimization IPA 7 3 X x86 ABI 3 1 4 1 Index 8 1 02404 13
340. space in it m You may need to cd to another directory before running the program m Youwanttotake advantage ofthe S rate fileor S timing filefeature that requires some grep and sed commands to isolate the number in the output to use as the performance metric of interest e g a megaflops number in the rate file case The next sections provide examples of a Makefile build and test scripts and the rate and timing files 7 9 8 The NAS Parallel Benchmark Suite Next is a concrete example with measurable results The NAS Parallel Benchmark NPB suite is commonly used for both serial and parallel benchmarking It consists of a set of dissimilar pieces of applications illustrating the various numerical techniques used by NASA s high performance applications The benchmark comes with several data set sizes with W being a workstation size smallest and A and B being two sizes appropriate to a cluster or supercomputer size problem Thes examples uses the Class A data set Several examples will be provided showing usage in a step by step mannner By following these steps you will get a better idea of how pathopt2 works 7 9 8 1 Set Up the Workarea The NAS Parallel Benchmark Suite NPB can be downloaded by going to http www nas nasa gov Software NPB and following the links to the file Download the file to a writable working directory Then 5 tar zxf NPB2 3 tar gz cd NPB2 3 NPB2 3 SER config cp opt pathscale share p
341. specifies that a function with size smaller than N basic blocks is not subject to the IPA plimit restriction The default is 30 IPA callee limit n specifies that a function whose size exceeds this limit will never be automatically inlined by IPA The default is 500 7 8 1 02404 15 XX 7 Tuning Options QLOGIC Inter Procedural Analysis IPA ls IPA min hotness Nis applicable only under feedback compilation A call site s invocation count must be at least N before it can be inlined by IPA The default is 10 INLINE aggressive ON increases the aggressiveness of the inlining in which more non leaf and out of loop calls are inlined Default is OFF We mentioned that leaf functions are good candidates to be inlined These functions do not contain calls that may inhibit various backend optimizations To amplify the effect of leaf functions IPA provides two options that exploit its call tree based inlining feature This is based on the fact that a function that calls only leaf functions can become a leaf function if all of its calls are inlined This in turn can be applied repeatedly up the call graph In the description of the following two options a function is said to be at depth N if it is never more than N edges from a leaf node in the call graph A leaf function has depth 0 IPA maxdepth N causes to inline all routines at depth N in the call graph subject to space limitation IPA forcedepthzN causes IPA to inline
342. stinct steps you must specify at least IPA for the compile step and specify IPA and the individual options in the group for the link step If you specify for the compile step and do not specify IPA for the link step you will receive an error IPA addressing Invoke the analysis of address operator usage The default is Off IPA alias ON is a prerequisite for this option IPA aggr_cprop ON OFF Enable or disable aggressive inter procedural constant propagation Setting can be ON or OFF This attempts to avoid passing constant parameters replacing the corresponding formal parameters by the constant values Less aggressive inter procedural constant propagation is done by default The default setting is ON IPA alias ON OFF Invoke alias mod ref analysis The default is ON IPA callee limit N Functions whose size exceeds this limit will never be automatically inlined by the compiler The default is 500 IPA cgi ON OFF Invoke constant global variable identification This option marks non scalar global variables that are never modified as constant and propagates their constant values to all files Default is ON IPA clone_list ON OFF Tell the IPA function cloner to list cloning actions as they occur to stderr The default is IPA clone_list OFF IPA common_pad_size N This specifies the amount by which to pad common block array dimensions By default an amount is automatically chosen that
343. t complexzON enables fast calculations for values declared to be of the type complex When this is set to ON complex absolute value norm and complex division use fast algorithms that overflow for an operand the divisor in the case of division that has an absolute value that is larger than the square root of the largest representable floating point number This would also apply to an underflow for a value that is smaller than the square root of the smallest representable floating point number OFF is the default fast complex ON is enabled if OPT roundoff 3 is in effect OPT fast exp ON OFF This option enables optimization of exponentiation by replacing the runtime call for exponentiation by multiplication and or square root operations for certain compile time constant exponents integers and halfs This can produce differently 1 02404 15 E 35 E eko man Page XX QLOGIC E R VSTX T OO wO P L V rounded results that those from the runtime function fast exp is OFF unless 03 or Ofast are specified or OPT roundoff 1 is in effect OPT fast io ON OFF For C C only This option enables inlining of printf fprintf sprintf scanf fscanf sscanf and printw COPT fast io is only in effect when the candidates for inlining are marked as intrinsic to the stdio h and curses h files Default is OFF OPT fast_math ON OFF Setting this to ON will tell the compiler to use the fast math functions tuned f
344. t division but may be faster OPT recip ON OFF This option allows expressions of the form 1 x to be converted to use the reciprocal instruction of the computer This is inherently less accurate than a division but will be faster These options can have performance impacts For more information see the e ko manual page You can view the manual page by typing man at the command line 7 7 3 Fast math Functions When OPT fast math onis specified the compiler uses fast versions of math functions tuned for the processor The affected math functions include log exp sin cos sincos expf and pow In general the accuracy is within 1 ulp of the fully precise result though the accuracy may be worse than this in some cases The routines may not raise IEEE exception flags They call no error handlers and denormal number inputs outputs are typically treated as 0 but may also produce unexpected results OPT fast math on is effected when OPT roundoff is set to 2 or above A different flag ffast math improves FP speed by relaxing ANSI amp IEEE rules fno fast math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed ffast math implies OPT IEEE arithmetic 2 fno math errno while no fast math implies OPT IEEE arithmetic 1 fmath errno These flags apply to all languages Both OPT fast_math on and ffast math implied by Ofast 7 7 4 IEEE 754 Compliance It is possible to
345. t the defaults used related to ABI ISA and processor targets When this flag is specified the compiler will just print the defaults and quit No compilation is performed pathcc show defaults 2 3 3 Compiling for an Alternate Platform You will need to compile with the march anyx86 flag if you want to run your compiled executables on both AMD and Intel platforms See the eko man page for more information about the march flag To run code generated with the PathScale Compiler Suite on a different host machine you will need to install the runtime libraries on your host machine or you need to static link your programs when you compile See section 2 7 for information on static linking and the QLogic PathScale Compiler Suite Install Guide for information on installing runtime libraries 1 02404 15 2 5 2 Compiler Quick Reference Input File Types XX QLOGIC ee 2 3 4 Compiling Option Tool pathhow compiled The PathScale Compiler Suite includes a tool that displays the compilation options and compiler version currently being used The tool is called pathhow compiled and can be found after installation in opt pathscale bin or lt install_directory gt bin if you installed to a non default location When a o file archive or an executable is passed to pathhow compiled it will display the compilation options for each o file constituting the argument file This includes any linked archives For example compile
346. table we also try 03 ipa To our great surprise we achieve a run time of 110 5 seconds a 58 speed up over our previous 03 time and a nice improvement over 02 ipa Section 7 7 mentions the flags 03 ipa LNO fusion 2 and OPT div split on Testing combinations of these two flags as additions to the O3 ipa we have already tested results in ipa LNO fusion 2 results in 109 74 seconds run time O3 ipa OPT div split on results in 112 24 seconds O3 ipa OPT div split on LNO fusion 2 results in 111 28 seconds So 03 ipa is essentially tie for the best set of flags with 03 ipa LNO fusion 2 9 4 1 02404 15 10 1 Section 10 Debugging and Troubleshooting The QLogic PathScale Compiler Suite Support Guide contains information about getting support from QLogic and tells you how to submit a bug We consider performance issues to be a bug The pathbug tool described in the Support Guide can help you gather information for submitting your bug Subscription Manager Problems 10 2 For recommendations in addressing problems or issues with subscriptions refer to Troubleshooting in the QLogic PathScale Install Guide Debugging 10 3 The earlier sections on the PathScale Fortran and C C compilers contain language specific debugging information See section 3 10 and section 4 3 More general information on debugging can be found in this section The flag g tells the PathScale compilers to produce
347. temporary directory directory v Generate more verbose output w columns Number of columns to use in formatting output 40 Don t print out a summary table Option Configuration File The PathScale Compiler Suite includes pathopt2 xm1 a pre configured option configuration file found in opt pathscale share pathopt2 that contains about 200 test flags and options This XML file specifies a tree of options to try A small set of tags and attributes are used The file supports many common combinations of options in a framework that enables pathopt2 to adapt as it runs pathopt2 xml can be used on its own or as a framework for creating a custom configuration file More than one configuration can be described in a single file 1 02404 15 7 31 7 Tuning Options XX The pathopt2 Tool QLOGIC Se A single configuration in pathopt2 xm1 consists of two parts m A list of options This list is contained within a define tag This list can also contain any number of option choose or append tags m Anexecute target This is a set of rules that accesses the named options list via the lt source gt tag The execute target can use multiple lt source gt tags in order to combine different lists of options It can also contain option or append tags An execute target can be addressed on the command line using the t option By default pathopt2 runs only the first execute targ
348. ter dimension p read 1 if 1 then p gt a else p gt a 1 5 2 1 5 2 endif Because possible does not have an explicit interface it expects a contiguous array Therefore the compiler generates a runtime test to check a contiguous bit belonging to the pointer p and if the target is not contiguous the values are copied to a temporary array before the call and copied back after the call call possible p size p The compiler must always copy this sequence array to a temporary variable to make it contiguous call possible a 1 2 5 2 3 5 Size a 1 2 5 2 3 5 end program copier pathf90 fullwarn c cico f90 call possible p size p pathf95 1438 pathf90 CAUTION COPIER File cico f90 Line 26 Column 17 This argument produces a possible copy in and out to a temporary variable call possible a 1 2 5 2 3 5 size a 1 2 5 2 3 5 pathf95 1438 pathf90 CAUTION COPIER File cico f90 Line 30 3 28 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Fortran Compiler Stack Size I E l l l i s 3 11 Column 18 This argument produces a copy in to a temporary variable pathf95 PathScale TM Fortran Version 2 9 99 fl4 Thu Dec 7 2006 06 03 17 pathf95 32 source lines pathf95 0 Error s 0 Warning s 2 Other message s 0 ANSI s pathf95 explain pathf95 message number gives more information abo
349. th the PathScale Compiler Suite ABI affinity AMD64 alias aliasing assertion base Describes the interface between program components at the binary level It encompasses details such as procedure calling convention how parameters and return values are passed the mangling encoding of function and variable names and the dedication of registers for different usages Processor affinity is used to specify the preferred processor or subset of processors for scheduling a thread An affinity setting might be made in order to bind a thread close to a resource and to prevent the kernel from rescheduling the thread to another processor further away from that resource Affinity is particularly important on NUMA non uniform memory architectures since memory access latency and bandwidth may vary based on the relative locations of the processor and memory 64 bit extensions to Intel s IA32 more commonly known as x86 architecture An alternate name used for identification such as for naming a field or a file Two variables are said to be aliased if they potentially are in the same location in memory This inhibits optimization A common example in the C language is two pointers ifthe compiler cannot prove that they point to different locations a write through one ofthe pointers will cause the compiler to believe that the second pointer s target has changed A statement in a program that a certain condition
350. thScale compiler The compiled benchmarks are run on a 1 4 GHz Opteron system Two sets of data are shown here The first set studies the effects of using the single option ipa The second set shows the effects of additional IPA related tuning flags on the same files Table 7 1 Effects of IPA on SPEC CPU 2000 Performance Benchmark Time w o ipa Time with ip Improvement 164 gzip 170 7 s 164 7 s 3 596 175 vpr 202 4 s 192 3s 5 176 gcc 113 6s 113 2 s 0 4 1 02404 15 XX QLOGIC 7 Tuning Options Inter Procedural Analysis IPA o P Table 7 1 Effects of IPA on SPEC CPU 2000 Performance Continued 1 02404 15 Benchmark Time w o ipa Time with ip Improvement 181 mcf 391 9 s 390 8 s 0 396 186 crafty 83 5s 83 4s 0 196 197 parser 301 4 s 289 3 s 4 252 eon 152 8s 126 8 s 1796 253 perlbmk 196 2 s 192 3 s 2 254 gap 153 5 s 128 6 s 16 2 255 vortex 175 25 132 15 24 6 256 bzip2 210 2s 181 0 s 13 996 300 twolf 376 55 362 25 3 896 168 wupwise 220 0s 161 5s 26 696 171 swim 181 4 s 180 7 s 0 496 172 mgrid 184 7 s 182 3 s 1 396 173 applu 282 5 s 245 25 13 2 177 mesa 155 4 131 5 15 4 178 galgel 150 4 s 149 9 s 0 396 179 art 245 7 s 221 1 10 183 equake 143 7 s 143 25 0 3 187 facerec 154 35 147 45 4 5 188 ammp 266 5 s 261 75 1 8 189 lucas 165 9 s 167 9 s 1 2 191 fma3d 239 6 s 244 6 s 2 196 200 sixtrack 265 0s 276 95 4 5 301 apsi 280 7 s 273 7 s 2 5 Tab
351. the file my ile c with pathcc and then use the pathhow compi led tool pathcc myfile c o myfile pathhow compiled myfile o The output would look something like this QLogic PathScale Compiler Version 3 0 compiled myfile c with options 02 march opteron msse2 mno sse3 mno 3dnow m64 2 4 Input File Types The name for a source file usually has the form ilename ext where ext is a one to three character extension used on a source code file that can have various meanings Extension Implication to the driver C source file that will be preprocessed C source file that will be preprocessed CC CPP CXX VE Fortran source file 90 is fixed format no preprocessor 95 90 is freeform format no preprocessor 95 is freeform format no preprocessor F Fortran source file F90 F is fixed format invokes preprocessor F95 F90 is freeform format invokes preprocessor F95 is freeform format invokes preprocessor For Fortran files with the extensions 90 or 95 you can use ftpp to invoke the Fortran preprocessor or cpp to invoke the C preprocessor on the 2 6 1 02404 15 2 Compiler Quick Reference QLOGIC Other Input Files path 95 command line The default preprocessor for files with F F90 or F95 extensions is cpp See section 3 4 1 for more information on preprocessing The compiler drivers can use the extension to determine which la
352. the result of all the runs The first set of flags are listed in the order in which they were run This is followed by a summary table which sorts the same output by time from fastest to slowest Sample output from this run is shown below Flagsb Build Test Real User System 02 PASS PASS 2 83 2 82 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 03 OPT Ofast PASS PASS 2 37 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 Sorted summary from all runs Flags Build Test Real User System OPT Ofast PASS PASS 2 37 2 38 0 00 Ofast PASS PASS 2 38 2 38 0 00 03 PASS PASS 2 39 2 39 0 00 03 ipa PASS PASS 2 40 2 40 0 01 02 PASS PASS 2 83 2 82 0 00 From these results we see that the best option from this run is 03 OPT Ofast The next sections will discuss details on usage command line options and the configuration file format Usage Basic usage is as follows pathopt2 n num iterations f configfile t execute target r test command S real user system build command args The command line above shows the most commonly used options for the complete list of options see Table 7 4 The pathopt2 tool runs build command with the provided arguments and using additional options as specified in configfile The build command can be an PathScale invocation command pathcc pathf 95 pathcc a make command or a script which eventually invokes the compiler perhaps via a make command The charact
353. the widely used command line options supported by gcc Generates code that complies with the x86 64 ABI and the 32 bit x86 ABI The C compiler Conforms to ISO IEC 14882 1998 E Programming Languages C standard Supports extensions to the C programming language as documented in Using GCC The GNU Compiler Collection Reference Manual October 2003 for GCC version 3 3 1 Refer to section 4 4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C compiler g as implemented on the platforms supported by the PathScale Compiler Suite Supports most of the widely used command line options supported by g Generates code that complies with the x86_64 ABI and the 32 bit x86 ABI To invoke the PathScale C and C compilers use these commands pathcc invoke C compiler pathcc invoke the C compiler 4 1 4 The PathScale C C Compiler XX Using the C C Compilers QLOGIC P AL 7C O 3 4 1 The command line flags for both compilers are compatible with those taken by the GCC suite See section 4 1 for more discussion of this Using the C C Compilers 4 1 1 If you currently use the GCC compilers the PathScale compiler commands will be familiar Makefiles that presently work with GCC should operate with the PathScale compilers effortlessly simply change the command used to invoke the compiler and rebuild Se
354. ther processes could also have run The default metric used when comparing the performance of one set of options with another is real time All 3 times will be displayed in the output Additionally pathopt2 allows arbitrary performance metrics to be used to guide option selection using the timing file and rate file choices When either of these options is used pathopt2 sets an environment variable called PSC METRIC FILE with the name of a temporary file before running the command The run command is required to write the performance metric into this file before it terminates The pathopt2 tool then opens this file reads a value from the file as a double precision floating point number and deletes the temporary file The only interpretation placed on these values is that smaller is better for timing and that larger is better for rate The actual units of the values do not matter as far as pathopt2 is concerned since it just performs comparisons on the values Using the above usage as a guide we can now summarize the simple command from the previous section pathopt2 f pathopt2 xml t try5 r factorial pathcc o factorial factorial c This example directs pathopt2 to use pathopt2 xm1 as the configuration file The build command pathcc o factorial factorial c is used for the building phase where option is iteratively replaced with the rules specified in the try5 subset within the configuration file pathopt2 xm1 The
355. timizations 7 7 6 A few advanced optimizations intended to exploit some exotic instructions such as CMOVE conditional move result in slightly changed program behavior such as programs which write into variables guarded by anif statement For example if a eq 1 then a 3 endif In this example the fastest code on an x86 CPU is code which avoids a branch by always writing a if the condition is false it writes a s existing value into a else it writes 3 into a If a is a read only value not equal to 1 this optimization will cause a segmentation fault in an odd but perfectly valid program Assumptions About Numerical Accuracy 1 02404 15 See the following table for the assumptions made about numerical accuracy at different levels of optimization Table 7 3 Numerical Accuracy with Options OPT option name 00 01 02 03 Ofast Notes div split off off off off on onif IEEE a 3 fast complex off off off off off onifroundof 3 fast exp off off off on on onifroundof gt 1 fast nint off off off off off onifroundof 3 fast s qrt off off off off off fast trunc off off off on on onifroundoff gt 1 fold reassociate off off off off on onifroundof f gt 2 fold unsafe relops on on on on on fold unsigned relops off off off off off IEEE arithmetic 1 1 1 2 2 7 23 7 Tuning Options XX Hardware Performance QLOGIC Table 7 3 Numerical
356. tion for all the program variables Whenever a variable has its address taken it can potentially be pointed to by a pointer Places that dereference or store through the pointer potentially access the variable IPA s alias analysis keeps track of this information so that in the presence of pointer 7 4 1 02404 15 XX QLOGIC 7 Tuning Options Inter Procedural Analysis IPA ls 7 3 3 accesses as few variables are affected as possible so they can be optimized more aggressively The mod ref and alias information collected by IPA are not just used by IPA itself The information is also recorded in the program representation so the optimizations in the backend phases also benefit Optimization 1 02404 15 The most important optimization performed by IPA is inlining in which the call to a function is replaced by the actual body of the function Inlining is most versatile in IPA because all the user function definitions are visible to it Apart from eliminating the function call overhead inlining increases optimization opportunities of the backend phases by letting them work on larger pieces of code For instance inlining may result in the formation of a loop nest that enables aggressive loop transformations Inlining requires careful benefit analysis because overdoing it may result in performance degradation The increased program size can cause higher instruction cache miss rate If a function is already quite large inlining
357. to use NUMA optimizations in the operating system to give good placement of data relative to threads This optimization relies on first touch the thread that first touches the data is assumed to be the most frequent user of the data and thus the data is allocated onto physical addresses in the DRAM associated with the CPU that is currently running that thread This is applied by a NUMA aware operating system at the page level If your kernel version is not NUMA aware then a kernel upgrade may be required for good performance Similarly thread to CPU affinity is also important for good OpenMP performance The OpenMP library by default uses affinity system calls to strongly associate threads with CPUs The idea is to keep the threads co located with their associated data Without affinity assignments the threads may be migrated by the O S scheduler to other nodes and lose their good placement relative to their data However sometimes the use of affinity binding can cause a load imbalance and prevent the scheduler from make sensible decisions about thread placement In this case the thread affinity assignments can be disabled by setting the PSC OMP AFFINITY environment variable to FALSE If your kernel does not support scheduling affinity you may need to upgrade to a newer kernel to see the performance benefit of this mechanism 8 14 3 3 Load Balancing It is possible to gain some coarse insight into the load balancing of the OpenMP application
358. tran programs that write to constant variables are compiled with the PathScale Fortran compiler A typical situation is that an argument to a subroutine or function is given a constant value such as 0 or FALSE but the subroutine or function tries to assign a new value to that argument We recommend that where possible you fix code that assigns to constants so that it no longer does this Such a change will continue to work with other Fortran compilers but will allow the PathScale Fortran compiler to generate code that will not crash and will run more efficiently If you cannot modify your code we provide an option called LANG xw const on that will change the compiler s behavior so that it allocates constant values in read write memory We do not make this option the default as it reduces the compiler s ability to propagate constant values which makes the resulting executables slower You might also try the LANG formal deref unsafe option This option tells the compiler whether it is unsafe to speculate a dereference of a formal parameter in Fortran The default is OFF which is better for performance See the eko man page for more details on these two flags Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments 3 26 The Fortran standards require that arguments to functions and subroutines not alias each other As an example this is illegal program bar call foo c c subroutine foo a b integer i real a 100
359. ttp www openmp org a For the Fortran C and C version 2 5 OpenMP Specification click on Specifications in the left column of the OpenMP home page a For Tutorials Benchmarks Publications and Books click on Resources in the left column of the OpenMP home page m Parallel Programming in OpenMP by Rohit Chandra et al Morgan Kaufmann Publishers 2000 ISBN 1 55 860671 8 1 02404 15 8 31 8 Using OpenMP and Autoparallelization Other Resources for OpenMP QLOGIC se amp _ ___ Notes 8 32 1 02404 15 Section 9 Examples 9 1 Compiler Flag Tuning and Profiling With pathprof We ll use the 168 wupwise program from the CPU2000 floating point suite for this example This is a Physics Quantum Chromodynamics QCD code For those who wupwise is an acronym for Wuppertal Wilson Fermion Solver a program in the area of lattice gauge theory quantum chromodynamics The code is in about 2100 lines of Fortran 77 in 23 files We ll be running and tuning wupwi se performance on the reference largest dataset Each run takes about two to four minutes on a 2 GHz Opteron system to complete Even though this is a Fortran 77 code the PathScale Fortran compiler path 95 can handle it Outline Try pathf95 02 pathf95 03 first Run times user time were seconds 02 150 3 03 174 3 We re a little surprised since 03 is supposed to be faster than 02 in general But the man page did say t
360. u_linux _ _GNUC_ _ 3 GNUC MINOR 3 _ PATCHLEVEL _ 1 _ PATHSCALE 2 0 _ PATHCC _ 2 _ _ MINOR _ 0 _ JPATHCC PATCHLEVEL 0 NOTE The GNU and _PATH values are derived from the respective compiler version numbers and will change with each release 4 The PathScale C C Compiler XX Compiler and Runtime Features QLOGIC ue These Fortran macros will also used if the source file is Fortran but cpp is used _LANGUAGE_FORTRAN 1 LANGUAGE_FORTRAN 1 _LANGUAGE_FORTRANSO 1 LANGUAGE FORTRAN90 1 For 32 bit compilation the following macros are defined _ 1386 _ 1386 _ 1386 For 64 bit the following macros are defined _ LP64 _ _LP64 NOTE When using an optimization level at 01 or higher the compiler will use the _ _OPTIMIZE_ _ macro A quick way to list all the predefined cpp macros would be to compile your program with the flags dD keep You can find all the defines or predefined macros in the resulting i file Here is an example for C cat hello c main printf Hello World n pathcc dD keep hello c wc hello i 94 278 2606 hello i cat hello i The hello file will contain the list of pre defined macros NOTE Generating an i file doesn t work well with Fortran because if the preprocessor sends the define s to the i file Fortran can t parse them See section 3 4 4 1 for information on finding pre defined macros in Fortra
361. ubroutine form sets 1 to the previous value of the mask unlink Fortran interface to the POSIX function unlink Remove the link to the file named file The function form returns 0 on success or the error code from the C library value errno The subroutine form sets status to the value which the function would return Trailing blanks in file are ignored you can prevent this by using char 0 to place a null character after the last significant character xor Bitwise Boolean zabs zcos Specific names for various mathematical functions having an zexp zlog argument of type complex 16 zsin zsqrt C 54 1 02404 15 C Supported Fortran Intrinsics QLOGIC Fortran Intrinsic Extensions ls Notes 1 02404 15 C 55 C Supported Fortran Intrinsics Fortran Intrinsic Extensions QLOGIC HEFFEEENECXAEXAEN C w C 56 1 02404 15 Appendix D Fortran 90 Dope Vector Here is an example of a simplified data structure from a Fortran 90 dope vector from the file clibinc cray dopevec h found in the source distribution See section 3 4 6 for more details typedef struct FCD char c pointer C character pointer unsigned long byte len Length of item in bytes b fcd typedef struct f90 type unsigned int 32 used for future development enum typecodes DVTYPE UNUSED 0 DVTYPE TYPELESS 1 DVTYPE INTEGER 2 DVTYPE REAL 3 DVTYPE COMPLEX 4 DVTYPE LOGICAL 5 DVTYP
362. uct int modulevarl1 double modulevar2 mymodule_data extern struct int modulevar3 double modulevar4 mymodule_data_init void mycfunction 55 5 printf d gWNn mymodule data modulevarl1 mymodule data modulevar2 printf d g n mymodule data init modulevar3 mymodule data init modulevar4 1 02404 15 3 17 3 The PathScale Fortran Compiler XX Runtime I O Compatibility QLOGIC Se cat dfile data init in mymodule mymodule data init data in mymodule in mymodule mymodule data mycfunction mycfunction pathf90 fdecorate dfile mymodule f90 mycprogram c mymodule 90 mycprogram c a out 22 33 3 44 55 5 3 6 Runtime I O Compatibility Files generated by the Fortran I O libraries on other systems may contain data in different formats than that generated or expected by codes compiled by the PathScale Fortran compiler This section discusses how the PathScale Fortran compiler interacts with files created by other systems 3 6 1 Performing Endian Conversions Use the assign command or ASSIGN procedure to perform endian conversions while doing file I O 3 6 1 1 The assign Command The assign command changes or displays the I O processing directives for a Fortran file or unit The assign command allows various processing directives to be associated with a unit or file name This can be used to perform numeric conversion while doing file I O The assi
363. uite User Guide 3 0 Beta 1 XX QLOGIC Se fstat C 45 ftell C 45 gerror C 45 getarg C 45 getcwd C 45 getenv C 46 getgid C 46 getlog C 46 getpid C 46 getuid C 46 gmtime C 46 hostnm C 46 iargc C 46 idate C 46 ierrno C 47 imag C 47 imagpart C 47 int2 C 47 int4 C 47 int8 C 47 irand C 47 isatty C 47 itime C 47 kill C 47 link C 47 Inbink C 47 loc C 47 Ishift C 47 Istat C 48 Itime C 48 mclock8 C 49 or C 49 perror C 49 rand C 49 realpart C 49 rename C 49 rshift C 49 secnds C 49 second C 49 setbuf C 49 setlinebuf C 50 short C 50 signal C 50 sleep C 52 srand C 52 Index 4 stat C 53 symlink C 53 system C 53 time8 C 53 ttynam C 54 umask C 54 unlink C 54 xor C 54 zsqrt C 54 Free form files 3 2 fsymlist 3 23 G 977 3 20 3 21 3 22 5 1 gcc 5 1 gcc compatibility wrapper script 5 3 gcc compilers 4 2 gcov 2 11 2 12 GDB 2 11 10 1 Global ID 8 13 gmon out 2 11 gprof 8 29 Group optimizations 7 2 GUIDED scheduling algorithm 8 30 H Higher optimization levels 10 2 Implementation defined behavior B 1 Induction variable 10 3 Inlining 7 4 Inner loop unrolling 7 16 Interleaving 7 25 Intermediate representation IR 7 3 Intrinsics Fortran 5 1 see also Appendix C iostat 3 11 IPA 7 3 files 7 4 ISA target 2 5 1 02404 13 XX QLOGIC QLogic PathScale Compiler Suite User Guide 3 0 Beta 1 Ts L L2 cache size 7 15 LAPACK 3 23 Large object files linking or asse
364. ult integer type and that are passed as actual arguments The padding extends the length to the size of the default integer type pathcc Define PATHCC and other macros pedantic errors Issue warnings needed by strict compliance to ANSI C Generate extra code to profile information suitable for the analysis program pathprof 1 You must use this option when compiling the source files you want data about and you must also use it when linking This option turns on application level profiling but not library level profiling see also profile See the gcc man pages for more information profile Generate extra code to profile information suitable for the analysis program pathprof 1 You must use this option when compiling the source files you want data about and you must also use it when linking This option turns on application level and library level profiling see also pg r Produce a relocatable o and stop rreal spec For Fortran only Specify the default kind specification for real values Option Kind value r4 Use REAL KIND 4 and COMPLEX KIND 4 for real and complex variables respectively the default r8 Use REAL KIND 8 and COMPLEX KIND 8 for real and complex variables respectively Generate an assembly file file s rather than an object file file o E 40 1 02404 15 E eko man Page QLOGIC ls shared DSO shared PIC code shared libgcc Force the use of the shared libgcc library
365. unctions having an cdexp cdlog argument of type complex 16 cdsin cdsqrt chdir Like the C library function chdir sets the current working directory to dir The function form returns 0 on success but otherwise returns the error code from the C library value errno The subroutine form sets status to the value that the function form would return Trailing blanks in dir are ignored you can prevent this by using char 0 to place a null character after the last significant character C 42 1 02404 15 XX QLOGIC C Supported Fortran Intrinsics Fortran Intrinsic Extensions I w a wwssEii 1 02404 15 chmod ctime date dbesj0 dbesj1 dbesjn dbesy0 dbesy1 dbesyn dcmplx dconj derf derfc dfloat dimag dreal dtime erf erfc Like the POSIX command chmod changes the access permissions of file name according to mode See the operating system documentation for the characters allowed in mode The function form returns 0 on success but otherwise returns the error code from the C library value errno The subroutine form sets status to the value which the function form would return Trailing blanks in name are ignored you can prevent this by using char 0 toplace a null character after the last significant character Like the C library function ctime converts stime which can be obtained from the intrinsic t ime8 to a string of the form Thu Mar 2 12 45 36 PST 2006 The function form returns that
366. ure The gnu4 option causes pathcc and pathcc to be compatible with version 4 0 2 of GCC and g in terms of the source language constructs they support A sample command for C is pathcc gnu4 world c 1 02404 15 XX QLOGIC 4 The PathScale C C Compiler Compiler and Runtime Features 4 2 When this option is not in effect pathcc pathcc are compatible with the default version 3 x of GCC and g This default option can be changed in your compiler defaults file by adding this line gnu4 See section 2 3 for an example compiler defaults file The option has no effect on pathf90 or pathf95 There are currently some limitations when using this option Please see the Release Notes for more information Compiler and Runtime Features 4 2 1 Preprocessing Source Files 4 2 1 1 Before being passed to the compiler front end source files are optionally passed through a source code preprocessor The preprocessor searches for certain directives in the file and based on these directives can include or exclude parts of the source code include other files or define and expand macros All C and C files are passed through the the C preprocessor unless the noccp flag is specified Pre defined Macros 1 02404 15 The PathScale compiler pre defines some macros for preprocessing code These include the following linux ci linux linux _ unix unix_ _ unix gn
367. ut each message One way to minimize copying while still taking advantage of Fortran 90 features is to use Fortran 90 style assumed shape and deferred shape arrays that is arrays whose bounds look like rather than 2 3 or n m for all dummy array arguments so that procedure calls pass a bit indicating whether the array is contiguous This requires that the program use explicit interfaces for all procedures with interface blocks with module use statements or by nesting one procedure inside another with contains Each of those methods provides the compiler with an explicit interface from the viewpoint of the Fortran standard NOTE Redundant interfaces are incorrect don t provide an interface block for a procedure whose interface is already imported via a use statement The compiler will also copy noncontiguous arrays to temporary variables in some situations where the standard does not require it but where heuristics suggest that this will improve performance by better using the cache To disable this category of copying use the command line option LANG copyinout off Fortran Compiler Stack Size 1 02404 15 The Fortran compiler allocates data on the stack by default Some environments set a low limit on the size of a process s stack which may cause Fortran programs that use a large amount of data to crash shortly after they start If the PathScale Fortran runtime environment detects a low stack size li
368. variables and memory returned by alloca It does not affect the behavior of globals memory allocated with malloc orFortran common data The option initializes integer variables to the bit pattern for floating point NaN integers don t have NaNs The CPU doesn t trap on these integer operands although the NaN bit pattern will make the wrong result more obvious This option is not supported under 32 bit ABI without SSE2 The Wuninitialized option warns about uninitialized automatic variables at compile time Wno uninitialized tells the compiler not to warn about uninitialized automatic variables The new zerouv option sets uninitialized variables in your code to zero at program runtime Doing this will have a slight performance impact This option affects local scalar and array variables and memory returned by alloca not affect the behavior of globals memory allocated with malloc or Fortran common data 10 4 Large Object Support Statically allocated data ss objects such as Fortran COMMON blocks and C variables with file scope are currently limited to 2GB in size If the total size exceeds that the compilation without the mcmode1 zmedium option will likely fail with the message relocation truncated to fit R X86 64 PC32 For Fortran programs with only one COMMON block or with no COMMON blocks after the one that exceeds the 2GB limit the program may compile and run correctly At higher optimizat
369. vided by the number of CPUs the system has For example a limit of 1 5g specifies that the Fortran runtime will use no more than 1 5 gigabytes GB of stack On a system with 2GB of physical memory a limit of 90 cpu will use no more than 0 9GB of stack 2 2 0 90 PSC STACK VERBOSE Fortran If this environment variable is set the Fortran runtime will print detailed information about how it is computing the stack size limit to use Standard OpenMP Runtime Environment Variables These environment variables can be used with OpenMP in either Fortran or C and C OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution Default is FALSE since this mechanism is not supported OMP_NESTED Enables or disables nested parallelism Default is FALSE OMP_SCHEDULE This environment variable only applies to DO and PARALLEL DO directives that have schedule type RUNTIME Type can be STATIC DYNAMIC or GUIDED Default is STATIC with no chunk size specified OMP NUM THREADS Set the number of threads to use during execution Default is number of CPUs in the machine PathScale OpenMP Environment Variables These environment variables can be used with OpenMP in both Fortran and C and C except as indicated PSC OMP AFFINITY When TRUE the operating system s affinity mechanism where available is used to assign threads to CPUs otherwise no affinity assignments are made The default value is
370. viding a compatibility flag for unportable g77 programs If you find this to be a problem the best solution is to change your program to inquire for the actual KIND values instead of hard wiring them 3 20 1 02404 15 XX QLOGIC 3 The PathScale Fortran Compiler Library Compatibility I s 3 8 If you are using i8 or r8 see section 3 3 1 for more details on usage Library Compatibility 3 8 1 This section discusses our compatibility with libraries compiled with C or other Fortran compilers Linking object code compiled with other Fortran compilers is a complex issue Fortran 90 or 95 compilers implement modules and arrays so differently that it is extremely difficult to attempt to link code from two or more compilers For Fortran 77 run time libraries for things like I O and intrinsics are different but it is possible to link both runtime libraries to an executable We have experimented using object code compiled by g77 This code is not guaranteed to work in every instance It is possible that some of our library functions have the same name but different calling conventions than some of g77 s library functions We have not tested linking object code from other compilers with the exception of 977 Name Mangling 1 02404 15 Name mangling is a mechanism by which names of functions procedures and common blocks from Fortran source files are converted into an internal representation when compiled into object file
371. whose type is char W no comment For C C only Wcomment warns if nested comments are detected Wno comment tell the compiler not to warn if nested comments are detected W no conversion For C C only Wconversion warns about possibly confusing type conversions Wno conversion tells the compiler not to warn about possibly confusing type conversions W no deprecated Wdeprecated will announce deprecation of compiler features Wno deprecated tells the compiler not to announce deprecation of compiler features Wno deprecated declarations Do not warn about deprecated declarations in code 1 02404 15 E 45 E man Page QLOGIC FII IIIIXIIIIII THIIIIIIH S E 46 W no disabled optimization Wdisabled optimization warns if a requested optimization pass is disabled Wno disabled optimization tells the compiler not warn if a requested optimization pass is disabled W no div by zero Wdiv by zero warns about compile time integer division by zero Wno div by zero suppresses warnings about compile time integer division by zero W no endif labels Wendif labels warns if if or endif is followed by text Wno endif labels tells the compiler not to warn if if or endif is followed by text W no error Werror makes all warnings into errors Wno error tells the compiler not to make all warnings into errors W no float equal Wfloat equal warns if floating point values
372. with a value known at compile time DSO dynamic shared object A library that is linked in at runtime In Linux the C library glibc is commonly dynamically linked in In Windows such libraries are called DLLs DWARF A debugging file format used by many compilers and debuggers to support source level debugging It is architecture independent and applicable to any processor or operating system It is widely used on Unix Linux and other operating systems as well in stand alone environments EBO The Extended Block Optimization pass in the PathScale compiler EM64T The Intel amp Extended Memory 64 Technology family of chips equivalence A Fortran feature similar to a C C union in which several variables occupy the same are of memory F 2 1 02404 15 XX QLOGIC F Glossary YY lt YI III T U UO9Y H 1 02404 15 executable feedback flag gcov IPA IR linker LNO MP NUMA object file The file created by the compiler and linker whose contents can be interpreted and run by a computer The compiler can also create libraries and debugging information from the source code A compiler optimization technique in which information from a run of the program is then used by the compiler to generate better code ThePathScale Compiler Suite uses feedback information for branches loop counts calls switch statements and variable values A command line option for the compiler usually an

QLogic PathScale™ Compiler Suite User Guide

Contents

Download Pdf Manuals

Related Search

Related Contents

QLogic PathScale&trade; Compiler Suite User Guide

Contents

Download Pdf Manuals

Related Search

Related Contents

QLogic PathScale™ Compiler Suite User Guide